objectif
Lire/écrire un dataset partitionné façon Hive.
code minimal
import pyarrow as pa, pyarrow.dataset as ds
table = pa.table({"country":["FR","FR","DE"], "id":[1,2,3]})
ds.write_dataset(table, base_dir="out_ds", format="parquet", partitioning=ds.partitioning(pa.schema([("country", pa.string())])))
print("written")
utilisation
import pyarrow.dataset as ds
dataset = ds.dataset("out_ds", format="parquet", partitioning="hive")
print(len(list(dataset.get_fragments())) >= 1)
variante(s) utile(s)
import pyarrow.parquet as pq, pyarrow as pa
pq.write_table(pa.table({"x":[1]}), "out_ds/x=1/part.parquet")
print("ok")
notes
- Assurer des types cohérents entre partitions.