-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Feature request
Huggingface datasets has great support for large tabular datasets in parquet with large partitions. I would love to see two things in the future:
- equivalent support for
lance,vortex,iceberg,zarr(in that order) in a way that I can stream them using the datasets library - more fine-grained control of streaming, so that I can stream at the partition / shard level
Motivation
I work with very large lance datasets on S3 and often require random access for AI/ML applications like multi-node training. I was able to achieve high throughput dataloading on a lance dataset with ~150B rows by building distributed dataloaders that can be scaled both vertically (until i/o and CPU are saturated), and then horizontally (to workaround network bottlenecks).
Using this strategy I was able to achieve 10-20x the throughput of the streaming data loader from the huggingface/datasets library.
I realized that these would be great features for huggingface to support natively
Your contribution
I'm not ready yet to make a PR but open to it with the right pointers!