Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub #7863

@pavanramkumar

Description

@pavanramkumar

Feature request

Huggingface datasets has great support for large tabular datasets in parquet with large partitions. I would love to see two things in the future:

  • equivalent support for lance, vortex, iceberg, zarr (in that order) in a way that I can stream them using the datasets library
  • more fine-grained control of streaming, so that I can stream at the partition / shard level

Motivation

I work with very large lance datasets on S3 and often require random access for AI/ML applications like multi-node training. I was able to achieve high throughput dataloading on a lance dataset with ~150B rows by building distributed dataloaders that can be scaled both vertically (until i/o and CPU are saturated), and then horizontally (to workaround network bottlenecks).

Using this strategy I was able to achieve 10-20x the throughput of the streaming data loader from the huggingface/datasets library.

I realized that these would be great features for huggingface to support natively

Your contribution

I'm not ready yet to make a PR but open to it with the right pointers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions