Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub

### Feature request

Huggingface datasets has great support for large tabular datasets in parquet with large partitions. I would love to see two things in the future: 

- equivalent support for `lance`, `vortex`, `iceberg`, `zarr` (in that order) in a way that I can stream them using the datasets library
- more fine-grained control of streaming, so that I can stream at the partition / shard level

### Motivation

I work with very large `lance` datasets on S3 and often require random access for AI/ML applications like multi-node training. I was able to achieve high throughput dataloading on a lance dataset with ~150B rows by building distributed dataloaders that can be scaled both vertically (until i/o and CPU are saturated), and then horizontally (to workaround network bottlenecks). 

Using this strategy I was able to achieve 10-20x the throughput of the streaming data loader from the `huggingface/datasets` library.

I realized that these would be great features for huggingface to support natively

### Your contribution

I'm not ready yet to make a PR but open to it with the right pointers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub #7863

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub #7863

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions