Codestin Search App

@lhoestq

Bug fixes and improvements

Better streaming retries (504 and 429) by @lhoestq in #7847
DOC: remove mode parameter in docstring of pdf and video feature by @CloseChoice in #7848

Full Changelog: 4.4.0...4.4.1

@CloseChoice

Dataset Features

Add nifti support by @CloseChoice in #7815

Load medical imaging datasets from Hugging Face:

ds = load_dataset("username/my_nifti_dataset")
ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}

Load medical imaging datasets from your disk:

files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}

Documentation: https://huggingface.co/docs/datasets/nifti_dataset

Add num channels to audio by @CloseChoice in #7840

# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio())  # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2))  # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1))  # use mono

Python 3.14 support by @lhoestq in #7836

What's Changed

Fix random seed on shuffle and interleave_datasets by @CloseChoice in #7823
fix ci compressionfs by @lhoestq in #7830
fix: better args passthrough for _batch_setitems() by @sghng in #7817
Fix: Properly render [!TIP] block in stream.shuffle documentation by @art-test-stack in #7833
resolves the ValueError: Unable to avoid copy while creating an array by @ArjunJagdale in #7831
fix column with transform by @lhoestq in #7843
support fsspec 2025.10.0 by @lhoestq in #7844

New Contributors

@sghng made their first contribution in #7817
@art-test-stack made their first contribution in #7833

Full Changelog: 4.3.0...4.4.0

@lhoestq

Dataset Features

Enable large scale distributed dataset streaming:

Keep hffs cache in workers when streaming by @lhoestq in #7820
Retry open hf file by @lhoestq in #7822

These improvements require huggingface_hub>=1.1.0 to take full effect

What's Changed

fix conda deps by @lhoestq in #7810
Add pyarrow's binary view to features by @delta003 in #7795
Fix polars cast column image by @CloseChoice in #7800
Allow streaming hdf5 files by @lhoestq in #7814
Fix batch_size default description in to_polars docstrings by @albertvillanova in #7824
docs: document_dataset PDFs & OCR by @ethanknights in #7812
Add custom fingerprint support to from_generator by @simonreise in #7533
picklable batch_fn by @lhoestq in #7826

New Contributors

@delta003 made their first contribution in #7795
@CloseChoice made their first contribution in #7800
@ethanknights made their first contribution in #7812
@simonreise made their first contribution in #7533

Full Changelog: 4.2.0...4.3.0

@radulescupetru

Dataset Features

Sample without replacement option when interleaving datasets by @radulescupetru in #7786

ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")

Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in #7806
```
ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
```

Add parquet scan options and docs by @lhoestq in #7801

docs to select columns and filter data efficiently

ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])

new argument to control buffering and caching when streaming

fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

What's Changed

Document HDF5 support by @klamike in #7740
update tips in docs by @lhoestq in #7790
feat: avoid some copies in torch formatter by @drbh in #7787
Support huggingface_hub v0.x and v1.x by @Wauplin in #7783
Define CI future by @lhoestq in #7799
More Parquet streaming docs by @lhoestq in #7803
Less api calls when resolving data_files by @lhoestq in #7805
typo by @lhoestq in #7807

New Contributors

@drbh made their first contribution in #7787

Full Changelog: 4.1.1...4.2.0

@lhoestq

What's Changed

fix iterate nested field by @lhoestq in #7775
Add support for arrow iterable when concatenating or interleaving by @radulescupetru in #7771
fix empty dataset to_parquet by @lhoestq in #7779

New Contributors

@radulescupetru made their first contribution in #7771

Full Changelog: 4.1.0...4.1.1

@kszucs

Dataset Features

feat: use content defined chunking by @kszucs in #7589
- internally uses use_content_defined_chunking=True when writing Parquet files
- this enables fast deduped uploads to Hugging Face !
```
# Now faster thanks to content defined chunking
ds.push_to_hub("username/dataset_name")
```
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
Concurrent push_to_hub by @lhoestq in #7708
Concurrent IterableDataset push_to_hub by @lhoestq in #7710
HDF5 support by @klamike in #7690
- load HDF5 datasets in one line of code
```
ds = load_dataset("username/dataset-with-hdf5-files")
```
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

Convert to string when needed + faster .zstd by @lhoestq in #7683
fix audio cast storage from array + sampling_rate by @lhoestq in #7684
Fix misleading add_column() usage example in docstring by @ArjunJagdale in #7648
Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in #7438
Update fsspec max version to current release 2025.7.0 by @rootAvish in #7701
Update dataset_dict push_to_hub by @lhoestq in #7711
Retry intermediate commits too by @lhoestq in #7712
num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in #7702
Update cli.mdx to refer to the new "hf" CLI by @evalstate in #7713
fix num_proc=1 ci test by @lhoestq in #7714
Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in #7715
typo by @lhoestq in #7716
fix largelist repr by @lhoestq in #7735
Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in #7730
Fix type hint train_test_split by @qgallouedec in #7736
fix(webdataset): don't .lower() field_name by @YassineYousfi in #7726
Refactor HDF5 and preserve tree structure by @klamike in #7743
docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in #7737
Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in #7761
Support pathlib.Path for feature input by @Joshua-Chin in #7755
add support for pyarrow string view in features by @onursatici in #7718
Fix typo in error message for cache directory deletion by @brchristian in #7749
update torchcodec in ci by @lhoestq in #7764
Bump dill to 0.4.0 by @Bomme in #7763

New Contributors

@DavidRConnell made their first contribution in #7438
@rootAvish made their first contribution in #7701
@tanuj-rai made their first contribution in #7702
@evalstate made their first contribution in #7713
@brchristian made their first contribution in #7730
@klamike made their first contribution in #7690
@YassineYousfi made their first contribution in #7726
@Sanjaykumar030 made their first contribution in #7737
@kszucs made their first contribution in #7589
@Joshua-Chin made their first contribution in #7755
@onursatici made their first contribution in #7718
@Bomme made their first contribution in #7763

Full Changelog: 4.0.0...4.1.0

@lhoestq

New Features

Add IterableDataset.push_to_hub() by @lhoestq in #7595

# Build streaming data pipelines in a few lines of code !
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)

Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in #7606

# Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)

New Column object

Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in #7564
Lazy column by @lhoestq in #7614

# Syntax:
ds["column_name"]  # datasets.Column([...]) or datasets.IterableColumn(...)

# Iterate on a column:
for text in ds["text"]:
    ...

# Load one cell without bringing the full column in memory
first_text = ds["text"][0]  # equivalent to ds[0]["text"]

Torchcodec decoding by @TyTodd in #7616

Enables streaming only the ranges you need !

# Don't download full audios/videos when it's not necessary
# Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames

Requires torch>=2.7.0 and FFmpeg >= 4
Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
Load audio data with AudioDecoder:

audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

# old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]

Load video data with VideoDecoder:

video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

Remove scripts altogether by @lhoestq in #7592
- trust_remote_code is no longer supported
Torchcodec decoding by @TyTodd in #7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding

Replace Sequence by List by @lhoestq in #7634

Introduction of the List type

from datasets import Features, List, Value

features = Features({
    "texts": List(Value("string")),
    "four_paragraphs": List(Value("string"), length=4)
})

Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

from datasets import Sequence

Sequence(Value("string"))  # List(Value("string"))
Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

Refactor Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in #7434
fix string_to_dict test by @lhoestq in #7571
Preserve formatting in concatenated IterableDataset by @francescorubbo in #7522
Fix typos in PDF and Video documentation by @AndreaFrancis in #7579
fix: Add embed_storage in Pdf feature by @AndreaFrancis in #7582
load_dataset splits typing by @lhoestq in #7587
Fixed typos by @TopCoder2K in #7572
Fix regex library warnings by @emmanuel-ferdman in #7576
[MINOR:TYPO] Update save_to_disk docstring by @cakiki in #7575
Add missing property on RepeatExamplesIterable by @SilvanCodes in #7581
Avoid multiple default config names by @albertvillanova in #7585
Fix broken link to albumentations by @ternaus in #7593
fix string_to_dict usage for windows by @lhoestq in #7598
No TF in win tests by @lhoestq in #7603
Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in #7604
Tests typing and fixes for push_to_hub by @lhoestq in #7608
fix parallel push_to_hub in dataset_dict by @lhoestq in #7613
remove unused code by @lhoestq in #7615
Update _dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in #7609
Fixes in docs by @lhoestq in #7620
Add albumentations to use dataset by @ternaus in #7596
minor docs data aug by @lhoestq in #7621
fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in #7623
fix save_infos by @lhoestq in #7639
better features repr by @lhoestq in #7640
update docs and docstrings by @lhoestq in #7641
fix length for ci by @lhoestq in #7642
Backward compat sequence instance by @lhoestq in #7643
fix sequence ci by @lhoestq in #7644
Custom metadata filenames by @lhoestq in #7663
Update the beans dataset link in Preprocess by @HJassar in #7659
Backward compat list feature by @lhoestq in #7666
Fix infer list of images by @lhoestq in #7667
Fix audio bytes by @lhoestq in #7670
Fix double sequence by @lhoestq in #7672

New Contributors

@TopCoder2K made their first contribution in #7564
@francescorubbo made their first contribution in #7522
@emmanuel-ferdman made their first contribution in #7576
@SilvanCodes made their first contribution in #7581
@ternaus made their first contribution in #7593
@ArjunJagdale made their first contribution in #7623
@TyTodd made their first contribution in #7616
@HJassar made their first contribution in #7659

Full Changelog: 3.6.0...4.0.0

@lhoestq

Dataset Features

Enable xet in push to hub by @lhoestq in #7552
- Faster downloads/uploads with Xet storage
- more info: #7526

Other improvements and bug fixes

Add try_original_type to DatasetDict.map by @yoshitomo-matsubara in #7544
Avoid global umask for setting file mode. by @ryan-clancy in #7547
Rebatch arrow iterables before formatted iterable by @lhoestq in #7553
Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in #7532
fix regression by @lhoestq in #7558
fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in #7521
Remove aiohttp from direct dependencies by @akx in #7294

New Contributors

@ryan-clancy made their first contribution in #7547
@Harry-Yang0518 made their first contribution in #7532
@giraffacarp made their first contribution in #7521
@akx made their first contribution in #7294

Full Changelog: 3.5.1...3.6.0

@lhoestq

Bug fixes

support pyarrow 20 by @lhoestq in #7540
- Fix pyarrow error TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
Write pdf in map by @lhoestq in #7487

Other improvements

update fsspec 2025.3.0 by @peteski22 in #7478
Support underscore int read instruction by @lhoestq in #7488
Support skip_trying_type by @yoshitomo-matsubara in #7483
pdf docs fixes by @lhoestq in #7519
Remove conditions for Python < 3.9 by @cyyever in #7474
mention av in video docs by @lhoestq in #7523
correct use with polars example by @SiQube in #7524
chore: fix typos by @afuetterer in #7436

New Contributors

@peteski22 made their first contribution in #7478
@yoshitomo-matsubara made their first contribution in #7483
@SiQube made their first contribution in #7524
@afuetterer made their first contribution in #7436

Full Changelog: 3.5.0...3.5.1

@yabramuvdi

Datasets Features

Introduce PDF support (#7318) by @yabramuvdi in #7325

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

Fix local pdf loading by @lhoestq in #7466
Minor fix for metadata files in extension counter by @lhoestq in #7464
Priotitize json by @lhoestq in #7476

New Contributors

@yabramuvdi made their first contribution in #7325

Full Changelog: 3.4.1...3.5.0

Releases: huggingface/datasets

4.4.1

Bug fixes and improvements

Contributors

Uh oh!

4.4.0

Dataset Features

What's Changed

New Contributors

Contributors

Uh oh!

4.3.0

Dataset Features

What's Changed

New Contributors

Contributors

Uh oh!

4.2.0

Dataset Features

What's Changed

New Contributors

Contributors

Uh oh!

4.1.1

What's Changed

New Contributors

Contributors

Uh oh!

4.1.0

Dataset Features

Other improvements and bug fixes

New Contributors

Contributors

Uh oh!

4.0.0

New Features

Breaking changes

Other improvements and bug fixes

New Contributors

Contributors

Uh oh!

3.6.0

Dataset Features

Other improvements and bug fixes

New Contributors

Contributors

Uh oh!

3.5.1

Bug fixes

Other improvements

New Contributors

Contributors

Uh oh!

3.5.0

Datasets Features

What's Changed

New Contributors

Contributors

Uh oh!