Releases: huggingface/datasets
4.4.1
4.4.0
Dataset Features
-
Add nifti support by @CloseChoice in #7815
- Load medical imaging datasets from Hugging Face:
ds = load_dataset("username/my_nifti_dataset") ds["train"][0] # {"nifti": <nibabel.nifti1.Nifti1Image>}
- Load medical imaging datasets from your disk:
files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"] ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti()) ds["train"][0] # {"nifti": <nibabel.nifti1.Nifti1Image>}
- Documentation: https://huggingface.co/docs/datasets/nifti_dataset
-
Add num channels to audio by @CloseChoice in #7840
# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio()) # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2)) # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1)) # use monoWhat's Changed
- Fix random seed on shuffle and interleave_datasets by @CloseChoice in #7823
- fix ci compressionfs by @lhoestq in #7830
- fix: better args passthrough for
_batch_setitems()by @sghng in #7817 - Fix: Properly render [!TIP] block in stream.shuffle documentation by @art-test-stack in #7833
- resolves the ValueError: Unable to avoid copy while creating an array by @ArjunJagdale in #7831
- fix column with transform by @lhoestq in #7843
- support fsspec 2025.10.0 by @lhoestq in #7844
New Contributors
- @sghng made their first contribution in #7817
- @art-test-stack made their first contribution in #7833
Full Changelog: 4.3.0...4.4.0
4.3.0
Dataset Features
Enable large scale distributed dataset streaming:
- Keep hffs cache in workers when streaming by @lhoestq in #7820
- Retry open hf file by @lhoestq in #7822
These improvements require huggingface_hub>=1.1.0 to take full effect
What's Changed
- fix conda deps by @lhoestq in #7810
- Add pyarrow's binary view to features by @delta003 in #7795
- Fix polars cast column image by @CloseChoice in #7800
- Allow streaming hdf5 files by @lhoestq in #7814
- Fix batch_size default description in to_polars docstrings by @albertvillanova in #7824
- docs: document_dataset PDFs & OCR by @ethanknights in #7812
- Add custom fingerprint support to
from_generatorby @simonreise in #7533 - picklable batch_fn by @lhoestq in #7826
New Contributors
- @delta003 made their first contribution in #7795
- @CloseChoice made their first contribution in #7800
- @ethanknights made their first contribution in #7812
- @simonreise made their first contribution in #7533
Full Changelog: 4.2.0...4.3.0
4.2.0
Dataset Features
-
Sample without replacement option when interleaving datasets by @radulescupetru in #7786
ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
-
Parquet: add
on_bad_filesargument to error/warn/skip bad files by @lhoestq in #7806ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
-
Add parquet scan options and docs by @lhoestq in #7801
- docs to select columns and filter data efficiently
ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"]) ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
- new argument to control buffering and caching when streaming
fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20)) ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
What's Changed
- Document HDF5 support by @klamike in #7740
- update tips in docs by @lhoestq in #7790
- feat: avoid some copies in torch formatter by @drbh in #7787
- Support huggingface_hub v0.x and v1.x by @Wauplin in #7783
- Define CI future by @lhoestq in #7799
- More Parquet streaming docs by @lhoestq in #7803
- Less api calls when resolving data_files by @lhoestq in #7805
- typo by @lhoestq in #7807
New Contributors
Full Changelog: 4.1.1...4.2.0
4.1.1
What's Changed
- fix iterate nested field by @lhoestq in #7775
- Add support for arrow iterable when concatenating or interleaving by @radulescupetru in #7771
- fix empty dataset to_parquet by @lhoestq in #7779
New Contributors
- @radulescupetru made their first contribution in #7771
Full Changelog: 4.1.0...4.1.1
4.1.0
Dataset Features
-
feat: use content defined chunking by @kszucs in #7589
- internally uses
use_content_defined_chunking=Truewhen writing Parquet files - this enables fast deduped uploads to Hugging Face !
# Now faster thanks to content defined chunking ds.push_to_hub("username/dataset_name")
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- internally uses
-
HDF5 support by @klamike in #7690
- load HDF5 datasets in one line of code
ds = load_dataset("username/dataset-with-hdf5-files")
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows
Other improvements and bug fixes
- Convert to string when needed + faster .zstd by @lhoestq in #7683
- fix audio cast storage from array + sampling_rate by @lhoestq in #7684
- Fix misleading add_column() usage example in docstring by @ArjunJagdale in #7648
- Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in #7438
- Update fsspec max version to current release 2025.7.0 by @rootAvish in #7701
- Update dataset_dict push_to_hub by @lhoestq in #7711
- Retry intermediate commits too by @lhoestq in #7712
- num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in #7702
- Update cli.mdx to refer to the new "hf" CLI by @evalstate in #7713
- fix num_proc=1 ci test by @lhoestq in #7714
- Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in #7715
- typo by @lhoestq in #7716
- fix largelist repr by @lhoestq in #7735
- Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in #7730
- Fix type hint
train_test_splitby @qgallouedec in #7736 - fix(webdataset): don't .lower() field_name by @YassineYousfi in #7726
- Refactor HDF5 and preserve tree structure by @klamike in #7743
- docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in #7737
- Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in #7761
- Support pathlib.Path for feature input by @Joshua-Chin in #7755
- add support for pyarrow string view in features by @onursatici in #7718
- Fix typo in error message for cache directory deletion by @brchristian in #7749
- update torchcodec in ci by @lhoestq in #7764
- Bump dill to 0.4.0 by @Bomme in #7763
New Contributors
- @DavidRConnell made their first contribution in #7438
- @rootAvish made their first contribution in #7701
- @tanuj-rai made their first contribution in #7702
- @evalstate made their first contribution in #7713
- @brchristian made their first contribution in #7730
- @klamike made their first contribution in #7690
- @YassineYousfi made their first contribution in #7726
- @Sanjaykumar030 made their first contribution in #7737
- @kszucs made their first contribution in #7589
- @Joshua-Chin made their first contribution in #7755
- @onursatici made their first contribution in #7718
- @Bomme made their first contribution in #7763
Full Changelog: 4.0.0...4.1.0
4.0.0
New Features
-
Add
IterableDataset.push_to_hub()by @lhoestq in #7595# Build streaming data pipelines in a few lines of code ! from datasets import load_dataset ds = load_dataset(..., streaming=True) ds = ds.map(...).filter(...) ds.push_to_hub(...)
-
Add
num_proc=to.push_to_hub()(Dataset and IterableDataset) by @lhoestq in #7606# Faster push to Hub ! Available for both Dataset and IterableDataset ds.push_to_hub(..., num_proc=8)
-
New
Columnobject- Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in #7564
- Lazy column by @lhoestq in #7614
# Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...) # Iterate on a column: for text in ds["text"]: ... # Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"]
-
Torchcodec decoding by @TyTodd in #7616
- Enables streaming only the ranges you need !
# Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset ds = load_dataset(..., streaming=True) for example in ds: video = example["video"] frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames
- Requires
torch>=2.7.0and FFmpeg >= 4 - Not available for Windows yet but it is coming soon - in the meantime please use
datasets<4.0 - Load audio data with
AudioDecoder:
audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> samples = audio.get_all_samples() # or use get_samples_played_in_range(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000 # old syntax is still supported array, sr = audio["array"], audio["sampling_rate"]
- Load video data with
VideoDecoder:
video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0> first_frame = video.get_frame_at(0) first_frame.data.shape # (3, 240, 320) first_frame.pts_seconds # 0.0 frames = video.get_frames_in_range(0, 6, 1) frames.data.shape # torch.Size([5, 3, 240, 320])
Breaking changes
-
Remove scripts altogether by @lhoestq in #7592
trust_remote_codeis no longer supported
-
Torchcodec decoding by @TyTodd in #7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding
-
Replace Sequence by List by @lhoestq in #7634
- Introduction of the
Listtype
from datasets import Features, List, Value features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) })
Sequencewas a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aListor adictdepending on the subfeature
from datasets import Sequence Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))}
- Introduction of the
Other improvements and bug fixes
- Refactor
Dataset.mapto reuse cache files mapped with differentnum_procby @ringohoffman in #7434 - fix string_to_dict test by @lhoestq in #7571
- Preserve formatting in concatenated IterableDataset by @francescorubbo in #7522
- Fix typos in PDF and Video documentation by @AndreaFrancis in #7579
- fix: Add embed_storage in Pdf feature by @AndreaFrancis in #7582
- load_dataset splits typing by @lhoestq in #7587
- Fixed typos by @TopCoder2K in #7572
- Fix regex library warnings by @emmanuel-ferdman in #7576
- [MINOR:TYPO] Update save_to_disk docstring by @cakiki in #7575
- Add missing property on
RepeatExamplesIterableby @SilvanCodes in #7581 - Avoid multiple default config names by @albertvillanova in #7585
- Fix broken link to albumentations by @ternaus in #7593
- fix string_to_dict usage for windows by @lhoestq in #7598
- No TF in win tests by @lhoestq in #7603
- Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in #7604
- Tests typing and fixes for push_to_hub by @lhoestq in #7608
- fix parallel push_to_hub in dataset_dict by @lhoestq in #7613
- remove unused code by @lhoestq in #7615
- Update
_dill.pyto useco_linetablefor Python 3.10+ in place ofco_lnotabby @qgallouedec in #7609 - Fixes in docs by @lhoestq in #7620
- Add albumentations to use dataset by @ternaus in #7596
- minor docs data aug by @lhoestq in #7621
- fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in #7623
- fix save_infos by @lhoestq in #7639
- better features repr by @lhoestq in #7640
- update docs and docstrings by @lhoestq in #7641
- fix length for ci by @lhoestq in #7642
- Backward compat sequence instance by @lhoestq in #7643
- fix sequence ci by @lhoestq in #7644
- Custom metadata filenames by @lhoestq in #7663
- Update the beans dataset link in Preprocess by @HJassar in #7659
- Backward compat list feature by @lhoestq in #7666
- Fix infer list of images by @lhoestq in #7667
- Fix audio bytes by @lhoestq in #7670
- Fix double sequence by @lhoestq in #7672
New Contributors
- @TopCoder2K made their first contribution in #7564
- @francescorubbo made their first contribution in #7522
- @emmanuel-ferdman made their first contribution in #7576
- @SilvanCodes made their first contribution in #7581
- @ternaus made their first contribution in #7593
- @ArjunJagdale made their first contribution in #7623
- @TyTodd made their first contribution in #7616
- @HJassar made their first contribution in #7659
Full Changelog: 3.6.0...4.0.0
3.6.0
Dataset Features
- Enable xet in push to hub by @lhoestq in #7552
- Faster downloads/uploads with Xet storage
- more info: #7526
Other improvements and bug fixes
- Add try_original_type to DatasetDict.map by @yoshitomo-matsubara in #7544
- Avoid global umask for setting file mode. by @ryan-clancy in #7547
- Rebatch arrow iterables before formatted iterable by @lhoestq in #7553
- Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in #7532
- fix regression by @lhoestq in #7558
- fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in #7521
- Remove
aiohttpfrom direct dependencies by @akx in #7294
New Contributors
- @ryan-clancy made their first contribution in #7547
- @Harry-Yang0518 made their first contribution in #7532
- @giraffacarp made their first contribution in #7521
- @akx made their first contribution in #7294
Full Changelog: 3.5.1...3.6.0
3.5.1
Bug fixes
- support pyarrow 20 by @lhoestq in #7540
- Fix pyarrow error
TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
- Fix pyarrow error
- Write pdf in map by @lhoestq in #7487
Other improvements
- update fsspec 2025.3.0 by @peteski22 in #7478
- Support underscore int read instruction by @lhoestq in #7488
- Support skip_trying_type by @yoshitomo-matsubara in #7483
- pdf docs fixes by @lhoestq in #7519
- Remove conditions for Python < 3.9 by @cyyever in #7474
- mention av in video docs by @lhoestq in #7523
- correct use with polars example by @SiQube in #7524
- chore: fix typos by @afuetterer in #7436
New Contributors
- @peteski22 made their first contribution in #7478
- @yoshitomo-matsubara made their first contribution in #7483
- @SiQube made their first contribution in #7524
- @afuetterer made their first contribution in #7436
Full Changelog: 3.5.0...3.5.1
3.5.0
Datasets Features
- Introduce PDF support (#7318) by @yabramuvdi in #7325
>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder" # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...What's Changed
- Fix local pdf loading by @lhoestq in #7466
- Minor fix for metadata files in extension counter by @lhoestq in #7464
- Priotitize json by @lhoestq in #7476
New Contributors
- @yabramuvdi made their first contribution in #7325
Full Changelog: 3.4.1...3.5.0