Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Releases: huggingface/datasets

4.4.1

05 Nov 16:01
6a6983a

Choose a tag to compare

Bug fixes and improvements

  • Better streaming retries (504 and 429) by @lhoestq in #7847
  • DOC: remove mode parameter in docstring of pdf and video feature by @CloseChoice in #7848

Full Changelog: 4.4.0...4.4.1

4.4.0

04 Nov 10:42
232cb10

Choose a tag to compare

Dataset Features

  • Add nifti support by @CloseChoice in #7815

    • Load medical imaging datasets from Hugging Face:
    ds = load_dataset("username/my_nifti_dataset")
    ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}
    • Load medical imaging datasets from your disk:
    files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
    ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
    ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}
  • Add num channels to audio by @CloseChoice in #7840

# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio())  # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2))  # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1))  # use mono

What's Changed

New Contributors

Full Changelog: 4.3.0...4.4.0

4.3.0

23 Oct 16:33
41c0529

Choose a tag to compare

Dataset Features

Enable large scale distributed dataset streaming:

These improvements require huggingface_hub>=1.1.0 to take full effect

What's Changed

New Contributors

Full Changelog: 4.2.0...4.3.0

4.2.0

09 Oct 16:18
7e1350b

Choose a tag to compare

Dataset Features

  • Sample without replacement option when interleaving datasets by @radulescupetru in #7786

    ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
  • Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in #7806

    ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
  • Add parquet scan options and docs by @lhoestq in #7801

    • docs to select columns and filter data efficiently
    ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
    ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
    • new argument to control buffering and caching when streaming
    fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
    ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

What's Changed

New Contributors

Full Changelog: 4.1.1...4.2.0

4.1.1

18 Sep 13:15
9be15a7

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 4.1.0...4.1.1

4.1.0

15 Sep 16:41
dd280cb

Choose a tag to compare

Dataset Features

  • feat: use content defined chunking by @kszucs in #7589

    • internally uses use_content_defined_chunking=True when writing Parquet files
    • this enables fast deduped uploads to Hugging Face !
    # Now faster thanks to content defined chunking
    ds.push_to_hub("username/dataset_name")
    • this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
  • Concurrent push_to_hub by @lhoestq in #7708

  • Concurrent IterableDataset push_to_hub by @lhoestq in #7710

  • HDF5 support by @klamike in #7690

    • load HDF5 datasets in one line of code
    ds = load_dataset("username/dataset-with-hdf5-files")
    • each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

New Contributors

Full Changelog: 4.0.0...4.1.0

4.0.0

09 Jul 14:54
b0de7a8

Choose a tag to compare

New Features

  • Add IterableDataset.push_to_hub() by @lhoestq in #7595

    # Build streaming data pipelines in a few lines of code !
    from datasets import load_dataset
    
    ds = load_dataset(..., streaming=True)
    ds = ds.map(...).filter(...)
    ds.push_to_hub(...)
  • Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in #7606

    # Faster push to Hub ! Available for both Dataset and IterableDataset
    ds.push_to_hub(..., num_proc=8)
  • New Column object

    # Syntax:
    ds["column_name"]  # datasets.Column([...]) or datasets.IterableColumn(...)
    
    # Iterate on a column:
    for text in ds["text"]:
        ...
    
    # Load one cell without bringing the full column in memory
    first_text = ds["text"][0]  # equivalent to ds[0]["text"]
  • Torchcodec decoding by @TyTodd in #7616

    • Enables streaming only the ranges you need !
    # Don't download full audios/videos when it's not necessary
    # Now with torchcodec it only streams the required ranges/frames:
    from datasets import load_dataset
    
    ds = load_dataset(..., streaming=True)
    for example in ds:
        video = example["video"]
        frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames
    • Requires torch>=2.7.0 and FFmpeg >= 4
    • Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
    • Load audio data with AudioDecoder:
    audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
    samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
    samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
    samples.sample_rate  # 16000
    
    # old syntax is still supported
    array, sr = audio["array"], audio["sampling_rate"]
    • Load video data with VideoDecoder:
    video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
    first_frame = video.get_frame_at(0)
    first_frame.data.shape  # (3, 240, 320)
    first_frame.pts_seconds  # 0.0
    frames = video.get_frames_in_range(0, 6, 1)
    frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

  • Remove scripts altogether by @lhoestq in #7592

    • trust_remote_code is no longer supported
  • Torchcodec decoding by @TyTodd in #7616

    • torchcodec replaces soundfile for audio decoding
    • torchcodec replaces decord for video decoding
  • Replace Sequence by List by @lhoestq in #7634

    • Introduction of the List type
    from datasets import Features, List, Value
    
    features = Features({
        "texts": List(Value("string")),
        "four_paragraphs": List(Value("string"), length=4)
    })
    • Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature
    from datasets import Sequence
    
    Sequence(Value("string"))  # List(Value("string"))
    Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

New Contributors

Full Changelog: 3.6.0...4.0.0

3.6.0

07 May 15:17
458f45a

Choose a tag to compare

Dataset Features

  • Enable xet in push to hub by @lhoestq in #7552
    • Faster downloads/uploads with Xet storage
    • more info: #7526

Other improvements and bug fixes

New Contributors

Full Changelog: 3.5.1...3.6.0

3.5.1

28 Apr 14:02
2e94045

Choose a tag to compare

Bug fixes

  • support pyarrow 20 by @lhoestq in #7540
    • Fix pyarrow error TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
  • Write pdf in map by @lhoestq in #7487

Other improvements

New Contributors

Full Changelog: 3.5.0...3.5.1

3.5.0

27 Mar 16:38
0b5998a

Choose a tag to compare

Datasets Features

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

New Contributors

Full Changelog: 3.4.1...3.5.0