feat: use content defined chunking #7589

kszucs · 2025-05-29T18:19:41Z

Use content defined chunking by default when writing parquet files.

set the parameters in io.parquet.ParquetDatasetReader
set the parameters in arrow_writer.ParquetWriter

It requires a new pyarrow pin ">=21.0.0" which is released now.

HuggingFaceDocBuilderDev · 2025-05-29T18:40:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kszucs · 2025-06-16T10:08:55Z

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

kszucs · 2025-07-24T09:21:54Z

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

…c builder workflow

kszucs · 2025-07-25T11:33:15Z

src/datasets/config.py

 # Batch size constants. For more info, see:
 # https://github.com/apache/arrow/blob/master/docs/source/cpp/arrays.rst#size-limitations-and-recommendations)
-DEFAULT_MAX_BATCH_SIZE = 1000
+DEFAULT_MAX_BATCH_SIZE = 1024 * 1024


This is the default arrow row group size. If we choose a too small row group size then we cannot profit from CDC chunking that much.

lhoestq · 2025-08-13T14:31:33Z

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

severo · 2025-08-13T15:37:39Z

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

lhoestq · 2025-08-13T15:43:38Z

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

most frameworks read row group by row group, that's why we need them to be of reasonable size anyways

severo · 2025-08-13T17:06:47Z

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

~~where would the page indexes be stored? in the custom section in the Parquet file metadata? Is it standardized or ad hoc?~~

OK, I just RTFM:

write_page_index: bool, default False

Whether to write a page index in general for all columns. Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes statistics-based filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O. Note that the page index is not yet used on the read size by PyArrow.

lhoestq · 2025-09-04T17:03:04Z

I updated the PR to write row groups of 100MB (uncompressed) instead of relying on a bigger default number of rows per row group. Let me know what you think :)

This way it should provide good performance for CDC while letting the Viewer work correctly without OOM.

severo · 2025-09-04T19:52:09Z

Have you tried on some datasets? I think it's important to choose a good default, as it will impact all the datasets, and it costs a lot to recompute if we need to. Not saying the value is bad, I don't know, but it would be good to have some validation before setting it.

lhoestq · 2025-09-05T14:23:34Z

I tried with both Dataset.push_to_hub() and IterableDataset.push_to_hub() on https://huggingface.co/datasets/lmsys/lmsys-chat-1m and it worked well. The viewer works great too (see temp repository https://huggingface.co/datasets/lhoestq/tmp-lmsys) and the row groups ended up with the right size.

I might do another try with the hermes dataset showcased in the blog post to do a final double check that filter + reupload dedupes well.

kszucs · 2025-09-05T14:34:10Z

I will also run the estimator to check the impact on deduplication. In general smaller row groups decrease the deduplication efficiency, so ideally we should use bigger row groups with page indexes.

BTW could we use smaller row groups only for refs/convert/parquet?

lhoestq · 2025-09-08T14:54:38Z

I reproduced the example in the blog post.

It uploads 60MB of new data for each shard of 210MB of filtered OpenHermes2.5 (after shuffling).

The baseline is 20MB for one single-row-group shard with pandas (unshuffled in the blog post), which is not impacted by file boundaries and row group boundaries.

(fyi increasing the row group size from 100MB to 200MB gives roughly the same result of 60MB of new data per shard of 210MB)

lhoestq · 2025-09-09T13:44:50Z

I just added write_page_index=True as well. It didn't affect dedupe perf on the OpenHermes example.

Merging now :) Let me know if there are other things to change before I can do a new release

kszucs changed the title ~~feat: use content defined chunking in io.parquet.ParquetDatasetReader~~ feat: use content defined chunking May 29, 2025

kszucs force-pushed the cdc branch 2 times, most recently from 960db25 to ef901ea Compare June 8, 2025 15:06

kszucs force-pushed the cdc branch from 735fbca to 9330cac Compare June 17, 2025 15:04

kszucs added 14 commits July 25, 2025 12:58

feat: use content defined chunking in io.parquet.ParquetDatasetReader

fe67bb7

ci: use nightly pyarrow wheels

66d77d9

feat: use content defined chunking in arrow_writer.ParquetWriter

9f866a4

ci: try to pass the pyarrow nightly wheel as additional arg to the do…

51c3135

…c builder workflow

ci: trigger builds on push no matter the branch

a7fcd4a

ci: pass the extra index url in the check_code_quality job

ff733a3

ci: specify pyarrow constraint as pyarrow>=21.0.0.dev

ac148b4

ci: restore branch filters in the doc build workflow

a9fda12

ci: install pyarrow from nightly channel in a separate command

7c66ba1

ci: missing --system for uv pip install

6de8d5d

chore: initialize features variable

ae35e1e

chore: set the default max batch size to pyarrow\'s default

9e9939c

chore: rename cdc_options argument to use_content_defined_chunking

565d260

chore: always store the cdc parameters as metadata

ee1e73b

kszucs force-pushed the cdc branch from 9330cac to ee1e73b Compare July 25, 2025 10:58

kszucs marked this pull request as ready for review July 25, 2025 10:59

kszucs added 3 commits July 25, 2025 13:19

test: cover more input parameter values for ParquetWriter

b306a37

test: cover more input parameter values for ParquetDatasetWriter

00a8c54

ci: pin arrow=21.0.0 since it is released now; restore CI workarounds

e0facc0

kszucs force-pushed the cdc branch from 5de2191 to e0facc0 Compare July 25, 2025 11:32

kszucs commented Jul 25, 2025

View reviewed changes

chore: use Union typehint

2b3acd5

100MB per row group

b46d13a

lhoestq added 3 commits September 8, 2025 18:56

fix tests

e87fb7e

again

87ff53b

write page index

1fbff05

lhoestq approved these changes Sep 9, 2025

View reviewed changes

lhoestq merged commit cedf831 into huggingface:main Sep 9, 2025
6 of 14 checks passed

lhoestq mentioned this pull request Sep 11, 2025

[Rows] sub-rowgroup loading using arrow-rs + libviewer huggingface/dataset-viewer#3213

Closed

2 tasks

feat: use content defined chunking #7589

feat: use content defined chunking #7589

Uh oh!

Conversation

kszucs commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2025

Uh oh!

kszucs commented Jun 16, 2025

Uh oh!

kszucs commented Jul 24, 2025

Uh oh!

kszucs Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Aug 13, 2025

Uh oh!

severo commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Aug 13, 2025

Uh oh!

severo commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Sep 4, 2025

Uh oh!

severo commented Sep 4, 2025

Uh oh!

lhoestq commented Sep 5, 2025

Uh oh!

kszucs commented Sep 5, 2025

Uh oh!

lhoestq commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kszucs commented May 29, 2025 •

edited

Loading

kszucs Jul 25, 2025 •

edited

Loading

severo commented Aug 13, 2025 •

edited

Loading

severo commented Aug 13, 2025 •

edited

Loading

lhoestq commented Sep 8, 2025 •

edited

Loading