Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented May 29, 2025

Use content defined chunking by default when writing parquet files.

  • set the parameters in io.parquet.ParquetDatasetReader
  • set the parameters in arrow_writer.ParquetWriter

It requires a new pyarrow pin ">=21.0.0" which is released now.

@kszucs kszucs changed the title feat: use content defined chunking in io.parquet.ParquetDatasetReader feat: use content defined chunking May 29, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kszucs kszucs force-pushed the cdc branch 2 times, most recently from 960db25 to ef901ea Compare June 8, 2025 15:06
@kszucs
Copy link
Member Author

kszucs commented Jun 16, 2025

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

@kszucs
Copy link
Member Author

kszucs commented Jul 24, 2025

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

@kszucs kszucs marked this pull request as ready for review July 25, 2025 10:59
# Batch size constants. For more info, see:
# https://github.com/apache/arrow/blob/master/docs/source/cpp/arrays.rst#size-limitations-and-recommendations)
DEFAULT_MAX_BATCH_SIZE = 1000
DEFAULT_MAX_BATCH_SIZE = 1024 * 1024
Copy link
Member Author

@kszucs kszucs Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default arrow row group size. If we choose a too small row group size then we cannot profit from CDC chunking that much.

@lhoestq
Copy link
Member

lhoestq commented Aug 13, 2025

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

@severo
Copy link
Collaborator

severo commented Aug 13, 2025

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

@lhoestq
Copy link
Member

lhoestq commented Aug 13, 2025

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

most frameworks read row group by row group, that's why we need them to be of reasonable size anyways

@severo
Copy link
Collaborator

severo commented Aug 13, 2025

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

where would the page indexes be stored? in the custom section in the Parquet file metadata? Is it standardized or ad hoc?

OK, I just RTFM:

write_page_index: bool, default False

Whether to write a page index in general for all columns. Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes statistics-based filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O. Note that the page index is not yet used on the read size by PyArrow.

@lhoestq
Copy link
Member

lhoestq commented Sep 4, 2025

I updated the PR to write row groups of 100MB (uncompressed) instead of relying on a bigger default number of rows per row group. Let me know what you think :)

This way it should provide good performance for CDC while letting the Viewer work correctly without OOM.

@severo
Copy link
Collaborator

severo commented Sep 4, 2025

Have you tried on some datasets? I think it's important to choose a good default, as it will impact all the datasets, and it costs a lot to recompute if we need to. Not saying the value is bad, I don't know, but it would be good to have some validation before setting it.

@lhoestq
Copy link
Member

lhoestq commented Sep 5, 2025

I tried with both Dataset.push_to_hub() and IterableDataset.push_to_hub() on https://huggingface.co/datasets/lmsys/lmsys-chat-1m and it worked well. The viewer works great too (see temp repository https://huggingface.co/datasets/lhoestq/tmp-lmsys) and the row groups ended up with the right size.

I might do another try with the hermes dataset showcased in the blog post to do a final double check that filter + reupload dedupes well.

@kszucs
Copy link
Member Author

kszucs commented Sep 5, 2025

I will also run the estimator to check the impact on deduplication. In general smaller row groups decrease the deduplication efficiency, so ideally we should use bigger row groups with page indexes.

BTW could we use smaller row groups only for refs/convert/parquet?

@lhoestq
Copy link
Member

lhoestq commented Sep 8, 2025

I reproduced the example in the blog post.

It uploads 60MB of new data for each shard of 210MB of filtered OpenHermes2.5 (after shuffling).

The baseline is 20MB for one single-row-group shard with pandas (unshuffled in the blog post), which is not impacted by file boundaries and row group boundaries.

(fyi increasing the row group size from 100MB to 200MB gives roughly the same result of 60MB of new data per shard of 210MB)

@lhoestq
Copy link
Member

lhoestq commented Sep 9, 2025

I just added write_page_index=True as well. It didn't affect dedupe perf on the OpenHermes example.

Merging now :) Let me know if there are other things to change before I can do a new release

@lhoestq lhoestq merged commit cedf831 into huggingface:main Sep 9, 2025
6 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants