-
Notifications
You must be signed in to change notification settings - Fork 3k
feat: use content defined chunking #7589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
io.parquet.ParquetDatasetReader|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
960db25 to
ef901ea
Compare
|
Need to set |
|
We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199 |
…c builder workflow
src/datasets/config.py
Outdated
| # Batch size constants. For more info, see: | ||
| # https://github.com/apache/arrow/blob/master/docs/source/cpp/arrays.rst#size-limitations-and-recommendations) | ||
| DEFAULT_MAX_BATCH_SIZE = 1000 | ||
| DEFAULT_MAX_BATCH_SIZE = 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the default arrow row group size. If we choose a too small row group size then we cannot profit from CDC chunking that much.
maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups |
would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb. |
most frameworks read row group by row group, that's why we need them to be of reasonable size anyways |
OK, I just RTFM:
|
|
I updated the PR to write row groups of 100MB (uncompressed) instead of relying on a bigger default number of rows per row group. Let me know what you think :) This way it should provide good performance for CDC while letting the Viewer work correctly without OOM. |
|
Have you tried on some datasets? I think it's important to choose a good default, as it will impact all the datasets, and it costs a lot to recompute if we need to. Not saying the value is bad, I don't know, but it would be good to have some validation before setting it. |
|
I tried with both Dataset.push_to_hub() and IterableDataset.push_to_hub() on https://huggingface.co/datasets/lmsys/lmsys-chat-1m and it worked well. The viewer works great too (see temp repository https://huggingface.co/datasets/lhoestq/tmp-lmsys) and the row groups ended up with the right size. I might do another try with the hermes dataset showcased in the blog post to do a final double check that filter + reupload dedupes well. |
|
I will also run the estimator to check the impact on deduplication. In general smaller row groups decrease the deduplication efficiency, so ideally we should use bigger row groups with page indexes. BTW could we use smaller row groups only for |
|
I reproduced the example in the blog post. It uploads 60MB of new data for each shard of 210MB of filtered OpenHermes2.5 (after shuffling). The baseline is 20MB for one single-row-group shard with pandas (unshuffled in the blog post), which is not impacted by file boundaries and row group boundaries. (fyi increasing the row group size from 100MB to 200MB gives roughly the same result of 60MB of new data per shard of 210MB) |
|
I just added Merging now :) Let me know if there are other things to change before I can do a new release |
Use content defined chunking by default when writing parquet files.
io.parquet.ParquetDatasetReaderarrow_writer.ParquetWriterIt requires a new pyarrow pin ">=21.0.0" which is released now.