perf: add a chunk cache to avoid decoding duplicated miniblock chunks #4846
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When miniblock encoding is used in a Lance file, reading the file with the v2
FileReader
via theread_stream_projected
API can become inefficient if the providedReadBatchParams::Indices
contains many nearby but non-contiguous row indices.For example:
This kind of access pattern causes the same chunk to be decoded repeatedly, resulting in slow performance and high CPU usage.
Solution
This PR introduces a lightweight single-entry cache in
DecodePageTask
. While it only helps when chunks are accessed in a somewhat sequential manner, row indices are typically sorted in ascending order, so the cache strikes a balance between saving memory and improving performance.Test
On a local setup with a Lance file containing 100k rows (each row with a text column of 200+ bytes):
zstd
is used for general compression