Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

niyue
Copy link
Contributor

@niyue niyue commented Sep 30, 2025

Description

When miniblock encoding is used in a Lance file, reading the file with the v2 FileReader via the read_stream_projected API can become inefficient if the provided ReadBatchParams::Indices contains many nearby but non-contiguous row indices.
For example:

29, 168, 180, 194, 376, 559, 574, 665, 666, 667, ..., 968, 969, 970, 973, 975, ...

This kind of access pattern causes the same chunk to be decoded repeatedly, resulting in slow performance and high CPU usage.

Solution

This PR introduces a lightweight single-entry cache in DecodePageTask. While it only helps when chunks are accessed in a somewhat sequential manner, row indices are typically sorted in ascending order, so the cache strikes a balance between saving memory and improving performance.

Test

On a local setup with a Lance file containing 100k rows (each row with a text column of 200+ bytes):

  • Reading 1700+ nearby but non-contiguous rows at random
  • zstd is used for general compression
  • With this change, performance improved by 3x–5x, depending on the dataset.

Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

// Now we iterate through each instruction and process it
for (instructions, chunk) in self.instructions.iter() {
// TODO: It's very possible that we have duplicate `buf` in self.instructions and we
// don't want to decode the buf again and again on the same thread.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR partially addresses this TODO. It improves performance unless chunks are accessed in a fully random pattern, which would require a HashMap-based cache at the cost of higher memory usage.

@niyue niyue changed the title Add a chunk cache to avoid decoding duplicated miniblock chunks perf: add a chunk cache to avoid decoding duplicated miniblock chunks Sep 30, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 80.84%. Comparing base (00703be) to head (611a69c).

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 94.73% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4846   +/-   ##
=======================================
  Coverage   80.84%   80.84%           
=======================================
  Files         330      330           
  Lines      130545   130561   +16     
  Branches   130545   130561   +16     
=======================================
+ Hits       105539   105552   +13     
- Misses      21269    21272    +3     
  Partials     3737     3737           
Flag Coverage Δ
unittests 80.84% <94.73%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@niyue niyue force-pushed the feature/chunk-cache branch from f1a8fd7 to 611a69c Compare September 30, 2025 08:11
…us chunks don't have to be decoded multiple times.
@niyue niyue force-pushed the feature/chunk-cache branch from 611a69c to b7620b1 Compare September 30, 2025 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants