perf: add a chunk cache to avoid decoding duplicated miniblock chunks #4846

niyue · 2025-09-30T04:26:30Z

Description

When miniblock encoding is used in a Lance file, reading the file with the v2 FileReader via the read_stream_projected API can become inefficient if the provided ReadBatchParams::Indices contains many nearby but non-contiguous row indices.
For example:

29, 168, 180, 194, 376, 559, 574, 665, 666, 667, ..., 968, 969, 970, 973, 975, ...

This kind of access pattern causes the same chunk to be decoded repeatedly, resulting in slow performance and high CPU usage.

Solution

This PR introduces a lightweight single-entry cache in DecodePageTask. While it only helps when chunks are accessed in a somewhat sequential manner, row indices are typically sorted in ascending order, so the cache strikes a balance between saving memory and improving performance.

Test

On a local setup with a Lance file containing 100k rows (each row with a text column of 200+ bytes):

Reading 1700+ nearby but non-contiguous rows at random
zstd is used for general compression
With this change, performance improved by 3x–5x, depending on the dataset.

github-actions · 2025-09-30T04:26:51Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

niyue · 2025-09-30T04:30:00Z

rust/lance-encoding/src/encodings/logical/primitive.rs

        // Now we iterate through each instruction and process it
-        for (instructions, chunk) in self.instructions.iter() {
-            // TODO: It's very possible that we have duplicate `buf` in self.instructions and we
-            // don't want to decode the buf again and again on the same thread.


This PR partially addresses this TODO. It improves performance unless chunks are accessed in a fully random pattern, which would require a HashMap-based cache at the cost of higher memory usage.

codecov-commenter · 2025-09-30T05:07:02Z

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 80.84%. Comparing base (00703be) to head (611a69c).

Files with missing lines	Patch %	Lines
.../lance-encoding/src/encodings/logical/primitive.rs	94.73%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4846   +/-   ##
=======================================
  Coverage   80.84%   80.84%           
=======================================
  Files         330      330           
  Lines      130545   130561   +16     
  Branches   130545   130561   +16     
=======================================
+ Hits       105539   105552   +13     
- Misses      21269    21272    +3     
  Partials     3737     3737

Flag	Coverage Δ
unittests	`80.84% <94.73%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…us chunks don't have to be decoded multiple times.

niyue commented Sep 30, 2025

View reviewed changes

niyue changed the title ~~Add a chunk cache to avoid decoding duplicated miniblock chunks~~ perf: add a chunk cache to avoid decoding duplicated miniblock chunks Sep 30, 2025

github-actions bot added the performance label Sep 30, 2025

niyue force-pushed the feature/chunk-cache branch from f1a8fd7 to 611a69c Compare September 30, 2025 08:11

Add a chunk cache when decoding chunks for miniblock so that continuo…

b7620b1

…us chunks don't have to be decoded multiple times.

niyue force-pushed the feature/chunk-cache branch from 611a69c to b7620b1 Compare September 30, 2025 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: add a chunk cache to avoid decoding duplicated miniblock chunks #4846

perf: add a chunk cache to avoid decoding duplicated miniblock chunks #4846

Uh oh!

niyue commented Sep 30, 2025

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

niyue Sep 30, 2025

Uh oh!

codecov-commenter commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

perf: add a chunk cache to avoid decoding duplicated miniblock chunks #4846

Are you sure you want to change the base?

perf: add a chunk cache to avoid decoding duplicated miniblock chunks #4846

Uh oh!

Conversation

niyue commented Sep 30, 2025

Description

Solution

Test

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

niyue Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

codecov-commenter commented Sep 30, 2025 •

edited

Loading