refactor: allow switching to bitpack inside RLE #5595

Xuanwo · 2025-12-30T07:42:37Z

In some cases, we initially considered using RLE but ultimately found that the data is better stored with bitpacking. This PR implements that change.

Metric	Parquet (reference)	Lance (before change)	Lance (after change)	Delta (after vs before)
`int_score` compressed size (bytes)	56,035	377,838	71,556	-306,282 (-81.06%)
`int_score` vs Parquet (ratio)	1.00x	6.74x	1.28x	-5.47x
Lance chosen encoding (hint)	`RLE_DICTIONARY` (plus `RLE`, `PLAIN`, `SNAPPY`)	`rle`	`inline_bitpacking`	n/a

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

github-actions · 2025-12-30T07:43:53Z

Code Review

Summary: This PR adds logic to prefer bitpacking over RLE when bitpacking produces smaller output. The approach is sound and the test coverage is good.

P1 Issue: Estimation formula may underestimate bitpacking size

In estimate_inline_bitpacking_bytes, the calculation appears to assume all chunks are full 1024-element chunks. The words_per_chunk is hardcoded to 1 (for the bit-width header), but the actual implementation in InlineBitpacking::bitpack_chunked stores the bit-width as a single element of type T (e.g., 8 bytes for u64), not 1 byte.

Looking at compression.rs:241-247:

let words_per_chunk: u128 = 1;
let word_bytes: u128 = (bits / 8) as u128;
// ...
let packed_words = (1024u128 * bit_width) / (bits as u128);
total_words = total_words.saturating_add(words_per_chunk.saturating_add(packed_words));

This correctly accounts for the header as 1 word (element) per chunk plus the packed data words. However, the comparison should be bitpack_bytes < rle_bytes not bitpack_bytes < rle_bytes returning None for RLE. The current logic is:

If bitpacking is smaller than RLE, skip RLE (return None)
Then bitpacking will be tried separately

This seems correct, but I'd suggest adding a brief comment in try_rle_for_mini_block explaining that we're checking if bitpacking would be better to avoid selecting RLE when it's not optimal.

Suggestion (not blocking)

Consider adding a comment in the RLE function explaining the fallback to bitpacking check:

// If bitpacking would produce smaller output than RLE, skip RLE
// and let the subsequent bitpacking check handle compression.

Test coverage

The new test test_low_cardinality_prefers_bitpacking_over_rle is well-designed and validates the behavior. The modification to test_rle_encoding_verification to use i32::MIN values ensures RLE is still tested when bitpacking can't help (high bit set means bit-width equals type width).

LGTM with the optional suggestion above.

codecov · 2025-12-30T08:25:00Z

Codecov Report

❌ Patch coverage is 98.21429% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-encoding/src/compression.rs	98.21%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

refactor: allow switching to bitpack inside RLE

bab0e8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: allow switching to bitpack inside RLE #5595

refactor: allow switching to bitpack inside RLE #5595

Uh oh!

Xuanwo commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

codecov bot commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor: allow switching to bitpack inside RLE #5595

Are you sure you want to change the base?

refactor: allow switching to bitpack inside RLE #5595

Uh oh!

Conversation

Xuanwo commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Code Review

P1 Issue: Estimation formula may underestimate bitpacking size

Suggestion (not blocking)

Test coverage

Uh oh!

codecov bot commented Dec 30, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants