Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Dec 30, 2025

In some cases, we initially considered using RLE but ultimately found that the data is better stored with bitpacking. This PR implements that change.

Metric Parquet (reference) Lance (before change) Lance (after change) Delta (after vs before)
int_score compressed size (bytes) 56,035 377,838 71,556 -306,282 (-81.06%)
int_score vs Parquet (ratio) 1.00x 6.74x 1.28x -5.47x
Lance chosen encoding (hint) RLE_DICTIONARY (plus RLE, PLAIN, SNAPPY) rle inline_bitpacking n/a

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions
Copy link
Contributor

Code Review

Summary: This PR adds logic to prefer bitpacking over RLE when bitpacking produces smaller output. The approach is sound and the test coverage is good.

P1 Issue: Estimation formula may underestimate bitpacking size

In estimate_inline_bitpacking_bytes, the calculation appears to assume all chunks are full 1024-element chunks. The words_per_chunk is hardcoded to 1 (for the bit-width header), but the actual implementation in InlineBitpacking::bitpack_chunked stores the bit-width as a single element of type T (e.g., 8 bytes for u64), not 1 byte.

Looking at compression.rs:241-247:

let words_per_chunk: u128 = 1;
let word_bytes: u128 = (bits / 8) as u128;
// ...
let packed_words = (1024u128 * bit_width) / (bits as u128);
total_words = total_words.saturating_add(words_per_chunk.saturating_add(packed_words));

This correctly accounts for the header as 1 word (element) per chunk plus the packed data words. However, the comparison should be bitpack_bytes < rle_bytes not bitpack_bytes < rle_bytes returning None for RLE. The current logic is:

  • If bitpacking is smaller than RLE, skip RLE (return None)
  • Then bitpacking will be tried separately

This seems correct, but I'd suggest adding a brief comment in try_rle_for_mini_block explaining that we're checking if bitpacking would be better to avoid selecting RLE when it's not optimal.

Suggestion (not blocking)

Consider adding a comment in the RLE function explaining the fallback to bitpacking check:

// If bitpacking would produce smaller output than RLE, skip RLE
// and let the subsequent bitpacking check handle compression.

Test coverage

The new test test_low_cardinality_prefers_bitpacking_over_rle is well-designed and validates the behavior. The modification to test_rle_encoding_verification to use i32::MIN values ensures RLE is still tested when bitpacking can't help (high bit set means bit-width equals type width).

LGTM with the optional suggestion above.

@codecov
Copy link

codecov bot commented Dec 30, 2025

Codecov Report

❌ Patch coverage is 98.21429% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/compression.rs 98.21% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants