Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@gz
Copy link
Contributor

@gz gz commented Jan 10, 2026

This changes the way tuples are serialized/deserialized.

before a TupX<> would be serialized as a struct of

struct ArchivedTupX
  t1: T1
  t2: T2
etc.

This is not space efficient in case a Tup has many Option types.

So we add two new formats:

  • A format for dense tuples (same as before, but we lifted all Option (in ::Inner) to T and use a bitmap to store None instead):
DenseTup
  bitmap: [u8; len]
  t1: T1::IsNone::Inner
  t2: T2::IsNone::Inner
  ...
  • A format for very sparse tuples (>40% is None):
SparseTup
  bitmap: [u8; len]
  rel_ptrs: ArchivedVec<RelPtrI32>
  <<serialized form of all TX::IsNone::Inner that are not set to None and have an entry in rel_ptrs pointing to them>>

We also need to distinguish what tuple format we're dealing with, which means the new Tuples now look like this

TupN (where N>=8, for smaller tuples we continue to use the rkyv default variant because adding the format tag + bitfield also has some constant that we amortize by only using this when N is large enough)
  format (sparse or dense)
  <<SparseTup or DenseTup>>

Breaking Changes?

  • Storage Format / Checkpoints

Increased storage format versions.

Describe Incompatible Changes

We are writing a different storage format for Tup<> types. We ensure it's backward compatible by distinguishing new and old formats using the version in the respective files.

Copilot AI review requested due to automatic review settings January 10, 2026 22:28
@gz gz changed the title None optimziation part 2 None optimiziation (part 2) Jan 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the serialization format for tuples containing Option types by implementing a sparse encoding strategy. Instead of storing the full size of each field even when None, the new format uses a bitmap to mark None fields and only serializes non-None values.

Changes:

  • Implemented bitmap-based sparse tuple serialization for more efficient storage of tuples with many None values
  • Incremented storage format version from 3 to 4 to support the new tuple encoding
  • Added backward compatibility layer to deserialize v3 files using the legacy format

Reviewed changes

Copilot reviewed 21 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
crates/feldera-macros/src/tuples.rs Core implementation of the new bitmap-based tuple serialization format
crates/dbsp/src/storage/file/format.rs Incremented VERSION_NUMBER from 3 to 4
crates/dbsp/src/storage/file.rs Added Deserializer wrapper with version tracking
crates/dbsp/src/storage/file/reader.rs Updated readers to pass version information for backward compatibility
crates/storage-test-compat/* New compatibility testing crate with golden file tests
crates/feldera-macros/tests/* Added comprehensive tests for new tuple serialization

@gz gz requested review from blp and removed request for blp January 10, 2026 22:34
@gz gz marked this pull request as draft January 11, 2026 19:33
@blp blp changed the title None optimiziation (part 2) None optimization (part 2) Jan 12, 2026
Copy link
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over this, thanks for making this work. I'm impressed that rkyv was flexible enough.

@gz
Copy link
Contributor Author

gz commented Jan 13, 2026

thanks @blp ! sorry you might have to review this again in a little bit because I made a bunch of changes so things are optimal in most (all?) cases of data ... but I'll incorporate your suggestions

@blp
Copy link
Member

blp commented Jan 13, 2026

thanks @blp ! sorry you might have to review this again in a little bit because I made a bunch of changes so things are optimal in most (all?) cases of data ... but I'll incorporate your suggestions

No worries

@gz gz force-pushed the none-optimziation-part2.2 branch 2 times, most recently from 1223154 to 8efaa8b Compare January 15, 2026 22:11
@mihaibudiu
Copy link
Contributor

So we gave up on columnar formats? These use a vector of bits for each column.

@gz
Copy link
Contributor Author

gz commented Jan 15, 2026

So we gave up on columnar formats? These use a vector of bits for each column.

no we didn't, that would still be ideal. this is a middle ground solution for now.

@gz gz force-pushed the none-optimziation-part2.2 branch from ccf58d2 to 57566da Compare January 15, 2026 23:52
@gz gz marked this pull request as ready for review January 15, 2026 23:52
@gz
Copy link
Contributor Author

gz commented Jan 15, 2026

@blp this needs a re-review

@gz gz force-pushed the none-optimziation-part2.2 branch from 57566da to 0e0df19 Compare January 15, 2026 23:56
@gz gz requested review from blp and Copilot January 16, 2026 00:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 42 changed files in this pull request and generated 2 comments.

@gz gz force-pushed the none-optimziation-part2.2 branch 2 times, most recently from 000c31c to baf404a Compare January 16, 2026 00:30
@gz gz force-pushed the none-optimziation-part2.2 branch 4 times, most recently from 2d02ee0 to 11ef998 Compare January 17, 2026 01:19
@gz gz force-pushed the none-optimziation-part2.2 branch from 11ef998 to 000d1f5 Compare January 19, 2026 18:58
@gz gz enabled auto-merge January 19, 2026 18:58
@gz gz force-pushed the none-optimziation-part2.2 branch from 7f01b18 to 9765a6e Compare January 19, 2026 19:48
@gz gz added the marketing Relevant for marketing content label Jan 20, 2026
@blp
Copy link
Member

blp commented Jan 22, 2026

I hadn't even known about the feldera-macros crate before

Copy link
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am impressed. This is a big change. It must have been a bear to write. I was not confident that it was possible, especially with backward compatibility. A lot of this is testing code, which is appropriate and I am glad to see it. The golden files are especially welcome.

I read all of it, except that I skipped over a lot of the testing code since the details are not very important.

@gz gz added this pull request to the merge queue Jan 22, 2026
@gz gz removed this pull request from the merge queue due to a manual request Jan 22, 2026
@gz gz force-pushed the none-optimziation-part2.2 branch from 9765a6e to 1db3aec Compare January 22, 2026 22:30
@gz gz enabled auto-merge January 22, 2026 22:30
gz added 7 commits January 22, 2026 14:37
- Stores None's as 1-bit in a bitmap
- In case something isn't None, store as the value in a
  dynamically sized-list
- Support handling old format by passing version to deserializer

Signed-off-by: Gerd Zellweger <[email protected]>
Add code to generate batch files in different formats and makes sure
we can read them correctly in the future.

Signed-off-by: Gerd Zellweger <[email protected]>
1. We elide the Option<> in stored Tuples since
we can reconstruct it from the bitmap.

This is an optimization added to the new Tup storage format
we used to write things into files as Some(x) but this leads
to unnecessary overheads when the Option can not be elided.

By extending IsNone with ability to extract the inner type
and reconstruct the Option in case of Option types
we can avoid storing things as Option<T> and just use
T plus a bitfield.

2. We use a dense and sparse format for Tup's where we
   have a Tup which has most fields set (it doesnt' make
   sense to maintain a list with offset pointers in this
   case)

3. Use standard rkyv for small tuples. Given the constant overhead
   of the optimized tuples (format fields + bitmap + list (for sparse
   format)), it is more space efficient to just use the default rkyv
   format.
Adds more comprehensive test suite for tuples to make
sure it's doing the right thing given the complexity of the new
formats.

Signed-off-by: Gerd Zellweger <[email protected]>
Signed-off-by: Gerd Zellweger <[email protected]>
Signed-off-by: Gerd Zellweger <[email protected]>
@gz gz force-pushed the none-optimziation-part2.2 branch from 1db3aec to 0cab26f Compare January 22, 2026 22:38
@gz gz added this pull request to the merge queue Jan 22, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 23, 2026
Otherwise there is a circular dependency between dbsp
and feldera-macros which breaks cargo publish.

Signed-off-by: Gerd Zellweger <[email protected]>
@gz gz force-pushed the none-optimziation-part2.2 branch from c6dd619 to 8743b9e Compare January 23, 2026 00:29
@gz gz enabled auto-merge January 23, 2026 00:30
Signed-off-by: feldera-bot <[email protected]>
@gz gz added this pull request to the merge queue Jan 23, 2026
Merged via the queue into main with commit 99c28d2 Jan 23, 2026
1 check passed
@gz gz deleted the none-optimziation-part2.2 branch January 23, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

marketing Relevant for marketing content

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants