-
Notifications
You must be signed in to change notification settings - Fork 97
None optimization (part 2) #5409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the serialization format for tuples containing Option types by implementing a sparse encoding strategy. Instead of storing the full size of each field even when None, the new format uses a bitmap to mark None fields and only serializes non-None values.
Changes:
- Implemented bitmap-based sparse tuple serialization for more efficient storage of tuples with many
Nonevalues - Incremented storage format version from 3 to 4 to support the new tuple encoding
- Added backward compatibility layer to deserialize v3 files using the legacy format
Reviewed changes
Copilot reviewed 21 out of 29 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/feldera-macros/src/tuples.rs | Core implementation of the new bitmap-based tuple serialization format |
| crates/dbsp/src/storage/file/format.rs | Incremented VERSION_NUMBER from 3 to 4 |
| crates/dbsp/src/storage/file.rs | Added Deserializer wrapper with version tracking |
| crates/dbsp/src/storage/file/reader.rs | Updated readers to pass version information for backward compatibility |
| crates/storage-test-compat/* | New compatibility testing crate with golden file tests |
| crates/feldera-macros/tests/* | Added comprehensive tests for new tuple serialization |
blp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over this, thanks for making this work. I'm impressed that rkyv was flexible enough.
crates/storage-test-compat/golden-files/golden-batch-v3-snappy-large.feldera
Show resolved
Hide resolved
|
thanks @blp ! sorry you might have to review this again in a little bit because I made a bunch of changes so things are optimal in most (all?) cases of data ... but I'll incorporate your suggestions |
No worries |
1223154 to
8efaa8b
Compare
|
So we gave up on columnar formats? These use a vector of bits for each column. |
no we didn't, that would still be ideal. this is a middle ground solution for now. |
ccf58d2 to
57566da
Compare
|
@blp this needs a re-review |
57566da to
0e0df19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 29 out of 42 changed files in this pull request and generated 2 comments.
000c31c to
baf404a
Compare
2d02ee0 to
11ef998
Compare
11ef998 to
000d1f5
Compare
7f01b18 to
9765a6e
Compare
|
I hadn't even known about the feldera-macros crate before |
blp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am impressed. This is a big change. It must have been a bear to write. I was not confident that it was possible, especially with backward compatibility. A lot of this is testing code, which is appropriate and I am glad to see it. The golden files are especially welcome.
I read all of it, except that I skipped over a lot of the testing code since the details are not very important.
9765a6e to
1db3aec
Compare
- Stores None's as 1-bit in a bitmap - In case something isn't None, store as the value in a dynamically sized-list - Support handling old format by passing version to deserializer Signed-off-by: Gerd Zellweger <[email protected]>
Signed-off-by: Gerd Zellweger <[email protected]>
Add code to generate batch files in different formats and makes sure we can read them correctly in the future. Signed-off-by: Gerd Zellweger <[email protected]>
1. We elide the Option<> in stored Tuples since we can reconstruct it from the bitmap. This is an optimization added to the new Tup storage format we used to write things into files as Some(x) but this leads to unnecessary overheads when the Option can not be elided. By extending IsNone with ability to extract the inner type and reconstruct the Option in case of Option types we can avoid storing things as Option<T> and just use T plus a bitfield. 2. We use a dense and sparse format for Tup's where we have a Tup which has most fields set (it doesnt' make sense to maintain a list with offset pointers in this case) 3. Use standard rkyv for small tuples. Given the constant overhead of the optimized tuples (format fields + bitmap + list (for sparse format)), it is more space efficient to just use the default rkyv format.
Adds more comprehensive test suite for tuples to make sure it's doing the right thing given the complexity of the new formats. Signed-off-by: Gerd Zellweger <[email protected]>
Signed-off-by: Gerd Zellweger <[email protected]>
Signed-off-by: Gerd Zellweger <[email protected]>
1db3aec to
0cab26f
Compare
Otherwise there is a circular dependency between dbsp and feldera-macros which breaks cargo publish. Signed-off-by: Gerd Zellweger <[email protected]>
c6dd619 to
8743b9e
Compare
Signed-off-by: feldera-bot <[email protected]>
This changes the way tuples are serialized/deserialized.
before a TupX<> would be serialized as a struct of
This is not space efficient in case a Tup has many Option types.
So we add two new formats:
We also need to distinguish what tuple format we're dealing with, which means the new Tuples now look like this
Breaking Changes?
Increased storage format versions.
Describe Incompatible Changes
We are writing a different storage format for Tup<> types. We ensure it's backward compatible by distinguishing new and old formats using the version in the respective files.