Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FSSTCompressor#664

Merged
a10y merged 16 commits into
developfrom
aduffy/fsst-compressor
Sep 3, 2024
Merged

FSSTCompressor#664
a10y merged 16 commits into
developfrom
aduffy/fsst-compressor

Conversation

@a10y

@a10y a10y commented Aug 20, 2024

Copy link
Copy Markdown
Contributor
  • Implements a new codec, FSSTCompressor using latest version of fsst-rs library
  • Adds new metadata field on CompressionTree to allow reuse between the sampling and compressing stages. For example, we can save the ALP exponents to not have to calculate them twice. This is very important for FSST so that we save the overhead of training the table twice
  • Adds new compression benchmark using the lineitem table's l_comment column, with scalefactor=1, which is just over 6million rows. By default this is loaded as a ChunkedArray with 733 partitions. Compressing with FSST enabled takes 1.6s. Compressing on the canonicalized array takes ~550ms. We should be able to speed this up by at least ~2x, see FSSTCompressor #664 (comment), and we can potentially do even better. We probably want to be able to FSST compress a ChunkedArray directly so that we avoid the overhead of training/compressing each chunk from scratch.

Comment thread vortex-sampling-compressor/src/compressors/fsst.rs Outdated
Comment thread vortex-sampling-compressor/src/compressors/fsst.rs Outdated
Comment thread vortex-sampling-compressor/src/compressors/fsst.rs
@a10y

a10y commented Aug 21, 2024

Copy link
Copy Markdown
Contributor Author

image

Ran the compress_taxi benchmark, got ~80% slower. I am a bit surprised that the biggest culprit seems to be creating new counters in the FSST training loop. That doesn't even scale w.r.t. to the size of the input array, it's just a flat 2MB allocation. The zeroing of the vector seems to be the biggest problem. I think we can avoid that with a second bitmap, let me try that out

@a10y

a10y commented Aug 21, 2024

Copy link
Copy Markdown
Contributor Author

Alright, using the change in spiraldb/fsst#21 helped a lot.

New benchmark result:

end to end - taxi/compress
                        time:   [100.73 ms 101.72 ms 102.96 ms]
                        change: [-45.073% -43.470% -42.057%] (p = 0.00 < 0.05)
                        Performance has improved.

Which is about 10ms or ~11% slower than running without FSST.

@a10y

a10y commented Aug 21, 2024

Copy link
Copy Markdown
Contributor Author

And I think we can go even lower, ideally we'd just use the trained compressor over the samples to compress the full array

Comment thread bench-vortex/benches/random_access.rs
Comment thread encodings/fsst/tests/fsst_tests.rs
@robert3005

Copy link
Copy Markdown
Contributor

Just bear in mind that the samples can be very small compared to data, i.e. 1024 elements. I would say just retrain it

@a10y

a10y commented Aug 23, 2024

Copy link
Copy Markdown
Contributor Author

Ok I've done a few things today

  1. Introduced a way to reuse compressors in our samplingcompressor code
  2. Keep tweaking some things on the FSST side, including matching how the paper author's sample the full input, and tried to reduce memory accesses/extraneous instructions as much as possible (in draft at feat: port in more from the C++ code spiraldb/fsst#24)
  3. I'm trying to run some comparisons against the C++ code. Here's a screenshot comparing a microbenchmark for compressing thecomments column from the TPCH orders table (1.5mm rows) using Rust vs the C++ implementation. We seem roughly on-par with the C++ implementation here, the timings seemed consistent after several runs

image

Comment thread Cargo.toml Outdated
@a10y

a10y commented Aug 23, 2024

Copy link
Copy Markdown
Contributor Author

Ok I added a new benchmark now which just compresses the comments column in-memory via Vortex, and i'm seeing it take ~500ms, which is roughly 2-3x longer than just doing the compression without Vortex.

I think the root of the performance difference is the chunking. Here's a comparison between running FSST over the comments column chunked as per our TPC loading infra (nchunks=192) and the canonicalized version of the comments array, which is not chunked:

image

So somewhere I guess there's some fixed-size overhead in FSST training (probably a combo of allocations and double-tight-loops over 0...511) that when you try and run FSST hundreds of times, they start to add up and can skew your results.

I'm curious how DuckDB and other folks deal with FSST + chunking, it seems like we might want to treat it as a special thing that can do its own sampling + have shared symbol table across chunks

@a10y

a10y commented Aug 23, 2024

Copy link
Copy Markdown
Contributor Author

I'm currently blocking this on some work in spiraldb/fsst#24

@a10y

a10y commented Sep 3, 2024

Copy link
Copy Markdown
Contributor Author

image

Currently 59% of fsst_compress time is spent actually compressing, we break out of the fast loop to do push_null and data copying. Something to improve on in flup

Comment thread bench-vortex/src/tpch/mod.rs
Comment thread encodings/fsst/src/array.rs
Comment thread encodings/fsst/src/array.rs Outdated
// so we transmute to kill the lifetime complaints.
// This is fine because the returned `Decompressor`'s lifetime is tied to the lifetime
// of these same arrays.
let symbol_lengths = unsafe { std::mem::transmute::<&[u8], &[u8]>(symbol_lengths) };

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious for a sanity check here, or if there's another way i should be doing this. it feels a bit wrong, but I think it is currently the best way to do the thing I want...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm this is wrong, if we actually canonicalize this pointer is invalid

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this should be fixed now, instead of returning a decompressor this constructs one on-the-fly to pass to a provided function

@a10y a10y marked this pull request as ready for review September 3, 2024 19:30
Comment thread bench-vortex/benches/compress_benchmark.rs Outdated
Comment thread vortex-sampling-compressor/src/compressors/mod.rs Outdated
@a10y a10y force-pushed the aduffy/fsst-compressor branch from 58a58eb to 79197e9 Compare September 3, 2024 21:26
Comment thread vortex-sampling-compressor/src/compressors/mod.rs
Comment thread vortex-sampling-compressor/src/compressors/fsst.rs Outdated
@a10y a10y enabled auto-merge (squash) September 3, 2024 22:20
@a10y a10y merged commit fd49140 into develop Sep 3, 2024
@a10y a10y deleted the aduffy/fsst-compressor branch September 3, 2024 22:29
@robert3005 robert3005 mentioned this pull request Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants