-
Notifications
You must be signed in to change notification settings - Fork 321
[Draft] Benchmarking 8-bits Tag instead of 7 #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
special, | ||
x86::_mm_set1_epi8(Tag::DELETED.0 as i8), | ||
x86::_mm_and_si128(is_special, x86::_mm_set1_epi32(Tag::EMPTY32)), // EMPTY if special | ||
x86::_mm_set1_epi32(Tag::DELETED32), // else DELETED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify this: the compare gives you 0
and -1
, to which you can just add 127
to get EMPTY
and DELETED
.
/// Control tag value for an empty bucket. | ||
pub(crate) const EMPTY: Tag = Tag(0b1111_1111); | ||
pub(crate) const EMPTY: Tag = Tag(0b0111_1111); // 127 | ||
pub(crate) const EMPTY32: i32 = 0x7F7F7F7F; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should probably be a separate ExpandedTag
type for this rather than a plain i32
.
Passing the benchmark results through
|
On my system I don't see the iteration speedups. However there is still a significant speedup for lookup-fail:
|
Interesting, thanks for the test run. This is a rough implementation just for benchmarking. |
I started an experiment in rust-lang/rust#146909 to use your modification for the compiler's benchmark suite. This measures compilation time, which is hugely affected by hash table performance.
I tested on a Zen 2 running Linux. |
Overall this seems like a slight perf loss: rust-lang/rust#146909 (comment) Note that the standard library was previously using 0.15.5, so this comparison also includes the change from #639. |
With that said it might still be worth persuing. Perhaps reducing the size of the lookup table might help? The current table takes up a full 1KB. But then this suffers from 2 extra cycles on x86 to broadcast a If you push any changes to this branch I can re-run the rustc perf benchmarks. |
Appreciated. I also tested with an i8 table, but the results were slower due to the additional instructions in the hot path. On another note, I noticed that random seeds of the hasher could sometimes lead to bad distribution in buckets when working with serial/high-bits key values. It is quite rare (maybe 3%) but worth mentioning as it could also skew some benchmark results. |
Following #635
Here is a test branch to benchmark an "8-bits Tag" implementation (starting with SSE2).
As seen in results below, for Foldhash:
For Std:
But the more interesting part might be the iteration results.
Not quite sure the exact reason for it (without proper profiling), but we have a ~35% speed-up.
As the new implementation relies on a static const array to map hash value to tag (which might not be hot in cache), the benchmarks should be taken with a grain of salt compared to in-production scenarios.
But I feel like the iteration gain deserve to be studied.
Vs