Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

gaujay
Copy link
Contributor

@gaujay gaujay commented Sep 21, 2025

Following #635
Here is a test branch to benchmark an "8-bits Tag" implementation (starting with SSE2).

As seen in results below, for Foldhash:

  • lookup-hit is slightly slower (~-2%)
  • lookup-fail get a little boost in perf as the table gets filled (~12.5%)

For Std:

  • lookup-hit is slightly faster (~3-5%)
  • lookup-fail is also a bit faster (~8%)

But the more interesting part might be the iteration results.
Not quite sure the exact reason for it (without proper profiling), but we have a ~35% speed-up.

As the new implementation relies on a static const array to map hash value to tag (which might not be hot in cache), the benchmarks should be taken with a grain of salt compared to in-production scenarios.
But I feel like the iteration gain deserve to be studied.

// BEFORE (7-bits Tag)
running 57 tests
test clone_from_large               ... bench:      12,657.55 ns/iter (+/- 136.98)
test clone_from_small               ... bench:         131.76 ns/iter (+/- 1.26)
test clone_large                    ... bench:      12,859.06 ns/iter (+/- 144.42)
test clone_small                    ... bench:         207.03 ns/iter (+/- 2.70)
test grow_insert_foldhash_highbits  ... bench:      38,844.09 ns/iter (+/- 379.59)
test grow_insert_foldhash_random    ... bench:      42,001.36 ns/iter (+/- 427.64)
test grow_insert_foldhash_serial    ... bench:      40,266.52 ns/iter (+/- 383.22)
test grow_insert_std_highbits       ... bench:      71,766.36 ns/iter (+/- 798.64)
test grow_insert_std_random         ... bench:      71,636.36 ns/iter (+/- 778.27)
test grow_insert_std_serial         ... bench:      71,198.46 ns/iter (+/- 615.62)
test insert_erase_foldhash_highbits ... bench:      36,820.00 ns/iter (+/- 401.00)
test insert_erase_foldhash_random   ... bench:      36,681.11 ns/iter (+/- 449.44)
test insert_erase_foldhash_serial   ... bench:      36,260.00 ns/iter (+/- 415.84)
test insert_erase_std_highbits      ... bench:      65,048.33 ns/iter (+/- 749.42)
test insert_erase_std_random        ... bench:      64,969.17 ns/iter (+/- 612.08)
test insert_erase_std_serial        ... bench:      63,380.00 ns/iter (+/- 700.42)
test insert_foldhash_highbits       ... bench:      30,413.33 ns/iter (+/- 666.75)
test insert_foldhash_random         ... bench:      30,216.89 ns/iter (+/- 205.22)
test insert_foldhash_serial         ... bench:      30,227.33 ns/iter (+/- 213.67)
test insert_std_highbits            ... bench:      48,720.43 ns/iter (+/- 455.83)
test insert_std_random              ... bench:      49,116.82 ns/iter (+/- 445.82)
test insert_std_serial              ... bench:      49,087.83 ns/iter (+/- 457.78)
test iter_foldhash_highbits         ... bench:       2,144.10 ns/iter (+/- 35.21)
test iter_foldhash_random           ... bench:       2,309.01 ns/iter (+/- 41.42)
test iter_foldhash_serial           ... bench:       2,282.76 ns/iter (+/- 41.28)
test iter_std_highbits              ... bench:       2,355.90 ns/iter (+/- 39.74)
test iter_std_random                ... bench:       2,310.06 ns/iter (+/- 64.54)
test iter_std_serial                ... bench:       2,311.40 ns/iter (+/- 50.47)
test loadfactor_lookup_14500        ... bench:       3,080.38 ns/iter (+/- 48.16)
test loadfactor_lookup_16500        ... bench:       3,120.47 ns/iter (+/- 33.56)
test loadfactor_lookup_18500        ... bench:       3,071.59 ns/iter (+/- 38.13)
test loadfactor_lookup_20500        ... bench:       3,112.54 ns/iter (+/- 29.97)
test loadfactor_lookup_22500        ... bench:       3,077.02 ns/iter (+/- 31.17)
test loadfactor_lookup_24500        ... bench:       3,123.32 ns/iter (+/- 34.10)
test loadfactor_lookup_26500        ... bench:       3,097.77 ns/iter (+/- 36.29)
test loadfactor_lookup_28500        ... bench:       3,088.45 ns/iter (+/- 36.75)
test loadfactor_lookup_fail_14500   ... bench:       2,653.24 ns/iter (+/- 32.00)
test loadfactor_lookup_fail_16500   ... bench:       2,779.21 ns/iter (+/- 29.43)
test loadfactor_lookup_fail_18500   ... bench:       2,924.01 ns/iter (+/- 30.28)
test loadfactor_lookup_fail_20500   ... bench:       3,220.67 ns/iter (+/- 39.57)
test loadfactor_lookup_fail_22500   ... bench:       3,775.43 ns/iter (+/- 48.63)
test loadfactor_lookup_fail_24500   ... bench:       4,931.71 ns/iter (+/- 57.38)
test loadfactor_lookup_fail_26500   ... bench:       7,343.11 ns/iter (+/- 102.79)
test loadfactor_lookup_fail_28500   ... bench:      11,605.00 ns/iter (+/- 161.62)
test lookup_fail_foldhash_highbits  ... bench:       5,403.08 ns/iter (+/- 61.75)
test lookup_fail_foldhash_random    ... bench:       5,156.32 ns/iter (+/- 10.66)
test lookup_fail_foldhash_serial    ... bench:       5,103.27 ns/iter (+/- 46.52)
test lookup_fail_std_highbits       ... bench:      18,611.51 ns/iter (+/- 172.34)
test lookup_fail_std_random         ... bench:      18,690.19 ns/iter (+/- 36.45)
test lookup_fail_std_serial         ... bench:      18,349.81 ns/iter (+/- 171.23)
test lookup_foldhash_highbits       ... bench:       5,714.35 ns/iter (+/- 46.51)
test lookup_foldhash_random         ... bench:       5,661.02 ns/iter (+/- 41.83)
test lookup_foldhash_serial         ... bench:       5,282.01 ns/iter (+/- 47.31)
test lookup_std_highbits            ... bench:      19,089.80 ns/iter (+/- 130.08)
test lookup_std_random              ... bench:      19,325.10 ns/iter (+/- 128.25)
test lookup_std_serial              ... bench:      18,871.54 ns/iter (+/- 140.92)
test rehash_in_place                ... bench:     411,638.12 ns/iter (+/- 15,060.88)

running 2 tests
test insert                  ... bench:      13,414.26 ns/iter (+/- 155.50)
test insert_unique_unchecked ... bench:      10,216.89 ns/iter (+/- 94.85)

running 10 tests
test set_ops_bit_and                ... bench:      15,200.38 ns/iter (+/- 115.83)
test set_ops_bit_and_assign         ... bench:      11,075.67 ns/iter (+/- 32.88)
test set_ops_bit_or                 ... bench:     121,849.25 ns/iter (+/- 721.05)
test set_ops_bit_or_assign          ... bench:      99,121.43 ns/iter (+/- 1,543.14)
test set_ops_bit_xor                ... bench:     127,540.83 ns/iter (+/- 4,612.75)
test set_ops_bit_xor_assign         ... bench:      99,960.00 ns/iter (+/- 1,572.29)
test set_ops_sub_assign_large_small ... bench:      98,311.43 ns/iter (+/- 1,693.29)
test set_ops_sub_assign_small_large ... bench:      11,274.10 ns/iter (+/- 98.62)
test set_ops_sub_large_small        ... bench:     127,703.33 ns/iter (+/- 1,757.83)
test set_ops_sub_small_large        ... bench:       1,592.34 ns/iter (+/- 19.71)

Vs

// AFTER (8-bits Tag)
running 57 tests
test clone_from_large               ... bench:      12,717.72 ns/iter (+/- 114.61)
test clone_from_small               ... bench:         131.72 ns/iter (+/- 0.72)
test clone_large                    ... bench:      13,012.22 ns/iter (+/- 279.33)
test clone_small                    ... bench:         207.25 ns/iter (+/- 3.09)
test grow_insert_foldhash_highbits  ... bench:      42,258.00 ns/iter (+/- 363.40)
test grow_insert_foldhash_random    ... bench:      42,671.36 ns/iter (+/- 441.41)
test grow_insert_foldhash_serial    ... bench:      40,586.52 ns/iter (+/- 361.26)
test grow_insert_std_highbits       ... bench:      72,596.36 ns/iter (+/- 793.91)
test grow_insert_std_random         ... bench:      72,523.08 ns/iter (+/- 639.77)
test grow_insert_std_serial         ... bench:      72,351.54 ns/iter (+/- 646.62)
test insert_erase_foldhash_highbits ... bench:      37,095.33 ns/iter (+/- 441.00)
test insert_erase_foldhash_random   ... bench:      37,033.75 ns/iter (+/- 553.25)
test insert_erase_foldhash_serial   ... bench:      36,694.50 ns/iter (+/- 444.00)
test insert_erase_std_highbits      ... bench:      65,396.67 ns/iter (+/- 642.42)
test insert_erase_std_random        ... bench:      65,792.50 ns/iter (+/- 632.17)
test insert_erase_std_serial        ... bench:      65,388.33 ns/iter (+/- 612.25)
test insert_foldhash_highbits       ... bench:      30,409.17 ns/iter (+/- 326.58)
test insert_foldhash_random         ... bench:      30,762.73 ns/iter (+/- 220.45)
test insert_foldhash_serial         ... bench:      30,995.33 ns/iter (+/- 245.04)
test insert_std_highbits            ... bench:      48,786.82 ns/iter (+/- 407.00)
test insert_std_random              ... bench:      49,355.91 ns/iter (+/- 408.41)
test insert_std_serial              ... bench:      48,913.18 ns/iter (+/- 439.14)
test iter_foldhash_highbits         ... bench:       1,688.02 ns/iter (+/- 26.88)
test iter_foldhash_random           ... bench:       1,689.84 ns/iter (+/- 23.44)
test iter_foldhash_serial           ... bench:       1,681.17 ns/iter (+/- 39.69)
test iter_std_highbits              ... bench:       1,662.24 ns/iter (+/- 33.39)
test iter_std_random                ... bench:       1,652.95 ns/iter (+/- 3.03)
test iter_std_serial                ... bench:       1,650.71 ns/iter (+/- 15.33)
test loadfactor_lookup_14500        ... bench:       3,128.97 ns/iter (+/- 46.94)
test loadfactor_lookup_16500        ... bench:       3,150.54 ns/iter (+/- 34.80)
test loadfactor_lookup_18500        ... bench:       3,181.06 ns/iter (+/- 44.53)
test loadfactor_lookup_20500        ... bench:       3,167.33 ns/iter (+/- 33.99)
test loadfactor_lookup_22500        ... bench:       3,171.40 ns/iter (+/- 27.92)
test loadfactor_lookup_24500        ... bench:       3,144.05 ns/iter (+/- 34.27)
test loadfactor_lookup_26500        ... bench:       3,168.11 ns/iter (+/- 37.73)
test loadfactor_lookup_28500        ... bench:       3,127.07 ns/iter (+/- 32.85)
test loadfactor_lookup_fail_14500   ... bench:       2,295.21 ns/iter (+/- 34.08)
test loadfactor_lookup_fail_16500   ... bench:       2,372.33 ns/iter (+/- 16.12)
test loadfactor_lookup_fail_18500   ... bench:       2,523.12 ns/iter (+/- 31.05)
test loadfactor_lookup_fail_20500   ... bench:       2,776.58 ns/iter (+/- 15.96)
test loadfactor_lookup_fail_22500   ... bench:       3,293.93 ns/iter (+/- 9.05)
test loadfactor_lookup_fail_24500   ... bench:       4,389.17 ns/iter (+/- 51.16)
test loadfactor_lookup_fail_26500   ... bench:       6,520.49 ns/iter (+/- 82.83)
test loadfactor_lookup_fail_28500   ... bench:      10,327.45 ns/iter (+/- 123.22)
test lookup_fail_foldhash_highbits  ... bench:       4,580.48 ns/iter (+/- 49.10)
test lookup_fail_foldhash_random    ... bench:       4,645.15 ns/iter (+/- 44.44)
test lookup_fail_foldhash_serial    ... bench:       4,524.91 ns/iter (+/- 20.92)
test lookup_fail_std_highbits       ... bench:      17,468.55 ns/iter (+/- 136.31)
test lookup_fail_std_random         ... bench:      17,285.09 ns/iter (+/- 36.36)
test lookup_fail_std_serial         ... bench:      17,238.77 ns/iter (+/- 141.75)
test lookup_foldhash_highbits       ... bench:       5,598.82 ns/iter (+/- 46.34)
test lookup_foldhash_random         ... bench:       5,554.56 ns/iter (+/- 14.40)
test lookup_foldhash_serial         ... bench:       5,506.69 ns/iter (+/- 39.15)
test lookup_std_highbits            ... bench:      18,487.12 ns/iter (+/- 151.31)
test lookup_std_random              ... bench:      18,336.98 ns/iter (+/- 145.79)
test lookup_std_serial              ... bench:      18,272.94 ns/iter (+/- 46.86)
test rehash_in_place                ... bench:     397,508.75 ns/iter (+/- 20,795.19)

running 2 tests
test insert                  ... bench:      13,527.58 ns/iter (+/- 102.34)
test insert_unique_unchecked ... bench:      11,386.02 ns/iter (+/- 129.57)

running 10 tests
test set_ops_bit_and                ... bench:      15,375.37 ns/iter (+/- 130.88)
test set_ops_bit_and_assign         ... bench:      11,078.62 ns/iter (+/- 97.75)
test set_ops_bit_or                 ... bench:     123,941.56 ns/iter (+/- 1,511.09)
test set_ops_bit_or_assign          ... bench:      98,770.00 ns/iter (+/- 1,370.00)
test set_ops_bit_xor                ... bench:     122,981.67 ns/iter (+/- 1,234.83)
test set_ops_bit_xor_assign         ... bench:      98,363.89 ns/iter (+/- 882.89)
test set_ops_sub_assign_large_small ... bench:      98,177.14 ns/iter (+/- 1,634.14)
test set_ops_sub_assign_small_large ... bench:      11,438.95 ns/iter (+/- 112.68)
test set_ops_sub_large_small        ... bench:     121,118.33 ns/iter (+/- 1,882.50)
test set_ops_sub_small_large        ... bench:       1,566.15 ns/iter (+/- 24.13)

@gaujay gaujay changed the title Draft: change SSE2 impl to use 8-bits Tag instead of 7 [Draft] change SSE2 impl to use 8-bits Tag instead of 7 Sep 21, 2025
@gaujay gaujay changed the title [Draft] change SSE2 impl to use 8-bits Tag instead of 7 [Draft] Benchmarking SSE2 impl to use 8-bits Tag instead of 7 Sep 21, 2025
@gaujay gaujay changed the title [Draft] Benchmarking SSE2 impl to use 8-bits Tag instead of 7 [Draft] Benchmarking 8-bits Tag instead of 7 Sep 21, 2025
special,
x86::_mm_set1_epi8(Tag::DELETED.0 as i8),
x86::_mm_and_si128(is_special, x86::_mm_set1_epi32(Tag::EMPTY32)), // EMPTY if special
x86::_mm_set1_epi32(Tag::DELETED32), // else DELETED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this: the compare gives you 0 and -1, to which you can just add 127 to get EMPTY and DELETED.

/// Control tag value for an empty bucket.
pub(crate) const EMPTY: Tag = Tag(0b1111_1111);
pub(crate) const EMPTY: Tag = Tag(0b0111_1111); // 127
pub(crate) const EMPTY32: i32 = 0x7F7F7F7F;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should probably be a separate ExpandedTag type for this rather than a plain i32.

@Amanieu
Copy link
Member

Amanieu commented Sep 21, 2025

Passing the benchmark results through cargo-benchcmp makes the differences easier to see:

 name                            before.txt ns/iter  after.txt ns/iter  diff ns/iter   diff %  speedup
 clone_from_large                12,657              12,717                       60    0.47%   x 1.00
 clone_from_small                131                 131                           0    0.00%   x 1.00
 clone_large                     12,859              13,012                      153    1.19%   x 0.99
 clone_small                     207                 207                           0    0.00%   x 1.00
 grow_insert_foldhash_highbits   38,844              42,258                    3,414    8.79%   x 0.92
 grow_insert_foldhash_random     42,001              42,671                      670    1.60%   x 0.98
 grow_insert_foldhash_serial     40,266              40,586                      320    0.79%   x 0.99
 grow_insert_std_highbits        71,766              72,596                      830    1.16%   x 0.99
 grow_insert_std_random          71,636              72,523                      887    1.24%   x 0.99
 grow_insert_std_serial          71,198              72,351                    1,153    1.62%   x 0.98
 insert                          13,414              13,527                      113    0.84%   x 0.99
 insert_erase_foldhash_highbits  36,820              37,095                      275    0.75%   x 0.99
 insert_erase_foldhash_random    36,681              37,033                      352    0.96%   x 0.99
 insert_erase_foldhash_serial    36,260              36,694                      434    1.20%   x 0.99
 insert_erase_std_highbits       65,048              65,396                      348    0.53%   x 0.99
 insert_erase_std_random         64,969              65,792                      823    1.27%   x 0.99
 insert_erase_std_serial         63,380              65,388                    2,008    3.17%   x 0.97
 insert_foldhash_highbits        30,413              30,409                       -4   -0.01%   x 1.00
 insert_foldhash_random          30,216              30,762                      546    1.81%   x 0.98
 insert_foldhash_serial          30,227              30,995                      768    2.54%   x 0.98
 insert_std_highbits             48,720              48,786                       66    0.14%   x 1.00
 insert_std_random               49,116              49,355                      239    0.49%   x 1.00
 insert_std_serial               49,087              48,913                     -174   -0.35%   x 1.00
 insert_unique_unchecked         10,216              11,386                    1,170   11.45%   x 0.90
 iter_foldhash_highbits          2,144               1,688                      -456  -21.27%   x 1.27
 iter_foldhash_random            2,309               1,689                      -620  -26.85%   x 1.37
 iter_foldhash_serial            2,282               1,681                      -601  -26.34%   x 1.36
 iter_std_highbits               2,355               1,662                      -693  -29.43%   x 1.42
 iter_std_random                 2,310               1,652                      -658  -28.48%   x 1.40
 iter_std_serial                 2,311               1,650                      -661  -28.60%   x 1.40
 loadfactor_lookup_14500         3,080               3,128                        48    1.56%   x 0.98
 loadfactor_lookup_16500         3,120               3,150                        30    0.96%   x 0.99
 loadfactor_lookup_18500         3,071               3,181                       110    3.58%   x 0.97
 loadfactor_lookup_20500         3,112               3,167                        55    1.77%   x 0.98
 loadfactor_lookup_22500         3,077               3,171                        94    3.05%   x 0.97
 loadfactor_lookup_24500         3,123               3,144                        21    0.67%   x 0.99
 loadfactor_lookup_26500         3,097               3,168                        71    2.29%   x 0.98
 loadfactor_lookup_28500         3,088               3,127                        39    1.26%   x 0.99
 loadfactor_lookup_fail_14500    2,653               2,295                      -358  -13.49%   x 1.16
 loadfactor_lookup_fail_16500    2,779               2,372                      -407  -14.65%   x 1.17
 loadfactor_lookup_fail_18500    2,924               2,523                      -401  -13.71%   x 1.16
 loadfactor_lookup_fail_20500    3,220               2,776                      -444  -13.79%   x 1.16
 loadfactor_lookup_fail_22500    3,775               3,293                      -482  -12.77%   x 1.15
 loadfactor_lookup_fail_24500    4,931               4,389                      -542  -10.99%   x 1.12
 loadfactor_lookup_fail_26500    7,343               6,520                      -823  -11.21%   x 1.13
 loadfactor_lookup_fail_28500    11,605              10,327                   -1,278  -11.01%   x 1.12
 lookup_fail_foldhash_highbits   5,403               4,580                      -823  -15.23%   x 1.18
 lookup_fail_foldhash_random     5,156               4,645                      -511   -9.91%   x 1.11
 lookup_fail_foldhash_serial     5,103               4,524                      -579  -11.35%   x 1.13
 lookup_fail_std_highbits        18,611              17,468                   -1,143   -6.14%   x 1.07
 lookup_fail_std_random          18,690              17,285                   -1,405   -7.52%   x 1.08
 lookup_fail_std_serial          18,349              17,238                   -1,111   -6.05%   x 1.06
 lookup_foldhash_highbits        5,714               5,598                      -116   -2.03%   x 1.02
 lookup_foldhash_random          5,661               5,554                      -107   -1.89%   x 1.02
 lookup_foldhash_serial          5,282               5,506                       224    4.24%   x 0.96
 lookup_std_highbits             19,089              18,487                     -602   -3.15%   x 1.03
 lookup_std_random               19,325              18,336                     -989   -5.12%   x 1.05
 lookup_std_serial               18,871              18,272                     -599   -3.17%   x 1.03
 rehash_in_place                 411,638             397,508                 -14,130   -3.43%   x 1.04
 set_ops_bit_and                 15,200              15,375                      175    1.15%   x 0.99
 set_ops_bit_and_assign          11,075              11,078                        3    0.03%   x 1.00
 set_ops_bit_or                  121,849             123,941                   2,092    1.72%   x 0.98
 set_ops_bit_or_assign           99,121              98,770                     -351   -0.35%   x 1.00
 set_ops_bit_xor                 127,540             122,981                  -4,559   -3.57%   x 1.04
 set_ops_bit_xor_assign          99,960              98,363                   -1,597   -1.60%   x 1.02
 set_ops_sub_assign_large_small  98,311              98,177                     -134   -0.14%   x 1.00
 set_ops_sub_assign_small_large  11,274              11,438                      164    1.45%   x 0.99
 set_ops_sub_large_small         127,703             121,118                  -6,585   -5.16%   x 1.05
 set_ops_sub_small_large         1,592               1,566                       -26   -1.63%   x 1.02

@Amanieu
Copy link
Member

Amanieu commented Sep 21, 2025

On my system I don't see the iteration speedups. However there is still a significant speedup for lookup-fail:

 name                            before.txt ns/iter  after.txt ns/iter  diff ns/iter   diff %  speedup
 clone_from_large                5,049               5,218                       169    3.35%   x 0.97
 clone_from_small                51                  51                            0    0.00%   x 1.00
 clone_large                     5,164               5,212                        48    0.93%   x 0.99
 clone_small                     58                  56                           -2   -3.45%   x 1.04
 grow_insert_foldhash_highbits   21,603              19,172                   -2,431  -11.25%   x 1.13
 grow_insert_foldhash_random     23,592              21,177                   -2,415  -10.24%   x 1.11
 grow_insert_foldhash_serial     22,113              20,659                   -1,454   -6.58%   x 1.07
 grow_insert_std_highbits        35,838              36,034                      196    0.55%   x 0.99
 grow_insert_std_random          36,287              36,268                      -19   -0.05%   x 1.00
 grow_insert_std_serial          36,041              35,853                     -188   -0.52%   x 1.01
 insert                          6,465               6,584                       119    1.84%   x 0.98
 insert_erase_foldhash_highbits  16,202              16,641                      439    2.71%   x 0.97
 insert_erase_foldhash_random    16,974              17,271                      297    1.75%   x 0.98
 insert_erase_foldhash_serial    16,452              17,010                      558    3.39%   x 0.97
 insert_erase_std_highbits       33,985              34,616                      631    1.86%   x 0.98
 insert_erase_std_random         34,856              34,390                     -466   -1.34%   x 1.01
 insert_erase_std_serial         35,041              34,198                     -843   -2.41%   x 1.02
 insert_foldhash_highbits        13,835              15,221                    1,386   10.02%   x 0.91
 insert_foldhash_random          13,832              14,401                      569    4.11%   x 0.96
 insert_foldhash_serial          13,259              14,592                    1,333   10.05%   x 0.91
 insert_std_highbits             23,343              23,800                      457    1.96%   x 0.98
 insert_std_random               23,502              23,521                       19    0.08%   x 1.00
 insert_std_serial               23,340              23,256                      -84   -0.36%   x 1.00
 insert_unique_unchecked         5,102               5,343                       241    4.72%   x 0.95
 iter_foldhash_highbits          976                 948                         -28   -2.87%   x 1.03
 iter_foldhash_random            977                 944                         -33   -3.38%   x 1.03
 iter_foldhash_serial            976                 949                         -27   -2.77%   x 1.03
 iter_std_highbits               977                 946                         -31   -3.17%   x 1.03
 iter_std_random                 976                 986                          10    1.02%   x 0.99
 iter_std_serial                 978                 983                           5    0.51%   x 0.99
 loadfactor_lookup_14500         1,621               1,711                        90    5.55%   x 0.95
 loadfactor_lookup_16500         1,561               1,721                       160   10.25%   x 0.91
 loadfactor_lookup_18500         1,582               1,718                       136    8.60%   x 0.92
 loadfactor_lookup_20500         1,560               1,714                       154    9.87%   x 0.91
 loadfactor_lookup_22500         1,568               1,709                       141    8.99%   x 0.92
 loadfactor_lookup_24500         1,667               1,714                        47    2.82%   x 0.97
 loadfactor_lookup_26500         1,605               1,710                       105    6.54%   x 0.94
 loadfactor_lookup_28500         1,608               1,709                       101    6.28%   x 0.94
 loadfactor_lookup_fail_14500    1,323               1,070                      -253  -19.12%   x 1.24
 loadfactor_lookup_fail_16500    1,396               1,120                      -276  -19.77%   x 1.25
 loadfactor_lookup_fail_18500    1,480               1,221                      -259  -17.50%   x 1.21
 loadfactor_lookup_fail_20500    1,734               1,326                      -408  -23.53%   x 1.31
 loadfactor_lookup_fail_22500    2,018               1,847                      -171   -8.47%   x 1.09
 loadfactor_lookup_fail_24500    2,620               2,415                      -205   -7.82%   x 1.08
 loadfactor_lookup_fail_26500    3,842               3,682                      -160   -4.16%   x 1.04
 loadfactor_lookup_fail_28500    6,322               5,976                      -346   -5.47%   x 1.06
 lookup_fail_foldhash_highbits   2,858               2,534                      -324  -11.34%   x 1.13
 lookup_fail_foldhash_random     2,693               2,375                      -318  -11.81%   x 1.13
 lookup_fail_foldhash_serial     2,523               2,137                      -386  -15.30%   x 1.18
 lookup_fail_std_highbits        9,009               8,553                      -456   -5.06%   x 1.05
 lookup_fail_std_random          9,191               8,758                      -433   -4.71%   x 1.05
 lookup_fail_std_serial          9,082               8,603                      -479   -5.27%   x 1.06
 lookup_foldhash_highbits        2,722               2,782                        60    2.20%   x 0.98
 lookup_foldhash_random          2,827               2,908                        81    2.87%   x 0.97
 lookup_foldhash_serial          2,541               2,776                       235    9.25%   x 0.92
 lookup_std_highbits             9,522               9,622                       100    1.05%   x 0.99
 lookup_std_random               9,577               9,827                       250    2.61%   x 0.97
 lookup_std_serial               9,364               9,571                       207    2.21%   x 0.98
 rehash_in_place                 192,258             185,360                  -6,898   -3.59%   x 1.04
 set_ops_bit_and                 6,873               7,051                       178    2.59%   x 0.97
 set_ops_bit_and_assign          4,405               4,383                       -22   -0.50%   x 1.01
 set_ops_bit_or                  44,992              44,977                      -15   -0.03%   x 1.00
 set_ops_bit_or_assign           35,757              35,544                     -213   -0.60%   x 1.01
 set_ops_bit_xor                 48,045              48,529                      484    1.01%   x 0.99
 set_ops_bit_xor_assign          36,744              36,532                     -212   -0.58%   x 1.01
 set_ops_sub_assign_large_small  35,307              35,852                      545    1.54%   x 0.98
 set_ops_sub_assign_small_large  4,484               4,548                        64    1.43%   x 0.99
 set_ops_sub_large_small         48,383              46,866                   -1,517   -3.14%   x 1.03
 set_ops_sub_small_large         791                 815                          24    3.03%   x 0.97

@gaujay
Copy link
Contributor Author

gaujay commented Sep 22, 2025

Interesting, thanks for the test run.
Looks like we have significant differences in results between machines (I tested on a Win11 Comet Lake).
This might explain the unexpected iterator speedup as a missed compiler optimization on this platform (maybe some inlining?).

This is a rough implementation just for benchmarking.
I could clean it up (and do the Neon version) if you think it is worth pursuing.
Seeing some of the slowdown, in particular for lookup hit, I'm not quite convinced.

rust-bors bot pushed a commit to rust-lang/rust that referenced this pull request Sep 22, 2025
Amanieu added a commit to Amanieu/rust that referenced this pull request Sep 22, 2025
@Amanieu
Copy link
Member

Amanieu commented Sep 22, 2025

I started an experiment in rust-lang/rust#146909 to use your modification for the compiler's benchmark suite. This measures compilation time, which is hugely affected by hash table performance.

Looks like we have significant differences in results between machines (I tested on a Win11 Comet Lake).

I tested on a Zen 2 running Linux.

@Amanieu
Copy link
Member

Amanieu commented Sep 23, 2025

Overall this seems like a slight perf loss: rust-lang/rust#146909 (comment)

Note that the standard library was previously using 0.15.5, so this comparison also includes the change from #639.

@Amanieu
Copy link
Member

Amanieu commented Sep 23, 2025

With that said it might still be worth persuing. Perhaps reducing the size of the lookup table might help? The current table takes up a full 1KB. But then this suffers from 2 extra cycles on x86 to broadcast a u8 to a u8x16.

If you push any changes to this branch I can re-run the rustc perf benchmarks.

@gaujay
Copy link
Contributor Author

gaujay commented Sep 23, 2025

Appreciated. I also tested with an i8 table, but the results were slower due to the additional instructions in the hot path.
In my C++ investigations, the 8-bits + i32 LuT approach was slightly faster in micro-benchmarks, but this doesn't seem to translate to Rust unfortunately.

On another note, I noticed that random seeds of the hasher could sometimes lead to bad distribution in buckets when working with serial/high-bits key values. It is quite rare (maybe 3%) but worth mentioning as it could also skew some benchmark results.
By opposition, Boost implementation seems to use a fixed seed that is favorably biased toward 0..N sequences (as in almost no hash collision). Maybe this could be worth considering for rustc, if it fits its usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants