-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Optimize pack with SIMD hex decoding #15751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Side note: I am not a a SIMD expert (yet hehe 😅 ) but I am willing to add ARM NEON support in the same PR to have it working for x86 and ARM if the team is ok with doubling the codebase for ARM. Side note 2: if this type of work is accepted I can submit several other SIMD related PRs for other core classes operations. It's a draft for now so I can fix linting and any other issues it may have. |
|
Note for myself before I forget:
|
Use SSSE3 pshufb for parallel nibble-to-hex conversion and SSE4.1
blendv for hex-to-nibble conversion. Also eliminate per-byte
rb_str_buf_cat() calls in pack by pre-allocating and writing
directly to the output buffer.
Performance improvements (A/B benchmark):
unpack('H*') - bytes to hex:
- 64 bytes: 1.4x faster
- 256 bytes: 1.7x faster
- 1KB: 2.2x faster
- 4KB: 2.3x faster
- 64KB: 2.4x faster
pack('H*') - hex to bytes:
- 64 bytes: 4.8x faster
- 256 bytes: 10.3x faster
- 1KB: 14.5x faster
- 4KB: 15.4x faster
- 64KB: 28x faster
The pack decoding improvement is especially dramatic because the
original code called rb_str_buf_cat() per byte, while the new code
pre-allocates the output buffer and writes directly.
37f6ebf to
42ff3dd
Compare
|
Last sidenote: I would be happy to take over maintenance of the SIDM code in Ruby if there is a will to merge such optimizations. |
|
Prebuilt Ruby packages are typically not compiled with -msse4 or -mavx flags, so runtime feature detection (via CPUID) is necessary. I’d be happy to help add this if there’s interest in using SIMD optimizations in general. |
|
Related discussion on https://bugs.ruby-lang.org/issues/16487
How much of the speedup comes from eliminating It seems other pack templates could probably also use similar optimizations. |
9326533 to
42ff3dd
Compare
|
@rhenium I will extract it from this and look into the rest pack templates with it in the upcoming days. |
Use SSSE3 pshufb for parallel nibble-to-hex conversion and SSE4.1 blendv for hex-to-nibble conversion. Also eliminate per-byte rb_str_buf_cat() calls in pack by pre-allocating and writing directly to the output buffer.
Performance Comparison Summary
DECODING (pack 'H*') - Major Improvements
Throughput Comparison (Decoding)
ENCODING (unpack 'H*') - No Regression
Encoding performance is essentially unchanged between versions (already efficient in master).
The original code called rb_str_buf_cat() per byte, while the new code pre-allocates the output buffer and writes directly.
Real-World Scenarios
Common Data Formats
Network/Protocol Data
Typical Payload Sizes
Round-trip Performance (encode then decode)
Edge Cases
SIMD Boundary Cases (Decoding)
The SIMD implementation processes 32 hex characters (16 bytes output) at a time.
Odd-length Hex Strings
Case Sensitivity (Decoding 4096 B)
Benchmark Script