Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mensfeld
Copy link
Contributor

Use SSSE3 pshufb for parallel nibble-to-hex conversion and SSE4.1 blendv for hex-to-nibble conversion. Also eliminate per-byte rb_str_buf_cat() calls in pack by pre-allocating and writing directly to the output buffer.

Performance Comparison Summary

DECODING (pack 'H*') - Major Improvements

Size Master Optimized Speedup
16 B 183 ns 93 ns 2.0x
32 B 303 ns 115 ns 2.6x
64 B 536 ns 151 ns 3.5x
128 B 974 ns 240 ns 4.1x
256 B 1.88 µs 432 ns 4.4x
512 B 3.68 µs 863 ns 4.3x
1024 B 7.79 µs 1.72 µs 4.5x
4096 B 28.0 µs 6.65 µs 4.2x
16384 B 159 µs 26.4 µs 6.0x
65536 B 675 µs 323 µs 2.1x

Throughput Comparison (Decoding)

Size Master Optimized
64 B 119 MB/s 425 MB/s
256 B 136 MB/s 593 MB/s
1024 B 131 MB/s 597 MB/s
4096 B 146 MB/s 616 MB/s
16384 B 103 MB/s 621 MB/s

ENCODING (unpack 'H*') - No Regression

Encoding performance is essentially unchanged between versions (already efficient in master).

Size Master Optimized
64 B 135 ns 133 ns
256 B 379 ns 371 ns
1024 B 1.73 µs 1.73 µs
4096 B 6.70 µs 6.69 µs

The original code called rb_str_buf_cat() per byte, while the new code pre-allocates the output buffer and writes directly.

Real-World Scenarios

Common Data Formats

Scenario Master Optimized Speedup
UUID decode (32 hex) 204 ns 106 ns 1.9x
MD5 decode (32 hex) 201 ns 106 ns 1.9x
SHA-1 decode (40 hex) 229 ns 111 ns 2.1x
SHA-256 decode (64 hex) 311 ns 127 ns 2.4x
SHA-512 decode (128 hex) 560 ns 159 ns 3.5x

Network/Protocol Data

Scenario Master Optimized Speedup
MAC address decode (12 hex) 117 ns 85 ns 1.4x
IPv6 decode (32 hex) 201 ns 115 ns 1.7x
AES-128 key decode (32 hex) 194 ns 106 ns 1.8x
AES-256 key decode (64 hex) 306 ns 125 ns 2.4x

Typical Payload Sizes

Scenario Master Optimized Speedup
Small payload decode (512 hex) 1.76 µs 392 ns 4.5x
Medium payload decode (2K hex) 7.37 µs 1.49 µs 4.9x
Page-size payload decode (8K hex) 27.3 µs 6.04 µs 4.5x
Large payload decode (128K hex) 672 µs 324 µs 2.1x

Round-trip Performance (encode then decode)

Size Master Optimized Speedup
16 B 245 ns 152 ns 1.6x
64 B 620 ns 260 ns 2.4x
256 B 2.16 µs 721 ns 3.0x
1024 B 7.94 µs 2.37 µs 3.4x
4096 B 31.9 µs 10.1 µs 3.2x
16384 B 166 µs 43.7 µs 3.8x
65536 B 751 µs 414 µs 1.8x

Edge Cases

SIMD Boundary Cases (Decoding)

The SIMD implementation processes 32 hex characters (16 bytes output) at a time.

Hex Length Master Optimized Speedup
28 chars 160 ns 95 ns 1.7x
32 chars 190 ns 107 ns 1.8x
64 chars 312 ns 126 ns 2.5x
96 chars 379 ns 152 ns 2.5x

Odd-length Hex Strings

Hex Length Master Optimized Speedup
15 chars 126 ns 87 ns 1.4x
31 chars 193 ns 107 ns 1.8x
63 chars 301 ns 137 ns 2.2x
65 chars 295 ns 131 ns 2.3x

Case Sensitivity (Decoding 4096 B)

Input Case Master Optimized
lowercase 30.0 µs 6.74 µs
UPPERCASE 27.1 µs 5.51 µs
MiXeD 27.1 µs 5.51 µs

Benchmark Script

#!/usr/bin/env ruby
# frozen_string_literal: true

# Comprehensive Benchmark for pack/unpack hex operations (H/h format specifiers)
# Tests SIMD-optimized hex encoding/decoding performance
#
# Usage:
#   ruby benchmark_pack_hex.rb

def measure(iterations)
  GC.start
  GC.disable
  start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  iterations.times { yield }
  elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
  GC.enable
  elapsed
end

def format_rate(bytes, elapsed)
  mb_per_sec = (bytes / elapsed) / 1_000_000.0
  "%.1f MB/s" % mb_per_sec
end

def format_ns(elapsed, iterations)
  ns = (elapsed * 1_000_000_000) / iterations
  if ns >= 1000
    "%.2f us" % (ns / 1000)
  else
    "%.1f ns" % ns
  end
end

def run_bench(label, iterations)
  elapsed = measure(iterations) { yield }
  ns = (elapsed * 1_000_000_000) / iterations
  puts "  %-45s %12s" % [label, format_ns(elapsed, iterations)]
  ns
end

puts "=" * 70
puts "Comprehensive Pack/Unpack Hex Benchmark (H/h format specifiers)"
puts "=" * 70
puts
puts "Ruby: #{RUBY_DESCRIPTION}"
puts "Time: #{Time.now}"
puts "PID: #{$$}"
puts

# Warm up CPU
10_000.times { "hello".unpack1('H*') }
10_000.times { ["68656c6c6f"].pack('H*') }

# =============================================================================
# SECTION 1: Core Performance - Various Sizes
# =============================================================================
puts "=" * 70
puts "SECTION 1: Core Performance Scaling"
puts "=" * 70
puts

# Test sizes including SIMD boundary points
# Encoding SIMD processes 16 bytes at a time
# Decoding SIMD processes 32 hex chars (16 bytes output) at a time
SIZES = [1, 2, 4, 8, 15, 16, 17, 31, 32, 33, 48, 64, 96, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536]

def iterations_for(size)
  case size
  when 0..32 then 500_000
  when 33..128 then 300_000
  when 129..512 then 150_000
  when 513..2048 then 75_000
  when 2049..8192 then 30_000
  when 8193..32768 then 10_000
  else 3_000
  end
end

# Pre-generate test data
test_data = {}
SIZES.each do |size|
  binary = Random.bytes(size)
  hex_lower = binary.unpack1('H*')
  test_data[size] = {
    binary: binary,
    hex_lower: hex_lower,
    hex_upper: hex_lower.upcase,
    hex_mixed: hex_lower.chars.map.with_index { |c, i| i.even? ? c.upcase : c }.join
  }
end

puts "-" * 70
puts "1.1 ENCODING: unpack('H*') - Binary to Hex"
puts "-" * 70
puts
puts "  %-12s %12s %12s %15s" % ["Size", "Time/op", "Throughput", "Iterations"]
puts "  " + "-" * 55

SIZES.each do |size|
  data = test_data[size][:binary]
  iterations = iterations_for(size)
  elapsed = measure(iterations) { data.unpack1('H*') }
  total_bytes = size * iterations
  puts "  %-12s %12s %12s %15d" % [
    "#{size} B",
    format_ns(elapsed, iterations),
    format_rate(total_bytes, elapsed),
    iterations
  ]
end
puts

puts "-" * 70
puts "1.2 DECODING: [hex].pack('H*') - Hex to Binary"
puts "-" * 70
puts
puts "  %-12s %12s %12s %15s" % ["Size", "Time/op", "Throughput", "Iterations"]
puts "  " + "-" * 55

SIZES.each do |size|
  hex = test_data[size][:hex_lower]
  iterations = iterations_for(size)
  elapsed = measure(iterations) { [hex].pack('H*') }
  total_bytes = size * iterations
  puts "  %-12s %12s %12s %15d" % [
    "#{size} B",
    format_ns(elapsed, iterations),
    format_rate(total_bytes, elapsed),
    iterations
  ]
end
puts

# =============================================================================
# SECTION 2: Format Variations (H vs h)
# =============================================================================
puts "=" * 70
puts "SECTION 2: Format Variations (H vs h)"
puts "=" * 70
puts

test_sizes = [16, 64, 256, 1024, 4096]

puts "-" * 70
puts "2.1 Encoding: H (high nibble first) vs h (low nibble first)"
puts "-" * 70
puts
puts "  %-20s %15s %15s" % ["Size", "unpack('H*')", "unpack('h*')"]
puts "  " + "-" * 50

test_sizes.each do |size|
  data = test_data[size][:binary]
  iterations = iterations_for(size)

  elapsed_H = measure(iterations) { data.unpack1('H*') }
  elapsed_h = measure(iterations) { data.unpack1('h*') }

  puts "  %-20s %15s %15s" % [
    "#{size} B",
    format_ns(elapsed_H, iterations),
    format_ns(elapsed_h, iterations)
  ]
end
puts

puts "-" * 70
puts "2.2 Decoding: H (high nibble first) vs h (low nibble first)"
puts "-" * 70
puts
puts "  %-20s %15s %15s" % ["Size", "pack('H*')", "pack('h*')"]
puts "  " + "-" * 50

test_sizes.each do |size|
  hex = test_data[size][:hex_lower]
  iterations = iterations_for(size)

  elapsed_H = measure(iterations) { [hex].pack('H*') }
  elapsed_h = measure(iterations) { [hex].pack('h*') }

  puts "  %-20s %15s %15s" % [
    "#{size} B",
    format_ns(elapsed_H, iterations),
    format_ns(elapsed_h, iterations)
  ]
end
puts

# =============================================================================
# SECTION 3: Input Case Sensitivity (Decoding)
# =============================================================================
puts "=" * 70
puts "SECTION 3: Hex Input Case Sensitivity (Decoding)"
puts "=" * 70
puts
puts "  %-15s %15s %15s %15s" % ["Size", "lowercase", "UPPERCASE", "MiXeD"]
puts "  " + "-" * 60

test_sizes.each do |size|
  iterations = iterations_for(size)

  elapsed_lower = measure(iterations) { [test_data[size][:hex_lower]].pack('H*') }
  elapsed_upper = measure(iterations) { [test_data[size][:hex_upper]].pack('H*') }
  elapsed_mixed = measure(iterations) { [test_data[size][:hex_mixed]].pack('H*') }

  puts "  %-15s %15s %15s %15s" % [
    "#{size} B",
    format_ns(elapsed_lower, iterations),
    format_ns(elapsed_upper, iterations),
    format_ns(elapsed_mixed, iterations)
  ]
end
puts

# =============================================================================
# SECTION 4: Partial Format Specifiers
# =============================================================================
puts "=" * 70
puts "SECTION 4: Partial Format Specifiers (H4, H8, H16, etc.)"
puts "=" * 70
puts

data_1024 = test_data[1024][:binary]
hex_2048 = test_data[1024][:hex_lower]  # 2048 hex chars
iterations = 200_000

puts "-" * 70
puts "4.1 Partial Encoding: Extract first N hex characters"
puts "-" * 70
puts

[4, 8, 16, 32, 64, 128, 256, 512].each do |n|
  run_bench("unpack('H#{n}') from 1024 bytes", iterations) { data_1024.unpack1("H#{n}") }
end
puts

puts "-" * 70
puts "4.2 Partial Decoding: Pack first N hex characters"
puts "-" * 70
puts

[4, 8, 16, 32, 64, 128, 256, 512].each do |n|
  hex_n = hex_2048[0, n]
  run_bench("['#{n} hex chars'].pack('H#{n}')", iterations) { [hex_n].pack("H#{n}") }
end
puts

# =============================================================================
# SECTION 5: Special Hex Patterns
# =============================================================================
puts "=" * 70
puts "SECTION 5: Special Hex Patterns (Decoding)"
puts "=" * 70
puts

iterations = 100_000
size = 1024

puts "-" * 70
puts "5.1 Homogeneous patterns (#{size} bytes output)"
puts "-" * 70
puts

hex_zeros = "00" * size
hex_ones = "ff" * size
hex_alternating = "0f" * size
hex_sequential = (0...size).map { |i| "%02x" % (i & 0xff) }.join

run_bench("All zeros (00)", iterations) { [hex_zeros].pack('H*') }
run_bench("All ones (ff)", iterations) { [hex_ones].pack('H*') }
run_bench("Alternating (0f)", iterations) { [hex_alternating].pack('H*') }
run_bench("Sequential (00,01,02...)", iterations) { [hex_sequential].pack('H*') }
puts

puts "-" * 70
puts "5.2 Digit-heavy vs Letter-heavy hex"
puts "-" * 70
puts

# Digits only: 0-9 (no a-f)
hex_digits_only = "0123456789" * (size * 2 / 10 + 1)
hex_digits_only = hex_digits_only[0, size * 2]

# Letters heavy: mostly a-f
hex_letters_heavy = "abcdef" * (size * 2 / 6 + 1)
hex_letters_heavy = hex_letters_heavy[0, size * 2]

run_bench("Digits only (0-9)", iterations) { [hex_digits_only].pack('H*') }
run_bench("Letters heavy (a-f)", iterations) { [hex_letters_heavy].pack('H*') }
puts

# =============================================================================
# SECTION 6: Edge Cases
# =============================================================================
puts "=" * 70
puts "SECTION 6: Edge Cases"
puts "=" * 70
puts

puts "-" * 70
puts "6.1 Empty and minimal inputs"
puts "-" * 70
puts

iterations = 500_000

run_bench("Empty string encode: ''.unpack('H*')", iterations) { "".unpack1('H*') }
run_bench("Empty string decode: [''].pack('H*')", iterations) { [""].pack('H*') }
run_bench("Single byte encode: 1 byte", iterations) { "\x42".unpack1('H*') }
run_bench("Single byte decode: '42'", iterations) { ["42"].pack('H*') }
run_bench("Two bytes encode", iterations) { "\x42\x43".unpack1('H*') }
run_bench("Two bytes decode: '4243'", iterations) { ["4243"].pack('H*') }
puts

puts "-" * 70
puts "6.2 Odd-length hex strings (last nibble handling)"
puts "-" * 70
puts

iterations = 200_000

# Odd length hex strings - the last character is a half-byte
[1, 3, 5, 7, 15, 17, 31, 33, 63, 65].each do |len|
  hex = "a" * len
  run_bench("Decode #{len} hex chars (odd)", iterations) { [hex].pack('H*') }
end
puts

puts "-" * 70
puts "6.3 SIMD boundary cases (16 bytes = 32 hex chars threshold)"
puts "-" * 70
puts

iterations = 300_000

# Just below, at, and above SIMD thresholds
[14, 15, 16, 17, 18, 30, 31, 32, 33, 34, 46, 47, 48, 49, 50].each do |size|
  data = Random.bytes(size)
  run_bench("Encode #{size} bytes", iterations) { data.unpack1('H*') }
end
puts

[28, 30, 31, 32, 33, 34, 62, 63, 64, 65, 66, 94, 95, 96, 97, 98].each do |hex_len|
  hex = "a" * hex_len
  run_bench("Decode #{hex_len} hex chars", iterations) { [hex].pack('H*') }
end
puts

# =============================================================================
# SECTION 7: Real-World Scenarios
# =============================================================================
puts "=" * 70
puts "SECTION 7: Real-World Scenarios"
puts "=" * 70
puts

iterations = 200_000

puts "-" * 70
puts "7.1 Common data formats"
puts "-" * 70
puts

# UUID (16 bytes = 32 hex chars)
uuid_binary = Random.bytes(16)
uuid_hex = uuid_binary.unpack1('H*')

run_bench("UUID encode (16 bytes)", iterations) { uuid_binary.unpack1('H*') }
run_bench("UUID decode (32 hex chars)", iterations) { [uuid_hex].pack('H*') }

# MD5 hash (16 bytes = 32 hex chars)
md5_binary = Random.bytes(16)
md5_hex = md5_binary.unpack1('H*')

run_bench("MD5 encode (16 bytes)", iterations) { md5_binary.unpack1('H*') }
run_bench("MD5 decode (32 hex chars)", iterations) { [md5_hex].pack('H*') }

# SHA-1 hash (20 bytes = 40 hex chars)
sha1_binary = Random.bytes(20)
sha1_hex = sha1_binary.unpack1('H*')

run_bench("SHA-1 encode (20 bytes)", iterations) { sha1_binary.unpack1('H*') }
run_bench("SHA-1 decode (40 hex chars)", iterations) { [sha1_hex].pack('H*') }

# SHA-256 hash (32 bytes = 64 hex chars)
sha256_binary = Random.bytes(32)
sha256_hex = sha256_binary.unpack1('H*')

run_bench("SHA-256 encode (32 bytes)", iterations) { sha256_binary.unpack1('H*') }
run_bench("SHA-256 decode (64 hex chars)", iterations) { [sha256_hex].pack('H*') }

# SHA-512 hash (64 bytes = 128 hex chars)
sha512_binary = Random.bytes(64)
sha512_hex = sha512_binary.unpack1('H*')

run_bench("SHA-512 encode (64 bytes)", iterations) { sha512_binary.unpack1('H*') }
run_bench("SHA-512 decode (128 hex chars)", iterations) { [sha512_hex].pack('H*') }
puts

puts "-" * 70
puts "7.2 Network/Protocol data sizes"
puts "-" * 70
puts

# Ethernet MAC address (6 bytes)
mac_binary = Random.bytes(6)
mac_hex = mac_binary.unpack1('H*')

run_bench("MAC address encode (6 bytes)", iterations) { mac_binary.unpack1('H*') }
run_bench("MAC address decode (12 hex)", iterations) { [mac_hex].pack('H*') }

# IPv6 address (16 bytes)
ipv6_binary = Random.bytes(16)
ipv6_hex = ipv6_binary.unpack1('H*')

run_bench("IPv6 encode (16 bytes)", iterations) { ipv6_binary.unpack1('H*') }
run_bench("IPv6 decode (32 hex)", iterations) { [ipv6_hex].pack('H*') }

# AES-128 key (16 bytes)
aes128_binary = Random.bytes(16)
aes128_hex = aes128_binary.unpack1('H*')

run_bench("AES-128 key encode (16 bytes)", iterations) { aes128_binary.unpack1('H*') }
run_bench("AES-128 key decode (32 hex)", iterations) { [aes128_hex].pack('H*') }

# AES-256 key (32 bytes)
aes256_binary = Random.bytes(32)
aes256_hex = aes256_binary.unpack1('H*')

run_bench("AES-256 key encode (32 bytes)", iterations) { aes256_binary.unpack1('H*') }
run_bench("AES-256 key decode (64 hex)", iterations) { [aes256_hex].pack('H*') }
puts

puts "-" * 70
puts "7.3 Typical payload sizes"
puts "-" * 70
puts

iterations = 50_000

# Small JSON-like payload
payload_256 = Random.bytes(256)
payload_256_hex = payload_256.unpack1('H*')

run_bench("Small payload encode (256 B)", iterations) { payload_256.unpack1('H*') }
run_bench("Small payload decode (512 hex)", iterations) { [payload_256_hex].pack('H*') }

# Medium payload (1KB)
payload_1k = Random.bytes(1024)
payload_1k_hex = payload_1k.unpack1('H*')

run_bench("Medium payload encode (1 KB)", iterations) { payload_1k.unpack1('H*') }
run_bench("Medium payload decode (2K hex)", iterations) { [payload_1k_hex].pack('H*') }

# Larger payload (4KB - typical page size)
payload_4k = Random.bytes(4096)
payload_4k_hex = payload_4k.unpack1('H*')

run_bench("Page-size payload encode (4 KB)", iterations) { payload_4k.unpack1('H*') }
run_bench("Page-size payload decode (8K hex)", iterations) { [payload_4k_hex].pack('H*') }

iterations = 10_000

# Large payload (64KB)
payload_64k = Random.bytes(65536)
payload_64k_hex = payload_64k.unpack1('H*')

run_bench("Large payload encode (64 KB)", iterations) { payload_64k.unpack1('H*') }
run_bench("Large payload decode (128K hex)", iterations) { [payload_64k_hex].pack('H*') }
puts

# =============================================================================
# SECTION 8: Round-trip Performance
# =============================================================================
puts "=" * 70
puts "SECTION 8: Round-trip Performance (encode then decode)"
puts "=" * 70
puts

puts "  %-15s %15s %15s" % ["Size", "Round-trip", "Throughput"]
puts "  " + "-" * 45

[16, 32, 64, 128, 256, 512, 1024, 4096, 16384, 65536].each do |size|
  data = test_data[size][:binary]
  iterations = iterations_for(size) / 2

  elapsed = measure(iterations) do
    hex = data.unpack1('H*')
    [hex].pack('H*')
  end

  total_bytes = size * iterations * 2
  puts "  %-15s %15s %15s" % [
    "#{size} B",
    format_ns(elapsed, iterations),
    format_rate(total_bytes, elapsed)
  ]
end
puts

# =============================================================================
# SECTION 9: Correctness Verification
# =============================================================================
puts "=" * 70
puts "SECTION 9: Correctness Verification"
puts "=" * 70
puts

errors = []

# Test encoding H format
[1, 2, 8, 15, 16, 17, 31, 32, 33, 64, 128, 256, 1024].each do |size|
  data = Random.bytes(size)
  hex_h = data.unpack1('H*')
  hex_expected = data.bytes.map { |b| "%02x" % b }.join

  if hex_h != hex_expected
    errors << "unpack('H*') failed for size #{size}: got #{hex_h[0,20]}..., expected #{hex_expected[0,20]}..."
  end
end

# Test encoding h format
[1, 2, 8, 15, 16, 17, 31, 32, 33, 64, 128, 256, 1024].each do |size|
  data = Random.bytes(size)
  hex_h = data.unpack1('h*')
  hex_expected = data.bytes.map { |b| "%x%x" % [b & 0xf, b >> 4] }.join

  if hex_h != hex_expected
    errors << "unpack('h*') failed for size #{size}"
  end
end

# Test decoding H format - round trip
[1, 2, 8, 15, 16, 17, 31, 32, 33, 64, 128, 256, 1024].each do |size|
  original = Random.bytes(size)
  hex = original.unpack1('H*')
  decoded = [hex].pack('H*')

  if decoded != original
    errors << "pack('H*') round-trip failed for size #{size}"
  end

  # Test uppercase
  decoded_upper = [hex.upcase].pack('H*')
  if decoded_upper != original
    errors << "pack('H*') uppercase failed for size #{size}"
  end

  # Test mixed case
  hex_mixed = hex.chars.map.with_index { |c, i| i.even? ? c.upcase : c }.join
  decoded_mixed = [hex_mixed].pack('H*')
  if decoded_mixed != original
    errors << "pack('H*') mixed case failed for size #{size}"
  end
end

# Test decoding h format - round trip
[1, 2, 8, 15, 16, 17, 31, 32, 33, 64, 128, 256, 1024].each do |size|
  original = Random.bytes(size)
  hex = original.unpack1('h*')
  decoded = [hex].pack('h*')

  if decoded != original
    errors << "pack('h*')/unpack('h*') round-trip failed for size #{size}"
  end
end

# Test partial format specifiers
data = Random.bytes(100)
[2, 4, 8, 16, 32, 64].each do |n|
  hex_partial = data.unpack1("H#{n}")
  if hex_partial.length != n
    errors << "unpack('H#{n}') returned wrong length: #{hex_partial.length}"
  end

  expected = data.bytes[0, (n+1)/2].map { |b| "%02x" % b }.join[0, n]
  if hex_partial != expected
    errors << "unpack('H#{n}') returned wrong value"
  end
end

# Test odd-length hex strings
[1, 3, 5, 7, 9].each do |len|
  hex = "a" * len
  decoded = [hex].pack('H*')
  expected_len = (len + 1) / 2
  if decoded.length != expected_len
    errors << "Odd-length hex (#{len}) produced wrong output length: #{decoded.length} vs #{expected_len}"
  end
end

# Test empty strings
if "".unpack1('H*') != ""
  errors << "Empty string encoding failed"
end
if [""].pack('H*') != ""
  errors << "Empty string decoding failed"
end

if errors.empty?
  puts "All #{13 * 4 + 6 + 5 + 2} correctness tests PASSED"
else
  puts "FAILURES (#{errors.length}):"
  errors.each { |e| puts "  - #{e}" }
end
puts

puts "=" * 70
puts "Benchmark complete"
puts "=" * 70

@mensfeld
Copy link
Contributor Author

mensfeld commented Dec 27, 2025

Side note: I am not a a SIMD expert (yet hehe 😅 ) but I am willing to add ARM NEON support in the same PR to have it working for x86 and ARM if the team is ok with doubling the codebase for ARM.

Side note 2: if this type of work is accepted I can submit several other SIMD related PRs for other core classes operations.

It's a draft for now so I can fix linting and any other issues it may have.

@mensfeld
Copy link
Contributor Author

mensfeld commented Dec 27, 2025

Note for myself before I forget:

x86 SSE/SSSE3 ARM NEON Purpose
_mm_loadu_si128 vld1q_u8 Load 16 bytes
_mm_shuffle_epi8 (pshufb) vqtbl1q_u8 Table lookup
_mm_and_si128 vandq_u8 Bitwise AND
_mm_srli_epi16 vshrq_n_u8 Shift right
_mm_blendv_epi8 vbslq_u8 Blend/select
_mm_storeu_si128 vst1q_u8 Store 16 bytes

Use SSSE3 pshufb for parallel nibble-to-hex conversion and SSE4.1
blendv for hex-to-nibble conversion. Also eliminate per-byte
rb_str_buf_cat() calls in pack by pre-allocating and writing
directly to the output buffer.

Performance improvements (A/B benchmark):

unpack('H*') - bytes to hex:
- 64 bytes: 1.4x faster
- 256 bytes: 1.7x faster
- 1KB: 2.2x faster
- 4KB: 2.3x faster
- 64KB: 2.4x faster

pack('H*') - hex to bytes:
- 64 bytes: 4.8x faster
- 256 bytes: 10.3x faster
- 1KB: 14.5x faster
- 4KB: 15.4x faster
- 64KB: 28x faster

The pack decoding improvement is especially dramatic because the
original code called rb_str_buf_cat() per byte, while the new code
pre-allocates the output buffer and writes directly.
@mensfeld mensfeld force-pushed the optimize-pack-hex-unified branch from 37f6ebf to 42ff3dd Compare December 27, 2025 15:20
@mensfeld mensfeld changed the title Optimize pack H/h with SIMD hex decoding Optimize pack with SIMD hex decoding Dec 27, 2025
@mensfeld mensfeld marked this pull request as ready for review December 27, 2025 18:22
@mensfeld
Copy link
Contributor Author

Last sidenote: I would be happy to take over maintenance of the SIDM code in Ruby if there is a will to merge such optimizations.

@ahorek
Copy link
Contributor

ahorek commented Dec 28, 2025

Prebuilt Ruby packages are typically not compiled with -msse4 or -mavx flags, so runtime feature detection (via CPUID) is necessary. I’d be happy to help add this if there’s interest in using SIMD optimizations in general.

@rhenium
Copy link
Member

rhenium commented Dec 28, 2025

Related discussion on https://bugs.ruby-lang.org/issues/16487

The original code called rb_str_buf_cat() per byte, while the new code pre-allocates the output buffer and writes directly.

How much of the speedup comes from eliminating rb_str_buf_cat() on each byte? That's a clear improvement we should apply first, before looking into whether SIMD instructions can make further gains that are worth the maintenance cost.

It seems other pack templates could probably also use similar optimizations.

@mensfeld mensfeld force-pushed the optimize-pack-hex-unified branch from 9326533 to 42ff3dd Compare December 28, 2025 21:34
@mensfeld
Copy link
Contributor Author

@rhenium I will extract it from this and look into the rest pack templates with it in the upcoming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants