Optimize pack with SIMD hex decoding #15751

mensfeld · 2025-12-27T15:14:10Z

Use SSSE3 pshufb for parallel nibble-to-hex conversion and SSE4.1 blendv for hex-to-nibble conversion. Also eliminate per-byte rb_str_buf_cat() calls in pack by pre-allocating and writing directly to the output buffer.

Performance Comparison Summary

DECODING (pack 'H*') - Major Improvements

Size	Master	Optimized	Speedup
16 B	183 ns	93 ns	2.0x
32 B	303 ns	115 ns	2.6x
64 B	536 ns	151 ns	3.5x
128 B	974 ns	240 ns	4.1x
256 B	1.88 µs	432 ns	4.4x
512 B	3.68 µs	863 ns	4.3x
1024 B	7.79 µs	1.72 µs	4.5x
4096 B	28.0 µs	6.65 µs	4.2x
16384 B	159 µs	26.4 µs	6.0x
65536 B	675 µs	323 µs	2.1x

Throughput Comparison (Decoding)

Size	Master	Optimized
64 B	119 MB/s	425 MB/s
256 B	136 MB/s	593 MB/s
1024 B	131 MB/s	597 MB/s
4096 B	146 MB/s	616 MB/s
16384 B	103 MB/s	621 MB/s

ENCODING (unpack 'H*') - No Regression

Encoding performance is essentially unchanged between versions (already efficient in master).

Size	Master	Optimized
64 B	135 ns	133 ns
256 B	379 ns	371 ns
1024 B	1.73 µs	1.73 µs
4096 B	6.70 µs	6.69 µs

The original code called rb_str_buf_cat() per byte, while the new code pre-allocates the output buffer and writes directly.

Real-World Scenarios

Common Data Formats

Scenario	Master	Optimized	Speedup
UUID decode (32 hex)	204 ns	106 ns	1.9x
MD5 decode (32 hex)	201 ns	106 ns	1.9x
SHA-1 decode (40 hex)	229 ns	111 ns	2.1x
SHA-256 decode (64 hex)	311 ns	127 ns	2.4x
SHA-512 decode (128 hex)	560 ns	159 ns	3.5x

Network/Protocol Data

Scenario	Master	Optimized	Speedup
MAC address decode (12 hex)	117 ns	85 ns	1.4x
IPv6 decode (32 hex)	201 ns	115 ns	1.7x
AES-128 key decode (32 hex)	194 ns	106 ns	1.8x
AES-256 key decode (64 hex)	306 ns	125 ns	2.4x

Typical Payload Sizes

Scenario	Master	Optimized	Speedup
Small payload decode (512 hex)	1.76 µs	392 ns	4.5x
Medium payload decode (2K hex)	7.37 µs	1.49 µs	4.9x
Page-size payload decode (8K hex)	27.3 µs	6.04 µs	4.5x
Large payload decode (128K hex)	672 µs	324 µs	2.1x

Round-trip Performance (encode then decode)

Size	Master	Optimized	Speedup
16 B	245 ns	152 ns	1.6x
64 B	620 ns	260 ns	2.4x
256 B	2.16 µs	721 ns	3.0x
1024 B	7.94 µs	2.37 µs	3.4x
4096 B	31.9 µs	10.1 µs	3.2x
16384 B	166 µs	43.7 µs	3.8x
65536 B	751 µs	414 µs	1.8x

Edge Cases

SIMD Boundary Cases (Decoding)

The SIMD implementation processes 32 hex characters (16 bytes output) at a time.

Hex Length	Master	Optimized	Speedup
28 chars	160 ns	95 ns	1.7x
32 chars	190 ns	107 ns	1.8x
64 chars	312 ns	126 ns	2.5x
96 chars	379 ns	152 ns	2.5x

Odd-length Hex Strings

Hex Length	Master	Optimized	Speedup
15 chars	126 ns	87 ns	1.4x
31 chars	193 ns	107 ns	1.8x
63 chars	301 ns	137 ns	2.2x
65 chars	295 ns	131 ns	2.3x

Case Sensitivity (Decoding 4096 B)

Input Case	Master	Optimized
lowercase	30.0 µs	6.74 µs
UPPERCASE	27.1 µs	5.51 µs
MiXeD	27.1 µs	5.51 µs

Benchmark Script

#!/usr/bin/env ruby
# frozen_string_literal: true

# Comprehensive Benchmark for pack/unpack hex operations (H/h format specifiers)
# Tests SIMD-optimized hex encoding/decoding performance
#
# Usage:
#   ruby benchmark_pack_hex.rb

def measure(iterations)
  GC.start
  GC.disable
  start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  iterations.times { yield }
  elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
  GC.enable
  elapsed
end

def format_rate(bytes, elapsed)
  mb_per_sec = (bytes / elapsed) / 1_000_000.0
  "%.1f MB/s" % mb_per_sec
end

def format_ns(elapsed, iterations)
  ns = (elapsed * 1_000_000_000) / iterations
  if ns >= 1000
    "%.2f us" % (ns / 1000)
  else
    "%.1f ns" % ns
  end
end

def run_bench(label, iterations)
  elapsed = measure(iterations) { yield }
  ns = (elapsed * 1_000_000_000) / iterations
  puts "  %-45s %12s" % [label, format_ns(elapsed, iterations)]
  ns
end

puts "=" * 70
puts "Comprehensive Pack/Unpack Hex Benchmark (H/h format specifiers)"
puts "=" * 70
puts
puts "Ruby: #{RUBY_DESCRIPTION}"
puts "Time: #{Time.now}"
puts "PID: #{$$}"
puts

# Warm up CPU
10_000.times { "hello".unpack1('H*') }
10_000.times { ["68656c6c6f"].pack('H*') }

# =============================================================================
# SECTION 1: Core Performance - Various Sizes
# =============================================================================
puts "=" * 70
puts "SECTION 1: Core Performance Scaling"
puts "=" * 70
puts

# Test sizes including SIMD boundary points
# Encoding SIMD processes 16 bytes at a time
# Decoding SIMD processes 32 hex chars (16 bytes output) at a time
SIZES = [1, 2, 4, 8, 15, 16, 17, 31, 32, 33, 48, 64, 96, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536]

def iterations_for(size)
  case size
  when 0..32 then 500_000
  when 33..128 then 300_000
  when 129..512 then 150_000
  when 513..2048 then 75_000
  when 2049..8192 then 30_000
  when 8193..32768 then 10_000
  else 3_000
  end
end

# Pre-generate test data
test_data = {}
SIZES.each do |size|
  binary = Random.bytes(size)
  hex_lower = binary.unpack1('H*')
  test_data[size] = {
    binary: binary,
    hex_lower: hex_lower,
    hex_upper: hex_lower.upcase,
    hex_mixed: hex_lower.chars.map.with_index { |c, i| i.even? ? c.upcase : c }.join
  }
end

puts "-" * 70
puts "1.1 ENCODING: unpack('H*') - Binary to Hex"
puts "-" * 70
puts
puts "  %-12s %12s %12s %15s" % ["Size", "Time/op", "Throughput", "Iterations"]
puts "  " + "-" * 55

SIZES.each do |size|
  data = test_data[size][:binary]
  iterations = iterations_for(size)
  elapsed = measure(iterations) { data.unpack1('H*') }
  total_bytes = size * iterations
  puts "  %-12s %12s %12s %15d" % [
    "#{size} B",
    format_ns(elapsed, iterations),
    format_rate(total_bytes, elapsed),
    iterations
  ]
end
puts

puts "-" * 70
puts "1.2 DECODING: [hex].pack('H*') - Hex to Binary"
puts "-" * 70
puts
puts "  %-12s %12s %12s %15s" % ["Size", "Time/op", "Throughput", "Iterations"]
puts "  " + "-" * 55

SIZES.each do |size|
  hex = test_data[size][:hex_lower]
  iterations = iterations_for(size)
  elapsed = measure(iterations) { [hex].pack('H*') }
  total_bytes = size * iterations
  puts "  %-12s %12s %12s %15d" % [
    "#{size} B",
    format_ns(elapsed, iterations),
    format_rate(total_bytes, elapsed),
    iterations
  ]
end
puts

# =============================================================================
# SECTION 2: Format Variations (H vs h)
# =============================================================================
puts "=" * 70
puts "SECTION 2: Format Variations (H vs h)"
puts "=" * 70
puts

test_sizes = [16, 64, 256, 1024, 4096]

puts "-" * 70
puts "2.1 Encoding: H (high nibble first) vs h (low nibble first)"
puts "-" * 70
puts
puts "  %-20s %15s %15s" % ["Size", "unpack('H*')", "unpack('h*')"]
puts "  " + "-" * 50

test_sizes.each do |size|
  data = test_data[size][:binary]
  iterations = iterations_for(size)

  elapsed_H = measure(iterations) { data.unpack1('H*') }
  elapsed_h = measure(iterations) { data.unpack1('h*') }

  puts "  %-20s %15s %15s" % [
    "#{size} B",
    format_ns(elapsed_H, iterations),
    format_ns(elapsed_h, iterations)
  ]
end
puts

puts "-" * 70
puts "2.2 Decoding: H (high nibble first) vs h (low nibble first)"
puts "-" * 70
puts
puts "  %-20s %15s %15s" % ["Size", "pack('H*')", "pack('h*')"]
puts "  " + "-" * 50

test_sizes.each do |size|
  hex = test_data[size][:hex_lower]
  iterations = iterations_for(size)

  elapsed_H = measure(iterations) { [hex].pack('H*') }
  elapsed_h = measure(iterations) { [hex].pack('h*') }

  puts "  %-20s %15s %15s" % [
    "#{size} B",
    format_ns(elapsed_H, iterations),
    format_ns(elapsed_h, iterations)
  ]
end
puts

# =============================================================================
# SECTION 3: Input Case Sensitivity (Decoding)
# =============================================================================
puts "=" * 70
puts "SECTION 3: Hex Input Case Sensitivity (Decoding)"
puts "=" * 70
puts
puts "  %-15s %15s %15s %15s" % ["Size", "lowercase", "UPPERCASE", "MiXeD"]
puts "  " + "-" * 60

test_sizes.each do |size|
  iterations = iterations_for(size)

  elapsed_lower = measure(iterations) { [test_data[size][:hex_lower]].pack('H*') }
  elapsed_upper = measure(iterations) { [test_data[size][:hex_upper]].pack('H*') }
  elapsed_mixed = measure(iterations) { [test_data[size][:hex_mixed]].pack('H*') }

  puts "  %-15s %15s %15s %15s" % [
    "#{size} B",
    format_ns(elapsed_lower, iterations),
    format_ns(elapsed_upper, iterations),
    format_ns(elapsed_mixed, iterations)
  ]
end
puts

# =============================================================================
# SECTION 4: Partial Format Specifiers
# =============================================================================
puts "=" * 70
puts "SECTION 4: Partial Format Specifiers (H4, H8, H16, etc.)"
puts "=" * 70
puts

data_1024 = test_data[1024][:binary]
hex_2048 = test_data[1024][:hex_lower]  # 2048 hex chars
iterations = 200_000

puts "-" * 70
puts "4.1 Partial Encoding: Extract first N hex characters"
puts "-" * 70
puts

[4, 8, 16, 32, 64, 128, 256, 512].each do |n|
  run_bench("unpack('H#{n}') from 1024 bytes", iterations) { data_1024.unpack1("H#{n}") }
end
puts

puts "-" * 70
puts "4.2 Partial Decoding: Pack first N hex characters"
puts "-" * 70
puts

[4, 8, 16, 32, 64, 128, 256, 512].each do |n|
  hex_n = hex_2048[0, n]
  run_bench("['#{n} hex chars'].pack('H#{n}')", iterations) { [hex_n].pack("H#{n}") }
end
puts

# =============================================================================
# SECTION 5: Special Hex Patterns
# =============================================================================
puts "=" * 70
puts "SECTION 5: Special Hex Patterns (Decoding)"
puts "=" * 70
puts

iterations = 100_000
size = 1024

puts "-" * 70
puts "5.1 Homogeneous patterns (#{size} bytes output)"
puts "-" * 70
puts

hex_zeros = "00" * size
hex_ones = "ff" * size
hex_alternating = "0f" * size
hex_sequential = (0...size).map { |i| "%02x" % (i & 0xff) }.join

run_bench("All zeros (00)", iterations) { [hex_zeros].pack('H*') }
run_bench("All ones (ff)", iterations) { [hex_ones].pack('H*') }
run_bench("Alternating (0f)", iterations) { [hex_alternating].pack('H*') }
run_bench("Sequential (00,01,02...)", iterations) { [hex_sequential].pack('H*') }
puts

puts "-" * 70
puts "5.2 Digit-heavy vs Letter-heavy hex"
puts "-" * 70
puts

# Digits only: 0-9 (no a-f)
hex_digits_only = "0123456789" * (size * 2 / 10 + 1)
hex_digits_only = hex_digits_only[0, size * 2]

# Letters heavy: mostly a-f
hex_letters_heavy = "abcdef" * (size * 2 / 6 + 1)
hex_letters_heavy = hex_letters_heavy[0, size * 2]

run_bench("Digits only (0-9)", iterations) { [hex_digits_only].pack('H*') }
run_bench("Letters heavy (a-f)", iterations) { [hex_letters_heavy].pack('H*') }
puts

# =============================================================================
# SECTION 6: Edge Cases
# =============================================================================
puts "=" * 70
puts "SECTION 6: Edge Cases"
puts "=" * 70
puts

puts "-" * 70
puts "6.1 Empty and minimal inputs"
puts "-" * 70
puts

iterations = 500_000

run_bench("Empty string encode: ''.unpack('H*')", iterations) { "".unpack1('H*') }
run_bench("Empty string decode: [''].pack('H*')", iterations) { [""].pack('H*') }
run_bench("Single byte encode: 1 byte", iterations) { "\x42".unpack1('H*') }
run_bench("Single byte decode: '42'", iterations) { ["42"].pack('H*') }
run_bench("Two bytes encode", iterations) { "\x42\x43".unpack1('H*') }
run_bench("Two bytes decode: '4243'", iterations) { ["4243"].pack('H*') }
puts

puts "-" * 70
puts "6.2 Odd-length hex strings (last nibble handling)"
puts "-" * 70
puts

iterations = 200_000

# Odd length hex strings - the last character is a half-byte
[1, 3, 5, 7, 15, 17, 31, 33, 63, 65].each do |len|
  hex = "a" * len
  run_bench("Decode #{len} hex chars (odd)", iterations) { [hex].pack('H*') }
end
puts

puts "-" * 70
puts "6.3 SIMD boundary cases (16 bytes = 32 hex chars threshold)"
puts "-" * 70
puts

iterations = 300_000

# Just below, at, and above SIMD thresholds
[14, 15, 16, 17, 18, 30, 31, 32, 33, 34, 46, 47, 48, 49, 50].each do |size|
  data = Random.bytes(size)
  run_bench("Encode #{size} bytes", iterations) { data.unpack1('H*') }
end
puts

[28, 30, 31, 32, 33, 34, 62, 63, 64, 65, 66, 94, 95, 96, 97, 98].each do |hex_len|
  hex = "a" * hex_len
  run_bench("Decode #{hex_len} hex chars", iterations) { [hex].pack('H*') }
end
puts

# =============================================================================
# SECTION 7: Real-World Scenarios
# =============================================================================
puts "=" * 70
puts "SECTION 7: Real-World Scenarios"
puts "=" * 70
puts

iterations = 200_000

puts "-" * 70
puts "7.1 Common data formats"
puts "-" * 70
puts

# UUID (16 bytes = 32 hex chars)
uuid_binary = Random.bytes(16)
uuid_hex = uuid_binary.unpack1('H*')

run_bench("UUID encode (16 bytes)", iterations) { uuid_binary.unpack1('H*') }
run_bench("UUID decode (32 hex chars)", iterations) { [uuid_hex].pack('H*') }

# MD5 hash (16 bytes = 32 hex chars)
md5_binary = Random.bytes(16)
md5_hex = md5_binary.unpack1('H*')

run_bench("MD5 encode (16 bytes)", iterations) { md5_binary.unpack1('H*') }
run_bench("MD5 decode (32 hex chars)", iterations) { [md5_hex].pack('H*') }

# SHA-1 hash (20 bytes = 40 hex chars)
sha1_binary = Random.bytes(20)
sha1_hex = sha1_binary.unpack1('H*')

run_bench("SHA-1 encode (20 bytes)", iterations) { sha1_binary.unpack1('H*') }
run_bench("SHA-1 decode (40 hex chars)", iterations) { [sha1_hex].pack('H*') }

# SHA-256 hash (32 bytes = 64 hex chars)
sha256_binary = Random.bytes(32)
sha256_hex = sha256_binary.unpack1('H*')

run_bench("SHA-256 encode (32 bytes)", iterations) { sha256_binary.unpack1('H*') }
run_bench("SHA-256 decode (64 hex chars)", iterations) { [sha256_hex].pack('H*') }

# SHA-512 hash (64 bytes = 128 hex chars)
sha512_binary = Random.bytes(64)
sha512_hex = sha512_binary.unpack1('H*')

run_bench("SHA-512 encode (64 bytes)", iterations) { sha512_binary.unpack1('H*') }
run_bench("SHA-512 decode (128 hex chars)", iterations) { [sha512_hex].pack('H*') }
puts

puts "-" * 70
puts "7.2 Network/Protocol data sizes"
puts "-" * 70
puts

# Ethernet MAC address (6 bytes)
mac_binary = Random.bytes(6)
mac_hex = mac_binary.unpack1('H*')

run_bench("MAC address encode (6 bytes)", iterations) { mac_binary.unpack1('H*') }
run_bench("MAC address decode (12 hex)", iterations) { [mac_hex].pack('H*') }

# IPv6 address (16 bytes)
ipv6_binary = Random.bytes(16)
ipv6_hex = ipv6_binary.unpack1('H*')

run_bench("IPv6 encode (16 bytes)", iterations) { ipv6_binary.unpack1('H*') }
run_bench("IPv6 decode (32 hex)", iterations) { [ipv6_hex].pack('H*') }

# AES-128 key (16 bytes)
aes128_binary = Random.bytes(16)
aes128_hex = aes128_binary.unpack1('H*')

run_bench("AES-128 key encode (16 bytes)", iterations) { aes128_binary.unpack1('H*') }
run_bench("AES-128 key decode (32 hex)", iterations) { [aes128_hex].pack('H*') }

# AES-256 key (32 bytes)
aes256_binary = Random.bytes(32)
aes256_hex = aes256_binary.unpack1('H*')

run_bench("AES-256 key encode (32 bytes)", iterations) { aes256_binary.unpack1('H*') }
run_bench("AES-256 key decode (64 hex)", iterations) { [aes256_hex].pack('H*') }
puts

puts "-" * 70
puts "7.3 Typical payload sizes"
puts "-" * 70
puts

iterations = 50_000

# Small JSON-like payload
payload_256 = Random.bytes(256)
payload_256_hex = payload_256.unpack1('H*')

run_bench("Small payload encode (256 B)", iterations) { payload_256.unpack1('H*') }
run_bench("Small payload decode (512 hex)", iterations) { [payload_256_hex].pack('H*') }

# Medium payload (1KB)
payload_1k = Random.bytes(1024)
payload_1k_hex = payload_1k.unpack1('H*')

run_bench("Medium payload encode (1 KB)", iterations) { payload_1k.unpack1('H*') }
run_bench("Medium payload decode (2K hex)", iterations) { [payload_1k_hex].pack('H*') }

# Larger payload (4KB - typical page size)
payload_4k = Random.bytes(4096)
payload_4k_hex = payload_4k.unpack1('H*')

run_bench("Page-size payload encode (4 KB)", iterations) { payload_4k.unpack1('H*') }
run_bench("Page-size payload decode (8K hex)", iterations) { [payload_4k_hex].pack('H*') }

iterations = 10_000

# Large payload (64KB)
payload_64k = Random.bytes(65536)
payload_64k_hex = payload_64k.unpack1('H*')

run_bench("Large payload encode (64 KB)", iterations) { payload_64k.unpack1('H*') }
run_bench("Large payload decode (128K hex)", iterations) { [payload_64k_hex].pack('H*') }
puts

# =============================================================================
# SECTION 8: Round-trip Performance
# =============================================================================
puts "=" * 70
puts "SECTION 8: Round-trip Performance (encode then decode)"
puts "=" * 70
puts

puts "  %-15s %15s %15s" % ["Size", "Round-trip", "Throughput"]
puts "  " + "-" * 45

[16, 32, 64, 128, 256, 512, 1024, 4096, 16384, 65536].each do |size|
  data = test_data[size][:binary]
  iterations = iterations_for(size) / 2

  elapsed = measure(iterations) do
    hex = data.unpack1('H*')
    [hex].pack('H*')
  end

  total_bytes = size * iterations * 2
  puts "  %-15s %15s %15s" % [
    "#{size} B",
    format_ns(elapsed, iterations),
    format_rate(total_bytes, elapsed)
  ]
end
puts

# =============================================================================
# SECTION 9: Correctness Verification
# =============================================================================
puts "=" * 70
puts "SECTION 9: Correctness Verification"
puts "=" * 70
puts

errors = []

# Test encoding H format
[1, 2, 8, 15, 16, 17, 31, 32, 33, 64, 128, 256, 1024].each do |size|
  data = Random.bytes(size)
  hex_h = data.unpack1('H*')
  hex_expected = data.bytes.map { |b| "%02x" % b }.join

  if hex_h != hex_expected
    errors << "unpack('H*') failed for size #{size}: got #{hex_h[0,20]}..., expected #{hex_expected[0,20]}..."
  end
end

# Test encoding h format
[1, 2, 8, 15, 16, 17, 31, 32, 33, 64, 128, 256, 1024].each do |size|
  data = Random.bytes(size)
  hex_h = data.unpack1('h*')
  hex_expected = data.bytes.map { |b| "%x%x" % [b & 0xf, b >> 4] }.join

  if hex_h != hex_expected
    errors << "unpack('h*') failed for size #{size}"
  end
end

# Test decoding H format - round trip
[1, 2, 8, 15, 16, 17, 31, 32, 33, 64, 128, 256, 1024].each do |size|
  original = Random.bytes(size)
  hex = original.unpack1('H*')
  decoded = [hex].pack('H*')

  if decoded != original
    errors << "pack('H*') round-trip failed for size #{size}"
  end

  # Test uppercase
  decoded_upper = [hex.upcase].pack('H*')
  if decoded_upper != original
    errors << "pack('H*') uppercase failed for size #{size}"
  end

  # Test mixed case
  hex_mixed = hex.chars.map.with_index { |c, i| i.even? ? c.upcase : c }.join
  decoded_mixed = [hex_mixed].pack('H*')
  if decoded_mixed != original
    errors << "pack('H*') mixed case failed for size #{size}"
  end
end

# Test decoding h format - round trip
[1, 2, 8, 15, 16, 17, 31, 32, 33, 64, 128, 256, 1024].each do |size|
  original = Random.bytes(size)
  hex = original.unpack1('h*')
  decoded = [hex].pack('h*')

  if decoded != original
    errors << "pack('h*')/unpack('h*') round-trip failed for size #{size}"
  end
end

# Test partial format specifiers
data = Random.bytes(100)
[2, 4, 8, 16, 32, 64].each do |n|
  hex_partial = data.unpack1("H#{n}")
  if hex_partial.length != n
    errors << "unpack('H#{n}') returned wrong length: #{hex_partial.length}"
  end

  expected = data.bytes[0, (n+1)/2].map { |b| "%02x" % b }.join[0, n]
  if hex_partial != expected
    errors << "unpack('H#{n}') returned wrong value"
  end
end

# Test odd-length hex strings
[1, 3, 5, 7, 9].each do |len|
  hex = "a" * len
  decoded = [hex].pack('H*')
  expected_len = (len + 1) / 2
  if decoded.length != expected_len
    errors << "Odd-length hex (#{len}) produced wrong output length: #{decoded.length} vs #{expected_len}"
  end
end

# Test empty strings
if "".unpack1('H*') != ""
  errors << "Empty string encoding failed"
end
if [""].pack('H*') != ""
  errors << "Empty string decoding failed"
end

if errors.empty?
  puts "All #{13 * 4 + 6 + 5 + 2} correctness tests PASSED"
else
  puts "FAILURES (#{errors.length}):"
  errors.each { |e| puts "  - #{e}" }
end
puts

puts "=" * 70
puts "Benchmark complete"
puts "=" * 70

mensfeld · 2025-12-27T15:16:07Z

Side note: I am not a a SIMD expert (yet hehe 😅 ) but I am willing to add ARM NEON support in the same PR to have it working for x86 and ARM if the team is ok with doubling the codebase for ARM.

Side note 2: if this type of work is accepted I can submit several other SIMD related PRs for other core classes operations.

It's a draft for now so I can fix linting and any other issues it may have.

mensfeld · 2025-12-27T15:19:46Z

Note for myself before I forget:

x86 SSE/SSSE3	ARM NEON	Purpose
_mm_loadu_si128	vld1q_u8	Load 16 bytes
_mm_shuffle_epi8 (pshufb)	vqtbl1q_u8	Table lookup
_mm_and_si128	vandq_u8	Bitwise AND
_mm_srli_epi16	vshrq_n_u8	Shift right
_mm_blendv_epi8	vbslq_u8	Blend/select
_mm_storeu_si128	vst1q_u8	Store 16 bytes

Use SSSE3 pshufb for parallel nibble-to-hex conversion and SSE4.1 blendv for hex-to-nibble conversion. Also eliminate per-byte rb_str_buf_cat() calls in pack by pre-allocating and writing directly to the output buffer. Performance improvements (A/B benchmark): unpack('H*') - bytes to hex: - 64 bytes: 1.4x faster - 256 bytes: 1.7x faster - 1KB: 2.2x faster - 4KB: 2.3x faster - 64KB: 2.4x faster pack('H*') - hex to bytes: - 64 bytes: 4.8x faster - 256 bytes: 10.3x faster - 1KB: 14.5x faster - 4KB: 15.4x faster - 64KB: 28x faster The pack decoding improvement is especially dramatic because the original code called rb_str_buf_cat() per byte, while the new code pre-allocates the output buffer and writes directly.

mensfeld · 2025-12-27T19:03:03Z

Last sidenote: I would be happy to take over maintenance of the SIDM code in Ruby if there is a will to merge such optimizations.

ahorek · 2025-12-28T05:17:34Z

Prebuilt Ruby packages are typically not compiled with -msse4 or -mavx flags, so runtime feature detection (via CPUID) is necessary. I’d be happy to help add this if there’s interest in using SIMD optimizations in general.

rhenium · 2025-12-28T06:49:44Z

Related discussion on https://bugs.ruby-lang.org/issues/16487

The original code called rb_str_buf_cat() per byte, while the new code pre-allocates the output buffer and writes directly.

How much of the speedup comes from eliminating rb_str_buf_cat() on each byte? That's a clear improvement we should apply first, before looking into whether SIMD instructions can make further gains that are worth the maintenance cost.

It seems other pack templates could probably also use similar optimizations.

mensfeld · 2025-12-29T10:36:29Z

@rhenium I will extract it from this and look into the rest pack templates with it in the upcoming days.

mensfeld force-pushed the optimize-pack-hex-unified branch from 37f6ebf to 42ff3dd Compare December 27, 2025 15:20

mensfeld changed the title ~~Optimize pack H/h with SIMD hex decoding~~ Optimize pack with SIMD hex decoding Dec 27, 2025

mensfeld marked this pull request as ready for review December 27, 2025 18:22

mensfeld force-pushed the optimize-pack-hex-unified branch from 9326533 to 42ff3dd Compare December 28, 2025 21:34

nobu added the Performance label Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize pack with SIMD hex decoding #15751

Optimize pack with SIMD hex decoding #15751

Uh oh!

mensfeld commented Dec 27, 2025

Uh oh!

mensfeld commented Dec 27, 2025 •

edited

Loading

Uh oh!

mensfeld commented Dec 27, 2025 •

edited

Loading

Uh oh!

mensfeld commented Dec 27, 2025

Uh oh!

ahorek commented Dec 28, 2025

Uh oh!

rhenium commented Dec 28, 2025

Uh oh!

mensfeld commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimize pack with SIMD hex decoding #15751

Are you sure you want to change the base?

Optimize pack with SIMD hex decoding #15751

Uh oh!

Conversation

mensfeld commented Dec 27, 2025

Performance Comparison Summary

DECODING (pack 'H*') - Major Improvements

Throughput Comparison (Decoding)

ENCODING (unpack 'H*') - No Regression

Real-World Scenarios

Common Data Formats

Network/Protocol Data

Typical Payload Sizes

Round-trip Performance (encode then decode)

Edge Cases

SIMD Boundary Cases (Decoding)

Odd-length Hex Strings

Case Sensitivity (Decoding 4096 B)

Benchmark Script

Uh oh!

mensfeld commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mensfeld commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mensfeld commented Dec 27, 2025

Uh oh!

ahorek commented Dec 28, 2025

Uh oh!

rhenium commented Dec 28, 2025

Uh oh!

mensfeld commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mensfeld commented Dec 27, 2025 •

edited

Loading

mensfeld commented Dec 27, 2025 •

edited

Loading