optimize utf8_length_from_latin1 for short strings #499

lemire · 2024-08-13T02:46:52Z

Using this UTF8 short file:
https://github.com/lemire/unicode_lipsum/tree/main/short

Run the benchmark...

benchmark -P utf8_length_from_latin1 -F fourbytes.utf8.txt -I 1000000

GCC 12, Intel Ice Lake:

utf8_length_from_latin1+fallback, input size: 64, iterations: 100000, dataset: shorter.txt
   1.281 ins/byte,    0.266 cycle/byte,   11.825 GB/s (16.2 %),     3.141 GHz,    4.824 ins/cycle 
   5.125 ins/char,    1.062 cycle/char,    2.956 Gc/s (16.2 %)     4.00 byte/char      5.4 ns
WARNING: Measurements are noisy, try increasing iteration count (-I).
utf8_length_from_latin1+haswell, input size: 64, iterations: 100000, dataset: shorter.txt
   0.672 ins/byte,    0.219 cycle/byte,   14.493 GB/s (14.2 %),     3.170 GHz,    3.071 ins/cycle 
   2.688 ins/char,    0.875 cycle/char,    3.623 Gc/s (14.2 %)     4.00 byte/char      4.4 ns
WARNING: Measurements are noisy, try increasing iteration count (-I).
utf8_length_from_latin1+icelake, input size: 64, iterations: 100000, dataset: shorter.txt
   0.750 ins/byte,    0.172 cycle/byte,   17.913 GB/s (0.7 %),     3.079 GHz,    4.364 ins/cycle 
   3.000 ins/char,    0.688 cycle/char,    4.478 Gc/s (0.7 %)     4.00 byte/char      3.6 ns
utf8_length_from_latin1+node, input size: 64, iterations: 100000, dataset: shorter.txt
   1.266 ins/byte,    0.297 cycle/byte,   10.714 GB/s (0.6 %),     3.181 GHz,    4.263 ins/cycle 
   5.062 ins/char,    1.188 cycle/char,    2.678 Gc/s (0.6 %)     4.00 byte/char      6.0 ns
utf8_length_from_latin1+westmere, input size: 64, iterations: 100000, dataset: shorter.txt
   0.938 ins/byte,    0.172 cycle/byte,   18.357 GB/s (21.3 %),     3.155 GHz,    5.455 ins/cycle 
   3.750 ins/char,    0.688 cycle/char,    4.589 Gc/s (21.3 %)     4.00 byte/char      3.5 ns
WARNING: Measurements are noisy, try increasing iteration count (-I).

LLVM 16, Apple Silicon M2:

utf8_length_from_latin1+arm64, input size: 64, iterations: 1000000, dataset: shorter.txt
   0.844 ins/byte,    0.156 cycle/byte,   25.372 GB/s (7.9 %),     3.964 GHz,    5.400 ins/cycle
   3.375 ins/char,    0.625 cycle/char,    6.343 Gc/s (7.9 %)     4.00 byte/char      2.5 ns
utf8_length_from_latin1+fallback, input size: 64, iterations: 1000000, dataset: shorter.txt
   1.266 ins/byte,    0.203 cycle/byte,   22.475 GB/s (10.0 %),     4.565 GHz,    6.231 ins/cycle
   5.062 ins/char,    0.812 cycle/char,    5.619 Gc/s (10.0 %)     4.00 byte/char      2.8 ns
utf8_length_from_latin1+node, input size: 64, iterations: 1000000, dataset: shorter.txt
   1.641 ins/byte,    0.203 cycle/byte,   18.291 GB/s (6.7 %),     3.715 GHz,    8.077 ins/cycle
   6.562 ins/char,    0.812 cycle/char,    4.573 Gc/s (6.7 %)     4.00 byte/char      3.5 ns

See nodejs/node#54345

lemire · 2024-08-13T02:47:14Z

cc @ronag

ronag · 2024-08-13T06:18:02Z

I'm not sure how to read those bench results? WHat's before and after?

src/fallback/implementation.cpp

ronag

Not sure what is going on in the SIMD variations but left some comments on the scalar one.

src/fallback/implementation.cpp

Co-authored-by: Robert Nagy <[email protected]>

lemire · 2024-08-13T13:05:07Z

I'm not sure how to read those bench results? WHat's before and after?

The reference is utf8_length_from_latin1+node. The way benchmarking works in simdutf is that we compare different implementations on the same task.

Co-authored-by: Robert Nagy <[email protected]>

optimize utf8_length_from_latin1 for short strings

8aa2da7

adding explicit cast

10a8327

lemire mentioned this pull request Aug 13, 2024

buffer: optimize byteLength for short strings nodejs/node#54345

Merged

ronag approved these changes Aug 13, 2024

View reviewed changes