ENH: optimize expandtabs to only call memcpy on full runs #30205

hellerve · 2025-11-12T23:21:16Z

This PR tackles #25910 by making a special case in expandtabs that deals with UTF-8 and calls memcpy only when we identified a full run. I also added a benchmark to look at some cases and ensure I understood the assignment correctly. Output on my machine:

$ spin bench -t bench_strings.ExpandTabs --compare "HEAD^" HEAD

$ cd benchmarks
$ asv continuous --factor 1.05 --bench bench_strings.ExpandTabs 60878d23a0610fd2ceb0bf57a44464c21bf8a075 fb29d9053b90f21dda0a4d241888d2c01bc72578
· Creating environments
· Discovering benchmarks.
·· Uninstalling from virtualenv-py3.12-Cython-build-packaging
·· Installing fb29d905 <main> into virtualenv-py3.12-Cython-build-packaging.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For numpy commit 60878d23 <main~1> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[ 0.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[25.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--).
[25.00%] · For numpy commit fb29d905 <main> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging...
[25.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[50.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--).
[50.00%] · For numpy commit fb29d905 <main> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[75.00%] ··· bench_strings.ExpandTabs.time_expandtabs                                                                  ok
[75.00%] ··· ================ ============ ============= ============ =============
             --                                 tab_pattern / size
             ---------------- -----------------------------------------------------
                  dtype          3 / 32       3 / 2048     11 / 32      11 / 2048
             ================ ============ ============= ============ =============
              dtype('<U256')   44.7±0.6μs    2.98±0.1ms   33.8±0.3μs   2.28±0.06ms
              dtype('S256')    40.8±0.2μs   2.47±0.01ms   31.1±0.3μs   1.84±0.02ms
              StringDType()    83.8±0.7μs    5.29±0.1ms   68.1±0.3μs   4.25±0.04ms
             ================ ============ ============= ============ =============

[75.00%] · For numpy commit 60878d23 <main~1> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[75.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[100.00%] ··· bench_strings.ExpandTabs.time_expandtabs                                                                  ok
[100.00%] ··· ================ ============ ============= ============ =============
              --                                 tab_pattern / size
              ---------------- -----------------------------------------------------
                   dtype          3 / 32       3 / 2048     11 / 32      11 / 2048
              ================ ============ ============= ============ =============
               dtype('<U256')   44.2±0.2μs   2.99±0.08ms   33.9±0.3μs   2.21±0.03ms
               dtype('S256')    40.4±0.1μs   2.44±0.02ms   31.4±0.5μs   1.88±0.02ms
               StringDType()    86.7±0.7μs   5.54±0.06ms   70.5±0.7μs   4.46±0.05ms
              ================ ============ ============= ============ =============

| Change   | Before [60878d23] <main~1>   | After [fb29d905] <main>   |   Ratio | Benchmark (Parameter)                                             |
|----------|------------------------------|---------------------------|---------|-------------------------------------------------------------------|
| -        | 4.46±0.05ms                  | 4.25±0.04ms               |    0.95 | bench_strings.ExpandTabs.time_expandtabs(StringDType(), 11, 2048) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Why only UTF-8? I had a bit of trouble understanding the whole machinery involved, but it looks like only UTF-8 re-encodes on buffered memset, so it’s the only encoding that benefits from clustering. When I tried expanding the approach to the other encodings, it actually slowed down. I might be missing something, though.

Cheers

seberg · 2025-11-13T08:24:16Z

CC @lysnikolaou for awareness. Now that I glance at this, I wonder a bit if it may be easier to just reinterpret the utf8 buffer as an ascii/latin1 buffer (do we have that pattern elsewhere)?
All characters that we are interested in are ASCII, so the beauty of utf8 is that it will do the right thing.

hellerve · 2025-11-13T15:19:40Z

CC @lysnikolaou for awareness. Now that I glance at this, I wonder a bit if it may be easier to just reinterpret the utf8 buffer as an ascii/latin1 buffer (do we have that pattern elsewhere)? All characters that we are interested in are ASCII, so the beauty of utf8 is that it will do the right thing.

Just a quick note: I just tried this approach but seemingly couldn’t get it right, and the performance impact all but evaporated. I can try tackling it again, just wanted to make sure I update ya’ll on this.

ngoldbaum · 2025-11-14T19:06:58Z

numpy/_core/src/umath/string_buffer.h


    Buffer<enc> tmp = buf;
+
+    if (enc == ENCODING::UTF8) {


I'm confused why you're only doing this for UTF-8. I would naively expect it to be possible to update this generically without adding this if statement.

To quote my original PR description:

Why only UTF-8? I had a bit of trouble understanding the whole machinery involved, but it looks like only UTF-8 re-encodes on buffered memset, so it’s the only encoding that benefits from clustering. When I tried expanding the approach to the other encodings, it actually slowed down. I might be missing something, though.

I'm a little surprised that doing fewer copies doesn't help. Do you happen to have the version that did it for all three string types saved somewhere? I'd like to take a shot - I originally wrote a good chunk of this code.

If we end up going with the approach here, I'd prefer to structure this as a template specialization rather than as a runtime if statement. I'd also add a comment explaining the performance analysis.

So, I tried a few different solutions, but found this one to be the fastest and, unfortunately, the ugliest. I’d happily hand this over to someone more experienced with the string buffer, though, in the hopes that at least my benchmark can help guide the decision.

This is the simpler diff (from main):

diff --git a/numpy/_core/src/umath/string_buffer.h b/numpy/_core/src/umath/string_buffer.h index 1e7bea49a3..f279b7ab1d 100644 --- a/numpy/_core/src/umath/string_buffer.h +++ b/numpy/_core/src/umath/string_buffer.h @@ -1533,26 +1533,43 @@ string_expandtabs(Buffer<enc> buf, npy_int64 tabsize, Buffer<enc> out) npy_intp new_len = 0, line_pos = 0; Buffer<enc> tmp = buf; + Buffer<enc> chunk_start = buf; + for (size_t i = 0; i < len; i++) { npy_ucs4 ch = *tmp; if (ch == '\t') { + std::ptrdiff_t span = tmp - chunk_start; + if (span > 0) { + size_t copy_len = (size_t)span; + chunk_start.buffer_memcpy(out, copy_len); + out.advance_chars_or_bytes(copy_len); + new_len += (npy_intp)span; + } if (tabsize > 0) { npy_intp incr = tabsize - (line_pos % tabsize); line_pos += incr; - new_len += out.buffer_memset((npy_ucs4) ' ', incr); - out += incr; + npy_intp spaces_written = + out.buffer_memset((npy_ucs4) ' ', (size_t)incr); + new_len += spaces_written; + out.advance_chars_or_bytes((size_t)spaces_written); } + chunk_start = tmp + 1; } else { line_pos++; - new_len += out.buffer_memset(ch, 1); - out++; if (ch == '\n' || ch == '\r') { line_pos = 0; } } tmp++; } + std::ptrdiff_t span = tmp - chunk_start; + if (span > 0) { + size_t copy_len = (size_t)span; + chunk_start.buffer_memcpy(out, copy_len); + out.advance_chars_or_bytes(copy_len); + new_len += (npy_intp)span; + } return new_len; }

And these are the numbers if I compare it to the special case:

· Creating environments · Discovering benchmarks. ·· Uninstalling from virtualenv-py3.12-Cython-build-packaging ·· Installing 3fe1c802 <main> into virtualenv-py3.12-Cython-build-packaging. · Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks) [ 0.00%] · For numpy commit fb29d905 <main~2> (round 1/2): [ 0.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging.. [ 0.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging [25.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--). [25.00%] · For numpy commit 3fe1c802 <main> (round 1/2): [25.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging.. [25.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging [50.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--). [50.00%] · For numpy commit 3fe1c802 <main> (round 2/2): [50.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging [75.00%] ··· bench_strings.ExpandTabs.time_expandtabs ok [75.00%] ··· ================ ============= ============= ============ ============= -- tab_pattern / size ---------------- ------------------------------------------------------ dtype 3 / 32 3 / 2048 11 / 32 11 / 2048 ================ ============= ============= ============ ============= dtype('<U256') 50.1±0.2μs 3.32±0.1ms 35.7±0.2μs 2.28±0.01ms dtype('S256') 44.1±0.07μs 2.66±0ms 32.0±0.1μs 1.88±0ms StringDType() 83.1±0.1μs 5.23±0.01ms 67.7±0.4μs 4.17±0.01ms ================ ============= ============= ============ ============= [75.00%] · For numpy commit fb29d905 <main~2> (round 2/2): [75.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging.. [75.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging [100.00%] ··· bench_strings.ExpandTabs.time_expandtabs ok [100.00%] ··· ================ ============= ============= ============= ============= -- tab_pattern / size ---------------- ------------------------------------------------------- dtype 3 / 32 3 / 2048 11 / 32 11 / 2048 ================ ============= ============= ============= ============= dtype('<U256') 44.0±0.2μs 2.81±0ms 33.4±0.1μs 2.24±0.1ms dtype('S256') 40.2±0.2μs 2.40±0.01ms 31.0±0.06μs 1.82±0.02ms StringDType() 83.7±0.09μs 5.22±0.02ms 67.1±0.5μs 4.21±0.03ms ================ ============= ============= ============= ============= | Change | Before [fb29d905] <main~2> | After [3fe1c802] <main> | Ratio | Benchmark (Parameter) | |----------|------------------------------|---------------------------|---------|-------------------------------------------------------------------| | + | 2.81±0ms | 3.32±0.1ms | 1.18 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 3, 2048) | | + | 44.0±0.2μs | 50.1±0.2μs | 1.14 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 3, 32) | | + | 2.40±0.01ms | 2.66±0ms | 1.11 | bench_strings.ExpandTabs.time_expandtabs(dtype('S256'), 3, 2048) | | + | 40.2±0.2μs | 44.1±0.07μs | 1.1 | bench_strings.ExpandTabs.time_expandtabs(dtype('S256'), 3, 32) | | + | 33.4±0.1μs | 35.7±0.2μs | 1.07 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 11, 32) | SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE DECREASED.

As you can see, the UTF-8 case doesn’t really change, but the others tank.

Ah yeah, sorry, the character counting is off with my thought of course 🤦

ENH: optimize expandtabs to only call memcpy on full runs

fb29d90

github-actions bot added the 01 - Enhancement label Nov 12, 2025

ngoldbaum reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: optimize expandtabs to only call memcpy on full runs #30205

ENH: optimize expandtabs to only call memcpy on full runs #30205

Uh oh!

hellerve commented Nov 12, 2025

Uh oh!

seberg commented Nov 13, 2025

Uh oh!

hellerve commented Nov 13, 2025

Uh oh!

ngoldbaum Nov 14, 2025

Uh oh!

hellerve Nov 14, 2025

Uh oh!

ngoldbaum Nov 14, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

hellerve Nov 14, 2025

Uh oh!

seberg Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

ENH: optimize expandtabs to only call memcpy on full runs #30205

Are you sure you want to change the base?

ENH: optimize expandtabs to only call memcpy on full runs #30205

Uh oh!

Conversation

hellerve commented Nov 12, 2025

Uh oh!

seberg commented Nov 13, 2025

Uh oh!

hellerve commented Nov 13, 2025

Uh oh!

ngoldbaum Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

hellerve Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ngoldbaum Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

hellerve Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

seberg Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants