Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@hellerve
Copy link
Contributor

This PR tackles #25910 by making a special case in expandtabs that deals with UTF-8 and calls memcpy only when we identified a full run. I also added a benchmark to look at some cases and ensure I understood the assignment correctly. Output on my machine:

$ spin bench -t bench_strings.ExpandTabs --compare "HEAD^" HEAD

$ cd benchmarks
$ asv continuous --factor 1.05 --bench bench_strings.ExpandTabs 60878d23a0610fd2ceb0bf57a44464c21bf8a075 fb29d9053b90f21dda0a4d241888d2c01bc72578
· Creating environments
· Discovering benchmarks.
·· Uninstalling from virtualenv-py3.12-Cython-build-packaging
·· Installing fb29d905 <main> into virtualenv-py3.12-Cython-build-packaging.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For numpy commit 60878d23 <main~1> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[ 0.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[25.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--).
[25.00%] · For numpy commit fb29d905 <main> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging...
[25.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[50.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--).
[50.00%] · For numpy commit fb29d905 <main> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[75.00%] ··· bench_strings.ExpandTabs.time_expandtabs                                                                  ok
[75.00%] ··· ================ ============ ============= ============ =============
             --                                 tab_pattern / size
             ---------------- -----------------------------------------------------
                  dtype          3 / 32       3 / 2048     11 / 32      11 / 2048
             ================ ============ ============= ============ =============
              dtype('<U256')   44.7±0.6μs    2.98±0.1ms   33.8±0.3μs   2.28±0.06ms
              dtype('S256')    40.8±0.2μs   2.47±0.01ms   31.1±0.3μs   1.84±0.02ms
              StringDType()    83.8±0.7μs    5.29±0.1ms   68.1±0.3μs   4.25±0.04ms
             ================ ============ ============= ============ =============

[75.00%] · For numpy commit 60878d23 <main~1> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[75.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[100.00%] ··· bench_strings.ExpandTabs.time_expandtabs                                                                  ok
[100.00%] ··· ================ ============ ============= ============ =============
              --                                 tab_pattern / size
              ---------------- -----------------------------------------------------
                   dtype          3 / 32       3 / 2048     11 / 32      11 / 2048
              ================ ============ ============= ============ =============
               dtype('<U256')   44.2±0.2μs   2.99±0.08ms   33.9±0.3μs   2.21±0.03ms
               dtype('S256')    40.4±0.1μs   2.44±0.02ms   31.4±0.5μs   1.88±0.02ms
               StringDType()    86.7±0.7μs   5.54±0.06ms   70.5±0.7μs   4.46±0.05ms
              ================ ============ ============= ============ =============

| Change   | Before [60878d23] <main~1>   | After [fb29d905] <main>   |   Ratio | Benchmark (Parameter)                                             |
|----------|------------------------------|---------------------------|---------|-------------------------------------------------------------------|
| -        | 4.46±0.05ms                  | 4.25±0.04ms               |    0.95 | bench_strings.ExpandTabs.time_expandtabs(StringDType(), 11, 2048) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Why only UTF-8? I had a bit of trouble understanding the whole machinery involved, but it looks like only UTF-8 re-encodes on buffered memset, so it’s the only encoding that benefits from clustering. When I tried expanding the approach to the other encodings, it actually slowed down. I might be missing something, though.

Cheers

@seberg
Copy link
Member

seberg commented Nov 13, 2025

CC @lysnikolaou for awareness. Now that I glance at this, I wonder a bit if it may be easier to just reinterpret the utf8 buffer as an ascii/latin1 buffer (do we have that pattern elsewhere)?
All characters that we are interested in are ASCII, so the beauty of utf8 is that it will do the right thing.

@hellerve
Copy link
Contributor Author

CC @lysnikolaou for awareness. Now that I glance at this, I wonder a bit if it may be easier to just reinterpret the utf8 buffer as an ascii/latin1 buffer (do we have that pattern elsewhere)? All characters that we are interested in are ASCII, so the beauty of utf8 is that it will do the right thing.

Just a quick note: I just tried this approach but seemingly couldn’t get it right, and the performance impact all but evaporated. I can try tackling it again, just wanted to make sure I update ya’ll on this.


Buffer<enc> tmp = buf;

if (enc == ENCODING::UTF8) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why you're only doing this for UTF-8. I would naively expect it to be possible to update this generically without adding this if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To quote my original PR description:

Why only UTF-8? I had a bit of trouble understanding the whole machinery involved, but it looks like only UTF-8 re-encodes on buffered memset, so it’s the only encoding that benefits from clustering. When I tried expanding the approach to the other encodings, it actually slowed down. I might be missing something, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised that doing fewer copies doesn't help. Do you happen to have the version that did it for all three string types saved somewhere? I'd like to take a shot - I originally wrote a good chunk of this code.

If we end up going with the approach here, I'd prefer to structure this as a template specialization rather than as a runtime if statement. I'd also add a comment explaining the performance analysis.

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I tried a few different solutions, but found this one to be the fastest and, unfortunately, the ugliest. I’d happily hand this over to someone more experienced with the string buffer, though, in the hopes that at least my benchmark can help guide the decision.

This is the simpler diff (from main):

diff --git a/numpy/_core/src/umath/string_buffer.h b/numpy/_core/src/umath/string_buffer.h
index 1e7bea49a3..f279b7ab1d 100644
--- a/numpy/_core/src/umath/string_buffer.h
+++ b/numpy/_core/src/umath/string_buffer.h
@@ -1533,26 +1533,43 @@ string_expandtabs(Buffer<enc> buf, npy_int64 tabsize, Buffer<enc> out)
     npy_intp new_len = 0, line_pos = 0;

     Buffer<enc> tmp = buf;
+    Buffer<enc> chunk_start = buf;
+
     for (size_t i = 0; i < len; i++) {
         npy_ucs4 ch = *tmp;
         if (ch == '\t') {
+            std::ptrdiff_t span = tmp - chunk_start;
+            if (span > 0) {
+                size_t copy_len = (size_t)span;
+                chunk_start.buffer_memcpy(out, copy_len);
+                out.advance_chars_or_bytes(copy_len);
+                new_len += (npy_intp)span;
+            }
             if (tabsize > 0) {
                 npy_intp incr = tabsize - (line_pos % tabsize);
                 line_pos += incr;
-                new_len += out.buffer_memset((npy_ucs4) ' ', incr);
-                out += incr;
+                npy_intp spaces_written =
+                        out.buffer_memset((npy_ucs4) ' ', (size_t)incr);
+                new_len += spaces_written;
+                out.advance_chars_or_bytes((size_t)spaces_written);
             }
+            chunk_start = tmp + 1;
         }
         else {
             line_pos++;
-            new_len += out.buffer_memset(ch, 1);
-            out++;
             if (ch == '\n' || ch == '\r') {
                 line_pos = 0;
             }
         }
         tmp++;
     }
+    std::ptrdiff_t span = tmp - chunk_start;
+    if (span > 0) {
+        size_t copy_len = (size_t)span;
+        chunk_start.buffer_memcpy(out, copy_len);
+        out.advance_chars_or_bytes(copy_len);
+        new_len += (npy_intp)span;
+    }
     return new_len;
 }

And these are the numbers if I compare it to the special case:

· Creating environments
· Discovering benchmarks.
·· Uninstalling from virtualenv-py3.12-Cython-build-packaging
·· Installing 3fe1c802 <main> into virtualenv-py3.12-Cython-build-packaging.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For numpy commit fb29d905 <main~2> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[ 0.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[25.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--).
[25.00%] · For numpy commit 3fe1c802 <main> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[25.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[50.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--).
[50.00%] · For numpy commit 3fe1c802 <main> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[75.00%] ··· bench_strings.ExpandTabs.time_expandtabs                                                                  ok
[75.00%] ··· ================ ============= ============= ============ =============
            --                                 tab_pattern / size
            ---------------- ------------------------------------------------------
                 dtype           3 / 32       3 / 2048     11 / 32      11 / 2048
            ================ ============= ============= ============ =============
             dtype('<U256')    50.1±0.2μs    3.32±0.1ms   35.7±0.2μs   2.28±0.01ms
             dtype('S256')    44.1±0.07μs     2.66±0ms    32.0±0.1μs     1.88±0ms
             StringDType()     83.1±0.1μs   5.23±0.01ms   67.7±0.4μs   4.17±0.01ms
            ================ ============= ============= ============ =============

[75.00%] · For numpy commit fb29d905 <main~2> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[75.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[100.00%] ··· bench_strings.ExpandTabs.time_expandtabs                                                                  ok
[100.00%] ··· ================ ============= ============= ============= =============
             --                                  tab_pattern / size
             ---------------- -------------------------------------------------------
                  dtype           3 / 32       3 / 2048      11 / 32      11 / 2048
             ================ ============= ============= ============= =============
              dtype('<U256')    44.0±0.2μs     2.81±0ms     33.4±0.1μs    2.24±0.1ms
              dtype('S256')     40.2±0.2μs   2.40±0.01ms   31.0±0.06μs   1.82±0.02ms
              StringDType()    83.7±0.09μs   5.22±0.02ms    67.1±0.5μs   4.21±0.03ms
             ================ ============= ============= ============= =============

| Change   | Before [fb29d905] <main~2>   | After [3fe1c802] <main>   |   Ratio | Benchmark (Parameter)                                             |
|----------|------------------------------|---------------------------|---------|-------------------------------------------------------------------|
| +        | 2.81±0ms                     | 3.32±0.1ms                |    1.18 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 3, 2048) |
| +        | 44.0±0.2μs                   | 50.1±0.2μs                |    1.14 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 3, 32)   |
| +        | 2.40±0.01ms                  | 2.66±0ms                  |    1.11 | bench_strings.ExpandTabs.time_expandtabs(dtype('S256'), 3, 2048)  |
| +        | 40.2±0.2μs                   | 44.1±0.07μs               |    1.1  | bench_strings.ExpandTabs.time_expandtabs(dtype('S256'), 3, 32)    |
| +        | 33.4±0.1μs                   | 35.7±0.2μs                |    1.07 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 11, 32)  |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

As you can see, the UTF-8 case doesn’t really change, but the others tank.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, sorry, the character counting is off with my thought of course 🤦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants