-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
ENH: optimize expandtabs to only call memcpy on full runs #30205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
CC @lysnikolaou for awareness. Now that I glance at this, I wonder a bit if it may be easier to just reinterpret the |
Just a quick note: I just tried this approach but seemingly couldn’t get it right, and the performance impact all but evaporated. I can try tackling it again, just wanted to make sure I update ya’ll on this. |
|
|
||
| Buffer<enc> tmp = buf; | ||
|
|
||
| if (enc == ENCODING::UTF8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why you're only doing this for UTF-8. I would naively expect it to be possible to update this generically without adding this if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To quote my original PR description:
Why only UTF-8? I had a bit of trouble understanding the whole machinery involved, but it looks like only UTF-8 re-encodes on buffered memset, so it’s the only encoding that benefits from clustering. When I tried expanding the approach to the other encodings, it actually slowed down. I might be missing something, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised that doing fewer copies doesn't help. Do you happen to have the version that did it for all three string types saved somewhere? I'd like to take a shot - I originally wrote a good chunk of this code.
If we end up going with the approach here, I'd prefer to structure this as a template specialization rather than as a runtime if statement. I'd also add a comment explaining the performance analysis.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I tried a few different solutions, but found this one to be the fastest and, unfortunately, the ugliest. I’d happily hand this over to someone more experienced with the string buffer, though, in the hopes that at least my benchmark can help guide the decision.
This is the simpler diff (from main):
diff --git a/numpy/_core/src/umath/string_buffer.h b/numpy/_core/src/umath/string_buffer.h
index 1e7bea49a3..f279b7ab1d 100644
--- a/numpy/_core/src/umath/string_buffer.h
+++ b/numpy/_core/src/umath/string_buffer.h
@@ -1533,26 +1533,43 @@ string_expandtabs(Buffer<enc> buf, npy_int64 tabsize, Buffer<enc> out)
npy_intp new_len = 0, line_pos = 0;
Buffer<enc> tmp = buf;
+ Buffer<enc> chunk_start = buf;
+
for (size_t i = 0; i < len; i++) {
npy_ucs4 ch = *tmp;
if (ch == '\t') {
+ std::ptrdiff_t span = tmp - chunk_start;
+ if (span > 0) {
+ size_t copy_len = (size_t)span;
+ chunk_start.buffer_memcpy(out, copy_len);
+ out.advance_chars_or_bytes(copy_len);
+ new_len += (npy_intp)span;
+ }
if (tabsize > 0) {
npy_intp incr = tabsize - (line_pos % tabsize);
line_pos += incr;
- new_len += out.buffer_memset((npy_ucs4) ' ', incr);
- out += incr;
+ npy_intp spaces_written =
+ out.buffer_memset((npy_ucs4) ' ', (size_t)incr);
+ new_len += spaces_written;
+ out.advance_chars_or_bytes((size_t)spaces_written);
}
+ chunk_start = tmp + 1;
}
else {
line_pos++;
- new_len += out.buffer_memset(ch, 1);
- out++;
if (ch == '\n' || ch == '\r') {
line_pos = 0;
}
}
tmp++;
}
+ std::ptrdiff_t span = tmp - chunk_start;
+ if (span > 0) {
+ size_t copy_len = (size_t)span;
+ chunk_start.buffer_memcpy(out, copy_len);
+ out.advance_chars_or_bytes(copy_len);
+ new_len += (npy_intp)span;
+ }
return new_len;
}And these are the numbers if I compare it to the special case:
· Creating environments
· Discovering benchmarks.
·· Uninstalling from virtualenv-py3.12-Cython-build-packaging
·· Installing 3fe1c802 <main> into virtualenv-py3.12-Cython-build-packaging.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For numpy commit fb29d905 <main~2> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[ 0.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[25.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--).
[25.00%] · For numpy commit 3fe1c802 <main> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[25.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[50.00%] ··· Running (bench_strings.ExpandTabs.time_expandtabs--).
[50.00%] · For numpy commit 3fe1c802 <main> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[75.00%] ··· bench_strings.ExpandTabs.time_expandtabs ok
[75.00%] ··· ================ ============= ============= ============ =============
-- tab_pattern / size
---------------- ------------------------------------------------------
dtype 3 / 32 3 / 2048 11 / 32 11 / 2048
================ ============= ============= ============ =============
dtype('<U256') 50.1±0.2μs 3.32±0.1ms 35.7±0.2μs 2.28±0.01ms
dtype('S256') 44.1±0.07μs 2.66±0ms 32.0±0.1μs 1.88±0ms
StringDType() 83.1±0.1μs 5.23±0.01ms 67.7±0.4μs 4.17±0.01ms
================ ============= ============= ============ =============
[75.00%] · For numpy commit fb29d905 <main~2> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.12-Cython-build-packaging..
[75.00%] ·· Benchmarking virtualenv-py3.12-Cython-build-packaging
[100.00%] ··· bench_strings.ExpandTabs.time_expandtabs ok
[100.00%] ··· ================ ============= ============= ============= =============
-- tab_pattern / size
---------------- -------------------------------------------------------
dtype 3 / 32 3 / 2048 11 / 32 11 / 2048
================ ============= ============= ============= =============
dtype('<U256') 44.0±0.2μs 2.81±0ms 33.4±0.1μs 2.24±0.1ms
dtype('S256') 40.2±0.2μs 2.40±0.01ms 31.0±0.06μs 1.82±0.02ms
StringDType() 83.7±0.09μs 5.22±0.02ms 67.1±0.5μs 4.21±0.03ms
================ ============= ============= ============= =============
| Change | Before [fb29d905] <main~2> | After [3fe1c802] <main> | Ratio | Benchmark (Parameter) |
|----------|------------------------------|---------------------------|---------|-------------------------------------------------------------------|
| + | 2.81±0ms | 3.32±0.1ms | 1.18 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 3, 2048) |
| + | 44.0±0.2μs | 50.1±0.2μs | 1.14 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 3, 32) |
| + | 2.40±0.01ms | 2.66±0ms | 1.11 | bench_strings.ExpandTabs.time_expandtabs(dtype('S256'), 3, 2048) |
| + | 40.2±0.2μs | 44.1±0.07μs | 1.1 | bench_strings.ExpandTabs.time_expandtabs(dtype('S256'), 3, 32) |
| + | 33.4±0.1μs | 35.7±0.2μs | 1.07 | bench_strings.ExpandTabs.time_expandtabs(dtype('<U256'), 11, 32) |
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
As you can see, the UTF-8 case doesn’t really change, but the others tank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, sorry, the character counting is off with my thought of course 🤦
This PR tackles #25910 by making a special case in
expandtabsthat deals with UTF-8 and callsmemcpyonly when we identified a full run. I also added a benchmark to look at some cases and ensure I understood the assignment correctly. Output on my machine:Why only UTF-8? I had a bit of trouble understanding the whole machinery involved, but it looks like only UTF-8 re-encodes on buffered
memset, so it’s the only encoding that benefits from clustering. When I tried expanding the approach to the other encodings, it actually slowed down. I might be missing something, though.Cheers