Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@pauldreik
Copy link
Collaborator

This is an attempt to make atomic_binary_to_base64 faster, by using architecture specific knowledge to do atomic reads.

The base64 benchmark, before and after the change shows a 3.7 x speedup on a 10M file:

# current system detected as icelake.
# loading files: .
# volume: 100000000 bytes
# max length: 100000000 bytes
# number of inputs: 1
# encode
memcpy                                   :  17.51 GB/s  9.40 % 
libbase64                                :  12.73 GB/s  17.39 % 
simdutf::icelake                         :  12.85 GB/s  4.69 % 
simdutf::haswell                         :  13.46 GB/s  5.68 % 
simdutf::westmere                        :  11.69 GB/s  2.16 % 
simdutf::fallback                        :   2.25 GB/s  0.26 % 
simdutf::atomic_binary_to_base64         :   3.36 GB/s  0.85 %   # <---------- before



# current system detected as icelake.
# loading files: .
# volume: 100000000 bytes
# max length: 100000000 bytes
# number of inputs: 1
# encode
memcpy                                   :  17.35 GB/s  10.57 % 
libbase64                                :  12.90 GB/s  21.67 % 
simdutf::icelake                         :  13.05 GB/s  2.66 % 
simdutf::haswell                         :  13.62 GB/s  4.20 % 
simdutf::westmere                        :  11.76 GB/s  1.46 % 
simdutf::fallback                        :   2.25 GB/s  0.24 % 
simdutf::atomic_binary_to_base64         :  12.37 GB/s  3.72 %  # <---------- after

@pauldreik pauldreik added the enhancement New feature or request label Apr 28, 2025
Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not investigated but as long as it does not trigger a data race warning using standard sanitizers, this PR looks great.

#if SIMDUTF_ATOMIC_REF
void implementation::memcpy_atomic_read(char *const dst, const char *const src,
const std::size_t len) const noexcept {
scalar::memcpy_atomic_read(dst, src, len);
Copy link
Member

@lemire lemire Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64-bit ARM is atomic when loading and storing 64-bit words.

Screenshot 2025-04-28 at 10 13 45 AM

https://developer.arm.com/documentation/ddi0553/latest

@pauldreik
Copy link
Collaborator Author

pauldreik commented Apr 28, 2025

I have not investigated but as long as it does not trigger a data race warning using standard sanitizers, this PR looks great.

I have now tried running the existing threaded test with thread sanitizer, and it fails...

This gcc/clang only code (somewhat portable between architectures!) passes thread sanitizers (thanks to thread sanitizer understanding it, see https://github.com/google/sanitizers/wiki/ThreadSanitizerAtomicOperations):

 // 10.52 GB/s and thread sanitizer clean
const std::uint64_t tmp1 = __atomic_load_n((const std::uint64_t *)src, __ATOMIC_RELAXED);
const std::uint64_t tmp2 = __atomic_load_n((const std::uint64_t *)(src + 8), __ATOMIC_RELAXED);
std::memcpy(dst, &tmp1, sizeof(tmp1));
std::memcpy(dst + 8, &tmp2, sizeof(tmp2));

UPDATE: the above is portable if replaced with std::atomic_ref< std::uint64_t>

This simd code is portable between compilers, faster but does not pass thread sanitizer:

//12.09 GB/s, not thread sanitizer clean
const __m128i tmp = _mm_load_si128((const __m128i *)src);
_mm_storeu_si128((__m128i *)dst, tmp);

I think it is acceptable to get somewhat lower performance and get it sanitizer clean, but it however means one needs alternate code for gcc/clang vs msvc.

It would also be interesting to test some other tool than thread sanitizer, perhaps valgrind: https://valgrind.org/docs/manual/hg-manual.html
UPDATE: I tried valgrind --tool=helgrind and it did not give any warnings on any of the constructs, not even if I changed it to ordinary std::memcpy. It did however give warnings about std::barrier which I believe are false and did not go away even with the recommended macros under "Data Race Hunting" in https://gcc.gnu.org/onlinedocs/libstdc++/manual/debug.html.

@lemire
Copy link
Member

lemire commented Apr 28, 2025

@pauldreik It is trivial to silence the sanitizers... see the function attributes...

https://clang.llvm.org/docs/ThreadSanitizer.html

But it won't work with other tools.

@lemire
Copy link
Member

lemire commented Apr 28, 2025

@pauldreik I think we could silence the sanitizers, that's what v8 does...

https://github.com/v8/v8/blob/611eac6d865e2957e9aa3bfd5d4bdb6f1b7bc660/src/heap/base/stack.cc#L56

@lemire lemire mentioned this pull request Apr 29, 2025
@lemire
Copy link
Member

lemire commented Apr 29, 2025

I wrote #769 as a simpler alternative. It shies away from kernel specific code. I recommend not going there for now as it adds complexity that might not be needed on the short run.

(I am somewhat in a hurry of getting a new release out.)

@pauldreik
Copy link
Collaborator Author

I wrote #769 as a simpler alternative. It shies away from kernel specific code. I recommend not going there for now as it adds complexity that might not be needed on the short run.

(I am somewhat in a hurry of getting a new release out.)

closing this

@pauldreik pauldreik closed this Apr 30, 2025
@lemire lemire mentioned this pull request May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants