-
Notifications
You must be signed in to change notification settings - Fork 102
Accelerate atomic_binary_to_base64 #767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lemire
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not investigated but as long as it does not trigger a data race warning using standard sanitizers, this PR looks great.
| #if SIMDUTF_ATOMIC_REF | ||
| void implementation::memcpy_atomic_read(char *const dst, const char *const src, | ||
| const std::size_t len) const noexcept { | ||
| scalar::memcpy_atomic_read(dst, src, len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now tried running the existing threaded test with thread sanitizer, and it fails... This gcc/clang only code (somewhat portable between architectures!) passes thread sanitizers (thanks to thread sanitizer understanding it, see https://github.com/google/sanitizers/wiki/ThreadSanitizerAtomicOperations): // 10.52 GB/s and thread sanitizer clean
const std::uint64_t tmp1 = __atomic_load_n((const std::uint64_t *)src, __ATOMIC_RELAXED);
const std::uint64_t tmp2 = __atomic_load_n((const std::uint64_t *)(src + 8), __ATOMIC_RELAXED);
std::memcpy(dst, &tmp1, sizeof(tmp1));
std::memcpy(dst + 8, &tmp2, sizeof(tmp2));UPDATE: the above is portable if replaced with This simd code is portable between compilers, faster but does not pass thread sanitizer: //12.09 GB/s, not thread sanitizer clean
const __m128i tmp = _mm_load_si128((const __m128i *)src);
_mm_storeu_si128((__m128i *)dst, tmp);I think it is acceptable to get somewhat lower performance and get it sanitizer clean, but it however means one needs alternate code for gcc/clang vs msvc. It would also be interesting to test some other tool than thread sanitizer, perhaps valgrind: https://valgrind.org/docs/manual/hg-manual.html |
|
@pauldreik It is trivial to silence the sanitizers... see the function attributes... https://clang.llvm.org/docs/ThreadSanitizer.html But it won't work with other tools. |
|
@pauldreik I think we could silence the sanitizers, that's what v8 does... https://github.com/v8/v8/blob/611eac6d865e2957e9aa3bfd5d4bdb6f1b7bc660/src/heap/base/stack.cc#L56 |
|
I wrote #769 as a simpler alternative. It shies away from kernel specific code. I recommend not going there for now as it adds complexity that might not be needed on the short run. (I am somewhat in a hurry of getting a new release out.) |
closing this |
This is an attempt to make
atomic_binary_to_base64faster, by using architecture specific knowledge to do atomic reads.The base64 benchmark, before and after the change shows a 3.7 x speedup on a 10M file: