-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Convert arithmetic from C universal intrinsics to C++ using Highway #27402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
numpy/_core/src/multiarray/common.h
Outdated
|| (defined __clang__ && __clang_major__ < 8)) | ||
# define NPY_ALIGNOF(type) offsetof(struct {char c; type v;}, v) | ||
#define NPY_ALIGNOF(type) __alignof__(type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation to change these macro's? You don't seem to be using them anywhere either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @r-devulap
Change to the NPY_ALIGNOF macro, replacing offsetof
with __alignof__
, is motivated by the need for compliance with modern C++ standards and to eliminate compiler errors. The original use of offsetof with an anonymous struct may not be standard in C++11 and later, which leads to errors as shown in pic.
On the other hand, modern compilers provide alignof, which is built-in, efficient, and compatible with both GCC and Clang, ensuring adherence to C++11 standards and simplifying alignment calculation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- On GCC and Clang (Linux/macOS), the
__alignof__
operator is supported and returns the required alignment of a type. - On MSVC,
__alignof__
is not recognized, soalignof
can be used.
This causes an issue with failing of Window CI with MSVC compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original use of offsetof with an anonymous struct may not be standard in C++11 and later, which leads to errors as shown in pic.
Does this happen on the main branch as well? I suggest we make a separate PR for this change cos it seems unrelated to this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this happen on the main branch as well?
No
@abhishek-iitmadras Is this patch still WIP or ready for review? I noticed this patch is missing the vsx4 specific optimizations in the original source file: numpy/numpy/_core/src/umath/loops_arithmetic.dispatch.c.src Lines 221 to 326 in 49e0743
|
I am currently trying to modify the code to ensure all CI tests pass. At present, I am facing a challenge in obtaining a Windows machine with the MSVC compiler. |
527a2f1
to
681c51a
Compare
memcpy(dst, src, len * sizeof(T)); | ||
} else if (scalar == static_cast<T>(-1)) { | ||
const auto vec_min_val = Set(d, std::numeric_limits<T>::min()); | ||
for (npy_intp i = 0; i < len; i += N) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly recommend splitting up loops into the main body, and a tail. The LoadN/StoreN functions are quite expensive, so in the main loop they should be replaced with LoadU/StoreU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack Done.
Might it help to run MSVC under an emulator (WINE)? |
numpy/_core/src/common/npy_atomic.h
Outdated
return (void *)(*(volatile uint64_t *)obj); | ||
#elif defined(_M_ARM64) | ||
return (uint64_t)__ldar64((unsigned __int64 volatile *)obj); | ||
return (void *)(uint64_t)__ldar64((unsigned __int64 volatile *)obj); | ||
#endif | ||
#else | ||
#if defined(_M_X64) || defined(_M_IX86) | ||
return *(volatile uint32_t *)obj; | ||
return (void *)(*(volatile uint32_t *)obj); | ||
#elif defined(_M_ARM64) | ||
return (uint32_t)__ldar32((unsigned __int32 volatile *)obj); | ||
return (void *)(uint32_t)__ldar32((unsigned __int32 volatile *)obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This window fix has already been merged, causing a conflict here.
681c51a
to
a94d62a
Compare
a94d62a
to
e30b14c
Compare
It's surprising to hear there are no performance regressions, especially on architectures like x86. As far as I understand, Highway will fall back to the FPU unit to handle integer vector division on architectures that don't provide native instructions for it. This fallback could potentially introduce a performance hit. The C code, on the other hand, relies on division by invariant integers using multiplication, as mentioned at the beginning of the C source file: numpy/numpy/_core/src/umath/loops_arithmetic.dispatch.c.src Lines 25 to 27 in 8c2476b
For further clarification, please take a look at the following sources: numpy/numpy/_core/src/common/simd/intdiv.h Lines 11 to 79 in d315181
numpy/numpy/_core/src/common/simd/sse/arithmetic.h Lines 88 to 262 in 8c2476b
|
hm, are we sure floating-point div instructions are actually used? I also have seen clang transform div into mul in a similar way. |
@abhishek-iitmadras could you rebase to main to fix the QEMU Ci failures? Apart from that you still seem to have a few more CI failures:
|
e30b14c
to
9bebb5f
Compare
9bebb5f
to
6d52554
Compare
3b537e4
to
e2c93c7
Compare
Out of all 9 CI failures, I am focusing on the one associated with the Pyodide build, as it is the only one that provides a clear reason for the failure. I would appreciate any insights into why these failures appear only in the Pyodide build, as well as any suggestions on how to resolve them. |
dcf8087
to
364d295
Compare
For anyone interested, here are some techniques for speeding up division by non-constant: http://0x80.pl/notesen/2024-12-21-uint8-division.html |
Hi @seiko2plus I am extremely sorry for the huge delay in replying to you regarding benchmarking performance on AVX512 and AVX2, as I was busy with some other framework optimization. Below is detail of Performance measures:
AVX2
Update commit message accordingly. |
So benchmarks got sigficantly slower? |
Yes, after addressing these 9 CI's failures, I will proceed with optimization. |
cc9e074
to
795cc9b
Compare
In my opinion, some axioms typically don’t require explicit testing. My intention was to guide you toward a more efficient approach to save time and effort. However, these benchmark results don’t seem accurate to me. The AVX2 and AVX512 benchmarks appear equivalent. Are you certain that AVX512 was correctly disabled before running the AVX2 benchmark? Also, have you tried increasing the array length used in the test to improve stability? see:
What kind of optimization are you planning to perform? Could you elaborate? |
I used following method to build for AVX2 only:
Below is screenshot of build configuration: Now for benchmarking:
Correct me if I am doing something wrong. @seiko2plus
No
I already tried point 1 and 2, after that build is ok but few test if failing even on my aarch64 machine. Maybe i need to get more understanding on this. |
c3ecde7
to
4ac89ba
Compare
5fdd529
to
8699aa9
Compare
const auto different_signs = hn::Xor(src_sign, scalar_sign); | ||
|
||
auto adjustment = hn::And(different_signs, has_remainder); | ||
vec_div = hn::IfThenElse(adjustment, hn::Sub(vec_div, hn::Set(d, static_cast<T>(1))), vec_div); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of IfThenElse(adjust, Sub(div, one), div) we can use MaskedSubOr(div, adjust, div, one). This would be faster on AVX-512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
const auto has_remainder = hn::Ne(vec_src, vec_mul); | ||
const auto src_sign = hn::Lt(vec_src, vec_zero); | ||
const auto scalar_sign = hn::Lt(vec_scalar, vec_zero); | ||
const auto different_signs = hn::Xor(src_sign, scalar_sign); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of Lt, Lt, Xor, we can Xor src and scalar, and then test only that one sign bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
const auto vec_src = hn::LoadU(d, src + i); | ||
auto vec_div = hn::Div(vec_src, vec_scalar); | ||
const auto vec_mul = hn::Mul(vec_div, vec_scalar); | ||
const auto has_remainder = hn::Ne(vec_src, vec_mul); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of And(diff, Ne(src, mul)) consider AndNot(Eq(src, mul), diff) - bit faster on x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
thanks @jan-wassenberg
218f319
to
512e512
Compare
This patch is all about to convert loops_arithmetic.dispatch.c.src to numpy/_core/src/umath/loops_arithmetic.dispatch.cpp using Google Highway as per NEP54 resolutions.
All Requirements are passed: