-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: use AVX for float32 and float64 implementation of sqrt, square, absolute, reciprocal, rint, floor, ceil and trunc #13885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Wonder why I assumed |
de47213
to
93e2e3f
Compare
The failing test is because of a newly added test (by this patch) that fails on Windows Python36-32bit:
Ideally, a FP exception should be raised if a negative input is passed to sqrt (as it does on other platforms). Anyone knows why this fails on windows py3.6 32bit? |
76a620b
to
2bfb768
Compare
Added AVX implementation of floor, ceil, trunc and rint. As with other AVX implementations, these handles strided arrays as well. These functions see up to a 14x speed up. Detailed numbers presented below:
|
2bfb768
to
226ae20
Compare
re-based with master. Can someone help review the code please? |
Should these have tests added to |
I do not think it is necessary though. There is no change to the way these functions are computed and all of these functions directly use the hardware instructions anyway (sqrt is computed using |
ok. The code, as much as I understand it, looks reasonable. Maybe someone else would like to try to take a look? |
ping .. anyone else who can review the code? |
@mattip how do you want to proceed with this patch? |
@r-devulap sorry this is taking so long. I think this PR has the same problem with duplicate loops that gh-14554 cleaned up, could you take a quick look? Also, a rebase/merge with master is needed to clear the conflict. FWIW, adding this code on my Ubuntu 18.04 dev system makes the wheel grow to 10_655_010 bytes, adding 45_101 bytes or about 0.5%, and it makes the I think this is acceptable for the speed boost, so I propose to merge it. |
thanks @mattip, I will take a look and fix it. |
(1) Workaround for bug in clang6: added missing GCC attribute to the prototype of ISA_sqrt_TYPE function which otherwise leads to a weird build failure in clang6 (gcc and clang7.0 doesnt have this issue) (2) Changed np.float128 to np.longdouble in tests: NumPy in windows doesn't support the np.float128 dtype (3) GCC 4.8/5.0 doesn't support _mm512_abs_ps/pd intrinsic
clang6 generates an invalid exception when computing abs value of +/-np.nan.
226ae20
to
5ee46de
Compare
Rebased with master and added a commit to fix the duplicated inner loop for float16. Let me know if that looks correct? |
please independently verify these benchmarks, we often get this MR using avx for e.g. sqrt and we always rejected it because the benchmarks turned out to be wrong and there were no performance improvements as the hardware does not actually use the wider registers in parallel. I think the last time I verified this was on a skylake xeon gold ... which should still be the latest intel generation. |
Do we need manually writen code for this? |
gcc refuses to vectorize this and gave me this info: |
@juliantaylor @mattip ping ... |
Did you commit the benchmarks you are showing above? I do not see them in benchmarking results on a machine with avx512. |
I just added a commit for the benchmarks. |
Comparing the pre- and post-PR benchmarks, I see large changes. The
I am not sure why the single strided cases are showing such a large speed-up. Do these results make sense? |
I think these numbers make sense. NumPy's current implementation of the rounding functions ceil, floor, rint and trunc are scalar even for stride 1, so the 10x speed up for the these functions is expected (my numbers are similar too). As expected, we do not need see any significant speed up for sqrt, square, reciprocal and absolute for stride 1, since these are currently implemented with SSE. |
just out of curiosity, what CPU did you run these benchmarks on? |
Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz |
@juliantaylor ping |
if (!run_unary_@isa@_sqrt_@TYPE@(args, dimensions, steps)) { | ||
UNARY_LOOP { | ||
const @type@ in1 = *(@type@ *)ip1; | ||
*(@type@ *)op1 = npy_sqrt@typesub@(in1); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the compiler able to generate the avx code automatically if you use
if (IS_OUTPUT_BLOCKABLE_UNARY(sizeof(@type@), @REGISTER_SIZE@)) {
UNARY_LOOP { ... }
}
else {
// as above
UNARY_LOOP { ... }
}
We use this trick in all sorts of places today to encourage it to generate optimized code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried several options with GCC-9.2 and found the following:
-
Any compiler generated vectorized loop for floating point seems to require extra compiler options like -ffast-math (see https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using) . Here is the code for an example of the sqrt loop with and without this option. There are several problems with this path: (1) -ffast-math obviously should not be used as a global compile option and (2) the code generated with this option ends up using a combination of
vrsqrt14ps
andvmulps
instruction to compute square root which is neither accuratenor fast(vrsqrt14ps
is only accurate up to the 6th decimal place and I have no idea why even the latest GCC wont use a simplevsqrtps
instruction instead!) -
The other problem is, no matter what option I try, I could not get GCC to vectorize the strided array case (see an example here). Even if somehow we were able to properly vectorize the case where stride = 1, as far as I know, we cannot auto-vectorize for general strided arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally learnt why gcc wont use vsqrtps
! vrsqrt14ps
instruction is 1-3 cycles, where as vsqrtps
is > 14 cycles. So its basically faster to compute invsqrt
, multiple it with input and then correct it with one step of newton raphson than to compute an accurate sqrt directly. -ffast-math obviously chooses speed over accuracy. This logic works for single precision and not for double precision where it uses the vsqrtpd
instruction (see code here) :)
Thanks @r-devulap. |
thanks @mattip ! |
* Replace masked elements with 1.0f to avoid divide by zero fp | ||
* exception in reciprocal | ||
*/ | ||
x = @isa@_set_masked_lanes_ps(x, ones_f, inv_load_mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand this line, "How and why"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r-devulap, you forget to remove it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The masked load instruction loads 0 for elements for which the mask is set to 0 (which happens for the trailing end of the array). For reciprocal, it causes a 1/0 which raises an divide by zero exception. This line replaces the zeros with ones to avoid that exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r-devulap, "the trailing end of the array", oh it makes sense now. thank you
By leveraging AVX gather instructions, this patch enables SIMD processing of strided arrays, which is currently handled in a scalar fashion. Performance of functions like sqrt improves by 9x. Detailed performance numbers are presented below (array size = 10000 for every benchmark):