-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Add AVX2 and AVX512 functionality for numpy.sqrt #12459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In microbenchmarks which measured TSC cycles for SSE2, AVX2 and AVX512 versions of the sqrt function, AVX2 and AVX512 showed a 1.99x and 3.6x performance improvement over SSE2. The experiments were performed on Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz. Detailed numbers are listed below: |-----------+--------------------+----------------------| | Arraysize | Speed up with AVX2 | Speed up with AVX512 | |-----------+--------------------+----------------------| | 5000 | 1.99173 | 3.76 | | 20000 | 1.99784 | 3.69 | | 40000 | 1.99861 | 3.85 | | 80000 | 1.99891 | 3.85 | | 160000 | 1.99957 | 3.86 | | 320000 | 1.99521 | 3.86 | | 640000 | 1.99623 | 3.86 | | 1280000 | 1.99966 | 3.81 | | 2560000 | 1.96163 | 3.38 | | 5120000 | 1.92656 | 3.24 | | 10240000 | 1.92680 | 3.17 | | 20480000 | 1.92289 | 3.12 | |-----------+--------------------+----------------------|
Couple of style nits, otherwise LGTM. |
I'd prefer this to use runtime selection, compile time only options are not very useful for the majority of users. also interesting that avx finally does something for square root on newer cpus, it used to do nothing as it was just effectively two sequential sse2 operations up to including haswell cpus. |
hm I didn't have much time but a quick test on an |
Also no difference in runtime or cycle count between sse and avx2 on a Can you provide your benchmark please? |
Perhaps these different experiences depend on the compiler/hardware. @juliantaylor I assume you are using (recent?) gcc. @r-devulap Are you using the Intel compiler? What hardware? |
I am using gcc 7.3.0 on Intel i9-7900X. I built NumPy using -march=native. I am basically measuring TSC cycles the function |
Also a gcc 7.3. The difference must be hardware, I checked the assembly that was produced and it looks fine. I also used rdtscp to check the cycle counts. |
What confuses me on your numbers is that there is almost no impact with increasing array size. sqrt is relatively slow so the memory bandwidth is not too problematic compared to simple addition, but I would still have expected an effect when you go beyond the L3 cache. |
While I would like to understand why I cannot reproduce any speedups with avx2/avx512 and do not like the way this is implemented, it does not make things worse and I might just be using wrong hardware (though I have tested alot of different cpus over time ...). (and again I want to plug my runtime variant of this in #11113 there are things to finish and improve in that PR but base concept of runtime selected ufuncs works) |
I think I'll leave this open for a bit, it will be easy enough to backport if wanted. The main thing seems to be the runtime vs compile time variants, although it would be nice to track down the timing differences. @r-devulap Any chance you could try a lower end cpu in the same series? |
My bad, @juliantaylor is right, I do not see any performance benefits either. I was using -O2 for compiling my micro-benchmarks and after debugging I realized gcc does not inline small functions consistently. If I inline them or use -O3 flag, then SSE2/AVX2/AVX512 all perform similarly. Vtune analysis shows that the function is memory bound and sqrt vectorization does not seem to help. |
I'll close this then. @r-devulap Thanks for the work and tracking down the cause of the timing discrepancy. @juliantaylor Thanks for the review. |
In microbenchmarks which measured TSC cycles for SSE2, AVX2 and AVX512
versions of the sqrt function, AVX2 and AVX512 showed a 1.99x and 3.6x
performance improvement over SSE2. The experiments were performed on
Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz. Detailed numbers are listed
below: