ENH: Add AVX2 and AVX512 functionality for numpy.sqrt #12459

r-devulap · 2018-11-28T04:16:57Z

In microbenchmarks which measured TSC cycles for SSE2, AVX2 and AVX512
versions of the sqrt function, AVX2 and AVX512 showed a 1.99x and 3.6x
performance improvement over SSE2. The experiments were performed on
Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz. Detailed numbers are listed
below:

Arraysize	Speed up with AVX2	Speed up with AVX512
5000	1.99173	3.76
20000	1.99784	3.69
40000	1.99861	3.85
80000	1.99891	3.85
160000	1.99957	3.86
320000	1.99521	3.86
640000	1.99623	3.86
1280000	1.99966	3.81
2560000	1.96163	3.38
5120000	1.92656	3.24
10240000	1.92680	3.17
20480000	1.92289	3.12

In microbenchmarks which measured TSC cycles for SSE2, AVX2 and AVX512 versions of the sqrt function, AVX2 and AVX512 showed a 1.99x and 3.6x performance improvement over SSE2. The experiments were performed on Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz. Detailed numbers are listed below: |-----------+--------------------+----------------------| | Arraysize | Speed up with AVX2 | Speed up with AVX512 | |-----------+--------------------+----------------------| | 5000 | 1.99173 | 3.76 | | 20000 | 1.99784 | 3.69 | | 40000 | 1.99861 | 3.85 | | 80000 | 1.99891 | 3.85 | | 160000 | 1.99957 | 3.86 | | 320000 | 1.99521 | 3.86 | | 640000 | 1.99623 | 3.86 | | 1280000 | 1.99966 | 3.81 | | 2560000 | 1.96163 | 3.38 | | 5120000 | 1.92656 | 3.24 | | 10240000 | 1.92680 | 3.17 | | 20480000 | 1.92289 | 3.12 | |-----------+--------------------+----------------------|

numpy/core/src/umath/simd.inc.src

charris · 2018-11-29T15:57:01Z

Couple of style nits, otherwise LGTM.

juliantaylor · 2018-11-29T18:13:43Z

I'd prefer this to use runtime selection, compile time only options are not very useful for the majority of users.

also interesting that avx finally does something for square root on newer cpus, it used to do nothing as it was just effectively two sequential sse2 operations up to including haswell cpus.
I'll have a look at verifying these benchmarks on some newer cpus at work tomorrow.

juliantaylor · 2018-11-30T17:50:38Z

hm I didn't have much time but a quick test on an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz showed no difference in performance for sse2, avx2 and avx512 despite the code definitely running the right code paths.

juliantaylor · 2018-12-01T10:08:31Z

Also no difference in runtime or cycle count between sse and avx2 on a AMD Ryzen 7 1700X

Can you provide your benchmark please?

charris · 2018-12-04T14:41:32Z

Perhaps these different experiences depend on the compiler/hardware. @juliantaylor I assume you are using (recent?) gcc. @r-devulap Are you using the Intel compiler? What hardware?

r-devulap · 2018-12-05T03:09:50Z

I am using gcc 7.3.0 on Intel i9-7900X. I built NumPy using -march=native. I am basically measuring TSC cycles the function DOUBLE_sqrt/FLOAT_sqrt takes to execute (by using the cpuid, rdtscand rdtscp, cpuid instructions before and after respectively). I did disable turbo and set intel_pstate governor to "performance" mode (to reduce variance across multiple runs). @juliantaylor How are you building NumPy?

juliantaylor · 2018-12-06T18:56:33Z

Also a gcc 7.3. The difference must be hardware, I checked the assembly that was produced and it looks fine. I also used rdtscp to check the cycle counts.
The Intel(R) Xeon(R) Gold 6130 is a server cpu not a high end desktop variant but it is of the same generation and has the same amount of fma units. I kind of doubt that something as fundamental as a sqrt would differ between these types of cpus.
But I can be wrong. Though this type of major change must be documented somewhere.

juliantaylor · 2018-12-06T18:59:43Z

What confuses me on your numbers is that there is almost no impact with increasing array size. sqrt is relatively slow so the memory bandwidth is not too problematic compared to simple addition, but I would still have expected an effect when you go beyond the L3 cache.
But maybe RAM just got faster since I was deep into this stuff :/

juliantaylor · 2018-12-06T19:05:58Z

While I would like to understand why I cannot reproduce any speedups with avx2/avx512 and do not like the way this is implemented, it does not make things worse and I might just be using wrong hardware (though I have tested alot of different cpus over time ...).
So from my side this can be merged at the more active maintainers discretion.

(and again I want to plug my runtime variant of this in #11113 there are things to finish and improve in that PR but base concept of runtime selected ufuncs works)

charris · 2018-12-06T19:40:43Z

I think I'll leave this open for a bit, it will be easy enough to backport if wanted. The main thing seems to be the runtime vs compile time variants, although it would be nice to track down the timing differences. @r-devulap Any chance you could try a lower end cpu in the same series?

r-devulap · 2018-12-07T09:32:18Z

My bad, @juliantaylor is right, I do not see any performance benefits either. I was using -O2 for compiling my micro-benchmarks and after debugging I realized gcc does not inline small functions consistently. If I inline them or use -O3 flag, then SSE2/AVX2/AVX512 all perform similarly. Vtune analysis shows that the function is memory bound and sqrt vectorization does not seem to help.

charris · 2018-12-07T13:31:43Z

I'll close this then. @r-devulap Thanks for the work and tracking down the cause of the timing discrepancy. @juliantaylor Thanks for the review.

charris added 01 - Enhancement component: numpy._core labels Nov 28, 2018

charris reviewed Nov 29, 2018

View reviewed changes

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

STY: Breaking long lines pull request numpy#12459

d48ff57

charris closed this Dec 7, 2018

r-devulap mentioned this pull request Sep 28, 2019

ENH: use AVX for float32 and float64 implementation of sqrt, square, absolute, reciprocal, rint, floor, ceil and trunc #13885

Merged

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Add AVX2 and AVX512 functionality for numpy.sqrt #12459

ENH: Add AVX2 and AVX512 functionality for numpy.sqrt #12459

Uh oh!

r-devulap commented Nov 28, 2018 •

edited by eric-wieser

Loading

Uh oh!

Uh oh!

charris commented Nov 29, 2018

Uh oh!

juliantaylor commented Nov 29, 2018 •

edited

Loading

Uh oh!

juliantaylor commented Nov 30, 2018

Uh oh!

juliantaylor commented Dec 1, 2018 •

edited

Loading

Uh oh!

charris commented Dec 4, 2018

Uh oh!

r-devulap commented Dec 5, 2018 •

edited

Loading

Uh oh!

juliantaylor commented Dec 6, 2018

Uh oh!

juliantaylor commented Dec 6, 2018 •

edited

Loading

Uh oh!

juliantaylor commented Dec 6, 2018 •

edited

Loading

Uh oh!

charris commented Dec 6, 2018 •

edited

Loading

Uh oh!

r-devulap commented Dec 7, 2018 •

edited

Loading

Uh oh!

charris commented Dec 7, 2018

Uh oh!

Uh oh!

Uh oh!

ENH: Add AVX2 and AVX512 functionality for numpy.sqrt #12459

ENH: Add AVX2 and AVX512 functionality for numpy.sqrt #12459

Uh oh!

Conversation

r-devulap commented Nov 28, 2018 • edited by eric-wieser Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

charris commented Nov 29, 2018

Uh oh!

juliantaylor commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliantaylor commented Nov 30, 2018

Uh oh!

juliantaylor commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Dec 4, 2018

Uh oh!

r-devulap commented Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliantaylor commented Dec 6, 2018

Uh oh!

juliantaylor commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliantaylor commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Dec 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Dec 7, 2018

Uh oh!

Uh oh!

r-devulap commented Nov 28, 2018 •

edited by eric-wieser

Loading

juliantaylor commented Nov 29, 2018 •

edited

Loading

juliantaylor commented Dec 1, 2018 •

edited

Loading

r-devulap commented Dec 5, 2018 •

edited

Loading

juliantaylor commented Dec 6, 2018 •

edited

Loading

juliantaylor commented Dec 6, 2018 •

edited

Loading

charris commented Dec 6, 2018 •

edited

Loading

r-devulap commented Dec 7, 2018 •

edited

Loading