Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Add AVX2 and AVX512 functionality for numpy.sqrt #12459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

r-devulap
Copy link
Member

@r-devulap r-devulap commented Nov 28, 2018

In microbenchmarks which measured TSC cycles for SSE2, AVX2 and AVX512
versions of the sqrt function, AVX2 and AVX512 showed a 1.99x and 3.6x
performance improvement over SSE2. The experiments were performed on
Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz. Detailed numbers are listed
below:

Arraysize Speed up with AVX2 Speed up with AVX512
5000 1.99173 3.76
20000 1.99784 3.69
40000 1.99861 3.85
80000 1.99891 3.85
160000 1.99957 3.86
320000 1.99521 3.86
640000 1.99623 3.86
1280000 1.99966 3.81
2560000 1.96163 3.38
5120000 1.92656 3.24
10240000 1.92680 3.17
20480000 1.92289 3.12

In microbenchmarks which measured TSC cycles for SSE2, AVX2 and AVX512
versions of the sqrt function, AVX2 and AVX512 showed a 1.99x and 3.6x
performance improvement over SSE2. The experiments were performed on
Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz. Detailed numbers are listed
below:
|-----------+--------------------+----------------------|
| Arraysize | Speed up with AVX2 | Speed up with AVX512 |
|-----------+--------------------+----------------------|
| 5000      | 1.99173            | 3.76                 |
| 20000     | 1.99784            | 3.69                 |
| 40000     | 1.99861            | 3.85                 |
| 80000     | 1.99891            | 3.85                 |
| 160000    | 1.99957            | 3.86                 |
| 320000    | 1.99521            | 3.86                 |
| 640000    | 1.99623            | 3.86                 |
| 1280000   | 1.99966            | 3.81                 |
| 2560000   | 1.96163            | 3.38                 |
| 5120000   | 1.92656            | 3.24                 |
| 10240000  | 1.92680            | 3.17                 |
| 20480000  | 1.92289            | 3.12                 |
|-----------+--------------------+----------------------|
@charris
Copy link
Member

charris commented Nov 29, 2018

Couple of style nits, otherwise LGTM.

@juliantaylor
Copy link
Contributor

juliantaylor commented Nov 29, 2018

I'd prefer this to use runtime selection, compile time only options are not very useful for the majority of users.

also interesting that avx finally does something for square root on newer cpus, it used to do nothing as it was just effectively two sequential sse2 operations up to including haswell cpus.
I'll have a look at verifying these benchmarks on some newer cpus at work tomorrow.

@juliantaylor
Copy link
Contributor

hm I didn't have much time but a quick test on an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz showed no difference in performance for sse2, avx2 and avx512 despite the code definitely running the right code paths.

@juliantaylor
Copy link
Contributor

juliantaylor commented Dec 1, 2018

Also no difference in runtime or cycle count between sse and avx2 on a AMD Ryzen 7 1700X

Can you provide your benchmark please?

@charris
Copy link
Member

charris commented Dec 4, 2018

Perhaps these different experiences depend on the compiler/hardware. @juliantaylor I assume you are using (recent?) gcc. @r-devulap Are you using the Intel compiler? What hardware?

@r-devulap
Copy link
Member Author

r-devulap commented Dec 5, 2018

I am using gcc 7.3.0 on Intel i9-7900X. I built NumPy using -march=native. I am basically measuring TSC cycles the function DOUBLE_sqrt/FLOAT_sqrt takes to execute (by using the cpuid, rdtscand rdtscp, cpuid instructions before and after respectively). I did disable turbo and set intel_pstate governor to "performance" mode (to reduce variance across multiple runs). @juliantaylor How are you building NumPy?

@juliantaylor
Copy link
Contributor

Also a gcc 7.3. The difference must be hardware, I checked the assembly that was produced and it looks fine. I also used rdtscp to check the cycle counts.
The Intel(R) Xeon(R) Gold 6130 is a server cpu not a high end desktop variant but it is of the same generation and has the same amount of fma units. I kind of doubt that something as fundamental as a sqrt would differ between these types of cpus.
But I can be wrong. Though this type of major change must be documented somewhere.

@juliantaylor
Copy link
Contributor

juliantaylor commented Dec 6, 2018

What confuses me on your numbers is that there is almost no impact with increasing array size. sqrt is relatively slow so the memory bandwidth is not too problematic compared to simple addition, but I would still have expected an effect when you go beyond the L3 cache.
But maybe RAM just got faster since I was deep into this stuff :/

@juliantaylor
Copy link
Contributor

juliantaylor commented Dec 6, 2018

While I would like to understand why I cannot reproduce any speedups with avx2/avx512 and do not like the way this is implemented, it does not make things worse and I might just be using wrong hardware (though I have tested alot of different cpus over time ...).
So from my side this can be merged at the more active maintainers discretion.

(and again I want to plug my runtime variant of this in #11113 there are things to finish and improve in that PR but base concept of runtime selected ufuncs works)

@charris
Copy link
Member

charris commented Dec 6, 2018

I think I'll leave this open for a bit, it will be easy enough to backport if wanted. The main thing seems to be the runtime vs compile time variants, although it would be nice to track down the timing differences. @r-devulap Any chance you could try a lower end cpu in the same series?

@r-devulap
Copy link
Member Author

r-devulap commented Dec 7, 2018

My bad, @juliantaylor is right, I do not see any performance benefits either. I was using -O2 for compiling my micro-benchmarks and after debugging I realized gcc does not inline small functions consistently. If I inline them or use -O3 flag, then SSE2/AVX2/AVX512 all perform similarly. Vtune analysis shows that the function is memory bound and sqrt vectorization does not seem to help.

@charris
Copy link
Member

charris commented Dec 7, 2018

I'll close this then. @r-devulap Thanks for the work and tracking down the cause of the timing discrepancy. @juliantaylor Thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants