ENH: Vectorize INT_FastClip operation using AVX2 #9037

ysingh7 · 2017-05-01T18:43:14Z

Hi Everyone,

I was working on training a deep neural network and found INT_fastclip to be the major bottleneck, consuming more than 50% of the cycles. I looked into the algorithm and found that it can be easily vectorized using avx2. After rewriting the code for INT_fastclip using avx2 extension, I got around 50% speedup for my workload. I am therefore creating a pull request, which contains the algorithm, for evaluation by the community, to see if it can be merged in numpy. I am also attaching a benchmark, which I used for performance evaluation. I would appreciate the feedback of the community.

min_max_array.txt
min_max_array_avx2.txt

Description of the Algorithm -
First it checks if AVX2 attribute is present or not.If AVX2 support is present, then it
breaks down the Algorithm into two parts.

First it looks at total length of the input array and find out how many bytes are unaligned
with respect to 256 bytes. Because avx2 registers used here are 256 bytes in size.
Then for the portion it is aligned with 256 bytes, it loops over the data and loads into one of the vector registers.
It loads vector registers vec_max_256 as vector of maximum value and vec_min_max as vector of minimum values, each of 4 Bytes.
Parallel comparision is made between loaded data and vec_max_256 first and do a vectorized min operation betwwen data and max value. If data is greater than max value , it gets replaced by max value, otherwise with data.
We then use the register where result of above min comparison is tored and do a max operation between the data and min value. If data is lesser than the min value , it gets replaced by min value, otherwise with data.
After that we take care of the unaligned portion by simply looping over it, like the previous algorithm was doing.
For a simple benchmark, this algorithm gave around 40-50% performance boost.

Regards
Yash

eric-wieser · 2017-05-01T19:22:04Z

It'd be nice if we could finish up #7876 first , which would probably make this a fair bit easier to implement - and it would mean that our vectorized code all ends up in the same place.

eric-wieser · 2017-05-01T19:27:19Z

numpy/core/src/multiarray/arraytypes.c.src

+ */
+static void
+@name@_fastclip(@type@ *in, npy_intp ni, @type@ *min, @type@ *max, @type@ *out)
+{


Is this an exact copy of the #else branch? If you, that's not good - perhaps you can move the HAVE_ATTRIBUTE_TARGET_AVX2 inside the function

Its not the exact copy but almost the exact copy. The else part has name and type as INT and npy_int which I removed in the HAVE_ATTRIBUTE_TARGET_AVX2 portion. It looks bad and I will try to move HAVE_ATTRIBUTE_TARGET_AVX2 inside the function.

eric-wieser · 2017-05-01T19:28:55Z

numpy/core/src/multiarray/arraytypes.c.src

+    }
+    if (max == NULL) {
+	for(i=0;i<ni/8;i++){
+           vec_array = _mm256_loadu_si256(in+(i*8));


This seems to be assuming that sizeof(npy_int) == 4, right? I don't think that's a safe assumption

I'm pretty sure every real modern general purpose platform has 32 bit ints though. An assert or something might make sense but I don't think there's any need to go beyond that.

Wouldn't be much work to switch on NPY_BITSOF_@TYPE@ == 32 - especially given the change I suggest above to remove code duplication. Also, presumably this would vectorize for other sizes of ints too?

Yes I made an assumption sizeof(npy_int) == 4, which won't be true for all the architectures. I would put the appropriate check into place.

njsmith · 2017-05-01T19:48:41Z

cc @juliantaylor

…TARGET_AVX2 inside the function

eric-wieser · 2017-05-01T22:07:33Z

numpy/core/src/multiarray/arraytypes.c.src

@@ -3710,6 +3730,65 @@ static void
    npy_intp i;
    @type@ max_val = 0, min_val = 0;

+#if HAVE_ATTRIBUTE_TARGET_AVX2 && @type@==npy_int && NPY_BITSOF_@type@ == 32 


This will make this miss LONG_fastclip, which is likely to be 32 bits on some platforms. Would be better to add a isint flag to match the isfloat one above, and switch on that - you don't care that it's an npy_int, only that it is some kind of int

Thank you for your feedback. I have added the check isint.

eric-wieser

Indentation needs work, but the structure is looking much better.

I'm afraid I can't comment on the correctness of the AVX2 stuff, but @juliantaylor should be able to

eric-wieser · 2017-05-01T22:10:49Z

numpy/core/src/multiarray/arraytypes.c.src

+5. We then use the register where result of above min comparison is tored and do a max operation between the data and min value. If data is lesser than the min value , it gets replaced by min value, otherwise with data. 
+6. After that we take care of the unaligned portion by simply looping over it, like the previous algorithm was doing.
+7. For a simple benchmark, this algorithm gave around 40-50% performance boost.
+*/


I think most of these comments would be better inline with the bit of code that does them

I have inlined the comments.

I have also fixed the indentation.

eric-wieser · 2017-05-01T22:11:55Z

numpy/core/src/multiarray/arraytypes.c.src

+    else if (min == NULL) {
+	for(i=0;i<ni/8;i++){
+       vec_array = _mm256_loadu_si256(in+(i*8));
+       vec_array = _mm256_min_epi32(vec_array,vec_max_256);


Would be great if this could also use _mm256_min_epi16 and _mm256_min_epi8 for other sizes of input

juliantaylor · 2017-05-01T22:53:14Z

This is compiler vectorizable, GCC uses vpmaskmov instead of min/max, but it is only about 10% slower on my garbage laptop haswell.
Note you have to replace the elseif,else for the min + max case with a flat if for it to vectorize. This is very odd considering the others explicitly need the else (probably worth filing a GCC bug about).
The advantage of letting the compiler do it is that we get the other types for free too.
In order to enable it you have to add the NPY_GCC_OPT_3 and NPY_GCC_TARGET_AVX2, see examples in loops.c.src

Also the AVX2 needs runtime detection, HAVE_ATTRIBUTE_TARGET_AVX2 only means the compiler can emit AVX2 code. This means creating two copies of the functions one generic and one with AVX2 and the select it at runtime by setting up the fastclip member of the arrays _ArrFuncs correctly.

ysingh7 · 2017-05-01T23:21:52Z

@juliantaylor Thank you for your feedback. What benchmark did you use for the comparison? Is the algorithm using min/max 10% slower than the vpmaskmov?

juliantaylor · 2017-05-01T23:26:16Z

I have used numpy adding NPY_GCC_OPT_3 and NPY_GCC_TARGET_AVX2 to the fastclip function.
using min/max is the slightly faster one.

ysingh7 · 2017-05-01T23:50:23Z

@juliantaylor Can you point me to the benchmark which you have used for performance assessment, so that I can try at my end and see how the performance compares. I have attached benchmarks I used for evaluation. I have compiled both the files with -o3 -mavx2 switches.

seberg · 2019-09-22T00:58:33Z

Well, the clip ufunc is merged now, so the code would have to be moved around quite a bit. On the other hand, that probably makes this whole thing much more doable (but also quite a bit different). Since this is sold, I think I will just close the PR, even though there may be some very nice code in it. Thanks @ysingh7, maybe you even want to look at it again?

ysingh7 added 3 commits May 1, 2017 11:27

Added algorithm to vectorize INT_FastClip operation using AVX2

4edb8d8

Adding conditional inclusion of files for Windows,gcc built

9ffbb31

Adding conditional inclusion of files for Windows,gcc built

c1afedc

eric-wieser added 01 - Enhancement component: numpy._core labels May 1, 2017

eric-wieser reviewed May 1, 2017

View reviewed changes

Added check for integer size = 32 and Moved the check HAVE_ATTRIBUTE_…

dcec355

…TARGET_AVX2 inside the function

eric-wieser reviewed May 1, 2017

View reviewed changes

Improved the indentation and added isint check

b118a2e

charris changed the title ~~Added algorithm to vectorize INT_FastClip operation using AVX2~~ ENH: Vectorize INT_FastClip operation using AVX2 May 7, 2017

seberg closed this Sep 22, 2019

Uh oh!

ENH: Vectorize INT_FastClip operation using AVX2 #9037

ENH: Vectorize INT_FastClip operation using AVX2 #9037

Uh oh!

Conversation

ysingh7 commented May 1, 2017

Uh oh!

eric-wieser commented May 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser May 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njsmith commented May 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser May 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliantaylor commented May 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysingh7 commented May 1, 2017

Uh oh!

juliantaylor commented May 1, 2017

Uh oh!

ysingh7 commented May 1, 2017

Uh oh!

seberg commented Sep 22, 2019

Uh oh!

Uh oh!

eric-wieser commented May 1, 2017 •

edited

Loading

eric-wieser May 1, 2017 •

edited

Loading

eric-wieser May 1, 2017 •

edited

Loading

juliantaylor commented May 1, 2017 •

edited

Loading