Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Vectorize INT_FastClip operation using AVX2 #9037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

ENH: Vectorize INT_FastClip operation using AVX2 #9037

wants to merge 5 commits into from

Conversation

ysingh7
Copy link

@ysingh7 ysingh7 commented May 1, 2017

Hi Everyone,

I was working on training a deep neural network and found INT_fastclip to be the major bottleneck, consuming more than 50% of the cycles. I looked into the algorithm and found that it can be easily vectorized using avx2. After rewriting the code for INT_fastclip using avx2 extension, I got around 50% speedup for my workload. I am therefore creating a pull request, which contains the algorithm, for evaluation by the community, to see if it can be merged in numpy. I am also attaching a benchmark, which I used for performance evaluation. I would appreciate the feedback of the community.

min_max_array.txt
min_max_array_avx2.txt

Description of the Algorithm -
First it checks if AVX2 attribute is present or not.If AVX2 support is present, then it
breaks down the Algorithm into two parts.

  1. First it looks at total length of the input array and find out how many bytes are unaligned
    with respect to 256 bytes. Because avx2 registers used here are 256 bytes in size.
  2. Then for the portion it is aligned with 256 bytes, it loops over the data and loads into one of the vector registers.
  3. It loads vector registers vec_max_256 as vector of maximum value and vec_min_max as vector of minimum values, each of 4 Bytes.
  4. Parallel comparision is made between loaded data and vec_max_256 first and do a vectorized min operation betwwen data and max value. If data is greater than max value , it gets replaced by max value, otherwise with data.
  5. We then use the register where result of above min comparison is tored and do a max operation between the data and min value. If data is lesser than the min value , it gets replaced by min value, otherwise with data.
  6. After that we take care of the unaligned portion by simply looping over it, like the previous algorithm was doing.
  7. For a simple benchmark, this algorithm gave around 40-50% performance boost.

Regards
Yash

@eric-wieser
Copy link
Member

eric-wieser commented May 1, 2017

It'd be nice if we could finish up #7876 first , which would probably make this a fair bit easier to implement - and it would mean that our vectorized code all ends up in the same place.

*/
static void
@name@_fastclip(@type@ *in, npy_intp ni, @type@ *min, @type@ *max, @type@ *out)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an exact copy of the #else branch? If you, that's not good - perhaps you can move the HAVE_ATTRIBUTE_TARGET_AVX2 inside the function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not the exact copy but almost the exact copy. The else part has name and type as INT and npy_int which I removed in the HAVE_ATTRIBUTE_TARGET_AVX2 portion. It looks bad and I will try to move HAVE_ATTRIBUTE_TARGET_AVX2 inside the function.

}
if (max == NULL) {
for(i=0;i<ni/8;i++){
vec_array = _mm256_loadu_si256(in+(i*8));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be assuming that sizeof(npy_int) == 4, right? I don't think that's a safe assumption

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure every real modern general purpose platform has 32 bit ints though. An assert or something might make sense but I don't think there's any need to go beyond that.

Copy link
Member

@eric-wieser eric-wieser May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be much work to switch on NPY_BITSOF_@TYPE@ == 32 - especially given the change I suggest above to remove code duplication. Also, presumably this would vectorize for other sizes of ints too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I made an assumption sizeof(npy_int) == 4, which won't be true for all the architectures. I would put the appropriate check into place.

@njsmith
Copy link
Member

njsmith commented May 1, 2017

cc @juliantaylor

@@ -3710,6 +3730,65 @@ static void
npy_intp i;
@type@ max_val = 0, min_val = 0;

#if HAVE_ATTRIBUTE_TARGET_AVX2 && @type@==npy_int && NPY_BITSOF_@type@ == 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make this miss LONG_fastclip, which is likely to be 32 bits on some platforms. Would be better to add a isint flag to match the isfloat one above, and switch on that - you don't care that it's an npy_int, only that it is some kind of int

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. I have added the check isint.

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation needs work, but the structure is looking much better.

I'm afraid I can't comment on the correctness of the AVX2 stuff, but @juliantaylor should be able to

5. We then use the register where result of above min comparison is tored and do a max operation between the data and min value. If data is lesser than the min value , it gets replaced by min value, otherwise with data.
6. After that we take care of the unaligned portion by simply looping over it, like the previous algorithm was doing.
7. For a simple benchmark, this algorithm gave around 40-50% performance boost.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of these comments would be better inline with the bit of code that does them

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have inlined the comments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also fixed the indentation.

else if (min == NULL) {
for(i=0;i<ni/8;i++){
vec_array = _mm256_loadu_si256(in+(i*8));
vec_array = _mm256_min_epi32(vec_array,vec_max_256);
Copy link
Member

@eric-wieser eric-wieser May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if this could also use _mm256_min_epi16 and _mm256_min_epi8 for other sizes of input

@juliantaylor
Copy link
Contributor

juliantaylor commented May 1, 2017

This is compiler vectorizable, GCC uses vpmaskmov instead of min/max, but it is only about 10% slower on my garbage laptop haswell.
Note you have to replace the elseif,else for the min + max case with a flat if for it to vectorize. This is very odd considering the others explicitly need the else (probably worth filing a GCC bug about).
The advantage of letting the compiler do it is that we get the other types for free too.
In order to enable it you have to add the NPY_GCC_OPT_3 and NPY_GCC_TARGET_AVX2, see examples in loops.c.src

Also the AVX2 needs runtime detection, HAVE_ATTRIBUTE_TARGET_AVX2 only means the compiler can emit AVX2 code. This means creating two copies of the functions one generic and one with AVX2 and the select it at runtime by setting up the fastclip member of the arrays _ArrFuncs correctly.

@ysingh7
Copy link
Author

ysingh7 commented May 1, 2017

@juliantaylor Thank you for your feedback. What benchmark did you use for the comparison? Is the algorithm using min/max 10% slower than the vpmaskmov?

@juliantaylor
Copy link
Contributor

I have used numpy adding NPY_GCC_OPT_3 and NPY_GCC_TARGET_AVX2 to the fastclip function.
using min/max is the slightly faster one.

@ysingh7
Copy link
Author

ysingh7 commented May 1, 2017

@juliantaylor Can you point me to the benchmark which you have used for performance assessment, so that I can try at my end and see how the performance compares. I have attached benchmarks I used for evaluation. I have compiled both the files with -o3 -mavx2 switches.

@charris charris changed the title Added algorithm to vectorize INT_FastClip operation using AVX2 ENH: Vectorize INT_FastClip operation using AVX2 May 7, 2017
@seberg
Copy link
Member

seberg commented Sep 22, 2019

Well, the clip ufunc is merged now, so the code would have to be moved around quite a bit. On the other hand, that probably makes this whole thing much more doable (but also quite a bit different). Since this is sold, I think I will just close the PR, even though there may be some very nice code in it. Thanks @ysingh7, maybe you even want to look at it again?

@seberg seberg closed this Sep 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants