-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Vectorize INT_FastClip operation using AVX2 #9037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It'd be nice if we could finish up #7876 first , which would probably make this a fair bit easier to implement - and it would mean that our vectorized code all ends up in the same place. |
*/ | ||
static void | ||
@name@_fastclip(@type@ *in, npy_intp ni, @type@ *min, @type@ *max, @type@ *out) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an exact copy of the #else
branch? If you, that's not good - perhaps you can move the HAVE_ATTRIBUTE_TARGET_AVX2
inside the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not the exact copy but almost the exact copy. The else part has name and type as INT and npy_int which I removed in the HAVE_ATTRIBUTE_TARGET_AVX2 portion. It looks bad and I will try to move HAVE_ATTRIBUTE_TARGET_AVX2 inside the function.
} | ||
if (max == NULL) { | ||
for(i=0;i<ni/8;i++){ | ||
vec_array = _mm256_loadu_si256(in+(i*8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be assuming that sizeof(npy_int) == 4
, right? I don't think that's a safe assumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure every real modern general purpose platform has 32 bit int
s though. An assert or something might make sense but I don't think there's any need to go beyond that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be much work to switch on NPY_BITSOF_@TYPE@ == 32
- especially given the change I suggest above to remove code duplication. Also, presumably this would vectorize for other sizes of ints too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I made an assumption sizeof(npy_int) == 4, which won't be true for all the architectures. I would put the appropriate check into place.
…TARGET_AVX2 inside the function
@@ -3710,6 +3730,65 @@ static void | |||
npy_intp i; | |||
@type@ max_val = 0, min_val = 0; | |||
|
|||
#if HAVE_ATTRIBUTE_TARGET_AVX2 && @type@==npy_int && NPY_BITSOF_@type@ == 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make this miss LONG_fastclip
, which is likely to be 32 bits on some platforms. Would be better to add a isint
flag to match the isfloat
one above, and switch on that - you don't care that it's an npy_int
, only that it is some kind of int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback. I have added the check isint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation needs work, but the structure is looking much better.
I'm afraid I can't comment on the correctness of the AVX2 stuff, but @juliantaylor should be able to
5. We then use the register where result of above min comparison is tored and do a max operation between the data and min value. If data is lesser than the min value , it gets replaced by min value, otherwise with data. | ||
6. After that we take care of the unaligned portion by simply looping over it, like the previous algorithm was doing. | ||
7. For a simple benchmark, this algorithm gave around 40-50% performance boost. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of these comments would be better inline with the bit of code that does them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have inlined the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also fixed the indentation.
else if (min == NULL) { | ||
for(i=0;i<ni/8;i++){ | ||
vec_array = _mm256_loadu_si256(in+(i*8)); | ||
vec_array = _mm256_min_epi32(vec_array,vec_max_256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if this could also use _mm256_min_epi16
and _mm256_min_epi8
for other sizes of input
This is compiler vectorizable, GCC uses vpmaskmov instead of min/max, but it is only about 10% slower on my garbage laptop haswell. Also the AVX2 needs runtime detection, |
@juliantaylor Thank you for your feedback. What benchmark did you use for the comparison? Is the algorithm using min/max 10% slower than the vpmaskmov? |
I have used numpy adding |
@juliantaylor Can you point me to the benchmark which you have used for performance assessment, so that I can try at my end and see how the performance compares. I have attached benchmarks I used for evaluation. I have compiled both the files with -o3 -mavx2 switches. |
Well, the clip ufunc is merged now, so the code would have to be moved around quite a bit. On the other hand, that probably makes this whole thing much more doable (but also quite a bit different). Since this is sold, I think I will just close the PR, even though there may be some very nice code in it. Thanks @ysingh7, maybe you even want to look at it again? |
Hi Everyone,
I was working on training a deep neural network and found INT_fastclip to be the major bottleneck, consuming more than 50% of the cycles. I looked into the algorithm and found that it can be easily vectorized using avx2. After rewriting the code for INT_fastclip using avx2 extension, I got around 50% speedup for my workload. I am therefore creating a pull request, which contains the algorithm, for evaluation by the community, to see if it can be merged in numpy. I am also attaching a benchmark, which I used for performance evaluation. I would appreciate the feedback of the community.
min_max_array.txt
min_max_array_avx2.txt
Description of the Algorithm -
First it checks if AVX2 attribute is present or not.If AVX2 support is present, then it
breaks down the Algorithm into two parts.
with respect to 256 bytes. Because avx2 registers used here are 256 bytes in size.
Regards
Yash