-
-
Notifications
You must be signed in to change notification settings - Fork 11k
BUG: exp, log AVX loops do not use steps #13520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This passes the |
Also, seems like reciprocal is failing the newly added test: |
The reciprical failure is due to Perhaps it is due to using |
now the same test seems to fail for macOS int16, and I am unable to reproduce this on my machine using gcc 7. 3 or clang 6.0.. |
due to the new generic test? |
Agreed |
please do not merge this just yet. now might be a good time to think about any potential bugs/missing elements of this patch. @juliantaylor could you please review this when you have the time? |
|
||
/**begin repeat1 | ||
* #func = exp, log# | ||
*/ | ||
|
||
#if defined HAVE_ATTRIBUTE_TARGET_@ISA@_WITH_INTRINSICS && defined NPY_HAVE_SSE2_INTRINSICS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are SSE2 intrinsic needed in this avx2/avx512 section?
#if defined @CHK@ && defined NPY_HAVE_SSE2_INTRINSICS | ||
@ISA@_@func@_FLOAT((npy_float*)args[1], (npy_float*)args[0], dimensions[0]); | ||
@ISA@_@func@_FLOAT((npy_float *)op1, (npy_float *)ip1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not correct, the existing logic is to dispatch to the vectorized code function if appropriate and if not return false so the caller can run the fallback code.
the logic may be unnecessarily convoluted, a dispatcher that just runs both variants would be nicer. The simd code is in need of refactoring but that is a different issue.
To fix your code just remove this line and the preprocessor directives around it.
the NPY_GCC_OPT_3
macro can probably also be removed, the UNARY_LOOP macro does not allow for much optimization and to vectorize it with UNARY_LOOP_FAST gcc would require the -ffast-math
which we intentionally do not use as it has some unwanted side effects (mostly handling of special float values and exceptions).
so far I know it doesn't apply to exp and log but in general you also need to take care that scalar and non scalar code give the same results for some functions we unfortunately use different code. So far I know this applies to for example |
With the commented issue fixed and a clarification on one maybe unnecessary macro check this looks good |
@mattip: Can we always expect the stride of input array in FLOAT_exp_AVX to be constant? what happens in this case?
|
@r-devulap the strides are always constant for the ufunc inner loops. |
Actually for these expensive functions it could be worthwhile to make the iterator always buffering, as it will be faster to copy to a buffer and call the vectorized function than using the unvectorized one but I am not sure if we can configure that on a per ufunc level |
It is very worthwhile, though only as it calls the vectorized function on a one element array in a loop currently
So this is something we should look into before release, in particular the same result for different strided data is important. We do have differences with the summation depending on strides and that issues comes up all the time. I'll open another issue, as it is not directly related to this fix which we should merge soon. |
opened #13557 for buffered iterator and result consistency restoring. |
I do not think you can do this always, but I have seen that you (Julian, at least I think so) added an iterator flags (and op flags) to the ufunc struct. So, for most loops using But I am not sure they are picked up for all things such as |
wait I forgot this branch still has the bug I commented on when I benchmarked ... So we should probably take a step back and do some platform benchmarking to see how many platforms actually already provide a well performing function for us. But this fix (with the commented issue) can already be merged so we have correct results and benchmarks again. |
@juliantaylor this PR still needs the fix from your comment, correct? |
yes |
We are getting close to the deadline for 1.17.0. We should finish this so it gets some testing before RC1 |
hm so soon already? This is a very significant low level change and it should not be done hastily. |
@juliantaylor @mattip I am slightly unclear about this comment #13520 (comment). Removing those 3 lines will result in your new test failing because outputs of strided and non-strided cases will differ. I think we need to keep these lines. Am I missing something? |
Instead of copying entire array to a buffer and call the vectorized function, won't it be better to use _mm256_mask_i32gather_ps instructions to accommodate strides in the vectorized function? Since stride is always uniform this is fairly easy to do and might speed up performance of the strided case. |
oh I misinterpreted the intention of the line, yes it is needed for that purpose. Please add a comment to the line why it exists. Using gather would be an option to use the same code with lower impact, but the performance vs a buffered iterator should be tested. |
The single element loop is still an issue as the code will currently loads at least 32 bytes of data on a single element loop, but there is no guarantee that the input data is 32 byte aligned so it can cross a page boundary and cause a segmentation fault. |
sounds good to me. I am working on using gather instructions for exp and log, but will need some time to implement, test and benchmark. May be we can merge this as is and I can submit another patch with gather instructions? would that be okay? |
or by adapting the load_partial_lanes function to handle the unaligned out of bounds case via scalar loads |
loading 32 bytes of data even across a page boundary is not a problem with the masked and unaligned load and stores. that's what the masks are for, right? |
A I did not know that, looking at the description it does say the masked value is not loaded, though it does not explicitly say the masked memory may be unmapped and will not throw and exception but that could be inferred from the wording. If that does indeed work then the code is fine. |
I still have the one comment on the sse2 preprocessor but that does no harm so ok from me for merging but we should probably still discuss release timing in the community meeting tomorrow (if there is one). |
I added in a comment stating the performance issues for strided loops. I will submit another patch to fix that using gather instructions (if it fixes the performance issues). I think this is okay to merge now. |
There are merge conflicts. |
re-based with master, no conflicts now. |
No description provided.