Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: exp, log AVX loops do not use steps #13520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 16, 2019
Merged

Conversation

r-devulap
Copy link
Member

No description provided.

@r-devulap
Copy link
Member Author

This passes the numpy/core/tests/test_ufunc.py and numpy/core/tests/test_umath.py tests, but I am seeing non-deterministic failures in numpy/core/tests/test_numeric.py. It fails about 1 in 5 times I run the test suite (this happens even for the master branch). Yet to determine what is causing it.

@r-devulap
Copy link
Member Author

r-devulap commented May 10, 2019

Also, seems like reciprocal is failing the newly added test: test_ufunc_noncontiguous: https://app.shippable.com/github/numpy/numpy/runs/3868/1/console

@mattip
Copy link
Member

mattip commented May 10, 2019

The reciprical failure is due to 1.0 / 0, when cast to npy_int, resulting in either -1 or 1 + 2**31

Perhaps it is due to using -O3 with the different branches in UNARY_LOOP_FAST. What happens if you make the numerator 1.0F on line 717 in loops.c.src?

@r-devulap
Copy link
Member Author

r-devulap commented May 11, 2019

now the same test seems to fail for macOS int16, and I am unable to reproduce this on my machine using gcc 7. 3 or clang 6.0..

@juliantaylor
Copy link
Contributor

due to the new generic test?
if so one could remove the fast loop for that function and handle it in another issue
I don't think performance of integer reciprocal operations is particularly important

@mattip
Copy link
Member

mattip commented May 11, 2019

Agreed np.reciprical(np.array([0], t) for t = [int dtypes] should be a separate issue. Perhaps for now change range(6) to range(1, 7) to avoid 0

@r-devulap
Copy link
Member Author

please do not merge this just yet. now might be a good time to think about any potential bugs/missing elements of this patch. @juliantaylor could you please review this when you have the time?


/**begin repeat1
* #func = exp, log#
*/

#if defined HAVE_ATTRIBUTE_TARGET_@ISA@_WITH_INTRINSICS && defined NPY_HAVE_SSE2_INTRINSICS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are SSE2 intrinsic needed in this avx2/avx512 section?

#if defined @CHK@ && defined NPY_HAVE_SSE2_INTRINSICS
@ISA@_@func@_FLOAT((npy_float*)args[1], (npy_float*)args[0], dimensions[0]);
@ISA@_@func@_FLOAT((npy_float *)op1, (npy_float *)ip1, 1);
Copy link
Contributor

@juliantaylor juliantaylor May 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct, the existing logic is to dispatch to the vectorized code function if appropriate and if not return false so the caller can run the fallback code.

the logic may be unnecessarily convoluted, a dispatcher that just runs both variants would be nicer. The simd code is in need of refactoring but that is a different issue.
To fix your code just remove this line and the preprocessor directives around it.

the NPY_GCC_OPT_3 macro can probably also be removed, the UNARY_LOOP macro does not allow for much optimization and to vectorize it with UNARY_LOOP_FAST gcc would require the -ffast-math which we intentionally do not use as it has some unwanted side effects (mostly handling of special float values and exceptions).

@juliantaylor
Copy link
Contributor

so far I know it doesn't apply to exp and log but in general you also need to take care that scalar and non scalar code give the same results

for some functions we unfortunately use different code. So far I know this applies to for example power, see some old experiments of mine: juliantaylor@6c29106

@juliantaylor
Copy link
Contributor

juliantaylor commented May 12, 2019

With the commented issue fixed and a clarification on one maybe unnecessary macro check this looks good

@charris charris added this to the 1.17.0 release milestone May 12, 2019
@r-devulap
Copy link
Member Author

@mattip: Can we always expect the stride of input array in FLOAT_exp_AVX to be constant? what happens in this case?

>>> np.random.rand(10,5)[::2].strides
(80, 8)

@juliantaylor
Copy link
Contributor

@r-devulap the strides are always constant for the ufunc inner loops.
In your strided case the iterator will copy to a contiguous buffer and then call the inner loop on the buffer

@juliantaylor
Copy link
Contributor

Actually for these expensive functions it could be worthwhile to make the iterator always buffering, as it will be faster to copy to a buffer and call the vectorized function than using the unvectorized one
it also ensures the same results for all strides of data

but I am not sure if we can configure that on a per ufunc level

@juliantaylor
Copy link
Contributor

juliantaylor commented May 14, 2019

It is very worthwhile, though only as it calls the vectorized function on a one element array in a loop currently

import numpy as np
d = np.random.rand(1000, 50).astype(np.float32)
print("2d strided -> buffered iterator")
%timeit np.exp(d[::2])
d = np.random.rand(1000* 50).astype(np.float32)
print("1d strided unbuffered iterator")
%timeit np.exp(d[::2])


2d strided -> buffered iterator
10000 loops, best of 3: 77.3 µs per loop
1d strided unbuffered iterator
1000 loops, best of 3: 1.02 ms per loop

So this is something we should look into before release, in particular the same result for different strided data is important. We do have differences with the summation depending on strides and that issues comes up all the time.

I'll open another issue, as it is not directly related to this fix which we should merge soon.

@juliantaylor
Copy link
Contributor

opened #13557 for buffered iterator and result consistency restoring.

@seberg
Copy link
Member

seberg commented May 14, 2019

I do not think you can do this always, but I have seen that you (Julian, at least I think so) added an iterator flags (and op flags) to the ufunc struct. So, for most loops using ufunc->op_flags = {NPY_ITER_CONTIG, NPY_ITER_CONTIG, NPY_ITER_CONTIG}; (or similar).

But I am not sure they are picked up for all things such as reduceat or accumulate.

@juliantaylor
Copy link
Contributor

juliantaylor commented May 14, 2019

wait I forgot this branch still has the bug I commented on when I benchmarked ...
Actually I see no performance difference when using correct code. Turns out glibc 2.27 expf is actually slightly faster than our avx2 vectorized version when called directly as the specialized ufunc now does.

So we should probably take a step back and do some platform benchmarking to see how many platforms actually already provide a well performing function for us.

But this fix (with the commented issue) can already be merged so we have correct results and benchmarks again.

@mattip
Copy link
Member

mattip commented May 14, 2019

@juliantaylor this PR still needs the fix from your comment, correct?

@juliantaylor
Copy link
Contributor

yes

@mattip
Copy link
Member

mattip commented May 14, 2019

We are getting close to the deadline for 1.17.0. We should finish this so it gets some testing before RC1

@juliantaylor
Copy link
Contributor

hm so soon already? This is a very significant low level change and it should not be done hastily.
Also with this fix I would not consider it finished. At least we need to ensure equal results regardless of the strides of the input and I would still like to see some benchmarks on more platforms comparing to system libraries.

@r-devulap
Copy link
Member Author

r-devulap commented May 14, 2019

@juliantaylor @mattip I am slightly unclear about this comment #13520 (comment). Removing those 3 lines will result in your new test failing because outputs of strided and non-strided cases will differ. I think we need to keep these lines. Am I missing something?

@r-devulap
Copy link
Member Author

Actually for these expensive functions it could be worthwhile to make the iterator always buffering, as it will be faster to copy to a buffer and call the vectorized function than using the unvectorized one
it also ensures the same results for all strides of data

but I am not sure if we can configure that on a per ufunc level

Instead of copying entire array to a buffer and call the vectorized function, won't it be better to use _mm256_mask_i32gather_ps instructions to accommodate strides in the vectorized function? Since stride is always uniform this is fairly easy to do and might speed up performance of the strided case.

@juliantaylor
Copy link
Contributor

juliantaylor commented May 14, 2019

oh I misinterpreted the intention of the line, yes it is needed for that purpose. Please add a comment to the line why it exists.
That it causes a factor 10 slowdown for strided data is still a concern.

Using gather would be an option to use the same code with lower impact, but the performance vs a buffered iterator should be tested.
Implementing the gather in the iterator buffer creation would be better as that would benefit all ufuncs not only float32 exp/log but as a quick fix one could also do it in those functions and iterate on the solution later.

@juliantaylor
Copy link
Contributor

The single element loop is still an issue as the code will currently loads at least 32 bytes of data on a single element loop, but there is no guarantee that the input data is 32 byte aligned so it can cross a page boundary and cause a segmentation fault.
This could be fixed with buffering too.

@r-devulap
Copy link
Member Author

sounds good to me. I am working on using gather instructions for exp and log, but will need some time to implement, test and benchmark. May be we can merge this as is and I can submit another patch with gather instructions? would that be okay?

@juliantaylor
Copy link
Contributor

juliantaylor commented May 14, 2019

or by adapting the load_partial_lanes function to handle the unaligned out of bounds case via scalar loads

@r-devulap
Copy link
Member Author

loading 32 bytes of data even across a page boundary is not a problem with the masked and unaligned load and stores. that's what the masks are for, right?

@juliantaylor
Copy link
Contributor

A I did not know that, looking at the description it does say the masked value is not loaded, though it does not explicitly say the masked memory may be unmapped and will not throw and exception but that could be inferred from the wording. If that does indeed work then the code is fine.

@juliantaylor
Copy link
Contributor

I still have the one comment on the sse2 preprocessor but that does no harm so ok from me for merging

but we should probably still discuss release timing in the community meeting tomorrow (if there is one).

@r-devulap
Copy link
Member Author

I added in a comment stating the performance issues for strided loops. I will submit another patch to fix that using gather instructions (if it fixes the performance issues). I think this is okay to merge now.

@r-devulap r-devulap changed the title WIP, BUG: exp, log AVX loops do not use steps BUG: exp, log AVX loops do not use steps May 14, 2019
@mattip
Copy link
Member

mattip commented May 16, 2019

There are merge conflicts.

@r-devulap
Copy link
Member Author

re-based with master, no conflicts now.

@mattip mattip merged commit 7be5f11 into numpy:master May 16, 2019
@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants