BUG: exp, log AVX loops do not use steps #13520

r-devulap · 2019-05-10T04:15:51Z

No description provided.

r-devulap · 2019-05-10T04:26:04Z

This passes the numpy/core/tests/test_ufunc.py and numpy/core/tests/test_umath.py tests, but I am seeing non-deterministic failures in numpy/core/tests/test_numeric.py. It fails about 1 in 5 times I run the test suite (this happens even for the master branch). Yet to determine what is causing it.

r-devulap · 2019-05-10T04:30:02Z

Also, seems like reciprocal is failing the newly added test: test_ufunc_noncontiguous: https://app.shippable.com/github/numpy/numpy/runs/3868/1/console

mattip · 2019-05-10T05:44:02Z

The reciprical failure is due to 1.0 / 0, when cast to npy_int, resulting in either -1 or 1 + 2**31

Perhaps it is due to using -O3 with the different branches in UNARY_LOOP_FAST. What happens if you make the numerator 1.0F on line 717 in loops.c.src?

r-devulap · 2019-05-11T14:47:11Z

now the same test seems to fail for macOS int16, and I am unable to reproduce this on my machine using gcc 7. 3 or clang 6.0..

juliantaylor · 2019-05-11T15:09:39Z

due to the new generic test?
if so one could remove the fast loop for that function and handle it in another issue
I don't think performance of integer reciprocal operations is particularly important

mattip · 2019-05-11T15:22:41Z

Agreed np.reciprical(np.array([0], t) for t = [int dtypes] should be a separate issue. Perhaps for now change range(6) to range(1, 7) to avoid 0

r-devulap · 2019-05-11T17:02:25Z

please do not merge this just yet. now might be a good time to think about any potential bugs/missing elements of this patch. @juliantaylor could you please review this when you have the time?

juliantaylor · 2019-05-12T12:51:47Z

numpy/core/src/umath/simd.inc.src


 /**begin repeat1
 * #func = exp, log#
 */

+#if defined HAVE_ATTRIBUTE_TARGET_@ISA@_WITH_INTRINSICS && defined NPY_HAVE_SSE2_INTRINSICS


why are SSE2 intrinsic needed in this avx2/avx512 section?

juliantaylor · 2019-05-12T13:03:07Z

numpy/core/src/umath/loops.c.src

 #if defined @CHK@ && defined NPY_HAVE_SSE2_INTRINSICS
-    @ISA@_@func@_FLOAT((npy_float*)args[1], (npy_float*)args[0], dimensions[0]);
+            @ISA@_@func@_FLOAT((npy_float *)op1, (npy_float *)ip1, 1);


this is not correct, the existing logic is to dispatch to the vectorized code function if appropriate and if not return false so the caller can run the fallback code.

the logic may be unnecessarily convoluted, a dispatcher that just runs both variants would be nicer. The simd code is in need of refactoring but that is a different issue.
To fix your code just remove this line and the preprocessor directives around it.

the NPY_GCC_OPT_3 macro can probably also be removed, the UNARY_LOOP macro does not allow for much optimization and to vectorize it with UNARY_LOOP_FAST gcc would require the -ffast-math which we intentionally do not use as it has some unwanted side effects (mostly handling of special float values and exceptions).

juliantaylor · 2019-05-12T13:15:29Z

so far I know it doesn't apply to exp and log but in general you also need to take care that scalar and non scalar code give the same results

for some functions we unfortunately use different code. So far I know this applies to for example power, see some old experiments of mine: juliantaylor@6c29106

juliantaylor · 2019-05-12T13:26:23Z

With the commented issue fixed and a clarification on one maybe unnecessary macro check this looks good

r-devulap · 2019-05-14T04:26:35Z

@mattip: Can we always expect the stride of input array in FLOAT_exp_AVX to be constant? what happens in this case?

>>> np.random.rand(10,5)[::2].strides
(80, 8)

juliantaylor · 2019-05-14T17:55:27Z

@r-devulap the strides are always constant for the ufunc inner loops.
In your strided case the iterator will copy to a contiguous buffer and then call the inner loop on the buffer

juliantaylor · 2019-05-14T18:01:07Z

Actually for these expensive functions it could be worthwhile to make the iterator always buffering, as it will be faster to copy to a buffer and call the vectorized function than using the unvectorized one
it also ensures the same results for all strides of data

but I am not sure if we can configure that on a per ufunc level

juliantaylor · 2019-05-14T18:03:44Z

It is very worthwhile, though only as it calls the vectorized function on a one element array in a loop currently

import numpy as np
d = np.random.rand(1000, 50).astype(np.float32)
print("2d strided -> buffered iterator")
%timeit np.exp(d[::2])
d = np.random.rand(1000* 50).astype(np.float32)
print("1d strided unbuffered iterator")
%timeit np.exp(d[::2])


2d strided -> buffered iterator
10000 loops, best of 3: 77.3 µs per loop
1d strided unbuffered iterator
1000 loops, best of 3: 1.02 ms per loop

So this is something we should look into before release, in particular the same result for different strided data is important. We do have differences with the summation depending on strides and that issues comes up all the time.

I'll open another issue, as it is not directly related to this fix which we should merge soon.

juliantaylor · 2019-05-14T18:10:15Z

opened #13557 for buffered iterator and result consistency restoring.

seberg · 2019-05-14T18:13:06Z

I do not think you can do this always, but I have seen that you (Julian, at least I think so) added an iterator flags (and op flags) to the ufunc struct. So, for most loops using ufunc->op_flags = {NPY_ITER_CONTIG, NPY_ITER_CONTIG, NPY_ITER_CONTIG}; (or similar).

But I am not sure they are picked up for all things such as reduceat or accumulate.

juliantaylor · 2019-05-14T18:26:28Z

wait I forgot this branch still has the bug I commented on when I benchmarked ...
Actually I see no performance difference when using correct code. Turns out glibc 2.27 expf is actually slightly faster than our avx2 vectorized version when called directly as the specialized ufunc now does.

So we should probably take a step back and do some platform benchmarking to see how many platforms actually already provide a well performing function for us.

But this fix (with the commented issue) can already be merged so we have correct results and benchmarks again.

mattip · 2019-05-14T19:00:20Z

@juliantaylor this PR still needs the fix from your comment, correct?

juliantaylor · 2019-05-14T19:01:24Z

yes

mattip · 2019-05-14T19:07:03Z

We are getting close to the deadline for 1.17.0. We should finish this so it gets some testing before RC1

juliantaylor · 2019-05-14T19:24:12Z

hm so soon already? This is a very significant low level change and it should not be done hastily.
Also with this fix I would not consider it finished. At least we need to ensure equal results regardless of the strides of the input and I would still like to see some benchmarks on more platforms comparing to system libraries.

r-devulap · 2019-05-14T19:38:18Z

@juliantaylor @mattip I am slightly unclear about this comment #13520 (comment). Removing those 3 lines will result in your new test failing because outputs of strided and non-strided cases will differ. I think we need to keep these lines. Am I missing something?

r-devulap · 2019-05-14T19:47:33Z

Actually for these expensive functions it could be worthwhile to make the iterator always buffering, as it will be faster to copy to a buffer and call the vectorized function than using the unvectorized one
it also ensures the same results for all strides of data

but I am not sure if we can configure that on a per ufunc level

Instead of copying entire array to a buffer and call the vectorized function, won't it be better to use _mm256_mask_i32gather_ps instructions to accommodate strides in the vectorized function? Since stride is always uniform this is fairly easy to do and might speed up performance of the strided case.

juliantaylor · 2019-05-14T19:53:23Z

oh I misinterpreted the intention of the line, yes it is needed for that purpose. Please add a comment to the line why it exists.
That it causes a factor 10 slowdown for strided data is still a concern.

Using gather would be an option to use the same code with lower impact, but the performance vs a buffered iterator should be tested.
Implementing the gather in the iterator buffer creation would be better as that would benefit all ufuncs not only float32 exp/log but as a quick fix one could also do it in those functions and iterate on the solution later.

juliantaylor · 2019-05-14T19:58:39Z

The single element loop is still an issue as the code will currently loads at least 32 bytes of data on a single element loop, but there is no guarantee that the input data is 32 byte aligned so it can cross a page boundary and cause a segmentation fault.
This could be fixed with buffering too.

r-devulap · 2019-05-14T20:01:10Z

sounds good to me. I am working on using gather instructions for exp and log, but will need some time to implement, test and benchmark. May be we can merge this as is and I can submit another patch with gather instructions? would that be okay?

juliantaylor · 2019-05-14T20:01:31Z

or by adapting the load_partial_lanes function to handle the unaligned out of bounds case via scalar loads

r-devulap · 2019-05-14T20:04:36Z

loading 32 bytes of data even across a page boundary is not a problem with the masked and unaligned load and stores. that's what the masks are for, right?

juliantaylor · 2019-05-14T20:10:57Z

A I did not know that, looking at the description it does say the masked value is not loaded, though it does not explicitly say the masked memory may be unmapped and will not throw and exception but that could be inferred from the wording. If that does indeed work then the code is fine.

juliantaylor · 2019-05-14T20:14:53Z

I still have the one comment on the sse2 preprocessor but that does no harm so ok from me for merging

but we should probably still discuss release timing in the community meeting tomorrow (if there is one).

r-devulap · 2019-05-14T22:01:27Z

I added in a comment stating the performance issues for strided loops. I will submit another patch to fix that using gather instructions (if it fixes the performance issues). I think this is okay to merge now.

mattip · 2019-05-16T03:35:24Z

There are merge conflicts.

r-devulap · 2019-05-16T04:29:24Z

re-based with master, no conflicts now.

…test

mattip mentioned this pull request May 10, 2019

WIP, BUG: exp, log AVX loops do not use steps #13517

Closed

charris added 00 - Bug component: numpy._core labels May 11, 2019

r-devulap force-pushed the issue13512 branch from 255e4b2 to 9386592 Compare May 11, 2019 15:47

juliantaylor reviewed May 12, 2019

View reviewed changes

charris added this to the 1.17.0 release milestone May 12, 2019

r-devulap force-pushed the issue13512 branch from 9386592 to 8d8d683 Compare May 14, 2019 21:59

r-devulap changed the title ~~WIP, BUG: exp, log AVX loops do not use steps~~ BUG: exp, log AVX loops do not use steps May 14, 2019

r-devulap force-pushed the issue13512 branch from 8d8d683 to 4b4d2ab Compare May 16, 2019 04:28

mattip and others added 3 commits May 15, 2019 21:30

TEST: add test for non-contiguous input to ufuncs

56201bb

BUG: exp, log AVX loops do not use steps

1afc95d

TEST: changing range(6) to range(1,7) to avoid failure in reciprocal …

4b4d2ab

…test

mattip merged commit 7be5f11 into numpy:master May 16, 2019

mattip mentioned this pull request May 17, 2019

BUG: exp and log with non-contiguous float32 inputs #13512

Closed

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022

Uh oh!

BUG: exp, log AVX loops do not use steps #13520

BUG: exp, log AVX loops do not use steps #13520

Uh oh!

Conversation

r-devulap commented May 10, 2019

Uh oh!

r-devulap commented May 10, 2019

Uh oh!

r-devulap commented May 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented May 10, 2019

Uh oh!

r-devulap commented May 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliantaylor commented May 11, 2019

Uh oh!

mattip commented May 11, 2019

Uh oh!

r-devulap commented May 11, 2019

Uh oh!

juliantaylor May 12, 2019

Choose a reason for hiding this comment

Uh oh!

juliantaylor May 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliantaylor commented May 12, 2019

Uh oh!

juliantaylor commented May 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliantaylor commented May 14, 2019

Uh oh!

seberg commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019

Uh oh!

mattip commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019

Uh oh!

r-devulap commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliantaylor commented May 14, 2019

Uh oh!

r-devulap commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019

Uh oh!

juliantaylor commented May 14, 2019

Uh oh!

r-devulap commented May 10, 2019 •

edited

Loading

r-devulap commented May 11, 2019 •

edited

Loading

juliantaylor May 12, 2019 •

edited

Loading

juliantaylor commented May 12, 2019 •

edited

Loading

juliantaylor commented May 14, 2019 •

edited

Loading

juliantaylor commented May 14, 2019 •

edited

Loading

r-devulap commented May 14, 2019 •

edited

Loading

juliantaylor commented May 14, 2019 •

edited

Loading

juliantaylor commented May 14, 2019 •

edited

Loading