Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: use AVX for float32 and float64 implementation of sqrt, square, absolute, reciprocal, rint, floor, ceil and trunc #13885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 15, 2019

Conversation

r-devulap
Copy link
Member

By leveraging AVX gather instructions, this patch enables SIMD processing of strided arrays, which is currently handled in a scalar fashion. Performance of functions like sqrt improves by 9x. Detailed performance numbers are presented below (array size = 10000 for every benchmark):

       before           after         ratio
     [a14a8cef]       [de47213c]
     <master>         <sqrt-sq-rcp-abs-avx>
-     2.27±0.02ns      2.05±0.04ns     0.90  bench_avx.Custom.time_square_stride1_float32
-     3.49±0.04ns      3.11±0.06ns     0.89  bench_avx.Custom.time_square_stride1_float64
-     3.38±0.03ns      2.93±0.03ns     0.87  bench_avx.Custom.time_reciprocal_stride1_float32
-     2.04±0.02ns      1.75±0.02ns     0.86  bench_avx.Custom.time_absolute_stride1_float32
-     3.37±0.02ns      2.78±0.05ns     0.83  bench_avx.Custom.time_absolute_stride1_float64
-     7.15±0.03ns      5.52±0.02ns     0.77  bench_avx.Custom.time_square_stride4_float64
-     7.15±0.03ns      5.04±0.03ns     0.71  bench_avx.Custom.time_square_stride2_float64
-     7.56±0.04ns      4.46±0.03ns     0.59  bench_avx.Custom.time_square_stride4_float32
-     8.94±0.03ns      5.15±0.02ns     0.58  bench_avx.Custom.time_absolute_stride4_float64
-     13.6±0.03ns      7.49±0.03ns     0.55  bench_avx.Custom.time_reciprocal_stride4_float64
-     8.52±0.01ns      4.67±0.02ns     0.55  bench_avx.Custom.time_absolute_stride2_float64
-     13.7±0.05ns      7.48±0.02ns     0.55  bench_avx.Custom.time_reciprocal_stride2_float64
-     7.55±0.05ns      4.13±0.02ns     0.55  bench_avx.Custom.time_square_stride2_float32
-     8.41±0.03ns      4.13±0.01ns     0.49  bench_avx.Custom.time_absolute_stride4_float32
-     7.97±0.03ns      3.80±0.01ns     0.48  bench_avx.Custom.time_absolute_stride2_float32
-     10.6±0.04ns      4.46±0.03ns     0.42  bench_avx.Custom.time_reciprocal_stride4_float32
-     10.5±0.03ns      4.26±0.03ns     0.40  bench_avx.Custom.time_reciprocal_stride2_float32
-     40.6±0.02ns      7.97±0.03ns     0.20  bench_avx.Custom.time_sqrt_stride2_float64
-     40.6±0.05ns      7.97±0.05ns     0.20  bench_avx.Custom.time_sqrt_stride4_float64
-     37.5±0.03ns      4.24±0.02ns     0.11  bench_avx.Custom.time_sqrt_stride4_float32
-     37.5±0.02ns      4.06±0.02ns     0.11  bench_avx.Custom.time_sqrt_stride2_float32

@r-devulap
Copy link
Member Author

The first 10 commits overlap with PR #13368, I will rebase if and when PR #13368 gets merged.

@r-devulap
Copy link
Member Author

Wonder why I assumed npy_abs was a legit function :/ Will fix that error ...

@r-devulap
Copy link
Member Author

r-devulap commented Jul 3, 2019

The failing test is because of a newly added test (by this patch) that fails on Windows Python36-32bit:

with np.errstate(invalid='raise'):
    assert_raises(FloatingPointError, np.sqrt, np.array(-100., dtype=dt))

Ideally, a FP exception should be raised if a negative input is passed to sqrt (as it does on other platforms). Anyone knows why this fails on windows py3.6 32bit?

@r-devulap r-devulap force-pushed the sqrt-sq-rcp-abs-avx branch 2 times, most recently from 76a620b to 2bfb768 Compare July 10, 2019 03:22
@r-devulap r-devulap changed the title ENH: use AVX for sqrt, square, absolute and reciprocal ENH: use AVX for float32 and float64 implementation of sqrt, square, absolute, reciprocal, rint, floor, ceil and trunc Jul 10, 2019
@r-devulap
Copy link
Member Author

Added AVX implementation of floor, ceil, trunc and rint. As with other AVX implementations, these handles strided arrays as well. These functions see up to a 14x speed up. Detailed numbers presented below:

before           after         ratio
[af5a1084]       [2bfb7685]
<master>         <sqrt-sq-rcp-abs-avx>
19.5±0.01ns      5.17±0.02ns     0.27  bench_avx.Custom.time_rint_stride4_f64
20.6±0.02ns      5.19±0.01ns     0.25  bench_avx.Custom.time_trunc_stride4_f64
19.5±0.02ns      4.71±0.01ns     0.24  bench_avx.Custom.time_rint_stride2_f64
17.6±0.01ns      4.13±0.02ns     0.24  bench_avx.Custom.time_rint_stride4_f32
20.5±0.01ns      4.70±0.02ns     0.23  bench_avx.Custom.time_trunc_stride2_f64
24.1±0.03ns      5.19±0.02ns     0.22  bench_avx.Custom.time_ceil_stride4_f64
17.6±0.01ns      3.77±0.01ns     0.21  bench_avx.Custom.time_rint_stride2_f32
20.5±0.01ns      4.12±0.01ns     0.20  bench_avx.Custom.time_trunc_stride4_f32
23.9±0.01ns      4.69±0.01ns     0.20  bench_avx.Custom.time_ceil_stride2_f64
26.4±0.02ns      5.18±0.02ns     0.20  bench_avx.Custom.time_floor_stride4_f64
20.4±0.01ns      3.78±0.01ns     0.19  bench_avx.Custom.time_trunc_stride2_f32
26.3±0.02ns      4.70±0.01ns     0.18  bench_avx.Custom.time_floor_stride2_f64
23.8±0.01ns      4.13±0.01ns     0.17  bench_avx.Custom.time_ceil_stride4_f32
23.8±0.01ns      3.81±0.01ns     0.16  bench_avx.Custom.time_ceil_stride2_f32
26.2±0.01ns      4.12±0.02ns     0.16  bench_avx.Custom.time_floor_stride4_f32
26.0±0.20ns      3.79±0.02ns     0.15  bench_avx.Custom.time_floor_stride2_f32
19.2±0.01ns      2.71±0.07ns     0.14  bench_avx.Custom.time_rint_stride1_f64
20.2±0.02ns      2.73±0.07ns     0.13  bench_avx.Custom.time_trunc_stride1_f64
23.6±0.01ns      2.72±0.07ns     0.12  bench_avx.Custom.time_ceil_stride1_f64
25.9±0.02ns      2.70±0.08ns     0.10  bench_avx.Custom.time_floor_stride1_f64
17.2±0.01ns      1.76±0.04ns     0.10  bench_avx.Custom.time_rint_stride1_f32
20.1±0.02ns      1.74±0.03ns     0.09  bench_avx.Custom.time_trunc_stride1_f32
23.5±0.02ns      1.73±0.03ns     0.07  bench_avx.Custom.time_ceil_stride1_f32
25.9±0.01ns      1.73±0.04ns     0.07  bench_avx.Custom.time_floor_stride1_f32

@r-devulap r-devulap force-pushed the sqrt-sq-rcp-abs-avx branch from 2bfb768 to 226ae20 Compare August 24, 2019 15:21
@r-devulap
Copy link
Member Author

re-based with master. Can someone help review the code please?

@r-devulap
Copy link
Member Author

@mattip @charris, any way forward with this patch?

@mattip
Copy link
Member

mattip commented Sep 1, 2019

Should these have tests added to numpy/core/tests/test_umath_accuracy.py via some numpy/core/tests/data/umath-validation*? I guess the abs, floor, ceil, trucate are quite straightforward, but maybe for reciprocal, sqrt, square?

@r-devulap
Copy link
Member Author

I do not think it is necessary though. There is no change to the way these functions are computed and all of these functions directly use the hardware instructions anyway (sqrt is computed using vsqrtps/vsqrtpd instruction, reciprocal is just vdivps, and square is just vmulps(x,x)). I did add a bunch of tests to validate special value floats and strided arrays to make sure I didn't break functionality.

@mattip mattip requested a review from juliantaylor September 1, 2019 18:45
@mattip
Copy link
Member

mattip commented Sep 1, 2019

ok. The code, as much as I understand it, looks reasonable. Maybe someone else would like to try to take a look?

@r-devulap
Copy link
Member Author

ping .. anyone else who can review the code?

@r-devulap
Copy link
Member Author

@mattip how do you want to proceed with this patch?

@mattip
Copy link
Member

mattip commented Sep 23, 2019

@r-devulap sorry this is taking so long.

I think this PR has the same problem with duplicate loops that gh-14554 cleaned up, could you take a quick look? Also, a rebase/merge with master is needed to clear the conflict.

FWIW, adding this code on my Ubuntu 18.04 dev system makes the wheel grow to 10_655_010 bytes, adding 45_101 bytes or about 0.5%, and it makes the _multiarray_umath shared object grow by 200_432 bytes to 17_778_296 bytes or about 1.1%.

I think this is acceptable for the speed boost, so I propose to merge it.

@r-devulap
Copy link
Member Author

thanks @mattip, I will take a look and fix it.

(1) Workaround for bug in clang6: added missing GCC attribute to the
prototype of ISA_sqrt_TYPE function which otherwise leads to a weird
build failure in clang6 (gcc and clang7.0 doesnt have this issue)

(2) Changed np.float128 to np.longdouble in tests: NumPy in windows
doesn't support the np.float128 dtype

(3) GCC 4.8/5.0 doesn't support _mm512_abs_ps/pd intrinsic
clang6 generates an invalid exception when computing abs value of
+/-np.nan.
@r-devulap
Copy link
Member Author

Rebased with master and added a commit to fix the duplicated inner loop for float16. Let me know if that looks correct?

@juliantaylor
Copy link
Contributor

juliantaylor commented Sep 28, 2019

please independently verify these benchmarks, we often get this MR using avx for e.g. sqrt and we always rejected it because the benchmarks turned out to be wrong and there were no performance improvements as the hardware does not actually use the wider registers in parallel.

I think the last time I verified this was on a skylake xeon gold ... which should still be the latest intel generation.

@juliantaylor
Copy link
Contributor

Do we need manually writen code for this?
the compiler should be able to vectorize these functions fine like it does for our AVX integer code.

@r-devulap
Copy link
Member Author

  1. This patch does not significantly improve performance of sqrt, reciprocal, square and absolute for stride = 1 (as can be seen in the performance numbers). These are memory bound functions and hence AVX provides no significant benefit over SSE. The patch ENH: Add AVX2 and AVX512 functionality for numpy.sqrt #12459 I submitted a while back missed this point exactly.
  2. The main feature this patch brings in is vectorizing strided arrays which are currently processed in a scalar fashion (gather instructions were not supported with SSE). It would be great if someone else can corroborate performance numbers for strided arrays, but we are moving from scalar processing to vector, so I am pretty confident the numbers should hold.
  3. floor, rint, ceil and trunc are currently processed scalar for all strides. gcc refuses to auto-vectorizes them. When I tried to compile a simple loop with gcc-9:
for (int ii = 0; ii < N; ii++)                                              
    op[ii] = floor(ip[ii]); 

gcc refuses to vectorize this and gave me this info: missed: not vectorized: relevant stmt not supported: _23 = __builtin_floorf (_22). and I doubt if gcc can vectorize strided arrays.

@r-devulap
Copy link
Member Author

@juliantaylor @mattip ping ...

@mattip
Copy link
Member

mattip commented Oct 8, 2019

Did you commit the benchmarks you are showing above? I do not see them in benchmarking results on a machine with avx512.

@r-devulap
Copy link
Member Author

I just added a commit for the benchmarks.

@mattip
Copy link
Member

mattip commented Oct 10, 2019

Comparing the pre- and post-PR benchmarks, I see large changes. The bench_avx parameters are ufunc, stride, dtype.

% asv compare 68bd6e35 5ee46de5 --only-changed -f 1.3 --sort ratio
· `wheel_cache_size` has been renamed to `build_cache_size`. Update your \ 
`asv.conf.json` accordingly.
       before           after         ratio
     [68bd6e35]       [5ee46de5]
     <sqrt-sq-rcp-abs-avx~8>       <sqrt-sq-rcp-abs-avx~1>
-         277±1μs          204±1μs     0.73  bench_ufunc.UFunc.time_ufunc_types('trunc')
-         293±2μs          213±3μs     0.73  bench_ufunc.UFunc.time_ufunc_types('ceil')
-         307±2μs        218±0.9μs     0.71  bench_ufunc.UFunc.time_ufunc_types('floor')
-     7.94±0.03μs       5.33±0.2μs     0.67  bench_avx.AVX_UFunc.time_ufunc('square', 4, 'd')
-      7.40±0.5μs      4.83±0.01μs     0.65  bench_avx.AVX_UFunc.time_ufunc('square', 2, 'd')
-      12.3±0.5μs      7.14±0.04μs     0.58  bench_avx.AVX_UFunc.time_ufunc('reciprocal', 4, 'd')
-     7.89±0.02μs       4.51±0.2μs     0.57  bench_avx.AVX_UFunc.time_ufunc('square', 4, 'f')
-      7.07±0.5μs      3.99±0.01μs     0.56  bench_avx.AVX_UFunc.time_ufunc('square', 2, 'f')
-     14.1±0.01μs       7.76±0.3μs     0.55  bench_avx.AVX_UFunc.time_ufunc('reciprocal', 2, 'd')
-      9.35±0.5μs       5.05±0.2μs     0.54  bench_avx.AVX_UFunc.time_ufunc('absolute', 4, 'd')
-     7.44±0.03μs      4.00±0.02μs     0.54  bench_avx.AVX_UFunc.time_ufunc('absolute', 2, 'f')
-      7.59±0.5μs         3.96±0μs     0.52  bench_avx.AVX_UFunc.time_ufunc('absolute', 4, 'f')
-      10.8±0.3μs       4.87±0.2μs     0.45  bench_avx.AVX_UFunc.time_ufunc('absolute', 2, 'd')
-      10.2±0.7μs       4.34±0.2μs     0.43  bench_avx.AVX_UFunc.time_ufunc('reciprocal', 4, 'f')
-      10.2±0.7μs       4.31±0.2μs     0.42  bench_avx.AVX_UFunc.time_ufunc('reciprocal', 2, 'f')
-     15.8±0.01μs      4.39±0.01μs     0.28  bench_avx.AVX_UFunc.time_ufunc('rint', 4, 'f')
-     20.3±0.01μs       5.13±0.2μs     0.25  bench_avx.AVX_UFunc.time_ufunc('rint', 4, 'd')
-     20.3±0.02μs      5.04±0.03μs     0.25  bench_avx.AVX_UFunc.time_ufunc('trunc', 4, 'd')
-        21.1±2μs      5.00±0.02μs     0.24  bench_avx.AVX_UFunc.time_ufunc('rint', 2, 'd')
-     36.4±0.02μs       8.25±0.3μs     0.23  bench_avx.AVX_UFunc.time_ufunc('sqrt', 4, 'd')
-     18.4±0.01μs      4.04±0.02μs     0.22  bench_avx.AVX_UFunc.time_ufunc('trunc', 4, 'f')
-        23.3±2μs      5.11±0.03μs     0.22  bench_avx.AVX_UFunc.time_ufunc('ceil', 4, 'd')
-        21.1±2μs      4.59±0.01μs     0.22  bench_avx.AVX_UFunc.time_ufunc('trunc', 2, 'd')
-     23.2±0.01μs       4.99±0.2μs     0.21  bench_avx.AVX_UFunc.time_ufunc('floor', 2, 'd')
-     21.4±0.01μs      4.58±0.04μs     0.21  bench_avx.AVX_UFunc.time_ufunc('ceil', 2, 'd')
-     36.3±0.01μs       7.66±0.2μs     0.21  bench_avx.AVX_UFunc.time_ufunc('sqrt', 2, 'd')
-     18.3±0.04μs       3.77±0.1μs     0.21  bench_avx.AVX_UFunc.time_ufunc('rint', 2, 'f')
-     18.3±0.04μs       3.77±0.1μs     0.21  bench_avx.AVX_UFunc.time_ufunc('rint', 2, 'f')
-      21.4±0.5μs       4.07±0.1μs     0.19  bench_avx.AVX_UFunc.time_ufunc('ceil', 4, 'f')
-     27.1±0.04μs      5.04±0.02μs     0.19  bench_avx.AVX_UFunc.time_ufunc('floor', 4, 'd')
-        20.9±1μs      3.74±0.02μs     0.18  bench_avx.AVX_UFunc.time_ufunc('trunc', 2, 'f')
-     22.9±0.03μs      4.03±0.02μs     0.18  bench_avx.AVX_UFunc.time_ufunc('floor', 4, 'f')
-        23.0±2μs       4.04±0.1μs     0.18  bench_avx.AVX_UFunc.time_ufunc('ceil', 2, 'f')
-        20.2±1μs      3.10±0.04μs     0.15  bench_avx.AVX_UFunc.time_ufunc('rint', 1, 'd')
-        26.5±1μs      4.04±0.02μs     0.15  bench_avx.AVX_UFunc.time_ufunc('floor', 2, 'f')
-     21.4±0.02μs      3.18±0.03μs     0.15  bench_avx.AVX_UFunc.time_ufunc('ceil', 1, 'd')
-        21.8±2μs      3.20±0.04μs     0.15  bench_avx.AVX_UFunc.time_ufunc('trunc', 1, 'd')
-      15.8±0.2μs      2.12±0.05μs     0.13  bench_avx.AVX_UFunc.time_ufunc('rint', 1, 'f')
-        33.6±3μs       4.01±0.2μs     0.12  bench_avx.AVX_UFunc.time_ufunc('sqrt', 2, 'f')
-     26.8±0.02μs      3.19±0.04μs     0.12  bench_avx.AVX_UFunc.time_ufunc('floor', 1, 'd')
-     18.3±0.01μs      2.12±0.02μs     0.12  bench_avx.AVX_UFunc.time_ufunc('trunc', 1, 'f')
-        36.5±3μs      4.08±0.01μs     0.11  bench_avx.AVX_UFunc.time_ufunc('sqrt', 4, 'f')
-        22.9±2μs      2.18±0.02μs     0.10  bench_avx.AVX_UFunc.time_ufunc('ceil', 1, 'f')
-     26.6±0.04μs      2.18±0.02μs     0.08  bench_avx.AVX_UFunc.time_ufunc('floor', 1, 'f')

I am not sure why the single strided cases are showing such a large speed-up. Do these results make sense?

@r-devulap
Copy link
Member Author

I think these numbers make sense. NumPy's current implementation of the rounding functions ceil, floor, rint and trunc are scalar even for stride 1, so the 10x speed up for the these functions is expected (my numbers are similar too). As expected, we do not need see any significant speed up for sqrt, square, reciprocal and absolute for stride 1, since these are currently implemented with SSE.

@r-devulap
Copy link
Member Author

just out of curiosity, what CPU did you run these benchmarks on?

@mattip
Copy link
Member

mattip commented Oct 10, 2019

Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz

@mattip
Copy link
Member

mattip commented Oct 10, 2019

@juliantaylor ping

Comment on lines +1692 to +1697
if (!run_unary_@isa@_sqrt_@TYPE@(args, dimensions, steps)) {
UNARY_LOOP {
const @type@ in1 = *(@type@ *)ip1;
*(@type@ *)op1 = npy_sqrt@typesub@(in1);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the compiler able to generate the avx code automatically if you use

if (IS_OUTPUT_BLOCKABLE_UNARY(sizeof(@type@), @REGISTER_SIZE@)) {
    UNARY_LOOP { ... }
}
else {
    // as above
    UNARY_LOOP { ... }
}

We use this trick in all sorts of places today to encourage it to generate optimized code.

Copy link
Member Author

@r-devulap r-devulap Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried several options with GCC-9.2 and found the following:

  • Any compiler generated vectorized loop for floating point seems to require extra compiler options like -ffast-math (see https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using) . Here is the code for an example of the sqrt loop with and without this option. There are several problems with this path: (1) -ffast-math obviously should not be used as a global compile option and (2) the code generated with this option ends up using a combination of vrsqrt14ps and vmulps instruction to compute square root which is neither accurate nor fast (vrsqrt14ps is only accurate up to the 6th decimal place and I have no idea why even the latest GCC wont use a simple vsqrtps instruction instead!)

  • The other problem is, no matter what option I try, I could not get GCC to vectorize the strided array case (see an example here). Even if somehow we were able to properly vectorize the case where stride = 1, as far as I know, we cannot auto-vectorize for general strided arrays.

Copy link
Member Author

@r-devulap r-devulap Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally learnt why gcc wont use vsqrtps! vrsqrt14ps instruction is 1-3 cycles, where as vsqrtps is > 14 cycles. So its basically faster to compute invsqrt , multiple it with input and then correct it with one step of newton raphson than to compute an accurate sqrt directly. -ffast-math obviously chooses speed over accuracy. This logic works for single precision and not for double precision where it uses the vsqrtpd instruction (see code here) :)

@mattip mattip merged commit c7f532e into numpy:master Oct 15, 2019
@mattip
Copy link
Member

mattip commented Oct 15, 2019

Thanks @r-devulap.

@r-devulap
Copy link
Member Author

thanks @mattip !

* Replace masked elements with 1.0f to avoid divide by zero fp
* exception in reciprocal
*/
x = @isa@_set_masked_lanes_ps(x, ones_f, inv_load_mask);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand this line, "How and why"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-devulap, you forget to remove it right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The masked load instruction loads 0 for elements for which the mask is set to 0 (which happens for the trailing end of the array). For reciprocal, it causes a 1/0 which raises an divide by zero exception. This line replaces the zeros with ones to avoid that exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-devulap, "the trailing end of the array", oh it makes sense now. thank you

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 03 - Maintenance component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants