-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: use AVX for float32 and float64 implementation of sqrt, square, absolute, reciprocal, rint, floor, ceil and trunc #13885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
c37fb93
ENH: use AVX for sqrt, square, reciprocal and absolute value
r-devulap d874804
BUG: fixing multiple CI failures
r-devulap 344b40f
BUG: ignore invalid exception raised by absolute
r-devulap 7a327d0
ENH: use AVX for floor, rint, ceil and trunc
r-devulap 299e533
TEST: disable raise invalid exception test for sqrt
r-devulap 0286715
MAINT: rebase with master
r-devulap 5ee46de
MAINT: removing duplicated inner loop for e->e
r-devulap 5323bbf
BENCH: adding benchmarks for avx based ufuncs
r-devulap File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
from __future__ import absolute_import, division, print_function | ||
|
||
from .common import Benchmark | ||
|
||
import numpy as np | ||
|
||
avx_ufuncs = ['sqrt', | ||
'absolute', | ||
'reciprocal', | ||
'square', | ||
'rint', | ||
'floor', | ||
'ceil' , | ||
'trunc'] | ||
stride = [1, 2, 4] | ||
dtype = ['f', 'd'] | ||
|
||
class AVX_UFunc(Benchmark): | ||
params = [avx_ufuncs, stride, dtype] | ||
param_names = ['avx_based_ufunc', 'stride', 'dtype'] | ||
timeout = 10 | ||
|
||
def setup(self, ufuncname, stride, dtype): | ||
np.seterr(all='ignore') | ||
try: | ||
self.f = getattr(np, ufuncname) | ||
except AttributeError: | ||
raise NotImplementedError() | ||
N = 10000 | ||
self.arr = np.ones(stride*N, dtype) | ||
|
||
def time_ufunc(self, ufuncname, stride, dtype): | ||
self.f(self.arr[::stride]) | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the compiler able to generate the avx code automatically if you use
We use this trick in all sorts of places today to encourage it to generate optimized code.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried several options with GCC-9.2 and found the following:
Any compiler generated vectorized loop for floating point seems to require extra compiler options like -ffast-math (see https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using) . Here is the code for an example of the sqrt loop with and without this option. There are several problems with this path: (1) -ffast-math obviously should not be used as a global compile option and (2) the code generated with this option ends up using a combination of
vrsqrt14ps
andvmulps
instruction to compute square root which is neither accuratenor fast(vrsqrt14ps
is only accurate up to the 6th decimal place and I have no idea why even the latest GCC wont use a simplevsqrtps
instruction instead!)The other problem is, no matter what option I try, I could not get GCC to vectorize the strided array case (see an example here). Even if somehow we were able to properly vectorize the case where stride = 1, as far as I know, we cannot auto-vectorize for general strided arrays.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally learnt why gcc wont use
vsqrtps
!vrsqrt14ps
instruction is 1-3 cycles, where asvsqrtps
is > 14 cycles. So its basically faster to computeinvsqrt
, multiple it with input and then correct it with one step of newton raphson than to compute an accurate sqrt directly. -ffast-math obviously chooses speed over accuracy. This logic works for single precision and not for double precision where it uses thevsqrtpd
instruction (see code here) :)