-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: AVX support for exp/log for strided float32 arrays #13581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(1) AVX2 and AVX512F supports gather_ps instruction to load strided data into a register. Rather than resort to scalar, these provide good benefit when the input array is strided. (2) Added some tests to validate AVX based algorithms for exp and log. The tests compare the output of float32 against glibc's float64 implementation and verify maxulp error.
|
Seems like a random CI failure nothing to do with the patch: |
numpy/core/src/umath/simd.inc.src
Outdated
@vtype@ x = @isa@_masked_load(load_mask, ip); | ||
num_lanes); | ||
@vtype@ x; | ||
if (stride == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always put { }
around loops and conditions also if it is only one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry should have been more specific the numpy style for conditionals and loops is:
if () {
}
else {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry about that, should be more consistent now.
strides = np.random.randint(low=-100, high=100, size=100) | ||
sizes = np.random.randint(low=1, high=2000, size=100) | ||
for ii in sizes: | ||
x_f32 = np.float32(np.random.uniform(low=0.01,high=88.1,size=ii)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
denormal floats are excluded here. how is the accuracy for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are meant to be like a sanity check and are not comprehensive at all. It will be slow to test for a large sample of float32's. But the MAXULP error of 2.6 and 3.9 hold even for denormals (this is something I validated separately by enumerating all float32 numbers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks
looks good. I get very different performance numbers on my AMD Ryzen 7 1700X with glibc 2.27
on an old intel i5-4310M with the same os it is significantly faster
|
7fc34bb
to
3251618
Compare
numpy/core/src/umath/simd.inc.src
Outdated
const npy_int num_lanes = @BYTES@/sizeof(npy_float); | ||
npy_int indexarr[16]; | ||
for (npy_int ii = 0; ii < 16; ii++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add braces
for () {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
numpy/core/src/umath/simd.inc.src
Outdated
const npy_int num_lanes = @BYTES@/sizeof(npy_float); | ||
npy_float xmax = 88.72283935546875f; | ||
npy_float xmin = -87.3365478515625f; | ||
npy_int indexarr[16]; | ||
for (npy_int ii = 0; ii < 16; ii++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add braces
for () {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
3251618
to
59b2a1d
Compare
As a point of interest, are these functions currently working correctly in master? |
Yes, this patch was only addressing performance issues for strided arrays. |
It seems there is an issue with clang 7.0, see #13586 |
(1) AVX2 and AVX512F supports gather_ps instruction to load strided data
into a register. Rather than resort to scalar, these provide good
benefit when the input array is strided.
(2) Added some tests to validate AVX based algorithms for exp and log.
The tests compare the output of float32 against glibc's float64
implementation and verify maxulp error.