Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: AVX support for exp/log for strided float32 arrays #13581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 19, 2019

Conversation

r-devulap
Copy link
Member

(1) AVX2 and AVX512F supports gather_ps instruction to load strided data
into a register. Rather than resort to scalar, these provide good
benefit when the input array is strided.

(2) Added some tests to validate AVX based algorithms for exp and log.
The tests compare the output of float32 against glibc's float64
implementation and verify maxulp error.

(1) AVX2 and AVX512F supports gather_ps instruction to load strided data
into a register. Rather than resort to scalar, these provide good
benefit when the input array is strided.

(2) Added some tests to validate AVX based algorithms for exp and log.
The tests compare the output of float32 against glibc's float64
implementation and verify maxulp error.
@r-devulap
Copy link
Member Author

function stride SIMD AVX512 SIMD AVX2 Glibc scalar
exp 1 7.05±0.01us 10.9±0.02us 33.3±0.3us
exp 2 7.75±0.03us 12.7±0.03us 33.6±0.2us
exp 4 7.80±0.01us 12.7±0us 33.7±0.01us
exp 8 8.02±0.01us 13.1±0.03us 34.3±0.2us
exp 16 9.47±0.30us 14.2±0.2us 39.4±0.2us
function stride SIMD AVX512 SIMD AVX2 Glibc scalar
log 1 7.04±0.01us 16.6±0.06us 38.1±0.2us
log 2 8.09±0.05us 18.7±0.2us 38.1±0.09us
log 4 8.12±0.05us 19.0±0.1us 38.3±0.03us
log 8 8.38±0.01us 19.8±0.4us 39.1±0.04us
log 16 9.68±0.08us 21.2±0.1us 42.9±0.6us

@r-devulap
Copy link
Member Author

Seems like a random CI failure nothing to do with the patch:
curl: (56) GnuTLS recv error (-54): Error in the pull function. Unable to download 3.5 archive. The archive may not exist. Please consider a different version.

@vtype@ x = @isa@_masked_load(load_mask, ip);
num_lanes);
@vtype@ x;
if (stride == 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always put { } around loops and conditions also if it is only one line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this.

Copy link
Contributor

@juliantaylor juliantaylor May 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry should have been more specific the numpy style for conditionals and loops is:

if () {
}
else {
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry about that, should be more consistent now.

strides = np.random.randint(low=-100, high=100, size=100)
sizes = np.random.randint(low=1, high=2000, size=100)
for ii in sizes:
x_f32 = np.float32(np.random.uniform(low=0.01,high=88.1,size=ii))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denormal floats are excluded here. how is the accuracy for these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are meant to be like a sanity check and are not comprehensive at all. It will be slow to test for a large sample of float32's. But the MAXULP error of 2.6 and 3.9 hold even for denormals (this is something I validated separately by enumerating all float32 numbers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks

@juliantaylor
Copy link
Contributor

juliantaylor commented May 18, 2019

looks good.

I get very different performance numbers on my AMD Ryzen 7 1700X with glibc 2.27

import numpy as np
d = np.ones(10000, dtype=np.float32) * 2.132

%timeit np.exp(d)
%timeit np.log(d)
function avx2 glibc scalar
exp 31us 33us
log 33us 37us

on an old intel i5-4310M with the same os it is significantly faster

function avx2 glibc scalar
exp 196us 369us
log 240us 440us

@r-devulap r-devulap force-pushed the gather-for-avxsimd branch from 7fc34bb to 3251618 Compare May 18, 2019 16:17
const npy_int num_lanes = @BYTES@/sizeof(npy_float);
npy_int indexarr[16];
for (npy_int ii = 0; ii < 16; ii++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add braces

for () {
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

const npy_int num_lanes = @BYTES@/sizeof(npy_float);
npy_float xmax = 88.72283935546875f;
npy_float xmin = -87.3365478515625f;
npy_int indexarr[16];
for (npy_int ii = 0; ii < 16; ii++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add braces

for () {
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@r-devulap r-devulap force-pushed the gather-for-avxsimd branch from 3251618 to 59b2a1d Compare May 18, 2019 20:02
@charris
Copy link
Member

charris commented May 19, 2019

As a point of interest, are these functions currently working correctly in master?

@r-devulap
Copy link
Member Author

As a point of interest, are these functions currently working correctly in master?

Yes, this patch was only addressing performance issues for strided arrays.

@mattip
Copy link
Member

mattip commented May 19, 2019

It seems there is an issue with clang 7.0, see #13586

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants