-
-
Notifications
You must be signed in to change notification settings - Fork 11k
min/max and base math vectorization #3419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improves performance by ~1.5/3.0 for float/double.
Improves performance by ~1.5/3.0 for float/double for inplace or cpu cached operations
For what it's worth, sum and prod have never guaranteed operation order,
|
@@ -1488,6 +1494,11 @@ NPY_NO_EXPORT void | |||
NPY_NO_EXPORT void | |||
@TYPE@_square(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(data)) | |||
{ | |||
char * margs[] = {args[0], args[0], args[1]}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this is portable? Some compilers (SUN) would only allow initialization of struct with constants. SUN is history, but I'm not sure it's ancient history. Is it possible to just pass args and steps?
I also looks like this pattern would be a candidate for a macro, maybe something like SIMD_UNARY_LOOP
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to support pre C89 compilers?
its no problem to do this in three steps, but at some point you have to draw the line what you want to support.
currently its only used twice and I don't see the need to do it more often. Its just so square and reciprocal are not slower than their explicit counter parts which do the same if the input pointers are equal.
The functions are obsolete on amd64 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just give it a try and wait for complaints, if any. As you say, it isn't worth supported obsolete stuff and C89 isn't that new ;)
There are variations in floating point results anyway, especially on 32 bit Intel, depending on whether the compiler uses SSE or x87 extended precision registers and stores intemediate results in FPU registers or memory. So I'm not sure is is worth worrying about small changes in results, that's just floating point. |
I agree we should probably not worry about it much, but one scipy test fails when run with reduction vectorized numpy as it expects smaller errors. |
avoids declared but not defined warnings
someone should update the nditer cython part tutorial |
improved to make use of SSE2 CPU SIMD instructions. | ||
Performance improvements to base math, `sqrt`, `abs` and `min/max` | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
The base math (add, subtract, divide, multiply) and `sqrt`, `abs`, `min/max` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be maximum/minimum
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? the functions are named min/max in the python api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ufuncs that are modified are maximum/minimum, which are binary functions. The max/min equivalent methods are implemented in numpy/core/_methods.py
using maximum.reduce
/minimum.reduce
and are accessed through amin
/amax
in numpy/core/fromnumeric.py
. The python max/min are different, they treat the array as an iterator.
It is a small point, agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then abs should be absolute too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. abs is a type method defined in number.c
and calls the absolute
ufunc. It's a bit of a tangle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@matthew-brett If you don't use gcc, could you check this PR for compiler errors on SPARC? |
Sorry Chuck - it's one of Yarick Halchenko's Debian boxes - no Sun cc, only gcc 4.4 |
Let's give it a shot. Thanks. |
min/max and base math vectorization
this pull includes the rest of the non-result changing float vectorization.
min/max is a little ugly as it needs to propagate nan efficiently. Are there platforms were fpu flag propagation is supported but NO_FLOATING_POINT_SUPPORT is set?
the base math is lengthy but simple, all the special cases are to archive optimal performance for these very common operations.
base math reductions are not vectorized as they change the results slightly (float add and multiply are not associative)