-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: implement pairwise summation #3685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
interestingly it makes one test in scipy test_lsmr.py fail |
To make sure I understand the api here: this just unconditionally uses
|
all add.reduce calls that go over float_add with IS_BINARY_REDUCE true |
Neat. I assume this is just as fast as the current naive algorithm?
|
its even faster :) (about 30%) |
|
||
while (i < n) { | ||
/* sum a block with an unrolled loop */ | ||
@type@ r[4] = {0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this idiom -- are you confident that it's standard compliant and supported by MSVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its standard c89, if there are less initializers than members the rest have same value of static storage which is 0.
but I'll make it explicit for only 4 elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a familiar old ISO C idiom.
Looks good modulo comments. Some further questions:
And okay, this is just idle curiosity really, but: would it make any sense to do the same for |
half and complex certainly should use it too, didn't do it yet to simplify review. I'll add some more tests for sum to make sure it works. I'm not sure if it is applicable to other operations, I'll have to look it up. A big problem is that the reduce ufunc actually works in buffersized blocks, so the benefit of the pairwise summation is reduced to blocks of 8192 elements by default. |
The inner loop buffering might be historical. @mwiebe Let's ask Mark if he recalls anything about that. |
Note that there are good reasons to want to use a smaller buffer by default I guess one cheap option for commutative ops would be to just always do
|
@@ -1356,20 +1356,89 @@ NPY_NO_EXPORT void | |||
* #C = F, , L# | |||
*/ | |||
|
|||
/* | |||
* pairwise summation, rounding error O(log(n)) instead of O(n) | |||
* iterative version of: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to do this iteratively? A recursive version would be shorter and easier to read, without the arbitrary stack sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not. I did it like this for the exercise. Performance wise it should not matter as the recursion depth is logarithmic.
Note: simplified recursive version and unit test here. |
pairwise summation has an average error of O(log(n)) instead of O(n) of the regular summation. It is implemented as summing pairs of small blocks of regulary summed values in order to archive the same performance as the old sum. An example of data which profits greatly is d = np.ones(500000) (d / 10.).sum() - d.size / 10. An alternative to pairwise summation is kahan summation but in order to have a low performance penalty one must unroll and vectorize it, while pairwise summation has the same speed without any vectorization. The better error bound of O(1) is negligible in many cases.
Simple recursive implementation with unrolled base case. Also fixed signed/unsigned issues by making all indices signed. Added a unit test based on @juliantaylor's example. Performance seems unchanged: still about a third faster than before.
Fix missing stride accounting in calling recursive function. Unroll 8 times to improve accuracy and allowing vectorizing with avx without changing summation order. Add tests.
Speaking of... what's the status of this? A quick skim looks like it's a win, but would be a bigger win if we had pairwise block reduction? |
yes its a win in accuracy and performance, but the former would be better if the blocks would be added the same way.
|
complex and half types added |
So now that you added those, you think this is ready to merge? (Even if On Mon, Dec 2, 2013 at 10:50 AM, Julian Taylor [email protected]:
Nathaniel J. Smith |
I think its ready, just pushed a small update which removed a nop line and changed the tests to not work on data with zero in first element which improves their ability to find issues regarding the first element (which is already initialized by the iterator in these loops) |
* The recursion depth is O(lg n) as well. | ||
*/ | ||
static @type@ | ||
pairwise_add_@TYPE@(@dtype@ *a, npy_uintp n, npy_intp stride) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the word "sum" here, not "add".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done + updated some comments
otherwise LGTM |
ENH: implement pairwise summation
pairwise summation has an average error of O(log(n)) instead of O(n) of
the regular summation.
It is implemented as summing pairs of small blocks of regulary summed
values in order to archive the same performance as the old sum.
An example of data which profits greatly is
d = np.ones(500000)
(d / 10.).sum() - d.size / 10.
An alternative to pairwise summation is kahan summation but in order to
have a low performance penalty one must unroll and vectorize it,
while pairwise summation has the same speed without any vectorization.
The better error bound of O(1) is negligible in many cases.