Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: implement pairwise summation #3685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 3, 2013
Merged

Conversation

juliantaylor
Copy link
Contributor

pairwise summation has an average error of O(log(n)) instead of O(n) of
the regular summation.
It is implemented as summing pairs of small blocks of regulary summed
values in order to archive the same performance as the old sum.

An example of data which profits greatly is
d = np.ones(500000)
(d / 10.).sum() - d.size / 10.

An alternative to pairwise summation is kahan summation but in order to
have a low performance penalty one must unroll and vectorize it,
while pairwise summation has the same speed without any vectorization.
The better error bound of O(1) is negligible in many cases.

@juliantaylor
Copy link
Contributor Author

interestingly it makes one test in scipy test_lsmr.py fail
apparently it has set of data where pairwise is slightly worse about 10e-6

@njsmith
Copy link
Member

njsmith commented Sep 4, 2013

To make sure I understand the api here: this just unconditionally uses
pairwise summation to implement np.add.reduce?
On 4 Sep 2013 20:29, "Julian Taylor" [email protected] wrote:

interestingly it makes one test in scipy test_lsmr.py fail
apparently it has set of data where pairwise is slightly worse about 10e-6


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23816846
.

@juliantaylor
Copy link
Contributor Author

all add.reduce calls that go over float_add with IS_BINARY_REDUCE true
so this also improves mean/std/var and anything else that uses sum.

@njsmith
Copy link
Member

njsmith commented Sep 4, 2013

Neat. I assume this is just as fast as the current naive algorithm?
On 4 Sep 2013 20:48, "Julian Taylor" [email protected] wrote:

all add.reduce calls that go over _add with IS_BINARY_REDUCE true
so this also improves std/var and anything else that uses sum.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23818421
.

@juliantaylor
Copy link
Contributor Author

its even faster :) (about 30%)
probably due to the unrolling.
and it can be vectorized easily.


while (i < n) {
/* sum a block with an unrolled loop */
@type@ r[4] = {0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this idiom -- are you confident that it's standard compliant and supported by MSVC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its standard c89, if there are less initializers than members the rest have same value of static storage which is 0.
but I'll make it explicit for only 4 elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a familiar old ISO C idiom.

@njsmith
Copy link
Member

njsmith commented Sep 5, 2013

Looks good modulo comments.

Some further questions:

  • It looks like this is only implemented for float/double/longdouble. What about half and complex types?
  • The loop unrolling and stack logic looks correct to me, but it's complicated enough, and with enough distinct execution paths, that I'd feel better knowing there were some careful tests checking lots of different array lengths. Do those exist?

And okay, this is just idle curiosity really, but: would it make any sense to do the same for *, and other commutative reduction operations? (I guess logaddexp.reduce would be a particularly obvious candidate, but that would of course require changes to the generic ufunc reduction loop. I guess the ufunc dispatch logic allows for types to override some reduction operations like add.reduce? We don't have two separate implementations of sum, do we?)

@juliantaylor
Copy link
Contributor Author

half and complex certainly should use it too, didn't do it yet to simplify review.

I'll add some more tests for sum to make sure it works.

I'm not sure if it is applicable to other operations, I'll have to look it up.

A big problem is that the reduce ufunc actually works in buffersized blocks, so the benefit of the pairwise summation is reduced to blocks of 8192 elements by default.
Innerloop growing for reductions is disabled with a TODO.
I tried simply enabling the innerloop growing in the iterator and all tests still passed. I'm not sure what additional checks mentioned in the TODO are required.

@charris
Copy link
Member

charris commented Sep 6, 2013

The inner loop buffering might be historical. @mwiebe Let's ask Mark if he recalls anything about that.

@njsmith
Copy link
Member

njsmith commented Sep 7, 2013

Note that there are good reasons to want to use a smaller buffer by default
though - reduces the memory overhead of casting, etc.

I guess one cheap option for commutative ops would be to just always do
pairwise reductions when combining inner loop results on different blocks.
I can't see off the top of my head how this could every systematically
reduce accuracy, and it might even save copies compared to a strategy that
inserts the result from the previous loop at the beginning of the next loop
block.
On 6 Sep 2013 18:16, "Charles Harris" [email protected] wrote:

The inner loop buffering might be historical. @mwiebehttps://github.com/mwiebeLet's ask Mark if he recalls anything about that.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23955340
.

@@ -1356,20 +1356,89 @@ NPY_NO_EXPORT void
* #C = F, , L#
*/

/*
* pairwise summation, rounding error O(log(n)) instead of O(n)
* iterative version of:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to do this iteratively? A recursive version would be shorter and easier to read, without the arbitrary stack sizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not. I did it like this for the exercise. Performance wise it should not matter as the recursion depth is logarithmic.

@larsmans
Copy link
Contributor

Note: simplified recursive version and unit test here.

juliantaylor and others added 3 commits November 30, 2013 19:35
pairwise summation has an average error of O(log(n)) instead of O(n) of
the regular summation.
It is implemented as summing pairs of small blocks of regulary summed
values in order to archive the same performance as the old sum.

An example of data which profits greatly is
d = np.ones(500000)
(d / 10.).sum() - d.size / 10.

An alternative to pairwise summation is kahan summation but in order to
have a low performance penalty one must unroll and vectorize it,
while pairwise summation has the same speed without any vectorization.
The better error bound of O(1) is negligible in many cases.
Simple recursive implementation with unrolled base case. Also fixed
signed/unsigned issues by making all indices signed.

Added a unit test based on @juliantaylor's example.

Performance seems unchanged: still about a third faster than before.
Fix missing stride accounting in calling recursive function.
Unroll 8 times to improve accuracy and allowing vectorizing with avx
without changing summation order.
Add tests.
@njsmith
Copy link
Member

njsmith commented Dec 2, 2013

Speaking of... what's the status of this? A quick skim looks like it's a win, but would be a bigger win if we had pairwise block reduction?

@juliantaylor
Copy link
Contributor Author

yes its a win in accuracy and performance, but the former would be better if the blocks would be added the same way.
I'm currently adding the pairwise sum for complex numbers, then I think its ready to merge.

  • the blocking can be revisited latter.
  • logaddexp needs a different approach (a new ufunc which uses the logsumexp scaling approach of scipy)
  • multiplication might profit too, but I would need to read up on floating point semantics (also its probably less important)
    • vectorization can be done easily with how this code is written (but the gain is not so high with sse, only about 20%), to be revisited later

@juliantaylor
Copy link
Contributor Author

complex and half types added

@njsmith
Copy link
Member

njsmith commented Dec 2, 2013

So now that you added those, you think this is ready to merge? (Even if
there's still more to do later?)

On Mon, Dec 2, 2013 at 10:50 AM, Julian Taylor [email protected]:

complex and half types added


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-29645613
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@juliantaylor
Copy link
Contributor Author

I think its ready, just pushed a small update which removed a nop line and changed the tests to not work on data with zero in first element which improves their ability to find issues regarding the first element (which is already initialized by the iterator in these loops)

* The recursion depth is O(lg n) as well.
*/
static @type@
pairwise_add_@TYPE@(@dtype@ *a, npy_uintp n, npy_intp stride)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the word "sum" here, not "add".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done + updated some comments

@njsmith
Copy link
Member

njsmith commented Dec 2, 2013

otherwise LGTM

njsmith added a commit that referenced this pull request Dec 3, 2013
ENH: implement pairwise summation
@njsmith njsmith merged commit 05ab6f4 into numpy:master Dec 3, 2013
@argriffing argriffing mentioned this pull request May 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants