ENH: implement pairwise summation #3685

juliantaylor · 2013-09-04T19:25:25Z

pairwise summation has an average error of O(log(n)) instead of O(n) of
the regular summation.
It is implemented as summing pairs of small blocks of regulary summed
values in order to archive the same performance as the old sum.

An example of data which profits greatly is
d = np.ones(500000)
(d / 10.).sum() - d.size / 10.

An alternative to pairwise summation is kahan summation but in order to
have a low performance penalty one must unroll and vectorize it,
while pairwise summation has the same speed without any vectorization.
The better error bound of O(1) is negligible in many cases.

juliantaylor · 2013-09-04T19:26:19Z

interestingly it makes one test in scipy test_lsmr.py fail
apparently it has set of data where pairwise is slightly worse about 10e-6

njsmith · 2013-09-04T19:42:56Z

To make sure I understand the api here: this just unconditionally uses
pairwise summation to implement np.add.reduce?
On 4 Sep 2013 20:29, "Julian Taylor" [email protected] wrote:

interestingly it makes one test in scipy test_lsmr.py fail
apparently it has set of data where pairwise is slightly worse about 10e-6

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23816846
.

juliantaylor · 2013-09-04T19:47:10Z

all add.reduce calls that go over float_add with IS_BINARY_REDUCE true
so this also improves mean/std/var and anything else that uses sum.

njsmith · 2013-09-04T20:39:32Z

Neat. I assume this is just as fast as the current naive algorithm?
On 4 Sep 2013 20:48, "Julian Taylor" [email protected] wrote:

all add.reduce calls that go over _add with IS_BINARY_REDUCE true
so this also improves std/var and anything else that uses sum.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23818421
.

juliantaylor · 2013-09-04T20:40:37Z

its even faster :) (about 30%)
probably due to the unrolling.
and it can be vectorized easily.

njsmith · 2013-09-05T15:50:59Z

numpy/core/src/umath/loops.c.src

+
+    while (i < n) {
+        /* sum a block with an unrolled loop */
+        @type@ r[4] = {0};


I'm not familiar with this idiom -- are you confident that it's standard compliant and supported by MSVC?

its standard c89, if there are less initializers than members the rest have same value of static storage which is 0.
but I'll make it explicit for only 4 elements.

Yes, this is a familiar old ISO C idiom.

njsmith · 2013-09-05T16:03:52Z

Looks good modulo comments.

Some further questions:

It looks like this is only implemented for float/double/longdouble. What about half and complex types?
The loop unrolling and stack logic looks correct to me, but it's complicated enough, and with enough distinct execution paths, that I'd feel better knowing there were some careful tests checking lots of different array lengths. Do those exist?

And okay, this is just idle curiosity really, but: would it make any sense to do the same for *, and other commutative reduction operations? (I guess logaddexp.reduce would be a particularly obvious candidate, but that would of course require changes to the generic ufunc reduction loop. I guess the ufunc dispatch logic allows for types to override some reduction operations like add.reduce? We don't have two separate implementations of sum, do we?)

juliantaylor · 2013-09-06T17:07:18Z

half and complex certainly should use it too, didn't do it yet to simplify review.

I'll add some more tests for sum to make sure it works.

I'm not sure if it is applicable to other operations, I'll have to look it up.

A big problem is that the reduce ufunc actually works in buffersized blocks, so the benefit of the pairwise summation is reduced to blocks of 8192 elements by default.
Innerloop growing for reductions is disabled with a TODO.
I tried simply enabling the innerloop growing in the iterator and all tests still passed. I'm not sure what additional checks mentioned in the TODO are required.

charris · 2013-09-06T17:14:03Z

The inner loop buffering might be historical. @mwiebe Let's ask Mark if he recalls anything about that.

njsmith · 2013-09-07T21:35:04Z

Note that there are good reasons to want to use a smaller buffer by default
though - reduces the memory overhead of casting, etc.

I guess one cheap option for commutative ops would be to just always do
pairwise reductions when combining inner loop results on different blocks.
I can't see off the top of my head how this could every systematically
reduce accuracy, and it might even save copies compared to a strategy that
inserts the result from the previous loop at the beginning of the next loop
block.
On 6 Sep 2013 18:16, "Charles Harris" [email protected] wrote:

The inner loop buffering might be historical. @mwiebehttps://github.com/mwiebeLet's ask Mark if he recalls anything about that.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23955340
.

larsmans · 2013-10-12T09:54:39Z

numpy/core/src/umath/loops.c.src

@@ -1356,20 +1356,89 @@ NPY_NO_EXPORT void
 *  #C = F, , L#
 */

+/*
+ * pairwise summation, rounding error O(log(n)) instead of O(n)
+ * iterative version of:


Any reason to do this iteratively? A recursive version would be shorter and easier to read, without the arbitrary stack sizes.

probably not. I did it like this for the exercise. Performance wise it should not matter as the recursion depth is logarithmic.

larsmans · 2013-10-16T20:06:50Z

Note: simplified recursive version and unit test here.

pairwise summation has an average error of O(log(n)) instead of O(n) of the regular summation. It is implemented as summing pairs of small blocks of regulary summed values in order to archive the same performance as the old sum. An example of data which profits greatly is d = np.ones(500000) (d / 10.).sum() - d.size / 10. An alternative to pairwise summation is kahan summation but in order to have a low performance penalty one must unroll and vectorize it, while pairwise summation has the same speed without any vectorization. The better error bound of O(1) is negligible in many cases.

@juliantaylor

Simple recursive implementation with unrolled base case. Also fixed signed/unsigned issues by making all indices signed. Added a unit test based on @juliantaylor's example. Performance seems unchanged: still about a third faster than before.

Fix missing stride accounting in calling recursive function. Unroll 8 times to improve accuracy and allowing vectorizing with avx without changing summation order. Add tests.

njsmith · 2013-12-02T18:14:49Z

Speaking of... what's the status of this? A quick skim looks like it's a win, but would be a bigger win if we had pairwise block reduction?

juliantaylor · 2013-12-02T18:20:34Z

yes its a win in accuracy and performance, but the former would be better if the blocks would be added the same way.
I'm currently adding the pairwise sum for complex numbers, then I think its ready to merge.

the blocking can be revisited latter.
logaddexp needs a different approach (a new ufunc which uses the logsumexp scaling approach of scipy)
multiplication might profit too, but I would need to read up on floating point semantics (also its probably less important)
- vectorization can be done easily with how this code is written (but the gain is not so high with sse, only about 20%), to be revisited later

juliantaylor · 2013-12-02T18:50:15Z

complex and half types added

njsmith · 2013-12-02T19:28:33Z

So now that you added those, you think this is ready to merge? (Even if
there's still more to do later?)

On Mon, Dec 2, 2013 at 10:50 AM, Julian Taylor [email protected]:

complex and half types added

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-29645613
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2013-12-02T19:34:04Z

I think its ready, just pushed a small update which removed a nop line and changed the tests to not work on data with zero in first element which improves their ability to find issues regarding the first element (which is already initialized by the iterator in these loops)

njsmith · 2013-12-02T20:05:30Z

numpy/core/src/umath/loops.c.src

+ * The recursion depth is O(lg n) as well.
+ */
+static @type@
+pairwise_add_@TYPE@(@dtype@ *a, npy_uintp n, npy_intp stride)


I think we should use the word "sum" here, not "add".

done + updated some comments

njsmith · 2013-12-02T20:06:32Z

otherwise LGTM

ENH: implement pairwise summation

njsmith reviewed Sep 5, 2013
View reviewed changes

juliantaylor mentioned this pull request Sep 7, 2013

TestLSMR testBidiagonalA fails with more precise summation scipy/scipy#2843

Closed

larsmans reviewed Oct 12, 2013
View reviewed changes

juliantaylor and others added 3 commits November 30, 2013 19:35

ENH: umath: simplify pairwise sum

f57b255

Simple recursive implementation with unrolled base case. Also fixed signed/unsigned issues by making all indices signed. Added a unit test based on @juliantaylor's example. Performance seems unchanged: still about a third faster than before.

ENH: fix stride issue and unroll 8 times to improve accuracy

b0bc012

Fix missing stride accounting in calling recursive function. Unroll 8 times to improve accuracy and allowing vectorizing with avx without changing summation order. Add tests.

njsmith reviewed Dec 2, 2013
View reviewed changes

ENH: also use pairwise summation for half and complex types

c0a0cd7

njsmith added a commit that referenced this pull request Dec 3, 2013

Merge pull request #3685 from juliantaylor/pairwise

05ab6f4

ENH: implement pairwise summation

njsmith merged commit 05ab6f4 into numpy:master Dec 3, 2013

This was referenced Feb 20, 2014

Numerical-stable sum (similar to math.fsum) (Trac #1855) #2448

Closed

variance is inaccurate for arrays of identical, large values (Trac #1098) #1696

Closed

argriffing mentioned this pull request May 13, 2014

Numerical stability #4694

Closed

embray mentioned this pull request Jan 16, 2015

Compound model fitting example not too impressive astropy/astropy#3305

Merged

argriffing mentioned this pull request Aug 21, 2015

add Welford's algorithm for stable low-memory calculation of mean and variance #6231

Open

ekelsen mentioned this pull request Dec 2, 2016

Precision of reduce_sum operation tensorflow/tensorflow#6039

Closed

stevengj mentioned this pull request Nov 22, 2017

use a larger base case for pairwise summation rreusser/summation-algorithms#3

Open

person142 mentioned this pull request Mar 26, 2018

float32 sum accuracy is low numba/numba#2855

Open

seibert mentioned this pull request Apr 5, 2018

WIP: use pairwise summation in sum numba/numba#2869

Closed

ekelsen mentioned this pull request Apr 13, 2019

np.sum lost precision for large array jax-ml/jax#585

Closed

grafi-tt mentioned this pull request Apr 24, 2019

ChainerX native reduction is unstable chainer/chainer#7038

Closed

VarIr mentioned this pull request Jan 27, 2020

TST Fix unreachable code in tests scikit-learn/scikit-learn#16110

Merged

peterbell10 mentioned this pull request Jun 23, 2020

Use cascade-summation for floats to avoid numerical instability pytorch/pytorch#39516

Closed

guevara mentioned this pull request Oct 20, 2020

Balancing Folds - Donnacha Oisín Kidney guevara/read-it-later#7235

Open

sebasv mentioned this pull request Aug 7, 2021

[BUG] Numerical inaccuracy in summation based routines pydata/bottleneck#379

Open

Arthur-S-Huang mentioned this pull request Jan 18, 2024

Implement Float64Array and DataFrame structs with basic functionalities illinoisdata/MojoFrame#1

Merged

ogrisel mentioned this pull request Feb 29, 2024

ENH Use Array API in r2_score scikit-learn/scikit-learn#27904

Merged

Uh oh!

ENH: implement pairwise summation #3685

ENH: implement pairwise summation #3685

Uh oh!

Conversation

juliantaylor commented Sep 4, 2013

Uh oh!

juliantaylor commented Sep 4, 2013

Uh oh!

njsmith commented Sep 4, 2013

Uh oh!

juliantaylor commented Sep 4, 2013

Uh oh!

njsmith commented Sep 4, 2013

Uh oh!

juliantaylor commented Sep 4, 2013

Uh oh!

njsmith Sep 5, 2013

Choose a reason for hiding this comment

Uh oh!

juliantaylor Sep 5, 2013

Choose a reason for hiding this comment

Uh oh!

larsmans Oct 12, 2013

Choose a reason for hiding this comment

Uh oh!

njsmith commented Sep 5, 2013

Uh oh!

juliantaylor commented Sep 6, 2013

Uh oh!

charris commented Sep 6, 2013

Uh oh!

njsmith commented Sep 7, 2013

Uh oh!

larsmans Oct 12, 2013

Choose a reason for hiding this comment

Uh oh!

juliantaylor Oct 12, 2013

Choose a reason for hiding this comment

Uh oh!

larsmans commented Oct 16, 2013

Uh oh!

njsmith commented Dec 2, 2013

Uh oh!

juliantaylor commented Dec 2, 2013

Uh oh!

juliantaylor commented Dec 2, 2013

Uh oh!

njsmith commented Dec 2, 2013

Uh oh!

juliantaylor commented Dec 2, 2013

Uh oh!

njsmith Dec 2, 2013

Choose a reason for hiding this comment

Uh oh!

juliantaylor Dec 2, 2013

Choose a reason for hiding this comment

Uh oh!

njsmith commented Dec 2, 2013

Uh oh!

Uh oh!