ENH/WIP: introduce a where keyword for reductions. #12635

mhvk · 2018-12-31T03:22:00Z

This introduces a where keyword for reductions - which would greatly help things like MaskedArray.

For lack of better ideas, the masking is done by setting elements that are not to be used to the identity. For this purposes, I have to ensure the operand reduced over is buffered - and this is now done with the drastic hack of (ab)using NPY_ITER_READONLY|NPY_ITER_UPDATEIFCOPY - I could not find a better way to tell the iterator to always buffer.

Work in progress - mostly here to request for comments, surely this can be done better...

The iterator masking is unsuited, as it prevents writing back an element to an array, while what is needed is to skip even reading/operating on an element. Might need a double internal loop, where the operand is filled in with an identity whenever the mask is set.

mhvk · 2018-12-31T03:26:11Z

@seberg, @jaimefrio, since you are called the nditer experts in #12362: that issue followed really from this work - I'd like the iterator to always buffer. Can one do this more elegantly than the hack here with the existing machinery?

seberg · 2018-12-31T10:41:15Z

Ah, copying things like that sounds like a pretty smart hack. Don't really know myself quickly. why buffering does not happen. The normal ufuncs must use the buffer when where is used for the write operand. But maybe that is the magic, that it only is used for the write operand.

The other thing is that possibly a trivial loop is triggered. But that would have to happen before nditer, so doubt it.

njsmith · 2018-12-31T10:46:27Z

Do you have any plan for handling reductions that don't have an identity?

seberg · 2018-12-31T10:53:38Z

Yeah, that would be much nicer. I think it may be possible when the iteration is along the slow axis by adapting the current machinery (write only those results back to the output array that are not masked, but fetch all every time). Or you would have to not copy everything, but not sure that can be hacked easily and it means the inner loop size changes (although I think it can change already).

mhvk · 2018-12-31T16:17:32Z

@njsmith - for ufuncs with no identity, my current plan was to let the docs tell that one has to give one (via the recently added initial) - this is needed anyway for the case that no elements are selected at all.

mhvk · 2018-12-31T16:21:07Z

@seberg - I tried changing the inner loop size first, just removing elements in the input that were not selected, but found it didn't work, as least not simply, since the inner loop is not guaranteed to go over the reduction axis (i.e., the output stride is not always 0). In the end, my present solution seemed to rely less on iterator implementation details.

Of course, I could rely even less on the iterator by buffering myself, but this seemed silly since in quite a number of cases the iterator will buffer already.

seberg · 2018-12-31T17:12:45Z

It might be we can make something work that uses a different mechanism depending on which axis the reduction is working on. When it is not along the reduction axis, I think the current where mechanism may actually (almost) work?

mhvk · 2018-12-31T17:26:54Z

hmm, you're right, when the inner loop is not along the reduction axis, not writing back the output would just do the trick. But to force that would make 1-d very slow. But perhaps when it is along the axis, one could do the count-reduction.

Though perhaps this is still trying to force things too much: the broadcasting of the where mask is different for a reduction: it has to be to the input, not the output.

mhvk · 2018-12-31T17:35:48Z

Though perhaps this is still trying to force things too much: the broadcasting of the where mask is different for a reduction: it has to be to the input, not the output.

Then again, perhaps one can set the input as the masked array, then broadcasting will presumably be OK.

njsmith · 2018-12-31T21:01:22Z

Not all reductions have an identity, though, so we'll eventually need some other way to handle them, and the no-elements-selected case. And doesn't initial= already have other semantics? e.g. for a silly example, np.add.reduce(arr, initial=10) is equivalent to no.add.reduce(arr) + 10, right? So I would expect np.add.reduce(arr, where=arr2, initial=10) to be equivalent to np.add.reduce(arr, where=arr2) + 10.

Maybe we want a masked buffering mode, where first we collect unmasked elements into a buffer, and then we run the operation over the dense buffer?

No-elements-selected creates some tricky API design issues. We already have a strategy for handling no-identity empty reductions, which is to raise an error. And that's ok because right now, this can only happen when a dimension has size 0, which means that in a vectorized reduction all of the core reductions are empty. But with where=, we could have a mix of empty and non-empty core reductions.

Gufuncs have similar issues with reporting partial errors. Maybe it's time to come up with a standard numpy-wide convention for how to handle this. I don't think that needs to block where= support in reductions though; it's ok if in the first version you get an error if any core reductions are empty and the operation has no identity, and then we refine that later.

mhvk · 2018-12-31T21:52:33Z

Good point about initial - it indeed only works when it is the identity, otherwise results would be more than a little surprising....

mhvk · 2019-01-01T17:27:04Z

@seberg - currently, the iterator explicitly forbids a mask that has more elements than the output along a reduction axis.

But another question: can one force the iterator to have the inner loop along a reduction axis? That would allow the select-data and reduce-the-count method to be used always.

seberg · 2019-01-01T18:18:43Z

I do not think there is a flag to do that currently, not 100% sure though. In principle some ufuncs might even be better of doing this for reductions in any case. (well, to be honest the only ones are probably the float16 loops, which need to cast back and forth less often).

mhvk · 2019-01-01T19:23:00Z

See #12640 for an alternative where I force the iterator to give the external loop an axis one reduces over. An advantage of this approach is that initial now will be used properly as an initial value, making it more reasonable to ask users to pass it in for ufuncs that do not have an identity.

mhvk · 2019-01-01T22:42:33Z

Just as a note: this implementation is not exactly speedy. E.g.,

a = np.arange(100000.)
a[1000] = np.nan
np.add.reduce(a, where=~np.isnan(a)) == np.nansum(a)
# True
%timeit np.nansum(a)
# 10000 loops, best of 5: 110 µs per loop
%timeit np.add.reduce(a, where=~np.isnan(a))
# 10000 loops, best of 5: 156 µs per loop
m = ~np.isnan(a)
%timeit np.add.reduce(a, where=m)
# 10000 loops, best of 5: 126 µs per loop

The alternative implementation in #12640 is even slower, as it is moving data around (but can probably be sped up to be roughly equivalent).

I guess part of the problem is the buffering, especially as I use the casting machinery for that. Though nansum obviously does a copy too.

EDIT: with buffering turned off, this version is slightly faster than nansum (but changes a inplace...)

seberg · 2019-01-01T23:17:29Z

I would imagine that things get faster if you disable GROWINNER, at least when buffering is enabled.

mhvk · 2019-01-02T04:38:40Z

closing in favour of the simpler and faster solution in #12644, which doesn't require messing with the iterator.

mhvk added 4 commits December 30, 2018 15:53

ENH: allow where to be recognized in reduce operations.

3bb3dc9

Hack to force buffering with NPY_ITER_READONLY|NPY_ITER_UPDATEIFCOPY

7401abc

Buffer when using where, and set masked elements to identity.

59ad0df

mhvk added 01 - Enhancement component: numpy.ufunc 25 - WIP labels Dec 31, 2018

mhvk mentioned this pull request Jan 2, 2019

ENH: where for ufunc reductions #12644

Merged

mhvk closed this Jan 2, 2019

Uh oh!

ENH/WIP: introduce a where keyword for reductions. #12635

ENH/WIP: introduce a where keyword for reductions. #12635

Uh oh!

Conversation

mhvk commented Dec 31, 2018

Uh oh!

mhvk commented Dec 31, 2018

Uh oh!

seberg commented Dec 31, 2018

Uh oh!

njsmith commented Dec 31, 2018

Uh oh!

seberg commented Dec 31, 2018

Uh oh!

mhvk commented Dec 31, 2018

Uh oh!

mhvk commented Dec 31, 2018

Uh oh!

seberg commented Dec 31, 2018

Uh oh!

mhvk commented Dec 31, 2018

Uh oh!

mhvk commented Dec 31, 2018

Uh oh!

njsmith commented Dec 31, 2018

Uh oh!

mhvk commented Dec 31, 2018

Uh oh!

mhvk commented Jan 1, 2019

Uh oh!

seberg commented Jan 1, 2019

Uh oh!

mhvk commented Jan 1, 2019

Uh oh!

mhvk commented Jan 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Jan 1, 2019

Uh oh!

mhvk commented Jan 2, 2019

Uh oh!

Uh oh!

mhvk commented Jan 1, 2019 •

edited

Loading