Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Adding __array_ufunc__ capability to MaskedArrays #16022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 9, 2022

Conversation

greglucas
Copy link
Contributor

This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls directly.

Example

import numpy as np
a = np.ma.array(np.arange(10), [0, 1] * 5)
x = a < 10
# x.data == [True,  True,  True,  True,  True, False, False, False, False, False]
y = np.ma.less(a, 10)
# y.data == [True,  True,  True,  True,  True,  True, False,  True, False, True]

Even though data values in a are masked, they are still acted upon and evaluated. Changing that underlying data. However, calling the masked less version doesn't do anything with the data in the array. Note that this is a big change but to a somewhat ambiguous case of what happens to masked values under function evaluation.

The real reason I began looking into this was to not evaluate the function in the masked values because I'd already premasked the condition that would throw warnings at me. See #4959 for evaluating log10 at locations less than 0 that were under the mask, so still threw warnings.

import numpy as np
A = arange(-2, 5)/2
Am = numpy.ma.masked_less_equal(A, 0)
np.log10(Am) # Previously threw a warning, now works as expected
np.ma.log10(Am) # No warnings before or after

Implementation

I haven't added any documentation yet because I wanted to check and see if this was even the right way of approaching this idea. I struggled to figure out how to pass the out= object pointers around in a consistent manner, so I am definitely interested in suggestions on improving that. I had to update quite a few seemingly random areas of the Masked codebase to make that work everywhere. There are a few tests that I had to update because values under the mask were being compared, and those have now changed with this implementation applying the masked version of ufuncs everywhere.

This is a pretty major (underlying) API change, so I expect there to be quite a bit of discussion on the best approaches here.

Linked Issues

Fixes #4959
Closes #15200

@seberg
Copy link
Member

seberg commented Apr 23, 2020

@ahaldane and @mhvk may be interested in this start. I like this. The main question is whether we should aim for creating a MaskedArray outside of NumPy instead of updating it here.
There was some consensus around that being the ideal scenario: A new masked array project outside of NumPy, that we can at some point consider a replacement for the one in NumPy. That said, if these are uncontroversial changes/improvement, nothing should stop it.

If you/anyone wants to create such a project and it seems helpful. I would be happy if we make a new repository in the numpy organization.

@jthielen
Copy link

Glad to see someone taking this on and moving it along! I apologize for not getting any chance to address #15200 since I raised it in December.

Just one early note/request: it would be great to add tests for proper dispatching/deferral to make sure that this truly closes #15200. Right now, as far as I can tell, it doesn't look like the suggested adaptations to the internal arithmetic methods have been made yet, so I'd suspect that the ufunc dispatch hierarchy issue remains.

@greglucas
Copy link
Contributor Author

Great observation, @jthielen! Thanks for catching that. I just changed those over to dispatch through the ufunc calls for me locally and your example still fails with this update. (So indeed doesn't close that issue yet) My guess is that I've probably got some bad re-casting going on somewhere? I agree that this definitely needs some tests and some documentation. If you want to try and add anything or build upon this, please feel free to.

@greglucas
Copy link
Contributor Author

greglucas commented Jul 8, 2020

@jthielen I decided to look at your example again to see why the dispatching still wasn't working right with this implementation and ran into some questions on what should actually be defined to happen in this situation. When I read the spec for array_ufunc, __array_priority__ only comes into play if one of the objects does not implement __array_ufunc__, but if both of the objects implement __array_ufunc__ then I think the first one called takes precedence. I'm wondering if the __array_ufunc__ implementation needs to consider the __array_priority__ internally? The reason this is an issue here is because your example has MaskedArray and WrappedArray both at the same class level. If we truly inherit from MaskedArray, so that WrappedArray is a level deeper in your example:
class WrappedArray(np.ma.MaskedArray, numpy.lib.mixins.NDArrayOperatorsMixin):
then all of the dispatching in the PR works as one would expect. I may also be missing something simple that will take care of this too, like maybe I need to call __array_wrap__ or __array_finalize__ somewhere to take care of the typecasting properly?

From another perspective, it is a bit ambiguous and I'm not clear if there should be commutativity between say a DaskArray and an Xarray object, or is it whichever library puts in the higher __array_priority__ to their object?

Edit: I was curious to see what that example would produce myself and it is non-commutative as well.

In [1]: import dask.array as da

In [2]: import xarray as xr

In [3]: d = da.array([1, 2])

In [4]: x = xr.DataArray([3, 4])

In [5]: type(d*x)
Out[5]: dask.array.core.Array

In [6]: type(x*d)
Out[6]: xarray.core.dataarray.DataArray

@mhvk
Copy link
Contributor

mhvk commented Jul 8, 2020

@greglucas - __array_priority__ is indeed ignored when __array_ufunc__ is defined and overwriters get to try in order, but subclasses first. The idea very much is that if a class does not know the other class, it should generally return NotImplemented. Getting things to work automatically is nearly impossibe. So, the equivalent of __array_priority__ would be to explicitly decide which of a pair of classes returns NotImplemented and which takes action.

For your example with dask: MaskedArray cannot know generally know about dask, but dask relies on numpy and can thus know about MaskedArray, so I think MaskedArray should return NotImplemented.

@jthielen
Copy link

jthielen commented Jul 8, 2020

@greglucas To follow-up on @mhvk's points, I'd like to emphasize some of the existing documentation that may be helpful (it looks like you've already read through NEP 13/__array_ufunc__ docs, but the others may help round that out):

It is quite a bit to wrap one's head around at first, but what it ends up coming down to is ensuring commutativity through consistently deferring to upcast types (those above the respective type in the hierarchy). The main problem is that there is no "official" or "standard" type casting hierarchy accepted among libraries right now...instead it is an emergent phenomenon from library-by-library ad hoc decisions. The closest thing I know of to a standard is the graph depicted in Pint's documentation (which comes from discussions in xarray, Pint, and Dask issue trackers), but that is by no means official. An continuously updated informational NEP (as I believe suggested by @shoyer) may be of use for this, but I don't know what the process for that would look like.

All that being said, MaskedArray's place in the hierarchy seems fairly clear: right above ndarray. So, I think MaskedArray should handle operations with ndarray (and subclasses), and defer to (that is, return NotImplemented) any other duck array.

For your example @greglucas, xarray and Dask are still not in the best place with respect to fully respecting the type casting hierarchy (see pydata/xarray#3950 and dask/dask#4583). Perhaps a better example of what should happen is with Pint and Sparse:

In [1]: import numpy as np                                                      

In [2]: from pint import UnitRegistry                                           

In [3]: from sparse import COO

In [5]: ureg = UnitRegistry()                                                   

In [6]: q = [[1], [2]] * ureg.m                                                 

In [7]: s = COO(np.array([[0, 1, 0], [0, 25, 50]]), np.array([2, 4, 8]))        

In [8]: q * s                                                                   
Out[8]: <COO: shape=(2, 51), dtype=int64, nnz=3, fill_value=0> <Unit('meter')>

In [9]: s * q                                                                   
Out[9]: <COO: shape=(2, 51), dtype=int64, nnz=3, fill_value=0> <Unit('meter')>

(both these are Pint Quantities wrapping Sparse COOs)

@greglucas
Copy link
Contributor Author

Thanks for the details both of you! I think this all makes more sense for deferrals and I was missing a simple check at the top of the function:

        # Determine what class types we are compatible with and return
        # NotImplemented if we don't know how to handle them
        for arg in args + outputs:
            if not isinstance(arg, (ndarray, np.bool_, Number, str)):
                return NotImplemented

Of course, then some other tests fail due to mask reshaping/resizing, so it isn't that simple. I'll look into this more later this week. Thanks for the links!

@greglucas greglucas force-pushed the masked_ufunc branch 2 times, most recently from ae90141 to facb5ea Compare July 9, 2020 14:45
@greglucas
Copy link
Contributor Author

OK, I think I worked out the different resizing bug I was running into and pushed up some changes that all work locally for me. I also added a test for the WrappedArray example you gave in #15200.

Unfortunately, it looks like MacOS 3.6 and PyPy don't like the changes, but the majority of systems do. May be tricky to track down those specifics unless someone sees something obvious right away.

@ahaldane
Copy link
Member

ahaldane commented Jul 19, 2020

Just to note for those interested, I have a new MaskedArray implementation using __array_ufunc__ and __array_function__ available here: https://github.com/ahaldane/ndarray_ducktypes

See the documentation at: https://github.com/ahaldane/ndarray_ducktypes/blob/master/doc/MaskedArray.md

I haven't make it more publicly available yet because it is still a work in progress, however I would say it is 95% functional now (The last 5% is always longest/hardest to implement). A TODO list is at ahaldane/ndarray_ducktypes#1. If you would like to try it out and comment, feel free to leave issues on github. Also, I welcome PRs, and you should feel welcome to reuse anything there in these PRs here. That repository also has two other ndarray-ducktypes I would like to get closer to a working state.

Copy link
Contributor Author

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm coming back to this again here and cleaning it up a bit and I'm wondering if anyone in the thread had any thoughts on this PR, or anything else I'm overlooking in it.

@@ -5785,7 +5911,7 @@ def mini(self, axis=None):
"`mini` is deprecated; use the `min` method or "
"`np.ma.minimum.reduce instead.",
DeprecationWarning, stacklevel=2)
return minimum.reduce(self, axis)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to just remove np.ma.mini()? It has been deprecated since 1.13.0 it looks like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. If you are concerned about the size of the PR, it would be fine to do this in a separate PR. Note there is a commented out deprecated code in numpy/ma/core.pyi as well.

@@ -740,7 +740,7 @@ def test_subclass(self):
mask=[[False, False], [True, False],
[False, True], [True, True], [False, False]])
out = diff(x)
assert_array_equal(out.data, [[1], [1], [1], [1], [1]])
assert_array_equal(out.data, [[1], [4], [6], [8], [1]])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ambiguous what should happen here. All of these values are under the mask.

@@ -3168,7 +3168,7 @@ def test_compress(self):
assert_equal(b.fill_value, 9999)
assert_equal(b, a[condition])

condition = (a < 4.)
condition = (a.data < 4.)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depended on evaluating values under the mask before.

@@ -5233,7 +5233,7 @@ def test_ufunc_with_out_varied():
a = array([ 1, 2, 3], mask=[1, 0, 0])
b = array([10, 20, 30], mask=[1, 0, 0])
out = array([ 0, 0, 0], mask=[0, 0, 1])
expected = array([11, 22, 33], mask=[1, 0, 0])
expected = array([1, 22, 33], mask=[1, 0, 0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ambiguous data under the mask also changed here.

@TomNicholas
Copy link

Just dropping in to say that it would be great to see this implemented because it would allow us to simplify our handling of masked arrays in xarray, and generalise it to handle masked duck arrays, which would be awesome!

@greglucas
Copy link
Contributor Author

@TomNicholas, I think this is still ready for review and feedback, so thanks for adding in the voice of support of this from Xarray's perspective too :)

@rcomer
Copy link
Contributor

rcomer commented Jul 28, 2021

If it helps, I'd also love to see this go in. It will make a workaround I've written for SciTools/iris' arithmetic redundant.

@seberg
Copy link
Member

seberg commented Jul 28, 2021

I have not looked at this in depth. What would help me to get started would be to understand what/if behaviour is changing. I.e. are all/most changes here basically fixing arguably outright bugs, or are there more sublet changes more likely to break workflows? I like the approach of improving the current implementation, rather than aiming for replacement, but with masked-arrays, it feels a bit doomed to be incomplete. (Which does not mean we should not do it, if anything improving things will make the eventually necessary transition easier.)

@greglucas
Copy link
Contributor Author

What would help me to get started would be to understand what/if behaviour is changing

The biggest change here is the deferral mechanism to to whichever object's priority takes precedent. I added a test in to try and capture that by creating a small WrappedArray. I tried to keep behavior changes to a minimum where possible, so not many tests were updated. However, I did put some inline review notes about data under masks being different between the versions.

One area that I am still not clear about is using the out keyword during Masked computations. I think I am creating an extra array during the computation and then copying it back over, but I'm not clear if that was the way it was before or not, so I could use some help on that aspect during a review.

Which does not mean we should not do it, if anything improving things will make the eventually necessary transition easier

I agree with this. I'm hoping that this can be a stepping stone to something like the new implementation mentioned in #16022 (comment)

@rcomer
Copy link
Contributor

rcomer commented Jun 25, 2022

Hi, is this PR still under consideration?

greglucas and others added 3 commits June 29, 2022 19:17
This enables any ufunc numpy operations that are called on a
MaskedArray to use the masked version of that function automatically
without needing to resort to np.ma.func() calls.
This test makes sure that a MaskedArray defers properly to another
class if it doesn't know how to handle it. See numpy#15200.
@greglucas
Copy link
Contributor Author

@rcomer, do you have a specific use case that this would help with? I just rebased this and it appears to still be ready to go.

@rcomer
Copy link
Contributor

rcomer commented Jun 30, 2022

Thanks @greglucas. The problem that ultimately led me here was that when doing arithmetic with dask arrays, any mask the dask array has is lost:

import numpy
import numpy.ma as ma
import dask
import dask.array as da

print("numpy version:", numpy.__version__)
print("dask version:", dask.__version__)

my_numpy_array = ma.masked_array(range(4), mask=[0, 1, 0, 0])
my_dask_array = da.ma.masked_array(range(4), mask=[0, 0, 1, 0])

print(my_numpy_array * my_dask_array)
numpy version: 1.23.0
dask version: 2022.6.1
[0 -- 4 9]

The result should be a dask array which when computed gives

[0 -- -- 9]

It's easy enough to work around once you realise what's happening, and I have an open PR in SciTools/iris with a workaround that should solve it for Iris users. But a low-level fix is always better than a high-level workaround!

# Test domained binary operations
assert_(isinstance(np.divide(wm, m), WrappedArray))
assert_(isinstance(np.divide(m, wm), WrappedArray))
assert_equal(np.divide(wm, m) * m, np.divide(m, m) * wm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test where the operands are not the same shape and broadcasting must be done? That is tricky to get right with masks.

Are reductions tested elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some broadcasting tests of other functions, but it could likely be made more robust. I just added a quick test for the instance type after broadcasting in the new PR, good suggestion.

@mattip
Copy link
Member

mattip commented Jul 5, 2022

I think we should try to get this in since it does not look like we will deprecate masked arrays soon. I do like the idea of having a separate repo like https://github.com/ahaldane/ndarray_ducktypes mentioned above for a new implementation of masked arrays, maybe we could make removing masked arrays part of NumPy 2.0

@mattip mattip merged commit ab541f9 into numpy:main Jul 9, 2022
@mattip
Copy link
Member

mattip commented Jul 9, 2022

Thanks @greglucas

@mattip
Copy link
Member

mattip commented Jul 9, 2022

This changed the result of np.ma.average(x, axis=1, keepdims=True) where x = np.ma.arange(6.).reshape(3, 2). Previously the result had mask = False, now it has mask = [False, False, False]. I submitted #21960 to revert it.

@mattip
Copy link
Member

mattip commented Jul 9, 2022

Please resubmit this as a new PR since the merge was reverted.

@greglucas
Copy link
Contributor Author

@mattip, my apologies for the silence, I was on vacation the past week, but I should be able to try and update this and resubmit in the next day/two. Thanks for taking care of the revert.

It may be this change: #16022 (comment)
which was needed when I first submitted this PR a couple years ago, but something else must have been updated during that time to no longer flatten the mask on output. I'm still unclear what a user would expect/want for "keepdims" on the mask...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants