-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Adding __array_ufunc__ capability to MaskedArrays #16022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ahaldane and @mhvk may be interested in this start. I like this. The main question is whether we should aim for creating a MaskedArray outside of NumPy instead of updating it here. If you/anyone wants to create such a project and it seems helpful. I would be happy if we make a new repository in the numpy organization. |
Glad to see someone taking this on and moving it along! I apologize for not getting any chance to address #15200 since I raised it in December. Just one early note/request: it would be great to add tests for proper dispatching/deferral to make sure that this truly closes #15200. Right now, as far as I can tell, it doesn't look like the suggested adaptations to the internal arithmetic methods have been made yet, so I'd suspect that the ufunc dispatch hierarchy issue remains. |
Great observation, @jthielen! Thanks for catching that. I just changed those over to dispatch through the ufunc calls for me locally and your example still fails with this update. (So indeed doesn't close that issue yet) My guess is that I've probably got some bad re-casting going on somewhere? I agree that this definitely needs some tests and some documentation. If you want to try and add anything or build upon this, please feel free to. |
@jthielen I decided to look at your example again to see why the dispatching still wasn't working right with this implementation and ran into some questions on what should actually be defined to happen in this situation. When I read the spec for array_ufunc, From another perspective, it is a bit ambiguous and I'm not clear if there should be commutativity between say a Edit: I was curious to see what that example would produce myself and it is non-commutative as well. In [1]: import dask.array as da
In [2]: import xarray as xr
In [3]: d = da.array([1, 2])
In [4]: x = xr.DataArray([3, 4])
In [5]: type(d*x)
Out[5]: dask.array.core.Array
In [6]: type(x*d)
Out[6]: xarray.core.dataarray.DataArray |
@greglucas - For your example with |
@greglucas To follow-up on @mhvk's points, I'd like to emphasize some of the existing documentation that may be helpful (it looks like you've already read through NEP 13/
It is quite a bit to wrap one's head around at first, but what it ends up coming down to is ensuring commutativity through consistently deferring to upcast types (those above the respective type in the hierarchy). The main problem is that there is no "official" or "standard" type casting hierarchy accepted among libraries right now...instead it is an emergent phenomenon from library-by-library ad hoc decisions. The closest thing I know of to a standard is the graph depicted in Pint's documentation (which comes from discussions in xarray, Pint, and Dask issue trackers), but that is by no means official. An continuously updated informational NEP (as I believe suggested by @shoyer) may be of use for this, but I don't know what the process for that would look like. All that being said, For your example @greglucas, xarray and Dask are still not in the best place with respect to fully respecting the type casting hierarchy (see pydata/xarray#3950 and dask/dask#4583). Perhaps a better example of what should happen is with Pint and Sparse: In [1]: import numpy as np
In [2]: from pint import UnitRegistry
In [3]: from sparse import COO
In [5]: ureg = UnitRegistry()
In [6]: q = [[1], [2]] * ureg.m
In [7]: s = COO(np.array([[0, 1, 0], [0, 25, 50]]), np.array([2, 4, 8]))
In [8]: q * s
Out[8]: <COO: shape=(2, 51), dtype=int64, nnz=3, fill_value=0> <Unit('meter')>
In [9]: s * q
Out[9]: <COO: shape=(2, 51), dtype=int64, nnz=3, fill_value=0> <Unit('meter')> (both these are Pint Quantities wrapping Sparse COOs) |
Thanks for the details both of you! I think this all makes more sense for deferrals and I was missing a simple check at the top of the function: # Determine what class types we are compatible with and return
# NotImplemented if we don't know how to handle them
for arg in args + outputs:
if not isinstance(arg, (ndarray, np.bool_, Number, str)):
return NotImplemented Of course, then some other tests fail due to mask reshaping/resizing, so it isn't that simple. I'll look into this more later this week. Thanks for the links! |
ae90141
to
facb5ea
Compare
OK, I think I worked out the different resizing bug I was running into and pushed up some changes that all work locally for me. I also added a test for the WrappedArray example you gave in #15200. Unfortunately, it looks like MacOS 3.6 and PyPy don't like the changes, but the majority of systems do. May be tricky to track down those specifics unless someone sees something obvious right away. |
Just to note for those interested, I have a new MaskedArray implementation using See the documentation at: https://github.com/ahaldane/ndarray_ducktypes/blob/master/doc/MaskedArray.md I haven't make it more publicly available yet because it is still a work in progress, however I would say it is 95% functional now (The last 5% is always longest/hardest to implement). A TODO list is at ahaldane/ndarray_ducktypes#1. If you would like to try it out and comment, feel free to leave issues on github. Also, I welcome PRs, and you should feel welcome to reuse anything there in these PRs here. That repository also has two other ndarray-ducktypes I would like to get closer to a working state. |
18f6ead
to
437736e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming back to this again here and cleaning it up a bit and I'm wondering if anyone in the thread had any thoughts on this PR, or anything else I'm overlooking in it.
@@ -5785,7 +5911,7 @@ def mini(self, axis=None): | |||
"`mini` is deprecated; use the `min` method or " | |||
"`np.ma.minimum.reduce instead.", | |||
DeprecationWarning, stacklevel=2) | |||
return minimum.reduce(self, axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to just remove np.ma.mini()
? It has been deprecated since 1.13.0 it looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please. If you are concerned about the size of the PR, it would be fine to do this in a separate PR. Note there is a commented out deprecated code in numpy/ma/core.pyi
as well.
@@ -740,7 +740,7 @@ def test_subclass(self): | |||
mask=[[False, False], [True, False], | |||
[False, True], [True, True], [False, False]]) | |||
out = diff(x) | |||
assert_array_equal(out.data, [[1], [1], [1], [1], [1]]) | |||
assert_array_equal(out.data, [[1], [4], [6], [8], [1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is ambiguous what should happen here. All of these values are under the mask.
@@ -3168,7 +3168,7 @@ def test_compress(self): | |||
assert_equal(b.fill_value, 9999) | |||
assert_equal(b, a[condition]) | |||
|
|||
condition = (a < 4.) | |||
condition = (a.data < 4.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depended on evaluating values under the mask before.
@@ -5233,7 +5233,7 @@ def test_ufunc_with_out_varied(): | |||
a = array([ 1, 2, 3], mask=[1, 0, 0]) | |||
b = array([10, 20, 30], mask=[1, 0, 0]) | |||
out = array([ 0, 0, 0], mask=[0, 0, 1]) | |||
expected = array([11, 22, 33], mask=[1, 0, 0]) | |||
expected = array([1, 22, 33], mask=[1, 0, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ambiguous data under the mask also changed here.
Just dropping in to say that it would be great to see this implemented because it would allow us to simplify our handling of masked arrays in xarray, and generalise it to handle masked duck arrays, which would be awesome! |
@TomNicholas, I think this is still ready for review and feedback, so thanks for adding in the voice of support of this from Xarray's perspective too :) |
If it helps, I'd also love to see this go in. It will make a workaround I've written for SciTools/iris' arithmetic redundant. |
I have not looked at this in depth. What would help me to get started would be to understand what/if behaviour is changing. I.e. are all/most changes here basically fixing arguably outright bugs, or are there more sublet changes more likely to break workflows? I like the approach of improving the current implementation, rather than aiming for replacement, but with masked-arrays, it feels a bit doomed to be incomplete. (Which does not mean we should not do it, if anything improving things will make the eventually necessary transition easier.) |
The biggest change here is the deferral mechanism to to whichever object's priority takes precedent. I added a test in to try and capture that by creating a small One area that I am still not clear about is using the
I agree with this. I'm hoping that this can be a stepping stone to something like the new implementation mentioned in #16022 (comment) |
Hi, is this PR still under consideration? |
This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls.
This test makes sure that a MaskedArray defers properly to another class if it doesn't know how to handle it. See numpy#15200.
@rcomer, do you have a specific use case that this would help with? I just rebased this and it appears to still be ready to go. |
Thanks @greglucas. The problem that ultimately led me here was that when doing arithmetic with dask arrays, any mask the dask array has is lost: import numpy
import numpy.ma as ma
import dask
import dask.array as da
print("numpy version:", numpy.__version__)
print("dask version:", dask.__version__)
my_numpy_array = ma.masked_array(range(4), mask=[0, 1, 0, 0])
my_dask_array = da.ma.masked_array(range(4), mask=[0, 0, 1, 0])
print(my_numpy_array * my_dask_array)
The result should be a dask array which when computed gives
It's easy enough to work around once you realise what's happening, and I have an open PR in SciTools/iris with a workaround that should solve it for Iris users. But a low-level fix is always better than a high-level workaround! |
# Test domained binary operations | ||
assert_(isinstance(np.divide(wm, m), WrappedArray)) | ||
assert_(isinstance(np.divide(m, wm), WrappedArray)) | ||
assert_equal(np.divide(wm, m) * m, np.divide(m, m) * wm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some test where the operands are not the same shape and broadcasting must be done? That is tricky to get right with masks.
Are reductions tested elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some broadcasting tests of other functions, but it could likely be made more robust. I just added a quick test for the instance type after broadcasting in the new PR, good suggestion.
I think we should try to get this in since it does not look like we will deprecate masked arrays soon. I do like the idea of having a separate repo like https://github.com/ahaldane/ndarray_ducktypes mentioned above for a new implementation of masked arrays, maybe we could make removing masked arrays part of NumPy 2.0 |
Thanks @greglucas |
This changed the result of |
Please resubmit this as a new PR since the merge was reverted. |
@mattip, my apologies for the silence, I was on vacation the past week, but I should be able to try and update this and resubmit in the next day/two. Thanks for taking care of the revert. It may be this change: #16022 (comment) |
This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls directly.
Example
Even though data values in
a
are masked, they are still acted upon and evaluated. Changing that underlying data. However, calling the masked less version doesn't do anything with the data in the array. Note that this is a big change but to a somewhat ambiguous case of what happens to masked values under function evaluation.The real reason I began looking into this was to not evaluate the function in the masked values because I'd already premasked the condition that would throw warnings at me. See #4959 for evaluating log10 at locations less than 0 that were under the mask, so still threw warnings.
Implementation
I haven't added any documentation yet because I wanted to check and see if this was even the right way of approaching this idea. I struggled to figure out how to pass the
out=
object pointers around in a consistent manner, so I am definitely interested in suggestions on improving that. I had to update quite a few seemingly random areas of the Masked codebase to make that work everywhere. There are a few tests that I had to update because values under the mask were being compared, and those have now changed with this implementation applying the masked version of ufuncs everywhere.This is a pretty major (underlying) API change, so I expect there to be quite a bit of discussion on the best approaches here.
Linked Issues
Fixes #4959
Closes #15200