Thanks to visit codestin.com
Credit goes to github.com

Skip to content

np.asarray(masked_array) should raise rather than silently dropping the mask #26669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rgommers opened this issue Jun 12, 2024 · 8 comments
Open

Comments

@rgommers
Copy link
Member

rgommers commented Jun 12, 2024

There are a lot of issues, both in numpy itself and in downstream packages, where masked arrays silently do the wrong thing when they get passed to a function that isn't aware of masked arrays and starts off by calling np.asarray on its inputs. gh-26530 is a recent example. Across the main numpy namespace, almost all functions are not mask-aware, and for those that do the right thing it's mostly by accident. Masked arrays should only be passed to functions in the numpy.ma namespace. In SciPy it's the same: scipy.stats.mstats supports masked arrays, nothing else in SciPy does.

Masked arrays should be treated like sparse arrays or other array types with different semantics: conversion should not be done implicitly, because dropping a mask implicitly is almost never the correct thing to do.

gh-18675 basically came to this conclusion. Other related issues:

We should aim to add a FutureWarning first I think, since raising immediately may be disruptive.

EDIT: code example:

>>> import numpy as np
>>> x = np.arange(12).reshape((3, 4))
>>> x = np.ma.masked_less(x, 6)
>>> x
masked_array(
  data=[[--, --, --, --],
        [--, --, 6, 7],
        [8, 9, 10, 11]],
  mask=[[ True,  True,  True,  True],
        [ True,  True, False, False],
        [False, False, False, False]],
  fill_value=999999)
>>> np.asarray(x)  # mask gets dropped
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

So if dropping the mask is wrong and np.asarray(masked_array) starts raising, what if you do want the underlying ndarray/values? https://numpy.org/devdocs/reference/maskedarray.generic.html#accessing-the-data suggests that using either x.data or np.ma.getdata(x)` is the way to do that.

@seberg
Copy link
Member

seberg commented Jun 13, 2024

Note that to do this, you will need to refine the __array__ protocol to takek precedence in subclass cases:

if type(subclass).__array__ != ndarray.__array__:
    arr = subclass.__array__()

(strictly speaking the check isn't necessary but an optimization to avoid unnecessary calls.)

Ping @mhvk since I think astropy might be the best "test bed" (or its downstream).

This would definitely be nice to do and maybe it would be just OK. It is a bit unclear to me how pervasive and "well working" code paths exist where things actually work out just fine.
(Yes, I realize this it is terrible that there is no clear pattern of when things work and when it doesn't!)

I.e. I don't mind trying it, but I could imagine that we need some preparation to keep some things working (i.e. things that we would keep working in a replacement).
I think __array_function__ and __array_ufunc__ can be used for this, the problem with the last try of implementing them was that they tried to fix masked arrays, which is hard because there are small inconsistencies in many places.
OTOH, if the ambition is just to do blocklist some functions and manually keep a few others working that seems plausible to me.

@rgommers
Copy link
Member Author

I added a code example. I haven't investigated what dunder method is best used/adapted here, that's an implementation detail at this point. Tests will show what is actually supported and should keep working.

OTOH, if the ambition is just to do blocklist some functions and manually keep a few others working that seems plausible to me.

I suspect that ufuncs and all other functions which operate element-wise and correctly return a masked array, because of __array_wrap__/__array_finalize__, are the only functions which may be considered as remaining supported. That said, I can also see that going away, because it's impossible to explain to a user that they should use np.ma.xxx functions, but np.xxx is also fine but only for some subset of functions and which ones they are cannot be deduced from function names.

Ideally, all functions in the main namespace should document whether they do or don't preserve subclasses, and that would be consistent for types of functions (element-wise, reductions, etc.) and tested. That was always the intent, and it mostly works due to __array_wrap__/__array_finalize__ and custom handling inside np.ma. It's a little fragile though, and when adding or changing any function in the main namespace that isn't using ufunc machinery, it's fairly likely that it's silently broken for masked array inputs.

Anyway, whether functions work is secondary to whether np.asarray called from outside of numpy works. The intended behavior is (with x a masked array instance):

  • np.asarray(x) and np.array(x) -> TypeError
  • np.asanyarray(x) -> works, preserves subclass
  • np.ma.getdata(x) -> use if the user intends to drop a mask from a masked array, and get an ndarray with all data values (behavior for previously-masked values is undefined)

If code outside numpy uses np.asarray instead of np.asanyarray, that should stop accepting masked arrays because that code is already broken.

@mhvk
Copy link
Contributor

mhvk commented Jun 13, 2024

It's a bit of a dead horse I'm beating here, but really numpy should just use np.asanyarray everywhere, so subclasses are passed through... I don't really see that it is ever the right thing to coerce to the main class in a function; really all it helped for was not having to deal with matrix...

But the time to do that would have been numpy 2.0. Equally, though, I think the time for such a major break of what happens to np.ma.MaskedArray would have been numpy 2.0 as well.

More constructively, implementing __array_function__ to call the np.ma equivalent and bailing on a specific list of unsupported functions would seem the better option.

This perhaps goes more in the direction of how it would be approached with the array api, with __array_namespace__ giving np.ma.

@mhvk
Copy link
Contributor

mhvk commented Jun 13, 2024

Now read the last comment: Why would np.ma.MaskedArray be special among subclasses? E.g., if one puts in a Quantity, the unit is stripped away, which is equally wrong (and was equally annoying until __array_function__ arrived).
Indeed, arguably raising on all subclasses is a bit more logical than just special-casing MaskedArray. It might be done by allowing a subok=Nonein addition toTrue/Falseinnp.arrayand changing the default fornp.asarray`. But not advocating that...

What I do worry about is that raising in asarray is going to break a lot of code outside numpy that long ago worked around np.asarray stripping the mask (matplotlib? they've definitely some odd hard-coded stuff for MaskedArray , which we ran into because it was triggered by anything with a .mask attribute). Is it really worth it to force all those users to change their code? In contrast, __array_function__ would solve the problem in np.ma itself.

@seberg
Copy link
Member

seberg commented Jun 13, 2024

Is it really worth it to force all those users to change their code? In contrast, array_function would solve the problem in np.ma itself.

Right, this is what I was worried about, and I am not sure if I underestimating the challenge when it comes to code like matplotlib.
Although, code that already checks for .mask should also be in a position to call some standardized normalization (like .data or .filled()).

In this case, also note that if you deprecate things you must also reject buffer protocol exports the same way.

Indeed, arguably raising on all subclasses is a bit more logical than just special-casing

Making masked arrays special wouldn't be great, which is why I brought up the logic of guaranteeing a call to __array__() which allows the subclass to opt-in to not allowing asarray().
Btw, just as important as asarray() is rejecting the buffer protocol export.

The question is really about downstream use of asarray() and less about NumPy use of it since without NumPy you always have the choice to patch things up with __array_function__ and I don't know what difficulties there might be.


At some point, it might be a similar effort to create the MA alternative that isn't a subclass ;).

@rgommers
Copy link
Member Author

rgommers commented Jun 13, 2024

Completely agree with the need/desire for better subclass support inside numpy. To me, that really is a much bigger and separate topic though than "make np.asarray(masked_array) raise".

Ideally I'd like this to be a two line patch (modulo tests) that does accomplishes this, and changes nothing else. That would solve a host of silent correctness bugs.

Why would np.ma.MaskedArray be special among subclasses?

It is semantically wrong, because masked values may contain arbitrary values or mean "undefined value" (a la pandas.NA). Once they're masked and have gone through some other function calls, there are no guarantees.

Side note: pandas series/dataframes with NA-aware dtypes backed by numpy arrays are a close analogy here, calling np.asarray on those should also be raising rather than returning raw int/float/string values.

E.g., if one puts in a Quantity, the unit is stripped away, which is equally wrong.

I don't have much experience with Quantity in particular, so please correct me if I am wrong, but I don't think that that is true. Units seem more like metadata that annotates values, and maybe enforces checks for (e.g.) correctness of the physics behind the numerical operations (don't combine meters with seconds). However, anything that is metadata-only is like a type annotation - it's useful, but it doesn't invalidate what's done with numerical operations. If you feed a Quantity instance with a voltage unit into, say, a scipy.signal function, you lose the unit from the output (annoying indeed) but still get valid signal processing results back. With masked arrays, you don't.

@mhvk
Copy link
Contributor

mhvk commented Jun 13, 2024

With Quantity the problem usually becomes apparent only if you have two: ignore the units in combining them and the result is wrong. Before __array_function__, this meant things like concatenate would silently create wrong results.

Note that at the level of just blacklisting numpy functions, the __array_function__ implementation is equally easy to changing np.asarray:

    def __array_function__(self, function, types, args, kwargs):
        if function in UNSUPPORTED_FUNCTIONS:
            return NotImplemented

        return super().__array_function__(function, types, args, kwargs)

Deferring to np.ma functions is not much more work, but what is a ton of work is writing all the tests required to check that it does the right thing, that the signatures are consistent, etc.

Though it would probably help to start with stuff from astropy's Masked - see its __array_function__ and the class that tracks signature compatibility

p.s. I'd happily review PRs to adjust MaskedArray, but don't have the time to do it myself. Sadly, how MaskedArray treats subclasses is too fundamentally broken to be useful for us (hence our own class). Indeed, if it hadn't been, I would have added __array_ufunc__ and __array_function__ a long time ago.

EDIT (seberg): I have doubts super works, but _implementation() is definitely fine to keep behavior.

@rgommers
Copy link
Member Author

With Quantity the problem usually becomes apparent only if you have two: ignore the units in combining them and the result is wrong. Before __array_function__, this meant things like concatenate would silently create wrong results.

I'm still not 100% sure I understand the answer. It matters whether the numerical values are correct or not. For concatenate that will clearly be the case - units can be wrong if concatenate doesn't know about units, but that's irrelevant to whether it's a proper subclass (Liskov substitutable).

If you have an ndarray subclass, it translates to: can function calls with two inputs be separated into "func(ndarray, ndarray) + separate handling of the extra bits". I.e. for Quantity, is the effect of units separable so that 2 * 2 == 4 always holds true, or are there things like nonlinear coordinate systems that affect the numerical behavior of *? (or the equivalent for some more elaborate function)

np.matrix and np.ma.MaskedArray are both not subclasses that have that substitution property. It's just a little more obvious for matrix, but non-element-wise functions like convolve will yield numerically wrong answers for masked arrays.

Note that at the level of just blacklisting numpy functions, the __array_function__ implementation is equally easy to changing np.asarray:

That does sound like a nice and simple implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants