Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: Masked division considers large float64 values as inf #22347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
davzaman opened this issue Sep 28, 2022 · 9 comments
Open

BUG: Masked division considers large float64 values as inf #22347

davzaman opened this issue Sep 28, 2022 · 9 comments

Comments

@davzaman
Copy link

davzaman commented Sep 28, 2022

Describe the issue:

I am testing my data pipeline that uses SimpleImputer from sklearn. Their code uses masked arrays to compute the mean for example. However, when the arrays have large float64's (and therefore their means are also large, but not infinite/larger than the max float64), np.ma.mean produces the value (which is not nan or inf) but considers it masked.

My best guess is that np.ma.mean is somehow going by float32 even though the array is type float64, under which those values would be considered infinite.

Reproduce the code example:

"""This code is based on SimpleImputer from sklearn since this is what raised this bug for me."""
# %%
from sklearn.utils._mask import _get_mask
from numpy import array, nan, mean as npmean
import numpy.ma as ma

# %%
# The last column does not contain any nans
X = array(
    [
        [nan, 2.00000000e000, nan, 2.00000000e000, 6.10351562e-005],
        [
            1.00000000e000,
            2.00000000e000,
            -3.40282347e038,
            1.00000000e000,
            1.79769313e308,
        ],
    ]
)
X.dtype
"""
>>> dtype('float64')
"""

# %%
missing_mask = _get_mask(X, nan)
"""
The last column is not flagged to have nans.
>>> array([[ True, False,  True, False, False],
       [False, False, False, False, False]])
"""

# %%
masked_X = ma.masked_array(X, mask=missing_mask)
"""
The last column is not flagged to be invalid.
>>> masked_array(
  data=[[--, 2.0, --, 2.0, 6.10351562e-05],
        [1.0, 2.0, -3.40282347e+38, 1.0, 1.79769313e+308]],
  mask=[[ True, False,  True, False, False],
        [False, False, False, False, False]],
  fill_value=1e+20)
"""

# %%
mean_masked = ma.mean(masked_X, axis=0)
"""
The last column is shown to have an invalid mean
>>> masked_array(data=[1.0, 2.0, -3.40282347e+38, 1.5, --],
             mask=[False, False, False, False,  True],
       fill_value=1e+20)
"""

# %%
mean = npmean(X, axis=0)
"""
The last column does not have a nan mean.
>>> array([            nan, 2.00000000e+000,             nan, 1.50000000e+000,
       8.98846565e+307])
"""

# %%
print(mean.dtype)
print(ma.getdata(mean_masked).dtype)
"""
>>> dtype('float64')
>>> dtype('float64')
"""

# %%
ma.getdata(mean_masked)
"""
>>> array([ 1.00000000e+000,  2.00000000e+000, -3.40282347e+038,
        1.50000000e+000,  1.79769313e+308])
"""

# %%
ma.getdata(mean_masked).astype(float32)
"""
>>> array([ 1.0000000e+00,  2.0000000e+00, -3.4028235e+38,  1.5000000e+00,
                  inf], dtype=float32)
"""

# %%
1.79769313e308 < finfo(float64).max
"""
>>> True

"""
# %%
8.98846565e+307 < np.finfo(np.float64).max
"""
Interestingly the values don't match either.
>>> True
"""

Error message:

There is no immediate error, the behavior is not what I expect and I end up with fewer columns than I expect downstream (shapes do not match).

NumPy/Python version information:

1.22.4
3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]

Context for the issue:

In the data I work with it may not be unreasonable to see large values and taking the mean should work as expected even when masking other nans and with large float64 values, unless I am missing something. I was unable to find anything documented to explain this behavior.

@melissawm melissawm added the component: numpy.ma masked arrays label Sep 28, 2022
@seberg seberg changed the title BUG: np.ma.mean considers large float64 values as inf BUG: Masked division considers large float64 values as inf Sep 29, 2022
@seberg
Copy link
Member

seberg commented Sep 29, 2022

The actual issue stems from the division internal to the mean operation, so I suspect that is at the core and requires thought.

Division uses the following snippet:

class _DomainSafeDivide:
    """
    Define a domain for safe division.

    """

    def __init__(self, tolerance=None):
        self.tolerance = tolerance

    def __call__(self, a, b):
        # Delay the selection of the tolerance to here in order to reduce numpy
        # import times. The calculation of these parameters is a substantial
        # component of numpy's import time.
        if self.tolerance is None:
            self.tolerance = np.finfo(float).tiny
        # don't call ma ufuncs from __array_wrap__ which would fail for scalars
        a, b = np.asarray(a), np.asarray(b)
        with np.errstate(invalid='ignore'):
            return umath.absolute(a) * self.tolerance >= umath.absolute(b)

Which indeed misfires here. I am a bit unsure what the logic is supposed to achieve to begin with though...

This feels a bit like it may be another deeper issue of masked arrays? Why do we guard against returning infinities? If the user does not want them, they should maybe call a .masked_nonfinite(inplace=True) method/function after the fact?

But even if we try to guard against it, maybe we should just do the cleanup for the user, if we really do that cleanup consistently? This also only guards for double-precision inputs!

(I admit, I am not in the weeds of masked arrays, so maybe I am missing the point!)

@davzaman
Copy link
Author

Thanks for the response! I figured it came from the core, but couldn't dig much further. I think you're right about this snippet. I think it's strange that it preemptively decides to handle this case without a warning to the user or making the behavior a flag (e.g., user says do safe division, vs user says I don't care about infinities I will deal with them myself). I'm also confused as to what the snippet is trying to achieve. Is the tolerance meant to normalize the numerator?

@seberg
Copy link
Member

seberg commented Sep 30, 2022

Well, I did the terrible thing and wandered into darkness. 673de27 seems to be the first commit introducing that we just always ignore any floating point errors. It makes this effectively meaningless.

Now 3ddc421 reintroduces it, because it was still meaningful in __array_wrap__ (if a ufunc is called directly, not via the masked array interface).
I suspect that was an oversight, at that point we cannot avoid any warnings, although, we do make a better job in a sense: inf/NaN which was previously already in the array is preserved in that path:

In [6]: np.divide(np.ma.array([np.nan]), [0])
Out[6]: 
masked_array(data=[nan],
             mask=False,
       fill_value=1e+20)

In [7]: np.ma.divide(np.ma.array([np.nan]), [0])
Out[7]: 
masked_array(data=[--],
             mask=[ True],
       fill_value=1e+20,
            dtype=float64)

(sqrt(inf) would be an example using infinity. Inf doesn't work in the example, because the "domain" definition flags the inf input also.)

To me, the __array_wrap__ path is thus slightly more correct, but I still don't even see the point in auto-masking domain errors.

If you remember the rules for "domain errors" fall into two categories:

  • An infinity is created from finite values ("division by zero", a misnomer as also noted by Kahan ;))
  • A new NaN is created from non-NaN input ("invalid value"; the warning is also used for order comparisons like NaN > 0 in principle, but NumPy ignores that and it doesn't matter here.)

Then the whole idea of calculating domains up-front seems a lot of hassle for little pay-off when the domain is not as easy as for sqrt with x > 0.

In any case, unless we dig deep e.g. with what @greglucas once started (or a whole new start), I am not sure there is much to be done. But I do think we can remove the domain handling from that specific path completely without changing the behavior at all.

@seberg
Copy link
Member

seberg commented Oct 5, 2022

@mhvk, @bsipocz do you have any input on how masked arrays should ideally behave and whether we should move forward here? Brigitta also had found the old PR gh-20551 and issue gh-#20506, where @math2001 addresses this issue (partially, I think)!

The difference is that:

  • np.ma.ufunc (through ma namespace) will mask all infinities/NaNs in the result whether they were created by the call or not.
  • np.ufunc calls on masked arrays will attempt to mask all new infinities/NaNs (but do so sometimes sloppy). Those calls do not mask existing NaNs/Infs though.

If we can decide that np.ufunc should behave like np.ma.ufunc, I think we could simplify things (possibly using __array_ufunc__ but also even via __array_wrap__). We could also go the other way, though.

(I am not sure which way is better, especially considering the historic baggage, since we don't want to break users: In that regard I am OK if the answer is to not touch things now and hope for a replacement.)

@mhvk
Copy link
Contributor

mhvk commented Oct 5, 2022

@seberg - this is partially why I just went with my own rewrite, though the main reason was to make masked quantity possible (https://docs.astropy.org/en/latest/utils/masked/index.html#utils-masked-vs-numpy-maskedarray)

A third option is not to try to adjust np.ma.ufunc so that existing code continues to work, but make np.ufunc not do anything with the mask (except propagate it), and then (eventually?) deprecate the ma.ufunc versions. But perhaps not touching things is best...

@seberg
Copy link
Member

seberg commented Oct 5, 2022

@mhvk ok, I had lost track of the fact that this is actually in astropy! I do wonder a bit if it wouldn't make sense as a stand-alone package...
There seems to be a quite lot of masked users out there, and in the long term keeping them on NumPy doesn't seem like doing anyone a favor.

@mhvk
Copy link
Contributor

mhvk commented Oct 5, 2022

Indeed, it might make sense as an independent package, but I followed the easy route -- easy especially when one is still solving bugs... The same holds for astropy.units - really no reason that isn't its own package (or perhaps we use pint, carrying over features as needed). But the problem as always is time...

@seberg
Copy link
Member

seberg commented Oct 5, 2022

Right, it seems the right approach to push for you to push things where it is easiest to push them. Decoupling from astropy mainly becomes interesting if others want to take over a reasonable amount of the maintenance burden probably.

For now I am happy to (again?) know where to point for the possible future :).

@seberg
Copy link
Member

seberg commented Nov 10, 2022

Believe it or not, this makes nanmedian of datetimes rely on comparisons between integers and timedelta64.
Those comparisons don't really make sense what is timedelta64(3, "s") > np.int64([1]) (I admit, you can define it for 0, but everything else seems meeeh).

There is more fun there. Since I had fixed a bug, that actually only worked for arrays, because the value-based legacy path gets the promotion wrong (it drops the timedelta unit).

Sorry, ranting... This is a mess, build on top of a mess :/ and it is mudding the waters trying to clear up the lower mess a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants