Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: Masked division considers large float64 values as inf  #22347

Open
@davzaman

Description

@davzaman

Describe the issue:

I am testing my data pipeline that uses SimpleImputer from sklearn. Their code uses masked arrays to compute the mean for example. However, when the arrays have large float64's (and therefore their means are also large, but not infinite/larger than the max float64), np.ma.mean produces the value (which is not nan or inf) but considers it masked.

My best guess is that np.ma.mean is somehow going by float32 even though the array is type float64, under which those values would be considered infinite.

Reproduce the code example:

"""This code is based on SimpleImputer from sklearn since this is what raised this bug for me."""
# %%
from sklearn.utils._mask import _get_mask
from numpy import array, nan, mean as npmean
import numpy.ma as ma

# %%
# The last column does not contain any nans
X = array(
    [
        [nan, 2.00000000e000, nan, 2.00000000e000, 6.10351562e-005],
        [
            1.00000000e000,
            2.00000000e000,
            -3.40282347e038,
            1.00000000e000,
            1.79769313e308,
        ],
    ]
)
X.dtype
"""
>>> dtype('float64')
"""

# %%
missing_mask = _get_mask(X, nan)
"""
The last column is not flagged to have nans.
>>> array([[ True, False,  True, False, False],
       [False, False, False, False, False]])
"""

# %%
masked_X = ma.masked_array(X, mask=missing_mask)
"""
The last column is not flagged to be invalid.
>>> masked_array(
  data=[[--, 2.0, --, 2.0, 6.10351562e-05],
        [1.0, 2.0, -3.40282347e+38, 1.0, 1.79769313e+308]],
  mask=[[ True, False,  True, False, False],
        [False, False, False, False, False]],
  fill_value=1e+20)
"""

# %%
mean_masked = ma.mean(masked_X, axis=0)
"""
The last column is shown to have an invalid mean
>>> masked_array(data=[1.0, 2.0, -3.40282347e+38, 1.5, --],
             mask=[False, False, False, False,  True],
       fill_value=1e+20)
"""

# %%
mean = npmean(X, axis=0)
"""
The last column does not have a nan mean.
>>> array([            nan, 2.00000000e+000,             nan, 1.50000000e+000,
       8.98846565e+307])
"""

# %%
print(mean.dtype)
print(ma.getdata(mean_masked).dtype)
"""
>>> dtype('float64')
>>> dtype('float64')
"""

# %%
ma.getdata(mean_masked)
"""
>>> array([ 1.00000000e+000,  2.00000000e+000, -3.40282347e+038,
        1.50000000e+000,  1.79769313e+308])
"""

# %%
ma.getdata(mean_masked).astype(float32)
"""
>>> array([ 1.0000000e+00,  2.0000000e+00, -3.4028235e+38,  1.5000000e+00,
                  inf], dtype=float32)
"""

# %%
1.79769313e308 < finfo(float64).max
"""
>>> True

"""
# %%
8.98846565e+307 < np.finfo(np.float64).max
"""
Interestingly the values don't match either.
>>> True
"""

Error message:

There is no immediate error, the behavior is not what I expect and I end up with fewer columns than I expect downstream (shapes do not match).

NumPy/Python version information:

1.22.4
3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]

Context for the issue:

In the data I work with it may not be unreasonable to see large values and taking the mean should work as expected even when masking other nans and with large float64 values, unless I am missing something. I was unable to find anything documented to explain this behavior.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions