Description
Describe the issue:
I am testing my data pipeline that uses SimpleImputer
from sklearn. Their code uses masked arrays to compute the mean for example. However, when the arrays have large float64's (and therefore their means are also large, but not infinite/larger than the max float64), np.ma.mean produces the value (which is not nan or inf) but considers it masked.
My best guess is that np.ma.mean
is somehow going by float32 even though the array is type float64, under which those values would be considered infinite.
Reproduce the code example:
"""This code is based on SimpleImputer from sklearn since this is what raised this bug for me."""
# %%
from sklearn.utils._mask import _get_mask
from numpy import array, nan, mean as npmean
import numpy.ma as ma
# %%
# The last column does not contain any nans
X = array(
[
[nan, 2.00000000e000, nan, 2.00000000e000, 6.10351562e-005],
[
1.00000000e000,
2.00000000e000,
-3.40282347e038,
1.00000000e000,
1.79769313e308,
],
]
)
X.dtype
"""
>>> dtype('float64')
"""
# %%
missing_mask = _get_mask(X, nan)
"""
The last column is not flagged to have nans.
>>> array([[ True, False, True, False, False],
[False, False, False, False, False]])
"""
# %%
masked_X = ma.masked_array(X, mask=missing_mask)
"""
The last column is not flagged to be invalid.
>>> masked_array(
data=[[--, 2.0, --, 2.0, 6.10351562e-05],
[1.0, 2.0, -3.40282347e+38, 1.0, 1.79769313e+308]],
mask=[[ True, False, True, False, False],
[False, False, False, False, False]],
fill_value=1e+20)
"""
# %%
mean_masked = ma.mean(masked_X, axis=0)
"""
The last column is shown to have an invalid mean
>>> masked_array(data=[1.0, 2.0, -3.40282347e+38, 1.5, --],
mask=[False, False, False, False, True],
fill_value=1e+20)
"""
# %%
mean = npmean(X, axis=0)
"""
The last column does not have a nan mean.
>>> array([ nan, 2.00000000e+000, nan, 1.50000000e+000,
8.98846565e+307])
"""
# %%
print(mean.dtype)
print(ma.getdata(mean_masked).dtype)
"""
>>> dtype('float64')
>>> dtype('float64')
"""
# %%
ma.getdata(mean_masked)
"""
>>> array([ 1.00000000e+000, 2.00000000e+000, -3.40282347e+038,
1.50000000e+000, 1.79769313e+308])
"""
# %%
ma.getdata(mean_masked).astype(float32)
"""
>>> array([ 1.0000000e+00, 2.0000000e+00, -3.4028235e+38, 1.5000000e+00,
inf], dtype=float32)
"""
# %%
1.79769313e308 < finfo(float64).max
"""
>>> True
"""
# %%
8.98846565e+307 < np.finfo(np.float64).max
"""
Interestingly the values don't match either.
>>> True
"""
Error message:
There is no immediate error, the behavior is not what I expect and I end up with fewer columns than I expect downstream (shapes do not match).
NumPy/Python version information:
1.22.4
3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]
Context for the issue:
In the data I work with it may not be unreasonable to see large values and taking the mean should work as expected even when masking other nans and with large float64 values, unless I am missing something. I was unable to find anything documented to explain this behavior.