Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: Masked array default fill value can overflow #25677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ngoldbaum opened this issue Jan 24, 2024 · 13 comments
Open

BUG: Masked array default fill value can overflow #25677

ngoldbaum opened this issue Jan 24, 2024 · 13 comments
Labels

Comments

@ngoldbaum
Copy link
Member

Describe the issue:

For both signed and unsigned integers the default fill value is 99999, while for floats it is 1e20.

This is problematic for (u)int[8,16] as well as half floats, which do not contain the default fill value in their valid range.

Reproduce the code example:

>>> arr = np.ma.array([1, 2, 3], mask=[1, 0, 1], dtype=np.int8)
>>> arr.filled()
array([63,  2, 63], dtype=int8)

>>> arr = np.ma.array([1, 2, 3], mask=[1, 0, 1], dtype=np.float16)
>>> arr.filled()
/Users/goldbaum/Documents/numpy/build-install/usr/lib/python3.11/site-packages/numpy/ma/core.py:3873: RuntimeWarning: overflow encountered in cast
  np.copyto(result, fill_value, where=m)
Out[16]: array([inf,  2., inf], dtype=float16)

Error message:

N/A

Python and NumPy Versions:

Numpy 2.0 dev on python 3.11.

Runtime Environment:

N/A

Context for the issue:

I don't think this is an urgent but didn't see an issue describing this behavior so I'm filing this for future searchers.

That said, this does complicate the NEP 50 implementation because we need to have a number of workarounds so that this continues to work. I think it would be better to choose a default fill value that fits in the range of the data ([i,f]info.max?) for these types, but I have no idea what that entails for backward compatibility.

@msavinash
Copy link
Contributor

msavinash commented Jan 25, 2024

If a specific fill value is absent in filled, it gets the default fill value for that dtype. The catch is that it gets the default fill value for that "kind" of dtype. below code from numpy.ma core code is given for reference:
default_filler = {'b': True, 'c': 1.e20 + 0.0j, 'f': 1.e20, 'i': 999999, 'O': '?', 'S': b'N/A', 'u': 999999, 'V': b'???', 'U': 'N/A' }

You can see that for any kind of float (float16, float32, float64) the same fill value is being used. For a fill value of "1.e20" you always get the runtime error while typecasting (say to float16). I tried the below code to reproduce that issue:
converted_value = np.array(1e+20, dtype=np.float64).astype(np.float16)

One workaround is to define fill values to each specific dtype. Another solution I tried that worked was to replace default value of float to 'inf', which did not raise any issue for any specific float dtypes.
fill_value = np.inf

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Jan 25, 2024

The main issue here isn't so much the implementation of changing the default fill value, or even coming up with better default fill values, it's whether or not changing this will break user code. The default fill value is used in a number of places implicitly in operations so it's not clear to me if changing the default will have unintended consequences elsewhere.

@msavinash
Copy link
Contributor

Since the fill value is not determined by the user, we should be able to change the default values without major code breakage. But there are chances of user code that depends on the default values (can't think of an example right off the bat). Maybe change and document it?

@msavinash
Copy link
Contributor

Also this warning is not seen in earlier versions of Numpy (I tried with v1.23.5). So it can be changed, although I need to verify how it's being handled in earlier versions.

@BielStela
Copy link

Hello, I just want to add to the discussion that this issue is causing real problems in some cases. Since 2.0 does not accept out of bounds python ints this causes problems when serializing and deserializing masked arrays with default fill values. For example when multiprocessing with dask, it serializes the masked array by parts and deserializes the fill value to python integer causing an overflow exception when rebuilding the masked array. I can't tell what would be the sane default for fill_value but having something that is well behaved under any circumstance would be ideal.

Hope this helps and thank you!

@ngoldbaum
Copy link
Member Author

Hi, I don't think this interaction with dask has been reported. Can you make a short runnable example that demonstrates the issue?

@BielStela
Copy link

BielStela commented Jul 18, 2024

Sure,

In [1]: import numpy as np
   ...: import dask
   ...: import dask.distributed

In [2]: client = dask.distributed.Client()

In [3]: def f():
   ...:     return np.ma.masked_array([1,1], [0,1], dtype=np.uint8)
   ...:

In [4]: task = dask.delayed(f)()

In [5]: client.compute(task).result()
2024-07-18 14:48:24,570 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File ".venv/lib/python3.12/site-packages/numpy/ma/core.py", line 489, in _check_fill_value
    fill_value = np.asarray(fill_value, dtype=ndtype)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: Python integer 999999 out of bounds for uint8

using:
Python 3.12.4
numpy 2.0.0
dask 2024.7.0

@ngoldbaum
Copy link
Member Author

ping @seberg in case you weren't aware of this interaction between dask, masked arrays, and the NEP 50 changes

Maybe we should use the minimum or maximum int as the default fill values for ints that are too small to store 999999? Or whatever value it overflows to that we were accidentally using before?

@seberg
Copy link
Member

seberg commented Jul 19, 2024

I was faintly aware, but not that it was an issue here. A solution may be to make sure that the default fill value is typed to e.g. int64. That way NumPy will keep force-casting it. Not 100% sure if that has other fallout, but Ithink the fill-value is always an array so it should be fine.

@BielStela
Copy link

BielStela commented Jul 21, 2024

The solution we apply goes in that line: we force-cast the fill_value with numpy with something like this to force a controlled overflow

def f():
    arr = np.ma.masked_array([1,1], [0,1], dtype=np.uint8)
    arr.fill_value = arr.fill_value
    return arr

fill_value will only overflow if is the default value so It does the job avoiding serialization errors because it overflows in the setter, becoming 63~ and then it is safe to move it around. Not the prettiest thing I must say but it imitates the old (bad) behavior. Luckily in our case we can live with a trashy fill_value since we never use it to fill the array.

@fengluoqiuwu
Copy link
Contributor

With type np.uint8, default fill value can be fixed by calling _check_fill_value to force casting its type.
But np.float16 can't it will always get a overflow runtime warning when use the default_value.
And another issue is, _check_fill_value is define to get a (valid) fill value of a certain type, but it doesn't with np.float16.

>>> check = np.ma.core._check_fill_value
>>> test = check(None,np.float16)
>>> test
array(inf, dtype=float16)
>>> test = check(test,np.float16)
C:\Users\Eden_\Desktop\Coder\MachineLearning\my_numpy\numpy\numpy\ma\core.py:492: RuntimeWarning: overflow encountered in cast  fill_value = np.asarray(fill_value, dtype=ndtype)
>>> test
array(inf, dtype=float16)

@Kirill888
Copy link

Kirill888 commented Dec 3, 2024

This also breaks aggregation operations for masked arrays with small dtype:

this raises an error

import numpy as np
x = np.arange(10, dtype="uint8")
a = np.ma.MaskedArray(x, x & 1, fill_value=111)
a.max(keepdims=True)

Due to fill_value being too large, error is raised from here (line 6089):

numpy/numpy/ma/core.py

Lines 6082 to 6089 in 5047f7b

result = self.filled(fill_value).max(
axis=axis, out=out, **kwargs).view(type(self))
if result.ndim:
# Set the mask
result.__setmask__(newmask)
# Get rid of Infs
if newmask.ndim:
np.copyto(result, result.fill_value, where=newmask)

Essentially result array is produced with .view(type(self)), which ends up losing fill_value=111 parameter, so gets replaced by 99999, then it raises an error from np.copy as you can't fit that.

I think it should be

result = self.filled(fill_value).max( 
     axis=axis, out=out, **kwargs).view(type(self))
result.fill_value = self.fill_value  

@seberg
Copy link
Member

seberg commented Dec 3, 2024

@Kirill888 there is a hot-fix that will be in 2.1.x (soon) and 2.2. (also not far future).

Although, it would be nice to figure out a clean way to handle this. Maybe something like: unless explicitly set don't usually propagate mask values (right now "explicitly set" is not defined, because the value is cached and thus considered "set" on random operations).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants