Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DISCUSS: About issue of masked array #27588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fengluoqiuwu opened this issue Oct 18, 2024 · 10 comments
Open

DISCUSS: About issue of masked array #27588

fengluoqiuwu opened this issue Oct 18, 2024 · 10 comments

Comments

@fengluoqiuwu
Copy link
Contributor

fengluoqiuwu commented Oct 18, 2024

While reading the code and addressing some bugs related to numpy/ma, I encountered a few questions:

  1. Filling Values in Masked Arrays:
    Do we actually care about the exact fill value in masked arrays, given that they are masked by other values? If not, I believe I can resolve bug 27580 by simply removing the check for inf.

  2. Default Fill Value for Masked Arrays:
    Currently, the default fill value for masked arrays is defined by default_filler for Python data types. However, Python doesn’t have unsigned integer types, so for np.uint arrays, the default fill value is stored as np.int64(999999). This causes issues in operations like copyto(..., casting='samekind'), as seen in bug 27580 and bug 27269. Should we consider using NumPy data types for the default fill value to ensure that the fill value matches the data type of the array (e.g., using a fill value that corresponds to integers or unsigned integers as appropriate)?

  3. Large Default Fill Values:
    Some default fill values seem quite large, such as 999999 for np.int8 and 1.e20 for np.float16. What would be an appropriate default fill value for masked arrays, particularly for small data types like int8 and float16? (bug25677)

  4. Reviewing copyto in Masked Arrays:
    Should we perform a comprehensive review of copyto functionality for masked arrays? It seems likely that similar bugs could exist due to the same root cause.

  5. Testing for Small Data Types:
    Should we extend the test suite to include small data types (e.g., int8 and float16) to ensure that functions handle these cases correctly?

  6. Checking Method Consistency
    Should we check the consistency of method between (no-masked) masked array and ndarray? There is some difference between methods and behaviors of (no-masked) masked array and ndarray, for example, see bug27258.

  7. Making Standard Clear
    Some methods' standard is not clear. For example, should we auto mask the invalid result? In some function (such as sqrt , std) it does, but in other function (such as median, mean). Something more worse is that in the document some function don't mention it but auto change the mask (sqrt std) , and others do mention it but not change (mean).
    And something more worse is that, some important methods don't have clear explanation both in document and doc string, some of them are really important. For example, __array_wrap__ , most of the callings to ufunc call it, and I think it might be the cause of the bug25635.

Since I'm not sure where to place these questions, I’ve marked this as a discussion for now.

@fengluoqiuwu
Copy link
Contributor Author

As I delve deeper into the masked array code, my concerns continue to grow. There are numerous issues within it that could lead to serious consequences. Recently, I've been attempting to fix some bugs in this module; however, I find that the repairs I implement often merely suppress the bugs rather than eliminate them at their source.

To truly eradicate these bugs, we must first establish appropriate standards and then test and modify all methods to ensure they comply with these standards. This is an incredibly challenging task, particularly when it involves altering the API.

I believe that addressing these underlying issues is crucial for the long-term stability and usability of the masked array functionality. Without a thorough examination and a systematic approach to standardization, we may continue to face recurring problems that undermine the integrity of this component.

@fengluoqiuwu
Copy link
Contributor Author

Any suggestions regarding the masked array section would be greatly appreciated. I am also seeking to collect all related bug reports that have been encountered. This month, I will be organizing the issues related to the ma package. If I conclude that API changes are necessary, I will write an email to the NumPy mailing list to explain the reasoning behind these changes after my review.

@fengluoqiuwu fengluoqiuwu changed the title DISCUSS: About default filling value in masked array. DISCUSS: About issue of masked array Oct 20, 2024
@fancidev
Copy link

fancidev commented Oct 21, 2024

I think that the fundamental issue with masked arrays is that np.ma.MaskedArray is a subclass of np.ndarray, but it alters the behavior of the base class. Consequently, any routine that works with np.ndarray will work with MaskedArray inputs formally, but may potentially break its semantics. To correctly maintain the semantics, every routine would need to test and specialize for the input type. This seems simply impossible.

Therefore, I suspect it is fundamentally impossible to ensure that MaskedArray works predictably in general.

@fengluoqiuwu
Copy link
Contributor Author

I think that the fundamental issue with masked arrays is that np.ma.MaskedArray is a subclass of np.ndarray, but it alters the behavior of the base class. Consequently, any routine that works with np.ndarray will work with MaskedArray inputs formally, but may potentially break its semantics. To correctly maintain the semantics, every routine would need to test and specialize for the input type. This seems simply impossible.
Therefore, I suspect it is fundamentally impossible to ensure that MaskedArray works predictably in general.

I understand that it's not feasible to test all NumPy methods comprehensively for every scenario. However, I believe that, within the current architecture of masked arrays, it would still be valuable to focus on testing certain key areas. For instance, __array_wrap__ is a critical function since many of NumPy's general methods (such as mathematical functions) call it. Ensuring that it behaves correctly would be a good starting point.

Additionally, I think it’s important to verify that the functions in np.ma behave as expected. This makes sense because np.ma is specifically designed for masked arrays. And perhaps making sure that when a masked array without a mask is passed, the results align with those from NumPy's general functions.

Particularly, I believe that within the current masked array, we might want to focus on testing edge cases like np.uint8 (since it stores the fill value as np.int64(999999)) and np.float16 (where the default fill value may fall outside of the type's range). We should also look into testing methods that involve type conversions or affect fill_value. While I understand that exhaustive testing may not be practical, perhaps maintaining a minimal set of core tests would be sufficient. More comprehensive testing could be reserved for a specific file that isn’t run by default, allowing developers to use it when modifying functions.

That said, I do think it's important to ensure that most methods work reliably under most conditions, and that at least in a certain version, we have a set of tests that have passed.

@jorenham
Copy link
Member

jorenham commented Nov 4, 2024

To throw some oil on the fire; numpy.ma is also almost completely untyped, see #26404 (comment).

@mdhaber
Copy link
Contributor

mdhaber commented Jan 28, 2025

You may be interested in marray, which adds mask support to all features of NumPy (and any other backend) that conform to the Python Array API Standard. I'm hoping that it will obviate the need for an overhaul/rewrite of NumPy masked arrays.

If there are other features of NumPy (beyond the Standard) that are needed, consider suggesting that they be added to the standard (if they are not efficient or easy to write in terms of Standard functionality). If that doesn't work, we could considering library-specific extensions.

@fengluoqiuwu
Copy link
Contributor Author

You may be interested in marray, which adds mask support to all features of NumPy (and any other backend) that conform to the Python Array API Standard. I'm hoping that it will obviate the need for an overhaul/rewrite of NumPy masked arrays.

Thank you very much! marray is indeed very useful and can largely replace NumPy masked arrays. However, in linear algebra calculations, especially for large-scale computations, its acceleration seems insufficient. As demonstrated in the following code:

##########input##########
import numpy as np
import marray
mnp = marray._get_namespace(np)
import time
A = np.random.rand(1000000).reshape((1000,1000))
B = np.random.rand(1000000).reshape((1000,1000))
A0 = mnp.asarray(A)
B0 = mnp.asarray(B)
A1 = np.ma.asarray(A)
B1 = np.ma.asarray(B)
A2 = np.ma.MaskedArray(A)
B2 = np.ma.MaskedArray(B)
def caltime(A,B):
    tic = time.time()
    A @ B
    toc = time.time()
    print("time:",toc-tic)
caltime(A,B)
caltime(A0,B0)
caltime(A1,B1)
caltime(A2,B2)
##########output##########
time: 0.008999347686767578
time: 1.6152479648590088 <--- use marray
time: 0.00899648666381836
time: 0.008004426956176758

It seems like it is not utilizing acceleration. Moreover, the time difference appears to grow with the increase in dimensionality.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 28, 2025

Interesting! Well that's clearly something to fix. Fortunately it's just code, so there's always a way. Feel free to submit a bug report (no need); there's no reason it should be much slower than np.ma here or in general. I recently opened an issue to do some benchmarking so we can make things faster, but it's new, and I only paid attention to correctness initially.

If I were to guess, the slowdown could be due to use of @ on an integer array as an easy way to determine the correct mask of the result. That should be easy to improve. Yup, that's it. I will improve it today!

BTW the syntax can be a bit more convenient:

import numpy as np
from marray import numpy as mnp

Explicit use of _get_namespace (which might need to become public again in some form) is only needed if the array namespace is not itself directly importable. When you can can import x, where x is an array API compatible array library, you can do from marray import x.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 28, 2025

mdhaber/marray#86 should mostly address this. Still not quite as fast as NumPy masked array (which has performance surprisingly close to regular array), but it was what I could do before heading to work!

@mdhaber
Copy link
Contributor

mdhaber commented Feb 1, 2025

However, in linear algebra calculations, especially for large-scale computations, its acceleration seems insufficient.

I was wondering how even after mdhaber/marray#86, NumPy's masked array @ operation was so much faster than marray. I mean, look at your results above - it was as fast or faster than regular array @ without the mask! How could this be?

Simple: ignore the mask.

import numpy as np
A = np.ma.masked_array(np.ones((2, 2)), ~np.eye(2, dtype=bool))
print(A)
[[1.0 --]
 [-- 1.0]]
print(A @ A)
# [[2.0 --]
#  [-- 2.0]]

It seems to just @ the unmasked arrays and uses the logical_or of the masks as the result mask, so I take it that it's treating @ as if it were an elementwise operation. Same with np.matmul. (gh-14992)

Using np.ma.dot, which does the right thing, closes the performance gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants