-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: Masked division considers large float64 values as inf #22347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The actual issue stems from the division internal to the Division uses the following snippet: class _DomainSafeDivide:
"""
Define a domain for safe division.
"""
def __init__(self, tolerance=None):
self.tolerance = tolerance
def __call__(self, a, b):
# Delay the selection of the tolerance to here in order to reduce numpy
# import times. The calculation of these parameters is a substantial
# component of numpy's import time.
if self.tolerance is None:
self.tolerance = np.finfo(float).tiny
# don't call ma ufuncs from __array_wrap__ which would fail for scalars
a, b = np.asarray(a), np.asarray(b)
with np.errstate(invalid='ignore'):
return umath.absolute(a) * self.tolerance >= umath.absolute(b) Which indeed misfires here. I am a bit unsure what the logic is supposed to achieve to begin with though... This feels a bit like it may be another deeper issue of masked arrays? Why do we guard against returning infinities? If the user does not want them, they should maybe call a But even if we try to guard against it, maybe we should just do the cleanup for the user, if we really do that cleanup consistently? This also only guards for double-precision inputs! (I admit, I am not in the weeds of masked arrays, so maybe I am missing the point!) |
Thanks for the response! I figured it came from the core, but couldn't dig much further. I think you're right about this snippet. I think it's strange that it preemptively decides to handle this case without a warning to the user or making the behavior a flag (e.g., user says do safe division, vs user says I don't care about infinities I will deal with them myself). I'm also confused as to what the snippet is trying to achieve. Is the tolerance meant to normalize the numerator? |
Well, I did the terrible thing and wandered into darkness. 673de27 seems to be the first commit introducing that we just always ignore any floating point errors. It makes this effectively meaningless. Now 3ddc421 reintroduces it, because it was still meaningful in
( To me, the If you remember the rules for "domain errors" fall into two categories:
Then the whole idea of calculating domains up-front seems a lot of hassle for little pay-off when the domain is not as easy as for sqrt with In any case, unless we dig deep e.g. with what @greglucas once started (or a whole new start), I am not sure there is much to be done. But I do think we can remove the domain handling from that specific path completely without changing the behavior at all. |
@mhvk, @bsipocz do you have any input on how masked arrays should ideally behave and whether we should move forward here? Brigitta also had found the old PR gh-20551 and issue gh-#20506, where @math2001 addresses this issue (partially, I think)! The difference is that:
If we can decide that (I am not sure which way is better, especially considering the historic baggage, since we don't want to break users: In that regard I am OK if the answer is to not touch things now and hope for a replacement.) |
@seberg - this is partially why I just went with my own rewrite, though the main reason was to make masked quantity possible (https://docs.astropy.org/en/latest/utils/masked/index.html#utils-masked-vs-numpy-maskedarray) A third option is not to try to adjust |
@mhvk ok, I had lost track of the fact that this is actually in astropy! I do wonder a bit if it wouldn't make sense as a stand-alone package... |
Indeed, it might make sense as an independent package, but I followed the easy route -- easy especially when one is still solving bugs... The same holds for |
Right, it seems the right approach to push for you to push things where it is easiest to push them. Decoupling from astropy mainly becomes interesting if others want to take over a reasonable amount of the maintenance burden probably. For now I am happy to (again?) know where to point for the possible future :). |
Believe it or not, this makes There is more fun there. Since I had fixed a bug, that actually only worked for arrays, because the value-based legacy path gets the promotion wrong (it drops the timedelta unit). Sorry, ranting... This is a mess, build on top of a mess :/ and it is mudding the waters trying to clear up the lower mess a bit. |
Uh oh!
There was an error while loading. Please reload this page.
Describe the issue:
I am testing my data pipeline that uses
SimpleImputer
from sklearn. Their code uses masked arrays to compute the mean for example. However, when the arrays have large float64's (and therefore their means are also large, but not infinite/larger than the max float64), np.ma.mean produces the value (which is not nan or inf) but considers it masked.My best guess is that
np.ma.mean
is somehow going by float32 even though the array is type float64, under which those values would be considered infinite.Reproduce the code example:
Error message:
There is no immediate error, the behavior is not what I expect and I end up with fewer columns than I expect downstream (shapes do not match).
NumPy/Python version information:
1.22.4
3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]
Context for the issue:
In the data I work with it may not be unreasonable to see large values and taking the mean should work as expected even when masking other nans and with large float64 values, unless I am missing something. I was unable to find anything documented to explain this behavior.
The text was updated successfully, but these errors were encountered: