-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Optimizing assert_all_finite check #11681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Because we are have check by default, and it can be relatively slow (#11487 (comment)) in some cases, it would be definitely nice to make it faster. Ideally, it would be great to have such an improvement in numpy (with possible temporary backport in scikit-learn), there is related discussion in numpy/numpy#6909 Maybe they could be interested by such a contribution? Are you saying that the proposed implementation is faster than |
Generally it seems like a good idea to contribute this stuff back to NumPy. Not sure they would take this implementation as it is customized for the case of searching the full array for the first invalid value and returning. Also while they do have a little bit of Cython code, they might prefer C for something like this in a prominent code path. For small arrays (~1000 elements or less) it appears to be ~3.5x faster than the current scikit-learn implementation. As the array size gets larger, imagine other things like cache pressure, which this implementation does nothing with currently, become an issue and then NumPy-based implementations take over. Maybe using NumPy iterators with buffering would improve the large array case. Generally open to suggestions here. This behaves more like |
@jakirkham : I would be +1 for inclusion of this in scikit-learn (and of course, encourage contributing upstream) |
A pull request here? |
Hey there @jakirkham, just wanted to check if you intended to open a PR for your proposed solution. |
Sorry for the slow reply @Micky774. Not atm, but am really happy to see you picking this up. Thanks for digging into this! 🙏 |
Now that #23197 is merged, should we open an issue upstream in numpy? |
It was already raised with NumPy. Please see issue ( numpy/numpy#11622 ), which is cross-linked above |
Thanks. I did not see it. |
Have done some work to come up with a Cython implementation of
_assert_all_finite
as the current check is rather slow. Have come up with this implementation. This simply uses the Cisfinite
(orisinf
forallow_nan=True
) function in a tight loop; returningFalse
wheneverisfinite
returnsFalse
and returningTrue
otherwise. As such it has a very small, constant memory overhead.For smallish data (0-8000 element
float64
arrays), this is comparable to the naive NumPy implementation in time and faster than the current scikit-learn implementation. Should add that the Cython implementation uses considerably less memory than either implementation. For large sized arrays, scikit-learn's performance improves to ~2x faster than the NumPy or Cython implementation. Expect this is mostly due to how the caches are utilized. So this can be improved further.Would be interested in getting some feedback about whether such a contribution would be interesting to scikit-learn. If so, can investigate further improvements to this implementation.
The text was updated successfully, but these errors were encountered: