Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Optimizing assert_all_finite check #11681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakirkham opened this issue Jul 25, 2018 · 9 comments · Fixed by #23197
Closed

Optimizing assert_all_finite check #11681

jakirkham opened this issue Jul 25, 2018 · 9 comments · Fixed by #23197

Comments

@jakirkham
Copy link
Contributor

Have done some work to come up with a Cython implementation of _assert_all_finite as the current check is rather slow. Have come up with this implementation. This simply uses the C isfinite (or isinf for allow_nan=True) function in a tight loop; returning False whenever isfinite returns False and returning True otherwise. As such it has a very small, constant memory overhead.

For smallish data (0-8000 element float64 arrays), this is comparable to the naive NumPy implementation in time and faster than the current scikit-learn implementation. Should add that the Cython implementation uses considerably less memory than either implementation. For large sized arrays, scikit-learn's performance improves to ~2x faster than the NumPy or Cython implementation. Expect this is mostly due to how the caches are utilized. So this can be improved further.

Would be interested in getting some feedback about whether such a contribution would be interesting to scikit-learn. If so, can investigate further improvements to this implementation.

@rth
Copy link
Member

rth commented Jul 25, 2018

Because we are have check by default, and it can be relatively slow (#11487 (comment)) in some cases, it would be definitely nice to make it faster.

Ideally, it would be great to have such an improvement in numpy (with possible temporary backport in scikit-learn), there is related discussion in numpy/numpy#6909 Maybe they could be interested by such a contribution?

Are you saying that the proposed implementation is faster than np.isfinite(X.sum()) ? That would take too much memory, would it? If I understand correctly, things become slower when this evaluates to False (e.g. there are non finite elements) and we also have to run np.isinf(X).any() ?

@jakirkham
Copy link
Contributor Author

Generally it seems like a good idea to contribute this stuff back to NumPy. Not sure they would take this implementation as it is customized for the case of searching the full array for the first invalid value and returning. Also while they do have a little bit of Cython code, they might prefer C for something like this in a prominent code path.

For small arrays (~1000 elements or less) it appears to be ~3.5x faster than the current scikit-learn implementation. As the array size gets larger, imagine other things like cache pressure, which this implementation does nothing with currently, become an issue and then NumPy-based implementations take over. Maybe using NumPy iterators with buffering would improve the large array case. Generally open to suggestions here.

This behaves more like l.index(...) in the sense that it scans the array for a match to an unacceptable value (inf and possibly nan). If it finds a match, it returns False ending the scan. If it doesn't find a match, it will scan the full array and then return True. So the memory usage is constant and actually quite small (e.g. the index, the current value, length, step size, some pointers, some flags, etc.).

is_finite_avg_runtime

@GaelVaroquaux
Copy link
Member

@jakirkham : I would be +1 for inclusion of this in scikit-learn (and of course, encourage contributing upstream)

@jnothman
Copy link
Member

A pull request here?

@Micky774
Copy link
Contributor

Hey there @jakirkham, just wanted to check if you intended to open a PR for your proposed solution.

@jakirkham
Copy link
Contributor Author

Sorry for the slow reply @Micky774. Not atm, but am really happy to see you picking this up. Thanks for digging into this! 🙏

@lorentzenchr
Copy link
Member

I would be +1 for inclusion of this in scikit-learn (and of course, encourage contributing upstream)

Now that #23197 is merged, should we open an issue upstream in numpy?

@jakirkham
Copy link
Contributor Author

It was already raised with NumPy. Please see issue ( numpy/numpy#11622 ), which is cross-linked above

@lorentzenchr
Copy link
Member

Thanks. I did not see it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment