Optimizing assert_all_finite check #11681

jakirkham · 2018-07-25T15:41:35Z

Have done some work to come up with a Cython implementation of _assert_all_finite as the current check is rather slow. Have come up with this implementation. This simply uses the C isfinite (or isinf for allow_nan=True) function in a tight loop; returning False whenever isfinite returns False and returning True otherwise. As such it has a very small, constant memory overhead.

For smallish data (0-8000 element float64 arrays), this is comparable to the naive NumPy implementation in time and faster than the current scikit-learn implementation. Should add that the Cython implementation uses considerably less memory than either implementation. For large sized arrays, scikit-learn's performance improves to ~2x faster than the NumPy or Cython implementation. Expect this is mostly due to how the caches are utilized. So this can be improved further.

Would be interested in getting some feedback about whether such a contribution would be interesting to scikit-learn. If so, can investigate further improvements to this implementation.

The text was updated successfully, but these errors were encountered:

rth · 2018-07-25T18:56:48Z

Because we are have check by default, and it can be relatively slow (#11487 (comment)) in some cases, it would be definitely nice to make it faster.

Ideally, it would be great to have such an improvement in numpy (with possible temporary backport in scikit-learn), there is related discussion in numpy/numpy#6909 Maybe they could be interested by such a contribution?

Are you saying that the proposed implementation is faster than np.isfinite(X.sum()) ? That would take too much memory, would it? If I understand correctly, things become slower when this evaluates to False (e.g. there are non finite elements) and we also have to run np.isinf(X).any() ?

jakirkham · 2018-07-26T04:37:37Z

Generally it seems like a good idea to contribute this stuff back to NumPy. Not sure they would take this implementation as it is customized for the case of searching the full array for the first invalid value and returning. Also while they do have a little bit of Cython code, they might prefer C for something like this in a prominent code path.

For small arrays (~1000 elements or less) it appears to be ~3.5x faster than the current scikit-learn implementation. As the array size gets larger, imagine other things like cache pressure, which this implementation does nothing with currently, become an issue and then NumPy-based implementations take over. Maybe using NumPy iterators with buffering would improve the large array case. Generally open to suggestions here.

This behaves more like l.index(...) in the sense that it scans the array for a match to an unacceptable value (inf and possibly nan). If it finds a match, it returns False ending the scan. If it doesn't find a match, it will scan the full array and then return True. So the memory usage is constant and actually quite small (e.g. the index, the current value, length, step size, some pointers, some flags, etc.).

GaelVaroquaux · 2018-07-29T12:28:25Z

@jakirkham : I would be +1 for inclusion of this in scikit-learn (and of course, encourage contributing upstream)

jnothman · 2018-07-31T08:51:05Z

A pull request here?

Micky774 · 2022-04-16T23:08:33Z

Hey there @jakirkham, just wanted to check if you intended to open a PR for your proposed solution.

jakirkham · 2022-05-26T18:18:12Z

Sorry for the slow reply @Micky774. Not atm, but am really happy to see you picking this up. Thanks for digging into this! 🙏

lorentzenchr · 2022-07-07T10:45:10Z

I would be +1 for inclusion of this in scikit-learn (and of course, encourage contributing upstream)

Now that #23197 is merged, should we open an issue upstream in numpy?

jakirkham · 2022-07-07T15:14:45Z

It was already raised with NumPy. Please see issue ( numpy/numpy#11622 ), which is cross-linked above

lorentzenchr · 2022-07-07T15:55:30Z

Thanks. I did not see it.

jakirkham mentioned this issue Jul 26, 2018

ENH: Create the ability for fused operations (fused ufuncs or map_reduce) style numpy/numpy#11622

Open

jakirkham mentioned this issue Aug 21, 2018

Dictionary learning is slower with n_jobs > 1 #4769

Open

thomasjpfan added the Enhancement label Oct 27, 2019

cmarmo added module:utils cython help wanted labels Jan 30, 2022

Micky774 moved this to In Progress🏗 in Quansight's scikit-learn Project Board Apr 16, 2022

Micky774 added this to Quansight's scikit-learn Project Board Apr 16, 2022

Micky774 moved this from In Progress🏗 to Todo📬 in Quansight's scikit-learn Project Board Apr 16, 2022

Micky774 moved this from Todo📬 to In Progress🏗 in Quansight's scikit-learn Project Board Apr 22, 2022

This was referenced Apr 22, 2022

ENH Cythonize _assert_all_finite using reduction scheme #23196

Closed

ENH Cythonize _assert_all_finite using stop-on-first strategy #23197

Merged

Micky774 removed this from Quansight's scikit-learn Project Board Apr 30, 2022

ogrisel closed this as completed in #23197 Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimizing assert_all_finite check #11681

Optimizing assert_all_finite check #11681

jakirkham commented Jul 25, 2018

rth commented Jul 25, 2018

Uh oh!

jakirkham commented Jul 26, 2018

Uh oh!

GaelVaroquaux commented Jul 29, 2018

Uh oh!

jnothman commented Jul 31, 2018

Uh oh!

Micky774 commented Apr 16, 2022

Uh oh!

jakirkham commented May 26, 2022

Uh oh!

lorentzenchr commented Jul 7, 2022

Uh oh!

jakirkham commented Jul 7, 2022

Uh oh!

lorentzenchr commented Jul 7, 2022

Uh oh!

Uh oh!

Optimizing assert_all_finite check #11681

Optimizing assert_all_finite check #11681

Comments

jakirkham commented Jul 25, 2018

rth commented Jul 25, 2018

Uh oh!

jakirkham commented Jul 26, 2018

Uh oh!

GaelVaroquaux commented Jul 29, 2018

Uh oh!

jnothman commented Jul 31, 2018

Uh oh!

Micky774 commented Apr 16, 2022

Uh oh!

jakirkham commented May 26, 2022

Uh oh!

lorentzenchr commented Jul 7, 2022

Uh oh!

jakirkham commented Jul 7, 2022

Uh oh!

lorentzenchr commented Jul 7, 2022

Uh oh!