-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
metrics.confusion_matrix far too slow for Boolean cases #15388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report! What data size are you using? With We can make a specialized faster solution for boolean, but even better would be to improve the current generic one. I have quickly looked at the code and couldn't see obvious bottlenecks, list comprehension is not necessarily it, though maybe we can do the same things with numpy functions. Profiling the code would be a good start. Would you be interested in investigating? |
I was using it with batches of about 4096 but doing it a great many times as part of a modeling algorithm for generic detection of Boolean formulas (sort of a variant of a decision tree classifier). 10000000 is quite different than 1000, 10000 times also as the bottleneck can be increased/decreased based on repeated calls due to the sequence of code in there. I realized whenever I Ctrl+C'ed it was always sitting in confusion_matrix, a sort of poor man's performance bottleneck finder while debugging - always in a list comprehension. Seriously empirical data :). Anyway, I swapped it out with the code mentioned, and it dramatically improved the performance. We could compare but without absolutely no doubt the code using numpy is faster. Of course generic functions are bound to be slower than specifically optimized variants. But since the data type is inspected, and the function is doing too much it seems, for this simple case which it does have the ability to easily detect, I raised the issue. I feel sending proper Boolean values is a primary use case so was not worth overlooking it. I do not see where the work is being done in the function, in fact, after studying the code. I should have paid more attention where that Ctrl+C was landing. A couple questions: What is the typical easy and efficient approach to go about performance profiling code in Python - beyond simplistic time measurements line by line, are there any tool suggestions? Also, would modifying the code to scikit-learn require virtual environments to not interfere with primary development? |
Update: 1 logical AND with 3 sums version edited into post above as its even more efficient, also the execution time difference now known exactly.
Output:
Removing 3 logical ANDs, 4 logical negations and a sum operation hardly makes a difference in numpy. 2718/46=59 times faster... As for doing large batches:
I see:
So apparently its faster in small batches than large batches but still many times slower than native numpy vectorized operations as list comprehensions do not have anyway to match such speed. |
Also consider, if x and y are Boolean: confusion = np.bincount(y_true * 2 + y_pred, minlength=4).reshape(2, 2) |
Using Ctrl+C with:
So the list comprehension based on this rather silly performance testing (but nevertheless legitimate and has been highly useful in the past in dramatically reducing bottlenecks), is one of the key bottlenecks compared to numpy arrays. Also going through the labels when they are known to be booleans. |
Consider using the ipython %timeit to benchmark |
Thanks for the advice, will look into ipython. In the meantime:
|
But note that our API entails overhead in each call so we will never reach
the efficient of your solution in any case.
|
Note also that breaking in list comprehension does not mean it is the
slowest, only that it might be the slowest *interruptible* process. Many
numpy operations can't be interrupted
|
Yes I realize that a generic solution cannot be as efficient as a special built one. But since the data type is detected at least appeared to be in code, I think this particular case could be handled differently using your optimized code. Also a good point, so more detailed profiling might be needed. But anything which is not vectorizable and requiring iteration is the most likely candidate for optimization. I confirmed that Boolean AND vs multiplication have equal speed with numpy. I am quite sure all boolean and basic mathematical algebraic expressions have processor intrinsics for fast operations - though not sure that is how numpy is doing it. So the number of vector operations is the key metric here. |
For the curious, this scenario derived from a binary classification problem. I want to take a Boolean matrix to classify data against a Boolean vector. However here is where it gets interesting. Obviously decision trees come to mind immediately. But the goal is not to strictly classify the data as the dataset is far to chaotic to do that (even GridSearchCV with DecisionTreeClassifier at best scores 52% when testing the model - even though naturally you can overfit the whole dataset to 100% easily without testing the model of course). Rather the goal is to find Boolean combinations (AND/OR) of columns in the matrix, which can predict the positives at > some % say 75%. And another set of combinations to predict the negatives > 75%. Everything else is considered a gray area. It amounts to a 3-label classification but not quite as the labels would not be known in advance. As far as I know, scikit-learn has no model to accomplish such a task. I simply started ANDing together columns to increase the positive rate, then ORing together those to maximize the total true positives. Hence I decided to use the confusion matrix to calculate the positive rate along the way. Probably there is a better way to do this, and it would be highly useful if there is since its painfully slow. |
I probably should open another issue since its fundamentally different. But using pure Boolean values is pretty limiting when dealing with very large data due to memory waste. np.array will use 1 byte per Boolean which although better than 4 or 8, is still 8 times waste. Easy enough to pack the Boolean value np.array as bytes (np.uint8):
Then
Now I am truly supposing that the table lookup which uses the imaging table translation function is vectorized properly. I would imagine since it is used heavily for image processing that it is. This would be 2 vector operations ( So if scikit-learn would dare to add a whole data type to use packed byte representations of Booleans instead (which might be a major change which would need to be implemented), it would decrease memory by 8 times, increase vector operations by 8 times except where bit twiddling like mentioned is needed where it would depending on the specific operation still tend to be faster. I can see no reason why this would not be highly desirable for the library especially since large datasets are pretty typical. I will open a feature request I suppose to continue this discussion. |
I don't think you're going to find that kind of optimisation on our
roadmap, sorry. Such data representation is not supported in numpy, apart
from anything else.
|
Yes the primary problem would be endless indexing oddities I suppose this first belongs over at numpy as a feature request before it moves here :-). Thanks for the advice. Half of the operations are probably trivial like multiplication, addition as shown, and a few would require some real thinking. Upon further thinking, if numpy added the datatype perhaps no work at all is required on this side at all. Had not thought of it this way (now at: numpy/numpy#14821). It would make these Python libraries as flexibly scalable as C though so it would be impressive. |
@jnothman Apparently we can do even better (at least twice better)...in fact milliseconds is almost no longer an appropriate timing mechanism but microseconds might be good:
I cannot get frozen arguments (so
I suppose numba is not used in scikit-learn currently, is there a reason any good reason why not to go this route? E.g. platform support (because its native C apparently going through a complicated LLVM optimization pipelines), extra dependency (though its required by numpy already I think). I am doubting further optimization is possible. A single multiplication and addition is really hard to beat unless there is some other Boolean comparison trick. |
Yes, a cython solution may also be fast, but this is really not the bottle
neck in most people's machine learning pipelines
|
Just for fun, I provide the best solution I could find for packed byte confusion matrices even though I understand such a format is not currently properly supported by numpy where such a change would need to start. First approach is similar to bincount but uses vectorized bit twiddling to achieve it, and the other approach uses 3 vectorized bit sums and a bitwise AND operation. I am thinking to try it again with a 64k byte by byte lookup table to 2 bits + 2 bits + 2 bits + 2 bits though the whole idea of reducing memory consumption starts to get defeated with such strategies. Also that lookup also requires bit twiddling unless making it 256k for bytes instead of 2 bit groupings. Update: so I found the best approach so far is using simple byte to bit component lookup tables requiring only 256*8=2k bytes. It halves the speed of all the twiddling.
So the excessive bit twiddling basically does not cancel the advantage of bincount and a table lookup. Incidentally, the speed is comparable to the above result for guvectorized bincount. But the fastest known Pythonic approach now without direct use of C - is using bit lookup tables unless I can find something faster :-). |
I am evaluation 3D ML-based segmentation predictions and was looking for a fast confusion matrix implementation. My observation was that sklearn.metrics.confusion_matrix is quite slow. In fact it is slower than loading the data and doing inference with a UNet. I did a comparison of different ways to compute the confusion matrix:
The results are:
The timing for the numba implementation can be optimized further (by half) if |
@jeremiedbb do you think we should close this? As you established in #28578, using alternative implementation for binary cases does not seem to be the way improve performance here. |
good point, closing as duplicate of #26808 |
Description
When using metrics.confusion_matrix with np.bool_ cases (e.g. with only True/False values), it is far more fast to not use list comprehensions as the current code does. numpy has sum and Boolean logic functions to deal with this very efficiently to scale far better.
Code found in: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/_classification.py which uses list comprehensions and never checks for np.bool_ types trying to avoid all the excessive code. This is a very common and reasonable use case in practice. Assuming normalization and no sample weights though those also can be efficiently dealt with.
Steps/Code to Reproduce
Expected Results
Fast execution time.
e.g. substituting confusion_matrix with this conf_mat (not efficient as possible but easier to read and more efficient than current library even with 4 sums, 4 logical ANDs and 4 logical NOTs:
Or even faster with 1 logical AND with 3 sums:
Actual Results
Slow execution time around 60 times slower than efficient code. The np.bool_ case could be identified and efficient code applied otherwise in serious cases of scale, the current code is too slow to be practically usable.
Versions
All - including 0.21
The text was updated successfully, but these errors were encountered: