Thanks to visit codestin.com
Credit goes to github.com

Skip to content

metrics.confusion_matrix far too slow for Boolean cases #15388

Closed as not planned
@GregoryMorse

Description

@GregoryMorse

Description

When using metrics.confusion_matrix with np.bool_ cases (e.g. with only True/False values), it is far more fast to not use list comprehensions as the current code does. numpy has sum and Boolean logic functions to deal with this very efficiently to scale far better.
Code found in: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/_classification.py which uses list comprehensions and never checks for np.bool_ types trying to avoid all the excessive code. This is a very common and reasonable use case in practice. Assuming normalization and no sample weights though those also can be efficiently dealt with.

Steps/Code to Reproduce

import numpy as np

N = 4096
p = 0.5
a = np.random.choice(a=[False, True], size=(N), p=[p, 1-p])
b = np.random.choice(a=[False, True], size=(N), p=[p, 1-p])
from sklearn.metrics import confusion_matrix
for i in range(1024): confusion_matrix(a, b)

Expected Results

Fast execution time.
e.g. substituting confusion_matrix with this conf_mat (not efficient as possible but easier to read and more efficient than current library even with 4 sums, 4 logical ANDs and 4 logical NOTs:

        def conf_mat(x, y):
            return np.array([[np.sum(~x & ~y), np.sum(~x & y)], #true negatives, false positives
                             [np.sum(x & ~y), np.sum(x & y)]]) #false negatives, true positives

Or even faster with 1 logical AND with 3 sums:

        def conf_mat(x, y):
            truepos, totalpos, totaltrue = np.sum(x & y), np.sum(y), np.sum(x)
            totalfalse = len(x) - totaltrue
            return np.array([[totalfalse - falseneg, totalpos - truepos], #true negatives, false positives
                             [totaltrue - truepos, truepos]]) #false negatives, true positives

Actual Results

Slow execution time around 60 times slower than efficient code. The np.bool_ case could be identified and efficient code applied otherwise in serious cases of scale, the current code is too slow to be practically usable.

Versions

All - including 0.21

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions