Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Issue in roc_auc_score which make wrong assumption of positive class #17572

@glemaitre

Description

@glemaitre

Currently, it is not possible to specify a specific positive class in plot_roc_curve as already mentioned in #15573. The only way around currently is to use roc_curve and roc_auc_score to get a similar score. However, there is something fishy with the roc_auc_score:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

X, y = fetch_openml(
    name="blood-transfusion-service-center",
    as_frame=True, return_X_y=True,
)
# Make columns and classes more human-readable
X.columns = ["Recency", "Frequency", "Monetary", "Time"]
y = y.apply(
    lambda x: "donated" if x == "2" else "not donated"
).astype("category")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=True, random_state=0, test_size=0.5
)

pos_label = "donated"
classifier = LogisticRegression().fit(X_train, y_train)
y_pred = classifier.predict_proba(X_test)

fpr, tpr, threshold = roc_curve(y_test, y_pred[:, 0], pos_label=pos_label)
roc_auc = roc_auc_score(y_test, y_pred[:, 0])
plt.plot(fpr, tpr, ".-", label=f"ROC-AUC: {roc_auc:.2f}")
plt.xlabel(f"False Positive Rate\n (Positive label: {pos_label})")
plt.ylabel(f"True Positive Rate\n (Positive label: {pos_label})")
plt.legend()

roc_curve

So I quickly wrote a non-optimized piece of code that should compute a ROC curve to know which one of the function was lying:

from sklearn.metrics import confusion_matrix

class_idx = 0
y_score = y_pred[:, class_idx]

pos_label = 'donated'
y_true = (y_test == pos_label).to_numpy()

decision_thresholds = np.unique(y_score)
decision_thresholds.sort()

all_tpr, all_tnr = [], []
for th in decision_thresholds:
    y_score_thresholded = y_score > th
    cm = confusion_matrix(y_true, y_score_thresholded)
    tpr = cm[1, 1] / (cm[1, 1] + cm[1, 0])
    tnr = cm[0, 0] / (cm[0, 0] + cm[0, 1])
    all_tpr.append(tpr)
    all_tnr.append(tnr)

all_tpr = np.array(all_tpr)
all_fpr = 1 - np.array(all_tnr)

plt.plot(all_fpr, all_tpr)
plt.xlabel("FPR")
plt.ylabel("TPR")

my_roc_curve

So the roc_curve is fine but roc_auc_score does something wrong:

labels = np.unique(y_true)
y_true = label_binarize(y_true, classes=labels)[:, 0]

Here we only do not encode properly the label if they are string and that the positive class is not the second element of the np.unique. Then y_true is encoded inversely.

Since that in this case, we are calling roc_curve in _binary_roc_auc_score, I am wondering if we should have a label pos_label in roc_auc_score and let roc_curve make the label binarisation instead of calling the label_binarize outside.

@thomasjpfan Any thoughts?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions