Bug in metrics.roc_auc_score #3864

madclam · 2014-11-19T12:22:37Z

pred=[1e-10, 0, 0]
sol=[1, 0, 0]
metrics.roc_auc_score(sol, pred) # 0.5, wrong, 1 is correct

pred=[1, 0, 0]
sol=[1, 0, 0]
metrics.roc_auc_score(sol, pred) # 1 correct

ssaeger · 2014-11-19T18:22:25Z

This is a result of the following code:

...
# We need to use isclose to avoid spurious repeated thresholds
# stemming from floating point roundoff errors.
distinct_value_indices = np.where(np.logical_not(isclose(np.diff(y_score), 0)))[0]

The numpy function isclose returns here True for 1e-10 and 0. So it is assumed that this is a 0.
But I'm not sure what to do about this.

hannes-brt · 2014-12-01T19:59:14Z

I have been affected by the same bug after I noticed that I got very different auROC values on two different machines using two different versions of sklearn on the same data.

On my dataset, the auROC score changes from 0.9821 (old code) to 0.9764 (new code). I am making my data available at https://www.dropbox.com/s/7nbhw9nhyavxcdm/roc-test-data.pkl.gz?dl=0 so you can verify the bug. The file unpickles to a tuple that has (y_true, y_predicted).

jnothman · 2014-12-01T22:07:16Z

This behaviour is due to #3268. It's a problem using differences within floating point error as the basis of ranking, but perhaps we should require a higher resolution, or allow the user to set this as a parameter (but the chances someone will do that of their own accord seem slim).

GaelVaroquaux · 2014-12-01T22:09:39Z

It seems to me that we could decrease the tol down to 1e-12 (but maybe
there is a specific reason not to). Beyond that, I consider numbers as
being not trustworthy.

jnothman · 2014-12-01T22:29:40Z

I consider numbers as being not trustworthy.

;) I'm inclined to agree, particularly under rank transformations.

jnothman · 2014-12-10T01:38:38Z

This has been reported again at #3950. Given that we may dealing with probabilities, which are frequently very small (and in particular, our scorer implementation does not exploit predict_log_proba currently) we should probably remove or lower the tolerance by default, but perhaps make it configurable.

It is frustrating that such a widely-used metric is so brittle to numeric instability.

GaelVaroquaux · 2014-12-10T06:29:41Z

(and in particular, our scorer implementation does not exploit
predict_log_proba currently)

Maybe we should use predict_log_proba when possible.

we should probably remove the tolerance by default, but perhaps make it
configurable.

I am worried that if we do this, people will then complain about unstable
results.

I am certainly in favor of decreasing the tol.

jnothman · 2014-12-10T06:35:47Z

Maybe we should use predict_log_proba when possible.

@mblondel, one for the scorer wishlist?

I am worried that if we do this, people will then complain about unstable
results.

Someone did, in #3268, but that's long after the metric has been implemented and available.

mblondel · 2014-12-11T12:00:22Z

@mblondel, one for the scorer wishlist?

Sounds reasonable!

amueller · 2015-01-09T21:29:34Z

I think there was a discussion elsewhere on always providing a decision_function. That can default to predict_log_proba + 2 class special case, so then we would avoid the problem.

jnothman · 2016-04-25T14:17:38Z

Duplicate issues in #4864, #6688, perhaps #6711. It seems that reverting #3268 might be the best solution (as discussed at #6693). As @jblackburne states there:

Looking back, I could have solved my problem more easily by just rounding my y_score before passing it into roc_curve(). It wasn't really fair to ask sklearn to solve what was essentially a problem in my client code.

This is only somewhat true given that the main source of data for metrics is the output of our estimators without the opportunity for tweaking, and that, as @GaelVaroquaux says above:

Beyond that [i.e. 1e-12] I consider numbers as being not trustworthy.

nielsenmarkus11 · 2016-09-06T16:21:02Z

I noticed the same, I checked how the roc_curve function was working and it appears that this was the issue for me. This function was returning values for fpr and tpr much greater than 1 which doesn't make much sense.

amueller · 2016-09-06T17:00:25Z

@nielsenmarkus11 can you provide code to reproduce that? Which version of scikit-learn are you using? Can you try using master?

amueller · 2016-09-06T17:01:32Z

@jnothman I'm not on top of all the auc issues. Which do you think we can fix for 0.18-rc or 0.18?

nielsenmarkus11 · 2016-09-06T20:17:12Z

I'm using scikit-learn 0.17.1.

So it appears that the problem is created when I add the actual=actual.to_sparse() line of code. I'm using a sparse DataFrame on my laptop to save disk space.

# Generate Data
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

pred = np.random.beta(2,10,20000)
pred = np.append(pred,np.zeros(5000))


# Generate label and features then put data into sparse DataFrame
import pandas as pd 
actual=pd.DataFrame([np.random.binomial(1,p) for p in pred],columns=['resp'])

for i in range(10):
    actual['PRE_'+str(i)] = [np.random.binomial(1,p*i/10) for p in pred]
actual = actual.to_sparse()

# Fit the model
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=-1,warm_start=True,oob_score=True,n_estimators=101,random_state=123)
rf.fit(actual.filter(regex=('PRE+.')),actual['resp'])


# AUC above 1 and tpr fpr above 1
roc_auc_score(actual['resp'],rf.oob_decision_function_[:,1])
roc_curve(actual['resp'],rf.oob_decision_function_[:,1])


# AUC and ROC Curve with 'nan's
roc_auc_score(actual['resp'],rf.predict_proba(actual.filter(regex=('PRE+.')))[:,1])
roc_curve(actual['resp'],rf.predict_proba(actual.filter(regex=('PRE+.')))[:,1])


# Check length of different predicted probability vectors:
(len(rf.oob_decision_function_[:,1]),
len(rf.predict_proba(actual.filter(regex=('PRE+.')))[:,1]))

jnothman · 2016-09-06T22:05:06Z

@amueller, I'm personally in favour of removing the tolerance for small differences. It's only been in there recently, and I think it was a mistake. Such differences are just part and parcel of rank-based metrics.

jnothman · 2016-09-06T22:05:44Z

I've not yet looked at the new issue here

jnothman · 2016-09-06T23:36:54Z

@nielsenmarkus11, your issue is unrelated. See #7352

jnothman mentioned this issue Dec 10, 2014

wrong AUC #3950

Closed

amueller added the Bug label Jan 22, 2015

This was referenced Apr 25, 2016

ROC functions break down with low scores variance. #6688

Closed

Document and adapt isclose() usage #4864

Closed

jnothman mentioned this issue Apr 25, 2016

kaggle AUC != sklearn AUC #6711

Closed

nelson-liu mentioned this issue May 29, 2016

roc_auc_score computation is wrong for large samples #6842

Closed

amueller added this to the 0.18 milestone Aug 31, 2016

jnothman mentioned this issue Sep 6, 2016

Problems accepting pandas.SparseSeries as target #7352

Closed

jblackburne mentioned this issue Sep 7, 2016

[MRG + 1] Remove np.isclose() from ROC curve calculation #7353

Merged

jnothman closed this as completed in #7353 Sep 11, 2016

Uh oh!

Bug in metrics.roc_auc_score #3864

Bug in metrics.roc_auc_score #3864

Comments

madclam commented Nov 19, 2014

ssaeger commented Nov 19, 2014

Uh oh!

hannes-brt commented Dec 1, 2014

Uh oh!

jnothman commented Dec 1, 2014

Uh oh!

GaelVaroquaux commented Dec 1, 2014

Uh oh!

jnothman commented Dec 1, 2014

Uh oh!

jnothman commented Dec 10, 2014

Uh oh!

GaelVaroquaux commented Dec 10, 2014

Uh oh!

jnothman commented Dec 10, 2014

Uh oh!

mblondel commented Dec 11, 2014

Uh oh!

amueller commented Jan 9, 2015

Uh oh!

jnothman commented Apr 25, 2016

Uh oh!

nielsenmarkus11 commented Sep 6, 2016

Uh oh!

amueller commented Sep 6, 2016

Uh oh!

amueller commented Sep 6, 2016

Uh oh!

nielsenmarkus11 commented Sep 6, 2016 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 6, 2016

Uh oh!

jnothman commented Sep 6, 2016

Uh oh!

jnothman commented Sep 6, 2016

Uh oh!

nielsenmarkus11 commented Sep 6, 2016 •

edited by TomDLT

Loading