Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Bug in metrics.roc_auc_score #3864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
madclam opened this issue Nov 19, 2014 · 18 comments · Fixed by #7353
Closed

Bug in metrics.roc_auc_score #3864

madclam opened this issue Nov 19, 2014 · 18 comments · Fixed by #7353
Labels
Milestone

Comments

@madclam
Copy link

madclam commented Nov 19, 2014

pred=[1e-10, 0, 0]
sol=[1, 0, 0]
metrics.roc_auc_score(sol, pred) # 0.5, wrong, 1 is correct

pred=[1, 0, 0]
sol=[1, 0, 0]
metrics.roc_auc_score(sol, pred) # 1 correct

@ssaeger
Copy link

ssaeger commented Nov 19, 2014

This is a result of the following code:

...
# We need to use isclose to avoid spurious repeated thresholds
# stemming from floating point roundoff errors.
distinct_value_indices = np.where(np.logical_not(isclose(np.diff(y_score), 0)))[0]

The numpy function isclose returns here True for 1e-10 and 0. So it is assumed that this is a 0.
But I'm not sure what to do about this.

@hannes-brt
Copy link

I have been affected by the same bug after I noticed that I got very different auROC values on two different machines using two different versions of sklearn on the same data.

On my dataset, the auROC score changes from 0.9821 (old code) to 0.9764 (new code). I am making my data available at https://www.dropbox.com/s/7nbhw9nhyavxcdm/roc-test-data.pkl.gz?dl=0 so you can verify the bug. The file unpickles to a tuple that has (y_true, y_predicted).

@jnothman
Copy link
Member

jnothman commented Dec 1, 2014

This behaviour is due to #3268. It's a problem using differences within floating point error as the basis of ranking, but perhaps we should require a higher resolution, or allow the user to set this as a parameter (but the chances someone will do that of their own accord seem slim).

@GaelVaroquaux
Copy link
Member

It seems to me that we could decrease the tol down to 1e-12 (but maybe
there is a specific reason not to). Beyond that, I consider numbers as
being not trustworthy.

@jnothman
Copy link
Member

jnothman commented Dec 1, 2014

I consider numbers as being not trustworthy.

;) I'm inclined to agree, particularly under rank transformations.

@jnothman jnothman mentioned this issue Dec 10, 2014
@jnothman
Copy link
Member

This has been reported again at #3950. Given that we may dealing with probabilities, which are frequently very small (and in particular, our scorer implementation does not exploit predict_log_proba currently) we should probably remove or lower the tolerance by default, but perhaps make it configurable.

It is frustrating that such a widely-used metric is so brittle to numeric instability.

@GaelVaroquaux
Copy link
Member

(and in particular, our scorer implementation does not exploit
predict_log_proba currently)

Maybe we should use predict_log_proba when possible.

we should probably remove the tolerance by default, but perhaps make it
configurable.

I am worried that if we do this, people will then complain about unstable
results.

I am certainly in favor of decreasing the tol.

@jnothman
Copy link
Member

Maybe we should use predict_log_proba when possible.

@mblondel, one for the scorer wishlist?

I am worried that if we do this, people will then complain about unstable
results.

Someone did, in #3268, but that's long after the metric has been implemented and available.

@mblondel
Copy link
Member

@mblondel, one for the scorer wishlist?

Sounds reasonable!

@amueller
Copy link
Member

amueller commented Jan 9, 2015

I think there was a discussion elsewhere on always providing a decision_function. That can default to predict_log_proba + 2 class special case, so then we would avoid the problem.

@jnothman
Copy link
Member

Duplicate issues in #4864, #6688, perhaps #6711. It seems that reverting #3268 might be the best solution (as discussed at #6693). As @jblackburne states there:

Looking back, I could have solved my problem more easily by just rounding my y_score before passing it into roc_curve(). It wasn't really fair to ask sklearn to solve what was essentially a problem in my client code.

This is only somewhat true given that the main source of data for metrics is the output of our estimators without the opportunity for tweaking, and that, as @GaelVaroquaux says above:

Beyond that [i.e. 1e-12] I consider numbers as being not trustworthy.

@nielsenmarkus11
Copy link
Contributor

I noticed the same, I checked how the roc_curve function was working and it appears that this was the issue for me. This function was returning values for fpr and tpr much greater than 1 which doesn't make much sense.

@amueller
Copy link
Member

amueller commented Sep 6, 2016

@nielsenmarkus11 can you provide code to reproduce that? Which version of scikit-learn are you using? Can you try using master?

@amueller
Copy link
Member

amueller commented Sep 6, 2016

@jnothman I'm not on top of all the auc issues. Which do you think we can fix for 0.18-rc or 0.18?

@nielsenmarkus11
Copy link
Contributor

nielsenmarkus11 commented Sep 6, 2016

I'm using scikit-learn 0.17.1.

So it appears that the problem is created when I add the actual=actual.to_sparse() line of code. I'm using a sparse DataFrame on my laptop to save disk space.

# Generate Data
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

pred = np.random.beta(2,10,20000)
pred = np.append(pred,np.zeros(5000))


# Generate label and features then put data into sparse DataFrame
import pandas as pd 
actual=pd.DataFrame([np.random.binomial(1,p) for p in pred],columns=['resp'])

for i in range(10):
    actual['PRE_'+str(i)] = [np.random.binomial(1,p*i/10) for p in pred]
actual = actual.to_sparse()

# Fit the model
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=-1,warm_start=True,oob_score=True,n_estimators=101,random_state=123)
rf.fit(actual.filter(regex=('PRE+.')),actual['resp'])


# AUC above 1 and tpr fpr above 1
roc_auc_score(actual['resp'],rf.oob_decision_function_[:,1])
roc_curve(actual['resp'],rf.oob_decision_function_[:,1])


# AUC and ROC Curve with 'nan's
roc_auc_score(actual['resp'],rf.predict_proba(actual.filter(regex=('PRE+.')))[:,1])
roc_curve(actual['resp'],rf.predict_proba(actual.filter(regex=('PRE+.')))[:,1])


# Check length of different predicted probability vectors:
(len(rf.oob_decision_function_[:,1]),
len(rf.predict_proba(actual.filter(regex=('PRE+.')))[:,1]))

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

@amueller, I'm personally in favour of removing the tolerance for small differences. It's only been in there recently, and I think it was a mistake. Such differences are just part and parcel of rank-based metrics.

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

I've not yet looked at the new issue here

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

@nielsenmarkus11, your issue is unrelated. See #7352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants