-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] Nonrepeating ROC thresholds #3268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Nonrepeating ROC thresholds #3268
Conversation
…es in the thresholds returned by roc_curve because of machine precision, and a quick stab at a fix.
|
||
# How well can the classifier predict whether a digit is less than 5? | ||
# This task contributes floating point roundoff errors to the probabilities | ||
probas_pred = clf.fit(X[::2], y[::2]).predict_proba(X[1::2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use explicit train test splits instead of slices with steps?
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
I find the code more readable with explicit variable names instead of slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using train_test_split would necessitate adding a new import to the test file, which I'd rather not do. What if I were to switch to named slice objects? That would provide the same readability.
…lds test. Added named slice variables to same, for readability.
… ROC thresholds. Added a comment explaining why the treatment is needed.
distinct_value_indices = np.where(np.diff(y_score))[0] | ||
# We need to use isclose to avoid spurious repeated thresholds | ||
# stemming from floating point roundoff errors. | ||
distinct_value_indices = np.where(np.logical_not(np.isclose( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, np.isclose was added in numpy 1.7, so this will not work with earlier versions. I will look into fixing this tomorrow
I have reproduce your testing case
From my understanding, the issue comes from floating point rounding in
Furthermore, the auc doesn't change much even with 2 digits rounding:
|
Arnaud, There is no issue in Tonight I'll push a changeset that fixes the double use of |
Still the issue is not in roc_curve which worked as expected (168 unique value in, 168 thresholds out), but from the summation On a side note if this really mater, you could do that sort of summation using a kahan-summation algorithm or even better a Kahan-Babuska-Summation-Algorithm. |
I see your point. 168 unique values in, 168 thresholds out indeed. But in my opinion, a better algorithm for selecting ROC curve thresholds would be to use only the However, if the prevailing opinion is not to modify the threshold selection algorithm, I will withdraw this pull request. |
I also find machine precision levels duplicated thresholds curious and it might cause artifacts in a matplotlib display of the curve. On the other hand don't think it worth slowing down the computation of |
Are they any further concerns or obstacles to merging this one? |
This looks good to me. +1 on my side. @arjoly have you changed your mind or not? Other's opinion? @GaelVaroquaux @larsmans @jnothman @mblondel @agramfort? |
I haven't changed my mind, but it's ok if you find other people supporting this pull request. |
I am fine with this. +1 |
(although I do wonder whether the test is robust to changes in the randomforestclassifier implementation) |
Merging then. If the random forest implementation changes in a way that makes this test fail, we can always rewrite it to use fixed values. However I think it's best to show a real case that stems from the library. |
[MRG+2] Nonrepeating ROC thresholds
Added a unit test to ensure that there are no spurious repeating values in the thresholds returned by roc_curve because of machine precision, and a quick stab at a fix.
Without the fix, the thresholds array produced in the unit test starts with the values [1, 0.99, 0.98, 0.98, ...]. There should be only one 0.98. The undesirable behavior stems from machine epsilon differences between predicted probabilities that are mathematically equivalent.
I am not sure that my suggested fix is the best approach, so I welcome comments.