-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450
Conversation
|
||
model = SVC(C=10, gamma=0.005) | ||
cv = cval.StratifiedKFold(y, 5) | ||
assert cval.cross_val_score(model, X, y, cv=cv, n_jobs=-1).mean() < 0.91 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not use n_jobs=-1
when not explicitly writing tests for parallel computing as currently it can fail on some platforms (for instance if numpy is built against openblas) and we would have to protect the tests against such failures.
Thanks @dnouri for tackling this. +1 for merging once my comments are addressed. |
I'm using a smaller number of examples from the digits dataset because that cuts down test execution time for me from 17s to 4s, and still yields the very similar results.
It appears that the sorted() call is unnecessary here.
Thanks for addressing my comments @dnouri ! This looks good to me. +1 for merge on my side. Any other review @agramfort @GaelVaroquaux @larsmans @mblondel @glouppe @arjoly (I didn't ping Andy intentionally to let him focus on his thesis :) ? @dnouri could you please add an entry to the |
+1 for merge |
Updated whats_new.rst. Thanks for reviewing! |
Merged by rebase. Thanks @dnouri! |
Actually I reverted the merge as it caused a test failure under python 3 (that could probably been having fixed) but more importantly, the strategy used here is no longer real k-fold cross-validation as some samples many never appear in the test set. |
[1 4 6] [0 2 3 5] | ||
[0 2 3 5] [1 4 6] | ||
[2 4 5 6] [0 1 3] | ||
[0 1 3 5] [2 4 6] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a regression: sample number #5 is never part of the test set. I will try to come with another way to do this.
I came up with a working implementation in #2463. I also reused updated tests from this PR and added more checks. Please let's continue the review over-there. |
See #2372 for motivation.