-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] FIX #2372: non-shuffling StratifiedKFold implementation #2463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# by authors although we don't have any information on the groups | ||
# segment locations for this data. We can highlight this fact be | ||
# computing k-fold cross-validation with and without shuffling: we | ||
# observer that the shuffling case makes the IID assumption and is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observer -> observe
Looks pretty good at first pass. It's a subtle point, and I'm glad you've addressed it! I've not run the tests yet, but I will do that soon. |
Addressed your comments, thanks @jakevdp! |
edit my mess |
Hm sorry, I messed something while I was toying with the code. |
No, indeed those are not correct at all. Let me check again my code. |
Ah ok, I am feeling reassured, I thought I had a bunch of tests covering such toy examples. |
Now, I get
This looks better. |
Yes the test fold for the first iteration is quite unbalanced as the 3 samples is not divisible by 2, for each class. I think this is an artifact of the fact that you have an usually large n_classes / n_samples ratio. For instance if you triple the size of the dataset while still making the number of samples per class not divisible by 2 for each class you will have.
|
In other words, the max difference between the size of the largest test folds and the smaller test folds is |
Quick question I need to think more about: is there any situation where assuring a balanced draw will lead to a bias? For example, if you have a two-class balanced sample with N1=100 and N2=100, a random split will have ~50 +/- 7 samples from each class. Could it be that forcing the sample to always be 50/50 might itself bias the CV results? |
What do you mean by "forcing the sample to always be 50/50", you mean to have exactly 50% of each class in each train / test fold rather than the same in expectation (when you use non-stratified If so, I don't think this bias would be detectable in practice. The variance of the validation scores across CV folds might be a bit lower. But I don't think the mean validation is affected in expectation in general for reasonable models. |
Sounds reasonable. I just wanted to make sure we weren't overlooking anything! |
I get one test failure on this branch:
|
Nevermind, I get the same failure on master, so it doesn't seem to be related to this PR. |
This all looks good to me. I'm +1 for merge. |
Looks good to me. +1 |
Alright I will squash the commits and merge by rebase then. |
…nd updated tests
[MRG] FIX #2372: non-shuffling StratifiedKFold implementation
This is a refactoring of @dnouri's fix for issue #2372 in PR #2450. The goal is to make the
StratifiedKFold
CV scheme not underestimate overfitting too much on datasets that have a strong samples dependencies. This important asStratifiedKFold
is used by default bycross_val_score
andGridSearchCV
for classification problems.The current implementation of
StratifiedKFold
in master shuffles the data before computing the splits which hides the potential non-respect of the IID assumption by the dataset. This is highlighted in a test on the digits dataset that as strong dependency between samples (co-authorship of consecutive samples).