-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
StratifiedKFold should do its best to preserve the dataset dependency structure #2372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also one should add a non-regression test based on the digits dataset as highlighted in the notebook. |
When you say that StratifiedKMeans should preserve individual dependency, do you mean each individual fold should contain samples of each classes in the same order as in the original set? If yes, I think using stable sorting should the problem, that is, replace line in cross_validation.py The samples would be sorted, but the samples for each class would be taken in the same order as it is in the original set. Or do you mean to asses dependency of individual classes in some more quantitative way? Thanks a lot! |
What I mean is that The current behavior is:
Ideally I will like to have:
I don't care much about preserving the ordering inside the folds (although it would be better if it is easy to implement). |
I think idx = np.argsort(self.y, kind='mergesort') should do the trick. |
No it does not, with
|
I don't consider this a bug at all. We don't do structured learning, so we can assume there are no dependencies between samples. |
It's inconsistent with the behavior of |
But in real life there is always a bit of dependency and if you are not able to quantify the impact of the IID-breakage then your are just lying to yourself with too optimistic scores. Furthermore we already provide |
Alright, I didn't fully grasp the issues. Never mind. |
…der of samples.
[MRG] FIX #2372: non-shuffling StratifiedKFold implementation
This line breaks for multilabel classifiers since it fails for both arrays of tuples as well as the 2D numpy array produced by MultiLabelBinarizer: label_test_folds = test_folds[y == label] |
|
Stratified cross-validation for the multilabel case is a much harder problem than for multiclass, and requires special algorithms. |
Couldn't we have just used a stable sort to do this? I know this is ancient but imbalanced test set sizes in #10154 makes this look like a pretty strange strategy! |
As highlighted in this notebook the current implementation of
StratifiedKFold
(which is used by default bycross_val_score
andGridSearchCV
for classification problems) breaks the dependency structure of the dataset by computing the folds based on the sorted labels.Instead one should probably do an implementation that performs individual dependency preserving KFold on for each possible label value and aggregate the folds to get the
StratifiedKFold
final folds.This might incur a refactoring to get rid of the
_BaseKFold
base class. It might also make it easier to implement ashuffle=True
option forStratifiedKFold
.The text was updated successfully, but these errors were encountered: