Thanks to visit codestin.com
Credit goes to github.com

Skip to content

StratifiedKFold should do its best to preserve the dataset dependency structure #2372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ogrisel opened this issue Aug 20, 2013 · 13 comments
Closed
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices
Milestone

Comments

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2013

As highlighted in this notebook the current implementation of StratifiedKFold (which is used by default by cross_val_score and GridSearchCV for classification problems) breaks the dependency structure of the dataset by computing the folds based on the sorted labels.

Instead one should probably do an implementation that performs individual dependency preserving KFold on for each possible label value and aggregate the folds to get the StratifiedKFold final folds.

This might incur a refactoring to get rid of the _BaseKFold base class. It might also make it easier to implement a shuffle=True option for StratifiedKFold.

@ogrisel
Copy link
Member Author

ogrisel commented Aug 20, 2013

Also one should add a non-regression test based on the digits dataset as highlighted in the notebook.

@kpysniak
Copy link
Contributor

When you say that StratifiedKMeans should preserve individual dependency, do you mean each individual fold should contain samples of each classes in the same order as in the original set? If yes, I think using stable sorting should the problem, that is, replace line in cross_validation.py
from: idx = np.argsort(self.y)
to: idx = np.argsort(self.y, kind='mergesort')

The samples would be sorted, but the samples for each class would be taken in the same order as it is in the original set.

Or do you mean to asses dependency of individual classes in some more quantitative way?

Thanks a lot!

@ogrisel
Copy link
Member Author

ogrisel commented Aug 25, 2013

What I mean is that StratifiedKFold should minimally impact the original order of the samples when computing the folds.

The current behavior is:

>>> from sklearn.cross_validation import StratifiedKFold
>>> for train, test in StratifiedKFold([0, 0, 0, 1, 1, 1, 1, 1]):
...     print "train={}, test={}".format(train, test)
...
train=[1 2 4 5 7], test=[0 3 6]
train=[0 2 3 5 6], test=[1 4 7]
train=[0 1 3 4 6 7], test=[2 5]

Ideally I will like to have:

train=[1 2 5 6 7], test=[0 3 4]
train=[0 2 3 4 7], test=[1 5 6]
train=[0 1 3 4 5 6], test=[2 7]

I don't care much about preserving the ordering inside the folds (although it would be better if it is easy to implement).

@agramfort
Copy link
Member

I think idx = np.argsort(self.y, kind='mergesort') should do the trick.

@ogrisel
Copy link
Member Author

ogrisel commented Aug 25, 2013

No it does not, with kind='mergesort' you get:

>>> for train, test in StratifiedKFold([0, 0, 0, 1, 1, 1, 1, 1]):
...     print "train={}, test={}".format(train, test)
...
train=[1 2 4 5 7], test=[0 3 6]
train=[0 2 3 5 6], test=[1 4 7]
train=[0 1 3 4 6 7], test=[2 5]

@larsmans
Copy link
Member

I don't consider this a bug at all. We don't do structured learning, so we can assume there are no dependencies between samples.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 13, 2013

It's inconsistent with the behavior of KFold and thus StratifiedKFold hides bad iid assumption problems, therefore misleading our users (my-self included).

@ogrisel
Copy link
Member Author

ogrisel commented Sep 13, 2013

We don't do structured learning, so we can assume there are no dependencies between samples.

But in real life there is always a bit of dependency and if you are not able to quantify the impact of the IID-breakage then your are just lying to yourself with too optimistic scores.

Furthermore we already provide LeaveOneLabelOut and LeavePLabelOut when the user has additional info on the dependency structure (e.g. experiments with distinct subjects).

@larsmans
Copy link
Member

Alright, I didn't fully grasp the issues. Never mind.

dnouri added a commit to dnouri/scikit-learn that referenced this issue Sep 17, 2013
dnouri added a commit to dnouri/scikit-learn that referenced this issue Sep 20, 2013
ogrisel pushed a commit that referenced this issue Sep 20, 2013
ogrisel added a commit that referenced this issue Sep 20, 2013
ogrisel added a commit that referenced this issue Sep 20, 2013
@ogrisel ogrisel reopened this Sep 22, 2013
ogrisel added a commit that referenced this issue Sep 25, 2013
[MRG] FIX #2372: non-shuffling StratifiedKFold implementation
@adarob
Copy link

adarob commented Jul 16, 2014

This line breaks for multilabel classifiers since it fails for both arrays of tuples as well as the 2D numpy array produced by MultiLabelBinarizer:

label_test_folds = test_folds[y == label]

@arjoly
Copy link
Member

arjoly commented Jul 16, 2014

StratifiedKFolddoesn't work for multilabel data.
There is pr that fix it the issue with cross_val_score.

@larsmans
Copy link
Member

Stratified cross-validation for the multilabel case is a much harder problem than for multiclass, and requires special algorithms.

@jnothman
Copy link
Member

What I mean is that StratifiedKFold should minimally impact the original order of the samples when computing the folds.

Couldn't we have just used a stable sort to do this? I know this is ancient but imbalanced test set sizes in #10154 makes this look like a pretty strange strategy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests

7 participants