StratifiedKFold should do its best to preserve the dataset dependency structure #2372

ogrisel · 2013-08-20T10:25:39Z

As highlighted in this notebook the current implementation of StratifiedKFold (which is used by default by cross_val_score and GridSearchCV for classification problems) breaks the dependency structure of the dataset by computing the folds based on the sorted labels.

Instead one should probably do an implementation that performs individual dependency preserving KFold on for each possible label value and aggregate the folds to get the StratifiedKFold final folds.

This might incur a refactoring to get rid of the _BaseKFold base class. It might also make it easier to implement a shuffle=True option for StratifiedKFold.

The text was updated successfully, but these errors were encountered:

ogrisel · 2013-08-20T10:27:23Z

Also one should add a non-regression test based on the digits dataset as highlighted in the notebook.

kpysniak · 2013-08-25T08:16:09Z

When you say that StratifiedKMeans should preserve individual dependency, do you mean each individual fold should contain samples of each classes in the same order as in the original set? If yes, I think using stable sorting should the problem, that is, replace line in cross_validation.py
from: idx = np.argsort(self.y)
to: idx = np.argsort(self.y, kind='mergesort')

The samples would be sorted, but the samples for each class would be taken in the same order as it is in the original set.

Or do you mean to asses dependency of individual classes in some more quantitative way?

Thanks a lot!

ogrisel · 2013-08-25T11:10:13Z

What I mean is that StratifiedKFold should minimally impact the original order of the samples when computing the folds.

The current behavior is:

>>> from sklearn.cross_validation import StratifiedKFold
>>> for train, test in StratifiedKFold([0, 0, 0, 1, 1, 1, 1, 1]):
...     print "train={}, test={}".format(train, test)
...
train=[1 2 4 5 7], test=[0 3 6]
train=[0 2 3 5 6], test=[1 4 7]
train=[0 1 3 4 6 7], test=[2 5]

Ideally I will like to have:

train=[1 2 5 6 7], test=[0 3 4]
train=[0 2 3 4 7], test=[1 5 6]
train=[0 1 3 4 5 6], test=[2 7]

I don't care much about preserving the ordering inside the folds (although it would be better if it is easy to implement).

agramfort · 2013-08-25T11:43:29Z

I think idx = np.argsort(self.y, kind='mergesort') should do the trick.

ogrisel · 2013-08-25T11:48:15Z

No it does not, with kind='mergesort' you get:

>>> for train, test in StratifiedKFold([0, 0, 0, 1, 1, 1, 1, 1]):
...     print "train={}, test={}".format(train, test)
...
train=[1 2 4 5 7], test=[0 3 6]
train=[0 2 3 5 6], test=[1 4 7]
train=[0 1 3 4 6 7], test=[2 5]

larsmans · 2013-09-13T15:14:06Z

I don't consider this a bug at all. We don't do structured learning, so we can assume there are no dependencies between samples.

ogrisel · 2013-09-13T15:23:11Z

It's inconsistent with the behavior of KFold and thus StratifiedKFold hides bad iid assumption problems, therefore misleading our users (my-self included).

ogrisel · 2013-09-13T15:26:58Z

We don't do structured learning, so we can assume there are no dependencies between samples.

But in real life there is always a bit of dependency and if you are not able to quantify the impact of the IID-breakage then your are just lying to yourself with too optimistic scores.

Furthermore we already provide LeaveOneLabelOut and LeavePLabelOut when the user has additional info on the dependency structure (e.g. experiments with distinct subjects).

larsmans · 2013-09-15T11:57:35Z

Alright, I didn't fully grasp the issues. Never mind.

…der of samples.

This reverts commit 6c7e751.

…of samples." This reverts commit 9e85797.

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation

adarob · 2014-07-16T21:16:39Z

This line breaks for multilabel classifiers since it fails for both arrays of tuples as well as the 2D numpy array produced by MultiLabelBinarizer:

label_test_folds = test_folds[y == label]

arjoly · 2014-07-16T21:41:10Z

StratifiedKFolddoesn't work for multilabel data.
There is pr that fix it the issue with cross_val_score.

larsmans · 2014-07-17T08:53:58Z

Stratified cross-validation for the multilabel case is a much harder problem than for multiclass, and requires special algorithms.

jnothman · 2017-11-16T21:36:23Z

What I mean is that StratifiedKFold should minimally impact the original order of the samples when computing the folds.

Couldn't we have just used a stable sort to do this? I know this is ancient but imbalanced test set sizes in #10154 makes this look like a pretty strange strategy!

dnouri added a commit to dnouri/scikit-learn that referenced this issue Sep 17, 2013

FIX scikit-learn#2372: StratifiedKFold less impact on the original or…

4f1d874

…der of samples.

dnouri mentioned this issue Sep 17, 2013

[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450

Closed

dnouri added a commit to dnouri/scikit-learn that referenced this issue Sep 20, 2013

Add entry for scikit-learn#2372 to whats_new.rst

f01dc18

ogrisel closed this as completed in 9e85797 Sep 20, 2013

ogrisel pushed a commit that referenced this issue Sep 20, 2013

Add entry for #2372 to whats_new.rst

6c7e751

ogrisel added a commit that referenced this issue Sep 20, 2013

Revert "Add entry for #2372 to whats_new.rst"

3aede7f

This reverts commit 6c7e751.

ogrisel added a commit that referenced this issue Sep 20, 2013

Revert "FIX #2372: StratifiedKFold less impact on the original order …

1a79640

…of samples." This reverts commit 9e85797.

ogrisel mentioned this issue Sep 20, 2013

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation #2463

Merged

ogrisel reopened this Sep 22, 2013

ogrisel closed this as completed in e3002c5 Sep 25, 2013

ogrisel added a commit that referenced this issue Sep 25, 2013

Merge pull request #2463 from ogrisel/stratified-kfold

53788b0

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation

giorgiop mentioned this issue Nov 4, 2015

[MRG+1]: TEST runtime down to 4:30 min on an old laptop #5711

Closed

jnothman mentioned this issue Nov 16, 2017

[DOC] Error in StratifiedKFold doctsring #10154

Closed

jnothman mentioned this issue Dec 13, 2017

StratifiedKFold with shuffle is not reproducible in 0.19. #10274

Closed

jnothman mentioned this issue Mar 7, 2018

[Feature-Request] Add a flag to StratifiedKFold to force classes with only 1 sample in training #10767

Open

qinhanmin2014 mentioned this issue Aug 17, 2019

StratifiedKFold makes fold-sizes very unequal #14673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

StratifiedKFold should do its best to preserve the dataset dependency structure #2372

StratifiedKFold should do its best to preserve the dataset dependency structure #2372

ogrisel commented Aug 20, 2013

ogrisel commented Aug 20, 2013

Uh oh!

kpysniak commented Aug 25, 2013

Uh oh!

ogrisel commented Aug 25, 2013

Uh oh!

agramfort commented Aug 25, 2013

Uh oh!

ogrisel commented Aug 25, 2013

Uh oh!

larsmans commented Sep 13, 2013

Uh oh!

ogrisel commented Sep 13, 2013

Uh oh!

ogrisel commented Sep 13, 2013

Uh oh!

larsmans commented Sep 15, 2013

Uh oh!

adarob commented Jul 16, 2014

Uh oh!

arjoly commented Jul 16, 2014

Uh oh!

larsmans commented Jul 17, 2014

Uh oh!

jnothman commented Nov 16, 2017

Uh oh!

Uh oh!

StratifiedKFold should do its best to preserve the dataset dependency structure #2372

StratifiedKFold should do its best to preserve the dataset dependency structure #2372

Comments

ogrisel commented Aug 20, 2013

ogrisel commented Aug 20, 2013

Uh oh!

kpysniak commented Aug 25, 2013

Uh oh!

ogrisel commented Aug 25, 2013

Uh oh!

agramfort commented Aug 25, 2013

Uh oh!

ogrisel commented Aug 25, 2013

Uh oh!

larsmans commented Sep 13, 2013

Uh oh!

ogrisel commented Sep 13, 2013

Uh oh!

ogrisel commented Sep 13, 2013

Uh oh!

larsmans commented Sep 15, 2013

Uh oh!

adarob commented Jul 16, 2014

Uh oh!

arjoly commented Jul 16, 2014

Uh oh!

larsmans commented Jul 17, 2014

Uh oh!

jnothman commented Nov 16, 2017

Uh oh!