[MRG] FIX #2372: non-shuffling StratifiedKFold implementation #2463

ogrisel · 2013-09-20T20:37:49Z

This is a refactoring of @dnouri's fix for issue #2372 in PR #2450. The goal is to make the StratifiedKFold CV scheme not underestimate overfitting too much on datasets that have a strong samples dependencies. This important as StratifiedKFold is used by default by cross_val_score and GridSearchCV for classification problems.

The current implementation of StratifiedKFold in master shuffles the data before computing the splits which hides the potential non-respect of the IID assumption by the dataset. This is highlighted in a test on the digits dataset that as strong dependency between samples (co-authorship of consecutive samples).

jakevdp · 2013-09-22T19:15:25Z

sklearn/tests/test_cross_validation.py

+    # by authors although we don't have any information on the groups
+    # segment locations for this data. We can highlight this fact be
+    # computing k-fold cross-validation with and without shuffling: we
+    # observer that the shuffling case makes the IID assumption and is


observer -> observe

jakevdp · 2013-09-22T19:30:52Z

Looks pretty good at first pass. It's a subtle point, and I'm glad you've addressed it! I've not run the tests yet, but I will do that soon.

ogrisel · 2013-09-23T12:09:07Z

Addressed your comments, thanks @jakevdp!

arjoly · 2013-09-23T16:51:13Z

edit my mess

arjoly · 2013-09-23T16:59:40Z

Hm sorry, I messed something while I was toying with the code.

ogrisel · 2013-09-23T16:59:45Z

No, indeed those are not correct at all. Let me check again my code.

ogrisel · 2013-09-23T17:00:22Z

Ah ok, I am feeling reassured, I thought I had a bunch of tests covering such toy examples.

arjoly · 2013-09-23T17:02:43Z

Now, I get

In [14]: v = StratifiedKFold(np.array([0, 1, 2, 0, 1, 2, 0, 1, 2]), 2)

In [15]: for fold, (train, test) in enumerate(v):
    print "fold %s" % fold
    print "train : "+ str(train)
    print "test : " + str(test)
   ....:     
fold 0
train : [6 7 8]
test : [0 1 2 3 4 5]
fold 1
train : [0 1 2 3 4 5]
test : [6 7 8]

In [16]: v = StratifiedKFold(np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]), 2)

In [17]: for fold, (train, test) in enumerate(v):
    print "fold %s" % fold
    print "train : "+ str(train)
    print "test : " + str(test)
   ....:     
fold 0
train : [2 5 8]
test : [0 1 3 4 6 7]
fold 1
train : [0 1 3 4 6 7]
test : [2 5 8]

This looks better.

ogrisel · 2013-09-23T17:29:06Z

Yes the test fold for the first iteration is quite unbalanced as the 3 samples is not divisible by 2, for each class. I think this is an artifact of the fact that you have an usually large n_classes / n_samples ratio. For instance if you triple the size of the dataset while still making the number of samples per class not divisible by 2 for each class you will have.

>>> list(StratifiedKFold([0, 1, 2, 0, 1, 2, 0, 1, 2] * 3, 2))
[(array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]),
  array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
  array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]))]

>>> list(StratifiedKFold([0] * 9 + [1] * 9 + [2] * 9, 2))
[(array([ 5,  6,  7,  8, 14, 15, 16, 17, 23, 24, 25, 26]),
  array([ 0,  1,  2,  3,  4,  9, 10, 11, 12, 13, 18, 19, 20, 21, 22])),
 (array([ 0,  1,  2,  3,  4,  9, 10, 11, 12, 13, 18, 19, 20, 21, 22]),
  array([ 5,  6,  7,  8, 14, 15, 16, 17, 23, 24, 25, 26]))]

ogrisel · 2013-09-23T17:34:43Z

In other words, the max difference between the size of the largest test folds and the smaller test folds is n_classes (one additional sample per class).

jakevdp · 2013-09-23T17:45:13Z

Quick question I need to think more about: is there any situation where assuring a balanced draw will lead to a bias? For example, if you have a two-class balanced sample with N1=100 and N2=100, a random split will have ~50 +/- 7 samples from each class. Could it be that forcing the sample to always be 50/50 might itself bias the CV results?

ogrisel · 2013-09-23T17:50:48Z

What do you mean by "forcing the sample to always be 50/50", you mean to have exactly 50% of each class in each train / test fold rather than the same in expectation (when you use non-stratified KFold with shuffle=True or ShuffleSplit)?

If so, I don't think this bias would be detectable in practice. The variance of the validation scores across CV folds might be a bit lower. But I don't think the mean validation is affected in expectation in general for reasonable models.

jakevdp · 2013-09-23T18:01:22Z

Sounds reasonable. I just wanted to make sure we weren't overlooking anything!

jakevdp · 2013-09-23T22:29:25Z

I get one test failure on this branch:

[~]$ cd scikit-learn
[~]$ make test

<snip>

======================================================================
FAIL: sklearn.cluster.tests.test_spectral.test_spectral_clustering_sparse
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jakevdp/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/jakevdp/Opensource/scikit-learn/sklearn/cluster/tests/test_spectral.py", line 155, in test_spectral_clustering_sparse
    assert_greater(np.mean(labels == [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]), .89)
AssertionError: 0.59999999999999998 not greater than 0.89
    """Fail immediately, with the given message."""
>>  raise self.failureException('0.59999999999999998 not greater than 0.89')


----------------------------------------------------------------------
Ran 2666 tests in 104.923s

FAILED (SKIP=16, failures=1)
make: *** [test-code] Error 1

jakevdp · 2013-09-23T22:32:29Z

Nevermind, I get the same failure on master, so it doesn't seem to be related to this PR.

jakevdp · 2013-09-24T14:12:30Z

This all looks good to me. I'm +1 for merge.

arjoly · 2013-09-25T13:42:29Z

Looks good to me. +1

ogrisel · 2013-09-25T13:56:00Z

Alright I will squash the commits and merge by rebase then.

…nd updated tests

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation

ogrisel mentioned this pull request Sep 20, 2013

[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450

Closed

jakevdp reviewed Sep 22, 2013
View reviewed changes

FIX scikit-learn#2372: non-shuffling StratifiedKFold implementation a…

e3002c5

…nd updated tests

ogrisel added a commit that referenced this pull request Sep 25, 2013

Merge pull request #2463 from ogrisel/stratified-kfold

53788b0

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation

ogrisel merged commit 53788b0 into scikit-learn:master Sep 25, 2013

Uh oh!

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation #2463

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation #2463

Uh oh!

Conversation

ogrisel commented Sep 20, 2013

Uh oh!

jakevdp Sep 22, 2013

Choose a reason for hiding this comment

Uh oh!

jakevdp commented Sep 22, 2013

Uh oh!

ogrisel commented Sep 23, 2013

Uh oh!

arjoly commented Sep 23, 2013

Uh oh!

arjoly commented Sep 23, 2013

Uh oh!

ogrisel commented Sep 23, 2013

Uh oh!

ogrisel commented Sep 23, 2013

Uh oh!

arjoly commented Sep 23, 2013

Uh oh!

ogrisel commented Sep 23, 2013

Uh oh!

ogrisel commented Sep 23, 2013

Uh oh!

jakevdp commented Sep 23, 2013

Uh oh!

ogrisel commented Sep 23, 2013

Uh oh!

jakevdp commented Sep 23, 2013

Uh oh!

jakevdp commented Sep 23, 2013

Uh oh!

jakevdp commented Sep 23, 2013

Uh oh!

jakevdp commented Sep 24, 2013

Uh oh!

arjoly commented Sep 25, 2013

Uh oh!

ogrisel commented Sep 25, 2013

Uh oh!

Uh oh!