Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation #2463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 25, 2013

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Sep 20, 2013

This is a refactoring of @dnouri's fix for issue #2372 in PR #2450. The goal is to make the StratifiedKFold CV scheme not underestimate overfitting too much on datasets that have a strong samples dependencies. This important as StratifiedKFold is used by default by cross_val_score and GridSearchCV for classification problems.

The current implementation of StratifiedKFold in master shuffles the data before computing the splits which hides the potential non-respect of the IID assumption by the dataset. This is highlighted in a test on the digits dataset that as strong dependency between samples (co-authorship of consecutive samples).

# by authors although we don't have any information on the groups
# segment locations for this data. We can highlight this fact be
# computing k-fold cross-validation with and without shuffling: we
# observer that the shuffling case makes the IID assumption and is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observer -> observe

@jakevdp
Copy link
Member

jakevdp commented Sep 22, 2013

Looks pretty good at first pass. It's a subtle point, and I'm glad you've addressed it! I've not run the tests yet, but I will do that soon.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 23, 2013

Addressed your comments, thanks @jakevdp!

@arjoly
Copy link
Member

arjoly commented Sep 23, 2013

edit my mess

@arjoly
Copy link
Member

arjoly commented Sep 23, 2013

Hm sorry, I messed something while I was toying with the code.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 23, 2013

No, indeed those are not correct at all. Let me check again my code.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 23, 2013

Ah ok, I am feeling reassured, I thought I had a bunch of tests covering such toy examples.

@arjoly
Copy link
Member

arjoly commented Sep 23, 2013

Now, I get

In [14]: v = StratifiedKFold(np.array([0, 1, 2, 0, 1, 2, 0, 1, 2]), 2)

In [15]: for fold, (train, test) in enumerate(v):
    print "fold %s" % fold
    print "train : "+ str(train)
    print "test : " + str(test)
   ....:     
fold 0
train : [6 7 8]
test : [0 1 2 3 4 5]
fold 1
train : [0 1 2 3 4 5]
test : [6 7 8]

In [16]: v = StratifiedKFold(np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]), 2)

In [17]: for fold, (train, test) in enumerate(v):
    print "fold %s" % fold
    print "train : "+ str(train)
    print "test : " + str(test)
   ....:     
fold 0
train : [2 5 8]
test : [0 1 3 4 6 7]
fold 1
train : [0 1 3 4 6 7]
test : [2 5 8]

This looks better.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 23, 2013

Yes the test fold for the first iteration is quite unbalanced as the 3 samples is not divisible by 2, for each class. I think this is an artifact of the fact that you have an usually large n_classes / n_samples ratio. For instance if you triple the size of the dataset while still making the number of samples per class not divisible by 2 for each class you will have.

>>> list(StratifiedKFold([0, 1, 2, 0, 1, 2, 0, 1, 2] * 3, 2))
[(array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]),
  array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
  array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]))]

>>> list(StratifiedKFold([0] * 9 + [1] * 9 + [2] * 9, 2))
[(array([ 5,  6,  7,  8, 14, 15, 16, 17, 23, 24, 25, 26]),
  array([ 0,  1,  2,  3,  4,  9, 10, 11, 12, 13, 18, 19, 20, 21, 22])),
 (array([ 0,  1,  2,  3,  4,  9, 10, 11, 12, 13, 18, 19, 20, 21, 22]),
  array([ 5,  6,  7,  8, 14, 15, 16, 17, 23, 24, 25, 26]))]

@ogrisel
Copy link
Member Author

ogrisel commented Sep 23, 2013

In other words, the max difference between the size of the largest test folds and the smaller test folds is n_classes (one additional sample per class).

@jakevdp
Copy link
Member

jakevdp commented Sep 23, 2013

Quick question I need to think more about: is there any situation where assuring a balanced draw will lead to a bias? For example, if you have a two-class balanced sample with N1=100 and N2=100, a random split will have ~50 +/- 7 samples from each class. Could it be that forcing the sample to always be 50/50 might itself bias the CV results?

@ogrisel
Copy link
Member Author

ogrisel commented Sep 23, 2013

What do you mean by "forcing the sample to always be 50/50", you mean to have exactly 50% of each class in each train / test fold rather than the same in expectation (when you use non-stratified KFold with shuffle=True or ShuffleSplit)?

If so, I don't think this bias would be detectable in practice. The variance of the validation scores across CV folds might be a bit lower. But I don't think the mean validation is affected in expectation in general for reasonable models.

@jakevdp
Copy link
Member

jakevdp commented Sep 23, 2013

Sounds reasonable. I just wanted to make sure we weren't overlooking anything!

@jakevdp
Copy link
Member

jakevdp commented Sep 23, 2013

I get one test failure on this branch:

[~]$ cd scikit-learn
[~]$ make test

<snip>

======================================================================
FAIL: sklearn.cluster.tests.test_spectral.test_spectral_clustering_sparse
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jakevdp/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/jakevdp/Opensource/scikit-learn/sklearn/cluster/tests/test_spectral.py", line 155, in test_spectral_clustering_sparse
    assert_greater(np.mean(labels == [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]), .89)
AssertionError: 0.59999999999999998 not greater than 0.89
    """Fail immediately, with the given message."""
>>  raise self.failureException('0.59999999999999998 not greater than 0.89')


----------------------------------------------------------------------
Ran 2666 tests in 104.923s

FAILED (SKIP=16, failures=1)
make: *** [test-code] Error 1

@jakevdp
Copy link
Member

jakevdp commented Sep 23, 2013

Nevermind, I get the same failure on master, so it doesn't seem to be related to this PR.

@jakevdp
Copy link
Member

jakevdp commented Sep 24, 2013

This all looks good to me. I'm +1 for merge.

@arjoly
Copy link
Member

arjoly commented Sep 25, 2013

Looks good to me. +1

@ogrisel
Copy link
Member Author

ogrisel commented Sep 25, 2013

Alright I will squash the commits and merge by rebase then.

ogrisel added a commit that referenced this pull request Sep 25, 2013
[MRG] FIX #2372: non-shuffling StratifiedKFold implementation
@ogrisel ogrisel merged commit 53788b0 into scikit-learn:master Sep 25, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants