[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450

dnouri · 2013-09-17T11:21:06Z

See #2372 for motivation.

…der of samples.

ogrisel · 2013-09-19T08:24:38Z

sklearn/tests/test_cross_validation.py

+
+    model = SVC(C=10, gamma=0.005)
+    cv = cval.StratifiedKFold(y, 5)
+    assert cval.cross_val_score(model, X, y, cv=cv, n_jobs=-1).mean() < 0.91


I would rather not use n_jobs=-1 when not explicitly writing tests for parallel computing as currently it can fail on some platforms (for instance if numpy is built against openblas) and we would have to protect the tests against such failures.

ogrisel · 2013-09-19T08:43:55Z

Thanks @dnouri for tackling this. +1 for merging once my comments are addressed.

I'm using a smaller number of examples from the digits dataset because that cuts down test execution time for me from 17s to 4s, and still yields the very similar results.

It appears that the sorted() call is unnecessary here.

ogrisel · 2013-09-19T17:16:05Z

Thanks for addressing my comments @dnouri ! This looks good to me. +1 for merge on my side. Any other review @agramfort @GaelVaroquaux @larsmans @mblondel @glouppe @arjoly (I didn't ping Andy intentionally to let him focus on his thesis :) ?

@dnouri could you please add an entry to the doc/whats_new.rst file to document this change?

agramfort · 2013-09-19T19:41:31Z

+1 for merge

dnouri · 2013-09-20T10:04:34Z

Updated whats_new.rst. Thanks for reviewing!

ogrisel · 2013-09-20T10:44:11Z

Merged by rebase. Thanks @dnouri!

ogrisel · 2013-09-20T11:38:23Z

Actually I reverted the merge as it caused a test failure under python 3 (that could probably been having fixed) but more importantly, the strategy used here is no longer real k-fold cross-validation as some samples many never appear in the test set.

ogrisel · 2013-09-20T11:39:25Z

doc/modules/cross_validation.rst

-  [1 4 6] [0 2 3 5]
-  [0 2 3 5] [1 4 6]
+  [2 4 5 6] [0 1 3]
+  [0 1 3 5] [2 4 6]


This is a regression: sample number #5 is never part of the test set. I will try to come with another way to do this.

ogrisel · 2013-09-20T20:41:17Z

I came up with a working implementation in #2463. I also reused updated tests from this PR and added more checks. Please let's continue the review over-there.

dnouri added 2 commits September 17, 2013 13:17

FIX scikit-learn#2372: StratifiedKFold less impact on the original or…

4f1d874

…der of samples.

Fix accidental doctest breakage.

9777eae

ogrisel reviewed Sep 19, 2013
View reviewed changes

dnouri added 3 commits September 19, 2013 14:43

Instead of linking to NB, explain the problem inside the test itself.

5358294

I'm using a smaller number of examples from the digits dataset because that cuts down test execution time for me from 17s to 4s, and still yields the very similar results.

Avoid list, preallocate a numpy array for indices instead.

d9fa475

It appears that the sorted() call is unnecessary here.

Update comment with numbers for when we run with 800 samples.

5a084bd

dnouri added 2 commits September 20, 2013 11:51

Merge branch 'master' into bug-2372-stratifiedkfold-preserve-order

13bb962

Add entry for scikit-learn#2372 to whats_new.rst

f01dc18

ogrisel closed this Sep 20, 2013

ogrisel reopened this Sep 20, 2013

ogrisel reviewed Sep 20, 2013
View reviewed changes

ogrisel mentioned this pull request Sep 20, 2013

[MRG] FIX #2372: non-shuffling StratifiedKFold implementation #2463

Merged

ogrisel closed this Sep 20, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450

[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450

Uh oh!

dnouri commented Sep 17, 2013

Uh oh!

ogrisel Sep 19, 2013

Uh oh!

ogrisel commented Sep 19, 2013

Uh oh!

ogrisel commented Sep 19, 2013

Uh oh!

agramfort commented Sep 19, 2013

Uh oh!

dnouri commented Sep 20, 2013

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

ogrisel Sep 20, 2013

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

Uh oh!

Uh oh!

[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450

[MRG] Fix #2372: StratifiedKFold less impact on the original order of samples. #2450

Uh oh!

Conversation

dnouri commented Sep 17, 2013

Uh oh!

ogrisel Sep 19, 2013

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 19, 2013

Uh oh!

ogrisel commented Sep 19, 2013

Uh oh!

agramfort commented Sep 19, 2013

Uh oh!

dnouri commented Sep 20, 2013

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

ogrisel Sep 20, 2013

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

Uh oh!