[MRG+1] fix StratifiedShuffleSplit train and test overlap #6379

lesteve · 2016-02-17T11:03:38Z

lesteve · 2016-02-17T11:07:49Z

sklearn/cross_validation.py

                # We complete by affecting randomly the missing indexes
                missing_idx = np.where(bincount(train + test,
                                                minlength=len(self.y)) == 0,
                                       )[0]
                missing_idx = rng.permutation(missing_idx)
-                train.extend(missing_idx[:(self.n_train - len(train))])


The crux of the problem is that self.n_train - len(train) can be negative if you are unlucky with the rounding, which causes a problem in the slicing below.

TomDLT · 2016-02-17T12:47:08Z

LGTM

raghavrv · 2016-02-17T12:50:22Z

I think this should be included in the model_selection module too... Thanks for the fix...

TomDLT · 2016-02-17T13:05:25Z

good point. sklearn/model_selection/_split.py

lesteve · 2016-02-17T13:07:54Z

good point. sklearn/model_selection/_split.py

I wasn't aware of that. Just curious, are model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffle split supposed to be just a copy and paste of each other?

raghavrv · 2016-02-17T13:09:40Z

For the most part yes!

GaelVaroquaux · 2016-02-17T13:09:58Z

model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffleSplit

Everything in cross_validation is deprecated. As we changed API we had to
create a new import path, in order not to break everything for people.

lesteve · 2016-02-17T13:40:21Z

OK I just pushed a commit with the fix for both model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffleSplit.

Just curious, are model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffle split supposed to be just a copy and paste of each other?

For the most part yes!

This is slightly surprising as manually maintaining consistency between the two for two more major releases may be a bit burdensome. I guess this was discussed in details though and it was deemed a better solution than, say, sharing the common code between the two.

raghavrv · 2016-02-17T13:53:45Z

may be a bit burdensome.

Indeed ;( An attempt to reuse the code from model selection has been made at #5568 Reviews would be welcome 😁

raghavrv · 2016-02-17T13:53:59Z

And thanks for extending the fix to model_selection

ogrisel · 2016-02-25T14:04:52Z

sklearn/model_selection/_split.py

@@ -1050,7 +1050,7 @@ class StratifiedShuffleSplit(BaseShuffleSplit):
    >>> y = np.array([0, 0, 1, 1])
    >>> sss = StratifiedShuffleSplit(n_iter=3, test_size=0.5, random_state=0)
    >>> sss.get_n_splits(X, y)
-    3
+    3n


I wonder why the docstring tests pass...

I fixed the typo and I'll take a closer look at why the docstring didn't seem to be run.

why the docstring didn't seem to be run.

Seems like nosetests ignores files starting with an underscore. Maybe we should use ignore-files=DONTIGNOREANYFILES as was mentioned here

For some reason you have to use ignore-files=setup.py because trying to import sklearn/cluster/setup.py confuses the hell out of nose not sure why.

ogrisel · 2016-02-25T14:10:43Z

Besides the typo in the doctest and the fact that travis is still happy about it, +1 for merge as well.

amueller · 2016-02-25T16:32:28Z

This is slightly surprising as manually maintaining consistency between the two for two more major releases may be a bit burdensome

Well, there will be no changes to the deprecated code, and hopefully there are not too many bugs to fix ;)

amueller · 2016-02-25T16:32:47Z

whatsnew entry?

amueller · 2016-02-25T16:58:44Z

the logic is kinda tricky now. Can you add comments about the one-letter variables please?
Any ideas if this can be made more simple? (otherwise lgtm)

lesteve · 2016-02-26T06:42:10Z

whatsnew entry?

I'll add a whatsnew entry. Just for the record this overlap between train and test happens only in some unlikely edge cases. You have to be pretty unlucky to hit it. Basically you have to

fall in the unassigned sample logic (n_train * p_i is of the form 2n + 0.5 with n being an integer)
assign too many samples to train (n_train * p_i is of the form 2n + 1.5)

You also need 1. and 2. to coordinate a in a specific fashion as well.

the logic is kinda tricky now

don't think the PR makes the logic trickier than it already was.

Can you add comments about the one-letter variables please?
Any ideas if this can be made more simple? (otherwise lgtm)

I was thinking about whether there was a way to simplify the code, in particular by getting rid of the unassigned samples logic, but I wasn't able to find one right away. I can try to look at it in more details and open a separate PR if I managed to get anywhere.

in some edge cases and added test. Fix was applied in both sklearn.model_selection and sklearn.cross_validation.

lesteve · 2016-02-26T10:01:15Z

I'll add a whatsnew entry

whatsnew entry added.

ogrisel · 2016-02-29T07:35:46Z

Can you add comments about the one-letter variables please?

Which variables?

lesteve · 2016-02-29T10:19:06Z

Which variables?

Not sure but I am guessing @amueller meant p_i, n_i and t_i here. Since I am not changing this code, I propose to do that in a separate PR, especially if I can find a way to simplify the logic as I mentioned above.

…ain-test-overlap [MRG+1] fix StratifiedShuffleSplit train and test overlap

amueller · 2016-02-29T20:24:16Z

I did mean p_i, n_i and t_i. And yeah, the logic didn't become more complex. I was just hoping against hope it could be less complex ;)

amueller · 2016-02-29T20:24:42Z

Maybe we can make that a programming challenge lol

lesteve · 2016-03-01T05:56:11Z

I did mean p_i, n_i and t_i. And yeah, the logic didn't become more complex. I was just hoping against hope it could be less complex ;)

I'll try to take a closer look at this in the next few days.

amueller · 2016-03-01T20:39:13Z

I think we could try to fix n_i and t_i before the are used so that n_train - n_i.sum() == 0.
We "just" need to put some samples from class_counts - n_i - t_i into n_i and t_i.
I haven't found an idiomatic way to sample part of class_counts - n_i - t_i, though.

np.random.permutation(np.repeat(range(n_classes), class_counts - n_i - t_i)) is the best I can do.

lesteve reviewed Feb 17, 2016
View reviewed changes

lesteve force-pushed the fix-stratified-shuffle-split-train-test-overlap branch from 32b99c0 to 115517e Compare February 17, 2016 13:29

lesteve force-pushed the fix-stratified-shuffle-split-train-test-overlap branch from 115517e to 95c3a29 Compare February 17, 2016 13:44

ogrisel reviewed Feb 25, 2016
View reviewed changes

ogrisel changed the title ~~[MRG] fix StratifiedShuffleSplit train and test overlap~~ [MRG+1] fix StratifiedShuffleSplit train and test overlap Feb 25, 2016

lesteve force-pushed the fix-stratified-shuffle-split-train-test-overlap branch from 95c3a29 to 7470513 Compare February 25, 2016 15:08

FIX StratifiedShuffleSplit train and test overlap

07728d9

in some edge cases and added test. Fix was applied in both sklearn.model_selection and sklearn.cross_validation.

lesteve force-pushed the fix-stratified-shuffle-split-train-test-overlap branch from 7470513 to 07728d9 Compare February 26, 2016 07:25

lesteve mentioned this pull request Feb 29, 2016

[MRG+1] Do not ignore files starting with _ and . in nose #6466

Merged

amueller added a commit that referenced this pull request Feb 29, 2016

Merge pull request #6379 from lesteve/fix-stratified-shuffle-split-tr…

150afe6

…ain-test-overlap [MRG+1] fix StratifiedShuffleSplit train and test overlap

amueller merged commit 150afe6 into scikit-learn:master Feb 29, 2016

lesteve deleted the fix-stratified-shuffle-split-train-test-overlap branch March 1, 2016 05:55

amueller mentioned this pull request Mar 1, 2016

StratifiedShuffleSplit still buggy #6471

Closed

Uh oh!

[MRG+1] fix StratifiedShuffleSplit train and test overlap #6379

[MRG+1] fix StratifiedShuffleSplit train and test overlap #6379

Conversation

lesteve commented Feb 17, 2016

Uh oh!

lesteve Feb 17, 2016

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Feb 17, 2016

Uh oh!

raghavrv commented Feb 17, 2016

Uh oh!

TomDLT commented Feb 17, 2016

Uh oh!

lesteve commented Feb 17, 2016

Uh oh!

raghavrv commented Feb 17, 2016

Uh oh!

GaelVaroquaux commented Feb 17, 2016

Uh oh!

lesteve commented Feb 17, 2016

Uh oh!

raghavrv commented Feb 17, 2016

Uh oh!

raghavrv commented Feb 17, 2016

Uh oh!

ogrisel Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

lesteve Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

lesteve Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

lesteve Feb 25, 2016

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 25, 2016

Uh oh!

amueller commented Feb 25, 2016

Uh oh!

amueller commented Feb 25, 2016

Uh oh!

amueller commented Feb 25, 2016

Uh oh!

lesteve commented Feb 26, 2016

Uh oh!

lesteve commented Feb 26, 2016

Uh oh!

ogrisel commented Feb 29, 2016

Uh oh!

lesteve commented Feb 29, 2016

Uh oh!

amueller commented Feb 29, 2016

Uh oh!

amueller commented Feb 29, 2016

Uh oh!

lesteve commented Mar 1, 2016

Uh oh!

amueller commented Mar 1, 2016

Uh oh!

Uh oh!