Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] fix StratifiedShuffleSplit train and test overlap #6379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Feb 17, 2016

Fix #6121.

# We complete by affecting randomly the missing indexes
missing_idx = np.where(bincount(train + test,
minlength=len(self.y)) == 0,
)[0]
missing_idx = rng.permutation(missing_idx)
train.extend(missing_idx[:(self.n_train - len(train))])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crux of the problem is that self.n_train - len(train) can be negative if you are unlucky with the rounding, which causes a problem in the slicing below.

@TomDLT
Copy link
Member

TomDLT commented Feb 17, 2016

LGTM

@raghavrv
Copy link
Member

I think this should be included in the model_selection module too... Thanks for the fix...

@TomDLT
Copy link
Member

TomDLT commented Feb 17, 2016

good point. sklearn/model_selection/_split.py

@lesteve
Copy link
Member Author

lesteve commented Feb 17, 2016

good point. sklearn/model_selection/_split.py

I wasn't aware of that. Just curious, are model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffle split supposed to be just a copy and paste of each other?

@raghavrv
Copy link
Member

For the most part yes!

@GaelVaroquaux
Copy link
Member

model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffleSplit

Everything in cross_validation is deprecated. As we changed API we had to
create a new import path, in order not to break everything for people.

@lesteve lesteve force-pushed the fix-stratified-shuffle-split-train-test-overlap branch from 32b99c0 to 115517e Compare February 17, 2016 13:29
@lesteve
Copy link
Member Author

lesteve commented Feb 17, 2016

OK I just pushed a commit with the fix for both model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffleSplit.

Just curious, are model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffle split supposed to be just a copy and paste of each other?

For the most part yes!

This is slightly surprising as manually maintaining consistency between the two for two more major releases may be a bit burdensome. I guess this was discussed in details though and it was deemed a better solution than, say, sharing the common code between the two.

@lesteve lesteve force-pushed the fix-stratified-shuffle-split-train-test-overlap branch from 115517e to 95c3a29 Compare February 17, 2016 13:44
@raghavrv
Copy link
Member

may be a bit burdensome.

Indeed ;( An attempt to reuse the code from model selection has been made at #5568 Reviews would be welcome 😁

@raghavrv
Copy link
Member

And thanks for extending the fix to model_selection

@@ -1050,7 +1050,7 @@ class StratifiedShuffleSplit(BaseShuffleSplit):
>>> y = np.array([0, 0, 1, 1])
>>> sss = StratifiedShuffleSplit(n_iter=3, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)
3
3n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why the docstring tests pass...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the typo and I'll take a closer look at why the docstring didn't seem to be run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the docstring didn't seem to be run.

Seems like nosetests ignores files starting with an underscore. Maybe we should use ignore-files=DONTIGNOREANYFILES as was mentioned here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason you have to use ignore-files=setup.py because trying to import sklearn/cluster/setup.py confuses the hell out of nose not sure why.

@ogrisel
Copy link
Member

ogrisel commented Feb 25, 2016

Besides the typo in the doctest and the fact that travis is still happy about it, +1 for merge as well.

@ogrisel ogrisel changed the title [MRG] fix StratifiedShuffleSplit train and test overlap [MRG+1] fix StratifiedShuffleSplit train and test overlap Feb 25, 2016
@lesteve lesteve force-pushed the fix-stratified-shuffle-split-train-test-overlap branch from 95c3a29 to 7470513 Compare February 25, 2016 15:08
@amueller
Copy link
Member

This is slightly surprising as manually maintaining consistency between the two for two more major releases may be a bit burdensome

Well, there will be no changes to the deprecated code, and hopefully there are not too many bugs to fix ;)

@amueller
Copy link
Member

whatsnew entry?

@amueller
Copy link
Member

the logic is kinda tricky now. Can you add comments about the one-letter variables please?
Any ideas if this can be made more simple? (otherwise lgtm)

@lesteve
Copy link
Member Author

lesteve commented Feb 26, 2016

whatsnew entry?

I'll add a whatsnew entry. Just for the record this overlap between train and test happens only in some unlikely edge cases. You have to be pretty unlucky to hit it. Basically you have to

  1. fall in the unassigned sample logic (n_train * p_i is of the form 2n + 0.5 with n being an integer)
  2. assign too many samples to train (n_train * p_i is of the form 2n + 1.5)

You also need 1. and 2. to coordinate a in a specific fashion as well.

the logic is kinda tricky now

don't think the PR makes the logic trickier than it already was.

Can you add comments about the one-letter variables please?
Any ideas if this can be made more simple? (otherwise lgtm)

I was thinking about whether there was a way to simplify the code, in particular by getting rid of the unassigned samples logic, but I wasn't able to find one right away. I can try to look at it in more details and open a separate PR if I managed to get anywhere.

in some edge cases and added test.

Fix was applied in both sklearn.model_selection and
sklearn.cross_validation.
@lesteve lesteve force-pushed the fix-stratified-shuffle-split-train-test-overlap branch from 7470513 to 07728d9 Compare February 26, 2016 07:25
@lesteve
Copy link
Member Author

lesteve commented Feb 26, 2016

I'll add a whatsnew entry

whatsnew entry added.

@ogrisel
Copy link
Member

ogrisel commented Feb 29, 2016

Can you add comments about the one-letter variables please?

Which variables?

@lesteve
Copy link
Member Author

lesteve commented Feb 29, 2016

Which variables?

Not sure but I am guessing @amueller meant p_i, n_i and t_i here. Since I am not changing this code, I propose to do that in a separate PR, especially if I can find a way to simplify the logic as I mentioned above.

amueller added a commit that referenced this pull request Feb 29, 2016
…ain-test-overlap

[MRG+1] fix StratifiedShuffleSplit train and test overlap
@amueller amueller merged commit 150afe6 into scikit-learn:master Feb 29, 2016
@amueller
Copy link
Member

I did mean p_i, n_i and t_i. And yeah, the logic didn't become more complex. I was just hoping against hope it could be less complex ;)

@amueller
Copy link
Member

Maybe we can make that a programming challenge lol

@lesteve lesteve deleted the fix-stratified-shuffle-split-train-test-overlap branch March 1, 2016 05:55
@lesteve
Copy link
Member Author

lesteve commented Mar 1, 2016

I did mean p_i, n_i and t_i. And yeah, the logic didn't become more complex. I was just hoping against hope it could be less complex ;)

I'll try to take a closer look at this in the next few days.

@amueller
Copy link
Member

amueller commented Mar 1, 2016

I think we could try to fix n_i and t_i before the are used so that n_train - n_i.sum() == 0.
We "just" need to put some samples from class_counts - n_i - t_i into n_i and t_i.
I haven't found an idiomatic way to sample part of class_counts - n_i - t_i, though.

np.random.permutation(np.repeat(range(n_classes), class_counts - n_i - t_i)) is the best I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants