-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] fix StratifiedShuffleSplit train and test overlap #6379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] fix StratifiedShuffleSplit train and test overlap #6379
Conversation
# We complete by affecting randomly the missing indexes | ||
missing_idx = np.where(bincount(train + test, | ||
minlength=len(self.y)) == 0, | ||
)[0] | ||
missing_idx = rng.permutation(missing_idx) | ||
train.extend(missing_idx[:(self.n_train - len(train))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crux of the problem is that self.n_train - len(train)
can be negative if you are unlucky with the rounding, which causes a problem in the slicing below.
LGTM |
I think this should be included in the |
good point. sklearn/model_selection/_split.py |
I wasn't aware of that. Just curious, are model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffle split supposed to be just a copy and paste of each other? |
For the most part yes! |
Everything in cross_validation is deprecated. As we changed API we had to |
32b99c0
to
115517e
Compare
OK I just pushed a commit with the fix for both model_selection.StratifiedShuffleSplit and cross_validation.StratifiedShuffleSplit.
This is slightly surprising as manually maintaining consistency between the two for two more major releases may be a bit burdensome. I guess this was discussed in details though and it was deemed a better solution than, say, sharing the common code between the two. |
115517e
to
95c3a29
Compare
Indeed ;( An attempt to reuse the code from model selection has been made at #5568 Reviews would be welcome 😁 |
And thanks for extending the fix to |
@@ -1050,7 +1050,7 @@ class StratifiedShuffleSplit(BaseShuffleSplit): | |||
>>> y = np.array([0, 0, 1, 1]) | |||
>>> sss = StratifiedShuffleSplit(n_iter=3, test_size=0.5, random_state=0) | |||
>>> sss.get_n_splits(X, y) | |||
3 | |||
3n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why the docstring tests pass...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the typo and I'll take a closer look at why the docstring didn't seem to be run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the docstring didn't seem to be run.
Seems like nosetests ignores files starting with an underscore. Maybe we should use ignore-files=DONTIGNOREANYFILES
as was mentioned here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason you have to use ignore-files=setup.py
because trying to import sklearn/cluster/setup.py confuses the hell out of nose not sure why.
Besides the typo in the doctest and the fact that travis is still happy about it, +1 for merge as well. |
95c3a29
to
7470513
Compare
Well, there will be no changes to the deprecated code, and hopefully there are not too many bugs to fix ;) |
whatsnew entry? |
the logic is kinda tricky now. Can you add comments about the one-letter variables please? |
I'll add a whatsnew entry. Just for the record this overlap between train and test happens only in some unlikely edge cases. You have to be pretty unlucky to hit it. Basically you have to
You also need 1. and 2. to coordinate a in a specific fashion as well.
don't think the PR makes the logic trickier than it already was.
I was thinking about whether there was a way to simplify the code, in particular by getting rid of the unassigned samples logic, but I wasn't able to find one right away. I can try to look at it in more details and open a separate PR if I managed to get anywhere. |
in some edge cases and added test. Fix was applied in both sklearn.model_selection and sklearn.cross_validation.
7470513
to
07728d9
Compare
whatsnew entry added. |
Which variables? |
…ain-test-overlap [MRG+1] fix StratifiedShuffleSplit train and test overlap
I did mean |
Maybe we can make that a programming challenge lol |
I'll try to take a closer look at this in the next few days. |
I think we could try to fix
|
Fix #6121.