-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] fix sampling in stratified shuffle split #6472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] fix sampling in stratified shuffle split #6472
Conversation
Ah, it's actually equivalent with to_n_i = np.random.choice(n_classes, missing_train, replace=False,
p=left_over_per_class /
left_over_per_class.sum()), I think |
39e7e1c
to
da032ec
Compare
n_i = np.round(n_train * p_i).astype(int) | ||
# n_i = number of samples per class in training set | ||
n_i = np.floor(n_train * p_i).astype(int) | ||
# n_i = number of samples per class in test set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Minor nitpick) n_i
--> t_i
Is this a fix that should go back to |
numpy.random.RandomState.choice was added in numpy 1.7, according to the numpy doc, so you get some errors like this:
|
I do agree there was a problem that train was getting too many samples as you noted in the associated issue and it's great that you managed to remove the dodgy second-pass "unassigned samples" logic. |
Use sklearn.utils.random.choice To find such a thing, don't hesitate to do a git grep in the scikit-learn |
@amueller What exactly should I be looking for here ? |
cc @MechCoder |
da032ec
to
966399e
Compare
hm maybe the best choice would be to remove the randomness in picking |
966399e
to
d5063b6
Compare
ok so I can't spend that much more time on it, but I think the newest version without sampling is ok. What I'm doing now is computing the most likely draw from the original There is an alternative, which is "draw n_train points from class_counts, draw n_test points from class_counts, and make sure that we they don't sum up to more than class_counts". While this behavior might be a bit more "intuitive", the "ensure that they don't sum up to more than class_counts" part is more or less what was buggy in #6121. So I'd rather stay with the simpler semantics of doing one draw and then the other draw from what's left over. I have no idea what to do with the failing tests, though.
The distribution between training and test part are not the same to one digit. |
ping @GaelVaroquaux who wrote the tests ;) |
9772b4e
to
db39f25
Compare
So indeed, the test that is failing passes on master with this configuration:
So it violated the n_train / n_test sizes and put one more in the training set than it should have, to balance the classes. |
I can't think clearly enough on how to attain a balance between. a] maintaining the |
It is also not clear to me, how this block of the previous code preserves the class proportionality in the previous code?. Was it simply that it was not tested enough?
|
@MechCoder it doesn't. It just samples randomly. There are very strict tests that passed. My intuition of the problem was: n_train and n_test are specified directly by the user, so these are hard limits. The stratification is a "best effort" kind of thing. The best possible is to pick a mode of the multivariate hypergeometric, but doing that exactly would require quite a bit of code. So I do an approximation that might be off by one per class (I have not proven this bound). |
db39f25
to
55a79dc
Compare
This should be good now. Reviews please? |
CIs are not happy |
Next try. Actually computing the mode (approximately) now, not sampling, to satisfy the proportion tests. But now breaking ties at random to satisfy the IID test. |
…sts but is slightly complicated.
… in current master!!)
96e902f
to
c865c83
Compare
CI error is server error in curl |
|
||
It is the mostly likely outcome of drawing n_draws many | ||
samples from the population given by class_counts. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a doctest that would serve both as an example to make it easier to understand what the function is doing and also serve as a unittest / sanity check on a couple simple cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are doctests run on private functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, but I would expect so. Have have you checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have, they are :)
Weird the appveyor error report has happened on @agramfort's account instead of sklearn-ci. Anyway this is not a big deal and can be ignored. There are still some inline comments to tackle in the diff view but other than that +1 as well. |
this was reported before. Maybe I clicked on a wrong button on appveyor...
|
@ogrisel I added (hopefully helpful) doctests. |
array([0, 1, 1, 0]) | ||
>>> _approximate_mode(class_counts=np.array([2, 2, 2, 1]), | ||
... n_draws=2, rng=42) | ||
array([1, 1, 0, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. To me those examples explain very intuitively what the function actually does.
LGTM, let me squash-merge that. Thanks for the fix @amueller. |
thanks for the reviews and merge :) |
Fix sampling in stratified shuffle split, break tests that test sampling.
Fix sampling in stratified shuffle split, break tests that test sampling.
Fix sampling in stratified shuffle split, break tests that test sampling.
Fix sampling in stratified shuffle split, break tests that test sampling.
Fix sampling in stratified shuffle split, break tests that test sampling.
Fixes #6471.
This breaks some tests for stratified shuffle split. From my understanding, it actually now has the correct statistics, and the tests are wrong.
I'll remove the asserts etc if we can agree on this implementation (and how the tests need to be changed).
This tries to adjust
t_i
andn_i
before the sampling happens, and makes suresum(t_i) == n_test
andsum(n_i) == n_train
.Also, it adds in the "missing" samples based on their class frequency. So on expectation, this should be doing exactly the right thing (I think, I didn't have any coffee today yet).
The ugly bit is
which is the only way I new to say "sample
missing_train
many points from the classes inleft_over_per_class
". I feel there should be a better way to say that.The first test that breaks checks that if something with
class_counts = [4, 3, 5]
is split into 8 training and test points, the class probabilities in training and test set are the same.With the given random state, it is split into
[2, 1, 1]
and[2, 2, 4]
, which I think is fine, the test disagrees.