Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] fix sampling in stratified shuffle split #6472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

amueller
Copy link
Member

@amueller amueller commented Mar 1, 2016

Fixes #6471.
This breaks some tests for stratified shuffle split. From my understanding, it actually now has the correct statistics, and the tests are wrong.

I'll remove the asserts etc if we can agree on this implementation (and how the tests need to be changed).

This tries to adjust t_i and n_i before the sampling happens, and makes sure sum(t_i) == n_test and sum(n_i) == n_train.

Also, it adds in the "missing" samples based on their class frequency. So on expectation, this should be doing exactly the right thing (I think, I didn't have any coffee today yet).

The ugly bit is

np.bincount(rng.permutation(np.repeat(range(n_classes), left_over_per_class))[:missing_train], minlength=n_classes)

which is the only way I new to say "sample missing_train many points from the classes in left_over_per_class". I feel there should be a better way to say that.

The first test that breaks checks that if something with class_counts = [4, 3, 5] is split into 8 training and test points, the class probabilities in training and test set are the same.
With the given random state, it is split into [2, 1, 1] and [2, 2, 4], which I think is fine, the test disagrees.

@amueller amueller changed the title fix sampling in stratified shuffle split, break tests that test sampl… fix sampling in stratified shuffle split, break tests that test sampling Mar 1, 2016
@amueller
Copy link
Member Author

amueller commented Mar 1, 2016

Ah, it's actually equivalent with

            to_n_i = np.random.choice(n_classes, missing_train, replace=False,
                                      p=left_over_per_class / 
                                      left_over_per_class.sum()),

I think

@amueller amueller force-pushed the stratified_shuffle_split_simplify branch 2 times, most recently from 39e7e1c to da032ec Compare March 1, 2016 23:07
n_i = np.round(n_train * p_i).astype(int)
# n_i = number of samples per class in training set
n_i = np.floor(n_train * p_i).astype(int)
# n_i = number of samples per class in test set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Minor nitpick) n_i --> t_i

@raghavrv
Copy link
Member

raghavrv commented Mar 2, 2016

Is this a fix that should go back to cross_validation.py (and also be backported to the recently released version?)

@lesteve
Copy link
Member

lesteve commented Mar 2, 2016

numpy.random.RandomState.choice was added in numpy 1.7, according to the numpy doc, so you get some errors like this:

    choice = rng.choice(n_classes, n_samples, replace=False,

AttributeError: 'mtrand.RandomState' object has no attribute 'choice'

@lesteve
Copy link
Member

lesteve commented Mar 2, 2016

I do agree there was a problem that train was getting too many samples as you noted in the associated issue and it's great that you managed to remove the dodgy second-pass "unassigned samples" logic.

@GaelVaroquaux
Copy link
Member

numpy.random.RandomState.choice was added in numpy 1.7,

Use sklearn.utils.random.choice

To find such a thing, don't hesitate to do a git grep in the scikit-learn
codebase.

@vighneshbirodkar
Copy link
Contributor

@amueller What exactly should I be looking for here ?

@amueller
Copy link
Member Author

amueller commented Mar 2, 2016

cc @MechCoder

@amueller amueller force-pushed the stratified_shuffle_split_simplify branch from da032ec to 966399e Compare March 2, 2016 19:56
@amueller
Copy link
Member Author

amueller commented Mar 2, 2016

hm maybe the best choice would be to remove the randomness in picking t_i and n_i and using a mode of the multinomial hypergeometric instead of sampling from it. Not sure if it is easy to compute that. I asked on stackoverflow ^^

@amueller amueller force-pushed the stratified_shuffle_split_simplify branch from 966399e to d5063b6 Compare March 3, 2016 19:22
@amueller
Copy link
Member Author

amueller commented Mar 3, 2016

ok so I can't spend that much more time on it, but I think the newest version without sampling is ok.
I'm using a bad approximation of the mode of the multivariate hypergeometric. Here "bad approximation" means that it can be off by one for each class. So that should not be that bad for us. This is all about how to do the rounding.

What I'm doing now is computing the most likely draw from the original class_counts with n_train many point. Then, from the class_counts minus what is in the training part, I get the most likely outcome for drawing n_test many points.

There is an alternative, which is "draw n_train points from class_counts, draw n_test points from class_counts, and make sure that we they don't sum up to more than class_counts". While this behavior might be a bit more "intuitive", the "ensure that they don't sum up to more than class_counts" part is more or less what was buggy in #6121.

So I'd rather stay with the simpler semantics of doing one draw and then the other draw from what's left over.

I have no idea what to do with the failing tests, though.
The first test is now failing because of the following situation:

(Pdb) p np.bincount(y[test])
array([2, 1, 1, 1])
(Pdb) p np.bincount(y[train])
array([2, 3, 3, 2])
(Pdb) p np.bincount(y)
array([4, 4, 4, 3])

 x: array([ 0.2,  0.3,  0.3,  0.2])
 y: array([ 0.4,  0.2,  0.2,  0.2])
>>  raise AssertionError('\nArrays are not almost equal to 1 decimals\n\n(mismatch 25.0%)\n x: array([ 0.2,  0.3,  0.3,  0.2])\n y: array([ 0.4,  0.2,  0.2,  0.2])')

The distribution between training and test part are not the same to one digit.
But that's impossible in this setup! I checked and a training set of [2, 3, 3, 2] is actually the true mode. So we do find the most likely configuration of the training set. As we have n_train + n_test = n_samples in this example, we have no choice in how to create the test set.
I have no idea how to fix the test, or why it was passing before. Probably because of one of the bugs.

@amueller
Copy link
Member Author

amueller commented Mar 3, 2016

ping @GaelVaroquaux who wrote the tests ;)

@amueller amueller force-pushed the stratified_shuffle_split_simplify branch 3 times, most recently from 9772b4e to db39f25 Compare March 3, 2016 19:58
@amueller
Copy link
Member Author

amueller commented Mar 3, 2016

So indeed, the test that is failing passes on master with this configuration:

ipdb> p np.bincount(y)
array([4, 4, 4, 3])
ipdb> p np.bincount(y[train])
array([3, 3, 3, 2])
ipdb> p np.bincount(y[test])
array([1, 1, 1, 1])

So it violated the n_train / n_test sizes and put one more in the training set than it should have, to balance the classes.
@GaelVaroquaux was that expected behavior?
I mean it kind of makes sense, but it is non-obvious to me.

@MechCoder
Copy link
Member

I can't think clearly enough on how to attain a balance between.

a] maintaining the train_size and test_size
b] maintaining the proportionality of the class labels between the train and test size :(

@MechCoder
Copy link
Member

It is also not clear to me, how this block of the previous code preserves the class proportionality in the previous code?. Was it simply that it was not tested enough?

        if len(train) + len(test) < n_train + n_test:
            # We complete by affecting randomly the missing indexes
            missing_indices = np.where(bincount(train + test,
                                                minlength=len(y)) == 0)[0]
            missing_indices = rng.permutation(missing_indices)
            n_missing_train = n_train - len(train)
            n_missing_test = n_test - len(test)

            if n_missing_train > 0:
                train.extend(missing_indices[:n_missing_train])
            if n_missing_test > 0:
                test.extend(missing_indices[-n_missing_test:])

@amueller
Copy link
Member Author

amueller commented Mar 8, 2016

@MechCoder it doesn't. It just samples randomly. There are very strict tests that passed.

My intuition of the problem was: n_train and n_test are specified directly by the user, so these are hard limits. The stratification is a "best effort" kind of thing. The best possible is to pick a mode of the multivariate hypergeometric, but doing that exactly would require quite a bit of code. So I do an approximation that might be off by one per class (I have not proven this bound).

@amueller amueller added this to the 0.18 milestone Mar 8, 2016
@amueller amueller force-pushed the stratified_shuffle_split_simplify branch from db39f25 to 55a79dc Compare August 25, 2016 21:33
@amueller amueller changed the title fix sampling in stratified shuffle split, break tests that test sampling [MRG] fix sampling in stratified shuffle split Aug 26, 2016
@amueller
Copy link
Member Author

This should be good now. Reviews please?

@agramfort
Copy link
Member

CIs are not happy

@amueller
Copy link
Member Author

Next try. Actually computing the mode (approximately) now, not sampling, to satisfy the proportion tests. But now breaking ties at random to satisfy the IID test.

@amueller amueller force-pushed the stratified_shuffle_split_simplify branch from 96e902f to c865c83 Compare September 8, 2016 14:54
@amueller
Copy link
Member Author

amueller commented Sep 8, 2016

CI error is server error in curl


It is the mostly likely outcome of drawing n_draws many
samples from the population given by class_counts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a doctest that would serve both as an example to make it easier to understand what the function is doing and also serve as a unittest / sanity check on a couple simple cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are doctests run on private functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but I would expect so. Have have you checked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have, they are :)

@ogrisel
Copy link
Member

ogrisel commented Sep 8, 2016

Weird the appveyor error report has happened on @agramfort's account instead of sklearn-ci. Anyway this is not a big deal and can be ignored.

There are still some inline comments to tackle in the diff view but other than that +1 as well.

@agramfort
Copy link
Member

agramfort commented Sep 8, 2016 via email

@amueller
Copy link
Member Author

amueller commented Sep 8, 2016

@ogrisel I added (hopefully helpful) doctests.

array([0, 1, 1, 0])
>>> _approximate_mode(class_counts=np.array([2, 2, 2, 1]),
... n_draws=2, rng=42)
array([1, 1, 0, 0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. To me those examples explain very intuitively what the function actually does.

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2016

LGTM, let me squash-merge that. Thanks for the fix @amueller.

@ogrisel ogrisel merged commit 31a4691 into scikit-learn:master Sep 9, 2016
@amueller
Copy link
Member Author

amueller commented Sep 9, 2016

thanks for the reviews and merge :)

@amueller amueller deleted the stratified_shuffle_split_simplify branch September 9, 2016 17:14
yangarbiter added a commit to yangarbiter/scikit-learn that referenced this pull request Sep 10, 2016
Fix sampling in stratified shuffle split, break tests that test sampling.
rsmith54 pushed a commit to rsmith54/scikit-learn that referenced this pull request Sep 14, 2016
Fix sampling in stratified shuffle split, break tests that test sampling.
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
Fix sampling in stratified shuffle split, break tests that test sampling.
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
Fix sampling in stratified shuffle split, break tests that test sampling.
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Fix sampling in stratified shuffle split, break tests that test sampling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants