[MRG+1] fix sampling in stratified shuffle split #6472

amueller · 2016-03-01T21:15:46Z

Fixes #6471.
This breaks some tests for stratified shuffle split. From my understanding, it actually now has the correct statistics, and the tests are wrong.

I'll remove the asserts etc if we can agree on this implementation (and how the tests need to be changed).

This tries to adjust t_i and n_i before the sampling happens, and makes sure sum(t_i) == n_test and sum(n_i) == n_train.

Also, it adds in the "missing" samples based on their class frequency. So on expectation, this should be doing exactly the right thing (I think, I didn't have any coffee today yet).

The ugly bit is

np.bincount(rng.permutation(np.repeat(range(n_classes), left_over_per_class))[:missing_train], minlength=n_classes)

which is the only way I new to say "sample missing_train many points from the classes in left_over_per_class". I feel there should be a better way to say that.

The first test that breaks checks that if something with class_counts = [4, 3, 5] is split into 8 training and test points, the class probabilities in training and test set are the same.
With the given random state, it is split into [2, 1, 1] and [2, 2, 4], which I think is fine, the test disagrees.

amueller · 2016-03-01T21:32:13Z

Ah, it's actually equivalent with

            to_n_i = np.random.choice(n_classes, missing_train, replace=False,
                                      p=left_over_per_class / 
                                      left_over_per_class.sum()),

I think

raghavrv · 2016-03-02T13:07:19Z

sklearn/model_selection/_split.py

-        n_i = np.round(n_train * p_i).astype(int)
+        # n_i = number of samples per class in training set
+        n_i = np.floor(n_train * p_i).astype(int)
+        # n_i = number of samples per class in test set


(Minor nitpick) n_i --> t_i

raghavrv · 2016-03-02T13:12:53Z

Is this a fix that should go back to cross_validation.py (and also be backported to the recently released version?)

lesteve · 2016-03-02T13:44:14Z

numpy.random.RandomState.choice was added in numpy 1.7, according to the numpy doc, so you get some errors like this:

    choice = rng.choice(n_classes, n_samples, replace=False,

AttributeError: 'mtrand.RandomState' object has no attribute 'choice'

lesteve · 2016-03-02T13:49:27Z

I do agree there was a problem that train was getting too many samples as you noted in the associated issue and it's great that you managed to remove the dodgy second-pass "unassigned samples" logic.

GaelVaroquaux · 2016-03-02T13:49:47Z

numpy.random.RandomState.choice was added in numpy 1.7,

Use sklearn.utils.random.choice

To find such a thing, don't hesitate to do a git grep in the scikit-learn
codebase.

vighneshbirodkar · 2016-03-02T14:30:16Z

@amueller What exactly should I be looking for here ?

amueller · 2016-03-02T15:43:04Z

cc @MechCoder

amueller · 2016-03-02T19:58:02Z

hm maybe the best choice would be to remove the randomness in picking t_i and n_i and using a mode of the multinomial hypergeometric instead of sampling from it. Not sure if it is easy to compute that. I asked on stackoverflow ^^

amueller · 2016-03-03T19:52:49Z

ok so I can't spend that much more time on it, but I think the newest version without sampling is ok.
I'm using a bad approximation of the mode of the multivariate hypergeometric. Here "bad approximation" means that it can be off by one for each class. So that should not be that bad for us. This is all about how to do the rounding.

What I'm doing now is computing the most likely draw from the original class_counts with n_train many point. Then, from the class_counts minus what is in the training part, I get the most likely outcome for drawing n_test many points.

There is an alternative, which is "draw n_train points from class_counts, draw n_test points from class_counts, and make sure that we they don't sum up to more than class_counts". While this behavior might be a bit more "intuitive", the "ensure that they don't sum up to more than class_counts" part is more or less what was buggy in #6121.

So I'd rather stay with the simpler semantics of doing one draw and then the other draw from what's left over.

I have no idea what to do with the failing tests, though.
The first test is now failing because of the following situation:

(Pdb) p np.bincount(y[test])
array([2, 1, 1, 1])
(Pdb) p np.bincount(y[train])
array([2, 3, 3, 2])
(Pdb) p np.bincount(y)
array([4, 4, 4, 3])

 x: array([ 0.2,  0.3,  0.3,  0.2])
 y: array([ 0.4,  0.2,  0.2,  0.2])
>>  raise AssertionError('\nArrays are not almost equal to 1 decimals\n\n(mismatch 25.0%)\n x: array([ 0.2,  0.3,  0.3,  0.2])\n y: array([ 0.4,  0.2,  0.2,  0.2])')

The distribution between training and test part are not the same to one digit.
But that's impossible in this setup! I checked and a training set of [2, 3, 3, 2] is actually the true mode. So we do find the most likely configuration of the training set. As we have n_train + n_test = n_samples in this example, we have no choice in how to create the test set.
I have no idea how to fix the test, or why it was passing before. Probably because of one of the bugs.

amueller · 2016-03-03T19:55:45Z

ping @GaelVaroquaux who wrote the tests ;)

amueller · 2016-03-03T20:09:24Z

So indeed, the test that is failing passes on master with this configuration:

ipdb> p np.bincount(y)
array([4, 4, 4, 3])
ipdb> p np.bincount(y[train])
array([3, 3, 3, 2])
ipdb> p np.bincount(y[test])
array([1, 1, 1, 1])

So it violated the n_train / n_test sizes and put one more in the training set than it should have, to balance the classes.
@GaelVaroquaux was that expected behavior?
I mean it kind of makes sense, but it is non-obvious to me.

MechCoder · 2016-03-07T21:27:14Z

I can't think clearly enough on how to attain a balance between.

a] maintaining the train_size and test_size
b] maintaining the proportionality of the class labels between the train and test size :(

MechCoder · 2016-03-07T21:37:46Z

It is also not clear to me, how this block of the previous code preserves the class proportionality in the previous code?. Was it simply that it was not tested enough?

        if len(train) + len(test) < n_train + n_test:
            # We complete by affecting randomly the missing indexes
            missing_indices = np.where(bincount(train + test,
                                                minlength=len(y)) == 0)[0]
            missing_indices = rng.permutation(missing_indices)
            n_missing_train = n_train - len(train)
            n_missing_test = n_test - len(test)

            if n_missing_train > 0:
                train.extend(missing_indices[:n_missing_train])
            if n_missing_test > 0:
                test.extend(missing_indices[-n_missing_test:])

amueller · 2016-03-08T15:16:44Z

@MechCoder it doesn't. It just samples randomly. There are very strict tests that passed.

My intuition of the problem was: n_train and n_test are specified directly by the user, so these are hard limits. The stratification is a "best effort" kind of thing. The best possible is to pick a mode of the multivariate hypergeometric, but doing that exactly would require quite a bit of code. So I do an approximation that might be off by one per class (I have not proven this bound).

amueller · 2016-08-26T15:59:49Z

This should be good now. Reviews please?

agramfort · 2016-08-27T09:30:03Z

CIs are not happy

amueller · 2016-08-29T19:18:08Z

Next try. Actually computing the mode (approximately) now, not sampling, to satisfy the proportion tests. But now breaking ties at random to satisfy the IID test.

…sts but is slightly complicated.

… in current master!!)

amueller · 2016-09-08T18:40:02Z

CI error is server error in curl

ogrisel · 2016-09-08T19:47:45Z

sklearn/cross_validation.py

+
+    It is the mostly likely outcome of drawing n_draws many
+    samples from the population given by class_counts.
+


Could you please add a doctest that would serve both as an example to make it easier to understand what the function is doing and also serve as a unittest / sanity check on a couple simple cases?

are doctests run on private functions?

I don't know, but I would expect so. Have have you checked?

I have, they are :)

ogrisel · 2016-09-08T20:00:31Z

Weird the appveyor error report has happened on @agramfort's account instead of sklearn-ci. Anyway this is not a big deal and can be ignored.

There are still some inline comments to tackle in the diff view but other than that +1 as well.

agramfort · 2016-09-08T20:13:21Z

this was reported before. Maybe I clicked on a wrong button on appveyor...

amueller · 2016-09-08T20:33:52Z

@ogrisel I added (hopefully helpful) doctests.

ogrisel · 2016-09-09T16:56:51Z

sklearn/model_selection/_split.py

+    array([0, 1, 1, 0])
+    >>> _approximate_mode(class_counts=np.array([2, 2, 2, 1]),
+    ...                   n_draws=2, rng=42)
+    array([1, 1, 0, 0])


Nice. To me those examples explain very intuitively what the function actually does.

ogrisel · 2016-09-09T17:00:24Z

LGTM, let me squash-merge that. Thanks for the fix @amueller.

amueller · 2016-09-09T17:03:46Z

thanks for the reviews and merge :)

Fix sampling in stratified shuffle split, break tests that test sampling.

amueller changed the title ~~fix sampling in stratified shuffle split, break tests that test sampl…~~ fix sampling in stratified shuffle split, break tests that test sampling Mar 1, 2016

amueller force-pushed the stratified_shuffle_split_simplify branch 2 times, most recently from 39e7e1c to da032ec Compare March 1, 2016 23:07

raghavrv reviewed Mar 2, 2016
View reviewed changes

amueller force-pushed the stratified_shuffle_split_simplify branch from da032ec to 966399e Compare March 2, 2016 19:56

amueller force-pushed the stratified_shuffle_split_simplify branch from 966399e to d5063b6 Compare March 3, 2016 19:22

amueller force-pushed the stratified_shuffle_split_simplify branch 3 times, most recently from 9772b4e to db39f25 Compare March 3, 2016 19:58

amueller added this to the 0.18 milestone Mar 8, 2016

amueller force-pushed the stratified_shuffle_split_simplify branch from db39f25 to 55a79dc Compare August 25, 2016 21:33

amueller changed the title ~~fix sampling in stratified shuffle split, break tests that test sampling~~ [MRG] fix sampling in stratified shuffle split Aug 26, 2016

amueller added 11 commits September 8, 2016 10:53

get rid of rounding issues

3d13d37

fixed the randomization to make everything nice and iid etc

2644a6e

old numpy choice compatibility.

143addf

don't draw at random, but break ties randomly. This now passes all te…

f227e8c

…sts but is slightly complicated.

added docstring to approximate hypergeometric mode computation

7fdf852

made test stronger, added a very explicit regression test (that fails…

7c59f2f

… in current master!!)

backport StratifiedShuffleSplit fix to cross_validation.py

1903e85

added whatsnew.

ae20158

fix pep8

e2883e3

fix typo in docstring

e982ace

added another example to test, indentation

c865c83

amueller force-pushed the stratified_shuffle_split_simplify branch from 96e902f to c865c83 Compare September 8, 2016 14:54

ogrisel reviewed Sep 8, 2016
View reviewed changes

add doctests to _approximate_mode

4bf494c

ogrisel reviewed Sep 9, 2016
View reviewed changes

ogrisel merged commit 31a4691 into scikit-learn:master Sep 9, 2016

amueller deleted the stratified_shuffle_split_simplify branch September 9, 2016 17:14

yangarbiter added a commit to yangarbiter/scikit-learn that referenced this pull request Sep 10, 2016

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)

e9d83fd

Fix sampling in stratified shuffle split, break tests that test sampling.

rsmith54 pushed a commit to rsmith54/scikit-learn that referenced this pull request Sep 14, 2016

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)

2f12edf

Fix sampling in stratified shuffle split, break tests that test sampling.

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)

31be86a

Fix sampling in stratified shuffle split, break tests that test sampling.

lesteve mentioned this pull request Oct 19, 2016

[MRG+2] switch to multinomial composition for mixture sampling #7702

Merged

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)

63d38ec

Fix sampling in stratified shuffle split, break tests that test sampling.

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)

2080f69

Fix sampling in stratified shuffle split, break tests that test sampling.


		It is the mostly likely outcome of drawing n_draws many
		samples from the population given by class_counts.

Uh oh!

[MRG+1] fix sampling in stratified shuffle split #6472

[MRG+1] fix sampling in stratified shuffle split #6472

Uh oh!

Conversation

amueller commented Mar 1, 2016

Uh oh!

amueller commented Mar 1, 2016

Uh oh!

raghavrv Mar 2, 2016

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Mar 2, 2016

Uh oh!

lesteve commented Mar 2, 2016

Uh oh!

lesteve commented Mar 2, 2016

Uh oh!

GaelVaroquaux commented Mar 2, 2016

Uh oh!

vighneshbirodkar commented Mar 2, 2016

Uh oh!

amueller commented Mar 2, 2016

Uh oh!

amueller commented Mar 2, 2016

Uh oh!

amueller commented Mar 3, 2016

Uh oh!

amueller commented Mar 3, 2016

Uh oh!

amueller commented Mar 3, 2016

Uh oh!

MechCoder commented Mar 7, 2016

Uh oh!

MechCoder commented Mar 7, 2016

Uh oh!

amueller commented Mar 8, 2016

Uh oh!

amueller commented Aug 26, 2016

Uh oh!

agramfort commented Aug 27, 2016

Uh oh!

amueller commented Aug 29, 2016

Uh oh!

amueller commented Sep 8, 2016

Uh oh!

ogrisel Sep 8, 2016

Choose a reason for hiding this comment

Uh oh!

amueller Sep 8, 2016

Choose a reason for hiding this comment

Uh oh!

ogrisel Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

amueller Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 8, 2016

Uh oh!

agramfort commented Sep 8, 2016 via email

Uh oh!

amueller commented Sep 8, 2016

Uh oh!

ogrisel Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 9, 2016

Uh oh!

amueller commented Sep 9, 2016

Uh oh!

Uh oh!