[MRG+2] switch to multinomial composition for mixture sampling #7702

ljwolf · 2016-10-19T00:03:56Z

Reference Issue

What does this implement/fix? Explain your changes.

This changes the way mixture models construct the composition of new samples. Specifically, any subclass deriving fromBaseMixture.sample is affected.

Instead of rounding the composition vector from weights * n_samples, this draws the composition vector from a multinomial distribution. Thus, samples return are guaranteed to have the number of observations requested by the user. However, the composition of the sample is now stochastic.

In addition, this adds tests to ensure that n_samples are returned when mixture.sample(n_samples) is called.

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow. But, this composition draw only occurs once during sampling.

Any other comments?

None. Thanks for the great package!

tguillemot · 2016-10-19T08:15:29Z

Can you add a test with the example you've put in #7701 on the function test_sample in mixture/test_gaussian_mixture.py ?
@ljwolf Thanks.

lesteve · 2016-10-19T08:22:34Z

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

This may be a problem, not sure.

This reminds me the issue we had in StratifiedShuffleSplit #6472 where drawing samples for each group did not add up to the total number of samples. Although I didn't really understand the details I believe @amueller used some kind of approximation to avoid randomly sampling.

tguillemot · 2016-10-19T09:37:00Z

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

It's not a problem. The function sample is not use during the fitting process.
Moreover, as weights is linked to the number of component and GaussianMixture is not designed to deal with a lot of component (less than 1e6), the time is really OK.

On my computer, for a weights of shape = (1000000, ) :

%timeit np.random.multinomial(100000000, weights).astype(int)
10 loops, best of 3: 135 ms per loop

jnothman

I think this is an acceptable use of rng.multinomial. Please add a test.

jnothman · 2016-10-19T12:36:57Z

sklearn/mixture/base.py

@@ -385,7 +385,7 @@ def sample(self, n_samples=1):

        _, n_features = self.means_.shape
        rng = check_random_state(self.random_state)
-        n_samples_comp = np.round(self.weights_ * n_samples).astype(int)
+        n_samples_comp = rng.multinomial(n_samples, self.weights_).astype(int)


the astype should not be necessary.

amueller · 2016-10-19T18:21:52Z

Can you please add a non-regression test? Otherwise looks good. Thanks for the fix!

ljwolf · 2016-10-19T18:51:32Z

Yes, will do.

ljwolf · 2016-10-20T00:49:20Z

I don't think I have the correct permissions to restart the travis build, but it should be passing. Can a maintainer trigger a rebuild?

jnothman · 2016-10-20T00:52:44Z

sklearn/mixture/tests/test_gaussian_mixture.py

@@ -956,6 +957,13 @@ def test_sample():
                           for k in range(n_features)])
        assert_array_almost_equal(gmm.means_, means_s, decimal=1)

+        # Check that sizes that are drawn match what is requested
+        assert_equal(X_s.shape, (n_samples, n_components))
+        for sample_size in [4, 101, 1004, 5051]:


Is there a particular reason to try with a large number? Would for sample_size in range(50) suffice?

I think either would be sufficient. Earlier in the test, the sample size is 20,000. it should be fine to do range(1,k), too. Should I use that instead?

jnothman · 2016-10-20T01:13:26Z

As long as the test runs quite quickly, this LGTM.

jnothman · 2016-10-20T01:13:37Z

Please update what's new

ljwolf · 2016-10-20T01:18:05Z

I just swapped to the range(1,50)construct and added the fact that the test was added to the original statement. That Travis report looks pretty strange... tons of errors that look like build errors?

jnothman · 2016-10-20T01:41:54Z

Could you please try updating master and rebasing on it?

ljwolf · 2016-10-20T04:57:30Z

Alright, I've rebased to master & the tests are off to the races.

tguillemot · 2016-10-20T13:46:31Z

LGTM

lesteve · 2016-10-20T14:28:27Z

sklearn/mixture/tests/test_gaussian_mixture.py

@@ -956,6 +957,13 @@ def test_sample():
                           for k in range(n_features)])
        assert_array_almost_equal(gmm.means_, means_s, decimal=1)

+        # Check that sizes that are drawn match what is requested
+        assert_equal(X_s.shape, (n_samples, n_components))
+        for sample_size in range(1, 50):


This test does not fail on master so it looks like you are not testing the edge case you discovered.

lesteve

I used n_components=3 for the test to test the actual regression seen in #7701.

As part of this it seems that n_features and n_components were swapped in a few places. @tguillemot can you quickly check whether what I changed makes sense.

lesteve · 2016-10-20T14:53:15Z

I used n_components=3 for the test to test the actual regression seen in #7701.

I forgot to say it explicitly: I used the recent github features that allows people with enough admin rights to push into PR branches. @ljwolf I hope you don't mind :-).

n_components and n_features were equal and one was used for the other in some places.

tguillemot · 2016-10-20T15:30:18Z

@lesteve Sorry for these mistakes. It's n_components indeed.

amueller · 2016-10-20T15:52:47Z

LGTM

ljwolf · 2016-10-20T16:18:03Z

@lesteve nope, that's what they're there for!

Test has now been added

lesteve · 2016-10-20T16:55:38Z

AppVeyor is taking quite some time, this PR should be merged if it comes back green.

lesteve · 2016-10-20T20:27:59Z

Merged, thanks a lot!

amueller · 2016-10-25T00:23:03Z

@lesteve for future reference, please use the squash and merge feature, that makes cherry-picking much simpler.

lesteve · 2016-10-25T04:29:00Z

Oops sorry for that, I always try to remember to do squash and merge but it looks like I missed this one.

lesteve · 2016-10-25T04:34:26Z

Looks like you can only allow squash and merge if you want to:
https://github.com/blog/2141-squash-your-commits

The settings are available from: https://github.com/scikit-learn/scikit-learn/settings

Should we do that?

amueller · 2016-10-25T15:35:46Z

@lesteve yes. I enabled that. Hm though sometimes we have multiple authors? Well, let's see when the first person complains. We can always go back. Thanks for finding that.

jnothman · 2016-11-05T11:32:01Z

Whoops, we'd forgotten to tag this for 0.18.1. Done now.

amueller · 2016-11-17T19:43:25Z

hm sorry didn't make it into 0.18.1. My bad :(

jnothman previously requested changes Oct 19, 2016

View reviewed changes

ljwolf force-pushed the gaussmix_sampling branch 2 times, most recently from aa9cae1 to 614dd4a Compare October 20, 2016 00:32

jnothman reviewed Oct 20, 2016

View reviewed changes

ljwolf force-pushed the gaussmix_sampling branch 2 times, most recently from 42d38eb to f25eacf Compare October 20, 2016 01:08

jnothman changed the title ~~switch to multinomial composition for mixture sampling~~ [MRG+1] switch to multinomial composition for mixture sampling Oct 20, 2016

jnothman added Bug Waiting for Reviewer labels Oct 20, 2016

ljwolf added 2 commits October 19, 2016 21:53

switch to multinomial composition for mixture sampling

cfc280d

add shape assertions to test

55672f9

ljwolf force-pushed the gaussmix_sampling branch from f25eacf to 55672f9 Compare October 20, 2016 04:54

lesteve requested changes Oct 20, 2016

View reviewed changes

lesteve approved these changes Oct 20, 2016

View reviewed changes

Use n_components=3 to test actual regression

4e1c101

n_components and n_features were equal and one was used for the other in some places.

lesteve force-pushed the gaussmix_sampling branch from b4913d8 to 4e1c101 Compare October 20, 2016 14:57

amueller changed the title ~~[MRG+1] switch to multinomial composition for mixture sampling~~ [MRG+2] switch to multinomial composition for mixture sampling Oct 20, 2016

lesteve merged commit ad6f094 into scikit-learn:master Oct 20, 2016

jnothman added this to the 0.18.1 milestone Nov 5, 2016

tguillemot mentioned this pull request Nov 7, 2016

sklearn.mixture.GaussianMixture doesn't sample properly plus a prob with fitting #7822

Closed

Uh oh!

[MRG+2] switch to multinomial composition for mixture sampling #7702

[MRG+2] switch to multinomial composition for mixture sampling #7702

Uh oh!

Conversation

ljwolf commented Oct 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

tguillemot commented Oct 19, 2016

Uh oh!

lesteve commented Oct 19, 2016

Uh oh!

tguillemot commented Oct 19, 2016

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Oct 19, 2016

Choose a reason for hiding this comment

Uh oh!

amueller commented Oct 19, 2016

Uh oh!

ljwolf commented Oct 19, 2016

Uh oh!

ljwolf commented Oct 20, 2016

Uh oh!

jnothman Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

ljwolf Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 20, 2016

Uh oh!

jnothman commented Oct 20, 2016

Uh oh!

ljwolf commented Oct 20, 2016

Uh oh!

jnothman commented Oct 20, 2016

Uh oh!

ljwolf commented Oct 20, 2016

Uh oh!

tguillemot commented Oct 20, 2016

Uh oh!

lesteve Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Oct 20, 2016

Uh oh!

tguillemot commented Oct 20, 2016

Uh oh!

amueller commented Oct 20, 2016

Uh oh!

ljwolf commented Oct 20, 2016

Uh oh!

lesteve commented Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lesteve commented Oct 20, 2016

Uh oh!

amueller commented Oct 25, 2016

Uh oh!

lesteve commented Oct 25, 2016

Uh oh!

lesteve commented Oct 25, 2016

Uh oh!

amueller commented Oct 25, 2016

Uh oh!

jnothman commented Nov 5, 2016

Uh oh!

amueller commented Nov 17, 2016

Uh oh!

Uh oh!

ljwolf commented Oct 19, 2016 •

edited

Loading

lesteve Oct 20, 2016 •

edited

Loading

lesteve commented Oct 20, 2016 •

edited

Loading