Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+2] switch to multinomial composition for mixture sampling #7702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 20, 2016

Conversation

ljwolf
Copy link
Contributor

@ljwolf ljwolf commented Oct 19, 2016

Reference Issue

fixes #7701

What does this implement/fix? Explain your changes.

This changes the way mixture models construct the composition of new samples. Specifically, any subclass deriving fromBaseMixture.sample is affected.

Instead of rounding the composition vector from weights * n_samples, this draws the composition vector from a multinomial distribution. Thus, samples return are guaranteed to have the number of observations requested by the user. However, the composition of the sample is now stochastic.

In addition, this adds tests to ensure that n_samples are returned when mixture.sample(n_samples) is called.

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow. But, this composition draw only occurs once during sampling.

Any other comments?

None. Thanks for the great package!

@tguillemot
Copy link
Contributor

Can you add a test with the example you've put in #7701 on the function test_sample in mixture/test_gaussian_mixture.py ?
@ljwolf Thanks.

@lesteve
Copy link
Member

lesteve commented Oct 19, 2016

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

This may be a problem, not sure.

This reminds me the issue we had in StratifiedShuffleSplit #6472 where drawing samples for each group did not add up to the total number of samples. Although I didn't really understand the details I believe @amueller used some kind of approximation to avoid randomly sampling.

@tguillemot
Copy link
Contributor

This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow.

It's not a problem. The function sample is not use during the fitting process.
Moreover, as weights is linked to the number of component and GaussianMixture is not designed to deal with a lot of component (less than 1e6), the time is really OK.

On my computer, for a weights of shape = (1000000, ) :

%timeit np.random.multinomial(100000000, weights).astype(int)
10 loops, best of 3: 135 ms per loop

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an acceptable use of rng.multinomial. Please add a test.

@@ -385,7 +385,7 @@ def sample(self, n_samples=1):

_, n_features = self.means_.shape
rng = check_random_state(self.random_state)
n_samples_comp = np.round(self.weights_ * n_samples).astype(int)
n_samples_comp = rng.multinomial(n_samples, self.weights_).astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the astype should not be necessary.

@amueller
Copy link
Member

Can you please add a non-regression test? Otherwise looks good. Thanks for the fix!

@ljwolf
Copy link
Contributor Author

ljwolf commented Oct 19, 2016

Yes, will do.

@ljwolf ljwolf force-pushed the gaussmix_sampling branch 2 times, most recently from aa9cae1 to 614dd4a Compare October 20, 2016 00:32
@ljwolf
Copy link
Contributor Author

ljwolf commented Oct 20, 2016

I don't think I have the correct permissions to restart the travis build, but it should be passing. Can a maintainer trigger a rebuild?

@@ -956,6 +957,13 @@ def test_sample():
for k in range(n_features)])
assert_array_almost_equal(gmm.means_, means_s, decimal=1)

# Check that sizes that are drawn match what is requested
assert_equal(X_s.shape, (n_samples, n_components))
for sample_size in [4, 101, 1004, 5051]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason to try with a large number? Would for sample_size in range(50) suffice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either would be sufficient. Earlier in the test, the sample size is 20,000. it should be fine to do range(1,k), too. Should I use that instead?

@ljwolf ljwolf force-pushed the gaussmix_sampling branch 2 times, most recently from 42d38eb to f25eacf Compare October 20, 2016 01:08
@jnothman
Copy link
Member

As long as the test runs quite quickly, this LGTM.

@jnothman
Copy link
Member

Please update what's new

@jnothman jnothman changed the title switch to multinomial composition for mixture sampling [MRG+1] switch to multinomial composition for mixture sampling Oct 20, 2016
@ljwolf
Copy link
Contributor Author

ljwolf commented Oct 20, 2016

I just swapped to the range(1,50)construct and added the fact that the test was added to the original statement. That Travis report looks pretty strange... tons of errors that look like build errors?

@jnothman
Copy link
Member

Could you please try updating master and rebasing on it?

@ljwolf ljwolf force-pushed the gaussmix_sampling branch from f25eacf to 55672f9 Compare October 20, 2016 04:54
@ljwolf
Copy link
Contributor Author

ljwolf commented Oct 20, 2016

Alright, I've rebased to master & the tests are off to the races.

@tguillemot
Copy link
Contributor

LGTM

@@ -956,6 +957,13 @@ def test_sample():
for k in range(n_features)])
assert_array_almost_equal(gmm.means_, means_s, decimal=1)

# Check that sizes that are drawn match what is requested
assert_equal(X_s.shape, (n_samples, n_components))
for sample_size in range(1, 50):
Copy link
Member

@lesteve lesteve Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not fail on master so it looks like you are not testing the edge case you discovered.

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used n_components=3 for the test to test the actual regression seen in #7701.

As part of this it seems that n_features and n_components were swapped in a few places. @tguillemot can you quickly check whether what I changed makes sense.

@lesteve
Copy link
Member

lesteve commented Oct 20, 2016

I used n_components=3 for the test to test the actual regression seen in #7701.

I forgot to say it explicitly: I used the recent github features that allows people with enough admin rights to push into PR branches. @ljwolf I hope you don't mind :-).

n_components and n_features were equal and one was used for the other in
some places.
@tguillemot
Copy link
Contributor

@lesteve Sorry for these mistakes. It's n_components indeed.

@amueller
Copy link
Member

LGTM

@amueller amueller changed the title [MRG+1] switch to multinomial composition for mixture sampling [MRG+2] switch to multinomial composition for mixture sampling Oct 20, 2016
@ljwolf
Copy link
Contributor Author

ljwolf commented Oct 20, 2016

@lesteve nope, that's what they're there for! :shipit:

@lesteve lesteve dismissed jnothman’s stale review October 20, 2016 16:54

Test has now been added

@lesteve
Copy link
Member

lesteve commented Oct 20, 2016

AppVeyor is taking quite some time, this PR should be merged if it comes back green.

@lesteve lesteve merged commit ad6f094 into scikit-learn:master Oct 20, 2016
@lesteve
Copy link
Member

lesteve commented Oct 20, 2016

Merged, thanks a lot!

@amueller
Copy link
Member

@lesteve for future reference, please use the squash and merge feature, that makes cherry-picking much simpler.

@lesteve
Copy link
Member

lesteve commented Oct 25, 2016

Oops sorry for that, I always try to remember to do squash and merge but it looks like I missed this one.

@lesteve
Copy link
Member

lesteve commented Oct 25, 2016

Looks like you can only allow squash and merge if you want to:
https://github.com/blog/2141-squash-your-commits

The settings are available from: https://github.com/scikit-learn/scikit-learn/settings

Should we do that?

@amueller
Copy link
Member

@lesteve yes. I enabled that. Hm though sometimes we have multiple authors? Well, let's see when the first person complains. We can always go back. Thanks for finding that.

@jnothman jnothman added this to the 0.18.1 milestone Nov 5, 2016
@jnothman
Copy link
Member

jnothman commented Nov 5, 2016

Whoops, we'd forgotten to tag this for 0.18.1. Done now.

@amueller
Copy link
Member

hm sorry didn't make it into 0.18.1. My bad :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GaussianMixture.sample(n) occasionally returns incorrectly-sized samples
5 participants