-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] switch to multinomial composition for mixture sampling #7702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This may be a problem, not sure. This reminds me the issue we had in StratifiedShuffleSplit #6472 where drawing samples for each group did not add up to the total number of samples. Although I didn't really understand the details I believe @amueller used some kind of approximation to avoid randomly sampling. |
It's not a problem. The function On my computer, for a weights of shape = (1000000, ) : %timeit np.random.multinomial(100000000, weights).astype(int)
10 loops, best of 3: 135 ms per loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an acceptable use of rng.multinomial
. Please add a test.
@@ -385,7 +385,7 @@ def sample(self, n_samples=1): | |||
|
|||
_, n_features = self.means_.shape | |||
rng = check_random_state(self.random_state) | |||
n_samples_comp = np.round(self.weights_ * n_samples).astype(int) | |||
n_samples_comp = rng.multinomial(n_samples, self.weights_).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the astype should not be necessary.
Can you please add a non-regression test? Otherwise looks good. Thanks for the fix! |
Yes, will do. |
aa9cae1
to
614dd4a
Compare
I don't think I have the correct permissions to restart the travis build, but it should be passing. Can a maintainer trigger a rebuild? |
@@ -956,6 +957,13 @@ def test_sample(): | |||
for k in range(n_features)]) | |||
assert_array_almost_equal(gmm.means_, means_s, decimal=1) | |||
|
|||
# Check that sizes that are drawn match what is requested | |||
assert_equal(X_s.shape, (n_samples, n_components)) | |||
for sample_size in [4, 101, 1004, 5051]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason to try with a large number? Would for sample_size in range(50)
suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either would be sufficient. Earlier in the test, the sample size is 20,000. it should be fine to do range(1,k)
, too. Should I use that instead?
42d38eb
to
f25eacf
Compare
As long as the test runs quite quickly, this LGTM. |
Please update what's new |
I just swapped to the |
Could you please try updating master and rebasing on it? |
f25eacf
to
55672f9
Compare
Alright, I've rebased to master & the tests are off to the races. |
LGTM |
@@ -956,6 +957,13 @@ def test_sample(): | |||
for k in range(n_features)]) | |||
assert_array_almost_equal(gmm.means_, means_s, decimal=1) | |||
|
|||
# Check that sizes that are drawn match what is requested | |||
assert_equal(X_s.shape, (n_samples, n_components)) | |||
for sample_size in range(1, 50): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test does not fail on master so it looks like you are not testing the edge case you discovered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used n_components=3 for the test to test the actual regression seen in #7701.
As part of this it seems that n_features and n_components were swapped in a few places. @tguillemot can you quickly check whether what I changed makes sense.
n_components and n_features were equal and one was used for the other in some places.
b4913d8
to
4e1c101
Compare
@lesteve Sorry for these mistakes. It's |
LGTM |
@lesteve nope, that's what they're there for! |
AppVeyor is taking quite some time, this PR should be merged if it comes back green. |
Merged, thanks a lot! |
@lesteve for future reference, please use the squash and merge feature, that makes cherry-picking much simpler. |
Oops sorry for that, I always try to remember to do squash and merge but it looks like I missed this one. |
Looks like you can only allow squash and merge if you want to: The settings are available from: https://github.com/scikit-learn/scikit-learn/settings Should we do that? |
@lesteve yes. I enabled that. Hm though sometimes we have multiple authors? Well, let's see when the first person complains. We can always go back. Thanks for finding that. |
Whoops, we'd forgotten to tag this for 0.18.1. Done now. |
hm sorry didn't make it into 0.18.1. My bad :( |
Reference Issue
fixes #7701
What does this implement/fix? Explain your changes.
This changes the way mixture models construct the composition of new samples. Specifically, any subclass deriving from
BaseMixture.sample
is affected.Instead of rounding the composition vector from
weights * n_samples
, this draws the composition vector from a multinomial distribution. Thus, samples return are guaranteed to have the number of observations requested by the user. However, the composition of the sample is now stochastic.In addition, this adds tests to ensure that
n_samples
are returned whenmixture.sample(n_samples)
is called.This may affect scaling for mixture models with a very large number of dimensions, since the multinational composition draw may be slow. But, this composition draw only occurs once during sampling.
Any other comments?
None. Thanks for the great package!