Thanks to visit codestin.com
Credit goes to github.com

Skip to content

GaussianMixture.sample(n) occasionally returns incorrectly-sized samples #7701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ljwolf opened this issue Oct 18, 2016 · 2 comments · Fixed by #7702
Closed

GaussianMixture.sample(n) occasionally returns incorrectly-sized samples #7701

ljwolf opened this issue Oct 18, 2016 · 2 comments · Fixed by #7702
Milestone

Comments

@ljwolf
Copy link
Contributor

ljwolf commented Oct 18, 2016

Description

GaussianMixture.sample(n) occasionally returns incorrectly-sized samples possibly due to rounding error.

Steps/Code to Reproduce

from sklearn.mixture import GaussianMixture
import numpy as np

np.random.seed(1011)

X = np.random.normal(size=(100,20))

mixture_model = GaussianMixture(n_components=10).fit(X)

Expected Results

I'd expect that if n_samples are requested, n_samples are provided, even when n_samples is not an even multiple of the number of mixture distributions.

Actual Results

Instead, it appears that my expectations are confounded in Line 388 of mixture/base.py, since

np.round(model.weights * n_samples).astype(int)

will not necessarily sum to n, depending on the weights and samples.

samples = mixture_model.sample(100)
print(samples[0].shape)
#prints: (100,20)
samples = mixture_model.sample(104)
print(samples[0].shape)
#prints: (103, 20)
samples = mixture_model.sample(1002)
print(samples[0].shape)
#prints: (1000,20)

Suggested Patch

I could be mistaken on this, but I think that the correct way to sample n_samples from a mixture distribution with frequency vector weights_ is to draw a count vector, np.random.multinomial(n_samples, weights_), and then draw the corresponding number of samples from each component distribution as comes out of the multinomial draw.

This would involve a one-line change to mixture/base.py replacing line 388 referenced above with:

rng.multinomial(n_samples, self.weights).astype(int)

Versions

>>> import platform; print(platform.platform())
Linux-4.7.6-1-ARCH-x86_64-with-arch
>>> import sys; print("Python", sys.version)
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
>>> import numpy; print('NumPy', numpy.__version__)
NumPy 1.11.2
>>> import scipy; print('SciPy', scipy.__version__)
SciPy 0.18.1
>>> import sklearn; print('Scikit-Learn', sklearn.__version__)
Scikit-Learn 0.18
@ljwolf ljwolf changed the title GaussianMixture.sample(n) occasionally returns incorrectly-sized samples possibly due to rounding error. GaussianMixture.sample(n) occasionally returns incorrectly-sized samples Oct 18, 2016
@lesteve
Copy link
Member

lesteve commented Oct 19, 2016

Thanks a lot for the very detailed issue! Not an expert on Gaussian mixtures, but it seems like a problem indeed. The GaussianMixture.sample docstring does say that the returned array should have n_samples rows:

Generate random samples from the fitted Gaussian distribution.

Parameters
----------
n_samples : int, optional
    Number of samples to generate. Defaults to 1.

Returns
-------
X : array, shape (n_samples, n_features)
    Randomly generated sample

@tguillemot @ogrisel what do you think?

@tguillemot
Copy link
Contributor

There is a problem indeed and your solution solves it.
Thanks @ljwolf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants