Thanks to visit codestin.com
Credit goes to github.com

Skip to content

sklearn.mixture.GaussianMixture doesn't sample properly plus a prob with fitting #7822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LuCeHe opened this issue Nov 3, 2016 · 16 comments
Closed

Comments

@LuCeHe
Copy link

LuCeHe commented Nov 3, 2016

Description

the new sklearn.mixture.GaussianMixture has messed up several things. 2 of them:

  • IMPORTANT: why can't I sample from a GMM if I don't fit. I want to define all the parameters and to be able to sample from that distribution
    • the trick I use is to specify gmm.precisions_cholesky_ but I find it overcomplicated
  • why does it assume that the covariance has to be a 2d matrix and I have to specify "spherical" if I am in 1d.
  • IMPORTANT: the process of sampling gives back fewer samples than it's asked for, 665 instead of 1000. I think its caused by the line 384 in scikit-learn/sklearn/mixture/base.py:
n_samples_comp = np.round(self.weights_ * n_samples).astype(int)

Steps/Code to Reproduce

import numpy as np
from sklearn.mixture import GaussianMixture
np.random.seed(0)

# I define 10 gaussians for the mixture
n_gaussians = 10
means = np.array(10*(2*np.random.rand(n_gaussians,1)-1))
weights = np.array(.4*np.random.rand(n_gaussians,1)) ** 2
to_normalize_weights = np.random.rand(n_gaussians,1)
covars = np.array(to_normalize_weights/np.sum(to_normalize_weights))
precisions = np.array([ x**(-1) for x in covars ])

gmm = GaussianMixture(n_gaussians, covariance_type='full')
gmm.means_ = means
gmm.weights_ = weights
gmm.covariances_ = covars
gmm.precisions_cholesky_ = precisions    #otherwise sklearn thinks it's not fit

smples = gmm.sample(n_samples=1000)
print len(smples[0]), len(smples[1])

Expected Results

1000,1000

Actual Results

665, 665

Versions

@jnothman
Copy link
Member

jnothman commented Nov 4, 2016

The last issue was fixed in #7701. The first we may not fix: generally
fit() needs to be called before model-based operations can be performed.

On 4 November 2016 at 08:23, Luca Celotti [email protected] wrote:

Description

the new sklearn.mixture.GaussianMixture has messed up several things. 2
of them:

  • IMPORTANT: why can't I sample from a GMM if I don't fit. I want to
    define all the parameters and to be able to sample from that distribution
    • the trick I use is to specify gmm.precisions_cholesky_ but I find
      it overcomplicated
  • why does it assume that the covariance has to be a 2d matrix and I
    have to specify "spherical" if I am in 1d.
  • IMPORTANT: the process of sampling gives back fewer samples than
    it's asked for, 665 instead of 1000. I think its caused by the line 384 in
    scikit-learn/sklearn/mixture/base.py

n_samples_comp = np.round(self.weights_ * n_samples).astype(int)

Steps/Code to Reproduce

import numpy as npfrom sklearn.mixture import GaussianMixture
np.random.seed(0)

I define 10 gaussians for the mixture

n_gaussians = 10
means = np.array(10_(2_np.random.rand(n_gaussians,1)-1))
weights = np.array(.4_np.random.rand(n_gaussians,1)) _ 2
to_normalize_weights = np.random.rand(n_gaussians,1)
covars = np.array(to_normalize_weights/np.sum(to_normalize_weights))
precisions = np.array([ x
*(-1) for x in covars ])

gmm = GaussianMixture(n_gaussians, covariance_type='full')
gmm.means_ = means
gmm.weights_ = weights
gmm.covariances_ = covars
gmm.precisions_cholesky_ = precisions #otherwise sklearn thinks it's not fit

smples = gmm.sample(n_samples=1000)print len(smples[0]), len(smples[1])

Expected Results

1000,1000
Actual Results

665, 665
Versions


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7822, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz6ynIoZ8nNas7mbUv4s-JTfAT20GNks5q6lDsgaJpZM4Ko7xf
.

@LuCeHe
Copy link
Author

LuCeHe commented Nov 4, 2016

I updated the sklearn, and it doesn't seem to solve the sampling issue, for 3000 asked gives back 1177, but probably I have to update in a more savvy manner and besides, this is an issue I already found a way around.

The first issue is more annoying because I think it's fundamental for people using GMM to be able to compare proposed methods with new data. In my case I'm defining a GMM using a novelty detection rule, so it's more complicated than just fitting to data.

@jnothman
Copy link
Member

jnothman commented Nov 5, 2016

It's not in the current release, but it's included in the current master, and should be in 0.18.1 to be released shortly.

@jnothman
Copy link
Member

jnothman commented Nov 5, 2016

Ping @tguillemot

@LuCeHe
Copy link
Author

LuCeHe commented Nov 5, 2016

Thanks a lot.

At the end I decided to use another package instead of sklearn

http://pypr.sourceforge.net/mog.html

I don't know if you can get inspiration from there. I don't know how far reaching you want your package to be. In my interest I would want it as complete as possible, but it's clear that there are alternative solutions.

Best,
Luca

@tguillemot
Copy link
Contributor

In your example :

In [15]: np.sum(weights)
Out[15]: 0.66475414826343626

So, it's normal if you have only 665 elements.

I think its caused by the line 384 in scikit-learn/sklearn/mixture/base.py:

Indeed there was a problem and it's corrected by #7702.

IMPORTANT: why can't I sample from a GMM if I don't fit. I want to define all the parameters and to be able to sample from that distribution.

GaussianMixture is designed to fit a Gaussian Mixture and this is for that reason that it doesn't check that np.sum(weights) = 1, ...

This issue is linked to #7701 : I think what people want is a sample function outside GaussianMixture. I'm for -1 because it seems out of range of sklearn.

@agramfort @raghavrv @TomDLT @jnothman What's your point of view about the last point ?

@jnothman
Copy link
Member

jnothman commented Nov 8, 2016

You mean people want to define a parametrised gaussian mixture and sample from it? I'm ambivalent.

@tguillemot
Copy link
Contributor

@jnothman Yes it is what I mean.

As it's the second time someone ask for something like that, I give you the code to do the sample.

For the full covariance case :

weights /= np.sum(weights)
n_samples = rng.multinomial(n_samples, weights_)
X = np.vstack([rng.multivariate_normal(mean, covariance, int(sample))
               for (mean, covariance, n_sample) in zip(means_, covariances_, n_samples)])
y = np.concatenate([j * np.ones(sample, dtype=int)
                    for j, sample in enumerate(n_samples_comp)])

For the diagonal covariance case :

weights /= np.sum(weights)
n_samples = rng.multinomial(int(n_samples), weights_)
X = np.vstack([mean + rng.randn(n_sample, n_features) * np.sqrt(covariance)
               for (mean, covariance, n_sample) in zip(means_, covariances_, n_samples)])
y = np.concatenate([j * np.ones(sample, dtype=int)
                    for j, sample in enumerate(n_samples_comp)])

@jnothman
Copy link
Member

jnothman commented Nov 8, 2016

yes, I realise it's no big deal.

@tguillemot
Copy link
Contributor

@jnothman Indeed, sorry.

@jnothman
Copy link
Member

jnothman commented Nov 8, 2016

I suppose you're right that just as we don't provide a separate function to
perform linear model prediction ...

@amueller
Copy link
Member

@ACTLA we definitely don't want to be as far-reaching as we can be. We are already stretched very thin. Maybe also check out pomegranate: https://github.com/jmschrei/pomegranate

@amueller
Copy link
Member

I vote to close. Not sure if pomegranate is where we want to sent people for sampling but it certainly addresses this use-case.

@tguillemot
Copy link
Contributor

+1 to close

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 21, 2016 via email

@AruniRC
Copy link

AruniRC commented Jan 15, 2018

@LuCeHe The trick about using precisions_cholesky_ is really useful. It enabled me to continue using sklearn, and avoid changing a large codebase to change to another package. IMHO, one should be able to predict posteriors (my use-case) without fitting, if all the parameters are provided at creation by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants