sklearn.mixture.GaussianMixture doesn't sample properly plus a prob with fitting #7822

LuCeHe · 2016-11-03T21:23:55Z

Description

the new sklearn.mixture.GaussianMixture has messed up several things. 2 of them:

IMPORTANT: why can't I sample from a GMM if I don't fit. I want to define all the parameters and to be able to sample from that distribution
- the trick I use is to specify gmm.precisions_cholesky_ but I find it overcomplicated
why does it assume that the covariance has to be a 2d matrix and I have to specify "spherical" if I am in 1d.
IMPORTANT: the process of sampling gives back fewer samples than it's asked for, 665 instead of 1000. I think its caused by the line 384 in scikit-learn/sklearn/mixture/base.py:

n_samples_comp = np.round(self.weights_ * n_samples).astype(int)

Steps/Code to Reproduce

import numpy as np
from sklearn.mixture import GaussianMixture
np.random.seed(0)

# I define 10 gaussians for the mixture
n_gaussians = 10
means = np.array(10*(2*np.random.rand(n_gaussians,1)-1))
weights = np.array(.4*np.random.rand(n_gaussians,1)) ** 2
to_normalize_weights = np.random.rand(n_gaussians,1)
covars = np.array(to_normalize_weights/np.sum(to_normalize_weights))
precisions = np.array([ x**(-1) for x in covars ])

gmm = GaussianMixture(n_gaussians, covariance_type='full')
gmm.means_ = means
gmm.weights_ = weights
gmm.covariances_ = covars
gmm.precisions_cholesky_ = precisions    #otherwise sklearn thinks it's not fit

smples = gmm.sample(n_samples=1000)
print len(smples[0]), len(smples[1])

Expected Results

1000,1000

Actual Results

665, 665

Versions

jnothman · 2016-11-04T03:03:57Z

The last issue was fixed in #7701. The first we may not fix: generally
fit() needs to be called before model-based operations can be performed.

On 4 November 2016 at 08:23, Luca Celotti [email protected] wrote:

Description

the new sklearn.mixture.GaussianMixture has messed up several things. 2
of them:

IMPORTANT: why can't I sample from a GMM if I don't fit. I want to
define all the parameters and to be able to sample from that distribution

the trick I use is to specify gmm.precisions_cholesky_ but I find
it overcomplicated

why does it assume that the covariance has to be a 2d matrix and I
have to specify "spherical" if I am in 1d.

IMPORTANT: the process of sampling gives back fewer samples than
it's asked for, 665 instead of 1000. I think its caused by the line 384 in
scikit-learn/sklearn/mixture/base.py

n_samples_comp = np.round(self.weights_ * n_samples).astype(int)

Steps/Code to Reproduce

import numpy as npfrom sklearn.mixture import GaussianMixture
np.random.seed(0)

I define 10 gaussians for the mixture

n_gaussians = 10
means = np.array(10_(2_np.random.rand(n_gaussians,1)-1))
weights = np.array(.4_np.random.rand(n_gaussians,1)) _ 2
to_normalize_weights = np.random.rand(n_gaussians,1)
covars = np.array(to_normalize_weights/np.sum(to_normalize_weights))
precisions = np.array([ x*(-1) for x in covars ])

gmm = GaussianMixture(n_gaussians, covariance_type='full')
gmm.means_ = means
gmm.weights_ = weights
gmm.covariances_ = covars
gmm.precisions_cholesky_ = precisions #otherwise sklearn thinks it's not fit

smples = gmm.sample(n_samples=1000)print len(smples[0]), len(smples[1])

Expected Results

1000,1000
Actual Results

665, 665
Versions

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7822, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz6ynIoZ8nNas7mbUv4s-JTfAT20GNks5q6lDsgaJpZM4Ko7xf
.

LuCeHe · 2016-11-04T15:31:45Z

I updated the sklearn, and it doesn't seem to solve the sampling issue, for 3000 asked gives back 1177, but probably I have to update in a more savvy manner and besides, this is an issue I already found a way around.

The first issue is more annoying because I think it's fundamental for people using GMM to be able to compare proposed methods with new data. In my case I'm defining a GMM using a novelty detection rule, so it's more complicated than just fitting to data.

jnothman · 2016-11-05T11:32:45Z

It's not in the current release, but it's included in the current master, and should be in 0.18.1 to be released shortly.

jnothman · 2016-11-05T11:34:01Z

Ping @tguillemot

LuCeHe · 2016-11-05T22:25:29Z

Thanks a lot.

At the end I decided to use another package instead of sklearn

http://pypr.sourceforge.net/mog.html

I don't know if you can get inspiration from there. I don't know how far reaching you want your package to be. In my interest I would want it as complete as possible, but it's clear that there are alternative solutions.

Best,
Luca

tguillemot · 2016-11-07T10:51:56Z

In your example :

In [15]: np.sum(weights)
Out[15]: 0.66475414826343626

So, it's normal if you have only 665 elements.

I think its caused by the line 384 in scikit-learn/sklearn/mixture/base.py:

Indeed there was a problem and it's corrected by #7702.

IMPORTANT: why can't I sample from a GMM if I don't fit. I want to define all the parameters and to be able to sample from that distribution.

GaussianMixture is designed to fit a Gaussian Mixture and this is for that reason that it doesn't check that np.sum(weights) = 1, ...

This issue is linked to #7701 : I think what people want is a sample function outside GaussianMixture. I'm for -1 because it seems out of range of sklearn.

@agramfort @raghavrv @TomDLT @jnothman What's your point of view about the last point ?

jnothman · 2016-11-08T12:10:11Z

You mean people want to define a parametrised gaussian mixture and sample from it? I'm ambivalent.

tguillemot · 2016-11-08T13:19:21Z

@jnothman Yes it is what I mean.

As it's the second time someone ask for something like that, I give you the code to do the sample.

For the full covariance case :

weights /= np.sum(weights)
n_samples = rng.multinomial(n_samples, weights_)
X = np.vstack([rng.multivariate_normal(mean, covariance, int(sample))
               for (mean, covariance, n_sample) in zip(means_, covariances_, n_samples)])
y = np.concatenate([j * np.ones(sample, dtype=int)
                    for j, sample in enumerate(n_samples_comp)])

For the diagonal covariance case :

weights /= np.sum(weights)
n_samples = rng.multinomial(int(n_samples), weights_)
X = np.vstack([mean + rng.randn(n_sample, n_features) * np.sqrt(covariance)
               for (mean, covariance, n_sample) in zip(means_, covariances_, n_samples)])
y = np.concatenate([j * np.ones(sample, dtype=int)
                    for j, sample in enumerate(n_samples_comp)])

jnothman · 2016-11-08T13:37:43Z

yes, I realise it's no big deal.

tguillemot · 2016-11-08T13:43:31Z

@jnothman Indeed, sorry.

jnothman · 2016-11-08T13:45:46Z

I suppose you're right that just as we don't provide a separate function to
perform linear model prediction ...

amueller · 2016-11-17T22:33:50Z

@ACTLA we definitely don't want to be as far-reaching as we can be. We are already stretched very thin. Maybe also check out pomegranate: https://github.com/jmschrei/pomegranate

amueller · 2016-11-17T22:34:49Z

I vote to close. Not sure if pomegranate is where we want to sent people for sampling but it certainly addresses this use-case.

tguillemot · 2016-11-21T10:47:26Z

+1 to close

GaelVaroquaux · 2016-11-21T10:49:00Z

+1

AruniRC · 2018-01-15T00:41:38Z

@LuCeHe The trick about using precisions_cholesky_ is really useful. It enabled me to continue using sklearn, and avoid changing a large codebase to change to another package. IMHO, one should be able to predict posteriors (my use-case) without fitting, if all the parameters are provided at creation by the user.

jnothman closed this as completed Nov 21, 2016

bsipocz mentioned this issue Jan 11, 2019

Avoid hacky way of setting up GaussianMixture dataset astroML/astroML_figures#19

Open

Uh oh!

sklearn.mixture.GaussianMixture doesn't sample properly plus a prob with fitting #7822

sklearn.mixture.GaussianMixture doesn't sample properly plus a prob with fitting #7822

Comments

LuCeHe commented Nov 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jnothman commented Nov 4, 2016

I define 10 gaussians for the mixture

Uh oh!

LuCeHe commented Nov 4, 2016

Uh oh!

jnothman commented Nov 5, 2016

Uh oh!

jnothman commented Nov 5, 2016

Uh oh!

LuCeHe commented Nov 5, 2016

Uh oh!

tguillemot commented Nov 7, 2016

Uh oh!

jnothman commented Nov 8, 2016

Uh oh!

tguillemot commented Nov 8, 2016

Uh oh!

jnothman commented Nov 8, 2016 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tguillemot commented Nov 8, 2016

Uh oh!

jnothman commented Nov 8, 2016 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Nov 17, 2016

Uh oh!

amueller commented Nov 17, 2016

Uh oh!

tguillemot commented Nov 21, 2016

Uh oh!

GaelVaroquaux commented Nov 21, 2016 via email

Uh oh!

AruniRC commented Jan 15, 2018

Uh oh!

LuCeHe commented Nov 3, 2016 •

edited

Loading

jnothman commented Nov 8, 2016 •

edited by TomDLT

Loading

jnothman commented Nov 8, 2016 •

edited by TomDLT

Loading