Scaling features using MinMaxScaler makes DPGMM always have one cluster #6694

HTCode · 2016-04-21T10:01:48Z

I have noticed that if I scale my dataset using MinMaxScaler() then if I use DPGMM with whatever value for alpha, it will always create one cluster (label). This might be related to some numerical precision issue.

If I don't rescale the data or if instead of the MinMaxScaler() I use StandardScaler(), then this problem does not occur (i.e., the DPGMM creates more than one cluster).

Is this a bug in the sklearn.mixture.DPGMM or did I miss something ?

API is here: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM

I have also tried on the artificial data in this example (from the official site) : http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html#example-mixture-plot-gmm-py

It works, but if I resale the generated dataset X by adding the following line, then the DPGMM will create only one cluster:

from sklearn.preprocessing import MinMaxScaler
X = MinMaxScaler().fit_transform(X) # added this after X ...

amueller · 2016-04-21T15:28:02Z

Yes, there are probably bugs in DPGMM. Search the issue tracker for DPGMM and you'll find a few.
They are hopefully fixed here: #6651

tguillemot · 2016-04-22T08:08:01Z

Yes, I've noticed this issue with the old DPGMM class.
It seems that there is a numerical problem indeed.
The new code will normally solve this problem.
Thank you @HTCode for the report.

ogrisel · 2016-04-25T07:52:42Z

Thanks for the report. We should make this a non-regression test for the new implementation of DP Gaussian Mixtures.

fringsoo · 2016-08-30T12:04:35Z

@tguillemot Is there now a new usable DPGMM version? I tried the version of your GSoC-BayesianMixture Version and it seems something is wrong with the E step (lowerbound doesnot update so the it always finishs after only one iteration)

tguillemot · 2016-08-30T12:18:57Z

@fringsoo Not yet for the DPGMM but we will merge BayesianMixture soon #6651 .
I haven't seen a problem with the update of the lowerbound and MinMaxScaler.
Can you send me the code to reproduce your bug ?

fringsoo · 2016-08-30T13:30:22Z

@tguillemot Sorry. I forgot to report that my data is quite high dimensional, like 300. In this case, if the data amount is low, then the DPGMM will not work fine. The code has no problem. Thanks.

tguillemot · 2016-08-30T13:34:51Z

@fringsoo Ok

amueller · 2016-09-29T13:46:12Z

Should be fixed with the new BayesianMixture. Please give it a shot @fringsoo.

ogrisel added the Bug label Apr 25, 2016

amueller modified the milestone: 0.19 Sep 29, 2016

amueller closed this as completed Sep 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling features using MinMaxScaler makes DPGMM always have one cluster #6694

Scaling features using MinMaxScaler makes DPGMM always have one cluster #6694

HTCode commented Apr 21, 2016 •

edited by TomDLT

Loading

amueller commented Apr 21, 2016

tguillemot commented Apr 22, 2016

ogrisel commented Apr 25, 2016

fringsoo commented Aug 30, 2016

tguillemot commented Aug 30, 2016 •

edited

Loading

fringsoo commented Aug 30, 2016

tguillemot commented Aug 30, 2016

amueller commented Sep 29, 2016

Scaling features using MinMaxScaler makes DPGMM always have one cluster #6694

Scaling features using MinMaxScaler makes DPGMM always have one cluster #6694

Comments

HTCode commented Apr 21, 2016 • edited by TomDLT Loading

amueller commented Apr 21, 2016

tguillemot commented Apr 22, 2016

ogrisel commented Apr 25, 2016

fringsoo commented Aug 30, 2016

tguillemot commented Aug 30, 2016 • edited Loading

fringsoo commented Aug 30, 2016

tguillemot commented Aug 30, 2016

amueller commented Sep 29, 2016

HTCode commented Apr 21, 2016 •

edited by TomDLT

Loading

tguillemot commented Aug 30, 2016 •

edited

Loading