Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Scaling features using MinMaxScaler makes DPGMM always have one cluster #6694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
HTCode opened this issue Apr 21, 2016 · 8 comments
Closed
Labels
Milestone

Comments

@HTCode
Copy link

HTCode commented Apr 21, 2016

I have noticed that if I scale my dataset using MinMaxScaler() then if I use DPGMM with whatever value for alpha, it will always create one cluster (label). This might be related to some numerical precision issue.

If I don't rescale the data or if instead of the MinMaxScaler() I use StandardScaler(), then this problem does not occur (i.e., the DPGMM creates more than one cluster).

Is this a bug in the sklearn.mixture.DPGMM or did I miss something ?

API is here: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM

I have also tried on the artificial data in this example (from the official site) : http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html#example-mixture-plot-gmm-py

It works, but if I resale the generated dataset X by adding the following line, then the DPGMM will create only one cluster:

from sklearn.preprocessing import MinMaxScaler
X = MinMaxScaler().fit_transform(X) # added this after X ...
@amueller
Copy link
Member

Yes, there are probably bugs in DPGMM. Search the issue tracker for DPGMM and you'll find a few.
They are hopefully fixed here: #6651

@tguillemot
Copy link
Contributor

Yes, I've noticed this issue with the old DPGMM class.
It seems that there is a numerical problem indeed.
The new code will normally solve this problem.
Thank you @HTCode for the report.

@ogrisel ogrisel added the Bug label Apr 25, 2016
@ogrisel
Copy link
Member

ogrisel commented Apr 25, 2016

Thanks for the report. We should make this a non-regression test for the new implementation of DP Gaussian Mixtures.

@fringsoo
Copy link

@tguillemot Is there now a new usable DPGMM version? I tried the version of your GSoC-BayesianMixture Version and it seems something is wrong with the E step (lowerbound doesnot update so the it always finishs after only one iteration)

@tguillemot
Copy link
Contributor

tguillemot commented Aug 30, 2016

@fringsoo Not yet for the DPGMM but we will merge BayesianMixture soon #6651 .
I haven't seen a problem with the update of the lowerbound and MinMaxScaler.
Can you send me the code to reproduce your bug ?

@fringsoo
Copy link

@tguillemot Sorry. I forgot to report that my data is quite high dimensional, like 300. In this case, if the data amount is low, then the DPGMM will not work fine. The code has no problem. Thanks.

@tguillemot
Copy link
Contributor

@fringsoo Ok

@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@amueller
Copy link
Member

Should be fixed with the new BayesianMixture. Please give it a shot @fringsoo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants