Thanks to visit codestin.com
Credit goes to github.com

Skip to content

new feature: add clusterQR method to 'kmeans' and 'discretize' in spectral clustering #12164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lobpcg opened this issue Sep 26, 2018 · 10 comments · Fixed by #21148
Closed

new feature: add clusterQR method to 'kmeans' and 'discretize' in spectral clustering #12164

lobpcg opened this issue Sep 26, 2018 · 10 comments · Fixed by #21148

Comments

@lobpcg
Copy link
Contributor

lobpcg commented Sep 26, 2018

Description

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.spectral_clustering.html generates clustering labels using one of the two methods determined by assign_labels = 'kmeans' or 'discretize' from embedding computed from diffusion_map in scikit-learn/sklearn/manifold/spectral_embedding_.py

There is a nice simple new algorithm, called clusterQR, described in https://github.com/asdamle/QR-spectral-clustering giving 100% correct results in https://doi.org/10.1109/HPEC.2017.8091045 or https://arxiv.org/abs/1708.07481. clusterQR costs about the same or less as 'kmeans' and 'discretize', but may be expected to outperform both when the number of clusters is not small.

I suggest adding clusterQR to the scikit-learn code base. The function itself is <10 lines, plus a few changes in documentation and the spectral clustering function that calls it, so extra maintenance efforts are tiny. It may become the new default instead of kmeans, since it produces better quality partitions at similar memory footprint and compute time.

Steps/Code to Reproduce

N/A

Expected Results

clusterQR available

Actual Results

clusterQR not available

Versions

the most recent

@jnothman
Copy link
Member

This doesn't meet our basic criteria for inclusion of stable and mature algorithms. What makes you think it is worth our while to maintain an implementation of this? What are the chances that this will remain a canonical approach in 5 years' time?

@ogrisel
Copy link
Member

ogrisel commented Sep 27, 2018

+1 for making a prototype Python implementation outside of the scikit-learn code base and running some benchmarks. If the results are as good as expected this could be contributed to http://contrib.scikit-learn.org/ with proper tests and documentation.

Then later once this method meets the scikit-learn basic criteria for inclusion, we can discuss merging it upstream into scikit-learn.

@lobpcg
Copy link
Contributor Author

lobpcg commented Sep 30, 2018

OK, I have made all the needed changes in the fork https://github.com/lobpcg/scikit-learn/tree/clusterQR and opened PR #12316

The actual changes are just a few lines in only 3 core codes, spectral.py, test_spectral.py, and plot_coin_segmentation.py That hardly justify creating a brand new separate project at http://contrib.scikit-learn.org/ ...

It appears that all 'clusterQR', 'kmeans', and 'discretize' work nearly the same way when the number of clusters is small, as in plot_cluster_comparison.py , but may be quite different when the number of clusters is over 20, as in plot_coin_segmentation.py I have also tested 'clusterQR' vs. 'kmeans' and 'discretize' in plot_cluster_comparison.py - the resulting clusters are essentially the same, so I am unsure if it is worth adding this comparison to the scikit-learn code base, thus it's not even uploaded to my fork.

In plot_coin_segmentation.py example, the new method 'clusterQR' appears to give better segmentation, compared to 'kmeans' and 'discretize':

sphx_glr_plot_coin_segmentation_001
sphx_glr_plot_coin_segmentation_002
sphx_glr_plot_coin_segmentation_003

@lobpcg
Copy link
Contributor Author

lobpcg commented Oct 12, 2018

@jnothman @ogrisel My coding is completed for this issue. Please see #12316 and react.

@FTB-B
Copy link

FTB-B commented Jan 6, 2021

I am using scikit-learn spectral clustering for my clustering problem. I use the following configuration for the spectral clustering

clustering = sklearn.cluster.SpectralClustering(n_clusters = number_clusters , affinity="cosine",assign_labels="clusterQR",eigen_solver='lobpcg',n_jobs=psutil.cpu_count()).fit(embedding_matrix)

but I get the error

File "/data/fatemeh/mem2Vec/kym_meme/scikit-learn-clusterQR/sklearn/cluster/_spectral.py", line 559, in fit
    assign_labels=self.assign_labels)
  File "/data/fatemeh/mem2Vec/kym_meme/scikit-learn-clusterQR/sklearn/cluster/_spectral.py", line 301, in spectral_clustering
    eigen_tol=eigen_tol, drop_first=False)
  File "/data/fatemeh/mem2Vec/kym_meme/scikit-learn-clusterQR/sklearn/manifold/_spectral_embedding.py", line 339, in spectral_embedding
    largest=False, maxiter=2000)
  File "/home/ftahmas/venv/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py", line 489, in lobpcg
    activeBlockVectorAR = A(activeBlockVectorR)
  File "/home/ftahmas/venv/lib/python3.6/site-packages/scipy/sparse/linalg/interface.py", line 387, in __call__
    return self*x
  File "/home/ftahmas/venv/lib/python3.6/site-packages/scipy/sparse/linalg/interface.py", line 390, in __mul__
    return self.dot(x)
  File "/home/ftahmas/venv/lib/python3.6/site-packages/scipy/sparse/linalg/interface.py", line 420, in dot
    % x)
ValueError: expected 1-d or 2-d array or matrix, got array(None, `dtype=object)

when I use affinity="rbf" it works without error!

any idea why?

@lobpcg
Copy link
Contributor Author

lobpcg commented Jan 6, 2021

Yes, I have seen this error. Please make sure that your scipy is the latest stable version and let me know if the problem still persists.

@FTB-B
Copy link

FTB-B commented Jan 6, 2021

Thanks for the reply. Yes my scipy version is fine it is version 1.4.1. though I uninstalled and installed it again and still I see the error!

@lobpcg
Copy link
Contributor Author

lobpcg commented Jan 6, 2021

The latest is 1.6.0 https://www.scipy.org/
Please make sure that this is what you run.

@FTB-B
Copy link

FTB-B commented Jan 7, 2021

Thanks for your reply. I did upgrade my scipy to 1.6.0, but still I have the same issue. Somehow the eigen_solver = lobpcg doesn't work if Affinity='cosine' and I don't know why. The both work with different options but not with each other at the same time.

@lobpcg
Copy link
Contributor Author

lobpcg commented Jan 7, 2021

I could investigate if you provide a reproducible example, please.

The issue should not be related to clusterqr, so please run with a different already available function for labeling and submit a formal bug report with a ping to me.

Cosine similarity may produce degenerate matrices with high dimensional eigen spaces that make lobpcg to fail because it runs out of space to generate new approximation in the Krylov subspace. Just don't use cosine similarity - it is bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants