Two improvement suggestions for spectral clustering

Giving such a simple affinity matrix in sparse form (row_id, cow_id, val) like this:
[ (1, 2, 100), (1, 3, 100), (2, 3, 100), (3, 4, 1), (4, 5, 100), (4, 6, 100), (5, 6, 100) ]

Note that this is half of the full matrix to show the relationship between nodes. It will be complemented when use as input in code.

This affinity matrix means that there is strong affinity within group (1, 2, 3) and group (4, 5, 6), and weak relationship between these two groups.

So when using spectral clustering to cluster these 6 nodes to 2 group, it's expected to give (1,2,3), (4,5,6) at most of time. However, it's not the case for the spectral clustering in sklearn. The result is so unstable.

I investigated the source code of it and found it can be improved in two ways. 

When using spectral_embedding, the drop_first variable now is set false.

https://github.com/scikit-learn/scikit-learn/blob/ec691e996d12ad28fb79af1976761e893894a057/sklearn/cluster/spectral.py#L260-L266

I think drop_first should be true. Because the first eigenvectors is related to eigenvalues 0. This eigenvectors is meaningless, because it represent translational motion of the whole system in physics. All the elements in this eigenvectors are identical. It will affect the result of the following K-means.

By simply setting drop_first to true, and use only 1 components, spectral clustering can always give the right answer.

```python
labels = spectral_clustering(affinity, n_clusters=2, n_components=1)
```

However, that is not enough, the algorithm should be robust and give the right answer no matter how many components used.

In the original spectral clustering, all eigenvectors has equal weight. However, each eigenvectors can be viewed as a motion pattern in physics, and the eigenvalue is related to its frequency. So the eigenvectors should be given weights like

```python
lambdas, diffusion_map = eigsh(laplacian, k=n_components,
                                           sigma=1.0, which='LM',
                                           tol=0, v0=v0)

lambdas = lambdas[n_components::-1]
embedding = diffusion_map.T[n_components::-1]
for idx, ei_val in enumerate(lambdas):
if abs(ei_val) > 0.00001:
      embedding[idx] = embedding[idx] * math.sqrt(abs(1/ei_val))
```

The first eigenvector with eigenvalue 0 will be dropt as I mentioned before. 

With thest two improvements, spectral clustering now can always give the right answer in my test.

However, I'm so sorry that it maybe out of my ablility to offer a ideal pull request for these improvement. So I share my ideas as suggestions.





	# The first eigen vector is constant only for fully connected graphs
	# and should be kept for spectral clustering (drop_first = False)
	# See spectral_embedding documentation.
	maps = spectral_embedding(affinity, n_components=n_components,
	eigen_solver=eigen_solver,
	random_state=random_state,
	eigen_tol=eigen_tol, drop_first=False)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Two improvement suggestions for spectral clustering #10736

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Two improvement suggestions for spectral clustering #10736

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions