Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Two improvement suggestions for spectral clustering #10736

@sky88088

Description

@sky88088

Giving such a simple affinity matrix in sparse form (row_id, cow_id, val) like this:
[ (1, 2, 100), (1, 3, 100), (2, 3, 100), (3, 4, 1), (4, 5, 100), (4, 6, 100), (5, 6, 100) ]

Note that this is half of the full matrix to show the relationship between nodes. It will be complemented when use as input in code.

This affinity matrix means that there is strong affinity within group (1, 2, 3) and group (4, 5, 6), and weak relationship between these two groups.

So when using spectral clustering to cluster these 6 nodes to 2 group, it's expected to give (1,2,3), (4,5,6) at most of time. However, it's not the case for the spectral clustering in sklearn. The result is so unstable.

I investigated the source code of it and found it can be improved in two ways.

When using spectral_embedding, the drop_first variable now is set false.

# The first eigen vector is constant only for fully connected graphs
# and should be kept for spectral clustering (drop_first = False)
# See spectral_embedding documentation.
maps = spectral_embedding(affinity, n_components=n_components,
eigen_solver=eigen_solver,
random_state=random_state,
eigen_tol=eigen_tol, drop_first=False)

I think drop_first should be true. Because the first eigenvectors is related to eigenvalues 0. This eigenvectors is meaningless, because it represent translational motion of the whole system in physics. All the elements in this eigenvectors are identical. It will affect the result of the following K-means.

By simply setting drop_first to true, and use only 1 components, spectral clustering can always give the right answer.

labels = spectral_clustering(affinity, n_clusters=2, n_components=1)

However, that is not enough, the algorithm should be robust and give the right answer no matter how many components used.

In the original spectral clustering, all eigenvectors has equal weight. However, each eigenvectors can be viewed as a motion pattern in physics, and the eigenvalue is related to its frequency. So the eigenvectors should be given weights like

lambdas, diffusion_map = eigsh(laplacian, k=n_components,
                                           sigma=1.0, which='LM',
                                           tol=0, v0=v0)

lambdas = lambdas[n_components::-1]
embedding = diffusion_map.T[n_components::-1]
for idx, ei_val in enumerate(lambdas):
if abs(ei_val) > 0.00001:
      embedding[idx] = embedding[idx] * math.sqrt(abs(1/ei_val))

The first eigenvector with eigenvalue 0 will be dropt as I mentioned before.

With thest two improvements, spectral clustering now can always give the right answer in my test.

However, I'm so sorry that it maybe out of my ablility to offer a ideal pull request for these improvement. So I share my ideas as suggestions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions