-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Giving such a simple affinity matrix in sparse form (row_id, cow_id, val) like this:
[ (1, 2, 100), (1, 3, 100), (2, 3, 100), (3, 4, 1), (4, 5, 100), (4, 6, 100), (5, 6, 100) ]
Note that this is half of the full matrix to show the relationship between nodes. It will be complemented when use as input in code.
This affinity matrix means that there is strong affinity within group (1, 2, 3) and group (4, 5, 6), and weak relationship between these two groups.
So when using spectral clustering to cluster these 6 nodes to 2 group, it's expected to give (1,2,3), (4,5,6) at most of time. However, it's not the case for the spectral clustering in sklearn. The result is so unstable.
I investigated the source code of it and found it can be improved in two ways.
When using spectral_embedding, the drop_first variable now is set false.
scikit-learn/sklearn/cluster/spectral.py
Lines 260 to 266 in ec691e9
# The first eigen vector is constant only for fully connected graphs | |
# and should be kept for spectral clustering (drop_first = False) | |
# See spectral_embedding documentation. | |
maps = spectral_embedding(affinity, n_components=n_components, | |
eigen_solver=eigen_solver, | |
random_state=random_state, | |
eigen_tol=eigen_tol, drop_first=False) |
I think drop_first should be true. Because the first eigenvectors is related to eigenvalues 0. This eigenvectors is meaningless, because it represent translational motion of the whole system in physics. All the elements in this eigenvectors are identical. It will affect the result of the following K-means.
By simply setting drop_first to true, and use only 1 components, spectral clustering can always give the right answer.
labels = spectral_clustering(affinity, n_clusters=2, n_components=1)
However, that is not enough, the algorithm should be robust and give the right answer no matter how many components used.
In the original spectral clustering, all eigenvectors has equal weight. However, each eigenvectors can be viewed as a motion pattern in physics, and the eigenvalue is related to its frequency. So the eigenvectors should be given weights like
lambdas, diffusion_map = eigsh(laplacian, k=n_components,
sigma=1.0, which='LM',
tol=0, v0=v0)
lambdas = lambdas[n_components::-1]
embedding = diffusion_map.T[n_components::-1]
for idx, ei_val in enumerate(lambdas):
if abs(ei_val) > 0.00001:
embedding[idx] = embedding[idx] * math.sqrt(abs(1/ei_val))
The first eigenvector with eigenvalue 0 will be dropt as I mentioned before.
With thest two improvements, spectral clustering now can always give the right answer in my test.
However, I'm so sorry that it maybe out of my ablility to offer a ideal pull request for these improvement. So I share my ideas as suggestions.