-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] change spectral embedding eigen solver from amg to arpack #10720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The kind of test you might consider: are there cases where solvers should be returning similar solutions and currently do not, but they become similar with your patch? |
@jnothman Actually I do make such tests. However, the result of spectral clustering in sklearn is actually unstable. I'm tring to rewrite it myself on base of the sklearn version. That's why I investigated the source code of sklearn and found these problem. And the spectral embedding is the core part of spectral clustering. For example, giving such a affinity matrix like this: Simply use 1 to 6 to index these nodes. Obviously, it should be divided into 2 clusters: (1, 2, 3) and (4, 5, 6) if you use spectral clustering like
However, sklearn spectral clustering isn't always giving the right answer. The result is relatived to the input random_state. In my opinion, though random_state affects the k-means part in spectral clustering, the result should be still stable, because the affinity within (1, 2, 3) and (4, 5, 6) is much stronger than that between nodes from different clusters. If I cannot get stable result from such a simple example, it's hard to make a standard test. Because I don't know whether is a bug when I get different results. On the other hand, this mistake is mainly not a math problem, I think it can be judged by logic. So maybe a math test is not necessary. |
@sky88088 what do you think about reorganizing the solver as #10715 (comment) We just should be careful about when setting the diagonal in arpack if it is failing. |
and we should add couple of regression tests. |
I take @jmargeta's code as reference and refine the code in my PR. I don't have a clear idea whether we should separate the solver selection from the execution. The solver selection now needs the information like n_nodes and n_components. If another new solver is added in the future, other information may also need. If the solver selection stay in the execution part, it can always use all the information. So I just leave it as is. I'm not good at code design, so I just do it in a simple way. |
This pull request fixes 1 alert when merging 7f5e1df into e161700 - view on lgtm.com fixed alerts:
Comment posted by lgtm.com |
@sky88088 |
@sky88088 Are you going to make any changes or we can ask for contributor to fix this PR |
@glemaitre I have few experience of writing regression tests for python project. So it would be great if any contributor can help me to fix this PR. |
Are you able to put together some code that fails in master but would succeed in this PR? |
@jnothman The following example code meets the condition that n_nodes < 5 * n_components, and I think it can be a test. from scipy.sparse import coo_matrix
from sklearn.manifold import spectral_embedding
def gen_input(size):
a = []
b = []
v = []
for i in range(size):
for j in range(i + 1, size):
if i <= size / 2 - 1 and j <= size / 2 - 1:
a.append(i)
b.append(j)
v.append(10)
elif i >= size / 2 and j >= size / 2:
a.append(i)
b.append(j)
v.append(1)
elif i == size / 2 - 1 and j == size / 2:
a.append(i)
b.append(j)
v.append(1)
return coo_matrix((v + v, (a + b, b + a)), shape=(size, size))
if __name__ == '__main__':
n_nodes = 6
n_components = 4
affinity = gen_input(n_nodes)
print affinity.todense()
for eigen_solver in ('arpack', 'amg'):
print '##### %s #####' % eigen_solver
print spectral_embedding(affinity, n_components=n_components, eigen_solver=eigen_solver, drop_first=False) |
Rather than calling arpack if n_nodes < 5 * n_components, it may be faster just to call the dense solver |
scipy/scipy#9650 is now merged to the mater in scipy, and added to the 1.3.0 milestone. It should take care of this issue, with no changes needed in sklearn. This issue can probably be closed |
@amueller No, it is not expected. Moreover, eigen_solver=lobpcg gives a correct result, different from eigen_solver=amg, although it runs the same code, just with an extra parameter. This requires investigation, but is surely not a good reason to change the default solver, since in this specific test amg is actually just calling eigh, since the problem size is way too small for real amg. This really looks like a silly bug in case
is surely wrong now on scipy 1.3.0. Someone should have a close look and fix it, adding a few unit tests to check the logic in case |
@lobpcg thank you for your analysis. Yes, I wasn't really advocating for the solution proposed here, just wondering if there is still an issue. And it seems like there is still an issue in our code. |
Just as a note for future self, an obvious mismatch is that
gives the expected result, but using |
May be suggest it as a unit test for #13393 ? |
OMG THERE IS A TERRIBLE BUG in the convoluted logic that you pointed out. |
Is not that AMG issue already addressed, so this can now be closed? |
Fixes #10715
This patch change spectral embedding eigen_solver variable from amg to arpack when number of nodes is low.
The original code use arpack to avoid the bug of amg, but not change this variable. Then the resulting embedding would not be used, and a new embedding would be computed still using amg solver.
Since the laplacian has been transformed in the arpack part, the new embedding I think is incorrect.
Please note that I haven't make a test because this patch is simple and I have no idea about the standard result of spectral embedding.