-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Optimize sklearn.manifold._graph_is_connected #5443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Optimize sklearn.manifold._graph_is_connected #5443
Conversation
_, node_to_add = np.where(graph[connected_components_matrix] != 0) | ||
connected_components_matrix[node_to_add] = True | ||
if last_num_component >= connected_components_matrix.sum(): | ||
nodes_to_add = np.zeros(shape=(graph.shape[0]), dtype=np.bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the following would not be slightly more efficient (probably depends of the malloc used by numpy):
n_node = graph.shape[0]
nodes_to_add = np.empty(shape=n_node, dtype=np.uint8)
for i in range(n_node):
nodes_to_add.fill(0)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had exactly the same thought. My gut feeling is that it's better so I'll change it.
Other than that the bench looks convincing. If the tests are green on both travis and appveyor, +1 on my side. @rudimeier can you please test this and tell us if it solve your original problem on your dataset? |
Can you add an explicit test for graph_connect_component in manifold.tests.test_spectral_embedding.py:test_spectral_embedding_two_components In this example you know the connect components, so you can test that the function works well. |
@ogrisel gave his +1 and travis is happy. Merging! |
…ected [MRG+1] Optimize sklearn.manifold._graph_is_connected
did anyone test the runtime? |
Fix #5024.
This is a naive fix where I use a temporary array to store the nodes to be explored and another one with the nodes to add in the current loop. This can probably be optimized by using a single integer array but the code would be less intuitive and the memory saved would be very small.
I can switch to cython if needed but the current proposition seems optimized enough. I did not add doc since variable names are self explanatory.
I tested it with a degenerated graph using the following code:
Before optimization, memory consumption looks like:


And after:
It is faster and consumes less memory.