Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix stopping criterion of _graph_connected_components #5713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Fix stopping criterion of _graph_connected_components #5713

wants to merge 2 commits into from

Conversation

AlexandreAbraham
Copy link
Contributor

The function didn't stop in the case of a cyclic graph. I restored the previous stopping criterion and left the optimization that only bring a small overhead on my box (50ms on 10 tries).

Related to #5639

@giorgiop
Copy link
Contributor

giorgiop commented Nov 5, 2015

Could you check the difference with master in runtime when you test manifold/tests/test_spectral_embedding and the whole code base?

@@ -47,17 +47,18 @@ def _graph_connected_component(graph, node_id):
nodes_to_explore = np.zeros(shape=(graph.shape[0]), dtype=np.bool)
nodes_to_explore[node_id] = True
n_node = graph.shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be put at the first line and then avoid to read graph.shape again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@AlexandreAbraham
Copy link
Contributor Author

Before any optimization:

$ nosetests sklearn/manifold/tests/test_spectral_embedding.py --pdb
/home/aa013911/mywork/scikit-learn/sklearn/__check_build/__init__.py:44: RuntimeWarning: compiletime version 2.6 of module 'sklearn.check_build._check_build' does not match runtime version 2.7
  from ._check_build import check_build
...S.....
----------------------------------------------------------------------
Ran 9 tests in 2.173s

OK (SKIP=1)

After first PR

$ nosetests sklearn/manifold/tests/test_spectral_embedding.py
/home/aa013911/mywork/scikit-learn/sklearn/__check_build/__init__.py:44: RuntimeWarning: compiletime version 2.6 of module 'sklearn.check_build._check_build' does not match runtime version 2.7
  from ._check_build import check_build
...S.....
----------------------------------------------------------------------
Ran 9 tests in 27.387s

OK (SKIP=1)

After fixing the stopping criterion

$ nosetests sklearn/manifold/tests/test_spectral_embedding.py
/home/aa013911/mywork/scikit-learn/sklearn/__check_build/__init__.py:44: RuntimeWarning: compiletime version 2.6 of module 'sklearn.check_build._check_build' does not match runtime version 2.7
  from ._check_build import check_build
...S......
----------------------------------------------------------------------
Ran 10 tests in 2.201s

OK (SKIP=1)

It is slightly slower but it uses a fix amount of memory as opposite to the old version. I have tried to use indices instead of boolean vectors and it is slightly slower (about 70ms) but I can use this option if needed.

@AlexandreAbraham
Copy link
Contributor Author

@giorgiop this one should be good to merge.

@giorgiop
Copy link
Contributor

giorgiop commented Nov 8, 2015

Did you try to see if we can completely avoid the allocation of the array nodes_to_explore?

@@ -42,23 +42,23 @@ def _graph_connected_component(graph, node_id):
belonging to the largest connected components of the given query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connected_components in Returns

@giorgiop
Copy link
Contributor

I am running this script to measure gain in runtime:

import numpy as np
from time import time
from sklearn.manifold.spectral_embedding_ import _graph_connected_component


def perturb_graph(graph):
    m = graph.shape[0]
    graph[rng.randint(m, size=2 * m),
          rng.randint(m, size=2 * m)] = 1

n = 1000
rng = np.random.RandomState(12)

graph = np.zeros(shape=(n, n))

start = time()
for _ in range(20):
    perturb_graph(graph)  # perturb graph at random
    init = rng.randint(n)
    graph[0, init] = 1  # starting node
    _graph_connected_component(graph, init)
end = time() - start
print("time: %.3f" % end)
# on master
time: 59.612
# this branch
time: 0.303
# above + moving up the break condition
time: 0.237
# above + no fill `nodes_to_explore` with False
time: 0.027

Here the final code. I am still not sure if we can completely avoid nodes_to_explore but the speed up should be fine already.

def _graph_connected_component(graph, node_id):
    n_node = graph.shape[0]
    connected_nodes = np.zeros(
        shape=(n_node), dtype=np.bool)
    nodes_to_explore = np.zeros(shape=(n_node), dtype=np.bool)
    nodes_to_explore[node_id] = True
    for _ in range(n_node):
        last_num_component = connected_nodes.sum()
        np.logical_or(connected_nodes,
                      nodes_to_explore,
                      out=connected_nodes)
        if last_num_component >= connected_nodes.sum():
            break
        indices = np.where(nodes_to_explore)[0]
        for i in indices:
            np.logical_or(np.zeros(shape=(n_node), dtype=np.bool), graph[i],
                          out=nodes_to_explore)
    return connected_nodes

@giorgiop
Copy link
Contributor

@ogrisel this one to close as well?

@amueller
Copy link
Member

amueller commented Oct 8, 2016

closing as #6268 was merged.

@amueller amueller closed this Oct 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants