Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX Optics paper typo #13750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 30, 2019
Merged

FIX Optics paper typo #13750

merged 8 commits into from
Apr 30, 2019

Conversation

qinhanmin2014
Copy link
Member

there's a typo in OPTICS paper, see #13739 (comment), #13739 (comment)
Modify existing test to serve as regression test.
Can someone tell me how can I run tests on i686 with CI? (I hope this will fix the test failures)
Remove test_cluster_hierarchy_ since it easily fails if we change another random_state, and I don't think it's reasonable.

@jnothman
Copy link
Member

jnothman commented Apr 30, 2019 via email

@qinhanmin2014
Copy link
Member Author

I prefer to include this patch in the rc, not sure whether it's possible.

@adrinjalali
Copy link
Member

Any reason you're changing the tests? Are the old ones failing after this change?

assert_array_equal(clust.labels_, expected_labels)


def test_cluster_hierarchy_():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this is a test we need to have. We have nothing else testing the hierarchy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this test is not reasonable IMO. Some points in C2 will also satisfy -2<=x<=2 & -2<=y<=2 so I can't understand why the expected clusters should be np.array([[0, 99], [0, 199]]).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the test easily fails if you change a random state when generating the dataset (you don't set a random state in the test so it's difficult to reproduce)

@adrinjalali
Copy link
Member

I'm a -1 on touching the tests unless there's a good reason to do so. And those changes seem irrelevant to the purpose of this PR anyway.

@qinhanmin2014
Copy link
Member Author

I'm a -1 on touching the tests unless there's a good reason to do so. And those changes seem irrelevant to the purpose of this PR anyway.

I need a regression test. This is why I'm modifying the test.
The original test (test_extract_xi) also passes.

@qinhanmin2014
Copy link
Member Author

I'm also trying to resolve the test failure here, e.g., use max_eps<np.inf when we try to detect outliers, use v_measure_score instead of assert_array_equal when comparing clusters. I want to try my best to make i686 happy.

@qinhanmin2014
Copy link
Member Author

And I don't shuffle the data because tests will sometimes fails if we shuffle the data in certain way, so it seems redundant.

@qinhanmin2014
Copy link
Member Author

@adrinjalali I can revert all the changes in test_extract_xi if you dislike them, but I need to remove test_cluster_hierarchy_ (unless you prove that I'm wrong and are able to make CIs happy). I prefer to hurry this change into RC since it's not trivial IMO.

@qinhanmin2014 qinhanmin2014 added this to the 0.21 milestone Apr 30, 2019
@adrinjalali
Copy link
Member

You can make the test "safer", by having a less sparse cluster as the outer one.

    rng = np.random.RandomState(0)
    n_points_per_cluster = 100
    C1 = [0, 0] + 2 * rng.randn(n_points_per_cluster, 2)
    C2 = [0, 0] + 50 * rng.randn(n_points_per_cluster, 2)
    X = np.vstack((C1, C2))
    X = shuffle(X, random_state=0)

    clusters = OPTICS(min_samples=20, xi=.1).fit(X).cluster_hierarchy_
    assert clusters.shape == (2, 2)
    diff = np.sum(clusters - np.array([[0, 99], [0, 199]]))
    assert diff / len(X) < 0.05

And I don't shuffle the data because tests will sometimes fails if we shuffle the data in certain way, so it seems redundant.

Yes, that happens, and it's because the generated data may seem to have a different density than the background distribution used to generate the data, and the test then may fail.

Yes, but this test is not reasonable IMO. Some points in C2 will also satisfy -2<=x<=2 & -2<=y<=2 so I can't understand why the expected clusters should be np.array([[0, 99], [0, 199]]).

That's why I didn't set the test to be exact, and set a tolerance there.

@adrinjalali
Copy link
Member

I'm also trying to resolve the test failure here, e.g., use max_eps<np.inf when we try to detect outliers, use v_measure_score instead of assert_array_equal when comparing clusters. I want to try my best to make i686 happy.

Please revert those changes, we decided to skip that test instead of trying to change the test to workaround the issue. We can work on that issue later.

@qinhanmin2014
Copy link
Member Author

@adrinjalali do you have time to commit these changes? I’m offline and would like to hurry this one into rc, thanks.

@adrinjalali
Copy link
Member

@adrinjalali do you have time to commit these changes? I’m offline and would like to hurry this one into rc, thanks.

will do :)

@qinhanmin2014
Copy link
Member Author

The original test (test_extract_xi) also passes.

Apologies the original tests don't pass, I set max_eps to a smaller value so the outliers can be correctly detected.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to test non-regression?

@@ -844,7 +844,7 @@ def _xi_cluster(reachability_plot, predecessor_plot, ordering, xi, min_samples,
# Find the first index from the right side which is almost
# at the same level as the beginning of the detected
# cluster.
while (reachability_plot[c_end - 1] < D_max
while (reachability_plot[c_end - 1] > D_max
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a comment "This corrects the paper, which uses >, to use < instead" or something?

@adrinjalali
Copy link
Member

@qinhanmin2014 yeah I'm also working on it and I saw you did that. I suggest we remove the test_extract_xi test as is, and add a test which would detect the outliers only (with max_eps=inf). As I see it, with your fix there are some points which are detected as outliers, but they're not, which is kinda okay, but we should test that the true outliers are detected.

I know that sometimes those outliers are not detected correctly, which should be its own issue and investigated IMO.

@adrinjalali
Copy link
Member

any way to test non-regression?

The regression would be that the test_extract_xi used to pass, now it doesn't :P (i.e. some non-outliers are labeled as -1)

@qinhanmin2014
Copy link
Member Author

any way to test non-regression?

Previously I added such a test, but Adrin want the PR to be small so I remove it.

rng = np.random.RandomState(0)
n_points_per_cluster = 5
C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)

X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]]), C6))
expected_labels = np.r_[[0] * 5, [1] * 5, [2] * 5, [3] * 5, [2] * 5,
                        -1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=0)
clust = OPTICS(min_samples=3, min_cluster_size=3,
               max_eps=20, cluster_method='xi').fit(X)
assert np.isclose(v_measure_score(clust.labels_, expected_labels), 1)
assert np.array_equal(np.where(clust.labels_ == -1)[0],
                      np.where(expected_labels == -1)[0])

@qinhanmin2014
Copy link
Member Author

It's not a trivial issue because we'll erroneously detect many points in the last steep up regions as outliers.

@qinhanmin2014
Copy link
Member Author

@qinhanmin2014 yeah I'm also working on it and I saw you did that. I suggest we remove the test_extract_xi test as is, and add a test which would detect the outliers only (with max_eps=inf). As I see it, with your fix there are some points which are detected as outliers, but they're not, which is kinda okay, but we should test that the true outliers are detected.

I'm confused by your comment, what do you mean?

The regression would be that the test_extract_xi used to pass, now it doesn't :P (i.e. some non-outliers are labeled as -1)

I don't think so. The issue is that now we're unable to label some outliers because we no longer erroneously detect many points in the last steep up regions as outliers. (so we had to rely on max_eps)

@adrinjalali
Copy link
Member

I mean this:

import numpy as np
from sklearn.cluster import OPTICS
from sklearn.utils import shuffle
from sklearn.utils.testing import assert_array_equal

rng = np.random.RandomState(8)
n_points_per_cluster = 5

C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)

X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]]), C6))
expected_labels = np.r_[[2] * 5, [0] * 5, [1] * 5, [3] * 5, [1] * 5,
                        -1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=rng)

clust = OPTICS(min_samples=3, min_cluster_size=2,
               max_eps=np.inf, cluster_method='xi',
               xi=.3).fit(X)
print(clust.labels_[clust.ordering_])

gives this output:

[ 0  0  0  1  1  1  1  1  1  2  2  2  2  2  2 -1 -1 -1 -1 -1  3  3  3  3
  3  4  4  4  4  4 -1]

It's detecting the outlier as an outlier, but also some other points as outliers, which as I understand it, is the effect of the change introduced in this PR. Also note that the max_eps is inf in this example.

@adrinjalali
Copy link
Member

Anyhow, I'm happy to have this merged as is (the CI seems happy), and open another issue regarding those outliers.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this PR introduces that change?

I had thought this would induce larger clusters. it requires the end point to be just under the start point in reachability, rather than being the minimum point in the identified region.

Is it possible that previously we were rejecting many clusters with the min_cluster_size criterion; now we are not and so are producing more small clusters?

@qinhanmin2014
Copy link
Member Author

I had thought this would induce larger clusters.

@jnothman Yes, we'll erroneously detect many points in the last steep up regions as outliers previously.

Regarding @adrinjalali example:
on master:
[ 0 0 0 1 1 1 1 1 1 2 2 2 2 -1 -1 -1 -1 -1 -1 -1 3 3 3 3
-1 4 4 4 -1 -1 -1]
this PR:
[ 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 -1 -1 -1 -1 -1 3 3 3 3
3 4 4 4 4 4 -1]
the result from this PR looks much better (the last point is outliers)

@jnothman
Copy link
Member

Oh, I understood Adrin's comment as saying with this PR there are more (spurious) outliers.

@jnothman
Copy link
Member

I'd be happy to see some kind of non-regression test, but this LGTM. The change seems logical.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qinhanmin2014
Copy link
Member Author

Oh, I understood Adrin's comment as saying with this PR there are more (spurious) outliers.

Yes, there're still some wrong outliers sometimes (I don't have time to debug but I think it's the expected behavior of OPTICS), but much less than before.

@qinhanmin2014
Copy link
Member Author

@adrinjalali Do you have time to look into your script and tell us why are we still getting some wrong outliers sometimes (e.g., maybe we can tune xi and get reasonable result?). Thanks. (I'll do this after I return to school)

@jnothman
Copy link
Member

is it okay to title this commit as "FIX Optics paper typo which resulted in undersized clusters"

@qinhanmin2014
Copy link
Member Author

is it okay to title this commit as "FIX Optics paper typo which resulted in undersized clusters"

+1 IMO.

@adrinjalali
Copy link
Member

@adrinjalali Do you have time to look into your script and tell us why are we still getting some wrong outliers sometimes (e.g., maybe we can tune xi and get reasonable result?). Thanks. (I'll do this after I return to school)

I don't think that's a quick one. I'll need to block some good time and see what happens there, and I can't do that now. I'll be happy to check it after the release.

@jnothman jnothman merged commit e7bd8a3 into scikit-learn:master Apr 30, 2019
@jnothman
Copy link
Member

Thanks @qinhanmin2014. Good catch :)

@qinhanmin2014
Copy link
Member Author

I don't think that's a quick one. I'll need to block some good time and see what happens there, and I can't do that now. I'll be happy to check it after the release.

That's fine, it's not urgent IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants