FIX Optics paper typo #13750

qinhanmin2014 · 2019-04-30T08:34:51Z

there's a typo in OPTICS paper, see #13739 (comment), #13739 (comment)
Modify existing test to serve as regression test.
Can someone tell me how can I run tests on i686 with CI? (I hope this will fix the test failures)
Remove test_cluster_hierarchy_ since it easily fails if we change another random_state, and I don't think it's reasonable.

jnothman · 2019-04-30T08:44:10Z

Unless I'm much mistaken, Multibuild uses a i686 docker image to test on that platform. It's hard to squeeze that into our builds here, I think. Easier might be to point scikit-learn-wheels at a branch containing this patch (and #13749, #13751).

qinhanmin2014 · 2019-04-30T08:49:23Z

I prefer to include this patch in the rc, not sure whether it's possible.

adrinjalali · 2019-04-30T08:56:34Z

Any reason you're changing the tests? Are the old ones failing after this change?

adrinjalali · 2019-04-30T08:57:50Z

sklearn/cluster/tests/test_optics.py

-    assert_array_equal(clust.labels_, expected_labels)
-
-
-def test_cluster_hierarchy_():


I actually think this is a test we need to have. We have nothing else testing the hierarchy.

Yes, but this test is not reasonable IMO. Some points in C2 will also satisfy -2<=x<=2 & -2<=y<=2 so I can't understand why the expected clusters should be np.array([[0, 99], [0, 199]]).

Actually the test easily fails if you change a random state when generating the dataset (you don't set a random state in the test so it's difficult to reproduce)

adrinjalali · 2019-04-30T08:59:07Z

I'm a -1 on touching the tests unless there's a good reason to do so. And those changes seem irrelevant to the purpose of this PR anyway.

qinhanmin2014 · 2019-04-30T09:01:35Z

I'm a -1 on touching the tests unless there's a good reason to do so. And those changes seem irrelevant to the purpose of this PR anyway.

I need a regression test. This is why I'm modifying the test.
The original test (test_extract_xi) also passes.

qinhanmin2014 · 2019-04-30T09:05:59Z

I'm also trying to resolve the test failure here, e.g., use max_eps<np.inf when we try to detect outliers, use v_measure_score instead of assert_array_equal when comparing clusters. I want to try my best to make i686 happy.

qinhanmin2014 · 2019-04-30T09:06:53Z

And I don't shuffle the data because tests will sometimes fails if we shuffle the data in certain way, so it seems redundant.

qinhanmin2014 · 2019-04-30T09:14:59Z

@adrinjalali I can revert all the changes in test_extract_xi if you dislike them, but I need to remove test_cluster_hierarchy_ (unless you prove that I'm wrong and are able to make CIs happy). I prefer to hurry this change into RC since it's not trivial IMO.

adrinjalali · 2019-04-30T09:34:28Z

You can make the test "safer", by having a less sparse cluster as the outer one.

    rng = np.random.RandomState(0)
    n_points_per_cluster = 100
    C1 = [0, 0] + 2 * rng.randn(n_points_per_cluster, 2)
    C2 = [0, 0] + 50 * rng.randn(n_points_per_cluster, 2)
    X = np.vstack((C1, C2))
    X = shuffle(X, random_state=0)

    clusters = OPTICS(min_samples=20, xi=.1).fit(X).cluster_hierarchy_
    assert clusters.shape == (2, 2)
    diff = np.sum(clusters - np.array([[0, 99], [0, 199]]))
    assert diff / len(X) < 0.05

And I don't shuffle the data because tests will sometimes fails if we shuffle the data in certain way, so it seems redundant.

Yes, that happens, and it's because the generated data may seem to have a different density than the background distribution used to generate the data, and the test then may fail.

Yes, but this test is not reasonable IMO. Some points in C2 will also satisfy -2<=x<=2 & -2<=y<=2 so I can't understand why the expected clusters should be np.array([[0, 99], [0, 199]]).

That's why I didn't set the test to be exact, and set a tolerance there.

adrinjalali · 2019-04-30T09:36:27Z

I'm also trying to resolve the test failure here, e.g., use max_eps<np.inf when we try to detect outliers, use v_measure_score instead of assert_array_equal when comparing clusters. I want to try my best to make i686 happy.

Please revert those changes, we decided to skip that test instead of trying to change the test to workaround the issue. We can work on that issue later.

qinhanmin2014 · 2019-04-30T09:48:34Z

@adrinjalali do you have time to commit these changes? I’m offline and would like to hurry this one into rc, thanks.

adrinjalali · 2019-04-30T09:50:12Z

@adrinjalali do you have time to commit these changes? I’m offline and would like to hurry this one into rc, thanks.

will do :)

qinhanmin2014 · 2019-04-30T12:37:05Z

The original test (test_extract_xi) also passes.

Apologies the original tests don't pass, I set max_eps to a smaller value so the outliers can be correctly detected.

jnothman

any way to test non-regression?

jnothman · 2019-04-30T12:38:29Z

sklearn/cluster/optics_.py

@@ -844,7 +844,7 @@ def _xi_cluster(reachability_plot, predecessor_plot, ordering, xi, min_samples,
                    # Find the first index from the right side which is almost
                    # at the same level as the beginning of the detected
                    # cluster.
-                    while (reachability_plot[c_end - 1] < D_max
+                    while (reachability_plot[c_end - 1] > D_max


Can we have a comment "This corrects the paper, which uses >, to use < instead" or something?

adrinjalali · 2019-04-30T12:41:35Z

@qinhanmin2014 yeah I'm also working on it and I saw you did that. I suggest we remove the test_extract_xi test as is, and add a test which would detect the outliers only (with max_eps=inf). As I see it, with your fix there are some points which are detected as outliers, but they're not, which is kinda okay, but we should test that the true outliers are detected.

I know that sometimes those outliers are not detected correctly, which should be its own issue and investigated IMO.

adrinjalali · 2019-04-30T12:42:34Z

any way to test non-regression?

The regression would be that the test_extract_xi used to pass, now it doesn't :P (i.e. some non-outliers are labeled as -1)

sklearn/cluster/tests/test_optics.py

qinhanmin2014 · 2019-04-30T12:47:21Z

any way to test non-regression?

Previously I added such a test, but Adrin want the PR to be small so I remove it.

rng = np.random.RandomState(0)
n_points_per_cluster = 5
C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)

X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]]), C6))
expected_labels = np.r_[[0] * 5, [1] * 5, [2] * 5, [3] * 5, [2] * 5,
                        -1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=0)
clust = OPTICS(min_samples=3, min_cluster_size=3,
               max_eps=20, cluster_method='xi').fit(X)
assert np.isclose(v_measure_score(clust.labels_, expected_labels), 1)
assert np.array_equal(np.where(clust.labels_ == -1)[0],
                      np.where(expected_labels == -1)[0])

qinhanmin2014 · 2019-04-30T12:50:15Z

It's not a trivial issue because we'll erroneously detect many points in the last steep up regions as outliers.

qinhanmin2014 · 2019-04-30T12:57:35Z

@qinhanmin2014 yeah I'm also working on it and I saw you did that. I suggest we remove the test_extract_xi test as is, and add a test which would detect the outliers only (with max_eps=inf). As I see it, with your fix there are some points which are detected as outliers, but they're not, which is kinda okay, but we should test that the true outliers are detected.

I'm confused by your comment, what do you mean?

The regression would be that the test_extract_xi used to pass, now it doesn't :P (i.e. some non-outliers are labeled as -1)

I don't think so. The issue is that now we're unable to label some outliers because we no longer erroneously detect many points in the last steep up regions as outliers. (so we had to rely on max_eps)

adrinjalali · 2019-04-30T13:22:20Z

I mean this:

import numpy as np
from sklearn.cluster import OPTICS
from sklearn.utils import shuffle
from sklearn.utils.testing import assert_array_equal

rng = np.random.RandomState(8)
n_points_per_cluster = 5

C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)

X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]]), C6))
expected_labels = np.r_[[2] * 5, [0] * 5, [1] * 5, [3] * 5, [1] * 5,
                        -1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=rng)

clust = OPTICS(min_samples=3, min_cluster_size=2,
               max_eps=np.inf, cluster_method='xi',
               xi=.3).fit(X)
print(clust.labels_[clust.ordering_])

gives this output:

[ 0  0  0  1  1  1  1  1  1  2  2  2  2  2  2 -1 -1 -1 -1 -1  3  3  3  3
  3  4  4  4  4  4 -1]

It's detecting the outlier as an outlier, but also some other points as outliers, which as I understand it, is the effect of the change introduced in this PR. Also note that the max_eps is inf in this example.

adrinjalali · 2019-04-30T13:24:00Z

Anyhow, I'm happy to have this merged as is (the CI seems happy), and open another issue regarding those outliers.

jnothman

Can you explain why this PR introduces that change?

I had thought this would induce larger clusters. it requires the end point to be just under the start point in reachability, rather than being the minimum point in the identified region.

Is it possible that previously we were rejecting many clusters with the min_cluster_size criterion; now we are not and so are producing more small clusters?

qinhanmin2014 · 2019-04-30T13:43:55Z

I had thought this would induce larger clusters.

@jnothman Yes, we'll erroneously detect many points in the last steep up regions as outliers previously.

Regarding @adrinjalali example:
on master:
[ 0 0 0 1 1 1 1 1 1 2 2 2 2 -1 -1 -1 -1 -1 -1 -1 3 3 3 3
-1 4 4 4 -1 -1 -1]
this PR:
[ 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 -1 -1 -1 -1 -1 3 3 3 3
3 4 4 4 4 4 -1]
the result from this PR looks much better (the last point is outliers)

jnothman · 2019-04-30T13:47:03Z

Oh, I understood Adrin's comment as saying with this PR there are more (spurious) outliers.

jnothman · 2019-04-30T13:47:32Z

I'd be happy to see some kind of non-regression test, but this LGTM. The change seems logical.

adrinjalali

Thanks @qinhanmin2014

qinhanmin2014 · 2019-04-30T14:00:53Z

Oh, I understood Adrin's comment as saying with this PR there are more (spurious) outliers.

Yes, there're still some wrong outliers sometimes (I don't have time to debug but I think it's the expected behavior of OPTICS), but much less than before.

qinhanmin2014 · 2019-04-30T14:21:38Z

@adrinjalali Do you have time to look into your script and tell us why are we still getting some wrong outliers sometimes (e.g., maybe we can tune xi and get reasonable result?). Thanks. (I'll do this after I return to school)

jnothman · 2019-04-30T14:22:01Z

is it okay to title this commit as "FIX Optics paper typo which resulted in undersized clusters"

qinhanmin2014 · 2019-04-30T14:26:40Z

is it okay to title this commit as "FIX Optics paper typo which resulted in undersized clusters"

+1 IMO.

adrinjalali · 2019-04-30T14:27:39Z

@adrinjalali Do you have time to look into your script and tell us why are we still getting some wrong outliers sometimes (e.g., maybe we can tune xi and get reasonable result?). Thanks. (I'll do this after I return to school)

I don't think that's a quick one. I'll need to block some good time and see what happens there, and I can't do that now. I'll be happy to check it after the release.

jnothman · 2019-04-30T14:28:27Z

Thanks @qinhanmin2014. Good catch :)

qinhanmin2014 · 2019-04-30T14:29:33Z

I don't think that's a quick one. I'll need to block some good time and see what happens there, and I can't do that now. I'll be happy to check it after the release.

That's fine, it's not urgent IMO.

…earn#13750)

qinhanmin2014 added 3 commits April 30, 2019 16:16

paper typo

310c1d4

update test

93de63e

remove test

7f0ce7a

flake8

b71bcff

qinhanmin2014 requested a review from adrinjalali April 30, 2019 08:53

adrinjalali reviewed Apr 30, 2019

View reviewed changes

flake8

201cea8

qinhanmin2014 added this to the 0.21 milestone Apr 30, 2019

revert changes in tests

42db2e6

jnothman reviewed Apr 30, 2019

View reviewed changes

adrinjalali reviewed Apr 30, 2019

View reviewed changes

sklearn/cluster/tests/test_optics.py Show resolved Hide resolved

Joel's comment

2f50cfb

random_state

346788c

jnothman reviewed Apr 30, 2019

View reviewed changes

adrinjalali approved these changes Apr 30, 2019

View reviewed changes

jnothman approved these changes Apr 30, 2019

View reviewed changes

jnothman merged commit e7bd8a3 into scikit-learn:master Apr 30, 2019

qinhanmin2014 deleted the optics-paper-typo branch April 30, 2019 14:29

jnothman pushed a commit that referenced this pull request Apr 30, 2019

FIX Optics paper typo which resulted in undersized clusters (#13750)

05b209f

marcelobeckmann pushed a commit to marcelobeckmann/scikit-learn that referenced this pull request May 1, 2019

FIX Optics paper typo which resulted in undersized clusters (scikit-l…

2cb2802

…earn#13750)

marcelobeckmann pushed a commit to marcelobeckmann/scikit-learn that referenced this pull request May 1, 2019

FIX Optics paper typo which resulted in undersized clusters (scikit-l…

ed74af0

…earn#13750)

qinhanmin2014 mentioned this pull request May 9, 2019

test_k_means_fit_predict failing on some MacPython runs #12644

Closed

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX Optics paper typo which resulted in undersized clusters (scikit-l…

c266625

…earn#13750)

		assert_array_equal(clust.labels_, expected_labels)


		def test_cluster_hierarchy_():

Uh oh!

FIX Optics paper typo #13750

FIX Optics paper typo #13750

Uh oh!

Conversation

qinhanmin2014 commented Apr 30, 2019

Uh oh!

jnothman commented Apr 30, 2019 via email

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

adrinjalali Apr 30, 2019

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 Apr 30, 2019

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 Apr 30, 2019

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Apr 30, 2019

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

adrinjalali commented Apr 30, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

jnothman commented Apr 30, 2019

Uh oh!

jnothman commented Apr 30, 2019

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

qinhanmin2014 commented Apr 30, 2019

Uh oh!

jnothman commented Apr 30, 2019