-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX Optics paper typo #13750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Optics paper typo #13750
Conversation
I prefer to include this patch in the rc, not sure whether it's possible. |
Any reason you're changing the tests? Are the old ones failing after this change? |
sklearn/cluster/tests/test_optics.py
Outdated
assert_array_equal(clust.labels_, expected_labels) | ||
|
||
|
||
def test_cluster_hierarchy_(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think this is a test we need to have. We have nothing else testing the hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this test is not reasonable IMO. Some points in C2 will also satisfy -2<=x<=2 & -2<=y<=2 so I can't understand why the expected clusters should be np.array([[0, 99], [0, 199]]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the test easily fails if you change a random state when generating the dataset (you don't set a random state in the test so it's difficult to reproduce)
I'm a -1 on touching the tests unless there's a good reason to do so. And those changes seem irrelevant to the purpose of this PR anyway. |
I need a regression test. This is why I'm modifying the test. |
I'm also trying to resolve the test failure here, e.g., use max_eps<np.inf when we try to detect outliers, use v_measure_score instead of assert_array_equal when comparing clusters. I want to try my best to make i686 happy. |
And I don't shuffle the data because tests will sometimes fails if we shuffle the data in certain way, so it seems redundant. |
@adrinjalali I can revert all the changes in test_extract_xi if you dislike them, but I need to remove test_cluster_hierarchy_ (unless you prove that I'm wrong and are able to make CIs happy). I prefer to hurry this change into RC since it's not trivial IMO. |
You can make the test "safer", by having a less sparse cluster as the outer one. rng = np.random.RandomState(0)
n_points_per_cluster = 100
C1 = [0, 0] + 2 * rng.randn(n_points_per_cluster, 2)
C2 = [0, 0] + 50 * rng.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2))
X = shuffle(X, random_state=0)
clusters = OPTICS(min_samples=20, xi=.1).fit(X).cluster_hierarchy_
assert clusters.shape == (2, 2)
diff = np.sum(clusters - np.array([[0, 99], [0, 199]]))
assert diff / len(X) < 0.05
Yes, that happens, and it's because the generated data may seem to have a different density than the background distribution used to generate the data, and the test then may fail.
That's why I didn't set the test to be exact, and set a tolerance there. |
Please revert those changes, we decided to skip that test instead of trying to change the test to workaround the issue. We can work on that issue later. |
@adrinjalali do you have time to commit these changes? I’m offline and would like to hurry this one into rc, thanks. |
will do :) |
Apologies the original tests don't pass, I set max_eps to a smaller value so the outliers can be correctly detected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any way to test non-regression?
@@ -844,7 +844,7 @@ def _xi_cluster(reachability_plot, predecessor_plot, ordering, xi, min_samples, | |||
# Find the first index from the right side which is almost | |||
# at the same level as the beginning of the detected | |||
# cluster. | |||
while (reachability_plot[c_end - 1] < D_max | |||
while (reachability_plot[c_end - 1] > D_max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a comment "This corrects the paper, which uses >, to use < instead" or something?
@qinhanmin2014 yeah I'm also working on it and I saw you did that. I suggest we remove the test_extract_xi test as is, and add a test which would detect the outliers only (with max_eps=inf). As I see it, with your fix there are some points which are detected as outliers, but they're not, which is kinda okay, but we should test that the true outliers are detected. I know that sometimes those outliers are not detected correctly, which should be its own issue and investigated IMO. |
The regression would be that the |
Previously I added such a test, but Adrin want the PR to be small so I remove it.
|
It's not a trivial issue because we'll erroneously detect many points in the last steep up regions as outliers. |
I'm confused by your comment, what do you mean?
I don't think so. The issue is that now we're unable to label some outliers because we no longer erroneously detect many points in the last steep up regions as outliers. (so we had to rely on max_eps) |
I mean this: import numpy as np
from sklearn.cluster import OPTICS
from sklearn.utils import shuffle
from sklearn.utils.testing import assert_array_equal
rng = np.random.RandomState(8)
n_points_per_cluster = 5
C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]]), C6))
expected_labels = np.r_[[2] * 5, [0] * 5, [1] * 5, [3] * 5, [1] * 5,
-1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=rng)
clust = OPTICS(min_samples=3, min_cluster_size=2,
max_eps=np.inf, cluster_method='xi',
xi=.3).fit(X)
print(clust.labels_[clust.ordering_]) gives this output:
It's detecting the outlier as an outlier, but also some other points as outliers, which as I understand it, is the effect of the change introduced in this PR. Also note that the max_eps is inf in this example. |
Anyhow, I'm happy to have this merged as is (the CI seems happy), and open another issue regarding those outliers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this PR introduces that change?
I had thought this would induce larger clusters. it requires the end point to be just under the start point in reachability, rather than being the minimum point in the identified region.
Is it possible that previously we were rejecting many clusters with the min_cluster_size criterion; now we are not and so are producing more small clusters?
@jnothman Yes, we'll erroneously detect many points in the last steep up regions as outliers previously. Regarding @adrinjalali example: |
Oh, I understood Adrin's comment as saying with this PR there are more (spurious) outliers. |
I'd be happy to see some kind of non-regression test, but this LGTM. The change seems logical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @qinhanmin2014
Yes, there're still some wrong outliers sometimes (I don't have time to debug but I think it's the expected behavior of OPTICS), but much less than before. |
@adrinjalali Do you have time to look into your script and tell us why are we still getting some wrong outliers sometimes (e.g., maybe we can tune xi and get reasonable result?). Thanks. (I'll do this after I return to school) |
is it okay to title this commit as "FIX Optics paper typo which resulted in undersized clusters" |
+1 IMO. |
I don't think that's a quick one. I'll need to block some good time and see what happens there, and I can't do that now. I'll be happy to check it after the release. |
Thanks @qinhanmin2014. Good catch :) |
That's fine, it's not urgent IMO. |
there's a typo in OPTICS paper, see #13739 (comment), #13739 (comment)
Modify existing test to serve as regression test.
Can someone tell me how can I run tests on i686 with CI? (I hope this will fix the test failures)
Remove
test_cluster_hierarchy_
since it easily fails if we change another random_state, and I don't think it's reasonable.