-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Optics test fix (32 and 64 bit numerical stability) #12054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Collapses test to a more simple case-- min_samples=2, which is equivlent to single linkage.
Have you tried it on a 32bit system and it works? |
0.62327168, 1.09937516, 0.64112772, 0.6290978 , 0.51004568] | ||
|
||
# we compare to truncated decimals, so use atol | ||
assert_allclose(clust.reachability_, np.array(v), atol=1e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to increase the tolerance from the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values are copy and pasted rather then actually calculate in the test. We can't calculate the other code directly in the test because the python chemometria port has a dependency on hcluster which isn't importable in the test environment. Originally, I wrote out the chemometria values to a numpy array and included them in the test folder, but wasn't able to read them when the tests where run after pushing them to github; hence, why the values are just copy pasted. Python does it's own rounding when printing out numbers to screen, and sometimes will print a value like '0.4246995' (7 digits to the right of the decimal) instead of '0.64324688' (8 digits to the right of the decimal). I assume that whatever the last digit is has been rounded, hence the specified tolerance... that said, it may be possible to change it to 1e-6. The default is atol=1e-08, which won't work without formatting the test array output from the chemometria port differently.
You also said the other code doesn't include the current point, does it mean to get the single linkage from the other code, we'd need to set the min_samples to 1? If that's the case, I believe we have it according to the paper, and they have it inconsistent with the paper. |
@adrinjalali yes, to get single linkage with the other code, we have to set min_samples=1 . We have it consistent with the DBSCAN module, which includes the query point. Keeping them both to include the query point ensures that extraction using extract_dbscan gives identical or near identical results to DBSCAN when called with the same input parameters. So I believe that we have it right both with regard to the paper and with regard to the sklearn API. I haven't tested this on 32-bit; I'm in the middle of migrating my primary laptop over to a new machine and haven't setup virtualbox or docker yet to do so. My impression from your tests earlier was that dropping the number of points per cluster fixed things... I don't see why changing min_samples would break them again, but I haven't checked to verify explicitly that it doesn't. |
Why do we not also want a comparison with a higher min_samples? |
I guess we could. Could you please add an example with a higher I tested on a 32bit system, and the test passes. Thanks! There are also some PEP8 space issues in the example. |
Since this test is related to the discrepancy between the methods regarding the min_samples parameter, I guess it's also apt to add some comments in the test, explaining the issue. I don't think it's necessary to put the explanation in the docs, since our implementation follows the convention in the papers I've read. |
See #12090 , I doubt whether the test is appropriate. |
@adrinjalali @jnothman If we add additional tests at higher |
I should think so
|
@espg yeah, probably 5 should be fine. |
Closing , we now have a new test based on ELKI dev version. |
Thanks for some great work on this everyone!
|
This PR addresses #11916 and #11878. The issue, which has been discussed #12036 (comment) and #11929 is that 32-bit and 64-bit results don't agree at high point density due to numerical imprecision. The suggested fix is to reduce the number of points per cluster to a lower density so that both architectures agree.
Initially, this fix appeared to cause divergence between the test case (derived from a chemometria implementation written in MATLAB by Michal Daszykowski and then ported to python by Brian Clowers), and the sklearn implementation of OPTICS. This divergence was present only when reducing the points to 50 per cluster (i.e., results matched at 250 points per cluster).
This pull request modifies the test to:
min_samples
parameter tomin_samples = 2
so as to ensure agreement between the chemoetria implementation and this implementation.Some background: the OPTICS algorithm collapses to single linkage in the specific case that
min_samples = 2
. Setting both implementations to this mode removes any ambiguity as to the correct output. Note that both implementations can be independently checked against single-linkage if desired (although this PR only compares them against one another).This provides a permanent fix to the testing issue, and provides a simpler unit test-- which follows best practices for unit tests.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Any other comments?