Optics test fix (32 and 64 bit numerical stability) #12054

espg · 2018-09-12T06:13:17Z

This PR addresses #11916 and #11878. The issue, which has been discussed #12036 (comment) and #11929 is that 32-bit and 64-bit results don't agree at high point density due to numerical imprecision. The suggested fix is to reduce the number of points per cluster to a lower density so that both architectures agree.

Initially, this fix appeared to cause divergence between the test case (derived from a chemometria implementation written in MATLAB by Michal Daszykowski and then ported to python by Brian Clowers), and the sklearn implementation of OPTICS. This divergence was present only when reducing the points to 50 per cluster (i.e., results matched at 250 points per cluster).

This pull request modifies the test to:

Use 50 points so as to achieve agreement in both 32-bit and 64-bit test cases
Reduce the min_samples parameter to min_samples = 2 so as to ensure agreement between the chemoetria implementation and this implementation.

Some background: the OPTICS algorithm collapses to single linkage in the specific case that min_samples = 2. Setting both implementations to this mode removes any ambiguity as to the correct output. Note that both implementations can be independently checked against single-linkage if desired (although this PR only compares them against one another).

This provides a permanent fix to the testing issue, and provides a simpler unit test-- which follows best practices for unit tests.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Collapses test to a more simple case-- min_samples=2, which is equivlent to single linkage.

adrinjalali · 2018-09-12T08:32:19Z

Have you tried it on a 32bit system and it works?

adrinjalali · 2018-09-12T08:59:02Z

sklearn/cluster/tests/test_optics.py

+       0.62327168, 1.09937516, 0.64112772, 0.6290978 , 0.51004568]
+
+    # we compare to truncated decimals, so use atol
+    assert_allclose(clust.reachability_, np.array(v), atol=1e-5)


any reason to increase the tolerance from the default?

The values are copy and pasted rather then actually calculate in the test. We can't calculate the other code directly in the test because the python chemometria port has a dependency on hcluster which isn't importable in the test environment. Originally, I wrote out the chemometria values to a numpy array and included them in the test folder, but wasn't able to read them when the tests where run after pushing them to github; hence, why the values are just copy pasted. Python does it's own rounding when printing out numbers to screen, and sometimes will print a value like '0.4246995' (7 digits to the right of the decimal) instead of '0.64324688' (8 digits to the right of the decimal). I assume that whatever the last digit is has been rounded, hence the specified tolerance... that said, it may be possible to change it to 1e-6. The default is atol=1e-08, which won't work without formatting the test array output from the chemometria port differently.

adrinjalali · 2018-09-12T10:40:49Z

You also said the other code doesn't include the current point, does it mean to get the single linkage from the other code, we'd need to set the min_samples to 1? If that's the case, I believe we have it according to the paper, and they have it inconsistent with the paper.

espg · 2018-09-12T16:26:54Z

@adrinjalali yes, to get single linkage with the other code, we have to set min_samples=1 . We have it consistent with the DBSCAN module, which includes the query point. Keeping them both to include the query point ensures that extraction using extract_dbscan gives identical or near identical results to DBSCAN when called with the same input parameters. So I believe that we have it right both with regard to the paper and with regard to the sklearn API.

I haven't tested this on 32-bit; I'm in the middle of migrating my primary laptop over to a new machine and haven't setup virtualbox or docker yet to do so. My impression from your tests earlier was that dropping the number of points per cluster fixed things... I don't see why changing min_samples would break them again, but I haven't checked to verify explicitly that it doesn't.

jnothman · 2018-09-13T08:22:22Z

Why do we not also want a comparison with a higher min_samples?

adrinjalali · 2018-09-15T08:46:33Z

I guess we could. Could you please add an example with a higher min_samples @espg ? I like that we're now testing the edge case, but we should also test the usual case.

I tested on a 32bit system, and the test passes. Thanks!

There are also some PEP8 space issues in the example.

adrinjalali · 2018-09-15T09:57:50Z

Since this test is related to the discrepancy between the methods regarding the min_samples parameter, I guess it's also apt to add some comments in the test, explaining the issue.

I don't think it's necessary to put the explanation in the docs, since our implementation follows the convention in the papers I've read.

qinhanmin2014 · 2018-09-16T01:41:09Z

See #12090 , I doubt whether the test is appropriate.

espg · 2018-09-16T02:58:54Z

@adrinjalali @jnothman If we add additional tests at higher min_samples, we will have to add different reference arrays-- i.e., the reachability graph is different each time we change min_samples. That's fine, but how many different min_samples values do we want to test? Is two sufficient (one at min_samples = 2, and one at some other value)?

jnothman · 2018-09-16T03:10:17Z

I should think so

adrinjalali · 2018-09-16T07:51:14Z

@espg yeah, probably 5 should be fine.

qinhanmin2014 · 2018-10-14T00:58:22Z

Closing , we now have a new test based on ELKI dev version.

jnothman · 2018-10-14T06:21:47Z

Thanks for some great work on this everyone!

espg added 3 commits September 11, 2018 23:37

Fixes 32bit / 64bit unit test failures

87034f8

Collapses test to a more simple case-- min_samples=2, which is equivlent to single linkage.

Set seed correctly

e9b50ca

Merge branch 'master' into optics_test_fix

1f03d80

This was referenced Sep 12, 2018

[MRG+1] OPTICS Change quick_scan floating point comparison to isclose #11929

Merged

Remove OPTICS from 0.20 #12053

Merged

adrinjalali reviewed Sep 12, 2018

View reviewed changes

adrinjalali mentioned this pull request Oct 7, 2018

ENH Remember predecessor in OPTICS #12135

Merged

kno10 mentioned this pull request Oct 11, 2018

[MRG+1] Fix OPTICS processing order #12357

Merged

qinhanmin2014 closed this Oct 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optics test fix (32 and 64 bit numerical stability) #12054

Optics test fix (32 and 64 bit numerical stability) #12054

Uh oh!

espg commented Sep 12, 2018

Uh oh!

adrinjalali commented Sep 12, 2018

Uh oh!

adrinjalali Sep 12, 2018

Uh oh!

espg Sep 12, 2018

Uh oh!

adrinjalali commented Sep 12, 2018

Uh oh!

espg commented Sep 12, 2018

Uh oh!

jnothman commented Sep 13, 2018

Uh oh!

adrinjalali commented Sep 15, 2018

Uh oh!

adrinjalali commented Sep 15, 2018

Uh oh!

qinhanmin2014 commented Sep 16, 2018

Uh oh!

espg commented Sep 16, 2018

Uh oh!

jnothman commented Sep 16, 2018 via email

Uh oh!

adrinjalali commented Sep 16, 2018

Uh oh!

qinhanmin2014 commented Oct 14, 2018

Uh oh!

jnothman commented Oct 14, 2018 via email

Uh oh!

Uh oh!

Uh oh!

Optics test fix (32 and 64 bit numerical stability) #12054

Optics test fix (32 and 64 bit numerical stability) #12054

Uh oh!

Conversation

espg commented Sep 12, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

adrinjalali commented Sep 12, 2018

Uh oh!

adrinjalali Sep 12, 2018

Choose a reason for hiding this comment

Uh oh!

espg Sep 12, 2018

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Sep 12, 2018

Uh oh!

espg commented Sep 12, 2018

Uh oh!

jnothman commented Sep 13, 2018

Uh oh!

adrinjalali commented Sep 15, 2018

Uh oh!

adrinjalali commented Sep 15, 2018

Uh oh!

qinhanmin2014 commented Sep 16, 2018

Uh oh!

espg commented Sep 16, 2018

Uh oh!

jnothman commented Sep 16, 2018 via email

Uh oh!

adrinjalali commented Sep 16, 2018

Uh oh!

qinhanmin2014 commented Oct 14, 2018

Uh oh!

jnothman commented Oct 14, 2018 via email

Uh oh!

Uh oh!