Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Optics test fix (32 and 64 bit numerical stability) #12054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

espg
Copy link
Contributor

@espg espg commented Sep 12, 2018

This PR addresses #11916 and #11878. The issue, which has been discussed #12036 (comment) and #11929 is that 32-bit and 64-bit results don't agree at high point density due to numerical imprecision. The suggested fix is to reduce the number of points per cluster to a lower density so that both architectures agree.

Initially, this fix appeared to cause divergence between the test case (derived from a chemometria implementation written in MATLAB by Michal Daszykowski and then ported to python by Brian Clowers), and the sklearn implementation of OPTICS. This divergence was present only when reducing the points to 50 per cluster (i.e., results matched at 250 points per cluster).

This pull request modifies the test to:

  • Use 50 points so as to achieve agreement in both 32-bit and 64-bit test cases
  • Reduce the min_samples parameter to min_samples = 2 so as to ensure agreement between the chemoetria implementation and this implementation.

Some background: the OPTICS algorithm collapses to single linkage in the specific case that min_samples = 2. Setting both implementations to this mode removes any ambiguity as to the correct output. Note that both implementations can be independently checked against single-linkage if desired (although this PR only compares them against one another).

This provides a permanent fix to the testing issue, and provides a simpler unit test-- which follows best practices for unit tests.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Collapses test to a more simple case-- min_samples=2, which is equivlent to single linkage.
@adrinjalali
Copy link
Member

Have you tried it on a 32bit system and it works?

0.62327168, 1.09937516, 0.64112772, 0.6290978 , 0.51004568]

# we compare to truncated decimals, so use atol
assert_allclose(clust.reachability_, np.array(v), atol=1e-5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to increase the tolerance from the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values are copy and pasted rather then actually calculate in the test. We can't calculate the other code directly in the test because the python chemometria port has a dependency on hcluster which isn't importable in the test environment. Originally, I wrote out the chemometria values to a numpy array and included them in the test folder, but wasn't able to read them when the tests where run after pushing them to github; hence, why the values are just copy pasted. Python does it's own rounding when printing out numbers to screen, and sometimes will print a value like '0.4246995' (7 digits to the right of the decimal) instead of '0.64324688' (8 digits to the right of the decimal). I assume that whatever the last digit is has been rounded, hence the specified tolerance... that said, it may be possible to change it to 1e-6. The default is atol=1e-08, which won't work without formatting the test array output from the chemometria port differently.

@adrinjalali
Copy link
Member

You also said the other code doesn't include the current point, does it mean to get the single linkage from the other code, we'd need to set the min_samples to 1? If that's the case, I believe we have it according to the paper, and they have it inconsistent with the paper.

@espg
Copy link
Contributor Author

espg commented Sep 12, 2018

@adrinjalali yes, to get single linkage with the other code, we have to set min_samples=1 . We have it consistent with the DBSCAN module, which includes the query point. Keeping them both to include the query point ensures that extraction using extract_dbscan gives identical or near identical results to DBSCAN when called with the same input parameters. So I believe that we have it right both with regard to the paper and with regard to the sklearn API.

I haven't tested this on 32-bit; I'm in the middle of migrating my primary laptop over to a new machine and haven't setup virtualbox or docker yet to do so. My impression from your tests earlier was that dropping the number of points per cluster fixed things... I don't see why changing min_samples would break them again, but I haven't checked to verify explicitly that it doesn't.

@jnothman
Copy link
Member

Why do we not also want a comparison with a higher min_samples?

@adrinjalali
Copy link
Member

I guess we could. Could you please add an example with a higher min_samples @espg ? I like that we're now testing the edge case, but we should also test the usual case.

I tested on a 32bit system, and the test passes. Thanks!

There are also some PEP8 space issues in the example.

@adrinjalali
Copy link
Member

Since this test is related to the discrepancy between the methods regarding the min_samples parameter, I guess it's also apt to add some comments in the test, explaining the issue.

I don't think it's necessary to put the explanation in the docs, since our implementation follows the convention in the papers I've read.

@qinhanmin2014
Copy link
Member

See #12090 , I doubt whether the test is appropriate.

@espg
Copy link
Contributor Author

espg commented Sep 16, 2018

@adrinjalali @jnothman If we add additional tests at higher min_samples, we will have to add different reference arrays-- i.e., the reachability graph is different each time we change min_samples. That's fine, but how many different min_samples values do we want to test? Is two sufficient (one at min_samples = 2, and one at some other value)?

@jnothman
Copy link
Member

jnothman commented Sep 16, 2018 via email

@adrinjalali
Copy link
Member

@espg yeah, probably 5 should be fine.

@qinhanmin2014
Copy link
Member

Closing , we now have a new test based on ELKI dev version.

@jnothman
Copy link
Member

jnothman commented Oct 14, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants