Codestin Search App

geoffreydstewart · 2022-04-12T23:28:41Z

Description

This PR introduces a k-d tree implementation, which can be used for nearest neighbour queries, and integrates this feature with Hdbscan and KNN. Initial performance measurements demonstrate that this k-d tree implementation is, in general, much faster than using a brute-force approach for larger datasets.

After these changes have been reviewed, there is a required change to the Hdbscan tutorial that will be made.

Motivation

K-d trees can provide faster nearest neighbour queries compared to a brute-force technique. This results in faster model training times.

Paper reference

This is the original paper which proposes a k-d tree:
J.L. Bentley "Multidimensional Binary Search Trees Used for Associative Searching", Commun. ACM, Vol 18, Sept. 1975, 509–517

Craigacp

There are a bunch of small changes, and a few larger ones. I think the logic in KNNModel probably could do with drastic revision at this point, because the neighbour query infrastructure integrates poorly into KNNModel's in-built threading options, however that's a problem for another PR.

I'm also interested in the behaviour of the kd-tree when working on binary/integer datasets, and also how the recursion behaves in real use (in case we need to flip it over to iteration like we did in the CART tree package).

Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java

Common/NearestNeighbour/src/main/java/org/tribuo/common/nearest/KNNClassifierOptions.java

Math/src/test/java/org/tribuo/math/util/SGDVectorsFromCSV.java

Math/src/test/java/org/tribuo/math/neighbour/TestKDTree.java

Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java

Craigacp · 2022-04-13T18:06:30Z

Math/src/main/java/org/tribuo/math/neighbour/kdtree/DimensionNode.java

How unbalanced can this tree get if we adversarially prepare a dataset? Can it pop the stack with the recursion in this call?

This algorithm always constructs a balanced tree, but I suppose there would be some manually generated datasets that could be provided that cause the algorithm to perform poorly. With recursion, there is always the risk that a tree constructed from a huge dataset could overflow the stack, but I would suspect we may also face memory issues in these scenarios. It sounds like you have more experience with this type of issue, so I'd be interested in hearing more.

So I agree that it is picking the median, but if the array is [0,1,1,1,1,1,1,1] then the invariants say that it should split that into [0] and [1,1,1,1,1,1,1], so I wonder what happens if you have a single data point that's all zeros and then every other data point is the same and all ones. We can decide that adversarial datasets are not our problem, but I'd like to have a rough idea how adversarial it has to be before it goes pop.

I may not be understanding the issue here. If I use this code to create a tree from the points: (0,0,0,0), and 9 points like (1,1,1,1), I still get a balanced tree. The root node, will be (1,1,1,1). Traversing the tree following the below nodes, arrives at the leaf point (0,0,0,0). All the other nodes in the tree are (1,1,1,1). I have added a test to the performance test branch I'm maintaining where this scenario could be executed in the debugger: adversarial test

Ok. I had a look through the test, when I run it the trees are balanced, but points with value equal to the split point appear on the left. I thought the invariant was that that couldn't happen? I've left some comments on the new PR in relevant places.

Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java

Common/NearestNeighbour/src/main/java/org/tribuo/common/nearest/KNNModel.java

Common/NearestNeighbour/src/main/java/org/tribuo/common/nearest/KNNTrainer.java

Math/src/main/java/org/tribuo/math/neighbour/kdtree/DimensionNode.java

Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java

Math/src/test/java/org/tribuo/math/neighbour/TestKDTree.java

Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java

Craigacp

A few more small things. Also could you rebase this branch on top of main? There's a conflict in TestHdbscan after I added the public cluster exemplar test.

oracle-contributor-agreement · 2022-04-19T18:15:31Z

Oracle requires that contributors to all of its open-source projects sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

PR author: @geoffreydstewart

In order to sign the OCA, you need to create an Oracle account and sign the OCA in the Oracle's Contributor Agreement Application by following the steps on the homepage.

When singing the OCA, please provide your GitHub username. By doing so, this PR will be automatically updated once the signed OCA was approved by Oracle.

geoffreydstewart added 2 commits April 12, 2022 15:06

Add a k-d tree implementation, and integrate it with Hdbscan and KNN.

e306056

minor javadoc updates

5a87514

Craigacp requested changes Apr 13, 2022

View reviewed changes

Craigacp added Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR labels Apr 15, 2022

These are changes to address the code review feedback

c8a076a

Craigacp reviewed Apr 17, 2022

View reviewed changes

Craigacp requested changes Apr 17, 2022

View reviewed changes

oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 19, 2022

geoffreydstewart closed this Apr 20, 2022

geoffreydstewart deleted the kd-tree branch April 20, 2022 00:59

geoffreydstewart mentioned this pull request Apr 20, 2022

Add a k-d tree implementation, and integrate it with Hdbscan and KNN. #231

Merged

Conversation

geoffreydstewart commented Apr 12, 2022

Description

Motivation

Paper reference

Uh oh!

Craigacp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Craigacp Apr 13, 2022

Choose a reason for hiding this comment

Uh oh!

geoffreydstewart Apr 15, 2022

Choose a reason for hiding this comment

Uh oh!

Craigacp Apr 17, 2022

Choose a reason for hiding this comment

Uh oh!

geoffreydstewart Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

Craigacp Apr 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Craigacp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oracle-contributor-agreement bot commented Apr 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Craigacp left a comment •

edited

Loading