Add a k-d tree implementation, and integrate it with Hdbscan and KNN.#231
Add a k-d tree implementation, and integrate it with Hdbscan and KNN.#231Craigacp merged 6 commits intooracle:mainfrom
Conversation
|
Oracle requires that contributors to all of its open-source projects sign the Oracle Contributor Agreement (OCA).
In order to sign the OCA, you need to create an Oracle account and sign the OCA in the Oracle's Contributor Agreement Application by following the steps on the homepage. When singing the OCA, please provide your GitHub username. By doing so, this PR will be automatically updated once the signed OCA was approved by Oracle. |
Craigacp
left a comment
There was a problem hiding this comment.
I'd like to properly nail down the semantics of the k-d tree building (which may just be documentation fixes now), and there are a couple of other small points.
Clustering/Hdbscan/src/test/java/org/tribuo/clustering/hdbscan/TestHdbscan.java
Show resolved
Hide resolved
There was a problem hiding this comment.
This javadoc is incorrect now as the default is BRUTE_FORCE.
| @Test | ||
| public void knnClassificationSingleThreadedTest() { | ||
| KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1, new VotingCombiner(), KNNModel.Backend.INNERTHREADPOOL); | ||
| KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1, |
There was a problem hiding this comment.
Similarly to TestHdbscan can one of these tests use the new constructor that accepts a NeighboursQueryFactory directly.
Math/src/main/java/org/tribuo/math/neighbour/NeighboursQueryFactory.java
Show resolved
Hide resolved
Math/src/main/java/org/tribuo/math/neighbour/bruteforce/NeighboursBruteForceFactory.java
Show resolved
Hide resolved
There was a problem hiding this comment.
I think this should say "on the right" as the values less than are on the left.
|
|
||
| int store = left; | ||
| for (int idx = left; idx < right; idx++) { | ||
| if (compareByDimension(points[idx], pivot, dimension) <= 0) { |
There was a problem hiding this comment.
Shouldn't this be a strict less than? The docs say it's less than on the left and greater than or equal on the right.
| /** | ||
| * Set the median point for an array of {@link IntAndVector}s based, for a specific dimension, through recursive partitioning | ||
| * ensuring that points before it (with lower index) will be <= median, although not sorted, and points after it | ||
| * (with higher index) will be >= median, again not sorted. The order of the array will almost certainly be changed. |
There was a problem hiding this comment.
So this says that values equal to the median can be found on both sides of the tree, but that seems to conflict with the documented behaviour of the partitionOnIndex function (though not its actual behaviour).
There was a problem hiding this comment.
Thanks for catching these documentation discrepancies. For logic like this, it's really important that the docs are perfectly clear.
There was a problem hiding this comment.
I think you missed converting this over to just returning Arrays.asList.
Description
This PR introduces a k-d tree implementation, which can be used for nearest neighbour queries, and integrates this feature with Hdbscan and KNN. Initial performance measurements demonstrate that this k-d tree implementation is, in general, much faster than using a brute-force approach for larger datasets.
After these changes have been reviewed, there is a required change to the Hdbscan tutorial that will be made.
This PR supersedes #230 which contains a lot of interesting code review discussion, but was closed as the result of the branch being recreated.
Motivation
K-d trees can provide faster nearest neighbour queries compared to a brute-force technique. This results in faster model training times.
Paper reference
This is the original paper which proposes a k-d tree:
J.L. Bentley "Multidimensional Binary Search Trees Used for Associative Searching", Commun. ACM, Vol 18, Sept. 1975, 509–517