Add a k-d tree implementation, and integrate it with Hdbscan and KNN.#230
Add a k-d tree implementation, and integrate it with Hdbscan and KNN.#230geoffreydstewart wants to merge 3 commits intooracle:mainfrom
Conversation
Craigacp
left a comment
There was a problem hiding this comment.
There are a bunch of small changes, and a few larger ones. I think the logic in KNNModel probably could do with drastic revision at this point, because the neighbour query infrastructure integrates poorly into KNNModel's in-built threading options, however that's a problem for another PR.
I'm also interested in the behaviour of the kd-tree when working on binary/integer datasets, and also how the recursion behaves in real use (in case we need to flip it over to iteration like we did in the CART tree package).
Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java
Outdated
Show resolved
Hide resolved
Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java
Outdated
Show resolved
Hide resolved
Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java
Outdated
Show resolved
Hide resolved
Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java
Show resolved
Hide resolved
Common/NearestNeighbour/src/main/java/org/tribuo/common/nearest/KNNClassifierOptions.java
Outdated
Show resolved
Hide resolved
Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java
Outdated
Show resolved
Hide resolved
Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
How unbalanced can this tree get if we adversarially prepare a dataset? Can it pop the stack with the recursion in this call?
There was a problem hiding this comment.
This algorithm always constructs a balanced tree, but I suppose there would be some manually generated datasets that could be provided that cause the algorithm to perform poorly. With recursion, there is always the risk that a tree constructed from a huge dataset could overflow the stack, but I would suspect we may also face memory issues in these scenarios. It sounds like you have more experience with this type of issue, so I'd be interested in hearing more.
There was a problem hiding this comment.
So I agree that it is picking the median, but if the array is [0,1,1,1,1,1,1,1] then the invariants say that it should split that into [0] and [1,1,1,1,1,1,1], so I wonder what happens if you have a single data point that's all zeros and then every other data point is the same and all ones. We can decide that adversarial datasets are not our problem, but I'd like to have a rough idea how adversarial it has to be before it goes pop.
There was a problem hiding this comment.
I may not be understanding the issue here. If I use this code to create a tree from the points: (0,0,0,0), and 9 points like (1,1,1,1), I still get a balanced tree. The root node, will be (1,1,1,1). Traversing the tree following the below nodes, arrives at the leaf point (0,0,0,0). All the other nodes in the tree are (1,1,1,1). I have added a test to the performance test branch I'm maintaining where this scenario could be executed in the debugger: adversarial test
There was a problem hiding this comment.
Ok. I had a look through the test, when I run it the trees are balanced, but points with value equal to the split point appear on the left. I thought the invariant was that that couldn't happen? I've left some comments on the new PR in relevant places.
Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java
Show resolved
Hide resolved
Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java
Show resolved
Hide resolved
Common/NearestNeighbour/src/main/java/org/tribuo/common/nearest/KNNModel.java
Show resolved
Hide resolved
Common/NearestNeighbour/src/main/java/org/tribuo/common/nearest/KNNModel.java
Show resolved
Hide resolved
Common/NearestNeighbour/src/main/java/org/tribuo/common/nearest/KNNTrainer.java
Show resolved
Hide resolved
Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java
Outdated
Show resolved
Hide resolved
|
Oracle requires that contributors to all of its open-source projects sign the Oracle Contributor Agreement (OCA).
In order to sign the OCA, you need to create an Oracle account and sign the OCA in the Oracle's Contributor Agreement Application by following the steps on the homepage. When singing the OCA, please provide your GitHub username. By doing so, this PR will be automatically updated once the signed OCA was approved by Oracle. |
Description
This PR introduces a k-d tree implementation, which can be used for nearest neighbour queries, and integrates this feature with Hdbscan and KNN. Initial performance measurements demonstrate that this k-d tree implementation is, in general, much faster than using a brute-force approach for larger datasets.
After these changes have been reviewed, there is a required change to the Hdbscan tutorial that will be made.
Motivation
K-d trees can provide faster nearest neighbour queries compared to a brute-force technique. This results in faster model training times.
Paper reference
This is the original paper which proposes a k-d tree:
J.L. Bentley "Multidimensional Binary Search Trees Used for Associative Searching", Commun. ACM, Vol 18, Sept. 1975, 509–517