Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add a k-d tree implementation, and integrate it with Hdbscan and KNN.#230

Closed
geoffreydstewart wants to merge 3 commits intooracle:mainfrom
geoffreydstewart:kd-tree
Closed

Add a k-d tree implementation, and integrate it with Hdbscan and KNN.#230
geoffreydstewart wants to merge 3 commits intooracle:mainfrom
geoffreydstewart:kd-tree

Conversation

@geoffreydstewart
Copy link
Member

Description

This PR introduces a k-d tree implementation, which can be used for nearest neighbour queries, and integrates this feature with Hdbscan and KNN. Initial performance measurements demonstrate that this k-d tree implementation is, in general, much faster than using a brute-force approach for larger datasets.

After these changes have been reviewed, there is a required change to the Hdbscan tutorial that will be made.

Motivation

K-d trees can provide faster nearest neighbour queries compared to a brute-force technique. This results in faster model training times.

Paper reference

This is the original paper which proposes a k-d tree:
J.L. Bentley "Multidimensional Binary Search Trees Used for Associative Searching", Commun. ACM, Vol 18, Sept. 1975, 509–517

Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of small changes, and a few larger ones. I think the logic in KNNModel probably could do with drastic revision at this point, because the neighbour query infrastructure integrates poorly into KNNModel's in-built threading options, however that's a problem for another PR.

I'm also interested in the behaviour of the kd-tree when working on binary/integer datasets, and also how the recursion behaves in real use (in case we need to flip it over to iteration like we did in the CART tree package).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How unbalanced can this tree get if we adversarially prepare a dataset? Can it pop the stack with the recursion in this call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm always constructs a balanced tree, but I suppose there would be some manually generated datasets that could be provided that cause the algorithm to perform poorly. With recursion, there is always the risk that a tree constructed from a huge dataset could overflow the stack, but I would suspect we may also face memory issues in these scenarios. It sounds like you have more experience with this type of issue, so I'd be interested in hearing more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree that it is picking the median, but if the array is [0,1,1,1,1,1,1,1] then the invariants say that it should split that into [0] and [1,1,1,1,1,1,1], so I wonder what happens if you have a single data point that's all zeros and then every other data point is the same and all ones. We can decide that adversarial datasets are not our problem, but I'd like to have a rough idea how adversarial it has to be before it goes pop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not be understanding the issue here. If I use this code to create a tree from the points: (0,0,0,0), and 9 points like (1,1,1,1), I still get a balanced tree. The root node, will be (1,1,1,1). Traversing the tree following the below nodes, arrives at the leaf point (0,0,0,0). All the other nodes in the tree are (1,1,1,1). I have added a test to the performance test branch I'm maintaining where this scenario could be executed in the debugger: adversarial test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I had a look through the test, when I run it the trees are balanced, but points with value equal to the split point appear on the left. I thought the invariant was that that couldn't happen? I've left some comments on the new PR in relevant places.

@Craigacp Craigacp added Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR labels Apr 15, 2022
Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small things. Also could you rebase this branch on top of main? There's a conflict in TestHdbscan after I added the public cluster exemplar test.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 19, 2022
@oracle-contributor-agreement
Copy link

Oracle requires that contributors to all of its open-source projects sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

In order to sign the OCA, you need to create an Oracle account and sign the OCA in the Oracle's Contributor Agreement Application by following the steps on the homepage.

When singing the OCA, please provide your GitHub username. By doing so, this PR will be automatically updated once the signed OCA was approved by Oracle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. Oracle employee This PR is from an Oracle employee squash-commits Squash the commits when merging this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments