Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add a k-d tree implementation, and integrate it with Hdbscan and KNN.#231

Merged
Craigacp merged 6 commits intooracle:mainfrom
geoffreydstewart:kd-tree
Apr 25, 2022
Merged

Add a k-d tree implementation, and integrate it with Hdbscan and KNN.#231
Craigacp merged 6 commits intooracle:mainfrom
geoffreydstewart:kd-tree

Conversation

@geoffreydstewart
Copy link
Member

@geoffreydstewart geoffreydstewart commented Apr 20, 2022

Description

This PR introduces a k-d tree implementation, which can be used for nearest neighbour queries, and integrates this feature with Hdbscan and KNN. Initial performance measurements demonstrate that this k-d tree implementation is, in general, much faster than using a brute-force approach for larger datasets.

After these changes have been reviewed, there is a required change to the Hdbscan tutorial that will be made.

This PR supersedes #230 which contains a lot of interesting code review discussion, but was closed as the result of the branch being recreated.

Motivation

K-d trees can provide faster nearest neighbour queries compared to a brute-force technique. This results in faster model training times.

Paper reference

This is the original paper which proposes a k-d tree:
J.L. Bentley "Multidimensional Binary Search Trees Used for Associative Searching", Commun. ACM, Vol 18, Sept. 1975, 509–517

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 20, 2022
@oracle-contributor-agreement
Copy link

Oracle requires that contributors to all of its open-source projects sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

In order to sign the OCA, you need to create an Oracle account and sign the OCA in the Oracle's Contributor Agreement Application by following the steps on the homepage.

When singing the OCA, please provide your GitHub username. By doing so, this PR will be automatically updated once the signed OCA was approved by Oracle.

Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to properly nail down the semantics of the k-d tree building (which may just be documentation fixes now), and there are a couple of other small points.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This javadoc is incorrect now as the default is BRUTE_FORCE.

@Test
public void knnClassificationSingleThreadedTest() {
KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1, new VotingCombiner(), KNNModel.Backend.INNERTHREADPOOL);
KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to TestHdbscan can one of these tests use the new constructor that accepts a NeighboursQueryFactory directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should say "on the right" as the values less than are on the left.


int store = left;
for (int idx = left; idx < right; idx++) {
if (compareByDimension(points[idx], pivot, dimension) <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a strict less than? The docs say it's less than on the left and greater than or equal on the right.

/**
* Set the median point for an array of {@link IntAndVector}s based, for a specific dimension, through recursive partitioning
* ensuring that points before it (with lower index) will be <= median, although not sorted, and points after it
* (with higher index) will be >= median, again not sorted. The order of the array will almost certainly be changed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this says that values equal to the median can be found on both sides of the tree, but that seems to conflict with the documented behaviour of the partitionOnIndex function (though not its actual behaviour).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching these documentation discrepancies. For logic like this, it's really important that the docs are perfectly clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed converting this over to just returning Arrays.asList.

@oracle-contributor-agreement oracle-contributor-agreement bot removed the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 22, 2022
Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Craigacp Craigacp merged commit f4b4b5c into oracle:main Apr 25, 2022
@geoffreydstewart geoffreydstewart deleted the kd-tree branch June 6, 2022 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments