-
-
Notifications
You must be signed in to change notification settings - Fork 26k
MAINT Introduce Pairwise Distances Reductions private submodule #22064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Introduce Pairwise Distances Reductions private submodule #22064
Conversation
This introduces the neccessary private implementations for a new private submodule, i.e.: - DatasetsPair, an abstraction to wrap a pair of two datasets and compute their vectors pairwise distances - DenseDenseDatasetsPair, a first implementation of DatasetsPair for pair of two dense datasets - PairwiseDistancesReduction, an abstraction allowing computing reductions efficiently in parallel and of - PairwiseDistancesArgkmin, a first implementation of PairwiseDistancesReduction for k-Nearest Neighbors search
Co-authored-by: Thomas J. Fan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A review round about _dist_metrics.pyx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review on setup:wink:
Co-authored-by: Christian Lorentzen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pass on cdef class PairwiseDistancesReduction
.
Co-authored-by: Christian Lorentzen <[email protected]>
9ba06e4
to
bd3f8fe
Compare
This was introduced once a long time ago for a failure which was happening in a single configuration (see the comments). Let's see if this has been fixed. Co-authored-by: Christian Lorentzen <[email protected]>
bd3f8fe
to
5d7ea09
Compare
Co-authored-by: Christian Lorentzen <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add "I-hope-useful" precision for some comments regarding parallel sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a batch of suggestions to help with https://github.com/scikit-learn/scikit-learn/pull/22064/files#r774754046.
If the changes suggested to improve the docstring on the strategy
parameter are accepted, the _parallel_on_X
and _parallel_on_Y
docstrings should be updated accordingly.
Co-authored-by: Olivier Grisel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pass of review. Once addressed (especially the comment on the tests), I think this PR LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, but I think this is ready to go.
There are two paths for follow up:
- FastEuclideanPairwiseArgKmin
- Interfacing implementation with scikit-learn interfaces
I am in favor of option 2 first, because with option 2 completed, we are merged directly into main
. FastEuclideanPairwiseArgKmin
would be a follow up performance improvement.
Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
To have argkmin_indices always be the first for consistency. Co-authored-by: Olivier Grisel <[email protected]>
This is more appropriate, especially for dynamic allocation. See: cython.readthedocs.io/en/latest/src/userguide/special_methods.html#initialisation-methods-cinit-and-init The __cinit__() method is where you should perform basic C-level initialisation of the object, including allocation of any C data structures that your object will own. Co-authored-by: Thomas J. Fan <[email protected]>
It might actually be needed to avoid a performance regression on machines with few cores as the current numpy code already uses the GEMM trick. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for merge once the tests have been updated to remove the translation invariance check that is too complicated to get working in general and replaced by a unit test to compare the ArgKMin implementation (with both parallelism strategies) with an naive implementation based on cdist/argpartition on small-ish random data as discussed in https://github.com/scikit-learn/scikit-learn/pull/22064/files#r777541915
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
So do I. 👍 |
@jjerphan here is what I had in mind for the test: jjerphan#7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A typo from an unfinished further review pass. Thanks @ogrisel and @thomasjpfan for finishing your reviews. I got busy these days.
@jjerphan The failing CI due to several |
Co-authored-by: Christian Lorentzen <[email protected]>
If there is a performance regression, then I agree that |
CI is blue! Merging! |
…min` (feature branch) (#22134) * MAINT Introduce Pairwise Distances Reductions private submodule (#22064) * MAINT Introduce FastEuclideanPairwiseArgKmin (#22065) * fixup! Merge branch 'main' into pairwise-distances-argkmin Remove duplicated Bunch * MAINT Plug `PairwiseDistancesArgKmin` as a back-end (#22288) * Forward pairwise_dist_chunk_size in the configuration * Flip finalized results for PairwiseDistancesArgKmin The previous would have made the code more complex by introducing some boilerplate for the interface plugs. Having it this way actually simplifies the code. This also removes the haversine branch for test_pairwise_distances_argkmin * Plug PairwiseDistancesArgKmin as a back-end * Adapt test accordingly * Add whats_new entry * Change input validation order for kneighbors * Remove duplicated test_neighbors_distance_metric_deprecation * Adapt the documentation * Add mahalanobis case to test fixtures * Correct whats_new entry * CLN Remove unneeded private metric attribute This was needed when 'fast_sqeuclidean' and 'fast_euclidean' were present to choose the best implementation based on the user specification. Those metric have been removed since then, making this attribute useless. * TST Assert FutureWarning instead of DeprecationWarning in test_neighbors_metrics * MAINT Add use_pairwise_dist_activate to scikit-learn config * TST Add a test for the 'brute' backends' results' consistency Co-authored-by: Thomas J. Fan <[email protected]> * fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config * fixup! fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config * TST Filter FutureWarning for WMinkowskiDistance * MAINT pin numpydoc in arm for now (#22292) * fixup! TST Filter FutureWarning for WMinkowskiDistance * Revert keywords arguments removal for the GEMM trick for 'euclidean' * MAINT pin max numpydoc for now (#22286) * Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS * fixup! Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS * Apply suggestions from code review * MAINT Document some config parameters for maintenance Also rename one of them. * FIX Support and test one of 'sqeuclidean' specification Co-authored-by: Olivier Grisel <[email protected]> * FIX Various typos fix and correct haversine 'haversine' is not supported by cdist. * Directly use get_config * CLN Apply comments from review * Motivate swapped returned values * TST Remove mahalanobis from test fixtures * MNT Add comment regaduction functions' signatures * TST Complete test for `pairwise_distance_{argmin,argmin_min}` (#22371) * DOC Add sub-pull requests to the whats_new entry * DOC place comment inside functions * DOC move up whatsnew entry Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Christian Lorentzen <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]>
…min` (feature branch) (scikit-learn#22134) * MAINT Introduce Pairwise Distances Reductions private submodule (scikit-learn#22064) * MAINT Introduce FastEuclideanPairwiseArgKmin (scikit-learn#22065) * fixup! Merge branch 'main' into pairwise-distances-argkmin Remove duplicated Bunch * MAINT Plug `PairwiseDistancesArgKmin` as a back-end (scikit-learn#22288) * Forward pairwise_dist_chunk_size in the configuration * Flip finalized results for PairwiseDistancesArgKmin The previous would have made the code more complex by introducing some boilerplate for the interface plugs. Having it this way actually simplifies the code. This also removes the haversine branch for test_pairwise_distances_argkmin * Plug PairwiseDistancesArgKmin as a back-end * Adapt test accordingly * Add whats_new entry * Change input validation order for kneighbors * Remove duplicated test_neighbors_distance_metric_deprecation * Adapt the documentation * Add mahalanobis case to test fixtures * Correct whats_new entry * CLN Remove unneeded private metric attribute This was needed when 'fast_sqeuclidean' and 'fast_euclidean' were present to choose the best implementation based on the user specification. Those metric have been removed since then, making this attribute useless. * TST Assert FutureWarning instead of DeprecationWarning in test_neighbors_metrics * MAINT Add use_pairwise_dist_activate to scikit-learn config * TST Add a test for the 'brute' backends' results' consistency Co-authored-by: Thomas J. Fan <[email protected]> * fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config * fixup! fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config * TST Filter FutureWarning for WMinkowskiDistance * MAINT pin numpydoc in arm for now (scikit-learn#22292) * fixup! TST Filter FutureWarning for WMinkowskiDistance * Revert keywords arguments removal for the GEMM trick for 'euclidean' * MAINT pin max numpydoc for now (scikit-learn#22286) * Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS * fixup! Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS * Apply suggestions from code review * MAINT Document some config parameters for maintenance Also rename one of them. * FIX Support and test one of 'sqeuclidean' specification Co-authored-by: Olivier Grisel <[email protected]> * FIX Various typos fix and correct haversine 'haversine' is not supported by cdist. * Directly use get_config * CLN Apply comments from review * Motivate swapped returned values * TST Remove mahalanobis from test fixtures * MNT Add comment regaduction functions' signatures * TST Complete test for `pairwise_distance_{argmin,argmin_min}` (scikit-learn#22371) * DOC Add sub-pull requests to the whats_new entry * DOC place comment inside functions * DOC move up whatsnew entry Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Christian Lorentzen <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]>
Reference Issues/PRs
Part of splitting #21462 (comment)
⚠ This targets the
upstream:pairwise-distances-argkmin
feature branch notupstream:main
.What does this implement/fix? Explain your changes.
This introduces the necessary private implementations for a new
private submodule, i.e.:
compute their vectors pairwise distances
for pair of two dense datasets
reductions efficiently in parallel and of
PairwiseDistancesReduction for k-Nearest Neighbors search