Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT Introduce Pairwise Distances Reductions private submodule #22064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Dec 23, 2021

Reference Issues/PRs

Part of splitting #21462 (comment)

⚠ This targets the upstream:pairwise-distances-argkmin feature branch not upstream:main.

What does this implement/fix? Explain your changes.

This introduces the necessary private implementations for a new
private submodule, i.e.:

  • DatasetsPair, an abstraction to wrap a pair of two datasets and
    compute their vectors pairwise distances
  • DenseDenseDatasetsPair, a first implementation of DatasetsPair
    for pair of two dense datasets
  • PairwiseDistancesReduction, an abstraction allowing computing
    reductions efficiently in parallel and of
  • PairwiseDistancesArgkmin, a first implementation of
    PairwiseDistancesReduction for k-Nearest Neighbors search

This introduces the neccessary private implementations for a new
private submodule, i.e.:

 - DatasetsPair, an abstraction to wrap a pair of two datasets and
 compute their vectors pairwise distances
 - DenseDenseDatasetsPair, a first implementation of DatasetsPair
 for pair of two dense datasets
 - PairwiseDistancesReduction, an abstraction allowing computing
 reductions efficiently in parallel and of
 - PairwiseDistancesArgkmin, a first implementation of
 PairwiseDistancesReduction for k-Nearest Neighbors search
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A review round about _dist_metrics.pyx.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review on setup:wink:

Co-authored-by: Christian Lorentzen <[email protected]>
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pass on cdef class PairwiseDistancesReduction.

Co-authored-by: Christian Lorentzen <[email protected]>
@jjerphan jjerphan force-pushed the pairwise-distances-argkmin-necessary-implementations branch from 9ba06e4 to bd3f8fe Compare December 23, 2021 16:39
This was introduced once a long time ago for a failure
which was happening in a single configuration (see the comments).

Let's see if this has been fixed.

Co-authored-by: Christian Lorentzen <[email protected]>
@jjerphan jjerphan force-pushed the pairwise-distances-argkmin-necessary-implementations branch from bd3f8fe to 5d7ea09 Compare December 23, 2021 16:43
Copy link
Member Author

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "I-hope-useful" precision for some comments regarding parallel sections.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a batch of suggestions to help with https://github.com/scikit-learn/scikit-learn/pull/22064/files#r774754046.

If the changes suggested to improve the docstring on the strategy parameter are accepted, the _parallel_on_X and _parallel_on_Y docstrings should be updated accordingly.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass of review. Once addressed (especially the comment on the tests), I think this PR LGTM.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but I think this is ready to go.

There are two paths for follow up:

  1. FastEuclideanPairwiseArgKmin
  2. Interfacing implementation with scikit-learn interfaces

I am in favor of option 2 first, because with option 2 completed, we are merged directly into main. FastEuclideanPairwiseArgKmin would be a follow up performance improvement.

jjerphan and others added 4 commits January 4, 2022 11:12
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
To have argkmin_indices always be the first
for consistency.

Co-authored-by: Olivier Grisel <[email protected]>
This is more appropriate, especially for dynamic allocation.

See:
cython.readthedocs.io/en/latest/src/userguide/special_methods.html#initialisation-methods-cinit-and-init

    The __cinit__() method is where you should perform basic C-level
    initialisation of the object, including allocation of any C data
    structures that your object will own.

Co-authored-by: Thomas J. Fan <[email protected]>
@ogrisel
Copy link
Member

ogrisel commented Jan 4, 2022

FastEuclideanPairwiseArgKmin would be a follow up performance improvement.

It might actually be needed to avoid a performance regression on machines with few cores as the current numpy code already uses the GEMM trick.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for merge once the tests have been updated to remove the translation invariance check that is too complicated to get working in general and replaced by a unit test to compare the ArgKMin implementation (with both parallelism strategies) with an naive implementation based on cdist/argpartition on small-ish random data as discussed in https://github.com/scikit-learn/scikit-learn/pull/22064/files#r777541915

@jjerphan
Copy link
Member Author

jjerphan commented Jan 5, 2022

Minor comments, but I think this is ready to go.

There are two paths for follow up:

1. FastEuclideanPairwiseArgKmin

2. Interfacing implementation with scikit-learn interfaces

I am in favor of option 2 first, because with option 2 completed, we are merged directly into main. FastEuclideanPairwiseArgKmin would be a follow up performance improvement.

So do I. 👍

@ogrisel
Copy link
Member

ogrisel commented Jan 5, 2022

@jjerphan here is what I had in mind for the test: jjerphan#7

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typo from an unfinished further review pass. Thanks @ogrisel and @thomasjpfan for finishing your reviews. I got busy these days.

@lorentzenchr
Copy link
Member

@jjerphan The failing CI due to several DeprecationWarning is maybe the missing piece before merge. Looking forward to it!

@thomasjpfan
Copy link
Member

It might actually be needed to avoid a performance regression on machines with few cores as the current numpy code already uses the GEMM trick.

If there is a performance regression, then I agree that FastEuclideanPairwiseArgKmin should come first.

@ogrisel
Copy link
Member

ogrisel commented Jan 5, 2022

CI is blue! Merging!

@ogrisel ogrisel merged commit 0532aab into scikit-learn:pairwise-distances-argkmin Jan 5, 2022
@jjerphan jjerphan deleted the pairwise-distances-argkmin-necessary-implementations branch January 5, 2022 14:37
lorentzenchr added a commit that referenced this pull request Feb 17, 2022
…min` (feature branch) (#22134)

* MAINT Introduce Pairwise Distances Reductions private submodule  (#22064)

* MAINT Introduce FastEuclideanPairwiseArgKmin  (#22065)

* fixup! Merge branch 'main' into pairwise-distances-argkmin

Remove duplicated Bunch

* MAINT Plug `PairwiseDistancesArgKmin` as a back-end (#22288)

* Forward pairwise_dist_chunk_size in the configuration

* Flip finalized results for PairwiseDistancesArgKmin

The previous would have made the code more complex
by introducing some boilerplate for the interface plugs.

Having it this way actually simplifies the code.

This also removes the haversine branch for
test_pairwise_distances_argkmin

* Plug PairwiseDistancesArgKmin as a back-end

* Adapt test accordingly

* Add whats_new entry

* Change input validation order for kneighbors

* Remove duplicated test_neighbors_distance_metric_deprecation

* Adapt the documentation

* Add mahalanobis case to test fixtures

* Correct whats_new entry

* CLN Remove unneeded private metric attribute

This was needed when 'fast_sqeuclidean' and 'fast_euclidean'
were present to choose the best implementation based on the user
specification.

Those metric have been removed since then, making this attribute
useless.

* TST Assert FutureWarning instead of DeprecationWarning in
test_neighbors_metrics

* MAINT Add use_pairwise_dist_activate to scikit-learn config

* TST Add a test for the 'brute' backends' results' consistency

Co-authored-by: Thomas J. Fan <[email protected]>

* fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config

* fixup! fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config

* TST Filter FutureWarning for WMinkowskiDistance

* MAINT pin numpydoc in arm for now (#22292)

* fixup! TST Filter FutureWarning for WMinkowskiDistance

* Revert keywords arguments removal for the GEMM trick for 'euclidean'

* MAINT pin max numpydoc for now (#22286)

* Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS

* fixup! Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS

* Apply suggestions from code review

* MAINT Document some config parameters for maintenance

Also rename one of them.

* FIX Support and test one of 'sqeuclidean' specification

Co-authored-by: Olivier Grisel <[email protected]>

* FIX Various typos fix and correct haversine

'haversine' is not supported by cdist.

* Directly use get_config

* CLN Apply comments from review

* Motivate swapped returned values

* TST Remove mahalanobis from test fixtures

* MNT Add comment regaduction functions' signatures

* TST Complete test for `pairwise_distance_{argmin,argmin_min}` (#22371)

* DOC Add sub-pull requests to the whats_new entry

* DOC place comment inside functions

* DOC move up whatsnew entry

Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Christian Lorentzen <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Jérémie du Boisberranger <[email protected]>
thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request Mar 1, 2022
…min` (feature branch) (scikit-learn#22134)

* MAINT Introduce Pairwise Distances Reductions private submodule  (scikit-learn#22064)

* MAINT Introduce FastEuclideanPairwiseArgKmin  (scikit-learn#22065)

* fixup! Merge branch 'main' into pairwise-distances-argkmin

Remove duplicated Bunch

* MAINT Plug `PairwiseDistancesArgKmin` as a back-end (scikit-learn#22288)

* Forward pairwise_dist_chunk_size in the configuration

* Flip finalized results for PairwiseDistancesArgKmin

The previous would have made the code more complex
by introducing some boilerplate for the interface plugs.

Having it this way actually simplifies the code.

This also removes the haversine branch for
test_pairwise_distances_argkmin

* Plug PairwiseDistancesArgKmin as a back-end

* Adapt test accordingly

* Add whats_new entry

* Change input validation order for kneighbors

* Remove duplicated test_neighbors_distance_metric_deprecation

* Adapt the documentation

* Add mahalanobis case to test fixtures

* Correct whats_new entry

* CLN Remove unneeded private metric attribute

This was needed when 'fast_sqeuclidean' and 'fast_euclidean'
were present to choose the best implementation based on the user
specification.

Those metric have been removed since then, making this attribute
useless.

* TST Assert FutureWarning instead of DeprecationWarning in
test_neighbors_metrics

* MAINT Add use_pairwise_dist_activate to scikit-learn config

* TST Add a test for the 'brute' backends' results' consistency

Co-authored-by: Thomas J. Fan <[email protected]>

* fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config

* fixup! fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config

* TST Filter FutureWarning for WMinkowskiDistance

* MAINT pin numpydoc in arm for now (scikit-learn#22292)

* fixup! TST Filter FutureWarning for WMinkowskiDistance

* Revert keywords arguments removal for the GEMM trick for 'euclidean'

* MAINT pin max numpydoc for now (scikit-learn#22286)

* Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS

* fixup! Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS

* Apply suggestions from code review

* MAINT Document some config parameters for maintenance

Also rename one of them.

* FIX Support and test one of 'sqeuclidean' specification

Co-authored-by: Olivier Grisel <[email protected]>

* FIX Various typos fix and correct haversine

'haversine' is not supported by cdist.

* Directly use get_config

* CLN Apply comments from review

* Motivate swapped returned values

* TST Remove mahalanobis from test fixtures

* MNT Add comment regaduction functions' signatures

* TST Complete test for `pairwise_distance_{argmin,argmin_min}` (scikit-learn#22371)

* DOC Add sub-pull requests to the whats_new entry

* DOC place comment inside functions

* DOC move up whatsnew entry

Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Christian Lorentzen <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Jérémie du Boisberranger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants