Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX Support F-contiguous arrays for PairwiseDistancesReductions-backed estimators #23990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 1, 2022

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Jul 25, 2022

Reference Issues/PRs

Fixes #23988.
Fixes #24013.

What does this implement/fix? Explain your changes.

Only C-contiguous arrays are supported by PairwiseDistancesReductions.
Yet, this is not specified and thus makes user-facing estimator failed
when used with F-contiguous.

This PR:

  • makes PairwiseDistancesReductions specify that they only support
    C-contiguous array
  • adds tests accordingly

@jjerphan jjerphan marked this pull request as ready for review July 25, 2022 10:56
@thomasjpfan thomasjpfan added this to the 1.1.2 milestone Jul 25, 2022
@jeremiedbb
Copy link
Member

With this approach, an estimator that don't converts to C-contiguous array will fallback to the non-optimized way on fortran arrays. I think we should instead do the conversion at _validate_data time of the estimator, setting order="C". This way we always use the optimized code and it's the role of the estimator to convert the data appropriately.

@jeremiedbb
Copy link
Member

jeremiedbb commented Jul 25, 2022

I removed the no changelog needed tag because I think this change does need a changelog entry

@jjerphan
Copy link
Member Author

With this approach, an estimator that don't converts to C-contiguous array will fallback to the non-optimized way on fortran arrays. I think we should instead do the conversion at _validate_data time of the estimator, setting order="C". This way we always use the optimized code and it's the role of the estimator to convert the data appropriately.

Yes indeed, I think this fix is independent of changing the validation steps and can be addressed in other PRs. What do you think? Should we instead address all the changes in this PR?

@jeremiedbb
Copy link
Member

Yes indeed, I think this fix is independent of changing the validation steps and can be addressed in other PRs. What do you think? Should we instead address all the changes in this PR?

I'm ok to merge this fix first to not delay a bug fix release. But once all concerned estimators convert appropriately, this check will always be True. I think you can add a comment to remove this check when all estimators provide C contiguous arrays.

@jjerphan jjerphan changed the title MAINT Make PairwiseDistancesReduction usable only for C-contiguous arrays FIX Make PairwiseDistancesReduction usable only for C-contiguous arrays Jul 27, 2022
@jjerphan jjerphan changed the title FIX Make PairwiseDistancesReduction usable only for C-contiguous arrays FIX Support F-contiguous arrays for PairwiseDistancesReductions-backed estimators Jul 27, 2022
@jjerphan
Copy link
Member Author

Actually, I've done all the changes in this PR since many estimators are affected and since the changes are relatively straightforward.

A really simple common test has been added, and we potential might want to extend tests for F-contiguous arrays in the future as its done in such generic tests.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a trade off in terms of memory between using the new pairwise backend and the old one? Specifically, the new backend requires C-ordered arrays which may create a copy, while the old backend can work with any array.

@jjerphan
Copy link
Member Author

jjerphan commented Jul 27, 2022

Is there a trade off in terms of memory between using the new pairwise backend and the old one? Specifically, the new backend requires C-ordered arrays which may create a copy, while the old backend can work with any array.

Yes, I think we should weight that properly (#23988 (comment)).

If the provided F-contiguous array is large, this fix can make users' workflows impracticable as a copy of the array needs to be made.

An alternative is to maintain F-contiguity (hence falling back to the old back-end in this case) but warn users that providing C-contiguous arrays allow getting the best performance, and document it properly somewhere.

(What's a bit sad, is that `pandas.DataFrames.values' arrays generally are F-contiguous.)

If needed, I am fine reverting the last changes in this PR to only properly fall-back on the old back-end for F-contiguous arrays before we choose which solution to pick.

What are you thoughts and preferences?

@thomasjpfan
Copy link
Member

thomasjpfan commented Jul 27, 2022

I answered your question in #23988 (comment) TLDR:

If we want to quickly fix this bug, I prefer to revert back to using old backend for F-contiguous arrays. In this case, there is no extra memory consumption compared to 1.0, but we end up using the slower backend.

@jjerphan
Copy link
Member Author

OK, so let's revert this PR to this simpler fix and potentially make arrays C-contiguous after cost analysis.

@jjerphan jjerphan added the Quick Review For PRs that are quick to review label Jul 29, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 8c83e61 into scikit-learn:main Aug 1, 2022
@glemaitre
Copy link
Member

Thanks @jjerphan

@jjerphan jjerphan deleted the maint/pdr-c-contiguous-only branch August 1, 2022 11:28
@jjerphan
Copy link
Member Author

jjerphan commented Aug 1, 2022

You're welcome, @glemaitre.

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
glemaitre pushed a commit that referenced this pull request Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ndarray is not C-contiguous error, when using KNeighborsRegressor birch clustering does not work with version 1.1.1
4 participants