Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH expose n_oversamples in PCA when using solver="randomized" #21109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Nov 9, 2021

Conversation

x-shadow-man
Copy link
Contributor

Reference Issues/PRs

What does this implement/fix? Explain your changes.

fix:#20589
PCA returns highly inaccurate results when number of features is large #20589

Any other comments?

@x-shadow-man
Copy link
Contributor Author

Reference Issues/PRs

What does this implement/fix? Explain your changes.

fix:#20589
PCA returns highly inaccurate results when number of features is large #20589

Any other comments?

The variable n_oversamples in func sklearn.utils.extmath.randomized_svd is 10 by default. The outer function sklearn.decomposition.PCA.fit_transform cannot modify the variable. So when the input data feature is greater than 10, the result of svd will be quite different and there will be a big error.
In order to solve this problem, I provided a sampling ratio(n_oversamples_rate) as an input parameter, and the user can customize the required error range according to the needs of the project. When the n_oversamples_rate=1, the svd result will be the same as scipy.linalg.svd and the scipy.sparse.linalg.svds.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremiedbb you probably want to give your advise regarding which API to use for this parameter.

@glemaitre glemaitre changed the title Fix bug:PCA returns highly inaccurate results when number of features is large #20589 ENH expose n_oversamples in PCA when using solver="randomized" Sep 24, 2021
@glemaitre glemaitre added this to the 1.0.1 milestone Sep 24, 2021
@jeremiedbb
Copy link
Member

@jeremiedbb you probably want to give your advise regarding which API to use for this parameter.

I'd keep the API of randomized_svd, i.e an integer. I agree that it might be interesting to have it being a float representing a ratio instead of an absolute value. But then we should discuss about changing it in randomized_svd directly.

@ChenBinfighting1
Copy link

@jeremiedbb you probably want to give your advise regarding which API to use for this parameter.

I'd keep the API of randomized_svd, i.e an integer. I agree that it might be interesting to have it being a float representing a ratio instead of an absolute value. But then we should discuss about changing it in randomized_svd directly.

First,thank you for your reply.

  1. Integer is simpler, because we can know the dimension of the feature immediately, but float may be more convenient to use in the engineering, because we can determine the amount of features that need to be retained according to the needs. Sometimes we don’t really care about the specific dimension, we are more concerned about the feature retention;
    Finally, the two have their own advantages and disadvantages, or we can keep both at the same time, and users can customize their choices according to their needs.
  2. However, no matter what kind of plan, we have to expose this parameter to users anyway, let them determine this parameter.
    At last, because I am also full of interest in machine learning, I hope my research can have a better effect on the improvement of the method.
    Thanks.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some tiny suggestions.

@jjerphan
Copy link
Member

jjerphan commented Nov 8, 2021

@x-shadow-man: the CI fails because of some changes made to the code have not been properly formatted.

Setting up the pre-commit setup (see the optional step n.9 of this section) will make you able to work more easily and will prevent problems in your code and in the jobs run on the CI. 🙂

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last tiny suggestions, and after fixing the code formatting issues, this will LGTM.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @x-shadow-man.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, otherwise LGTM. thanks @x-shadow-man

x-shadow-man added 2 commits November 8, 2021 19:36
@glemaitre glemaitre changed the title FIX expose n_oversamples in PCA when using solver="randomized" ENH expose n_oversamples in PCA when using solver="randomized" Nov 8, 2021
@glemaitre
Copy link
Member

You should check that my suggestion are working with black

@jjerphan jjerphan requested a review from glemaitre November 9, 2021 07:38
@glemaitre glemaitre merged commit df2f0d0 into scikit-learn:main Nov 9, 2021
@glemaitre
Copy link
Member

Thanks @x-shadow-man Merging

@ChenBinfighting1
Copy link

Thanks for your help, I am also happy to solve this problem!

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Nov 22, 2021
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants