ENH expose n_oversamples in PCA when using solver="randomized" #21109

x-shadow-man · 2021-09-22T07:07:26Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

fix：#20589
PCA returns highly inaccurate results when number of features is large #20589

Any other comments?

x-shadow-man · 2021-09-23T12:59:32Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

fix：#20589
PCA returns highly inaccurate results when number of features is large #20589

Any other comments?

The variable n_oversamples in func sklearn.utils.extmath.randomized_svd is 10 by default. The outer function sklearn.decomposition.PCA.fit_transform cannot modify the variable. So when the input data feature is greater than 10, the result of svd will be quite different and there will be a big error.
In order to solve this problem, I provided a sampling ratio(n_oversamples_rate) as an input parameter, and the user can customize the required error range according to the needs of the project. When the n_oversamples_rate=1, the svd result will be the same as scipy.linalg.svd and the scipy.sparse.linalg.svds.

sklearn/decomposition/_pca.py

glemaitre

@jeremiedbb you probably want to give your advise regarding which API to use for this parameter.

sklearn/decomposition/_pca.py

sklearn/decomposition/tests/test_pca.py

doc/whats_new/v0.23.rst

jeremiedbb · 2021-09-24T11:53:39Z

@jeremiedbb you probably want to give your advise regarding which API to use for this parameter.

I'd keep the API of randomized_svd, i.e an integer. I agree that it might be interesting to have it being a float representing a ratio instead of an absolute value. But then we should discuss about changing it in randomized_svd directly.

sklearn/decomposition/_pca.py

ChenBinfighting1 · 2021-09-24T15:40:17Z

@jeremiedbb you probably want to give your advise regarding which API to use for this parameter.

I'd keep the API of randomized_svd, i.e an integer. I agree that it might be interesting to have it being a float representing a ratio instead of an absolute value. But then we should discuss about changing it in randomized_svd directly.

First，thank you for your reply.

Integer is simpler, because we can know the dimension of the feature immediately, but float may be more convenient to use in the engineering, because we can determine the amount of features that need to be retained according to the needs. Sometimes we don’t really care about the specific dimension, we are more concerned about the feature retention;
Finally, the two have their own advantages and disadvantages, or we can keep both at the same time, and users can customize their choices according to their needs.
However, no matter what kind of plan, we have to expose this parameter to users anyway, let them determine this parameter.
At last, because I am also full of interest in machine learning, I hope my research can have a better effect on the improvement of the method.
Thanks.

…nto main

doc/whats_new/v1.1.rst

sklearn/decomposition/_pca.py

jjerphan

Just some tiny suggestions.

doc/whats_new/v1.1.rst

sklearn/decomposition/_pca.py

sklearn/decomposition/tests/test_pca.py

…nto main

jjerphan · 2021-11-08T07:11:41Z

@x-shadow-man: the CI fails because of some changes made to the code have not been properly formatted.

Setting up the pre-commit setup (see the optional step n.9 of this section) will make you able to work more easily and will prevent problems in your code and in the jobs run on the CI. 🙂

jjerphan

Some last tiny suggestions, and after fixing the code formatting issues, this will LGTM.

sklearn/decomposition/_pca.py

sklearn/decomposition/tests/test_pca.py

jjerphan

LGTM. Thank you, @x-shadow-man.

doc/whats_new/v1.1.rst

jeremiedbb

Just a few comments, otherwise LGTM. thanks @x-shadow-man

doc/whats_new/v1.1.rst

sklearn/decomposition/tests/test_pca.py

glemaitre · 2021-11-08T15:48:32Z

You should check that my suggestion are working with black

glemaitre · 2021-11-09T10:50:33Z

Thanks @x-shadow-man Merging

ChenBinfighting1 · 2021-11-09T12:34:25Z

Thanks for your help, I am also happy to solve this problem!

…t-learn#21109) Co-authored-by: Guillaume Lemaitre <[email protected]>

_pca add optional parameters

34049a5

github-actions bot added the module:decomposition label Sep 22, 2021

x-shadow-man added 12 commits September 22, 2021 15:40

add pca param

8899473

add pca param

4f7b358

add pca param

7643495

update commit-config.yaml

1f88587

update pre-commit-config.yaml

a34a93d

delete black

3d3f5de

update black

13a33cf

reformat _pca.py

7f5d34f

update black

b26b02a

update fit_transform doc

8a3450e

update fit_transform param explanation

bb1d6e7

update test_pca

607899d

glemaitre reviewed Sep 23, 2021

View reviewed changes

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

x-shadow-man added 5 commits September 24, 2021 09:47

move param to __init__

96d4a8f

black done

455cd45

black done

53315c6

black done

001f455

double blank line update

d1647a0

glemaitre reviewed Sep 24, 2021

View reviewed changes

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

sklearn/decomposition/tests/test_pca.py Outdated Show resolved Hide resolved

glemaitre changed the title ~~Fix bug：PCA returns highly inaccurate results when number of features is large #20589~~ ENH expose n_oversamples in PCA when using solver="randomized" Sep 24, 2021

glemaitre reviewed Sep 24, 2021

View reviewed changes

doc/whats_new/v0.23.rst Outdated Show resolved Hide resolved

glemaitre added this to the 1.0.1 milestone Sep 24, 2021

jeremiedbb reviewed Sep 24, 2021

View reviewed changes

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

x-shadow-man added 2 commits September 27, 2021 10:08

update version

1e4271f

update version

9c0b53d

x-shadow-man added 2 commits October 28, 2021 09:24

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

bd50dae

…nto main

Merge branch 'main' into feature/pca_add_param

1605d42

glemaitre reviewed Nov 3, 2021

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

glemaitre reviewed Nov 3, 2021

View reviewed changes

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

jjerphan requested changes Nov 3, 2021

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

sklearn/decomposition/tests/test_pca.py Outdated Show resolved Hide resolved

x-shadow-man added 4 commits November 5, 2021 13:53

format update

f83ec3a

format update

7e5c45b

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

894d91f

…nto main

Merge branch 'main' into feature/pca_add_param

42bad44

jjerphan reviewed Nov 8, 2021

View reviewed changes

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

sklearn/decomposition/tests/test_pca.py Outdated Show resolved Hide resolved

_pca.py modified

6253b0e

jjerphan approved these changes Nov 8, 2021

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

update

fbaed4c

jeremiedbb approved these changes Nov 8, 2021

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

sklearn/decomposition/tests/test_pca.py Show resolved Hide resolved

x-shadow-man added 2 commits November 8, 2021 19:36

update

cd639b2

update

db01a5c

glemaitre reviewed Nov 8, 2021

View reviewed changes

sklearn/decomposition/tests/test_pca.py Outdated Show resolved Hide resolved

glemaitre reviewed Nov 8, 2021

View reviewed changes

sklearn/decomposition/tests/test_pca.py Outdated Show resolved Hide resolved

glemaitre changed the title ~~FIX expose n_oversamples in PCA when using solver="randomized"~~ ENH expose n_oversamples in PCA when using solver="randomized" Nov 8, 2021

update

90714e6

jjerphan requested a review from glemaitre November 9, 2021 07:38

glemaitre merged commit df2f0d0 into scikit-learn:main Nov 9, 2021

glemaitre mentioned this pull request Nov 15, 2021

PCA returns highly inaccurate results when number of features is large #20589

Closed

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Nov 22, 2021

ENH expose n_oversamples in PCA when using solver="randomized" (sciki…

6c447a1

…t-learn#21109) Co-authored-by: Guillaume Lemaitre <[email protected]>

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021

ENH expose n_oversamples in PCA when using solver="randomized" (sciki…

5829224

…t-learn#21109) Co-authored-by: Guillaume Lemaitre <[email protected]>

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

ENH expose n_oversamples in PCA when using solver="randomized" (sciki…

b77e962

…t-learn#21109) Co-authored-by: Guillaume Lemaitre <[email protected]>

Uh oh!

ENH expose n_oversamples in PCA when using solver="randomized" #21109

ENH expose n_oversamples in PCA when using solver="randomized" #21109

Uh oh!

Conversation

x-shadow-man commented Sep 22, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

x-shadow-man commented Sep 23, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremiedbb commented Sep 24, 2021

Uh oh!

Uh oh!

ChenBinfighting1 commented Sep 24, 2021

Uh oh!

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan commented Nov 8, 2021

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Nov 8, 2021

Uh oh!

glemaitre commented Nov 9, 2021

Uh oh!

ChenBinfighting1 commented Nov 9, 2021

Uh oh!

Uh oh!