Remove unnecessary restriction on number of samples in IncrementalPCA #30224

ThomasGesseyJonesPX · 2024-11-05T13:56:17Z

Currently when calling IncrementalPCA.partial_fit() the number of samples in X always has to be greater or equal to the number of PCA components due to this hardcoded check

scikit-learn/sklearn/decomposition/_incremental_pca.py

Line 303 in 6e90391

elif not self.n_components <= n_samples:

However, this restriction actually only needs to apply to the first call to partial_fit(), which performs the initial SVD decomposition. Once this first call has established the initial $U$ and $\Sigma$, none of the algorithms steps (see original paper https://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf) require any size restrictions on the number of samples in the subsequent batches passed to partial_fit. The current error message hence unnecessary restricts the usage of IncrementalPCA, in particular in online applications where data might arrive in batches of different sizes.

This PR, updates the aforementioned check to only apply to the first call of partial_fit and adds a non-regression test for this issue.

github-actions · 2024-11-05T13:57:30Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c752837. Link to the linter CI: here}

ogrisel

Thanks for the PR. Some minor feedback below, but otherwise LGTM.

We probably need better tests (e.g. on non-spherical data) for this estimator, and we could also clarify what are the expectations in terms of accuracy between the full-batch and incremental version of PCA, but this is beyond the scope of this PR.

sklearn/decomposition/tests/test_incremental_pca.py

sklearn/decomposition/_incremental_pca.py

ThomasGesseyJonesPX · 2024-11-05T18:12:38Z

Thanks for the feedback! I have implemented all of your suggestions

…talPCA

github-actions bot added the module:decomposition label Nov 5, 2024

ogrisel approved these changes Nov 5, 2024

View reviewed changes

ThomasGesseyJonesPX added 8 commits November 7, 2024 09:47

Enforce minimum batch size only on first partial_fit call in Incremen…

1490969

…talPCA

Add non-regression test

ae39244

Linting

55ac444

Changelog

66c4a22

Correct changelog file name

b52e8f7

Use assert_allclose instead of assert_almost_equal

a96dd38

Use f strings in error messages

5996d43

Correct grammar in comment

c752837

ThomasGesseyJonesPX force-pushed the correct_errors_in_incremental_pca branch from d2d4f27 to c752837 Compare November 7, 2024 09:47

adrinjalali approved these changes Nov 7, 2024

View reviewed changes

adrinjalali merged commit 94f8875 into scikit-learn:main Nov 7, 2024
30 checks passed

ThomasGesseyJonesPX deleted the correct_errors_in_incremental_pca branch November 7, 2024 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove unnecessary restriction on number of samples in IncrementalPCA #30224

Remove unnecessary restriction on number of samples in IncrementalPCA #30224

Uh oh!

ThomasGesseyJonesPX commented Nov 5, 2024

Uh oh!

github-actions bot commented Nov 5, 2024 •

edited

Loading

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ThomasGesseyJonesPX commented Nov 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Remove unnecessary restriction on number of samples in IncrementalPCA #30224

Remove unnecessary restriction on number of samples in IncrementalPCA #30224

Uh oh!

Conversation

ThomasGesseyJonesPX commented Nov 5, 2024

Uh oh!

github-actions bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ThomasGesseyJonesPX commented Nov 5, 2024

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 5, 2024 •

edited

Loading