-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Remove unnecessary restriction on number of samples in IncrementalPCA #30224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary restriction on number of samples in IncrementalPCA #30224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Some minor feedback below, but otherwise LGTM.
We probably need better tests (e.g. on non-spherical data) for this estimator, and we could also clarify what are the expectations in terms of accuracy between the full-batch and incremental version of PCA, but this is beyond the scope of this PR.
Thanks for the feedback! I have implemented all of your suggestions |
d2d4f27
to
c752837
Compare
Currently when calling
IncrementalPCA.partial_fit()
the number of samples inX
always has to be greater or equal to the number of PCA components due to this hardcoded checkscikit-learn/sklearn/decomposition/_incremental_pca.py
Line 303 in 6e90391
However, this restriction actually only needs to apply to the first call to$U$ and $\Sigma$ , none of the algorithms steps (see original paper https://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf) require any size restrictions on the number of samples in the subsequent batches passed to
partial_fit()
, which performs the initial SVD decomposition. Once this first call has established the initialpartial_fit
. The current error message hence unnecessary restricts the usage of IncrementalPCA, in particular in online applications where data might arrive in batches of different sizes.This PR, updates the aforementioned check to only apply to the first call of
partial_fit
and adds a non-regression test for this issue.