-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
[MRG] Adding explained variances to sparse pca #16255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jeremiedbb I just added the test discussed in the original PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, a few comments below.
this looks for correct indeed. If we can test against the R implementation
that would be perfect (just day dreaming...)
|
@agramfort I just pushed in 28da58b the adjusted explained variance we discussed yesterday. I took the liberty to rewrite various parts of the original PR. https://gist.github.com/Batalex/6a116a5754ccb42ceca5313f62380e99 The results are pretty close (identical down to 4 digits). The observed difference is first observed during fitting. Edit: it seems like this new implementation does not like 2D-array with 1 value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you check using R code that values are correct?
# Variance in the original dataset | ||
cov_mat = np.cov(X.T) | ||
total_variance_in_x = np.trace(cov_mat) if cov_mat.size >= 2 \ | ||
else cov_mat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is it possible that cov_mat is not 2d here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens in a test, I will give it a closer look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the failing tests are test_estimators[SparsePCA()-check_fit2d_1feature]
and test_estimators[MiniBatchPCA()-check_fit2d_1feature]
With only one feature the covariance matrix is in fact a scalar:
>>> np.cov([[1,1]])
array(0.)
>>> np.cov([[1, 1], [1, 1]])
array([[0., 0.],
[0., 0.]])
Thanks for the reminder @cmarmo ! Code wise this LGTM, and as far as I can tell all comments where addressed. However I'm not too familiar with how explained variance should be computed for sparse PCA nor available at present to check the provided references in detail. Maybe @glemaitre would be able to have a look, since it's related to other PRs you worked on I think? (We also need to move the what's new to 0.24). |
Hi @Batalex thanks for your patience! Do you mind fixing the small conflict? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! Just 2 questions:
- How expensive is the computation of the explained variance? Depending on that, would it be better to make it optional?
- What do you think about a test, where the
explained_variance_ratio_
must be1
when settingn_components=X.shape[1]
andalpha
tiny?
|
||
Parameters | ||
---------- | ||
X : ndarray of shape (n_samples, n_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does X
need to be mean centered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say no, because neither ridge regression nor covariance needs centered data.
Thank you kindly for your review! On the performance side, with the following setup import numpy as np
from sklearn.decomposition import SparsePCA
arr = np.random.random((1000, 50))
def main():
SparsePCA().fit(arr) I get # Before
1 loop, best of 5: 1.23 sec per loop
# After
1 loop, best of 5: 1.26 sec per loop IMO it can remain as it is |
IMO it can remain as it is
I agree
|
Hello @Batalex, me again! :D |
gently pinging @GaelVaroquaux or @glemaitre for a second approval? Thanks! |
I tried to add another test but it does not seem to work as expected. The observed discrepancy seems larger than just a rounding error: >>> from sklearn.decomposition import SparsePCA
>>> import numpy as numpy
>>> X = np.random.randn(100, 20)
>>> spca = SparsePCA(n_components=3).fit(X)
>>> np.var(spca.transform(X), axis=0)
array([1.78457351, 1.69101422, 1.61273961])
>>> spca.explained_variance_
array([1.8025995 , 1.7071744 , 1.58828028]) Changing the number of degrees of freedom does not explain the discrepancy either: >>> np.var(spca.transform(X), axis=0, ddof=1)
array([1.8025995 , 1.70809517, 1.62902991]) I also wanted to check the explained variance ratio, but since the raw variances are off, the ratio is too: >>> np.var(spca.transform(X), axis=0) / np.var(X, axis=0).sum()
array([0.08786529, 0.08325881, 0.07940488])
>>> spca.explained_variance_ratio_
array([0.08786529, 0.08321393, 0.07741859]) @Batalex am I doing someting wrong? Isn't this the right definition of "explained variance"? |
Hum I might have been mistaken, the definition of the component-wise explained variance above is probably wrong. Looking at the definition of the Sparse PCA Wikipedia page I get the following values: >>> from sklearn.decomposition import SparsePCA
>>> import numpy as numpy
>>> X = np.random.randn(100, 20)
>>> spca = SparsePCA(n_components=3).fit(X)
>>> X_c = X - X.mean(axis=0)
>>> ddof = 0
>>> C = X_c.T @ X_c / (X_c.shape[0] - ddof)
>>> ((spca.components_ @ C) * spca.components_).sum(axis=1)
array([1.75304619, 1.65854766, 1.48179409]) For this random data, this does not match the computed explained variances either: >>> spca.explained_variance_
array([1.73544722, 1.61770256, 1.4562664 ]) Note: I have also tried with ddof=1, it does not match either. Maybe this is because the scikit-learn Sparse PCA model is actually not the true Sparse PCA model (see #13127 (comment)) as the orthogonality contraint is not enforced in scikit-learn (as it should): >>> spca.components_[0].T @ spca.components_[1]
0.0054753074411247205 |
Thank you kindly for your help on this PR. I have been going back and forth on the original paper and #13127 since your first message and my conclusions are the same as yours. Does a component-wise equality between the adjusted explained variance and the projected data variance need to be enforced here or is the small reconstruction error enough to submit this work? |
I think the priority would be to implement the true Sparse PCA model (with the orthogonality constraints) from the original paper from Zou et al. instead. It's weird to call something PCA without orthogonality. |
Given that scikit-learn does not enforce orthogonality between the components, the concept of component-wise explained variance is weird, because two components can share explained variance. That is the sum of the component wise explained variances will be higher than 100% of the input data variance. I therefore thing we should close this PR. I will update #11512. I am very sorry for the confusion @Batalex |
No problem @ogrisel, I would rather see this PR go with a bang than stay stale. |
You're really a good sport @Batalex ! Thanks a lot for pushing this.
Sorry it took us so long to realize the fundamental issue.
|
Reference Issues/PRs
Continuation of #11527.
What does this implement/fix? Explain your changes.
This PR proposes a new implementation for the computation of the explained variance for the Sparse PCA.
As such, we add two more attributes to the SparsePCA & MiniBatchSparsePCA classes:
explained_variance_
andexplained_variance_ratio_
.Any other comments?