Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Adding explained variances to sparse pca #16255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 39 commits into from

Conversation

Batalex
Copy link
Contributor

@Batalex Batalex commented Jan 28, 2020

Reference Issues/PRs

Continuation of #11527.

What does this implement/fix? Explain your changes.

This PR proposes a new implementation for the computation of the explained variance for the Sparse PCA.
As such, we add two more attributes to the SparsePCA & MiniBatchSparsePCA classes: explained_variance_ and explained_variance_ratio_.

Any other comments?

@Batalex
Copy link
Contributor Author

Batalex commented Jan 28, 2020

@jeremiedbb I just added the test discussed in the original PR.

rth
rth previously requested changes Jan 29, 2020
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, a few comments below.

@Batalex Batalex changed the title [MRG] Adding explained variances to sparse pca [WIP] Adding explained variances to sparse pca Jan 29, 2020
@agramfort
Copy link
Member

agramfort commented Jan 29, 2020 via email

@Batalex
Copy link
Contributor Author

Batalex commented Jan 30, 2020

this looks for correct indeed. If we can test against the R implementation that would be perfect (just day dreaming...)

@agramfort I just pushed in 28da58b the adjusted explained variance we discussed yesterday. I took the liberty to rewrite various parts of the original PR.
Moreover here are two gists: the reference implementation in R of the explained variance on the Boston housing dataset and the new one in sklearn

https://gist.github.com/Batalex/6a116a5754ccb42ceca5313f62380e99
https://gist.github.com/Batalex/8288b6cc30125db1cfbc4d1c6031cd18

The results are pretty close (identical down to 4 digits). The observed difference is first observed during fitting.

Edit: it seems like this new implementation does not like 2D-array with 1 value.

@Batalex Batalex changed the title [WIP] Adding explained variances to sparse pca [MRG] Adding explained variances to sparse pca Jan 30, 2020
Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you check using R code that values are correct?

# Variance in the original dataset
cov_mat = np.cov(X.T)
total_variance_in_x = np.trace(cov_mat) if cov_mat.size >= 2 \
else cov_mat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it possible that cov_mat is not 2d here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens in a test, I will give it a closer look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the failing tests are test_estimators[SparsePCA()-check_fit2d_1feature] and test_estimators[MiniBatchPCA()-check_fit2d_1feature]

With only one feature the covariance matrix is in fact a scalar:

>>> np.cov([[1,1]])
array(0.)
>>> np.cov([[1, 1], [1, 1]])
array([[0., 0.],
       [0., 0.]])

@Batalex Batalex changed the title [MRG] Adding explained variances to sparse pca [WIP] Adding explained variances to sparse pca Feb 6, 2020
@rth
Copy link
Member

rth commented May 18, 2020

Thanks for the reminder @cmarmo ! Code wise this LGTM, and as far as I can tell all comments where addressed.

However I'm not too familiar with how explained variance should be computed for sparse PCA nor available at present to check the provided references in detail. Maybe @glemaitre would be able to have a look, since it's related to other PRs you worked on I think?

(We also need to move the what's new to 0.24).

@rth rth dismissed their stale review May 18, 2020 13:25

Coments were addressed, and code wise LGTM.

@cmarmo
Copy link
Contributor

cmarmo commented Jun 12, 2020

Hi @Batalex thanks for your patience! Do you mind fixing the small conflict?
@glemaitre or @GaelVaroquaux a second approve? :) Thanks!

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Just 2 questions:

  • How expensive is the computation of the explained variance? Depending on that, would it be better to make it optional?
  • What do you think about a test, where the explained_variance_ratio_ must be 1 when setting n_components=X.shape[1]and alpha tiny?


Parameters
----------
X : ndarray of shape (n_samples, n_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does X need to be mean centered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say no, because neither ridge regression nor covariance needs centered data.

@Batalex
Copy link
Contributor Author

Batalex commented Jun 28, 2020

@lorentzenchr

Good work! Just 2 questions:

  • How expensive is the computation of the explained variance? Depending on that, would it be better to make it optional?
  • What do you think about a test, where the explained_variance_ratio_ must be 1 when setting n_components=X.shape[1]and alpha tiny?

Thank you kindly for your review!

On the performance side, with the following setup

import numpy as np
from sklearn.decomposition import SparsePCA

arr = np.random.random((1000, 50))

def main():
    SparsePCA().fit(arr)

I get

# Before
1 loop, best of 5: 1.23 sec per loop
# After
1 loop, best of 5: 1.26 sec per loop

IMO it can remain as it is

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 28, 2020 via email

@cmarmo
Copy link
Contributor

cmarmo commented Jul 6, 2020

Hello @Batalex, me again! :D
Do you mind re-triggering CircleCI synchronizing with upstream? The failed check is independent of your changes. Thanks!

@cmarmo
Copy link
Contributor

cmarmo commented Jul 9, 2020

gently pinging @GaelVaroquaux or @glemaitre for a second approval? Thanks!

@ogrisel
Copy link
Member

ogrisel commented Aug 5, 2020

I tried to add another test but it does not seem to work as expected. The observed discrepancy seems larger than just a rounding error:

>>> from sklearn.decomposition import SparsePCA
>>> import numpy as numpy
>>> X = np.random.randn(100, 20)
>>> spca = SparsePCA(n_components=3).fit(X)
>>> np.var(spca.transform(X), axis=0)
array([1.78457351, 1.69101422, 1.61273961])
>>> spca.explained_variance_
array([1.8025995 , 1.7071744 , 1.58828028])

Changing the number of degrees of freedom does not explain the discrepancy either:

>>> np.var(spca.transform(X), axis=0, ddof=1)
array([1.8025995 , 1.70809517, 1.62902991])

I also wanted to check the explained variance ratio, but since the raw variances are off, the ratio is too:

>>> np.var(spca.transform(X), axis=0) / np.var(X, axis=0).sum()
array([0.08786529, 0.08325881, 0.07940488])
>>> spca.explained_variance_ratio_
array([0.08786529, 0.08321393, 0.07741859])

@Batalex am I doing someting wrong? Isn't this the right definition of "explained variance"?

@ogrisel
Copy link
Member

ogrisel commented Aug 5, 2020

Hum I might have been mistaken, the definition of the component-wise explained variance above is probably wrong. Looking at the definition of the Sparse PCA Wikipedia page I get the following values:

>>> from sklearn.decomposition import SparsePCA
>>> import numpy as numpy
>>> X = np.random.randn(100, 20)
>>> spca = SparsePCA(n_components=3).fit(X) 

>>> X_c = X - X.mean(axis=0)
>>> ddof = 0
>>> C = X_c.T @ X_c / (X_c.shape[0] - ddof)
>>> ((spca.components_ @ C) * spca.components_).sum(axis=1)
array([1.75304619, 1.65854766, 1.48179409])

For this random data, this does not match the computed explained variances either:

>>> spca.explained_variance_
array([1.73544722, 1.61770256, 1.4562664 ])

Note: I have also tried with ddof=1, it does not match either.

Maybe this is because the scikit-learn Sparse PCA model is actually not the true Sparse PCA model (see #13127 (comment)) as the orthogonality contraint is not enforced in scikit-learn (as it should):

>>> spca.components_[0].T @ spca.components_[1]
0.0054753074411247205

@Batalex
Copy link
Contributor Author

Batalex commented Aug 5, 2020

@ogrisel

Thank you kindly for your help on this PR. I have been going back and forth on the original paper and #13127 since your first message and my conclusions are the same as yours.

Does a component-wise equality between the adjusted explained variance and the projected data variance need to be enforced here or is the small reconstruction error enough to submit this work?
Or should we try another method? https://rdrr.io/github/chavent/sparsePCA/man/explainedVar.html

@ogrisel
Copy link
Member

ogrisel commented Aug 5, 2020

I think the priority would be to implement the true Sparse PCA model (with the orthogonality constraints) from the original paper from Zou et al. instead.

It's weird to call something PCA without orthogonality.

@ogrisel
Copy link
Member

ogrisel commented Aug 11, 2020

Given that scikit-learn does not enforce orthogonality between the components, the concept of component-wise explained variance is weird, because two components can share explained variance. That is the sum of the component wise explained variances will be higher than 100% of the input data variance.

I therefore thing we should close this PR. I will update #11512. I am very sorry for the confusion @Batalex

@Batalex
Copy link
Contributor Author

Batalex commented Aug 11, 2020

No problem @ogrisel, I would rather see this PR go with a bang than stay stale.
See you on another one.

@Batalex Batalex closed this Aug 11, 2020
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 11, 2020 via email

@Batalex Batalex deleted the feat/sparse-pca-variance branch February 4, 2021 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants