[MRG] Adding explained variances to sparse pca #16255

Batalex · 2020-01-28T15:31:18Z

Reference Issues/PRs

Continuation of #11527.

What does this implement/fix? Explain your changes.

This PR proposes a new implementation for the computation of the explained variance for the Sparse PCA.
As such, we add two more attributes to the SparsePCA & MiniBatchSparsePCA classes: explained_variance_ and explained_variance_ratio_.

Any other comments?

…ariance

Batalex · 2020-01-28T16:50:13Z

@jeremiedbb I just added the test discussed in the original PR.

rth

Thanks for this PR, a few comments below.

sklearn/decomposition/tests/test_sparse_pca.py

sklearn/decomposition/_sparse_pca.py

…ariance

sklearn/decomposition/_sparse_pca.py

…ariance

agramfort · 2020-01-29T16:31:27Z

this looks for correct indeed. If we can test against the R implementation that would be perfect (just day dreaming...)

…ariance

Batalex · 2020-01-30T13:29:01Z

this looks for correct indeed. If we can test against the R implementation that would be perfect (just day dreaming...)

@agramfort I just pushed in 28da58b the adjusted explained variance we discussed yesterday. I took the liberty to rewrite various parts of the original PR.
Moreover here are two gists: the reference implementation in R of the explained variance on the Boston housing dataset and the new one in sklearn

https://gist.github.com/Batalex/6a116a5754ccb42ceca5313f62380e99
https://gist.github.com/Batalex/8288b6cc30125db1cfbc4d1c6031cd18

The results are pretty close (identical down to 4 digits). The observed difference is first observed during fitting.

Edit: it seems like this new implementation does not like 2D-array with 1 value.

…ariance

agramfort

did you check using R code that values are correct?

sklearn/decomposition/__init__.py

sklearn/decomposition/_sparse_pca.py

agramfort · 2020-02-03T09:54:39Z

sklearn/decomposition/_sparse_pca.py

+    # Variance in the original dataset
+    cov_mat = np.cov(X.T)
+    total_variance_in_x = np.trace(cov_mat) if cov_mat.size >= 2 \
+        else cov_mat


how is it possible that cov_mat is not 2d here?

It happens in a test, I will give it a closer look

Ok, the failing tests are test_estimators[SparsePCA()-check_fit2d_1feature] and test_estimators[MiniBatchPCA()-check_fit2d_1feature]

With only one feature the covariance matrix is in fact a scalar:

>>> np.cov([[1,1]]) array(0.) >>> np.cov([[1, 1], [1, 1]]) array([[0., 0.], [0., 0.]])

rth · 2020-05-18T13:25:10Z

Thanks for the reminder @cmarmo ! Code wise this LGTM, and as far as I can tell all comments where addressed.

However I'm not too familiar with how explained variance should be computed for sparse PCA nor available at present to check the provided references in detail. Maybe @glemaitre would be able to have a look, since it's related to other PRs you worked on I think?

(We also need to move the what's new to 0.24).

Coments were addressed, and code wise LGTM.

…ariance

cmarmo · 2020-06-12T15:23:53Z

Hi @Batalex thanks for your patience! Do you mind fixing the small conflict?
@glemaitre or @GaelVaroquaux a second approve? :) Thanks!

…ariance

lorentzenchr

Good work! Just 2 questions:

How expensive is the computation of the explained variance? Depending on that, would it be better to make it optional?
What do you think about a test, where the explained_variance_ratio_ must be 1 when setting n_components=X.shape[1]and alpha tiny?

lorentzenchr · 2020-06-27T14:35:22Z

sklearn/decomposition/_sparse_pca.py

+
+    Parameters
+    ----------
+    X : ndarray of shape (n_samples, n_features)


Does X need to be mean centered?

I would say no, because neither ridge regression nor covariance needs centered data.

sklearn/decomposition/_sparse_pca.py

Batalex · 2020-06-28T11:57:13Z

@lorentzenchr

Good work! Just 2 questions:

How expensive is the computation of the explained variance? Depending on that, would it be better to make it optional?

What do you think about a test, where the explained_variance_ratio_ must be 1 when setting n_components=X.shape[1]and alpha tiny?

Thank you kindly for your review!

On the performance side, with the following setup

import numpy as np
from sklearn.decomposition import SparsePCA

arr = np.random.random((1000, 50))

def main():
    SparsePCA().fit(arr)

I get

# Before
1 loop, best of 5: 1.23 sec per loop
# After
1 loop, best of 5: 1.26 sec per loop

IMO it can remain as it is

GaelVaroquaux · 2020-06-28T13:32:17Z

IMO it can remain as it is

I agree

cmarmo · 2020-07-06T13:59:32Z

Hello @Batalex, me again! :D
Do you mind re-triggering CircleCI synchronizing with upstream? The failed check is independent of your changes. Thanks!

…ariance

cmarmo · 2020-07-09T14:39:33Z

gently pinging @GaelVaroquaux or @glemaitre for a second approval? Thanks!

…ariance

ogrisel · 2020-08-05T09:55:43Z

I tried to add another test but it does not seem to work as expected. The observed discrepancy seems larger than just a rounding error:

>>> from sklearn.decomposition import SparsePCA
>>> import numpy as numpy
>>> X = np.random.randn(100, 20)
>>> spca = SparsePCA(n_components=3).fit(X)
>>> np.var(spca.transform(X), axis=0)
array([1.78457351, 1.69101422, 1.61273961])
>>> spca.explained_variance_
array([1.8025995 , 1.7071744 , 1.58828028])

Changing the number of degrees of freedom does not explain the discrepancy either:

>>> np.var(spca.transform(X), axis=0, ddof=1)
array([1.8025995 , 1.70809517, 1.62902991])

I also wanted to check the explained variance ratio, but since the raw variances are off, the ratio is too:

>>> np.var(spca.transform(X), axis=0) / np.var(X, axis=0).sum()
array([0.08786529, 0.08325881, 0.07940488])
>>> spca.explained_variance_ratio_
array([0.08786529, 0.08321393, 0.07741859])

@Batalex am I doing someting wrong? Isn't this the right definition of "explained variance"?

ogrisel · 2020-08-05T13:52:48Z

Hum I might have been mistaken, the definition of the component-wise explained variance above is probably wrong. Looking at the definition of the Sparse PCA Wikipedia page I get the following values:

>>> from sklearn.decomposition import SparsePCA
>>> import numpy as numpy
>>> X = np.random.randn(100, 20)
>>> spca = SparsePCA(n_components=3).fit(X) 

>>> X_c = X - X.mean(axis=0)
>>> ddof = 0
>>> C = X_c.T @ X_c / (X_c.shape[0] - ddof)
>>> ((spca.components_ @ C) * spca.components_).sum(axis=1)
array([1.75304619, 1.65854766, 1.48179409])

For this random data, this does not match the computed explained variances either:

>>> spca.explained_variance_
array([1.73544722, 1.61770256, 1.4562664 ])

Note: I have also tried with ddof=1, it does not match either.

Maybe this is because the scikit-learn Sparse PCA model is actually not the true Sparse PCA model (see #13127 (comment)) as the orthogonality contraint is not enforced in scikit-learn (as it should):

>>> spca.components_[0].T @ spca.components_[1]
0.0054753074411247205

Batalex · 2020-08-05T14:17:03Z

@ogrisel

Thank you kindly for your help on this PR. I have been going back and forth on the original paper and #13127 since your first message and my conclusions are the same as yours.

Does a component-wise equality between the adjusted explained variance and the projected data variance need to be enforced here or is the small reconstruction error enough to submit this work?
Or should we try another method? https://rdrr.io/github/chavent/sparsePCA/man/explainedVar.html

ogrisel · 2020-08-05T14:24:00Z

I think the priority would be to implement the true Sparse PCA model (with the orthogonality constraints) from the original paper from Zou et al. instead.

It's weird to call something PCA without orthogonality.

ogrisel · 2020-08-11T14:29:08Z

Given that scikit-learn does not enforce orthogonality between the components, the concept of component-wise explained variance is weird, because two components can share explained variance. That is the sum of the component wise explained variances will be higher than 100% of the input data variance.

I therefore thing we should close this PR. I will update #11512. I am very sorry for the confusion @Batalex

Batalex · 2020-08-11T14:39:31Z

No problem @ogrisel, I would rather see this PR go with a bang than stay stale.
See you on another one.

GaelVaroquaux · 2020-08-11T14:43:17Z

You're really a good sport @Batalex ! Thanks a lot for pushing this. Sorry it took us so long to realize the fundamental issue.

Julien Lhermitte and others added 3 commits June 19, 2019 17:59

feat(sparse_pca): Add explained variance to sparse PCA

0d80178

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

214fbd0

…ariance

add test for sparse PCA explained var on ortogonal matrix

92d0f2d

jeremiedbb added the Sprint label Jan 28, 2020

rth previously requested changes Jan 29, 2020

View reviewed changes

rth added the Enhancement label Jan 29, 2020

Batalex added 3 commits January 29, 2020 11:33

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

2cdb976

…ariance

CLN cosmetic changes to original PR

f5a536b

ENH use assert_allclose for sparse pca variance testing

ddf903d

agramfort reviewed Jan 29, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

22441b4

…ariance

Batalex changed the title ~~[MRG] Adding explained variances to sparse pca~~ [WIP] Adding explained variances to sparse pca Jan 29, 2020

Batalex added 4 commits January 29, 2020 14:21

DOC add comment why sparse pca explained var test use private function

d2475ec

CLN remove redundant parenthesis in sparse pca test

ab2d667

ENH fix version mismatch done during merge 214fbd

140be55

DOC add authors to sparse pca reference

f46b929

Batalex added 2 commits January 30, 2020 14:18

ENH use new computation method for sparse pca explained variance

28da58b

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

de50d7f

…ariance

ENH fix tot var computation if cov matrix is 0d

6096e6c

Batalex changed the title ~~[WIP] Adding explained variances to sparse pca~~ [MRG] Adding explained variances to sparse pca Jan 30, 2020

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

35f98c7

…ariance

agramfort reviewed Feb 3, 2020

View reviewed changes

Batalex changed the title ~~[MRG] Adding explained variances to sparse pca~~ [WIP] Adding explained variances to sparse pca Feb 6, 2020

Batalex added 4 commits February 6, 2020 20:12

do not expose private _get_explain_variance

8526653

use scipy's linalg instead of numpy's

5651599

document ridge_alpha in _get_explained_variance

609ce5a

add references titles

320a774

Batalex added 3 commits May 18, 2020 18:03

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

a1146ef

…ariance

Move what's new entry to 0.24

00f03f6

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

0d81775

…ariance

cmarmo added the Waiting for Reviewer label May 26, 2020

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

beb5324

…ariance

lorentzenchr reviewed Jun 27, 2020

View reviewed changes

Batalex added 2 commits June 28, 2020 13:02

use additional options in linalg.qr

149171a

fix linting

d8a5dd0

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

9e4bd05

…ariance

Batalex and others added 2 commits August 3, 2020 12:53

Merge remote-tracking branch 'upstream/master' into feat/sparse-pca-v…

9bef57a

…ariance

Merge branch 'master' into feat/sparse-pca-variance

15b8deb

cmarmo removed the Waiting for Reviewer label Aug 5, 2020

Batalex closed this Aug 11, 2020

This was referenced Aug 11, 2020

Adding explained variances to sparse pca #11527

Closed

Extracting explained variance in SparsePCA #11512

Open

Batalex deleted the feat/sparse-pca-variance branch February 4, 2021 15:51

Uh oh!

[MRG] Adding explained variances to sparse pca #16255

[MRG] Adding explained variances to sparse pca #16255

Uh oh!

Conversation

Batalex commented Jan 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Batalex commented Jan 28, 2020

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agramfort commented Jan 29, 2020 via email

Uh oh!

Batalex commented Jan 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agramfort left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agramfort Feb 3, 2020

Choose a reason for hiding this comment

Uh oh!

Batalex Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

Batalex Feb 12, 2020

Choose a reason for hiding this comment

Uh oh!

rth commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmarmo commented Jun 12, 2020

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Jun 27, 2020

Choose a reason for hiding this comment

Uh oh!

Batalex Jun 28, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Batalex commented Jun 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaelVaroquaux commented Jun 28, 2020 via email

Uh oh!

cmarmo commented Jul 6, 2020

Uh oh!

cmarmo commented Jul 9, 2020

Uh oh!

ogrisel commented Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Batalex commented Jan 28, 2020 •

edited

Loading

Batalex commented Jan 30, 2020 •

edited

Loading

rth commented May 18, 2020 •

edited

Loading

Batalex commented Jun 28, 2020 •

edited

Loading

ogrisel commented Aug 5, 2020 •

edited

Loading

ogrisel commented Aug 5, 2020 •

edited

Loading

Batalex commented Aug 5, 2020 •

edited

Loading

ogrisel commented Aug 11, 2020 •

edited

Loading

GaelVaroquaux commented Aug 11, 2020 via email •

edited by ogrisel

Loading