Pca for sparse data #403

Koncopd · 2018-12-20T13:34:10Z

For this - #393

Benchmarks
https://github.com/Koncopd/anndata-scanpy-benchmarks/blob/master/pca_for_sparse.ipynb

falexwolf · 2018-12-26T21:09:02Z

This is nice! Thank you!

It appears to me that the benchmarks show that this only becomes relevant for very large data. So we need to be mindful to not break backward compatibility for all the small and medium-size datasets that people use (which we do by introducing the tiny difference). Don't you think that in the light of this, it would be better to leave the default as is (densifying) and have an option sparse_pca or something similar?

Koncopd · 2018-12-26T21:30:31Z

It appears to me that the benchmarks show that this only becomes relevant for very large data.

Hm, even for my example it is 77.14 MiB vs 893.92 MiB, so 10 times difference. This seems large to me, no?

falexwolf · 2018-12-27T20:03:54Z

Hm, even for my example it is 77.14 MiB vs 893.92 MiB, so 10 times difference. This seems large to me, no?

Yes, it's definitely large and it's awesome that you solved this problem! I just meant that it's not hitting people's computational resources limits: your example is 60K x 2K, so quite big already, if you densify you need 800MB, which is easily available even on a laptop. That's what I meant.

What do you think?

Koncopd · 2018-12-27T23:05:56Z

Yeah, this seems important only for large datasets in that sense. So, i will add sparse_pca option with False by default.

Koncopd · 2019-01-14T10:58:25Z

Not sure what kind of test to add for this...

falexwolf · 2019-01-21T10:54:42Z

As discussed, @Koncopd will try to integrate this into scikit-learn itself and not into Scanpy. 😄

Koncopd · 2019-02-04T12:54:24Z

Similar pull request exists already in sklearn.
scikit-learn/scikit-learn#12841
Will watch.

flying-sheep

Sorry about causing merge conflicts. Squashing all your commits and then rebasing on master is probably going to be easier than rebasing all the commits individually

flying-sheep · 2020-01-10T09:40:01Z

scanpy/preprocessing/pca_for_sparse.py

@@ -0,0 +1,70 @@
+


File needs to be named starting with _ as the others

flying-sheep · 2020-01-10T09:40:46Z

scanpy/preprocessing/pca_for_sparse.py

@@ -0,0 +1,70 @@
+
+import numpy as np
+from distutils.version import StrictVersion


distutils.version is broken. Please use from packaging import version and then version.parse. (I added packaging to the requirements on master)

flying-sheep · 2020-01-10T09:41:54Z

scanpy/preprocessing/pca_for_sparse.py

+else:
+    from sklearn.utils.sparsefuncs_fast import csr_mean_variance_axis0 as mean_variance
+
+# need to pass issparse check


I don’t understand this comment.

flying-sheep · 2020-01-10T09:43:13Z

scanpy/preprocessing/pca_for_sparse.py

+        if k >= n_features:
+            raise ValueError('n_components must be < n_features;'
+                             ' got %d >= %d' % (k, n_features))
+        U, Sigma, VT = randomized_svd(C, k, n_iter=self.n_iter,


please don’t use uppercase letters in variable names.

flying-sheep · 2020-01-10T09:44:49Z

scanpy/preprocessing/simple.py

    data: Union[AnnData, np.ndarray, spmatrix],
    n_comps: int = N_PCS,
    zero_center: Optional[bool] = True,
+    sparse_pca: Optional[bool] = False,


Maybe make it None by default, and, based on the benchmarks, use it when the data is sufficiently large.

In this case, make sure you document this in the parameter docs and log a debug message.

gokceneraslan · 2020-01-11T12:56:29Z

Shall we replace sparse_pca with something else like implicit_centering or so?

sparse_pca might give rise to confusion since there is indeed a PCA technique called sparse PCA.

Koncopd · 2020-01-11T14:17:54Z

Hm, it was decided to suspend this pr earlier.
There is an analogous pr in scikit-learn, but i'm not sure it will got forward.
I'm not sure what to do with this pr...

flying-sheep · 2020-01-13T09:33:55Z

Ah, then we should leave it open and unchanged until there’s a decision on the sklearn PR. Sorry I missed the decision.

VolkerBergen · 2020-01-21T09:09:38Z

What's the current state here?

This just came out
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1900-3
favoring sklearn's PCA implementation while stating that it cannot yet handle sparse matrices. Is that still true?

Koncopd · 2020-01-22T16:56:44Z

Last time i checked it was true. Need to check again, but i bet it is still the same.

ivirshup · 2020-01-23T02:05:28Z

That's still the case (at least for randomized PCA @Koncopd linked above), though it looks like there may be a another path forward using other solvers: scikit-learn/scikit-learn#12794. Still needs an implementation though.

atarashansky · 2020-02-04T03:35:10Z

You could just add a sparse argument to pca. If True, just call this function instead of scikit-learn's PCA:

def sparse_pca(X,npcs,mu = None):
    # X -- scipy sparse data matrix
    # npcs -- number of principal components
    # mu -- precomputed feature means. if None, calculates them from X.

    # compute mean of data features
    if mu is None: 
        mu = X.mean(0).A.flatten()[None,:]

    # dot product operator for the means
    mmat = mdot = mu.dot 
    # dot product operator for the transposed means
    mhmat = mhdot = mu.T.dot 
    # dot product operator for the data
    Xmat = Xdot = X.dot 
    # dot product operator for the transposed data
    XHmat = XHdot = X.T.conj().dot 
    # dot product operator for a vector of ones
    ones = np.ones(X.shape[0])[None,:].dot 

    # modify the matrix/vector dot products to subtract the means
    def matvec(x): 
        return Xdot(x) - mdot(x)
    def matmat(x): 
        return Xmat(x) - mmat(x)
    def rmatvec(x): 
        return XHdot(x) - mhdot(ones(x))
    def rmatmat(x): 
        return XHmat(x) - mhmat(ones(x))
    
    # construct the LinearOperator
    XL = sp.sparse.linalg.LinearOperator(matvec = matvec, dtype = X.dtype,
                                         matmat = matmat,
                                         shape = X.shape,
                                        rmatvec = rmatvec, rmatmat = rmatmat)
     
    u,s,v = sp.sparse.linalg.svds(XL,solver='arpack',k=npcs)
    
    # i like my eigenvalues sorted in decreasing order
    idx = np.argsort(-s)
    S = np.diag(s[idx])
    # principal components
    pcs = u[:,idx].dot(S) 
    # equivalent to PCA.components_ in sklearn 
    components_ = v[idx,:] 
    return pcs,components_

This only works for the arpack solver. It's a bit slower than PCA on dense matrices (since arpack is slower than randomized), but it's super memory efficient.

Koncopd added 9 commits December 16, 2018 08:06

centered sparse

832a797

fix rmul

629ac6b

proper centered sparse class

6c77764

pca for centered sparse data added

7d16b3e

efficient mean

4516f2a

proper mean and variance calculation

0fda5a4

fix import

d81f411

same n_iter as in pca

29d3d2a

add n_iter to pca

2371cd0

Koncopd requested a review from falexwolf December 20, 2018 13:34

Koncopd added 2 commits January 5, 2019 02:22

add sparse_pca

6333bc4

change to built-in version comparison

a10a808

flying-sheep force-pushed the master branch 2 times, most recently from 3efb194 to fc84096 Compare February 12, 2019 11:38

falexwolf mentioned this pull request Mar 10, 2019

TODO: Backwards-compat breaking changes #453

Open

15 tasks

cyrus303 approved these changes Dec 19, 2019

View reviewed changes

falexwolf force-pushed the master branch from aa3acd7 to fd4bc99 Compare December 30, 2019 00:53

flying-sheep requested changes Jan 10, 2020

View reviewed changes

atarashansky mentioned this pull request Feb 20, 2020

PCA for sparse data (v2) #1066

Closed

Koncopd closed this Mar 1, 2022

flying-sheep deleted the pca_sparse branch October 30, 2023 13:23

		@@ -0,0 +1,70 @@

		import numpy as np
		from distutils.version import StrictVersion

Pca for sparse data #403

Pca for sparse data #403

Uh oh!

Conversation

Koncopd commented Dec 20, 2018

Uh oh!

falexwolf commented Dec 26, 2018

Uh oh!

Koncopd commented Dec 26, 2018

Uh oh!

falexwolf commented Dec 27, 2018

Uh oh!

Koncopd commented Dec 27, 2018

Uh oh!

Koncopd commented Jan 14, 2019

Uh oh!

falexwolf commented Jan 21, 2019

Uh oh!

Koncopd commented Feb 4, 2019

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

gokceneraslan commented Jan 11, 2020

Uh oh!

Koncopd commented Jan 11, 2020

Uh oh!

flying-sheep commented Jan 13, 2020

Uh oh!

VolkerBergen commented Jan 21, 2020

Uh oh!

Koncopd commented Jan 22, 2020

Uh oh!

ivirshup commented Jan 23, 2020

Uh oh!

atarashansky commented Feb 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

flying-sheep Jan 10, 2020 •

edited

Loading