Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Koncopd
Copy link
Member

@Koncopd Koncopd commented Dec 20, 2018

@Koncopd Koncopd requested a review from falexwolf December 20, 2018 13:34
@falexwolf
Copy link
Member

This is nice! Thank you!

It appears to me that the benchmarks show that this only becomes relevant for very large data. So we need to be mindful to not break backward compatibility for all the small and medium-size datasets that people use (which we do by introducing the tiny difference). Don't you think that in the light of this, it would be better to leave the default as is (densifying) and have an option sparse_pca or something similar?

@Koncopd
Copy link
Member Author

Koncopd commented Dec 26, 2018

It appears to me that the benchmarks show that this only becomes relevant for very large data.

Hm, even for my example it is 77.14 MiB vs 893.92 MiB, so 10 times difference. This seems large to me, no?

@falexwolf
Copy link
Member

Hm, even for my example it is 77.14 MiB vs 893.92 MiB, so 10 times difference. This seems large to me, no?

Yes, it's definitely large and it's awesome that you solved this problem! I just meant that it's not hitting people's computational resources limits: your example is 60K x 2K, so quite big already, if you densify you need 800MB, which is easily available even on a laptop. That's what I meant.

What do you think?

@Koncopd
Copy link
Member Author

Koncopd commented Dec 27, 2018

Yeah, this seems important only for large datasets in that sense. So, i will add sparse_pca option with False by default.

@Koncopd
Copy link
Member Author

Koncopd commented Jan 14, 2019

Not sure what kind of test to add for this...

@falexwolf
Copy link
Member

As discussed, @Koncopd will try to integrate this into scikit-learn itself and not into Scanpy. 😄

@Koncopd
Copy link
Member Author

Koncopd commented Feb 4, 2019

Similar pull request exists already in sklearn.
scikit-learn/scikit-learn#12841
Will watch.

@flying-sheep flying-sheep force-pushed the master branch 2 times, most recently from 3efb194 to fc84096 Compare February 12, 2019 11:38
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about causing merge conflicts. Squashing all your commits and then rebasing on master is probably going to be easier than rebasing all the commits individually

@@ -0,0 +1,70 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File needs to be named starting with _ as the others

@@ -0,0 +1,70 @@

import numpy as np
from distutils.version import StrictVersion
Copy link
Member

@flying-sheep flying-sheep Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distutils.version is broken. Please use from packaging import version and then version.parse. (I added packaging to the requirements on master)

else:
from sklearn.utils.sparsefuncs_fast import csr_mean_variance_axis0 as mean_variance

# need to pass issparse check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t understand this comment.

if k >= n_features:
raise ValueError('n_components must be < n_features;'
' got %d >= %d' % (k, n_features))
U, Sigma, VT = randomized_svd(C, k, n_iter=self.n_iter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don’t use uppercase letters in variable names.

data: Union[AnnData, np.ndarray, spmatrix],
n_comps: int = N_PCS,
zero_center: Optional[bool] = True,
sparse_pca: Optional[bool] = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it None by default, and, based on the benchmarks, use it when the data is sufficiently large.

In this case, make sure you document this in the parameter docs and log a debug message.

@gokceneraslan
Copy link
Collaborator

Shall we replace sparse_pca with something else like implicit_centering or so?

sparse_pca might give rise to confusion since there is indeed a PCA technique called sparse PCA.

@Koncopd
Copy link
Member Author

Koncopd commented Jan 11, 2020

Hm, it was decided to suspend this pr earlier.
There is an analogous pr in scikit-learn, but i'm not sure it will got forward.
I'm not sure what to do with this pr...

@flying-sheep
Copy link
Member

Ah, then we should leave it open and unchanged until there’s a decision on the sklearn PR. Sorry I missed the decision.

@VolkerBergen
Copy link
Contributor

What's the current state here?

This just came out
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1900-3
favoring sklearn's PCA implementation while stating that it cannot yet handle sparse matrices. Is that still true?

@Koncopd
Copy link
Member Author

Koncopd commented Jan 22, 2020

Last time i checked it was true. Need to check again, but i bet it is still the same.

@ivirshup
Copy link
Member

That's still the case (at least for randomized PCA @Koncopd linked above), though it looks like there may be a another path forward using other solvers: scikit-learn/scikit-learn#12794. Still needs an implementation though.

@atarashansky
Copy link

You could just add a sparse argument to pca. If True, just call this function instead of scikit-learn's PCA:

def sparse_pca(X,npcs,mu = None):
    # X -- scipy sparse data matrix
    # npcs -- number of principal components
    # mu -- precomputed feature means. if None, calculates them from X.

    # compute mean of data features
    if mu is None: 
        mu = X.mean(0).A.flatten()[None,:]

    # dot product operator for the means
    mmat = mdot = mu.dot 
    # dot product operator for the transposed means
    mhmat = mhdot = mu.T.dot 
    # dot product operator for the data
    Xmat = Xdot = X.dot 
    # dot product operator for the transposed data
    XHmat = XHdot = X.T.conj().dot 
    # dot product operator for a vector of ones
    ones = np.ones(X.shape[0])[None,:].dot 

    # modify the matrix/vector dot products to subtract the means
    def matvec(x): 
        return Xdot(x) - mdot(x)
    def matmat(x): 
        return Xmat(x) - mmat(x)
    def rmatvec(x): 
        return XHdot(x) - mhdot(ones(x))
    def rmatmat(x): 
        return XHmat(x) - mhmat(ones(x))
    
    # construct the LinearOperator
    XL = sp.sparse.linalg.LinearOperator(matvec = matvec, dtype = X.dtype,
                                         matmat = matmat,
                                         shape = X.shape,
                                        rmatvec = rmatvec, rmatmat = rmatmat)
     
    u,s,v = sp.sparse.linalg.svds(XL,solver='arpack',k=npcs)
    
    # i like my eigenvalues sorted in decreasing order
    idx = np.argsort(-s)
    S = np.diag(s[idx])
    # principal components
    pcs = u[:,idx].dot(S) 
    # equivalent to PCA.components_ in sklearn 
    components_ = v[idx,:] 
    return pcs,components_

This only works for the arpack solver. It's a bit slower than PCA on dense matrices (since arpack is slower than randomized), but it's super memory efficient.

@Koncopd Koncopd closed this Mar 1, 2022
@flying-sheep flying-sheep deleted the pca_sparse branch October 30, 2023 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants