Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@atarashansky
Copy link

I know this (quite ancient) pull request has been open (#403), but I wasn't sure on its status. I think the consensus was to wait for sklearn to integrate the necessary changes? If that's still the case, then please feel free to remove this PR.

Here I make use of scipy's extremely nifty LinearOperator class to customize the dot product functions for an input sparse matrix. In this case, the 'custom' dot product performs implicit mean centering.

In my benchmarks, performing implicit mean centering in this way does not affect the runtime whatsoever. However, this approach has to use svds, for which randomized SVD is not implemented. So we have to use 'arpack', which can be significantly slower (but not intractably so.... in my hands, I could still do PCA on datasets of 200k+ cells in minutes, and it sure beats densifying the data, if you want more thorough benchmarks I am happy to generate them!).

The way I incorporated this functionality into scanpy/preprocessing/_simple.py might be questionable, and would love any suggestions or advice on how to better integrate this if there is interest in pushing this PR through. Let me know!

@ivirshup
Copy link
Member

ivirshup commented Feb 21, 2020

Thanks for the PR! This looks really interesting. I've got a couple questions:

  • Why not submit this to scikit-learn? In general I'd be more confident in their vetting.
  • This should work with other solvers from scipy, like lobpcg, right?
  • Could you provide some benchmarks on time, memory usage, and accuracy?

From a brief benchmark on my end, this looks very good from a memory usage perspective, with similar compute times. The components also seem highly correlated, but the components are scaled differently. Would you mind commenting on that?


Edit: It seems like the factors are making our nearest neighbor network quite different. It also looks like the calculated variances are different.

@atarashansky
Copy link
Author

atarashansky commented Feb 21, 2020

Great catch! I messed up and forgot to sort the singular values prior to scaling U. The components should be more or less the same now.

To answer your other questions,

  • Submitting this PR to scanpy seemed like lower-hanging fruit since I'm much more familiar with your codebase. sklearn has also had a PR on this topic out for a long time and it just does not seem to budge. Allowing sparse support for PCA doesn't seem to be high on their priority list(?). If sklearn does eventually allow for PCA on sparse inputs, it would be really easy to replace the call to my function with a call to sklearn's implementation instead.

  • This does work with lobpcg, but I'm a little confused by when lobpcg outperforms arpack (see the discussion here PCA on sparse, noncentered data scikit-learn/scikit-learn#12794). There's some criterion that relates to the number of components and the size of the smallest dimension. In my hands, lobpcg is significantly slower.

  • Will do!

@atarashansky
Copy link
Author

atarashansky commented Feb 21, 2020

Err, hopefully this isn't inconvenient. Here's a zip file containing the relevant notebook.
benchmarks_PR1066.zip

Benchmarking was done on the raw pbmc3k data.

Summary of the timing and memory results:

With pca_sparse=False,

%%memit
t=time.time()
sc.tl.pca(adata1,pca_sparse=False,svd_solver='arpack',random_state=0,zero_center=True)
print(str(time.time()-t)+' seconds')
6.122049570083618 seconds
peak memory: 1332.33 MiB, increment: 1047.04 MiB

With pca_sparse=True,

%%memit
t=time.time()
sc.tl.pca(adata2,pca_sparse=True,random_state=0)
print(str(time.time()-t)+' seconds')
2.373802423477173 seconds
peak memory: 401.17 MiB, increment: 56.26 MiB

There are very slight differences between the eigenvalues output by the different methods, which translates to slightly different cluster assignments when using euclidean distance (this is probably exacerbated by the fact that I am benchmarking on raw data). However, for correlation distance, the output is exactly the same. See the attached notebook for more details.

@ivirshup
Copy link
Member

sklearn has also had a PR on this topic out for a long time and it just does not seem to budge. Allowing sparse support for PCA doesn't seem to be high on their priority list(?)

I've read that situation as that particular PR being stalled, but it's also just for the random solver. I think sklearn would really like to have this feature. I think there's support for this from the community (where the referenced comment is yours):

The perfect implementation of implicit data centering must be solver agnostic, allowing any matrix-free sparse PCA and SVD solver from scipy and scikit to be used. E.g., adding support to call any matrix-free scikit SVD/PCA solver in #12794 (comment) would make it perfect PR for implicit data centering.

Do you think you could make a PR with this to sklearn? I'd like to see the response it gets, and judge based on that. My preference would be for this to go there, but I'm very open to having this in our codebase until it's in a sklearn release.

what's the best way of sharing the reproducing jupyter notebook with you?

Ha, that's actually a difficult question. I'm not quite sure, zip file should be fine. Thanks for sharing!

Ideally what I'd like from a benchmark of performance would be time and memory usage for the product of these conditions:

  • Datasets size (one small, one large (>50k cells))
  • Implicit centering, densifying centering, no centering
  • single threaded, multi-threaded

I'd also lean towards making this the default for sparse data. But to do that, I will need to look a little closer at correctness. For that, could you show the average residual from a few runs (with different seeds) for all output values between implicit vs explicit centering?

@ivirshup
Copy link
Member

ivirshup commented Feb 21, 2020

Also, btw, I like the memory-profiler mprof sampling plots a lot for this kind of benchmarking. I took a look at this with this script:

`sparse_pca.py`
import scanpy as sc

pbmc = sc.datasets.pbmc3k()
sc.pp.log1p(pbmc)

@profile
def implicit_mean_pca():
    sc.pp.pca(pbmc, pca_sparse=True)

@profile
def explicit_mean_pca():
    sc.pp.pca(pbmc)

@profile
def nomean_pca():
    sc.pp.pca(pbmc, zero_center=False)

if __name__ == "__main__":
    implicit_mean_pca()

    nomean_pca()

    explicit_mean_pca()

Run with

$ mprof run --interval=0.01 ./sparse_pca.py
...
$ mprof plot

Shows:

pca_mem_benchmark

So this is looking very good!

@atarashansky
Copy link
Author

atarashansky commented Feb 21, 2020

Do you think you could make a PR with this to sklearn? I'd like to see the response it gets, and judge based on that. My preference would be for this to go there, but I'm very open to having this in our codebase until it's in a sklearn release.

I'll try and do that soon. For now, I'll focus on providing you with the benchmarks you requested!

  • Datasets size (one small, one large (>50k cells))
  • Implicit centering, densifying centering, no centering
  • single threaded, multi-threaded <---------

I could not find a n_jobs argument in scanpy.pp.pca. Can you elaborate a little on the single threaded, multi-threaded bit?

@atarashansky
Copy link
Author

I used the 68k pbmc dataset from 10x genomics for the large dataset.

Jupyter notebook with residuals:
benchmarks_PR1066_residuals.ipynb.zip

The memory and timing benchmarks:
large
small

@atarashansky
Copy link
Author

atarashansky commented Feb 21, 2020

By the way, I was curious why ‘nomean’ was so much faster than implicit mean centering. I noticed that if zero_center=False then TruncatedSVD does not accept the svd_solver argument and defaults to the randomized solver (line 533):

pca_ = TruncatedSVD(n_components=n_comps, random_state=random_state)

So the speed difference is due to differences in the solvers (arpack vs randomized). Is the omission of svd_solver in the above line intended?

@ivirshup
Copy link
Member

I could not find a n_jobs argument in scanpy.pp.pca. Can you elaborate a little on the single threaded, multi-threaded bit?

The blas library used by numpy is multithreaded by default. You can change this by setting an environment variable. This might have to happen before numpy is imported. Here's how you'd do that:

import os
os.environ["MKL_NUM_THREADS"] = "1"  # If you're using MKL blas
os.environ["OPENBLAS_NUM_THREADS"] = "1"  # If you're using open blas

Using sc.datasets.pbmc3k:

Single threaded
%time sc.pp.pca(pbmc, pca_sparse=True)                                                                                
CPU times: user 4.36 s, sys: 57.2 ms, total: 4.42 s
Wall time: 4.43 s

 %time sc.pp.pca(pbmc)                                                                                                 
CPU times: user 15.7 s, sys: 127 ms, total: 15.8 s
Wall time: 15.8 s
Multithreaded
%time sc.pp.pca(pbmc, pca_sparse=True)                                                                                 
CPU times: user 28.9 s, sys: 5.44 s, total: 34.4 s
Wall time: 2.39 s

%time sc.pp.pca(pbmc)                                                                                                  
CPU times: user 1min 37s, sys: 23.6 s, total: 2min 1s
Wall time: 9.92 s

I noticed that if zero_center=False then TruncatedSVD does not accept the svd_solver argument and defaults to the randomized solver

Good catch! I'm pretty sure that should be passed the solver.

@atarashansky
Copy link
Author

atarashansky commented Feb 22, 2020

That's bizarre. Somehow, using a single thread on my system doesn't actually increase the runtime by that much. I kept an eye on the cpu usage to make sure that I was just using one core. It's actually faster to do implicit mean centering on the small and large datasets using a single thread.

Small dataset, one thread:
one_thread_small

Large dataset, one thread:
one_thread_large

@ivirshup
Copy link
Member

Interesting... I know that there can be some difference between systems I use for how time is being recorded. But I still don't think I'd expect this. Either way, it looks like single threaded performance is good, and multithreaded is adding surprisingly little for a lot of spent computation.

Once you've got the similarity measurements done, I think there's a little code organization to do, and this should be pretty much ready.

@atarashansky
Copy link
Author

I included the notebook with the residuals above. I'll reattach it to this message:

benchmarks_PR1066_residuals.ipynb.zip

@ivirshup
Copy link
Member

Ah, I had totally missed that, sorry!

Hm, it looks like the residuals is scaling with the number of cells. I think this has to do with floating point precision, since using 64bit floats up seems to remove the effect for me. Could you show comparisons between random states within the sparse and dense method so we can be sure? I.e. if you run each method twice with the different seeds, how different are the results?

Also, the numpy random state should be set (with the random_state argument) before _pca_with_sparse calls the svd solver.

return mean, var


def _pca_with_sparse(X, npcs, solver='arpack', mu=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should get a random_state argument. Make sure it will work if either a RandomState or int is passed to sc.pp.pca

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think scipy.sparse.linalg.svds accepts a random state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be easiest to normalize the random state with:

random_state = sklearn.utils.check_random_state(random_state)

during argument handling for sc.pp.pca then setting np.random.set_state(random_state.get_state()) in _pca_with_sparse.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no you're right. I was getting the same result when I set the seed due to conversion from float64 -> float32 which made values similar.

svds does use randomness to initialize though, it would be nice if we could set that.

Copy link
Author

@atarashansky atarashansky Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, lobpcg doesn't even use v0. It seems that it is deterministic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it? From svds:

    if solver == 'lobpcg':

        if k == 1 and v0 is not None:
            X = np.reshape(v0, (-1, 1))
        else:
            X = np.random.RandomState(52).randn(min(A.shape), k)

        eigvals, eigvec = lobpcg(XH_X, X, tol=tol ** 2, maxiter=maxiter,
                                 largest=largest)

It just hard codes the random state...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, I missed that. Okay, so it uses v0 when you compute a single singular vector, lol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@falexwolf, am I wrong in thinking we changed the default to arpack for reproducibility?

I believe so as well. I had issues reproducing my single-cell-tutorial workflow on different systems without using svd_solver='arpack'. It still wasn't 100% reproducible afterwards, but much more similar betweem runs. Alex recommended this option at the time... and I believe the defaults were changed as a result of that as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scikit-learn==0.22.1, TruncatedSVD does not use the random_state parameter if the solver is arpack, which is why you still weren't getting 100% reproducibility.

        X = check_array(X, accept_sparse=['csr', 'csc'],
                        ensure_min_features=2)
        random_state = check_random_state(self.random_state)

        if self.algorithm == "arpack":
            U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol)
            # svds doesn't abide by scipy.linalg.svd/randomized_svd
            # conventions, so reverse its outputs.
            Sigma = Sigma[::-1]
            U, VT = svd_flip(U[:, ::-1], VT[::-1])

@Koncopd
Copy link
Member

Koncopd commented Feb 24, 2020

If we have arpack i can also update the PR with randomized svd approach. Is it needed?

@atarashansky
Copy link
Author

atarashansky commented Feb 24, 2020

I'm of the opinion that, until sklearn implements its own sparse-capable PCA (rendering the use of svds in this PR moot), the inclusion of randomized SVD should be an upstream PR to scipy.sparse.linalg.svds.

@atarashansky
Copy link
Author

@ivirshup I think the benchmarks have shown satisfactory performance of this PR. Should we move on to polishing the code organization?

@ivirshup
Copy link
Member

@atarashansky I'll have time to give a little more detailed notes this weekend.

  • Have you looked at submitting this in a PR to sklearn? I think they would be better at evaluating the stability. I'd be happy to help with this if you want.
  • At this point, I think PCA should go into its own file. Could you move the pca function and your sparse on into a scanpy/preprocessing/_pca.py file?
  • From the stability and performance checks, I think this could be similar enough make it the default. I think this should just be the default behaviour for when: 1) the data matrix is sparse 2) zero_center=True 3) svd_solver is arpack or lobpcg. Any objections to this @Koncopd, @flying-sheep, @falexwolf?

@Koncopd, I don't think I've looked over your implementation much. Is it similar to the stalled sklearn PR? If so, do you have a sense of why the sklearn PR for the randomized solver is stalled?

@ivirshup
Copy link
Member

@atarashansky, the "variance_ratio" entry in uns["pca"] seems to be off by about a factor of 3 from previous results.

@atarashansky
Copy link
Author

atarashansky commented Feb 28, 2020

Oh crud, good catch. I was normalizing by the total variance of the top n principal components, instead of by the total variance. Easy fix, will commit it in a hot sec.

Edit: Scratch that, I lied. I'm confused now. svds only outputs the top k principal components -- how can I find out the total variance for all min(n,m) possible principal components?

Edit-edit: Apparently the total variance is equal to the total variance of all the original features.

Edit-edit-edit: Fixed 👯‍♂

@atarashansky
Copy link
Author

atarashansky commented Mar 3, 2020

That's odd. sklearn calculates the explained variance and variance ratio as follows:

        # Calculate explained variance & explained variance ratio
        X_transformed = U * Sigma
        self.explained_variance_ = exp_var = np.var(X_transformed, axis=0)
        if sp.issparse(X):
            _, full_var = mean_variance_axis(X, axis=0)
            full_var = full_var.sum()
        else:
            full_var = np.var(X, axis=0).sum()
        self.explained_variance_ratio_ = exp_var / full_var

I do it in the same way:

    X_pca = (u * s)[:, idx] # sort PCs in decreasing order
    ev = X_pca.var(0)

    total_var = _get_mean_var(X)[1].sum()
    ev_ratio = ev / total_var

I'll investigate...

EDIT: Strange, your assertion error is not reproducible on my end. The code runs fine for me.

@ivirshup
Copy link
Member

ivirshup commented Mar 3, 2020

I'm not sure we're looking at the same code. I was looking at this:

        self.n_samples_, self.n_features_ = n_samples, n_features
        self.components_ = V
        self.n_components_ = n_components

        # Get variance explained by singular values
        self.explained_variance_ = (S ** 2) / (n_samples - 1)
        total_var = np.var(X, ddof=1, axis=0)
        self.explained_variance_ratio_ = \
            self.explained_variance_ / total_var.sum()
        self.singular_values_ = S.copy()  # Store the singular values.

@atarashansky
Copy link
Author

atarashansky commented Mar 3, 2020

Also, any thoughts on making a PR there?

Tbh, it's just less hoops to jump through to write a pca function that can fit snugly in your preprocessing utils module compared to writing a new class on top of their base class with transform, fit, fit_transform methods, and the whole shebang. I think the stark computational advantage of using this method for sparse inputs justifies its immediate inclusion into scanpy (of course, this is at the discretion of scanpy maintainers :D). Submitting a PR to sklearn is perhaps something I'd be willing to do later on. Also, as discussed previously, if sklearn ever does come out with an implementation, it should be quite straightforward to sub my function call with theirs.

@atarashansky
Copy link
Author

atarashansky commented Mar 3, 2020

I'm not sure we're looking at the same code. I was looking at this:

I was looking at the TruncatedSVD code. Either way, I'm not able to reproduce your assertion error.

@ivirshup
Copy link
Member

ivirshup commented Mar 3, 2020

Here's the test I ran for commit 82e3a59

import scanpy as sc
import numpy as np

pbmc = sc.datasets.pbmc3k()
pbmc.X = pbmc.X.astype(np.float64)
sc.pp.log1p(pbmc)

implicit = sc.pp.pca(pbmc, pca_sparse=True, dtype=np.float64, copy=True)
explicit = sc.pp.pca(pbmc, pca_sparse=False, dtype=np.float64, copy=True)

assert not np.allclose(implicit.uns["pca"]["variance"], explicit.uns["pca"]["variance"])
assert not np.allclose(implicit.uns["pca"]["variance_ratio"], explicit.uns["pca"]["variance_ratio"])

@atarashansky
Copy link
Author

As per your suggestion, I switched to calculating eigenvalue from the singular values as opposed to taking the variance of the principal components. Now the eigenvalues are almost exactly the same.

@ivirshup
Copy link
Member

ivirshup commented Mar 3, 2020

Part of why I would like this to be in sklearn is that it lessens our responsibility to maintain it, and simplifies our code. I think it'll be easiest to do this sooner, rather than later, since these things have a tendency to lose momentum.

For sklearn submission, I don't think you'd have to implement any classes. Your solution would just be what happened if someone passed a sparse matrix and solver="arpack" to PCA.fit, like what scikit-learn/scikit-learn#12841 does. Does this make it more appealing? If not, would you mind if I opened a PR to sklearn with this code (crediting you, of course)?


About this PR, could you add tests for:

  • The variance and variance explained entries being correct
  • Explicit and implicit centering returning equivalent results

After that and the code reorganization I mentioned above, this should be about ready to merge.

@atarashansky
Copy link
Author

For sklearn submission, I don't think you'd have to implement any classes. Your solution would just be what happened if someone passed a sparse matrix and solver="arpack" to PCA.fit, like what scikit-learn/scikit-learn#12841 does.

That's fair! Doesn't seem like much work at all. I'll submit a PR to sklearn, then.

@ivirshup
Copy link
Member

Hey @atarashansky, what's your status with this? We're going for a larger release soon, and I'd really like to have this PR in it!

@atarashansky
Copy link
Author

atarashansky commented Mar 18, 2020

Sorry! I've been preoccupied with some stuff on my end (and also super distracted by this coronavirus hullabaloo). I'll add the requested tests, soon.

EDIT: Done!

@ivirshup
Copy link
Member

Great! Yeah, coronavirus is pretty distracting. This last week has definitely felt like a month for me.

So still to do:

  • This should be rebased on master
  • This should be the default, and this should be noted in the documentation
  • Add this to the release notes!

@ivirshup ivirshup self-requested a review April 13, 2020 05:15
@ivirshup ivirshup mentioned this pull request Apr 13, 2020
@ivirshup
Copy link
Member

@atarashansky, I've merged this through #1162.

This is really great! Thanks for implementing this!

@ivirshup ivirshup closed this Apr 15, 2020
@dkobak
Copy link

dkobak commented May 19, 2020

@ivirshup @atarashansky

Your solution would just be what happened if someone passed a sparse matrix and solver="arpack" to PCA.fit, like what scikit-learn/scikit-learn#12841 does. Does this make it more appealing? If not, would you mind if I opened a PR to sklearn with this code (crediting you, of course)?

That's fair! Doesn't seem like much work at all. I'll submit a PR to sklearn, then.

Has one of you opened this PR to sklearn? I just wanted to chime in and say that it'd be great if sklearn finally started to support this. Definitely worth trying to get it in there.

@ivirshup
Copy link
Member

@atarashansky did you still want to do this? I’d be happy to give it a shot if not.

@atarashansky
Copy link
Author

Sorry about going mia! You can open the PR. I’ve been totally slammed working on a publication.

@gokceneraslan
Copy link
Collaborator

Shall we close #393 and #403?

@ivirshup ivirshup mentioned this pull request May 28, 2020
@gokceneraslan
Copy link
Collaborator

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

~/.local/lib/python3.8/site-packages/scanpy/preprocessing/_pca.py in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
    201             )
    202 
--> 203         output = _pca_with_sparse(X, n_comps, solver=svd_solver)
    204         # this is just a wrapper for the results
    205         X_pca = output['X_pca']

~/.local/lib/python3.8/site-packages/scanpy/preprocessing/_pca.py in _pca_with_sparse(X, npcs, solver, mu, random_state)
    293         return XHmat(x) - mhmat(ones(x))
    294 
--> 295     XL = LinearOperator(
    296         matvec=matvec,
    297         dtype=X.dtype,

TypeError: __init__() got an unexpected keyword argument 'rmatmat'

I got this error once with the new spare PCA. @atarashansky do we need to write an explicit scipy version as dependency? It might be something weird with my setup too.

@gokceneraslan
Copy link
Collaborator

Ah ok @ivirshup already addressed this here, #1247.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants