-
Notifications
You must be signed in to change notification settings - Fork 687
PCA for sparse data (v2) #1066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCA for sparse data (v2) #1066
Conversation
|
Thanks for the PR! This looks really interesting. I've got a couple questions:
From a brief benchmark on my end, this looks very good from a memory usage perspective, with similar compute times. The components also seem highly correlated, but the components are scaled differently. Would you mind commenting on that? Edit: It seems like the factors are making our nearest neighbor network quite different. It also looks like the calculated variances are different. |
|
Great catch! I messed up and forgot to sort the singular values prior to scaling To answer your other questions,
|
|
Err, hopefully this isn't inconvenient. Here's a zip file containing the relevant notebook. Benchmarking was done on the raw pbmc3k data. Summary of the timing and memory results: With With There are very slight differences between the eigenvalues output by the different methods, which translates to slightly different cluster assignments when using euclidean distance (this is probably exacerbated by the fact that I am benchmarking on raw data). However, for correlation distance, the output is exactly the same. See the attached notebook for more details. |
I've read that situation as that particular PR being stalled, but it's also just for the random solver. I think sklearn would really like to have this feature. I think there's support for this from the community (where the referenced comment is yours):
Do you think you could make a PR with this to sklearn? I'd like to see the response it gets, and judge based on that. My preference would be for this to go there, but I'm very open to having this in our codebase until it's in a
Ha, that's actually a difficult question. I'm not quite sure, zip file should be fine. Thanks for sharing! Ideally what I'd like from a benchmark of performance would be time and memory usage for the product of these conditions:
I'd also lean towards making this the default for sparse data. But to do that, I will need to look a little closer at correctness. For that, could you show the average residual from a few runs (with different seeds) for all output values between implicit vs explicit centering? |
|
Also, btw, I like the memory-profiler `sparse_pca.py`import scanpy as sc
pbmc = sc.datasets.pbmc3k()
sc.pp.log1p(pbmc)
@profile
def implicit_mean_pca():
sc.pp.pca(pbmc, pca_sparse=True)
@profile
def explicit_mean_pca():
sc.pp.pca(pbmc)
@profile
def nomean_pca():
sc.pp.pca(pbmc, zero_center=False)
if __name__ == "__main__":
implicit_mean_pca()
nomean_pca()
explicit_mean_pca()Run with $ mprof run --interval=0.01 ./sparse_pca.py
...
$ mprof plotShows: So this is looking very good! |
I'll try and do that soon. For now, I'll focus on providing you with the benchmarks you requested!
I could not find a |
|
I used the 68k pbmc dataset from 10x genomics for the large dataset. Jupyter notebook with residuals: |
|
By the way, I was curious why ‘nomean’ was so much faster than implicit mean centering. I noticed that if
So the speed difference is due to differences in the solvers (arpack vs randomized). Is the omission of |
The blas library used by numpy is multithreaded by default. You can change this by setting an environment variable. This might have to happen before numpy is imported. Here's how you'd do that: import os
os.environ["MKL_NUM_THREADS"] = "1" # If you're using MKL blas
os.environ["OPENBLAS_NUM_THREADS"] = "1" # If you're using open blasUsing sc.datasets.pbmc3k: Single threaded%time sc.pp.pca(pbmc, pca_sparse=True)
CPU times: user 4.36 s, sys: 57.2 ms, total: 4.42 s
Wall time: 4.43 s
%time sc.pp.pca(pbmc)
CPU times: user 15.7 s, sys: 127 ms, total: 15.8 s
Wall time: 15.8 sMultithreaded%time sc.pp.pca(pbmc, pca_sparse=True)
CPU times: user 28.9 s, sys: 5.44 s, total: 34.4 s
Wall time: 2.39 s
%time sc.pp.pca(pbmc)
CPU times: user 1min 37s, sys: 23.6 s, total: 2min 1s
Wall time: 9.92 s
Good catch! I'm pretty sure that should be passed the solver. |
|
That's bizarre. Somehow, using a single thread on my system doesn't actually increase the runtime by that much. I kept an eye on the cpu usage to make sure that I was just using one core. It's actually faster to do implicit mean centering on the small and large datasets using a single thread. |
|
Interesting... I know that there can be some difference between systems I use for how time is being recorded. But I still don't think I'd expect this. Either way, it looks like single threaded performance is good, and multithreaded is adding surprisingly little for a lot of spent computation. Once you've got the similarity measurements done, I think there's a little code organization to do, and this should be pretty much ready. |
|
I included the notebook with the residuals above. I'll reattach it to this message: |
|
Ah, I had totally missed that, sorry! Hm, it looks like the residuals is scaling with the number of cells. I think this has to do with floating point precision, since using 64bit floats up seems to remove the effect for me. Could you show comparisons between random states within the sparse and dense method so we can be sure? I.e. if you run each method twice with the different seeds, how different are the results? Also, the numpy random state should be set (with the |
scanpy/preprocessing/_utils.py
Outdated
| return mean, var | ||
|
|
||
|
|
||
| def _pca_with_sparse(X, npcs, solver='arpack', mu=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should get a random_state argument. Make sure it will work if either a RandomState or int is passed to sc.pp.pca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think scipy.sparse.linalg.svds accepts a random state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be easiest to normalize the random state with:
random_state = sklearn.utils.check_random_state(random_state)during argument handling for sc.pp.pca then setting np.random.set_state(random_state.get_state()) in _pca_with_sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no you're right. I was getting the same result when I set the seed due to conversion from float64 -> float32 which made values similar.
svds does use randomness to initialize though, it would be nice if we could set that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, lobpcg doesn't even use v0. It seems that it is deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it? From svds:
if solver == 'lobpcg':
if k == 1 and v0 is not None:
X = np.reshape(v0, (-1, 1))
else:
X = np.random.RandomState(52).randn(min(A.shape), k)
eigvals, eigvec = lobpcg(XH_X, X, tol=tol ** 2, maxiter=maxiter,
largest=largest)It just hard codes the random state...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, I missed that. Okay, so it uses v0 when you compute a single singular vector, lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@falexwolf, am I wrong in thinking we changed the default to arpack for reproducibility?
I believe so as well. I had issues reproducing my single-cell-tutorial workflow on different systems without using svd_solver='arpack'. It still wasn't 100% reproducible afterwards, but much more similar betweem runs. Alex recommended this option at the time... and I believe the defaults were changed as a result of that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In scikit-learn==0.22.1, TruncatedSVD does not use the random_state parameter if the solver is arpack, which is why you still weren't getting 100% reproducibility.
X = check_array(X, accept_sparse=['csr', 'csc'],
ensure_min_features=2)
random_state = check_random_state(self.random_state)
if self.algorithm == "arpack":
U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol)
# svds doesn't abide by scipy.linalg.svd/randomized_svd
# conventions, so reverse its outputs.
Sigma = Sigma[::-1]
U, VT = svd_flip(U[:, ::-1], VT[::-1])|
If we have arpack i can also update the PR with randomized svd approach. Is it needed? |
|
I'm of the opinion that, until |
|
@ivirshup I think the benchmarks have shown satisfactory performance of this PR. Should we move on to polishing the code organization? |
|
@atarashansky I'll have time to give a little more detailed notes this weekend.
@Koncopd, I don't think I've looked over your implementation much. Is it similar to the stalled sklearn PR? If so, do you have a sense of why the |
|
@atarashansky, the |
|
Oh crud, good catch. I was normalizing by the total variance of the top Edit: Scratch that, I lied. I'm confused now. Edit-edit: Apparently the total variance is equal to the total variance of all the original features. Edit-edit-edit: Fixed 👯♂ |
|
That's odd. sklearn calculates the explained variance and variance ratio as follows: # Calculate explained variance & explained variance ratio
X_transformed = U * Sigma
self.explained_variance_ = exp_var = np.var(X_transformed, axis=0)
if sp.issparse(X):
_, full_var = mean_variance_axis(X, axis=0)
full_var = full_var.sum()
else:
full_var = np.var(X, axis=0).sum()
self.explained_variance_ratio_ = exp_var / full_varI do it in the same way: X_pca = (u * s)[:, idx] # sort PCs in decreasing order
ev = X_pca.var(0)
total_var = _get_mean_var(X)[1].sum()
ev_ratio = ev / total_varI'll investigate... EDIT: Strange, your assertion error is not reproducible on my end. The code runs fine for me. |
|
I'm not sure we're looking at the same code. I was looking at this: self.n_samples_, self.n_features_ = n_samples, n_features
self.components_ = V
self.n_components_ = n_components
# Get variance explained by singular values
self.explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = np.var(X, ddof=1, axis=0)
self.explained_variance_ratio_ = \
self.explained_variance_ / total_var.sum()
self.singular_values_ = S.copy() # Store the singular values. |
Tbh, it's just less hoops to jump through to write a pca function that can fit snugly in your preprocessing utils module compared to writing a new class on top of their base class with |
I was looking at the TruncatedSVD code. Either way, I'm not able to reproduce your assertion error. |
|
Here's the test I ran for commit 82e3a59 import scanpy as sc
import numpy as np
pbmc = sc.datasets.pbmc3k()
pbmc.X = pbmc.X.astype(np.float64)
sc.pp.log1p(pbmc)
implicit = sc.pp.pca(pbmc, pca_sparse=True, dtype=np.float64, copy=True)
explicit = sc.pp.pca(pbmc, pca_sparse=False, dtype=np.float64, copy=True)
assert not np.allclose(implicit.uns["pca"]["variance"], explicit.uns["pca"]["variance"])
assert not np.allclose(implicit.uns["pca"]["variance_ratio"], explicit.uns["pca"]["variance_ratio"]) |
|
As per your suggestion, I switched to calculating eigenvalue from the singular values as opposed to taking the variance of the principal components. Now the eigenvalues are almost exactly the same. |
|
Part of why I would like this to be in For sklearn submission, I don't think you'd have to implement any classes. Your solution would just be what happened if someone passed a sparse matrix and About this PR, could you add tests for:
After that and the code reorganization I mentioned above, this should be about ready to merge. |
That's fair! Doesn't seem like much work at all. I'll submit a PR to sklearn, then. |
|
Hey @atarashansky, what's your status with this? We're going for a larger release soon, and I'd really like to have this PR in it! |
|
Sorry! I've been preoccupied with some stuff on my end (and also super distracted by this coronavirus hullabaloo). I'll add the requested tests, soon. EDIT: Done! |
|
Great! Yeah, coronavirus is pretty distracting. This last week has definitely felt like a month for me. So still to do:
|
|
@atarashansky, I've merged this through #1162. This is really great! Thanks for implementing this! |
Has one of you opened this PR to sklearn? I just wanted to chime in and say that it'd be great if sklearn finally started to support this. Definitely worth trying to get it in there. |
|
@atarashansky did you still want to do this? I’d be happy to give it a shot if not. |
|
Sorry about going mia! You can open the PR. I’ve been totally slammed working on a publication. |
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<timed exec> in <module>
~/.local/lib/python3.8/site-packages/scanpy/preprocessing/_pca.py in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
201 )
202
--> 203 output = _pca_with_sparse(X, n_comps, solver=svd_solver)
204 # this is just a wrapper for the results
205 X_pca = output['X_pca']
~/.local/lib/python3.8/site-packages/scanpy/preprocessing/_pca.py in _pca_with_sparse(X, npcs, solver, mu, random_state)
293 return XHmat(x) - mhmat(ones(x))
294
--> 295 XL = LinearOperator(
296 matvec=matvec,
297 dtype=X.dtype,
TypeError: __init__() got an unexpected keyword argument 'rmatmat'I got this error once with the new spare PCA. @atarashansky do we need to write an explicit scipy version as dependency? It might be something weird with my setup too. |





I know this (quite ancient) pull request has been open (#403), but I wasn't sure on its status. I think the consensus was to wait for sklearn to integrate the necessary changes? If that's still the case, then please feel free to remove this PR.
Here I make use of scipy's extremely nifty LinearOperator class to customize the dot product functions for an input sparse matrix. In this case, the 'custom' dot product performs implicit mean centering.
In my benchmarks, performing implicit mean centering in this way does not affect the runtime whatsoever. However, this approach has to use svds, for which randomized SVD is not implemented. So we have to use 'arpack', which can be significantly slower (but not intractably so.... in my hands, I could still do PCA on datasets of 200k+ cells in minutes, and it sure beats densifying the data, if you want more thorough benchmarks I am happy to generate them!).
The way I incorporated this functionality into scanpy/preprocessing/_simple.py might be questionable, and would love any suggestions or advice on how to better integrate this if there is interest in pushing this PR through. Let me know!