-
Notifications
You must be signed in to change notification settings - Fork 687
Pca for sparse data #403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pca for sparse data #403
Conversation
|
This is nice! Thank you! It appears to me that the benchmarks show that this only becomes relevant for very large data. So we need to be mindful to not break backward compatibility for all the small and medium-size datasets that people use (which we do by introducing the tiny difference). Don't you think that in the light of this, it would be better to leave the default as is (densifying) and have an option |
Hm, even for my example it is 77.14 MiB vs 893.92 MiB, so 10 times difference. This seems large to me, no? |
Yes, it's definitely large and it's awesome that you solved this problem! I just meant that it's not hitting people's computational resources limits: your example is 60K x 2K, so quite big already, if you densify you need 800MB, which is easily available even on a laptop. That's what I meant. What do you think? |
|
Yeah, this seems important only for large datasets in that sense. So, i will add sparse_pca option with False by default. |
|
Not sure what kind of test to add for this... |
|
As discussed, @Koncopd will try to integrate this into scikit-learn itself and not into Scanpy. 😄 |
|
Similar pull request exists already in sklearn. |
3efb194 to
fc84096
Compare
flying-sheep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about causing merge conflicts. Squashing all your commits and then rebasing on master is probably going to be easier than rebasing all the commits individually
| @@ -0,0 +1,70 @@ | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File needs to be named starting with _ as the others
| @@ -0,0 +1,70 @@ | |||
|
|
|||
| import numpy as np | |||
| from distutils.version import StrictVersion | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distutils.version is broken. Please use from packaging import version and then version.parse. (I added packaging to the requirements on master)
| else: | ||
| from sklearn.utils.sparsefuncs_fast import csr_mean_variance_axis0 as mean_variance | ||
|
|
||
| # need to pass issparse check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t understand this comment.
| if k >= n_features: | ||
| raise ValueError('n_components must be < n_features;' | ||
| ' got %d >= %d' % (k, n_features)) | ||
| U, Sigma, VT = randomized_svd(C, k, n_iter=self.n_iter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don’t use uppercase letters in variable names.
| data: Union[AnnData, np.ndarray, spmatrix], | ||
| n_comps: int = N_PCS, | ||
| zero_center: Optional[bool] = True, | ||
| sparse_pca: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make it None by default, and, based on the benchmarks, use it when the data is sufficiently large.
In this case, make sure you document this in the parameter docs and log a debug message.
|
Shall we replace sparse_pca with something else like implicit_centering or so? sparse_pca might give rise to confusion since there is indeed a PCA technique called sparse PCA. |
|
Hm, it was decided to suspend this pr earlier. |
|
Ah, then we should leave it open and unchanged until there’s a decision on the sklearn PR. Sorry I missed the decision. |
|
What's the current state here? This just came out |
|
Last time i checked it was true. Need to check again, but i bet it is still the same. |
|
That's still the case (at least for randomized PCA @Koncopd linked above), though it looks like there may be a another path forward using other solvers: scikit-learn/scikit-learn#12794. Still needs an implementation though. |
|
You could just add a This only works for the |
For this - #393
Benchmarks
https://github.com/Koncopd/anndata-scanpy-benchmarks/blob/master/pca_for_sparse.ipynb