Description
Description
Randomized sklearn.decomposition.PCA
uses about2*n_samples*n_features
memory (RAM), including specified samples.
While fbpca
(https://github.com/facebook/fbpca) uses 2 times less.
Is this expected behavour?
(I understand that sklearn
version computes more things like explained_variance_)
Steps/Code to Reproduce
sklearn
version:
import numpy as np
from sklearn.decomposition import PCA
samples = np.random.random((20000, 16384))
pca = PCA(copy=False, n_components=128, svd_solver='randomized', iterated_power=4)
pca.fit_transform(samples)
fbpca
version:
import numpy as np
import fbpca
samples = np.random.random((20000, 16384))
(U, s, Va) = fbpca.pca(samples, k=128, n_iter=4)
Expected Results
Randomized sklearn.decomposition.PCA
uses aboutn_samples*n_features + n_samples*n_components + <variance matrices etc.>
memory (RAM).
Actual Results
Randomized sklearn.decomposition.PCA
uses about2*n_samples*n_features
memory (RAM).
We see peaks at transform
step.
(generated with memory_profiler
and gnuplot
)
Versions
Darwin-17.4.0-x86_64-i386-64bit
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.14.3
SciPy 1.1.0
Scikit-Learn 0.19.1
(tested on different Linux machines as well)
P.S.
We are trying to perform PCA for large matrices (2m x 16k, ~110GB). IncrementalPCA is very slow for us. Randomized PCA is very fast, but we are trying to reduce memory consumption to use cheaper instances.
Thank you.