Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Randomized PCA.transform uses a lot of RAM #11102

Closed
@vortexdev

Description

@vortexdev

Description

Randomized sklearn.decomposition.PCA uses about2*n_samples*n_features memory (RAM), including specified samples.
While fbpca (https://github.com/facebook/fbpca) uses 2 times less.

Is this expected behavour?
(I understand that sklearn version computes more things like explained_variance_)

Steps/Code to Reproduce

sklearn version:

import numpy as np
from sklearn.decomposition import PCA

samples = np.random.random((20000, 16384))
pca = PCA(copy=False, n_components=128, svd_solver='randomized', iterated_power=4)
pca.fit_transform(samples)

fbpca version:

import numpy as np
import fbpca

samples = np.random.random((20000, 16384))
(U, s, Va) = fbpca.pca(samples, k=128, n_iter=4)

Expected Results

Randomized sklearn.decomposition.PCA uses aboutn_samples*n_features + n_samples*n_components + <variance matrices etc.> memory (RAM).

Actual Results

Randomized sklearn.decomposition.PCA uses about2*n_samples*n_features memory (RAM).
We see peaks at transform step.
pca_memory_test
(generated with memory_profiler and gnuplot)

Versions

Darwin-17.4.0-x86_64-i386-64bit
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.14.3
SciPy 1.1.0
Scikit-Learn 0.19.1
(tested on different Linux machines as well)

P.S.

We are trying to perform PCA for large matrices (2m x 16k, ~110GB). IncrementalPCA is very slow for us. Randomized PCA is very fast, but we are trying to reduce memory consumption to use cheaper instances.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions