ENH add ARPACK solver to `IncrementalPCA` to avoid densifying sparse data #29512

Charlie-XIAO · 2024-07-17T19:05:09Z

Fixes #28386, motivated by #18689.

Use _implicit_column_offset operator as implemented in #18689.
Add svd_solver parameter supporting "full" (default, original behavior) and "arpack" (truncated SVD)
Implement _implicit_vstack operator to avoid densifying data in intermediate steps.
Add tests for _implicit_vstack.
Add tests for the IncrementalPCA with svd_solver="arpack".
Test performance improvement on fetch_20newsgroups_vectorized dataset and update changelog.

Enhancement Overview

The following code uses the first 500 entries from the 20 newsgroups training set, of shape (500, 130107). When both using truncated SVD via ARPACK, the sparse routine is ~3x faster and saves >30x memory than the dense routine. Compare with dense routine with full SVD (which is the original setup), it is ~10x faster.

Example code

import time
import tracemalloc
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.decomposition import IncrementalPCA


def measure_performance(func, *args, **kwargs):
    tracemalloc.start()
    start = time.perf_counter()
    result = func(*args, **kwargs)
    elapsed = time.perf_counter() - start
    _, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return {"time_s": elapsed, "peak_mb": peak / (1024**2)}, result


def sparse_ipca_arpack(X):
    ipca = IncrementalPCA(n_components=20, svd_solver="arpack")
    coords = ipca.fit_transform(X)
    return ipca, coords


def dense_ipca_arpack(X):
    X_dense = X.toarray()
    ipca = IncrementalPCA(n_components=20, svd_solver="arpack")
    coords = ipca.fit_transform(X_dense)
    return ipca, coords


def dense_ipca_full(X):
    X_dense = X.toarray()
    ipca = IncrementalPCA(n_components=20, svd_solver="full")
    coords = ipca.fit_transform(X_dense)
    return ipca, coords


def main():
    n_samples = 3000
    X, _ = fetch_20newsgroups_vectorized(return_X_y=True)
    X = X[:n_samples]

    methods = [
        ("Sparse ARPACK", sparse_ipca_arpack),
        ("Dense ARPACK", dense_ipca_arpack),
        ("Dense Full", dense_ipca_full),
    ]
    metrics = {}
    models = {}
    coords = {}

    print()
    print(f"\033[1mBenchmarking on {n_samples} samples...\033[0m")
    for name, func in methods:
        print(f"Running {name}...", end=" ", flush=True)
        stats, output = measure_performance(func, X)
        model, coord = output
        metrics[name] = stats
        models[name] = model
        coords[name] = coord
        print(f"Time = {stats['time_s']:.3f}s, Peak Memory = {stats['peak_mb']:.2f}MB")

    print()
    print("\033[1mVerifying results...\033[0m")
    base = "Dense Full"
    base_model = models[base]
    for name, _ in methods:
        if name == base:
            continue
        model = models[name]
        assert np.allclose(base_model.components_, model.components_)
        assert np.allclose(base_model.explained_variance_, model.explained_variance_)
        assert np.allclose(base_model.singular_values_, model.singular_values_)
        print(f"- {base} vs {name}: OK")
    print("All results are equivalent! ✅")

    print()
    print("\033[1mSummarizing performance and memory usage...\033[0m")
    base_stats = metrics[base]
    for name in methods:
        key = name[0]
        if key == base:
            metrics[key]["speedup"] = 1.0
            metrics[key]["memory_saving"] = 1.0
        else:
            t = metrics[key]["time_s"]
            m = metrics[key]["peak_mb"]
            metrics[key]["speedup"] = base_stats["time_s"] / t
            metrics[key]["memory_saving"] = base_stats["peak_mb"] / m

    df = pd.DataFrame(metrics).T
    df = df[["time_s", "peak_mb", "speedup", "memory_saving"]]
    print(df.round(3))


if __name__ == "__main__":
    main()

Benchmarking on 3000 samples...
Running Sparse ARPACK... Time = 1.716s, Peak Memory = 3005.11MB
Running Dense ARPACK... Time = 18.594s, Peak Memory = 9320.87MB
Running Dense Full... Time = 122.849s, Peak Memory = 14960.27MB

Verifying results...
- Dense Full vs Sparse ARPACK: OK
- Dense Full vs Dense ARPACK: OK
All results are equivalent! ✅

Summarizing performance and memory usage...
                time_s    peak_mb  speedup  memory_saving
Sparse ARPACK    1.716   3005.110   71.586          4.978
Dense ARPACK    18.594   9320.873    6.607          1.605
Dense Full     122.849  14960.265    1.000          1.000

Additional Comments & Questions

About the new svd_solver parameter: This is added because I found no other way to support sparse input without densifying and I think its reasonable to add. "full" (default) is the original behavior, where sparse data will be densified in batches. "arpack" is the truncated SVD version that will not densify sparse data. I did not add an "auto" parameter because I think ideally it should select "arpack" for sparse data which is not the default behavior. Perhaps we can still have an "auto" option but not as the default and make it default some day?

About sparse support: Previously the fit method accepts CSR, CSC, and LIL formats. This PR no longer supports LIL format as the sparse version of _incremental_mean_and_var only supports CSR and CSC formats. We can indeed convert LIL so CSR/CSC to keep supporting that format, but is this necessary? Maybe we can just add a note somewhere in the changelog because it is very easy for users to do the conversion themselves.

About testing: I currently simply extended most tests to both svd_solvers on dense data; do I need to extend them on dense and sparse containers as well? Currently the only test that uses sparse data plus ARPACK solver is test_incremental_pca_sparse which performs some basic validation as before. Is this enough?

github-actions · 2024-07-17T19:06:24Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 294f615. Link to the linter CI: here}

Charlie-XIAO · 2024-07-17T19:09:35Z

sklearn/decomposition/_base.py

            exp_var > self.noise_variance_,
            exp_var_diff,
-            xp.asarray(0.0, device=device(exp_var)),
+            xp.asarray(1e-10, device=device(exp_var)),


0.0 becomes nan when doing 1.0 / exp_var_diff later, which cannot be taken linalg.inv of. I wonder if giving it a small value instead is reasonable; otherwise perhaps exp_var should theoretically be always greater than self.noise_variance_ which in turn means that my implementation is incorrect somewhere?

Note: test_incremental_pca_sparse when n_components = X.shape[1] - 1 triggers the issue.

ENH support partial fitting incremental PCA on sparse data

46ea617

github-actions bot added the module:decomposition label Jul 17, 2024

Charlie-XIAO commented Jul 17, 2024

View reviewed changes

Charlie-XIAO added 6 commits July 18, 2024 16:43

Merge remote-tracking branch 'upstream/main' into fast-ipca-sparse

8a996f8

implement vstack linear operator to make it fully sparse

89678e4

fix failure on large sparse; remove auto svd solver

dee2bf1

Merge remote-tracking branch 'upstream/main' into fast-ipca-sparse

cdaa9c9

add tests for sparsefuncs _implicit_vstack

0ab1b8f

add changelog entry

82497de

Charlie-XIAO changed the title ~~ENH support partial fitting incremental PCA on sparse data~~ ENH add ARPACK solver to IncrementalPCA to avoid densifying sparse data Jul 19, 2024

Charlie-XIAO added 3 commits July 19, 2024 19:54

add tests for incremental pca arpack

7feb730

Merge remote-tracking branch 'upstream/main' into fast-ipca-sparse

78433ab

raise for arpack + sparse data in partial fit

3dab46f

Charlie-XIAO marked this pull request as ready for review July 19, 2024 13:04

Charlie-XIAO and others added 13 commits September 4, 2024 23:11

Merge remote-tracking branch 'upstream/main' into fast-ipca-sparse

4b9759b

Merge branch 'main' into fast-ipca-sparse

cf666fb

Merge branch 'main' into fast-ipca-sparse

e3f2122

Merge branch 'main' into fast-ipca-sparse

d416fbb

move changelog

f1fa0fa

update versionadded

c83ac55

fix test_incremental_pca_validation

76e09d2

test_incremental_pca_partial_fit_small_batch

d904f64

Merge remote-tracking branch 'upstream/main' into fast-ipca-sparse

485bd05

retrigger checks

5d04f08

retrigger checks

898aeee

Merge branch 'main' into fast-ipca-sparse

9db2ad0

try fix failing tests by incrementing n_samples

294f615

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH add ARPACK solver to `IncrementalPCA` to avoid densifying sparse data #29512

ENH add ARPACK solver to `IncrementalPCA` to avoid densifying sparse data #29512

Charlie-XIAO commented Jul 17, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jul 17, 2024 •

edited

Loading

Uh oh!

Charlie-XIAO Jul 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

ENH add ARPACK solver to IncrementalPCA to avoid densifying sparse data #29512

Are you sure you want to change the base?

ENH add ARPACK solver to IncrementalPCA to avoid densifying sparse data #29512

Conversation

Charlie-XIAO commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhancement Overview

Additional Comments & Questions

Uh oh!

github-actions bot commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Charlie-XIAO Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ENH add ARPACK solver to `IncrementalPCA` to avoid densifying sparse data #29512

ENH add ARPACK solver to `IncrementalPCA` to avoid densifying sparse data #29512

Charlie-XIAO commented Jul 17, 2024 •

edited

Loading

github-actions bot commented Jul 17, 2024 •

edited

Loading