Thanks to visit codestin.com
Credit goes to github.com

Skip to content

HistGradientBoostingClassifier/Regressor 15x slowdown on small data problems compared to disabled OpenMP threading #30662

Open
@ogrisel

Description

@ogrisel

This problem was first described as part of #14306, but I think it might make sense to open a dedicated issue for the particular problem of small data shapes.

The fundamental problem seems to be that the OpenMP threadpool overhead is frequently detrimental to performance on. Note that OpenMP threading is enabled by default in general in scikit-learn.

Here are the relative durations of running cross-validation on this model on data with various size with and without enabling threads:

Image

Image

Image

Speed-up are measured in relative improvement in fit speed compared to a sequential fit (multi-threading disabled).

This was collected on an Apple M1 CPU with 4 performance cores and 4 efficiency cores with llvm-opemp libomp installed from conda-forge.

System:
    python: 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:19:53) [Clang 18.1.8 ]
executable: /Users/ogrisel/miniforge3/envs/dev/bin/python
   machine: macOS-15.2-arm64-arm-64bit

Python dependencies:
      sklearn: 1.7.dev0
          pip: 24.3.1
   setuptools: 75.6.0
        numpy: 2.0.2
        scipy: 1.14.1
       Cython: 3.0.11
       pandas: 2.2.3
   matplotlib: 3.10.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/ogrisel/miniforge3/envs/dev/lib/libomp.dylib
        version: None

Here are my conclusions:

  • Using threads can cause a huge slowdown on the smallest problems (15x slower than when running with the threads disabled);
  • OpenMP threading becomes benefitial only with large datasets (more than 100k data points with hundreds of dimensions);
  • Using the 4 efficiency cores is almost always detrimental. For Intel/AMD x86 CPUs we already disable SMT/HyperThreading for a similar reason (see PERF set openmp to use only physical cores by default #26082). But I don't know how to do something similar for efficiency cores in platform-agnostic way;
  • We should probably disable multi-threading, or at least reduce the number of threads with a heuristic that depends on the data shape such as:
    n_threads = min(
        min(
            max(n_samples * n_features // int(1e5), 1),
            n_features // 2,  # to avoid stragglers
        ),
        joblib.cpu_count(only_physical_cores=True)
    )
  • Implementing sample-wise parallelism could probably help a bit on some case (n_features << max(n_cores, n_samples), but I still think OpenMP threading should be disabled on the smallest problems.
  • Results will probably also be impacted if we merge ENH HGBT histograms in blocks of features #27168.

Benchmark result data

bench_num_threads.parquet

Benchmark code

# %%
from time import perf_counter

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score, ShuffleSplit
from threadpoolctl import threadpool_limits
import pandas as pd
import pyarrow
import fastparquet

# %%

records = []
data_shapes = [
    # n_features == 10
    (100, 10),
    (1_000, 10),
    (10_000, 10),
    (100_000, 10),
    (1_000_000, 10),
    # n_features == 100
    (100, 100),
    (1_000, 100),
    (10_000, 100),
    (100_000, 100),
    (1_000_000, 100),
    # n_features == 1000
    (100, 1_000),
    (1_000, 1_000),
]
# all_max_num_threads = [1, 2, 4, 8, None]  # None means using all available threads
all_max_num_threads = [1, 2, 4, 8]


for max_num_threads in all_max_num_threads:
    for n_samples, n_features in data_shapes:
        X, y = make_regression(
            n_samples=n_samples, n_features=n_features, random_state=0
        )
        with threadpool_limits(limits=max_num_threads):
            # Make sure we do enough CV splits to have non-trivial runtimes on
            # smaller problems.
            n_splits = max(int(1e5) // (n_samples * n_features), 5)
            start = perf_counter()
            scores = cross_val_score(
                HistGradientBoostingRegressor(),
                X,
                y,
                cv=ShuffleSplit(n_splits=n_splits, test_size=0.2, random_state=0),
            )
            end = perf_counter()
        record = {
            "n_samples": n_samples,
            "n_features": n_features,
            "n_splits": n_splits,
            "max_num_threads": max_num_threads,
            "duration": end - start,
        }
        print(record)
        records.append(record)

records = pd.DataFrame(records)
records.to_parquet("bench_num_threads.parquet")

Plotting code

# %%
import matplotlib.pyplot as plt
import pandas as pd

records = pd.read_parquet("bench_num_threads.parquet")
# %%
# For each (n_samples, n_features, n_splits) group, we compute the relative speed up
# with respect to running on a single thread. We append the result as a new "speedup" column
# to the records DataFrame.
# %%
speedups = []
for (n_samples, n_features, n_splits), group in records.groupby(
    ["n_samples", "n_features", "n_splits"]
):
    ref_duration = group[group["max_num_threads"] == 1]["duration"].values[0]
    group["speedup"] = (ref_duration / group["duration"]).round(2)
    speedups.append(group)

speedups = pd.concat(speedups)
speedups["slowdown"] = (1 / speedups["speedup"]).round(2)

# %%
speedups.sort_values("speedup", ascending=False).head(10)

# %%
speedups.sort_values("slowdown", ascending=False).head(10)


# %%
# We can now plot the speedup as a function of the number of threads used.
# %%
def plot_speed_curves(speedups, n_features=100):
    fig, ax = plt.subplots()
    for (n_samples, n_features, n_splits), group in speedups.query(
        "n_features == @n_features"
    ).groupby(["n_samples", "n_features", "n_splits"]):
        group.plot(
            x="max_num_threads",
            y="speedup",
            ax=ax,
            label=f"X.shape=({n_samples}, {n_features}))",
        )
    ax.hlines(1, 0.95, 8.5, linestyles="--", color="gray")
    ax.set(
        xscale="log",
        xlabel="Number of threads",
        xticks=[1, 4, 8],
        xticklabels=["1", "4", "8"],
        yscale="log",
        ylabel="Speedup",
        yticks=[0.1, 0.2, 0.5, 1, 2, 5, 10],
        yticklabels=["0.1x", "0.2x", "0.5x", "1x", "2x", "5x", "10x"],
        title=f"Impact of threading on fit time (n_features={n_features})",
    )


plot_speed_curves(speedups, n_features=10)
plot_speed_curves(speedups, n_features=100)
plot_speed_curves(speedups, n_features=1000)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions