Description
This problem was first described as part of #14306, but I think it might make sense to open a dedicated issue for the particular problem of small data shapes.
The fundamental problem seems to be that the OpenMP threadpool overhead is frequently detrimental to performance on. Note that OpenMP threading is enabled by default in general in scikit-learn.
Here are the relative durations of running cross-validation on this model on data with various size with and without enabling threads:
Speed-up are measured in relative improvement in fit speed compared to a sequential fit (multi-threading disabled).
This was collected on an Apple M1 CPU with 4 performance cores and 4 efficiency cores with llvm-opemp libomp
installed from conda-forge.
System:
python: 3.12.8 | packaged by conda-forge | (main, Dec 5 2024, 14:19:53) [Clang 18.1.8 ]
executable: /Users/ogrisel/miniforge3/envs/dev/bin/python
machine: macOS-15.2-arm64-arm-64bit
Python dependencies:
sklearn: 1.7.dev0
pip: 24.3.1
setuptools: 75.6.0
numpy: 2.0.2
scipy: 1.14.1
Cython: 3.0.11
pandas: 2.2.3
matplotlib: 3.10.0
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /Users/ogrisel/miniforge3/envs/dev/lib/libomp.dylib
version: None
Here are my conclusions:
- Using threads can cause a huge slowdown on the smallest problems (15x slower than when running with the threads disabled);
- OpenMP threading becomes benefitial only with large datasets (more than 100k data points with hundreds of dimensions);
- Using the 4 efficiency cores is almost always detrimental. For Intel/AMD x86 CPUs we already disable SMT/HyperThreading for a similar reason (see PERF set openmp to use only physical cores by default #26082). But I don't know how to do something similar for efficiency cores in platform-agnostic way;
- We should probably disable multi-threading, or at least reduce the number of threads with a heuristic that depends on the data shape such as:
n_threads = min( min( max(n_samples * n_features // int(1e5), 1), n_features // 2, # to avoid stragglers ), joblib.cpu_count(only_physical_cores=True) )
- Implementing sample-wise parallelism could probably help a bit on some case (
n_features << max(n_cores, n_samples)
, but I still think OpenMP threading should be disabled on the smallest problems. - Results will probably also be impacted if we merge ENH HGBT histograms in blocks of features #27168.
Benchmark result data
Benchmark code
# %%
from time import perf_counter
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score, ShuffleSplit
from threadpoolctl import threadpool_limits
import pandas as pd
import pyarrow
import fastparquet
# %%
records = []
data_shapes = [
# n_features == 10
(100, 10),
(1_000, 10),
(10_000, 10),
(100_000, 10),
(1_000_000, 10),
# n_features == 100
(100, 100),
(1_000, 100),
(10_000, 100),
(100_000, 100),
(1_000_000, 100),
# n_features == 1000
(100, 1_000),
(1_000, 1_000),
]
# all_max_num_threads = [1, 2, 4, 8, None] # None means using all available threads
all_max_num_threads = [1, 2, 4, 8]
for max_num_threads in all_max_num_threads:
for n_samples, n_features in data_shapes:
X, y = make_regression(
n_samples=n_samples, n_features=n_features, random_state=0
)
with threadpool_limits(limits=max_num_threads):
# Make sure we do enough CV splits to have non-trivial runtimes on
# smaller problems.
n_splits = max(int(1e5) // (n_samples * n_features), 5)
start = perf_counter()
scores = cross_val_score(
HistGradientBoostingRegressor(),
X,
y,
cv=ShuffleSplit(n_splits=n_splits, test_size=0.2, random_state=0),
)
end = perf_counter()
record = {
"n_samples": n_samples,
"n_features": n_features,
"n_splits": n_splits,
"max_num_threads": max_num_threads,
"duration": end - start,
}
print(record)
records.append(record)
records = pd.DataFrame(records)
records.to_parquet("bench_num_threads.parquet")
Plotting code
# %%
import matplotlib.pyplot as plt
import pandas as pd
records = pd.read_parquet("bench_num_threads.parquet")
# %%
# For each (n_samples, n_features, n_splits) group, we compute the relative speed up
# with respect to running on a single thread. We append the result as a new "speedup" column
# to the records DataFrame.
# %%
speedups = []
for (n_samples, n_features, n_splits), group in records.groupby(
["n_samples", "n_features", "n_splits"]
):
ref_duration = group[group["max_num_threads"] == 1]["duration"].values[0]
group["speedup"] = (ref_duration / group["duration"]).round(2)
speedups.append(group)
speedups = pd.concat(speedups)
speedups["slowdown"] = (1 / speedups["speedup"]).round(2)
# %%
speedups.sort_values("speedup", ascending=False).head(10)
# %%
speedups.sort_values("slowdown", ascending=False).head(10)
# %%
# We can now plot the speedup as a function of the number of threads used.
# %%
def plot_speed_curves(speedups, n_features=100):
fig, ax = plt.subplots()
for (n_samples, n_features, n_splits), group in speedups.query(
"n_features == @n_features"
).groupby(["n_samples", "n_features", "n_splits"]):
group.plot(
x="max_num_threads",
y="speedup",
ax=ax,
label=f"X.shape=({n_samples}, {n_features}))",
)
ax.hlines(1, 0.95, 8.5, linestyles="--", color="gray")
ax.set(
xscale="log",
xlabel="Number of threads",
xticks=[1, 4, 8],
xticklabels=["1", "4", "8"],
yscale="log",
ylabel="Speedup",
yticks=[0.1, 0.2, 0.5, 1, 2, 5, 10],
yticklabels=["0.1x", "0.2x", "0.5x", "1x", "2x", "5x", "10x"],
title=f"Impact of threading on fit time (n_features={n_features})",
)
plot_speed_curves(speedups, n_features=10)
plot_speed_curves(speedups, n_features=100)
plot_speed_curves(speedups, n_features=1000)