HistGradientBoostingClassifier/Regressor 15x slowdown on small data problems compared to disabled OpenMP threading

This problem was first described as part of #14306, but I think it might make sense to open a dedicated issue for the particular problem of small data shapes.

The fundamental problem seems to be that the OpenMP threadpool overhead is frequently detrimental to performance on. Note that OpenMP threading is enabled by default in general in scikit-learn.

Here are the relative durations of running cross-validation on this model on data with various size with and without enabling threads:

![Image](https://github.com/user-attachments/assets/8f0c6da7-2c68-4f8c-9cdc-75058626c46b)

![Image](https://github.com/user-attachments/assets/804aab12-dc15-4c43-986a-ad05f07b7663)

![Image](https://github.com/user-attachments/assets/8768faa3-9fd4-4217-9006-ad1b6708ca4b)

Speed-up are measured in relative improvement in fit speed compared to a sequential fit (multi-threading disabled). 

This was collected on an Apple M1 CPU with 4 performance cores and 4 efficiency cores with llvm-opemp `libomp` installed from conda-forge.

<details>

```
System:
    python: 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:19:53) [Clang 18.1.8 ]
executable: /Users/ogrisel/miniforge3/envs/dev/bin/python
   machine: macOS-15.2-arm64-arm-64bit

Python dependencies:
      sklearn: 1.7.dev0
          pip: 24.3.1
   setuptools: 75.6.0
        numpy: 2.0.2
        scipy: 1.14.1
       Cython: 3.0.11
       pandas: 2.2.3
   matplotlib: 3.10.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/ogrisel/miniforge3/envs/dev/lib/libomp.dylib
        version: None
```

</details>

Here are my conclusions:

- Using threads can cause a huge slowdown on the smallest problems (15x slower than when running with the threads disabled);
- OpenMP threading becomes benefitial only with large datasets (more than 100k data points with hundreds of dimensions);
- Using the 4 efficiency cores is almost always detrimental. For Intel/AMD x86 CPUs we already disable SMT/HyperThreading for a similar reason (see #26082). But I don't know how to do something similar for efficiency cores in platform-agnostic way;
- We should probably disable multi-threading, or at least reduce the number of threads with a heuristic that depends on the data shape such as:
  ```python
  n_threads = min(
      min(
          max(n_samples * n_features // int(1e5), 1),
          n_features // 2,  # to avoid stragglers
      ),
      joblib.cpu_count(only_physical_cores=True)
  )
  ```
- Implementing [sample-wise parallelism](https://github.com/scikit-learn/scikit-learn/issues/14306#issuecomment-688377677) could probably help a bit on some case (`n_features << max(n_cores, n_samples)`, but I still think OpenMP threading should be disabled on the smallest problems.
- Results will probably also be impacted if we merge #27168.

## Benchmark result data

[bench_num_threads.parquet](https://drive.google.com/file/d/1sRCKpqrOFUMi4xs2g6SDfErJoN9CqrzG/view?usp=sharing)

## Benchmark code

<details>

```python
# %%
from time import perf_counter

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score, ShuffleSplit
from threadpoolctl import threadpool_limits
import pandas as pd
import pyarrow
import fastparquet

# %%

records = []
data_shapes = [
    # n_features == 10
    (100, 10),
    (1_000, 10),
    (10_000, 10),
    (100_000, 10),
    (1_000_000, 10),
    # n_features == 100
    (100, 100),
    (1_000, 100),
    (10_000, 100),
    (100_000, 100),
    (1_000_000, 100),
    # n_features == 1000
    (100, 1_000),
    (1_000, 1_000),
]
# all_max_num_threads = [1, 2, 4, 8, None]  # None means using all available threads
all_max_num_threads = [1, 2, 4, 8]


for max_num_threads in all_max_num_threads:
    for n_samples, n_features in data_shapes:
        X, y = make_regression(
            n_samples=n_samples, n_features=n_features, random_state=0
        )
        with threadpool_limits(limits=max_num_threads):
            # Make sure we do enough CV splits to have non-trivial runtimes on
            # smaller problems.
            n_splits = max(int(1e5) // (n_samples * n_features), 5)
            start = perf_counter()
            scores = cross_val_score(
                HistGradientBoostingRegressor(),
                X,
                y,
                cv=ShuffleSplit(n_splits=n_splits, test_size=0.2, random_state=0),
            )
            end = perf_counter()
        record = {
            "n_samples": n_samples,
            "n_features": n_features,
            "n_splits": n_splits,
            "max_num_threads": max_num_threads,
            "duration": end - start,
        }
        print(record)
        records.append(record)

records = pd.DataFrame(records)
records.to_parquet("bench_num_threads.parquet")
```

</details>

## Plotting code

<details>

```python
# %%
import matplotlib.pyplot as plt
import pandas as pd

records = pd.read_parquet("bench_num_threads.parquet")
# %%
# For each (n_samples, n_features, n_splits) group, we compute the relative speed up
# with respect to running on a single thread. We append the result as a new "speedup" column
# to the records DataFrame.
# %%
speedups = []
for (n_samples, n_features, n_splits), group in records.groupby(
    ["n_samples", "n_features", "n_splits"]
):
    ref_duration = group[group["max_num_threads"] == 1]["duration"].values[0]
    group["speedup"] = (ref_duration / group["duration"]).round(2)
    speedups.append(group)

speedups = pd.concat(speedups)
speedups["slowdown"] = (1 / speedups["speedup"]).round(2)

# %%
speedups.sort_values("speedup", ascending=False).head(10)

# %%
speedups.sort_values("slowdown", ascending=False).head(10)


# %%
# We can now plot the speedup as a function of the number of threads used.
# %%
def plot_speed_curves(speedups, n_features=100):
    fig, ax = plt.subplots()
    for (n_samples, n_features, n_splits), group in speedups.query(
        "n_features == @n_features"
    ).groupby(["n_samples", "n_features", "n_splits"]):
        group.plot(
            x="max_num_threads",
            y="speedup",
            ax=ax,
            label=f"X.shape=({n_samples}, {n_features}))",
        )
    ax.hlines(1, 0.95, 8.5, linestyles="--", color="gray")
    ax.set(
        xscale="log",
        xlabel="Number of threads",
        xticks=[1, 4, 8],
        xticklabels=["1", "4", "8"],
        yscale="log",
        ylabel="Speedup",
        yticks=[0.1, 0.2, 0.5, 1, 2, 5, 10],
        yticklabels=["0.1x", "0.2x", "0.5x", "1x", "2x", "5x", "10x"],
        title=f"Impact of threading on fit time (n_features={n_features})",
    )


plot_speed_curves(speedups, n_features=10)
plot_speed_curves(speedups, n_features=100)
plot_speed_curves(speedups, n_features=1000)
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HistGradientBoostingClassifier/Regressor 15x slowdown on small data problems compared to disabled OpenMP threading #30662

Benchmark result data

Benchmark code

Plotting code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

HistGradientBoostingClassifier/Regressor 15x slowdown on small data problems compared to disabled OpenMP threading #30662

Description

Benchmark result data

Benchmark code

Plotting code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions