Description
Describe the bug
This is perhaps not a bug but an opportunity for improvement. I've noticed that scikit-learn runs considerably faster if I happen to have import torch
before any sklearn
imports.
This first block of code runs much slower:
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np
X = np.random.random(size=(50, 10000))
y = np.random.random(size=50)
estimator = HistGradientBoostingRegressor(verbose=True)
estimator.fit(X, y)
Than this second block of code:
import torch # The only difference
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np
X = np.random.random(size=(50, 10000))
y = np.random.random(size=50)
estimator = HistGradientBoostingRegressor(verbose=True)
estimator.fit(X, y)
Here's the run times over 6 runs each on my actual code, the only difference being an import of torch
I know it's confusing that I'm importing torch
but not using it, so to be clear, I don't use the torch module in any way on the page. I just happened to stumble across the performance improvement at one point when I imported torch
for some other purpose. It's literally just sitting there as an 'unused import' making my code run much faster.
I've tested with a few other regressors, including RandomForestRegressor
and GradientBoostingRegressor
and I don't see any difference.
I compared os.environ
in both cases and they're the same. I looked at sklearn.base.get_config()
and they're identical in both cases too. I notice that torch sets OMP_NUM_THREADS
to 10, while without the torch import this value is set to 20
(on my machine with 20 cores). But even manually setting this to 10 doesn't bridge the gap.
I don't know enough about torch
or sklearn
to be able to work out what else is going on, I'm guessing someone who's worked on HistGradientBoostingRegressor
might know what's going on? Seems like there's a nice performance gain to be found somewhere in here.
Steps/Code to Reproduce
As above
Expected Results
Should be max fast all the time.
Actual Results
Is not max fast unless I import torch.
Also as a general thing it would be nice to be able to pass n_jobs
to the constructor. Having something use all 20 cores is not always the fastest way.
Versions
System:
python: 3.10.8 (main, Oct 12 2022, 19:14:26) [GCC 9.4.0]
executable: /home/davidg/.virtualenvs/learning/bin/python
machine: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python dependencies:
sklearn: 1.2.2
pip: 23.1.2
setuptools: 59.5.0
numpy: 1.24.3
scipy: 1.10.1
Cython: 0.29.33
pandas: 2.0.1
matplotlib: 3.7.0
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
version: 0.3.21
threading_layer: pthreads
architecture: Haswell
num_threads: 20
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/torch/lib/libgomp-a34b3233.so.1
version: None
num_threads: 10
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
num_threads: 20
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
version: 0.3.18
threading_layer: pthreads
architecture: Haswell
num_threads: 20