diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst index 382fa8938b5ca..97e3e2866083f 100644 --- a/doc/computing/parallelism.rst +++ b/doc/computing/parallelism.rst @@ -10,22 +10,29 @@ Parallelism, resource management, and configuration Parallelism ----------- -Some scikit-learn estimators and utilities can parallelize costly operations -using multiple CPU cores, thanks to the following components: +Some scikit-learn estimators and utilities parallelize costly operations +using multiple CPU cores. -- via the `joblib `_ library. In - this case the number of threads or processes can be controlled with the - ``n_jobs`` parameter. -- via OpenMP, used in C or Cython code. +Depending on the type of estimator and sometimes the values of the +constructor parameters, this is either done: -In addition, some of the numpy routines that are used internally by -scikit-learn may also be parallelized if numpy is installed with specific -numerical libraries such as MKL, OpenBLAS, or BLIS. +- with higher-level parallelism via `joblib `_. +- with lower-level parallelism via OpenMP, used in C or Cython code. +- with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations + on arrays. -We describe these 3 scenarios in the following subsections. +The `n_jobs` parameters of estimators always controls the amount of parallelism +managed by joblib (processes or threads depending on the joblib backend). +The thread-level parallelism managed by OpenMP in scikit-learn's own Cython code +or by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn +is always controlled by environment variables or `threadpoolctl` as explained below. +Note that some estimators can leverage all three kinds of parallelism at different +points of their training and prediction methods. -Joblib-based parallelism -........................ +We describe these 3 types of parallelism in the following subsections in more details. + +Higher-level parallelism with joblib +.................................... When the underlying implementation uses joblib, the number of workers (threads or processes) that are spawned in parallel can be controlled via the @@ -33,15 +40,16 @@ When the underlying implementation uses joblib, the number of workers .. note:: - Where (and how) parallelization happens in the estimators is currently - poorly documented. Please help us by improving our docs and tackle `issue - 14228 `_! + Where (and how) parallelization happens in the estimators using joblib by + specifying `n_jobs` is currently poorly documented. + Please help us by improving our docs and tackle `issue 14228 + `_! Joblib is able to support both multi-processing and multi-threading. Whether joblib chooses to spawn a thread or a process depends on the **backend** that it's using. -Scikit-learn generally relies on the ``loky`` backend, which is joblib's +scikit-learn generally relies on the ``loky`` backend, which is joblib's default backend. Loky is a multi-processing backend. When doing multi-processing, in order to avoid duplicating the memory in each process (which isn't reasonable with big datasets), joblib will create a `memmap @@ -70,40 +78,57 @@ that increasing the number of workers is always a good thing. In some cases it can be highly detrimental to performance to run multiple copies of some estimators or functions in parallel (see oversubscription below). -OpenMP-based parallelism -........................ +Lower-level parallelism with OpenMP +................................... OpenMP is used to parallelize code written in Cython or C, relying on -multi-threading exclusively. By default (and unless joblib is trying to -avoid oversubscription), the implementation will use as many threads as -possible. +multi-threading exclusively. By default, the implementations using OpenMP +will use as many threads as possible, i.e. as many threads as logical cores. -You can control the exact number of threads that are used via the -``OMP_NUM_THREADS`` environment variable: +You can control the exact number of threads that are used either: -.. prompt:: bash $ + - via the ``OMP_NUM_THREADS`` environment variable, for instance when: + running a python script: + + .. prompt:: bash $ + + OMP_NUM_THREADS=4 python my_script.py - OMP_NUM_THREADS=4 python my_script.py + - or via `threadpoolctl` as explained by `this piece of documentation + `_. -Parallel Numpy routines from numerical libraries -................................................ +Parallel NumPy and SciPy routines from numerical libraries +.......................................................... -Scikit-learn relies heavily on NumPy and SciPy, which internally call -multi-threaded linear algebra routines implemented in libraries such as MKL, -OpenBLAS or BLIS. +scikit-learn relies heavily on NumPy and SciPy, which internally call +multi-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries +such as MKL, OpenBLAS or BLIS. -The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set -via the ``MKL_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, and -``BLIS_NUM_THREADS`` environment variables. +You can control the exact number of threads used by BLAS for each library +using environment variables, namely: + + - ``MKL_NUM_THREADS`` sets the number of thread MKL uses, + - ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses + - ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses + +Note that BLAS & LAPACK implementations can also be impacted by +`OMP_NUM_THREADS`. To check whether this is the case in your environment, +you can inspect how the number of threads effectively used by those libraries +is affected when running the the following command in a bash or zsh terminal +for different values of `OMP_NUM_THREADS`:: + +.. prompt:: bash $ -Please note that scikit-learn has no direct control over these -implementations. Scikit-learn solely relies on Numpy and Scipy. + OMP_NUM_THREADS=2 python -m threadpoolctl -i numpy scipy .. note:: - At the time of writing (2019), NumPy and SciPy packages distributed on - pypi.org (used by ``pip``) and on the conda-forge channel are linked - with OpenBLAS, while conda packages shipped on the "defaults" channel - from anaconda.org are linked by default with MKL. + At the time of writing (2022), NumPy and SciPy packages which are + distributed on pypi.org (i.e. the ones installed via ``pip install``) + and on the conda-forge channel (i.e. the ones installed via + ``conda install --channel conda-forge``) are linked with OpenBLAS, while + NumPy and SciPy packages packages shipped on the ``defaults`` conda + channel from Anaconda.org (i.e. the ones installed via ``conda install``) + are linked by default with MKL. Oversubscription: spawning too many threads @@ -120,8 +145,8 @@ with ``n_jobs=8`` over a OpenMP). Each instance of :class:`~sklearn.ensemble.HistGradientBoostingClassifier` will spawn 8 threads (since you have 8 CPUs). That's a total of ``8 * 8 = 64`` threads, which -leads to oversubscription of physical CPU resources and to scheduling -overhead. +leads to oversubscription of threads for physical CPU resources and thus +to scheduling overhead. Oversubscription can arise in the exact same fashion with parallelized routines from MKL, OpenBLAS or BLIS that are nested in joblib calls. @@ -146,38 +171,34 @@ Note that: only use ``_NUM_THREADS``. Joblib exposes a context manager for finer control over the number of threads in its workers (see joblib docs linked below). -- Joblib is currently unable to avoid oversubscription in a - multi-threading context. It can only do so with the ``loky`` backend - (which spawns processes). +- When joblib is configured to use the ``threading`` backend, there is no + mechanism to avoid oversubscriptions when calling into parallel native + libraries in the joblib-managed threads. +- All scikit-learn estimators that explicitly rely on OpenMP in their Cython code + always use `threadpoolctl` internally to automatically adapt the numbers of + threads used by OpenMP and potentially nested BLAS calls so as to avoid + oversubscription. You will find additional details about joblib mitigation of oversubscription in `joblib documentation `_. +You will find additional details about parallelism in numerical python libraries +in `this document from Thomas J. Fan `_. Configuration switches ----------------------- -Python runtime -.............. +Python API +.......... -:func:`sklearn.set_config` controls the following behaviors: - -`assume_finite` -~~~~~~~~~~~~~~~ - -Used to skip validation, which enables faster computations but may lead to -segmentation faults if the data contains NaNs. - -`working_memory` -~~~~~~~~~~~~~~~~ - -The optimal size of temporary arrays used by some algorithms. +:func:`sklearn.set_config` and :func:`sklearn.config_context` can be used to change +parameters of the configuration which control aspect of parallelism. .. _environment_variable: Environment variables -...................... +..................... These environment variables should be set before importing scikit-learn. @@ -277,3 +298,14 @@ float64 data. When this environment variable is set to a non zero value, the `Cython` derivative, `boundscheck` is set to `True`. This is useful for finding segfaults. + +`SKLEARN_PAIRWISE_DIST_CHUNK_SIZE` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This sets the size of chunk to be used by the underlying `PairwiseDistancesReductions` +implementations. The default value is `256` which has been showed to be adequate on +most machines. + +Users looking for the best performance might want to tune this variable using +powers of 2 so as to get the best parallelism behavior for their hardware, +especially with respect to their caches' sizes. diff --git a/sklearn/metrics/pairwise.py b/sklearn/metrics/pairwise.py index e6abd596b0000..1ccff8ae8c8b7 100644 --- a/sklearn/metrics/pairwise.py +++ b/sklearn/metrics/pairwise.py @@ -688,9 +688,12 @@ def pairwise_distances_argmin_min( values = values.flatten() indices = indices.flatten() else: - # TODO: once BaseDistanceReductionDispatcher supports distance metrics - # for boolean datasets, we won't need to fallback to - # pairwise_distances_chunked anymore. + # Joblib-based backend, which is used when user-defined callable + # are passed for metric. + + # This won't be used in the future once PairwiseDistancesReductions support: + # - DistanceMetrics which work on supposedly binary data + # - CSR-dense and dense-CSR case if 'euclidean' in metric. # Turn off check for finiteness because this is costly and because arrays # have already been validated. @@ -800,9 +803,12 @@ def pairwise_distances_argmin(X, Y, *, axis=1, metric="euclidean", metric_kwargs ) indices = indices.flatten() else: - # TODO: once BaseDistanceReductionDispatcher supports distance metrics - # for boolean datasets, we won't need to fallback to - # pairwise_distances_chunked anymore. + # Joblib-based backend, which is used when user-defined callable + # are passed for metric. + + # This won't be used in the future once PairwiseDistancesReductions support: + # - DistanceMetrics which work on supposedly binary data + # - CSR-dense and dense-CSR case if 'euclidean' in metric. # Turn off check for finiteness because this is costly and because arrays # have already been validated. diff --git a/sklearn/neighbors/_base.py b/sklearn/neighbors/_base.py index 07b4e4d5996ff..3b01824a3a73a 100644 --- a/sklearn/neighbors/_base.py +++ b/sklearn/neighbors/_base.py @@ -839,9 +839,13 @@ class from an array representing our data set and ask who's ) elif self._fit_method == "brute": - # TODO: should no longer be needed once ArgKmin - # is extended to accept sparse and/or float32 inputs. + # Joblib-based backend, which is used when user-defined callable + # are passed for metric. + # This won't be used in the future once PairwiseDistancesReductions + # support: + # - DistanceMetrics which work on supposedly binary data + # - CSR-dense and dense-CSR case if 'euclidean' in metric. reduce_func = partial( self._kneighbors_reduce_func, n_neighbors=n_neighbors, @@ -1173,9 +1177,13 @@ class from an array representing our data set and ask who's ) elif self._fit_method == "brute": - # TODO: should no longer be needed once we have Cython-optimized - # implementation for radius queries, with support for sparse and/or - # float32 inputs. + # Joblib-based backend, which is used when user-defined callable + # are passed for metric. + + # This won't be used in the future once PairwiseDistancesReductions + # support: + # - DistanceMetrics which work on supposedly binary data + # - CSR-dense and dense-CSR case if 'euclidean' in metric. # for efficiency, use squared euclidean distances if self.effective_metric_ == "euclidean":