Parallelism - what do libraries offer, and is there an API aspect to it

Several people have expressed a strong interest in talking about and working on (auto-)parallelization. Here is an attempt at summarizing this topic.

- current status
- auto-parallelization and nested parallelism
- limitations due to Python package distribution mechanisms
- The need for a better API pattern or library

## Current status

### Linear algebra libraries

The main accelerated linear algebra libraries that are in use (for CPU based
code) are [OpenBLAS](https://github.com/xianyi/OpenBLAS) and
[MKL](https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html).
Both of those libraries auto-parallelize function calls.

OpenBLAS can be built with either its own pthreads-based thread pool, or with
OpenMP support. The number of threads can be controlled with an environment
variable (`OPENBLAS_NUM_THREADS` or `OMP_NUM_THREADS`), or from Python via
[threadpoolctl](https://github.com/joblib/threadpoolctl). The conda-forge
OpenBLAS package uses OpenMP; the OpenBLAS builds linked into NumPy and SciPy
wheels on PyPI use pthreads.

MKL supports OpenMP and Intel TBB as the threading control mechanisms. The
number of threads can be controlled with an environment variable
(`MKL_NUM_THREADS` or `OMP_NUM_THREADS`), or from Python with `threadpoolctl`.


### NumPy

NumPy does not provide parallelization, with the exception of linear algebra
routines which inherit the auto-parallelization of the underlying library
(OpenBLAS or MKL typically). NumPy does however release the GIL consistently
where it can.


### Scikit-learn

Scikit-learn provides a keyword `n_jobs=1` in many estimators and other
functions to let users enable parallel execution. This is done via the
[joblib](https://joblib.readthedocs.io) library, which provides both
multiprocessing (default) and threading backends that can be selected with a
context manager.

Scikit-learn also contains C and Cython code that uses OpenMP. OpenMP is
enabled in both wheels on PyPI and in conda-forge packages. The number of
threads used can be controlled with the `OMP_NUM_THREADS` environment variable.

Scikit-learn has good [documentation on parallelism and resource management](https://scikit-learn.org/stable/modules/computing.html#parallelism-resource-management-and-configuration).


### SciPy

SciPy provides a `workers=1` keyword in a (still limited) number of functions
to let users enable parallel execution. It is similar to scikit-learn's
`n_jobs` keyword, except that it also accepts a `map`-like callable (e.g.
`multiprocess.Pool.map` to allow using a custom pool. C++ code in SciPy uses
pthreads; the use of OpenMP was
[discussed and rejected](https://github.com/scipy/scipy/issues/10239).

`scipy.linalg` also provides a Cython API for BLAS and LAPACK. This lets other
libraries use linear algebra routines without having to ship or build against
an accelerated linear algebra library directly. Scikit-learn, statsmodels and
other libraries do this - thereby again inheriting the auto-parallelization
behavior from OpenBLAS or MKL.


### Deep learning frameworks

TensorFlow, PyTorch, MXNet and JAX all have auto-parallelization behavior.
Furthermore they provide support for distributed computing (with the exception
of JAX). These frameworks are very performance-focused, and aim to optimally
use all available hardware. They typically allow building with different
backends like NCCL or GLOO for GPU support, and use OpenMP, MPI, gRPC and more.

The advantage these frameworks have is that users typically _only_ use this one
framework for their whole program, so the parallelism used can be optimized
without having to play well with other Python packages that also execute code
in parallel.


### Dask

Dask provides parallel arrays, dataframes and machine learning algorithms with
APIs that match NumPy, Pandas and scikit-learn as much as possible. Dask is a
pure Python library and uses blocked algorithms; each block contains a single
NumPy array or Pandas dataframe. Scaling to hundreds of nodes is possible; Dask
is a good solution to obtain distributed arrays. When used as a method to
obtain parallelism on a single node however, it is not very efficient.


## Auto-parallelization and nested parallelism

Some libraries, like the deep learning frameworks, do auto-parallelization.
Most non deep learning libraries do not do this. When a single library or
framework is used to execute an end user program, auto-parallelization is
usually a good thing to have. It uses all available hardware resources in an
optimal fashion.

Problems can occur when multiple libraries are involved. What often happens is
oversubscription of resources. For example, if an end user would write code
using scikit-learn with `n_jobs=-1`, and NumPy would auto-parallelize
operations, then scikit-learn will use `N` processes (on an `N`-core machine)
and NumPy will use `N` threads per process - leading to `N^2` threads being
used. On machines with a large number of cores, the overhead of this quickly
becomes problematic. Given that NumPy uses OpenBLAS or MKL, this problem
already occurs today. For a while Anaconda and Intel shipped a modified NumPy
version that had auto-parallelization behavior for functions other than linear
algebra - and the problem occurred more frequently.

The paper [Composable Multi-Threading and Multi-Processing for Numeric
Libraries](http://conference.scipy.org/proceedings/scipy2018/pdfs/anton_malakhov.pdf)
from Malakhov et al. contains a good overview with examples and comparisons
between different parallelization methods. It uses NumPy, SciPy, Dask, and
Numba, and uses `multiprocessing`, `concurrent.futures`, OpenMP, Intel TBB
(Threading Building Blocks), and a custom library SMP (symmetric
multi-processing).


## Limitations due to Python package distribution mechanisms

When one wants to use auto-parallelization, it's important to have control over
the complete set of packages that a user gets installed on their machine. That
way one can ensure there's a single linear algebra library installed, and a
single OpenMP runtime is used.

That control over the full set of packages is common in HPC type situations,
where admins need to deal with build and install requirements to make libraries
work well together. Both packages managers (e.g. Apt in Debian) and Conda have
the ability to do this right as well - both because of dependency resolution
and because of a common build infrastructure.

A large fraction of Python users install packages from PyPI with Pip however.
The binary installers (wheels) on PyPI are not built on a common
infrastructure, and because there's no real support for non-Python
dependencies, libraries like OpenMP and OpenBLAS are bundled into the wheels
and installed into end user environments multiple times. This makes it
very difficult to reliably use, e.g., OpenMP. For this reason SciPy uses custom
pthreads thread pools rather than OpenMP.


## The need for a better API pattern or library

The default behavior for libraries like NumPy and SciPy given the status of the
ecosystem today should be to be single-threaded, otherwise it composes badly
with multiprocessing, scikit-learn (joblib), Dask, etc. _However_, there's
room for improvement here. Two things that could help improve the coordination
of parallelization behavior in a stack of Python libraries are:

1. A common API pattern for enabling parallelism
2. A common library providing a parallelization layer

A common API pattern is the simpler of the two options. It could be a keyword
like `n_jobs` or `workers` that gets used consistently between libraries, or a
context manager to achieve the same level of per-function or per-code-block
control.

A common library would be more powerful and enable auto-parallelization rather
than giving the user control (which is what the API pattern does). From a
performance perspective, having arrays and dataframes auto-parallelize their
functions as much as possible over all cores on a single node, and then letting
a separate library like Dask deal with multi-node coordination, seems optimal.
Introducing a new dependency into multiple libraries at the core of the PyData
ecosystem is a nontrivial exercise however.

The above attempts to summarize the state of affairs today. The topic of
parallelization is largely an implementation rather than an API question,
however there is an API component to it with option (1) above. How to move
forward here is worth discussing.

_Note: there's also a lot of room left in NumPy also for optimizing
single-threaded performance. There's ongoing work on making better use of
intrinsics (this is a large effort, ongoing), or using SLEEF for vector math
(discussed in the past, no one is working on it)._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelism - what do libraries offer, and is there an API aspect to it #4

Current status

Linear algebra libraries

NumPy

Scikit-learn

SciPy

Deep learning frameworks

Dask

Auto-parallelization and nested parallelism

Limitations due to Python package distribution mechanisms

The need for a better API pattern or library

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelism - what do libraries offer, and is there an API aspect to it #4

Description

Current status

Linear algebra libraries

NumPy

Scikit-learn

SciPy

Deep learning frameworks

Dask

Auto-parallelization and nested parallelism

Limitations due to Python package distribution mechanisms

The need for a better API pattern or library

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions