Description
Context
Pairwise distance computation is an essential part of many estimators in scikit-learn, and can take up a significant portion of run time in certain workflows. I believe that we may achieve significant performance gains in several (perhaps most) distance metric implementations by leveraging SIMD intrinsics.
Proof of Concept
I built a quick proof of concept just to see what kinds of performance gains we could observe with a potentially-naive implementation of SIMD intrinsics. I chose to optimize the ManhattanDistance.dist
function. This implementation uses intrinsics found in SSE{1,2,3}
. To ensure that the instructions are supported, it checks for the presence of the SSE3
instruction set (SSE3
implies SSE{1,2}
) and provides the optimized implementation if so. Otherwise it provides a dummy implementation just to appease Cython, and the main function falls back to the current implementation on main
. Note that on most modern hardware, support for SSE3
is a reasonable expectation (indeed numpy assumes it is always present when optimization is enabled). For the specific implementation referred to here, please take a look at this PR: Micky774#11
Note that the full benefit of the intrinsics are gained when compiling with -march="native"
, however the benefit is still significant when compiling with -march="nocona"
, as is often default (e.g when following the scikit-learn development instructions on linux).
Benchmarks
The following benchmarks were produced by this gist: https://gist.github.com/Micky774/567a5fa199c05d90c4c08625b077840e
Summary: The SIMD implementations are ~2x faster than the current implementation for float32
and 1.5x faster for float64
.
Discussion
I haven't looked too deeply into this yet, as first I wanted to see whether there was interest in the venture. I would love to hear what the other maintainers' thoughts are regarding exploring this route in a bit more detail. Obviously SIMD implementations will bring with them added complexity, but the performance gains are pretty compelling. In my opinion, the tradeoff is worth it.
CC: @scikit-learn/core-devs