Introduce SIMD intrinsics for `_dist_metrics.pyx`

# Context
Pairwise distance computation is an essential part of many estimators in scikit-learn, and can take up a significant portion of run time in certain workflows. I believe that we may achieve significant performance gains in several (perhaps most) distance metric implementations by leveraging SIMD intrinsics.

# Proof of Concept

I built a quick proof of concept just to see what kinds of performance gains we could observe with a potentially-naive implementation of SIMD intrinsics. I chose to optimize the `ManhattanDistance.dist` function. This implementation uses intrinsics found in `SSE{1,2,3}`. To ensure that the instructions are supported, it checks for the presence of the `SSE3`  instruction set (`SSE3` implies `SSE{1,2}`) and provides the optimized implementation if so. Otherwise it provides a dummy implementation just to appease Cython, and the main function falls back to the current implementation on `main`. Note that on most modern hardware, support for `SSE3` is a reasonable expectation (indeed numpy assumes it is always present when optimization is enabled). For the specific implementation referred to here, please take a look at this PR: https://github.com/Micky774/scikit-learn/pull/11

Note that the full benefit of the intrinsics are gained when compiling with `-march="native"`, however the benefit is still significant when compiling with `-march="nocona"`, as is often default (e.g when following the scikit-learn development instructions on linux).

# Benchmarks
The following benchmarks were produced by this gist: https://gist.github.com/Micky774/567a5fa199c05d90c4c08625b077840e

### **Summary: The SIMD implementations are ~2x faster than the current implementation for `float32` and 1.5x faster for `float64`.**

<details>
<summary>Plots</summary>

![f2b1f1e8-59b0-4ec5-b91c-fe1d19abd9ec](https://user-images.githubusercontent.com/34613774/228374474-99ee13b3-228d-4c53-9dcc-3a8f8b639a9f.png)
</details>

# Discussion
I haven't looked too deeply into this yet, as first I wanted to see whether there was interest in the venture. I would love to hear what the other maintainers' thoughts are regarding exploring this route in a bit more detail. Obviously SIMD implementations will bring with them added complexity, but the performance gains are pretty compelling. In my opinion, the tradeoff is worth it.

CC: @scikit-learn/core-devs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Introduce SIMD intrinsics for `_dist_metrics.pyx` #26010

Context

Proof of Concept

Benchmarks

Summary: The SIMD implementations are ~2x faster than the current implementation for `float32` and 1.5x faster for `float64`.

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Introduce SIMD intrinsics for _dist_metrics.pyx #26010

Description

Context

Proof of Concept

Benchmarks

Summary: The SIMD implementations are ~2x faster than the current implementation for float32 and 1.5x faster for float64.

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Introduce SIMD intrinsics for `_dist_metrics.pyx` #26010

Summary: The SIMD implementations are ~2x faster than the current implementation for `float32` and 1.5x faster for `float64`.