Description
Currently the dtype of the distance matrix returned by pairwise_distances
is not very consistent, depending on the metric and on the value of n_jobs.
For float64 input, everything is consistent: the returned matrix is always in float64.
For mixed float64 X and float32 Y, the return matrix is also always in float64 and this is what should be expected imo.
The troubles come when both X and Y are float32.
- for sklearn metrics:
euclidean
andcosine
: result is always float32manhattan
: result is float64 if n_jobs=1 and float32 otherwise
- for scipy metrics: result is float64 if n_jobs=1 and float32 otherwise
Note that scipy cdist/pdist always returns float64.
Hence the question: should pairwise_distances
preserve float32 ?
My opinion is that it should since pairwise_distances
can be used as an intermediate step during fit and since there's ongoing work towards preserving float32 in estimators (see #11000 for transfromers for instance).
An argument against that could be reducing the numerical instabilities. A potential solution could be to use float64 accumulators for the intermediate computations only, still returning a float32 dist matrix. Note that with #23958 we might not need to use the scipy metrics anymore, in favor of the ones defined in dist_metrics
, and using float64 accumulators would be easier to implement generally.
Answering this question will help to not go in the wrong direction in #23958