Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RFC Should pairwise_distances preserve float32 ? #24502

Open
@jeremiedbb

Description

@jeremiedbb

Currently the dtype of the distance matrix returned by pairwise_distances is not very consistent, depending on the metric and on the value of n_jobs.

For float64 input, everything is consistent: the returned matrix is always in float64.
For mixed float64 X and float32 Y, the return matrix is also always in float64 and this is what should be expected imo.

The troubles come when both X and Y are float32.

  • for sklearn metrics:
    • euclidean and cosine: result is always float32
    • manhattan: result is float64 if n_jobs=1 and float32 otherwise
  • for scipy metrics: result is float64 if n_jobs=1 and float32 otherwise
    Note that scipy cdist/pdist always returns float64.

Hence the question: should pairwise_distances preserve float32 ?

My opinion is that it should since pairwise_distances can be used as an intermediate step during fit and since there's ongoing work towards preserving float32 in estimators (see #11000 for transfromers for instance).

An argument against that could be reducing the numerical instabilities. A potential solution could be to use float64 accumulators for the intermediate computations only, still returning a float32 dist matrix. Note that with #23958 we might not need to use the scipy metrics anymore, in favor of the ones defined in dist_metrics, and using float64 accumulators would be easier to implement generally.

Answering this question will help to not go in the wrong direction in #23958

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions