Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT Create private _pairwise_distances_reductions submodule #23724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jun 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
23c175b
MAINT Introduce interfaces for PairwiseDistancesReductions
jjerphan Jun 1, 2022
c772d48
DEBUG Propagate sort_results
jjerphan Jun 7, 2022
590a8b0
DOC Reword
jjerphan Jun 8, 2022
ec40fea
DOC Document dispatchers and implementations
jjerphan Jun 9, 2022
470f231
Merge branch 'main' into maint/pairwise-distances-reductions-interfaces
jjerphan Jun 9, 2022
cd49d8a
MAINT Simply use python class for dispatchers
jjerphan Jun 10, 2022
c7dc987
DOC Improve comments
jjerphan Jun 10, 2022
e6f9c9a
Apply review comments
jjerphan Jun 14, 2022
3a8eb53
MAINT Create private _pairwise_distances_reductions submodule
jjerphan Jun 9, 2022
5c855fe
MAINT Group dispatchers in a single file
jjerphan Jun 10, 2022
a8c7dfa
DOC Fix typo
jjerphan Jun 15, 2022
fd754d5
DOC Move comment where appropriate
jjerphan Jun 15, 2022
79c8188
MAINT Dispatch _sqeuclidean_row_norms
jjerphan Jun 15, 2022
8f2c106
Merge branch 'main' into maint/pdr-private-submodule
jjerphan Jun 22, 2022
91420d6
Correctly handle read-only datasets
jjerphan Jun 22, 2022
8f791b6
Apply review comment
jjerphan Jun 22, 2022
229758c
Apply review comments
jjerphan Jun 23, 2022
303aac6
Apply suggestions from code review
jjerphan Jun 23, 2022
3480955
Merge branch 'main' into maint/pdr-private-submodule
jjerphan Jun 23, 2022
94c76fb
TST Remove useless globam random seed
jjerphan Jun 23, 2022
dc26da0
CI Trigger some tests on supporting global random seed [all random se…
jjerphan Jun 23, 2022
b089de3
CI Trigger [all random seeds]
thomasjpfan Jun 24, 2022
0c01ea9
Merge branch 'main' into maint/pdr-private-submodule
thomasjpfan Jun 24, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 0 additions & 20 deletions sklearn/metrics/_dist_metrics.pxd.tp
Original file line number Diff line number Diff line change
Expand Up @@ -101,23 +101,3 @@ cdef class DistanceMetric{{name_suffix}}:
cdef DTYPE_t _dist_to_rdist(self, {{DTYPE_t}} dist) nogil except -1

{{endfor}}

######################################################################
# DatasetsPair base class
cdef class DatasetsPair:
cdef DistanceMetric distance_metric

cdef ITYPE_t n_samples_X(self) nogil

cdef ITYPE_t n_samples_Y(self) nogil

cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil

cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil


cdef class DenseDenseDatasetsPair(DatasetsPair):
cdef:
const DTYPE_t[:, ::1] X
const DTYPE_t[:, ::1] Y
ITYPE_t d
161 changes: 0 additions & 161 deletions sklearn/metrics/_dist_metrics.pyx.tp
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ implementation_specific_values = [

import numpy as np
cimport numpy as cnp
from cython cimport final

cnp.import_array() # required in order to use C-API

Expand Down Expand Up @@ -1171,163 +1170,3 @@ cdef class PyFuncDistance{{name_suffix}}(DistanceMetric{{name_suffix}}):
"vectors and return a float.")

{{endfor}}

######################################################################
# Datasets Pair Classes
cdef class DatasetsPair:
"""Abstract class which wraps a pair of datasets (X, Y).

This class allows computing distances between a single pair of rows of
of X and Y at a time given the pair of their indices (i, j). This class is
specialized for each metric thanks to the :func:`get_for` factory classmethod.

The handling of parallelization over chunks to compute the distances
and aggregation for several rows at a time is done in dedicated
subclasses of PairwiseDistancesReduction that in-turn rely on
subclasses of DatasetsPair for each pair of rows in the data. The goal
is to make it possible to decouple the generic parallelization and
aggregation logic from metric-specific computation as much as
possible.

X and Y can be stored as C-contiguous np.ndarrays or CSR matrices
in subclasses.

This class avoids the overhead of dispatching distance computations
to :class:`sklearn.metrics.DistanceMetric` based on the physical
representation of the vectors (sparse vs. dense). It makes use of
cython.final to remove the overhead of dispatching method calls.

Parameters
----------
distance_metric: DistanceMetric
The distance metric responsible for computing distances
between two vectors of (X, Y).
"""

@classmethod
def get_for(
cls,
X,
Y,
str metric="euclidean",
dict metric_kwargs=None,
) -> DatasetsPair:
"""Return the DatasetsPair implementation for the given arguments.

Parameters
----------
X : {ndarray, sparse matrix} of shape (n_samples_X, n_features)
Input data.
If provided as a ndarray, it must be C-contiguous.
If provided as a sparse matrix, it must be in CSR format.

Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features)
Input data.
If provided as a ndarray, it must be C-contiguous.
If provided as a sparse matrix, it must be in CSR format.

metric : str, default='euclidean'
The distance metric to compute between rows of X and Y.
The default metric is a fast implementation of the Euclidean
metric. For a list of available metrics, see the documentation
of :class:`~sklearn.metrics.DistanceMetric`.

metric_kwargs : dict, default=None
Keyword arguments to pass to specified metric function.

Returns
-------
datasets_pair: DatasetsPair
The suited DatasetsPair implementation.
"""
cdef:
DistanceMetric distance_metric = DistanceMetric.get_metric(
metric,
**(metric_kwargs or {})
)

if not(X.dtype == Y.dtype == np.float64):
raise ValueError(
f"Only 64bit float datasets are supported at this time, "
f"got: X.dtype={X.dtype} and Y.dtype={Y.dtype}."
)

# Metric-specific checks that do not replace nor duplicate `check_array`.
distance_metric._validate_data(X)
distance_metric._validate_data(Y)

# TODO: dispatch to other dataset pairs for sparse support once available:
if issparse(X) or issparse(Y):
raise ValueError("Only dense datasets are supported for X and Y.")

return DenseDenseDatasetsPair(X, Y, distance_metric)

def __init__(self, DistanceMetric distance_metric):
self.distance_metric = distance_metric

cdef ITYPE_t n_samples_X(self) nogil:
"""Number of samples in X."""
# This is a abstract method.
# This _must_ always be overwritten in subclasses.
# TODO: add "with gil: raise" here when supporting Cython 3.0
return -999

cdef ITYPE_t n_samples_Y(self) nogil:
"""Number of samples in Y."""
# This is a abstract method.
# This _must_ always be overwritten in subclasses.
# TODO: add "with gil: raise" here when supporting Cython 3.0
return -999

cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.dist(i, j)

cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil:
# This is a abstract method.
# This _must_ always be overwritten in subclasses.
# TODO: add "with gil: raise" here when supporting Cython 3.0
return -1

@final
cdef class DenseDenseDatasetsPair(DatasetsPair):
"""Compute distances between row vectors of two arrays.

Parameters
----------
X: ndarray of shape (n_samples_X, n_features)
Rows represent vectors. Must be C-contiguous.

Y: ndarray of shape (n_samples_Y, n_features)
Rows represent vectors. Must be C-contiguous.

distance_metric: DistanceMetric
The distance metric responsible for computing distances
between two row vectors of (X, Y).
"""

def __init__(self, X, Y, DistanceMetric distance_metric):
super().__init__(distance_metric)
# Arrays have already been checked
self.X = X
self.Y = Y
self.d = X.shape[1]

@final
cdef ITYPE_t n_samples_X(self) nogil:
return self.X.shape[0]

@final
cdef ITYPE_t n_samples_Y(self) nogil:
return self.Y.shape[0]

@final
cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.rdist(&self.X[i, 0],
&self.Y[j, 0],
self.d)

@final
cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.dist(&self.X[i, 0],
&self.Y[j, 0],
self.d)
Loading