FEA Allow string input for pairwise distances #27456

julibeg · 2023-09-23T22:44:51Z

Reference Issues/PRs

closes #15932
closes #17991
closes #24674
supersedes #17991
supersedes #24674

What does this implement/fix? Explain your changes.

This allows the user to compute pairwise distance using a custom metric for non-metric data types (e.g. string or boolean).

Any other comments?

This PR just implements the changes from #24674 (which has stalled), including the last round of suggestions, and changes the name of the new parameter from check_length_only to only_check_num_samples.

github-actions · 2023-09-23T22:46:45Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 3df1d5c. Link to the linter CI: here}

julibeg · 2023-09-24T01:00:52Z

added an extra test for pairwise dists on boolean arrays

glemaitre · 2023-10-31T08:19:28Z

@julibeg I solved the conflict, I will do a review now.

glemaitre

LGTM. Thanks @julibeg

We will need a second reviewer.

sklearn/metrics/tests/test_pairwise.py

glemaitre

I just pushed a small change to make codecov happy.

adrinjalali

Other than the public API note, LGTM.

sklearn/metrics/pairwise.py

Co-authored-by: Adrin Jalali <[email protected]>

julibeg · 2023-12-28T18:39:42Z

I changed the API to use ensure_2d=True instead of replaced only_check_num_samples=False (as suggested by @adrinjalali). To make it work I had to add dtype=None to the signature of _pairwise_callable() and add another test to make codecov happy. @glemaitre, @adrinjalali, could both of you have another quick look? Thanks!

adrinjalali

It's looking quite good to me now.

adrinjalali · 2024-01-03T10:49:45Z

sklearn/metrics/pairwise.py

+    elif ensure_2d:
+        if X.shape[1] != Y.shape[1]:
+            raise ValueError(
+                "Incompatible dimension for X and Y matrices: "
+                "X.shape[1] == %d while Y.shape[1] == %d" % (X.shape[1], Y.shape[1])
+            )
+    else:
+        # the distances are neither pre-computed nor is the input expected to be 2D; we
+        # thus only check if the number of samples is the same in both arrays
+        if len(X) != len(Y):
+            raise ValueError(
+                f"Incompatible length for X and Y matrices: len(X) == {len(X)} while "
+                f" len(Y) == {len(Y)}"
+            )


seems like this can be replaced with:

Suggested change

elif ensure_2d:

if X.shape[1] != Y.shape[1]:

raise ValueError(

"Incompatible dimension for X and Y matrices: "

"X.shape[1] == %d while Y.shape[1] == %d" % (X.shape[1], Y.shape[1])

)

else:

# the distances are neither pre-computed nor is the input expected to be 2D; we

# thus only check if the number of samples is the same in both arrays

if len(X) != len(Y):

raise ValueError(

f"Incompatible length for X and Y matrices: len(X) == {len(X)} while "

f" len(Y) == {len(Y)}"

)

else:

check_consistent_length(X, Y)

check_consistent_length is not appropriate here since we want to ensure that both arrays have the same number of features. In fact, the added check is wrong because even when input is non-numeric, we don't want to force an equal number of samples for X and Y. I just removed it.

adrinjalali · 2024-01-03T10:55:40Z

sklearn/metrics/pairwise.py

+        if Y is not None:
+            y_dtype = Y.dtype if hasattr(Y, "dtype") else type(Y[0])
+            if dtype != y_dtype:
+                raise TypeError("X and Y have different dtypes.")


this seems to be a check that didn't exist before, do we really need it? The distance function is the one deciding whether it can handle mixed types or not I think.

I agree. I guess it was added to prevent check_pairwise_arrays from converting to float. But instead I removed this check and made check_pairwise_arrays preserve the dtype when using a custom metric.

doc/whats_new/v1.4.rst

jeremiedbb

Thanks for the PR @julibeg. While trying it out, I noticed an inconsistency between the dtype parameter of check_pairwise_arrays and the one of check_array. I also found that one of the test ensuring that both arrays have the same length is not correct: Different length are allowed, this function computes distances between all pairs of elements from X and Y. I directly pushed fixes for these along with some clean-ups.

jeremiedbb

LGTM. Wanna have another look @glemaitre or @adrinjalali ?

jeremiedbb · 2024-03-08T14:42:00Z

sklearn/metrics/pairwise.py

    *,
    precomputed=False,
-    dtype=None,
+    dtype="infer_float",


I changed the default dtype in check_pairwaise_arrays and also the behavior of dtype=None because it was inconsistent with the behavior of dtype=None in check_array (i.e. preserve dtype).

Now:
dtype="infer_float" has the same behavior as previously dtype=None, that is convert to appropriate float.
dtype=None preserves the input dtype.

check_pairwaise_arrays is private and I checked that scikit-learn doesn't use the dtype parameter anywhere else than in pairwise_distances. That's why I directly changed the default and the behavior. Let me know if you're okay with that.

jeremiedbb · 2024-03-18T17:42:04Z

Actually, I think that we don't need to clutter the public API with a new ensure_2d parameter that is only used if metric is a callable. Discussing it with @glemaitre, we think that we should instead always disable this kind of check when the metric is custom and leave responsibility to the user. I pushed c5849c5 to remove the addition of ensure_2d in pairwise_distances.

@glemaitre this is ready for a final review

adrinjalali

Otherwise LGTM.

adrinjalali · 2024-03-19T10:24:38Z

sklearn/metrics/pairwise.py

                "for %d indexed." % (X.shape[0], X.shape[1], Y.shape[0])
            )
-    elif X.shape[1] != Y.shape[1]:
+    elif X.ndim == 2 and X.shape[1] != Y.shape[1]:


I'm not sure why we wouldn't do this if X is one dimensional

Suggested change

elif X.ndim == 2 and X.shape[1] != Y.shape[1]:

elif _num_samples(X) != _num_samples(Y):

Because we're not comparing n_samples but n_features instead. To be able to compute pairwise distances both arrays need to have the same number of features.

We could use _num_features but it errors on 1d arrays so I wanted to keep it simple and just enable the check when we enforce 2d arrays. I can replace the if X.ndim == 2 by if ensure_2d if it makes the reason clearer but it's essentially the same.

I think a comment would be enough there then.

julibeg added 4 commits September 23, 2023 23:16

changes from previous PR (#24674)

82b3c78

apply suggestions from previous PR (#24674)

725bab4

rename new param from 'check_length_only' to 'only_check_num_samples'

0ad2c87

formatting

72dc318

github-actions bot added the module:metrics label Sep 23, 2023

julibeg mentioned this pull request Sep 23, 2023

FEA Allow string input for pairwise distances #24674

Closed

julibeg added 3 commits September 23, 2023 23:49

add PR number to docs

047ac41

update comment

b79ca68

add test for pairwise distances on boolean array

941ba1b

julibeg and others added 4 commits September 24, 2023 20:55

add only_check_num_samples to @validate_params

962d0a1

fix whitespace

ac9ccd3

merge in main

07be233

Merge branch 'main' into allow-non-numeric-input-for-pairwise-distances

a86a954

glemaitre changed the title ~~Allow non numeric input for pairwise distances~~ FEA Allow string input for pairwise distances Oct 31, 2023

glemaitre self-requested a review October 31, 2023 08:15

Merge remote-tracking branch 'origin/main' into pr/julibeg/27456

a0d4205

glemaitre approved these changes Oct 31, 2023

View reviewed changes

glemaitre added this to the 1.4 milestone Oct 31, 2023

glemaitre reviewed Nov 2, 2023

View reviewed changes

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved

Update sklearn/metrics/tests/test_pairwise.py

8d7a266

glemaitre reviewed Nov 2, 2023

View reviewed changes

adrinjalali reviewed Dec 7, 2023

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

jeremiedbb modified the milestones: 1.4, 1.5 Dec 21, 2023

julibeg and others added 3 commits December 23, 2023 15:57

Add suggestion PR to sklearn/metrics/pairwise.py

90bb14c

Co-authored-by: Adrin Jalali <[email protected]>

replace only_check_num_samples with ensure_2d

eb2bf17

Merge branch 'main' into allow-non-numeric-input-for-pairwise-distances

d5d3e0b

julibeg and others added 5 commits December 27, 2023 18:08

pass dtype to _pairwise_callable()

f4a9dc7

use hasattr to check for dtype instead of checking if input is ndarray

0421408

remove duplicated statement

b3c0d13

add test for pairwise distances between arrays with different dtypes

b4377c4

Merge branch 'main' into allow-non-numeric-input-for-pairwise-distances

e6e4dfa

adrinjalali reviewed Jan 3, 2024

View reviewed changes

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved

jeremiedbb added 4 commits March 8, 2024 14:59

cln dype inconsistency + fix tests

daa1410

Merge remote-tracking branch 'upstream/main' into pr/julibeg/27456

733e8fc

target 1.5

d2ab406

cln

1e78d84

jeremiedbb reviewed Mar 8, 2024

View reviewed changes

jeremiedbb approved these changes Mar 8, 2024

View reviewed changes

jeremiedbb reviewed Mar 8, 2024

View reviewed changes

simplify: remove ensure_2d

c5849c5

jeremiedbb added 2 commits March 18, 2024 18:43

cln

227b639

cln

82cf8d7

adrinjalali reviewed Mar 19, 2024

View reviewed changes

comment on validation skip for non-2d arrays

3df1d5c

adrinjalali approved these changes Mar 19, 2024

View reviewed changes

adrinjalali enabled auto-merge (squash) March 19, 2024 14:08

adrinjalali merged commit 65a40e5 into scikit-learn:main Mar 19, 2024

This was referenced Apr 9, 2025

Array API support for pairwise kernels #29822

Merged

Inconsistent check_pairwise_arrays in pairwise_distances #31162

Closed

	elif X.ndim == 2 and X.shape[1] != Y.shape[1]:
	elif _num_samples(X) != _num_samples(Y):

Uh oh!

FEA Allow string input for pairwise distances #27456

FEA Allow string input for pairwise distances #27456

Uh oh!

Conversation

julibeg commented Sep 23, 2023 • edited by glemaitre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Sep 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

julibeg commented Sep 24, 2023

Uh oh!

glemaitre commented Oct 31, 2023

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julibeg commented Dec 28, 2023

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Jan 3, 2024

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

adrinjalali Jan 3, 2024

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Mar 18, 2024

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Mar 19, 2024

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali Mar 19, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

julibeg commented Sep 23, 2023 •

edited by glemaitre

Loading

github-actions bot commented Sep 23, 2023 •

edited

Loading

jeremiedbb Mar 19, 2024 •

edited

Loading