Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@julibeg
Copy link
Contributor

@julibeg julibeg commented Sep 23, 2023

Reference Issues/PRs

closes #15932
closes #17991
closes #24674
supersedes #17991
supersedes #24674

What does this implement/fix? Explain your changes.

This allows the user to compute pairwise distance using a custom metric for non-metric data types (e.g. string or boolean).

Any other comments?

This PR just implements the changes from #24674 (which has stalled), including the last round of suggestions, and changes the name of the new parameter from check_length_only to only_check_num_samples.

@github-actions
Copy link

github-actions bot commented Sep 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 3df1d5c. Link to the linter CI: here

@julibeg
Copy link
Contributor Author

julibeg commented Sep 24, 2023

added an extra test for pairwise dists on boolean arrays

@glemaitre glemaitre changed the title Allow non numeric input for pairwise distances FEA Allow string input for pairwise distances Oct 31, 2023
@glemaitre glemaitre self-requested a review October 31, 2023 08:15
@glemaitre
Copy link
Member

@julibeg I solved the conflict, I will do a review now.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @julibeg

We will need a second reviewer.

@glemaitre glemaitre added this to the 1.4 milestone Oct 31, 2023
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a small change to make codecov happy.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the public API note, LGTM.

@jeremiedbb jeremiedbb modified the milestones: 1.4, 1.5 Dec 21, 2023
@julibeg
Copy link
Contributor Author

julibeg commented Dec 28, 2023

I changed the API to use ensure_2d=True instead of replaced only_check_num_samples=False (as suggested by @adrinjalali). To make it work I had to add dtype=None to the signature of _pairwise_callable() and add another test to make codecov happy. @glemaitre, @adrinjalali, could both of you have another quick look? Thanks!

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking quite good to me now.

Comment on lines 199 to 212
elif ensure_2d:
if X.shape[1] != Y.shape[1]:
raise ValueError(
"Incompatible dimension for X and Y matrices: "
"X.shape[1] == %d while Y.shape[1] == %d" % (X.shape[1], Y.shape[1])
)
else:
# the distances are neither pre-computed nor is the input expected to be 2D; we
# thus only check if the number of samples is the same in both arrays
if len(X) != len(Y):
raise ValueError(
f"Incompatible length for X and Y matrices: len(X) == {len(X)} while "
f" len(Y) == {len(Y)}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this can be replaced with:

Suggested change
elif ensure_2d:
if X.shape[1] != Y.shape[1]:
raise ValueError(
"Incompatible dimension for X and Y matrices: "
"X.shape[1] == %d while Y.shape[1] == %d" % (X.shape[1], Y.shape[1])
)
else:
# the distances are neither pre-computed nor is the input expected to be 2D; we
# thus only check if the number of samples is the same in both arrays
if len(X) != len(Y):
raise ValueError(
f"Incompatible length for X and Y matrices: len(X) == {len(X)} while "
f" len(Y) == {len(Y)}"
)
else:
check_consistent_length(X, Y)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_consistent_length is not appropriate here since we want to ensure that both arrays have the same number of features. In fact, the added check is wrong because even when input is non-numeric, we don't want to force an equal number of samples for X and Y. I just removed it.

if Y is not None:
y_dtype = Y.dtype if hasattr(Y, "dtype") else type(Y[0])
if dtype != y_dtype:
raise TypeError("X and Y have different dtypes.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a check that didn't exist before, do we really need it? The distance function is the one deciding whether it can handle mixed types or not I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I guess it was added to prevent check_pairwise_arrays from converting to float. But instead I removed this check and made check_pairwise_arrays preserve the dtype when using a custom metric.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @julibeg. While trying it out, I noticed an inconsistency between the dtype parameter of check_pairwise_arrays and the one of check_array. I also found that one of the test ensuring that both arrays have the same length is not correct: Different length are allowed, this function computes distances between all pairs of elements from X and Y. I directly pushed fixes for these along with some clean-ups.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Wanna have another look @glemaitre or @adrinjalali ?

*,
precomputed=False,
dtype=None,
dtype="infer_float",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the default dtype in check_pairwaise_arrays and also the behavior of dtype=None because it was inconsistent with the behavior of dtype=None in check_array (i.e. preserve dtype).

Now:
dtype="infer_float" has the same behavior as previously dtype=None, that is convert to appropriate float.
dtype=None preserves the input dtype.

check_pairwaise_arrays is private and I checked that scikit-learn doesn't use the dtype parameter anywhere else than in pairwise_distances. That's why I directly changed the default and the behavior. Let me know if you're okay with that.

@jeremiedbb
Copy link
Member

Actually, I think that we don't need to clutter the public API with a new ensure_2d parameter that is only used if metric is a callable. Discussing it with @glemaitre, we think that we should instead always disable this kind of check when the metric is custom and leave responsibility to the user. I pushed c5849c5 to remove the addition of ensure_2d in pairwise_distances.

@glemaitre this is ready for a final review

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

"for %d indexed." % (X.shape[0], X.shape[1], Y.shape[0])
)
elif X.shape[1] != Y.shape[1]:
elif X.ndim == 2 and X.shape[1] != Y.shape[1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we wouldn't do this if X is one dimensional

Suggested change
elif X.ndim == 2 and X.shape[1] != Y.shape[1]:
elif _num_samples(X) != _num_samples(Y):

Copy link
Member

@jeremiedbb jeremiedbb Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're not comparing n_samples but n_features instead. To be able to compute pairwise distances both arrays need to have the same number of features.

We could use _num_features but it errors on 1d arrays so I wanted to keep it simple and just enable the check when we enforce 2d arrays. I can replace the if X.ndim == 2 by if ensure_2d if it makes the reason clearer but it's essentially the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment would be enough there then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

compute pairwise_distance with custom metric function for non-numeric data

4 participants