-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
FEA Allow string input for pairwise distances #27456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Allow string input for pairwise distances #27456
Conversation
|
added an extra test for pairwise dists on boolean arrays |
|
@julibeg I solved the conflict, I will do a review now. |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @julibeg
We will need a second reviewer.
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a small change to make codecov happy.
adrinjalali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the public API note, LGTM.
|
I changed the API to use |
adrinjalali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking quite good to me now.
sklearn/metrics/pairwise.py
Outdated
| elif ensure_2d: | ||
| if X.shape[1] != Y.shape[1]: | ||
| raise ValueError( | ||
| "Incompatible dimension for X and Y matrices: " | ||
| "X.shape[1] == %d while Y.shape[1] == %d" % (X.shape[1], Y.shape[1]) | ||
| ) | ||
| else: | ||
| # the distances are neither pre-computed nor is the input expected to be 2D; we | ||
| # thus only check if the number of samples is the same in both arrays | ||
| if len(X) != len(Y): | ||
| raise ValueError( | ||
| f"Incompatible length for X and Y matrices: len(X) == {len(X)} while " | ||
| f" len(Y) == {len(Y)}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this can be replaced with:
| elif ensure_2d: | |
| if X.shape[1] != Y.shape[1]: | |
| raise ValueError( | |
| "Incompatible dimension for X and Y matrices: " | |
| "X.shape[1] == %d while Y.shape[1] == %d" % (X.shape[1], Y.shape[1]) | |
| ) | |
| else: | |
| # the distances are neither pre-computed nor is the input expected to be 2D; we | |
| # thus only check if the number of samples is the same in both arrays | |
| if len(X) != len(Y): | |
| raise ValueError( | |
| f"Incompatible length for X and Y matrices: len(X) == {len(X)} while " | |
| f" len(Y) == {len(Y)}" | |
| ) | |
| else: | |
| check_consistent_length(X, Y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_consistent_length is not appropriate here since we want to ensure that both arrays have the same number of features. In fact, the added check is wrong because even when input is non-numeric, we don't want to force an equal number of samples for X and Y. I just removed it.
sklearn/metrics/pairwise.py
Outdated
| if Y is not None: | ||
| y_dtype = Y.dtype if hasattr(Y, "dtype") else type(Y[0]) | ||
| if dtype != y_dtype: | ||
| raise TypeError("X and Y have different dtypes.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be a check that didn't exist before, do we really need it? The distance function is the one deciding whether it can handle mixed types or not I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I guess it was added to prevent check_pairwise_arrays from converting to float. But instead I removed this check and made check_pairwise_arrays preserve the dtype when using a custom metric.
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @julibeg. While trying it out, I noticed an inconsistency between the dtype parameter of check_pairwise_arrays and the one of check_array. I also found that one of the test ensuring that both arrays have the same length is not correct: Different length are allowed, this function computes distances between all pairs of elements from X and Y. I directly pushed fixes for these along with some clean-ups.
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Wanna have another look @glemaitre or @adrinjalali ?
| *, | ||
| precomputed=False, | ||
| dtype=None, | ||
| dtype="infer_float", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the default dtype in check_pairwaise_arrays and also the behavior of dtype=None because it was inconsistent with the behavior of dtype=None in check_array (i.e. preserve dtype).
Now:
dtype="infer_float" has the same behavior as previously dtype=None, that is convert to appropriate float.
dtype=None preserves the input dtype.
check_pairwaise_arrays is private and I checked that scikit-learn doesn't use the dtype parameter anywhere else than in pairwise_distances. That's why I directly changed the default and the behavior. Let me know if you're okay with that.
|
Actually, I think that we don't need to clutter the public API with a new @glemaitre this is ready for a final review |
adrinjalali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
sklearn/metrics/pairwise.py
Outdated
| "for %d indexed." % (X.shape[0], X.shape[1], Y.shape[0]) | ||
| ) | ||
| elif X.shape[1] != Y.shape[1]: | ||
| elif X.ndim == 2 and X.shape[1] != Y.shape[1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we wouldn't do this if X is one dimensional
| elif X.ndim == 2 and X.shape[1] != Y.shape[1]: | |
| elif _num_samples(X) != _num_samples(Y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we're not comparing n_samples but n_features instead. To be able to compute pairwise distances both arrays need to have the same number of features.
We could use _num_features but it errors on 1d arrays so I wanted to keep it simple and just enable the check when we enforce 2d arrays. I can replace the if X.ndim == 2 by if ensure_2d if it makes the reason clearer but it's essentially the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a comment would be enough there then.
Reference Issues/PRs
closes #15932
closes #17991
closes #24674
supersedes #17991
supersedes #24674
What does this implement/fix? Explain your changes.
This allows the user to compute pairwise distance using a custom metric for non-metric data types (e.g. string or boolean).
Any other comments?
This PR just implements the changes from #24674 (which has stalled), including the last round of suggestions, and changes the name of the new parameter from
check_length_onlytoonly_check_num_samples.