-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
FEA Add array API support to davies_bouldin_score
#32693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add array API support to davies_bouldin_score
#32693
Conversation
virchan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks, @jaffourt!
CUDA CI is green as well. @lucyleeow, @OmarManzoor, friendly ping, would you like to have a look?
OmarManzoor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @jaffourt
|
|
||
| intra_dists = np.zeros(n_labels) | ||
| centroids = np.zeros((n_labels, len(X[0])), dtype=float) | ||
| dtype = _max_precision_float_dtype(xp, device_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick question, why _max_precision_float_dtype and not _find_matching_floating_dtype here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a good question and I don't have a solid answer. I have seen a number of different approaches for finding a common floating dtype in array_api compliant calculations, and it would be nice to have a more defined heuristic if needed :)
In this case, my reasoning for _max_precision_float_dtype was that _find_matching_floating_dtype finds a common float type between several arrays, but in this implementation we are building new floating arrays for intermediate calculations.
I.e., does it make sense to use the floating dtype returned from _find_matching_floating_dtype(x, labels)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could have better consistency around the use of these two.
From my understanding, _max_precision_float_dtype is used when we require the highest precision e.g. here:
scikit-learn/sklearn/metrics/_ranking.py
Lines 969 to 974 in de38166
| # Perform the weighted cumulative sum using float64 precision when possible | |
| # to avoid numerical stability problem with tens of millions of very noisy | |
| # predictions: | |
| # https://github.com/scikit-learn/scikit-learn/issues/31533#issuecomment-2967062437 | |
| y_true = xp.astype(y_true, max_float_dtype) | |
| tps = xp.cumulative_sum(y_true * weight, dtype=max_float_dtype)[threshold_idxs] |
cumulative_sum is notorious for floating point precision error, so it's best to use the highest precision.
I was just wondering if intra_dists had a particular requirement for higher precision.
Probably not worth worrying about here (I think?)
Reference Issues/PRs
Towards #26024
What does this implement/fix? Explain your changes.
Add array API support to
davies_bouldin_scoreAny other comments?