-
Notifications
You must be signed in to change notification settings - Fork 3
ENH Enables array_api for LinearDiscriminantAnalysis #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments highlighting some decision points that are different from #99
sklearn/discriminant_analysis.py
Outdated
if is_array_api: | ||
for i in range(classes.shape[0]): | ||
means[i, :] = np.mean(X[y == i], axis=0) | ||
else: | ||
cnt = np.bincount(y) | ||
np.add.at(means, y, X) | ||
means /= cnt[:, None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since array_api
do not have np.add.at
, we use a for loop to do the same computation.
sklearn/discriminant_analysis.py
Outdated
if is_array_api: | ||
svd = np.linalg.svd | ||
else: | ||
svd = scipy.linalg.svd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a bit interesting. Since I already have a wrapper for numpy: _NumPyApiWrapper
, it could make sense to make numpy.scipy
== scipy
to avoid this conditional.
sklearn/utils/_array_api.py
Outdated
def astype(self, x, dtype, *args, **kwargs): | ||
# astype is not defined in the top level numpy namespace | ||
return x.astype(dtype, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrap numpy
because I want np.astype
to exist. In a sense, this is making NumPy
more like array api. The alternative is to add astype
to the array_api.Array
object, but that involves patching the object.
|
||
ys_labels = set(chain.from_iterable((i for i in _unique_labels(y)) for y in ys)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array api does not go down this code path, because Arrays are not hashable.
Although using set + arrays feels like an anti pattern.
sklearn/utils/_array_api.py
Outdated
def concatenate(self, arrays, *, axis=0, **kwargs): | ||
# ignore parameters that is not supported by array-api | ||
f = self._array_namespace.concat | ||
return f(arrays, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's either this or what we see in https://github.com/scipy/scipy/pull/15395/files:
def _concatenate(arrays, axis):
xp = _get_namespace(*arrays)
if xp is np:
return xp.concatenate(arrays, axis=axis)
else:
return xp.concat(arrays, axis=axis)
And importing _concatenate
when needed.
sklearn/utils/_array_api.py
Outdated
@property | ||
def VisibleDeprecationWarning(self): | ||
return DeprecationWarning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unsure about how I feel about this workaround
warnings.warn("The priors do not sum to 1. Renormalizing", UserWarning) | ||
self.priors_ = self.priors_ / self.priors_.sum() | ||
|
||
# TODO: implement isclose in wrapper? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to implement our own isclose
.
sklearn/discriminant_analysis.py
Outdated
@@ -110,11 +113,17 @@ def _class_means(X, y): | |||
means : array-like of shape (n_classes, n_features) | |||
Class means. | |||
""" | |||
np, is_array_api = get_namespace(X) | |||
classes, y = np.unique(y, return_inverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique
is not in the array API (this would be unique_inverse
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you are using a wrapper below. IMO it would be better to use the array API APIs in the wrapper and wrap non-compatible NumPy conventions to match them, not the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, going the other direction also works. I'll try it the other way around and see how it compares.
My guess is that it's fine.
sklearn/discriminant_analysis.py
Outdated
|
||
if is_array_api: | ||
for i in range(classes.shape[0]): | ||
means[i, :] = np.mean(X[y == i], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be +=?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using mean here avoids needing to divide by the count. (In the end, I think it's the same computation. It's basically a groupby + mean aggregation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you moved the aggregation from at
into the mean calculation.
I think you could get rid of the loop with something like
np.sum(np.where(y[None] == np.arange(classes.shape[0])[:, None], X, np.asarray(0.)), axis=1)/cnt
except None
indexing isn't in the spec yet (data-apis/array-api#360), so you'd have to use expand_dims
. That's still not as "efficient" as the original because you are adding a lot of redundant 0s in the sum, so depending on how many classes there typically are the loop version might be better anyway (at least in the sense that it's more readable). Another thing to note is that not all array API modules are guaranteed to have boolean indexing (https://data-apis.org/array-api/latest/API_specification/indexing.html#boolean-array-indexing).
Also, I think cnt
can be gotten from unique_counts
or unique_all
in the array API (include_counts
in NumPy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to note is that not all array API modules are guaranteed to have boolean indexing
Thanks for the information! Some functionality would be hard to reproduce without boolean indexing. Looking into "Data-dependent output shape" more, I see that unique_all
may also not work. This is a significant barrier for scikit-learn. Pragmatically, I would have the first version of array API support in scikit-learn to be restricted to array modules that support "Data-dependent output shape".
(I wanted to use nonzero
to go from boolean index -> integer index -> take
, but that looks to under "Data-dependent output shape")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah of course unique wouldn't work in such APIs either. So there's no point in worrying about it here.
Another example of using
array_api
+ scikit-learn's LinearDiscriminantAnalysis.There is a good speed up 14x speed up when using cupy compared to numpy as seen in this gist.
Most workarounds are similar to the ones in #99.