-
Notifications
You must be signed in to change notification settings - Fork 3
Use array_api in Gaussian Mixture Models #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments about what kind of workaround was needed to get Gaussian Mixture Models to support array_api
.
sklearn/mixture/_base.py
Outdated
if is_array_api: | ||
log_resp = weighted_log_prob - np.reshape(log_prob_norm, (-1, 1)) | ||
else: | ||
with np.errstate(under="ignore"): | ||
# ignore underflow | ||
log_resp = weighted_log_prob - log_prob_norm[:, np.newaxis] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workaround for no errstate
in array_api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think floating point warnings will ever be portable. They're not even consistent in NumPy, and a constant source of pain. Maybe we need (later) a utility context manager errstate
that is do-nothing or delegate to library-specific implementation, to remove the if-else
here.
sklearn/mixture/_gaussian_mixture.py
Outdated
covariances[k].flat[:: n_features + 1] += reg_covar | ||
diff = X - means[k, :] | ||
covariances[k, :, :] = ((resp[:, k] * diff.T) @ diff) / nk[k] | ||
np.reshape(covariances[k, :, :], (-1,))[:: n_features + 1] += reg_covar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workaround for no flat
in array_api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to read either way; I don't think this is a portable solution. covariances
is not a 1-D array; wouldn't it be better to reshape the right-hand side here to match the shape of the left-hand size (or broadcast correctly)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the right-hand side is a float. The line is adding a float to the diagonal, something like this:
import numpy as np
covariances = np.ones((3, 3))
reg_covar = 4.0
np.fill_diagonal(covariances, covariances.diagonal() + reg_covar)
With only array_api
, I see two options:
- The current one with reshaping and slicing.
- Create the diagonal array on the right hand side (which would allocate more memory compare to option 1):
import numpy.array_api as xp
covariances = xp.ones((3, 3))
reg_covar = 4.0
covariances += reg_covar * np.eye(3)
I could be missing another way of "adding a scalar to the diagonal" using only array_api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's linalg.diagonal
that returns the matrix diagonal but the standard doesn't specify whether it's a view or copy operation. numpy.array_api
implementation wraps np.diagonal
that returns a non-writable view.
In [1]: import numpy.array_api as xp
In [2]: covariances = xp.ones((3, 3))
In [3]: diag = xp.linalg.diagonal(covariances)
In [4]: reg_covar = 4.0
In [5]: diag += reg_covar
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-dc9034854296> in <module>
----> 1 diag += reg_covar
~/.conda/envs/cupy-scipy/lib/python3.9/site-packages/numpy/array_api/_array_object.py in __iadd__(self, other)
739 if other is NotImplemented:
740 return other
--> 741 self._array.__iadd__(other._array)
742 return self
743
ValueError: output array is read-only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can never assume something is a view because multiple libraries don't have such a concept. And the ones that do are inconsistent with each other. tl;dr relying on views is always a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's some additional info on why relying on views is not portable:
https://data-apis.org/array-api/latest/design_topics/copies_views_and_mutation.html
This is again an example where the desire to support simultaneously efficient NumPy and portable Array API codes leads to two code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion about JAX not having fill_diagonal
is probably relevant: jax-ml/jax#2680. The portable solutions are (a) using eye
, or (b) add a for-loop for scalar inplace ops. It wouldn't surprise me if the for-loop is fast compared to the operation above, so it'd be fine and more readable. You could also special case numpy.ndarray
if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Athan pointed out that put
will solve this, and should get into the standard soonish.
sklearn/mixture/_gaussian_mixture.py
Outdated
means = np.dot(resp.T, X) / nk[:, np.newaxis] | ||
np, _ = get_namespace(X, resp) | ||
nk = np.sum(resp, axis=0) + 10 * np.finfo(resp.dtype).eps | ||
means = resp.T @ X / np.reshape(nk, (-1, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using @
to avoid using dot
, which I find nicer.
sklearn/mixture/_gaussian_mixture.py
Outdated
if is_array_api: | ||
cholesky = np.linalg.cholesky | ||
solve = np.linalg.solve | ||
else: | ||
cholesky = partial(scipy.linalg.cholesky, lower=True) | ||
solve = partial(scipy.linalg.solve_triangular, lower=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to use np.linalg.solve
because array_api
does not have solve_triangular
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be added in the future perhaps? @IvanYashchuk WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solve_triangular
could be added to the Array API spec in the future. It wouldn't be terribly difficult to add it to numpy.array_api
and cupy.array_api
. This functionality is part of level-3 BLAS and it's available in PyTorch (torch.linalg.solve_triangular
), in CuPy (cupyx.scipy.linalg.solve_triangular
) and I imagine in other libraries as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also the territory where the type dispatcher for scipy.linalg.solve_triangular
would be handy. CuPy's SciPy should also start working with cupy.array_api inputs (it doesn't currently) and SciPy should work with numpy.array_api inputs.
Replacing scipy
with scipy_dispatch
(installed with python -m pip install git+https://github.com/IvanYashchuk/scipy-singledispatch.git@master
) would give a working prototype:
In [1]: from scipy_dispatch import linalg
...: import cupy
In [2]: import cupy.array_api as xp
<ipython-input-2-23333abb466b>:1: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47.
import cupy.array_api as xp
In [3]: a = cupy.array_api.asarray(cupy.random.random((3, 3)))
In [4]: b = cupy.array_api.asarray(cupy.random.random((3,)))
In [5]: import scipy_dispatch.cupy_backend.linalg # activate dispatching
C:\Users\Ivan\dev\scipy-singledispatch\scipy_dispatch\cupy_backend\linalg.py:12: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47.
import numpy.array_api
In [6]: type(linalg.solve_triangular(a, b))
Out[6]: cupy.array_api._array_object.Array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance, scipy_dispatch
looks to be a simple wrapper around singledispatch
and would cover a majority of what users want. If cupy array -> use cupy operators. If a user wants to use an Intel operator on their NumPy array, they can register a single dispatch on np.ndarray
.
uarray
adds multiple dispatch, but I do not know if there is a need for it. Is it enough to dispatch based on the first argument and then make sure that all other arguments are compatible?
I'm guessing this has been discussed in length somewhere. Is there a document that explains why multiple dispatch and uarray
are required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user wants to use an Intel operator on their NumPy array, they can register a single dispatch on
np.ndarray
.
Right, regular SciPy functions can be registered to work for typing.Any
or object
type. And an alternative implementation specific to NumPy could be registered using np.ndarray
type.
Is it enough to dispatch based on the first argument and then make sure that all other arguments are compatible?
It might be enough in many cases.
I'm guessing this has been discussed in length somewhere. Is there a document that explains why multiple dispatch and uarray are required?
The requirements are being discussed at https://discuss.scientific-python.org/t/requirements-and-discussion-of-a-type-dispatcher-for-the-ecosystem/157/34. A few reasons for uarray
are listed in this post (https://discuss.scientific-python.org/t/a-proposed-design-for-supporting-multiple-array-types-across-scipy-scikit-learn-scikit-image-and-beyond/131). The main ones being: 1. ability to specify locally using context managers what backend to use; 2. ability to register different backend for the same array type (most often for np.ndarray
); 3. it's already used in scipy.fft
and we have a PR for scipy.ndimage
.
It may be the case that we actually do not everything that uarray
provides and uarray
looks scary enough for several people both on the library side and users side that it might be worth considering implementing the dispatching using a simpler option first (that is singledispatch
or Plum which in my tests adds less overhead than singledispatch
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full disclosure: I designed and implemented linalg.solve_triangular
in PyTorch.
Just for reference, torch.linalg.solve_triangular
diverges from scipy.solve_triangular
in the following ways:
- PyTorch does not expose BLAS'
trans
parameter (which is a bit confusing when used withlower
), but rather handles this internally looking at the strides of the tensor. - PyTorch has an
upper
kwonly parameter without a default. SciPy haslower=False
. We went withupper
to be consistent withlinalg.cholesky
. - PyTorch implements a
left=True
parameter that, when false, it solvesXA = B
. - SciPy's
unit_diagonal
parameter is calledunitriangular
in PyTorch. - SciPy has a number of extra parameters, namely
overwrite_b=False, debug=None, check_finite=True
I like PyTorch's behaviour better when it comes to the first three points. I don't care about the naming of the parameters though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be the case that we actually do not everything that
uarray
provides anduarray
looks scary enough for several people both on the library side and users side that it might be worth considering implementing the dispatching using a simpler option first (that issingledispatch
or Plum which in my tests adds less overhead thansingledispatch
).
There's probably more reasons, see for example the discussion in scipy/scipy#14356 (review) and following comments about conversion and supporting multiple types for the same parameter (union of ndarray and dtype, list/scalar/str inputs, etc.). That PR also adds some developer docs with discussion on this topic.
Unless it can be determined that either such things do not need to be supported in the future or there is a clean upgrade path later on, I don't think there's a point in using singledispatch
.
sklearn/mixture/_gaussian_mixture.py
Outdated
for k, (mu, prec_chol) in enumerate(zip(means, precisions_chol)): | ||
y = np.dot(X, prec_chol) - np.dot(mu, prec_chol) | ||
for k in range(n_components): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can not iterate from array_api
, so we must iterate through the axis explicitly. (Which I prefer)
sklearn/utils/_array_api.py
Outdated
if xp is None: | ||
# Use numpy as default | ||
return np, False | ||
|
||
return xp, True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns a boolean so the caller can easily tell if we are using the array_api
namespace
sklearn/utils/_array_api.py
Outdated
def logsumexp(a, axis=None, b=None, keepdims=False, return_sign=False): | ||
np, is_array_api = get_namespace(a) | ||
|
||
# Use SciPy if a is an ndarray | ||
if not is_array_api: | ||
return sp_logsumexp( | ||
a, axis=axis, b=b, keepdims=keepdims, return_sign=return_sign | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this is not needed in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This we should fix in the standard I think. It should have logsumexp
.
sklearn/utils/validation.py
Outdated
if not is_array_api: | ||
X = np.asanyarray(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another code path for array_api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a workaround for anything as far as I can tell - it's just that this function takes sequences/generators/etc when it probably shouldn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking it over, this asanyarray
is most likely not needed.
sklearn/utils/validation.py
Outdated
array = np.asarray(array, order=order) | ||
if not is_array_api: | ||
# array_api does not have order | ||
array = np.asarray(array, order=order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No order support for array_api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's expected, should be fine. Not all array types have memory order support.
sklearn/utils/validation.py
Outdated
if np.may_share_memory(array, array_orig): | ||
array = np.array(array, dtype=dtype, order=order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No may_share_memory
in array_api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should stay that way - this would not make sense for libraries without the concept of a view.
sklearn/utils/_array_api.py
Outdated
if not get_config()["array_api_dispatch"]: | ||
return np, False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global configuration option to control the dispatching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks @thomasjpfan!
Overall this looks pretty good I think, and a 6x speedup is nice to have. A few things to fix/improve upstream.
sklearn/mixture/_base.py
Outdated
if is_array_api: | ||
log_resp = weighted_log_prob - np.reshape(log_prob_norm, (-1, 1)) | ||
else: | ||
with np.errstate(under="ignore"): | ||
# ignore underflow | ||
log_resp = weighted_log_prob - log_prob_norm[:, np.newaxis] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think floating point warnings will ever be portable. They're not even consistent in NumPy, and a constant source of pain. Maybe we need (later) a utility context manager errstate
that is do-nothing or delegate to library-specific implementation, to remove the if-else
here.
sklearn/mixture/_gaussian_mixture.py
Outdated
covariances[k].flat[:: n_features + 1] += reg_covar | ||
diff = X - means[k, :] | ||
covariances[k, :, :] = ((resp[:, k] * diff.T) @ diff) / nk[k] | ||
np.reshape(covariances[k, :, :], (-1,))[:: n_features + 1] += reg_covar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to read either way; I don't think this is a portable solution. covariances
is not a 1-D array; wouldn't it be better to reshape the right-hand side here to match the shape of the left-hand size (or broadcast correctly)?
sklearn/utils/_array_api.py
Outdated
def logsumexp(a, axis=None, b=None, keepdims=False, return_sign=False): | ||
np, is_array_api = get_namespace(a) | ||
|
||
# Use SciPy if a is an ndarray | ||
if not is_array_api: | ||
return sp_logsumexp( | ||
a, axis=axis, b=b, keepdims=keepdims, return_sign=return_sign | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This we should fix in the standard I think. It should have logsumexp
.
sklearn/utils/validation.py
Outdated
if not is_array_api: | ||
X = np.asanyarray(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a workaround for anything as far as I can tell - it's just that this function takes sequences/generators/etc when it probably shouldn't?
sklearn/utils/validation.py
Outdated
array = np.asarray(array, order=order) | ||
if not is_array_api: | ||
# array_api does not have order | ||
array = np.asarray(array, order=order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's expected, should be fine. Not all array types have memory order support.
sklearn/utils/validation.py
Outdated
if np.may_share_memory(array, array_orig): | ||
array = np.array(array, dtype=dtype, order=order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should stay that way - this would not make sense for libraries without the concept of a view.
sklearn/mixture/_gaussian_mixture.py
Outdated
if is_array_api: | ||
cholesky = np.linalg.cholesky | ||
solve = np.linalg.solve | ||
else: | ||
cholesky = partial(scipy.linalg.cholesky, lower=True) | ||
solve = partial(scipy.linalg.solve_triangular, lower=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be added in the future perhaps? @IvanYashchuk WDYT?
|
||
def may_share_memory(self, *args, **kwargs): | ||
# The safe choice is to return True all the time | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this call np.may_share_memory
for np.ndarray
input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If get_namespace
returns the Array-API wrapper, then I am assuming that input are arrays from the array-api spec. This means that the implementation of these methods should only use functions in the spec.
We can special case numpy.array_api.Array
and actually call NumPy functions on it, but I feel like that defeats the purpose of the array-api spec.
Sorry for the ping, but do you know if this was upstreamed (or attempted to be upstreamed) to scikit-learn? I've been looking at adding Array API support to more estimators, and judging from the diff in the PR, this looks pretty possible. |
@lithomas1 This was not upstreamed. Looking over the diff, it shouldn't be too hard to upstream now. You are welcome to upstream this PR. |
Here is a short example comparing runtime between
cupy.array_api
andnp.ndarray
, which shows the speed up from using cupy. Note thatcupy
was build with my fork which includes this commit. (I opened a PR in cupy to upstream the fix)This PR also has an example of a global configuration option to control the dispatching.