Use array_api in Gaussian Mixture Models #99

thomasjpfan · 2022-01-04T22:50:56Z

Here is a short example comparing runtime between cupy.array_api and np.ndarray, which shows the speed up from using cupy. Note that cupy was build with my fork which includes this commit. (I opened a PR in cupy to upstream the fix)

This PR also has an example of a global configuration option to control the dispatching.

thomasjpfan

I left some comments about what kind of workaround was needed to get Gaussian Mixture Models to support array_api.

CC @rgommers @IvanYashchuk

thomasjpfan · 2022-01-04T22:52:02Z

sklearn/mixture/_base.py

+        if is_array_api:
+            log_resp = weighted_log_prob - np.reshape(log_prob_norm, (-1, 1))
+        else:
+            with np.errstate(under="ignore"):
+                # ignore underflow
+                log_resp = weighted_log_prob - log_prob_norm[:, np.newaxis]


Workaround for no errstate in array_api.

I don't think floating point warnings will ever be portable. They're not even consistent in NumPy, and a constant source of pain. Maybe we need (later) a utility context manager errstate that is do-nothing or delegate to library-specific implementation, to remove the if-else here.

thomasjpfan · 2022-01-04T22:52:18Z

sklearn/mixture/_gaussian_mixture.py

-        covariances[k].flat[:: n_features + 1] += reg_covar
+        diff = X - means[k, :]
+        covariances[k, :, :] = ((resp[:, k] * diff.T) @ diff) / nk[k]
+        np.reshape(covariances[k, :, :], (-1,))[:: n_features + 1] += reg_covar


Workaround for no flat in array_api.

hard to read either way; I don't think this is a portable solution. covariances is not a 1-D array; wouldn't it be better to reshape the right-hand side here to match the shape of the left-hand size (or broadcast correctly)?

In this case, the right-hand side is a float. The line is adding a float to the diagonal, something like this:

import numpy as np covariances = np.ones((3, 3)) reg_covar = 4.0 np.fill_diagonal(covariances, covariances.diagonal() + reg_covar)

With only array_api, I see two options:

The current one with reshaping and slicing.

Create the diagonal array on the right hand side (which would allocate more memory compare to option 1):

import numpy.array_api as xp covariances = xp.ones((3, 3)) reg_covar = 4.0 covariances += reg_covar * np.eye(3)

I could be missing another way of "adding a scalar to the diagonal" using only array_api.

There's linalg.diagonal that returns the matrix diagonal but the standard doesn't specify whether it's a view or copy operation. numpy.array_api implementation wraps np.diagonal that returns a non-writable view.

In [1]: import numpy.array_api as xp In [2]: covariances = xp.ones((3, 3)) In [3]: diag = xp.linalg.diagonal(covariances) In [4]: reg_covar = 4.0 In [5]: diag += reg_covar --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-dc9034854296> in <module> ----> 1 diag += reg_covar ~/.conda/envs/cupy-scipy/lib/python3.9/site-packages/numpy/array_api/_array_object.py in __iadd__(self, other) 739 if other is NotImplemented: 740 return other --> 741 self._array.__iadd__(other._array) 742 return self 743 ValueError: output array is read-only

You can never assume something is a view because multiple libraries don't have such a concept. And the ones that do are inconsistent with each other. tl;dr relying on views is always a bug.

Here's some additional info on why relying on views is not portable:
https://data-apis.org/array-api/latest/design_topics/copies_views_and_mutation.html

This is again an example where the desire to support simultaneously efficient NumPy and portable Array API codes leads to two code paths.

This discussion about JAX not having fill_diagonal is probably relevant: jax-ml/jax#2680. The portable solutions are (a) using eye, or (b) add a for-loop for scalar inplace ops. It wouldn't surprise me if the for-loop is fast compared to the operation above, so it'd be fine and more readable. You could also special case numpy.ndarray if desired.

Athan pointed out that put will solve this, and should get into the standard soonish.

thomasjpfan · 2022-01-04T22:52:52Z

sklearn/mixture/_gaussian_mixture.py

-    means = np.dot(resp.T, X) / nk[:, np.newaxis]
+    np, _ = get_namespace(X, resp)
+    nk = np.sum(resp, axis=0) + 10 * np.finfo(resp.dtype).eps
+    means = resp.T @ X / np.reshape(nk, (-1, 1))


Using @ to avoid using dot, which I find nicer.

thomasjpfan · 2022-01-04T22:53:37Z

sklearn/mixture/_gaussian_mixture.py

+    if is_array_api:
+        cholesky = np.linalg.cholesky
+        solve = np.linalg.solve
+    else:
+        cholesky = partial(scipy.linalg.cholesky, lower=True)
+        solve = partial(scipy.linalg.solve_triangular, lower=True)


Need to use np.linalg.solve because array_api does not have solve_triangular.

Could be added in the future perhaps? @IvanYashchuk WDYT?

solve_triangular could be added to the Array API spec in the future. It wouldn't be terribly difficult to add it to numpy.array_api and cupy.array_api. This functionality is part of level-3 BLAS and it's available in PyTorch (torch.linalg.solve_triangular), in CuPy (cupyx.scipy.linalg.solve_triangular) and I imagine in other libraries as well.

This is also the territory where the type dispatcher for scipy.linalg.solve_triangular would be handy. CuPy's SciPy should also start working with cupy.array_api inputs (it doesn't currently) and SciPy should work with numpy.array_api inputs.

Replacing scipy with scipy_dispatch (installed with python -m pip install git+https://github.com/IvanYashchuk/scipy-singledispatch.git@master) would give a working prototype:

In [1]: from scipy_dispatch import linalg ...: import cupy In [2]: import cupy.array_api as xp <ipython-input-2-23333abb466b>:1: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47. import cupy.array_api as xp In [3]: a = cupy.array_api.asarray(cupy.random.random((3, 3))) In [4]: b = cupy.array_api.asarray(cupy.random.random((3,))) In [5]: import scipy_dispatch.cupy_backend.linalg # activate dispatching C:\Users\Ivan\dev\scipy-singledispatch\scipy_dispatch\cupy_backend\linalg.py:12: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47. import numpy.array_api In [6]: type(linalg.solve_triangular(a, b)) Out[6]: cupy.array_api._array_object.Array

At a glance, scipy_dispatch looks to be a simple wrapper around singledispatch and would cover a majority of what users want. If cupy array -> use cupy operators. If a user wants to use an Intel operator on their NumPy array, they can register a single dispatch on np.ndarray.

uarray adds multiple dispatch, but I do not know if there is a need for it. Is it enough to dispatch based on the first argument and then make sure that all other arguments are compatible?

I'm guessing this has been discussed in length somewhere. Is there a document that explains why multiple dispatch and uarray are required?

If a user wants to use an Intel operator on their NumPy array, they can register a single dispatch on np.ndarray.

Right, regular SciPy functions can be registered to work for typing.Any or object type. And an alternative implementation specific to NumPy could be registered using np.ndarray type.

Is it enough to dispatch based on the first argument and then make sure that all other arguments are compatible?

It might be enough in many cases.

I'm guessing this has been discussed in length somewhere. Is there a document that explains why multiple dispatch and uarray are required?

The requirements are being discussed at https://discuss.scientific-python.org/t/requirements-and-discussion-of-a-type-dispatcher-for-the-ecosystem/157/34. A few reasons for uarray are listed in this post (https://discuss.scientific-python.org/t/a-proposed-design-for-supporting-multiple-array-types-across-scipy-scikit-learn-scikit-image-and-beyond/131). The main ones being: 1. ability to specify locally using context managers what backend to use; 2. ability to register different backend for the same array type (most often for np.ndarray); 3. it's already used in scipy.fft and we have a PR for scipy.ndimage.

It may be the case that we actually do not everything that uarray provides and uarray looks scary enough for several people both on the library side and users side that it might be worth considering implementing the dispatching using a simpler option first (that is singledispatch or Plum which in my tests adds less overhead than singledispatch).

Full disclosure: I designed and implemented linalg.solve_triangular in PyTorch.

Just for reference, torch.linalg.solve_triangular diverges from scipy.solve_triangular in the following ways:

PyTorch does not expose BLAS' trans parameter (which is a bit confusing when used with lower), but rather handles this internally looking at the strides of the tensor.

PyTorch has an upper kwonly parameter without a default. SciPy has lower=False. We went with upper to be consistent with linalg.cholesky.

PyTorch implements a left=True parameter that, when false, it solves XA = B.

SciPy's unit_diagonal parameter is called unitriangular in PyTorch.

SciPy has a number of extra parameters, namely overwrite_b=False, debug=None, check_finite=True

I like PyTorch's behaviour better when it comes to the first three points. I don't care about the naming of the parameters though.

It may be the case that we actually do not everything that uarray provides and uarray looks scary enough for several people both on the library side and users side that it might be worth considering implementing the dispatching using a simpler option first (that is singledispatch or Plum which in my tests adds less overhead than singledispatch).

There's probably more reasons, see for example the discussion in scipy/scipy#14356 (review) and following comments about conversion and supporting multiple types for the same parameter (union of ndarray and dtype, list/scalar/str inputs, etc.). That PR also adds some developer docs with discussion on this topic.

Unless it can be determined that either such things do not need to be supported in the future or there is a clean upgrade path later on, I don't think there's a point in using singledispatch.

thomasjpfan · 2022-01-04T22:54:38Z

sklearn/mixture/_gaussian_mixture.py

-        for k, (mu, prec_chol) in enumerate(zip(means, precisions_chol)):
-            y = np.dot(X, prec_chol) - np.dot(mu, prec_chol)
+        for k in range(n_components):


Can not iterate from array_api, so we must iterate through the axis explicitly. (Which I prefer)

thomasjpfan · 2022-01-04T22:55:14Z

sklearn/utils/_array_api.py

+    if xp is None:
+        # Use numpy as default
+        return np, False
+
+    return xp, True


Returns a boolean so the caller can easily tell if we are using the array_api namespace

thomasjpfan · 2022-01-04T22:55:28Z

sklearn/utils/_array_api.py

+def logsumexp(a, axis=None, b=None, keepdims=False, return_sign=False):
+    np, is_array_api = get_namespace(a)
+
+    # Use SciPy if a is an ndarray
+    if not is_array_api:
+        return sp_logsumexp(
+            a, axis=axis, b=b, keepdims=keepdims, return_sign=return_sign
+        )


Hopefully this is not needed in the future.

This we should fix in the standard I think. It should have logsumexp.

thomasjpfan · 2022-01-04T22:55:55Z

sklearn/utils/validation.py

+    if not is_array_api:
+        X = np.asanyarray(X)


Another code path for array_api.

This is not a workaround for anything as far as I can tell - it's just that this function takes sequences/generators/etc when it probably shouldn't?

Looking it over, this asanyarray is most likely not needed.

thomasjpfan · 2022-01-04T22:56:11Z

sklearn/utils/validation.py

-                    array = np.asarray(array, order=order)
+                    if not is_array_api:
+                        # array_api does not have order
+                        array = np.asarray(array, order=order)


No order support for array_api.

That's expected, should be fine. Not all array types have memory order support.

thomasjpfan · 2022-01-04T22:56:28Z

sklearn/utils/validation.py

+            if np.may_share_memory(array, array_orig):
+                array = np.array(array, dtype=dtype, order=order)


No may_share_memory in array_api.

That should stay that way - this would not make sense for libraries without the concept of a view.

thomasjpfan · 2022-01-05T04:03:20Z

sklearn/utils/_array_api.py

+    if not get_config()["array_api_dispatch"]:
+        return np, False


Global configuration option to control the dispatching.

rgommers

Nice, thanks @thomasjpfan!

Overall this looks pretty good I think, and a 6x speedup is nice to have. A few things to fix/improve upstream.

rgommers · 2022-01-05T12:42:01Z

sklearn/mixture/_base.py

+        if is_array_api:
+            log_resp = weighted_log_prob - np.reshape(log_prob_norm, (-1, 1))
+        else:
+            with np.errstate(under="ignore"):
+                # ignore underflow
+                log_resp = weighted_log_prob - log_prob_norm[:, np.newaxis]


I don't think floating point warnings will ever be portable. They're not even consistent in NumPy, and a constant source of pain. Maybe we need (later) a utility context manager errstate that is do-nothing or delegate to library-specific implementation, to remove the if-else here.

rgommers · 2022-01-05T12:44:50Z

sklearn/mixture/_gaussian_mixture.py

-        covariances[k].flat[:: n_features + 1] += reg_covar
+        diff = X - means[k, :]
+        covariances[k, :, :] = ((resp[:, k] * diff.T) @ diff) / nk[k]
+        np.reshape(covariances[k, :, :], (-1,))[:: n_features + 1] += reg_covar


hard to read either way; I don't think this is a portable solution. covariances is not a 1-D array; wouldn't it be better to reshape the right-hand side here to match the shape of the left-hand size (or broadcast correctly)?

rgommers · 2022-01-05T12:47:13Z

sklearn/utils/_array_api.py

+def logsumexp(a, axis=None, b=None, keepdims=False, return_sign=False):
+    np, is_array_api = get_namespace(a)
+
+    # Use SciPy if a is an ndarray
+    if not is_array_api:
+        return sp_logsumexp(
+            a, axis=axis, b=b, keepdims=keepdims, return_sign=return_sign
+        )


This we should fix in the standard I think. It should have logsumexp.

rgommers · 2022-01-05T12:49:41Z

sklearn/utils/validation.py

+    if not is_array_api:
+        X = np.asanyarray(X)


This is not a workaround for anything as far as I can tell - it's just that this function takes sequences/generators/etc when it probably shouldn't?

rgommers · 2022-01-05T12:51:59Z

sklearn/utils/validation.py

-                    array = np.asarray(array, order=order)
+                    if not is_array_api:
+                        # array_api does not have order
+                        array = np.asarray(array, order=order)


That's expected, should be fine. Not all array types have memory order support.

rgommers · 2022-01-05T12:52:46Z

sklearn/utils/validation.py

+            if np.may_share_memory(array, array_orig):
+                array = np.array(array, dtype=dtype, order=order)


That should stay that way - this would not make sense for libraries without the concept of a view.

rgommers · 2022-01-05T12:53:16Z

sklearn/mixture/_gaussian_mixture.py

+    if is_array_api:
+        cholesky = np.linalg.cholesky
+        solve = np.linalg.solve
+    else:
+        cholesky = partial(scipy.linalg.cholesky, lower=True)
+        solve = partial(scipy.linalg.solve_triangular, lower=True)


Could be added in the future perhaps? @IvanYashchuk WDYT?

rgommers · 2022-01-18T14:17:25Z

sklearn/utils/_array_api.py

+
+    def may_share_memory(self, *args, **kwargs):
+        # The safe choice is to return True all the time
+        return True


Shouldn't this call np.may_share_memory for np.ndarray input?

If get_namespace returns the Array-API wrapper, then I am assuming that input are arrays from the array-api spec. This means that the implementation of these methods should only use functions in the spec.

We can special case numpy.array_api.Array and actually call NumPy functions on it, but I feel like that defeats the purpose of the array-api spec.

lithomas1 · 2024-04-24T02:42:36Z

@thomasjpfan

Sorry for the ping, but do you know if this was upstreamed (or attempted to be upstreamed) to scikit-learn?

I've been looking at adding Array API support to more estimators, and judging from the diff in the PR, this looks pretty possible.

thomasjpfan · 2024-04-29T22:29:59Z

@lithomas1 This was not upstreamed. Looking over the diff, it shouldn't be too hard to upstream now.

You are welcome to upstream this PR.

thomasjpfan added 2 commits December 27, 2021 23:31

ENH Uses gaussian mixture

6093daf

ENH Use array_api in validation

ab1ab7a

thomasjpfan commented Jan 4, 2022

View reviewed changes

ENH Adds global configuration option

bd7a4e6

thomasjpfan commented Jan 5, 2022

View reviewed changes

rgommers reviewed Jan 5, 2022

View reviewed changes

ENH Reduce the usage of is_array_api

6124b00

rgommers reviewed Jan 18, 2022

View reviewed changes

thomasjpfan added 2 commits January 18, 2022 14:14

Merge remote-tracking branch 'upstream/main' into array_api_gmm

57167b4

Merge remote-tracking branch 'upstream/main' into array_api_gmm

1f8e25a

thomasjpfan mentioned this pull request Jan 24, 2022

ENH Enables array_api for LinearDiscriminantAnalysis #102

Closed

thomasjpfan mentioned this pull request Feb 1, 2022

Path for Adopting the Array API spec scikit-learn/scikit-learn#22352

Open

thomasjpfan added 9 commits February 11, 2022 10:57

Merge remote-tracking branch 'upstream/main' into array_api_gmm

46b5be4

CLN Become more array-api like

19abb1a

Merge remote-tracking branch 'upstream/main' into array_api_gmm

7983c55

DOC Adds period

aeb1488

FIX Fixes for copy

f574e76

CLN Adds comment

b518f5b

CLN Fixes tests

fb12c03

CLN Less diff

bf5975f

CLN Fix spelling

3c08e1d

thomasjpfan force-pushed the main branch from 4dd9b57 to 872a084 Compare February 15, 2023 01:13

thomasjpfan force-pushed the main branch from a48f0ea to 3d35fa9 Compare January 2, 2024 12:57

		if np.may_share_memory(array, array_orig):
		array = np.array(array, dtype=dtype, order=order)

Use array_api in Gaussian Mixture Models #99

Are you sure you want to change the base?

Use array_api in Gaussian Mixture Models #99

Conversation

thomasjpfan commented Jan 4, 2022 • edited Loading

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Jan 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Apr 24, 2024

thomasjpfan commented Apr 29, 2024

thomasjpfan commented Jan 4, 2022 •

edited

Loading

thomasjpfan Jan 7, 2022 •

edited

Loading