ENH: add cartesian() #5874

sotte · 2015-05-14T11:29:21Z

Generate the cartesian product of input arrays.

This is the pull request that resulted from the discussion. First, thanks to everybody who participated in the discussion!

Here is a performance comparison of different implementations.

The PR contains the implementation, docs and unittests, but...

I'm not quite sure if arraysetopt.py it the right place for this function.
"Set operations for 1D numeric arrays based on sorting." does not sound like cartesian.
cartesian is similar to prod, but fromnumeric.py is also not the right place.
Maybe core/function_base.py next to linspace? Or linalg.py?

Open questions

where should we put it?
Jaime raised the following question: "should it work on flattened arrays?
or should we give it an axis argument, and then "broadcast on the rest", a la
generalized ufunc?" :

Generate the cartesian product of input arrays.

pv · 2015-05-14T11:52:20Z

shape_base.py seems to have somewhat similar functions.

jaimefrio · 2015-05-14T14:23:41Z

numpy/lib/arraysetops.py

+        out = out.T
+
+    for j, arr in enumerate(arrays):
+        n /= arr.size


You need to do n //= arr.size to silence the deprecation warnings that are failing the tests.

jaimefrio · 2015-05-14T14:25:44Z

Wherever it ends up, it will need to be mentioned in the release notes, and the docstring needs one of those ..versionadded tags somewhere too.

shoyer · 2015-05-14T17:36:05Z

I feel pretty strongly that automatically flattening input to 1D is a bad idea. Better to just raise an error (though I do like the idea of the broadcasting across other dimensions).

- add versionadded - improve doc - refuse 2D input - fix unittests - more unittests - add "unique" parameter

sotte · 2015-05-15T18:25:31Z

The cartesian product is defined to works on sets. Following this definition we
could actually use np.unique() instead of np.ravel() which would result in
this behavior:

>>> cartesian((np.array([1, 1]), np.array([1, 1])))
array([[1, 1]])

Would this confuse people? I don't know.

Alternatively one could argue that each row in a 2-D array is an element of the
set.

>>> cartesian((np.array([[1, 1]]), np.array([[1, 1]])))
array([[1, 1, 1, 1]])

Would we still use the unique rows then?

>>> cartesian((np.array([[1, 1], [1, 1]]), np.array([[1, 1], [2, 2]])))
array([[1, 1, 1, 1], [1, 1, 2, 2]])

Just as reference Python's itertools.product does not care about the
uniqueness of elements:

>>> list(itertools.product(range(2), range(2)))
[(0, 0), (0, 1), (1, 0), (1, 1)]

>>> list(itertools.product([1, 1], range(2)))
[(1, 0), (1, 1), (1, 0), (1, 1)]

If we refuse 2D input we still have to decide if we use np.unique() which
would result in an implementation which is closer to the definition or the
itertools.product-like implementation.

Here is my suggestion which represents the current pull request:

Refuse 2D input so it's clear to the user what happens to the data.
Add an optional parameter unique. If unique is True pass the input through
np.unique.

shoyer · 2015-05-15T18:57:30Z

I think a itertools.product like implementation is the most obvious. I don't like automatically using only unique elements -- if users want that sort of behavior, they can call np.unique themselves. Likewise, I'm -1 on adding a unique parameter -- better to have simple, composable building blocks.

For the name of this function, how about using the most explicit cartesian_product?

argriffing · 2015-05-15T19:00:30Z

I agree with @shoyer. I would expect it to work like itertools.product and without anything related to uniqueness. I also prefer the more explicit cartesian_product name.

sotte · 2015-05-15T20:52:03Z

OK. Sounds good to me. But we'll refuse 2D input, right?! It's easy to flatten the input.

jaimefrio · 2015-05-16T01:37:16Z

Here's a tentative version of what a "broadcast on the rest" implementation could look like. I have removed the unique keyword, since it really only applies to flattened inputs:

def cartesian_product(arrays, axis=None, out=None):
    if len(arrays) < 2:
        msg = "need at least two array to calculate the cartesian product"
        raise ValueError(msg)
    if axis is None:
        arrays = [np.asarray(arr).ravel() for arr in arrays]
        broadcast_shape = ()
    else:
        arrays = [np.asarray(arr) for arr in arrays]
        nd = max(arr.ndim for arr in arrays)

        if axis < 0:
            axis += nd
        if axis < 0 or axis >= nd:
            raise ValueError("'axis' out of bounds")

        # Expand all arrays to the same number of dimensions
        for arr in arrays:
            arr.shape = (1,)*(nd - arr.ndim) + arr.shape

        # Move the axis to perform the cartesian product over to the
        # start of the shape
        arrays = [np.rollaxis(arr, axis, nd).T for arr in arrays]

        # Broadcast the arrays, minus the product axis, to a common shape
        broadcast_shape = np.broadcast(*[arr[0] for arr in arrays]).shape

    dtype = np.result_type(*arrays)
    n = np.prod([arr.shape[0] for arr in arrays])

    # For performance, the dimension of size n has to be contiguous in
    # memory. For convenience indexing, it is best to have the dimension
    # of size 'len(arrays)'' first, then the one of length 'n', then all
    # the others
    if out is None:
        out_shape = broadcast_shape  + (len(arrays), n)
        out = np.empty(out_shape, dtype=dtype)
        # Move the last two dimensions to the front
        out = np.transpose(out, (-2, -1) + tuple(range(nd-1)))
    else:
        if out.shape != broadcast_shape + (n, len(arrays)):
            raise ValueError("Wrong shape for 'out'")
        # Move the last two dimensions to the fron and flip them
        out = np.transpose(out, (-1, -2) + tuple(range(nd-1)))

    for j, arr in enumerate(arrays):
        n //= arr.shape[0]
        out.shape = (len(arrays), -1, arr.shape[0], n) + broadcast_shape
        out[j] = arr[np.newaxis, :, np.newaxis]
    out.shape = (len(arrays), -1) + broadcast_shape

    # Move the first and second axis to where they belong
    return np.transpose(out, tuple(range(2, nd+1)) + (1, 0))

You could then do sick things like, computing the cartesian products over the last dimension, for the cartesian product of all subarrays, or something like that...

>>> a = np.arange(6).reshape(3, 2)
>>> b = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> b
array([[0, 1, 2],
       [3, 4, 5]])
>>> cartesian_product((a[:, None], b), axis=-1)
array([[[[0, 0],
         [0, 1],
         [0, 2],
         [1, 0],
         [1, 1],
         [1, 2]],

        [[2, 0],
         [2, 1],
         [2, 2],
         [3, 0],
         [3, 1],
         [3, 2]],

        [[4, 0],
         [4, 1],
         [4, 2],
         [5, 0],
         [5, 1],
         [5, 2]]],


       [[[0, 3],
         [0, 4],
         [0, 5],
         [1, 3],
         [1, 4],
         [1, 5]],

        [[2, 3],
         [2, 4],
         [2, 5],
         [3, 3],
         [3, 4],
         [3, 5]],

        [[4, 3],
         [4, 4],
         [4, 5],
         [5, 3],
         [5, 4],
         [5, 5]]]])

The logic to broadcast arrays excluding an axis probably deserves being turned into its own function, especially if we want the error message to have some relevance to the original shape of the arrays, to the remapped one passed into np.broadcast. And I think we also need an explicit check for empty arrays with a quick return, to avoid divisions by zero, regardless of whether we stick with the original implementation or go for the broadcasting.

shoyer · 2015-05-16T06:35:40Z

@jaimefrio I like the idea of copying generalized ufuncs, but this seems way too complicated. In particular, there are way too many ways to interpret the axis argument:

does axis refer to (1) the input arrays, (2) the arrays after broadcasting, or (3) the output array?
does axis refer to (a) the axis along which the cartesian product is taken, or (b) along which the arrays are stacked?

The implemented solution here is, I think, (2)a, but all these seem like reasonable guesses.

You also have axis=None triggering a whole different code path (with automatic flattening, no less), which seems like a bad idea for both the separate code path and automatic flattening.

I don't think there's a way to make the axis argument less confusing. So my inclination is to not include it, even if we allow for multi-dimensional input.

Unfortunately, even in that case the function signature (allowing for computed dimension sizes) is not entirely obvious.

Jaime seems to be suggesting something like (a)(b)->(a*b,2) as the core gufunc signature (with 2 arguments). But then, if I understand gufuncs correctly, that we would mean input shaped like [(3, 2), (2, 3)] would raise an exception, because the first axis has a different size. So I think Jaime's example, which has output of shape (2, 3, 6, 2) is wrong. Instead, it should be something like [(2, 2), (2, 3)] -> (2, 6, 2), which is broadcasting along all the non-core dimensions.

Another way to vectorize across multiple dimensions is to do the cartesian product along all axes. That would imply: [(2, 2), (2, 3)] -> (4, 6, 2). This is a little more similar to the 1D behavior insofar as there are the same number of elements in the output array as there would be if all the input arrays were flattened, but the dimensionality of the stacked input arrays is preserved: [(4,), (6,)] -> (24, 2).

I think that treating the signature of the function for 1D input (as already documented) as the core signature for a gufunc is the most consistent approach to generalizing to multiple dimensions.

shoyer · 2015-05-16T06:40:45Z

One case in which an axis argument for cartesian_product would probably be reasonable is if it implements the exact same behavior as more generally implemented for all gufuncs (#5197). We got most of the way to a consensus on what that should look like on the mailing list, but it's still missing any implementation.

njsmith · 2015-05-16T08:59:42Z

I wonder if we should take these problems with computing output axis sizes
as a hint that the Cartesian product of arrays with shape (n, m) and (i, j)
should be (n, m, i, j). Then anyone who prefers some flattened version can
get whichever variant they like using our rich set of reshaping operators.
On May 15, 2015 11:40 PM, "Stephan Hoyer" [email protected] wrote:

One case in which an axis argument for cartesian_product would probably
be reasonable is if it implements the exact same behavior as more generally
implemented for all gufuncs (#5197
#5197). We got most of the way to
a consensus on what that should look like on the mailing list, but it's
still missing any implementation.

—
Reply to this email directly or view it on GitHub
#5874 (comment).

jaimefrio · 2015-05-16T13:15:42Z

@shoyer In my example, I am adding a size-1 dimension to the first array, so the dimensions of the inputs are (3, 1, 2) and (2, 3), hence the (3, 2, 6, 2) output.

jaimefrio · 2015-05-16T13:23:50Z

Regarding your suggestion of doing cartesian products over all axes, I think the example I chose shows that it can very easily be achieved with the proposed functionality by adding a few size-1 axes and a call to reshape. Plus it adds all the usual flexibility of doing many 1-D products at once. What you are proposing is a subset of what can be achieved with the usual broadcasting rules.

jaimefrio · 2015-05-16T14:14:15Z

And yes, adding an axis argument to a multi-operand function is tricky, and can become confusing. Perhaps this PR is not the best place to figure that out. We should retake that proposal of yours of adding an axis argument to gufuncs and implement it in numpy, together with all the other goodness that has been discussed, like optional core dimensions, frozen size dimensions, calculated dimensions...

Until we sort that out, if we don't want to stall this, I think we have two options:

For starters, have this function work only on 1-D arrays.
For starters, have this function behave like a current gufunc, and work on last axes, broadcast on the rest.

mhvk · 2015-05-16T19:08:54Z

I think it would be nice to allow subclasses, at least in principle (via a subok keyword argument that defaults to False). For this to happen, np.asarray(...) should be changed into np.array(..., copy=False, subok=subok). A bit more tricky is the construction of the output array. I'd suggest using the class of the first array:

if out is None:
    # Construct output array using a copy of the first input array, broadcasted to the right dimensions.
    # This preserves the class of the input array if required, and already inserts it where needed.
    n //= arrays[0].size
    out = np.broadcast_to(arrays[0][np.newaxis, :, np.newaxis],
                          (len(arrays), 1, arrays[0].size, n, subok=subok).copy()
    inserts = arrays[1:]
    outs = out[1:]
else:
    outs = out.T
    inserts = arrays

for a, o in zip(inserts, outs):
    n //= a.size
    o.shape = (-1, arr.size, n)
    o[...] = arr[np.newaxis, :, np.newaxis]

njsmith · 2015-05-16T19:41:02Z

My main point was, why are we flattening the output? E.g. even in the 1d case we could return (n, m, 2) instead of (n * m, 2), and it would be much clearer how things were laid out.

jaimefrio · 2015-05-16T21:35:31Z

@mhvk Shouldn't __array_priority__ play some role in determining the subclass of the output if we were to do this right?

@njsmith If the inputs are all 1-D, then yes, reshaping to (-1, len(arrays)) is pretty much trivial, and could be left to the user if she needs a flattened output, e.g. to iterate over. It gets a little more involved if we allow broadcasting into the mix. Perhaps (yet another) keyword argument to choose behavior?

njsmith · 2015-05-16T22:09:17Z

Let's not start inventing new array_priority special cases, there are
too many already and the goal is to deprecate it in favor of more powerful
and cleaner mechanisms like numpy_ufunc. Of course, then the question
becomes, what would that mechanism be... In a lot of cases the answer is we
should figure out how to turn the operation into a (g)ufunc and then reduce
to the previously solved case :-). But in this case it's not so obvious to
me, because this is basically a thin wrapper around concatenate /
broadcast_arrays / reshape, which are not very ufuncy.

I have no idea what to say about the full generality version with
broadcasting along some axes but not others and stuff, because I don't have
a clear idea of the use cases and what problem we're trying to solve. It's
honestly not 100% clear to me whether this needs its own function when we
already have the grid functions etc.

@mhvk https://github.com/mhvk Shouldn't array_priority play some role
in determining the subclass of the output if we were to do this right?

@njsmith https://github.com/njsmith If the inputs are all 1-D, then yes,
reshaping to (-1, len(arrays)) is pretty much trivial, and could be left to
the user if she needs a flattened output, e.g. to iterate over. It gets a
little more involved if we allow broadcasting into the mix. Perhaps (yet
another) keyword argument to choose behavior?

—
Reply to this email directly or view it on GitHub
#5874 (comment).

mhvk · 2015-05-17T15:39:25Z

@jaimefrio - In principle, one might want to have a np.result_class just as one has np.result_type. The logic implicit in my suggestion is the one I (eventually) would like to propose for concatenate as well, which does make the first array special. Here, I'm biased by my own uses for Quantity, where this would be fine: if I were to pass in arrays with units of, say, m, cm, and m, they would all be converted correctly (while if I put in an array with s, I'd get a UnitsError).

But your comment does make me realise that this probably needs more thought about how to generalize this. Since this involves adding a subok, which can be done later, this doesn't need to hold up this PR.

sotte · 2015-05-18T07:49:02Z

My supposedly simple pull requests have the tendency to turn into bigger discussions :)

I agree with @jaimefrio that solving how to handle the axis argument to a multi-operand functions probably should be discussed on the mailinglist.

To proceed with the pull request:
I still can't wrap my head around the @jaimefrio's axis implementation. Do we actually want the axis argument now? I'd be fine with the plain and easy to understand implementation that only deals with 1D input.

homu · 2016-11-14T00:07:42Z

☔ The latest upstream changes (presumably #7742) made this pull request unmergeable. Please resolve the merge conflicts.

sotte · 2016-11-14T09:17:02Z

There does not seem to be any demand for this feature. I guess we can close this.

ENH: add cartesian()

c6726bd

Generate the cartesian product of input arrays.

jaimefrio reviewed May 14, 2015
View reviewed changes

sotte added 2 commits May 14, 2015 19:49

Use ifloordiv --> tests don't fail

0dab65d

Many improvements

7aa744d

- add versionadded - improve doc - refuse 2D input - fix unittests - more unittests - add "unique" parameter

cdeil mentioned this pull request May 28, 2015

Model Discretization API draft astropy/astropy-api#13

Open

charris added 01 - Enhancement component: numpy.lib labels Jun 13, 2015

sotte closed this Nov 14, 2016

Uh oh!

ENH: add cartesian() #5874

ENH: add cartesian() #5874

Uh oh!

Conversation

sotte commented May 14, 2015

Uh oh!

pv commented May 14, 2015

Uh oh!

jaimefrio May 14, 2015

Choose a reason for hiding this comment

Uh oh!

jaimefrio commented May 14, 2015

Uh oh!

shoyer commented May 14, 2015

Uh oh!

sotte commented May 15, 2015

Uh oh!

shoyer commented May 15, 2015

Uh oh!

argriffing commented May 15, 2015

Uh oh!

sotte commented May 15, 2015

Uh oh!

jaimefrio commented May 16, 2015

Uh oh!

shoyer commented May 16, 2015

Uh oh!

shoyer commented May 16, 2015

Uh oh!

njsmith commented May 16, 2015

Uh oh!

jaimefrio commented May 16, 2015

Uh oh!

jaimefrio commented May 16, 2015

Uh oh!

jaimefrio commented May 16, 2015

Uh oh!

mhvk commented May 16, 2015

Uh oh!

njsmith commented May 16, 2015

Uh oh!

jaimefrio commented May 16, 2015

Uh oh!

njsmith commented May 16, 2015

Uh oh!

mhvk commented May 17, 2015

Uh oh!

sotte commented May 18, 2015

Uh oh!

homu commented Nov 14, 2016

Uh oh!

sotte commented Nov 14, 2016

Uh oh!

Uh oh!