Enable Awkward Array Support in _unpack_to_numpy for Improved Array-like Compatibility #29136

Thierno88 · 2024-11-13T13:54:42Z

PR Summary

Why is this change necessary?
The _unpack_to_numpy function currently supports a range of array-like objects but does not account for the awkward.Array type, which is widely used for data with irregular or nested structures. This change makes it easier for users to visualize data stored in awkward.Arrays without needing manual conversions, especially useful when subarray lengths vary.

What problem does it solve?
This PR modifies _unpack_to_numpy to recognize and handle awkward.Arrays, flattening them and converting them to numpy.ndarray format for direct compatibility with Matplotlib’s plotting functions. It addresses compatibility issues and prevents errors when attempting to plot irregular arrays.

Reasoning for Implementation
The enhancement involves:

1.Detecting if the input array is an awkward.Array.
2.Flattening nested or non-uniform awkward.Arrays using ak.flatten.
3.Converting the result to numpy.ndarray via ak.to_numpy().

This maintains backward compatibility and aligns with existing helper functions that aim to standardize various array-like formats for Matplotlib.

PR Checklist

"closes Document and test what "array like" means to Matplotlib #22879"
New and changed code is teste
New Features and API Changes are noted with a directive and release note

handle akward array data

tacaswell · 2024-11-13T14:25:01Z

I am 👍🏻 on this is principle, however I have concerns about the implementation.

First, we can not pick up and awkward array dependency in Matplotlib nor should we import awkward array if the user has not already imported it so if we are to keep the type check like this, it needs to be hidden behind sys.module inspection.

Second, I am clear why flattening a ragged array is the right thing to do in most cases as it seems it is dropping important structural information on the floor.

I would appreciate if @jpivarski could weigh in on this being the right approach.

jpivarski · 2024-11-13T15:23:40Z

Yeah, Matplotlib can't take on Awkward Array as a dependency. Matplotlib is a more fundamental library in the ecosystem.

It would be possible to determine if an object is of a particular type in a third-party module without loading that module, for instance with

def is_in_module_with_type(tpe: type, module: str, name: str | None = None) -> bool:
    module_name = tpe.__module__
    if module_name == module or module_name.startswith(module + "."):
        if name is None or name == tpe.__name__:
            return True
    mro = tpe.mro()
    if len(mro) <= 1:
        return False
    return is_in_module_with_type(mro[1], module, name)

but I don't think Matplotlib should get into the business of "knowing" about all the other libraries and their types. If it did, this section would fill up with checking for array types in libraries, and some of them would even be abandoned, making this not-quite-but-effectively dead code.

My original intention was to have Awkward Arrays not automatically flatten when passed to histogramming and other plotting code. This section from the user guide explains that choice: although flattening might be the most common thing to do, it's not the only way one would want to map the structure of their data onto a plot. (What about taking the mean of each list? The maximum of each list? The first? Actually drawing lists as little disconnected lines?) If it happened automatically, one might not realize it's happening and misinterpret their plot!

The one thing that I do think we need to avoid—and this happened in some older version of Matplotlib—is for the function call to proceed very slowly and then fail at the end. This happened because Matplotlib, recognizing that its argument was not an array but was iterable, proceeded to iterate over the megabytes of little lists and then conclude at the end that it couldn't plot lists, as lists are not numbers. It needs to fail-fast so that users get their error message right away. I remember that being fixed, though it was years ago. (It wasn't something we could do on our side: Awkward Arrays should have an __iter__ method, and it will be slow.)

I'll do a quick check:

class SomeIterable:
    def __iter__(self):
        output = [["U"], ["can't"], ["plot"], ["this"]]
        for x in output:
            print(x)
            yield x

plt.plot(SomeIterable(), SomeIterable())

raises

['U']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File ".../site-packages/matplotlib/_api/__init__.py", line 91, in check_isinstance
    raise TypeError(
TypeError: 'value' must be an instance of str or bytes, not a __main__.SomeIterable

which is very good: it fails fast. As soon as it sees the first iterated object that it can't plot, it gives up, without iterating through the (possibly GB of) data in the iterable.

So, users get an error message saying that this data can't be plotted, and I think that's ideal: it forces them to choose ak.flatten if that is actually what they want. Maybe the error message could hint, "Perhaps your data need to be flattened or destructured?" which is not Awkward-specific—it would be a problem with NumPy, too.

Oh, NumPy gives a different error message:

bad_numpy = np.arange(2*3*5).reshape(2, 3, 5)
plt.plot(bad_numpy, bad_numpy)

ValueError: x and y can be no greater than 2D, but have shapes (2, 3, 5) and (2, 3, 5)

What we have with the data that needs to be flattened is similar to the NumPy case.

Hey, wait a minute: with 2D NumPy arrays, it draws

numpy2D_x = np.array([[0.0, 1.0, 2.0], [0.1, 1.1, 2.1], [0.2, 1.2, 2.2]])
numpy2D_y = np.array([[1.0, 0.9, 1.1], [2.0, 1.0, 0.0], [0.0, 0.5, 2.0]])
plt.plot(numpy2D_x, numpy2D_y)

The second dimension is assumed to range over different-colored lines. There could be a good application of Awkward or ragged¹ arrays here, which lets the line segments have different lengths. However, the interpretation of axes isn't the one we'd want: it would be natural for each different-length list to represent one line segment, and a direct generalization of the above wouldn't do that.

Also, I guess it would be a bad mistake to make if you, as a user, really intended for the data to be flattened before plotting. Making millions of Line2D objects would probably be resource-intensive, not easy to KeyboardInterrupt out of.

satisfying the Array API: you'd only have to check for a shape with None in a dimension ↩

tacaswell · 2024-11-13T16:41:11Z

Hey, wait a minute: with 2D NumPy arrays, it draws

This a an artifact of the MATLAB legacy and "tables as 2D arrays" so the 2nd dimension is the "right" one to treat an connected.

jklymak · 2024-11-13T16:45:51Z

@jpivarski Thanks for your thoughts here.

I'm a little confused about the motivation for this PR. It looks like AwkwardArray has an __array__ method, so np.asarray(X) should already work as "expected", which includes an error for ragged arrays. I think both AwkwardArray and Matplotlib are doing what they can here, and if someone has a complicated ragged array, they are going to need to unpack it in the manner that makes the most sense for the underlying data structure.

Probably something for the data pipeline in the future, but Matplotlib has the concept of plotting the columns of arrays separately in plot and errorbar (and hist). It seems possible to me that we should be trying to make plot etc convert the input y and maybe the input x to a list of 1D arrays rather than a 2D array, which would allow ragged arrays.

jpivarski · 2024-11-13T16:57:57Z

Calling ak.flatten on a ragged array is a little more than calling __array__ on it:

>>> regular = ak.Array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])
>>> ak.flatten(regular)
<Array [1.1, 2.2, 3.3, 4.4, 5.5, 6.6] type='6 * float64'>
>>> np.asarray(regular)
array([[1.1, 2.2, 3.3],
       [4.4, 5.5, 6.6]])

>>> ragged = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> ak.flatten(ragged)
<Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>
>>> np.asarray(ragged)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-42/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

This error occurred while calling

    numpy.asarray(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
        dtype = None
    )

Flattening is the most common thing to want to do to a ragged array, but I didn't want it to happen by default because that would hide some structure-manipulation that should probably be explicit.

In a High Energy Physics (HEP) context, one is often wondering, "Is this a plot of events, a plot of particles, a plot of particle candidates?" (i.e. one collision event per data point or histogram entry, or one particle-in-an-event, or one candidate invisible particle, formed by trying all combinations of visible particles as decay products). These are different levels of nesting for Awkward/ragged arrays, so if flattening was automatic, the distinction between these cases would be blurred.

On the other hand, the frequency with which one wants to flatten, rather than anything else, might be in the high 90%'s. So I can see the motivation behind the PR.

timhoffm · 2024-11-14T13:46:29Z

Flattening is the most common thing to want to do to a ragged array, but I didn't want it to happen by default because that would hide some structure-manipulation that should probably be explicit.

To me, this is the central sentence. Matplotlib's interfaces are build around the idea of "array-likes". To be able to handle third-party data stuctures, we need to convert them to any array. On a technical level, the __arrray__ protocol delegates this convertion to the data structure. We should not get into the business of making assumptions and do implicit conversions that the original datastructure intentionally did not put into __array__.

tacaswell · 2024-11-14T14:06:35Z

@Thierno88 Thank you for your work on this, however I am going to close this as I think there is a consensus that Matplotlib should to be adding any assumptions to implicit data structure conversion. Although we are not going to merge this PR, the discussion it provoked is valuable!

I think a reasonable followup PR would be to add logic to provide better error messages in the case where we have been given a ragger awkaward array with suggestions as to how the user could split it.

It might also be interesting to look at how awkward arrays play with LineCollection and what would we have to do to make it "just work" in that case.

Update cbook.py

28479f9

handle akward array data

tacaswell closed this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enable Awkward Array Support in _unpack_to_numpy for Improved Array-like Compatibility #29136

Enable Awkward Array Support in _unpack_to_numpy for Improved Array-like Compatibility #29136

Uh oh!

Thierno88 commented Nov 13, 2024

Uh oh!

tacaswell commented Nov 13, 2024

Uh oh!

jpivarski commented Nov 13, 2024

Uh oh!

tacaswell commented Nov 13, 2024

Uh oh!

jklymak commented Nov 13, 2024

Uh oh!

jpivarski commented Nov 13, 2024

Uh oh!

timhoffm commented Nov 14, 2024 •

edited

Loading

Uh oh!

tacaswell commented Nov 14, 2024

Uh oh!

Uh oh!

Uh oh!

Enable Awkward Array Support in _unpack_to_numpy for Improved Array-like Compatibility #29136

Enable Awkward Array Support in _unpack_to_numpy for Improved Array-like Compatibility #29136

Uh oh!

Conversation

Thierno88 commented Nov 13, 2024

Uh oh!

tacaswell commented Nov 13, 2024

Uh oh!

jpivarski commented Nov 13, 2024

Footnotes

Uh oh!

tacaswell commented Nov 13, 2024

Uh oh!

jklymak commented Nov 13, 2024

Uh oh!

jpivarski commented Nov 13, 2024

Uh oh!

timhoffm commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tacaswell commented Nov 14, 2024

Uh oh!

Uh oh!

timhoffm commented Nov 14, 2024 •

edited

Loading