Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Enable Awkward Array Support in _unpack_to_numpy for Improved Array-like Compatibility #29136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Thierno88
Copy link

PR Summary

Why is this change necessary?
The _unpack_to_numpy function currently supports a range of array-like objects but does not account for the awkward.Array type, which is widely used for data with irregular or nested structures. This change makes it easier for users to visualize data stored in awkward.Arrays without needing manual conversions, especially useful when subarray lengths vary.

What problem does it solve?
This PR modifies _unpack_to_numpy to recognize and handle awkward.Arrays, flattening them and converting them to numpy.ndarray format for direct compatibility with Matplotlib’s plotting functions. It addresses compatibility issues and prevents errors when attempting to plot irregular arrays.

Reasoning for Implementation
The enhancement involves:

1.Detecting if the input array is an awkward.Array.
2.Flattening nested or non-uniform awkward.Arrays using ak.flatten.
3.Converting the result to numpy.ndarray via ak.to_numpy().

This maintains backward compatibility and aligns with existing helper functions that aim to standardize various array-like formats for Matplotlib.

PR Checklist

handle akward array data
@tacaswell
Copy link
Member

I am πŸ‘πŸ» on this is principle, however I have concerns about the implementation.

First, we can not pick up and awkward array dependency in Matplotlib nor should we import awkward array if the user has not already imported it so if we are to keep the type check like this, it needs to be hidden behind sys.module inspection.

Second, I am clear why flattening a ragged array is the right thing to do in most cases as it seems it is dropping important structural information on the floor.

I would appreciate if @jpivarski could weigh in on this being the right approach.

@jpivarski
Copy link

Yeah, Matplotlib can't take on Awkward Array as a dependency. Matplotlib is a more fundamental library in the ecosystem.

It would be possible to determine if an object is of a particular type in a third-party module without loading that module, for instance with

def is_in_module_with_type(tpe: type, module: str, name: str | None = None) -> bool:
    module_name = tpe.__module__
    if module_name == module or module_name.startswith(module + "."):
        if name is None or name == tpe.__name__:
            return True
    mro = tpe.mro()
    if len(mro) <= 1:
        return False
    return is_in_module_with_type(mro[1], module, name)

but I don't think Matplotlib should get into the business of "knowing" about all the other libraries and their types. If it did, this section would fill up with checking for array types in libraries, and some of them would even be abandoned, making this not-quite-but-effectively dead code.

My original intention was to have Awkward Arrays not automatically flatten when passed to histogramming and other plotting code. This section from the user guide explains that choice: although flattening might be the most common thing to do, it's not the only way one would want to map the structure of their data onto a plot. (What about taking the mean of each list? The maximum of each list? The first? Actually drawing lists as little disconnected lines?) If it happened automatically, one might not realize it's happening and misinterpret their plot!

The one thing that I do think we need to avoidβ€”and this happened in some older version of Matplotlibβ€”is for the function call to proceed very slowly and then fail at the end. This happened because Matplotlib, recognizing that its argument was not an array but was iterable, proceeded to iterate over the megabytes of little lists and then conclude at the end that it couldn't plot lists, as lists are not numbers. It needs to fail-fast so that users get their error message right away. I remember that being fixed, though it was years ago. (It wasn't something we could do on our side: Awkward Arrays should have an __iter__ method, and it will be slow.)

I'll do a quick check:

class SomeIterable:
    def __iter__(self):
        output = [["U"], ["can't"], ["plot"], ["this"]]
        for x in output:
            print(x)
            yield x

plt.plot(SomeIterable(), SomeIterable())

raises

['U']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File ".../site-packages/matplotlib/_api/__init__.py", line 91, in check_isinstance
    raise TypeError(
TypeError: 'value' must be an instance of str or bytes, not a __main__.SomeIterable

which is very good: it fails fast. As soon as it sees the first iterated object that it can't plot, it gives up, without iterating through the (possibly GB of) data in the iterable.

So, users get an error message saying that this data can't be plotted, and I think that's ideal: it forces them to choose ak.flatten if that is actually what they want. Maybe the error message could hint, "Perhaps your data need to be flattened or destructured?" which is not Awkward-specificβ€”it would be a problem with NumPy, too.

Oh, NumPy gives a different error message:

bad_numpy = np.arange(2*3*5).reshape(2, 3, 5)
plt.plot(bad_numpy, bad_numpy)
ValueError: x and y can be no greater than 2D, but have shapes (2, 3, 5) and (2, 3, 5)

What we have with the data that needs to be flattened is similar to the NumPy case.

Hey, wait a minute: with 2D NumPy arrays, it draws

numpy2D_x = np.array([[0.0, 1.0, 2.0], [0.1, 1.1, 2.1], [0.2, 1.2, 2.2]])
numpy2D_y = np.array([[1.0, 0.9, 1.1], [2.0, 1.0, 0.0], [0.0, 0.5, 2.0]])
plt.plot(numpy2D_x, numpy2D_y)

image

The second dimension is assumed to range over different-colored lines. There could be a good application of Awkward or ragged1 arrays here, which lets the line segments have different lengths. However, the interpretation of axes isn't the one we'd want: it would be natural for each different-length list to represent one line segment, and a direct generalization of the above wouldn't do that.

Also, I guess it would be a bad mistake to make if you, as a user, really intended for the data to be flattened before plotting. Making millions of Line2D objects would probably be resource-intensive, not easy to KeyboardInterrupt out of.

Footnotes

  1. satisfying the Array API: you'd only have to check for a shape with None in a dimension ↩

@tacaswell
Copy link
Member

Hey, wait a minute: with 2D NumPy arrays, it draws

This a an artifact of the MATLAB legacy and "tables as 2D arrays" so the 2nd dimension is the "right" one to treat an connected.

@jklymak
Copy link
Member

jklymak commented Nov 13, 2024

@jpivarski Thanks for your thoughts here.

I'm a little confused about the motivation for this PR. It looks like AwkwardArray has an __array__ method, so np.asarray(X) should already work as "expected", which includes an error for ragged arrays. I think both AwkwardArray and Matplotlib are doing what they can here, and if someone has a complicated ragged array, they are going to need to unpack it in the manner that makes the most sense for the underlying data structure.

Probably something for the data pipeline in the future, but Matplotlib has the concept of plotting the columns of arrays separately in plot and errorbar (and hist). It seems possible to me that we should be trying to make plot etc convert the input y and maybe the input x to a list of 1D arrays rather than a 2D array, which would allow ragged arrays.

@jpivarski
Copy link

Calling ak.flatten on a ragged array is a little more than calling __array__ on it:

>>> regular = ak.Array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])
>>> ak.flatten(regular)
<Array [1.1, 2.2, 3.3, 4.4, 5.5, 6.6] type='6 * float64'>
>>> np.asarray(regular)
array([[1.1, 2.2, 3.3],
       [4.4, 5.5, 6.6]])

>>> ragged = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> ak.flatten(ragged)
<Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>
>>> np.asarray(ragged)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-42/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

This error occurred while calling

    numpy.asarray(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
        dtype = None
    )

Flattening is the most common thing to want to do to a ragged array, but I didn't want it to happen by default because that would hide some structure-manipulation that should probably be explicit.

In a High Energy Physics (HEP) context, one is often wondering, "Is this a plot of events, a plot of particles, a plot of particle candidates?" (i.e. one collision event per data point or histogram entry, or one particle-in-an-event, or one candidate invisible particle, formed by trying all combinations of visible particles as decay products). These are different levels of nesting for Awkward/ragged arrays, so if flattening was automatic, the distinction between these cases would be blurred.

On the other hand, the frequency with which one wants to flatten, rather than anything else, might be in the high 90%'s. So I can see the motivation behind the PR.

@timhoffm
Copy link
Member

timhoffm commented Nov 14, 2024

Flattening is the most common thing to want to do to a ragged array, but I didn't want it to happen by default because that would hide some structure-manipulation that should probably be explicit.

To me, this is the central sentence. Matplotlib's interfaces are build around the idea of "array-likes". To be able to handle third-party data stuctures, we need to convert them to any array. On a technical level, the __arrray__ protocol delegates this convertion to the data structure. We should not get into the business of making assumptions and do implicit conversions that the original datastructure intentionally did not put into __array__.

@tacaswell
Copy link
Member

@Thierno88 Thank you for your work on this, however I am going to close this as I think there is a consensus that Matplotlib should to be adding any assumptions to implicit data structure conversion. Although we are not going to merge this PR, the discussion it provoked is valuable!


I think a reasonable followup PR would be to add logic to provide better error messages in the case where we have been given a ragger awkaward array with suggestions as to how the user could split it.

It might also be interesting to look at how awkward arrays play with LineCollection and what would we have to do to make it "just work" in that case.

@tacaswell tacaswell closed this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document and test what "array like" means to Matplotlib
5 participants