-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Enable Awkward Array Support in _unpack_to_numpy for Improved Array-like Compatibility #29136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
handle akward array data
I am ππ» on this is principle, however I have concerns about the implementation. First, we can not pick up and awkward array dependency in Matplotlib nor should we import awkward array if the user has not already imported it so if we are to keep the type check like this, it needs to be hidden behind Second, I am clear why flattening a ragged array is the right thing to do in most cases as it seems it is dropping important structural information on the floor. I would appreciate if @jpivarski could weigh in on this being the right approach. |
Yeah, Matplotlib can't take on Awkward Array as a dependency. Matplotlib is a more fundamental library in the ecosystem. It would be possible to determine if an object is of a particular type in a third-party module without loading that module, for instance with def is_in_module_with_type(tpe: type, module: str, name: str | None = None) -> bool:
module_name = tpe.__module__
if module_name == module or module_name.startswith(module + "."):
if name is None or name == tpe.__name__:
return True
mro = tpe.mro()
if len(mro) <= 1:
return False
return is_in_module_with_type(mro[1], module, name) but I don't think Matplotlib should get into the business of "knowing" about all the other libraries and their types. If it did, this section would fill up with checking for array types in libraries, and some of them would even be abandoned, making this not-quite-but-effectively dead code. My original intention was to have Awkward Arrays not automatically flatten when passed to histogramming and other plotting code. This section from the user guide explains that choice: although flattening might be the most common thing to do, it's not the only way one would want to map the structure of their data onto a plot. (What about taking the mean of each list? The maximum of each list? The first? Actually drawing lists as little disconnected lines?) If it happened automatically, one might not realize it's happening and misinterpret their plot! The one thing that I do think we need to avoidβand this happened in some older version of Matplotlibβis for the function call to proceed very slowly and then fail at the end. This happened because Matplotlib, recognizing that its argument was not an array but was iterable, proceeded to iterate over the megabytes of little lists and then conclude at the end that it couldn't plot lists, as lists are not numbers. It needs to fail-fast so that users get their error message right away. I remember that being fixed, though it was years ago. (It wasn't something we could do on our side: Awkward Arrays should have an I'll do a quick check: class SomeIterable:
def __iter__(self):
output = [["U"], ["can't"], ["plot"], ["this"]]
for x in output:
print(x)
yield x
plt.plot(SomeIterable(), SomeIterable()) raises
which is very good: it fails fast. As soon as it sees the first iterated object that it can't plot, it gives up, without iterating through the (possibly GB of) data in the iterable. So, users get an error message saying that this data can't be plotted, and I think that's ideal: it forces them to choose ak.flatten if that is actually what they want. Maybe the error message could hint, "Perhaps your data need to be flattened or destructured?" which is not Awkward-specificβit would be a problem with NumPy, too. Oh, NumPy gives a different error message: bad_numpy = np.arange(2*3*5).reshape(2, 3, 5)
plt.plot(bad_numpy, bad_numpy)
What we have with the data that needs to be flattened is similar to the NumPy case. Hey, wait a minute: with 2D NumPy arrays, it draws numpy2D_x = np.array([[0.0, 1.0, 2.0], [0.1, 1.1, 2.1], [0.2, 1.2, 2.2]])
numpy2D_y = np.array([[1.0, 0.9, 1.1], [2.0, 1.0, 0.0], [0.0, 0.5, 2.0]])
plt.plot(numpy2D_x, numpy2D_y) The second dimension is assumed to range over different-colored lines. There could be a good application of Awkward or ragged1 arrays here, which lets the line segments have different lengths. However, the interpretation of axes isn't the one we'd want: it would be natural for each different-length list to represent one line segment, and a direct generalization of the above wouldn't do that. Also, I guess it would be a bad mistake to make if you, as a user, really intended for the data to be flattened before plotting. Making millions of Footnotes
|
This a an artifact of the MATLAB legacy and "tables as 2D arrays" so the 2nd dimension is the "right" one to treat an connected. |
@jpivarski Thanks for your thoughts here. I'm a little confused about the motivation for this PR. It looks like AwkwardArray has an Probably something for the data pipeline in the future, but Matplotlib has the concept of plotting the columns of arrays separately in |
Calling ak.flatten on a ragged array is a little more than calling >>> regular = ak.Array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])
>>> ak.flatten(regular)
<Array [1.1, 2.2, 3.3, 4.4, 5.5, 6.6] type='6 * float64'>
>>> np.asarray(regular)
array([[1.1, 2.2, 3.3],
[4.4, 5.5, 6.6]])
>>> ragged = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> ak.flatten(ragged)
<Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>
>>> np.asarray(ragged)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-42/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)
This error occurred while calling
numpy.asarray(
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
dtype = None
) Flattening is the most common thing to want to do to a ragged array, but I didn't want it to happen by default because that would hide some structure-manipulation that should probably be explicit. In a High Energy Physics (HEP) context, one is often wondering, "Is this a plot of events, a plot of particles, a plot of particle candidates?" (i.e. one collision event per data point or histogram entry, or one particle-in-an-event, or one candidate invisible particle, formed by trying all combinations of visible particles as decay products). These are different levels of nesting for Awkward/ragged arrays, so if flattening was automatic, the distinction between these cases would be blurred. On the other hand, the frequency with which one wants to flatten, rather than anything else, might be in the high 90%'s. So I can see the motivation behind the PR. |
To me, this is the central sentence. Matplotlib's interfaces are build around the idea of "array-likes". To be able to handle third-party data stuctures, we need to convert them to any array. On a technical level, the |
@Thierno88 Thank you for your work on this, however I am going to close this as I think there is a consensus that Matplotlib should to be adding any assumptions to implicit data structure conversion. Although we are not going to merge this PR, the discussion it provoked is valuable! I think a reasonable followup PR would be to add logic to provide better error messages in the case where we have been given a ragger awkaward array with suggestions as to how the user could split it. It might also be interesting to look at how awkward arrays play with |
PR Summary
Why is this change necessary?
The _unpack_to_numpy function currently supports a range of array-like objects but does not account for the awkward.Array type, which is widely used for data with irregular or nested structures. This change makes it easier for users to visualize data stored in awkward.Arrays without needing manual conversions, especially useful when subarray lengths vary.
What problem does it solve?
This PR modifies _unpack_to_numpy to recognize and handle awkward.Arrays, flattening them and converting them to numpy.ndarray format for direct compatibility with Matplotlibβs plotting functions. It addresses compatibility issues and prevents errors when attempting to plot irregular arrays.
Reasoning for Implementation
The enhancement involves:
1.Detecting if the input array is an awkward.Array.
2.Flattening nested or non-uniform awkward.Arrays using ak.flatten.
3.Converting the result to numpy.ndarray via ak.to_numpy().
This maintains backward compatibility and aligns with existing helper functions that aim to standardize various array-like formats for Matplotlib.
PR Checklist