Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improve pandas/xarray/... conversion #22560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 21, 2022
Merged

Conversation

oscargus
Copy link
Member

@oscargus oscargus commented Feb 24, 2022

PR Summary

See https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array/54508052#54508052 for motivation.

Related to #16402

PR Checklist

Tests and Styling

  • Has pytest style unit tests (and pytest passes).
  • Is Flake 8 compliant (install flake8-docstrings and run flake8 --docstring-convention=all).

Documentation

  • [N/A] New features are documented, with examples if plot related.
  • [N/A] New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
  • [N/A] API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).
  • [N/A] Documentation is sphinx and numpydoc compliant (the docs should build without error).

if hasattr(X, 'values'): # support pandas.Series
X = X.values
if hasattr(X, 'to_numpy'): # support pandas.Series
X = X.to_numpy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any libraries we need to worry about that implement X.values, but not X.to_numpy()? Maybe we also want to leave an elif with the X.values block as a fallback just in case...?

@QuLogic
Copy link
Member

QuLogic commented Feb 25, 2022

Perhaps should be coordinated with #22141?

@oscargus
Copy link
Member Author

oscargus commented Feb 25, 2022

Perhaps should be coordinated with #22141?

That's where I got the idea. As seen, I have not touched cbook.

Related to the question from @greglucas, I do not know. But I saw this code in cbook (which I didn't touch):

# unpack if we have a values or to_numpy method.
try:
X = X.to_numpy()
except AttributeError:
try:
if isinstance(X.values, np.ndarray):
X = X.values
except AttributeError:
pass

This is a bit more complex as it uses a try-except approach (not sure how much that affects things though), has a fallback on values and check that values actually returns an nparray. One can of course use a similar approach here (and in #22141). Possibly slightly improved as I do not know if there are cases if values actually is non-trivial, so no need to run it twice.

Edit: I have now touched this and use a similar approach, but with hasattr instead of try-except see https://stackoverflow.com/questions/903130/hasattr-vs-try-except-block-to-deal-with-non-existent-attributes as the assumption is that most often we do not pass pandas or xarray (or whatever) data.

@greglucas
Copy link
Contributor

I guess you could write a cbook._unpack_pandas(x) helper to use that in all of these areas for consistency? I would say it is safer to keep the .values fallback, and if you do want to get rid of it eventually, you could put in a deprecation warning if that branch is hit.

@oscargus
Copy link
Member Author

Yes, I was also thinking about a helper function (but didn't know where to place it, so great info that cbook is the place). I do not have any strong opinions as such, more that I read the link and it seemed like the right thing to do.

@oscargus oscargus marked this pull request as draft February 25, 2022 16:46
@oscargus oscargus marked this pull request as ready for review February 27, 2022 12:11
@oscargus oscargus mentioned this pull request Feb 27, 2022
6 tasks
@oscargus
Copy link
Member Author

Seems like the easiest way is to wait for #22141, add those conversions here as well, and then discuss the correct name for the function.

@oscargus
Copy link
Member Author

I'm thinking that maybe _unpack_to_numpy is a general enough name?

@oscargus
Copy link
Member Author

oscargus commented Feb 28, 2022

This is now updated:

  • renamed function to _unpack_to_numpy
  • added test for xarray (now an extra test dependency) (Removed extra dependency)
  • (removed duplicate test code for pandas) (No duplicate code to start with...)

@oscargus oscargus changed the title Improve Pandas conversion Improve pandas/xarray/... conversion Feb 28, 2022
@tacaswell tacaswell added this to the v3.5.2 milestone Mar 3, 2022
@tacaswell
Copy link
Member

Please update the lines touched by #22141

@oscargus oscargus force-pushed the pandasconversion branch 3 times, most recently from 19e86ca to f17f3af Compare March 3, 2022 22:13
@oscargus
Copy link
Member Author

oscargus commented Mar 3, 2022

There is also this function where it doesn't work to simply replace to_numpy with _unpack_to_numpy. No idea why. (One reason may be that there is no attribute error for to_numpy anymore, so some types of input data is handled well by _check_1d, has an index method, but no to_numpy method... Seems like plain Python lists can be a culprit here then...)

def index_of(y):
"""
A helper function to create reasonable x values for the given *y*.
This is used for plotting (x, y) if x values are not explicitly given.
First try ``y.index`` (assuming *y* is a `pandas.Series`), if that
fails, use ``range(len(y))``.
This will be extended in the future to deal with more types of
labeled data.
Parameters
----------
y : float or array-like
Returns
-------
x, y : ndarray
The x and y values to plot.
"""
try:
return y.index.to_numpy(), y.to_numpy()
except AttributeError:
pass
try:
y = _check_1d(y)
except (np.VisibleDeprecationWarning, ValueError):
# NumPy 1.19 will warn on ragged input, and we can't actually use it.
pass
else:
return np.arange(y.shape[0], dtype=float), y
raise ValueError('Input could not be cast to an at-least-1D NumPy array')

np.testing.assert_array_equal(Idx, IdxRef)


def test_index_of_xarray(xr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does xarray get us more coverage here? They have a to_numpy() method the same as pandas I believe.
https://xarray.pydata.org/en/stable/generated/xarray.DataArray.to_numpy.html

So, it seems like a pretty heavy dependency to add for just this one test...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not so much for coverage as for actually testing using data of specified formats. With the discussion about which formats we support, it makes sense to test them as well. Right now some of these are tested in the plots, but it can possibly make sense to simply test them here as these are the core function used to get data that can be plotted.

If we claim (which we actually don't, maybe we should?) that we can plot xarray, we should probably test it as well. And other types that we may want to claim to support. Or maybe fork off a specific dependency test that is not executed on all platforms/version, including pandas (which is 11.7 MB, xarray is 870 kB).

(There is another xarray-test above, so two.)

I can of course remove them, but I think we should discuss if we want to support more formats than pandas and numpy (and Python list/tuple), and, if so, have explicit tests for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should probably discuss what we want to support/test. To me, this doesn't seem to add a whole lot of value for adding a new dependency.

There was also a discussion around removing Scipy as a dependency in the docs: #22120

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the dependencies but kept the tests. Hence, they will run if xarray is available.

I also opened #22645 for discussions (probably should be discussed at a dev-call as well).

@tacaswell
Copy link
Member

I made an executive decision to install xarray on CI.

We already have all of its dependencies installed and it is a pure-python package.



def _unpack_to_numpy(x):
"""Internal helper to extract data from e.g. pandas and xarray objects."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what we intend to support, i.e. everything with .to_numpy() or .values, and what types we expect to catch with it, e.g. values-> older pandas dataframes(?).

@jklymak jklymak merged commit 709fba8 into matplotlib:main Apr 21, 2022
@tacaswell
Copy link
Member

We discussed this on the call, I will write up an issues shortly about the documentation of the intention etc.

@QuLogic
Copy link
Member

QuLogic commented Apr 21, 2022

meeseeksdev backport to v3.5.x

meeseeksmachine pushed a commit to meeseeksmachine/matplotlib that referenced this pull request Apr 21, 2022
QuLogic added a commit that referenced this pull request Apr 29, 2022
…560-on-v3.5.x

Backport PR #22560 on branch v3.5.x (Improve pandas/xarray/... conversion)
mocquin added a commit to mocquin/physipy that referenced this pull request May 7, 2022
see matplotlib/matplotlib#22973, matplotlib/matplotlib#22879, and matplotlib/matplotlib#22560
It is not clear to me as to which is the standard interface for unit handling (for eg, hist still doesn't handle unit by default)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants