Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Bug]: plt.pcolormesh crashes when called with Int64 nullable dtype #23991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mojones opened this issue Sep 23, 2022 · 16 comments
Open

[Bug]: plt.pcolormesh crashes when called with Int64 nullable dtype #23991

mojones opened this issue Sep 23, 2022 · 16 comments

Comments

@mojones
Copy link

mojones commented Sep 23, 2022

Bug summary

When calling plt.pcolormesh with a dataframe containing either of the pandas nullable dtypes (Int64 or Float64) it crashes.

Code for reproduction

import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({
    'foo' : [1,2],
    'bar' : [3,np.nan]
    })

# works as expected
plt.pcolormesh(df)

# crashes
plt.pcolormesh(df.astype('Int64'))

Actual outcome

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [28], in <cell line: 13>()
     10 plt.pcolormesh(df)
     12 # crashes
---> 13 plt.pcolormesh(df.astype('Int64'))

File ~/.virtualenvs/teaching/lib/python3.10/site-packages/matplotlib/pyplot.py:2728, in pcolormesh(alpha, norm, cmap, vmin, vmax, shading, antialiased, data, *args, **kwargs)
   2723 @_copy_docstring_and_deprecators(Axes.pcolormesh)
   2724 def pcolormesh(
   2725         *args, alpha=None, norm=None, cmap=None, vmin=None,
   2726         vmax=None, shading=None, antialiased=False, data=None,
   2727         **kwargs):
-> 2728     __ret = gca().pcolormesh(
   2729         *args, alpha=alpha, norm=norm, cmap=cmap, vmin=vmin,
   2730         vmax=vmax, shading=shading, antialiased=antialiased,
   2731         **({"data": data} if data is not None else {}), **kwargs)
   2732     sci(__ret)
   2733     return __ret

File ~/.virtualenvs/teaching/lib/python3.10/site-packages/matplotlib/__init__.py:1414, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs)
   1411 @functools.wraps(func)
   1412 def inner(ax, *args, data=None, **kwargs):
   1413     if data is None:
-> 1414         return func(ax, *map(sanitize_sequence, args), **kwargs)
   1416     bound = new_sig.bind(ax, *args, **kwargs)
   1417     auto_label = (bound.arguments.get(label_namer)
   1418                   or bound.kwargs.get(label_namer))

File ~/.virtualenvs/teaching/lib/python3.10/site-packages/matplotlib/axes/_axes.py:6072, in Axes.pcolormesh(self, alpha, norm, cmap, vmin, vmax, shading, antialiased, *args, **kwargs)
   6068 C = C.ravel()
   6070 kwargs.setdefault('snap', rcParams['pcolormesh.snap'])
-> 6072 collection = mcoll.QuadMesh(
   6073     coords, antialiased=antialiased, shading=shading,
   6074     array=C, cmap=cmap, norm=norm, alpha=alpha, **kwargs)
   6075 collection._scale_norm(norm, vmin, vmax)
   6076 self._pcolor_grid_deprecation_helper()

File ~/.virtualenvs/teaching/lib/python3.10/site-packages/matplotlib/collections.py:2015, in QuadMesh.__init__(self, *args, **kwargs)
   2012 self._bbox.update_from_data_xy(self._coordinates.reshape(-1, 2))
   2013 # super init delayed after own init because array kwarg requires
   2014 # self._coordinates and self._shading
-> 2015 super().__init__(**kwargs)
   2016 self.mouseover = False

File ~/.virtualenvs/teaching/lib/python3.10/site-packages/matplotlib/collections.py:217, in Collection.__init__(self, edgecolors, facecolors, linewidths, linestyles, capstyle, joinstyle, antialiaseds, offsets, transOffset, norm, cmap, pickradius, hatch, urls, zorder, **kwargs)
    214 self._transOffset = transOffset
    216 self._path_effects = None
--> 217 self.update(kwargs)
    218 self._paths = None

File ~/.virtualenvs/teaching/lib/python3.10/site-packages/matplotlib/artist.py:1069, in Artist.update(self, props)
   1066             if not callable(func):
   1067                 raise AttributeError(f"{type(self).__name__!r} object "
   1068                                      f"has no property {k!r}")
-> 1069             ret.append(func(v))
   1070 if ret:
   1071     self.pchanged()

File ~/.virtualenvs/teaching/lib/python3.10/site-packages/matplotlib/collections.py:2077, in QuadMesh.set_array(self, A)
   2072     if faulty_data:
   2073         raise TypeError(
   2074             f"Dimensions of A {A.shape} are incompatible with "
   2075             f"X ({width}) and/or Y ({height})")
-> 2077 return super().set_array(A)

File ~/.virtualenvs/teaching/lib/python3.10/site-packages/matplotlib/cm.py:477, in ScalarMappable.set_array(self, A)
    475 A = cbook.safe_masked_invalid(A, copy=True)
    476 if not np.can_cast(A.dtype, float, "same_kind"):
--> 477     raise TypeError(f"Image data of dtype {A.dtype} cannot be "
    478                     "converted to float")
    480 self._A = A

TypeError: Image data of dtype object cannot be converted to float

Expected outcome

image

Additional information

Noticed via seaborn mwaskom/seaborn#3042.

Operating system

lubuntu

Matplotlib Version

3.5.3

Matplotlib Backend

module://matplotlib_inline.backend_inline

Python version

3.10.4

Jupyter version

6.4.12

Installation

pip

@jklymak
Copy link
Member

jklymak commented Sep 23, 2022

At our root we are a numpy based library. If you can't cleanly convert your data to a numpy numerical array with np.asanyarray we probably can't plot it.

dd = df.astype('Int64')
print(np.asanyarray(dd).dtype)

returns an object array, which we obviously cannot pcolormesh because each element could be anything.

I would say this is an upstream issue, and pandas and numpy need to work out what they want to do here, versus us adding a conversion shim.

@mwaskom
Copy link

mwaskom commented Sep 23, 2022

I think the root issue here is that pd.NA cannot be cast to float:

float(pd.NA)
TypeError: float() argument must be a string or a real number, not 'NAType'

I do find this a little surprising; you might expect float(pd.NA) -> np.nan. And in fact, if you do df.astype("Int64").astype(float) it round trips back to np.nan. So I think it probably is fair to call this upstream of matplotlib too :/

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Sep 30, 2022

This would also require changing

C = np.asanyarray(args[0])

to

c = np.asanyarray(args[0], dtype=float)

Together with

In [1]: float(pd.NA)
Out[1]: nan

it looks like this solves the issue

Would that be acceptable?

The reason this is necessary is that a Series of type Int64, after to_numpy(), currently becomes of dtype object regardless of whether it contains missing values, see this comment

@jklymak
Copy link
Member

jklymak commented Sep 30, 2022

I guess this is OK? It will mean that an integer array ax.pcolormesh([[0, 1], [2, 3]]) will get converted to float earlier in the pipeline, but I assume that is OK. I think other object arrays will still fail to convert.

Someone will have to go through and do this for all the methods that take a 2-D array (pcolormesh, imshow, contour, contourf, streamline all pop to mind, but likely more).

I think its a bit stubborn to not just convert the array to float, but I can see the idea that you want to preserve the ints as ints.

@mwaskom
Copy link

mwaskom commented Sep 30, 2022

it looks like this solves the issue

I don't think so?

df = pd.DataFrame([[1, 2], [3, pd.NA]]).astype("Int64")
np.asanyarray(df, float)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [72], line 2
      1 df = pd.DataFrame([[1, 2], [3, pd.NA]]).astype("Int64")
----> 2 np.asanyarray(df, float)

File ~/miniconda/envs/py310/lib/python3.10/site-packages/pandas/core/generic.py:2069, in NDFrame.__array__(self, dtype)
   2068 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
-> 2069     return np.asarray(self._values, dtype=dtype)

TypeError: float() argument must be a string or a real number, not 'NAType'

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Sep 30, 2022

@mwaskom sorry I meant with the float(pd.NA) returning nan change as well (I was trying this out here pandas-dev/pandas#48885)

image

I think its a bit stubborn to not just convert the array to float

Yeah TBH I'm not keen on the current solution of to_numpy() converting everything to object by default. I've opened pandas-dev/pandas#48891 to discuss

@jklymak
Copy link
Member

jklymak commented Sep 30, 2022

Oh, sorry, I somehow thought it worked straight as np.asanyarray(df.astype('Int64'), dtype=np.float64) (I thought I'd tested it, but apparently my coffee hadn't kicked in). If that doesn't work, I'm not clear what you mean by

the float(pd.NA) returning nan change as well

@MarcoGorelli
Copy link
Contributor

I meant changing pandas so that float(pd.NA) returns np.nan

If I make that change in pandas, and also change

c = np.asanyarray(args[0], dtype=float)

in

C = np.asanyarray(args[0])
, then the snippet from the issue works

@tacaswell
Copy link
Member

Shouldn't this be handled via the unit code?

@jklymak
Copy link
Member

jklymak commented Sep 30, 2022

I don't think we pass any mappable data through a units machinery, despite some attempts/desire to do so. I guess if this was x or y data it could be handled via units, but I'm not aware of special units for pandas objects.

@tacaswell
Copy link
Member

oh, fair point I missed that this was the color channel not the x/y 🐑

There are handler registered someplace (I think by pandas) to correctly handle their datetime types, I think it would make sense to also handle the nullable integers the same way (we treat them as a "unit" that needs casting to floats somehow).

@jklymak
Copy link
Member

jklymak commented Sep 30, 2022

Agreed. OTOH asanyarray should work and all that is being suggested here is that we further cast to float, which makes sense to me. It may actually catch user type errors sooner in the pipeline.

@tacaswell
Copy link
Member

fair, but explicit type casts is a bit of a code-smell to me (which may be just be due to the persistent issues with the unit code which are not currently used here).

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Oct 1, 2022

explicit type casts is a bit of a code-smell

Hi @tacaswell - would love your thoughts on pandas-dev/pandas#48891 if possible

In the case of e.g. pd.Series([1,2,3], dtype='Int64').to_numpy(), then it could be transformed to int64 and there'd be no need to type cast.
On the other hand, pd.Series([1,2,pd.NA], dtype='Int64').to_numpy() can't get converted to int64, and there's a desire in pandas to avoid value-dependent behaviour. So, this would raise, unless the user specifies something like pd.Series([1,2,pd.NA], dtype='Int64').to_numpy('int64', na_value=-1) or pd.Series([1,2,pd.NA], dtype='Int64').to_numpy('float64').
If you didn't want to cast by default in matplotlib, I think that might have to look like

try:
    c = np.asanyarray(args[0])
except ValueError:
    c = np.asanyarray(args[0], dtype='float64')

Could that be acceptable?

@jklymak
Copy link
Member

jklymak commented Oct 1, 2022

Well I'd argue the code smell here (had to look that phrase up) is that pandas didn't push their NA concept upstream to numpy.

From a practical point of view I was going to claim that we implicitly cast to float pretty early here. However, that is not true - we have BoundaryNorm that maps to integers. So we have another example where things have developed too many paths to easily infer behaviour. In this case I think our best bet is to short term tell folks to explicitly cast their data to something that works.

Longterm, I think the idea of unit conversion for mapables may be ok, but maybe via passing a converter in explicitly and/or allowing norms to handle the unit conversion. The problem as always is if the units are triggered via the container object or the type of elements in an array.

@dstansby
Copy link
Member

Do we have any other methods that can succesfully plot this data type? (e.g. plot, scatter)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants