Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Plotting a pandas DataFrame with string MultiIndex #18371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kasra-keshavarz opened this issue Aug 28, 2020 · 14 comments Β· Fixed by #18374
Closed

Plotting a pandas DataFrame with string MultiIndex #18371

kasra-keshavarz opened this issue Aug 28, 2020 · 14 comments Β· Fixed by #18374

Comments

@kasra-keshavarz
Copy link

Bug report

I am trying to plot a violinplot of my DataFrame which its elements dtypes are all np.float 64. Also, it has a MultiIndex of mixed integer and string values. The problem is when the DataFrame is given to matplotlib.pyplot.boxplot the following error is raised, while in earlier version of matplotlib this problem did not exist.

TLDR: the matplotlib fails to plot DataFrame with string index.

Python code

arrays = [[bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], \
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8,5), index=index, columns = ['un', 'deux', 'trois', 'quatre', 'cinq'])
s.unstack().boxplot() # works perfectly fine
plt.boxplot(s.unstack()) # fails

Actual outcome

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-64fecc3e6540> in <module>
----> 1 plt.boxplot(s.unstack())

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\pyplot.py in boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_ticks, autorange, zorder, data)
   2510         whiskerprops=whiskerprops, manage_ticks=manage_ticks,
   2511         autorange=autorange, zorder=zorder,
-> 2512         **({"data": data} if data is not None else {}))
   2513 
   2514 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
   1436     def inner(ax, *args, data=None, **kwargs):
   1437         if data is None:
-> 1438             return func(ax, *map(sanitize_sequence, args), **kwargs)
   1439 
   1440         bound = new_sig.bind(ax, *args, **kwargs)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\axes\_axes.py in boxplot(self, x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_ticks, autorange, zorder)
   3682 
   3683         bxpstats = cbook.boxplot_stats(x, whis=whis, bootstrap=bootstrap,
-> 3684                                        labels=labels, autorange=autorange)
   3685         if notch is None:
   3686             notch = rcParams['boxplot.notch']

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in boxplot_stats(X, whis, bootstrap, labels, autorange)
   1175 
   1176         # arithmetic mean
-> 1177         stats['mean'] = np.mean(x)
   1178 
   1179         # medians and quartiles

<__array_function__ internals> in mean(*args, **kwargs)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py in mean(a, axis, dtype, out, keepdims)
   3371 
   3372     return _methods._mean(a, axis=axis, dtype=dtype,
-> 3373                           out=out, **kwargs)
   3374 
   3375 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
    158             is_float16_result = True
    159 
--> 160     ret = umr_sum(arr, axis, dtype, out, keepdims)
    161     if isinstance(ret, mu.ndarray):
    162         ret = um.true_divide(

TypeError: cannot perform reduce with flexible type

Expected outcome

image
Just like the boxplot method of pandas.DataFrame.boxplot()

Matplotlib version

  • Operating system: Windows 10
  • Matplotlib version: 3.3.1
  • Matplotlib backend: module://ipykernel.pylab.backend_inline
  • Python version: Python 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
  • Jupyter version (if applicable): 6.1.1
  • Other libraries: Pandas 1.1.1

I am using Anaconda and updated all my libraries to the latest version via the 'conda update --update-all' code.

@kasra-keshavarz kasra-keshavarz changed the title Plotting a pandas DataFrame with string index Plotting a pandas DataFrame with string MultiIndex Aug 28, 2020
@jklymak
Copy link
Member

jklymak commented Aug 28, 2020

This bisects to ac018af #17289 (@QuLogic)

However, I will point out that matplotlib doesn't magically work with every possible pandas object. We try to maintain some basic compatibility, but for a more complicated object like this, I am not surprised things broke. Like, what the heck is

<bound method DataFrame.unstack of                     un      deux     trois    quatre      cinq
first second
bar   one     0.973978 -1.033628  0.447611  0.551987 -1.169132
      two     0.701825  0.174550 -0.067057 -0.870809 -0.182306
baz   one     0.317420 -2.886704  2.002434  1.099354 -1.165119
      two     0.730384  0.161140 -2.744320  0.743795  0.090105
foo   one    -0.902822 -1.659980  0.548587  0.670274 -2.389515
      two    -0.018007 -1.723369 -0.653125 -0.511720  0.241532
qux   one    -0.208663 -0.050863  1.535365  1.356895  0.211385
      two     0.109886 -1.027877 -1.373521  1.512366 -0.028899>

and what is Matplotlib supposed to sensibly do with it? I'm not sure the fact that it worked before was anything other than a happy co-incidence rather than something we planned to support. @phobson are you the box plot guru and can you comment?

@kasra-keshavarz
Copy link
Author

My understanding is that the DataFrame is still a bunch of vectors of data. At least, I suppose that should be interpreted like that by matplotlib. I am seeing the raised error relates to the dtypes in the array-like input data - which in this case is np.float64 - and the error really doesn't make sense. However, matplotlib can plot the input data if they are converted to a numpy.array by simply getting the values of the DataFrame:

s.unstack().values

So, my understanding is that, matplotlib is raising a wrong error while plotting a DataFrame which includes index and column of 'str' dtype.

@jklymak
Copy link
Member

jklymak commented Aug 28, 2020

Thanks thats helpful. I guess boxplot needs to try to use values at some point in the process...

@jklymak
Copy link
Member

jklymak commented Aug 28, 2020

The problem is that we now iterate over X instead of just calling np.asanyarray on it right away. However, if you iterate over s..unstack():

for xi in s.unstack():
    print(xi)

You get the column names

('un', 'one')
('un', 'two')
('deux', 'one')
('deux', 'two')
('trois', 'one')
('trois', 'two')
('quatre', 'one')
('quatre', 'two')
('cinq', 'one')
('cinq', 'two')

I don't know what to do about this in light of the changes in #17289

@jklymak
Copy link
Member

jklymak commented Aug 28, 2020

Certainly just putting

    # try to get the values from X:
    try:
        X = X.values
    except:
        pass

at the start of _reshape_2D fixes the problem. Is there some reason we wouldn't want to always do that?

@phobson
Copy link
Member

phobson commented Aug 28, 2020

Even if it's simple to get the values out, I'm not terribly interested in support multi-index rows or columns in pandas DataFrames.

I would argue that out of the box, pandas doesn't do awesome handling labels:

image

@phobson
Copy link
Member

phobson commented Aug 28, 2020

Potential scope creep of an API that handles hierarchical index objects:

  • default formating (string joining) of the labels
  • grouped sections
  • mapping levels as facets (e.g., colors, symbols) of the plots.

AFAICT, seaborn went down this road and decided to not mess with hierarchical indexes.

A more sensible approach in pandas:

df.groupby(level='second').boxplot()

image

Here's how I'd do this in seaborn:

ax = (
    df.rename_axis(columns='third')
      .stack()
      .to_frame('value')
      .reset_index()
      .pipe(
          (seaborn.boxplot, 'data'),
          x='third', y='value', hue='second'
      )
)

image

And then here's good ol' Hobson-style horror code:

import numpy
import pandas
from matplotlib import pyplot
from matplotlib import cbook

arrays = [
    ['bar', 'baz', 'foo', 'qux'],
    ['one', 'two']
]
cols = ['un', 'deux', 'trois', 'quatre', 'cinq']
index = pandas.MultiIndex.from_product(arrays, names=['first', 'second'])
data = numpy.random.randn(8,5)

df = pandas.DataFrame(data, index=index, columns=cols)
stats = (
   df.rename_axis(columns='third')
     .stack()
     .to_frame('value')
     .reset_index()
     .groupby(['second', 'third'], sort=False)['value']
     .apply(lambda g: pandas.Series(cbook.boxplot_stats(g)).loc[0])
     .unstack(level=-1)
     .assign(label=lambda df: df.index.map(lambda x: '\n&\n'.join(x))) 
     .reset_index(drop=True)
     .pipe(lambda df: [row.to_dict() for _, row in df.iterrows()])
)

fig, ax = pyplot.subplots()
bp_artists = ax.bxp(stats)

image

@jklymak
Copy link
Member

jklymak commented Aug 28, 2020

I agree somewhat - playing whack-a-mole w/ complex Pandas objects doesn't appeal.

OTOH, if a data object gives us a values attribute, should we try to use it at this stage?

@phobson
Copy link
Member

phobson commented Aug 28, 2020

Perhaps, but there's still a question of what to do with the labels. There's literally no limit to how many levels there might be.

If we punt and tell the user to provide labels, there's no promise that the labels and columns of the array will line up as the user intends.

I'm not saying this isn't possible, or even that we shouldn't, But I think this is a complex topic and there are solutions in the ecosystem that already exist

@jklymak
Copy link
Member

jklymak commented Aug 28, 2020

I don't think this had labels before either (well they were 1, 2,3,4).

@kasra-keshavarz
Copy link
Author

My main issue was not how the labels are handled. I was mostly interested in knowing why the produced DataFrame is not being plot by matplotlib in the first place.

@mwaskom
Copy link

mwaskom commented Aug 29, 2020

DataFrame.values is actually not the canonical way to get a NumPy array from pandas. With some more complex types (e.g. categorical data) it gives you a different kind of object. You should call np.asarray (or on newer pandas, the .to_numpy method).

Agreed that hierarchical column indexes are something to discourage. If you give it to seaborn in wide-form mode, it will draw the plot, but it doesn't use the hierarchical information do do any grouping, so you get basically the same thing as the pandas plot.

@mwaskom
Copy link

mwaskom commented Aug 29, 2020

Also the multiindex is a red herring here. Matplotlib also fails when a "simple" dataframe that has string-type column labeled:

import matplotlib.pyplot as plt, numpy as np, pandas as pd
plt.boxplot(pd.DataFrame(np.random.randn(100, 3), columns=["a", "b", "c"]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-58-9901a5f00145> in <module>
----> 1 plt.boxplot(pd.DataFrame(np.random.randn(100, 3), columns=["a", "b", "c"]))

~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/matplotlib/pyplot.py in boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_ticks, autorange, zorder, data)
   2498         whiskerprops=None, manage_ticks=True, autorange=False,
   2499         zorder=None, *, data=None):
-> 2500     return gca().boxplot(
   2501         x, notch=notch, sym=sym, vert=vert, whis=whis,
   2502         positions=positions, widths=widths, patch_artist=patch_artist,

~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/matplotlib/__init__.py in inner(ax, data, *args, **kwargs)
   1429     def inner(ax, *args, data=None, **kwargs):
   1430         if data is None:
-> 1431             return func(ax, *map(sanitize_sequence, args), **kwargs)
   1432 
   1433         bound = new_sig.bind(ax, *args, **kwargs)

~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/matplotlib/axes/_axes.py in boxplot(self, x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_ticks, autorange, zorder)
   3681             bootstrap = rcParams['boxplot.bootstrap']
   3682 
-> 3683         bxpstats = cbook.boxplot_stats(x, whis=whis, bootstrap=bootstrap,
   3684                                        labels=labels, autorange=autorange)
   3685         if notch is None:

~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/matplotlib/cbook/__init__.py in boxplot_stats(X, whis, bootstrap, labels, autorange)
   1175 
   1176         # arithmetic mean
-> 1177         stats['mean'] = np.mean(x)
   1178 
   1179         # medians and quartiles

<__array_function__ internals> in mean(*args, **kwargs)

~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims)
   3370             return mean(axis=axis, dtype=dtype, out=out, **kwargs)
   3371 
-> 3372     return _methods._mean(a, axis=axis, dtype=dtype,
   3373                           out=out, **kwargs)
   3374 

~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
    158             is_float16_result = True
    159 
--> 160     ret = umr_sum(arr, axis, dtype, out, keepdims)
    161     if isinstance(ret, mu.ndarray):
    162         ret = um.true_divide(

TypeError: cannot perform reduce with flexible type

whereas this naive code (where the column labels are numeric [0, 1, ..] "works"

plt.boxplot(pd.DataFrame(np.random.randn(100, 3)))

image

But see if you can figure out the problem with it.

@phobson
Copy link
Member

phobson commented Aug 29, 2020

@mwaskom is that a boxplot of the column labels?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants