-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Plotting a pandas DataFrame with string MultiIndex #18371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This bisects to ac018af #17289 (@QuLogic) However, I will point out that matplotlib doesn't magically work with every possible pandas object. We try to maintain some basic compatibility, but for a more complicated object like this, I am not surprised things broke. Like, what the heck is
and what is Matplotlib supposed to sensibly do with it? I'm not sure the fact that it worked before was anything other than a happy co-incidence rather than something we planned to support. @phobson are you the box plot guru and can you comment? |
My understanding is that the DataFrame is still a bunch of vectors of data. At least, I suppose that should be interpreted like that by matplotlib. I am seeing the raised error relates to the dtypes in the array-like input data - which in this case is np.float64 - and the error really doesn't make sense. However, matplotlib can plot the input data if they are converted to a numpy.array by simply getting the values of the DataFrame:
So, my understanding is that, matplotlib is raising a wrong error while plotting a DataFrame which includes index and column of 'str' dtype. |
Thanks thats helpful. I guess |
The problem is that we now iterate over for xi in s.unstack():
print(xi) You get the column names
I don't know what to do about this in light of the changes in #17289 |
Certainly just putting
at the start of |
Potential scope creep of an API that handles hierarchical index objects:
AFAICT, seaborn went down this road and decided to not mess with hierarchical indexes. A more sensible approach in pandas: df.groupby(level='second').boxplot() Here's how I'd do this in seaborn: ax = (
df.rename_axis(columns='third')
.stack()
.to_frame('value')
.reset_index()
.pipe(
(seaborn.boxplot, 'data'),
x='third', y='value', hue='second'
)
) And then here's good ol' Hobson-style horror code: import numpy
import pandas
from matplotlib import pyplot
from matplotlib import cbook
arrays = [
['bar', 'baz', 'foo', 'qux'],
['one', 'two']
]
cols = ['un', 'deux', 'trois', 'quatre', 'cinq']
index = pandas.MultiIndex.from_product(arrays, names=['first', 'second'])
data = numpy.random.randn(8,5)
df = pandas.DataFrame(data, index=index, columns=cols)
stats = (
df.rename_axis(columns='third')
.stack()
.to_frame('value')
.reset_index()
.groupby(['second', 'third'], sort=False)['value']
.apply(lambda g: pandas.Series(cbook.boxplot_stats(g)).loc[0])
.unstack(level=-1)
.assign(label=lambda df: df.index.map(lambda x: '\n&\n'.join(x)))
.reset_index(drop=True)
.pipe(lambda df: [row.to_dict() for _, row in df.iterrows()])
)
fig, ax = pyplot.subplots()
bp_artists = ax.bxp(stats) |
I agree somewhat - playing whack-a-mole w/ complex Pandas objects doesn't appeal. OTOH, if a data object gives us a |
Perhaps, but there's still a question of what to do with the labels. There's literally no limit to how many levels there might be. If we punt and tell the user to provide labels, there's no promise that the labels and columns of the array will line up as the user intends. I'm not saying this isn't possible, or even that we shouldn't, But I think this is a complex topic and there are solutions in the ecosystem that already exist |
I don't think this had labels before either (well they were 1, 2,3,4). |
My main issue was not how the labels are handled. I was mostly interested in knowing why the produced DataFrame is not being plot by matplotlib in the first place. |
Agreed that hierarchical column indexes are something to discourage. If you give it to seaborn in wide-form mode, it will draw the plot, but it doesn't use the hierarchical information do do any grouping, so you get basically the same thing as the pandas plot. |
Also the multiindex is a red herring here. Matplotlib also fails when a "simple" dataframe that has string-type column labeled: import matplotlib.pyplot as plt, numpy as np, pandas as pd
plt.boxplot(pd.DataFrame(np.random.randn(100, 3), columns=["a", "b", "c"])) ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-58-9901a5f00145> in <module>
----> 1 plt.boxplot(pd.DataFrame(np.random.randn(100, 3), columns=["a", "b", "c"]))
~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/matplotlib/pyplot.py in boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_ticks, autorange, zorder, data)
2498 whiskerprops=None, manage_ticks=True, autorange=False,
2499 zorder=None, *, data=None):
-> 2500 return gca().boxplot(
2501 x, notch=notch, sym=sym, vert=vert, whis=whis,
2502 positions=positions, widths=widths, patch_artist=patch_artist,
~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/matplotlib/__init__.py in inner(ax, data, *args, **kwargs)
1429 def inner(ax, *args, data=None, **kwargs):
1430 if data is None:
-> 1431 return func(ax, *map(sanitize_sequence, args), **kwargs)
1432
1433 bound = new_sig.bind(ax, *args, **kwargs)
~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/matplotlib/axes/_axes.py in boxplot(self, x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_ticks, autorange, zorder)
3681 bootstrap = rcParams['boxplot.bootstrap']
3682
-> 3683 bxpstats = cbook.boxplot_stats(x, whis=whis, bootstrap=bootstrap,
3684 labels=labels, autorange=autorange)
3685 if notch is None:
~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/matplotlib/cbook/__init__.py in boxplot_stats(X, whis, bootstrap, labels, autorange)
1175
1176 # arithmetic mean
-> 1177 stats['mean'] = np.mean(x)
1178
1179 # medians and quartiles
<__array_function__ internals> in mean(*args, **kwargs)
~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims)
3370 return mean(axis=axis, dtype=dtype, out=out, **kwargs)
3371
-> 3372 return _methods._mean(a, axis=axis, dtype=dtype,
3373 out=out, **kwargs)
3374
~/miniconda3/envs/seaborn-py38-latest/lib/python3.8/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
158 is_float16_result = True
159
--> 160 ret = umr_sum(arr, axis, dtype, out, keepdims)
161 if isinstance(ret, mu.ndarray):
162 ret = um.true_divide(
TypeError: cannot perform reduce with flexible type whereas this naive code (where the column labels are numeric plt.boxplot(pd.DataFrame(np.random.randn(100, 3))) But see if you can figure out the problem with it. |
@mwaskom is that a boxplot of the column labels? |
Bug report
I am trying to plot a violinplot of my DataFrame which its elements dtypes are all np.float 64. Also, it has a MultiIndex of mixed integer and string values. The problem is when the DataFrame is given to matplotlib.pyplot.boxplot the following error is raised, while in earlier version of matplotlib this problem did not exist.
TLDR: the matplotlib fails to plot DataFrame with string index.
Python code
Actual outcome
Expected outcome
Just like the boxplot method of pandas.DataFrame.boxplot()
Matplotlib version
I am using Anaconda and updated all my libraries to the latest version via the 'conda update --update-all' code.
The text was updated successfully, but these errors were encountered: