Thanks to visit codestin.com
Credit goes to github.com

Skip to content

cbook._reshape_2D flattens ndarray with 2 dims (rectangular ndarray) #8092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lcapalleja opened this issue Feb 16, 2017 · 19 comments
Open
Labels
API: consistency keep Items to be ignored by the “Stale” Github Action status: inactive Marked by the “Stale” Github Action

Comments

@lcapalleja
Copy link

Bug report

boxplot throwing an error (below) when x is an ndarray with len(x.shape)==2 (I.E. when x is rectangular).

ValueError: List of boxplot statistics and 'positions' values must have same the length

Code for reproduction

import numpy as np
import matplotlib.pyplot as plt

a = np.array([np.array([1, 2, 3, 4]), np.array([3, 2, 7, 4]), np.array([3, 9, 3, 1, 6])]) 
b = np.array([np.array([1, 2, 3, 4]), np.array([3, 2, 7, 4]), np.array([3, 9, 3, 1])]) 

# Ragged ndarray works correctly
plt.boxplot(x=a, positions=range(len(a)))

# Rectangular ndarray throws error (above)
plt.boxplot(x=b, positions=range(len(b)))

Matplotlib version

  • matplotlib 1.5.3 np111py35_0
  • python 3.5.2
  • Windows Server 2012 R2 Standard
  • conda installation

Possible cause

I believe the issue is in cbook._reshape_2D in the line below. I am not sure why that logic is in place but assume it is for good reason.

X = [X[:, i] for i in xrange(ncols)]

Got there by looking in:

  1. _axes.boxplot
  2. _axes.bxp
  3. cbook.boxplot_stats

reference _reshape_2D:

def _reshape_2D(X):
    """
    Converts a non-empty list or an ndarray of two or fewer dimensions
    into a list of iterable objects so that in

        for v in _reshape_2D(X):

    v is iterable and can be used to instantiate a 1D array.
    """
    if hasattr(X, 'shape'):
        # one item
        if len(X.shape) == 1:
            if hasattr(X[0], 'shape'):
                X = list(X)
            else:
                X = [X, ]

        # several items
        elif len(X.shape) == 2:
            nrows, ncols = X.shape
            if nrows == 1:
                X = [X]
            elif ncols == 1:
                X = [X.ravel()]
            else:
                X = [X[:, i] for i in xrange(ncols)]
        else:
            raise ValueError("input `X` must have 2 or fewer dimensions")

    if not hasattr(X[0], '__len__'):
        X = [X]
    else:
        X = [np.ravel(x) for x in X]

    return X

Current Workaround
Converting the ndarray to a list of lists. The necessity for this workaround doesn't really make sense especially since the plotting works fine with a ragged/non-rectangular ndarray but does not work with a rectangular ndarray.

@anntzer
Copy link
Contributor

anntzer commented Feb 19, 2017

This is due to the fundamentally broken (IMO) semantics of boxplot, as given in the docstring:

Make a box and whisker plot for each column of x or each vector in sequence x.

The ragged input is a 1D object array where each entry is an array, so each row becomes a box plot. The rectangular input is a 2D float array, so it is each column that becomes a box plot.

The options I can think of are:

  • Silently plot rows of object arrays but columns of scalar arrays (the current behavior): confusing, as this issue shows.
  • Always plot rows: breaks backcompatibility (and I can't see a way to make the transition smooth), and inconsistent with plot.
  • Raise an error when an object array representing a ragged array is passed (current case for plot, too), suggesting that the user wraps it in a call to list (still awkward semantics but backcompatible).

@tacaswell tacaswell added this to the 2.1 (next point release) milestone Feb 20, 2017
@tacaswell
Copy link
Member

attn @phobson

@phobson
Copy link
Member

phobson commented Feb 21, 2017

I'm not sure that I'd go so far as to say that the semantics are fundamentally broken, but I can see what @anntzer is saying.

With the third option above, I'm not sure what the point would be in raising an error and telling the user to convert it to a list, since that's what we're already doing. Seems like a warning would suffice.

While we could add yet another boxplot option (e.g., axis) to toggle plotting columns vs rows, and slowly/annoyingly warn users that we'll be switching from columns to rows, I think that's more fundamentally confusing. As someone who often works with tabular data, I struggle to think of I situation where I'd want rows plotted.

My vote is that we keep the current behavior, improve the documentation around this, and maybe raise a warning.

@anntzer
Copy link
Contributor

anntzer commented Feb 21, 2017

I probably don't use boxplot often enough to really argue about this, but I just had an unrelated case where I wanted to make a boxplot of data of the form

d = OrderedDict([(key, array), (key, array), ...])

in which case boxplot(list(d.values()), labels=list(d.keys())) just works fine, even if the arrays have different sizes. Can you even do plot unequal-sized data using a 2D array as input? (other than padding with nans, I guess?)

The third option could be implemented as a warning too, I don't have a strong opinion there.

It makes sense for MATLAB's boxplot to plot columns because it uses FORTRAN order indexing. For matplotlib, not so much...

@NelleV
Copy link
Member

NelleV commented Mar 15, 2017

FYI, I consider that the semantic are fundamentally broken.

@anntzer
Copy link
Contributor

anntzer commented Mar 27, 2017

After discussion with @efiring, we see two possibilities to help fixing the semantics.

  1. deprecate passing 2D ndarrays to boxplot and hist, and only support lists of arrays.
  2. add an "axis" keyword to boxplot and hist (axis=0: C order, axis=1: Fortran order) (but the ambiguity would remain).

@phobson
Copy link
Member

phobson commented Mar 27, 2017

@anntzer I can add the first element to MEP28

To be clear: under the proposed semanics, 2d arrays get flattened into a single boxplot?

@anntzer
Copy link
Contributor

anntzer commented Mar 27, 2017

If we were designing the system from scratch (without regard for backcompat), I think the expected semantics would be that 2D arrays would have each individual row plotted as a boxplot (similarly to lists of lists).

@phobson
Copy link
Member

phobson commented Mar 27, 2017

I understand that (though my opinion differs). My question is: once 2D-arrays are fully deprecated in boxplots, should they raise an error or get reshaped?

@anntzer
Copy link
Contributor

anntzer commented Mar 27, 2017

I would raise an error. See e.g. #7785 for issues with implicit reshaping.

@phobson
Copy link
Member

phobson commented Mar 27, 2017

An edge-ier case: list of 2D arrays also raises?

@anntzer
Copy link
Contributor

anntzer commented Mar 27, 2017

Yes. What's the alternative?

@phobson
Copy link
Member

phobson commented Mar 27, 2017

I can imagine a world where flattening each array individually makes sense, but I don't think we should cover that.

@tacaswell tacaswell modified the milestones: 2.1 (next point release), 2.1.1 (next bug fix release), 2.2 (next next feature release) Sep 24, 2017
@github-actions
Copy link

github-actions bot commented Apr 7, 2023

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label Apr 7, 2023
@anntzer
Copy link
Contributor

anntzer commented Apr 7, 2023

@timhoffm Do you want to comment as API lead on this (and the similar #12178)?

@github-actions github-actions bot removed the status: inactive Marked by the “Stale” Github Action label Apr 8, 2023
Copy link

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label Apr 15, 2024
@anntzer
Copy link
Contributor

anntzer commented Apr 15, 2024

Repinging @timhoffm on the above, though feel free to just say if you don't really care either way :)

@timhoffm
Copy link
Member

timhoffm commented Apr 15, 2024

IMHO the documented behavior is the reasonable one.

The input data. If a 2D array, a boxplot is drawn for each column in x. If a sequence of 1D arrays, a boxplot is drawn for each array in x.

In general, if you have a sequence of datasets [ds1, ..., dsN] you should get a box for each dataset in the sequence. There's one exception though for 2D arrays. Commonly, this is "tabulated data" where each dataset is a column (ML/datascience terminology columns are features, rows are instances). This is e.g. also consistent with pandas and what we use in plt.plot(x, Y) where Y is a 2D array.

So the distinction criteria here should be:

  • Is X a 2D array? -> datasets are columns
  • otherwise (X is 1D-like) datasets are elements of the outer sequence - no matter if it's a list of lists, list of ndarrays, ndarray of ndarrays.

Edit: The correct way forward is to adapt the implementation. This will only affect the array-of-arraylike case where the container is a 1D-object array. There are two possible ways to do this:

  • Claim it's a bug and do a hard change. This can result in unnoticed behavior change in very rare cases (prerequisite: all inner array-likes must have the same length and either both dimensions have the same size or no other per-dataset parameters are passed (in particular also not tick_labels (formerly labels)).
  • If that's too risky, we can catch the above case and (temporarily?) deprecate 1D-object arrays, suggesting x.tolist() as a replacement, which forces an opt-in into the sequence of datasets semantics.

We may have similar issues in other functions that accept multiple datasets. Of the top of my head we should check stackplot(), hist(), boxplot(), violinplot(), eventplot() and aim for consistency.

Copy link

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label Apr 16, 2025
@timhoffm timhoffm added the keep Items to be ignored by the “Stale” Github Action label Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API: consistency keep Items to be ignored by the “Stale” Github Action status: inactive Marked by the “Stale” Github Action
Projects
None yet
Development

No branches or pull requests

6 participants