-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Make pcolor(mesh) preserve all data #9629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I agree this behavior is... less bad 😃 |
ping @efiring because I remember he had an opinion about this (though I don't rememebr what it was exactly).... |
Fails some image tests and a couple of x/y limits tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for proposing this change. Till now, I have been using a helper function (similar to what you have proposed) to fix X and Y values before passing them to pcolormesh. I have added a suggestion to _interp_grid
function to correctly handle logarithmically spaced data.
if dropdata: | ||
C = C[:Ny - 1, :Nx - 1] | ||
else: | ||
def _interp_grid(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _interp_grid(X): | |
def _interp_grid(X): | |
# helper for below | |
dX = np.diff(X, axis=1)/2. | |
if np.allclose(dX[:, :], dX[0, [0]]): | |
X = np.hstack((X[:, [0]] - dX[:, [0]], | |
X[:, :-1] + dX, | |
X[:, [-1]] + dX[:, [-1]]) | |
) | |
else: | |
X_ratio = np.sqrt(X[:, 1:]/X[:, 0:-1]) | |
if np.allclose(X_ratio[:, :], X_ratio[0, [0]]): | |
X = np.hstack((X[:, [0]]/X_ratio[:, [0]], | |
X[:, :-1]*X_ratio, | |
X[:, [-1]]*X_ratio[:, [-1]]) | |
) | |
else: | |
pass | |
# Warn the user that the data is neither linearly | |
# nor logarithmically spaced. | |
return X |
The changes I am suggesting take care of the case when X and/or Y are logarithmically spaced.
(e.g. visualizing a time series of fractional octave band sound pressure level data. See the following example)
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
time = np.array([0, 1, 2, 3])
freq = np.array([500, 1000, 2000, 4000])
time_mesh, freq_mesh = np.meshgrid(time, freq)
C = np.arange(16).reshape(4,4)
fig, ax = plt.subplots()
im = ax.pcolormesh(time_mesh, freq_mesh, C, dropdata=False)
cbar = fig.colorbar(im)
cbar.set_label('Octave band SPL [dB$_{ref: 20 \mu Pa}$]')
ax.set_xticks(time)
ax.set_yscale('log')
ax.set_yticks(freq)
ax.set_yticklabels(freq)
ax.minorticks_off()
ax.set_xlabel('Time [s]')
ax.set_ylabel('Octave band center frequency [Hz]')
plt.show()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this logarithmic addition! The only thing I'd add is that you can still use the method and calculate dX's even if the points aren't evenly spaced, so the first if statement shouldn't be a requirement. It could just be the default after checking for the ratio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will apply generally enough to be a good approach. If someone needs precision in the cell edges, they should calculate them themselves as they need, not count on matplotlib to do it. This is just a rough fix which is not quite as rough as just dropping the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are a whole bunch of cases where this will fail/error out (x = [10, 9, 8]
) that we could check for, but I'm not convinced its worth the extra code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a possibility would be to do the grid interpolation in screen space? Something like... (assuming dropdata=False and the shapes are such that grid interp is necessary)
- convert the x values using the current scale
- if the scaled values are uniformly spaced, do the interp, convert back to data space, use the interp'd grid.
- if not (and an interpolation is necessary), emit a warning (so that users will know to call set_xscale("log") before passing their log-scaled x values
(very rough sketch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems way too fancy and hard to explain to the user. And I'm not even convinced it makes sense for curvilinear grids etc. Lets keep this simple and easy to explain. If the user wants something fancier they can calculate their own edges...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a possibility would be to do the grid interpolation in screen space? Something like... (assuming dropdata=False and the shapes are such that grid interp is necessary)
* convert the x values using the current scale * if the scaled values are uniformly spaced, do the interp, convert back to data space, use the interp'd grid. * if not (and an interpolation is necessary), emit a warning (so that users will know to call set_xscale("log") before passing their log-scaled x values
This process could be converted into a helper function which could be showed off in a gallery example/matplotlib blog. If the OP is fine with it, I am interested in working on a PR demonstrating the process outlined here. Does this sound a like a good idea?
@@ -5665,7 +5691,8 @@ def pcolormesh(self, *args, alpha=None, norm=None, cmap=None, vmin=None, | |||
|
|||
allmatch = (shading == 'gouraud') | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropdata = kwargs.pop('dropdata', True) | |
Also, I think the method definition for _pcolorargs needs to be updated to include dropdata=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a different kwarg than dropdata? Maybe something like xyvertices=False, xyedges. Something to insinuate that the current xy are grid points or vertices rather than edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see what you mean, but not sure what to do about this. We don't want to imply that we will do anything if len(x)==N+1
. I'll propose we keep dropdata
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I see with dropdata
as the kwarg is that you're actually not just including all of the data, you're also changing the centers/edges of the cells by calculating a midpoint everywhere. In the examples above, the edges were at integers (0, 1, 2) and with this fix they are now at -0.5, 0.5, 1.5 2.5. So the grid has shifted as well. To me, dropdata=False
reads like you would have just extended the previous edges to include a new extrapolation point to capture all of the data (0, 1, 2, 3).
xy_centers, xy_midpoints?
If len(x)==N+1
and this kwarg is present, you could raise/warn a user that the combo doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're also changing the centers/edges of the cells by calculating a midpoint everywhere.
I'm not calculating the midpoint, I'm calculating the edges, assuming that the given co-ordinates are the midpoint, which I think is what 99.99% of users would expect. The fact is that I think this should be the default behaviour, but thats a whole back compatibility concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with what you've said (hence my xy_midpoints
suggestion), I wasn't very clear in my description above. I also agree that I think this is what most users would expect. For backwards compatibility, I think it would be fine if this is just an extra option and the default is the previous behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the name should be changed. I disagree that this is what users want 99% of the time. Most of the time, imshow()
is exactly what those users really need, and is faster to boot. I have been in enough arguments with people about whether coordinates should represent vertices or midpoints to know that it is very dependent upon the particular field you are in and the standard data formats they are usually using most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 0.001% of users expect data to be dropped. As for imshow that is useless for unevenly spaced X or Y.
766c322
to
2ca4e28
Compare
I don't understand the doc build failure. It says there are two "Notes" section in
|
^ that error message arose because of an indentation error. 🙄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor spelling corrections.
On line 5617 (I couldn't figure out how to comment on a line you didn't change)
if allmatch:
X, Y = np.meshgrid(np.arange(ncols), np.arange(nrows))
else:
X, Y = np.meshgrid(np.arange(ncols + 1), np.arange(nrows + 1))
do you want to worry about whether someone uses dropdata=False
without any X/Y input? i.e. what if someone wanted to artificially shift their plotted mesh by using the dropdata kwarg to make the centers on integer values, rather than edges being on the integer values?
Right now you won't be able to do that and I'm not sure you even want to allow it, so just some food for thought. Otherwise, I think everything looks good right now and the log-spacing and other enhancements could be additional PRs.
@greglucas Thanks for the comments. First, I'm still open to changing the name from
Data doesn't get dropped when there are no x/y vectors given, so I'm not inclined to change that behaviour. Again, I'm not looking at this as a way to shift the cells for the user, I'm looking at this as a simple way to give the user what they meant by giving us N values for |
I went and looked at MATLAB's documentation for pcolor and saw that they state X/Y/Z are supposed to be the same size (and then they drop the last row/column), so I'm not even sure if they allow you to make Z be a different shape at all (have not tested this myself). My thoughts in favor of preserving all data are that I'm usually plotting some function evaluated at grid cells, which turns into Here is an example demonstrating a few of the different use cases.
|
@greglucas Well obseved that dropping data and the edge/midpoint semantics are to seprate issues. BTW nice illustrative plots! It is an important point whether Current state
For N patches, we need N+1 edges along a direction.
Possible changesAlternatives for the last case: If we go with cases b) or c), we need some alternative method/parameter, that actually supports X,Y,C of the same size because Center positions is a whole new feature for
|
We have two ways 1. The first case is unambiguous. The second case is formally ambiguous, but what do we think users actually mean when they pass this to As for @greglucas other point about |
Naively, I would expect matplotlib/lib/matplotlib/image.py Lines 955 to 958 in 745dcae
Of course, in pcolormesh, the opposite arbitrary decision was chosen. matplotlib/lib/matplotlib/axes/_axes.py Lines 5608 to 5611 in 745dcae
It would be nice to be consistent in my opinion... If there is going to be breaking changes I think this would be a welcome one to tidy up as well. |
Ok but again that’s really orthogonal to this PR. |
Let's disregard backward compatibility for the moment. What would be a clean design given Option 1) only x ~ N+1, y ~ M+1 is accepted --> x, y define edges Option 2) only x ~ N, y ~ M is accepted --> x, y define centers One may be more intuitive than the other but both are technically clean APIs, with different semantics. However I don't think, one should mix these implicitly. Deciding if values define edges or centers based on their shape is too much magic. If we want both behaviors in a single function, the clean way out is to let the user specify the meaning via an extra |
Part of the problem here is that the only perfectly general and unambiguous way to do a pcolor-type plot is to specify the edges. With an irregular grid and/or nonlinear scale, there is more than one way to generate a reasonable set of edges. In a clean API:
I think that cramming too much functionality into a single function is counterproductive. It can be done, and maybe it is what we will end up with, but I think using two functions would lead to simpler code (both at the user level and at our level) and less confusion. |
I kind of feel like we are letting perfect get in the way of fixing something that is terrible. Right now the way we deal with x ~ N, y ~ M in the worse way possible by dropping data, whereas I feel pretty strongly that most people who do this expect the data to be centred on x and y. I'm not convinced that making people call a helper function, or having yet another pcolormesh is the right way forward. The I guess we could deprecate x ~ N. I guarantee we will get lots of feedback if we do that! |
I'm very sympathetic with this PR, and with the "good" versus "perfect" argument; the data-dropping behavior is an unfortunate inheritance from Matlab, and certainly there are many cases where data are specified on centers. When the user actually wants the data-dropping behavior, the user could do it explicitly with simple indexing. I'm a little uncomfortable with the dropdata boolean kwarg and rcParam. How about setting out a strategy for getting rid of it:
Maybe that's too much fuss, but it would be nice to have a strategy for ending up with a cleaner API instead of just settling for another kwarg forever. |
Use of a different function for center-specification was implicit in my comment above, so it doesn't address the desire to have one do-everything function that figures out what to do based partly on the dimensions of the inputs. The PR is certainly trying to be friendly, handling edge versus center independently for each dimension. That's not terribly magical; it's pretty easy to understand and explain. (The docstring modifications in this PR don't quite go far enough in describing the new situation, but there is no point in doing more work on them until we are sure of the API.) We already have a "shading" kwarg, which actually affects the meaning of the X, Y locations: with gouraud shading, (X, Y) specify the corner colors, and the plot does not extend beyond the X, Y points. This is similar to X, Y specifying centers (as in imshow for uniform grids), the differences being (1) that the colors are blocks rather than gradients, and (2) the domain is expanded slightly so that for uniform grids the blocks are all the same size. This second difference is not essential; it would be possible to handle center-specification, with X, Y, C all of the same dimensions, using half-cells on the edges so that the domain would be the same as for gouraud shading. The advantage is that this interpolates but does not extrapolate. The disadvantage is that most often the user probably will want more image-like behavior. Would it make sense to supersede "shading" with (perhaps) "mode", which would take values of
Or maybe just use the first 3 of these. And maybe leave the name as "shading" so as to minimize the changes. The advantage of keeping "edge" and "center" is code readability: it would make clear to the reader what sort of data input is required. If they are used, though, the kwarg name "shading" becomes problematic. |
TL;DR I am 👎 on putting in logic to try and guess the cell edges, I think it opens up a can of worms we don't want to get into (as exemplified by the suggestion from @pharshalp for detecting log spaced values). I am more amenable to changing the default behavior with no X or Y to match imshow (center the cells on the integers rather than anchor a corner on the integers), but that is an orthogonal discussion (and in that case the user probably should be using I am 👍 on including the centers -> edges code as a public helper in cbook, probably in both a linear and log sampled flavors. I swear this started as a short comment and then I went and wrote an essay...
I disagree that the current behavior is the worst possible solution, I think incorrectly inferring the edges would be worse behavior. "dropping" the last row / column is consistent and correct in the sense that the patch defined by
If you take that as the invanint of Thinking about how the shading works, this actually becomes even more defendable. In I can think of stories where the "natural" representation in the users code is either with matching or off-by-one sized arrays. As @greglucas pointed out this kwarg is completely changing the meaning of the inputs. That needs to be called out much more clearly. @greglucas : those are amazing examples, no matter how this PR ends up, I am in favor of those ending up in the documentation (as either an example or a tutorial) someplace.
This is one of the core reasons I want to push the API to be many top-level functions rather than as As @efiring points out, once you have a non-uniform grid then you need to specify the edges of the cells, not just one point in them. If you have uniform square grid of data (for example, from a pixelated detector), then to uniquely place the cells in data space you only need to know the bounding box of the whole image, the size of the input, and then the index of the cell in question (eliding the issues of up / down sampling here, lets stick with "nearest" lots of up-sampling for now). You could also specify and offset within the cell of where "the value is", but that is degenerate with translating the bounding box. Internally, we always think of the value as being on the corner (this ends up being more complicated than one expects, see https://matplotlib.org/3.1.1/tutorials/intermediate/imshow_extent.html). When the user does not specify the extent, we pick one that puts the center of the cells at the integers (which is the behavior you want when working with pixel detectors and images in general). On the other hand, once you allow non-uniform spacing, many bets are off (we are still assuming a rectangular grid). We are still assuming that the data is contiguous so we can just specify the vertices and then construct the patches based on the neighbors, but this does lead to the off-by-one issue (which is a pretty common thing, see histograms). To go one assumption less (that the data is contiguous) then you have to go to As @WeatherGod points out, if the user is naturally thinking in centers or edges depends on the user and the field. I think having helpers that do the conversion for the common cases is a good idea. If I am reading the edge computing code correctly, I would not call that linear interpolation as it is just re-using the width of the second to last cell for the width of the last cell. Is there prior art to reference on this? I would not be surprised if this function already existed in scipy or skimage. This is an interesting partner to the discussion @anntzer was leading about making |
Nicety, or footgun, depending on how you look at it. I personally think that silently dropping data is literally one of the worse things we can do as a library (insert rant about how each data point may be years in the life of a grad student). On the other hand my personal solution is to use my own private wrappers to imshow() and friends to avoid that... (so I already moved past these discussions :-))
In a sense supporting N/N+1 in step plots is similar, right now step plots "drop" the first or last point (but not completely because there's still a vertical segment leading to it) and supporting N/N+1 would avoid that. Somewhat unrelatedly, actually in the axes-aligned case I guess an improvement would be to return a NonUniformImage, which is likely much faster to render than even a QuadMesh? |
On 2020/01/03 12:15 AM, Antony Lee wrote:
Somewhat unrelatedly, actually in the axes-aligned case I guess an
improvement would be to return a NonUniformImage, which is likely much
faster to render than even a QuadMesh?
See Axes.pcolorfast.
|
Thanks for the great comments. A couple of small ones for you to chew on: If your data Z is collected on your grid points, and we want to think about this as interpolation between those points , this algorithm implements (linear) nearest-neighbor between the provided points, which is usually the lowest-order interpolation. The current behaviour implements nearest-to-the-left, which is just weird. I agree with everything said above about how specifying the edges is the "unambiguous" thing to do. And I also agree that just dropping the data is unambiguous. But I've not seen any argument that this is what the naive user expects, and I think its a terrible thing to do silently to someone's data set when they are not expecting it. If the way forward is deprecating mismatched x, y, Z, and providing helpers, I'd be fine with that. But I think the first user request would be to wrap that in a kwarg to |
Ah yes, sorry I forgot that. Anyways that discussion point is orthogonal to the main issue at hand here. |
That is if you are saying the point at
Where
The naive user also would not expect that a kwarg would completely change the semantics of the input and makes the function a fundamentally different function. To be fair, we already have shades of that with the shading kwarg, but there you get to pick your poison of "nearest-up-and-left" interpolation or semantic changing kwarg. This also suggests that the core of the problem here is a documentation one not a functionality one.
If that is a case then this needs to be very clearly documented in the docstring that setting a kwarg will radically change the behavior of the function. Let's start with just a warning and see how much feedback we get. The warning should be something like: "We think you did this by accident, if you meant to do this please let us know why". I think we should also change the first line of the docstrings to make clearer these functions take edges not centers as input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should
- move the interpolation function to cbook
- advertise said function in the docstring
- change the first line of the docstring to make clear that these are edge-specified functions
- add a warning if the dimensions all match with instructions to contact us if it actually was intentional
- put the examples from @greglucas in the docs someplace
@tacaswell, just to be clear I'm not simply "saying" this: its how almost every data set I've ever seen is organized: the data is collected at X[n, m] and Y[n, m] so those are the values stored. I think you'd be hard pressed to find a lot of data sets that are organized differently, unless they come from numerical models. If you want specific examples where X and Y have the same dimensions as C:
If you want prior art, xarray has an Seaborn's GMT assumes the data is centered on the grid for
We can do that, (and add the helper), but many people are going to say "I didn't do this by accident, why doesn't it work?". I really think this is a super-common use case. |
I agree completely with @jklymak--in physical oceanography and meteorology, gridded data products and numerical model output is usually provided with the center positions as the grid. Sometimes there are "boundary" grids, but the ones I have seen are (N,2), not N+1. Being able to handle potentially non-uniform center grids easily and efficiently would be a real convenience, and an improvement over the present quick and dirty workaround of dropping a row and a column and ignoring the systematic half-grid position error. Which is probably what all of us do when we are in a hurry, which is most of the time. |
Incidentally, the NonUniformImage class, which we have had since early days but which is probably rarely used, already handles this centered-grid case. I modified it to make the PcolorImage class which is used in pcolorfast to speed up the common case of a nonuniform but rectilinear set of boundaries. |
I'd be happy to put the example in the docs. I am going to wait until the discussion settles on the implementation specifics (kwarg, new function, etc...). My personal opinion is that, yes this is a highly desired option/default. Another common 'gotcha' here that this would fix is when I do animations and call My preferences would be to:
There are currently too many functions to remember what they all do, in my opinion, to warrant yet another one. I often forget that pcolorfast, pcolorimage even exist, so adding another function would just add something else that I forget when working fast. |
@tacaswell should this be discussed at a weekly call, are you onboard for the solutions listed above, or should this just get the warning and we can see how much pushback there is? |
Please see #16258 for a new version of this. |
PR Summary
UPDATE: 17 Dec 2019
New docs: https://27917-1385122-gh.circle-artifacts.com/0/home/circleci/project/doc/build/html/api/_as_gen/matplotlib.axes.Axes.pcolormesh.html#matplotlib.axes.Axes.pcolormesh
Revamped example: https://27917-1385122-gh.circle-artifacts.com/0/home/circleci/project/doc/build/html/gallery/images_contours_and_fields/pcolormesh_levels.html#sphx-glr-gallery-images-contours-and-fields-pcolormesh-levels-py
Right now, if you supply C = MxN matrix and x = N and y = M
pcolor
drops the last column and row of C.Pluses: no data thrown out!
Minuses: need to add a kwarg to not lose the data.
PR Checklist