Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Make pcolor(mesh) preserve all data #9629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jklymak
Copy link
Member

@jklymak jklymak commented Oct 30, 2017

PR Summary

UPDATE: 17 Dec 2019

New docs: https://27917-1385122-gh.circle-artifacts.com/0/home/circleci/project/doc/build/html/api/_as_gen/matplotlib.axes.Axes.pcolormesh.html#matplotlib.axes.Axes.pcolormesh

Revamped example: https://27917-1385122-gh.circle-artifacts.com/0/home/circleci/project/doc/build/html/gallery/images_contours_and_fields/pcolormesh_levels.html#sphx-glr-gallery-images-contours-and-fields-pcolormesh-levels-py

Right now, if you supply C = MxN matrix and x = N and y = M pcolor drops the last column and row of C.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

x = np.arange(4)
y = np.arange(4)
X, Y = np.meshgrid(x, y)
z = (X+1)*(Y+1)
fig, ax = plt.subplots(3, 2, sharex=True, sharey=True, constrained_layout=True)

for nn,drop in enumerate([True, False]):
    vmin = np.min(z)
    vmax = np.max(z)
    ax[0, nn].pcolor(x, y, z, vmin=vmin, vmax=vmax, dropdata=drop)
    ax[1, nn].pcolor(X+0.2*Y, Y, z, vmin=vmin, vmax=vmax, dropdata=drop)
    pc = ax[2, nn].pcolor(X +0.2 * Y, Y + 0.2*X, z,
                     vmin=vmin, vmax=vmax, dropdata=drop)

fig.colorbar(pc, ax=ax)
plt.show()

pcolor

Pluses: no data thrown out!
Minuses: need to add a kwarg to not lose the data.

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way
  • Fix handling of dates in x/y axes...

@anntzer
Copy link
Contributor

anntzer commented Oct 30, 2017

I agree this behavior is... less bad 😃

@jklymak
Copy link
Member Author

jklymak commented Oct 30, 2017

ping @efiring because I remember he had an opinion about this (though I don't rememebr what it was exactly)....

@jklymak
Copy link
Member Author

jklymak commented Oct 30, 2017

Only fails one image comparison in test_axes.py and I'd argue it now does the "right" thing except for the auto-scaling of the x and y limits:

Old:

pcolormesh-expected

New:

pcolormesh

Fails the tests with datenumbers. That'll be a project for another day or evening... See todo above

@jklymak jklymak changed the title Make pcolor(mesh) preserve all data WIP: Make pcolor(mesh) preserve all data Oct 30, 2017
@jklymak
Copy link
Member Author

jklymak commented Oct 31, 2017

Fails some image tests and a couple of x/y limits tests

@jklymak jklymak changed the title WIP: Make pcolor(mesh) preserve all data Make pcolor(mesh) preserve all data Dec 9, 2017
@jklymak jklymak added this to the v2.2 milestone Dec 9, 2017
@jklymak jklymak modified the milestones: v2.2, v3.0 Jan 11, 2018
@jklymak jklymak modified the milestones: v3.0, v3.1 Jul 3, 2018
Copy link
Contributor

@pharshalp pharshalp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing this change. Till now, I have been using a helper function (similar to what you have proposed) to fix X and Y values before passing them to pcolormesh. I have added a suggestion to _interp_grid function to correctly handle logarithmically spaced data.

if dropdata:
C = C[:Ny - 1, :Nx - 1]
else:
def _interp_grid(X):
Copy link
Contributor

@pharshalp pharshalp Oct 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _interp_grid(X):
def _interp_grid(X):
# helper for below
dX = np.diff(X, axis=1)/2.
if np.allclose(dX[:, :], dX[0, [0]]):
X = np.hstack((X[:, [0]] - dX[:, [0]],
X[:, :-1] + dX,
X[:, [-1]] + dX[:, [-1]])
)
else:
X_ratio = np.sqrt(X[:, 1:]/X[:, 0:-1])
if np.allclose(X_ratio[:, :], X_ratio[0, [0]]):
X = np.hstack((X[:, [0]]/X_ratio[:, [0]],
X[:, :-1]*X_ratio,
X[:, [-1]]*X_ratio[:, [-1]])
)
else:
pass
# Warn the user that the data is neither linearly
# nor logarithmically spaced.
return X

The changes I am suggesting take care of the case when X and/or Y are logarithmically spaced.
(e.g. visualizing a time series of fractional octave band sound pressure level data. See the following example)

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

time = np.array([0, 1, 2, 3])
freq = np.array([500, 1000, 2000, 4000])
time_mesh, freq_mesh = np.meshgrid(time, freq)
C = np.arange(16).reshape(4,4)

fig, ax = plt.subplots()
im = ax.pcolormesh(time_mesh, freq_mesh, C, dropdata=False)
cbar = fig.colorbar(im)
cbar.set_label('Octave band SPL [dB$_{ref: 20 \mu Pa}$]')

ax.set_xticks(time)
ax.set_yscale('log')
ax.set_yticks(freq)
ax.set_yticklabels(freq)
ax.minorticks_off()
ax.set_xlabel('Time [s]')
ax.set_ylabel('Octave band center frequency  [Hz]')

plt.show()

figure_1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this logarithmic addition! The only thing I'd add is that you can still use the method and calculate dX's even if the points aren't evenly spaced, so the first if statement shouldn't be a requirement. It could just be the default after checking for the ratio.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will apply generally enough to be a good approach. If someone needs precision in the cell edges, they should calculate them themselves as they need, not count on matplotlib to do it. This is just a rough fix which is not quite as rough as just dropping the data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are a whole bunch of cases where this will fail/error out (x = [10, 9, 8]) that we could check for, but I'm not convinced its worth the extra code.

Copy link
Contributor

@anntzer anntzer Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a possibility would be to do the grid interpolation in screen space? Something like... (assuming dropdata=False and the shapes are such that grid interp is necessary)

  • convert the x values using the current scale
  • if the scaled values are uniformly spaced, do the interp, convert back to data space, use the interp'd grid.
  • if not (and an interpolation is necessary), emit a warning (so that users will know to call set_xscale("log") before passing their log-scaled x values

(very rough sketch)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems way too fancy and hard to explain to the user. And I'm not even convinced it makes sense for curvilinear grids etc. Lets keep this simple and easy to explain. If the user wants something fancier they can calculate their own edges...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a possibility would be to do the grid interpolation in screen space? Something like... (assuming dropdata=False and the shapes are such that grid interp is necessary)

* convert the x values using the current scale

* if the scaled values are uniformly spaced, do the interp, convert back to data space, use the interp'd grid.

* if not (and an interpolation is necessary), emit a warning (so that users will know to call set_xscale("log") before passing their log-scaled x values

This process could be converted into a helper function which could be showed off in a gallery example/matplotlib blog. If the OP is fine with it, I am interested in working on a PR demonstrating the process outlined here. Does this sound a like a good idea?

@@ -5665,7 +5691,8 @@ def pcolormesh(self, *args, alpha=None, norm=None, cmap=None, vmin=None,

allmatch = (shading == 'gouraud')

Copy link
Contributor

@pharshalp pharshalp Oct 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dropdata = kwargs.pop('dropdata', True)

Also, I think the method definition for _pcolorargs needs to be updated to include dropdata=True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a different kwarg than dropdata? Maybe something like xyvertices=False, xyedges. Something to insinuate that the current xy are grid points or vertices rather than edges.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see what you mean, but not sure what to do about this. We don't want to imply that we will do anything if len(x)==N+1. I'll propose we keep dropdata for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see with dropdata as the kwarg is that you're actually not just including all of the data, you're also changing the centers/edges of the cells by calculating a midpoint everywhere. In the examples above, the edges were at integers (0, 1, 2) and with this fix they are now at -0.5, 0.5, 1.5 2.5. So the grid has shifted as well. To me, dropdata=False reads like you would have just extended the previous edges to include a new extrapolation point to capture all of the data (0, 1, 2, 3).

xy_centers, xy_midpoints?

If len(x)==N+1 and this kwarg is present, you could raise/warn a user that the combo doesn't make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're also changing the centers/edges of the cells by calculating a midpoint everywhere.

I'm not calculating the midpoint, I'm calculating the edges, assuming that the given co-ordinates are the midpoint, which I think is what 99.99% of users would expect. The fact is that I think this should be the default behaviour, but thats a whole back compatibility concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with what you've said (hence my xy_midpoints suggestion), I wasn't very clear in my description above. I also agree that I think this is what most users would expect. For backwards compatibility, I think it would be fine if this is just an extra option and the default is the previous behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the name should be changed. I disagree that this is what users want 99% of the time. Most of the time, imshow() is exactly what those users really need, and is faster to boot. I have been in enough arguments with people about whether coordinates should represent vertices or midpoints to know that it is very dependent upon the particular field you are in and the standard data formats they are usually using most of the time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 0.001% of users expect data to be dropped. As for imshow that is useless for unevenly spaced X or Y.

@greglucas
Copy link
Contributor

It looks like this PR may have gotten lost along the way, and I think it would be an excellent addition to matplotlib. I have also been writing wrapper functions to help out with data before passing it into pcolormesh and was about to contribute something, but then found this old PR out there which looks great!

In particular, often the data I'm working with from models are given at the center of a cell and frequently do not include the edge data of the grid cells, so this would simplify a lot of the workflow when making meshed plots.

One aspect that hasn't been brought up here is the missing cyclic point in longitude that you get currently when just quick plotting a meshgrid on a map. Here is a quick polar plot demonstrating how this new method would help on a regular angular grid (by eliminating the blank space found at the wraparound).
image

@jklymak
Copy link
Member Author

jklymak commented Dec 18, 2019

I don't understand the doc build failure. It says there are two "Notes" section in pocolor, but I didn't add a "Notes" section....

ValueError: The section Notes appears twice in the docstring of <function Axes.pcolor at 0x7fc6d690c670> in /home/circleci/project/lib/matplotlib/__init__.py.

@jklymak
Copy link
Member Author

jklymak commented Dec 18, 2019

^ that error message arose because of an indentation error. 🙄

Copy link
Contributor

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor spelling corrections.

On line 5617 (I couldn't figure out how to comment on a line you didn't change)

            if allmatch:
                X, Y = np.meshgrid(np.arange(ncols), np.arange(nrows))
            else:
                X, Y = np.meshgrid(np.arange(ncols + 1), np.arange(nrows + 1))

do you want to worry about whether someone uses dropdata=False without any X/Y input? i.e. what if someone wanted to artificially shift their plotted mesh by using the dropdata kwarg to make the centers on integer values, rather than edges being on the integer values?

Right now you won't be able to do that and I'm not sure you even want to allow it, so just some food for thought. Otherwise, I think everything looks good right now and the log-spacing and other enhancements could be additional PRs.

@jklymak
Copy link
Member Author

jklymak commented Dec 18, 2019

@greglucas Thanks for the comments. First, I'm still open to changing the name from dropdata Lets get some more opinions. I guess psychologically, I want this to sound like something you don't want to do, even if the default is to do it when necessary.

do you want to worry about whether someone uses dropdata=False without any X/Y input?

Data doesn't get dropped when there are no x/y vectors given, so I'm not inclined to change that behaviour. Again, I'm not looking at this as a way to shift the cells for the user, I'm looking at this as a simple way to give the user what they meant by giving us N values for x when Z is MxN rather than just arbitrarily throwing out data.

@greglucas
Copy link
Contributor

I went and looked at MATLAB's documentation for pcolor and saw that they state X/Y/Z are supposed to be the same size (and then they drop the last row/column), so I'm not even sure if they allow you to make Z be a different shape at all (have not tested this myself).

My thoughts in favor of preserving all data are that I'm usually plotting some function evaluated at grid cells, which turns into z = f(x, y). All x/y/z have the same exact shape and I can scatter plot those values. Now, I want to plot a surface and fill the space between the contours, so I would prefer to center the x/y in the middle of the cell where the function was evaluated.

Here is an example demonstrating a few of the different use cases.
image

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
cmap = mpl.cm.get_cmap('viridis')
norm = mpl.colors.Normalize(0, 1)

x, y = np.arange(4), np.arange(4)
xx, yy = np.meshgrid(x, y)
z = (xx**2 + yy**2)/18
xedges, yedges = np.meshgrid(np.arange(5) - 0.5, np.arange(5) - 0.5)


def add_scatter(ax):
    ax.scatter(xx, yy, c=z, s=100, cmap=cmap, norm=norm, edgecolor='k')
    ax.set_xlim(-1, 4)
    ax.set_ylim(-1, 4)
    ax.set_aspect('equal')


fig, axarr = plt.subplots(1, 4, figsize=(8, 3), constrained_layout=True)

ax = axarr[0]
ax.set_title('pcolormesh(Z)')
ax.pcolormesh(z, cmap=cmap, norm=norm, edgecolor='w')
ax.quiver(xx, yy, np.ones(xx.shape), np.ones(yy.shape), scale=10, color='w', linewidth=2)
add_scatter(ax)

ax = axarr[1]
ax.set_title('pcolormesh(X, Y, Z)')
ax.pcolormesh(xx, yy, z, cmap=cmap, norm=norm, edgecolor='w')
ax.quiver(xx, yy, np.ones(xx.shape), np.ones(yy.shape), scale=10, color='w', linewidth=2)
add_scatter(ax)

ax = axarr[2]
ax.set_title('pcolormesh(X, Y, Z)\ndropdata=False')
ax.pcolormesh(xedges, yedges, z, cmap=cmap, norm=norm, edgecolor='w')
add_scatter(ax)

ax = axarr[3]
ax.set_title('imshow')
ax.imshow(z, origin='lower', cmap=cmap, norm=norm)
add_scatter(ax)

plt.show()

@timhoffm
Copy link
Member

timhoffm commented Jan 1, 2020

@greglucas Well obseved that dropping data and the edge/midpoint semantics are to seprate issues. BTW nice illustrative plots!

It is an important point whether X, Y denote edges or center positions.

Current state

X, Y is always edges.

For N patches, we need N+1 edges along a direction.

  • If edges are not given, we can easily create them at positions range(N+1). --> Ok.
  • If X, Y have size N+1, we can just use them. --> Ok.
  • If X, Y have size N, we're missing the last edge.
    Current solution: Silently drop last value. --> May be problematic as some data is not visualized.

Possible changes

Alternatives for the last case:
a) Try to extrapolate an additional value (#9629 (comment)).
--> Could be constant or linearly extrapolated from the previous deltas. But might be too clever and give unexpected results in more complicated cases.
b) Just make it expicit, so that you need pcolor(X, Y, C, dropdata=True) to get the current behavor - possibly with warning exception (see c), d)) if not given.
c) Warn about dropped data.
d) Fail with an exception.
--> Might actually be reasonable ("In the face of ambiguity, remove the temptation to guess.").

If we go with cases b) or c), we need some alternative method/parameter, that actually supports X,Y,C of the same size because C = f(X, Y) is a common usecase. If we do not want b) for that, center positions would come into play.

Center positions is a whole new feature for pcolor. Some things to consider:

  • Should be controlled by a new kwarg, proposal: xy_type.
  • It's a separate issue from dropping data; and only related in the aspect, that it is an alternative way of plotting same-size X, Y, C, which may allow us to be more restrictive on the dropping.
  • Should be implemented for pcolor and pcolormesh. Also check the relation to pcolorfast.
  • For info: This is a bit similar to plt.step(where=...) https://matplotlib.org/devdocs/gallery/lines_bars_and_markers/step_demo.html. But I don't think we can immitate/learn from that. Color patches are too different from a height profile.

@jklymak
Copy link
Member Author

jklymak commented Jan 2, 2020

We have two ways pcolormesh can be called for Z of dimensions MxN:

1.x has length N+1, y M+1
2. x has length N, y M.

The first case is unambiguous. x and y specify cell edges, and Z sets the color between the cell edges.

The second case is formally ambiguous, but what do we think users actually mean when they pass this to pcolormesh? I'd argue they expect what dropdata=True gives us here 99.99% of the time, which is that x, y specify where the data was collected and they want to color a cell around that point with the color given by Z. Other meanings seems obscure enough to me that we should not complicate things by giving them an API. The only reason to provide a dropdata=True toggle is for back compatibility.

As for @greglucas other point about pcolormesh(Z), that seems an orthogonal issue to me; if the user doesn't specify x, y they will get some undefined default behaviour, and the one given is as good as any other.

@greglucas
Copy link
Contributor

Naively, I would expect pcolormesh(Z) and imshow(Z) to produce the same image (albeit with a different return-type). Right now, panels 1 and 4 are off by a half-step. In the image code there was the arbitrary decision to make the extent be shifted by 0.5.

if self.origin == 'upper':
return (-0.5, numcols-0.5, numrows-0.5, -0.5)
else:
return (-0.5, numcols-0.5, -0.5, numrows-0.5)

Of course, in pcolormesh, the opposite arbitrary decision was chosen.
if allmatch:
X, Y = np.meshgrid(np.arange(ncols), np.arange(nrows))
else:
X, Y = np.meshgrid(np.arange(ncols + 1), np.arange(nrows + 1))

It would be nice to be consistent in my opinion... If there is going to be breaking changes I think this would be a welcome one to tidy up as well.

@jklymak
Copy link
Member Author

jklymak commented Jan 2, 2020

Ok but again that’s really orthogonal to this PR.

@timhoffm
Copy link
Member

timhoffm commented Jan 2, 2020

Let's disregard backward compatibility for the moment. What would be a clean design given Z is MxN?

Option 1) only x ~ N+1, y ~ M+1 is accepted --> x, y define edges

Option 2) only x ~ N, y ~ M is accepted --> x, y define centers

One may be more intuitive than the other but both are technically clean APIs, with different semantics.

However I don't think, one should mix these implicitly. Deciding if values define edges or centers based on their shape is too much magic. If we want both behaviors in a single function, the clean way out is to let the user specify the meaning via an extra xy_type (name up for discussion) kwarg explicitly.

@efiring
Copy link
Member

efiring commented Jan 2, 2020

Part of the problem here is that the only perfectly general and unambiguous way to do a pcolor-type plot is to specify the edges. With an irregular grid and/or nonlinear scale, there is more than one way to generate a reasonable set of edges.

In a clean API:

  1. Data would not be dropped, ever. If the dimensions don't match appropriately, raise a ValueError or TypeError.
  2. The basic routine (let's just keep the pcolormesh name for now) would require edges, which could be specified via 1-D or 2-D arrays.
  3. One or more routines would be supplied to generate edges from 1-D or 2-D arrays of centers.
  4. An additional convenience routine would combine the edge generation with pcolormesh. This function might include the gouraud shading option, or that might also be a separate function.

I think that cramming too much functionality into a single function is counterproductive. It can be done, and maybe it is what we will end up with, but I think using two functions would lead to simpler code (both at the user level and at our level) and less confusion.

@jklymak
Copy link
Member Author

jklymak commented Jan 3, 2020

I kind of feel like we are letting perfect get in the way of fixing something that is terrible. Right now the way we deal with x ~ N, y ~ M in the worse way possible by dropping data, whereas I feel pretty strongly that most people who do this expect the data to be centred on x and y. I'm not convinced that making people call a helper function, or having yet another pcolormesh is the right way forward.

The xy_type kwarg is basically the same as dropdata, unless you think this should have a meaning if x ~ N+1.

I guess we could deprecate x ~ N. I guarantee we will get lots of feedback if we do that!

@efiring
Copy link
Member

efiring commented Jan 3, 2020

I'm very sympathetic with this PR, and with the "good" versus "perfect" argument; the data-dropping behavior is an unfortunate inheritance from Matlab, and certainly there are many cases where data are specified on centers. When the user actually wants the data-dropping behavior, the user could do it explicitly with simple indexing. I'm a little uncomfortable with the dropdata boolean kwarg and rcParam. How about setting out a strategy for getting rid of it:

  1. Let it take None (use the default) or the string values, "allow", "warn", "error".
  2. Start with "allow" in the rcParams, but urge users to manually set the default to "error", noting that it will eventually go away.
  3. With successive releases, change the rcParams default to "warn" and then "error".
  4. Then remove it.

Maybe that's too much fuss, but it would be nice to have a strategy for ending up with a cleaner API instead of just settling for another kwarg forever.

@efiring
Copy link
Member

efiring commented Jan 3, 2020

Use of a different function for center-specification was implicit in my comment above, so it doesn't address the desire to have one do-everything function that figures out what to do based partly on the dimensions of the inputs. The PR is certainly trying to be friendly, handling edge versus center independently for each dimension. That's not terribly magical; it's pretty easy to understand and explain. (The docstring modifications in this PR don't quite go far enough in describing the new situation, but there is no point in doing more work on them until we are sure of the API.)

We already have a "shading" kwarg, which actually affects the meaning of the X, Y locations: with gouraud shading, (X, Y) specify the corner colors, and the plot does not extend beyond the X, Y points. This is similar to X, Y specifying centers (as in imshow for uniform grids), the differences being (1) that the colors are blocks rather than gradients, and (2) the domain is expanded slightly so that for uniform grids the blocks are all the same size. This second difference is not essential; it would be possible to handle center-specification, with X, Y, C all of the same dimensions, using half-cells on the edges so that the domain would be the same as for gouraud shading. The advantage is that this interpolates but does not extrapolate. The disadvantage is that most often the user probably will want more image-like behavior.

Would it make sense to supersede "shading" with (perhaps) "mode", which would take values of

  • "legacy" (or "dropdata"?)
  • "gouraud"
  • "auto" which would have the dropdata=False behavior of this PR
  • "edge" which would enforce edge specification
  • "center" enforcing center specification

Or maybe just use the first 3 of these. And maybe leave the name as "shading" so as to minimize the changes. The advantage of keeping "edge" and "center" is code readability: it would make clear to the reader what sort of data input is required. If they are used, though, the kwarg name "shading" becomes problematic.

@tacaswell
Copy link
Member

TL;DR I am 👎 on putting in logic to try and guess the cell edges, I think it opens up a can of worms we don't want to get into (as exemplified by the suggestion from @pharshalp for detecting log spaced values).

I am more amenable to changing the default behavior with no X or Y to match imshow (center the cells on the integers rather than anchor a corner on the integers), but that is an orthogonal discussion (and in that case the user probably should be using imshow anyway).

I am 👍 on including the centers -> edges code as a public helper in cbook, probably in both a linear and log sampled flavors.

I swear this started as a short comment and then I went and wrote an essay...


Right now the way we deal with x ~ N, y ~ M in the worse way possible by dropping data

I disagree that the current behavior is the worst possible solution, I think incorrectly inferring the edges would be worse behavior.

"dropping" the last row / column is consistent and correct in the sense that the patch defined by X[n:n+1], Y[n:n+1] is colored by the value in C[n]. The docstring of pcolormesh has this diagram:

(X[i+1, j], Y[i+1, j])          (X[i+1, j+1], Y[i+1, j+1])
                      +--------+
                      | C[i,j] |
                      +--------+
    (X[i, j], Y[i, j])          (X[i, j+1], Y[i, j+1]),

If you take that as the invanint of pcolor*, then the behavior of dropping the last row / column of the color input is justifiable nicety to save your users from typing C[:-1, :-1] (or in the case of MATLAB C(1:end-1, 1:end-1)).

Thinking about how the shading works, this actually becomes even more defendable. In pcolormesh, if the 'flat' shading is thought of as the 0-order interpolation and the input giving use the value of C at every vertex and we use that value to color the cell the vertex is the lower-left (using the conventions of that diagram) corner of. We are then we are not dropping data, we are just getting data from the user that indentaly does not participate in the interpolation. If you use the gouraud shading then the color within the patch depends on the values of all 4 vertices. Looking at it from this point of view, accepting the X/Y and C as off-by-one shapes is the convenience as it does not require the user to provide "junk" data that we are never going to use.

I can think of stories where the "natural" representation in the users code is either with matching or off-by-one sized arrays.


As @greglucas pointed out this kwarg is completely changing the meaning of the inputs. That needs to be called out much more clearly.


@greglucas : those are amazing examples, no matter how this PR ends up, I am in favor of those ending up in the documentation (as either an example or a tutorial) someplace.


Maybe that's too much fuss, but it would be nice to have a strategy for ending up with a cleaner API instead of just settling for another kwarg forever.

This is one of the core reasons I want to push the API to be many top-level functions rather than as Axes methods as being able to experiment with various APIs as stand-alone packages / have several variants co-exist cleanly. I prefer many simple functions over a smaller number of complex functions (which is consistent with the library PoV, from the application PoV fewer more complex functions may be better because (I am told) people find it easier to remember).


As @efiring points out, once you have a non-uniform grid then you need to specify the edges of the cells, not just one point in them.

If you have uniform square grid of data (for example, from a pixelated detector), then to uniquely place the cells in data space you only need to know the bounding box of the whole image, the size of the input, and then the index of the cell in question (eliding the issues of up / down sampling here, lets stick with "nearest" lots of up-sampling for now). You could also specify and offset within the cell of where "the value is", but that is degenerate with translating the bounding box. Internally, we always think of the value as being on the corner (this ends up being more complicated than one expects, see https://matplotlib.org/3.1.1/tutorials/intermediate/imshow_extent.html). When the user does not specify the extent, we pick one that puts the center of the cells at the integers (which is the behavior you want when working with pixel detectors and images in general).

On the other hand, once you allow non-uniform spacing, many bets are off (we are still assuming a rectangular grid). We are still assuming that the data is contiguous so we can just specify the vertices and then construct the patches based on the neighbors, but this does lead to the off-by-one issue (which is a pretty common thing, see histograms).

To go one assumption less (that the data is contiguous) then you have to go to PatchCollection directly and you need 4 scalars per cell and putting them in a rectangular array is no longer helpful (to Matplotlib, user side might still be useful).


As @WeatherGod points out, if the user is naturally thinking in centers or edges depends on the user and the field. I think having helpers that do the conversion for the common cases is a good idea.


If I am reading the edge computing code correctly, I would not call that linear interpolation as it is just re-using the width of the second to last cell for the width of the last cell. Is there prior art to reference on this? I would not be surprised if this function already existed in scipy or skimage.


This is an interesting partner to the discussion @anntzer was leading about making step be more forgiving about N / N+1 length inputs.

@anntzer
Copy link
Contributor

anntzer commented Jan 3, 2020

the behavior of dropping the last row / column of the color input is justifiable nicety to save your users from typing C[:-1, :-1]

Nicety, or footgun, depending on how you look at it. I personally think that silently dropping data is literally one of the worse things we can do as a library (insert rant about how each data point may be years in the life of a grad student). On the other hand my personal solution is to use my own private wrappers to imshow() and friends to avoid that... (so I already moved past these discussions :-))

This is an interesting partner to the discussion @anntzer was leading about making step be more forgiving about N / N+1 length inputs.

In a sense supporting N/N+1 in step plots is similar, right now step plots "drop" the first or last point (but not completely because there's still a vertical segment leading to it) and supporting N/N+1 would avoid that.


Somewhat unrelatedly, actually in the axes-aligned case I guess an improvement would be to return a NonUniformImage, which is likely much faster to render than even a QuadMesh?

@efiring
Copy link
Member

efiring commented Jan 3, 2020 via email

@jklymak
Copy link
Member Author

jklymak commented Jan 3, 2020

Thanks for the great comments. A couple of small ones for you to chew on:

If your data Z is collected on your grid points, and we want to think about this as interpolation between those points , this algorithm implements (linear) nearest-neighbor between the provided points, which is usually the lowest-order interpolation. The current behaviour implements nearest-to-the-left, which is just weird.

I agree with everything said above about how specifying the edges is the "unambiguous" thing to do. And I also agree that just dropping the data is unambiguous. But I've not seen any argument that this is what the naive user expects, and I think its a terrible thing to do silently to someone's data set when they are not expecting it.

If the way forward is deprecating mismatched x, y, Z, and providing helpers, I'd be fine with that. But I think the first user request would be to wrap that in a kwarg to pcolormesh, so...

@anntzer
Copy link
Contributor

anntzer commented Jan 3, 2020

See Axes.pcolorfast.

Ah yes, sorry I forgot that. Anyways that discussion point is orthogonal to the main issue at hand here.

@tacaswell
Copy link
Member

The current behaviour implements nearest-to-the-left, which is just weird.

That is if you are saying the point at C[n,m] is associated with the (X[n,m], Y[n, m]). Instead it is associated with 4 points in the X/Y data as an "intercalated" data set. Something like

x x x x
 o o o
x x x x
 o o o
x x x x

Where x is the location data and y is the value data. Although we can't express it in strided arrays, it make be best to think of the value data as being at half-indexes.


But I've not seen any argument that this is what the naive user expects, and I think its a terrible thing to do silently to someone's data set when they are not expecting it.

The naive user also would not expect that a kwarg would completely change the semantics of the input and makes the function a fundamentally different function.

To be fair, we already have shades of that with the shading kwarg, but there you get to pick your poison of "nearest-up-and-left" interpolation or semantic changing kwarg.

This also suggests that the core of the problem here is a documentation one not a functionality one.

But I think the first user request would be to wrap that in a kwarg to pcolormesh, so...

If that is a case then this needs to be very clearly documented in the docstring that setting a kwarg will radically change the behavior of the function.


Let's start with just a warning and see how much feedback we get. The warning should be something like: "We think you did this by accident, if you meant to do this please let us know why".


I think we should also change the first line of the docstrings to make clearer these functions take edges not centers as input.

Copy link
Member

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should

  • move the interpolation function to cbook
  • advertise said function in the docstring
  • change the first line of the docstring to make clear that these are edge-specified functions
  • add a warning if the dimensions all match with instructions to contact us if it actually was intentional
  • put the examples from @greglucas in the docs someplace

@jklymak
Copy link
Member Author

jklymak commented Jan 4, 2020

That is if you are saying the point at C[n,m] is associated with the (X[n,m], Y[n, m]). Instead it is associated with 4 points in the X/Y data as an "intercalated" data set.

@tacaswell, just to be clear I'm not simply "saying" this: its how almost every data set I've ever seen is organized: the data is collected at X[n, m] and Y[n, m] so those are the values stored. I think you'd be hard pressed to find a lot of data sets that are organized differently, unless they come from numerical models. If you want specific examples where X and Y have the same dimensions as C:

If you want prior art, xarray has an infer_intervals that defaults to True (unless X and Y are matrices and we are dealing with a projection, in which case they don't do it). The logic is here: https://github.com/pydata/xarray/blob/db36c5c0cdee2f5313a81fdeca8a8ae5491d1c8f/xarray/plot/plot.py#L959

Seaborn's heatmap only works the way that this PR works.

GMT assumes the data is centered on the grid for grdimage.

Let's start with just a warning and see how much feedback we get. The warning should be something like: "We think you did this by accident, if you meant to do this please let us know why".

We can do that, (and add the helper), but many people are going to say "I didn't do this by accident, why doesn't it work?". I really think this is a super-common use case.

@efiring
Copy link
Member

efiring commented Jan 4, 2020

I agree completely with @jklymak--in physical oceanography and meteorology, gridded data products and numerical model output is usually provided with the center positions as the grid. Sometimes there are "boundary" grids, but the ones I have seen are (N,2), not N+1. Being able to handle potentially non-uniform center grids easily and efficiently would be a real convenience, and an improvement over the present quick and dirty workaround of dropping a row and a column and ignoring the systematic half-grid position error. Which is probably what all of us do when we are in a hurry, which is most of the time.
The question is therefore how to provide this major user convenience with a clean, simple, unambiguous, preferably explicit API. Hence my general suggestions: either use a separate function name for the center case, or use a kwarg design that is clear and readable. Then it all comes down to the problem of naming, which is hard. But maybe not impossible. And there is also the horrible backward-compatibility, slow deprecation mess to deal with. Putting the center-based version in its own function requires coming up with a name (I don't have a good suggestion so far), but apart from that it is likely the cleanest approach, with the simplest compatibility and deprecation pathway.

@efiring
Copy link
Member

efiring commented Jan 4, 2020

Incidentally, the NonUniformImage class, which we have had since early days but which is probably rarely used, already handles this centered-grid case. I modified it to make the PcolorImage class which is used in pcolorfast to speed up the common case of a nonuniform but rectilinear set of boundaries.

@greglucas
Copy link
Contributor

I'd be happy to put the example in the docs. I am going to wait until the discussion settles on the implementation specifics (kwarg, new function, etc...).

My personal opinion is that, yes this is a highly desired option/default. Another common 'gotcha' here that this would fix is when I do animations and call coll.set_array(Z) to update the data in the meshgrid. But, I didn't realize that the data in the stored array is not the same size as the initial data I passed in... Again, that is another orthogonal issue to the main code here though (updating the set_array machinery to have more helpful checks/warnings to users).

My preferences would be to:

  • Move this 'calculate edges from centers' code to the cbook.
  • Add a new kwarg to pcolor* that would call that cbook helper.
  • Add a warning to the current pcolor* if the sizes of X/Y aren't N+1 and suggest calculating edges themselves or using the new kwarg.
  • Make the kwarg default in mpl 4+ to prevent dropping any data.

There are currently too many functions to remember what they all do, in my opinion, to warrant yet another one. I often forget that pcolorfast, pcolorimage even exist, so adding another function would just add something else that I forget when working fast.

@jklymak
Copy link
Member Author

jklymak commented Jan 8, 2020

@tacaswell should this be discussed at a weekly call, are you onboard for the solutions listed above, or should this just get the warning and we can see how much pushback there is?

@jklymak
Copy link
Member Author

jklymak commented Jan 18, 2020

Please see #16258 for a new version of this.

@jklymak jklymak closed this Jan 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants