Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[ENH]: plt.scatter() parameters are extremely confusing #27765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
francescoboc opened this issue Feb 9, 2024 · 30 comments
Open

[ENH]: plt.scatter() parameters are extremely confusing #27765

francescoboc opened this issue Feb 9, 2024 · 30 comments

Comments

@francescoboc
Copy link

Problem

Everytime I have to change the marker, size or color of a scatter plot I have to google the parameters because they are impossible to remember. Why is it that:

  • to change marker color I can use either the color or c parameter,
  • but to change the marker I can only use marker, while the shortened version m does not work,
  • and to change the marker size only the shortened parameter s works but not size !!?

So confusing!

image

Proposed solution

Define parameters to customize the scatter plot more consistently.

@timhoffm
Copy link
Member

timhoffm commented Feb 9, 2024

Size s and color c are the properties you can set per data point scatter(x, y, s=..., c=...). They are used quite often and it's a historic decision that they are single chararcters. Whether we like it or not, that's used too widely to be changed.

marker is a single configuration parameter, and written out consistently throught the library (e.g. plot(..., marker='x')). It's not reasonable to add a shortcut m because that's redundant and not readable.

color is a bit special. It's a Collection property and we allow setting all such properties via keyword arguments (we basically get that automatically because scatter creates a Collection. Unfortunately, this clashes with the semantics of the explicitly introduced c, so we have to remap color to c.

The only thing one could debate here is whether one wants to add an additional alias size mapping to s. But I claim it's not good to have multiple parameter names for the same thing.

I see your issue but I don't see a way forward to make the API more clear while maintaining backward compatibility.

@francescoboc
Copy link
Author

francescoboc commented Feb 9, 2024

I understand the motivations, historical and not, that you have provided me for the current situation. But this does not change the fact that the current situation is confusing. Am I supposed to think "oh ok, s and c are there for historic reasons, but wait c can also be set with color because scatter creates a Collection object, while for the markers I have to use marker because m is not there for historic reasons" every time I need to change properites of a scatter plot? It's impossible to remember!

"It's not reasonable to add a shortcut m because that's redundant and not readable." -> Well... also c is redundant, and both c and s are not readable, so I don't really see your point.

"I claim it's not good to have multiple parameter names for the same thing." -> But you already have that: color is already remapped to c so again, I don't really see your point.

The ideal solution for this messy situation would be to add m for marker ans size for s, so that all the 3 proprietes are consistently set either with their explicit parameter, or their shortened version. If that is not possible, then yes, at least add size for s, so that people don't have to remember that marker and color are set with marker and color, but size is special and only works with s...

I am sorry if i sound a little rough, but this problem frustrates me so much, i think i have googled how to set the properties of a scatter plot at least 100 times in my life.

@rcomer
Copy link
Member

rcomer commented Feb 9, 2024

See also #1101.

@francescoboc
Copy link
Author

See also #1101.

Indeed! The inconsistency of ms/markersize and s between plot and scatter also adds confusion to the situation. I didn't want to mention it in the first post because I wanted to focus on the scatter function, but that's also a problem that wasted a lot of my time for no reason.

@oscargus
Copy link
Member

oscargus commented Feb 9, 2024

Isn't the real problem s? I mean, it should be OK to write it out? (That not all parameters have abbreviations is maybe more expected?)

@story645
Copy link
Member

story645 commented Feb 9, 2024

See also #1101.

Indeed! The inconsistency of ms/markersize and s between plot and scatter also adds confusion to the situation.

Honestly also trips me up all the time that scatter doesn't have marker{size, facecolor, edgecolor} & I agree consistency on marker setting across the methods would be nice.

@jklymak
Copy link
Member

jklymak commented Feb 9, 2024

The only reason to use scatter over plot is to have the marker size and color change on a per-point basis. If you want all your dots to have the same "size" and "color", then just use plot.

@story645
Copy link
Member

story645 commented Feb 9, 2024

If you want all your dots to have the same "size" and "color", then just use plot.

We mention this as an optimization, but scatter-> scatter plots and plot-> line plots is a reasonable way to understand/use the API.

@jklymak
Copy link
Member

jklymak commented Feb 9, 2024

but scatter-> scatter plots and plot-> line plots is a reasonable way to understand/use the API.

Sure, but it would also be misleading.

The scatter API has different names for s and c because they are meant to be vectors. If you find yourself saying "why can't I use markersize in scatter?", rather than saying "Matplotlib should make scatter more like plot" you should say "this is a situation where using plot is more appropriate than scatter".

@story645
Copy link
Member

story645 commented Feb 9, 2024

If you find yourself saying "why can't I use markersize in scatter?", rather than saying "Matplotlib should make scatter more like plot" you should say "this is a situation where using plot is more appropriate than scatter".

A) I have lots of situations where I don't want vector size but want vector color (sometimes also the reverse)
B) plot really being plot + limited scatter to me is an argument for pulling the scatter/marker specific keywords back to scatter

@francescoboc
Copy link
Author

If you want all your dots to have the same "size" and "color", then just use plot.

We mention this as an optimization, but scatter-> scatter plots and plot-> line plots is a reasonable way to understand/use the API.

This interpretation is also the one that the majority of new users have (me included).

@timhoffm
Copy link
Member

timhoffm commented Feb 9, 2024

@francescoboc answering to #27765 (comment):

Of couse you're not supposed to think these convoluted thoughts on every plot, my explanation was rather to motivate where the current state comes from.I advocate to think like this:

  • Use plot() for any pure x/y data (i.e. no additional per-point information).
    Whether to connect the data points with lines or represent them by markers or both is a visualization detail. Note: it's genrically named plot and explicitly does not encode the type (e.g. line or scatter/markers) in the name.
  • User scatter() If you want to encode more information per data point (and this naturally only works through makers). We only support individual color (c) and size (s) for scatter markers.
    (Maybe as a mnemonic: Like the single char variables x and y these encode primary data aspects)

On a side note, but not something you have to know/remember:

  • further stylistic configuration applied to all data points is possibly through various keyword arguments, including marker in both cases.
  • color is an alias c in scatter for technical reasons, but you don't have to know/use this.

scatter-> scatter plots and plot-> line plots is a reasonable way to understand/use the API.

This interpretation is also the one that the majority of new users have (me included).

I'm very cautious with general claims what our users think - we usually don't have reliable data about that.

I also think this is not a good interpretation and should not be advertised (@story645 I know you differ here). "plot -> line" is not obvious, it was if the function was called line() but it's not, and for a reason: The implementation is for makers and lines. I think the way of understanding should be as above: plot() for pure x/y data (represented through line and/or identical markers); scatter() if you need to encode more information per marker.

One may argue whether the visual-based interpretation (line/marker) is simpler than the data-based interpretation. But whether we like it or not, the implementation of the functions matches the data-based interpretation, and I think we're not doing the users a service when we try to retrofit the visual-based interpretation.

@anntzer
Copy link
Contributor

anntzer commented Feb 9, 2024

scatter-> scatter plots and plot-> line plots is a reasonable way to understand/use the API.

This interpretation is also the one that the majority of new users have (me included).

I also think this is not a good interpretation and should not be advertised (@story645 I know you differ here).

I would go further and say this interpretation is completely wrong (even though it may be common) and we should actively write the docs in a way that goes against this misunderstanding.

@francescoboc
Copy link
Author

francescoboc commented Feb 9, 2024

@timhoffm
I think it is not only a matter of misinterpretation, but also the fact that writing plt.scatter(x,y) is faster and cleaner than plt.plot(x,y,ls='') or plt.plot(x,y,lw=0), whenever I want to make a scatter plot without lines connecting data points.

As for the "reliable data on what other users think", ok I admit that I have not done an official survey, but at least this is my experience from talking to coworkers and colleagues in academia.

@jklymak
Copy link
Member

jklymak commented Feb 10, 2024

@francescoboc there has to be a default, and plot happens to default to a line with no marker. plt.plot(x, y, 'o') is the idiomatic way to make just circular markers with no line (in fact plt.plot(x, y, ls='') will make an empty plot). It doesn't mean you can't do plt.scatter(x, y), just that they are two distinct APIs that are provided to do different things, but can be made to overlap. These idioms are very old (dating to early Matlab, so >40 years), so they are not likely to be changed in any major way. I also think trying to make the methods overlap more would be an API mistake and lead to even more confusion (for instance scatter(..., s=10) and plot(..., markersize=10) are different sizes!).

we should actively write the docs in a way that goes against this misunderstanding.

For sure, if there are places where we could differentiate better that would be welcome.

I'll also point out that it is quite easy to write wrappers around our API for your own API (my_homogenous_scatter(ax, x, y, ...)).

@francescoboc
Copy link
Author

@francescoboc there has to be a default, and plot happens to default to a line with no marker. plt.plot(x, y, 'o') is the idiomatic way to make just circular markers with no line (in fact plt.plot(x, y, ls='') will make an empty plot). It doesn't mean you can't do plt.scatter(x, y), just that they are two distinct APIs that are provided to do different things, but can be made to overlap. These idioms are very old (dating to early Matlab, so >40 years), so they are not likely to be changed in any major way. I also think trying to make the methods overlap more would be an API mistake and lead to even more confusion (for instance scatter(..., s=10) and plot(..., markersize=10) are different sizes!).

we should actively write the docs in a way that goes against this misunderstanding.

For sure, if there are places where we could differentiate better that would be welcome.

I'll also point out that it is quite easy to write wrappers around our API for your own API (my_homogenous_scatter(ax, x, y, ...)).

Yes, sorry I wrote it quickly and I forgot the marker parameter. What I would normally use is plt.plot(x, y, ls='', marker='o').
I just tried plt.plot(x, y, 'o'), and yes it does indeed produce a scatter plot with no lines, thank you for the suggestion! I will use this command from now on instead of plt.scatter.

The main confusion for me comes from the fact that, intuitively, if I want to make a simple scatter plot (with all points having the same properties) I use plt.scatter(x,y), but after this exchange I now understand that the correct function is plt.plot(x,y,'o'). This in my opinion is misleading and leads many users to misuse the scatter function.

@anntzer
Copy link
Contributor

anntzer commented Feb 10, 2024

(Perhaps I can take advantage of this discussion to try and revive #14174, by the way?)

@story645
Copy link
Member

story645 commented Feb 11, 2024

I would go further and say this interpretation is completely wrong (even though it may be common) and we should actively write the docs in a way that goes against this misunderstanding

Ok fine it's wrong but I don't think this is possible to correct in docs b/c line and scatter have fundementally different semantics and our defaults highlight those different semantics

  • ax.plot(y) -> line plot
  • ax.scatter(x, y) -> scatter plot

So I don't see how we sell people on "so yes the default of plot is a line plot but use plot when trying to make a line plot or a very specific type of scatter plot, but use scatter when making every single other type of scatter plot and oh yeah it can make the specific type of scatter you're trying to make too."

For the record, it also frustrates me that stackplot is either an areaplot or a streamgraph, but that at least is b/c they're originating from the same paper.

(for instance scatter(..., s=10) and plot(..., markersize=10) are different sizes!).

Yeah, I think that's really terrible for consistency and we have #25259 for that reason

@rcomer
Copy link
Member

rcomer commented Feb 11, 2024

It probably doesn't help that plot returns an object called Line2D, even if no line is drawn.

@rcomer
Copy link
Member

rcomer commented Feb 11, 2024

Would it be worth adding a second entry for plot in plot types showing a scatter plot? The scatter entry already has random colours and sizes, so it could help illustrate the distinction between what to use for a simple scatter vs a more configured one.

@timhoffm
Copy link
Member

Friends, the relation between plot and scatter is messy for historic reasons, partly originating even from MATLAB. Just complaining that it's unintuitive or insisting on one certain ill-fitting point of view (plot = line plot) does not make it better for our user. I'd like to have constructive suggestions how to improve the situation.

We only have very limited possibilities to change the API and naming due to backward-compatibility. IMHO we can help the users most by proper description in the documentation. In particular that means not primarily associtating plot with the visual (line vs. marker) but with the data. The docstring already does this "Plot y versus x as lines and/or markers".

The plot types visual could simply be expanded to image

@timhoffm
Copy link
Member

Would it be worth adding a second entry for plot in plot types showing a scatter plot?

Cross post 😄 . Yes! See my comment above.

@rcomer
Copy link
Member

rcomer commented Feb 11, 2024

If the thumbnails look like this, it makes it really obvious how much the use-cases can overlap

plot plot_scatter scatter

@story645
Copy link
Member

story645 commented Feb 11, 2024

Just complaining that it's unintuitive or insisting on one certain ill-fitting point of view (plot = line plot) does not make it better for our user. I'd like to have constructive suggestions how to improve the situation

Frankly, I think the API mismatch issue between plot and scatter is even worse if we insist on recommending plot for uniform scatter because a very common exploratory viz workflow is layering in more encodings so we'd end up encouraging users to start with plot to make the scatter, then move to scatter to encode size and color and so they'd run into more issues w/ parameters not having the same name.

I'm not trying to complain, I'm just wondering why we're insisting on what we'd normally consider bad API:

  • plot -> line plot + one very specific type of scatter

  • scatter -> every type of scatter except uniform scatter, which this function can also handle just fine.

When we have the simple usability/less confusing out of recommending scatter for the general case and plot for the optimized case. ETA: Concretely and constructively, scatter for 0D discrete data, plot for samples of 1D continuous data b\c those are the underlying assumptions of each method (plot is drawing invisible lines between the markers).

Like I think @rcomer 's thumbnail example only highlights this issue of heavy somewhat confusing overlap. I think adding the line example makes more sense b/c it highlights what I think is the primary purpose of plot scatter function, which is annotate the line w/ markers -which is why there's also a markevery keyword on Line2D while scatter doesn't allow for that b/c every point needs to be shown. Which again boils down to the semantic assumptions baked in on what the data is supposed to be - the reason for markevery is the underlying assumption of continuity.

@rcomer
Copy link
Member

rcomer commented Feb 11, 2024

scatter -> every type of scatter except uniform scatter

Just to confuse things further, I recently changed some code from using scatter to using plot because I wanted half-filled markers, which I don't think scatter can do?

@story645
Copy link
Member

story645 commented Feb 11, 2024

half-filled markers, which I don't think scatter can do?

You're right, and I think scatter should support the half-filled b/c I don't think we should have inconsistency in the markers we support - we've done the other way and allowed plot to take MarkerStyle methods.

ETA: w/ the caveat that I think the reason we don't support it has to do w/ the technical implementation of the markers such that I recognize it may be hard/technically impossible to implement half-filled and respect scatter semantics.

ETA2: Tried to sketch this out:
image

@jklymak
Copy link
Member

jklymak commented Feb 11, 2024

This in my opinion is misleading and leads many users to misuse the scatter function.

I don't think it's a misuse of scatter, but it just has a different API, so needs to be used differently. markersize and color don't work because the scatter API allows vectors for s and c.

very common exploratory viz workflow is layering in more encodings so we'd end up encouraging users to start with plot to make the scatter, then move to scatter to encode size and color and so they'd run into more issues w/ parameters not having the same name.

I don't think anyone is encouraging that, necessarily. If they want to move from scatter(x, y, s=1) to scatter(x, y, s=z), that is great and perfectly within the API. But I'd argue scatter(x, y, markersize=1) to scatter(x, y, s=z) is worse, and more confusing. Particularly as s and markersize are different units: s=2 is the same as markersize=sqrt(2). Overloading to scatter(x, y, markersize=1) to scatter(x, y, markersize=z), would maybe be OK, but I'm still not a fan as it is a different markersize than for Line2D markers. I think it is more clear to keep it a distinct keyword particular to scatter.

timhoffm added a commit to timhoffm/matplotlib that referenced this issue Feb 11, 2024
Inspired from the discussion in matplotlib#27765: We should visually communicate
that `plot()` covers all three variants: markers only,
line+markers, line-only.
They are visually distinct enough that it's not possible
to infer the variants if you see only one.
In particular, it's important to communicate that you
can draw markers only. We don't want to automatically drive people who
want markers (e.g. some discrete measurements of a dependent variable y
(x)) to scatter because that's the only one showing
 discrete markers in the overview.
@timhoffm
Copy link
Member

While this is getting quite off-topic, I just want to note that you can have half-filled markers in scatter.

import matplotlib.pyplot as plt
from matplotlib.markers import MarkerStyle
import numpy as np

x, y = np.random.random((2, 20))
plt.scatter(x, y, s=500, marker=MarkerStyle('o', fillstyle='right'))
x, y = np.random.random((2, 20))
plt.scatter(x, y, s=500, marker=MarkerStyle('o', fillstyle='top'))

image

So the fundamental mechanism is in place. I suspect however that styling is limited, because the colors and linewidths are not exposed through kwargs.

timhoffm added a commit to timhoffm/matplotlib that referenced this issue Feb 12, 2024
Inspired from the discussion in matplotlib#27765: We should visually communicate
that `plot()` covers all three variants: markers only,
line+markers, line-only.
They are visually distinct enough that it's not possible
to infer the variants if you see only one.
In particular, it's important to communicate that you
can draw markers only. We don't want to automatically drive people who
want markers (e.g. some discrete measurements of a dependent variable y
(x)) to scatter because that's the only one showing
 discrete markers in the overview.
@story645
Copy link
Member

.Overloading to scatter(x, y, markersize=1) to scatter(x, y, markersize=z), would maybe be OK, but I'm still not a fan as it is a different markersize than for Line2D markers. I think it is more clear to keep it a distinct keyword particular to scatter.

So I kinda agree here in that I think if scatter were to get a markersize keyword, it should be in the same units as Line2D. Which, my primary reason for wanting the marker{face,alt,edge}color keywords is also b/c of wanting a consistent interface for markers. I get the argument that scatter doesn't need the marker preface b/c the only option is markers, but like it may not hurt to document that explicitly as a note or something.

@timhoffm
Copy link
Member

timhoffm commented Feb 12, 2024

A fundamental issue with the marker handling in scatter() is that the underlying artist is a PathCollection and does not know anything about markers. We convert the marker to a Path in scatter(). While one could try and just monkey patch marker kwargs onto scatter, you won’t full functionality: e.g. you still would not have set_markeredgecolor(), and also not the property aliases, e.g. mec.

I therefore recommend to make a marker-aware subclass MarkerPathCollection/ScatterCollection if you want to improve marker handling in scatter() and handle all logic therein.

Note however, that even then there will remain some rough edges. For example, the rcParams for markers are in the “lines” subgroup.

Impaler343 pushed a commit to Impaler343/matplotlib that referenced this issue Mar 8, 2024
Inspired from the discussion in matplotlib#27765: We should visually communicate
that `plot()` covers all three variants: markers only,
line+markers, line-only.
They are visually distinct enough that it's not possible
to infer the variants if you see only one.
In particular, it's important to communicate that you
can draw markers only. We don't want to automatically drive people who
want markers (e.g. some discrete measurements of a dependent variable y
(x)) to scatter because that's the only one showing
 discrete markers in the overview.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants