Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Rethink categoricals. #9774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Nov 13, 2017

Don't support mixed type inputs.
Don't sort keys.

Edited: I accidentally relied on Py3.6's dict ordering behavior in the previous version :-)
Made private what can be.

@story645 @tacaswell

PR Summary

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@tacaswell
Copy link
Member

I thought we agreed to support mixed types as distinct.

@anntzer
Copy link
Contributor Author

anntzer commented Nov 13, 2017

Yes, we did. But after some reflection I really can't think of a way to support that without some refactoring of other places in the codebase: there are places that expect to be able to pass a raw number in and get the same number out. So these places would need to be fixed to use a separate API (which I am not going to cook up today). Feel free to come up with a better implementation...

Meanwhile, tests for this version are in.

The patch may look big but I think you should really look at categorical.py as a mostly new implementation rather than diffing it with the previous one.

@jklymak
Copy link
Member

jklymak commented Nov 13, 2017

#9736 had this same problem, but I think I largely overcame it in a continuation of that PR. Is there interest in my hearing that, at least so you can see what I did?

@anntzer
Copy link
Contributor Author

anntzer commented Nov 13, 2017

Sure, if you have a solution I'm all ears...

@jklymak jklymak mentioned this pull request Nov 13, 2017
7 tasks
@jklymak
Copy link
Member

jklymak commented Nov 13, 2017

So what #9776 does is create a DefaultConverter that gets assigned when data is plotted that is float64, etc. That DefaultConverter will stop other converters from being used on the same axis. However, I turned off the converter assignment in non-plotting routines. i.e. the units aren't updated, and are just None...

@anntzer
Copy link
Contributor Author

anntzer commented Nov 13, 2017

Although this could work for categoricals (well that's basically what I mentioned above, we need to be able to distinguish cases where a raw number is just a raw number and where a raw number is actually a category), I think this is problematic for e.g. dates, where you'd want to be able to do set_xlim(date1, date2) (in fact I think we don't want to support set_xlim(num1, num2) when the axes are using a custom converter; so perhaps that points to an API where the externally exposed set_xlim goes through the converter, but we internally have (and mostly use) a _set_xlim_preconverted which does not.)

@jklymak
Copy link
Member

jklymak commented Nov 13, 2017

Well, it doesn't let you set the limits before you plot some data. It also doesn't let you specify the limits as floats if you used dates for the data. But it works fine if you plot the data first and then set the xlimits.

@anntzer
Copy link
Contributor Author

anntzer commented Nov 13, 2017

Will need to look more in depth at how your case works, likely I have missed something.

@jklymak
Copy link
Member

jklymak commented Nov 13, 2017

Your idea of an internal set xlim is good. I think the place where this was being set was in our friend cla(). I’m not clear there are other places where it gets set outside the auto limit setting.

@anntzer
Copy link
Contributor Author

anntzer commented Nov 13, 2017

tests should catch that (also grep for set_xlim and set_ylim and convert_{x,y,}units though the code base...)

@jklymak
Copy link
Member

jklymak commented Nov 13, 2017

How about a private (?) kwarg _converter=True for set_xlim() and set_ylim()?

@anntzer
Copy link
Contributor Author

anntzer commented Nov 13, 2017

I'd rather just have another function (and have the "external, API" function call the converters and pass to the internal one).

@jklymak
Copy link
Member

jklymak commented Nov 13, 2017

Fair enough. Going to do it the easy way first. If it works, I'll refactor ;-)

@efiring
Copy link
Member

efiring commented Nov 13, 2017

I would like to be sure I understand the "mixed types" business. Is there a concise summary anywhere, including the argument for why their support is desired? And what does "support mixed types as distinct" mean?

@jklymak
Copy link
Member

jklymak commented Nov 13, 2017

Someone correct me if I'm wrong, but I think it means:

x = [1.0, 'a', datenum.datenum(2017, 1, 1), 'Hi', '1.0']

would all be considered distinct "categories" and their tick labels would default to their string representation, but not their category value (because 1.0 != '1.0'). But I'm basing that off the previous discussion not the code here...

@anntzer
Copy link
Contributor Author

anntzer commented Nov 14, 2017

The relevant discussion starts at #9340 (comment), I think @jklymak summarized it right.

Now that I think of it I still think mixed inputs have too poorly defined semantics to make them work. What's supposed to happen with

plt.plot([10, 20])
plt.plot([10, "a"])

? Does the second call reinterpret 10 as a category and thus also 20 as a category and remap them? Is the behavior supposed to be different from

plt.plot([10, "a"])
plt.plot([10, 20])

because of which converter gets locked in for the axis? Does that mean that if you want to use categorical inputs, "mixed types are OK but you must make sure there's at least one string"? (that would be awful IMO)

@jklymak
Copy link
Member

jklymak commented Nov 14, 2017

#9776 would make both the second calls above error if we decide to go that route.

However #9776 has disadvantages - users can't override the data translation and just plot floats if an axis is locked down to another converter.

I think for categoricals, they are always lists, so list entries could be checked for multiple types and treated as categorical? You can't have ndarrays with different types can you?

@anntzer
Copy link
Contributor Author

anntzer commented Nov 14, 2017

You can have object arrays with mixed types...

@efiring
Copy link
Member

efiring commented Nov 14, 2017

Thank you, that is what I thought. I still fail to understand why there should be support for "mixed types". It doesn't make sense; I don't see any advantage or compelling use case; and it makes the code unnecessarily complicated.

@jklymak
Copy link
Member

jklymak commented Nov 14, 2017

I think the idea is that Pandas allows mixed-type categories. I admit the practical use case is somewhat elusive...

@efiring
Copy link
Member

efiring commented Nov 14, 2017

Yes, that sounds dimly familiar. I urge that we not slavishly follow Pandas, but instead do what we think makes sense.

@anntzer
Copy link
Contributor Author

anntzer commented Nov 14, 2017

Another nonsensical example with mixed types:

from pylab import *
ax0, ax1 = gcf().subplots(2)

ax0.plot([10, "a"], "r")
ax0.plot([20, "a"], "g")
ax0.plot([10, 20], "b")

# Same plots, but in a different order (but labels first appearance order stays the same).
ax1.plot([10, "a"], "r")
ax1.plot([10, 20], "b")
ax1.plot([20, "a"], "g")

show()

gives, as of master
figure_1

It's a bit less nonsensical with this PR: [10, 20] is always interpreted as numbers and [10/20, "a"] as categoricals, but the 10 and 20 "change meaning" between the calls, which is still non-ideal, but that should be fixed by #9776 (converter locking):
figure_2

@jklymak
Copy link
Member

jklymak commented Nov 14, 2017

Hmm, but if I do

from pylab import *
ax0, ax1 = gcf().subplots(2)

ax0.plot([10, "10"], "r")

fig1

I thought the point was these would be two categories, though their label would be the same...

@anntzer
Copy link
Contributor Author

anntzer commented Nov 14, 2017

This is actually because they get converted to strings even before hitting the unit conversion machinery, namely by https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes/_base.py#L240 which calls https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/cbook/__init__.py#L2015.

Basically, there are many places in the code base that assume that you can safely call np.asarray on the user input. But asarray will convert [10, "10"] to ["10", "10"] :-(

xref numpy/numpy#6550 and issues linked therein

@jklymak
Copy link
Member

jklymak commented Nov 14, 2017

Hmmmm: I see, plot delays the unit conversion to Line2D, but then messes with the data in _check_1d().

My tendency would be to move all that messing into the if self.command != 'plot': block, and make Line2D handle any making things ndarrays and unit conversion itself. If its really necessary for plot and Line2D to be so special, that is...

x = self.axes.convert_xunits(x)
if by:
y = self.axes.convert_yunits(y)
if bx:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this breaks line.get_xdata() which will return the original x-data, not the converted...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can manually reattach the original data to the lines, but of course fell into another dragonhole on the way... (#9784).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reattached the unitful data manually.

@anntzer anntzer force-pushed the categorical-take-2 branch from edabcad to 59887f5 Compare December 3, 2017 03:03
@tacaswell tacaswell modified the milestones: v2.1.1, v2.2 Dec 6, 2017
@jklymak
Copy link
Member

jklymak commented Jan 3, 2018

Coming back to this, I'm starting to think that categorical support should be limited to some sort of containing class (pandas data frame?).

Based on user issues, it seems that all too often folks pass strings to the plotting functions, I suppose because their ascii converters have given them a list of strings. This used to work because we used to cast to float. It no longer works because we now call this list of strings a bunch of categroies.

I'm no expert on what container should be enforced for categorical conversion, but I think there should be one so that the rest of the codebase is not so adversely affected.

@story645
Copy link
Member

story645 commented Jan 3, 2018

This used to work because we used to cast to float. It no longer works because we now call this list of strings a bunch of categroies.

More to the point, this was explicitly taken out in discussion with @tacaswell and @mdboom because it broke the tests. But it's also problematic in a "it does magic" sort of way that could introduce weird side effects (I have developed a deep distrust of numpy casting over the course of working on categoricals). I'm now on board with categoricals should only be strings (not mixed type), and that should at least simplify the potential pitfalls in categoricals.

I don't like limiting categoricals to explicit containers because that's not how any of the other units work and so special casing categoricals is gonna introduce a whole new range of bugs and the typical use case for categorical is probably a dict of categorical (key, value) pairs or lists.

@jklymak
Copy link
Member

jklymak commented Jan 3, 2018

As discussed on the gitter channel, I think some sort of consideration should be given to users who want ['1', '3', '2'] to map as a numerical array versus list of Categories. Perhaps a warning is sufficient. I somewhat suspect there are more users who have lists of strings they think are numbers than who purposely want to use categoricals, but of course I have no proof of that 😉

@anntzer
Copy link
Contributor Author

anntzer commented Jan 3, 2018

I would rather educate the users to have them actually supply numbers as, well, numbers instead of strings.

Now it may well be that some deprecation policy / warning / etc is the best way to do this.

@efiring
Copy link
Member

efiring commented Jan 3, 2018

@jklymak, I'm puzzled: why do you think there is a large population of users who have been feeding sequences of string representations of numbers to plot commands, and will be severely confused and distressed when the strings are suddenly interpreted as categoricals?
I'm reluctant to see mpl code made any more complicated than necessary to handle this case, and I'm not convinced that any transition warnings or deprecations are needed. How would you implement such warnings or deprecations in a way that would not interfere with a legitimate use of categoricals in which each value is a string representation of a number?

@jklymak
Copy link
Member

jklymak commented Jan 3, 2018

Well, I haven't made a comrpehensive list, but there have been issues since 2.1 came out of folks doing just that. How "large" the population is, is open to debate.

I was thinking that if x was a list of strings, but np.asarray(x).astype(np.float) didn't raise a ValueError (or whatever the safe way of doing that is), then I'd do a

_log.info('Your list of strings can be converted to a list of floats, but MPL is treating them '
                'as categories.  If you want to plot as floats, cast them as such before passing to '
                'MPL')

(wordsmith as you will).

@jklymak
Copy link
Member

jklymak commented Jan 3, 2018

story645 added a commit to story645/matplotlib that referenced this pull request Feb 8, 2018
@anntzer
Copy link
Contributor Author

anntzer commented Feb 15, 2018

Overridden by the other categorical PR, but I haven't really followed whether the improved unit-handling code in plot (discussed privately with @tacaswell) made it in that PR (in which case feel free to close this) or if this one should be kept around until that snippet is pulled in.

@story645
Copy link
Member

It wasn't pulled into that PR cause of the side effects it was having on other tests. Was hoping it could get spun off into its own PR, but also wonder if the other proposed changes to units machinery would override those changes to plot.

@tacaswell
Copy link
Member

@anntzer The changes to plot broke pandas datetime handling, but did not have time to dig down into that.

@efiring
Copy link
Member

efiring commented Feb 25, 2018

@anntzer If I understand correctly, it would be good to close this and open a new PR with the part that you think is still relevant.

@anntzer
Copy link
Contributor Author

anntzer commented Feb 25, 2018

When I discussed this with @tacaswell, he mentioned that if this PR doesn't go in he'd take care of picking apart the relevant chunks after the other categorical fixes go in. So I'll leave that to him :-)

@story645 story645 closed this Nov 14, 2019
@jklymak
Copy link
Member

jklymak commented Nov 14, 2019

If you follow stackoverflow, the confusion about strings being categories persists, particularly for folks opening spreadsheets. I'd say there is a question about it every other day.

For Matplotlib 4.0, one thing to consider is changing the units system to require more explicit direction from the user, rather than us trying to guess what the user means. ie. ax.plot(x, y, xunits='dates', yunits=MyCustomConverter('kg')). That way if they pass ax.plot(list_of_strings, y) and don't specify xunits='categories' they will get a TypeError.

@timhoffm
Copy link
Member

Really? you mean plt.plot(['a', 'b', 'c'], [1, 2, 3]) shouldn't be allowd anymore and I have to add a lengthy xunits='categories'? IMHO that reduces usability (at least for people who know what they are doing).

@jklymak
Copy link
Member

jklymak commented Nov 14, 2019

Understood. But I still wonder if it’s better than a mysterious system that registers converters that then try to guess what to do with no user intervention at all. I don’t even think we have a way of telling the user what converter was used on an artist so they can’t even sensibly debug an unexpected result. EDIT: Actually, ax.xaxis.get_units will give a (somewhat cryptic) indication.

@story645 story645 removed this from the future releases milestone Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants