Rethink categoricals. #9774

anntzer · 2017-11-13T20:23:58Z

Don't support mixed type inputs.
Don't sort keys.

Edited: I accidentally relied on Py3.6's dict ordering behavior in the previous version :-)
Made private what can be.

@story645 @tacaswell

PR Summary

PR Checklist

Has Pytest style unit tests
Code is PEP 8 compliant
New features are documented, with examples if plot related
Documentation is sphinx and numpydoc compliant
Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

tacaswell · 2017-11-13T20:33:44Z

I thought we agreed to support mixed types as distinct.

anntzer · 2017-11-13T20:54:15Z

Yes, we did. But after some reflection I really can't think of a way to support that without some refactoring of other places in the codebase: there are places that expect to be able to pass a raw number in and get the same number out. So these places would need to be fixed to use a separate API (which I am not going to cook up today). Feel free to come up with a better implementation...

Meanwhile, tests for this version are in.

The patch may look big but I think you should really look at categorical.py as a mostly new implementation rather than diffing it with the previous one.

jklymak · 2017-11-13T21:11:02Z

#9736 had this same problem, but I think I largely overcame it in a continuation of that PR. Is there interest in my hearing that, at least so you can see what I did?

anntzer · 2017-11-13T21:27:16Z

Sure, if you have a solution I'm all ears...

jklymak · 2017-11-13T21:38:44Z

So what #9776 does is create a DefaultConverter that gets assigned when data is plotted that is float64, etc. That DefaultConverter will stop other converters from being used on the same axis. However, I turned off the converter assignment in non-plotting routines. i.e. the units aren't updated, and are just None...

anntzer · 2017-11-13T22:11:04Z

Although this could work for categoricals (well that's basically what I mentioned above, we need to be able to distinguish cases where a raw number is just a raw number and where a raw number is actually a category), I think this is problematic for e.g. dates, where you'd want to be able to do set_xlim(date1, date2) (in fact I think we don't want to support set_xlim(num1, num2) when the axes are using a custom converter; so perhaps that points to an API where the externally exposed set_xlim goes through the converter, but we internally have (and mostly use) a _set_xlim_preconverted which does not.)

jklymak · 2017-11-13T22:13:07Z

Well, it doesn't let you set the limits before you plot some data. It also doesn't let you specify the limits as floats if you used dates for the data. But it works fine if you plot the data first and then set the xlimits.

anntzer · 2017-11-13T22:20:54Z

Will need to look more in depth at how your case works, likely I have missed something.

jklymak · 2017-11-13T22:25:25Z

Your idea of an internal set xlim is good. I think the place where this was being set was in our friend cla(). I’m not clear there are other places where it gets set outside the auto limit setting.

anntzer · 2017-11-13T22:40:31Z

tests should catch that (also grep for set_xlim and set_ylim and convert_{x,y,}units though the code base...)

jklymak · 2017-11-13T22:49:26Z

How about a private (?) kwarg _converter=True for set_xlim() and set_ylim()?

anntzer · 2017-11-13T22:54:29Z

I'd rather just have another function (and have the "external, API" function call the converters and pass to the internal one).

jklymak · 2017-11-13T22:57:57Z

Fair enough. Going to do it the easy way first. If it works, I'll refactor ;-)

efiring · 2017-11-13T23:48:02Z

I would like to be sure I understand the "mixed types" business. Is there a concise summary anywhere, including the argument for why their support is desired? And what does "support mixed types as distinct" mean?

jklymak · 2017-11-13T23:55:56Z

Someone correct me if I'm wrong, but I think it means:

x = [1.0, 'a', datenum.datenum(2017, 1, 1), 'Hi', '1.0']

would all be considered distinct "categories" and their tick labels would default to their string representation, but not their category value (because 1.0 != '1.0'). But I'm basing that off the previous discussion not the code here...

anntzer · 2017-11-14T00:07:32Z

The relevant discussion starts at #9340 (comment), I think @jklymak summarized it right.

Now that I think of it I still think mixed inputs have too poorly defined semantics to make them work. What's supposed to happen with

plt.plot([10, 20])
plt.plot([10, "a"])

? Does the second call reinterpret 10 as a category and thus also 20 as a category and remap them? Is the behavior supposed to be different from

plt.plot([10, "a"])
plt.plot([10, 20])

because of which converter gets locked in for the axis? Does that mean that if you want to use categorical inputs, "mixed types are OK but you must make sure there's at least one string"? (that would be awful IMO)

jklymak · 2017-11-14T00:14:45Z

#9776 would make both the second calls above error if we decide to go that route.

However #9776 has disadvantages - users can't override the data translation and just plot floats if an axis is locked down to another converter.

I think for categoricals, they are always lists, so list entries could be checked for multiple types and treated as categorical? You can't have ndarrays with different types can you?

anntzer · 2017-11-14T00:23:02Z

You can have object arrays with mixed types...

efiring · 2017-11-14T01:34:40Z

Thank you, that is what I thought. I still fail to understand why there should be support for "mixed types". It doesn't make sense; I don't see any advantage or compelling use case; and it makes the code unnecessarily complicated.

jklymak · 2017-11-14T01:44:41Z

I think the idea is that Pandas allows mixed-type categories. I admit the practical use case is somewhat elusive...

efiring · 2017-11-14T01:46:53Z

Yes, that sounds dimly familiar. I urge that we not slavishly follow Pandas, but instead do what we think makes sense.

anntzer · 2017-11-14T07:46:13Z

Another nonsensical example with mixed types:

from pylab import *
ax0, ax1 = gcf().subplots(2)

ax0.plot([10, "a"], "r")
ax0.plot([20, "a"], "g")
ax0.plot([10, 20], "b")

# Same plots, but in a different order (but labels first appearance order stays the same).
ax1.plot([10, "a"], "r")
ax1.plot([10, 20], "b")
ax1.plot([20, "a"], "g")

show()

gives, as of master

It's a bit less nonsensical with this PR: [10, 20] is always interpreted as numbers and [10/20, "a"] as categoricals, but the 10 and 20 "change meaning" between the calls, which is still non-ideal, but that should be fixed by #9776 (converter locking):

jklymak · 2017-11-14T18:34:26Z

Hmm, but if I do

from pylab import *
ax0, ax1 = gcf().subplots(2)

ax0.plot([10, "10"], "r")

I thought the point was these would be two categories, though their label would be the same...

anntzer · 2017-11-14T19:30:01Z

This is actually because they get converted to strings even before hitting the unit conversion machinery, namely by https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes/_base.py#L240 which calls https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/cbook/__init__.py#L2015.

Basically, there are many places in the code base that assume that you can safely call np.asarray on the user input. But asarray will convert [10, "10"] to ["10", "10"] :-(

xref numpy/numpy#6550 and issues linked therein

jklymak · 2017-11-14T20:41:40Z

Hmmmm: I see, plot delays the unit conversion to Line2D, but then messes with the data in _check_1d().

My tendency would be to move all that messing into the if self.command != 'plot': block, and make Line2D handle any making things ndarrays and unit conversion itself. If its really necessary for plot and Line2D to be so special, that is...

jklymak · 2017-11-14T21:26:05Z

lib/matplotlib/axes/_base.py

-                    x = self.axes.convert_xunits(x)
-                if by:
-                    y = self.axes.convert_yunits(y)
+            if bx:


OK, but this breaks line.get_xdata() which will return the original x-data, not the converted...

I can manually reattach the original data to the lines, but of course fell into another dragonhole on the way... (#9784).

Reattached the unitful data manually.

Note that this disables post-hoc unit changes on Line2D.

jklymak · 2018-01-03T17:56:21Z

Coming back to this, I'm starting to think that categorical support should be limited to some sort of containing class (pandas data frame?).

Based on user issues, it seems that all too often folks pass strings to the plotting functions, I suppose because their ascii converters have given them a list of strings. This used to work because we used to cast to float. It no longer works because we now call this list of strings a bunch of categroies.

I'm no expert on what container should be enforced for categorical conversion, but I think there should be one so that the rest of the codebase is not so adversely affected.

story645 · 2018-01-03T18:12:03Z

This used to work because we used to cast to float. It no longer works because we now call this list of strings a bunch of categroies.

More to the point, this was explicitly taken out in discussion with @tacaswell and @mdboom because it broke the tests. But it's also problematic in a "it does magic" sort of way that could introduce weird side effects (I have developed a deep distrust of numpy casting over the course of working on categoricals). I'm now on board with categoricals should only be strings (not mixed type), and that should at least simplify the potential pitfalls in categoricals.

I don't like limiting categoricals to explicit containers because that's not how any of the other units work and so special casing categoricals is gonna introduce a whole new range of bugs and the typical use case for categorical is probably a dict of categorical (key, value) pairs or lists.

jklymak · 2018-01-03T18:49:06Z

As discussed on the gitter channel, I think some sort of consideration should be given to users who want ['1', '3', '2'] to map as a numerical array versus list of Categories. Perhaps a warning is sufficient. I somewhat suspect there are more users who have lists of strings they think are numbers than who purposely want to use categoricals, but of course I have no proof of that 😉

anntzer · 2018-01-03T19:02:48Z

I would rather educate the users to have them actually supply numbers as, well, numbers instead of strings.

Now it may well be that some deprecation policy / warning / etc is the best way to do this.

efiring · 2018-01-03T21:33:35Z

@jklymak, I'm puzzled: why do you think there is a large population of users who have been feeding sequences of string representations of numbers to plot commands, and will be severely confused and distressed when the strings are suddenly interpreted as categoricals?
I'm reluctant to see mpl code made any more complicated than necessary to handle this case, and I'm not convinced that any transition warnings or deprecations are needed. How would you implement such warnings or deprecations in a way that would not interfere with a legitimate use of categoricals in which each value is a string representation of a number?

jklymak · 2018-01-03T21:41:11Z

Well, I haven't made a comrpehensive list, but there have been issues since 2.1 came out of folks doing just that. How "large" the population is, is open to debate.

I was thinking that if x was a list of strings, but np.asarray(x).astype(np.float) didn't raise a ValueError (or whatever the safe way of doing that is), then I'd do a

_log.info('Your list of strings can be converted to a list of floats, but MPL is treating them '
                'as categories.  If you want to plot as floats, cast them as such before passing to '
                'MPL')

(wordsmith as you will).

jklymak · 2018-01-03T21:51:01Z

Since 30 Dec:
https://stackoverflow.com/questions/48083936/y-axis-not-properly-sorted-matplotlib
https://stackoverflow.com/questions/48063998/matplotlib-numbers-in-xaxis-are-shown-disordered
https://stackoverflow.com/questions/48039312/matplotlib-order-of-the-x-axis-is-wrong
https://stackoverflow.com/questions/48031991/y-axis-not-aligned-by-their-values-in-matplotlib

We can argue about how hard these folks worked to solve their problem, but I think the current behaviour is mysterious w/o any error message...

anntzer · 2018-02-15T10:17:32Z

Overridden by the other categorical PR, but I haven't really followed whether the improved unit-handling code in plot (discussed privately with @tacaswell) made it in that PR (in which case feel free to close this) or if this one should be kept around until that snippet is pulled in.

story645 · 2018-02-15T16:12:09Z

It wasn't pulled into that PR cause of the side effects it was having on other tests. Was hoping it could get spun off into its own PR, but also wonder if the other proposed changes to units machinery would override those changes to plot.

tacaswell · 2018-02-15T18:49:35Z

@anntzer The changes to plot broke pandas datetime handling, but did not have time to dig down into that.

efiring · 2018-02-25T22:03:24Z

@anntzer If I understand correctly, it would be good to close this and open a new PR with the part that you think is still relevant.

anntzer · 2018-02-25T22:41:03Z

When I discussed this with @tacaswell, he mentioned that if this PR doesn't go in he'd take care of picking apart the relevant chunks after the other categorical fixes go in. So I'll leave that to him :-)

jklymak · 2019-11-14T17:00:37Z

If you follow stackoverflow, the confusion about strings being categories persists, particularly for folks opening spreadsheets. I'd say there is a question about it every other day.

For Matplotlib 4.0, one thing to consider is changing the units system to require more explicit direction from the user, rather than us trying to guess what the user means. ie. ax.plot(x, y, xunits='dates', yunits=MyCustomConverter('kg')). That way if they pass ax.plot(list_of_strings, y) and don't specify xunits='categories' they will get a TypeError.

timhoffm · 2019-11-14T19:21:51Z

Really? you mean plt.plot(['a', 'b', 'c'], [1, 2, 3]) shouldn't be allowd anymore and I have to add a lengthy xunits='categories'? IMHO that reduces usability (at least for people who know what they are doing).

jklymak · 2019-11-14T19:45:46Z

Understood. But I still wonder if it’s better than a mysterious system that registers converters that then try to guess what to do with no user intervention at all. ~~I don’t even think we have a way of telling the user what converter was used on an artist so they can’t even sensibly debug an unexpected result.~~ EDIT: Actually, ax.xaxis.get_units will give a (somewhat cryptic) indication.

anntzer force-pushed the categorical-take-2 branch from f1f1919 to 251587a Compare November 13, 2017 20:51

jklymak mentioned this pull request Nov 13, 2017

WIP: Lockout new converters Part 2 #9776

Closed

7 tasks

dstansby added the topic: categorical label Nov 14, 2017

story645 mentioned this pull request Nov 14, 2017

Categorical: Unsorted, String only, fix overwrite bug #9783

Merged

jklymak reviewed Nov 14, 2017

View reviewed changes

anntzer added 5 commits December 2, 2017 19:01

Actually we don't need to store the unit data ourselves.

6589adc

Force unit conversion before call to asarray().

5dfe4d1

Note that this disables post-hoc unit changes on Line2D.

Reattach the unitful data to Line2D.

beef2d3

Fixes based on phone discussion.

07cb765

Deprecate some more stuff we don't need.

59887f5

anntzer force-pushed the categorical-take-2 branch from edabcad to 59887f5 Compare December 3, 2017 03:03

tacaswell modified the milestones: v2.1.1, v2.2 Dec 6, 2017

story645 mentioned this pull request Jan 3, 2018

Error Handling of Non-Ints/Floats for postion of xticks #10147

Closed

story645 added a commit to story645/matplotlib that referenced this pull request Feb 8, 2018

units deprecated per matplotlib#9774

ff61b89

anntzer added the status: duplicate label Feb 15, 2018

anntzer removed the status: duplicate label Feb 15, 2018

story645 added the topic: units and array ducktypes label Feb 15, 2018

story645 closed this Nov 14, 2019

story645 removed this from the future releases milestone Oct 6, 2022

Uh oh!

Rethink categoricals. #9774

Rethink categoricals. #9774

Uh oh!

Conversation

anntzer commented Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

PR Checklist

Uh oh!

tacaswell commented Nov 13, 2017

Uh oh!

anntzer commented Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jklymak commented Nov 13, 2017

Uh oh!

anntzer commented Nov 13, 2017

Uh oh!

jklymak commented Nov 13, 2017

Uh oh!

anntzer commented Nov 13, 2017

Uh oh!

jklymak commented Nov 13, 2017

Uh oh!

anntzer commented Nov 13, 2017

Uh oh!

jklymak commented Nov 13, 2017

Uh oh!

anntzer commented Nov 13, 2017

Uh oh!

jklymak commented Nov 13, 2017

Uh oh!

anntzer commented Nov 13, 2017

Uh oh!

jklymak commented Nov 13, 2017

Uh oh!

efiring commented Nov 13, 2017

Uh oh!

jklymak commented Nov 13, 2017

Uh oh!

anntzer commented Nov 14, 2017

Uh oh!

jklymak commented Nov 14, 2017

Uh oh!

anntzer commented Nov 14, 2017

Uh oh!

efiring commented Nov 14, 2017

Uh oh!

jklymak commented Nov 14, 2017

Uh oh!

efiring commented Nov 14, 2017

Uh oh!

anntzer commented Nov 14, 2017

Uh oh!

jklymak commented Nov 14, 2017

Uh oh!

anntzer commented Nov 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jklymak commented Nov 14, 2017

Uh oh!

jklymak Nov 14, 2017

Choose a reason for hiding this comment

Uh oh!

anntzer Nov 14, 2017

Choose a reason for hiding this comment

Uh oh!

anntzer Nov 14, 2017

Choose a reason for hiding this comment

Uh oh!

jklymak commented Jan 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

story645 commented Jan 3, 2018

Uh oh!

jklymak commented Jan 3, 2018

Uh oh!

anntzer commented Jan 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

anntzer commented Nov 13, 2017 •

edited

Loading

anntzer commented Nov 13, 2017 •

edited

Loading

anntzer commented Nov 14, 2017 •

edited

Loading

jklymak commented Jan 3, 2018 •

edited

Loading

anntzer commented Jan 3, 2018 •

edited

Loading

jklymak commented Jan 3, 2018 •

edited

Loading

jklymak commented Nov 14, 2019 •

edited

Loading