-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Rethink categoricals. #9774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink categoricals. #9774
Conversation
I thought we agreed to support mixed types as distinct. |
f1f1919
to
251587a
Compare
Yes, we did. But after some reflection I really can't think of a way to support that without some refactoring of other places in the codebase: there are places that expect to be able to pass a raw number in and get the same number out. So these places would need to be fixed to use a separate API (which I am not going to cook up today). Feel free to come up with a better implementation... Meanwhile, tests for this version are in. The patch may look big but I think you should really look at categorical.py as a mostly new implementation rather than diffing it with the previous one. |
#9736 had this same problem, but I think I largely overcame it in a continuation of that PR. Is there interest in my hearing that, at least so you can see what I did? |
Sure, if you have a solution I'm all ears... |
So what #9776 does is create a |
Although this could work for categoricals (well that's basically what I mentioned above, we need to be able to distinguish cases where a raw number is just a raw number and where a raw number is actually a category), I think this is problematic for e.g. dates, where you'd want to be able to do |
Well, it doesn't let you set the limits before you plot some data. It also doesn't let you specify the limits as floats if you used dates for the data. But it works fine if you plot the data first and then set the xlimits. |
Will need to look more in depth at how your case works, likely I have missed something. |
Your idea of an internal set xlim is good. I think the place where this was being set was in our friend cla(). I’m not clear there are other places where it gets set outside the auto limit setting. |
tests should catch that (also grep for |
How about a private (?) kwarg |
I'd rather just have another function (and have the "external, API" function call the converters and pass to the internal one). |
Fair enough. Going to do it the easy way first. If it works, I'll refactor ;-) |
I would like to be sure I understand the "mixed types" business. Is there a concise summary anywhere, including the argument for why their support is desired? And what does "support mixed types as distinct" mean? |
Someone correct me if I'm wrong, but I think it means: x = [1.0, 'a', datenum.datenum(2017, 1, 1), 'Hi', '1.0'] would all be considered distinct "categories" and their tick labels would default to their string representation, but not their category value (because 1.0 != '1.0'). But I'm basing that off the previous discussion not the code here... |
The relevant discussion starts at #9340 (comment), I think @jklymak summarized it right. Now that I think of it I still think mixed inputs have too poorly defined semantics to make them work. What's supposed to happen with
? Does the second call reinterpret 10 as a category and thus also 20 as a category and remap them? Is the behavior supposed to be different from
because of which converter gets locked in for the axis? Does that mean that if you want to use categorical inputs, "mixed types are OK but you must make sure there's at least one string"? (that would be awful IMO) |
#9776 would make both the second calls above error if we decide to go that route. However #9776 has disadvantages - users can't override the data translation and just plot floats if an axis is locked down to another converter. I think for categoricals, they are always lists, so list entries could be checked for multiple types and treated as categorical? You can't have ndarrays with different types can you? |
You can have object arrays with mixed types... |
Thank you, that is what I thought. I still fail to understand why there should be support for "mixed types". It doesn't make sense; I don't see any advantage or compelling use case; and it makes the code unnecessarily complicated. |
I think the idea is that Pandas allows mixed-type categories. I admit the practical use case is somewhat elusive... |
Yes, that sounds dimly familiar. I urge that we not slavishly follow Pandas, but instead do what we think makes sense. |
Another nonsensical example with mixed types:
It's a bit less nonsensical with this PR: [10, 20] is always interpreted as numbers and [10/20, "a"] as categoricals, but the 10 and 20 "change meaning" between the calls, which is still non-ideal, but that should be fixed by #9776 (converter locking): |
This is actually because they get converted to strings even before hitting the unit conversion machinery, namely by https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes/_base.py#L240 which calls https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/cbook/__init__.py#L2015. Basically, there are many places in the code base that assume that you can safely call xref numpy/numpy#6550 and issues linked therein |
Hmmmm: I see, My tendency would be to move all that messing into the |
x = self.axes.convert_xunits(x) | ||
if by: | ||
y = self.axes.convert_yunits(y) | ||
if bx: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but this breaks line.get_xdata()
which will return the original x-data, not the converted...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can manually reattach the original data to the lines, but of course fell into another dragonhole on the way... (#9784).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reattached the unitful data manually.
Note that this disables post-hoc unit changes on Line2D.
edabcad
to
59887f5
Compare
Coming back to this, I'm starting to think that categorical support should be limited to some sort of containing class (pandas data frame?). Based on user issues, it seems that all too often folks pass strings to the plotting functions, I suppose because their ascii converters have given them a list of strings. This used to work because we used to cast to float. It no longer works because we now call this list of strings a bunch of categroies. I'm no expert on what container should be enforced for categorical conversion, but I think there should be one so that the rest of the codebase is not so adversely affected. |
More to the point, this was explicitly taken out in discussion with @tacaswell and @mdboom because it broke the tests. But it's also problematic in a "it does magic" sort of way that could introduce weird side effects (I have developed a deep distrust of numpy casting over the course of working on categoricals). I'm now on board with categoricals should only be strings (not mixed type), and that should at least simplify the potential pitfalls in categoricals. I don't like limiting categoricals to explicit containers because that's not how any of the other units work and so special casing categoricals is gonna introduce a whole new range of bugs and the typical use case for categorical is probably a dict of categorical (key, value) pairs or lists. |
As discussed on the gitter channel, I think some sort of consideration should be given to users who want |
I would rather educate the users to have them actually supply numbers as, well, numbers instead of strings. Now it may well be that some deprecation policy / warning / etc is the best way to do this. |
@jklymak, I'm puzzled: why do you think there is a large population of users who have been feeding sequences of string representations of numbers to plot commands, and will be severely confused and distressed when the strings are suddenly interpreted as categoricals? |
Well, I haven't made a comrpehensive list, but there have been issues since 2.1 came out of folks doing just that. How "large" the population is, is open to debate. I was thinking that if x was a list of strings, but _log.info('Your list of strings can be converted to a list of floats, but MPL is treating them '
'as categories. If you want to plot as floats, cast them as such before passing to '
'MPL') (wordsmith as you will). |
Since 30 Dec: We can argue about how hard these folks worked to solve their problem, but I think the current behaviour is mysterious w/o any error message... |
Overridden by the other categorical PR, but I haven't really followed whether the improved unit-handling code in |
It wasn't pulled into that PR cause of the side effects it was having on other tests. Was hoping it could get spun off into its own PR, but also wonder if the other proposed changes to units machinery would override those changes to plot. |
@anntzer The changes to |
@anntzer If I understand correctly, it would be good to close this and open a new PR with the part that you think is still relevant. |
When I discussed this with @tacaswell, he mentioned that if this PR doesn't go in he'd take care of picking apart the relevant chunks after the other categorical fixes go in. So I'll leave that to him :-) |
If you follow stackoverflow, the confusion about strings being categories persists, particularly for folks opening spreadsheets. I'd say there is a question about it every other day. For Matplotlib 4.0, one thing to consider is changing the units system to require more explicit direction from the user, rather than us trying to guess what the user means. ie. |
Really? you mean |
Understood. But I still wonder if it’s better than a mysterious system that registers converters that then try to guess what to do with no user intervention at all. |
Don't support mixed type inputs.
Don't sort keys.
Edited: I accidentally relied on Py3.6's dict ordering behavior in the previous version :-)
Made private what can be.
@story645 @tacaswell
PR Summary
PR Checklist