Don't sort categorical keys. #9318

anntzer · 2017-10-08T07:10:45Z

PR Summary

I may have missed something in the categorical PR discussion but if I write

plt.bar(["foo", "bar", "quux"], [1, 2, 3])

I expect the categories to appear in the order I explicitly give, not in alphabetical order...

Milestoning this to 2.1.1 as, in case we agree that's the correct behavior, I'd like to minimize the time the sorting behavior is present in the wild...

attn @story645
xref #9312

PR Checklist

Has Pytest style unit tests
Code is PEP 8 compliant
New features are documented, with examples if plot related
Documentation is sphinx and numpydoc compliant
Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

story645 · 2017-10-09T04:57:25Z

lib/matplotlib/category.py

@@ -86,32 +83,28 @@ class UnitData(object):
    spdict = {'nan': -1.0, 'inf': -2.0, '-inf': -3.0}


This can be dropped too. It was sort of an artifact of trying to conform to how pandas unique handles categoricals and forces a sorting on 'nan', 'inf', and '-inf' that doesn't make sense if the aim is to be data agnostic.

story645 · 2017-10-09T04:59:10Z

lib/matplotlib/category.py

+
+    def update(self, data):
+        data = np.atleast_1d(shim_array(data))
+        sorted_unique = list(OrderedDict(zip(data, itertools.repeat(None))))


does this need to be cast as a list? Also, since you're just using OrderedDict for the uniqueness, the following also works (dunno which is more efficient though):

_, idx = np.unique(data, return_index=True) sorted_unique = data[np.sort(idx)]

Another option that I like is:

OrderedDict.fromkeys(data)

story645 · 2017-10-09T05:02:00Z

lib/matplotlib/category.py

+        data = np.atleast_1d(shim_array(data))
+        sorted_unique = list(OrderedDict(zip(data, itertools.repeat(None))))
+        for s in sorted_unique:
+            if s in self.seq:


if spdict gets dropped, this all reduces to:

for s in sorted_unique: if s not in self.seq: self.seq.append(s) self.locs.append(next(self._counter))

anntzer · 2017-10-09T05:48:17Z

I was kind of confused why you would bother with spdict, but yes, if you're willing to drop it then things become even simpler. OrderedDict (which is basically O(n) due to hashing) should be at least asymptotically faster than np.unique (which uses sorting, so O(n log(n))) but in practice I don't think we'll ever have more than a few dozen categories so I doubt it's going to matter.
Casting to list is indeed unneeded.

Mostly I wanted to know whether you're fine with not sorting the categories (which I definitely think is the "correct" behavior, but perhaps you had a reason to sort them).

story645 · 2017-10-09T06:01:29Z

There was a ton of back and forth on sorting and honestly it may have boiled down to that being the default behavior for unique. Or trying to conform to pandas.factorize, which uses unique under the hood. The latter is definitely where spdict came from.

I'm pretty sure I wrote code both with and without sort - somewhere down the task list was an idea of giving users the option to choose whether they wanted the data sorted or not, but that doesn't really fit with how the data is handled and so I don't think it really makes sense anymore as a to do.

ETA: Also personal preference is for the numpy approach just 'cause I think that's clearer from a self documenting point of view whereas you're sorta using a side effect of OrderedDict. The OrderedDict approach I'm liking is OrderedDict.fromkeys(data)

anntzer · 2017-10-09T06:48:12Z

OK, I just bit the bullet and rewrote the whole thing. I dropped special support for nan/inf and the related tests.
It's backwards incompatible but I honestly think the whole thing is much clearer with this PR, so perhaps (if @story645 agrees with the general approach, of course) @tacaswell would be willing to consider putting this (or similar) in 2.1.1 breaking backcompat with 2.1.0 given that category support is brand new (plus, not sorting and not handling nan/inf specially is already backwards incompatible so...).

story645 · 2017-10-09T07:22:25Z

My major concern is that you seem to be trying to preserve numbers in the mapping, and I don't see how that's possible without then explicitly checking for conflicts in the mapping. And this doesn't work at all if you want to preserve locations on updates. Anything that hits this code needs to be treated as/converted into a string.

I'm also massively confused as to how a mapping dict is gonna yield the sequence and labels correctly.

anntzer · 2017-10-09T07:45:19Z

Ah. Well, convert must preserve numbers (and your original implementation does) because it will get called on the axes limits as well (you can check that). So I thought everything else also needed to, but I misunderstood it.

However, I doubt that "cast floats to string" is a really workable approach, exactly because of the confusion between an "actual" float (typically, an axis limit) that needs to be converted vs a float-that-needs-to-be-cast-to-string. For example, as of 2.1.0,

bar(["a", 1], [4, 6], width=1); gca().margins(0)

gives a blank plot (likely because of such a collision, although I haven't really investigated), whereas

bar(["a", 1], [4, 6], width=1.01); gca().margins(0)

or

bar(["a", 1], [4, 6], width=0.99); gca().margins(0)

work "just fine".

Frankly I'm tempted to say, you don't get to pass numbers as categories (too confusing), just strings (and possibly bytes).

story645 · 2017-10-09T08:04:56Z

Frankly I'm tempted to say, you don't get to pass numbers as categories (too confusing), just strings (and possibly bytes).

What is a string? From the perspective of a use, "abc", 'abc', 🖌 are all strings. And the issue isn't numbers as categories, it's that it's really hard to check for mixed type inputs like ["ocean", "cloud", -99999]. It's a really painful use case (in fact, it's the only use case that triggers the numpy errors for which there are the crazy shims), but I think it's an important one.

My original implementation always does a lookup and then replaces the value with it's lookup:

for lab, loc in vmap.items():
    vals[vals == lab] = loc

I agree with you there are likely better ways to do this - I tried the dict approach and for some reason it would constantly break somewhere down the line.

likely because of such a collision, although I haven't really investigated

That seems like a bug worth fixing, but I don't know that it necessitates a complete reworking of categorical support.

anntzer · 2017-10-09T08:16:55Z

Your lookup approach effectively passes unknown float values through (they get converted to str, then they are never hit in the vmap.items() loop, and get converted back into floats.
As I mentioned the axes limits will get passed through the converter (just add a print and check by yourself), and these should be passed through (otherwise you get problems like the one I mentioned above).
Another example would be

plot(["!", 0], [1, 2])

which plots a vertical line as of 2.1 whereas

plot(["a", 0], [1, 2])

"works" (plots a slanted line). This is because "!" comes before "0" in the ascii order, so you again have a key collision.

I'm going to mark this as release critical because we need at least to make sure we can agree on well defined semantics before we start getting an avalanche of bug reports from some users while other users start relying on imprecise semantics that will take forever to deprecate.

story645 · 2017-10-09T14:26:38Z

So I'm a little bit confused as to why they're not hitting the lookups since they're encoded in the lookups:

ax.xaxis.unit_data.locs
ax.xaxis.unit_data.seq

prints:

[0, 1]
['!', '0']

this also doesn't work in the same way and really should:

ax.plot(["!", "0"], [1, 2])

The issue seems to be on numeric inputs, not on whether they're strings or not. @tacaswell is this possibly related to that "feature" we thought we had removed about automatically converting strings back to numbers?

And the mixed type categorical come from pandas, and I really think our semantics need to be consistent with theirs.

story645 · 2017-10-09T18:17:37Z

Going a bit further down this rabbit hole:

fig, ax = plt.subplots()
ax.set_xlim([-1,2])
ax.set_ylim([0,2])
ax.plot(["!", "0"], [1, 2])

yields:

So the labeling is in the right place, but the values aren't. Ok, now I get what you mean about the convertor ignoring values.

tacaswell · 2017-10-09T22:17:03Z

If I recall correctly, the semantics from pandas is that integers are supposed to map to the categorical that maps to them so plt.plot(['a', 0], [1, 2]) is the case that is buggy, that is in all cases it should yield a vertical line as ['a', 0] should be equivalent to ['a', 'a'] (but I am being told on gitter that this is wrong).

I am not sure about the sorting either. On one hand, if all we get from the user is a bunch of strings, there is no clear indication that they are ordered categorical so sorting them alphabetically seems like a sensible thing to do. Particuarly looking at the use case of plt.plot(d.keys(), d.values()) which will be random on process-to-process on half of the version of python we support. On the other hand, the most common use case for this is probably one where the user is providing sematic information about the order in the order they come in in so we should not sort. Can we do unit dispatching based on the type of the container the data comes in in?

anntzer · 2017-10-09T22:19:09Z

I would honestly just say "if you pass in unordered data you get a plot in the order of whatever __iter__ gives"... a.k.a. don't pass a plain dict.

tacaswell · 2017-10-09T22:25:57Z

In either case, preserving np.nan as a way to mark invalid data seems useful.

Other than changing the behavior from sorted -> the order they are seen in I do not understand what back compatibility we have to break.

jklymak · 2017-10-09T22:27:18Z

The last makes the most sense to me. If you pass an unordered interable (dictionary) you cant expect the output to be consistent. But I’d expect things to be plotted in the order I give them in other iterables.

anntzer · 2017-10-09T22:46:46Z

nan is not used as a N/A marker right now, it's just the "nan" string (which is forced to be at the left).
plot(["a", "nan", "b"], [0, 1, 2])

story645 · 2017-10-09T22:48:26Z

Can we do unit dispatching based on the type of the container the data comes in in?
kinda? when it's anything more structured then a list, it gets handled in preprocess_data.

We could in theory do something like plot(OrderedDict) and that could hit a convertor...but it would mean breaking the x, y semantics.

preserving nan hopefully won't be difficult, what do you want to do about inf and - inf? Changing to preserving order should hopefully be trivial, I just want #9340 to go through first.

anntzer · 2017-10-09T22:50:59Z

I would not do anything about nan/inf.
To be honest I don't understand what would "missing" as a category mean, and thus why we should try to handle it specially (if the user has a category named "missing" next to one named "a", "b" and "c", they can easily put it themselves at the beginning or the end).

tacaswell · 2017-10-09T23:04:06Z

The same way we handle it with plot and friends which is to just drop that point, but given that it does not currently do that (I was confused 🐑 ) forget I said anything about it.

tacaswell · 2017-11-20T23:41:36Z

This is superceded by #9774 and/or #9783

@anntzer I am not going to look it this PR while reconciling all of the catagorical PRs.

ImportanceOfBeingErnest · 2018-01-22T20:59:49Z

Are there any news on this?

I would argue that as long as one cannot specify the order of the categories on an axes, the whole categorical support is pretty much unusable in real world applications.

I think one can easily communicate to people that using nans or mixed type categories is not supported. However, it is rather hard to see why a list of strings would suddenly change its order when being plotted.

jklymak · 2018-01-22T21:02:45Z

See #10212 for the current state of this...

Don't sort categorical keys.

70df945

anntzer added the topic: categorical label Oct 8, 2017

anntzer added this to the 2.1.1 (next bug fix release) milestone Oct 8, 2017

story645 reviewed Oct 9, 2017

View reviewed changes

Rewrite category.py.

142b78b

anntzer force-pushed the dont-sort-categorical-keys branch from b068739 to 142b78b Compare October 9, 2017 06:44

WIP: try to make category support saner...

a607fe6

anntzer added the Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. label Oct 9, 2017

story645 mentioned this pull request Oct 9, 2017

Integer Categorical Values Not Getting Mapped Correctly #9336

Closed

tacaswell closed this Nov 20, 2017

anntzer deleted the dont-sort-categorical-keys branch November 20, 2017 23:44

anntzer restored the dont-sort-categorical-keys branch July 18, 2018 11:31

anntzer deleted the dont-sort-categorical-keys branch July 18, 2018 11:32

		@@ -86,32 +83,28 @@ class UnitData(object):
		spdict = {'nan': -1.0, 'inf': -2.0, '-inf': -3.0}

Uh oh!

Don't sort categorical keys. #9318

Don't sort categorical keys. #9318

Uh oh!

Conversation

anntzer commented Oct 8, 2017

PR Summary

PR Checklist

Uh oh!

story645 Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

story645 Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

story645 Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

story645 Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer commented Oct 9, 2017

Uh oh!

story645 commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Oct 9, 2017

Uh oh!

story645 commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Oct 9, 2017

Uh oh!

story645 commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

story645 commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

story645 commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tacaswell commented Oct 9, 2017

Uh oh!

anntzer commented Oct 9, 2017

Uh oh!

tacaswell commented Oct 9, 2017

Uh oh!

jklymak commented Oct 9, 2017

Uh oh!

anntzer commented Oct 9, 2017

Uh oh!

story645 commented Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Oct 9, 2017

Uh oh!

tacaswell commented Oct 9, 2017

Uh oh!

tacaswell commented Nov 20, 2017

Uh oh!

ImportanceOfBeingErnest commented Jan 22, 2018

Uh oh!

jklymak commented Jan 22, 2018

Uh oh!

Uh oh!

story645 Oct 9, 2017 •

edited

Loading

story645 Oct 9, 2017 •

edited

Loading

story645 Oct 9, 2017 •

edited

Loading

story645 commented Oct 9, 2017 •

edited

Loading

story645 commented Oct 9, 2017 •

edited

Loading

story645 commented Oct 9, 2017 •

edited

Loading

anntzer commented Oct 9, 2017 •

edited

Loading

story645 commented Oct 9, 2017 •

edited

Loading

story645 commented Oct 9, 2017 •

edited

Loading

story645 commented Oct 9, 2017 •

edited

Loading