-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Don't sort categorical keys. #9318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/matplotlib/category.py
Outdated
@@ -86,32 +83,28 @@ class UnitData(object): | |||
spdict = {'nan': -1.0, 'inf': -2.0, '-inf': -3.0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be dropped too. It was sort of an artifact of trying to conform to how pandas unique handles categoricals and forces a sorting on 'nan', 'inf', and '-inf' that doesn't make sense if the aim is to be data agnostic.
lib/matplotlib/category.py
Outdated
|
||
def update(self, data): | ||
data = np.atleast_1d(shim_array(data)) | ||
sorted_unique = list(OrderedDict(zip(data, itertools.repeat(None)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be cast as a list? Also, since you're just using OrderedDict for the uniqueness, the following also works (dunno which is more efficient though):
_, idx = np.unique(data, return_index=True)
sorted_unique = data[np.sort(idx)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option that I like is:
OrderedDict.fromkeys(data)
lib/matplotlib/category.py
Outdated
data = np.atleast_1d(shim_array(data)) | ||
sorted_unique = list(OrderedDict(zip(data, itertools.repeat(None)))) | ||
for s in sorted_unique: | ||
if s in self.seq: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if spdict gets dropped, this all reduces to:
for s in sorted_unique:
if s not in self.seq:
self.seq.append(s)
self.locs.append(next(self._counter))
I was kind of confused why you would bother with spdict, but yes, if you're willing to drop it then things become even simpler. OrderedDict (which is basically O(n) due to hashing) should be at least asymptotically faster than np.unique (which uses sorting, so O(n log(n))) but in practice I don't think we'll ever have more than a few dozen categories so I doubt it's going to matter. Mostly I wanted to know whether you're fine with not sorting the categories (which I definitely think is the "correct" behavior, but perhaps you had a reason to sort them). |
There was a ton of back and forth on sorting and honestly it may have boiled down to that being the default behavior for unique. Or trying to conform to pandas.factorize, which uses unique under the hood. The latter is definitely where spdict came from. I'm pretty sure I wrote code both with and without sort - somewhere down the task list was an idea of giving users the option to choose whether they wanted the data sorted or not, but that doesn't really fit with how the data is handled and so I don't think it really makes sense anymore as a to do. ETA: Also personal preference is for the numpy approach just 'cause I think that's clearer from a self documenting point of view whereas you're sorta using a side effect of OrderedDict. The OrderedDict approach I'm liking is |
b068739
to
142b78b
Compare
OK, I just bit the bullet and rewrote the whole thing. I dropped special support for nan/inf and the related tests. |
My major concern is that you seem to be trying to preserve numbers in the mapping, and I don't see how that's possible without then explicitly checking for conflicts in the mapping. And this doesn't work at all if you want to preserve locations on updates. Anything that hits this code needs to be treated as/converted into a string. I'm also massively confused as to how a mapping dict is gonna yield the sequence and labels correctly. |
Ah. Well, However, I doubt that "cast floats to string" is a really workable approach, exactly because of the confusion between an "actual" float (typically, an axis limit) that needs to be converted vs a float-that-needs-to-be-cast-to-string. For example, as of 2.1.0,
gives a blank plot (likely because of such a collision, although I haven't really investigated), whereas
or
work "just fine". Frankly I'm tempted to say, you don't get to pass numbers as categories (too confusing), just strings (and possibly bytes). |
What is a string? From the perspective of a use, "abc", 'abc', 🖌 are all strings. And the issue isn't numbers as categories, it's that it's really hard to check for mixed type inputs like ["ocean", "cloud", -99999]. It's a really painful use case (in fact, it's the only use case that triggers the numpy errors for which there are the crazy shims), but I think it's an important one. My original implementation always does a lookup and then replaces the value with it's lookup: for lab, loc in vmap.items():
vals[vals == lab] = loc I agree with you there are likely better ways to do this - I tried the dict approach and for some reason it would constantly break somewhere down the line.
That seems like a bug worth fixing, but I don't know that it necessitates a complete reworking of categorical support. |
Your lookup approach effectively passes unknown float values through (they get converted to str, then they are never hit in the vmap.items() loop, and get converted back into floats.
which plots a vertical line as of 2.1 whereas
"works" (plots a slanted line). This is because "!" comes before "0" in the ascii order, so you again have a key collision. I'm going to mark this as release critical because we need at least to make sure we can agree on well defined semantics before we start getting an avalanche of bug reports from some users while other users start relying on imprecise semantics that will take forever to deprecate. |
So I'm a little bit confused as to why they're not hitting the lookups since they're encoded in the lookups:
prints:
this also doesn't work in the same way and really should:
The issue seems to be on numeric inputs, not on whether they're strings or not. @tacaswell is this possibly related to that "feature" we thought we had removed about automatically converting strings back to numbers? And the mixed type categorical come from pandas, and I really think our semantics need to be consistent with theirs. |
If I recall correctly, the semantics from pandas is that integers are supposed to map to the categorical that maps to them so I am not sure about the sorting either. On one hand, if all we get from the user is a bunch of strings, there is no clear indication that they are ordered categorical so sorting them alphabetically seems like a sensible thing to do. Particuarly looking at the use case of |
I would honestly just say "if you pass in unordered data you get a plot in the order of whatever |
In either case, preserving Other than changing the behavior from sorted -> the order they are seen in I do not understand what back compatibility we have to break. |
The last makes the most sense to me. If you pass an unordered interable (dictionary) you cant expect the output to be consistent. But I’d expect things to be plotted in the order I give them in other iterables. |
We could in theory do something like plot(OrderedDict) and that could hit a convertor...but it would mean breaking the x, y semantics. preserving nan hopefully won't be difficult, what do you want to do about inf and - inf? Changing to preserving order should hopefully be trivial, I just want #9340 to go through first. |
I would not do anything about nan/inf. |
The same way we handle it with |
Are there any news on this? I would argue that as long as one cannot specify the order of the categories on an axes, the whole categorical support is pretty much unusable in real world applications. I think one can easily communicate to people that using |
See #10212 for the current state of this... |
PR Summary
I may have missed something in the categorical PR discussion but if I write
I expect the categories to appear in the order I explicitly give, not in alphabetical order...
Milestoning this to 2.1.1 as, in case we agree that's the correct behavior, I'd like to minimize the time the sorting behavior is present in the wild...
attn @story645
xref #9312
PR Checklist