Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Don't sort categorical keys. #9318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Oct 8, 2017

PR Summary

I may have missed something in the categorical PR discussion but if I write

plt.bar(["foo", "bar", "quux"], [1, 2, 3])

I expect the categories to appear in the order I explicitly give, not in alphabetical order...

Milestoning this to 2.1.1 as, in case we agree that's the correct behavior, I'd like to minimize the time the sorting behavior is present in the wild...

attn @story645
xref #9312

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@anntzer anntzer added this to the 2.1.1 (next bug fix release) milestone Oct 8, 2017
@@ -86,32 +83,28 @@ class UnitData(object):
spdict = {'nan': -1.0, 'inf': -2.0, '-inf': -3.0}
Copy link
Member

@story645 story645 Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be dropped too. It was sort of an artifact of trying to conform to how pandas unique handles categoricals and forces a sorting on 'nan', 'inf', and '-inf' that doesn't make sense if the aim is to be data agnostic.


def update(self, data):
data = np.atleast_1d(shim_array(data))
sorted_unique = list(OrderedDict(zip(data, itertools.repeat(None))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be cast as a list? Also, since you're just using OrderedDict for the uniqueness, the following also works (dunno which is more efficient though):

_, idx = np.unique(data, return_index=True)
sorted_unique = data[np.sort(idx)]

Copy link
Member

@story645 story645 Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option that I like is:

OrderedDict.fromkeys(data)

data = np.atleast_1d(shim_array(data))
sorted_unique = list(OrderedDict(zip(data, itertools.repeat(None))))
for s in sorted_unique:
if s in self.seq:
Copy link
Member

@story645 story645 Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if spdict gets dropped, this all reduces to:

for s in sorted_unique:
    if s not in self.seq:
        self.seq.append(s)
        self.locs.append(next(self._counter))

@anntzer
Copy link
Contributor Author

anntzer commented Oct 9, 2017

I was kind of confused why you would bother with spdict, but yes, if you're willing to drop it then things become even simpler. OrderedDict (which is basically O(n) due to hashing) should be at least asymptotically faster than np.unique (which uses sorting, so O(n log(n))) but in practice I don't think we'll ever have more than a few dozen categories so I doubt it's going to matter.
Casting to list is indeed unneeded.

Mostly I wanted to know whether you're fine with not sorting the categories (which I definitely think is the "correct" behavior, but perhaps you had a reason to sort them).

@story645
Copy link
Member

story645 commented Oct 9, 2017

There was a ton of back and forth on sorting and honestly it may have boiled down to that being the default behavior for unique. Or trying to conform to pandas.factorize, which uses unique under the hood. The latter is definitely where spdict came from.

I'm pretty sure I wrote code both with and without sort - somewhere down the task list was an idea of giving users the option to choose whether they wanted the data sorted or not, but that doesn't really fit with how the data is handled and so I don't think it really makes sense anymore as a to do.

ETA: Also personal preference is for the numpy approach just 'cause I think that's clearer from a self documenting point of view whereas you're sorta using a side effect of OrderedDict. The OrderedDict approach I'm liking is OrderedDict.fromkeys(data)

@anntzer anntzer force-pushed the dont-sort-categorical-keys branch from b068739 to 142b78b Compare October 9, 2017 06:44
@anntzer
Copy link
Contributor Author

anntzer commented Oct 9, 2017

OK, I just bit the bullet and rewrote the whole thing. I dropped special support for nan/inf and the related tests.
It's backwards incompatible but I honestly think the whole thing is much clearer with this PR, so perhaps (if @story645 agrees with the general approach, of course) @tacaswell would be willing to consider putting this (or similar) in 2.1.1 breaking backcompat with 2.1.0 given that category support is brand new (plus, not sorting and not handling nan/inf specially is already backwards incompatible so...).

@story645
Copy link
Member

story645 commented Oct 9, 2017

My major concern is that you seem to be trying to preserve numbers in the mapping, and I don't see how that's possible without then explicitly checking for conflicts in the mapping. And this doesn't work at all if you want to preserve locations on updates. Anything that hits this code needs to be treated as/converted into a string.

I'm also massively confused as to how a mapping dict is gonna yield the sequence and labels correctly.

@anntzer
Copy link
Contributor Author

anntzer commented Oct 9, 2017

Ah. Well, convert must preserve numbers (and your original implementation does) because it will get called on the axes limits as well (you can check that). So I thought everything else also needed to, but I misunderstood it.

However, I doubt that "cast floats to string" is a really workable approach, exactly because of the confusion between an "actual" float (typically, an axis limit) that needs to be converted vs a float-that-needs-to-be-cast-to-string. For example, as of 2.1.0,

bar(["a", 1], [4, 6], width=1); gca().margins(0)

gives a blank plot (likely because of such a collision, although I haven't really investigated), whereas

bar(["a", 1], [4, 6], width=1.01); gca().margins(0)

or

bar(["a", 1], [4, 6], width=0.99); gca().margins(0)

work "just fine".

Frankly I'm tempted to say, you don't get to pass numbers as categories (too confusing), just strings (and possibly bytes).

@story645
Copy link
Member

story645 commented Oct 9, 2017

Frankly I'm tempted to say, you don't get to pass numbers as categories (too confusing), just strings (and possibly bytes).

What is a string? From the perspective of a use, "abc", 'abc', 🖌 are all strings. And the issue isn't numbers as categories, it's that it's really hard to check for mixed type inputs like ["ocean", "cloud", -99999]. It's a really painful use case (in fact, it's the only use case that triggers the numpy errors for which there are the crazy shims), but I think it's an important one.

My original implementation always does a lookup and then replaces the value with it's lookup:

for lab, loc in vmap.items():
    vals[vals == lab] = loc

I agree with you there are likely better ways to do this - I tried the dict approach and for some reason it would constantly break somewhere down the line.

likely because of such a collision, although I haven't really investigated

That seems like a bug worth fixing, but I don't know that it necessitates a complete reworking of categorical support.

@anntzer
Copy link
Contributor Author

anntzer commented Oct 9, 2017

Your lookup approach effectively passes unknown float values through (they get converted to str, then they are never hit in the vmap.items() loop, and get converted back into floats.
As I mentioned the axes limits will get passed through the converter (just add a print and check by yourself), and these should be passed through (otherwise you get problems like the one I mentioned above).
Another example would be

plot(["!", 0], [1, 2])

which plots a vertical line as of 2.1 whereas

plot(["a", 0], [1, 2])

"works" (plots a slanted line). This is because "!" comes before "0" in the ascii order, so you again have a key collision.

I'm going to mark this as release critical because we need at least to make sure we can agree on well defined semantics before we start getting an avalanche of bug reports from some users while other users start relying on imprecise semantics that will take forever to deprecate.

@anntzer anntzer added the Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. label Oct 9, 2017
@story645
Copy link
Member

story645 commented Oct 9, 2017

So I'm a little bit confused as to why they're not hitting the lookups since they're encoded in the lookups:

ax.xaxis.unit_data.locs
ax.xaxis.unit_data.seq

prints:

[0, 1]
['!', '0']

this also doesn't work in the same way and really should:

ax.plot(["!", "0"], [1, 2])

The issue seems to be on numeric inputs, not on whether they're strings or not. @tacaswell is this possibly related to that "feature" we thought we had removed about automatically converting strings back to numbers?

And the mixed type categorical come from pandas, and I really think our semantics need to be consistent with theirs.

@story645
Copy link
Member

story645 commented Oct 9, 2017

Going a bit further down this rabbit hole:

fig, ax = plt.subplots()
ax.set_xlim([-1,2])
ax.set_ylim([0,2])
ax.plot(["!", "0"], [1, 2])

yields:
index

So the labeling is in the right place, but the values aren't. Ok, now I get what you mean about the convertor ignoring values.

@tacaswell
Copy link
Member

If I recall correctly, the semantics from pandas is that integers are supposed to map to the categorical that maps to them so plt.plot(['a', 0], [1, 2]) is the case that is buggy, that is in all cases it should yield a vertical line as ['a', 0] should be equivalent to ['a', 'a'] (but I am being told on gitter that this is wrong).

I am not sure about the sorting either. On one hand, if all we get from the user is a bunch of strings, there is no clear indication that they are ordered categorical so sorting them alphabetically seems like a sensible thing to do. Particuarly looking at the use case of plt.plot(d.keys(), d.values()) which will be random on process-to-process on half of the version of python we support. On the other hand, the most common use case for this is probably one where the user is providing sematic information about the order in the order they come in in so we should not sort. Can we do unit dispatching based on the type of the container the data comes in in?

@anntzer
Copy link
Contributor Author

anntzer commented Oct 9, 2017

I would honestly just say "if you pass in unordered data you get a plot in the order of whatever __iter__ gives"... a.k.a. don't pass a plain dict.

@tacaswell
Copy link
Member

In either case, preserving np.nan as a way to mark invalid data seems useful.

Other than changing the behavior from sorted -> the order they are seen in I do not understand what back compatibility we have to break.

@jklymak
Copy link
Member

jklymak commented Oct 9, 2017

The last makes the most sense to me. If you pass an unordered interable (dictionary) you cant expect the output to be consistent. But I’d expect things to be plotted in the order I give them in other iterables.

@anntzer
Copy link
Contributor Author

anntzer commented Oct 9, 2017

nan is not used as a N/A marker right now, it's just the "nan" string (which is forced to be at the left).
plot(["a", "nan", "b"], [0, 1, 2])
figure_1

@story645
Copy link
Member

story645 commented Oct 9, 2017

Can we do unit dispatching based on the type of the container the data comes in in?
kinda? when it's anything more structured then a list, it gets handled in preprocess_data.

We could in theory do something like plot(OrderedDict) and that could hit a convertor...but it would mean breaking the x, y semantics.

preserving nan hopefully won't be difficult, what do you want to do about inf and - inf? Changing to preserving order should hopefully be trivial, I just want #9340 to go through first.

@anntzer
Copy link
Contributor Author

anntzer commented Oct 9, 2017

I would not do anything about nan/inf.
To be honest I don't understand what would "missing" as a category mean, and thus why we should try to handle it specially (if the user has a category named "missing" next to one named "a", "b" and "c", they can easily put it themselves at the beginning or the end).

@tacaswell
Copy link
Member

The same way we handle it with plot and friends which is to just drop that point, but given that it does not currently do that (I was confused 🐑 ) forget I said anything about it.

@tacaswell
Copy link
Member

This is superceded by #9774 and/or #9783

@anntzer I am not going to look it this PR while reconciling all of the catagorical PRs.

@tacaswell tacaswell closed this Nov 20, 2017
@anntzer anntzer deleted the dont-sort-categorical-keys branch November 20, 2017 23:44
@ImportanceOfBeingErnest
Copy link
Member

Are there any news on this?

I would argue that as long as one cannot specify the order of the categories on an axes, the whole categorical support is pretty much unusable in real world applications.

I think one can easily communicate to people that using nans or mixed type categories is not supported. However, it is rather hard to see why a list of strings would suddenly change its order when being plotted.

@jklymak
Copy link
Member

jklymak commented Jan 22, 2018

See #10212 for the current state of this...

@anntzer anntzer restored the dont-sort-categorical-keys branch July 18, 2018 11:31
@anntzer anntzer deleted the dont-sort-categorical-keys branch July 18, 2018 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. topic: categorical
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants