bugfix/test for #9336 integer overwrite in categorical #9340

story645 · 2017-10-09T22:33:54Z

Categorical variables were doing in place conversions and sometimes integer values would end up swamping other variables:

vals[vals=='!'] = 0 # casts into string
vals=='0' # matches everything

This PR fixes that and adds a test for this case. I fully acknowledge that there's a longer discussion to be had about design philosophy and semantics of categoricals and all, but just wanted to get this patch in.

Has Pytest style unit tests
Code is PEP 8 compliant

tacaswell · 2017-10-09T22:55:23Z

A slightly more spelled out example of @story645 's explanation of the bug

In [30]: t = np.array([1, 2, 'a'])

In [31]: t
Out[31]: 
array(['1', '2', 'a'], 
      dtype='<U21')

In [32]: t[t == 'a'] = 1

In [33]: t
Out[33]: 
array(['1', '2', '1'], 
      dtype='<U21')

tacaswell

Independent of the discussion about how to order catagoricals, this should go in.

anntzer · 2017-10-10T00:35:32Z

I don't understand why np.array([vmap[v] for v in values]) (or ... vmap[str(v)] ..., depending on whether you want to allow mixed type input or not) is not enough?

story645 · 2017-10-10T02:05:40Z

Hmm, the original idea was on trying to preserve the input shape of values 'cause down the line the plan is to add support for imshow/pcolor etc... (and yes I know I can use ravel/reshape stuff for that, but....)
In practical terms, mostly 'cause this doesn't break the tests and everything seems to break them.

anntzer · 2017-10-10T03:12:43Z

Actually perhaps simplest would be np.vectorize(vmap.__getitem__)(values) or np.vectorize(lambda v: vmap.get(v, v))(values) (again depending on the exact semantics)?

anntzer · 2017-10-10T03:20:15Z

And in fact this patch is still insufficient:

plot(["0.0", "!"], [1, 2]); gca().margins(0)

causes a collision:

(with or without the patch)

I don't want to sound overly negative but I think we're trying to paper over some fundamental problems with the design of categorical support.

story645 · 2017-10-10T03:30:18Z

Without margins it works, so please explain to me what margins does. Also it's not causing a collision, it's flattening the y axis, which confuses me more since that shouldn't be hitting the convertor.

I agree that the current design and implementation could use improvement, but I favor an incrementalist approach to this because of the complexity of the call stack.

ETA: np.vectorize(vmap.__getitem__)(values) breaks a ton of tests, so I love the idea, but want to punt getting it working to a later PR on supporting nans.

anntzer · 2017-10-10T03:45:16Z

Basically categoricals map the string "!" to 0 and the string "0.0" to 1 (due to sorting, but even without sorting it's easy to generate input that maps "0.0" to whatever integer you want). Now when we try to get the axes limits, given that I set the margins to zero, the xlims are 0.0 and 1.0 (if I did not set the margins to zero, then the limits would be -0.05 and 1.05, so just replace "0.0" by the correct value to get a similar bug); but 0.0 gets converted to a string ("0.0") which itself is converted to 1. So we end up using 1, 1 as limits, which is blown up to 1-eps, 1+eps, thus we get a zoomed in region around the "0.0" label (which is at x=1).

Similar case without calling margins: plot(["1.05", "a"], [1, 2]).

story645 · 2017-10-10T04:01:02Z

but 0.0 gets converted to a string ("0.0") which itself is converted to 1. So we end up using 1, 1 as limits, which is blown up to 1-eps, 1+eps, thus we get a zoomed in region around the "0.0" label (which is at x=1).

so this is what confuses me since "!" gets mapped to 0 and so under the hood the data being plotted should be [0, 1] and so I don't get a) why the axes limits aren't updating, b) how leaving "0.0" as a number but mapping "!" to an int (which I'd then have to ensure wasn't mapped to 0) resolves the issue. Basically, I don't get why this bug only happens in the case where I'm plotting 0 and some character with an ascii value less than 48. And why now the integer case plt.plot(['0','!']) works and it's only the float case "0.0" that doesn't. Is that cause float to string conversion is really persnickety?

anntzer · 2017-10-10T04:05:55Z

Basically str(0)=="0" but str(0.)=="0.0" so plot([0., "!"], [1, 2]); gca().margins(0) also fails (but as you said plot([0, "!"], [1, 2]); gca().margins(0) "works").

tacaswell · 2017-10-10T05:01:53Z

Actually perhaps simplest would be np.vectorize(vmap.getitem)(values) or np.vectorize(lambda v: vmap.get(v, v))(values) (again depending on the exact semantics)?

I do not think that is not actually simpler. Fewer lines, but not simpler.

I don't want to sound overly negative but I think we're trying to paper over some fundamental problems with the design of categorical support.

This is a weird corner case where the categorical names happen to be floats that happen to hit the limits. I think the correct fix is that the converter should short-circuit it if gets a single number (like the dates converter does https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/dates.py#L1608 ) which is for when the limits are being set post-conversion (as is happening with the margins). This is why it works with integer 0 or with plt.plot(["a", "!"], [1, 2]); plt.gca().margins(0)

If there is a fundamental design problem, it is with the unit support it's self and is a much bigger project to redesign (which would likely require buy in from JPL).

anntzer · 2017-10-10T05:52:43Z

Actually perhaps simplest would be np.vectorize(vmap.getitem)(values) or np.vectorize(lambda v: vmap.get(v, v))(values) (again depending on the exact semantics)?

I do not think that is not actually simpler. Fewer lines, but not simpler.

Right now the PR has

        str_value = shim_array(value)
        mapped_value = str_value.copy()

        for lab, loc in vmap.items():
            mapped_value[str_value == lab] = loc

which only works because it passes through values that are not in the vmap (because they get cast to a string, are never replaced, and then get cast back to a float). This is quite non-obvious IMO.
Additionally, this is relatively quite inefficient because you need to loop over the array once per label.
Using np.vectorize makes the expected semantics explicit (... it's a vectorized dict lookup...) and only requires a single iteration through the array.

This is a weird corner case where the categorical names happen to be floats that happen to hit the limits.

No, this is a corner case because we cast back and forth between floats and strings without paying attention to the fact that a caller that passes in a float and a caller that passes in a string is expecting a different behavior -- and thus you get a collision when someone passes in a float expecting it to be treated as a float, but it gets treated as a string. (Of course, your proposed fix would also work.)

Again the fix is really simple. It is to require that categorical inputs be all strings (or all bytes, possibly). @story645 gave the example of someone passing in [1, 2, 3, "4+"]. In all honesty I think we should reject such input and request that ["1", "2", "3", "4+"] be passed in.

I can't say I'm really fond of the unit support machinery but AFAICT it can be made to work perfectly fine for this use case as long as the distinction between floats and strings is maintained. In the case of datetimes, we simply never had this problem because passing in a mix of datetimes and floats errors (as it should).

Coming back to dates, in fact:

from datetime import datetime as d, timedelta as t
from matplotlib import pyplot as plt
plt.plot([d.now(), d.now() + t(1)], [1, 2])
plt.plot([1, 2], [3, 4])
plt.show()

errors as it should, whereas

from matplotlib import pyplot as plt
plt.plot(["a", "b"], [1, 2])
plt.plot([1, 2], [3, 4])
plt.show()

gives a nonsensical result.

story645 · 2017-10-10T17:21:22Z

a caller that passes in a float and a caller that passes in a string is expecting a different behavior

I struggle with understanding why/how numbers that don't need to be converted are hitting the conversion interface, and that's why it took me a while to wrap my head around adding the pass through.

In all honesty I think we should reject such input and request that ["1", "2", "3", "4+"] be passed in

I don't disagree that this would probably be more optimal, but I think it's really important to maintain consistency with pandas on what is a categorical array. Since they allow mixed types, I think matplotlib kind of has to.

as long as the distinction between floats and strings is maintained.

I don't have an ideological objection to this, but I just don't see how to implement this and maintain consistency in location for graph updates and animations. Unless you just mean on the lookup side, and then the plan is to attempt that on the np.nan pr.

anntzer · 2017-10-10T17:52:33Z

I don't disagree that this would probably be more optimal, but I think it's really important to maintain consistency with pandas on what is a categorical array. Since they allow mixed types, I think matplotlib kind of has to.

But we are not maintaining consistency with pandas here: pandas will treat objects of different types, well, as different objects:

In [1]: s = pd.Series(["23", "23"], dtype="category")

In [2]: s.values[0] == s.values[1]
Out[2]: True

In [3]: s = pd.Series([23, "23"], dtype="category")

In [4]: s.values[0] == s.values[1]
Out[4]: False

so if anything we are making things more confusing, by explicitly making categories behave in a different way from pandas.

If we want to truly behave like pandas, we should just keep labels in an object array, and if one writes plt.plot([42, "42"]), they'll have two identically-looking labels.

story645 · 2017-10-10T20:06:48Z

I agree with you that the data shouldn't be cast, and plan to fix that. I just worry it'll be a rabbit hole and so want to keep that separate from this bugfix.

if one writes plt.plot([42, "42"]), they'll have two identically-looking labels.

I think you're correct that this should be the behavior and that's definitely a bug.

anntzer · 2017-10-10T20:51:04Z

OK, if we agree to go this way (which only slightly irks me because I don't think it's a really reasonable input; on the other hand the semantics are perfectly well defined so I'm totally OK with that) I believe it's just a matter to not bother with string arrays anywhere (bye bye shims) and just keep a mapping of objects to strings.

tacaswell · 2017-11-21T01:14:27Z

Closing this as per the discussion we will drop mixed types for string categoricals types due to confusing behavior in cases like

x = [753, 7, 'a' ,'b']
y = range(4)
fig, (ax1, ax2) = fig.subplots()
ax1.plot(x, y, 'o')
ax2.plot(x[:2], y[:2], 'x')

The first case will would trigger categorical, the second would not which could be very surprising. If we want to support mixed types we need to write our own Categorical wrapper class or support pandas categoricals.

jklymak · 2017-11-21T01:15:32Z

@tacaswell or have the first call set the Converter to Categorical, and then let the converter deal with the data in the second call.

anntzer · 2017-11-21T01:51:33Z

@jklymak But that would fail if you swap the order of the calls.

jklymak · 2017-11-21T04:18:49Z

Yeah that’s fine. We can’t expect matpltolib to be prescient. Though I guess it’d be nice if there were a way to specify what converter you want rather than always trusting we will figure it out.

Though I’m not objecting to just saying categorical need to be strings. I don’t think it’s a hard ask to expect the user to convert. And I guess that gets around the issues. But you could imagine a dateconverter dealing with subsequent float arguments differently than the None converter.

bugfix for matplotlib#9336 and test for integer overwrite

dfafe42

story645 added topic: categorical Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. labels Oct 9, 2017

tacaswell added this to the 2.1.1 (next bug fix release) milestone Oct 9, 2017

story645 mentioned this pull request Oct 9, 2017

Don't sort categorical keys. #9318

Closed

6 tasks

tacaswell approved these changes Oct 9, 2017

View reviewed changes

value pass through

b0af911

story645 mentioned this pull request Oct 10, 2017

Data types not preserved in categoricals #9350

Closed

story645 added 3 commits November 13, 2017 16:16

bugfix for matplotlib#9336 and test for integer overwrite

29c3596

working on redoing category w/ mixed type semantics

48f5e1f

cat

43b8664

anntzer mentioned this pull request Nov 14, 2017

Rethink categoricals. #9774

Closed

6 tasks

story645 mentioned this pull request Nov 14, 2017

Categorical: Unsorted, String only, fix overwrite bug #9783

Merged

tacaswell closed this Nov 21, 2017

story645 deleted the category branch February 6, 2018 18:18

Uh oh!

bugfix/test for #9336 integer overwrite in categorical #9340

bugfix/test for #9336 integer overwrite in categorical #9340

Uh oh!

Conversation

story645 commented Oct 9, 2017

Uh oh!

tacaswell commented Oct 9, 2017

Uh oh!

tacaswell left a comment

Choose a reason for hiding this comment

Uh oh!

anntzer commented Oct 10, 2017

Uh oh!

story645 commented Oct 10, 2017

Uh oh!

anntzer commented Oct 10, 2017

Uh oh!

anntzer commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

story645 commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

story645 commented Oct 10, 2017

Uh oh!

anntzer commented Oct 10, 2017

Uh oh!

tacaswell commented Oct 10, 2017

Uh oh!

anntzer commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

story645 commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

story645 commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Oct 10, 2017

Uh oh!

tacaswell commented Nov 21, 2017

Uh oh!

jklymak commented Nov 21, 2017

Uh oh!

anntzer commented Nov 21, 2017

Uh oh!

jklymak commented Nov 21, 2017

Uh oh!

Uh oh!

anntzer commented Oct 10, 2017 •

edited

Loading

story645 commented Oct 10, 2017 •

edited

Loading

anntzer commented Oct 10, 2017 •

edited

Loading

anntzer commented Oct 10, 2017 •

edited

Loading

story645 commented Oct 10, 2017 •

edited

Loading

anntzer commented Oct 10, 2017 •

edited

Loading

story645 commented Oct 10, 2017 •

edited

Loading