Thanks to visit codestin.com
Credit goes to github.com

Skip to content

bugfix/test for #9336 integer overwrite in categorical #9340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

story645
Copy link
Member

@story645 story645 commented Oct 9, 2017

Categorical variables were doing in place conversions and sometimes integer values would end up swamping other variables:

vals[vals=='!'] = 0 # casts into string
vals=='0' # matches everything

This PR fixes that and adds a test for this case. I fully acknowledge that there's a longer discussion to be had about design philosophy and semantics of categoricals and all, but just wanted to get this patch in.

  • Has Pytest style unit tests
  • Code is PEP 8 compliant

@story645 story645 added topic: categorical Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. labels Oct 9, 2017
@tacaswell tacaswell added this to the 2.1.1 (next bug fix release) milestone Oct 9, 2017
@story645 story645 mentioned this pull request Oct 9, 2017
6 tasks
@tacaswell
Copy link
Member

A slightly more spelled out example of @story645 's explanation of the bug

In [30]: t = np.array([1, 2, 'a'])

In [31]: t
Out[31]: 
array(['1', '2', 'a'], 
      dtype='<U21')

In [32]: t[t == 'a'] = 1

In [33]: t
Out[33]: 
array(['1', '2', '1'], 
      dtype='<U21')

Copy link
Member

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent of the discussion about how to order catagoricals, this should go in.

@anntzer
Copy link
Contributor

anntzer commented Oct 10, 2017

I don't understand why np.array([vmap[v] for v in values]) (or ... vmap[str(v)] ..., depending on whether you want to allow mixed type input or not) is not enough?

@story645
Copy link
Member Author

Hmm, the original idea was on trying to preserve the input shape of values 'cause down the line the plan is to add support for imshow/pcolor etc... (and yes I know I can use ravel/reshape stuff for that, but....)
In practical terms, mostly 'cause this doesn't break the tests and everything seems to break them.

@anntzer
Copy link
Contributor

anntzer commented Oct 10, 2017

Actually perhaps simplest would be np.vectorize(vmap.__getitem__)(values) or np.vectorize(lambda v: vmap.get(v, v))(values) (again depending on the exact semantics)?

@anntzer
Copy link
Contributor

anntzer commented Oct 10, 2017

And in fact this patch is still insufficient:

plot(["0.0", "!"], [1, 2]); gca().margins(0)

causes a collision:
figure_1
(with or without the patch)

I don't want to sound overly negative but I think we're trying to paper over some fundamental problems with the design of categorical support.

@story645
Copy link
Member Author

story645 commented Oct 10, 2017

Without margins it works, so please explain to me what margins does. Also it's not causing a collision, it's flattening the y axis, which confuses me more since that shouldn't be hitting the convertor.

I agree that the current design and implementation could use improvement, but I favor an incrementalist approach to this because of the complexity of the call stack.

ETA: np.vectorize(vmap.__getitem__)(values) breaks a ton of tests, so I love the idea, but want to punt getting it working to a later PR on supporting nans.

@anntzer
Copy link
Contributor

anntzer commented Oct 10, 2017

Basically categoricals map the string "!" to 0 and the string "0.0" to 1 (due to sorting, but even without sorting it's easy to generate input that maps "0.0" to whatever integer you want). Now when we try to get the axes limits, given that I set the margins to zero, the xlims are 0.0 and 1.0 (if I did not set the margins to zero, then the limits would be -0.05 and 1.05, so just replace "0.0" by the correct value to get a similar bug); but 0.0 gets converted to a string ("0.0") which itself is converted to 1. So we end up using 1, 1 as limits, which is blown up to 1-eps, 1+eps, thus we get a zoomed in region around the "0.0" label (which is at x=1).

Similar case without calling margins: plot(["1.05", "a"], [1, 2]).

@story645
Copy link
Member Author

but 0.0 gets converted to a string ("0.0") which itself is converted to 1. So we end up using 1, 1 as limits, which is blown up to 1-eps, 1+eps, thus we get a zoomed in region around the "0.0" label (which is at x=1).

so this is what confuses me since "!" gets mapped to 0 and so under the hood the data being plotted should be [0, 1] and so I don't get a) why the axes limits aren't updating, b) how leaving "0.0" as a number but mapping "!" to an int (which I'd then have to ensure wasn't mapped to 0) resolves the issue. Basically, I don't get why this bug only happens in the case where I'm plotting 0 and some character with an ascii value less than 48. And why now the integer case plt.plot(['0','!']) works and it's only the float case "0.0" that doesn't. Is that cause float to string conversion is really persnickety?

@anntzer
Copy link
Contributor

anntzer commented Oct 10, 2017

Basically str(0)=="0" but str(0.)=="0.0" so plot([0., "!"], [1, 2]); gca().margins(0) also fails (but as you said plot([0, "!"], [1, 2]); gca().margins(0) "works").

@tacaswell
Copy link
Member

Actually perhaps simplest would be np.vectorize(vmap.getitem)(values) or np.vectorize(lambda v: vmap.get(v, v))(values) (again depending on the exact semantics)?

I do not think that is not actually simpler. Fewer lines, but not simpler.

I don't want to sound overly negative but I think we're trying to paper over some fundamental problems with the design of categorical support.

This is a weird corner case where the categorical names happen to be floats that happen to hit the limits. I think the correct fix is that the converter should short-circuit it if gets a single number (like the dates converter does https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/dates.py#L1608 ) which is for when the limits are being set post-conversion (as is happening with the margins). This is why it works with integer 0 or with plt.plot(["a", "!"], [1, 2]); plt.gca().margins(0)

If there is a fundamental design problem, it is with the unit support it's self and is a much bigger project to redesign (which would likely require buy in from JPL).

@anntzer
Copy link
Contributor

anntzer commented Oct 10, 2017

Actually perhaps simplest would be np.vectorize(vmap.getitem)(values) or np.vectorize(lambda v: vmap.get(v, v))(values) (again depending on the exact semantics)?

I do not think that is not actually simpler. Fewer lines, but not simpler.

Right now the PR has

        str_value = shim_array(value)
        mapped_value = str_value.copy()

        for lab, loc in vmap.items():
            mapped_value[str_value == lab] = loc

which only works because it passes through values that are not in the vmap (because they get cast to a string, are never replaced, and then get cast back to a float). This is quite non-obvious IMO.
Additionally, this is relatively quite inefficient because you need to loop over the array once per label.
Using np.vectorize makes the expected semantics explicit (... it's a vectorized dict lookup...) and only requires a single iteration through the array.

This is a weird corner case where the categorical names happen to be floats that happen to hit the limits.

No, this is a corner case because we cast back and forth between floats and strings without paying attention to the fact that a caller that passes in a float and a caller that passes in a string is expecting a different behavior -- and thus you get a collision when someone passes in a float expecting it to be treated as a float, but it gets treated as a string. (Of course, your proposed fix would also work.)

Again the fix is really simple. It is to require that categorical inputs be all strings (or all bytes, possibly). @story645 gave the example of someone passing in [1, 2, 3, "4+"]. In all honesty I think we should reject such input and request that ["1", "2", "3", "4+"] be passed in.

I can't say I'm really fond of the unit support machinery but AFAICT it can be made to work perfectly fine for this use case as long as the distinction between floats and strings is maintained. In the case of datetimes, we simply never had this problem because passing in a mix of datetimes and floats errors (as it should).

Coming back to dates, in fact:

from datetime import datetime as d, timedelta as t
from matplotlib import pyplot as plt
plt.plot([d.now(), d.now() + t(1)], [1, 2])
plt.plot([1, 2], [3, 4])
plt.show()

errors as it should, whereas

from matplotlib import pyplot as plt
plt.plot(["a", "b"], [1, 2])
plt.plot([1, 2], [3, 4])
plt.show()

gives a nonsensical result.

@story645
Copy link
Member Author

story645 commented Oct 10, 2017

a caller that passes in a float and a caller that passes in a string is expecting a different behavior

I struggle with understanding why/how numbers that don't need to be converted are hitting the conversion interface, and that's why it took me a while to wrap my head around adding the pass through.

In all honesty I think we should reject such input and request that ["1", "2", "3", "4+"] be passed in

I don't disagree that this would probably be more optimal, but I think it's really important to maintain consistency with pandas on what is a categorical array. Since they allow mixed types, I think matplotlib kind of has to.

as long as the distinction between floats and strings is maintained.

I don't have an ideological objection to this, but I just don't see how to implement this and maintain consistency in location for graph updates and animations. Unless you just mean on the lookup side, and then the plan is to attempt that on the np.nan pr.

@anntzer
Copy link
Contributor

anntzer commented Oct 10, 2017

I don't disagree that this would probably be more optimal, but I think it's really important to maintain consistency with pandas on what is a categorical array. Since they allow mixed types, I think matplotlib kind of has to.

But we are not maintaining consistency with pandas here: pandas will treat objects of different types, well, as different objects:

In [1]: s = pd.Series(["23", "23"], dtype="category")

In [2]: s.values[0] == s.values[1]
Out[2]: True

In [3]: s = pd.Series([23, "23"], dtype="category")

In [4]: s.values[0] == s.values[1]
Out[4]: False

so if anything we are making things more confusing, by explicitly making categories behave in a different way from pandas.

If we want to truly behave like pandas, we should just keep labels in an object array, and if one writes plt.plot([42, "42"]), they'll have two identically-looking labels.

@story645
Copy link
Member Author

story645 commented Oct 10, 2017

I agree with you that the data shouldn't be cast, and plan to fix that. I just worry it'll be a rabbit hole and so want to keep that separate from this bugfix.

if one writes plt.plot([42, "42"]), they'll have two identically-looking labels.

I think you're correct that this should be the behavior and that's definitely a bug.

@anntzer
Copy link
Contributor

anntzer commented Oct 10, 2017

OK, if we agree to go this way (which only slightly irks me because I don't think it's a really reasonable input; on the other hand the semantics are perfectly well defined so I'm totally OK with that) I believe it's just a matter to not bother with string arrays anywhere (bye bye shims) and just keep a mapping of objects to strings.

@tacaswell
Copy link
Member

Closing this as per the discussion we will drop mixed types for string categoricals types due to confusing behavior in cases like

x = [753, 7, 'a' ,'b']
y = range(4)
fig, (ax1, ax2) = fig.subplots()
ax1.plot(x, y, 'o')
ax2.plot(x[:2], y[:2], 'x')

The first case will would trigger categorical, the second would not which could be very surprising. If we want to support mixed types we need to write our own Categorical wrapper class or support pandas categoricals.

@tacaswell tacaswell closed this Nov 21, 2017
@jklymak
Copy link
Member

jklymak commented Nov 21, 2017

@tacaswell or have the first call set the Converter to Categorical, and then let the converter deal with the data in the second call.

@anntzer
Copy link
Contributor

anntzer commented Nov 21, 2017

@jklymak But that would fail if you swap the order of the calls.

@jklymak
Copy link
Member

jklymak commented Nov 21, 2017

Yeah that’s fine. We can’t expect matpltolib to be prescient. Though I guess it’d be nice if there were a way to specify what converter you want rather than always trusting we will figure it out.

Though I’m not objecting to just saying categorical need to be strings. I don’t think it’s a hard ask to expect the user to convert. And I guess that gets around the issues. But you could imagine a dateconverter dealing with subsequent float arguments differently than the None converter.

@story645 story645 deleted the category branch February 6, 2018 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. topic: categorical
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants