-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
bugfix/test for #9336 integer overwrite in categorical #9340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A slightly more spelled out example of @story645 's explanation of the bug In [30]: t = np.array([1, 2, 'a'])
In [31]: t
Out[31]:
array(['1', '2', 'a'],
dtype='<U21')
In [32]: t[t == 'a'] = 1
In [33]: t
Out[33]:
array(['1', '2', '1'],
dtype='<U21') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent of the discussion about how to order catagoricals, this should go in.
I don't understand why |
Hmm, the original idea was on trying to preserve the input shape of values 'cause down the line the plan is to add support for imshow/pcolor etc... (and yes I know I can use ravel/reshape stuff for that, but....) |
Actually perhaps simplest would be |
Without margins it works, so please explain to me what margins does. Also it's not causing a collision, it's flattening the y axis, which confuses me more since that shouldn't be hitting the convertor. I agree that the current design and implementation could use improvement, but I favor an incrementalist approach to this because of the complexity of the call stack. ETA: |
Basically categoricals map the string "!" to 0 and the string "0.0" to 1 (due to sorting, but even without sorting it's easy to generate input that maps "0.0" to whatever integer you want). Now when we try to get the axes limits, given that I set the margins to zero, the xlims are 0.0 and 1.0 (if I did not set the margins to zero, then the limits would be -0.05 and 1.05, so just replace "0.0" by the correct value to get a similar bug); but 0.0 gets converted to a string ("0.0") which itself is converted to 1. So we end up using 1, 1 as limits, which is blown up to 1-eps, 1+eps, thus we get a zoomed in region around the "0.0" label (which is at x=1). Similar case without calling margins: |
so this is what confuses me since "!" gets mapped to 0 and so under the hood the data being plotted should be [0, 1] and so I don't get a) why the axes limits aren't updating, b) how leaving "0.0" as a number but mapping "!" to an int (which I'd then have to ensure wasn't mapped to 0) resolves the issue. Basically, I don't get why this bug only happens in the case where I'm plotting 0 and some character with an ascii value less than 48. And why now the integer case |
Basically str(0)=="0" but str(0.)=="0.0" so |
I do not think that is not actually simpler. Fewer lines, but not simpler.
This is a weird corner case where the categorical names happen to be floats that happen to hit the limits. I think the correct fix is that the converter should short-circuit it if gets a single number (like the dates converter does https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/dates.py#L1608 ) which is for when the limits are being set post-conversion (as is happening with the margins). This is why it works with integer 0 or with If there is a fundamental design problem, it is with the unit support it's self and is a much bigger project to redesign (which would likely require buy in from JPL). |
Right now the PR has
which only works because it passes through values that are not in the vmap (because they get cast to a string, are never replaced, and then get cast back to a float). This is quite non-obvious IMO.
No, this is a corner case because we cast back and forth between floats and strings without paying attention to the fact that a caller that passes in a float and a caller that passes in a string is expecting a different behavior -- and thus you get a collision when someone passes in a float expecting it to be treated as a float, but it gets treated as a string. (Of course, your proposed fix would also work.) Again the fix is really simple. It is to require that categorical inputs be all strings (or all bytes, possibly). @story645 gave the example of someone passing in [1, 2, 3, "4+"]. In all honesty I think we should reject such input and request that ["1", "2", "3", "4+"] be passed in. I can't say I'm really fond of the unit support machinery but AFAICT it can be made to work perfectly fine for this use case as long as the distinction between floats and strings is maintained. In the case of datetimes, we simply never had this problem because passing in a mix of datetimes and floats errors (as it should). Coming back to dates, in fact:
errors as it should, whereas
gives a nonsensical result. |
I struggle with understanding why/how numbers that don't need to be converted are hitting the conversion interface, and that's why it took me a while to wrap my head around adding the pass through.
I don't disagree that this would probably be more optimal, but I think it's really important to maintain consistency with pandas on what is a categorical array. Since they allow mixed types, I think matplotlib kind of has to.
I don't have an ideological objection to this, but I just don't see how to implement this and maintain consistency in location for graph updates and animations. Unless you just mean on the lookup side, and then the plan is to attempt that on the np.nan pr. |
But we are not maintaining consistency with pandas here: pandas will treat objects of different types, well, as different objects:
so if anything we are making things more confusing, by explicitly making categories behave in a different way from pandas. If we want to truly behave like pandas, we should just keep labels in an object array, and if one writes |
I agree with you that the data shouldn't be cast, and plan to fix that. I just worry it'll be a rabbit hole and so want to keep that separate from this bugfix.
I think you're correct that this should be the behavior and that's definitely a bug. |
OK, if we agree to go this way (which only slightly irks me because I don't think it's a really reasonable input; on the other hand the semantics are perfectly well defined so I'm totally OK with that) I believe it's just a matter to not bother with string arrays anywhere (bye bye shims) and just keep a mapping of objects to strings. |
Closing this as per the discussion we will drop mixed types for string categoricals types due to confusing behavior in cases like x = [753, 7, 'a' ,'b']
y = range(4)
fig, (ax1, ax2) = fig.subplots()
ax1.plot(x, y, 'o')
ax2.plot(x[:2], y[:2], 'x') The first case will would trigger categorical, the second would not which could be very surprising. If we want to support mixed types we need to write our own |
@tacaswell or have the first call set the Converter to Categorical, and then let the converter deal with the data in the second call. |
@jklymak But that would fail if you swap the order of the calls. |
Yeah that’s fine. We can’t expect matpltolib to be prescient. Though I guess it’d be nice if there were a way to specify what converter you want rather than always trusting we will figure it out. Though I’m not objecting to just saying categorical need to be strings. I don’t think it’s a hard ask to expect the user to convert. And I guess that gets around the issues. But you could imagine a dateconverter dealing with subsequent float arguments differently than the None converter. |
Categorical variables were doing in place conversions and sometimes integer values would end up swamping other variables:
This PR fixes that and adds a test for this case. I fully acknowledge that there's a longer discussion to be had about design philosophy and semantics of categoricals and all, but just wanted to get this patch in.