-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: truthiness of strings is arbitrary, context-dependent, and inconsistent #9875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Continues from #5967 |
Should we move forward with this? PR #9877 requests discussion take place here, but the issue does not seem to have garnered much attention. Edit: PR number |
Looking at that PR, what is the future path for doing the conversation it cites if some version of it gets approved.
So is the suggestion for future code going from an array of strings containing 0/1 to bool to typecast twice? |
@LevN0 it is the only consistent way and it is identical to python: Now, there are some exceptions to this and I am not sure what they currently do. That is textreaders should behave differently, since they parse True/False. But I guess those must not have been using this anyway. |
Can someone compute the diff between the table here and the one from 1.18.3? |
Well, I used master, but I doubt anything else changed this. Only the last column changed, and that of course changed quite a bit:
The NULL padding getting lost is the problem of the scalar though, numpy unicode though, not the bool cast itself. |
As to the |
Jeesh, I did not remember the crazyness of ignoring all whitespaces for the
See also gh-9877 |
Just to confirm, any changes here would eventually apply to In [12]: np.argwhere(["a", " ", "c"])
Out[12]:
array([[0],
[2]]) might change to count |
Yeah, I think we should/have to do that... I am not quite sure yet how to do the transition(or lets say how nasty it will be). For current strings. I suppose What would this mean for pandas (and dask)? The current (white-space ignoring) way seems so strange to me from a modern Python perspective... |
Agreed. I only discovered this issue because of a test in dask. We currently implement This shouldn't really affect pandas users, since we convert incoming fix-width string dtype to object-dtype. |
Right, the only real thing I would be worried about if we manage to add reasonable warnings in our code paths and you all suddenly have to try and work around that. |
Ah, no, I don't think so (at least not for where) since we're using |
Since Chuck gave a ping on the related PR. Do we think we could make a decision to move forward with this? I doubt it is much in the way of DTypes, just one more of those annoying things to look out for. The reason is that all this does is that Of course, this is still a major bug and trap, so I would still be happy to form a plan about just fixing it! |
We had discussed this, and we somewhat agreed that this is desireable, and should have A specific issue may be I hope to look at this, hopefully, |
Uh oh!
There was an error while loading. Please reload this page.
The following table shows the extents of the problem
x
bool(x)
bool(sc)
bool(arr[()])
bool(arr)
sc.astype(bool)
''
False
False
False
False
ValueError
'\0'
True
True
False
False
ValueError
' '
True
True
True
False
ValueError
' \0'
True
True
True
False
ValueError
'0'
True
True
True
True
False
'1'
True
True
True
True
True
'\0 '
True
True
True
True
ValueError
' \0 '
True
True
True
True
ValueError
Generated with...
The differences come down to:
bool(arr[()])
- indexing string arrays causes trailing nulls to be truncated. This is unavoidable.bool(arr)
- has special handling to treat spaces as falsy. This seems unpythonic, since that's not howstr.__bool__
behaves - are we stuck with itarr.astype(bool)
- treated asarr.astype(int).astype(bool)
, which seems bizarreThe text was updated successfully, but these errors were encountered: