Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: truthiness of strings is arbitrary, context-dependent, and inconsistent #9875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eric-wieser opened this issue Oct 17, 2017 · 15 comments · Fixed by #23898
Closed

BUG: truthiness of strings is arbitrary, context-dependent, and inconsistent #9875

eric-wieser opened this issue Oct 17, 2017 · 15 comments · Fixed by #23898
Assignees
Labels
00 - Bug component: numpy._core triaged Issue/PR that was discussed in a triage meeting

Comments

@eric-wieser
Copy link
Member

eric-wieser commented Oct 17, 2017

The following table shows the extents of the problem

x : str
sc = np.str_(x)
arr = np.array(x, np.str_)
x bool(x) bool(sc) bool(arr[()]) bool(arr) sc.astype(bool)
'' False False False False ValueError
'\0' True True False False ValueError
' ' True True True False ValueError
' \0' True True True False ValueError
'0' True True True True False
'1' True True True True True
'\0 ' True True True True ValueError
' \0 ' True True True True ValueError
def call_or_exc(f):
    try:
        return f()
    except Exception as e:
        return e

def truthiness(x):
    x_arr = np.array(x)
    x_sc = np.unicode_(x)
    x_sc2 = x_arr[()]
    return bool(x), bool(x_sc), bool(x_sc2), bool(x_arr), call_or_exc(lambda: x_sc.astype(bool))

def fmt(x):
    if isinstance(x, str): return repr(x).replace(r'\x00', r'\0')
    if isinstance(x, Exception): return type(x).__name__
    return repr(x)

print(' | '.join("`{}`".format(x) for x in ['x', 'bool(x)', 'bool(sc)', 'bool(arr[()])', 'bool(arr)', 'sc.astype(bool)']))
print(' | '.join(['--'] * 6))
for val in ['', '\0', ' ', ' \0', '0', '1', '\0 ', ' \0 ']:
   print(' | '.join("`{}`".format(fmt(x)) for x in (val,)+truthiness(val)))
Generated with...

The differences come down to:

  • bool(arr[()]) - indexing string arrays causes trailing nulls to be truncated. This is unavoidable.
  • bool(arr) - has special handling to treat spaces as falsy. This seems unpythonic, since that's not how str.__bool__ behaves - are we stuck with it
  • arr.astype(bool) - treated as arr.astype(int).astype(bool), which seems bizarre
@eric-wieser
Copy link
Member Author

Continues from #5967

@mattip
Copy link
Member

mattip commented Apr 29, 2019

Should we move forward with this? PR #9877 requests discussion take place here, but the issue does not seem to have garnered much attention.

Edit: PR number

@LevN0
Copy link

LevN0 commented Apr 20, 2020

Looking at that PR, what is the future path for doing the conversation it cites if some version of it gets approved.

Previously this was interpreted as string.astype(int).astype(bool), which
interpeted '0' as False.

The behavior is now consistent with count_nonzero and nonzero, treating only
the empty string as False.

So is the suggestion for future code going from an array of strings containing 0/1 to bool to typecast twice?

@seberg
Copy link
Member

seberg commented Apr 20, 2020

@LevN0 it is the only consistent way and it is identical to python: bool("0") vs. bool(int("0")). So yes, what to do right now is a bit trickier, unless we do a quick 1.18.4 release just for this. @charris ping just in case you did not see gh-16023 yet.

Now, there are some exceptions to this and I am not sure what they currently do. That is textreaders should behave differently, since they parse True/False. But I guess those must not have been using this anyway.

@eric-wieser
Copy link
Member Author

Can someone compute the diff between the table here and the one from 1.18.3?

@seberg
Copy link
Member

seberg commented Apr 20, 2020

Well, I used master, but I doubt anything else changed this. Only the last column changed, and that of course changed quite a bit:

x bool(x) bool(sc) bool(arr[()]) bool(arr) sc.astype(bool)
'' False False False False False
'\0' True True False False False
' ' True True True False True
' \0' True True True False True
'0' True True True True True
'1' True True True True True
'\0 ' True True True True True
' \0 ' True True True True True

The NULL padding getting lost is the problem of the scalar though, numpy unicode though, not the bool cast itself.

@seberg
Copy link
Member

seberg commented Apr 20, 2020

As to the bool(sc) logic, it is mirrored by len(sc) since the NULL stripping is not consistent there.

@seberg
Copy link
Member

seberg commented Apr 24, 2020

Jeesh, I did not remember the crazyness of ignoring all whitespaces for the nonzero implementation. I am not even sure we can put a warning into that easily, since it might be used all over without the GIL...
That is:

np.nonzero(np.array([" ", "\n", "\t", ""]))
# Should only consider the last "" as Falsy, but considers them all False.

See also gh-9877

@TomAugspurger
Copy link

Just to confirm, any changes here would eventually apply to where and argwhere? So the output of

In [12]: np.argwhere(["a", " ", "c"])
Out[12]:
array([[0],
       [2]])

might change to count ' ' as true in the future?

@seberg
Copy link
Member

seberg commented May 1, 2020

Yeah, I think we should/have to do that... I am not quite sure yet how to do the transition(or lets say how nasty it will be). For current strings. I suppose arr == arr.dtype.type() is a way to opt in to the future behaviour (for most dtypes anyway, not for object).

What would this mean for pandas (and dask)? The current (white-space ignoring) way seems so strange to me from a modern Python perspective...

@TomAugspurger
Copy link

The current (white-space ignoring) way seems so strange to me from a modern Python perspective

Agreed.

I only discovered this issue because of a test in dask. We currently implement argwhere via array.astype(bool), which does treat ' ' as True, so our results differ from NumPy. I'm probably OK with that difference.

This shouldn't really affect pandas users, since we convert incoming fix-width string dtype to object-dtype.

@seberg
Copy link
Member

seberg commented May 1, 2020

Right, the only real thing I would be worried about if we manage to add reasonable warnings in our code paths and you all suddenly have to try and work around that.

@TomAugspurger
Copy link

Ah, no, I don't think so (at least not for where) since we're using astype(bool) and masking, Dask users shouldn't see any warnings.

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Dec 18, 2020
@seberg
Copy link
Member

seberg commented Dec 18, 2020

Since Chuck gave a ping on the related PR. Do we think we could make a decision to move forward with this?

I doubt it is much in the way of DTypes, just one more of those annoying things to look out for. The reason is that all this does is that nonzero can't be implemented in terms of .astype(bool) and .astype(bool) is hopeless inconsistent for strings. That is all very annoying, but I am not sure it will float up somewhere as more than a simple special case.

Of course, this is still a major bug and trap, so I would still be happy to form a plan about just fixing it!

@seberg seberg self-assigned this Jan 27, 2021
@seberg
Copy link
Member

seberg commented Jan 27, 2021

We had discussed this, and we somewhat agreed that this is desireable, and should have FutureWarnings (or VisibleDeprecationWarning.

A specific issue may be np.loadtext and genfromtxt since that might be a more common use case for using 0 and 1 as boolean, so that may need some special consideration.

I hope to look at this, hopefully, FutureWarnings (and opting in to future behaviour) will be reasonably possible.

@seberg seberg added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug component: numpy._core triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants