-
-
Notifications
You must be signed in to change notification settings - Fork 11k
BUG: lib: Fix handling of usecols=[] in loadtxt. #16632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Before this change, because of statements such as `if usecols:`, `usecols=[]` was treated the same as `usecols=None`, and all the columns were used. With this change, `usecols=[]` means "read no columns", and an empty array is returned. The error handling is improved: now if `usecols` is given, and the number of columns in `usecols` does not equal the number of fields in a given structured dtype, an exception is raised with a message that explains the problem. Previously this mismatch would fail with an incidental IndexError.
68d7e55
to
fc8f6b3
Compare
The maintenance PR (#16633) is merged, and this PR now has just the changes related to fixing the handling of |
if len(usecols) == 0: | ||
shp = (0, 0) if ndmin == 2 else (0,) | ||
return np.empty(shp, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd perhaps expect an a (N, 0)
array (or maybe (0, N)
?) where N
is the number of lines in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing that (specifically (N, 0)
, or (0, N)
if unpack
is True). But then I wondered if that was a foolish consistency. However, if in fact most people would expect (N, 0)
, then that's what it should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this early exit even needed? What happens if you let the rest of the function run, to avoiding needing a special case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should go with (N, 0)
, ideally it works without any special handling. Speed is not relevant for such a weird corner case anyway. It is a bit strange, but you could imaging reading K
columns from multiple files and concatenating them or so, which only makes sense if the shape includes the N
correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on an update to return (N, 0). I just found a bug in how loadtxt
handles nested structured dtypes, so I'll try to fix that, and then get back to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, the bug that I encountered is #16678.
@@ -1071,6 +1071,15 @@ def read_data(chunk_size): | |||
|
|||
dtype_types, packing = flatten_dtype_internal(dtype) | |||
|
|||
if usecols is not None: | |||
if len(dtype_types) > 1 and len(usecols) != len(dtype_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps clearer as:
if len(dtype_types) > 1 and len(usecols) != len(dtype_types): | |
if len(dtype_types) <= 1: | |
pass # this is ok because ??? | |
elif len(usecols) != len(dtype_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although i understand that this should be a rare case, maybe makes sense to make it work for len(dtype_types) >= 1 , for one element structured dtypes ?
Closing, the error is now (also coming from Warren via
Although, I guess we could also use |
Before this change, because of statements such as
if usecols:
,usecols=[]
was treated the same asusecols=None
, and all thecolumns were used. With this change,
usecols=[]
means "read nocolumns", and an empty array is returned.
The error handling is improved: now if
usecols
is given, and thenumber of columns in
usecols
does not equal the number of fieldsin a given structured dtype, an exception is raised with a message
that explains the problem. Previously this mismatch would fail
with an incidental IndexError.