Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: lib: Fix handling of usecols=[] in loadtxt. #16632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesser WarrenWeckesser commented Jun 18, 2020

Before this change, because of statements such as if usecols:,
usecols=[] was treated the same as usecols=None, and all the
columns were used. With this change, usecols=[] means "read no
columns", and an empty array is returned.

The error handling is improved: now if usecols is given, and the
number of columns in usecols does not equal the number of fields
in a given structured dtype, an exception is raised with a message
that explains the problem. Previously this mismatch would fail
with an incidental IndexError.

@WarrenWeckesser WarrenWeckesser marked this pull request as draft June 18, 2020 14:03
Before this change, because of statements such as `if usecols:`,
`usecols=[]` was treated the same as `usecols=None`, and all the
columns were used.  With this change, `usecols=[]` means "read no
columns", and an empty array is returned.

The error handling is improved: now if `usecols` is given, and the
number of columns in `usecols` does not equal the number of fields
in a given structured dtype, an exception is raised with a message
that explains the problem.  Previously this mismatch would fail
with an incidental IndexError.
@WarrenWeckesser WarrenWeckesser marked this pull request as ready for review June 18, 2020 15:24
@WarrenWeckesser
Copy link
Member Author

The maintenance PR (#16633) is merged, and this PR now has just the changes related to fixing the handling of usecols=[].

Comment on lines +1079 to +1081
if len(usecols) == 0:
shp = (0, 0) if ndmin == 2 else (0,)
return np.empty(shp, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd perhaps expect an a (N, 0) array (or maybe (0, N)?) where N is the number of lines in the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing that (specifically (N, 0), or (0, N) if unpack is True). But then I wondered if that was a foolish consistency. However, if in fact most people would expect (N, 0), then that's what it should do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this early exit even needed? What happens if you let the rest of the function run, to avoiding needing a special case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go with (N, 0), ideally it works without any special handling. Speed is not relevant for such a weird corner case anyway. It is a bit strange, but you could imaging reading K columns from multiple files and concatenating them or so, which only makes sense if the shape includes the N correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on an update to return (N, 0). I just found a bug in how loadtxt handles nested structured dtypes, so I'll try to fix that, and then get back to this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, the bug that I encountered is #16678.

@@ -1071,6 +1071,15 @@ def read_data(chunk_size):

dtype_types, packing = flatten_dtype_internal(dtype)

if usecols is not None:
if len(dtype_types) > 1 and len(usecols) != len(dtype_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps clearer as:

Suggested change
if len(dtype_types) > 1 and len(usecols) != len(dtype_types):
if len(dtype_types) <= 1:
pass # this is ok because ???
elif len(usecols) != len(dtype_types):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although i understand that this should be a rare case, maybe makes sense to make it work for len(dtype_types) >= 1 , for one element structured dtypes ?

Base automatically changed from master to main March 4, 2021 02:04
@seberg
Copy link
Member

seberg commented Feb 8, 2022

Closing, the error is now (also coming from Warren via npreadtext I think):

TypeError: If a structured dtype is used, the number of columns in `usecols` must match the effective number of fields. But 2 usecols were given and the number of fields is 3.

Although, I guess we could also use ValueError as here.

@seberg seberg closed this Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants