Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: interpret 'c' PEP3118/struct type as 'S1'. #7803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 7, 2016

Conversation

jzwinck
Copy link
Contributor

@jzwinck jzwinck commented Jul 5, 2016

Before this, a 'c' type code in a PEP3118 buffer would result in
failure to construct a NumPy array. Now it's interpreted as a
single character, as in Python's struct module. This means
'4c' is an array of 4 strings of size 1, while '4s' is (as before)
a single string of size 4.

Before this, a 'c' type code in a PEP3118 buffer would result in
failure to construct a NumPy array.  Now it's interpreted as a
single character, as in Python's struct module.  This means
'4c' is an array of 4 strings of size 1, while '4s' is (as before)
a single string of size 4.
@ahaldane
Copy link
Member

ahaldane commented Jul 6, 2016

LGTM, though we should be clear about the design decision being made here: We are assuming that numpy should do exactly what the struct module currently does.

As discussed in http://bugs.python.org/issue15622 there is some ambiguity in the docs. The struct module docs say one thing about c, while PEP3118 says something else. In that thread it was concluded that the struct module doc takes precedence. Therefore c is interpreted as a single character (byte in python3) by struct.

It makes most sense to me that numpy should follow suite. In #7798 I'm also following this principle, and also note that there has been recent related discussion on the Python dev list at https://bugs.python.org/issue26746 and https://bugs.python.org/issue3132.

I'd like to merge, but I'll wait a day or so so others can comment in case that doesn't sound right.

@jzwinck
Copy link
Contributor Author

jzwinck commented Jul 6, 2016

Thanks for pointing that out, @ahaldane - I agree. This patch does implement a certain decision which may (depending on your reading) not be aligned with PEP 3118. However, PEP 3118 was never fully implemented, and as pointed out in some of your links, it was erroneous from the beginning, in the sense that it claimed to "add" the "c" typecode when in fact "c" existed in the struct module before the PEP was written.

Related: With or without this patch I would like a way to create a NumPy array from a buffer of UTF-8 strings without copying. I briefly hoped "c" could be that way, but there appears to be no single-byte unicode type in NumPy, so this seems impossible (without having to copy the 1-byte characters into 4-byte ones).

@ahaldane
Copy link
Member

ahaldane commented Jul 6, 2016

Re: UTF-8, see this discussion: String type again.

The problem with UTF-8 is it is a variable-width encoding, while numpy arrays have fixed width type, meaning you might get unexpected string truncation. There are various ideas in that thread to work around this problem.

There is another discussion somewhere (can't find it now) about how utf-32 was chosen because it is the only fixed width encoding.

@jzwinck
Copy link
Contributor Author

jzwinck commented Jul 6, 2016

@ahaldane That's interesting. For the purposes of this discussion (PEP 3118), truncation is not a concern because the data comes from elsewhere.

I do think that adding a mapping for the Python 3.3 'a' single-byte type (which I didn't know about until now) would be useful. There is plenty of existing data in the world which consists of columns of ASCII characters, and right now it's awkward to handle because in Python 3 these come out as bytes which lead to unhappiness due to pandas-dev/pandas#9712 and multitudes of similar issues. A way to tell NumPy to view a column of char data from C as str in Python 3 would be useful. Cython has something similar already using "directives."

@ahaldane
Copy link
Member

ahaldane commented Jul 7, 2016

All right, no objections, so I'll merge. Thanks @jzwinck

@eric-wieser
Copy link
Member

Fixes #4469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants