Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Accessing elements of an array with dtype="U" may raise UnicodeDecodeError #15363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Zac-HD opened this issue Jan 21, 2020 · 9 comments · Fixed by #15385
Closed

Accessing elements of an array with dtype="U" may raise UnicodeDecodeError #15363

Zac-HD opened this issue Jan 21, 2020 · 9 comments · Fixed by #15385

Comments

@Zac-HD
Copy link
Contributor

Zac-HD commented Jan 21, 2020

Reproducing code example:

import sys
import numpy as np

arr = np.zeros(shape=1, dtype="U1")
for i in range(1, sys.maxunicode + 1):
    arr[0] = chr(i)
    arr[0]

Error message:

In [35]: arr = np.zeros(shape=1, dtype="U1")
    ...: for i in range(1, sys.maxunicode + 1):
    ...:     arr[0] = chr(i)
    ...:     arr[0]
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-35-5a43f3fdb19c> in <module>
      2 for i in range(1, sys.maxunicode + 1):
      3     arr[0] = chr(i)
----> 4     arr[0]

UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: 
	code point in surrogate code point range(0xd800, 0xe000)

For some later codepoints the message is instead code point not in range(0x110000) (false, as the problematic codepoints are all <= 0xdfff). For extra bonus fun some affected code points are < 0xd800, and not all surrogate characters are affected.

Numpy/Python version information:

Numpy 1.17.4
Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]

@eric-wieser
Copy link
Member

eric-wieser commented Jan 21, 2020

Does it work without numpy?

import sys
for i in range(1, sys.maxunicode + 1):
    chr(i).encode('utf-32-le').decode('utf-32-le')

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Jan 21, 2020

      1 for i in range(1, sys.maxunicode + 1):
----> 2     chr(i).encode('utf-32-le').decode('utf-32-le')
      3

UnicodeEncodeError: 'utf-32-le' codec can't encode character '\ud800' in position 0:
    surrogates not allowed

But note that the array problem doesn't start until '\ud82d', so some surrogates clearly are allowed.

(it would also be handy if Numpy reported the problematic string in the error message)

@mattip
Copy link
Member

mattip commented Jan 21, 2020

Are there open issues that point to this being a problem users have encountered? There are many dark corners in using characters/strings in NumPy arrays since there will always be an impedance mismatch between arrays and unicode. Perhaps we should leave them alone.

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Jan 21, 2020

If the impedence mismatch is so bad that users can't rely on being able to index their arrays, surely we should be deprecating unicode (and perhaps all string) arrays altogther? object arrays of strs are generally easier to reason about, at least.

I haven't seen issues related to it, but I would expect this to arise in practice for users who are processing text files on Windows (utf-16), and that seems more and more likely with the Python 2 transition now happening even in large and slow institutions. String arrays, numpy, and memory-mapped columnar text files are otherwise a great fit!

@charris
Copy link
Member

charris commented Jan 21, 2020

ISTR reading recent discussions of Python and unicode surrogates somewhere. My impression was that things were in flux.

@eric-wieser
Copy link
Member

For reference this succeeds:

import sys
for i in range(1, sys.maxunicode + 1):
    chr(i).encode('utf-32-le', errors='surrogatepass').decode('utf-32-le', errors='surrogatepass')

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Jan 21, 2020

I think using errors='surrogatepass' when accessing elements of unicode arrays would resolve this issue 😄

@eric-wieser
Copy link
Member

Solution here is to stop using our PyUCS2Buffer_AsUCS4 workaround and just use the Python 3.3 API. Will put out a patch in the next few days.

eric-wieser added a commit to eric-wieser/numpy that referenced this issue Feb 8, 2020
These APIs work with either UCS2 or UCS4, depending on the value of `Py_UNICODE_WIDE`.
After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this.

Fixes numpygh-3258
Fixes numpygh-15363
@Zac-HD
Copy link
Contributor Author

Zac-HD commented Feb 14, 2020

Awesome! Thanks @eric-wieser and all reviewers ❤️

Zac-HD added a commit to Zac-HD/numpy that referenced this issue Jun 30, 2020
The bug was reported in numpy#15363 and fixed in numpy#15385, before Numpy decided to allow Hypothesis in it's own test suite.  Since it does now, I thought it would be nice to include the test that found the bug as well as the more specific regression test I wrote.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants