Accessing elements of an array with `dtype="U"` may raise `UnicodeDecodeError` #15363

Zac-HD · 2020-01-21T02:23:00Z

Reproducing code example:

import sys
import numpy as np

arr = np.zeros(shape=1, dtype="U1")
for i in range(1, sys.maxunicode + 1):
    arr[0] = chr(i)
    arr[0]

Error message:

In [35]: arr = np.zeros(shape=1, dtype="U1")
    ...: for i in range(1, sys.maxunicode + 1):
    ...:     arr[0] = chr(i)
    ...:     arr[0]
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-35-5a43f3fdb19c> in <module>
      2 for i in range(1, sys.maxunicode + 1):
      3     arr[0] = chr(i)
----> 4     arr[0]

UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: 
	code point in surrogate code point range(0xd800, 0xe000)

For some later codepoints the message is instead code point not in range(0x110000) (false, as the problematic codepoints are all <= 0xdfff). For extra bonus fun some affected code points are < 0xd800, and not all surrogate characters are affected.

Numpy/Python version information:

Numpy 1.17.4
Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]

The text was updated successfully, but these errors were encountered:

eric-wieser · 2020-01-21T02:37:21Z

Does it work without numpy?

import sys
for i in range(1, sys.maxunicode + 1):
    chr(i).encode('utf-32-le').decode('utf-32-le')

Zac-HD · 2020-01-21T02:53:42Z

      1 for i in range(1, sys.maxunicode + 1):
----> 2     chr(i).encode('utf-32-le').decode('utf-32-le')
      3

UnicodeEncodeError: 'utf-32-le' codec can't encode character '\ud800' in position 0:
    surrogates not allowed

But note that the array problem doesn't start until '\ud82d', so some surrogates clearly are allowed.

(it would also be handy if Numpy reported the problematic string in the error message)

mattip · 2020-01-21T03:04:48Z

Are there open issues that point to this being a problem users have encountered? There are many dark corners in using characters/strings in NumPy arrays since there will always be an impedance mismatch between arrays and unicode. Perhaps we should leave them alone.

Zac-HD · 2020-01-21T03:37:56Z

If the impedence mismatch is so bad that users can't rely on being able to index their arrays, surely we should be deprecating unicode (and perhaps all string) arrays altogther? object arrays of strs are generally easier to reason about, at least.

I haven't seen issues related to it, but I would expect this to arise in practice for users who are processing text files on Windows (utf-16), and that seems more and more likely with the Python 2 transition now happening even in large and slow institutions. String arrays, numpy, and memory-mapped columnar text files are otherwise a great fit!

charris · 2020-01-21T14:52:10Z

ISTR reading recent discussions of Python and unicode surrogates somewhere. My impression was that things were in flux.

eric-wieser · 2020-01-21T15:19:12Z

For reference this succeeds:

import sys
for i in range(1, sys.maxunicode + 1):
    chr(i).encode('utf-32-le', errors='surrogatepass').decode('utf-32-le', errors='surrogatepass')

Zac-HD · 2020-01-21T22:00:14Z

I think using errors='surrogatepass' when accessing elements of unicode arrays would resolve this issue 😄

eric-wieser · 2020-01-22T10:29:09Z

Solution here is to stop using our PyUCS2Buffer_AsUCS4 workaround and just use the Python 3.3 API. Will put out a patch in the next few days.

These APIs work with either UCS2 or UCS4, depending on the value of `Py_UNICODE_WIDE`. After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this. Fixes numpygh-3258 Fixes numpygh-15363

Zac-HD · 2020-02-14T02:40:26Z

Awesome! Thanks @eric-wieser and all reviewers ❤️

The bug was reported in numpy#15363 and fixed in numpy#15385, before Numpy decided to allow Hypothesis in it's own test suite. Since it does now, I thought it would be nice to include the test that found the bug as well as the more specific regression test I wrote.

Zac-HD mentioned this issue Jan 21, 2020

Roll back 5.2.0 (numpy unicode fun) for now HypothesisWorks/hypothesis#2329

Merged

Zac-HD mentioned this issue Feb 2, 2020

BUG, MAINT: Stop using the error-prone deprecated Py_UNICODE apis #15385

Merged

eric-wieser mentioned this issue Feb 4, 2020

BUG: np.unicode_ scalars misbehave on narrow builds #3258

Closed

seberg closed this as completed in #15385 Feb 14, 2020

Zac-HD mentioned this issue Jun 2, 2020

Generate non-UTF-8 unicode for Numpy arrays HypothesisWorks/hypothesis#2455

Closed

flexatone mentioned this issue Jan 20, 2022

fillfalsy_backward raises UnicodeDecodeError when backward-filling strings on windows static-frame/static-frame#427

Closed

flexatone mentioned this issue Mar 4, 2022

Assigning NumPy string to Frame with "U6" arrays raises UnicodeDecodeError on display static-frame/static-frame#444

Closed

Zac-HD mentioned this issue Jul 3, 2022

Generate non-UTF-8 unicode for Numpy arrays HypothesisWorks/hypothesis#3394

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Accessing elements of an array with `dtype="U"` may raise `UnicodeDecodeError` #15363

Accessing elements of an array with `dtype="U"` may raise `UnicodeDecodeError` #15363

Zac-HD commented Jan 21, 2020

eric-wieser commented Jan 21, 2020 •

edited

Loading

Uh oh!

Zac-HD commented Jan 21, 2020 •

edited

Loading

Uh oh!

mattip commented Jan 21, 2020

Uh oh!

Zac-HD commented Jan 21, 2020 •

edited

Loading

Uh oh!

charris commented Jan 21, 2020

Uh oh!

eric-wieser commented Jan 21, 2020

Uh oh!

Zac-HD commented Jan 21, 2020

Uh oh!

eric-wieser commented Jan 22, 2020

Uh oh!

Zac-HD commented Feb 14, 2020

Uh oh!

Uh oh!

Accessing elements of an array with dtype="U" may raise UnicodeDecodeError #15363

Accessing elements of an array with dtype="U" may raise UnicodeDecodeError #15363

Comments

Zac-HD commented Jan 21, 2020

Reproducing code example:

Error message:

Numpy/Python version information:

eric-wieser commented Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD commented Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Jan 21, 2020

Uh oh!

Zac-HD commented Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Jan 21, 2020

Uh oh!

eric-wieser commented Jan 21, 2020

Uh oh!

Zac-HD commented Jan 21, 2020

Uh oh!

eric-wieser commented Jan 22, 2020

Uh oh!

Zac-HD commented Feb 14, 2020

Uh oh!

Accessing elements of an array with `dtype="U"` may raise `UnicodeDecodeError` #15363

Accessing elements of an array with `dtype="U"` may raise `UnicodeDecodeError` #15363

eric-wieser commented Jan 21, 2020 •

edited

Loading

Zac-HD commented Jan 21, 2020 •

edited

Loading

Zac-HD commented Jan 21, 2020 •

edited

Loading