-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Accessing elements of an array with dtype="U"
may raise UnicodeDecodeError
#15363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does it work without numpy? import sys
for i in range(1, sys.maxunicode + 1):
chr(i).encode('utf-32-le').decode('utf-32-le') |
1 for i in range(1, sys.maxunicode + 1):
----> 2 chr(i).encode('utf-32-le').decode('utf-32-le')
3
UnicodeEncodeError: 'utf-32-le' codec can't encode character '\ud800' in position 0:
surrogates not allowed But note that the array problem doesn't start until (it would also be handy if Numpy reported the problematic string in the error message) |
Are there open issues that point to this being a problem users have encountered? There are many dark corners in using characters/strings in NumPy arrays since there will always be an impedance mismatch between arrays and unicode. Perhaps we should leave them alone. |
If the impedence mismatch is so bad that users can't rely on being able to index their arrays, surely we should be deprecating unicode (and perhaps all string) arrays altogther? I haven't seen issues related to it, but I would expect this to arise in practice for users who are processing text files on Windows (utf-16), and that seems more and more likely with the Python 2 transition now happening even in large and slow institutions. String arrays, numpy, and memory-mapped columnar text files are otherwise a great fit! |
ISTR reading recent discussions of Python and unicode surrogates somewhere. My impression was that things were in flux. |
For reference this succeeds: import sys
for i in range(1, sys.maxunicode + 1):
chr(i).encode('utf-32-le', errors='surrogatepass').decode('utf-32-le', errors='surrogatepass') |
I think using |
Solution here is to stop using our |
These APIs work with either UCS2 or UCS4, depending on the value of `Py_UNICODE_WIDE`. After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this. Fixes numpygh-3258 Fixes numpygh-15363
Awesome! Thanks @eric-wieser and all reviewers ❤️ |
The bug was reported in numpy#15363 and fixed in numpy#15385, before Numpy decided to allow Hypothesis in it's own test suite. Since it does now, I thought it would be nice to include the test that found the bug as well as the more specific regression test I wrote.
Reproducing code example:
Error message:
For some later codepoints the message is instead
code point not in range(0x110000)
(false, as the problematic codepoints are all<= 0xdfff
). For extra bonus fun some affected code points are< 0xd800
, and not all surrogate characters are affected.Numpy/Python version information:
Numpy 1.17.4
Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]
The text was updated successfully, but these errors were encountered: