Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit da9ec99

Browse files
author
Victor Stinner
committed
Issue #10783: struct.pack() doesn't encode implicitly unicode to UTF-8
* Replace "bytes" by "bytes object" in struct error messages * Document the API change in What's new in Python 3.2 * Fix test_wave * Remove also ugly implicit conversions in test_struct
1 parent e398da9 commit da9ec99

6 files changed

Lines changed: 83 additions & 107 deletions

File tree

Doc/library/struct.rst

Lines changed: 22 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -164,71 +164,66 @@ platform-dependent.
164164
+--------+--------------------------+--------------------+----------------+------------+
165165
| ``c`` | :c:type:`char` | bytes of length 1 | 1 | |
166166
+--------+--------------------------+--------------------+----------------+------------+
167-
| ``b`` | :c:type:`signed char` | integer | 1 | \(1),\(4) |
167+
| ``b`` | :c:type:`signed char` | integer | 1 | \(1),\(3) |
168168
+--------+--------------------------+--------------------+----------------+------------+
169-
| ``B`` | :c:type:`unsigned char` | integer | 1 | \(4) |
169+
| ``B`` | :c:type:`unsigned char` | integer | 1 | \(3) |
170170
+--------+--------------------------+--------------------+----------------+------------+
171-
| ``?`` | :c:type:`_Bool` | bool | 1 | \(2) |
171+
| ``?`` | :c:type:`_Bool` | bool | 1 | \(1) |
172172
+--------+--------------------------+--------------------+----------------+------------+
173-
| ``h`` | :c:type:`short` | integer | 2 | \(4) |
173+
| ``h`` | :c:type:`short` | integer | 2 | \(3) |
174174
+--------+--------------------------+--------------------+----------------+------------+
175-
| ``H`` | :c:type:`unsigned short` | integer | 2 | \(4) |
175+
| ``H`` | :c:type:`unsigned short` | integer | 2 | \(3) |
176176
+--------+--------------------------+--------------------+----------------+------------+
177-
| ``i`` | :c:type:`int` | integer | 4 | \(4) |
177+
| ``i`` | :c:type:`int` | integer | 4 | \(3) |
178178
+--------+--------------------------+--------------------+----------------+------------+
179-
| ``I`` | :c:type:`unsigned int` | integer | 4 | \(4) |
179+
| ``I`` | :c:type:`unsigned int` | integer | 4 | \(3) |
180180
+--------+--------------------------+--------------------+----------------+------------+
181-
| ``l`` | :c:type:`long` | integer | 4 | \(4) |
181+
| ``l`` | :c:type:`long` | integer | 4 | \(3) |
182182
+--------+--------------------------+--------------------+----------------+------------+
183-
| ``L`` | :c:type:`unsigned long` | integer | 4 | \(4) |
183+
| ``L`` | :c:type:`unsigned long` | integer | 4 | \(3) |
184184
+--------+--------------------------+--------------------+----------------+------------+
185-
| ``q`` | :c:type:`long long` | integer | 8 | \(3), \(4) |
185+
| ``q`` | :c:type:`long long` | integer | 8 | \(2), \(3) |
186186
+--------+--------------------------+--------------------+----------------+------------+
187-
| ``Q`` | :c:type:`unsigned long | integer | 8 | \(3), \(4) |
187+
| ``Q`` | :c:type:`unsigned long | integer | 8 | \(2), \(3) |
188188
| | long` | | | |
189189
+--------+--------------------------+--------------------+----------------+------------+
190-
| ``f`` | :c:type:`float` | float | 4 | \(5) |
190+
| ``f`` | :c:type:`float` | float | 4 | \(4) |
191191
+--------+--------------------------+--------------------+----------------+------------+
192-
| ``d`` | :c:type:`double` | float | 8 | \(5) |
192+
| ``d`` | :c:type:`double` | float | 8 | \(4) |
193193
+--------+--------------------------+--------------------+----------------+------------+
194-
| ``s`` | :c:type:`char[]` | bytes | | \(1) |
194+
| ``s`` | :c:type:`char[]` | bytes | | |
195195
+--------+--------------------------+--------------------+----------------+------------+
196-
| ``p`` | :c:type:`char[]` | bytes | | \(1) |
196+
| ``p`` | :c:type:`char[]` | bytes | | |
197197
+--------+--------------------------+--------------------+----------------+------------+
198-
| ``P`` | :c:type:`void \*` | integer | | \(6) |
198+
| ``P`` | :c:type:`void \*` | integer | | \(5) |
199199
+--------+--------------------------+--------------------+----------------+------------+
200200

201201
Notes:
202202

203203
(1)
204-
The ``c``, ``s`` and ``p`` conversion codes operate on :class:`bytes`
205-
objects, but packing with such codes also supports :class:`str` objects,
206-
which are encoded using UTF-8.
207-
208-
(2)
209204
The ``'?'`` conversion code corresponds to the :c:type:`_Bool` type defined by
210205
C99. If this type is not available, it is simulated using a :c:type:`char`. In
211206
standard mode, it is always represented by one byte.
212207

213-
(3)
208+
(2)
214209
The ``'q'`` and ``'Q'`` conversion codes are available in native mode only if
215210
the platform C compiler supports C :c:type:`long long`, or, on Windows,
216211
:c:type:`__int64`. They are always available in standard modes.
217212

218-
(4)
213+
(3)
219214
When attempting to pack a non-integer using any of the integer conversion
220215
codes, if the non-integer has a :meth:`__index__` method then that method is
221216
called to convert the argument to an integer before packing.
222217

223218
.. versionchanged:: 3.2
224219
Use of the :meth:`__index__` method for non-integers is new in 3.2.
225220

226-
(5)
221+
(4)
227222
For the ``'f'`` and ``'d'`` conversion codes, the packed representation uses
228223
the IEEE 754 binary32 (for ``'f'``) or binary64 (for ``'d'``) format,
229224
regardless of the floating-point format used by the platform.
230225

231-
(6)
226+
(5)
232227
The ``'P'`` format character is only available for the native byte ordering
233228
(selected as the default or with the ``'@'`` byte order character). The byte
234229
order character ``'='`` chooses to use little- or big-endian ordering based
@@ -310,9 +305,9 @@ the result in a named tuple::
310305
The ordering of format characters may have an impact on size since the padding
311306
needed to satisfy alignment requirements is different::
312307

313-
>>> pack('ci', '*', 0x12131415)
308+
>>> pack('ci', b'*', 0x12131415)
314309
b'*\x00\x00\x00\x12\x13\x14\x15'
315-
>>> pack('ic', 0x12131415, '*')
310+
>>> pack('ic', 0x12131415, b'*')
316311
b'\x12\x13\x14\x15*'
317312
>>> calcsize('ci')
318313
8

Doc/whatsnew/3.2.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1705,3 +1705,7 @@ require changes to your code:
17051705

17061706
(Contributed by Georg Brandl and Mattias Brändström;
17071707
`appspot issue 53094 <http://codereview.appspot.com/53094>`_.)
1708+
1709+
* :func:`struct.pack` doesn't encode implicitly unicode to UTF-8 anymore: use
1710+
explicit conversion instead and replace unicode literals by bytes literals.
1711+

Lib/test/test_struct.py

Lines changed: 46 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -82,58 +82,52 @@ def test_new_features(self):
8282
# Test some of the new features in detail
8383
# (format, argument, big-endian result, little-endian result, asymmetric)
8484
tests = [
85-
('c', 'a', 'a', 'a', 0),
86-
('xc', 'a', '\0a', '\0a', 0),
87-
('cx', 'a', 'a\0', 'a\0', 0),
88-
('s', 'a', 'a', 'a', 0),
89-
('0s', 'helloworld', '', '', 1),
90-
('1s', 'helloworld', 'h', 'h', 1),
91-
('9s', 'helloworld', 'helloworl', 'helloworl', 1),
92-
('10s', 'helloworld', 'helloworld', 'helloworld', 0),
93-
('11s', 'helloworld', 'helloworld\0', 'helloworld\0', 1),
94-
('20s', 'helloworld', 'helloworld'+10*'\0', 'helloworld'+10*'\0', 1),
95-
('b', 7, '\7', '\7', 0),
96-
('b', -7, '\371', '\371', 0),
97-
('B', 7, '\7', '\7', 0),
98-
('B', 249, '\371', '\371', 0),
99-
('h', 700, '\002\274', '\274\002', 0),
100-
('h', -700, '\375D', 'D\375', 0),
101-
('H', 700, '\002\274', '\274\002', 0),
102-
('H', 0x10000-700, '\375D', 'D\375', 0),
103-
('i', 70000000, '\004,\035\200', '\200\035,\004', 0),
104-
('i', -70000000, '\373\323\342\200', '\200\342\323\373', 0),
105-
('I', 70000000, '\004,\035\200', '\200\035,\004', 0),
106-
('I', 0x100000000-70000000, '\373\323\342\200', '\200\342\323\373', 0),
107-
('l', 70000000, '\004,\035\200', '\200\035,\004', 0),
108-
('l', -70000000, '\373\323\342\200', '\200\342\323\373', 0),
109-
('L', 70000000, '\004,\035\200', '\200\035,\004', 0),
110-
('L', 0x100000000-70000000, '\373\323\342\200', '\200\342\323\373', 0),
111-
('f', 2.0, '@\000\000\000', '\000\000\000@', 0),
112-
('d', 2.0, '@\000\000\000\000\000\000\000',
113-
'\000\000\000\000\000\000\000@', 0),
114-
('f', -2.0, '\300\000\000\000', '\000\000\000\300', 0),
115-
('d', -2.0, '\300\000\000\000\000\000\000\000',
116-
'\000\000\000\000\000\000\000\300', 0),
117-
('?', 0, '\0', '\0', 0),
118-
('?', 3, '\1', '\1', 1),
119-
('?', True, '\1', '\1', 0),
120-
('?', [], '\0', '\0', 1),
121-
('?', (1,), '\1', '\1', 1),
85+
('c', b'a', b'a', b'a', 0),
86+
('xc', b'a', b'\0a', b'\0a', 0),
87+
('cx', b'a', b'a\0', b'a\0', 0),
88+
('s', b'a', b'a', b'a', 0),
89+
('0s', b'helloworld', b'', b'', 1),
90+
('1s', b'helloworld', b'h', b'h', 1),
91+
('9s', b'helloworld', b'helloworl', b'helloworl', 1),
92+
('10s', b'helloworld', b'helloworld', b'helloworld', 0),
93+
('11s', b'helloworld', b'helloworld\0', b'helloworld\0', 1),
94+
('20s', b'helloworld', b'helloworld'+10*b'\0', b'helloworld'+10*b'\0', 1),
95+
('b', 7, b'\7', b'\7', 0),
96+
('b', -7, b'\371', b'\371', 0),
97+
('B', 7, b'\7', b'\7', 0),
98+
('B', 249, b'\371', b'\371', 0),
99+
('h', 700, b'\002\274', b'\274\002', 0),
100+
('h', -700, b'\375D', b'D\375', 0),
101+
('H', 700, b'\002\274', b'\274\002', 0),
102+
('H', 0x10000-700, b'\375D', b'D\375', 0),
103+
('i', 70000000, b'\004,\035\200', b'\200\035,\004', 0),
104+
('i', -70000000, b'\373\323\342\200', b'\200\342\323\373', 0),
105+
('I', 70000000, b'\004,\035\200', b'\200\035,\004', 0),
106+
('I', 0x100000000-70000000, b'\373\323\342\200', b'\200\342\323\373', 0),
107+
('l', 70000000, b'\004,\035\200', b'\200\035,\004', 0),
108+
('l', -70000000, b'\373\323\342\200', b'\200\342\323\373', 0),
109+
('L', 70000000, b'\004,\035\200', b'\200\035,\004', 0),
110+
('L', 0x100000000-70000000, b'\373\323\342\200', b'\200\342\323\373', 0),
111+
('f', 2.0, b'@\000\000\000', b'\000\000\000@', 0),
112+
('d', 2.0, b'@\000\000\000\000\000\000\000',
113+
b'\000\000\000\000\000\000\000@', 0),
114+
('f', -2.0, b'\300\000\000\000', b'\000\000\000\300', 0),
115+
('d', -2.0, b'\300\000\000\000\000\000\000\000',
116+
b'\000\000\000\000\000\000\000\300', 0),
117+
('?', 0, b'\0', b'\0', 0),
118+
('?', 3, b'\1', b'\1', 1),
119+
('?', True, b'\1', b'\1', 0),
120+
('?', [], b'\0', b'\0', 1),
121+
('?', (1,), b'\1', b'\1', 1),
122122
]
123123

124124
for fmt, arg, big, lil, asy in tests:
125-
big = bytes(big, "latin-1")
126-
lil = bytes(lil, "latin-1")
127125
for (xfmt, exp) in [('>'+fmt, big), ('!'+fmt, big), ('<'+fmt, lil),
128126
('='+fmt, ISBIGENDIAN and big or lil)]:
129127
res = struct.pack(xfmt, arg)
130128
self.assertEqual(res, exp)
131129
self.assertEqual(struct.calcsize(xfmt), len(res))
132130
rev = struct.unpack(xfmt, res)[0]
133-
if isinstance(arg, str):
134-
# Strings are returned as bytes since you can't know the
135-
# encoding of the string when packed.
136-
arg = bytes(arg, 'latin1')
137131
if rev != arg:
138132
self.assertTrue(asy)
139133

@@ -334,15 +328,14 @@ def __int__(self):
334328
def test_p_code(self):
335329
# Test p ("Pascal string") code.
336330
for code, input, expected, expectedback in [
337-
('p','abc', '\x00', b''),
338-
('1p', 'abc', '\x00', b''),
339-
('2p', 'abc', '\x01a', b'a'),
340-
('3p', 'abc', '\x02ab', b'ab'),
341-
('4p', 'abc', '\x03abc', b'abc'),
342-
('5p', 'abc', '\x03abc\x00', b'abc'),
343-
('6p', 'abc', '\x03abc\x00\x00', b'abc'),
344-
('1000p', 'x'*1000, '\xff' + 'x'*999, b'x'*255)]:
345-
expected = bytes(expected, "latin-1")
331+
('p', b'abc', b'\x00', b''),
332+
('1p', b'abc', b'\x00', b''),
333+
('2p', b'abc', b'\x01a', b'a'),
334+
('3p', b'abc', b'\x02ab', b'ab'),
335+
('4p', b'abc', b'\x03abc', b'abc'),
336+
('5p', b'abc', b'\x03abc\x00', b'abc'),
337+
('6p', b'abc', b'\x03abc\x00\x00', b'abc'),
338+
('1000p', b'x'*1000, b'\xff' + b'x'*999, b'x'*255)]:
346339
got = struct.pack(code, input)
347340
self.assertEqual(got, expected)
348341
(got,) = struct.unpack(code, got)
@@ -401,15 +394,11 @@ def test_unpack_from(self):
401394
s = struct.Struct(fmt)
402395
for cls in (bytes, bytearray):
403396
data = cls(test_string)
404-
if not isinstance(data, (bytes, bytearray)):
405-
bytes_data = bytes(data, 'latin1')
406-
else:
407-
bytes_data = data
408397
self.assertEqual(s.unpack_from(data), (b'abcd',))
409398
self.assertEqual(s.unpack_from(data, 2), (b'cd01',))
410399
self.assertEqual(s.unpack_from(data, 4), (b'0123',))
411400
for i in range(6):
412-
self.assertEqual(s.unpack_from(data, i), (bytes_data[i:i+4],))
401+
self.assertEqual(s.unpack_from(data, i), (data[i:i+4],))
413402
for i in range(6, len(test_string) + 1):
414403
self.assertRaises(struct.error, s.unpack_from, data, i)
415404
for cls in (bytes, bytearray):

Lib/wave.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -467,11 +467,11 @@ def _write_header(self, initlength):
467467
self._datalength = self._nframes * self._nchannels * self._sampwidth
468468
self._form_length_pos = self._file.tell()
469469
self._file.write(struct.pack('<l4s4slhhllhh4s',
470-
36 + self._datalength, 'WAVE', 'fmt ', 16,
470+
36 + self._datalength, b'WAVE', b'fmt ', 16,
471471
WAVE_FORMAT_PCM, self._nchannels, self._framerate,
472472
self._nchannels * self._framerate * self._sampwidth,
473473
self._nchannels * self._sampwidth,
474-
self._sampwidth * 8, 'data'))
474+
self._sampwidth * 8, b'data'))
475475
self._data_length_pos = self._file.tell()
476476
self._file.write(struct.pack('<l', self._datalength))
477477
self._headerwritten = True

Misc/NEWS

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ Core and Builtins
1818
Library
1919
-------
2020

21+
- Issue #10783: struct.pack() doesn't encode implicitly unicode to UTF-8
22+
anymore.
23+
2124
- Issue #10730: Add SVG mime types to mimetypes module.
2225

2326
- Issue #10768: Make the Tkinter ScrolledText widget work again.

Modules/_struct.c

Lines changed: 6 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -462,14 +462,9 @@ np_ubyte(char *p, PyObject *v, const formatdef *f)
462462
static int
463463
np_char(char *p, PyObject *v, const formatdef *f)
464464
{
465-
if (PyUnicode_Check(v)) {
466-
v = _PyUnicode_AsDefaultEncodedString(v, NULL);
467-
if (v == NULL)
468-
return -1;
469-
}
470465
if (!PyBytes_Check(v) || PyBytes_Size(v) != 1) {
471466
PyErr_SetString(StructError,
472-
"char format requires bytes or string of length 1");
467+
"char format requires a bytes object of length 1");
473468
return -1;
474469
}
475470
*p = *PyBytes_AsString(v);
@@ -1345,7 +1340,7 @@ s_init(PyObject *self, PyObject *args, PyObject *kwds)
13451340
if (!PyBytes_Check(o_format)) {
13461341
Py_DECREF(o_format);
13471342
PyErr_Format(PyExc_TypeError,
1348-
"Struct() argument 1 must be bytes, not %.200s",
1343+
"Struct() argument 1 must be a bytes object, not %.200s",
13491344
Py_TYPE(o_format)->tp_name);
13501345
return -1;
13511346
}
@@ -1423,7 +1418,7 @@ s_unpack(PyObject *self, PyObject *input)
14231418
return NULL;
14241419
if (vbuf.len != soself->s_size) {
14251420
PyErr_Format(StructError,
1426-
"unpack requires a bytes argument of length %zd",
1421+
"unpack requires a bytes object of length %zd",
14271422
soself->s_size);
14281423
PyBuffer_Release(&vbuf);
14291424
return NULL;
@@ -1503,15 +1498,10 @@ s_pack_internal(PyStructObject *soself, PyObject *args, int offset, char* buf)
15031498
if (e->format == 's') {
15041499
int isstring;
15051500
void *p;
1506-
if (PyUnicode_Check(v)) {
1507-
v = _PyUnicode_AsDefaultEncodedString(v, NULL);
1508-
if (v == NULL)
1509-
return -1;
1510-
}
15111501
isstring = PyBytes_Check(v);
15121502
if (!isstring && !PyByteArray_Check(v)) {
15131503
PyErr_SetString(StructError,
1514-
"argument for 's' must be a bytes or string");
1504+
"argument for 's' must be a bytes object");
15151505
return -1;
15161506
}
15171507
if (isstring) {
@@ -1529,15 +1519,10 @@ s_pack_internal(PyStructObject *soself, PyObject *args, int offset, char* buf)
15291519
} else if (e->format == 'p') {
15301520
int isstring;
15311521
void *p;
1532-
if (PyUnicode_Check(v)) {
1533-
v = _PyUnicode_AsDefaultEncodedString(v, NULL);
1534-
if (v == NULL)
1535-
return -1;
1536-
}
15371522
isstring = PyBytes_Check(v);
15381523
if (!isstring && !PyByteArray_Check(v)) {
15391524
PyErr_SetString(StructError,
1540-
"argument for 'p' must be a bytes or string");
1525+
"argument for 'p' must be a bytes object");
15411526
return -1;
15421527
}
15431528
if (isstring) {
@@ -1691,7 +1676,7 @@ static struct PyMethodDef s_methods[] = {
16911676
{NULL, NULL} /* sentinel */
16921677
};
16931678

1694-
PyDoc_STRVAR(s__doc__,
1679+
PyDoc_STRVAR(s__doc__,
16951680
"Struct(fmt) --> compiled struct object\n"
16961681
"\n"
16971682
"Return a new Struct object which writes and reads binary data according to\n"

0 commit comments

Comments
 (0)