Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit bff879c

Browse files
committed
This patch finalizes the move from UTF-8 to a default encoding in
the Python Unicode implementation. The internal buffer used for implementing the buffer protocol is renamed to defenc to make this change visible. It now holds the default encoded version of the Unicode object and is calculated on demand (NULL otherwise). Since the default encoding defaults to ASCII, this will mean that Unicode objects which hold non-ASCII characters will no longer work on C APIs using the "s" or "t" parser markers. C APIs must now explicitly provide Unicode support via the "u", "U" or "es"/"es#" parser markers in order to work with non-ASCII Unicode strings. (Note: this patch will also have to be applied to the 1.6 branch of the CVS tree.)
1 parent 2b83b46 commit bff879c

4 files changed

Lines changed: 109 additions & 65 deletions

File tree

Include/unicodeobject.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -204,8 +204,9 @@ typedef struct {
204204
int length; /* Length of raw Unicode data in buffer */
205205
Py_UNICODE *str; /* Raw Unicode buffer */
206206
long hash; /* Hash value; -1 if not set */
207-
PyObject *utf8str; /* UTF-8 encoded version as Python string,
208-
or NULL */
207+
PyObject *defenc; /* (Default) Encoded version as Python
208+
string, or NULL; this is used for
209+
implementing the buffer protocol */
209210
} PyUnicodeObject;
210211

211212
extern DL_IMPORT(PyTypeObject) PyUnicode_Type;

Misc/unicode.txt

Lines changed: 60 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
=============================================================================
2-
Python Unicode Integration Proposal Version: 1.4
2+
Python Unicode Integration Proposal Version: 1.6
33
-----------------------------------------------------------------------------
44

55

@@ -41,16 +41,52 @@ General Remarks:
4141
case-insensitive on input (they will be converted to lower case
4242
by all APIs taking an encoding name as input).
4343

44-
Encoding names should follow the name conventions as used by the
44+
Encoding names should follow the name conventions as used by the
4545
Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
4646
written as 'utf-16'.
4747

48-
Codec modules should use the same names, but with hyphens converted
48+
Codec modules should use the same names, but with hyphens converted
4949
to underscores, e.g. utf_8, utf_16, iso_8859_1.
5050

51-
� The <default encoding> should be the widely used 'utf-8' format. This
52-
is very close to the standard 7-bit ASCII format and thus resembles the
53-
standard used programming nowadays in most aspects.
51+
52+
Unicode Default Encoding:
53+
-------------------------
54+
55+
The Unicode implementation has to make some assumption about the
56+
encoding of 8-bit strings passed to it for coercion and about the
57+
encoding to as default for conversion of Unicode to strings when no
58+
specific encoding is given. This encoding is called <default encoding>
59+
throughout this text.
60+
61+
For this, the implementation maintains a global which can be set in
62+
the site.py Python startup script. Subsequent changes are not
63+
possible. The <default encoding> can be set and queried using the
64+
two sys module APIs:
65+
66+
sys.setdefaultencoding(encoding)
67+
--> Sets the <default encoding> used by the Unicode implementation.
68+
encoding has to be an encoding which is supported by the Python
69+
installation, otherwise, a LookupError is raised.
70+
71+
Note: This API is only available in site.py ! It is removed
72+
from the sys module by site.py after usage.
73+
74+
sys.getdefaultencoding()
75+
--> Returns the current <default encoding>.
76+
77+
If not otherwise defined or set, the <default encoding> defaults to
78+
'ascii'. This encoding is also the startup default of Python (and in
79+
effect before site.py is executed).
80+
81+
Note that the default site.py startup module contains disabled
82+
optional code which can set the <default encoding> according to the
83+
encoding defined by the current locale. The locale module is used to
84+
extract the encoding from the locale default settings defined by the
85+
OS environment (see locale.py). If the encoding cannot be determined,
86+
is unkown or unsupported, the code defaults to setting the <default
87+
encoding> to 'ascii'. To enable this code, edit the site.py file or
88+
place the appropriate code into the sitecustomize.py module of your
89+
Python installation.
5490

5591

5692
Unicode Constructors:
@@ -159,8 +195,10 @@ other objects have been coerced to Unicode. For strings this means
159195
that they are interpreted as Unicode string using the <default
160196
encoding>.
161197

162-
For the same reason, Unicode objects should return the same hash value
163-
as their UTF-8 equivalent strings.
198+
Unicode objects should return the same hash value as their ASCII
199+
equivalent strings. Unicode strings holding non-ASCII values are not
200+
guaranteed to return the same hash values as the default encoded
201+
equivalent string representation.
164202

165203
When compared using cmp() (or PyObject_Compare()) the implementation
166204
should mask TypeErrors raised during the conversion to remain in synch
@@ -661,11 +699,10 @@ to the compiler's wchar_t which can be 16 or 32 bit depending on the
661699
compiler/libc/platform being used.
662700

663701
Unicode objects should have a pointer to a cached Python string object
664-
<defencstr> holding the object's value using the current <default
665-
encoding>. This is needed for performance and internal parsing (see
666-
Internal Argument Parsing) reasons. The buffer is filled when the
667-
first conversion request to the <default encoding> is issued on the
668-
object.
702+
<defenc> holding the object's value using the <default encoding>.
703+
This is needed for performance and internal parsing (see Internal
704+
Argument Parsing) reasons. The buffer is filled when the first
705+
conversion request to the <default encoding> is issued on the object.
669706

670707
Interning is not needed (for now), since Python identifiers are
671708
defined as being ASCII only.
@@ -701,11 +738,11 @@ type).
701738
Buffer Interface:
702739
-----------------
703740

704-
Implement the buffer interface using the <defencstr> Python string
741+
Implement the buffer interface using the <defenc> Python string
705742
object as basis for bf_getcharbuf (corresponds to the "t#" argument
706743
parsing marker) and the internal buffer for bf_getreadbuf (corresponds
707744
to the "s#" argument parsing marker). If bf_getcharbuf is requested
708-
and the <defencstr> object does not yet exist, it is created first.
745+
and the <defenc> object does not yet exist, it is created first.
709746

710747
This has the advantage of being able to write to output streams (which
711748
typically use this interface) without additional specification of the
@@ -775,8 +812,8 @@ These markers are used by the PyArg_ParseTuple() APIs:
775812

776813
"U": Check for Unicode object and return a pointer to it
777814

778-
"s": For Unicode objects: auto convert them to the <default encoding>
779-
and return a pointer to the object's <defencstr> buffer.
815+
"s": For Unicode objects: return a pointer to the object's
816+
<defenc> buffer (which uses the <default encoding>).
780817

781818
"s#": Access to the Unicode object via the bf_getreadbuf buffer interface
782819
(see Buffer Interface); note that the length relates to the buffer
@@ -785,8 +822,7 @@ These markers are used by the PyArg_ParseTuple() APIs:
785822

786823
"t#": Access to the Unicode object via the bf_getcharbuf buffer interface
787824
(see Buffer Interface); note that the length relates to the buffer
788-
length, not necessarily to the Unicode string length (this may
789-
be different depending on the <default encoding>).
825+
length, not necessarily to the Unicode string length.
790826

791827
"es":
792828
Takes two parameters: encoding (const char *) and
@@ -1007,6 +1043,11 @@ Encodings:
10071043

10081044
History of this Proposal:
10091045
-------------------------
1046+
1.6: Changed <defencstr> to <defenc> since this is the name used in the
1047+
implementation. Added notes about the usage of <defenc> in the
1048+
buffer protocol implementation.
1049+
1.5: Added notes about setting the <default encoding>. Fixed some
1050+
typos (thanks to Andrew Kuchling). Changed <defencstr> to <utf8str>.
10101051
1.4: Added note about mixed type comparisons and contains tests.
10111052
Changed treating of Unicode objects in format strings (if used
10121053
with '%s' % u they will now cause the format string to be

Objects/unicodeobject.c

Lines changed: 40 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -165,9 +165,9 @@ int _PyUnicode_Resize(register PyUnicodeObject *unicode,
165165

166166
reset:
167167
/* Reset the object caches */
168-
if (unicode->utf8str) {
169-
Py_DECREF(unicode->utf8str);
170-
unicode->utf8str = NULL;
168+
if (unicode->defenc) {
169+
Py_DECREF(unicode->defenc);
170+
unicode->defenc = NULL;
171171
}
172172
unicode->hash = -1;
173173

@@ -243,7 +243,7 @@ PyUnicodeObject *_PyUnicode_New(int length)
243243
unicode->str[length] = 0;
244244
unicode->length = length;
245245
unicode->hash = -1;
246-
unicode->utf8str = NULL;
246+
unicode->defenc = NULL;
247247
return unicode;
248248

249249
onError:
@@ -262,9 +262,9 @@ void _PyUnicode_Free(register PyUnicodeObject *unicode)
262262
unicode->str = NULL;
263263
unicode->length = 0;
264264
}
265-
if (unicode->utf8str) {
266-
Py_DECREF(unicode->utf8str);
267-
unicode->utf8str = NULL;
265+
if (unicode->defenc) {
266+
Py_DECREF(unicode->defenc);
267+
unicode->defenc = NULL;
268268
}
269269
/* Add to free list */
270270
*(PyUnicodeObject **)unicode = unicode_freelist;
@@ -273,7 +273,7 @@ void _PyUnicode_Free(register PyUnicodeObject *unicode)
273273
}
274274
else {
275275
PyMem_DEL(unicode->str);
276-
Py_XDECREF(unicode->utf8str);
276+
Py_XDECREF(unicode->defenc);
277277
PyObject_DEL(unicode);
278278
}
279279
}
@@ -529,6 +529,33 @@ PyObject *PyUnicode_AsEncodedString(PyObject *unicode,
529529
return NULL;
530530
}
531531

532+
/* Return a Python string holding the default encoded value of the
533+
Unicode object.
534+
535+
The resulting string is cached in the Unicode object for subsequent
536+
usage by this function. The cached version is needed to implement
537+
the character buffer interface and will live (at least) as long as
538+
the Unicode object itself.
539+
540+
The refcount of the string is *not* incremented.
541+
542+
*** Exported for internal use by the interpreter only !!! ***
543+
544+
*/
545+
546+
PyObject *_PyUnicode_AsDefaultEncodedString(PyObject *unicode,
547+
const char *errors)
548+
{
549+
PyObject *v = ((PyUnicodeObject *)unicode)->defenc;
550+
551+
if (v)
552+
return v;
553+
v = PyUnicode_AsEncodedString(unicode, NULL, errors);
554+
if (v && errors == NULL)
555+
((PyUnicodeObject *)unicode)->defenc = v;
556+
return v;
557+
}
558+
532559
Py_UNICODE *PyUnicode_AsUnicode(PyObject *unicode)
533560
{
534561
if (!PyUnicode_Check(unicode)) {
@@ -874,35 +901,6 @@ PyObject *PyUnicode_EncodeUTF8(const Py_UNICODE *s,
874901
return NULL;
875902
}
876903

877-
/* Return a Python string holding the UTF-8 encoded value of the
878-
Unicode object.
879-
880-
The resulting string is cached in the Unicode object for subsequent
881-
usage by this function. The cached version is needed to implement
882-
the character buffer interface and will live (at least) as long as
883-
the Unicode object itself.
884-
885-
The refcount of the string is *not* incremented.
886-
887-
*** Exported for internal use by the interpreter only !!! ***
888-
889-
*/
890-
891-
PyObject *_PyUnicode_AsUTF8String(PyObject *unicode,
892-
const char *errors)
893-
{
894-
PyObject *v = ((PyUnicodeObject *)unicode)->utf8str;
895-
896-
if (v)
897-
return v;
898-
v = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(unicode),
899-
PyUnicode_GET_SIZE(unicode),
900-
errors);
901-
if (v && errors == NULL)
902-
((PyUnicodeObject *)unicode)->utf8str = v;
903-
return v;
904-
}
905-
906904
PyObject *PyUnicode_AsUTF8String(PyObject *unicode)
907905
{
908906
PyObject *str;
@@ -911,7 +909,9 @@ PyObject *PyUnicode_AsUTF8String(PyObject *unicode)
911909
PyErr_BadArgument();
912910
return NULL;
913911
}
914-
str = _PyUnicode_AsUTF8String(unicode, NULL);
912+
str = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(unicode),
913+
PyUnicode_GET_SIZE(unicode),
914+
NULL);
915915
if (str == NULL)
916916
return NULL;
917917
Py_INCREF(str);
@@ -4519,7 +4519,7 @@ unicode_buffer_getcharbuf(PyUnicodeObject *self,
45194519
"accessing non-existent unicode segment");
45204520
return -1;
45214521
}
4522-
str = _PyUnicode_AsUTF8String((PyObject *)self, NULL);
4522+
str = _PyUnicode_AsDefaultEncodedString((PyObject *)self, NULL);
45234523
if (str == NULL)
45244524
return -1;
45254525
*ptr = (void *) PyString_AS_STRING(str);
@@ -5130,7 +5130,7 @@ _PyUnicode_Fini(void)
51305130
u = *(PyUnicodeObject **)u;
51315131
if (v->str)
51325132
PyMem_DEL(v->str);
5133-
Py_XDECREF(v->utf8str);
5133+
Py_XDECREF(v->defenc);
51345134
PyObject_DEL(v);
51355135
}
51365136
unicode_freelist = NULL;

Python/getargs.c

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -372,7 +372,7 @@ convertsimple(PyObject *arg, char **p_format, va_list *p_va, char *msgbuf)
372372

373373
/* Internal API needed by convertsimple1(): */
374374
extern
375-
PyObject *_PyUnicode_AsUTF8String(PyObject *unicode,
375+
PyObject *_PyUnicode_AsDefaultEncodedString(PyObject *unicode,
376376
const char *errors);
377377

378378
/* Convert a non-tuple argument. Return NULL if conversion went OK,
@@ -567,7 +567,8 @@ convertsimple1(PyObject *arg, char **p_format, va_list *p_va)
567567
if (PyString_Check(arg))
568568
*p = PyString_AS_STRING(arg);
569569
else if (PyUnicode_Check(arg)) {
570-
arg = _PyUnicode_AsUTF8String(arg, NULL);
570+
arg = _PyUnicode_AsDefaultEncodedString(
571+
arg, NULL);
571572
if (arg == NULL)
572573
return "unicode conversion error";
573574
*p = PyString_AS_STRING(arg);
@@ -612,7 +613,8 @@ convertsimple1(PyObject *arg, char **p_format, va_list *p_va)
612613
else if (PyString_Check(arg))
613614
*p = PyString_AsString(arg);
614615
else if (PyUnicode_Check(arg)) {
615-
arg = _PyUnicode_AsUTF8String(arg, NULL);
616+
arg = _PyUnicode_AsDefaultEncodedString(
617+
arg, NULL);
616618
if (arg == NULL)
617619
return "unicode conversion error";
618620
*p = PyString_AS_STRING(arg);
@@ -644,7 +646,7 @@ convertsimple1(PyObject *arg, char **p_format, va_list *p_va)
644646
/* Get 'e' parameter: the encoding name */
645647
encoding = (const char *)va_arg(*p_va, const char *);
646648
if (encoding == NULL)
647-
return "(encoding is NULL)";
649+
encoding = PyUnicode_GetDefaultEncoding();
648650

649651
/* Get 's' parameter: the output buffer to use */
650652
if (*format != 's')

0 commit comments

Comments
 (0)