11=============================================================================
2- Python Unicode Integration Proposal Version: 1.4
2+ Python Unicode Integration Proposal Version: 1.6
33-----------------------------------------------------------------------------
44
55
@@ -41,16 +41,52 @@ General Remarks:
4141 case-insensitive on input (they will be converted to lower case
4242 by all APIs taking an encoding name as input).
4343
44- Encoding names should follow the name conventions as used by the
44+ � Encoding names should follow the name conventions as used by the
4545 Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
4646 written as 'utf-16'.
4747
48- Codec modules should use the same names, but with hyphens converted
48+ � Codec modules should use the same names, but with hyphens converted
4949 to underscores, e.g. utf_8, utf_16, iso_8859_1.
5050
51- � The <default encoding> should be the widely used 'utf-8' format. This
52- is very close to the standard 7-bit ASCII format and thus resembles the
53- standard used programming nowadays in most aspects.
51+
52+ Unicode Default Encoding:
53+ -------------------------
54+
55+ The Unicode implementation has to make some assumption about the
56+ encoding of 8-bit strings passed to it for coercion and about the
57+ encoding to as default for conversion of Unicode to strings when no
58+ specific encoding is given. This encoding is called <default encoding>
59+ throughout this text.
60+
61+ For this, the implementation maintains a global which can be set in
62+ the site.py Python startup script. Subsequent changes are not
63+ possible. The <default encoding> can be set and queried using the
64+ two sys module APIs:
65+
66+ sys.setdefaultencoding(encoding)
67+ --> Sets the <default encoding> used by the Unicode implementation.
68+ encoding has to be an encoding which is supported by the Python
69+ installation, otherwise, a LookupError is raised.
70+
71+ Note: This API is only available in site.py ! It is removed
72+ from the sys module by site.py after usage.
73+
74+ sys.getdefaultencoding()
75+ --> Returns the current <default encoding>.
76+
77+ If not otherwise defined or set, the <default encoding> defaults to
78+ 'ascii'. This encoding is also the startup default of Python (and in
79+ effect before site.py is executed).
80+
81+ Note that the default site.py startup module contains disabled
82+ optional code which can set the <default encoding> according to the
83+ encoding defined by the current locale. The locale module is used to
84+ extract the encoding from the locale default settings defined by the
85+ OS environment (see locale.py). If the encoding cannot be determined,
86+ is unkown or unsupported, the code defaults to setting the <default
87+ encoding> to 'ascii'. To enable this code, edit the site.py file or
88+ place the appropriate code into the sitecustomize.py module of your
89+ Python installation.
5490
5591
5692Unicode Constructors:
@@ -159,8 +195,10 @@ other objects have been coerced to Unicode. For strings this means
159195that they are interpreted as Unicode string using the <default
160196encoding>.
161197
162- For the same reason, Unicode objects should return the same hash value
163- as their UTF-8 equivalent strings.
198+ Unicode objects should return the same hash value as their ASCII
199+ equivalent strings. Unicode strings holding non-ASCII values are not
200+ guaranteed to return the same hash values as the default encoded
201+ equivalent string representation.
164202
165203When compared using cmp() (or PyObject_Compare()) the implementation
166204should mask TypeErrors raised during the conversion to remain in synch
@@ -661,11 +699,10 @@ to the compiler's wchar_t which can be 16 or 32 bit depending on the
661699compiler/libc/platform being used.
662700
663701Unicode objects should have a pointer to a cached Python string object
664- <defencstr> holding the object's value using the current <default
665- encoding>. This is needed for performance and internal parsing (see
666- Internal Argument Parsing) reasons. The buffer is filled when the
667- first conversion request to the <default encoding> is issued on the
668- object.
702+ <defenc> holding the object's value using the <default encoding>.
703+ This is needed for performance and internal parsing (see Internal
704+ Argument Parsing) reasons. The buffer is filled when the first
705+ conversion request to the <default encoding> is issued on the object.
669706
670707Interning is not needed (for now), since Python identifiers are
671708defined as being ASCII only.
@@ -701,11 +738,11 @@ type).
701738Buffer Interface:
702739-----------------
703740
704- Implement the buffer interface using the <defencstr > Python string
741+ Implement the buffer interface using the <defenc > Python string
705742object as basis for bf_getcharbuf (corresponds to the "t#" argument
706743parsing marker) and the internal buffer for bf_getreadbuf (corresponds
707744to the "s#" argument parsing marker). If bf_getcharbuf is requested
708- and the <defencstr > object does not yet exist, it is created first.
745+ and the <defenc > object does not yet exist, it is created first.
709746
710747This has the advantage of being able to write to output streams (which
711748typically use this interface) without additional specification of the
@@ -775,8 +812,8 @@ These markers are used by the PyArg_ParseTuple() APIs:
775812
776813 "U": Check for Unicode object and return a pointer to it
777814
778- "s": For Unicode objects: auto convert them to the <default encoding>
779- and return a pointer to the object's <defencstr> buffer .
815+ "s": For Unicode objects: return a pointer to the object's
816+ <defenc> buffer (which uses the <default encoding>) .
780817
781818 "s#": Access to the Unicode object via the bf_getreadbuf buffer interface
782819 (see Buffer Interface); note that the length relates to the buffer
@@ -785,8 +822,7 @@ These markers are used by the PyArg_ParseTuple() APIs:
785822
786823 "t#": Access to the Unicode object via the bf_getcharbuf buffer interface
787824 (see Buffer Interface); note that the length relates to the buffer
788- length, not necessarily to the Unicode string length (this may
789- be different depending on the <default encoding>).
825+ length, not necessarily to the Unicode string length.
790826
791827 "es":
792828 Takes two parameters: encoding (const char *) and
@@ -1007,6 +1043,11 @@ Encodings:
10071043
10081044History of this Proposal:
10091045-------------------------
1046+ 1.6: Changed <defencstr> to <defenc> since this is the name used in the
1047+ implementation. Added notes about the usage of <defenc> in the
1048+ buffer protocol implementation.
1049+ 1.5: Added notes about setting the <default encoding>. Fixed some
1050+ typos (thanks to Andrew Kuchling). Changed <defencstr> to <utf8str>.
101010511.4: Added note about mixed type comparisons and contains tests.
10111052 Changed treating of Unicode objects in format strings (if used
10121053 with '%s' % u they will now cause the format string to be
0 commit comments