Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 5cd2f0d

Browse files
committed
Updated according to the changes made to the "s#" parser marker
and bumped the version number to 1.7.
1 parent b425f5e commit 5cd2f0d

1 file changed

Lines changed: 27 additions & 20 deletions

File tree

Misc/unicode.txt

Lines changed: 27 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
=============================================================================
2-
Python Unicode Integration Proposal Version: 1.6
2+
Python Unicode Integration Proposal Version: 1.7
33
-----------------------------------------------------------------------------
44

55

@@ -738,16 +738,26 @@ type).
738738
Buffer Interface:
739739
-----------------
740740

741-
Implement the buffer interface using the <defenc> Python string
742-
object as basis for bf_getcharbuf (corresponds to the "t#" argument
743-
parsing marker) and the internal buffer for bf_getreadbuf (corresponds
744-
to the "s#" argument parsing marker). If bf_getcharbuf is requested
745-
and the <defenc> object does not yet exist, it is created first.
741+
Implement the buffer interface using the <defenc> Python string object
742+
as basis for bf_getcharbuf and the internal buffer for
743+
bf_getreadbuf. If bf_getcharbuf is requested and the <defenc> object
744+
does not yet exist, it is created first.
745+
746+
Note that as special case, the parser marker "s#" will not return raw
747+
Unicode UTF-16 data (which the bf_getreadbuf returns), but instead
748+
tries to encode the Unicode object using the default encoding and then
749+
returns a pointer to the resulting string object (or raises an
750+
exception in case the conversion fails). This was done in order to
751+
prevent accidentely writing binary data to an output stream which the
752+
other end might not recognize.
746753

747754
This has the advantage of being able to write to output streams (which
748755
typically use this interface) without additional specification of the
749756
encoding to use.
750757

758+
If you need to access the read buffer interface of Unicode objects,
759+
use the PyObject_AsReadBuffer() interface.
760+
751761
The internal format can also be accessed using the 'unicode-internal'
752762
codec, e.g. via u.encode('unicode-internal').
753763

@@ -815,14 +825,11 @@ These markers are used by the PyArg_ParseTuple() APIs:
815825
"s": For Unicode objects: return a pointer to the object's
816826
<defenc> buffer (which uses the <default encoding>).
817827

818-
"s#": Access to the Unicode object via the bf_getreadbuf buffer interface
819-
(see Buffer Interface); note that the length relates to the buffer
820-
length, not the Unicode string length (this may be different
821-
depending on the Internal Format).
828+
"s#": Access to the default encoded version of the Unicode object
829+
(see Buffer Interface); note that the length relates to the length
830+
of the default encoded string rather than the Unicode object length.
822831

823-
"t#": Access to the Unicode object via the bf_getcharbuf buffer interface
824-
(see Buffer Interface); note that the length relates to the buffer
825-
length, not necessarily to the Unicode string length.
832+
"t#": Same as "s#".
826833

827834
"es":
828835
Takes two parameters: encoding (const char *) and
@@ -934,14 +941,13 @@ Using "es#" with a pre-allocated buffer:
934941
File/Stream Output:
935942
-------------------
936943

937-
Since file.write(object) and most other stream writers use the "s#"
938-
argument parsing marker for binary files and "t#" for text files, the
939-
buffer interface implementation determines the encoding to use (see
940-
Buffer Interface).
944+
Since file.write(object) and most other stream writers use the "s#" or
945+
"t#" argument parsing marker for querying the data to write, the
946+
default encoded string version of the Unicode object will be written
947+
to the streams (see Buffer Interface).
941948

942-
For explicit handling of files using Unicode, the standard
943-
stream codecs as available through the codecs module should
944-
be used.
949+
For explicit handling of files using Unicode, the standard stream
950+
codecs as available through the codecs module should be used.
945951

946952
The codecs module should provide a short-cut open(filename,mode,encoding)
947953
available which also assures that mode contains the 'b' character when
@@ -1043,6 +1049,7 @@ Encodings:
10431049

10441050
History of this Proposal:
10451051
-------------------------
1052+
1.7: Added note about the changed behaviour of "s#".
10461053
1.6: Changed <defencstr> to <defenc> since this is the name used in the
10471054
implementation. Added notes about the usage of <defenc> in the
10481055
buffer protocol implementation.

0 commit comments

Comments
 (0)