|
1 | 1 | ============================================================================= |
2 | | - Python Unicode Integration Proposal Version: 1.6 |
| 2 | + Python Unicode Integration Proposal Version: 1.7 |
3 | 3 | ----------------------------------------------------------------------------- |
4 | 4 |
|
5 | 5 |
|
@@ -738,16 +738,26 @@ type). |
738 | 738 | Buffer Interface: |
739 | 739 | ----------------- |
740 | 740 |
|
741 | | -Implement the buffer interface using the <defenc> Python string |
742 | | -object as basis for bf_getcharbuf (corresponds to the "t#" argument |
743 | | -parsing marker) and the internal buffer for bf_getreadbuf (corresponds |
744 | | -to the "s#" argument parsing marker). If bf_getcharbuf is requested |
745 | | -and the <defenc> object does not yet exist, it is created first. |
| 741 | +Implement the buffer interface using the <defenc> Python string object |
| 742 | +as basis for bf_getcharbuf and the internal buffer for |
| 743 | +bf_getreadbuf. If bf_getcharbuf is requested and the <defenc> object |
| 744 | +does not yet exist, it is created first. |
| 745 | + |
| 746 | +Note that as special case, the parser marker "s#" will not return raw |
| 747 | +Unicode UTF-16 data (which the bf_getreadbuf returns), but instead |
| 748 | +tries to encode the Unicode object using the default encoding and then |
| 749 | +returns a pointer to the resulting string object (or raises an |
| 750 | +exception in case the conversion fails). This was done in order to |
| 751 | +prevent accidentely writing binary data to an output stream which the |
| 752 | +other end might not recognize. |
746 | 753 |
|
747 | 754 | This has the advantage of being able to write to output streams (which |
748 | 755 | typically use this interface) without additional specification of the |
749 | 756 | encoding to use. |
750 | 757 |
|
| 758 | +If you need to access the read buffer interface of Unicode objects, |
| 759 | +use the PyObject_AsReadBuffer() interface. |
| 760 | + |
751 | 761 | The internal format can also be accessed using the 'unicode-internal' |
752 | 762 | codec, e.g. via u.encode('unicode-internal'). |
753 | 763 |
|
@@ -815,14 +825,11 @@ These markers are used by the PyArg_ParseTuple() APIs: |
815 | 825 | "s": For Unicode objects: return a pointer to the object's |
816 | 826 | <defenc> buffer (which uses the <default encoding>). |
817 | 827 |
|
818 | | - "s#": Access to the Unicode object via the bf_getreadbuf buffer interface |
819 | | - (see Buffer Interface); note that the length relates to the buffer |
820 | | - length, not the Unicode string length (this may be different |
821 | | - depending on the Internal Format). |
| 828 | + "s#": Access to the default encoded version of the Unicode object |
| 829 | + (see Buffer Interface); note that the length relates to the length |
| 830 | + of the default encoded string rather than the Unicode object length. |
822 | 831 |
|
823 | | - "t#": Access to the Unicode object via the bf_getcharbuf buffer interface |
824 | | - (see Buffer Interface); note that the length relates to the buffer |
825 | | - length, not necessarily to the Unicode string length. |
| 832 | + "t#": Same as "s#". |
826 | 833 |
|
827 | 834 | "es": |
828 | 835 | Takes two parameters: encoding (const char *) and |
@@ -934,14 +941,13 @@ Using "es#" with a pre-allocated buffer: |
934 | 941 | File/Stream Output: |
935 | 942 | ------------------- |
936 | 943 |
|
937 | | -Since file.write(object) and most other stream writers use the "s#" |
938 | | -argument parsing marker for binary files and "t#" for text files, the |
939 | | -buffer interface implementation determines the encoding to use (see |
940 | | -Buffer Interface). |
| 944 | +Since file.write(object) and most other stream writers use the "s#" or |
| 945 | +"t#" argument parsing marker for querying the data to write, the |
| 946 | +default encoded string version of the Unicode object will be written |
| 947 | +to the streams (see Buffer Interface). |
941 | 948 |
|
942 | | -For explicit handling of files using Unicode, the standard |
943 | | -stream codecs as available through the codecs module should |
944 | | -be used. |
| 949 | +For explicit handling of files using Unicode, the standard stream |
| 950 | +codecs as available through the codecs module should be used. |
945 | 951 |
|
946 | 952 | The codecs module should provide a short-cut open(filename,mode,encoding) |
947 | 953 | available which also assures that mode contains the 'b' character when |
@@ -1043,6 +1049,7 @@ Encodings: |
1043 | 1049 |
|
1044 | 1050 | History of this Proposal: |
1045 | 1051 | ------------------------- |
| 1052 | +1.7: Added note about the changed behaviour of "s#". |
1046 | 1053 | 1.6: Changed <defencstr> to <defenc> since this is the name used in the |
1047 | 1054 | implementation. Added notes about the usage of <defenc> in the |
1048 | 1055 | buffer protocol implementation. |
|
0 commit comments