Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 9076f9e

Browse files
author
Victor Stinner
committed
Merged revisions 81168 via svnmerge from
svn+ssh://[email protected]/python/branches/py3k ........ r81168 | victor.stinner | 2010-05-14 17:58:55 +0200 (ven., 14 mai 2010) | 10 lines Issue #8711: Document PyUnicode_DecodeFSDefault*() functions * Add paragraph titles to c-api/unicode.rst. * Fix PyUnicode_DecodeFSDefault*() comment: it now uses the "surrogateescape" error handler (and not "replace") * Remove "The function is intended to be used for paths and file names only during bootstrapping process where the codecs are not set up." from PyUnicode_FSConverter() comment: it is used after the bootstrapping and for other purposes than file names ........
1 parent 9d765bd commit 9076f9e

2 files changed

Lines changed: 101 additions & 47 deletions

File tree

Doc/c-api/unicode.rst

Lines changed: 89 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,12 @@ Unicode Objects and Codecs
1010
Unicode Objects
1111
^^^^^^^^^^^^^^^
1212

13+
Unicode Type
14+
""""""""""""
15+
1316
These are the basic Unicode object types used for the Unicode implementation in
1417
Python:
1518

16-
.. % --- Unicode Type -------------------------------------------------------
17-
1819

1920
.. ctype:: Py_UNICODE
2021

@@ -89,12 +90,13 @@ access internal read-only data of Unicode objects:
8990
Clear the free list. Return the total number of freed items.
9091

9192

93+
Unicode Character Properties
94+
""""""""""""""""""""""""""""
95+
9296
Unicode provides many different character properties. The most often needed ones
9397
are available through these macros which are mapped to C functions depending on
9498
the Python configuration.
9599

96-
.. % --- Unicode character properties ---------------------------------------
97-
98100

99101
.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
100102

@@ -192,11 +194,13 @@ These APIs can be used for fast direct character conversions:
192194
Return the character *ch* converted to a double. Return ``-1.0`` if this is not
193195
possible. This macro does not raise exceptions.
194196

197+
198+
Plain Py_UNICODE
199+
""""""""""""""""
200+
195201
To create Unicode objects and access their basic sequence properties, use these
196202
APIs:
197203

198-
.. % --- Plain Py_UNICODE ---------------------------------------------------
199-
200204

201205
.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
202206

@@ -346,9 +350,47 @@ Python can interface directly to this type using the following functions.
346350
Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
347351
the system's :ctype:`wchar_t`.
348352

349-
.. % --- wchar_t support for platforms which support it ---------------------
353+
354+
File System Encoding
355+
""""""""""""""""""""
356+
357+
To encode and decode file names and other environment strings,
358+
:cdata:`Py_FileSystemEncoding` should be used as the encoding, and
359+
``"surrogateescape"`` should be used as the error handler (:pep:`383`). To
360+
encode file names during argument parsing, the ``"O&"`` converter should be
361+
used, passsing :func:PyUnicode_FSConverter as the conversion function:
362+
363+
.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
364+
365+
Convert *obj* into *result*, using :cdata:`Py_FileSystemDefaultEncoding`,
366+
and the ``"surrogateescape"`` error handler. *result* must be a
367+
``PyObject*``, return a :func:`bytes` object which must be released if it
368+
is no longer used.
369+
370+
.. versionadded:: 3.1
371+
372+
.. cfunction:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
373+
374+
Decode a null-terminated string using :cdata:`Py_FileSystemDefaultEncoding`
375+
and the ``"surrogateescape"`` error handler.
376+
377+
If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
378+
379+
Use :func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
380+
381+
.. cfunction:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
382+
383+
Decode a string using :cdata:`Py_FileSystemDefaultEncoding` and
384+
the ``"surrogateescape"`` error handler.
385+
386+
If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
350387

351388

389+
wchar_t Support
390+
"""""""""""""""
391+
392+
wchar_t support for platforms which support it:
393+
352394
.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
353395

354396
Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
@@ -395,9 +437,11 @@ built-in codecs is "strict" (:exc:`ValueError` is raised).
395437
The codecs all use a similar interface. Only deviation from the following
396438
generic ones are documented for simplicity.
397439

398-
These are the generic codec APIs:
399440

400-
.. % --- Generic Codecs -----------------------------------------------------
441+
Generic Codecs
442+
""""""""""""""
443+
444+
These are the generic codec APIs:
401445

402446

403447
.. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
@@ -426,9 +470,11 @@ These are the generic codec APIs:
426470
using the Python codec registry. Return *NULL* if an exception was raised by
427471
the codec.
428472

429-
These are the UTF-8 codec APIs:
430473

431-
.. % --- UTF-8 Codecs -------------------------------------------------------
474+
UTF-8 Codecs
475+
""""""""""""
476+
477+
These are the UTF-8 codec APIs:
432478

433479

434480
.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
@@ -458,9 +504,11 @@ These are the UTF-8 codec APIs:
458504
object. Error handling is "strict". Return *NULL* if an exception was
459505
raised by the codec.
460506

461-
These are the UTF-32 codec APIs:
462507

463-
.. % --- UTF-32 Codecs ------------------------------------------------------ */
508+
UTF-32 Codecs
509+
"""""""""""""
510+
511+
These are the UTF-32 codec APIs:
464512

465513

466514
.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
@@ -525,9 +573,10 @@ These are the UTF-32 codec APIs:
525573
Return *NULL* if an exception was raised by the codec.
526574

527575

528-
These are the UTF-16 codec APIs:
576+
UTF-16 Codecs
577+
"""""""""""""
529578

530-
.. % --- UTF-16 Codecs ------------------------------------------------------ */
579+
These are the UTF-16 codec APIs:
531580

532581

533582
.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
@@ -591,9 +640,11 @@ These are the UTF-16 codec APIs:
591640
order. The string always starts with a BOM mark. Error handling is "strict".
592641
Return *NULL* if an exception was raised by the codec.
593642

594-
These are the "Unicode Escape" codec APIs:
595643

596-
.. % --- Unicode-Escape Codecs ----------------------------------------------
644+
Unicode-Escape Codecs
645+
"""""""""""""""""""""
646+
647+
These are the "Unicode Escape" codec APIs:
597648

598649

599650
.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
@@ -615,9 +666,11 @@ These are the "Unicode Escape" codec APIs:
615666
string object. Error handling is "strict". Return *NULL* if an exception was
616667
raised by the codec.
617668

618-
These are the "Raw Unicode Escape" codec APIs:
619669

620-
.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
670+
Raw-Unicode-Escape Codecs
671+
"""""""""""""""""""""""""
672+
673+
These are the "Raw Unicode Escape" codec APIs:
621674

622675

623676
.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
@@ -639,11 +692,13 @@ These are the "Raw Unicode Escape" codec APIs:
639692
Python string object. Error handling is "strict". Return *NULL* if an exception
640693
was raised by the codec.
641694

695+
696+
Latin-1 Codecs
697+
""""""""""""""
698+
642699
These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
643700
ordinals and only these are accepted by the codecs during encoding.
644701

645-
.. % --- Latin-1 Codecs -----------------------------------------------------
646-
647702

648703
.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
649704

@@ -664,11 +719,13 @@ ordinals and only these are accepted by the codecs during encoding.
664719
object. Error handling is "strict". Return *NULL* if an exception was
665720
raised by the codec.
666721

722+
723+
ASCII Codecs
724+
""""""""""""
725+
667726
These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
668727
codes generate errors.
669728

670-
.. % --- ASCII Codecs -------------------------------------------------------
671-
672729

673730
.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
674731

@@ -689,9 +746,11 @@ codes generate errors.
689746
object. Error handling is "strict". Return *NULL* if an exception was
690747
raised by the codec.
691748

692-
These are the mapping codec APIs:
693749

694-
.. % --- Character Map Codecs -----------------------------------------------
750+
Character Map Codecs
751+
""""""""""""""""""""
752+
753+
These are the mapping codec APIs:
695754

696755
This codec is special in that it can be used to implement many different codecs
697756
(and this is in fact what was done to obtain most of the standard codecs
@@ -760,7 +819,9 @@ use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
760819
DBCS) is a class of encodings, not just one. The target encoding is defined by
761820
the user settings on the machine running the codec.
762821

763-
.. % --- MBCS codecs for Windows --------------------------------------------
822+
823+
MBCS codecs for Windows
824+
"""""""""""""""""""""""
764825

765826

766827
.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
@@ -790,20 +851,9 @@ the user settings on the machine running the codec.
790851
object. Error handling is "strict". Return *NULL* if an exception was
791852
raised by the codec.
792853

793-
For decoding file names and other environment strings, :cdata:`Py_FileSystemEncoding`
794-
should be used as the encoding, and ``"surrogateescape"`` should be used as the error
795-
handler. For encoding file names during argument parsing, the ``O&`` converter should
796-
be used, passsing PyUnicode_FSConverter as the conversion function:
797-
798-
.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
799-
800-
Convert *obj* into *result*, using the file system encoding, and the ``surrogateescape``
801-
error handler. *result* must be a ``PyObject*``, yielding a bytes or bytearray object
802-
which must be released if it is no longer used.
803-
804-
.. versionadded:: 3.1
805854

806-
.. % --- Methods & Slots ----------------------------------------------------
855+
Methods & Slots
856+
"""""""""""""""
807857

808858

809859
.. _unicodemethodsandslots:

Include/unicodeobject.h

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1238,25 +1238,29 @@ PyAPI_FUNC(int) PyUnicode_EncodeDecimal(
12381238
/* --- File system encoding ---------------------------------------------- */
12391239

12401240
/* ParseTuple converter which converts a Unicode object into the file
1241-
system encoding, using the PEP 383 error handler; bytes objects are
1242-
output as-is. */
1241+
system encoding as a bytes object, using the "surrogateescape" error
1242+
handler; bytes objects are output as-is. */
12431243

12441244
PyAPI_FUNC(int) PyUnicode_FSConverter(PyObject*, void*);
12451245

1246-
/* Decode a null-terminated string using Py_FileSystemDefaultEncoding.
1246+
/* Decode a null-terminated string using Py_FileSystemDefaultEncoding
1247+
and the "surrogateescape" error handler.
12471248
1248-
If the encoding is supported by one of the built-in codecs (i.e., UTF-8,
1249-
UTF-16, UTF-32, Latin-1 or MBCS), otherwise fallback to UTF-8 and replace
1250-
invalid characters with '?'.
1249+
If Py_FileSystemDefaultEncoding is not set, fall back to UTF-8.
12511250
1252-
The function is intended to be used for paths and file names only
1253-
during bootstrapping process where the codecs are not set up.
1251+
Use PyUnicode_DecodeFSDefaultAndSize() if you have the string length.
12541252
*/
12551253

12561254
PyAPI_FUNC(PyObject*) PyUnicode_DecodeFSDefault(
12571255
const char *s /* encoded string */
12581256
);
12591257

1258+
/* Decode a string using Py_FileSystemDefaultEncoding
1259+
and the "surrogateescape" error handler.
1260+
1261+
If Py_FileSystemDefaultEncoding is not set, fall back to UTF-8.
1262+
*/
1263+
12601264
PyAPI_FUNC(PyObject*) PyUnicode_DecodeFSDefaultAndSize(
12611265
const char *s, /* encoded string */
12621266
Py_ssize_t size /* size */

0 commit comments

Comments
 (0)