Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 2151fc6

Browse files
committed
#4153: update Unicode howto for Python 3.3
* state that python3 source encoding is UTF-8, and give examples * mention surrogateescape in the 'tips and tricks' section, and backslashreplace in the "Python's Unicode Support" section. * Describe Unicode support provided by the re module. * link to Nick Coghlan's and Ned Batchelder's notes/presentations. * default filesystem encoding is now UTF-8, not ascii. * Describe StreamRecoder class. * update acks section * remove usage of "I think", "I'm not going to", etc. * various edits * remove revision history and original outline
1 parent ce3dd0b commit 2151fc6

1 file changed

Lines changed: 162 additions & 95 deletions

File tree

Doc/howto/unicode.rst

Lines changed: 162 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -28,15 +28,15 @@ which required accented characters couldn't be faithfully represented in ASCII.
2828
as 'naïve' and 'café', and some publications have house styles which require
2929
spellings such as 'coöperate'.)
3030

31-
For a while people just wrote programs that didn't display accents. I remember
32-
looking at Apple ][ BASIC programs, published in French-language publications in
33-
the mid-1980s, that had lines like these::
31+
For a while people just wrote programs that didn't display accents.
32+
In the mid-1980s an Apple II BASIC program written by a French speaker
33+
might have lines like these::
3434

3535
PRINT "FICHIER EST COMPLETE."
3636
PRINT "CARACTERE NON ACCEPTE."
3737

38-
Those messages should contain accents, and they just look wrong to someone who
39-
can read French.
38+
Those messages should contain accents (completé, caractère, accepté),
39+
and they just look wrong to someone who can read French.
4040

4141
In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
4242
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
@@ -69,9 +69,12 @@ There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
6969
originally separate efforts, but the specifications were merged with the 1.1
7070
revision of Unicode.
7171

72-
(This discussion of Unicode's history is highly simplified. I don't think the
73-
average Python programmer needs to worry about the historical details; consult
74-
the Unicode consortium site listed in the References for more information.)
72+
(This discussion of Unicode's history is highly simplified. The
73+
precise historical details aren't necessary for understanding how to
74+
use Unicode effectively, but if you're curious, consult the Unicode
75+
consortium site listed in the References or
76+
the `Wikipedia entry for Unicode <http://en.wikipedia.org/wiki/Unicode#History>`_
77+
for more information.)
7578

7679

7780
Definitions
@@ -216,10 +219,8 @@ Unicode character tables.
216219

217220
Another `good introductory article <http://www.joelonsoftware.com/articles/Unicode.html>`_
218221
was written by Joel Spolsky.
219-
If this introduction didn't make things clear to you, you should try reading this
220-
alternate article before continuing.
221-
222-
.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken
222+
If this introduction didn't make things clear to you, you should try
223+
reading this alternate article before continuing.
223224

224225
Wikipedia entries are often helpful; see the entries for "`character encoding
225226
<http://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
@@ -239,8 +240,31 @@ Since Python 3.0, the language features a :class:`str` type that contain Unicode
239240
characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
240241
rocks!'``, or the triple-quoted string syntax is stored as Unicode.
241242

242-
To insert a non-ASCII Unicode character, e.g., any letters with
243-
accents, one can use escape sequences in their string literals as such::
243+
The default encoding for Python source code is UTF-8, so you can simply
244+
include a Unicode character in a string literal::
245+
246+
try:
247+
with open('/tmp/input.txt', 'r') as f:
248+
...
249+
except IOError:
250+
# 'File not found' error message.
251+
print("Fichier non trouvé")
252+
253+
You can use a different encoding from UTF-8 by putting a specially-formatted
254+
comment as the first or second line of the source code::
255+
256+
# -*- coding: <encoding name> -*-
257+
258+
Side note: Python 3 also supports using Unicode characters in identifiers::
259+
260+
répertoire = "/tmp/records.log"
261+
with open(répertoire, "w") as f:
262+
f.write("test\n")
263+
264+
If you can't enter a particular character in your editor or want to
265+
keep the source code ASCII-only for some reason, you can also use
266+
escape sequences in string literals. (Depending on your system,
267+
you may see the actual capital-delta glyph instead of a \u escape.) ::
244268

245269
>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
246270
'\u0394'
@@ -251,7 +275,7 @@ accents, one can use escape sequences in their string literals as such::
251275

252276
In addition, one can create a string using the :func:`~bytes.decode` method of
253277
:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
254-
and optionally, an *errors* argument.
278+
and optionally an *errors* argument.
255279

256280
The *errors* argument specifies the response when the input string can't be
257281
converted according to the encoding's rules. Legal values for this argument are
@@ -295,11 +319,15 @@ Converting to Bytes
295319

296320
The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
297321
which returns a :class:`bytes` representation of the Unicode string, encoded in the
298-
requested *encoding*. The *errors* parameter is the same as the parameter of
299-
the :meth:`~bytes.decode` method, with one additional possibility; as well as
300-
``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case inserts a
301-
question mark instead of the unencodable character), you can also pass
302-
``'xmlcharrefreplace'`` which uses XML's character references.
322+
requested *encoding*.
323+
324+
The *errors* parameter is the same as the parameter of the
325+
:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
326+
``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
327+
inserts a question mark instead of the unencodable character), there is
328+
also ``'xmlcharrefreplace'`` (inserts an XML character reference) and
329+
``backslashreplace`` (inserts a ``\uNNNN`` escape sequence).
330+
303331
The following example shows the different results::
304332

305333
>>> u = chr(40960) + 'abcd' + chr(1972)
@@ -316,16 +344,15 @@ The following example shows the different results::
316344
b'?abcd?'
317345
>>> u.encode('ascii', 'xmlcharrefreplace')
318346
b'&#40960;abcd&#1972;'
347+
>>> u.encode('ascii', 'backslashreplace')
348+
b'\\ua000abcd\\u07b4'
319349

320-
.. XXX mention the surrogate* error handlers
321-
322-
The low-level routines for registering and accessing the available encodings are
323-
found in the :mod:`codecs` module. However, the encoding and decoding functions
324-
returned by this module are usually more low-level than is comfortable, so I'm
325-
not going to describe the :mod:`codecs` module here. If you need to implement a
326-
completely new encoding, you'll need to learn about the :mod:`codecs` module
327-
interfaces, but implementing encodings is a specialized task that also won't be
328-
covered here. Consult the Python documentation to learn more about this module.
350+
The low-level routines for registering and accessing the available
351+
encodings are found in the :mod:`codecs` module. Implementing new
352+
encodings also requires understanding the :mod:`codecs` module.
353+
However, the encoding and decoding functions returned by this module
354+
are usually more low-level than is comfortable, and writing new encodings
355+
is a specialized task, so the module won't be covered in this HOWTO.
329356

330357

331358
Unicode Literals in Python Source Code
@@ -415,25 +442,61 @@ These are grouped into categories such as "Letter", "Number", "Punctuation", or
415442
from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
416443
"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
417444
other". See
418-
<http://www.unicode.org/reports/tr44/#General_Category_Values> for a
445+
`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
419446
list of category codes.
420447

448+
449+
Unicode Regular Expressions
450+
---------------------------
451+
452+
The regular expressions supported by the :mod:`re` module can be provided
453+
either as bytes or strings. Some of the special character sequences such as
454+
``\d`` and ``\w`` have different meanings depending on whether
455+
the pattern is supplied as bytes or a string. For example,
456+
``\d`` will match the characters ``[0-9]`` in bytes but
457+
in strings will match any character that's in the ``'Nd'`` category.
458+
459+
The string in this example has the number 57 written in both Thai and
460+
Arabic numerals::
461+
462+
import re
463+
p = re.compile('\d+')
464+
465+
s = "Over \u0e55\u0e57 57 flavours"
466+
m = p.search(s)
467+
print(repr(m.group()))
468+
469+
When executed, ``\d+`` will match the Thai numerals and print them
470+
out. If you supply the :const:`re.ASCII` flag to
471+
:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
472+
473+
Similarly, ``\w`` matches a wide variety of Unicode characters but
474+
only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
475+
and ``\s`` will match either Unicode whitespace characters or
476+
``[ \t\n\r\f\v]``.
477+
478+
421479
References
422480
----------
423481

482+
.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
483+
484+
Some good alternative discussions of Python's Unicode support are:
485+
486+
* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
487+
* `Pragmatic Unicode <http://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
488+
424489
The :class:`str` type is described in the Python library reference at
425490
:ref:`textseq`.
426491

427492
The documentation for the :mod:`unicodedata` module.
428493

429494
The documentation for the :mod:`codecs` module.
430495

431-
Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
432-
Unicode". A PDF version of his slides is available at
433-
<http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
434-
excellent overview of the design of Python's Unicode features (based on Python
435-
2, where the Unicode string type is called ``unicode`` and literals start with
436-
``u``).
496+
Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides) <http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
497+
EuroPython 2002. The slides are an excellent overview of the design
498+
of Python 2's Unicode features (where the Unicode string type is
499+
called ``unicode`` and literals start with ``u``).
437500

438501

439502
Reading and Writing Unicode Data
@@ -512,7 +575,7 @@ example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
512575
Windows, Python uses the name "mbcs" to refer to whatever the currently
513576
configured encoding is. On Unix systems, there will only be a filesystem
514577
encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
515-
you haven't, the default encoding is ASCII.
578+
you haven't, the default encoding is UTF-8.
516579

517580
The :func:`sys.getfilesystemencoding` function returns the encoding to use on
518581
your current system, in case you want to do the encoding manually, but there's
@@ -527,13 +590,13 @@ automatically converted to the right encoding for you::
527590
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
528591
filenames.
529592

530-
Function :func:`os.listdir`, which returns filenames, raises an issue: should it return
593+
The :func:`os.listdir` function returns filenames and raises an issue: should it return
531594
the Unicode version of filenames, or should it return bytes containing
532595
the encoded versions? :func:`os.listdir` will do both, depending on whether you
533596
provided the directory path as bytes or a Unicode string. If you pass a
534597
Unicode string as the path, filenames will be decoded using the filesystem's
535598
encoding and a list of Unicode strings will be returned, while passing a byte
536-
path will return the bytes versions of the filenames. For example,
599+
path will return the filenames as bytes. For example,
537600
assuming the default filesystem encoding is UTF-8, running the following
538601
program::
539602

@@ -548,13 +611,13 @@ program::
548611
will produce the following output::
549612

550613
amk:~$ python t.py
551-
[b'.svn', b'filename\xe4\x94\x80abc', ...]
552-
['.svn', 'filename\u4500abc', ...]
614+
[b'filename\xe4\x94\x80abc', ...]
615+
['filename\u4500abc', ...]
553616

554617
The first list contains UTF-8-encoded filenames, and the second list contains
555618
the Unicode versions.
556619

557-
Note that in most occasions, the Unicode APIs should be used. The bytes APIs
620+
Note that on most occasions, the Unicode APIs should be used. The bytes APIs
558621
should only be used on systems where undecodable file names can be present,
559622
i.e. Unix systems.
560623

@@ -585,65 +648,69 @@ data also specifies the encoding, since the attacker can then choose a
585648
clever way to hide malicious text in the encoded bytestream.
586649

587650

651+
Converting Between File Encodings
652+
'''''''''''''''''''''''''''''''''
653+
654+
The :class:`~codecs.StreamRecoder` class can transparently convert between
655+
encodings, taking a stream that returns data in encoding #1
656+
and behaving like a stream returning data in encoding #2.
657+
658+
For example, if you have an input file *f* that's in Latin-1, you
659+
can wrap it with a :class:`StreamRecoder` to return bytes encoded in UTF-8::
660+
661+
new_f = codecs.StreamRecoder(f,
662+
# en/decoder: used by read() to encode its results and
663+
# by write() to decode its input.
664+
codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
665+
666+
# reader/writer: used to read and write to the stream.
667+
codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
668+
669+
670+
Files in an Unknown Encoding
671+
''''''''''''''''''''''''''''
672+
673+
What can you do if you need to make a change to a file, but don't know
674+
the file's encoding? If you know the encoding is ASCII-compatible and
675+
only want to examine or modify the ASCII parts, you can open the file
676+
with the ``surrogateescape`` error handler::
677+
678+
with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
679+
data = f.read()
680+
681+
# make changes to the string 'data'
682+
683+
with open(fname + '.new', 'w',
684+
encoding="ascii", errors="surrogateescape") as f:
685+
f.write(data)
686+
687+
The ``surrogateescape`` error handler will decode any non-ASCII bytes
688+
as code points in the Unicode Private Use Area ranging from U+DC80 to
689+
U+DCFF. These private code points will then be turned back into the
690+
same bytes when the ``surrogateescape`` error handler is used when
691+
encoding the data and writing it back out.
692+
693+
588694
References
589695
----------
590696

591-
The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
592-
Applications in Python" are available at
593-
<http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
594-
and discuss questions of character encodings as well as how to internationalize
697+
One section of `Mastering Python 3 Input/Output <http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_, a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
698+
699+
The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware Applications in Python" <http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
700+
discuss questions of character encodings as well as how to internationalize
595701
and localize an application. These slides cover Python 2.x only.
596702

703+
`The Guts of Unicode in Python <http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_ is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode representation in Python 3.3.
704+
597705

598706
Acknowledgements
599707
================
600708

601-
Thanks to the following people who have noted errors or offered suggestions on
602-
this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
603-
Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
604-
605-
.. comment
606-
Revision History
607-
608-
Version 1.0: posted August 5 2005.
609-
610-
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
611-
several links.
612-
613-
Version 1.02: posted August 16 2005. Corrects factual errors.
614-
615-
Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
616-
617-
Version 1.11: posted June 20 2010. Notes that Python 3.x is not covered,
618-
and that the HOWTO only covers 2.x.
619-
620-
.. comment Describe Python 3.x support (new section? new document?)
621-
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
622-
623-
.. comment
624-
Original outline:
625-
626-
- [ ] Unicode introduction
627-
- [ ] ASCII
628-
- [ ] Terms
629-
- [ ] Character
630-
- [ ] Code point
631-
- [ ] Encodings
632-
- [ ] Common encodings: ASCII, Latin-1, UTF-8
633-
- [ ] Unicode Python type
634-
- [ ] Writing unicode literals
635-
- [ ] Obscurity: -U switch
636-
- [ ] Built-ins
637-
- [ ] unichr()
638-
- [ ] ord()
639-
- [ ] unicode() constructor
640-
- [ ] Unicode type
641-
- [ ] encode(), decode() methods
642-
- [ ] Unicodedata module for character properties
643-
- [ ] I/O
644-
- [ ] Reading/writing Unicode data into files
645-
- [ ] Byte-order marks
646-
- [ ] Unicode filenames
647-
- [ ] Writing Unicode programs
648-
- [ ] Do everything in Unicode
649-
- [ ] Declaring source code encodings (PEP 263)
709+
The initial draft of this document was written by Andrew Kuchling.
710+
It has since been revised further by Alexander Belopolsky, Georg Brandl,
711+
Andrew Kuchling, and Ezio Melotti.
712+
713+
Thanks to the following people who have noted errors or offered
714+
suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
715+
Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
716+
Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.

0 commit comments

Comments
 (0)