44 Unicode HOWTO
55*****************
66
7- :Release: 1.11
7+ :Release: 1.12
88
9- This HOWTO discusses Python 2.x's support for Unicode, and explains
9+ This HOWTO discusses Python support for Unicode, and explains
1010various problems that people commonly encounter when trying to work
11- with Unicode. (This HOWTO has not yet been updated to cover the 3.x
12- versions of Python.)
13-
11+ with Unicode.
1412
1513Introduction to Unicode
1614=======================
@@ -44,14 +42,14 @@ In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
4442hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
4543machines assigned values between 128 and 255 to accented characters. Different
4644machines had different codes, however, which led to problems exchanging files.
47- Eventually various commonly used sets of values for the 128-255 range emerged.
45+ Eventually various commonly used sets of values for the 128-- 255 range emerged.
4846Some were true standards, defined by the International Standards Organization,
4947and some were **de facto ** conventions that were invented by one company or
5048another and managed to catch on.
5149
5250255 characters aren't very many. For example, you can't fit both the accented
5351characters used in Western Europe and the Cyrillic alphabet used for Russian
54- into the 128-255 range because there are more than 127 such characters.
52+ into the 128-- 255 range because there are more than 127 such characters.
5553
5654You could write files using different codes (all your Russian files in a coding
5755system called KOI8, all your French files in a different coding system called
@@ -64,8 +62,8 @@ bits means you have 2^16 = 65,536 distinct values available, making it possible
6462to represent many different characters from many different alphabets; an initial
6563goal was to have Unicode contain the alphabets for every single human language.
6664It turns out that even 16 bits isn't enough to meet that goal, and the modern
67- Unicode specification uses a wider range of codes, 0- 1,114,111 (0x10ffff in
68- base- 16).
65+ Unicode specification uses a wider range of codes, 0 through 1,114,111 (0x10ffff
66+ in base 16).
6967
7068There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
7169originally separate efforts, but the specifications were merged with the 1.1
@@ -90,7 +88,7 @@ meanings.
9088The Unicode standard describes how characters are represented by **code
9189points **. A code point is an integer value, usually denoted in base 16. In the
9290standard, a code point is written using the notation U+12ca to mean the
93- character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
91+ character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot
9492of tables listing characters and their corresponding code points::
9593
9694 0061 'a'; LATIN SMALL LETTER A
@@ -117,10 +115,10 @@ Encodings
117115---------
118116
119117To summarize the previous section: a Unicode string is a sequence of code
120- points, which are numbers from 0 to 0x10ffff. This sequence needs to be
121- represented as a set of bytes (meaning, values from 0-255) in memory. The rules
122- for translating a Unicode string into a sequence of bytes are called an
123- **encoding **.
118+ points, which are numbers from 0 through 0x10ffff (1,114,111 decimal). This
119+ sequence needs to be represented as a set of bytes (meaning, values
120+ from 0 through 255) in memory. The rules for translating a Unicode string
121+ into a sequence of bytes are called an **encoding **.
124122
125123The first encoding you might think of is an array of 32-bit integers. In this
126124representation, the string "Python" would look like this::
@@ -164,7 +162,7 @@ encoding, for example, are simple; for each code point:
164162 case.)
165163
166164Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
167- 0-255 are identical to the Latin-1 values, so converting to this encoding simply
165+ 0-- 255 are identical to the Latin-1 values, so converting to this encoding simply
168166requires converting code points to byte values; if a code point larger than 255
169167is encountered, the string can't be encoded into Latin-1.
170168
@@ -226,8 +224,8 @@ Wikipedia entries are often helpful; see the entries for "character encoding"
226224<http://en.wikipedia.org/wiki/UTF-8>, for example.
227225
228226
229- Python 2.x 's Unicode Support
230- ============================
227+ Python's Unicode Support
228+ ========================
231229
232230Now that you've learned the rudiments of Unicode, we can look at Python's
233231Unicode features.
@@ -265,7 +263,7 @@ Unicode result). The following examples show the differences::
265263 UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
266264 unexpected code byte
267265 >>> b'\x80abc'.decode("utf-8", "replace")
268- '\ufffdabc '
266+ '�abc '
269267 >>> b'\x80abc'.decode("utf-8", "ignore")
270268 'abc'
271269
@@ -281,10 +279,10 @@ that contains the corresponding code point. The reverse operation is the
281279built-in :func: `ord ` function that takes a one-character Unicode string and
282280returns the code point value::
283281
284- >>> chr(40960 )
285- '\ua000 '
286- >>> ord('\ua000 ')
287- 40960
282+ >>> chr(57344 )
283+ '\ue000 '
284+ >>> ord('\ue000 ')
285+ 57344
288286
289287Converting to Bytes
290288-------------------
@@ -326,7 +324,8 @@ Unicode Literals in Python Source Code
326324
327325In Python source code, specific Unicode code points can be written using the
328326``\u `` escape sequence, which is followed by four hex digits giving the code
329- point. The ``\U `` escape sequence is similar, but expects 8 hex digits, not 4::
327+ point. The ``\U `` escape sequence is similar, but expects eight hex digits,
328+ not four::
330329
331330 >>> s = "a\xac\u1234\u20ac\U00008000"
332331 ^^^^ two-digit hex escape
@@ -465,18 +464,17 @@ like those in string objects' :meth:`encode` and :meth:`decode` methods.
465464
466465Reading Unicode from a file is therefore simple::
467466
468- f = open('unicode.rst', encoding='utf-8')
469- for line in f:
470- print(repr(line))
467+ with open('unicode.rst', encoding='utf-8') as f:
468+ for line in f:
469+ print(repr(line))
471470
472471It's also possible to open files in update mode, allowing both reading and
473472writing::
474473
475- f = open('test', encoding='utf-8', mode='w+')
476- f.write('\u4500 blah blah blah\n')
477- f.seek(0)
478- print(repr(f.readline()[:1]))
479- f.close()
474+ with open('test', encoding='utf-8', mode='w+') as f:
475+ f.write('\u4500 blah blah blah\n')
476+ f.seek(0)
477+ print(repr(f.readline()[:1]))
480478
481479The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
482480written as the first character of a file in order to assist with autodetection
@@ -513,14 +511,13 @@ usually just provide the Unicode string as the filename, and it will be
513511automatically converted to the right encoding for you::
514512
515513 filename = 'filename\u4500abc'
516- f = open(filename, 'w')
517- f.write('blah\n')
518- f.close()
514+ with open(filename, 'w') as f:
515+ f.write('blah\n')
519516
520517Functions in the :mod: `os ` module such as :func: `os.stat ` will also accept Unicode
521518filenames.
522519
523- :func: `os.listdir `, which returns filenames, raises an issue: should it return
520+ Function :func: `os.listdir `, which returns filenames, raises an issue: should it return
524521the Unicode version of filenames, or should it return byte strings containing
525522the encoded versions? :func: `os.listdir ` will do both, depending on whether you
526523provided the directory path as a byte string or a Unicode string. If you pass a
@@ -569,14 +566,6 @@ strings, you will find your program vulnerable to bugs wherever you combine the
569566two different kinds of strings. There is no automatic encoding or decoding if
570567you do e.g. ``str + bytes ``, a :exc: `TypeError ` is raised for this expression.
571568
572- It's easy to miss such problems if you only test your software with data that
573- doesn't contain any accents; everything will seem to work, but there's actually
574- a bug in your program waiting for the first user who attempts to use characters
575- > 127. A second tip, therefore, is:
576-
577- Include characters > 127 and, even better, characters > 255 in your test
578- data.
579-
580569When using data coming from a web browser or some other untrusted source, a
581570common technique is to check for illegal characters in a string before using the
582571string in a generated command line or storing it in a database. If you're doing
@@ -594,8 +583,8 @@ this code::
594583 if '/' in filename:
595584 raise ValueError("'/' not allowed in filenames")
596585 unicode_name = filename.decode(encoding)
597- f = open(unicode_name, 'r')
598- # ... return contents of file ...
586+ with open(unicode_name, 'r') as f:
587+ # ... return contents of file ...
599588
600589However, if an attacker could specify the ``'base64' `` encoding, they could pass
601590``'L2V0Yy9wYXNzd2Q=' ``, which is the base-64 encoded form of the string
@@ -610,27 +599,30 @@ The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
610599Applications in Python" are available at
611600<http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
612601and discuss questions of character encodings as well as how to internationalize
613- and localize an application.
602+ and localize an application. These slides cover Python 2.x only.
614603
615604
616- Revision History and Acknowledgements
617- =====================================
605+ Acknowledgements
606+ ================
618607
619608Thanks to the following people who have noted errors or offered suggestions on
620609this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
621610Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
622611
623- Version 1.0: posted August 5 2005.
612+ .. comment
613+ Revision History
614+
615+ Version 1.0: posted August 5 2005.
624616
625- Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
626- several links.
617+ Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
618+ several links.
627619
628- Version 1.02: posted August 16 2005. Corrects factual errors.
620+ Version 1.02: posted August 16 2005. Corrects factual errors.
629621
630- Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
622+ Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
631623
632- Version 1.11: posted June 20 2010. Notes that Python 3.x is not covered,
633- and that the HOWTO only covers 2.x.
624+ Version 1.11: posted June 20 2010. Notes that Python 3.x is not covered,
625+ and that the HOWTO only covers 2.x.
634626
635627.. comment Describe Python 3.x support (new section? new document?)
636628.. comment Additional topic: building Python w/ UCS2 or UCS4 support
0 commit comments