Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 93a6b13

Browse files
committed
Issue #4153: Updated Unicode HOWTO.
1 parent b970142 commit 93a6b13

1 file changed

Lines changed: 47 additions & 55 deletions

File tree

Doc/howto/unicode.rst

Lines changed: 47 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,11 @@
44
Unicode HOWTO
55
*****************
66

7-
:Release: 1.11
7+
:Release: 1.12
88

9-
This HOWTO discusses Python 2.x's support for Unicode, and explains
9+
This HOWTO discusses Python support for Unicode, and explains
1010
various problems that people commonly encounter when trying to work
11-
with Unicode. (This HOWTO has not yet been updated to cover the 3.x
12-
versions of Python.)
13-
11+
with Unicode.
1412

1513
Introduction to Unicode
1614
=======================
@@ -44,14 +42,14 @@ In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
4442
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
4543
machines assigned values between 128 and 255 to accented characters. Different
4644
machines had different codes, however, which led to problems exchanging files.
47-
Eventually various commonly used sets of values for the 128-255 range emerged.
45+
Eventually various commonly used sets of values for the 128--255 range emerged.
4846
Some were true standards, defined by the International Standards Organization,
4947
and some were **de facto** conventions that were invented by one company or
5048
another and managed to catch on.
5149

5250
255 characters aren't very many. For example, you can't fit both the accented
5351
characters used in Western Europe and the Cyrillic alphabet used for Russian
54-
into the 128-255 range because there are more than 127 such characters.
52+
into the 128--255 range because there are more than 127 such characters.
5553

5654
You could write files using different codes (all your Russian files in a coding
5755
system called KOI8, all your French files in a different coding system called
@@ -64,8 +62,8 @@ bits means you have 2^16 = 65,536 distinct values available, making it possible
6462
to represent many different characters from many different alphabets; an initial
6563
goal was to have Unicode contain the alphabets for every single human language.
6664
It turns out that even 16 bits isn't enough to meet that goal, and the modern
67-
Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
68-
base-16).
65+
Unicode specification uses a wider range of codes, 0 through 1,114,111 (0x10ffff
66+
in base 16).
6967

7068
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
7169
originally separate efforts, but the specifications were merged with the 1.1
@@ -90,7 +88,7 @@ meanings.
9088
The Unicode standard describes how characters are represented by **code
9189
points**. A code point is an integer value, usually denoted in base 16. In the
9290
standard, a code point is written using the notation U+12ca to mean the
93-
character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
91+
character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot
9492
of tables listing characters and their corresponding code points::
9593

9694
0061 'a'; LATIN SMALL LETTER A
@@ -117,10 +115,10 @@ Encodings
117115
---------
118116

119117
To summarize the previous section: a Unicode string is a sequence of code
120-
points, which are numbers from 0 to 0x10ffff. This sequence needs to be
121-
represented as a set of bytes (meaning, values from 0-255) in memory. The rules
122-
for translating a Unicode string into a sequence of bytes are called an
123-
**encoding**.
118+
points, which are numbers from 0 through 0x10ffff (1,114,111 decimal). This
119+
sequence needs to be represented as a set of bytes (meaning, values
120+
from 0 through 255) in memory. The rules for translating a Unicode string
121+
into a sequence of bytes are called an **encoding**.
124122

125123
The first encoding you might think of is an array of 32-bit integers. In this
126124
representation, the string "Python" would look like this::
@@ -164,7 +162,7 @@ encoding, for example, are simple; for each code point:
164162
case.)
165163

166164
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
167-
0-255 are identical to the Latin-1 values, so converting to this encoding simply
165+
0--255 are identical to the Latin-1 values, so converting to this encoding simply
168166
requires converting code points to byte values; if a code point larger than 255
169167
is encountered, the string can't be encoded into Latin-1.
170168

@@ -226,8 +224,8 @@ Wikipedia entries are often helpful; see the entries for "character encoding"
226224
<http://en.wikipedia.org/wiki/UTF-8>, for example.
227225

228226

229-
Python 2.x's Unicode Support
230-
============================
227+
Python's Unicode Support
228+
========================
231229

232230
Now that you've learned the rudiments of Unicode, we can look at Python's
233231
Unicode features.
@@ -265,7 +263,7 @@ Unicode result). The following examples show the differences::
265263
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
266264
unexpected code byte
267265
>>> b'\x80abc'.decode("utf-8", "replace")
268-
'\ufffdabc'
266+
'�abc'
269267
>>> b'\x80abc'.decode("utf-8", "ignore")
270268
'abc'
271269

@@ -281,10 +279,10 @@ that contains the corresponding code point. The reverse operation is the
281279
built-in :func:`ord` function that takes a one-character Unicode string and
282280
returns the code point value::
283281

284-
>>> chr(40960)
285-
'\ua000'
286-
>>> ord('\ua000')
287-
40960
282+
>>> chr(57344)
283+
'\ue000'
284+
>>> ord('\ue000')
285+
57344
288286

289287
Converting to Bytes
290288
-------------------
@@ -326,7 +324,8 @@ Unicode Literals in Python Source Code
326324

327325
In Python source code, specific Unicode code points can be written using the
328326
``\u`` escape sequence, which is followed by four hex digits giving the code
329-
point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4::
327+
point. The ``\U`` escape sequence is similar, but expects eight hex digits,
328+
not four::
330329

331330
>>> s = "a\xac\u1234\u20ac\U00008000"
332331
^^^^ two-digit hex escape
@@ -465,18 +464,17 @@ like those in string objects' :meth:`encode` and :meth:`decode` methods.
465464

466465
Reading Unicode from a file is therefore simple::
467466

468-
f = open('unicode.rst', encoding='utf-8')
469-
for line in f:
470-
print(repr(line))
467+
with open('unicode.rst', encoding='utf-8') as f:
468+
for line in f:
469+
print(repr(line))
471470

472471
It's also possible to open files in update mode, allowing both reading and
473472
writing::
474473

475-
f = open('test', encoding='utf-8', mode='w+')
476-
f.write('\u4500 blah blah blah\n')
477-
f.seek(0)
478-
print(repr(f.readline()[:1]))
479-
f.close()
474+
with open('test', encoding='utf-8', mode='w+') as f:
475+
f.write('\u4500 blah blah blah\n')
476+
f.seek(0)
477+
print(repr(f.readline()[:1]))
480478

481479
The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
482480
written as the first character of a file in order to assist with autodetection
@@ -513,14 +511,13 @@ usually just provide the Unicode string as the filename, and it will be
513511
automatically converted to the right encoding for you::
514512

515513
filename = 'filename\u4500abc'
516-
f = open(filename, 'w')
517-
f.write('blah\n')
518-
f.close()
514+
with open(filename, 'w') as f:
515+
f.write('blah\n')
519516

520517
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
521518
filenames.
522519

523-
:func:`os.listdir`, which returns filenames, raises an issue: should it return
520+
Function :func:`os.listdir`, which returns filenames, raises an issue: should it return
524521
the Unicode version of filenames, or should it return byte strings containing
525522
the encoded versions? :func:`os.listdir` will do both, depending on whether you
526523
provided the directory path as a byte string or a Unicode string. If you pass a
@@ -569,14 +566,6 @@ strings, you will find your program vulnerable to bugs wherever you combine the
569566
two different kinds of strings. There is no automatic encoding or decoding if
570567
you do e.g. ``str + bytes``, a :exc:`TypeError` is raised for this expression.
571568

572-
It's easy to miss such problems if you only test your software with data that
573-
doesn't contain any accents; everything will seem to work, but there's actually
574-
a bug in your program waiting for the first user who attempts to use characters
575-
> 127. A second tip, therefore, is:
576-
577-
Include characters > 127 and, even better, characters > 255 in your test
578-
data.
579-
580569
When using data coming from a web browser or some other untrusted source, a
581570
common technique is to check for illegal characters in a string before using the
582571
string in a generated command line or storing it in a database. If you're doing
@@ -594,8 +583,8 @@ this code::
594583
if '/' in filename:
595584
raise ValueError("'/' not allowed in filenames")
596585
unicode_name = filename.decode(encoding)
597-
f = open(unicode_name, 'r')
598-
# ... return contents of file ...
586+
with open(unicode_name, 'r') as f:
587+
# ... return contents of file ...
599588

600589
However, if an attacker could specify the ``'base64'`` encoding, they could pass
601590
``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
@@ -610,27 +599,30 @@ The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
610599
Applications in Python" are available at
611600
<http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
612601
and discuss questions of character encodings as well as how to internationalize
613-
and localize an application.
602+
and localize an application. These slides cover Python 2.x only.
614603

615604

616-
Revision History and Acknowledgements
617-
=====================================
605+
Acknowledgements
606+
================
618607

619608
Thanks to the following people who have noted errors or offered suggestions on
620609
this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
621610
Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
622611

623-
Version 1.0: posted August 5 2005.
612+
.. comment
613+
Revision History
614+
615+
Version 1.0: posted August 5 2005.
624616
625-
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
626-
several links.
617+
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
618+
several links.
627619
628-
Version 1.02: posted August 16 2005. Corrects factual errors.
620+
Version 1.02: posted August 16 2005. Corrects factual errors.
629621
630-
Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
622+
Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
631623
632-
Version 1.11: posted June 20 2010. Notes that Python 3.x is not covered,
633-
and that the HOWTO only covers 2.x.
624+
Version 1.11: posted June 20 2010. Notes that Python 3.x is not covered,
625+
and that the HOWTO only covers 2.x.
634626
635627
.. comment Describe Python 3.x support (new section? new document?)
636628
.. comment Additional topic: building Python w/ UCS2 or UCS4 support

0 commit comments

Comments
 (0)