Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 81b46ec

Browse files
committed
#4153: merge with 3.2.
2 parents 35f8f37 + 410eee5 commit 81b46ec

1 file changed

Lines changed: 74 additions & 69 deletions

File tree

Doc/howto/unicode.rst

Lines changed: 74 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ machines assigned values between 128 and 255 to accented characters. Different
4444
machines had different codes, however, which led to problems exchanging files.
4545
Eventually various commonly used sets of values for the 128--255 range emerged.
4646
Some were true standards, defined by the International Standards Organization,
47-
and some were **de facto** conventions that were invented by one company or
47+
and some were *de facto* conventions that were invented by one company or
4848
another and managed to catch on.
4949

5050
255 characters aren't very many. For example, you can't fit both the accented
@@ -62,8 +62,8 @@ bits means you have 2^16 = 65,536 distinct values available, making it possible
6262
to represent many different characters from many different alphabets; an initial
6363
goal was to have Unicode contain the alphabets for every single human language.
6464
It turns out that even 16 bits isn't enough to meet that goal, and the modern
65-
Unicode specification uses a wider range of codes, 0 through 1,114,111 (0x10ffff
66-
in base 16).
65+
Unicode specification uses a wider range of codes, 0 through 1,114,111 (
66+
``0x10FFFF`` in base 16).
6767

6868
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
6969
originally separate efforts, but the specifications were merged with the 1.1
@@ -87,9 +87,11 @@ meanings.
8787

8888
The Unicode standard describes how characters are represented by **code
8989
points**. A code point is an integer value, usually denoted in base 16. In the
90-
standard, a code point is written using the notation U+12ca to mean the
91-
character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot
92-
of tables listing characters and their corresponding code points::
90+
standard, a code point is written using the notation ``U+12CA`` to mean the
91+
character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
92+
a lot of tables listing characters and their corresponding code points:
93+
94+
.. code-block:: none
9395
9496
0061 'a'; LATIN SMALL LETTER A
9597
0062 'b'; LATIN SMALL LETTER B
@@ -98,7 +100,7 @@ of tables listing characters and their corresponding code points::
98100
007B '{'; LEFT CURLY BRACKET
99101
100102
Strictly, these definitions imply that it's meaningless to say 'this is
101-
character U+12ca'. U+12ca is a code point, which represents some particular
103+
character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
102104
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
103105
informal contexts, this distinction between code points and characters will
104106
sometimes be forgotten.
@@ -115,13 +117,15 @@ Encodings
115117
---------
116118

117119
To summarize the previous section: a Unicode string is a sequence of code
118-
points, which are numbers from 0 through 0x10ffff (1,114,111 decimal). This
120+
points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
119121
sequence needs to be represented as a set of bytes (meaning, values
120122
from 0 through 255) in memory. The rules for translating a Unicode string
121123
into a sequence of bytes are called an **encoding**.
122124

123125
The first encoding you might think of is an array of 32-bit integers. In this
124-
representation, the string "Python" would look like this::
126+
representation, the string "Python" would look like this:
127+
128+
.. code-block:: none
125129
126130
P y t h o n
127131
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
@@ -133,10 +137,10 @@ problems.
133137
1. It's not portable; different processors order the bytes differently.
134138

135139
2. It's very wasteful of space. In most texts, the majority of the code points
136-
are less than 127, or less than 255, so a lot of space is occupied by zero
140+
are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
137141
bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
138142
ASCII representation. Increased RAM usage doesn't matter too much (desktop
139-
computers have megabytes of RAM, and strings aren't usually that large), but
143+
computers have gigabytes of RAM, and strings aren't usually that large), but
140144
expanding our usage of disk and network bandwidth by a factor of 4 is
141145
intolerable.
142146

@@ -175,14 +179,12 @@ internal detail.
175179

176180
UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
177181
Transformation Format", and the '8' means that 8-bit numbers are used in the
178-
encoding. (There's also a UTF-16 encoding, but it's less frequently used than
179-
UTF-8.) UTF-8 uses the following rules:
182+
encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
183+
frequently used than UTF-8.) UTF-8 uses the following rules:
180184

181-
1. If the code point is <128, it's represented by the corresponding byte value.
182-
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
183-
between 128 and 255.
184-
3. Code points >0x7ff are turned into three- or four-byte sequences, where each
185-
byte of the sequence is between 128 and 255.
185+
1. If the code point is < 128, it's represented by the corresponding byte value.
186+
2. If the code point is >= 128, it's turned into a sequence of two, three, or
187+
four bytes, where each byte of the sequence is between 128 and 255.
186188

187189
UTF-8 has several convenient properties:
188190

@@ -192,8 +194,8 @@ UTF-8 has several convenient properties:
192194
processed by C functions such as ``strcpy()`` and sent through protocols that
193195
can't handle zero bytes.
194196
3. A string of ASCII text is also valid UTF-8 text.
195-
4. UTF-8 is fairly compact; the majority of code points are turned into two
196-
bytes, and values less than 128 occupy only a single byte.
197+
4. UTF-8 is fairly compact; the majority of commonly used characters can be
198+
represented with one or two bytes.
197199
5. If bytes are corrupted or lost, it's possible to determine the start of the
198200
next UTF-8-encoded code point and resynchronize. It's also unlikely that
199201
random 8-bit data will look like valid UTF-8.
@@ -203,25 +205,25 @@ UTF-8 has several convenient properties:
203205
References
204206
----------
205207

206-
The Unicode Consortium site at <http://www.unicode.org> has character charts, a
208+
The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
207209
glossary, and PDF versions of the Unicode specification. Be prepared for some
208-
difficult reading. <http://www.unicode.org/history/> is a chronology of the
209-
origin and development of Unicode.
210+
difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
211+
origin and development of Unicode is also available on the site.
210212

211-
To help understand the standard, Jukka Korpela has written an introductory guide
212-
to reading the Unicode character tables, available at
213-
<http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
213+
To help understand the standard, Jukka Korpela has written `an introductory
214+
guide <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>`_ to reading the
215+
Unicode character tables.
214216

215-
Another good introductory article was written by Joel Spolsky
216-
<http://www.joelonsoftware.com/articles/Unicode.html>.
217+
Another `good introductory article <http://www.joelonsoftware.com/articles/Unicode.html>`_
218+
was written by Joel Spolsky.
217219
If this introduction didn't make things clear to you, you should try reading this
218220
alternate article before continuing.
219221

220222
.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken
221223
222-
Wikipedia entries are often helpful; see the entries for "character encoding"
223-
<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
224-
<http://en.wikipedia.org/wiki/UTF-8>, for example.
224+
Wikipedia entries are often helpful; see the entries for "`character encoding
225+
<http://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
226+
<http://en.wikipedia.org/wiki/UTF-8>`_, for example.
225227

226228

227229
Python's Unicode Support
@@ -233,11 +235,11 @@ Unicode features.
233235
The String Type
234236
---------------
235237

236-
Since Python 3.0, the language features a ``str`` type that contain Unicode
238+
Since Python 3.0, the language features a :class:`str` type that contain Unicode
237239
characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
238240
rocks!'``, or the triple-quoted string syntax is stored as Unicode.
239241

240-
To insert a Unicode character that is not part ASCII, e.g., any letters with
242+
To insert a non-ASCII Unicode character, e.g., any letters with
241243
accents, one can use escape sequences in their string literals as such::
242244

243245
>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
@@ -247,15 +249,16 @@ accents, one can use escape sequences in their string literals as such::
247249
>>> "\U00000394" # Using a 32-bit hex value
248250
'\u0394'
249251

250-
In addition, one can create a string using the :func:`decode` method of
251-
:class:`bytes`. This method takes an encoding, such as UTF-8, and, optionally,
252-
an *errors* argument.
252+
In addition, one can create a string using the :func:`~bytes.decode` method of
253+
:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
254+
and optionally, an *errors* argument.
253255

254256
The *errors* argument specifies the response when the input string can't be
255257
converted according to the encoding's rules. Legal values for this argument are
256-
'strict' (raise a :exc:`UnicodeDecodeError` exception), 'replace' (use U+FFFD,
257-
'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
258-
Unicode result). The following examples show the differences::
258+
``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
259+
``U+FFFD``, ``REPLACEMENT CHARACTER``), or ``'ignore'`` (just leave the
260+
character out of the Unicode result).
261+
The following examples show the differences::
259262

260263
>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
261264
Traceback (most recent call last):
@@ -273,8 +276,8 @@ a question mark because it may not be displayed on some systems.)
273276
Encodings are specified as strings containing the encoding's name. Python 3.2
274277
comes with roughly 100 different encodings; see the Python Library Reference at
275278
:ref:`standard-encodings` for a list. Some encodings have multiple names; for
276-
example, 'latin-1', 'iso_8859_1' and '8859' are all synonyms for the same
277-
encoding.
279+
example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
280+
the same encoding.
278281

279282
One-character Unicode strings can also be created with the :func:`chr`
280283
built-in function, which takes integers and returns a Unicode string of length 1
@@ -290,13 +293,14 @@ returns the code point value::
290293
Converting to Bytes
291294
-------------------
292295

293-
Another important str method is ``.encode([encoding], [errors='strict'])``,
294-
which returns a ``bytes`` representation of the Unicode string, encoded in the
295-
requested encoding. The ``errors`` parameter is the same as the parameter of
296-
the :meth:`decode` method, with one additional possibility; as well as 'strict',
297-
'ignore', and 'replace' (which in this case inserts a question mark instead of
298-
the unencodable character), you can also pass 'xmlcharrefreplace' which uses
299-
XML's character references. The following example shows the different results::
296+
The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
297+
which returns a :class:`bytes` representation of the Unicode string, encoded in the
298+
requested *encoding*. The *errors* parameter is the same as the parameter of
299+
the :meth:`~bytes.decode` method, with one additional possibility; as well as
300+
``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case inserts a
301+
question mark instead of the unencodable character), you can also pass
302+
``'xmlcharrefreplace'`` which uses XML's character references.
303+
The following example shows the different results::
300304

301305
>>> u = chr(40960) + 'abcd' + chr(1972)
302306
>>> u.encode('utf-8')
@@ -313,6 +317,8 @@ XML's character references. The following example shows the different results::
313317
>>> u.encode('ascii', 'xmlcharrefreplace')
314318
b'&#40960;abcd&#1972;'
315319

320+
.. XXX mention the surrogate* error handlers
321+
316322
The low-level routines for registering and accessing the available encodings are
317323
found in the :mod:`codecs` module. However, the encoding and decoding functions
318324
returned by this module are usually more low-level than is comfortable, so I'm
@@ -365,14 +371,14 @@ they have no significance to Python but are a convention. Python looks for
365371
``coding: name`` or ``coding=name`` in the comment.
366372

367373
If you don't include such a comment, the default encoding used will be UTF-8 as
368-
already mentioned.
374+
already mentioned. See also :pep:`263` for more information.
369375

370376

371377
Unicode Properties
372378
------------------
373379

374380
The Unicode specification includes a database of information about code points.
375-
For each code point that's defined, the information includes the character's
381+
For each defined code point, the information includes the character's
376382
name, its category, the numeric value if applicable (Unicode has characters
377383
representing the Roman numerals and fractions such as one-third and
378384
four-fifths). There are also properties related to the code point's use in
@@ -392,7 +398,9 @@ prints the numeric value of one particular character::
392398
# Get numeric value of second character
393399
print(unicodedata.numeric(u[1]))
394400

395-
When run, this prints::
401+
When run, this prints:
402+
403+
.. code-block:: none
396404
397405
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
398406
1 0bf2 No TAMIL NUMBER ONE THOUSAND
@@ -413,7 +421,7 @@ list of category codes.
413421
References
414422
----------
415423

416-
The ``str`` type is described in the Python library reference at
424+
The :class:`str` type is described in the Python library reference at
417425
:ref:`textseq`.
418426

419427
The documentation for the :mod:`unicodedata` module.
@@ -443,26 +451,26 @@ columns and can return Unicode values from an SQL query.
443451

444452
Unicode data is usually converted to a particular encoding before it gets
445453
written to disk or sent over a socket. It's possible to do all the work
446-
yourself: open a file, read an 8-bit byte string from it, and convert the string
447-
with ``str(bytes, encoding)``. However, the manual approach is not recommended.
454+
yourself: open a file, read an 8-bit bytes object from it, and convert the string
455+
with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
448456

449457
One problem is the multi-byte nature of encodings; one Unicode character can be
450458
represented by several bytes. If you want to read the file in arbitrary-sized
451-
chunks (say, 1K or 4K), you need to write error-handling code to catch the case
459+
chunks (say, 1k or 4k), you need to write error-handling code to catch the case
452460
where only part of the bytes encoding a single Unicode character are read at the
453461
end of a chunk. One solution would be to read the entire file into memory and
454462
then perform the decoding, but that prevents you from working with files that
455-
are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM.
463+
are extremely large; if you need to read a 2GB file, you need 2GB of RAM.
456464
(More, really, since for at least a moment you'd need to have both the encoded
457465
string and its Unicode version in memory.)
458466

459467
The solution would be to use the low-level decoding interface to catch the case
460468
of partial coding sequences. The work of implementing this has already been
461469
done for you: the built-in :func:`open` function can return a file-like object
462470
that assumes the file's contents are in a specified encoding and accepts Unicode
463-
parameters for methods such as ``.read()`` and ``.write()``. This works through
471+
parameters for methods such as :meth:`read` and :meth:`write`. This works through
464472
:func:`open`\'s *encoding* and *errors* parameters which are interpreted just
465-
like those in string objects' :meth:`encode` and :meth:`decode` methods.
473+
like those in :meth:`str.encode` and :meth:`bytes.decode`.
466474

467475
Reading Unicode from a file is therefore simple::
468476

@@ -478,7 +486,7 @@ writing::
478486
f.seek(0)
479487
print(repr(f.readline()[:1]))
480488

481-
The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
489+
The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
482490
written as the first character of a file in order to assist with autodetection
483491
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
484492
present at the start of a file; when such an encoding is used, the BOM will be
@@ -520,12 +528,12 @@ Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unico
520528
filenames.
521529

522530
Function :func:`os.listdir`, which returns filenames, raises an issue: should it return
523-
the Unicode version of filenames, or should it return byte strings containing
531+
the Unicode version of filenames, or should it return bytes containing
524532
the encoded versions? :func:`os.listdir` will do both, depending on whether you
525-
provided the directory path as a byte string or a Unicode string. If you pass a
533+
provided the directory path as bytes or a Unicode string. If you pass a
526534
Unicode string as the path, filenames will be decoded using the filesystem's
527535
encoding and a list of Unicode strings will be returned, while passing a byte
528-
path will return the byte string versions of the filenames. For example,
536+
path will return the bytes versions of the filenames. For example,
529537
assuming the default filesystem encoding is UTF-8, running the following
530538
program::
531539

@@ -559,13 +567,13 @@ Unicode.
559567

560568
The most important tip is:
561569

562-
Software should only work with Unicode strings internally, converting to a
563-
particular encoding on output.
570+
Software should only work with Unicode strings internally, decoding the input
571+
data as soon as possible and encoding the output only at the end.
564572

565573
If you attempt to write processing functions that accept both Unicode and byte
566574
strings, you will find your program vulnerable to bugs wherever you combine the
567-
two different kinds of strings. There is no automatic encoding or decoding if
568-
you do e.g. ``str + bytes``, a :exc:`TypeError` is raised for this expression.
575+
two different kinds of strings. There is no automatic encoding or decoding: if
576+
you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
569577

570578
When using data coming from a web browser or some other untrusted source, a
571579
common technique is to check for illegal characters in a string before using the
@@ -610,7 +618,6 @@ Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
610618
and that the HOWTO only covers 2.x.
611619
612620
.. comment Describe Python 3.x support (new section? new document?)
613-
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
614621
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
615622
616623
.. comment
@@ -640,5 +647,3 @@ Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
640647
- [ ] Writing Unicode programs
641648
- [ ] Do everything in Unicode
642649
- [ ] Declaring source code encodings (PEP 263)
643-
- [ ] Other issues
644-
- [ ] Building Python (UCS2, UCS4)

0 commit comments

Comments
 (0)