Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 92b9584

Browse files
committed
Merge the codecs doc refactoring with 3.2.
2 parents 7a03f64 + fbb3981 commit 92b9584

1 file changed

Lines changed: 21 additions & 19 deletions

File tree

Doc/library/codecs.rst

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -809,35 +809,36 @@ e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
809809
Windows). There's a string constant with 256 characters that shows you which
810810
character is mapped to which byte value.
811811

812-
All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
812+
All of these encodings can only encode 256 of the 1114112 codepoints
813813
defined in Unicode. A simple and straightforward way that can store each Unicode
814-
code point, is to store each codepoint as two consecutive bytes. There are two
815-
possibilities: Store the bytes in big endian or in little endian order. These
816-
two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
817-
disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
818-
will always have to swap bytes on encoding and decoding. UTF-16 avoids this
819-
problem: Bytes will always be in natural endianness. When these bytes are read
814+
code point, is to store each codepoint as four consecutive bytes. There are two
815+
possibilities: store the bytes in big endian or in little endian order. These
816+
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
817+
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
818+
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
819+
problem: bytes will always be in natural endianness. When these bytes are read
820820
by a CPU with a different endianness, then bytes have to be swapped though. To
821-
be able to detect the endianness of a UTF-16 byte sequence, there's the so
822-
called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
823-
This character will be prepended to every UTF-16 byte sequence. The byte swapped
824-
version of this character (``0xFFFE``) is an illegal character that may not
825-
appear in a Unicode text. So when the first character in an UTF-16 byte sequence
821+
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
822+
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
823+
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
824+
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
825+
illegal character that may not appear in a Unicode text. So when the
826+
first character in an ``UTF-16`` or ``UTF-32`` byte sequence
826827
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
827-
Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
828-
a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
828+
Unfortunately the character ``U+FEFF`` had a second purpose as
829+
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
829830
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
830831
With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
831832
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
832-
Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
833+
Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
833834
it's a device to determine the storage layout of the encoded bytes, and vanishes
834835
once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
835836
NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
836837

837838
There's another encoding that is able to encoding the full range of Unicode
838839
characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
839840
with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
840-
parts: Marker bits (the most significant bits) and payload bits. The marker bits
841+
parts: marker bits (the most significant bits) and payload bits. The marker bits
841842
are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
842843
encoded like this (with x being payload bits, which when concatenated give the
843844
Unicode character):
@@ -876,13 +877,14 @@ map to
876877
| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
877878
| INVERTED QUESTION MARK
878879
879-
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
880+
in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
880881
correctly guessed from the byte sequence. So here the BOM is not used to be able
881882
to determine the byte order used for generating the byte sequence, but as a
882883
signature that helps in guessing the encoding. On encoding the utf-8-sig codec
883884
will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
884-
decoding utf-8-sig will skip those three bytes if they appear as the first three
885-
bytes in the file.
885+
decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
886+
three bytes in the file. In UTF-8, the use of the BOM is discouraged and
887+
should generally be avoided.
886888

887889

888890
.. _standard-encodings:

0 commit comments

Comments
 (0)