Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit fbb3981

Browse files
committed
Refactor a bit the codecs doc.
1 parent 963004d commit fbb3981

1 file changed

Lines changed: 21 additions & 19 deletions

File tree

Doc/library/codecs.rst

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -810,35 +810,36 @@ e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
810810
Windows). There's a string constant with 256 characters that shows you which
811811
character is mapped to which byte value.
812812

813-
All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
813+
All of these encodings can only encode 256 of the 1114112 codepoints
814814
defined in Unicode. A simple and straightforward way that can store each Unicode
815-
code point, is to store each codepoint as two consecutive bytes. There are two
816-
possibilities: Store the bytes in big endian or in little endian order. These
817-
two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
818-
disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
819-
will always have to swap bytes on encoding and decoding. UTF-16 avoids this
820-
problem: Bytes will always be in natural endianness. When these bytes are read
815+
code point, is to store each codepoint as four consecutive bytes. There are two
816+
possibilities: store the bytes in big endian or in little endian order. These
817+
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
818+
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
819+
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
820+
problem: bytes will always be in natural endianness. When these bytes are read
821821
by a CPU with a different endianness, then bytes have to be swapped though. To
822-
be able to detect the endianness of a UTF-16 byte sequence, there's the so
823-
called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
824-
This character will be prepended to every UTF-16 byte sequence. The byte swapped
825-
version of this character (``0xFFFE``) is an illegal character that may not
826-
appear in a Unicode text. So when the first character in an UTF-16 byte sequence
822+
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
823+
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
824+
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
825+
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
826+
illegal character that may not appear in a Unicode text. So when the
827+
first character in an ``UTF-16`` or ``UTF-32`` byte sequence
827828
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
828-
Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
829-
a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
829+
Unfortunately the character ``U+FEFF`` had a second purpose as
830+
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
830831
a word to be split. It can e.g. be used to give hints to a ligature algorithm.
831832
With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
832833
deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
833-
Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
834+
Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
834835
it's a device to determine the storage layout of the encoded bytes, and vanishes
835836
once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
836837
NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
837838

838839
There's another encoding that is able to encoding the full range of Unicode
839840
characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
840841
with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
841-
parts: Marker bits (the most significant bits) and payload bits. The marker bits
842+
parts: marker bits (the most significant bits) and payload bits. The marker bits
842843
are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
843844
encoded like this (with x being payload bits, which when concatenated give the
844845
Unicode character):
@@ -877,13 +878,14 @@ map to
877878
| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
878879
| INVERTED QUESTION MARK
879880
880-
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
881+
in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
881882
correctly guessed from the byte sequence. So here the BOM is not used to be able
882883
to determine the byte order used for generating the byte sequence, but as a
883884
signature that helps in guessing the encoding. On encoding the utf-8-sig codec
884885
will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
885-
decoding utf-8-sig will skip those three bytes if they appear as the first three
886-
bytes in the file.
886+
decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
887+
three bytes in the file. In UTF-8, the use of the BOM is discouraged and
888+
should generally be avoided.
887889

888890

889891
.. _standard-encodings:

0 commit comments

Comments
 (0)