@@ -733,6 +733,107 @@ \subsection{Strings \label{strings}}
73373334
734734\end {verbatim }
735735
736+
737+ \subsection {Unicode Strings \label {unicodeStrings } }
738+ \sectionauthor {Marc-Andre Lemburg}{
[email protected] }
739+
740+ Starting with Python 1.6 a new data type for storing text data is
741+ available to the programmer: the Unicode object. It can be used to
742+ store and manipulate Unicode data (see \url {http://www.unicode.org})
743+ and intergrates well with the existing string objects providing
744+ auto-conversions where necessary.
745+
746+ Unicode has the advantage of providing one ordinal for every character
747+ in every script used in modern and ancient texts. Previously, there
748+ were only 256 possible ordinals for script characters and texts were
749+ typically bound to a code page which mapped the ordinals to script
750+ characters. This lead to very much confusion especially with respect
751+ to internalization (usually written as \samp {i18n} --- \character {i} +
752+ 18 characters + \character {n}) of software. Unicode solves these
753+ problems by defining one code page for all scripts.
754+
755+ Creating Unicode strings in Python is just as simple as creating
756+ normal strings:
757+
758+ \begin {verbatim }
759+ >>> u'Hello World !'
760+ u'Hello World !'
761+ \end {verbatim }
762+
763+ The small \character {u} in front of the quote indicates that an
764+ Unicode string is supposed to be created. If you want to include
765+ special characters in the string, you can do so by using the Python
766+ \emph {Unicode-Escape } encoding. The following example shows how:
767+
768+ \begin {verbatim }
769+ >>> u'Hello\\u0020World !'
770+ u'Hello World !'
771+ \end {verbatim }
772+
773+ The escape sequence \code {\\ u0020} indicates to insert the Unicode
774+ character with the HEX ordinal 0x0020 (the space character) at the
775+ given position.
776+
777+ Other characters are interpreted by using their respective ordinal
778+ value directly as Unicode ordinal. Due to the fact that the lower 256
779+ Unicode are the same as the standard Latin-1 encoding used in many
780+ western countries, the process of entering Unicode is greatly
781+ simplified.
782+
783+ For experts, there is also a raw mode just like for normal
784+ strings. You have to prepend the string with a small 'r' to have
785+ Python use the \emph {Raw-Unicode-Escape } encoding. It will only apply
786+ the above \code {\\ uXXXX} conversion if there is an uneven number of
787+ backslashes in front of the small 'u' .
788+
789+ \begin {verbatim }
790+ >>> ur'Hello\u0020World !'
791+ u'Hello World !'
792+ >>> ur'Hello\\u0020World !'
793+ u'Hello\\\\u0020World !'
794+ \end {verbatim }
795+
796+ The raw mode is most useful when you have to enter lots of backslashes
797+ e.g. in regular expressions.
798+
799+ Apart from these standard encodings, Python provides a whole set of
800+ other ways of creating Unicod strings on the basis of a known
801+ encoding.
802+
803+ The builtin \function {unicode()}\bifuncindex {unicode} provides access
804+ to all registered Unicode codecs (COders and DECoders). Some of the
805+ more well known encodings which these codecs can convert are
806+ \emph {Latin-1 }, \emph {ASCII }, \emph {UTF-8 } and \emph {UTF-16 }. The latter two
807+ are variable length encodings which permit to store Unicode characters
808+ in 8 or 16 bits. Python uses UTF-8 as default encoding. This becomes
809+ noticable when printing Unicode strings or writing them to files.
810+
811+ \begin {verbatim }
812+ >>> u"äöü"
813+ u'\344\366\374'
814+ >>> str(u"äöü")
815+ '\303\244\303\266\303\274'
816+ \end {verbatim }
817+
818+ If you have data in a specific encoding and want to produce a
819+ corresponding Unicode string from it, you can use the
820+ \function {unicode()} builtin with the encoding name as second
821+ argument.
822+
823+ \begin {verbatim }
824+ >>> unicode('\303\244\303\266\303\274','UTF-8')
825+ u'\344\366\374'
826+ \end {verbatim }
827+
828+ To convert the Unicode string back into a string using the original
829+ encoding, the objects provide an \method {encode()} method.
830+
831+ \begin {verbatim }
832+ >>> u"äöü".encode('UTF-8')
833+ '\303\244\303\266\303\274'
834+ \end {verbatim }
835+
836+
736837\subsection {Lists \label {lists } }
737838
738839Python knows a number of \emph {compound } data types, used to group
0 commit comments