Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 9dc30bb

Browse files
committed
Marc-Andre Lemburg <[email protected]>:
Tutorial information about Unicode strings in Python, with some markup adjustments from FLD.
1 parent a4cd261 commit 9dc30bb

1 file changed

Lines changed: 101 additions & 0 deletions

File tree

Doc/tut/tut.tex

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -733,6 +733,107 @@ \subsection{Strings \label{strings}}
733733
34
734734
\end{verbatim}
735735

736+
737+
\subsection{Unicode Strings \label{unicodeStrings}}
738+
\sectionauthor{Marc-Andre Lemburg}{[email protected]}
739+
740+
Starting with Python 1.6 a new data type for storing text data is
741+
available to the programmer: the Unicode object. It can be used to
742+
store and manipulate Unicode data (see \url{http://www.unicode.org})
743+
and intergrates well with the existing string objects providing
744+
auto-conversions where necessary.
745+
746+
Unicode has the advantage of providing one ordinal for every character
747+
in every script used in modern and ancient texts. Previously, there
748+
were only 256 possible ordinals for script characters and texts were
749+
typically bound to a code page which mapped the ordinals to script
750+
characters. This lead to very much confusion especially with respect
751+
to internalization (usually written as \samp{i18n} --- \character{i} +
752+
18 characters + \character{n}) of software. Unicode solves these
753+
problems by defining one code page for all scripts.
754+
755+
Creating Unicode strings in Python is just as simple as creating
756+
normal strings:
757+
758+
\begin{verbatim}
759+
>>> u'Hello World !'
760+
u'Hello World !'
761+
\end{verbatim}
762+
763+
The small \character{u} in front of the quote indicates that an
764+
Unicode string is supposed to be created. If you want to include
765+
special characters in the string, you can do so by using the Python
766+
\emph{Unicode-Escape} encoding. The following example shows how:
767+
768+
\begin{verbatim}
769+
>>> u'Hello\\u0020World !'
770+
u'Hello World !'
771+
\end{verbatim}
772+
773+
The escape sequence \code{\\u0020} indicates to insert the Unicode
774+
character with the HEX ordinal 0x0020 (the space character) at the
775+
given position.
776+
777+
Other characters are interpreted by using their respective ordinal
778+
value directly as Unicode ordinal. Due to the fact that the lower 256
779+
Unicode are the same as the standard Latin-1 encoding used in many
780+
western countries, the process of entering Unicode is greatly
781+
simplified.
782+
783+
For experts, there is also a raw mode just like for normal
784+
strings. You have to prepend the string with a small 'r' to have
785+
Python use the \emph{Raw-Unicode-Escape} encoding. It will only apply
786+
the above \code{\\uXXXX} conversion if there is an uneven number of
787+
backslashes in front of the small 'u'.
788+
789+
\begin{verbatim}
790+
>>> ur'Hello\u0020World !'
791+
u'Hello World !'
792+
>>> ur'Hello\\u0020World !'
793+
u'Hello\\\\u0020World !'
794+
\end{verbatim}
795+
796+
The raw mode is most useful when you have to enter lots of backslashes
797+
e.g. in regular expressions.
798+
799+
Apart from these standard encodings, Python provides a whole set of
800+
other ways of creating Unicod strings on the basis of a known
801+
encoding.
802+
803+
The builtin \function{unicode()}\bifuncindex{unicode} provides access
804+
to all registered Unicode codecs (COders and DECoders). Some of the
805+
more well known encodings which these codecs can convert are
806+
\emph{Latin-1}, \emph{ASCII}, \emph{UTF-8} and \emph{UTF-16}. The latter two
807+
are variable length encodings which permit to store Unicode characters
808+
in 8 or 16 bits. Python uses UTF-8 as default encoding. This becomes
809+
noticable when printing Unicode strings or writing them to files.
810+
811+
\begin{verbatim}
812+
>>> u"äöü"
813+
u'\344\366\374'
814+
>>> str(u"äöü")
815+
'\303\244\303\266\303\274'
816+
\end{verbatim}
817+
818+
If you have data in a specific encoding and want to produce a
819+
corresponding Unicode string from it, you can use the
820+
\function{unicode()} builtin with the encoding name as second
821+
argument.
822+
823+
\begin{verbatim}
824+
>>> unicode('\303\244\303\266\303\274','UTF-8')
825+
u'\344\366\374'
826+
\end{verbatim}
827+
828+
To convert the Unicode string back into a string using the original
829+
encoding, the objects provide an \method{encode()} method.
830+
831+
\begin{verbatim}
832+
>>> u"äöü".encode('UTF-8')
833+
'\303\244\303\266\303\274'
834+
\end{verbatim}
835+
836+
736837
\subsection{Lists \label{lists}}
737838

738839
Python knows a number of \emph{compound} data types, used to group

0 commit comments

Comments
 (0)