Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 2548c73

Browse files
committed
Implement IDNA (Internationalized Domain Names in Applications).
1 parent 8d17a90 commit 2548c73

12 files changed

Lines changed: 1671 additions & 9 deletions

File tree

Doc/lib/lib.tex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ \chapter*{Front Matter\label{front}}
112112
\input{libtextwrap}
113113
\input{libcodecs}
114114
\input{libunicodedata}
115+
\input{libstringprep}
115116

116117
\input{libmisc} % Miscellaneous Services
117118
\input{libpydoc}

Doc/lib/libcodecs.tex

Lines changed: 71 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ \section{\module{codecs} ---
55
\modulesynopsis{Encode and decode data and streams.}
66
\moduleauthor{Marc-Andre Lemburg}{[email protected]}
77
\sectionauthor{Marc-Andre Lemburg}{[email protected]}
8-
8+
\sectionauthor{Martin v. L\"owis}{[email protected]}
99

1010
\index{Unicode}
1111
\index{Codecs}
@@ -809,6 +809,11 @@ \subsection{Standard Encodings}
809809
{byte string}
810810
{Convert operand to hexadecimal representation, with two digits per byte}
811811

812+
\lineiv{idna}
813+
{}
814+
{Unicode string}
815+
{Implements \rfc{3490}. \versionadded{2.3}. See also \module{encodings.idna}}
816+
812817
\lineiv{mbcs}
813818
{dbcs}
814819
{Unicode string}
@@ -819,6 +824,11 @@ \subsection{Standard Encodings}
819824
{Unicode string}
820825
{Encoding of PalmOS 3.5}
821826

827+
\lineiv{punycode}
828+
{}
829+
{Unicode string}
830+
{Implements \rfc{3492}. \versionadded{2.3}}
831+
822832
\lineiv{quopri_codec}
823833
{quopri, quoted-printable, quotedprintable}
824834
{byte string}
@@ -865,3 +875,63 @@ \subsection{Standard Encodings}
865875
{Compress the operand using gzip}
866876

867877
\end{tableiv}
878+
879+
\subsection{\module{encodings.idna} ---
880+
Internationalized Domain Names in Applications}
881+
882+
\declaremodule{standard}{encodings.idna}
883+
\modulesynopsis{Internationalized Domain Names implementation}
884+
\moduleauthor{Martin v. L\"owis}
885+
886+
This module implements \rfc{3490} (Internationalized Domain Names in
887+
Applications) and \rfc{3492} (Nameprep: A Stringprep Profile for
888+
Internationalized Domain Names (IDN)). It builds upon the
889+
\code{punycode} encoding and \module{stringprep}. \versionadded{2.3}
890+
891+
These RFCs together define a protocol to support non-ASCII characters
892+
in domain names. A domain name containing non-ASCII characters (such
893+
as ``www.Alliancefran\,caise.nu'') is converted into an
894+
ASCII-compatible encoding (ACE, such as
895+
``www.xn--alliancefranaise-npb.nu''). The ACE form of the domain name
896+
is then used in all places where arbitrary characters are not allowed
897+
by the protocol, such as DNS queries, HTTP \code{Host:} fields, and so
898+
on. This conversion is carried out in the application; if possible
899+
invisible to the user: The application should transparently convert
900+
Unicode domain labels to IDNA on the wire, and convert back ACE labels
901+
to Unicode before presenting them to the user.
902+
903+
Python supports this conversion in several ways: The \code{idna} codec
904+
allows to convert between Unicode and the ACE. Furthermore, the
905+
\module{socket} module transparently converts Unicode host names to
906+
ACE, so that applications need not be concerned about converting host
907+
names themselves when they pass them to the socket module. On top of
908+
that, modules that have host names as function parameters, such as
909+
\module{httplib} and \module{ftplib}, accept Unicode host names
910+
(\module{httplib} then also transparently sends an IDNA hostname in
911+
the \code{Host:} field if it sends that field at all).
912+
913+
When receiving host names from the wire (such as in reverse name
914+
lookup), no automatic conversion to Unicode is performed: Applications
915+
wishing to present such host names to the user should decode them to
916+
Unicode.
917+
918+
The module \module{encodings.idna} also implements the nameprep
919+
procedure, which performs certain normalizations on host names, to
920+
achieve case-insensitivity of international domain names, and to unify
921+
similar characters. The nameprep functions can be used directly if
922+
desired.
923+
924+
\begin{funcdesc}{nameprep}{label}
925+
Return the nameprepped version of \var{label}. The implementation
926+
currently assumes query strings, so \code{AllowUnassigned} is
927+
true.
928+
\end{funcdesc}
929+
930+
\begin{funcdesc}{ToASCCII}{label}
931+
Convert a label to ASCII, as specified in \rfc{3490}.
932+
\code{UseSTD3ASCIIRules} is assumed to be false.
933+
\end{funcdesc}
934+
935+
\begin{funcdesc}{ToUnicode}{label}
936+
Convert a label to Unicode, as specified in \rfc{3490}.
937+
\end{funcdesc}

Doc/lib/libstringprep.tex

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
\section{\module{stringprep} ---
2+
Internet String Preparation}
3+
4+
\declaremodule{standard}{stringprep}
5+
\modulesynopsis{String preparation, as per RFC 3453}
6+
\moduleauthor{Martin v. L\"owis}{[email protected]}
7+
\sectionauthor{Martin v. L\"owis}{[email protected]}
8+
9+
When identifying things (such as host names) in the internet, it is
10+
often necessary to compare such identifications for
11+
``equality''. Exactly how this comparison is executed may depend on
12+
the application domain, e.g. whether it should be case-insensitive or
13+
not. It may be also necessary to restrict the possible
14+
identifications, to allow only identifications consisting of
15+
``printable'' characters.
16+
17+
\rfc{3454} defines a procedure for ``preparing'' Unicode strings in
18+
internet protocols. Before passing strings onto the wire, they are
19+
processed with the preparation procedure, after which they have a
20+
certain normalized form. The RFC defines a set of tables, which can be
21+
combined into profiles. Each profile must define which tables it uses,
22+
and what other optional parts of the \code{stringprep} procedure are
23+
part of the profile. One example of a \code{stringprep} profile is
24+
\code{nameprep}, which is used for internationalized domain names.
25+
26+
The module \module{stringprep} only exposes the tables from RFC
27+
3454. As these tables would be very large to represent them as
28+
dictionaries or lists, the module uses the Unicode character database
29+
internally. The module source code itself was generated using the
30+
\code{mkstringprep.py} utility.
31+
32+
As a result, these tables are exposed as functions, not as data
33+
structures. There are two kinds of tables in the RFC: sets and
34+
mappings. For a set, \module{stringprep} provides the ``characteristic
35+
function'', i.e. a function that returns true if the parameter is part
36+
of the set. For mappings, it provides the mapping function: given the
37+
key, it returns the associated value. Below is a list of all functions
38+
available in the module.
39+
40+
\begin{funcdesc}{in_table_a1}{code}
41+
Determine whether \var{code} is in table{A.1} (Unassigned code points
42+
in Unicode 3.2).
43+
\end{funcdesc}
44+
45+
\begin{funcdesc}{in_table_b1}{code}
46+
Determine whether \var{code} is in table{B.1} (Commonly mapped to
47+
nothing).
48+
\end{funcdesc}
49+
50+
\begin{funcdesc}{map_table_b2}{code}
51+
Return the mapped value for \var{code} according to table{B.2}
52+
(Mapping for case-folding used with NFKC).
53+
\end{funcdesc}
54+
55+
\begin{funcdesc}{map_table_b3}{code}
56+
Return the mapped value for \var{code} according to table{B.3}
57+
(Mapping for case-folding used with no normalization).
58+
\end{funcdesc}
59+
60+
\begin{funcdesc}{in_table_c11}{code}
61+
Determine whether \var{code} is in table{C.1.1}
62+
(ASCII space characters).
63+
\end{funcdesc}
64+
65+
\begin{funcdesc}{in_table_c12}{code}
66+
Determine whether \var{code} is in table{C.1.2}
67+
(Non-ASCII space characters).
68+
\end{funcdesc}
69+
70+
\begin{funcdesc}{in_table_c11_c12}{code}
71+
Determine whether \var{code} is in table{C.1}
72+
(Space characters, union of C.1.1 and C.1.2).
73+
\end{funcdesc}
74+
75+
\begin{funcdesc}{in_table_c21}{code}
76+
Determine whether \var{code} is in table{C.2.1}
77+
(ASCII control characters).
78+
\end{funcdesc}
79+
80+
\begin{funcdesc}{in_table_c22}{code}
81+
Determine whether \var{code} is in table{C.2.2}
82+
(Non-ASCII control characters).
83+
\end{funcdesc}
84+
85+
\begin{funcdesc}{in_table_c21_c22}{code}
86+
Determine whether \var{code} is in table{C.2}
87+
(Control characters, union of C.2.1 and C.2.2).
88+
\end{funcdesc}
89+
90+
\begin{funcdesc}{in_table_c3}{code}
91+
Determine whether \var{code} is in table{C.3}
92+
(Private use).
93+
\end{funcdesc}
94+
95+
\begin{funcdesc}{in_table_c4}{code}
96+
Determine whether \var{code} is in table{C.4}
97+
(Non-character code points).
98+
\end{funcdesc}
99+
100+
\begin{funcdesc}{in_table_c5}{code}
101+
Determine whether \var{code} is in table{C.5}
102+
(Surrogate codes).
103+
\end{funcdesc}
104+
105+
\begin{funcdesc}{in_table_c6}{code}
106+
Determine whether \var{code} is in table{C.6}
107+
(Inappropriate for plain text).
108+
\end{funcdesc}
109+
110+
\begin{funcdesc}{in_table_c7}{code}
111+
Determine whether \var{code} is in table{C.7}
112+
(Inappropriate for canonical representation).
113+
\end{funcdesc}
114+
115+
\begin{funcdesc}{in_table_c8}{code}
116+
Determine whether \var{code} is in table{C.8}
117+
(Change display properties or are deprecated).
118+
\end{funcdesc}
119+
120+
\begin{funcdesc}{in_table_c9}{code}
121+
Determine whether \var{code} is in table{C.9}
122+
(Tagging characters).
123+
\end{funcdesc}
124+
125+
\begin{funcdesc}{in_table_d1}{code}
126+
Determine whether \var{code} is in table{D.1}
127+
(Characters with bidirectional property ``R'' or ``AL'').
128+
\end{funcdesc}
129+
130+
\begin{funcdesc}{in_table_d2}{code}
131+
Determine whether \var{code} is in table{D.2}
132+
(Characters with bidirectional property ``L'').
133+
\end{funcdesc}
134+

Doc/whatsnew/whatsnew23.tex

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1791,6 +1791,27 @@ \section{New, Improved, and Deprecated Modules}
17911791

17921792
Any breakage caused by this change should be reported as a bug.
17931793

1794+
\item Support for internationalized domain names (RFCs 3454, 3490,
1795+
3491, and 3492) has been added. The ``idna'' encoding can be used
1796+
to convert between a Unicode domain name and the ASCII-compatible
1797+
encoding (ACE).
1798+
1799+
\begin{verbatim}
1800+
>>> u"www.Alliancefran\,caise.nu".encode("idna")
1801+
'www.xn--alliancefranaise-npb.nu'
1802+
\end{verbatim}
1803+
1804+
In addition, the \module{socket} has been extended to transparently
1805+
convert Unicode hostnames to the ACE before passing them to the C
1806+
library. In turn, modules that pass hostnames ``through'' (such as
1807+
\module{httplib}, \module{ftplib}) also support Unicode host names
1808+
(httplib also sends ACE Host: headers). \module{urllib} supports
1809+
Unicode URLs with non-ASCII host names as long as the \code{path} part
1810+
of the URL is ASCII only.
1811+
1812+
To implement this change, the module \module{stringprep}, the tool
1813+
\code{mkstringprep} and the \code{punycode} encoding have been added.
1814+
17941815
\end{itemize}
17951816

17961817

0 commit comments

Comments
 (0)