@@ -5,7 +5,7 @@ \section{\module{unicodedata} ---
55\modulesynopsis {Access the Unicode Database.}
66\moduleauthor {Marc-Andre Lemburg}{
[email protected] }
77\sectionauthor {Marc-Andre Lemburg}{
[email protected] }
8-
8+ \sectionauthor {Martin v. L \" owis}{[email protected] } 99
1010\index {Unicode}
1111\index {character}
@@ -14,10 +14,10 @@ \section{\module{unicodedata} ---
1414This module provides access to the Unicode Character Database which
1515defines character properties for all Unicode characters. The data in
1616this database is based on the \file {UnicodeData.txt} file version
17- 3.0 .0 which is publically available from \url {ftp://ftp.unicode.org/}.
17+ 3.2 .0 which is publically available from \url {ftp://ftp.unicode.org/}.
1818
1919The module uses the same names and symbols as defined by the
20- UnicodeData File Format 3.0 .0 (see
20+ UnicodeData File Format 3.2 .0 (see
2121\url {http://www.unicode.org/Public/UNIDATA/UnicodeData.html}). It
2222defines the following functions:
2323
@@ -83,3 +83,37 @@ \section{\module{unicodedata} ---
8383 character \var {unichr} as string. An empty string is returned in case
8484 no such mapping is defined.
8585\end {funcdesc }
86+
87+ \begin {funcdesc }{normalize}{form, unistr}
88+
89+ Return the normal form \var {form} for the Unicode string \var {unistr}.
90+ Valid values for \var {form} are 'NFC' , 'NFKC' , 'NFD' , and 'NFKD' .
91+
92+ The Unicode standard defines various normalization forms of a Unicode
93+ string, based on the definition of canonical equivalence and
94+ compatibility equivalence. In Unicode, several characters can be
95+ expressed in various way. For example, the character U+00C7 (LATIN
96+ CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
97+ U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
98+
99+ For each character, there are two normal forms: normal form C and
100+ normal form D. Normal form D (NFD) is also known as canonical
101+ decomposition, and translates each character into its decomposed form.
102+ Normal form C (NFC) first applies a canonical decomposition, then
103+ composes pre-combined characters again.
104+
105+ In addition to these two forms, there two additional normal forms
106+ based on compatibility equivalence. In Unicode, certain characters are
107+ supported which normally would be unified with other characters. For
108+ example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
109+ (LATIN CAPITAL LETTER I). However, it is supported in Unicode for
110+ compatibility with existing character sets (e.g. gb2312).
111+
112+ The normal form KD (NFKD) will apply the compatibility decomposition,
113+ i.e. replace all compatibility characters with their equivalents. The
114+ normal form KC (NFKC) first applies the compatibility decomposition,
115+ followed by the canonical composition.
116+
117+ \versionadded {2.3}
118+ \end {funcdesc }
119+
0 commit comments