Unicode
3.11.4
Guido van Rossum and the Python development team
24, 2023
Python Software Foundation
Email: [email protected]
Contents
1 Unicode 2
1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Python Unicode 4
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Python Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Unicode 8
3.1 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 11
12
1.12
Python Unicode Unicode
1
1 Unicode
1.1
Python Unicode
Python
Unicode https://www.unicode.org/
A B C È Í
Ⅰ 1 I
Unicode code point 0 0x10FFFF
110 Unicode U+265E
0x265e 9822
Unicode
0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
...
2167 'Ⅷ'; ROMAN NUMERAL EIGHT
2168 'Ⅸ'; ROMAN NUMERAL NINE
...
265E ' '; BLACK CHESS KNIGHT
265F ' '; BLACK CHESS PAWN
...
1F600 ' '; GRINNING FACE
1F609 ' '; WINKING FACE
...
U+265E U+265E
’♞’
glyph A
Python
GUI
1.2
Unicode 0 0x10FFFF 1,114,111
code unit
Unicode
32 CPU 32
Python
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2
1.
2. 127 255 0x00
ASCII 6 24 RAM
GB RAM
4
3. C strlen()
UTF-8
UTF-8 Python UTF Unicode Transformation Format ’8’
8 UTF-16 UTF-32 UTF-8 UTF-8
1. < 128
2. >= 128 2 3 4 128 255
UTF-8
1. Unicode
2. Unicode null U+0000
strcpy() C UTF-8
3. ASCII UTF-8
4. UTF-8
5. UTF-8 8
UTF-8
6. UTF-8
UTF-16 UTF-32
1.3
Unicode Consortium Unicode PDF
Unicode ‘ <https://www.unicode.org/history/>‘_
Computerphile Youtube Tom Scott ‘ Unicode UTF-8 <https://www.youtube.com/
watch?v=MijmeoH9LT4>‘_ 9 36
To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character
tables.
Joel Spolsky <https://www.joelonsoftware.com/2003/10/08/
the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-
set-no-excuses/>‘_
Wikipedia UTF-8
3
2 Python Unicode
Unicode Python Unicode
2.1
Python 3.0 str Unicode “”unicode rocks!” ’unicode rocks!’“
Unicode
Python UTF-8 Unicode
try:
with open('/tmp/input.txt', 'r') as f:
...
except OSError:
# 'File not found' error message.
print("Fichier non trouvé")
Python 3 Unicode
répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
f.write("test\n")
ASCII
Delta u
>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
'\u0394'
>>> "\u0394" # Using a 16-bit hex value
'\u0394'
>>> "\U00000394" # Using a 32-bit hex value
'\u0394'
bytes decode() encoding
UTF-8 errors
errors
'strict' ( UnicodeDecodeError ) 'replace' ( U+FFFD REPLACEMENT
CHARACTER) 'ignore' ( Unicode ) 'backslashreplace' (
\xNN ) :
>>> b'\x80abc'.decode("utf-8", "strict")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'
Python 100 Python
standard-encodings 'latin-1' 'iso_8859_1' '8859
chr() Unicode
1 Unicode ord() Unicode
4
>>> chr(57344)
'\ue000'
>>> ord('\ue000')
57344
2.2
bytes.decode() str.encode() Unicode bytes
encoding
errors decode() handler 'strict' 'ignore'
'replace' 'xmlcharrefreplace'
XML backslashreplace \uNNNN namereplace
\N{...}
>>> u = chr(40960) + 'abcd' + chr(1972)
>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'ꀀabcd޴'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'
>>> u.encode('ascii', 'namereplace')
b'\\N{YI SYLLABLE IT}abcd\\u07b4'
codecs
codecs
2.3 Python Unicode
Python \u Unicode 4
\U 8 4
>>> s = "a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]
127
chr()
Python
5
Python UTF-8
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = 'abcdé'
print(ord(u[-1]))
Emacs Emacs Python
-*- Emacs Python
Python coding: name coding=name
UTF-8 PEP 263
2.4 Unicode
Unicode
import unicodedata
u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))
# Get numeric value of second character
print(unicodedata.numeric(u[1]))
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0
'Ll' 'No' 'Mn'
, 'So' Unicode
<https://www.unicode.org/reports/tr44/#General_Category_Values>‘_
2.5
Unicode
ê U+00EA U+0065 U+0302 e COMBINING
CIRCUMFLEX ACCENT 1
2
casefold() Unicode
ß U+00DF
ss
6
>>> street = 'Gürzenichstraße'
>>> street.casefold()
'gürzenichstrasse'
unicodedata normalize()
normalize()
import unicodedata
def compare_strs(s1, s2):
def NFD(s):
return unicodedata.normalize('NFD', s)
return NFD(s1) == NFD(s2)
single_char = 'ê'
multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print('length of first string=', len(single_char))
print('length of second string=', len(multiple_chars))
print(compare_strs(single_char, multiple_chars))
$ python3 compare-strs.py
length of first string= 1
length of second string= 2
True
normalize() NFC NFKC NFD
NFKD
Unicode
import unicodedata
def compare_caseless(s1, s2):
def NFD(s):
return unicodedata.normalize('NFD', s)
return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
# Example usage
single_char = 'ê'
multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print(compare_caseless(single_char, multiple_chars))
True NFD() casefold()
Unicode 3.13
7
2.6 Unicode
re \d \w
\d
[0-9] 'Nd'
57
import re
p = re.compile(r'\d+')
s = "Over \u0e55\u0e57 57 flavours"
m = p.search(s)
print(repr(m.group()))
\d+ compile() re.ASCII \d+
”57”
\w Unicode [a-zA-Z0-9_] re.ASCII
\s `` Unicode ``[ \t\n\r\f\v]
2.7
Python Unicode
• Processing Text Files in Python 3, by Nick Coghlan.
• Unicode Ned Batchelder PyCon 2012
str Python textseq
unicodedata
codecs
Marc-André Lemburg EuroPython 2002 Python Unicode PDF <https:
//downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>‘_ Python
2 Unicode Unicode unicode u
3 Unicode
Unicode / Unicode
Unicode
Unicode
XML Unicode Unicode SQL
Unicode
Unicode
8 bytes.decode(encoding)
Unicode
1024 4096 Unicode
2 GB 2 GB RAM
Unicode
open() read() write()
Unicode open() encoding errors str.encode()
bytes.decode()
8
Unicode
with open('unicode.txt', encoding='utf-8') as f:
for line in f:
print(repr(line))
with open('test', encoding='utf-8', mode='w+') as f:
f.write('\u4500 blah blah blah\n')
f.seek(0)
print(repr(f.readline()[:1]))
Unicode U+FEFF BOM
UTF-16 BOM BOM
little-endian
big-endian utf-16-le utf-16-be BOM
UTF-8 BOM UTF-8
UTF-8 utf-8-sig
3.1 Unicode
Unicode Unicode
Python UTF-8 MacOS Python
UTF-8 Python 3.6 Windows UTF-8 Unix
LANG LC_CTYPE
UTF-8
sys.getfilesystemencoding()
Unicode
filename = 'filename\u4500abc'
with open(filename, 'w') as f:
f.write('blah\n')
os Unicode os.stat()
os.listdir() Unicode
os.listdir()
Unicode Unicode
Unicode
UTF-8 :
fn = 'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print(os.listdir(b'.'))
print(os.listdir('.'))
$ python listdir-test.py
[b'filename\xe4\x94\x80abc', ...]
['filename\u4500abc', ...]
UTF-8 Unicode
9
API Unicode API
Unix
3.2 Unicode
Unicode
Unicode
Unicode
str + bytes
TypeError
Web
ASCII
StreamRecoder #1
#2
f Latin-1 StreamRecoder UTF-8
new_f = codecs.StreamRecoder(f,
# en/decoder: used by read() to encode its results and
# by write() to decode its input.
codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
# reader/writer: used to read and write to the stream.
codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
ASCII
ASCII surrogateescape handler
with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
data = f.read()
# make changes to the string 'data'
with open(fname + '.new', 'w',
encoding="ascii", errors="surrogateescape") as f:
f.write(data)
surrogateescape handler ASCII U+DC80 U+DCFF
surrogateescape handler
10
3.3
One section of Mastering Python 3 Input/Output, a PyCon 2010 talk by David Beazley, discusses text processing and
binary data handling.
Marc-André Lemburg PDF Python Unicode
Python 2.x
The Guts of Unicode in Python is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
representation in Python 3.3.
Andrew Kuchling Alexander Belopolsky Georg Brandl Andrew Kuchling Ezio
Melotti
Éric Araujo Nicholas Bastin Nick Coghlan Marius Gedminas
Kent Johnson Ken Krugler Marc-André Lemburg Martin von Löwis Terry J. Reedy Serhiy Storchaka , Eryk
Sun, Chad Whitacre, Graham Wideman
11
P
Python
PEP 263, 6
12