0% found this document useful (0 votes)

47 views12 pages

Howto Unicode

Uploaded by

ryan suen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views12 pages

Howto Unicode

Uploaded by

ryan suen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unicode

3.11.4

Guido van Rossum and the Python development team

24, 2023
Python Software Foundation
Email: [email protected]

Contents

1 Unicode 2
1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Python Unicode 4
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Python Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Unicode 8
3.1 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 11

1.12
Python Unicode Unicode

1
1 Unicode

1.1

Python Unicode
Python
Unicode https://www.unicode.org/

A B C È Í
Ⅰ 1 I

Unicode code point 0 0x10FFFF

110 Unicode U+265E
0x265e 9822
Unicode

0061 'a'; LATIN SMALL LETTER A

0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
...
2167 'Ⅷ'; ROMAN NUMERAL EIGHT
2168 'Ⅸ'; ROMAN NUMERAL NINE
...
265E ' '; BLACK CHESS KNIGHT
265F ' '; BLACK CHESS PAWN
...
1F600 ' '; GRINNING FACE
1F609 ' '; WINKING FACE
...

U+265E U+265E
’♞’

glyph A
Python
GUI

1.2

Unicode 0 0x10FFFF 1,114,111

code unit
Unicode
32 CPU 32
Python

P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2
1.
2. 127 255 0x00
ASCII 6 24 RAM
GB RAM
4
3. C strlen()
UTF-8
UTF-8 Python UTF Unicode Transformation Format ’8’
8 UTF-16 UTF-32 UTF-8 UTF-8

1. < 128
2. >= 128 2 3 4 128 255
UTF-8
1. Unicode
2. Unicode null U+0000
strcpy() C UTF-8

3. ASCII UTF-8
4. UTF-8
5. UTF-8 8
UTF-8
6. UTF-8
UTF-16 UTF-32

1.3

Unicode Consortium Unicode PDF

Unicode ‘ <https://www.unicode.org/history/>‘_
Computerphile Youtube Tom Scott ‘ Unicode UTF-8 <https://www.youtube.com/
watch?v=MijmeoH9LT4>‘_ 9 36
To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character
tables.
Joel Spolsky <https://www.joelonsoftware.com/2003/10/08/
the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-
set-no-excuses/>‘_
Wikipedia UTF-8

3
2 Python Unicode

Unicode Python Unicode

2.1

Python 3.0 str Unicode “”unicode rocks!” ’unicode rocks!’“

Unicode
Python UTF-8 Unicode

try:
with open('/tmp/input.txt', 'r') as f:
...
except OSError:
# 'File not found' error message.
print("Fichier non trouvé")

Python 3 Unicode

répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
f.write("test\n")

ASCII
Delta u

>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name

'\u0394'
>>> "\u0394" # Using a 16-bit hex value
'\u0394'
>>> "\U00000394" # Using a 32-bit hex value
'\u0394'

bytes decode() encoding

UTF-8 errors
errors
'strict' ( UnicodeDecodeError ) 'replace' ( U+FFFD REPLACEMENT
CHARACTER) 'ignore' ( Unicode ) 'backslashreplace' (
\xNN ) :

>>> b'\x80abc'.decode("utf-8", "strict")

Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'

Python 100 Python

standard-encodings 'latin-1' 'iso_8859_1' '8859

chr() Unicode
1 Unicode ord() Unicode

4
>>> chr(57344)
'\ue000'
>>> ord('\ue000')
57344

2.2

bytes.decode() str.encode() Unicode bytes

encoding
errors decode() handler 'strict' 'ignore'
'replace' 'xmlcharrefreplace'
XML backslashreplace \uNNNN namereplace
\N{...}

>>> u = chr(40960) + 'abcd' + chr(1972)

>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'ꀀabcd޴'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'
>>> u.encode('ascii', 'namereplace')
b'\\N{YI SYLLABLE IT}abcd\\u07b4'

codecs
codecs

2.3 Python Unicode

Python \u Unicode 4
\U 8 4

>>> s = "a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]

127
chr()

Python

5
Python UTF-8

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = 'abcdé'
print(ord(u[-1]))

Emacs Emacs Python

-*- Emacs Python
Python coding: name coding=name
UTF-8 PEP 263

2.4 Unicode

Unicode

import unicodedata

u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)

for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))

# Get numeric value of second character

print(unicodedata.numeric(u[1]))

0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE

1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0

'Ll' 'No' 'Mn'

, 'So' Unicode
<https://www.unicode.org/reports/tr44/#General_Category_Values>‘_

2.5

Unicode
ê U+00EA U+0065 U+0302 e COMBINING
CIRCUMFLEX ACCENT 1
2
casefold() Unicode
ß U+00DF
ss

6
>>> street = 'Gürzenichstraße'
>>> street.casefold()
'gürzenichstrasse'

unicodedata normalize()
normalize()

import unicodedata

def compare_strs(s1, s2):

def NFD(s):
return unicodedata.normalize('NFD', s)

return NFD(s1) == NFD(s2)

single_char = 'ê'
multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print('length of first string=', len(single_char))
print('length of second string=', len(multiple_chars))
print(compare_strs(single_char, multiple_chars))

$ python3 compare-strs.py
length of first string= 1
length of second string= 2
True

normalize() NFC NFKC NFD

NFKD
Unicode

import unicodedata

def compare_caseless(s1, s2):

def NFD(s):
return unicodedata.normalize('NFD', s)

return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())

# Example usage
single_char = 'ê'
multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'

print(compare_caseless(single_char, multiple_chars))

True NFD() casefold()

Unicode 3.13

7
2.6 Unicode

re \d \w
\d
[0-9] 'Nd'
57

import re
p = re.compile(r'\d+')

s = "Over \u0e55\u0e57 57 flavours"

m = p.search(s)
print(repr(m.group()))

\d+ compile() re.ASCII \d+

”57”
\w Unicode [a-zA-Z0-9_] re.ASCII
\s `` Unicode ``[ \t\n\r\f\v]

2.7

Python Unicode
• Processing Text Files in Python 3, by Nick Coghlan.
• Unicode Ned Batchelder PyCon 2012
str Python textseq
unicodedata
codecs
Marc-André Lemburg EuroPython 2002 Python Unicode PDF <https:
//downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>‘_ Python
2 Unicode Unicode unicode u

3 Unicode

Unicode / Unicode
Unicode
Unicode
XML Unicode Unicode SQL
Unicode
Unicode
8 bytes.decode(encoding)

Unicode
1024 4096 Unicode

2 GB 2 GB RAM
Unicode

open() read() write()

Unicode open() encoding errors str.encode()
bytes.decode()

8
Unicode

with open('unicode.txt', encoding='utf-8') as f:

for line in f:
print(repr(line))

with open('test', encoding='utf-8', mode='w+') as f:

f.write('\u4500 blah blah blah\n')
f.seek(0)
print(repr(f.readline()[:1]))

Unicode U+FEFF BOM

UTF-16 BOM BOM
little-endian
big-endian utf-16-le utf-16-be BOM
UTF-8 BOM UTF-8
UTF-8 utf-8-sig

3.1 Unicode

Unicode Unicode
Python UTF-8 MacOS Python
UTF-8 Python 3.6 Windows UTF-8 Unix
LANG LC_CTYPE
UTF-8
sys.getfilesystemencoding()
Unicode

filename = 'filename\u4500abc'
with open(filename, 'w') as f:
f.write('blah\n')

os Unicode os.stat()
os.listdir() Unicode
os.listdir()
Unicode Unicode
Unicode
UTF-8 :

fn = 'filename\u4500abc'
f = open(fn, 'w')
f.close()

import os
print(os.listdir(b'.'))
print(os.listdir('.'))

$ python listdir-test.py
[b'filename\xe4\x94\x80abc', ...]
['filename\u4500abc', ...]

UTF-8 Unicode

9
API Unicode API
Unix

3.2 Unicode

Unicode

Unicode
str + bytes
TypeError
Web

ASCII

StreamRecoder #1
#2
f Latin-1 StreamRecoder UTF-8

new_f = codecs.StreamRecoder(f,
# en/decoder: used by read() to encode its results and
# by write() to decode its input.
codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),

# reader/writer: used to read and write to the stream.

codecs.getreader('latin-1'), codecs.getwriter('latin-1') )

ASCII
ASCII surrogateescape handler

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:

data = f.read()

# make changes to the string 'data'

with open(fname + '.new', 'w',

encoding="ascii", errors="surrogateescape") as f:
f.write(data)

surrogateescape handler ASCII U+DC80 U+DCFF

surrogateescape handler

10
3.3

One section of Mastering Python 3 Input/Output, a PyCon 2010 talk by David Beazley, discusses text processing and
binary data handling.
Marc-André Lemburg PDF Python Unicode
Python 2.x
The Guts of Unicode in Python is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
representation in Python 3.3.

Andrew Kuchling Alexander Belopolsky Georg Brandl Andrew Kuchling Ezio

Melotti
Éric Araujo Nicholas Bastin Nick Coghlan Marius Gedminas
Kent Johnson Ken Krugler Marc-André Lemburg Martin von Löwis Terry J. Reedy Serhiy Storchaka , Eryk
Sun, Chad Whitacre, Graham Wideman

11
P
Python
PEP 263, 6

04 Basic Types
No ratings yet
04 Basic Types
52 pages
CSC 201 MANUAL For Computer Students and Teachers
No ratings yet
CSC 201 MANUAL For Computer Students and Teachers
46 pages
Python 3
No ratings yet
Python 3
457 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Programming With Uni Cod
No ratings yet
Programming With Uni Cod
63 pages
RLDev
No ratings yet
RLDev
161 pages
Ruby Conf 2006: I18N, M17N, Unicode, and All That
No ratings yet
Ruby Conf 2006: I18N, M17N, Unicode, and All That
60 pages
Documentation - The Zig Programming Language
No ratings yet
Documentation - The Zig Programming Language
314 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
13 pages
An Introduction To Python For Absolute Beginners
No ratings yet
An Introduction To Python For Absolute Beginners
457 pages
Unicode & Character Encodings in Python - A Painless Guide - Real Python
No ratings yet
Unicode & Character Encodings in Python - A Painless Guide - Real Python
20 pages
Howto Unicode
No ratings yet
Howto Unicode
13 pages
PPS Python
No ratings yet
PPS Python
212 pages
Asam Ae MCD-2 MC BS V1-6-1 PDF
No ratings yet
Asam Ae MCD-2 MC BS V1-6-1 PDF
252 pages
C&NS Lab Manual
No ratings yet
C&NS Lab Manual
24 pages
ICT Lecture 22
No ratings yet
ICT Lecture 22
23 pages
ICT Lecture 22
No ratings yet
ICT Lecture 22
25 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Python Unicode Guide for Developers
No ratings yet
Python Unicode Guide for Developers
2 pages
100 Python Programming Challenges
50% (2)
100 Python Programming Challenges
61 pages
Python - Strings1
No ratings yet
Python - Strings1
3 pages
ES6 JavaScript Features Guide
No ratings yet
ES6 JavaScript Features Guide
40 pages
The JavaScript Object Notation
No ratings yet
The JavaScript Object Notation
22 pages
Programming Fundamentals: Lecturer XXX
No ratings yet
Programming Fundamentals: Lecturer XXX
30 pages
DocView IDE Integration
No ratings yet
DocView IDE Integration
3 pages
Unicode Vs UTF-8
No ratings yet
Unicode Vs UTF-8
2 pages
CF Chapter-007
No ratings yet
CF Chapter-007
76 pages
One Month Coding and Encoding Study Plan
No ratings yet
One Month Coding and Encoding Study Plan
5 pages
An Informal Introduction To Python - Python 3.12
No ratings yet
An Informal Introduction To Python - Python 3.12
9 pages
Python Unicode Guide
No ratings yet
Python Unicode Guide
13 pages
Python Supports ASCII and UNICODE C
No ratings yet
Python Supports ASCII and UNICODE C
1 page
Python
No ratings yet
Python
50 pages
Unicode (UTF-8) With PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet
No ratings yet
Unicode (UTF-8) With PHP 5.3, MySQL 5.5 and HTML5 Cheat Sheet
3 pages
P6 String
No ratings yet
P6 String
20 pages
Cs Practical Qs 2024
No ratings yet
Cs Practical Qs 2024
2 pages
Class Viii
No ratings yet
Class Viii
2 pages
Claude AI Browser
No ratings yet
Claude AI Browser
11 pages
POSX FileMakerClipBoard
No ratings yet
POSX FileMakerClipBoard
36 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Understanding Unicode and Encodings
No ratings yet
Understanding Unicode and Encodings
4 pages
Python Recipes
No ratings yet
Python Recipes
84 pages
Unit 3 Powerpoint
100% (1)
Unit 3 Powerpoint
43 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Sigcse Slides PDF
No ratings yet
Sigcse Slides PDF
108 pages
TET2 2manual
No ratings yet
TET2 2manual
92 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
13 pages
Datacleaning Clinical Py
No ratings yet
Datacleaning Clinical Py
7 pages
Python Notes 4th Sem AKTU
No ratings yet
Python Notes 4th Sem AKTU
23 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Python Note 2
No ratings yet
Python Note 2
8 pages
Python Unit-2 Notes
No ratings yet
Python Unit-2 Notes
60 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
2 Python Fundamentals Lecture Xi 2023-24
No ratings yet
2 Python Fundamentals Lecture Xi 2023-24
10 pages
Lecture No 3
No ratings yet
Lecture No 3
14 pages
Strings
No ratings yet
Strings
23 pages
Python Strings: Accessing Values in String S
No ratings yet
Python Strings: Accessing Values in String S
7 pages
Unicodebook PDF
No ratings yet
Unicodebook PDF
73 pages
Python Basics for Students
No ratings yet
Python Basics for Students
37 pages
An Introduction To Python For Absolute Beginners
No ratings yet
An Introduction To Python For Absolute Beginners
127 pages
TP2 TensorFlow Programming Basics Experiment Guide
No ratings yet
TP2 TensorFlow Programming Basics Experiment Guide
47 pages
Docs - Python.org Tutorial Introduction
No ratings yet
Docs - Python.org Tutorial Introduction
14 pages
C Man Um Apr320 Ceu
No ratings yet
C Man Um Apr320 Ceu
300 pages
Python 2 Unicode Handling Guide
No ratings yet
Python 2 Unicode Handling Guide
19 pages
Rust by Practice
No ratings yet
Rust by Practice
253 pages
NewsAnalyitcsV4 1 3UserGuide
No ratings yet
NewsAnalyitcsV4 1 3UserGuide
41 pages
Parsing JSON in Swift The Cheat Sheet
No ratings yet
Parsing JSON in Swift The Cheat Sheet
36 pages
Python Fundamentals 1
No ratings yet
Python Fundamentals 1
21 pages
File Handling - 7
No ratings yet
File Handling - 7
48 pages
Cython Tutorial: Release 0.28.2
No ratings yet
Cython Tutorial: Release 0.28.2
81 pages
SironKYC Data Supply
No ratings yet
SironKYC Data Supply
10 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
Accessing Values in Strings: 'Hello World!' "Python Programming"
No ratings yet
Accessing Values in Strings: 'Hello World!' "Python Programming"
29 pages
Pay One Era Pi Guide
No ratings yet
Pay One Era Pi Guide
52 pages
GLS SOAP API for Label Printing
No ratings yet
GLS SOAP API for Label Printing
12 pages
IoT USSD API Developers Guide
No ratings yet
IoT USSD API Developers Guide
27 pages
Example Encrypting and Decrypting: Apex Classes Reference
No ratings yet
Example Encrypting and Decrypting: Apex Classes Reference
5 pages
Python Programming Lab Manual
No ratings yet
Python Programming Lab Manual
24 pages
FAST Specification 1 X 1 PDF
No ratings yet
FAST Specification 1 X 1 PDF
44 pages
HTTP Live Streaming
No ratings yet
HTTP Live Streaming
56 pages
EscapeFromTarkovSecurity Part 2 (Unknowncheats - Me)
No ratings yet
EscapeFromTarkovSecurity Part 2 (Unknowncheats - Me)
9 pages
What Is Python?: Emphasis On Structure and Discipline Simple Problems ! Simple Programs
No ratings yet
What Is Python?: Emphasis On Structure and Discipline Simple Problems ! Simple Programs
27 pages
Character Encodings For Beginners
No ratings yet
Character Encodings For Beginners
1 page
Dynatrace JIRA Integration-1
No ratings yet
Dynatrace JIRA Integration-1
7 pages
(Unicode) Character
No ratings yet
(Unicode) Character
4 pages

Howto Unicode

Uploaded by

Howto Unicode

Uploaded by

Unicode

Guido van Rossum and the Python development team

Unicode code point 0 0x10FFFF

0061 'a'; LATIN SMALL LETTER A

Unicode 0 0x10FFFF 1,114,111

Unicode Consortium Unicode PDF

Unicode Python Unicode

Python 3.0 str Unicode “”unicode rocks!” ’unicode rocks!’“

>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name

bytes decode() encoding

>>> b'\x80abc'.decode("utf-8", "strict")

Python 100 Python

bytes.decode() str.encode() Unicode bytes

>>> u = chr(40960) + 'abcd' + chr(1972)

2.3 Python Unicode

Emacs Emacs Python

u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)

# Get numeric value of second character

0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE

'Ll' 'No' 'Mn'

def compare_strs(s1, s2):

return NFD(s1) == NFD(s2)

normalize() NFC NFKC NFD

def compare_caseless(s1, s2):

return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())

True NFD() casefold()

s = "Over \u0e55\u0e57 57 flavours"

\d+ compile() re.ASCII \d+

open() read() write()

with open('unicode.txt', encoding='utf-8') as f:

with open('test', encoding='utf-8', mode='w+') as f:

Unicode U+FEFF BOM

# reader/writer: used to read and write to the stream.

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:

# make changes to the string 'data'

with open(fname + '.new', 'w',

surrogateescape handler ASCII U+DC80 U+DCFF

Andrew Kuchling Alexander Belopolsky Georg Brandl Andrew Kuchling Ezio

You might also like