Topic 20: Huffman Coding
The author should gaze at Noah, and ...
learn, as they did in the Ark, to crowd a
great deal of matter into a very small
compass.
Sydney Smith, Edinburgh Review
Agenda
Encoding
Compression
Huffman Coding
Encoding
UT CS
85 84 32 67 83
01010101 01010100 00100000 01000011
01010011
what is a file? how do some OS use file extensions?
open a bitmap in a text editor
open a pdf in word
ASCII - UNICODE
Text File
Text File???
Bitmap File
Bitmap File????
JPEG File
JPEG VS BITMAP
JPEG File
10
Encoding Schemes
"It's all 1s and 0s"
What do the 1s and 0s mean?
50 12 109
ASCII -> 2ym
Red Blue Green ->
dark teal?
11
Agenda
Encoding
Compression
Huffman Coding
12
Compression
Compression: Storing the same information
but in a form that takes less memory
lossless and lossy compression
Recall:
13
Lossy Artifacts
14
Why Bother?
Is compression really necessary?
2 Terabytes
500 HD, 2 hour movies or 500,000 songs
15
Little Pipes and Big Pumps
Home Internet Access
CPU Capability
40 Mbps roughly $40 per
month.
12 months * 3 years * $40 =
$1,440
10,000,000 bits /second
= 1.25 * 106 bytes / sec
$1,500 for a laptop or
desktop
Intel i7 processor
Assume it lasts 3 years.
Memory bandwidth
25.6 GB / sec
= 2.6 * 1010 bytes / sec
on the order of
5.0 * 1010 instructions /
second
16
Mobile Devices?
Cellular Network
iPhone CPU
Your mileage may vary
Mega bits per second
AT&T
Apple A6 System on a Chip
Coy about IPS
2 cores
Rough estimates:
1 x 1010 instructions per
second
17 download, 7 upload
T-Mobile & Verizon
12 download, 7 upload
17,000,000 bits per second
= 2.125 x 106 bytes per
second
http://tinyurl.com/q6o7wan
17
Little Pipes and Big Pumps
CPU
Data In
From Network
18
Compression - Why Bother?
Apostolos "Toli" Lerios
Facebook Engineer
Heads image storage group
jpeg images already
compressed
look for ways to compress even
more
1% less space = millions of
dollars in savings
19
Agenda
Encoding
Compression
Huffman Coding
20
Purpose of Huffman Coding
Proposed by Dr. David A. Huffman
A Method for the Construction of Minimum
Redundancy Codes
Written in 1952
Applicable to many forms of data transmission
Our example: text files
still used in fax machines, mp3 encoding, others
21
The Basic Algorithm
Huffman coding is a form of statistical coding
Not all characters occur with the same
frequency!
Yet in ASCII all characters are allocated the
same amount of space
1 char = 1 byte, be it e or x
22
The Basic Algorithm
Any savings in tailoring codes to
frequency of character?
Code word lengths are no longer fixed like
ASCII or Unicode
Code word lengths vary and will be
shorter for the more frequently used
characters
23
The Basic Algorithm
1. Scan text to be compressed and tally
occurrence of all characters.
2. Sort or prioritize characters based on
number of occurrences in text.
3. Build Huffman code tree based on
prioritized list.
4. Perform a traversal of tree to determine
all code words.
5. Scan text again and create new file
using the Huffman codes
24
Building a Tree
Scan the original text
Consider the following short text
Eerie eyes seen near lake.
Count up the occurrences of all characters in
the text
25
Building a Tree
Scan the original text
Eerie eyes seen near lake.
What characters are present?
E e r i space
y s n a r l k .
26
Building a Tree
Scan the original text
Eerie eyes seen near lake.
What is the frequency of each character in the
text?
Char Freq. Char Freq.
E
1
y
1
k 1
e
8
s
2
. 1
r
2
n
2
i
1
a
2
space 4
l
1
Char Freq.
27
Building a Tree
Prioritize characters
Create binary tree nodes with character
and frequency of each character
Place nodes in a priority queue
The lower the occurrence, the higher the
priority in the queue
28
Building a Tree
The queue after inserting all nodes
sp
Null Pointers are not shown
29
Building a Tree
While priority queue contains two or more
nodes
Create new node
Dequeue node and make it left subtree
Dequeue next node and make it right subtree
Frequency of new node equals sum of frequency of
left and right children
Enqueue new node back into queue
30
Building a Tree
sp
31
Building a Tree
sp
2
E
1
32
Building a Tree
2
E
sp
33
Building a Tree
2
E
sp
2
k
1
34
Building a Tree
sp
35
Building a Tree
sp
2
y
1
36
Building a Tree
sp
37
Building a Tree
sp
4
a
2
38
Building a Tree
sp
39
Building a Tree
E
1
2
i
1
sp
8
a
4
r
2
40
Building a Tree
E
1
2
i
1
sp
8
a
41
Building a Tree
sp
8
a
4
2
1
42
Building a Tree
sp
43
Building a Tree
6
sp
2
y
44
Building a Tree
e
sp
What is happening to the characters
with a low number of occurrences?
45
Building a Tree
e
sp
4
8
4
4
a
2
46
Building a Tree
sp
e
8
4
a
47
Building a Tree
e
8
10
a
sp
1
48
Building a Tree
e
8
10
4
4
a
sp
49
Building a Tree
10
16
sp
8
4
4
a
2
50
Building a Tree
10
16
e
sp
8
4
4
a
51
Building a Tree
26
16
10
4
sp
4
a
52
Building a Tree
After
enqueueing
this node
there is
only one
node left in
priority
queue.
26
16
10
4
sp
4
a
2
53
Building a Tree
Dequeue the single node
left in the queue.
This tree contains the
new code words for each
character.
Frequency of root node
should equal number of
characters in text.
26
16
10
4
2
6
2
E i k l y .
sp
4
1 1 1 1 1 1
Eerie eyes seen near lake. 4 spaces,
26 characters total
54
8
4
a n
r s
2 2
2 2
Encoding the File
Traverse Tree for Codes
Perform a traversal of the tree
to obtain new code words
left, append a 0 to code word
right append a 1 to code word
code word is only completed
when a leaf node is reached
26
16
10
4
2
6
2
E i
k l y .
1 1
1 1 1 1
sp
55
8
4
a n
r s
2 2
2 2
Encoding the File
Traverse Tree for Codes
Char
E
i
k
l
y
.
space
e
a
n
r
s
Code
0000
0001
0010
0011
0100
0101
011
10
1100
1101
1110
1111
26
16
10
4
2
6
2
E i
k l y .
1 1
1 1 1 1
sp
56
8
4
a n
r s
2 2
2 2
Encoding the File
Rescan text and encode file
using new code words
Eerie eyes seen near lake.
000010111000011001110
010010111101111111010
110101111011011001110
011001111000010100101
Char
E
i
k
l
y
.
space
e
a
n
r
s
Code
0000
0001
0010
0011
0100
0101
011
10
1100
1101
1110
1111
57
Encoding the File
Results
Have we made things any
better?
82 bits to encode the text
ASCII would take 8 * 26 =
208 bits
000010111000011001110
010010111101111111010
110101111011011001110
011001111000010100101
hIf modified code used 4 bits per
character are needed. Total bits
4 * 26 = 104. Savings not as great.
58
Decoding the File
How does receiver know what the codes are?
Tree constructed for each text file.
Considers frequency for each file
Big hit on compression, especially for smaller files
Tree predetermined
based on statistical analysis of text files or file types
59
Decoding the File
Once receiver has tree it
scans incoming bit stream
0 go left
1 go right
1010001001111000111111
11011100001010
26
10
4
2
A.
B.
C.
D.
E.
elk nay sir
eek a snake
eek kin sly
eek snarl nil
eel a snarl
16
e
6
2
E i k l y .
sp
4
1 1 1 1 1 1
60
8
4
a n
r s
2 2
2 2
Assignment Hints
reading chunks not chars
header format
the pseudo eof character
the GUI
61
Assignment Example
"Eerie eyes seen near lake." will result in different
codes than those shown in slides due to:
adding elements in order to PriorityQueue
required pseudo eof character (PEOF)
62
Assignment Example
Char Freq. Char Freq. Char Freq.
E
1
y
1
k 1
e
8
s
2
. 1
r
2
n
2
PEOF 1
i
1
a
2
space 4
l
1
63
Assignment Example
.
1
E
1
i
1
k
1
l
1
y
1
PEOF
1
a
2
n
2
r
2
s
2
64
SP
4
e
8
Assignment Example
i
1
k
1
l
1
y
1
PEOF
1
a
2
n
2
r
2
s
2
2
.
1
E
1
65
SP
4
e
8
Assignment Example
i
1
k
1
l
1
y
1
PEOF a
1
2
n
2
r
2
s
2
SP
4
2
.
1
E
1
66
e
8
Assignment Example
l
1
y
1
PEOF
1
a
2
n
2
r
2
s
2
2
.
1
E
1
i
1
k
1
67
SP
4
e
8
Assignment Example
PEOF
1
a
2
n
2
r
2
s
2
2
.
1
E
1
i
1
SP
4
2
k
1
l
1
y
1
68
e
8
Assignment Example
n
2
r
2
s
2
2
.
1
E
1
i
1
2
k
1
l
1
SP
4
3
y
1
PEOF
1
a
2
69
e
8
Assignment Example
s
2
2
.
1
E
1
i
1
2
k
1
l
1
SP
4
3
y
1
PEOF
1
e
8
a
2
n
2
70
r
2
Assignment Example
2
i
1
2
k
1
l
1
SP
4
3
y
1
PEOF
1
a
2
n
2
e
8
4
s
2
r
2
2
.
1
71
E
1
4
n
2
r
2
s
2
2
.
1
2
E
1
i
1
e
8
SP
4
3
k
1
l
1
y
1
PEOF
1
a
2
72
4
2
2
i
1
k
1
l
1
PEOF
1
SP
4
3
y
1
e
8
a
2
n
2
4
r
2
s
2
2
.
1
73
E
1
11
e
8
4
n
2
r
2
4
s
2
.
1
SP
4
3
E
1
i
1
k
1
l
1
y
1
PEOF
1
74
a
2
11
16
e
8
4
2
SP
4
3
i
1
k
1
l
1
y
1
PEOF
1
a
2
8
4
4
n
2
s
2
r
2
2
.
1
75
E
1
27
11
16
e
8
4
2
SP
4
3
i
1
k
1
l
1
y
1
PEOF
1
a
2
8
4
4
n
2
s
2
r
2
2
.
1
76
E
1
Codes
value:
value:
value:
value:
value:
value:
value:
value:
value:
value:
value:
value:
value:
32, equivalent char: , frequency: 4, new code 011
46, equivalent char: ., frequency: 1, new code 11110
69, equivalent char: E, frequency: 1, new code 11111
97, equivalent char: a, frequency: 2, new code 0101
101, equivalent char: e, frequency: 8, new code 10
105, equivalent char: i, frequency: 1, new code 0000
107, equivalent char: k, frequency: 1, new code 0001
108, equivalent char: l, frequency: 1, new code 0010
110, equivalent char: n, frequency: 2, new code 1100
114, equivalent char: r, frequency: 2, new code 1101
115, equivalent char: s, frequency: 2, new code 1110
121, equivalent char: y, frequency: 1, new code 0011
256, equivalent char: ?, frequency: 1, new code 0100
77