0% found this document useful (0 votes)

29 views37 pages

07 DictionaryCoding

This document discusses dictionary-based coding techniques for lossless data compression. It describes the Lempel-Ziv 1977 (LZ77) algorithm, which constructs a dictionary during encoding and decoding by finding the longest matches between the search buffer and look-ahead buffer. Simple examples are provided to illustrate how LZ77 encoding and decoding works.

Uploaded by

the boyz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views37 pages

07 DictionaryCoding

Uploaded by

the boyz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Dictionary-based Coding

already coded not yet coded

search buffer look-ahead buffer

(N symbols)
cursor (L symbols)

We know the past but cannot control it. We control the future but...
Last Lecture

Last Lecture: Predictive Lossless Coding

Predictive Lossless Coding

Simple and effective way to exploit dependencies between neighboring symbols / samples
Optimal predictor: Conditional mean (requires storage of large tables)

Affine and Linear Prediction

Simple structure, low-complex implementation possible
Optimal prediction parameters are given by solution of Yule-Walker equations
Works very well for real signals (e.g., audio, images, ...)

Efficient Lossless Coding for Real-World Signals

Affine/linear prediction (often: block-adaptive choice of prediction parameters)
Entropy coding of prediction errors (e.g., arithmetic coding)
Using marginal pmf often already yields good results
Can be improved by using conditional pmfs (with simple conditions)

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 2 / 37

Dictionary-based Coding

Dictionary-Based Coding

Coding of Text Files

Very high amount of dependencies
Affine prediction does not work (requires linear dependencies)
Higher-order conditional coding should work well, but is way to complex (memory)
Alternative: Do not code single characters, but words or phrases

Example: English Texts

Oxford English Dictionary lists less than 230 000 words (including obsolete words)
On average, a word contains about 6 characters
Average codeword length per character would be limited by
1 l m
`¯ < · log2 230 000 ≈ 3.0
6
Including “phrases” would further increase coding efficiency

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 3 / 37

Dictionary-based Coding

Lempel-Ziv Coding

Universal Algorithms for Lossless Data Compression

Based on the work of Abraham Lempel and Jacob Ziv
Basic idea: Construct dictionary during encoding and decoding

Two Basic Variants

LZ77: Based on [ Ziv, Lempel, “A Universal algorithm for sequential data compresion”, 1977 ]
Lempel-Ziv-Storer-Szymanski (LSZZ)
DEFLATE used in ZIP, gzip, PNG, TIFF, PDF, OpenDocument, ...
Lempel-Ziv-Markov Chain Algorithm (LZMA) used in 7zip, xv, lzip
...

LZ78: Based on [ Ziv, Lempel, “Compression of individual sequences via variable-rate coding”, 1978 ]
Lempel-Ziv-Welch (LZW) used in compress, GIF, optional support in PDF, TIFF
...

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 4 / 37

Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZ77

The Lempel-Ziv 1977 Algorithm (LZ77)

already coded not yet coded

search buffer look-ahead buffer

(N symbols)
cursor (L symbols)

We know the past but cannot control it. We control the future but cannot know it. ···
(distance) 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1
(d, `, n) = (15, 7, ’t’)

Basic Idea of the LZ77 Algorithm

Dictionary of variable-length sequences is given by the preceding N symbols (sliding window)
Find longest possible match for the sequence at the start of the look-ahead buffer
Message is coded as sequence of triples (d, `, n):
d : distance of best match from next symbol to be coded
` : length of matched phrase (match starts in search buffer but may reach into look-ahead buffer)
n : next symbol after matched sequence
If no match is found, then (1, 0, n) is coded (with n being the next symbol after the cursor)
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 5 / 37
Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZ77

Simplest Version: LZ77 Algorithm with Fixed-Length Coding

search buffer look-ahead buffer

(N symbols)
cursor (L symbols)

We know the past but cannot control it. We control the future but cannot know it. ···
(distance) 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1
(d, `, n) = (15, 7, ’t’)

How Many Bits Do We Need ?

Distance d : Can take values from 1 ... N ( we could actually code d − 1 )

Require nd = log2 N bits

Length ` : Can take values from 0 ... L − 1 ( ` + 1 symbols must fit into look-ahead buffer )

Require n` = log2 L bits

Next symbol n : Can be any symbol of the alphabet A with size |A|

Require nn = log2 |A| bits ( in most applications : 8 bits )

The sizes of both the preview and the look-ahead buffer should be integer powers of two !
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 6 / 37
Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZ77

Toy Example: LZ77 Encoding

Message: Miss␣Mississippi
original message:
16 characters (8 bits per symbols)
look-ahead 128 bits (16 × 8 bits)
search buffer buffer ( d, `, n )
Miss ( 1, 0, M ) LZ77 configuration:
M iss␣ ( 1, 0, i ) search buffer of N = 8 symbols
Mi ss␣M ( 1, 0, s ) look-ahead buffer of L = 4 symbols
Mis s␣Mi ( 1, 1, ␣ )
Miss␣ Miss ( 5, 3, s ) coded representation (fixed-length):
iss␣Miss issi ( 3, 3, i ) 8 triples (d, `, n)
Mississi ppi ( 1, 0, p ) 13 bits per triple (3 + 2 + 8 bits)
ississip pi ( 1, 1, i ) 104 bits (19% bit savings)

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 7 / 37

Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZ77

Toy Example: LZ77 Decoding

Coded representation: (1, 0, M) (1, 0, i) (1, 0, s) (1, 1, ␣) (5, 3, s) (3, 3, i) (1, 0, p) (1, 1, i)

Decode message: Miss␣Mississippi

search buffer ( d, `, n ) decoded phrase

( 1, 0, M ) M
M ( 1, 0, i ) i
Mi ( 1, 0, s ) s
Mis ( 1, 1, ␣ ) s␣
Miss␣ ( 5, 3, s ) Miss
iss␣Miss ( 3, 3, i ) issi
Mississi ( 1, 0, p ) p
ississip ( 1, 1, i ) pi
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 8 / 37
Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZ77

Coding Efficiency and Complexity of LZ77

Coding Efficiency
The LZ77 algorithm is asymptotically optimal (e.g., when using unary codes for d and `)
N → ∞, L → ∞ =⇒ `¯ → H̄
Proof can be found in [ Cover, Thomas, “Elements of Information Theory” ]
In practice: Require really large search buffer sizes N

Implementation Complexity
Decoder: Very low complexity (just copying characters)
Encoder: Highly depends on buffer size N and actual implementation
Use suitable data structures such as search trees, radix trees, hash tables
Not necessary to find the “best match” (note: shorter match can actually be more efficient)
There are very efficient implementations for rather large buffer sizes (e.g., N = 32 768)

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 9 / 37

Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZSS

LZ77 Variant: The Lempel-Ziv-Storer-Szymanski Algorithm (LZSS)

already coded not yet coded

search buffer look-ahead buffer

(N symbols)
cursor (L symbols)

We know the past but cannot control it. We control the future but cannot know it. ···
(distance) 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1

Changes relative to LZ77 Algorithm

1 At first, code a single bit b to indicate whether a match is found
2 For matches, don’t transmit the following symbol

Message is coded as sequence of tuples (b, {d, `} | n)

The indication bit b signals whether a match is found (b = 1 → match found)
If (b = 0), then code next symbol n as literal
If (b = 1), then code the match as distance-length pair {d, `} (with d ∈ [1, N] and ` ∈ [1, L] )
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 10 / 37
Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZSS

Toy Example: LZSS Encoding

Message: Miss␣Mississippi
original message:
search buffer look-ahead ( b, {d, `} | n ) 16 characters (8 bits per symbols)
Miss ( 0, M ) 128 bits (16 × 8 bits)
M iss␣ ( 0, i )
Mi ss␣M ( 0, s ) LZSS configuration:
Mis s␣Mi ( 1, 1, 1 ) search buffer of N = 8 symbols
Miss ␣Mis ( 0, ␣ ) look-ahead buffer of L = 4 symbols
Miss␣ Miss ( 1, 5, 4 )
coded representation (fixed-length):
iss␣Miss issi ( 1, 3, 4 )
5 literals (5 × 9 bits)
Mississi ppi ( 0, p )
5 matches (5 × 6 bits)
ississip pi ( 1, 1, 1 )
75 bits (41% bit savings)
ssissipp i ( 1, 3, 1 )

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 11 / 37

Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZSS

Toy Example: LZSS Decoding

Coded representation: (0, M) (0, i) (0, s) (1, 1, 1) (0, ␣) (1, 5, 4) (1, 3, 4) (0, p) (1, 1, 1) (1, 3, 1)

Decode message: Miss␣Mississippi

search buffer ( b, {d, `} | n ) decoded phrase

( 0, M ) M
M ( 0, i ) i
Mi ( 0, s ) s
Mis ( 1, 1, 1 ) s
Miss ( 0, ␣ ) ␣
Miss␣ ( 1, 5, 4 ) Miss
iss␣Miss ( 1, 3, 4 ) issi ( note: copy symbol by symbol )
Mississi ( 0, p ) p
ississip ( 1, 1, 1 ) p
sissipp ( 1, 3, 1 ) i
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 12 / 37
Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / DEFLATE

The DEFLATE Algorithm: Combining LZSS with Huffman Coding

The Concept of DEFLATE

Pre-process message/file/symbol sequence using the LZSS algorithm (remove dependencies)
Entropy coding of tuples (b, {d, `} | n) using Huffman coding

Details of DEFLATE Format

Input as interpreted as sequence of bytes (alphabet size of 256)
LZSS configuration: Search buffer of N = 32 768, look-ahead buffer of L = 258
Input data are coded using variable-length blocks (for optimizing the Huffman coding)

3-bit block header (at start of each block)

1 bit 0 there are blocks that follow the current block
1 this is the last block of the file / data stream
2 bits 00 uncompressed block (number of bytes in block is coded after block header, max. 65k)
01 compressed block using pre-defined Huffman tables
10 compressed block with transmitted Huffman tables (most frequently used type)
11 reserved (forbidden)

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 13 / 37

Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / DEFLATE

The DEFLATE Format: Two Huffman Tables

Main Huffman table with 288 codewords Huffman table for distance
index n meaning (additional codewords follow for n = 257 ... 285) n distance d bits for z

0 – 255 literal with ASCII code being equal to n 0–3 d =1+n

4 d =5+z 1
256 end-of-block (last symbol of a block) 5 d =7+z 1
6 d =9+z 2
257 – 264 match with ` = (n − 254)
7 d = 13 + z 2
265 – 268 match with ` = 2 · (n − 260) + 1 + x (1 extra bit for x) 8 d = 17 + z 4
269 – 272 match with ` = 4 · (n − 265) + 3 + x (2 extra bits for x) .. .. ..
273 – 276 match with ` = 8 · (n − 269) + 3 + x (3 extra bits for x) . . .
26 d = 8 193 + z 12
277 – 280 match with ` = 16 · (n − 273) + 3 + x (4 extra bits for x)
27 d = 12 289 + z 12
281 – 284 match with ` = 32 · (n − 277) + 3 + x (5 extra bits for x) 28 d = 16 385 + z 13
285 match with ` = 258 29 d = 24 577 + z 13
286 – 287 reserved (forbidden codeword) 30 – 31 reserved

Note 1: The values for x are coded using fixed-length codes. Note: The values for z are coded
Note 2: The match size must be in range ` = 3 ... 258. using fixed-length codes.

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 14 / 37

Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / DEFLATE

The DEFLATE Algorithm in Practice

Encoding and Decoding

Decoding: Straightforward (follow format specification)
Encoding: Can trade-off coding efficiency and complexity
Fixed pre-defined or dynamic Huffman tables
Determination of suitable block sizes
Simplified search for finding best matches

Applications
One the most used algorithms in practice
Archive formats: Library zlib, ZIP, gzip, PKZIP, Zopfli, CAB
Lossless image coding: PNG, TIFF
Documents: OpenDocument, PDF
Cryptography: Crypto++
...
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 15 / 37
Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZMA

LZ77 Variant: Lempel-Ziv-Markov Chain Algorithm (LZMA)

The Concept of LZMA

Pre-process byte sequence using an LZ77 variant (similar to LZSS, but with special cases)
Entropy coding of resulting bit sequence using a range encoder (adaptive binary arithmetic coding)

Improvements over DEFLATE

Most important: Context-based adaptive binary arithmetic coding of bit sequences
Larger search buffer of up to N = 4 294 967 296 (32 bit), look-ahead buffer of L = 273
Special codes for using same distances as for one of the last four matches

Applications of LZMA
Next generation file compressors
7zip, xv, lzip, ZIPX

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 16 / 37

Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZMA

LZMA: Mapping of Byte Sequences to Bit Sequences

Code for single byte sequence (match or literal)

0 + (byte) Direct encoding of next byte (no match)

10 +`+d Conventional match (followed by codes for length ` and distance d)

1100 Match of length ` = 1, distance d is equal to last used distance

1101 + ` Match of length `, distance d is equal to last used distance
1110 + ` Match of length `, distance d is equal to second last used distance
11110 + ` Match of length `, distance d is equal to third last used distance
11111 + ` Match of length `, distance d is equal to fourth last used distance

Code for length ` Code for distance d

0 + (3 bits) Length in range ` = 2 ... 9 6 bits for indicating “distance slot”

10 + (3 bits) Length in range ` = 10 ... 17 followed by 0–30 of bits (depending on slot)
11 + (8 bits) Length in range ` = 18 ... 273

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 17 / 37

Dictionary-based Coding / The LZ77 Algorithm and Selected Variants / LZMA

LZMA: Entropy Coding of Bit Sequence after LZ77 Variant

Entropy Coding of Bit Sequences

Context-based Adaptive Binary Arithmetic Coding (called range encoder)
Multiple adaptive binary probability models + bypass mode (probability 0.5)
Sophisticated context modeling: Probability model for next bit is chosen based on ...
type of bit, value of preceding byte, preceding bits of current byte,
type of preceding byte sequences, ...

Binary Arithmetic Coding Engine

11 bits of precision for binary probability masses (only store p0 , since p1 = 211 − p0 )
32 bits of precision for interval width
Probability models are updated according to
(
p0 + (211 − p0 ) 5 : bit = 0

p0 =
p0 − (p0 5) : bit = 1

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 18 / 37

Dictionary-based Coding / The LZ78 Algorithms and a Selected Variant / LZ78

The Lempel-Ziv 1978 Algorithm (LZ78)

Main Difference to LZ77
Dictionary is not restricted to preceding N symbols
Dictionary is constructed during encoding and decoding

The LZ78 Algorithm

Starts with an empty dictionary
Next variable-length symbol sequence as coded by tuple {k, n}
k : Index for best match in dictionary (or “0” if no match is found)
n : Next symbol (similar to LZ77)
After coding a tuple {k, n}, the represented phrase is added to the dictionary

Number of Bits for Dictionary Index

Number of bits nk for dictionary index depends in dictionary size
l m
nk = log2 (1 + dictionary size)

In practice: Dictionary is reset after it becomes too large

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 19 / 37
Dictionary-based Coding / The LZ78 Algorithms and a Selected Variant / LZ78

Toy Example: LZ78 Encoding

phrase output bits dictionary Message:

t ( 0, t ) 8 1: t
h ( 0, h ) 9 2: h thinking␣things␣through
i ( 0, i ) 10 3: i
n ( 0, n ) 10 4: n
k ( 0, k ) 11 5: k
in ( 3, n ) 11 6 : in Result:
g ( 0, g ) 11 7: g
Original message: 184 bits (23 bytes)
␣ ( 0, ␣ ) 11 8: ␣
th ( 1, h ) 12 9 : th Required 177 bits in total
ing ( 6, g ) 12 10 : ing
s ( 0, s ) 12 11 : s
␣t ( 8, t ) 12 12 : ␣t
hr ( 2, r ) 12 13 : hr Remember: Number of bits for dictionary index k
o ( 0, o ) 12 14 : o l m
u ( 0, u ) 12 15 : u nk = log2 (1 + dictionary size)
gh ( 7, h ) 12 16 : gh
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 20 / 37
Dictionary-based Coding / The LZ78 Algorithms and a Selected Variant / LZ78

Toy Example: LZ78 Decoding

input phrase dictionary

( 0, t ) t 1: t
( 0, h ) h 2: h
( 0, i ) i 3: i
( 0, n ) n 4: n
( 0, k ) k 5: k Decoded Message:
( 3, n ) in 6 : in
( 0, g ) g 7: g thinking␣things␣through
( 0, ␣ ) ␣ 8: ␣
( 1, h ) th 9 : th
( 6, g ) ing 10 : ing
( 0, s ) s 11 : s
( 8, t ) ␣t 12 : ␣t
( 2, r ) hr 13 : hr
( 0, o ) o 14 : o
( 0, u ) u 15 : u
( 7, h ) gh 16 : gh
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 21 / 37
Dictionary-based Coding / The LZ78 Algorithms and a Selected Variant / LZW

LZ78 Variant: The Lempel-Ziv-Welch Algorithm (LZW)

Main Difference to LZ78

Dictionary is initialized with all strings of length one (i.e., all byte codes)
Next symbol is not included in code

The LZW Algorithm

Send code for dictionary entry that matches start of remaining sequence
After sending a code, a new dictionary entry is added that consists of
the phrases that was just coded followed by
the next symbol in the message

Applications using the LZW Algorithm

Unix file compression tool compress
Image coding format GIF
Optional compression mode in PDF and TIFF
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 22 / 37
Dictionary-based Coding / The LZ78 Algorithms and a Selected Variant / LZW

Toy Example: LZW Encoding

phrase next output dictionary
t h <116> 256: th Message:
h i <104> 257: hi
i n <105> 258: in thinking␣things␣through
n k <110> 259: nk
k i <107> 260: ki
in g <258> 261: ing
g ␣ <103> 262: g␣
␣ t <32> 263: ␣t Pre-initialized dictionary:
th i <256> 264: thi All byte codes: <0> ... <255>
ing s <261> 265: ings
s ␣ <115> 266: s␣
␣t h <263> 267: ␣th
h r <104> 268: hr
r o <114> 269: ro Result:
o u <111> 270: ou Original message: 184 bits (23 bytes)
u g <117> 271: ug Required 162 bits (18 × 9 bits)
g h <103> 272: gh
h <104> 273: h
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 23 / 37
Dictionary-based Coding / The LZ78 Algorithms and a Selected Variant / LZW

Toy Example: LZW Decoding

input output dictionary conjecture
<116> t 256: t?
<104> h 256: th 257: h?
<105> i 257: hi 258: i? Message:
<110> n 258: in 259: n?
<107> k 259: nk 260: k? thinking␣things␣through
<258> in 260: ki 261: in?
<103> g 261: ing 262: g?
<32> ␣ 262: g␣ 263: ␣?
<256> th 263: ␣t 264: th? Pre-initialized dictionary:
<261> ing 264: thi 265: ing?
<115> s 265: ings 266: s?
All byte codes: <0> ... <255>
<263> ␣t 266: s␣ 267: ␣t?
<104> h 267: ␣th 268: h?
<114> r 268: hr 269: r?
<111> o 269: ro 270: o?
<117> u 270: ou 271: u?
<103> g 271: ug 272: g?
<104> h 272: gh 273: h?
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 24 / 37
Dictionary-based Coding / The LZ78 Algorithms and a Selected Variant / LZW

LZW: The K-Omega-K Problem

Property of LZW Algorithm

Decoder is one step behind encoder in constructing dictionary
Encoder might send code for not yet completed dictionary entry

Example: Coding of sequence “...cXYZcXYZca...”

encoder decoder
phrase next output dictionary input output dictionary conjecture
<300>: cXYZ <300>: cXYZ
cXYZ c <300> <400>: cXYZc <300> cXYZ <400>: cXYZ?
cXYZc a <400> <401>: cXYZca <400> cXYZ? ( cXYZ? must be cXYZc )

How can the decoder correctly decode in such a case ?

Incomplete dictionary entry is last added entry
This entry is used only if the first symbol of new sequence is the last symbol of incomplete entry
Last symbol must be equal to first symbol ! (in our example: “cXYZ?” = ”cXYZc”)
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 25 / 37
Data Compression using Block Sorting / Burrows-Wheeler Transform (BWT)

The Burrows-Wheeler Transform (BWT)

1 Create all rotations of the original message

2 Sort all rotations in lexicographical order
3 Output: Last column of the sorted block + index of original message (in sorted block)

Example: Message “BANANAMAN”

BANANAMAN AMANBANAN
ANANAMANB ANAMANBAN
NANAMANBA ANANAMANB
rotations ANAMANBAN sorting ANBANANAM last column NNBMNAAAA
−−−−−→ NAMANBANA −−−−→ BANANAMAN −−−−−−→
AMANBANAN MANBANANA index index = 4
MANBANANA NAMANBANA
ANBANANAM NANAMANBA
NBANANAMA NBANANAMA

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 26 / 37

Data Compression using Block Sorting / Burrows-Wheeler Transform (BWT)

BWT: The Inverse Transform (Can we reconstruct the original message?)

index = 4 Given:
A N Last column of sorted block “N N B M N A A A A”
A N Index of original message in sorted block (4)
A B
A M
4 Decoding procedure
B N
1 Create first column of sorted block (by sorting)
M A
N A 2 First symbol is given at transmitted index

N A 3 Next symbol is obtained by

N A a Look for corresponding symbol in last column
(i.e., same count of same letter)

decoded message: b Next symbol is at same position in first column

(since following symbol is in first column)
BANANAMAN
4 Continue procedure until all letters are decoded

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 27 / 37

Data Compression using Block Sorting / Burrows-Wheeler Transform (BWT)

BWT: Why Is It Useful for Compression ?

Property of BTW (for large blocks)

AMANBANAN Symbols on left side of sorted block are contexts
ANAMANBAN (symbols that follow last column in message)
ANANAMANB Block lines are sorted according to the contexts
ANBANANAM Likely that same symbol (last column) precedes same context
(source with memory: conditional pmf with high peak)
BANANAMAN
Last column contains long sequences of identical symbols
MANBANANA
NAMANBANA
NANAMANBA How to exploit this property ?
In following processing steps
NBANANAMA
Example: Move-to-front transform (MTF)

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 28 / 37

Data Compression using Block Sorting / Move-To-Front Transform (MTF)

The Move-To-Front Transform (MTF)

MTF: Map Symbols Sequences to Sequence of Unsigned Integers

1 Replace next symbol with its alphabet index
2 Update alphabet A by moving symbol to the front

Example: Sequence “NNBMNAAAA” (result of BWT for “BANANAMAN”)

NNBMNAAAA 13 A = {A B C D E F G H I J K L M N O P Q R S T U V W X Y Z}
NNBMNAAAA 0 A = {N A B C D E F G H I J K L M O P Q R S T U V W X Y Z}
NNBMNAAAA 2 A = {N A B C D E F G H I J K L M O P Q R S T U V W X Y Z}
NNBMNAAAA 13 A = {B N A C D E F G H I J K L M O P Q R S T U V W X Y Z}
NNBMNAAAA 2 A = {M B N A C D E F G H I J K L O P Q R S T U V W X Y Z}
NNBMNAAAA 3 A = {N M B A C D E F G H I J K L O P Q R S T U V W X Y Z}
NNBMNAAAA 0 A = {A N M B C D E F G H I J K L O P Q R S T U V W X Y Z}
NNBMNAAAA 0 A = {A N M B C D E F G H I J K L O P Q R S T U V W X Y Z}
NNBMNAAAA 0 A = {A N M B C D E F G H I J K L O P Q R S T U V W X Y Z}

Effect: Many small values for sequences with long repetitions (e.g., results of a BWT)
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 29 / 37
Data Compression using Block Sorting / BZIP2

File Compression Utility BZIP2

Main Components for Compression

Run-length encoding of input data (special V2V code)
Block-wise Burrows-Wheeler Transform (BWT)
Move-To-Front Transform (MTF) of BWT result
Run-length encoding of MTF result
Dynamic Huffman coding

Some more details

Block size for BWT/MTF of up to 900 kBytes
Smart coding of Huffman tables
Up to 6 Huffman tables per block
Adaptive selection between Huffman tables (every 50 symbols)

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 30 / 37

Lossless Compression in Practice

Universal File Compressors

Marginal Huffman Coding

Very old Unix utility pack

Lempel-Ziv-Welch (LZW) Algorithm

Old Unix utility compress

DEFLATE: Lempel-Ziv-Storer-Szymanski (LZSS) + Huffman Coding

File compressors ZIP, gzip, PKZIP, Zopfli, CAB

Lempel-Ziv-Markov-Chain (LZMA) with binary arithmetic coding

File compressors 7zip, xv, lzip

Block Sorting: Burrows-Wheeler & Move-To-Front Transform

File compressor bzip2
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 31 / 37
Lossless Compression in Practice

Lossless Audio Coding: Free Lossless Audio Codec (FLAC)

Basic Source Codec

1 Decompose audio file into variable-size blocks
Block sizes determines capability for adaptation to signal statistics

2 Inter-channel decorrelation (invertible)

For example: Stereo is coded as mid = (left + right)/2
side = (left − right)

3 Linear prediction (4 types)

a No prediction
b Prediction by a constant value
c Prediction using pre-defined linear predictor (order 1 to 4)
d Prediction using adaptive linear predictor (up to order 32)

4 Entropy coding of prediction error samples

Rice coding with adaptive Rice parameter selection
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 32 / 37
Lossless Compression in Practice

Lossless Image Coding: Portable Network Graphics (PNG)

Basic Source Codec

1 Separate Coding of Individual Color Planes
2 Prediction of Image Samples
Predictor is selected per image row
Five predictors are pre-defined (no adaptive prediction coefficients)
0 none direct coding of image samples
1 left prediction using left sample
2 above prediction using above sample
3 average prediction using rounded average of left and above sample
4 Paeth non-linear prediction using left, above, and corner sample (most often use)

3 Entropy Coding of Prediction Error Samples

DEFLATE algorithm:
Lempel-Ziv-Storer-Szymanski (LZSS) algorithm for dependency removal
Huffman coding of LZSS output (adaptive Huffman tables)
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 33 / 37
Lossless Compression in Practice

Lossless Image Coding: JPEG-LS (Joint Photographic Experts Group)

Basic Source Codec
1 First prediction stage: LOCO Predictor
C A D

 min(L, A)
 : C ≥ max(L, A) L X
X̂ = max(L, A) : C ≤ min(L, A)

 L+A−C : otherwise

2 Second order prediction using conditional mean E{ x | g1 , g2 , g3 }

Given by clipped gradients (365 contexts after merging contexts with positive and negative signs)
g1 = max(−4, min(4, D − A))
g2 = max(−4, min(4, A − C ))
g3 = max(−4, min(4, C − L))

3 Entropy Coding of Prediction Error Samples

Rice codes
Optional: Run-length coding (for uniform areas)
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 34 / 37
Lossless Compression in Practice

Comparison: Universal vs Specialized Compressors

text images audio

compression compression factor compression factor compression factor

gzip (DEFLATE) 2.60 1.20 1.09
lzip (LZMA) 3.53 1.41 1.17
bzip2 (BWT+MTF) 3.55 1.39 1.15

PNG (prediction) 1.62

FLAC (prediction) 1.82

Specialized Compressors achieve Higher Coding Efficiency

Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 35 / 37
Summary

Summary of Lecture

Dictionary-based Coding
Lempel-Ziv 1977 and 1978 algorithms (LZ77, LZ78): Basis for many universal compressors
Lempel-Ziv-Storer-Szymanski (LZSS): Variant of LZ77
Lempel-Ziv-Welch (LZW): Variant of LZ78
DEFLATE: Combining LZSS with Huffman Coding
Lempel-Ziv-Markov Chain Algorithm (LZMA): LZ78 Variant with Binary Arithmetic Coding

Lossless Coding using Block Sorting

Burrows-Wheeler Transform (BWT)
Move-To-Front Transform (MFT)

Lossless Compression Applications

Universal File Compression: compress, gzip, bezip2, lzip
Lossless Audio Coding: FLAC
Lossless Image Coding: PNG, JPEG-LS
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 36 / 37
Exercises

Exercise: Lossless Image Compression Challenge (Part II)

Improve your codec for lossless coding of 8-bit color images

Try different things discussed in lectures and exercises

The following might be worth trying

Prediction
Simple prediction using left sample
Fixed non-linear predictor like LOCO or Paeth predictor
Line- or block-adaptive selection of predictor (e.g., between horizontal, vertical, ...)
Entropy Coding of Prediction Errors
Simple Rice codes (may be with adaptive Rice parameter)
Arithmetic coding with adaptive marginal pmf
Arithmetic coding with conditional pmf (very simple conditions)

Measure and provide the compressed file sizes for the Kodak test set!
Heiko Schwarz (Freie Universität Berlin) — Data Compression: Dictionary-based Coding 37 / 37

Dictionary Coding Explained
No ratings yet
Dictionary Coding Explained
56 pages
Declaration of Trust
83% (6)
Declaration of Trust
3 pages
Imc14 05 Dictionary Codes
No ratings yet
Imc14 05 Dictionary Codes
31 pages
WILP Brochure
No ratings yet
WILP Brochure
20 pages
Dictionary Coding
No ratings yet
Dictionary Coding
44 pages
Source Coding Techniques
No ratings yet
Source Coding Techniques
44 pages
LZ78
No ratings yet
LZ78
17 pages
Seminar Data Compression
No ratings yet
Seminar Data Compression
32 pages
Data Compression Techniques
No ratings yet
Data Compression Techniques
27 pages
Huffman Coding, RLE, LZW
No ratings yet
Huffman Coding, RLE, LZW
41 pages
Data Compression: LZ77 vs LZ78
No ratings yet
Data Compression: LZ77 vs LZ78
5 pages
Unit - 5 - Dictionary Technique
No ratings yet
Unit - 5 - Dictionary Technique
19 pages
Dictionary Techniques (Lempel-Ziv Codes) : Dictionary, and Encode These Patterns by Transmitting
No ratings yet
Dictionary Techniques (Lempel-Ziv Codes) : Dictionary, and Encode These Patterns by Transmitting
26 pages
LZW Compression for Students
No ratings yet
LZW Compression for Students
7 pages
Gray Level Count Probabil Ity 21 12 3/8 95 4 1/8 169 4 1/8 243 12 3/8
No ratings yet
Gray Level Count Probabil Ity 21 12 3/8 95 4 1/8 169 4 1/8 243 12 3/8
51 pages
MATLAB Lempel-Ziv Compression
No ratings yet
MATLAB Lempel-Ziv Compression
27 pages
Chapter 5 - Dictionary Techniques
No ratings yet
Chapter 5 - Dictionary Techniques
25 pages
Lempel Ziv for Computer Scientists
No ratings yet
Lempel Ziv for Computer Scientists
26 pages
Lecture19 PDF
No ratings yet
Lecture19 PDF
8 pages
Image Compression
100% (1)
Image Compression
38 pages
Enzymes in Industrial Applications
No ratings yet
Enzymes in Industrial Applications
18 pages
Provantage 1
No ratings yet
Provantage 1
220 pages
Arithmetic Lempel and Ziv Coding Chapter 2 Part 2 EH
No ratings yet
Arithmetic Lempel and Ziv Coding Chapter 2 Part 2 EH
23 pages
Multimedia Systems Chapter 7
No ratings yet
Multimedia Systems Chapter 7
21 pages
Lecture 10-Print
No ratings yet
Lecture 10-Print
50 pages
Kuwait's Growing F&B Market
No ratings yet
Kuwait's Growing F&B Market
2 pages
Basics of Information Theory
No ratings yet
Basics of Information Theory
21 pages
Unit 2 - Part 7 Coding Information Sources: 1 Adaptive Variable-Length Codes
No ratings yet
Unit 2 - Part 7 Coding Information Sources: 1 Adaptive Variable-Length Codes
5 pages
Channel Coding Using Matlab
No ratings yet
Channel Coding Using Matlab
14 pages
Implementation of Lempel-Ziv Algorithm For Lossless Compression Using VHDL
No ratings yet
Implementation of Lempel-Ziv Algorithm For Lossless Compression Using VHDL
2 pages
Design and Implementation Af LZW Data Compression Algorithm
No ratings yet
Design and Implementation Af LZW Data Compression Algorithm
11 pages
LZ77 JensMueller
No ratings yet
LZ77 JensMueller
14 pages
Arithmetic Lempel and Ziv Coding Chapter 2 Part 2 EH
No ratings yet
Arithmetic Lempel and Ziv Coding Chapter 2 Part 2 EH
23 pages
Lec5 - LZW Compression
No ratings yet
Lec5 - LZW Compression
29 pages
The GMP Regulations Report 2020
No ratings yet
The GMP Regulations Report 2020
5 pages
LZW Encoding Class Notes
No ratings yet
LZW Encoding Class Notes
5 pages
Stack by Linked List (By C++) : #Include
No ratings yet
Stack by Linked List (By C++) : #Include
4 pages
Radix Senegae
No ratings yet
Radix Senegae
13 pages
Lemp El Ziv Report
No ratings yet
Lemp El Ziv Report
17 pages
Unit31 LZ78
No ratings yet
Unit31 LZ78
15 pages
Lemp El Ziv Compression
No ratings yet
Lemp El Ziv Compression
6 pages
Dictionary Methods: Introduction To Lempel-Ziv Encoding
No ratings yet
Dictionary Methods: Introduction To Lempel-Ziv Encoding
40 pages
Ethics Case Studies
No ratings yet
Ethics Case Studies
5 pages
Lempel Ziv Coding Explained
No ratings yet
Lempel Ziv Coding Explained
1 page
Forouzan6e ch11 PPTs Accessible
No ratings yet
Forouzan6e ch11 PPTs Accessible
119 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
User Manual: Di1611/Di1811p/Di2011 Twain Driver
No ratings yet
User Manual: Di1611/Di1811p/Di2011 Twain Driver
21 pages
Data Compression for Engineers
No ratings yet
Data Compression for Engineers
4 pages
LZW Algorithm: Lossless Data Compression
No ratings yet
LZW Algorithm: Lossless Data Compression
9 pages
Arithmetic & Lempel-Ziv Coding Guide
No ratings yet
Arithmetic & Lempel-Ziv Coding Guide
53 pages
Data Compression Techniques
No ratings yet
Data Compression Techniques
25 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Multimedia Data Compression Guide
No ratings yet
Multimedia Data Compression Guide
21 pages
Guidance Transcutaneous Electrical Stimulators
No ratings yet
Guidance Transcutaneous Electrical Stimulators
18 pages
Chapter Three
No ratings yet
Chapter Three
30 pages
DC 1
No ratings yet
DC 1
3 pages
Flexitallic Flexpro Brochure 11-30-2017
No ratings yet
Flexitallic Flexpro Brochure 11-30-2017
8 pages
Unit 5 Data Compression
No ratings yet
Unit 5 Data Compression
98 pages
Compression Methods: Huffman & LZ
100% (1)
Compression Methods: Huffman & LZ
26 pages
Attachment - 1
No ratings yet
Attachment - 1
2 pages
Lecture 13 - Delta Coding
No ratings yet
Lecture 13 - Delta Coding
41 pages
Property Dispute: No Forgery Found
No ratings yet
Property Dispute: No Forgery Found
1 page
Fastpath SAP Extractor
No ratings yet
Fastpath SAP Extractor
8 pages
Compression Techniques Explained
No ratings yet
Compression Techniques Explained
10 pages
ERW Boiler & Air Heater Tubes
No ratings yet
ERW Boiler & Air Heater Tubes
2 pages
Local Media7707301369137256841
No ratings yet
Local Media7707301369137256841
33 pages
HRM: Job Analysis Essentials
100% (1)
HRM: Job Analysis Essentials
11 pages
Compression: Author: Paul Penfield, Jr. Url: Toc
No ratings yet
Compression: Author: Paul Penfield, Jr. Url: Toc
5 pages
Reveiw. - Data Compressiondocx
No ratings yet
Reveiw. - Data Compressiondocx
1 page
GE2 - Exercise 2.1 Juvine Ramos
No ratings yet
GE2 - Exercise 2.1 Juvine Ramos
4 pages
Image Compression-2
No ratings yet
Image Compression-2
13 pages
Factors and Norms Influencing Unpaid Care Work
No ratings yet
Factors and Norms Influencing Unpaid Care Work
64 pages
I-Sem-Marketing Management
No ratings yet
I-Sem-Marketing Management
2 pages
Crime Mapping for Police Planning
No ratings yet
Crime Mapping for Police Planning
7 pages
Nature and Scope of Rural Development
No ratings yet
Nature and Scope of Rural Development
59 pages
EMA Literature Review Guide
No ratings yet
EMA Literature Review Guide
7 pages
Hydraulic Sealing Surface Insights
No ratings yet
Hydraulic Sealing Surface Insights
7 pages
Applied Energy Systems
No ratings yet
Applied Energy Systems
2 pages
Bengtech Metallurgy Extended
100% (1)
Bengtech Metallurgy Extended
2 pages
Blume Expando T
No ratings yet
Blume Expando T
24 pages
File 02 Ingles
No ratings yet
File 02 Ingles
30 pages
Chapter 7
No ratings yet
Chapter 7
70 pages
Itc 11
No ratings yet
Itc 11
11 pages
Planmeca
No ratings yet
Planmeca
27 pages
Testing - Document - Sourjyendra - Data Compression Techniques - Lecture 7 - Dictionary Compression (DCT2015-Lecture7Web)
No ratings yet
Testing - Document - Sourjyendra - Data Compression Techniques - Lecture 7 - Dictionary Compression (DCT2015-Lecture7Web)
40 pages
Bstm20oe201 2ND Sem Sy2024 2025
No ratings yet
Bstm20oe201 2ND Sem Sy2024 2025
1 page
Data Compression
No ratings yet
Data Compression
35 pages
Huffman and Arithmetic Coding
No ratings yet
Huffman and Arithmetic Coding
10 pages