TEOI-Capacity of Discrete Channels
TEOI-Capacity of Discrete Channels
Information Theory
Degree in Data Science and Engineering
Lesson 5: Capacity of discrete channels
2019/20 - Q1
1/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Definition of communication
2/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
What is the origin of errors? Thermal noise in all electronic devices, and also:
dust or scratches in HDD and DVD, solar wind in satellite comms, nearby
transmissions in cellular communications, cosmic rays in solid state memories,
propellers and biological noise in underwater comms, black body radiation,... 3/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Goals
We want to ellucidate...
4/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Formal definitions
5/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Formal definitions
Qn
p (y n |xn ) = i=1 p(yi |xi )
The conditional symbol error probability given that index i was sent is
X
λi = P r (g(Y n ) 6= i|xn (i)) = p (y n |xn (i)) I (g(y n ) 6= i)
y n ∈Y n
6/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Formal definitions
λ(n) = max λi
i∈1,2,...,M
(n)
The average probability of error Pe for an (M, n) code is
M
1 X
Pe(n) = λi
M i=1
log M
The rate R of an (M, n) code is R = n
bits per transmission.
A rate R is said to be achievable if there exists a sequence of d2nR e, n
7/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Range of application
But the model is useful for transmission schemes where modulator and
demodulator are part of the channel:
8/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Channel capacity
C = max I(X; Y )
p(x)
9/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Noiseless channel
p(yi |xj ) = δi,j
10/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Noisy typewritter
We can compute capacity as:
1
2
if x = A
1
p(x|y = B) = 2
if x = B
0 otherwise
Z
X
H(X|Y ) = p(y)H(X|Y = y) = 1 bit
y=A
11/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
For large block lengths, every channel looks like the typewritter: any input is
likely to produce a channel output in a small subset of the output alphabet.
Then, capacity is obtained from the non-confusable subset of inputs that
produce disjoint output sequences (will be discussed later in this lesson in the
noisy-channel capacity theorem).
12/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
C = 1 − H(p) bits/tr
Note that the rate at which we can transmit information is not (1 − p) bits per
channel use, since the receiver does not know when an error occurs. In fact, if
p = 21 , we cannot transmit any information at all!
13/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Let us compute
X X
H(Y ) = − p(y) log p(y) where p(y) = p(y|x)p(x)
y∈Y x∈X
and maximize it w.r.t. p(x).
14/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
C = max(1 − α)H(π)
π
The intuition for the expression is the following: since a fraction α of symbols
is lost, we can only transmit a fraction (1 − α).
15/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Symmetric channel
Examples:
Y = X + Z mod c
= H(Y ) − H(q)
where c(y) is the sum of elements in the y-th column of the transition matrix
and it is constant for both symmetric and quasi-symmetric channels. Therefore
the capacity is
C = max I(X; Y ) = log |Y| − H(q)
p(x)
17/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Pattern recognition
Consider the problem of recognizing handwritten digits. In this case the input
to the channel is a decimal digit X ∈ X = {0, 1, 2, . . . , 9}. What comes out is
a pattern of ink on a paper that can be represented as a vector y.
18/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Natural evolution
Natural evolution can be considered as a channel that models how information
about the environment is transferred to the genome.
19/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
C = max I(X; Y )
p(x)
1 C ≥ 0 since I(X; Y ) ≥ 0.
2 C ≤ log |X | since C = maxp(x) I(X; Y ) ≤ maxp(x) H(X) = log |X |.
3 C ≤ log |Y| for the same reason.
4 I(X; Y ) is a continuous function of p(x).
5 I(X; Y ) is a concave function of p(x).
6 Reminder of relations between information and entropy in graphic form:
A magnetic hard disk drive (HDD) records data by magnetizing a thin film of
ferromagnetic material in flat circular disks. Bits are stored by changing the
direction of magnetization through a magnetic coil head. A reading head is
used to detect the magnetization of the material underneath.
21/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
For PeN = 10−15 , N must be at least 68: we need so many parallel disks to
achieve the target bit error probability!!
24/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
26/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Definition
(n)
The set A of jointly typical sequences (xn , y n ) is the set of n-sequences
with empirical entropies -close to the true ones, that is
A(n)
={(xn , y n ) ∈ X n × Y n :
1
− log p(xn ) − H(X) < ,
n
1
− log p(y n ) − H(Y ) < ,
n
1
− log p(xn , y n ) − H(X, Y ) < }
n
Qn
where p(xn , y n ) = i=1 p(xi , yi )
27/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Example
Let us evaluate the joint typicality of two sequences n = 100:
xn (with p(x = 0) = 0.9) was transmitted at the input of a binary symmetric
channel (with p(y|x) = 0.2), and y n was received:
Theorem (5.2)
(n)
For sufficiently large n, (1 − )2n(H(X,Y )−) ≤ A ≤ 2n(H(X,Y )+)
Theorem (5.3)
If (X̃ n , Ỹ n ) ∼ p(xn )p(y n ) (they are independent with the same marginals as
X n and Y n ), then the probability of being jointly typical is upper bounded by
Pr (X̃ n , Ỹ n ) ∈ A(n) ≤ 2−n(I(X;Y )−3)
Proof. See annex 2 for a separate proof of the achievability and converse parts.
31/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Which rates are achievable beyond capacity if some error can be accepted?
Theorem (Rate-distorsion)
For a channel of capacity C, transmission rates up to
C
R=
1 − H(Pe )
can be achieved at a probability of bit error Pe .
32/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
33/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
CF B = C = max I(X; Y )
p(x)
Example
Feedback cannot provide higher rates, but it helps in simplifying encoding and
decoding in practical systems.
As an example, let us use feedback in the binary erasure channel: transmit the
same bit through the channel until an erasure does not occur and the
information bit is received correctly.
Can you compute the average number of uses of the channel it takes to
transmit an information bit through this channel?
35/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Conclusions
37/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Way through...
38/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Proof. This proves that with high probability the sequences (X n , Y n ) of length
n are jointly typical. By the weak law of large numbers
1
− log p(X n ) → −E [log p(X)] = H(X)
n
Hence, given > 0, there exist n1 , n2 , n3 such that for all n > n1 , n > n2 ,
n > n3
1
Pr − log p(X n ) − H(X) ≥ <
n 3
1
Pr − log p(Y n ) − H(Y ) ≥ <
n 3
1
Pr − log p(X n , Y n ) − H(X, Y ) ≥ <
n 3
respectively. Then, by choosing n > max(n1 , n2 , n3 ), and using
Pr(A ∩ B) = Pr(A) + Pr(B) − Pr(A ∪ B) ≤ Pr(A) + Pr(B), the probability of
the intersection of the three sets must be less than . Hence, for n sufficiently
(n)
large, the probability of the set A is greater than 1 − .
39/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
For the lower bound, take theorem 5.1, and if n is sufficiently large
X
1 − < Pr A(n) ≤ 2−n(H(X,Y )−) = 2−n(H(X,Y )−) A(n)
(n)
(xn ,y n )∈A
The set may be small, its size depends on the joint entropy of X and Y .
40/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Proof. Under the conditions stated in the theorem, and using theorem 5.2
X
Pr (X̃ n , Ỹ n ) ∈ An
= p(xn )p(y n )
(n)
(xn ,y n )∈A
n(H(X,Y )+) −n(H(X)−) −n(H(Y )−)
≤2 2 2
−n(I(X;Y )−3)
=2
41/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
xn (1)
x1 (1) x2 (1) ... xn (1)
.. .. .. ..
C= = ,
. . . .
n nR nR nR nR
x (2 ) x1 (2 ) x2 (2 ) . . . xn (2 )
ii. The random code used is revealed to sender and receiver, who also knows
the channel transition matrix p(y|x).
42/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
vi. The receiver guesses which message was sent according to the joint
typicallity criterion. ŵ is declared to have been sent if the following conditions
are satisfied:
- (xn (ŵ), y n ) is jointly typical, and
- no other index w0 6= ŵ such that (xn (w0 ), y n ) ∈ An
otherwise, an error is declared.
43/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Example of typical set decoding for a 4-symbols codebook: yan is not jointly
typical with any codeword, ybn is jointly typical with xn (3), ycn is jointly typical
with xn (4), ydn is jointly typical with more than one codeword.
44/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
A key point: we can make the last term in red lower than by increasing n,
provided that R < I(X, Y ) − 3, and then Pr(E) ≤ 2.
46/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
c Throw away the worst half of codewords (so as we can minimize the
maximal probability of error λmax (C ∗ ), not only the average):
" #
∗ 1 1 X ∗ 1 X ∗
Pr(E|C ) = λi (C ) + nR−1 λi (C ) ≤ 2
2 2nR−1 worst i 2
best i
47/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
A typical random code (left), where a small fraction of the codewords are sufficiently
close to each other that the probability of error when either codeword is transmitted is
not tiny. We obtain a new code by deleting all these confusable codewords (right).
The resulting code has less codewords, so has a lower rate, and its maximal probability
of error is greatly reduced.
48/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
The number of codewords has changed to 2nR−1 and therefore the rate is
1 1 1
log(2nR−1 ) = (nR − 1) = R −
n n n
In short: we have been able to turn a noisy channel into a noiseless channel, as
long as the transmission rate is below the capacity, just by constructing a code
of rate
1
R0 = R − ,
n
whose maximal probability of error is λ(n) ≤ 4.
49/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
W → X n (W ) → Y n → Ŵ .
(n) 1
P
If W has a uniform distribution Pr(Ŵ 6= W ) = Pe = 2nR i λi and hence
nR = H(W )
= H(W |Ŵ ) + I(W ; Ŵ ) Entropy identity
≤ 1 + Pe(n) nR + I(W ; Ŵ ) Fano’s theorem (upper bounding H(Pe(n) ) < 1)
≤ 1 + Pe(n) nR + I(X n ; Y n ) Data processing inequality
≤ 1 + Pe(n) nR + nC Repeated use of channel does not increase capacity (see annex 6)
Dividing by n we obtain
1
R ≤ Pe(n) R + C +
n
(n)
For large n, we can build codes that Pe → 0 and hence R ≤ C.
50/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Since we are interested in lossy transmission, let us reverse the use of encoder
and decoder, i.e. the BSC-decoder is used as a lossy encoder, whose input is a
sequence of n symbols and the output is a codeword of length k < n, and its
rate is R0 = n/k = 1/ (1 − H(q)).
51/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Proof (cont.) The lossy decoder will take a sequence of k symbols and convert
it to a sequence of n symbols. Let us concatenate them with the C capacity
channel together with its own optimum encoder/decoder as follows...
The lossy encoding is a surjective mapping and the lossy decoding is a bijective
mapping. Both are designed using the joint typicality principle for the BSC
channel, so ŵ and w will differ in nq symbols, hence Pe = q.
Now the rate of the transmission with errors is
k n 1
R= = C · R0 = C
# transmissions k 1 − H(Pe )
52/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
53/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
54/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
55/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
56/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
57/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
58/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
nR ≤ 1 + Pe(n) nR + nC,
so when n → ∞, R ≤ C.
59/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Pr (v n 6= v̂ n ) ≤ Pr (v n ∈
/ An n n n n
) + Pr (g(y ) 6= v |v ∈ A )
for large n. Therefore we can reconstruct the original sequence with low
probability of error if H(V) ≤ C.
60/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
61/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes
Proof.
I(X n ; Y n ) = H(Y n ) − H(Y n |X n )
n
X
= H(Y n ) − H(Yi |Y1 , . . . , Yi−1 , X n ) by chain rule of entropy
i=1
n
X
n
= H(Y ) − H(Yi |Xi ) by definition of a memoryless channel
i=1
n
X n
X
≤ H(Yi ) − H(Yi |Xi ) by definition of joint entropy
i=1 i=1
Xn
= I(Xi ; Yi ) by definition of mutual information
i=1
≤ nC
62/62