Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views62 pages

TEOI-Capacity of Discrete Channels

Course on Capacity of discrete channels

Uploaded by

Carlos Hurtado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views62 pages

TEOI-Capacity of Discrete Channels

Course on Capacity of discrete channels

Uploaded by

Carlos Hurtado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Information Theory
Degree in Data Science and Engineering
Lesson 5: Capacity of discrete channels

Jordi Quer, Josep Vidal

Mathematics Department, Signal Theory and Communications Department


{jordi.quer, josep.vidal}@upc.edu

2019/20 - Q1

1/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Definition of communication

Communication between two points A and B is a procedure whereby physical


acts in A induce a desired state in B.

Succesful communication: when A and B agree on what was sent, in spite of


the noise and imperfections of the signalling process that might induce errors.

Data communication systems are strongly related to the transmission medium


and the design of the transceiver...

2/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Some real communication channels

Light on optical fiber Acoustic medium Electric signals on


copper cable

Electromagnetic waves Wireless optical Data storage devices

What is the origin of errors? Thermal noise in all electronic devices, and also:
dust or scratches in HDD and DVD, solar wind in satellite comms, nearby
transmissions in cellular communications, cosmic rays in solid state memories,
propellers and biological noise in underwater comms, black body radiation,... 3/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Goals

We want to ellucidate...

How to design distinguishable codewords in B so that we can reconstruct


the original sequence sent in A with arbitrary low probability of error?
Channel capacity theorem
At which rate can these codewords transmit information?
Channel capacity theorem
Can feedback from the receiver improve the transmission rate?
Feedback capacity theorem
Is it efficient to separately design source and channel encoders?
Joint source-channel coding theorem

4/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Formal definitions

Discrete noisy channel. It consists of an input alphabet X , an output


alphabet Y and a set of transition probabilities p(y|x) accounting for the
probability of observing the output symbol y when x was sent. Therefore,
different input sequences may give rise to the same output sequence.
Memoryless channel. The current output yi only depends on the input xi ,
and is independent of past inputs and past outputs:

p(yi |xi , xi−1 , . . . , x1 , yi−1 , . . . , y1 ) = p(yi |xi )

Message. It is a random variable W taking values from a set


{1, 2, . . . , M }. The encoder selects a codeword X n (W ) which is received
as Y n . The decoder then guesses the index W by an appropriate
decoding rule Ŵ = g(Y n ). The receiver decides an error if Ŵ 6= W .

5/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Formal definitions

If xn ∈ X n is the transmitted codeword, the probability of receiving


y n ∈ Y n is

Qn
p (y n |xn ) = i=1 p(yi |xi )

The conditional symbol error probability given that index i was sent is
X
λi = P r (g(Y n ) 6= i|xn (i)) = p (y n |xn (i)) I (g(y n ) 6= i)
y n ∈Y n

6/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Formal definitions

The maximal probability of error λ(n) for an (M, n) code is

λ(n) = max λi
i∈1,2,...,M

(n)
The average probability of error Pe for an (M, n) code is
M
1 X
Pe(n) = λi
M i=1

log M
The rate R of an (M, n) code is R = n
bits per transmission.
A rate R is said to be achievable if there exists a sequence of d2nR e, n


codes such that λ(n) → 0 as n → ∞.

7/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Range of application

The models used in the sequel do not include:

Discrete-time continous amplitude input/output channels


Channels with memory (i.e. frequency selective channels, where outputs
depend on previous inputs)
Multi-user transmissions (e.g. multiple-access channels, broadcast
channels, interference channels)

But the model is useful for transmission schemes where modulator and
demodulator are part of the channel:

8/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Channel capacity

The capacity of the channel is defined as

C = max I(X; Y )
p(x)

where the maximum is taken over all input distributions p(x).

It is measured in bits/transmission, or bits/channel use.

The following duality should be noted:

Data compression - use source encoder to remove redundancy and


compress.
Data transmission - use channel encoder to add redundancy and combat
channel errors.

9/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Noiseless channel
p(yi |xj ) = δi,j

I(X; Y ) = H(X) − H(X|Y ) = H(p) − 0


C = max I(X; Y ) = 1 bit/tr
p(x)

Noisy channel with non-overlapping outputs

The channel is random, but the inputs can be


reconstructed from the outputs without errors, so
H(X|Y ) = 0 and

C = max I(X; Y ) = 1 bit/tr


p(x)

10/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Noisy typewritter
We can compute capacity as:

1

 2
if x = A
1
p(x|y = B) = 2
if x = B
0 otherwise

Z
X
H(X|Y ) = p(y)H(X|Y = y) = 1 bit
y=A

C = max I(X; Y ) = max (H(X) − 1)


p(x) p(x)

= log 26 − 1 = log 13 bits/tr

Note that if a uniform distribution is used for X,


we have a uniform distribution for Y .
The result shows that using only the alternate
input symbols, we can transmit 13 symbols without
errors, so C = log 13.

11/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels


Noisy typewritter (cont.)

Non-confusable set of inputs for a three-outputs noisy typewritter channel:

For large block lengths, every channel looks like the typewritter: any input is
likely to produce a channel output in a small subset of the output alphabet.
Then, capacity is obtained from the non-confusable subset of inputs that
produce disjoint output sequences (will be discussed later in this lesson in the
noisy-channel capacity theorem).
12/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Binary symmetric channel 


1−p if x = y
p(x|y) =
p 6 y
if x =

I(X; Y ) = H(Y ) − H(Y |X)


X
= H(Y ) − p(x)H(Y |X = x)
x∈X

= H(Y ) + p log p + (1 − p) log(1 − p)


= H(Y ) − H(p)

The maximum is achieved when p(x = 1) = 12 , then p(y = 1) = 12 , so

C = 1 − H(p) bits/tr
Note that the rate at which we can transmit information is not (1 − p) bits per
channel use, since the receiver does not know when an error occurs. In fact, if
p = 21 , we cannot transmit any information at all!

13/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Binary erasure channel

Some bits are lost, rather than corrupted:



 1 − α if y = 0
p(y|x = 0) = α if y = e
0 if y = 1


 0 if y = 0
p(y|x = 1) = α if y = e
1 − α if y = 1

C = max I(X; Y ) = max (H(Y ) − H(Y |X)) = max H(Y ) − H(α)


p(x) p(x) p(x)

Let us compute
X X
H(Y ) = − p(y) log p(y) where p(y) = p(y|x)p(x)
y∈Y x∈X
and maximize it w.r.t. p(x).

14/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Binary erasure channel (cont.)

By computing the probability of the output it is clear that, for an arbitrary


value of α, we cannot make Y symbols equiprobable, and H(Y ) < log 3:

H(Y ) = H(α) + (1 − α)H(π)

where π = p(x = 1). Therefore,

C = max(1 − α)H(π)
π

so the maximum is achieved at π = 21 , and C = (1 − α) bits/tr.

The intuition for the expression is the following: since a fraction α of symbols
is lost, we can only transmit a fraction (1 − α).

15/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Symmetric channel

Weakly symmetric channel: in matrix Q, that contains the transition


probabilites, rows are permutations of other rows, and all the column sums are
equal.
Symmetric channel: in matrix Q, rows are permutations of other rows, and
columns are permutations of other columns.

Examples:

Y = X + Z mod c

Weakly symmetric channel Non-symmetric channel Symmetric channel


  
0.6 0.2 0.2
 X , Z ∈ {0, 1, ..., c − 1}
1/3 1/6 1/2 Q=
Q= 0.1 0.2 0.7 X and Z are independent
1/3 1/2 1/6
p(z) is arbitrary
Determine Q as an exercise
16/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Symmetric channel (cont.)


Both for symmetric and weakly symmetric channels:
X X
I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) + p(x) p(y|x) log p(y|x)
x∈X y∈Y

= H(Y ) − H(q)

where q is a row of Q. If p(x) is uniform,


X 1 X c(y)
p(y) = p(y|x)p(x) = p(y|x) =
x∈X
|X | x∈X |X |

where c(y) is the sum of elements in the y-th column of the transition matrix
and it is constant for both symmetric and quasi-symmetric channels. Therefore
the capacity is
C = max I(X; Y ) = log |Y| − H(q)
p(x)

17/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Pattern recognition
Consider the problem of recognizing handwritten digits. In this case the input
to the channel is a decimal digit X ∈ X = {0, 1, 2, . . . , 9}. What comes out is
a pattern of ink on a paper that can be represented as a vector y.

If the ink pattern is digitized to 16 × 16 binary


pixels, the output of the channel is a vector
random variable Y ∈ {0, 1}256 (unlike previous
examples, where one scalar input produced one
scalar output).
One strategy for pattern recognition (that is,
decoding) is to build a model for p(y|x) and
use it to infer X given Y using Bayes’ theorem:
p(y|x)p(x)
x̂ = argmax p(x|y) = argmax
x x p(y)

18/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Examples of discrete channels

Natural evolution
Natural evolution can be considered as a channel that models how information
about the environment is transferred to the genome.

19/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Properties of channel capacity

C = max I(X; Y )
p(x)

1 C ≥ 0 since I(X; Y ) ≥ 0.
2 C ≤ log |X | since C = maxp(x) I(X; Y ) ≤ maxp(x) H(X) = log |X |.
3 C ≤ log |Y| for the same reason.
4 I(X; Y ) is a continuous function of p(x).
5 I(X; Y ) is a concave function of p(x).
6 Reminder of relations between information and entropy in graphic form:

Rationale: The mutual information is the uncertainty at the channel input


minus the remaining uncertainty when the channel output is observed.
20/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Example: A code for reliable storage

A magnetic hard disk drive (HDD) records data by magnetizing a thin film of
ferromagnetic material in flat circular disks. Bits are stored by changing the
direction of magnetization through a magnetic coil head. A reading head is
used to detect the magnetization of the material underneath.

Hard disk drive Magnetic recoding read/write head

Watch this video on how an HDD works.

21/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Example: A code for reliable storage


The reading/writing process can be modeled as a transmission of a sequence of
bits through a BSC:

Assume that the crossover probability of the BSC is p = 0.1.


If we read/write 1 Gbyte of data per day for 10 years, the number of bits
sent through the channel is 2, 92 × 1013 .
A useful HDD should not deliver any erroneous bit in its entire life.
This requires a target bit error probability on the order of 10−15 , or even
smaller.

An error-correcting code (channel encoder/decoder) is needed.


22/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Example: A code for reliable storage


Let us try to reduce the probability of error in the BSC using a repetition code
RN , for N = 3 parallel disks:
1 0 1 1 0 0 . . . → 111 000 111 111 000 000 . . .

The transmission rate is R = 1/N . 23/62


Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Example: A code for reliable storage

The optimum decoding strategy is deciding the transmitted bit by majority


voting among the N received symbols. Then, the probability of error is:
N  
X N N
PeN = pn (1 − p)N −n ≈ (4p(1 − p)) 2
n
n=(N +1)/2

For PeN = 10−15 , N must be at least 68: we need so many parallel disks to
achieve the target bit error probability!!

Clearly, better codes are needed.

24/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Example: A code for reliable storage


Plot PeN vs bitrate for p = 0.1 and different codes

What is the tradeoff between redundancy and error probability?


Is it possible to transmit at R > 0 with PeN = 0?
25/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Better channel codes and a teaser

Before Shannon announced his theorem, it was believed that zero-error


communication implied zero-rate transmission.

Shannon stated the tradeoff between


Pe and R: there is a non-zero rate
R ≤ C at which we can transmit
information with Pe = 0.

The theorem proves that every channel


behaves like a typewritter channel: a
subset of inputs produce disjoint (and
hence non-confusable) sequences at the
output.

We need to evaluate how many of these


sequences are possible and how to decode
them.

26/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Jointly typical sequences

Definition
(n)
The set A of jointly typical sequences (xn , y n ) is the set of n-sequences
with empirical entropies -close to the true ones, that is

A(n)
 ={(xn , y n ) ∈ X n × Y n :
1
− log p(xn ) − H(X) < ,
n
1
− log p(y n ) − H(Y ) < ,
n
1
− log p(xn , y n ) − H(X, Y ) < }
n
Qn
where p(xn , y n ) = i=1 p(xi , yi )

27/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Example
Let us evaluate the joint typicality of two sequences n = 100:
xn (with p(x = 0) = 0.9) was transmitted at the input of a binary symmetric
channel (with p(y|x) = 0.2), and y n was received:

From the probabilities above, we can easily compute


1  p(x, y) 0 1
X 0.74 if y = 1
p(y) = p(y|x)p(x) = 0 0.72 0.02
0.26 if y = 0
x=0
1 0.18 0.08

and check that

H(Y |X) = 1.8140, H(X, Y ) = H(X) + H(Y |X) = 2.2830

which exactly matches the empirical entropies computed on the sequences xn


and y n , so sequences are jointly typical. Now flip the last bit of y n . To which
tolerance  both sequences are now jointly typical?
28/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Properties of the joint typical sequences


Theorem (5.1)
 
(n)
Pr (X n , Y n ) ∈ A > 1 −  as n → ∞

Theorem (5.2)
(n)
For sufficiently large n, (1 − )2n(H(X,Y )−) ≤ A ≤ 2n(H(X,Y )+)

Theorem (5.3)
If (X̃ n , Ỹ n ) ∼ p(xn )p(y n ) (they are independent with the same marginals as
X n and Y n ), then the probability of being jointly typical is upper bounded by
 
Pr (X̃ n , Ỹ n ) ∈ A(n) ≤ 2−n(I(X;Y )−3)

and for sufficiently large n, it is lower bounded by


 
Pr (X̃ n , Ỹ n ) ∈ A(n) ≥ (1 − )2−n(I(X;Y )+3)

Proofs. See annex 1. 29/62


Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

The size of the jointly typical set


The outer box represents all conceivable
input/output pairs of sequences. Each dot
represents a jointly-typical pair of
sequences, whose total number is
2nH(X,Y ) (they can be computed offline,
using exhaustive search).
Note that for very large block lengths,
every channel looks like the typewritter:
any input is very likely to produce a
channel output in a small subset of the
output alphabet.
Transmitter operation: use a
non-confusable subset of inputs that
produce disjoint output sequences. This is
the key idea behind the channel coding
theorem. How can they be used?
Receiver operation: associate an observed
channel output y n to the i-th message, if
the transmitted codeword xn (i) is jointly
typical with y n .
30/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Channel coding theorem

Theorem (C. E. Shannon, 1948)


1 Achievability. In every discrete memoryless channel, for R ≤ C and
n → ∞, there exist a code (2nR , n) and a decoding algorithm for which
the maximum probability of error is λ(n) → 0.
2 Converse. For any λ(n) → 0, transmission rates greater than C are not
achievable.

Proof. See annex 2 for a separate proof of the achievability and converse parts.

31/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Communication above capacity

Which rates are achievable beyond capacity if some error can be accepted?

Theorem (Rate-distorsion)
For a channel of capacity C, transmission rates up to
C
R=
1 − H(Pe )
can be achieved at a probability of bit error Pe .

Proof. See annex 3.

32/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Communication above capacity

Bit error probability versus transmission rate for a channel


of capacity 1 bit/transmission (the non-achievability region is not proved).

33/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Capacity with feedback


Let us assume that the receiver can send back immediately and noiselessly the
received symbols to the transmitter, which then can decide what to do next.
Can feedback from the receiver increase the channel capacity?

Let us define a (2nR , n) feedback code as a sequence of mappings xi (w, yi−1 )


where the codeword is selected according to the input symbol and the past
received symbols.

Theorem (Feedback capacity)

CF B = C = max I(X; Y )
p(x)

Proof. See annex 4.


34/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Example

Feedback cannot provide higher rates, but it helps in simplifying encoding and
decoding in practical systems.

As an example, let us use feedback in the binary erasure channel: transmit the
same bit through the channel until an erasure does not occur and the
information bit is received correctly.

Can you compute the average number of uses of the channel it takes to
transmit an information bit through this channel?

35/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Joint source-channel coding


Now we shall combine the fundamental results of channel coding Rc ≤ C and
source coding Rs ≥ H: is the condition H ≤ C necessary and sufficient?
If so, we can separately design a source encoder that encodes to
H bits/symbol and a channel encoder that encodes to C bits/tr.

Theorem (Source-channel coding)


Let V = V1 , V2 , ..., Vn ∈ V n be a finite alphabet ergodic process that satisfies
the AEP. Then
1 Achievability. If H(V) ≤ C there exist a source-channel code with zero
probability of error: Pr (v̂ n 6= v n ) → 0.
2 Converse. If H(V) > C the probability of error is bounded away from
zero.

Proof. See annex 5. 36/62


Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Conclusions

The data compression theorem is based on the AEP: there exists a


“small” subset (of size 2nH ) of all possible source sequences that contain
most of the probability and hence we can represent the source with a
small probability of error using H bits/symbol.
The channel coding theorem is based on the joint AEP: for long block
lengths the output sequence of the channel is very likely to be jointly
typical with the input codeword, while any other input codeword is jointly
typical with the observed output with a small probability ' 2−nI . Hence,
we can use about 2nI codewords and still have negligible probability of
error for large n.
The source–channel separation theorem shows that we can design the
source code and the channel code separately and use them together to
achieve optimal performance as long as H ≤ C.

37/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Way through...

Chapter 6 introduces practical channel codes that achieve low probability


of error and aproach capacity.
Other relevant aspects of capacity not considered in this course are:
- How the probability of error decreases as a function of n (i.e. error
exponents, sphere packing bounds)
- Optimization with respect to p(x) (e.g. using Blahut-Arimoto
algorithm)

38/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 1: Proof of theorem 5.1

Proof. This proves that with high probability the sequences (X n , Y n ) of length
n are jointly typical. By the weak law of large numbers
1
− log p(X n ) → −E [log p(X)] = H(X)
n
Hence, given  > 0, there exist n1 , n2 , n3 such that for all n > n1 , n > n2 ,
n > n3  
1 
Pr − log p(X n ) − H(X) ≥  <
n 3
 
1 
Pr − log p(Y n ) − H(Y ) ≥  <
n 3
 
1 
Pr − log p(X n , Y n ) − H(X, Y ) ≥  <
n 3
respectively. Then, by choosing n > max(n1 , n2 , n3 ), and using
Pr(A ∩ B) = Pr(A) + Pr(B) − Pr(A ∪ B) ≤ Pr(A) + Pr(B), the probability of
the intersection of the three sets must be less than . Hence, for n sufficiently
(n)
large, the probability of the set A is greater than 1 − . 

39/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 1: Proof of theorem 5.2

Proof. For the upper bound,


X X
1= p(xn , y n ) ≥ p(xn , y n ) ≥ 2−n(H(X,Y )+) A(n)

(xn ,y n )∈{X n ,Y n } (n)
(xn ,y n )∈A

For the lower bound, take theorem 5.1, and if n is sufficiently large
  X
1 −  < Pr A(n) ≤ 2−n(H(X,Y )−) = 2−n(H(X,Y )−) A(n) 
(n)
(xn ,y n )∈A

The set may be small, its size depends on the joint entropy of X and Y .

40/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 1: Proof of theorem 5.3

Proof. Under the conditions stated in the theorem, and using theorem 5.2
  X
Pr (X̃ n , Ỹ n ) ∈ An
 = p(xn )p(y n )
(n)
(xn ,y n )∈A
n(H(X,Y )+) −n(H(X)−) −n(H(Y )−)
≤2 2 2
−n(I(X;Y )−3)
=2

Similarly we can also prove using theorem 5.2


  X
Pr (X̃ n , Ỹ n ) ∈ An
 = p(xn )p(y n )
(n)
(xn ,y n )∈A
n(H(X,Y )−) −n(H(X)+) −n(H(Y )+)
≥ (1 − )2 2 2
−n(I(X;Y )+3)
= (1 − )2

41/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Achievability part

Proof of achievability. It is completed in two steps:


Can we define a suitable encoding/decoding scheme?
i. A random code of 2nRQsequences of length n is generated according to some
pdf, such that p(xn ) = n
i=1 p(xi ), so the codebook can be described as

xn (1)
   
x1 (1) x2 (1) ... xn (1)
.. .. .. ..
C= = ,
   
. . . .
n nR nR nR nR
x (2 ) x1 (2 ) x2 (2 ) . . . xn (2 )

and the probablity of the code is


nR
2Y n
Y
Pr(C) = p(xi (w)).
w=1 i=1

ii. The random code used is revealed to sender and receiver, who also knows
the channel transition matrix p(y|x).

42/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Achievability part

iii. A message W is selected according to a uniform distribution


P r(W = w) = 2−nR , w = 1, 2, . . . , 2nR .

iv. The w-th codeword xn (w) is sent over the channel.

v. The receiver observes a sequence y n according to the distribution


n
Y
p(y n |xn (w)) = p(yi |xi (w)).
i=1

vi. The receiver guesses which message was sent according to the joint
typicallity criterion. ŵ is declared to have been sent if the following conditions
are satisfied:
- (xn (ŵ), y n ) is jointly typical, and
- no other index w0 6= ŵ such that (xn (w0 ), y n ) ∈ An

otherwise, an error is declared.

vii. There is an error E if ŵ 6= w.

43/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Achievability part

Example of typical set decoding for a 4-symbols codebook: yan is not jointly
typical with any codeword, ybn is jointly typical with xn (3), ycn is jointly typical
with xn (4), ydn is jointly typical with more than one codeword.
44/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Achievability part


Proof (cont).
Does the probability of error of this code reduces to zero?
i. The average probability of error for all codewords in all possible codebooks is:
nR
2
X X 1 X
Pr(E) = Pr(C)Pe(n) (C) = Pr(C) λw (C)
C C
2nR w=1
nR nR
2 2
1 XX 1 X
= Pr(C)λw (C) = Pr(E|w)
2nR w=1 C
2nR w=1

ii. Let us evaluate the probability of error for


n w = 1 and it will be
o seen that it
n n (n)
does not depend on w. Let us define Ei = (x (i), y ) ∈ A for
i ∈ 1, 2, . . . , 2nR . An error occurs if either:

n o
(n)
- E1c = (xn (1), y n ) ∈
/ A , or
- E2 ∪ E3 ∪ · · · ∪ E2nR occurs (a wrong codeword is jointly typical)

Therefore, we can write, using the union bound...


45/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Achievability part

Pr(E|w = 1) = Pr(E1c ∪ E2 ∪ E3 ∪ · · · ∪ E2nR |w = 1)


nR
2
X
≤ Pr(E1c |w = 1) + Pr(Ei |w = 1).
i=2

iii. By the joint AEP, Pr(E1c |w = 1) ≤  for n → ∞. By the code generation


process, xn (1) and xn (i) are independent for i 6= 1 and so are y n and xn (i),
and hence by theorem 5.3 of the joint AEP:
nR
2
X
Pr(E) = Pr(E|w = 1) ≤  + 2−n(I(X,Y )−3)
i=2

=  + (2nR − 1)2−n(I(X,Y )−3)


≤  + 23n 2−n(I(X,Y )−R)

A key point: we can make the last term in red lower than  by increasing n,
provided that R < I(X, Y ) − 3, and then Pr(E) ≤ 2.

46/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Achievability part


iv. For transmission we have to select one among all possible random codes.
We can strengthen the conclusion about Pr(E) by wisely selecting the code:
a Choosing p(x) such that I(X, Y ) is maximal, and hence R < C.
b Get rid of the average over codes and keep just the good one. Since
averaging over all codes gives Pr(E) ≤ 2 there must be one code (found
by exhaustive search over all (2nR , n) codes) such that
nR
2
1 X
Pr(E|C ∗ ) = λi (C ∗ ) ≤ 2.
2nR i=1

c Throw away the worst half of codewords (so as we can minimize the
maximal probability of error λmax (C ∗ ), not only the average):
" #
∗ 1 1 X ∗ 1 X ∗
Pr(E|C ) = λi (C ) + nR−1 λi (C ) ≤ 2
2 2nR−1 worst i 2
best i

where the term in red is loosy upper bounded by 4.

47/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Achievability part

A typical random code (left), where a small fraction of the codewords are sufficiently
close to each other that the probability of error when either codeword is transmitted is
not tiny. We obtain a new code by deleting all these confusable codewords (right).
The resulting code has less codewords, so has a lower rate, and its maximal probability
of error is greatly reduced.

48/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Achievability part

The number of codewords has changed to 2nR−1 and therefore the rate is
1 1 1
log(2nR−1 ) = (nR − 1) = R −
n n n
In short: we have been able to turn a noisy channel into a noiseless channel, as
long as the transmission rate is below the capacity, just by constructing a code
of rate
1
R0 = R − ,
n
whose maximal probability of error is λ(n) ≤ 4. 

49/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 2: Channel coding theorem. Converse part


Proof of converse. Assume we have a code (2nR , n) with λ(n) → 0 (this
(n)
implies Pe → 0), an encoding rule X n (.) and a decoding rule Ŵ = g(Y n ) so
we can construct the Markov chain

W → X n (W ) → Y n → Ŵ .
(n) 1
P
If W has a uniform distribution Pr(Ŵ 6= W ) = Pe = 2nR i λi and hence

nR = H(W )
= H(W |Ŵ ) + I(W ; Ŵ ) Entropy identity
≤ 1 + Pe(n) nR + I(W ; Ŵ ) Fano’s theorem (upper bounding H(Pe(n) ) < 1)
≤ 1 + Pe(n) nR + I(X n ; Y n ) Data processing inequality
≤ 1 + Pe(n) nR + nC Repeated use of channel does not increase capacity (see annex 6)

Dividing by n we obtain
1
R ≤ Pe(n) R + C +
n
(n)
For large n, we can build codes that Pe → 0 and hence R ≤ C. 
50/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 3: Communication above capacity

Proof. Take a channel of capacity C together with corresponding


capacity-achieving encoder/decoder that allow zero-error transmission, and
take the capacity-achieving encoder/decoder designed for a BSC whose
transition probability is q.

Since we are interested in lossy transmission, let us reverse the use of encoder
and decoder, i.e. the BSC-decoder is used as a lossy encoder, whose input is a
sequence of n symbols and the output is a codeword of length k < n, and its
rate is R0 = n/k = 1/ (1 − H(q)).

51/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 3: Communication above capacity

Proof (cont.) The lossy decoder will take a sequence of k symbols and convert
it to a sequence of n symbols. Let us concatenate them with the C capacity
channel together with its own optimum encoder/decoder as follows...

The lossy encoding is a surjective mapping and the lossy decoding is a bijective
mapping. Both are designed using the joint typicality principle for the BSC
channel, so ŵ and w will differ in nq symbols, hence Pe = q.
Now the rate of the transmission with errors is
k n 1
R= = C · R0 = C 
# transmissions k 1 − H(Pe )

The following slides illustrate the proof...

52/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

53/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

54/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

55/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

56/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

57/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 4: Capacity with feedback

Proof. A non-feedback code is a special case, so CF B ≥ C. Let us prove that


CF B ≤ C. Assume W be uniformly distributed over 1, 2, ..., 2nR , and hence
(n)
P r(W 6= Ŵ ) = Pe . Then let us bound the rate as:

nR = H(W ) = H(W |Ŵ ) + I(W ; Ŵ )


≤ 1 + Pe(n) nR + I(W ; Ŵ ) Fano’s inequality
≤ 1 + Pe(n) nR + I(W ; Y n ) Data processing inequality

The last term can be bounded as:

58/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 4: Capacity with feedback

I(W ; Y n ) = H(Y n ) − H(Y n |W )


n
X
= H(Y n ) − H(Yi |Y1 , Y2 , . . . , Yi−1 , W ) Chain rule of H
i=1
Xn
= H(Y n ) − H(Yi |Y1 , Y2 , . . . , Yi−1 , W, Xi ) Since Xi =f (W,Yi−1 ,...,Y1 )
i=1
Xn n
X n
X
= H(Y n ) − H(Yi |Xi ) ≤ H(Yi ) − H(Yi |Xi )
i=1 i=1 i=1
n
X
= I(Xi ; Yi ) ≤ nC Repeated use of channel does not increase capacity
i=1

Put altogether to obtain

nR ≤ 1 + Pe(n) nR + nC,

so when n → ∞, R ≤ C. 

59/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 5: Joint source-channel coding

Proof of achievability. The ergodic source process V is mapped to codewords


through an encoding rule xn (v n ) and sent over the channel. The receiver
observes y n and makes a guess v̂ n . An error is declared if v̂ n 6= v n . We shall
follow three steps:
i. The size of the typical set An
 for V is 2
n(H(V)+)
so we need n(H(V) + )
bits to encode it. Encoding non-typical sequences will entail an error  → 0
(remember lecture 3).
ii. We can transmit the index to the typical set with arbitrary low error if
H(V) +  = Rs ≤ C (according to the channel capacity theorem).
iii. The receiver can reconstruct v n by enumerating the typical set and
choosing a sequence that matches the transmitted one with high probability:

Pr (v n 6= v̂ n ) ≤ Pr (v n ∈
/ An n n n n
 ) + Pr (g(y ) 6= v |v ∈ A )

for large n. Therefore we can reconstruct the original sequence with low
probability of error if H(V) ≤ C.

60/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 5: Joint source-channel coding

Proof of converse. Identify the Markov’s chain V → X n → Y n → V̂


Then, by Fano’s inequality

H(V|V̂) ≤ 1 + Pr (v̂ n 6= v n ) log |V n | = 1 + Pr (v̂ n 6= v n ) n log |V|

Let us apply it to bound the error


1
H(V) ≤ H(V1 , ..., Vn )
n
1 1
= H(V n |V̂ n ) + I(V n ; V̂ n )
n n
1 1
≤ (1 + Pr (v̂ n 6= v n ) n log |V|) + I(V n ; V̂ n ) Fano’s theorem
n n
1 1
≤ (1 + Pr (v̂ n 6= v n ) n log |V|) + I(X n ; Ŷ n ) Data processing inequality
n n
1
≤ + Pr (v̂ n 6= v n ) log |V| + C Capacity of repeated use of channel (see annex 6)
n
By letting n → ∞, if Pr (v̂ n 6= v n ) → 0 then H(V) ≤ C. 

61/62
Communication Discrete noisy channels Joint typicality Channel capacity theorems Annexes

Annex 6: Repeated use of channel


Theorem (Capacity of repeated use of channel)
Let Y n be the result of passing X n through a discrete memoryless channel of
capacity C. Then I(X n ; Y n ) ≤ nC for all p(xn ).

Proof.
I(X n ; Y n ) = H(Y n ) − H(Y n |X n )
n
X
= H(Y n ) − H(Yi |Y1 , . . . , Yi−1 , X n ) by chain rule of entropy
i=1
n
X
n
= H(Y ) − H(Yi |Xi ) by definition of a memoryless channel
i=1
n
X n
X
≤ H(Yi ) − H(Yi |Xi ) by definition of joint entropy
i=1 i=1
Xn
= I(Xi ; Yi ) by definition of mutual information
i=1
≤ nC
 62/62

You might also like