Prob
Prob
Vojtěch Řehák
based on slides of Jan Bouda
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 1 / 61
Part I
Basic Information
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 2 / 61
Course Information
• Lecture
▶ motivation, definitions, theorems, proofs, demonstration on examples
▶ these slides IV111 * handout.pdf + videos from fall 2020
▶ for later lectures there are old versions of the slides, they will be
gradually updated
• Seminars
▶ practicing on word problems, discussions
▶ all examples are in IV111 exercises for tutorials.pdf
▶ attendance at tutorials is compulsory and absence will be penalized
• Assessment
▶ written final exam for preliminary evaluation (see Sample Exam in IS)
▶ subsequent oral exam
• The course will be taught in English
see the Interactive syllabus for more detail
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 3 / 61
Course Topics in Exam Questions
Q1: Probability space, random variable
• probability space, events, conditional probability, independence
Q2: Markov and Chebyshev inequalities
• random variable expectation, moments, tail bounds
Q3: Laws of large numbers
—
Q4: Discrete-time and continuous-time Markov chains
• probabilistic processes, analysis, average and almost sure performance
Q5: Ergodic theorem for DTMC
—
Q6: Kraft and McMillan theorems, Huffman codes
• entropy, randomness, information, coding, compression
Q7: Channel coding theorem
• noisy channels, capacity and coding rates
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 4 / 61
Bibliography
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 5 / 61
Bibliography
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 6 / 61
Part II
Motivation
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 7 / 61
Randomness, Probability and Statistics
Probability
given model, predict data
Statistics
given data, predict model
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 8 / 61
Randomness, Probability and Statistics
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 9 / 61
Randomness, Probability and Statistics
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 10 / 61
Motivation for The Course
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 11 / 61
Applications of Probability Theory
• Machine learning
• Recommendation systems
• Spam filtering
• Computational finance
• Cryptography, Network security
• System biology
• DNA sequencing
• ...
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 12 / 61
Example: Algorithm for Verifying Matrix Multiplication
Given three n × n matrices A, B, and C, verify
AB = C.
Standard algorithm:
• perform matrix multiplication AB
• compare the result with C
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 13 / 61
Example: Algorithm for Verifying Matrix Multiplication
Given three n × n matrices A, B, and C, verify
AB = C.
Randomized algorithm: (Freivald’s algorithm)
1. Choose uniformly at random a vector ⃗r = (r1 , r2 , . . . , rn ) ∈ {0, 1}n .
2. Compute vector ⃗vB⃗r = B⃗r .
3. Compute vector ⃗vAB⃗r = A⃗vB⃗r .
4. Compute vector ⃗vC⃗r = C⃗r .
5. If ⃗vAB⃗r ̸= ⃗vC⃗r return AB ̸= C, else return AB = C.
This algorithm takes Θ(n2 ) time.
Questions:
1. Say that AB ̸= C, what is the probability of a wrong answer?
2. What is the probability of a wrong answer when executed 100 times?
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 13 / 61
Example: Algorithm for Verifying Matrix Multiplication
Lecture 1: Introduction and Basic Definitions Given three n × n matrices A, B, and C, verify
2024-09-26
AB = C.
Randomized algorithm: (Freivald’s algorithm)
1. Choose uniformly at random a vector ⃗r = (r1 , r2 , . . . , rn ) ∈ {0, 1}n .
2. Compute vector ⃗vB⃗r = B⃗r .
3. Compute vector ⃗vAB⃗r = A⃗vB⃗r .
4. Compute vector ⃗vC⃗r = C⃗r .
Example: Algorithm for Verifying Matrix 5. If ⃗vAB⃗r ̸= ⃗vC⃗r return AB ̸= C, else return AB = C.
This algorithm takes Θ(n2 ) time.
Questions:
Answers:
1. For AB ̸= C and a vector ⃗r chosen uniformly at random, the
probability of wrong answer is at most 1/2.
For formal proof see Wikipedia.
2. If the vectors are chosen independently, the probability of the wrong
answer could be at most 1/2100 which is smaller a HW error
probability.
Goals of Probability Theory
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 14 / 61
Examples of Random Experiments
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 15 / 61
Examples of Random Experiments
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 16 / 61
Examples of Random Experiments: Balls and Bins
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 17 / 61
Part III
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 18 / 61
Random Experiment
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 19 / 61
Sample Space
Definition
The (non-empty) set of the possible outcomes of a random experimenta
is called the sample spaceb of the experiment and it will be denoted S.
The outcomes of a random experiment (elements of the sample space) are
denoted sample pointsc .
a náhodný pokus
b základnı́ prostor
c základnı́/elementárnı́ body
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 20 / 61
Sample Space - Examples
Example
Let us consider the following experiment: we toss the coin until the ’head’
appears. Possible outcomes of this experiment are
We may also consider the possibility that ’head’ never occurs. In this case
we have to introduce an extra sample point denoted, e.g. ⊥.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 21 / 61
Sample Space - Examples
Idea
The sample space is not determined completely by the experiment.
It is partially determined by the purpose for which the experiment is
carried out.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 22 / 61
Sample Space Size
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 23 / 61
Event
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 24 / 61
Events in Discrete Space
Definition
Eventa of a random experiment with sample space S is any subset of S.
a jev
1 nemožný jev
2 jistý jev
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 25 / 61
Event
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 26 / 61
Example of Events: Three Balls in Three Bins
• The event A=’there is more than one ball in one of the bins’
corresponds to atomic outcomes 1-21. We say that the event A is an
aggregate of events 1-21.
• The event B=’ first bin is not empty ’ is an aggregate of sample
points 1,4-15,22-27.
• The event C is defined as ’ both A and B occur ’. It represents
sample points 1,4-15.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 27 / 61
Algebra of Events - Notation
Message
Events are sets.
Hence, set operations can be used to obtain a new event.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 28 / 61
Algebra of Events - Laws
E1 Commutative:
A ∪ B = B ∪ A, A ∩ B = B ∩ A
E2 Associative:
A ∪ (B ∪ C ) = (A ∪ B) ∪ C
A ∩ (B ∩ C ) = (A ∩ B) ∩ C
E3 Distributive:
A ∪ (B ∩ C ) = (A ∪ B) ∩ (A ∪ C )
A ∩ (B ∪ C ) = (A ∩ B) ∪ (A ∩ C )
E4 Identity:
A ∪ 0/ = A, A ∩ S = A
E5 Complement:
A ∪ A = S, A ∩ A = 0/
Any relation valid in the algebra of events can be proved using these
axioms.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 29 / 61
Algebra of Events - Some Relations
Using the previously introduced axioms we can derive, e.g.,
• Idempotent laws:
A ∪ A = A, A ∩ A = A
• Domination laws:
A ∪ S = S, A ∩ 0/ = 0/
• Absorption laws:
A ∪ (A ∩ B) = A, A ∩ (A ∪ B) = A
• de Morgan’s laws:
(A ∪ B) = A ∩ B, (A ∩ B) = A ∪ B
•
A =A
•
A ∪ (A ∩ B) = A ∪ B
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 30 / 61
Probability - Intuition
Definition
Let us suppose that we perform n trials of a random experiment and we
obtain an outcome s ∈ S k-times. Then the relative frequency of the
outcome s is k/n.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 31 / 61
Definition of Discrete Probability
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 32 / 61
Some Relations
− · · · + (−1)n−1 P(A1 ∩ A2 ∩ . . . An ).
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 33 / 61
Some Relations
Lecture 1: Introduction and Basic Definitions Are the following statements always true?
2024-09-26
• P(0)
/ =0
• P(A) = 1 − P(A) for any event A
• P(A ∪ B) = P(A) + P(B) − P(A ∩ B) for any events A and B
• inclusion and exclusion
!
n
[
P Ai =P(A1 ∪ · · · ∪ An )
i=1
− · · · + (−1)n−1 P(A1 ∩ A2 ∩ . . . An ).
∑
1≤i<j<k≤n
P(Ai ∩ Aj ∩ Ak )
1 = P(S) = P(S ∪ 0)
/ = P(S) + P(0)
/ = 1 + P(0)
/
P(A ∪ B)
= P((A ∪ (B ∖ (A ∩ B)))
= P(A) + P(B ∖ (A ∩ B))
= P(A) + 1 − P(B ∖ (A ∩ B))
= P(A) + 1 − P(B ∪ (A ∩ B))
= P(A) + 1 − (P(B) + P(A ∩ B))
= P(A) + 1 − (1 − P(B) + P(A ∩ B))
= P(A) + 1 − 1 + P(B) − P(A ∩ B))
= P(A) + P(B) − P(A ∩ B)
Probability - How To Assign
Idea
For finitely or countable many sample points, it is sufficient to assign
probabilities to elementary events.
Then, the probability of an event will be the finite or countable sum of
probabilities of the elementary events included in the event.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 34 / 61
Event collections
Definition
Events A1 , A2 , . . . , An are mutually exclusive or mutually disjointa if and only
if
∀i ̸= j . Ai ∩ Aj = 0.
/
a (vzájemně) disjunktnı́ nebo neslučitelné
Definition
Events A1 , A2 , . . . , An are collectively exhaustivea if and only if
A1 ∪ A2 · · · ∪ An = S.
a tvořı́ úplný systém
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 35 / 61
Partition of Sample Space
Definition
Mutually exclusive and collectively exhaustive list is called a partitiona of
the sample space S.
a rozklad
Example
The list of all elementary events {s}, for s ∈ S, is a partition of S.
Message
Every partition can be used as an alternative sample space.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 36 / 61
Designing a (discrete) random experiment
Idea
The design of a random experiment in a real situation is to follow this
procedure:
• Identify the sample space - set of mutually exclusive and collectively
exhaustive events. Choose the elements in the way that they cannot
be further subdivided. You can always define aggregate events.
• Assign probabilities to elements in S - probabilities are usually
either result of a careful theoretical analysis, or based on estimates
obtained from past experience.
• Identify events of interest - they are usually described by
statements and should be reformulated in terms of subsets of S.
• Compute desired probabilities - calculate the probabilities of
interesting events using sums of elementary event probabilities.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 37 / 61
Balls and Bins Revisited - Aggregates
Let us return to the first example with three balls and three bins and
suppose that the balls are not distinguishable, implying e.g. that we do
not distinguish atomic events 4, 5 and 6. Atomic events in the new
experiment (placing three indistinguishable balls into three bins) are
1.[***][ ][ ] 6.[ * ][** ][ ]
2.[ ][***][ ] 7.[ * ][ ][** ]
3.[ ][ ][***] 8.[ ][** ][ * ]
4.[** ][ * ][ ] 9.[ ][ * ][** ]
5.[** ][ ][ * ] 10.[ * ][ * ][ * ]
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 39 / 61
Balls and Bins Revisited
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 40 / 61
Discrete Probability Space
Definition
A discrete probability spacea is a triple (S, F , P) where
• S is a sample spaceb , i.e. the set of all possible outcomes,
• F = 2S is a collection of all events, and
• P : F → [0, 1] is a probability functionc .
a pravděpodobnostnı́ prostor
b základnı́
prostor
c pravděpodobnostnı́ funkce
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 41 / 61
Continuous Sample Space - Examples
• What is the portion of a day spent on FB?
• Hitting an archery target
(areas - discrete vs. distance from the center - continuous).
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 42 / 61
Discrete vs. Continuous Sample Space
Idea
We need to measure the area of particular sample points.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 43 / 61
Counterexample
Imagine a uniform distribution on [0, 1].
Let Q be a set of all rational numbers and Q′ be rationals of [0, 1).
Let A ⊕ r = {a + r mod 1 | a ∈ A} be a circular shift.
As ⊕ r preserves the size of the set, we can naturally expect that
P(A) = P(A ⊕ r ).
Define equivalence relation on reals x ∼ y iff the difference x − y ∈ Q.
Let H be a subset of [0, 1) consisting of precisely one element from each
equivalence class of ∼ (the existence follows from the Axiom of Choice).
!
[
1 = P([0, 1)) = P H ⊕ r = ∑ P(H ⊕ r) = ∑ P(H)
r ∈Q′ r ∈Q′ r ∈Q′
When the sample space is continuous, the power set F = 2S is a too large
collection for probabilities to be assigned reasonably to all its members. It
is sufficient to take a subset F ⊆ 2S that is a σ -field.
Definition
Let S be a sample space. Then F ⊆ 2S is a σ -field if
• 0/ ∈ F ,
• if A1 , A2 , · · · ∈ F then ∪∞
i=1 Ai ∈ F , and
• if A ∈ F then A ∈ F .
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 45 / 61
Probability Space
Definition
A probability spacea is a triple (S, F , P) where
• S is a sample spaceb , i.e. the set of all possible outcomes,
• F ⊆ 2S is a σ -fieldc , i.e. a collection of sets representing the
allowable events, and
• P : F → [0, 1] is a probability functiond .
a pravděpodobnostnı́ prostor
b základnı́prostor
c jevové pole
d pravděpodobnostnı́ funkce
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 46 / 61
Part IV
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 47 / 61
Conditional Probability
Example
We are throwing a 6-sided die but it is so far that we see only it is ≥ 4.
What is the probability that it is 6?
Contrary to the initial situation, now we know that some sample points are
of zero probability. I.e., our probability distribution changes.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 48 / 61
Conditional Probability
Given that B occurred, we know that the outcomes outside B could not
occur (i.e., they are of zero probability).
Let us erase the thier probabilities to zero.
But probabilities of all sample points have to sum to 1, and their are
summing to P(B). So, we also need to enlarge them.
For every atomic outcome s, we derive
( P(s)
if s ∈ B,
P(s|B) = P(B)
0 s ∈ B.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 49 / 61
Conditional Probability
Definition
For P(B) ̸= 0, the conditional probabilitya of A given B is
P(A ∩ B)
P(A|B) = .
P(B)
If P(B) = 0, it is undefined.
a podmı́něná pravděpodobnost
Note: We sometimes wish to condition on null events; then, the approach is more
complicated. See J. Rosenthal: A First Look at Rigorous Probability Theory, Chapter 13.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 50 / 61
Independence of Events
Definition
Events A and B are said to be independenta if
P(A ∩ B) = P(A|B)P(B).
I.e., P(A) = P(A|B), what is saying that the probability of the event A
does not change regardless of whether event B occurred.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 51 / 61
Independence of Events - Remarks
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 52 / 61
Independence of More Events
Definition
Events A1 , A2 , . . . An are (mutually) independenta if and only if for any set
{i1 , i2 , . . . ik } ⊆ {1 . . . n} (2 ≤ k ≤ n) of distinct indices it holds that
Definition
Events A1 , A2 , . . . An are pairwise independenta iff for all distinct indices
i, j ∈ {1 . . . n} it holds that P(Ai ∩ Aj ) = P(Ai )P(Aj ).
a po dvou nezávislé
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 53 / 61
Independence of More Events - Example
-----------
A | s1 /| s2 |
----------/-|
/ |/ |
/ /| |
B | s3 / | s4 | C
---/ ----
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 54 / 61
Law of Total Probability
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 55 / 61
Law of Total Probability (Proof)
n
P(A) = ∑ P(A | Bi )P(Bi )
i=1
Proof.
Let A be an event and {B1 , . . . , Bn } be an event space, i.e. mutually
exclusive and collectively exhaustive.
n
∑ P(A | Bi )P(Bi ) =
i=1
=
=
=
= P(A)
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 56 / 61
Law of Total Probability (Proof)
Lecture 1: Introduction and Basic Definitions P(A) =
n
∑ P(A | Bi )P(Bi )
2024-09-26
i=1
Proof.
Let A be an event and {B1 , . . . , Bn } be an event space, i.e. mutually
exclusive and collectively exhaustive.
n
∑ P(A | Bi )P(Bi ) =
i=1
n n
∑ P(A | Bi )P(Bi ) = ∑ P(A ∩ Bi )
i =1 i =1
n
[
= P( (A ∩ Bi ))
i =1
n
[
= P(A ∩ Bi )
i =1
= P(A ∩ S)
= P(A)
Law of Total Probability (Example)
Example
Production on machines A, B, C . Products can be defective.
P(A) = 0.25 P(Defect | A) = 0.05
P(B) = 0.35 P(Defect | B) = 0.04
P(C ) = 0.40 P(Defect | C ) = 0.02
P(Defect) = ∑ P(Defect | X )P(X )
X ∈{A,B,C }
Idea
The theorem of total probability is useful when we know conditional
probabilities of a property in all (exhaustive) subcases and we are
interested in (general) probability of the property.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 57 / 61
Bayes’ Rule
P(B | A) · P(A)
P(A | B) = .
P(B)
Proof.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 58 / 61
Bayes’ Rule
Lecture 1: Introduction and Basic Definitions Theorem (Bayes’ Rule)
2024-09-26
Let A and B be events such that P(B) > 0. Then
P(B | A) · P(A)
P(A | B) = .
P(B)
Proof.
Bayes’ Rule
Traditional naming of the terms:
P(B | A) Likelihood
P(A | B) = · P(A) is read as Posterior = · Prior
P(B) Evidence
P (A∩B )
We use the definition of the conditional distribution P(A | B) = P (B )
and commutativity of intersection
Hence
P(A ∩ B) P(B | A)P(A)
P(A | B) = = .
P(B) P(B)
Bayes’ Rule - Conditional Variants
P(B | A) · P(A)
P(A | B) = .
P(B)
Proof.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 59 / 61
Bayes’ Rule - Conditional Variants
Lecture 1: Introduction and Basic Definitions Theorem (Bayes’ Rule)
2024-09-26
Let A and B be events such that P(B) > 0. Then
P(B | A) · P(A)
P(A | B) = .
P(B)
Proof.
Proof.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 60 / 61
Bayes’ Rule with Law of Total Probability
Lecture 1: Introduction and Basic Definitions
2024-09-26
Theorem (Bayes’ Rule & Law of Total Probability)
Let A be an event with P(A) > 0 and {B1 , . . . , Bn } be an event space.
Then for every Bj
Proof.
Bayes’ Rule with Law of Total Probability
Example
Production on machines A, B, C . Products can be defective.
P(A) = 0.25 P(Defect | A) = 0.05
P(B) = 0.35 P(Defect | B) = 0.04
P(C ) = 0.40 P(Defect | C ) = 0.02
Having a defective product, what is the probability A produced it?
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 61 / 61
Lecture 2: Random Variables
Vojtěch Řehák
based on slides of Jan Bouda
October 3, 2024
Revision
Definition
A discrete probability space is a triple (S, F , P) where
• S is a sample space, i.e. the set of all possible outcomes,
• F = 2S is a collection of all events, and
• P : F → [0, 1] is a probability function.
Definition
A probability function P on a (discrete) sample space S with a set of all
events F = 2S is a function P : F → [0, 1] such that:
• P(S) = 1, and
• for any countable sequence of mutually exclusive events A1 , A2 , . . . ,
!
∞
[ ∞
P Ai = ∑ P(Ai ).
i=1 i=1
Definition
For P(B) ̸= 0, the conditional probabilitya of A given B is
P(A ∩ B)
P(A|B) = .
P(B)
If P(B) = 0, it is undefined.
a podmı́něná pravděpodobnost
Definition
Events A and B are said to be independenta if
Definition
Events A1 , A2 , . . . An are pairwise independenta iff for all distinct indices
i, j ∈ {1 . . . n} it holds that P(Ai ∩ Aj ) = P(Ai )P(Aj ).
a po dvou nezávislé
-----------
A | s1 /| s2 |
----------/-|
/ |/ |
/ /| |
B | s3 / | s4 | C
---/ ----
n
P(A) = ∑ P(A | Bi )P(Bi )
i=1
Proof.
Let A be an event and {B1 , . . . , Bn } be an event space, i.e. mutually
exclusive and collectively exhaustive.
n
∑ P(A | Bi )P(Bi ) =
i=1
=
=
=
= P(A)
Proof.
Let A be an event and {B1 , . . . , Bn } be an event space, i.e. mutually
exclusive and collectively exhaustive.
n
∑ P(A | Bi )P(Bi ) =
i=1
n n
∑ P(A | Bi )P(Bi ) = ∑ P(A ∩ Bi )
i =1 i =1
n
[
= P( (A ∩ Bi ))
i =1
n
[
= P(A ∩ Bi )
i =1
= P(A ∩ S)
= P(A)
Law of Total Probability (Example)
Example
Production on machines A, B, C . Products can be defective.
P(A) = 0.25 P(Defect | A) = 0.05
P(B) = 0.35 P(Defect | B) = 0.04
P(C ) = 0.40 P(Defect | C ) = 0.02
P(Defect) = ∑ P(Defect | X )P(X )
X ∈{A,B,C }
Idea
The theorem of total probability is useful when we know conditional
probabilities of a property in all (exhaustive) subcases and we are
interested in (general) probability of the property.
P(B | A) · P(A)
P(A | B) = .
P(B)
Proof.
2024-10-03
Let A and B be events such that P(B) > 0. Then
P(B | A) · P(A)
P(A | B) = .
P(B)
Proof.
Bayes’ Rule
Traditional naming of the terms:
P(B | A) Likelihood
P(A | B) = · P(A) is read as Posterior = · Prior
P(B) Evidence
P (A∩B )
We use the definition of the conditional distribution P(A | B) = P (B )
and commutativity of intersection
Hence
P(A ∩ B) P(B | A)P(A)
P(A | B) = = .
P(B) P(B)
Bayes’ Rule - Conditional Variants
P(B | A) · P(A)
P(A | B) = .
P(B)
Proof.
2024-10-03
Let A and B be events such that P(B) > 0. Then
P(B | A) · P(A)
P(A | B) = .
P(B)
Proof.
Proof.
Proof.
Bayes’ Rule with Law of Total Probability
Example
Production on machines A, B, C . Products can be defective.
P(A) = 0.25 P(Defect | A) = 0.05
P(B) = 0.35 P(Defect | B) = 0.04
P(C ) = 0.40 P(Defect | C ) = 0.02
Having a defective product, what is the probability A produced it?
Vojtěch Řehák
based on slides of Jan Bouda
October 3, 2024
Definition
For a random variable X and a real number x, we define inverse image of
x to be the event
[X = x] = {s ∈ S | X (s) = x},
i.e. the set of all sample points from S to which X assigns the value x.
Definition
Probability mass functiona (or probability distribution) of a random
variable X is a function pX : R → [0, 1] given by
P(X ∈ A) = ∑ pX (x).
x∈A
Definition
The cumulative distribution function (probability distribution
function or simply distribution functiona ) of a random variable X is a
function FX : R → [0, 1] given by
a distribučnı́ funkce
It follows that
Cumulative Distribution Function - Properties such that F (x) = 0 for all x < u and F (x) = 1 for all x ≥ v .
A note to (F2)
If x1 ≤ x2 then (−∞, x1 ] ⊆ (−∞, x2 ] and we have
P(X ≤ x1 ) ≤ P(X ≤ x2 )
Cumulative Distribution Function • Often the initial information is “we have a random variable X with
the probability distribution pX (x)” or “we have a random variable X
with the cumulative distribution function FX (x)”.
P(A) = ∑ pX (x).
x∈A
Note that the events of S are mutually exclusive and collectively exhaustive.
Part III
1 rozdělenı́
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 25 / 52
Discrete Uniform Probability Distribution
pX (x) FX (x)
1 1 •
•
•
• • • • •
x x
pX (0) =P(X = 0) = q
pX (1) =P(X = 1) = p = 1 − q
Proof.
2024-10-03
success is the same.
pk =P(X = k) = pX (k)
(
n k
p (1 − p)n−k for 0 ≤ k ≤ n
= k
0 otherwise.
We can apply the binomial model when the following conditions hold:
• Each trial has exactly two mutually exclusive outcomes.
• The probability of ’success’ is constant on each trial.
• The outcomes of successive trials are mutually independent.
• The name ’binomial’ comes from the equation verifying that the
probabilities sum to 1
n n
n
∑ i ∑ i pi (1 − p)n−i
p =
i=0 i=0
=[p + (1 − p)]n = 1.
⌊t⌋
n i
B(t; n, p) = FX (t) = ∑ p (1 − p)n−i .
i=0 i
Definition
The cumulative distribution function (probability distribution
function or simply distribution functiona ) of a random variable X is a
function FX : R → [0, 1] given by
Note that the values are not probabilities but ranges over all positive real
numbers.
a a x
b x b
• Normal Distribution fX (x) FX (x)
x x
• Exponential Distribution fX (x) FX (x)
x x
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 42 / 52
Cumulative Distribution Function - Why?
Question: Do we need cumulative distribution function?
Suggestion: Let us use either probability function or density function!
Answer: The suggestion is wrong!
Cumulative distribution function is the only way how to specify probability
when the discrete and continuous behavior is combined.
See, e.g.,the following example where P(X = 0) = 12 and the remaining
probability is uniformly distributed on the interval (a, b).
FX (x)
x
0 a b
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 43 / 52
Part V
Definition
Let X1 , X2 , . . . , Xr be discrete random variables defined on a sample space
S.
The random vector X = (X1 , X2 , . . . , Xr ) is an r -dimensional vector-valued
function X : S → Rr with X(s) = x = (x1 , x2 , . . . , xr ), where
Definition
The joint (or compound) probability distributiona of a random vector X
is defined to be
Definition
The joint (or compound) distribution functiona of a random vector X is
defined to be
FX (x) = P(X1 ≤ x1 ∧ X2 ≤ x2 ∧ · · · ∧ Xr ≤ xr ).
a simultánnı́ nebo sdružená distribučnı́ funkce
In situation when we are examining more that one random variable, the
probability distribution of a single variable is referred to as marginal
probability distribution.
Definition
Let pX (x) be a joint probability distribution of a random variable
X = (X1 , X2 ). The marginal probability distributiona of X1 is defined as
a marginálnı́ distribuce
pX (n1 , n2 , . . . , nr ) =P(X1 = n1 , X2 = n2 , . . . Xr = nr )
( n1 n2
n nr
n1 ,n2 ,...nr p1 p2 . . . pr if ∑ri=1 ni = n
=
0 otherwise
n
where n1 ,n2 ,...nr is a combination number with repetition.
• If X and Y are two independent random variables, then for any two
subsets A, B ⊆ R the events X ∈ A and Y ∈ B are independent:
FX ,Y (x, y ) = FX (x)FY (y ).
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 51 / 52
Independent Random Variables
Lecture 2: Random Variables Definition
2024-10-03
Two discrete random variables X and Y are independent provided
• If X and Y are two independent random variables, then for any two
subsets A, B ⊆ R the events X ∈ A and Y ∈ B are independent:
FX ,Y (x, y ) = FX (x)FY (y ).
To see this
P(X ∈ A ∧ Y ∈ B) = ∑ ∑ pX ,Y (x, y )
x∈A y ∈B
= ∑ ∑ pX (x)pY (y )
x∈A y ∈B
= ∑ pX (x) ∑ pY (y )
x∈A y ∈B
Definition
Let X1 , . . . , Xr be discrete random variables with probability distributions
pX1 , . . . , pXr . These random variables are pairwise independent iff
∀1 ≤ i < j ≤ r , ∀xi ∈ Im(Xi ), xj ∈ Im(Xj ), pXi ,Xj (xi , xj ) = pXi (xi )pXj (xj ).
Definition
Let X1 , . . . , Xr be discrete random variables with probability distributions
pX1 , . . . , pXr . These random variables are mutually independent iff
∀2 ≤ k ≤ r , ∀{i1 , . . . , ik } ⊆ {1, . . . , r },
Note that the random variables are pairwise/mutually independent iff the
events [X1 = x1 ], . . . , [Xr = xr ] are so, for all x1 ∈ Im(X1 ), . . . , xr ∈ Im(Xr ).
Vojtěch Řehák
based on slides of Jan Bouda
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 1 / 47
Part I
Revision
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 2 / 47
Few Notes on Conditional Probability
Do not mix apples and oranges:
((((
P(A|B)
( (((+ P(C )
Conditioning is a parameter that changes (scales) the probability function.
To combine different, we need to rescale them first.
P(A|B) · P(B) + P(C )
Conditional probability P(A|B) under condition of X :
P(A|B ∩ X )
When X is known, we have conditional probability function:
P(A|X ) = 1 − P(A|X )
P(B|A ∩ X ) · P(A|X )
Bayes theorem: P(A|B ∩ X ) =
P(B|X )
Law of T.P.: P(A|X ) = ∑ P(A|Bi ∩ X ) · P(Bi |X )
i
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 3 / 47
Few Notes on Conditional Independence
Let us have a six-sided die, i.e. the sample space is {1, 2, 3, 4, 5, 6} and the
probability mass function is uniform.
Let A = {1, 2, 5}, B = {2, 4, 6}, X = {1, 2, 3, 4}.
Note that A and B are not independent, as
1 1 1
6= · = P(A) · P(B).
P(A ∩ B) =
6 2 2
Given X , A and B are independent, as
1 1 1
P(A ∩ B|X ) = = · = P(A|X ) · P(B|X ).
4 2 2
Given ¬X , A and B are not independent, as
1 1
P(A ∩ B|¬X ) = 0 6= · = P(A|¬X ) · P(B|¬X ).
2 2
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 4 / 47
Few Notes on Conditional Independence
Lecture 3: Expectation and Markov Ineq. Let us have a six-sided die, i.e. the sample space is {1, 2, 3, 4, 5, 6} and the
2024-10-16
probability mass function is uniform.
Let A = {1, 2, 5}, B = {2, 4, 6}, X = {1, 2, 3, 4}.
Note that A and B are not independent, as
1 1 1
P(A ∩ B) = 6= · = P(A) · P(B).
6 2 2
Given X , A and B are independent, as
1 1 1
P(A ∩ B) = P({2}) = 16
P(A) · P(B) = 21 · 21 = 14
Definition
A (discrete) random variable X on a sample space S is a function
X : S → R that assigns a real number X (s) to each sample point s ∈ S.
Definition
Probability mass function of X is a function pX : R → [0, 1] given by
Definition
Cumulative distribution function of X is a function FX : R → [0, 1] given
by
FX (x) = P(X ≤ x) = ∑ pX (t).
t≤x
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 5 / 47
Continuous Random Variable - Revision
Definition
A (continuous) random variable X on a sample space S is a function
X : S → R that assigns a real number X (s) to each sample point s ∈ S
such that
{s | X (s) ≤ r } ∈ F for each r ∈ R.
Definition
Cumulative distribution function of X is a function FX : R → [0, 1] given
by FX (x) = P(X ≤ x).
Definition
Probability density function fX of X is the derivative function of FX .
Z x
P(X ≤ x) = FX (x) = fX (t)dt
−∞
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 6 / 47
Part II
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 7 / 47
Discrete Random Vectors - Motivation
Definition
Let X1 , X2 , . . . , Xr be discrete random variables defined on a sample space
S.
The random vector X = (X1 , X2 , . . . , Xr ) is an r -dimensional vector-valued
function X : S → Rr with X(s) = x = (x1 , x2 , . . . , xr ), where
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 8 / 47
Discrete Random Vectors
Definition
The joint (or compound) probability distributiona of a random vector X
is defined to be
Definition
The joint (or compound) distribution functiona of a random vector X is
defined to be
FX (x) = P(X1 ≤ x1 ∧ X2 ≤ x2 ∧ · · · ∧ Xr ≤ xr ).
a simultánnı́ nebo sdružená distribučnı́ funkce
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 9 / 47
Marginal Probability Distributions
In situation when we are examining more that one random variable, the
probability distribution of a single variable is referred to as marginal
probability distribution.
Definition
Let pX (x) be a joint probability distribution of a random variable
X = (X1 , X2 ). The marginal probability distributiona of X1 is defined as
a marginálnı́ distribuce
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 10 / 47
Multinomial Probability Distribution
pX (n1 , n2 , . . . , nr ) =P(X1 = n1 , X2 = n2 , . . . Xr = nr )
( n1 n2
n nr
n1 ,n2 ,...nr p1 p2 . . . pr if ∑ri=1 ni = n
=
0 otherwise
n
where n1 ,n2 ,...nr is a combination number with repetition.
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 11 / 47
Multinomial Probability Distribution
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 12 / 47
Part III
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 13 / 47
Independent Random Variables
Definition
Two discrete random variables X and Y are independent provided
• If X and Y are two independent random variables, then for any two
subsets A, B ⊆ R the events X ∈ A and Y ∈ B are independent:
FX ,Y (x, y ) = FX (x)FY (y ).
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 14 / 47
Independent Random Variables
Lecture 3: Expectation and Markov Ineq. Definition
2024-10-16
Two discrete random variables X and Y are independent provided
• If X and Y are two independent random variables, then for any two
subsets A, B ⊆ R the events X ∈ A and Y ∈ B are independent:
FX ,Y (x, y ) = FX (x)FY (y ).
To see this
P(X ∈ A ∧ Y ∈ B) = ∑ ∑ pX ,Y (x, y )
x∈A y ∈B
= ∑ ∑ pX (x)pY (y )
x∈A y ∈B
= ∑ pX (x) ∑ pY (y )
x∈A y ∈B
Definition
Let X1 , . . . , Xr be discrete random variables with probability distributions
pX1 , . . . , pXr . These random variables are pairwise independent iff
∀1 ≤ i < j ≤ r , ∀xi ∈ Im(Xi ), xj ∈ Im(Xj ), pXi ,Xj (xi , xj ) = pXi (xi )pXj (xj ).
Definition
Let X1 , . . . , Xr be discrete random variables with probability distributions
pX1 , . . . , pXr . These random variables are mutually independent iff
∀2 ≤ k ≤ r , ∀{i1 , . . . , ik } ⊆ {1, . . . , r },
Note that the random variables are pairwise/mutually independent iff the
events [X1 = x1 ], . . . , [Xr = xr ] are so, for all x1 ∈ Im(X1 ), . . . , xr ∈ Im(Xr ).
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 15 / 47
Lecture 3: Expectation and Markov Inequality
Vojtěch Řehák
based on slides of Jan Bouda
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 16 / 47
Part IV
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 17 / 47
Expectation
• Often we need a shorter description than PMF or CDF - single
number, or a few numbers.
• First such characteristic describing a random variable is the
expectation, also known as the mean value.
Definition
Expectationa of a random variable X is defined as
E (X ) = ∑ x · P(X = x)
x∈Im(X )
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 18 / 47
Expectation
Lecture 3: Expectation and Markov Ineq. • Often we need a shorter description than PMF or CDF - single
number, or a few numbers.
2024-10-16
• First such characteristic describing a random variable is the
expectation, also known as the mean value.
Definition
Expectationa of a random variable X is defined as
E (X ) = ∑ x · P(X = x)
x∈Im(X )
Expectation provided the sum is absolutely convergent. In case the sum is convergent,
but not absolutely convergent, we say that no finite expectation exists. In
case the sum is not convergent, the expectation has no meaning.
a střednı́ hodnota
• The expectation of X is
1 1 n
E (X ) = ∑ x · P(X = x) = ∑ x· = · ∑ xi
x∈Im(X ) x∈Im(X )
n n i=1
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 19 / 47
Bernoulli Probability Distribution
pX (0) =P(X = 0) = 1 − p
pX (1) =P(X = 1) = p
• The expectation is
E (X ) = ∑ x · P(X = x) = 0 · (1 − p) + 1 · p = p
x∈Im(X )
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 20 / 47
Indicator Random Variable
Indicator random variable of an event A.
• The indicator of an event A is the random variable IA defined by
• The expectation
E (X ) = 1 · c = c
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 22 / 47
Geometric Probability Distribution
Geometric probability distribution has one parameter p and models the
number of Bernoulli trials until the first ’success’ occurs.
• The probability function
• The expectation is
E (X ) = ∑x∈Im(X ) x · P(X = x) =
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 23 / 47
Geometric Probability Distribution
Lecture 3: Expectation and Markov Ineq. Geometric probability distribution has one parameter p and models the
number of Bernoulli trials until the first ’success’ occurs.
E (X ) = ∑x∈Im(X ) x · P(X = x) =
Expectation:
E (X ) = ∑ x · P(X = x) = ∑ x · (1 − p)x−1 p
x∈Im(X ) x∈Im(X )
multiply by (1 − p)
(1 − p)E (X ) = ∑x∈Im(X ) x(1 − p)x p
make the difference of the two equations
E (X ) − (1 − p)E (X ) = ∑x∈Im(X ) (x(1 − p)x−1 p − x(1 − p)x p)
pE (X ) = p ∑x∈Im(X ) (x(1 − p)x−1 − x(1 − p)x )
E (X ) = ∑x∈Im(X ) (x(1 − p)x−1 − x(1 − p)x )
e.g. = (1 + 2(1 − p) + 3(1 − p)2 + . . .
−(1 − p) − 2(1 − p)2 − 3(1 − p)3 − . . .
hence
E (X ) = 1 + (1 − p) + (1 − p)2 + . . .
E (X ) = ∑x∈Im(X ) (1 − p)x = 1−(11−p) = 1/p
Binomial Probability Distribution
• The expectation is
E (X ) = ∑x∈Im(X ) x · P(X = x) =
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 24 / 47
Binomial Probability Distribution
Lecture 3: Expectation and Markov Ineq.
2024-10-16 Examples of Probability Distributions Binomial random variable X , denoted by B(n, p).
• The probability distribution is
(
n x
p (1 − p)n−x for 0 ≤ x ≤ n
pX (x) = P(X = x) = x
0 otherwise.
• The expectation is
n
x is read aloud as ”n choose x”
E (X ) = ∑x∈Im(X ) xP(X = x)
= ∑nx=0 x xnp x (1 − p)n−x
= ∑nx=1 x xn p x (1 − p)n−x
n!
= ∑nx=1 x x !(n−x x
)! p (1 − p)
n−x
n−1)!
= ∑nx=1 n (x−(1)!( x
n−x )! p (1 − p)
n−x
n−1
= ∑nx=1 n x−1 p x (1 − p)n−x
n−1 x−1
= np ∑nx=1 x− 1p (1 − p)(n−1)−(x−1)
1 n−1 y
= np ∑n−
y =0 y p (1 − p)
(n−1)−y
m m y (m−y )
= np ∑y =0 y p (1 − p)
= np(p + 1 − p)m
= np
or n independent Bernoulli trials = np. To use the easiest way, we need to
define independence on random variables. Hence, we need more random
variables.
Part V
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 25 / 47
Sum of Independent Random Variables
Idea
Observe that:
• Sum of two Bernoulli distributions with probability of success p is a
binomial distribution B(2, p).
• Sum of two binomial distributions B(n1 , p) and B(n2 , p) is a binomial
distribution B(n1 + n2 , p).
2024-10-16
new random variable defined as sum of X and Y . If X and Y are
non-negative integer values, the probability distribution of Z is
t
pZ (t) = pX +Y (t) = ∑ pX (x)pY (t − x).
x=0
Idea
Observe that:
Sum of Independent Random Variables • Sum of two Bernoulli distributions with probability of success p is a
binomial distribution B(2, p).
• Sum of two binomial distributions B(n1 , p) and B(n2 , p) is a binomial
distribution B(n1 + n2 , p).
• The expectation is
E (X ) = ∑x∈Im(X ) xP(X = x) =
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 27 / 47
Negative Binomial Distribution
Lecture 3: Expectation and Markov Ineq.
2024-10-16 Negative binomial distribution has two parameters p and r and
expresses the number of Bernoulli trials to the r th success. Hence, it is a
convolution of r geometric distributions.
• The probability function
(
x−1 r x−r
pX (x) = P(X = x) = r −1 p (1 − p) for x ≥ r
0 otherwise.
E (X ) = ∑x∈Im(X ) xP(X = x) =
Why is the expectation r /p? r independent trials, each with mean 1/p. Is
it true that expectation of convolution is always the sum of expectations?
Linearity of Expectation
E (X + Y ) = E (X ) + E (Y ).
Proof.
E (X + Y ) =
= E (X ) + E (Y ).
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 28 / 47
Linearity of Expectation
Lecture 3: Expectation and Markov Ineq. Theorem (Expectation of sum)
2024-10-16
Let X and Y be random variables. Then
E (X + Y ) = E (X ) + E (Y ).
Proof.
E (X + Y ) =
Linearity of Expectation =
= E (X ) + E (Y ).
E (X + Y ) = ∑ ∑ (x + y )P(X = x, Y = y ) =
x∈Im(X ) y ∈Im(Y )
= ∑ ∑ x · P(X = x, Y = y ) + ∑ ∑ y · P(X = x, Y = y ) =
x y y x
= ∑ x ∑ P(X = x, Y = y ) + ∑ y ∑ P(X = x, Y = y ) =
x y y x
= ∑ xP(X = x) + ∑ yP(Y = y ) =
x∈Im(X ) y ∈Im(Y )
= E (X ) + E (Y ).
2024-10-16
Let X and Y be random variables. Then
E (X + Y ) = E (X ) + E (Y ).
Proof.
E (X + Y ) =
Linearity of Expectation =
= E (X ) + E (Y ).
Hence,
E (cX ) = cE (X ).
In general, the above theorem implies the following result for any linear
combination of n random variables, i.e.
Corollary (Linearity of expectation)
Let X1 , X2 , . . . Xn be random variables and c1 , c2 , . . . cn ∈ R constants. Then
!
n n
E ∑ ci Xi = ∑ ci E (Xi ).
i=1 i=1
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 29 / 47
Expectation of Independent Random Variables
Theorem
If X and Y are independent random variables, then
E (XY ) = E (X )E (Y ).
Proof.
E (XY ) =
= E (X )E (Y ).
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 30 / 47
Expectation of Independent Random Variables
Lecture 3: Expectation and Markov Ineq. Theorem
2024-10-16
If X and Y are independent random variables, then
E (XY ) = E (X )E (Y ).
Proof.
E (XY ) =
= E (X )E (Y ).
= ∑ ∑ x · y · P(X = x) · P(Y = y ) =
x∈Im(X ) y ∈Im(Y )
= ∑ ∑ x · P(X = x) · y · P(Y = y ) =
x∈Im(X ) y ∈Im(Y )
= ∑ x · P(X = x) · ∑ y · P(Y = y ) =
x∈Im(X ) y ∈Im(Y )
= E (X ) · E (Y ).
Observation
Theorem
If Im(X ) ⊆ N then E (X ) = ∑∞
i=1 P(X ≥ i).
Proof.
E (X ) =
=
=
=
∞
= ∑ P(X ≥ i).
i=1
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 31 / 47
Observation
Lecture 3: Expectation and Markov Ineq. Theorem
Proof.
i=1 P(X ≥ i).
E (X ) =
=
=
Observation =
=
∞
∑ P(X ≥ i).
i=1
E (X ) = ∑ x · P(X = x) =
x∈Im(X )
∞
= ∑ x · P(X = x) =
x =1
∞ x
= ∑ ∑ P(X = x) = the summands are non-negative
x =1 i =1
∞ ∞
= ∑ ∑ P(X = x) = we can swap sums due to (Fubini-)Tonelli thm
i =1 x =i
∞
= ∑ P(X ≥ i).
i =1
R∞
For non-negative continuous variables E (X ) = i =0 P(X ≥ i).
Part VI
Conditional Expectation
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 32 / 47
Conditional probability
Definition
The conditional probability distribution of random variable X given
random variable Y (where pX ,Y (x, y ) is their joint distribution) is
P(X = x, Y = y )
pX |Y (x|y ) = P(X = x|Y = y ) = =
P(Y = y )
pX ,Y (x, y )
=
pY (y )
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 33 / 47
Conditional expectation
We may consider Y |(X = x) to be a new random variable that is given by
the conditional probability distribution pY |X . Therefore, we can define its
mean.
Definition
The conditional expectation of Y given X = x is defined
E (Y ) = ∑ E (Y |X = x)pX (x).
x
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 34 / 47
Conditional expectation
Lecture 3: Expectation and Markov Ineq. We may consider Y |(X = x) to be a new random variable that is given by
2024-10-16
the conditional probability distribution pY |X . Therefore, we can define its
mean.
Definition
The conditional expectation of Y given X = x is defined
E (Y ) = ∑ E (Y |X = x)pX (x).
x
= ∑ ∑ y · P(Y = y |X = x) · P(X = x)
y x
= ∑ y · ∑ P(Y = y |X = x) · P(X = x)
y x
= ∑ y · P(Y = y )
y
=E (Y )
Example: Random sums
Example
Let N, X1 , X2 , . . . be mutually independent random variables. Let us
suppose that X1 , X2 , . . . have identical probability distribution pX . Hence,
they have common mean E (X ). We also know the mean value E (N).
Compute the expectation of the random variable T defined as
T = X1 + X2 + · · · + XN .
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 35 / 47
Part VII
Markov Inequality
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 36 / 47
Markov Inequality
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 37 / 47
Markov Inequality
Lecture 3: Expectation and Markov Ineq. It is important to derive as much information as possible even from a
2024-10-16
partial description of random variable. The mean value already gives more
information than one might expect, as captured by Markov inequality.
Markov Inequality t
Proof.
Let t > 0. We define a random variable Yt (for fixed t) as
(
0 if X < t
Yt =
t X ≥ t.
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 38 / 47
Markov Inequality: Example
Example
Bound the probability of obtaining more than 75 heads in a sequence of
100 fair coin flips.
Hence, X = ∑100i=1 Xi be the number of heads in 100 coin flips. Note that
E (Xi ) = 1/2, and E (X ) = 100/2 = 50.
Using the Markov inequality, we get
E (X ) 50 2
P(X ≥ 75) ≤ = = .
75 75 3
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 39 / 47
Markov Inequality: Example - parameterized
Example
Bound the probability of obtaining more than 34 n heads in a sequence of n
fair coin flips.
3 E (Yn ) n/2 2
P(Yn ≥ n) ≤ 3 = = .
4 4n
3n/4 3
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 40 / 47
Markov Inequality: Example - parameterized
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 41 / 47
Expectation is not everything
But
• P(X ≥ 32) = 1 − ∑5i=1 ( 12 )i = ( 21 )5 = 0.03125
• P(X ≥ 1024) = ( 12 )10 = 0.000976562
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 42 / 47
Expectation is not everything
Lecture 3: Expectation and Markov Ineq. Example (St. Petersburg Lottery)
2024-10-16 A casino offers a game of chance for a single player in which a fair coin is
tossed at each stage. The pot starts at $1 and is doubled every time a
head appears. The first time a tail appears, the game ends and the player
wins whatever is in the pot. What would be a fair price to pay for entering
the game?
But
1 i i−1
i=1 ( 2 ) 2 = ∑∞ 1
i=1 2 = ∞
Example
Program A terminates in 1 hour on 99% of inputs and in 1000 hours on
1% of inputs.
Program B terminates in 10.99 hours on every input.
Which one would you prefer?
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 43 / 47
Part VIII
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 44 / 47
Moments
Definition
The k th moment of a random variable X is defined as E (X k ).
Theorem
If X and Y are random variables with matching corresponding moments of
all orders, i.e. ∀k E (X k ) = E (Y k ), then X and Y have the same
distributions.
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 45 / 47
Variance - the second central moment
Definition
The second central moment is known as the variancea of X and defined as
µ2 = E [X − E (X )]2 .
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 46 / 47
Variance
Theorem
Let Var (X ) be the variance of the random variable X . Then
Var (X ) = E (X 2 ) − [E (X )]2 .
Proof.
Var (X ) =
= E (X 2 ) − [E (X )]2 .
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 47 / 47
Variance
Lecture 3: Expectation and Markov Ineq. Theorem
2024-10-16
Let Var (X ) be the variance of the random variable X . Then
Var (X ) = E (X 2 ) − [E (X )]2 .
Proof.
Var (X ) =
Variance =
= E (X 2 ) − [E (X )]2 .
= E (X 2 ) − E [2XE (X )] + [E (X )]2 =
= E (X 2 ) − 2[E (X )]2 + [E (X )]2 =
= E (X 2 ) − [E (X )]2 .
Lecture 4: Variance and Chebyshev Inequality
Vojtěch Řehák
based on slides of Jan Bouda
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 1 / 36
Part I
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 2 / 36
Expectation
E (aX + bY + c) = aE (X ) + bE (Y ) + c.
E [f (X )] ≥ f (E (X )).
Theorem
If X and Y are independent random variables, then
E (XY ) = E (X )E (Y ).
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 3 / 36
Expectation
Lecture 4: Variance and Chebyshev Ineq. Theorem (Linearity of expectation)
E (aX + bY + c) = aE (X ) + bE (Y ) + c.
E [f (X )] ≥ f (E (X )).
Expectation Theorem
If X and Y are independent random variables, then
E (XY ) = E (X )E (Y ).
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 4 / 36
Markov Inequality
Proof.
Let t > 0. We define a random variable Yt (for fixed t) as
(
0 if X < t
Yt =
t X ≥ t.
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 5 / 36
Markov Inequality: Example
Example
Bound the probability of obtaining more than 75 heads in a sequence of
100 fair coin flips.
Hence, X = ∑100i=1 Xi be the number of heads in 100 coin flips. Note that
E (Xi ) = 1/2, and E (X ) = 100/2 = 50.
Using the Markov inequality, we get
E (X ) 50 2
P(X ≥ 75) ≤ = = .
75 75 3
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 6 / 36
Markov Inequality: Example - parameterized
Example
Bound the probability of obtaining more than 34 n heads in a sequence of n
fair coin flips.
3 E (Yn ) n/2 2
P(Yn ≥ n) ≤ 3 = = .
4 4n
3n/4 3
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 7 / 36
Markov Inequality: Example - parameterized
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 8 / 36
Expectation is not everything
But
• P(X ≥ 32) = 1 − ∑5i=1 ( 12 )i = ( 21 )5 = 0.03125
• P(X ≥ 1024) = ( 12 )10 = 0.000976562
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 9 / 36
Expectation is not everything
Lecture 4: Variance and Chebyshev Ineq. Example (St. Petersburg Lottery)
2024-10-16 A casino offers a game of chance for a single player in which a fair coin is
tossed at each stage. The pot starts at $1 and is doubled every time a
head appears. The first time a tail appears, the game ends and the player
wins whatever is in the pot. What would be a fair price to pay for entering
the game?
But
1 i i−1
i=1 ( 2 ) 2 = ∑∞ 1
i=1 2 = ∞
Example
Program A terminates in 1 hour on 99% of inputs and in 1000 hours on
1% of inputs.
Program B terminates in 10.99 hours on every input.
Which one would you prefer?
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 10 / 36
Part II
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 11 / 36
Moments
Definition
The k th moment of a random variable X is defined as E (X k ).
Theorem
If X and Y are random variables with matching corresponding moments of
all orders, i.e. ∀k E (X k ) = E (Y k ), then X and Y have the same
distributions.
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 12 / 36
Variance - the second central moment
Definition
The second central moment is known as the variancea of X and defined as
µ2 = E [X − E (X )]2 .
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 13 / 36
Variance
Theorem
Let Var (X ) be the variance of the random variable X . Then
Var (X ) = E (X 2 ) − [E (X )]2 .
Proof.
Var (X ) =
= E (X 2 ) − [E (X )]2 .
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 14 / 36
Variance
Lecture 4: Variance and Chebyshev Ineq. Theorem
2024-10-16
Let Var (X ) be the variance of the random variable X . Then
Var (X ) = E (X 2 ) − [E (X )]2 .
Proof.
Var (X ) =
Variance =
= E (X 2 ) − [E (X )]2 .
= E (X 2 ) − E [2XE (X )] + [E (X )]2 =
= E (X 2 ) − 2[E (X )]2 + [E (X )]2 =
= E (X 2 ) − [E (X )]2 .
Variance - properties
Theorem
Let X be a random variable and a and b be real numbers. Then
Proof.
Var (aX + b) =
= a2 Var (X ).
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 15 / 36
Variance - properties
Lecture 4: Variance and Chebyshev Ineq. Theorem
2024-10-16
Let X be a random variable and a and b be real numbers. Then
Proof.
Var (aX + b) =
Variance - properties =
= a2 Var (X ).
= E [aX + b − aE (X ) − b]2 =
= E [a(X − E (X ))]2 =
= E a2 [X − E (X )]2 =
= a2 E ([X − E (X )]2 ) =
= a2 Var (X ).
Variance of Some Distributions
Constant P(c) = 1 c 0
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 16 / 36
Variance of Some Distributions
Lecture 4: Variance and Chebyshev Ineq.
2024-10-16
Distribution Probability E (X ) Var (X )
a+b n2 −1
Uniform a..b n = b − a + 1 2 12
Constant P(c) = 1 c 0
Neg. binomial
p-trials to succ.
p-trials to r succ.s
1
p
r
p
1−p
p2
r (1−p)
p2
Var (X + Y ) = E (X + Y )2 − (E (X + Y ))2 =
= E (X 2 + 2XY + Y 2 ) − (E (X ) + E (Y ))2 =
= E (X 2 ) + 2E (XY ) + E (Y 2 ) − (E (X ))2 + 2E (X )E (Y ) + (E (Y ))2
Definition
The quantity
E [X − E (X )][Y − E (Y )] = ∑ pxi ,yj [xi − E (X )] [yj − E (Y )]
i,j
Cov (X , Y ) = aVar (X ).
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 17 / 36
Covariance
Definition
We define the correlation coefficient ρ(X , Y ) as the normalized
covariance, i.e.
Cov (X , Y )
ρ(X , Y ) = p .
Var (X )Var (Y )
Theorem
For any two random variables X and Y , it holds that
I.e., −1 ≤ ρ(X , Y ) ≤ 1.
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 18 / 36
Covariance
Lecture 4: Variance and Chebyshev Ineq. Definition
Theorem
For any two random variables X and Y , it holds that
I.e., −1 ≤ ρ(X , Y ) ≤ 1.
Cov (X , Y )
ρ(X , Y ) = p
Var (X )Var (Y )
Cov (X , aX )
= p
Var (X )Var (aX )
aVar (X )
= p
Var (X )a2 Var (X )
aVar (X ) 1
for a > 0
= = −1 for a < 0
|a|Var (X )
⊥ for a = 0
Covariance
Theorem
Let X and Y be random variables. Then the covariance
Cov (X , Y ) = E (XY ) − E (X )E (Y ).
Proof.
Cov (X , Y ) = E [X − E (X )][Y − E (Y )] =
Corollary
For independent random variables X and Y , it holds that Cov (X , Y ) = 0.
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 19 / 36
Covariance
Lecture 4: Variance and Chebyshev Ineq. Theorem
2024-10-16
Let X and Y be random variables. Then the covariance
Cov (X , Y ) = E (XY ) − E (X )E (Y ).
Proof.
Cov (X , Y ) = E [X − E (X )][Y − E (Y )] =
Covariance Corollary
For independent random variables X and Y , it holds that Cov (X , Y ) = 0.
Cov (X , Y ) = E [X − E (X )][Y − E (Y )] =
= E [XY − YE (X ) − XE (Y ) + E (X )E (Y )] =
= E (XY ) − E (X )E (Y ) − E (X )E (Y ) + E (X )E (Y ) =
= E (XY ) − E (X )E (Y )
Covariance
Hence,
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 20 / 36
Covariance
Lecture 4: Variance and Chebyshev Ineq.
2024-10-16 It may happen that X is completely dependent on Y and
yet the covariance is 0.
E.g., for X with Uniform({−1, 0, 1}) distribution and Y = X 2 .
Hence,
E (X ) = 0
E (Y ) = E (X 2 ) = 1/3 + 1/3 = 2/3
E (XY ) = E (X 3 ) = E (X ) = 0.
Hence,
Cov (X , Y ) = E (XY ) − E (X )E (Y ) = 0
Variance of Independent Variables
Theorem
If X and Y are independent random variables, then
Proof.
Var (X + Y ) = E [(X + Y ) − E (X + Y )]2 =
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 21 / 36
Variance of Independent Variables
Lecture 4: Variance and Chebyshev Ineq. Theorem
2024-10-16
If X and Y are independent random variables, then
Proof.
Var (X + Y ) = E [(X + Y ) − E (X + Y )]2 =
Theorem
If X and Y are (not independent) random variables, we obtain
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 22 / 36
Variance
Lecture 4: Variance and Chebyshev Ineq. Theorem
2024-10-16
If X and Y are (not independent) random variables, we obtain
Variance Var ∑ ai Xi + b
i=1
= ∑ ai2 Var (Xi ).
i=1
Example
Let X1 , X2 be two independent identically distributed random variables.
Compute Var (X1 + X1 ) and Var (X1 + X2 ) and explain the difference
between X1 + X1 and X1 + X2 .
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 23 / 36
Example
Lecture 4: Variance and Chebyshev Ineq. Example
2024-10-16
Let X1 , X2 be two independent identically distributed random variables.
Compute Var (X1 + X1 ) and Var (X1 + X2 ) and explain the difference
between X1 + X1 and X1 + X2 .
Example
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 24 / 36
Conditional probability - recall
Definition
The conditional probability distribution of random variable Y given
random variable X (their joint distribution is pX ,Y (x, y )) is
P(Y = y , X = x)
pY |X (y |x) = P(Y = y |X = x) = =
P(X = x)
pX ,Y (x, y )
=
pX (x)
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 25 / 36
Conditional expectation
Definition
Analogously, the conditional variance can be defined as
Var (Y |X = x) = E (Y 2 |X = x) − [E (Y |X = x)]2 .
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 26 / 36
Conditional expectation
E (Y ) = ∑ E (Y |X = x)pX (x).
x
E (Y k ) = ∑ E (Y k |X = x)pX (x).
x
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 27 / 36
Conditional expectation
Lecture 4: Variance and Chebyshev Ineq. We can derive the expectation of Y from the conditional expectations.
E (Y ) = ∑ E (Y |X = x)pX (x).
x
E (Y k ) = ∑ E (Y k |X = x)pX (x).
x
= ∑ ∑ y k · P(Y = y |X = x) · P(X = x)
y x
= ∑ y k · ∑ P(Y = y |X = x) · P(X = x)
y x
= ∑ y k · P(Y = y )
y
=E (Y k )
Example: Random sums
Example
Let N, X1 , X2 , . . . be mutually independent random variables. Let us
suppose that X1 , X2 , . . . have identical probability distribution pX , mean
E (X ), and variance Var (X ). We also know the values E (N) and Var (N).
Compute the expectation and variance of the random variable T defined as
T = X1 + X2 + · · · + XN .
E (T 2 |N = n) = n · Var (X ) + (n · E (X ))2 .
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 29 / 36
Example: Random sums
Now, using the theorem of total moments, we get
E (T 2 ) = ∑ E (T 2 |N = n) · P(N = n)
n
= ∑ nVar (X ) + n2 [E (X )]2 · P(N = n)
n
2 2
=Var (X ) ∑ n · P(N = n) + [E (X )] ∑ n · P(N = n)
n n
2 2
=Var (X )E (N) + [E (X )] E (N ).
Finally, we obtain
Var (T ) =E (T 2 ) − [E (T )]2 =
=Var (X )E (N) + [E (X )]2 E (N 2 ) − [E (X )]2 [E (N)]2 =
=Var (X )E (N) + [E (X )]2 Var (N).
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 30 / 36
Example: Random sums
Lecture 4: Variance and Chebyshev Ineq. Now, using the theorem of total moments, we get
2024-10-16
E (T 2 ) = ∑ E (T 2 |N = n) · P(N = n)
n
= ∑ nVar (X ) + n2 [E (X )]2 · P(N = n)
n
=Var (X ) ∑ n · P(N = n) + [E (X )]2 ∑ n2 · P(N = n)
n n
=Var (X )E (N) + [E (X )]2 E (N 2 ).
Var (T ) =E (T 2 ) − [E (T )]2 =
=Var (X )E (N) + [E (X )]2 E (N 2 ) − [E (X )]2 [E (N)]2 =
=Var (X )E (N) + [E (X )]2 Var (N).
Chebyshev Inequality
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 31 / 36
Chebyshev Inequality
In case we know both mean value and variance of a random variable, we
can use much more accurate estimation
Theorem (Chebyshev inequality)
Let X be a random variable with finite variance. Then
Var (X )
P |X − E (X )| ≥ t ≤ , t >0
t2
p
Alternatively, substituting t = kσX = k Var (X ), we obtain
1
P |X − E (X )| ≥ kσX ≤ 2 , k > 0
k
or, alternatively, substituting X − E (X ) = X 0
E (X 02 )
P(|X 0 | ≥ t) ≤ , t > 0.
t2
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 32 / 36
Chebyshev Inequality
Proof.
We apply the Markov inequality to the nonnegative variable [X − E (X )]2
and we replace t by t 2 to get
E [X − E (X )]2
2 2 Var (X )
P (X − E (X )) ≥ t ≤ 2
= .
t t2
We obtain the Chebyshev inequality using the fact that the events
[(X − E (X ))2 ≥ t 2 ] = [|X − E (X )| ≥ t] are the same.
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 33 / 36
Chebyshev Inequality: Example - parameterized
1
E (Xi2 ) = E (Xi ) = .
2
Then
1 1 1
Var (Xi ) = E (Xi2 ) − [E (Xi )]2 = − =
2 4 4
and using the independence we have
n
Var (Yn ) = .
4
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 34 / 36
Chebyshev Inequality: Example - parameterized
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 35 / 36
Chebyshev Inequality: Example - parameterized
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 36 / 36
Lecture 5: Chernoff Bounds and
Laws of Large Numbers
Vojtěch Řehák
based on slides of Jan Bouda
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 1 / 29
Part I
Motivation Examples
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 2 / 29
Unknown Biased Coin
Example
Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k bits are correct?
repeat flips and count the frequency of heads An = ∑ni=1 Xi /n.
We need n s.t.
1
P(|An − p| ≥ k ) ≤ 0.001.
2
Markov inequality: P(An ≥ t) ≤ E (An )/t
Expectation E (An ) = ∑ni=1 E (Xi )/n = E (Xi ) = p.
Note that An is not symmetric around E (An ) when p ̸= 1/2. Hence, we
can bound only one side where moreover n disappears:
1 1 p2k
P((An − p) ≥ ) = P(An ≥ + p) ≤
2k 2k 1 + p2k
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 3 / 29
Unknown Biased Coin
Lecture 5: Chernoff Bounds and LoLN Example
2024-10-24
Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k bits are correct?
repeat flips and count the frequency of heads An = ∑ni=1 Xi /n.
We need n s.t.
1
P(|An − p| ≥ k ) ≤ 0.001.
2
Markov inequality: P(An ≥ t) ≤ E (An )/t
1 1 p2k
P((An − p) ≥ ) = P(An ≥ k + p) ≤
2k 2 1 + p2k
Var (An ) = Var (∑i Xi /n) = ∑i Var (Xi )/n2 = Var (Xi )/n = p(1 − p)/n
Chebyshev inequality: P(|An − E (An )| ≥ t) ≤ Var (An )/t 2
1 p(1 − p)
P(|An − p| ≥ k
) ≤ 22k Var (An ) = 22k
2 n
We do not know p, but p(1 − p) ≤ 1/4 hence it is
22k 22k−2
≤ = ≤ 0.001 and so n ≥ 103 ∗ 22k−2
4n n
i.e. n has exponential value to # digits/bits in both the probability bound
and the result precision.
Unknown Biased Coin
Example
Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k significant bits are correct?
2024-10-24
Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k significant bits are correct?
p p 2k 2k
P((An − p) ≥ ) = P(An ≥ k + p) ≤ p · = <1
2k 2 p + p2k 1 + 2k
both n and p disapeared.
Chebyshev inequality:
(1 − p) 1
= − 1 ≈ very large.
p p
So, we need to bound p from below.
Chernoff inequality: better bounds, let’s try it.
Chernoff will also need a bound on p from below but it is tighter.
Part II
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 5 / 29
Moments and Moment Generating Function
Idea
Markov inequality uses E (X ).
Chebyshev inequality uses Var (X ).
Chernoff bound will use moment generating functions.
Definition (Recall)
The k th moment of a random variable X is defined as E (X k ).
Definition
The moment generating function of a random variable X is
MX (t) = E (e tX ).
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 6 / 29
Moment Generating Function and Moments
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 7 / 29
Moment Generating Function and Moments
Proof.
Assuming that exchanging the expectation and differentiation operands is
legitimate, we have
(n)
MX (t) = (E (e tX ))(n) = E ((e tX )(n) ) = E (X n e tX ).
OR
MX (t) = E (e tX ) = E (1 + tX + (tX )2 /2! + (tX )3 /3! + · · · )
Computing at t = 0, we get
(n)
MX (0) = E (X n ).
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 8 / 29
Moment Generating Function and Distributions
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 9 / 29
Moment Generating Function and Distributions
Proof.
using independence
z }| {
MX +Y (t) = E (e t(X +Y ) ) = E (e tX e tY ) = E (e tX )E (e tY ) = MX (t)MY (t).
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 10 / 29
Part III
Chernoff Bounds
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 11 / 29
Chernoff Bounds
The Chernoff bound for random variable X is obtained by applying the
Markov inequality to e tX for some suitably chosen t.
For any t > 0
E (e tX )
P(X ≥ a) = P(tX ≥ ta) = P(e tX ≥ e ta ) ≤ .
e ta
Similarly, for any t < 0
E (e tX )
P(X ≤ a) = P(tX ≥ ta) = P(e tX ≥ e ta ) ≤ .
e ta
tX
While the value of t that minimizes E (e
e ta
)
gives the best bound, in
practice we usually use the value of t that gives a convenient form.
Definition
Bounds derived using this approach are called the Chernoff bounds.
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 12 / 29
Chernoff Bound and a Sum of Poisson Trials
Poisson trials (do not confuse with Poisson random variables!!) are a
sequence of independent coin flips, but the probability of respective coin
flips differs. Bernoulli trials are a special case of the Poisson trials.
Example
Let X1 , . . . , Xn be independent Poisson trials with P(Xi = 1) = pi , and
X = ∑ni=1 Xi their sum. For every δ > 0 we want to bound the probabilities
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 13 / 29
Chernoff Bound and a Sum of Poisson Trials
We derive a bound on the moment generating function
def
MXi (t) = E (e tXi ) = pi e t·1 + (1 − pi )e t·0
= 1 + pi (e t − 1)
t −1)
≤ e pi (e
using that 1 + y ≤ e y for any y ≥ 0.
The moment generating function of X is (due to the Theorem of slide 10)
n
E (e tX ) = MX (t) = ∏ Mx (t) i
i=1
n
t
≤ ∏ e p (e −1)i
i=1
n t −1) t −1)E (X )
= e ∑i=1 pi (e = e (e
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 14 / 29
Chernoff Bound and a Sum of Poisson Trials
Theorem
Let X1 , . . . , Xn be independent Poisson trials with P(Xi = 1) = pi ,
X = ∑ni=1 Xi their sum and µ = E (X ). Then the following Chernoff
bounds hold:
1. for any δ > 0
!µ
eδ
P(X ≥ (1 + δ )µ) ≤
(1 + δ )(1+δ )
2. for 0 < δ ≤ 1
2 /3
P(X ≥ (1 + δ )µ) ≤ e −µδ
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 15 / 29
Chernoff Bound and a Sum of Poisson Trials
Proof.
1. Using Markov inequality, we have that for any t > 0
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 16 / 29
Chernoff Bound and a Sum of Poisson Trials
Proof (Cont.)
2. We want to show that for any 0 < δ ≤ 1
eδ 2
≤ e −δ /3 ,
(1 + δ )(1+δ )
what will give us the result immediately. Taking the natural logarithm
of both sides we obtain the equivalent condition
def δ2
f (δ ) = δ − (1 + δ ) ln(1 + δ ) + ≤ 0.
3
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 17 / 29
Chernoff Bound and a Sum of Poisson Trials
Proof (Cont.)
Note that f (0) = 0. We show that f is decreasing on [0, 1].
We calculate the first and second derivative of f (δ )
1+δ 2 2
f ′ (δ ) =1 − − ln(1 + δ ) + δ = −ln(1 + δ ) + δ
1+δ 3 3
1 2
f ′′ (δ ) = − + .
1+δ 3
We see that f ′′ (δ ) < 0 for 0 ≤ δ < 1/2 and f ′′ (δ ) > 0 for δ > 1/2. Hence,
f ′ (δ ) first decreases and then increases on [0, 1]. Since f ′ (0) = 0 and
f ′ (1) < 0, we see that f ′ (t) ≤ 0 on [0, 1]. Since f (0) = 0, it follows that
f (t) ≤ 0 on [0, 1] as well, what completes the proof.
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 18 / 29
Chernoff Bound and a Sum of Poisson Trials
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 19 / 29
Chernoff Bound and a Sum of Poisson Trials
Theorem
Let X1 , . . . , Xn be independent Poisson trials with P(Xi = 1) = pi ,
X = ∑ni=1 Xi their sum and µ = E (X ). Then for 0 < δ ≤ 1
1. !µ
e −δ
P(X ≤ (1 − δ )µ) ≤
(1 − δ )(1−δ )
2.
2 /2
P(X ≤ (1 − δ )µ) ≤ e −µδ
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 20 / 29
Chernoff Bound and a Sum of Poisson Trials
Corollary
Let X1 , . . . , Xn be independent Poisson trials and X = ∑ni=1 Xi . For
0 < δ ≤ 1,
2
P(|X − E (X )| ≥ δ E (X )) ≤ 2e −E (X )δ /3
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 21 / 29
Chernoff Bound and a Sum of Poisson Trials
Lecture 5: Chernoff Bounds and LoLN Corollary
2024-10-24
Let X1 , . . . , Xn be independent Poisson trials and X = ∑ni=1 Xi . For
0 < δ ≤ 1,
2
P(|X − E (X )| ≥ δ E (X )) ≤ 2e −E (X )δ /3
Chernoff Bound and a Sum of Poisson Trials first k significant bits are correct?
Chernoff bound:
for δ = 21k
E (Sn ) 1 ·1
−pn· 2k
P |Sn − E (Sn )| ≥ k
≤ 2e 2 3
2
2k
It’s better than Chebyshev 2n (1−p p
)
but again, there is the probability p.
Without bounding it from below, we cannot compute any bound.
Explanation: Imagine that no head appears after many trials, i.e., with
high probability, p is smaller than some bound, but we know nothing how
small it is, and so what the first significant bit of p is.
Chernoff Bound: Example - parameterized
Example (Coin flipping re-revisited)
Bound the probability of obtaining more that 34 n heads in a sequence of n
fair coin flips.
≤ 2e −n/24 .
Recall that from Chebyshev inequality:
4
P(Yn ≥ 3n/4) ≤ .
n
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 22 / 29
Chernoff Bounds vs. Chebyshev Bounds
0.15
0.10
0.05
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 23 / 29
Part IV
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 24 / 29
(Weak) Law of Large Numbers
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 25 / 29
(Weak) Law of Large Numbers
Proof.
We proof the theorem only for the special case when Var (Xk ) = σ 2 exists.
Let An = (X1 + · · · + Xn )/n. Then
X1 + · · · + Xn
P − µ > ε = P (|An − µ| > ε)
n
Hence, E (An ) = E (1/n ∑ni=1 Xi ) = 1/nE (∑ni=1 Xi ) = 1/n ∑ni=1 E (Xi ) = µ.
Moreover, Var (An ) = Var (1/n ∑ni=1 Xi ) = 1/n2 Var (∑ni=1 Xi ) =
1/n2 ∑ni=1 Var (Xi ) = n/n2 Var (Xk ) = σ 2 /n.
The law of large numbers is a direct consequence of the Chebyshev
inequality for An :
Var (An ) σ2
P (|An − µ| > ε) ≤ =
ε2 nε 2
Hence, if n → ∞ the right-hand side tends to 0 to get the result.
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 26 / 29
Strong Law of Large Numbers
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 27 / 29
Weak vs. Strong Law of Large Numbers
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 28 / 29
Converging in probability but not a.s.
Example
Let us have a sequence of random variables Yn where
1 1
P(Yn = 0) = 1 − and P(Yn = n) =
n n
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 29 / 29
Lectures 6:
Stochastic Processes and Markov Chains
Vojtěch Řehák
based on slides of Jan Bouda
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 1 / 29
Part I
Stochastic Processes
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 2 / 29
Stochastic Processes - Examples
Other examples:
temperature, memory consumption, queue length, program counter value,
configuration of a program execution, printer status, . . .
chemical reaction, car location, animal population, . . .
Classification:
• the time domain T can be either countable, or uncountable -
representing discrete-time or continuous-time process, resp.
• the state space of Xt can be finite, countable, or uncountable -
representing finite-state, discrete-state, or continuous-state
process.
How does the sample space look like? Are the r. v. independent?
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 4 / 29
Stochastic Process - Definition
Lecture 6: Stoch. Proc. and MC Sequence of experiments that watches “values” evolving in time.
2024-10-31
X0 . . . X1 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .
Classification:
Stochastic Process - Definition • the time domain T can be either countable, or uncountable -
representing discrete-time or continuous-time process, resp.
• the state space of Xt can be finite, countable, or uncountable -
representing finite-state, discrete-state, or continuous-state
process.
How does the sample space look like? Are the r. v. independent?
Markov Chains
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 5 / 29
Markov Chain
Definition
A discrete-time stochastic process {X0 , X1 , X2 , . . . } is a Markov chain if
That is, the value of Xt should depend on the value of Xt−1 , but does
not depend on the history of how we arrived at Xt−1 .
This is called Markov or memoryless property.
Note that, due to “= pb,a ”, it does not depend on the time t either1 .
Hence, a Markov chain can be drawn as an automaton (i.e., a state
diagram). Vertices are states (i.e. values) of the Markov chain and for
each non-zero pb,a there is an edge from b to a with label pb,a .
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 7 / 29
Examples
Example
Draw the automaton for St. Petersburg Lottery.
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 8 / 29
Examples
Example
Draw the automaton for a queue, when between every two subsequent
states a new customer comes with probability p and one customer
(of a non-empty queue) is served with probability q.
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 9 / 29
Markov Chain - Transition Matrix
0 1/4 0 3/4
1/2 0 1/3 1/6
P =
0
0 1 0
0 1/2 1/4 1/4
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 11 / 29
Markov Chain - Probability of a Run
Lecture 6: Stoch. Proc. and MC
0 1/4 0 3/4
2024-10-31
1/2 0 1/3 1/6
P =
0
0 1 0
0 1/2 1/4 1/4
P(X0 = 1, X1 = 4, X2 = 2, X3 = 3) =
P(X0 = 1)P(X1 = 4, X2 = 2, X3 = 3 | X0 = 1) =
p0 P(X1 = 4, X2 = 2, X3 = 3 | X0 = 1) =
p0 P(X1 = 4 | X0 = 1)P(X2 = 2, X3 = 3 | X1 = 4, X0 = 1) =
p0 p1,4 P(X2 = 2, X3 = 3 | X1 = 4) =
p0 p1,4 P(X2 = 2 | X1 = 4)P(X3 = 3 | X2 = 2, X1 = 4) =
p0 p1,4 p4,2 P(X3 = 3 | X2 = 2) =
p0 p1,4 p4,2 p2,3
Distribution on States
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 12 / 29
Transition Matrix
• Let vector ⃗λ (t) = (λ1 (t), λ2 (t), . . . , λn (t)) denote the probability
distribution on states expressing where the process is at time t.
Note that λi (t) = P(Xt = i) and ⃗λ (0) is the initial distribution.
In other words,
⃗λ (0) · P = ⃗λ (1).
Theorem
Every (discrete-time finite-state) Markov chain can be alternatively
(uniquely) defined by an initial vector ⃗λ (0) and a transition matrix P.
Proof.
It follows from the observations mentioned above that
• λi (0) = P(X0 = i), for all i, and
• the matrix P represents the conditional distributions for all the
subsequent random variables.
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 14 / 29
Part III
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 15 / 29
Example: Bounded Reachability
0 1/4 0 3/4
1/2 0 1/3 1/6
P =
0
0 1 0
0 1/2 1/4 1/4
1 − 2 − 1 − 4, 1 − 2 − 4 − 4, 1 − 4 − 2 − 4, and 1 − 4 − 4 − 4.
Probability of success for each path is: 3/32, 1/96, 1/16, and 3/64,
respectively. Summing up the probabilities, the total probability is 41/192.
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 16 / 29
Example: Bounded Reachability
Lecture 6: Stoch. Proc. and MC
2024-10-31
0 1/4 0 3/4
1/2 0 1/3 1/6
P =
0
0 1 0
0 1/2 1/4 1/4
Probability of success for each path is: 3/32, 1/96, 1/16, and 3/64,
respectively. Summing up the probabilities, the total probability is 41/192.
• For any m, we define the m-step transition matrix P (m) such that
(m)
pi,j = P(Xt+m = j | Xt = i),
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 17 / 29
Example: Bounded Reachability
Idea
How to solve reachability in arbitrary many steps?
Let us demonstrate the problem on an example.
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 18 / 29
Example: Bounded Reachability
Lecture 6: Stoch. Proc. and MC
2024-10-31
Alternatively, we can compute
3/16 7/48 29/64 41/192
P3 = 5/48 5/24 79/144 5/36
0 0 1 0
1/16 13/96 107/192 47/192
3 = 41/192 gives the correct answer.
The entry P1,4
Definition
The hitting time of a subset A of states of a Markov chain is a random
variable H A : S → {0, 1, 2, . . . } ∪ {∞} given by
H A = inf {n ≥ 0 | Xn ∈ A}
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 19 / 29
Hitting Probability and Mean Hitting Time - Definitions
Definition
Starting from a state i, the probability of hitting A is
kiA = E (H A | X0 = i)
= ∑ n · P(H A = n | X0 = i) + ∞ · P(H A = ∞ | X0 = i)
n<∞
where 0 · ∞ = 0.
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 20 / 29
Drunkard’s walk
What is the probability of reaching state 4 and what is the mean number
of steps to reach a ditch state, i.e. states 1 or 4?
1 1
1/2
1/2 1/2
1 2 3 4
1/2
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 21 / 29
Drunkard’s walk
Lecture 6: Stoch. Proc. and MC What is the probability of reaching state 4 and what is the mean number
of steps to reach a ditch state, i.e. states 1 or 4?
2024-10-31 1
1/2
1/2
1/2
1
1 2 3 4
1/2
Drunkard’s walk
h1 = 0, h4 = 1
h2 = 1/2h1 + 1/2h3 and
h3 = 1/2h2 + 1/2h4
Hence,
h2 = 1/2h3 = 1/2(1/2h2 + 1/2), i.e.
h2 = 4/3 · 1/4 = 1/3.
Theorem
The vector of hitting probabilities hA = h1A , h2A , . . . is the minimal
non-negative solution to the system of linear equations
(
1 for i ∈ A
hiA = A
∑j pi,j hj for i ̸∈ A
Theorem
The vector of mean hitting times k A = k1A , k2A , . . . is the minimal
non-negative solution to the system of linear equations
(
A 0 for i ∈ A
ki = A
1 + ∑j̸∈A pi,j kj for i ̸∈ A
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 22 / 29
The Game - Example
• Consider two players, one has $ L1 and the other has $ L2 . Player 1
will continue to throw a fair coin, such that
– if head appears, he wins $ 1,
– if tails appears, he loses $ 1.
• Suppose the game is played until one player goes bankrupt. What is
the probability that Player 1 survives?
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 23 / 29
The Markov Chain Model
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 24 / 29
The Analysis
{L2}
We can write the equations for hj (= hj for simplicity) and solve them.
h−L1 = 0
h−L1 +1 = 1/2 · h−L1 +2 + 1/2 · h−L1
h−L1 +2 = 1/2 · h−L1 +3 + 1/2 · h−L1 +1
..
.
hL2 −1 = 1/2 · hL2 −2 + 1/2 · hL2
hL2 = 1
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 25 / 29
The Analysis (Another Solution)
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 26 / 29
The Analysis (Another Solution) cont.
• Now, let Wt denote the money Player 1 has won after t steps.
• Note that the expected value of Wt − Wt−1 is zero.
• By linearity of expectation,
E [Wt ] = 0.
• On the other hand,
(t)
E [Wt ] = ∑ jPj = 0.
j
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 27 / 29
The Analysis (Another Solution) cont.
0 = lim E [Wt ]
t→∞
(t)
= lim ∑ jPj
t→∞
j
q = L1 /(L1 + L2 ).
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 28 / 29
Markov Chain Analysis
• Transient analysis
▶ distribution after k-steps
the k-th power of the transition matrix
▶ reaching/hitting probability
equations for hitting probabilities hi
▶ (mean) hitting time
equations for hitting times ki
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 29 / 29
Lectures 7:
Long-Run Analysis of Discrete-Time Markov Chains
Vojtěch Řehák
based on slides of Jan Bouda
November 7, 2024
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 1/1
Part I
Revision
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 2/1
Definitions
Sequence of experiments that watches “values” evolving in time.
X0 . . . X1 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .
Definition
A discrete-time stochastic process {X0 , X1 , X2 , . . . } is a Markov chain if
Theorem
Every (discrete-time finite-state) Markov chain can be alternatively
(uniquely) defined by an initial vector ~λ (0) and a transition matrix P.
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 3/1
Definitions
Lecture 7: Long-Run Analysis of DTMC Sequence of experiments that watches “values” evolving in time.
2024-11-06
X0 . . . X1 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .
Definition
A discrete-time stochastic process {X0 , X1 , X2 , . . . } is a Markov chain if
Theorem
Every (discrete-time finite-state) Markov chain can be alternatively
(uniquely) defined by an initial vector ~λ (0) and a transition matrix P.
Definition
The hitting time of a subset A of states of a Markov chain is a random
variable H A : S → {0, 1, 2, . . . } ∪ {∞} given by H A = inf {n ≥ 0 | Xn ∈ A}
where S is the sample space of the Markov chain and ∞ is infimum of 0. /
Definition
Starting from a state i, the probability of hitting a set of states A is
where 0 · ∞ = 0.
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 4/1
Markov Chain Analysis
• Transient analysis
I distribution after k-steps
the k-th power of the transition matrix
I reaching/hitting probability
equations for hitting probabilities hi
I (mean) hitting time
equations for hitting times ki
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 5/1
Part II
Long-Run Analysis
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 6/1
Markov Chain Analysis
• Long-run analysis
I probability of infinite hitting
I mean inter-visit time
I long-run limit distribution
I stationary (invariant) distribution
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 7/1
Transient Analysis
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 8/1
State Classification
Definition
A state of a Markov chain is said to be absorbing iff it cannot be left,
once it is entered, i.e. pi,i = 1.
Definition
A state i of a Markov chain is said to be recurrent iff, starting from state
i, the process eventually returns to state i with probability 1.
Definition
A state of a Markov chain is said to be transient (or non-recurrent) iff
there is a positive probability that the process will not return to this state.
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 9/1
Infinite Hitting - Transient States
Definition (recall)
A state i of a Markov chain is said to be transient (or non-recurrent) iff
there is a positive probability that the process will not return to this state,
i.e.
P(i =⇒+ i) < 1.
Theorem
Every transient state is visited finitely many times almost surely
(i.e. with probability 1).
Proof.
Theorem
Every transient state is visited finitely many times almost surely
Proof.
Let p be the probability of not returning to i from i, i.e. P(¬(i =⇒+ i)).
If i is transient, then p > 0.
Probability of finitely many visits of i equals
to prob. of exactly 1 visit + exactly 2 visits + . . .
= p + (1 − p) (p + (1 − p)(· · · )) = p(1 + (1 − p) + (1 − p)2 + · · · ).
The geometric series in the brackets equals to 1/(1 − (1 − p)) = 1/p.
Hence, the probability of finitely many visits = 1.
Czech proverb:
The pitcher goes so often to the well, that it is broken at last.
Infinite Hitting - Recurrent States
Definition (recall)
A state i of a Markov chain is said to be recurrent iff, starting from state
i, the process eventually returns to state i with probability 1, i.e.
P(i =⇒+ i) = 1.
Theorem
In a finite-state Markov chain, each recurrent state is almost surely either
not visited or visited infinitely many times.
Proof.
If it is visited, then it is revisited with probability one. Hence, in an infinite
run, it is visited infinitely many times with probability one.
Otherwise, it is not visited.
Theorem
In a finite-state Markov chain, a state is recurrent if and only if it is in a
bottom strongly connected component of the Markov chain graph
representation. All other states are transient.
Idea
For the sake of infinite behaviour, we will concentrate on bottom strongly
connected components only.
Definition
A Markov chain is said to be irreducible if every state can be reached from
every other state in a finite number of steps, i.e. P(i =⇒+ j) > 0 for all i, j.
Theorem
A Markov chain is irreducible if and only if its graph representation is a
single strongly connected component.
Corollary
All states of a finite-state irreducible Markov chain are recurrent.
1 1
1 2 3
1/2
1/2
1 1
1 2 3
1/2
0 1 0
(0.2, 0.4, 0.4) · 0 0 1 = (0.2, 0.4, 0.4)
1/2 1/2 0
Stationary (Invariant) Distribution
Definition
Let P be the transition matrix of a Markov chain and ~λ be a probability
distribution on its states. If
~λ P = ~λ
Question:
How many stationary distributions can a Markov chain have?
Can it be more than one?
Can it be none?
1 1
1/2
1/2 1/2
1 2 3 4
1/2
Theorem
If a finite-state Markov chain is irreducible then there is a unique
stationary distribution.
Q: Can it be none?
Theorem
For each finite-state Markov chain, there is a stationary distribution.
Let us have the following Markov chain with its transition matrix.
p
s0 s1 1−p p
1−p 1−q P=
q 1−q
q
Solving ~π P = ~π yields to the following system of equations
π1 (1 − p) + π2 q = π1
π1 p + π2 (1 − q) = π2
π1 + π2 = 1
Theorem
Let P be a transition matrix of a finite-state Markov chain and
~π = (π1 , π2 , . . . , πn ) be a stationary distribution corresponding to P.
For any state i of the Markov chain, we have
∑ πj Pj,i = ∑ πi Pi,j .
j6=i j6=i
2024-11-06
Let P be a transition matrix of a finite-state Markov chain and
~π = (π1 , π2 , . . . , πn ) be a stationary distribution corresponding to P.
For any state i of the Markov chain, we have
∑ πj Pj,i = ∑ πi Pi,j .
j6=i j6=i
∑ πj Pj,i = ∑ πi Pi,j
j6=i j6=i
πP = π
Stationary Distribution & Cut-sets
Another solution (using a cut-set formula):
Let us have a subset of states, e.g. {s1 }.
p
1−p s1 s2
1−q
q
∑ πi Pi,j = ∑ πj Pj,i .
i∈A,j6∈A i∈A,j6∈A
Proof.
2024-11-06
Theorem
Let P be a transition matrix of a finite-state Markov chain. A distribution
~π = (π1 , π2 , . . . , πn ) is a stationary distribution corresponding to P if and
only if for every subset A of states of the Markov chain it holds that
∑ πi Pi,j = ∑ πj Pj,i .
i∈A,j6∈A i∈A,j6∈A
“if” It holds for A = {i} for every state i, hence, the distribution is sta-
tionary.
∑ πi Pi,j = ∑ πj Pj,i
j6∈{i} j6∈{i}
πi = ∑ πj Pj,i
j
“only if” If it is stationary, it holds for singletons (due to the previous thm),
hence, it also holds for larger sets.
Example - Finite Buffer Model
Example
Consider the following irreducible Markov chain model of a finite buffer.
Find a stationary distribution for its states.
Example
Consider the following irreducible Markov chain model of a finite buffer.
Find a stationary distribution for its states.
2024-11-06
Consider the following irreducible Markov chain model of a finite buffer.
Find a stationary distribution for its states.
What are the states with the highest and the lowest probability in the
stationary distribution?
Try to identify structural schema of the graph that increases the stationary
distribution?
Google PageRank
2024-11-06
Google PageRank
It is not a problem to find all web pages with a given keyword. The problem
is to identify what are to most important web pages. Here comes the story
of a random walk. Imagine that a user surfs webs by random clicks on
links that occur on pages. The higher number of visits, the higher rank
of the page. This corresponds to invariant distribution when we view the
pages as states of a MC and assign uniform distribution on the outgoing
links. To have the MC irreducible, there is a concept of restart (with a
magic probability, e.g. 1.6 % in this picture) to another state uniformly.
Mean Portion of Visited States and Inter Visit Time
Theorem (Expected long-run frequency)
Let us have a finite-state irreducible Markov chain and the unique
stationary distribution ~π . It holds that
πi = 1/mi
where
mi = E (number of steps of i =⇒+ i)
is the mean inter-visit time of state i (or expected return time to i).
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 25 / 1
Mean Portion of Visited States and Inter Visit Time
Lecture 7: Long-Run Analysis of DTMC Theorem (Expected long-run frequency)
2024-11-06
Let us have a finite-state irreducible Markov chain and the unique
stationary distribution ~π . It holds that
Time where
mi = E (number of steps of i =⇒+ i)
is the mean inter-visit time of state i (or expected return time to i).
{i}
Be careful: mi 6= ki .
and also
For example:
aperiodic periodic
Definition
A state j in a Markov chain is periodic if there exists an integer ∆ > 1
such that P(Xt+s = j | Xt = j) = 0 unless s is divisible by ∆.
A Markov chain is periodic if any state in the chain is periodic.
A state or chain that is not periodic is aperiodic.
~π = limn→∞~λ P n
Note that limn→∞~λ P n does not depend on ~λ . It is caused by the fact that
for finite-state aperiodic irreducible Markov chains, P n converges in n → ∞
to a matrix with equal rows.
Example
Classify the following infinite-state Markov chains for different values of p.
p p p p p
0 1 2 3 4 ···
1−p 1−p 1−p 1−p 1−p 1−p
p p p p p
··· -1 0 1 2 ···
1−p 1−p 1−p 1−p 1−p
1 1 1 1 1
0 1 2 3 4 ···
It is no longer true that all states are recurrent in irreducible Markov chain.
A state can be recurrent and the mean inter-visit time can be infinite.
1/2 1/2 1/2 1/2 1/2
··· -1 0 1 2 ···
1/2 1/2 1/2 1/2 1/2
Vojtěch Řehák
• In discrete time
X0 , X1 , X2 , X3 , X4 , X5 , . . .
I distribution on where we go in the next step
• In continuous time
Event-Driven System
• we are staying in a state and waiting for events
• the system changes its state when the first event occurs
• the subsequent state depends on the event that occurs first
newjob pjam
paper
idle printing
jam
done repaired
Markovian Property
• We would like to have a Markovian model - where the subsequent
behavior depends on the current state only.
I It depends neither on where we were going through nor when we did
the (last) step(s).
I It also does not depend on the time we have been waiting for the
next step.
• We are looking for a continuous-time variant of the geometric
distribution.
λ 1
F (t)
f (t)
0 0
t t
λ 1
F (t)
f (t)
0 0
0 t t
λ 1
F (t)
f (t)
0 0
0 t t
λ 1
F (t)
f (t)
0 0
0 t t
Theorem
For an exponentially distributed random variable X and every t, t0 ≥ 0,
it holds that
P(X > t0 + t | X > t0 ) = P(X > t).
Proof.
= P(X > t)
2024-11-13
For an exponentially distributed random variable X and every t, t0 ≥ 0,
it holds that
P(X > t0 + t | X > t0 ) = P(X > t).
Proof.
= P(X > t)
Definition
Continuous-Time Markov Chain (CTMC) is an event-driven system
with exponentially distributed events.
λ1 λ1 λ1 λ1
s0 s1 s2 s3 ···
λ2 λ2 λ2 λ2
CTMC Analysis
Definition
CTMC is defined by (an initial distribution λ (0) and) a rate matrix Q
where Q[i, j] is the rate of an event leading from state i to state j.
3
0 3
Q= s0 s1
2 0
2
Definition
A distribution π on states of a CTMC is called stationary iff
π = λ (t), for every time t ≥ 0, when starting the CTMC in π = λ (0).
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 15 / 30
Important Questions - Defined Formally
The important questions for a given CTMC with λ (0) and Q are:
• What is the distribution on states at a given time?
I It is to compute λ (t), for a given time t ≥ 0.
Idea
As the winning event does not depend on the winning time and vice versa,
we can separately choose the winner and the winning time.
λ1 p1
λ2 λ p2
s ≡ s p3
λ3
where λ = λ1 + λ2 + λ3
pi = λi /λ
Definition
The rate λ = λ1 + λ2 + λ3 is called exit rate of the state s
and the probabilities p1 , p2 , p3 are called exit probabilities.
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 17 / 30
Solution Technique 1 - Discretization
Discretization techniques
• transform the CTMC to a DTMC (e.g. based on exit probabilities)
• analyse the CTMC using the properties of the DTMC
• are useful for long-run average (steady-state) properties
1. discrete (time-abstract) frequency
I do not care about time; a move in CTMC is a step in the DTMC
I computed as the invariant distribution of the underlying DTMC
2. continuous (time) frequency
I discrete frequency weighted by the mean exit times
3. utilization and mean length of a queue
I occupancy of the queue multiplied by the (probability of) the
continuous (time) frequency
1 5/12
5 5
CTMC: s0 s1 s2 DTMC: s0 s1 s2
7 14 7/12 1
Example (cont.)
What is the utilization (i.e. the mean queue length) of M/M/2/2 queue
with new-request rate 5 and service rate 7?
1 5/12
5 5
CTMC: s0 s1 s2 DTMC: s0 s1 s2
7 14 7/12 1
Example (cont.)
1 5/12
5 5
CTMC: s0 s1 s2 DTMC: s0 s1 s2
7 14 7/12 1
Idea
Thanks to the memoryless property:
λ
λ1 λ1
λ2 λ2
s ≡ s
λ3 λ3
for any λ
Adding a self-loop does not change the behavior of a CTMC!
• Every CTMC can be transformed to the one with uniform exit rates.
(Find a state with the highest exit rate and added appropriate
self-loops to all others.)
• Uniform exit rates allow for easier discretization and enables
analysis of distribution on states at a given time.
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 22 / 30
Example - Uniformization
Example
What is the utilization (i.e. the mean queue length) of M/M/2/2 queue
with new-request rate 5 and service rate 7?
5 5
the original CTMC: s0 s1 s2
7 14
9/14 2/14
9 2 5/14 5/14
5 5 its
uniformized
s0 s1 s2 s0 s1 s2
CTMC: DTMC:
7 14 7/14 1
It can be easily checked that (98/193, 70/193, 25/193) is the stationary
distribution for the DTMC. Thanks to uniform exit rates, the rebalance by
waiting times is not needed. It is stationary also for (unifomized) CTMC.
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 23 / 30
Important Questions (recall)
PRISM - http://www.prismmodelchecker.org/
CADP - http://www.inrialpes.fr/vasy/cadp/
MRMC - http://www.mrmc-tool.org
Modest Toolset - http://www.modestchecker.net/
Storm - http://www.stormchecker.org/
PROPhESY - https://moves.rwth-aachen.de/research/tools/prophesy/
MATLAB extension - http://www.mathworks.com/products/matlab/
0.01
1
10 01
0.0
0.
Vojtěch Řehák
based on slides of Jan Bouda
n× (n+1)×
z }| { z }| {
H(1/n, . . . , 1/n) ≤ H(1/(n + 1), . . . , 1/(n + 1)).
Definition
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then the (Shannon) entropy of the random variable
X is defined as
H(X ) = − ∑ p(x) log p(x).
x∈Im(X )
Lemma
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then
Proof.
1
Note that log p(X ) and log p(X ) are used as random variables.
2024-11-21 Lemma
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then
Proof.
Entropy as Expectation
1
Note that log p(X ) and log p(X ) are used as random variables.
Proof.
2024-11-21 Lemma
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then
H(X ) ≥ 0.
Proof.
Entropy
Example
What is the entropy and how should we send/encode the result of:
1. one toss of a fair coin?
2. one throw of a fair eight-sided die?
3. Horse race with winning probabilities of individual horses
1/2, 1/8, 1/8, 1/8, 1/8?
Answers:
1. H(1/2, 1/2) = 1/2 · log 2 + 1/2 · log 2 = 1 bit
2. H(1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8) = 8 · 1/8 · log 8 = 3 bits
3. H(1/2, 1/8, 1/8, 1/8, 1/8) = 1/2 · log 2 + 4 · 1/8 · log 8 = 2 bits
Let us assign shorter messages to the horses with higher winning
probability. If the assigned messages are 1, 000, 001, 010, and 011,
then the expected message length is 1/2 · 1 + 1/2 · 3 = 2 bits.
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 11 / 36
Part II
Definition
Let X and Y be random variables with a joint probability distribution
p(x, y ) = P(X = x, Y = y ).
We define the joint (Shannon) entropy of X and Y as
or, alternatively,
1
H(X , Y ) = −Ep [log p(X , Y )] = Ep log .
p(X , Y )
Definition
Let X and Y be random variables and y ∈ Im(Y ).
The conditional entropy of X given Y = y is
Definition
Let X and Y be random variables with a joint probability distribution
p(x, y ) = P(X = x, Y = y ). Let us denote p(x|y ) = P(X = x|Y = y ).
The conditional entropy of X given Y is
Note that the probability space in which we are computing the expectation
is based on the joint probability p(x, y )!
Proof.
H(X , Y ) =
= H(Y ) + H(X |Y ).
2024-11-21
Let X and Y be random variables. Then
H(X , Y ) = H(Y ) + H(X |Y ).
Proof.
H(X , Y ) =
= H(Y ) + H(X |Y ).
= H(Y ) + H(X |Y )
Alternatively, we may use log p(X , Y ) = log p(Y ) + log p(X |Y ) and take
the expectation (based on the joint probability distribution) on both sides
to get the desired result H(X , Y ) = H(Y ) + H(X |Y ).
Conditioned Chain Rule of Conditional Entropy
Proof.
Similarly to the previous proof, we may use
and take the expectation (based on the joint probability distribution of all
X , Y , and Z ) on both sides to get the desired result.
p(X )
= E log
q(X )
Idea
Wait!!!! What is the base of the expectations????
Definition
The cross-entropy of the distribution q relative to a distribution p
(on a common set of sample points Im(X )) is defined as
I (X ; Y ) = H(X ) − H(X |Y ).
Proof.
p(x, y ) p(x|y )
I (X ; Y ) = ∑ p(x, y ) log p(x)p(y ) = ∑ p(x, y ) log =
x,y x,y p(x)
= − ∑ p(x, y ) log p(x) + ∑ p(x, y ) log p(x|y ) =
x,y x,y
!
= − ∑ p(x) log p(x) − − ∑ p(x, y ) log p(x|y ) =
x x,y
= H(X ) − H(X |Y ).
Corollary (Inclusion-exclusion)
I (X ; Y ) = H(X ) + H(Y ) − H(X , Y ).
Proof.
We use repeated application of the chain rule for a pair of random variables
Definition
Let X , Y , and Z be random variables.
The conditional mutual information between X and Y given Z = z is
where the expectation is taken over the joint distribution p(x, y , z).
Definition
The conditional relative entropy is the average of the relative entropies
between the conditional probability distributions pY |X and qY |X averaged
over the probability distribution pX . Formally,
pY |X (y |x) pY |X
D pY |X kqY |X = ∑ pX (x) ∑ pY |X (y |x) log = Ep log
x y qY |X (y |x) qY |X
Proof.
p(x, y )
D(p(x, y )kq(x, y )) = ∑ ∑ p(x, y ) log q(x, y ) =
x y
p(x)p(y |x)
= ∑ ∑ p(x, y ) log q(x)q(y |x) =
x y
p(x) p(y |x)
= ∑ p(x, y ) log q(x) + ∑ p(x, y ) log q(y |x) =
x,y x,y
Information inequalities
D(pkq) ≥ 0
During the proof we will use Jensen’s inequality stating for logarithm
(a concave function) and positive random variable X that
E log X ≤ log EX .
Proof.
Jensen ineq is eq iff the function is linear, or the function is applied on a
constant random variable.
p(x) q(x)
−D(pkq) = − Ep log = Ep log
q(x) p(x)
(∗) q(x) q(x)
≤ log Ep = log ∑ p(x) ≤ log ∑ q(x)
p(x) x∈Im(p)
p(x) x∈Im(q)
= log 1 = 0,
I (X ; Y ) ≥ 0
Corollary
D(p(y |x)kq(y |x)) ≥ 0
with equality if and only if p(y |x) = q(y |x) for all y and x with p(x) > 0.
Corollary
I (X ; Y |Z ) ≥ 0
with equality if and only if X and Y are conditionally independent given Z .
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 33 / 36
Consequences of Information Inequality
Lecture 9: Information Theory Corollary (Nonnegativity of mutual information)
2024-11-21
For any two random variables X , Y
I (X ; Y ) ≥ 0
Corollary
D(p(y |x)kq(y |x)) ≥ 0
Consequences of Information Inequality with equality if and only if p(y |x) = q(y |x) for all y and x with p(x) > 0.
Corollary
I (X ; Y |Z ) ≥ 0
with equality if and only if X and Y are conditionally independent given Z .
Theorem
Let X be a random variable, then
Proof.
Let u(x) = 1/|Im(X )| be a uniform probability distribution over Im(X ) and
let p(x) be the probability distribution of X . Then D(pku) =
p(X ) 1 1
= Ep log = Ep [log p(X )] + Ep log = −H(X ) + log
u(X ) u(X ) u(X )
1 1
Due to 0 ≤ D(pku) and u(x) = |Im(X )| : H(X ) ≤ log u(x) = log |Im(X )|.
Note that 0 = D(pku) iff p ≡ u.
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 34 / 36
Consequences of Information Inequality
H(X |Y ) ≤ H(X )
Proof.
0 ≤ I (X ; Y ) = H(X ) − H(X |Y )
with equality iff X and Y are independent.
H(X |Y ) ≤ H(X )
Proof.
Proof.
We use the chain rule for entropy
n n
H(X1 , X2 , . . . , Xn ) = ∑ H(Xi |Xi−1 , . . . , X1 ) ≤ ∑ H(Xi ),
i=1 i=1
where the inequality follows directly from the previous theorem. We have
equality if and only if Xi is independent of all Xi−1 , . . . , X1 .
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 36 / 36
Lecture 10: Codes for Data Compression
Vojtěch Řehák
based on slides of Jan Bouda
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 1 / 34
Part I
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 2 / 34
Intro
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 3 / 34
Message Source
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 4 / 34
Code
Definition
A code C for a random variable (memoryless source) X is a mapping
C : Im(X ) → D∗ , where D∗ is the set of all finite-length strings over the
alphabet D. With |D| = d, we say the code is d-ary.
C (x) is the codeword assigned to x and lC (x) denotes the length of C (x).
Definition
The expected length LC (X ) of a code C for a random variable X is given
by
LC (X ) = ∑ P(X = x)lC (x) = E [lC (X )] .
x∈Im(X )
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 5 / 34
Code
Example
Let X and C be given by the following probability distribution and
codeword assignment
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 6 / 34
Code
Example
Consider another example with
The entropy in this case is H(X ) = log2 3 ≈ 1.58 bits, but the expected
length is LC (X ) ≈ 1.66.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 7 / 34
Non-singular Code
Definition
A code C is said to be non-singular if it maps every element in the range
of X to different string in D∗ , i.e.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 8 / 34
Uniquely Decodable Code
Let Im(X )+ denotes the set of all nonempty strings over the alphabet
Im(X ).
Definition
An extension C ∗ of a code C is the mapping from Im(X )+ to D∗ defined
by
C ∗ (x1 x2 . . . xn ) = C (x1 )C (x2 ) . . . C (xn ),
where C (x1 )C (x2 ) . . . C (xn ) denotes concatenation of corresponding
codewords.
Definition
A code is uniquely decodable iff its extension is non-singular.
In other words, a code is uniquely decodable if any encoded string has only
one possible source string.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 9 / 34
Prefix Code
Definition
A code is called prefix code (or instantaneous code) if no codeword is a
prefix of any other codeword.
The advantage of prefix codes is not only their unique decodability, but
also the fact that a codeword can be decoded as soon as we read its last
symbol.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 10 / 34
Part II
Kraft Inequality
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 11 / 34
Kraft Inequality
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 12 / 34
Kraft Inequality
Proof.
Consider a d–ary tree in which every inner node has d descendants. Each
edge represents a choice of a code alphabet symbol at a particular
position. For example, d edges emerging from the root represent d choices
of the alphabet symbol at the first position of different codewords. Each
codeword is represented by a node (some nodes are not codewords!)
and the path from the root to a particular node (codeword) specifies the
codeword symbols. The prefix condition implies that no codeword is an
ancestor of other codeword on the tree. Hence, each codeword
eliminates its possible descendants.
Let lmax = max{l1 , l2 , . . . , lm }. Consider all nodes of the tree at the level
lmax . Some of them are codewords, some of them are descendants of
codewords, some of them are neither.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 13 / 34
Kraft Inequality
Proof (Cont.)
A codeword at level li has d lmax −li descendants at level lmax . Sets of
descendant of different codewords must be disjoint and the total number of
nodes in all these sets must be at most d lmax . Summing over all codewords
we have
m
max −li
∑ dl ≤ d lmax
i=1
and hence
m
∑ d −l i
≤ 1.
i=1
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 14 / 34
Kraft Inequality
Proof (Cont.)
Label the first note of depth l1 as the codeword 1 and remove its descendants
from the tree. Then mark first remaining node of depth l2 as the codeword 2.
In this way you can construct prefix code with codeword lengths l1 , l2 , . . . , lm .
We may observe easily that this construction does not violate the prefix property.
To do so, the new codeword should be placed either as a precedent, or an
antecedent of an existing codeword, what is prevented by the construction.
It remains to show that there is always enough nodes.
Assume that for some i ≤ m there is no free node of level li when we want to add
a new codeword of length li . This, however, means that all nodes at level li are
either codewords, or descendants of a codeword, giving
i−1
∑ d li −lj = d li
j =1
1 −lj
and we have ∑ji−
=1 d = 1, and, finally, ∑ij =1 d −lj > 1 violating the initial
assumption.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 15 / 34
Part III
McMillan Inequality
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 16 / 34
McMillan Inequality
Kraft inequality holds also for codes with countably infinite number of
codewords, however, we omit the proof here. There exist uniquely
decodable codes that are not prefix codes, but, as established by the
following theorem, the Kraft inequality applies to general uniquely
decodable codes as well and, therefore, when searching for an optimal
code it suffices to concentrate on prefix codes. General uniquely decodable
codes offer no extra codeword lengths in contrast to prefix codes.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 17 / 34
McMillan Inequality
Proof.
Consider the k-th extension C k of a code C . By the definition of the
unique decodability, C k is non-singular for any k.
Observe that lC k (x1 x2 · · · xk ) = ∑ki=1 lC (xi ). Let us calculate the sum for
the code extension C k .
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 18 / 34
McMillan Inequality
Proof (Cont.)
Another expression is obtained when we reorder the terms by word
lengths to get
klmax
∑ d −lC k (x1 x2 ···xk ) = ∑ a(m)d −m ,
x1 ,x2 ,...,xk ∈Im(X ) m=1
where lmax is the maximum codeword length and a(m) is the number of
k character source strings mapped to a codeword of length m.
The code is uniquely decodable, i.e. there is at most one input being
mapped on each codeword (of length m). The total number of such inputs
is at most the same as the number of sequences of length m, i.e.
a(m) ≤ d m .
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 19 / 34
McMillan Inequality
Proof (Cont.)
Using a(m) ≤ d m , we get
!k
klmax
∑ d −lC (x) = ∑ a(m)d −m
x∈Im(X ) m=1
klmax
≤ ∑ d m d −m = klmax
m=1
implying
∑ d −l i
≤ (klmax )1/k .
i
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 20 / 34
McMillan Inequality
Proof (Cont.)
This inequality holds for any k and observing limk→∞ (klmax )1/k = 1 we
have
∑ d −li ≤ 1.
i
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 21 / 34
Part IV
Optimal Codes
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 22 / 34
Optimal Codes
Theorem
The expected length of any prefix d–ary code C for a random variable
X is greater than or equal to the entropy Hd (X ) (d is the base of the
logarithm), i.e.
LC (X ) ≥ Hd (X )
with equality iff for all xi P(X = xi ) = pi = d −li for some integer li .
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 23 / 34
Optimal Codes
Proof.
We write the difference between the expected length and the entropy as
LC (X ) − Hd (X ) = ∑ pi li + ∑ pi logd pi =
i i
= ∑ pi logd d li + ∑ pi logd pi =
i i
pi
= ∑ pi logd −l .
i d i
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 24 / 34
Optimal Codes
Proof (Cont.)
Having ri = d −li /c and c = ∑j d −lj , we can continue by
pi pi
LC (X ) − Hd (X ) = ∑ pi logd −l
= ∑ pi logd
i d i i ri · c
pi 1 pi 1
= ∑ pi logd + logd = ∑ pi logd + ∑ pi logd
i ri c i ri i c
1
= D(p∥r ) + logd ≥ 0
c
by the non-negativity of the relative entropy and the fact that c ≤ 1 (Kraft
inequality). Hence, LC (X ) ≥ Hd (X ) with equality if and only if c = 1 and,
for all i, pi = d −li , i.e. − logd pi is an integer.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 25 / 34
Optimal Codes
Definition
A probability distribution is called d–adic if each of the probabilities is
equal to d −n for some integer n.
Due to the previous theorem, the expected length is equal to the entropy if
and only if the probability distribution of X is d–adic. The proof also
suggests a method to find a code with optimal length in case the
probability distribution is not d–adic.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 26 / 34
Optimal Codes - Construction - Idea
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 28 / 34
Bounds on the Optimal Code Length
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 29 / 34
Bounds on the Optimal Code Length
Hd (X ) ≤ LC (X ) < Hd (X ) + 1. (1)
{pi }i and a d–ary alphabet and let L∗ be the associated expected length
of the optimal code, i.e. L∗ = ∑i pi li∗ . Then
Hd (X ) ≤ L∗ < Hd (X ) + 1.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 30 / 34
Bounds on the Optimal Code Length
Proof.
1
Let li = ⌈logd pi ⌉. Then li satisfies the Kraft inequality and from (1) we
have
Hd (X ) ≤ LC (X ) = ∑ pi li < Hd (X ) + 1.
i
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 31 / 34
Bounds on the Optimal Code Length
Let us define Ln as the expected codeword length per input symbol, i.e.
1 1
Ln = ∑ p(x1 , x2 , . . . , xn )lc (x1 , x2 , . . . , xn ) = E [lC (X1 , X2 , . . . , Xn )].
n n
Using the bounds derived above, we have
1
H(X ) ≤ Ln < H(X ) + .
n
Using large blocks allows us to arbitrarily approach the optimal length -
the entropy.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 32 / 34
Optimal Coding and Relative Entropy
This could serve as a motivation for the definitions of D(p∥q) and Hp (q).
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 33 / 34
Optimal Coding and Relative Entropy
Proof.
1
Ep [lC (X )] = ∑ p(x) log
x q(x)
1
< ∑ p(x) log +1
x q(x)
1
= ∑ p(x) log + ∑ p(x) ∗ 1
x q(x) x
= Hp (q) + 1.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 34 / 34
Lecture 11: Optimal Codes for Data Compression
Vojtěch Řehák
based on slides of Jan Bouda
December 5, 2024
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 1 / 38
L. 11: Optimal Codes for Data Compression
2024-12-02 Lecture 11: Optimal Codes for Data Compression
Vojtěch Řehák
based on slides of Jan Bouda
December 5, 2024
Revision
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 2 / 38
Revision
The goal is to find an optimal code for a given probability distribution, i.e.
with the shortest expected length of a message.
We know that the expected length is bounded from below by the entropy
Hd and it can reach the bound iff the distribution is d-adic.
Theorem
The expected length of any prefix d–ary code C for a random variable
X is greater than or equal to the entropy Hd (X ) (d is the base of the
logarithm), i.e.
LC (X ) ≥ Hd (X )
with equality iff for all xi P(X = xi ) = pi = d −li for some integer li .
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 3 / 38
Bounds on the Optimal Code Length
Theorem
Let l1∗ , l2∗ , . . . , lm
∗ be the optimal codeword lengths for a source distribution
{pi }i and a d–ary alphabet and let L∗ be the associated expected length
of the optimal code, i.e. L∗ = ∑i pi li∗ . Then
Hd (X ) ≤ L∗ < Hd (X ) + 1.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 4 / 38
Part II
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 5 / 38
Coding Based on Shannon Entropy
Example
We have two source symbols with probabilities 0.0001 and 0.9999. Using
1
the entropy definition we get two codewords with lengths 1 = ⌈log 0.9999 ⌉
1
and 14 = ⌈log 0.0001 ⌉ bits. The optimal code obviously uses 1 bit codeword
for both symbols.
On the other hand, is it true that an optimal code uses always codewords
of length not larger than ⌈log p1i ⌉?
It is not.
Consider probabilities 1/3, 1/3, 1/4, 1/12. Expected length of its optimal
code is 2 (as we will be able to show later). Hence, a code 0, 10, 110, 111
with lengths 1, 2, 3, 3 is an optimal code.
The third symbol has length 3, which is > ⌈log p13 ⌉ = ⌈log2 4⌉ = 2.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 6 / 38
Shannon-Fano Algorithm for Code Generation
Remark
Shannon-Fano coding is not optimal!
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 7 / 38
Shannon-Fano vs. Huffman Codes - Example
Example
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.15, 0.15 and construct an optimal binary code.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 8 / 38
Shannon-Fano vs. Huffman Codes - Example
L. 11: Optimal Codes for Data Compression Example
2024-12-02
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.15, 0.15 and construct an optimal binary code.
Shannon-Fano code:
0.25, 0.25 and 0.2, 0.15, 0.15
0.25 and 0.25 and 0.2 and 0.15, 0.15
Expected length: 2 · 0.7 + 3 · 0.3 = 2.3 bits.
The Huffman approach.
0.25, 0.25, 0.2, 0.15, 0.15
0.25, 0.25, 0.2, 0.3
0.25, 0.45, 0.3
0.55, 0.45
1
Expected code length = 2.3 bits.
H(X)=2.2854. . .
Shannon-Fano vs. Huffman Codes - Example
Example
Let us consider a random variable with outcome probabilities
0.38, 0.18, 0.16, 0.15, 0.13 and construct an optimal binary code.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 9 / 38
Shannon-Fano vs. Huffman Codes - Example
L. 11: Optimal Codes for Data Compression Example
2024-12-02
Let us consider a random variable with outcome probabilities
0.38, 0.18, 0.16, 0.15, 0.13 and construct an optimal binary code.
Example
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.15, 0.15 and construct an optimal ternary code.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 10 / 38
Huffman Codes - Example
L. 11: Optimal Codes for Data Compression Example
2024-12-02
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.15, 0.15 and construct an optimal ternary code.
Example
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.1, 0.1, 0.1 and construct 3-ary code.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 11 / 38
Huffman Codes - Example
L. 11: Optimal Codes for Data Compression Example
2024-12-02
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.1, 0.1, 0.1 and construct 3-ary code.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 12 / 38
Part III
Proof of Optimality
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 13 / 38
Properties of Optimal Prefix Codes
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 14 / 38
Properties of Optimal Prefix Codes
Proof.
Let us consider an optimal code C .
1. Let us suppose that pj > pk . Consider C ′ with the codewords j and k
interchanged (comparing to C ). Then
LC ′ (X ) − LC (X ) = ∑ pi li′ − ∑ pi li
i i
=pj lk + pk lj − pj lj − pk lk
=(pj − pk )(lk − lj ).
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 15 / 38
Properties of Optimal Prefix Codes
Proof (Cont.)
2. If two longest codewords have different length, then we can delete
the last letter of the longer one to get shorter code while preserving
the prefix property, what contradicts our assumption that C is
optimal. Therefore, two longest codewords have the same length.
3. If there is a codeword of maximal length without a sibling then we
can delete the last letter of the codeword and still maintain the
prefix property, what contradicts optimality of the code.
4. Again, if there is a longest codeword with higher probability, we can
swap two codewords and contradict optimality of C . Formally, if
∀j ∈ M, k ̸∈ M : lmax = lj ≥ lk what by item 1 implies pj ≤ pk .
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 16 / 38
Properties of Optimal Prefix Codes
Proof.
Due to the previous lemma, we know that w and its siblings are in the
least-likely symbols of the longest codewords (i.e., Mw ⊆ M). The problem
is that in general Mw is not necessarily a set of indices of least-likely
symbols. To obtain the optimal code required here, we simply exchange
some of the longest codewords such that Mw is a set of least-likely
indices. Note that this modification preserves the assigned codeword
lengths (all the exchanged codewords have the maximal length).
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 17 / 38
Optimality of Huffman Codes
Theorem
Huffman codes are optimal, i.e. the codes obtained by the Huffman
algorithm assign codewords of the exactly same lengths as optimal codes.
Proof.
This proof is limited to the case of a binary code, general n-ary code is
analogous (with a discussion about unused longest siblings).
The proof is done by induction on the number of codewords. If there are
only two codewords, the Huffman-code lengths are 1, 1 what is optimal.
Now, let us assume, that the statement holds for all codes with m − 1
codewords. Let Cm be an optimal code with m codewords satisfying our
Least-likely-siblings property. Hence, there are two longest least-likely
siblings, and we can define the ’merged’ code Cm−1 of (m − 1)
codewords the way that we take the common prefix of the two siblings and
assign it to a new symbol with probability pm−1 + pm (the corresponding
smallest probabilities).
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 18 / 38
Optimality of Huffman Codes
Proof (Cont.)
Let li , pi be the lengths and probabilities for code Cm , and li′ , pi′ for Cm−1 .
′
Hence, li = li′ and pi = pi′ , for i = 1 . . . m − 2; and pm−1 + pm = pm−1 and
′
lm = lm−1 = lm−1 + 1. The expected length of the code Cm is
m m−2
LCm (X ) = ∑ pi li = ∑ pi li + pm−1 lm−1 + pm lm
i=1 i=1
m−2
= ∑ pi li′ + pm−1 (lm−1
′ ′
+ 1) + pm (lm−1 + 1)
i=1
m−2
= ∑ pi li′ + (pm−1 + pm )(lm−1
′
+ 1)
i=1
m−2
= ∑ pi′ li′ + pm−1
′ ′
lm−1 + pm−1 + pm
i=1
m−1
= ∑ pi′ li′ + pm−1 + pm = LC m−1 (X ) + pm−1 + pm .
i=1
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 19 / 38
Optimality of Huffman Codes
Proof (Cont.)
It is important that the expected length of Cm differs from the expected
length of Cm−1 only by a fixed amount given by the probability distribution
of the source. Therefore, Cm is optimal iff Cm−1 is optimal. (If one of
them is not optimal, i.e. there is a better one, the above-mentioned
transformation contradicts optimality of the other side.)
Due to the induction hypothesis, Cm−1 is optimal iff its lengths can be
obtained from the Huffman algorithm. As the transformation of Cm to
Cm−1 corresponds to one step of the Huffman algorithm, we have that Cm
lengths can be obtained from the Huffman algorithm iff Cm−1 can.
Remark
Shannon-Fano coding is not optimal!
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 20 / 38
Part IV
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 21 / 38
Huffman and Shannon-Fano Codes in Practice
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 22 / 38
Adaptive Coding
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 23 / 38
Lempel-Ziv Coding
The source sequence is parsed into strings that did not appear before.
For example, if the input is 1011010100010 . . . , it is parsed as
1, 0, 11, 01, 010, 00, 10, . . . . After determining each phrase we search for the
shortest string that did not appear before. The coding follows:
• Parse the input sequence as above and count the number of codewords.
This will be used to determine the length of the bit string referring to a
particular codeword.
• We code each phrase by specifying the id of its longest prefix (it certainly
already appeared and was parsed) and the extra bit. The empty prefix we
usually assign index 0.
• Our example will be coded as
(000, 1)(000, 0)(001, 1)(010, 1)(100, 0)(010, 0)(001, 0).
• The length of the code can be further optimized, e.g. at the beginning of
the coding process the length of the bit string describing the codeword
can be shorter than at the end. Note that in fact we do not need commas
and parentheses, it suffices to specify the length of the bit string
identifying the prefix.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 24 / 38
Part V
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 25 / 38
Discrete Distribution and a Fair Coin
Example
Suppose we want to simulate a source described by a random variable X
with the distribution
1
a with probability 2
X = b with probability 14
c with probability 14
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 26 / 38
Discrete Distribution and a Fair Coin
L. 11: Optimal Codes for Data Compression Example
2024-12-02
Suppose we want to simulate a source described by a random variable X
with the distribution
1
a with probability 2
X = b with probability 14
c with probability 14
The solution is pretty easy - if the outcome of the first coin toss is 0, we
set X = a, otherwise we perform another coin toss and set X = b if the
outcomes are 10 and X = c if the outcomes were 11.
The average number of fair coin tosses is 1.5 what equals to the entropy
of X .
Discrete Distribution and a Fair Coin
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 27 / 38
Discrete Distribution and a Fair Coin
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 28 / 38
Discrete Distribution and a Fair Coin
Example
Suppose we want to simulate a source described by a random variable X
with the distribution
(
a with probability 23
X=
b with probability 13
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 29 / 38
Discrete Distribution and a Fair Coin
L. 11: Optimal Codes for Data Compression Example
infinite tree:
1-a
01 - b
001 - a
0001 - b
00001 - a
alternative construction:
two levels - first two for “a”, one for “b”, the last for “repeat”
E (T ) = ∑ i ∗ 2−i = 2
H(X ) = 0.91 . . .
Discrete Distribution and a Fair Coin
Lemma
Let Y denote the set of leaves of a full binary tree and Y a random
variable with distribution on Y, where the probability of a leaf of the depth
k is 2−k . The expected depth of this tree is equal to the entropy of Y .
Proof.
Let k(y ) denote the depth of y . The expected depth of the tree is
E (T ) = ∑ k(y )2−k(y ) ,
y ∈Y
The entropy of Y is
1 1
H(Y ) = − ∑ k(y )
log k(y ) = ∑ 2−k(y ) k(y ) = E (T ).
y ∈Y 2 2 y ∈Y
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 30 / 38
Discrete Distribution and a Fair Coin
Theorem
For any algorithm generating X , the expected number of fair bits used is
at least the entropy H(X ), i.e.
E (T ) ≥ H(X ).
Proof.
Any algorithm generating X from fair bits can be represented by a binary
tree. Label all leaves by distinct symbols Y. The tree may be infinite.
Consider the random variable Y defined on the leaves of the tree
such that for any leaf of depth k the probability is P(Y = y ) = 2−k . By
the previous lemma, we get E (T ) = H(Y ). The random variable X is a
function of Y and hence we have H(X ) ≤ H(Y ), i.e. some leaves may
equal on outputs of X . Combining we get that for any algorithm
H(X ) ≤ E (T ).
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 31 / 38
Dyadic Distribution and a Fair Coin
Theorem
Let X be a random variable with a dyadic (i.e. 2–adic) distribution.
The optimal algorithm to generate X from fair coin flips requires an
expected number of coin tosses equal to the entropy H(X ).
Proof.
The previous theorem shows that we need at least H(X ) bits to generate
X . We use the Huffman code tree to generate the variable. For dyadic
distribution Huffman code coincides with Shannon-Fano code, and so it
1
has codewords of length log p(x) and the probability of such a codeword is
p(x) = 2log p(x) . Hence, the expected depth of the tree is H(X ).
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 32 / 38
Non-dyadic Example
Example
Suppose we want to simulate a source described by a random variable X
with the distribution
(
a with probability 23
X=
b with probability 13
2
Note that the binary expansion of 3 is 0.10101010101 . . . .
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 33 / 38
Discrete Distribution and Fair Coin
(j) (j)
where pi is either 2−j or 0. Now, we will assign to each nonzero pi a
leaf of depth j in a binary tree. Their depths satisfy the Kraft inequality,
(j)
because ∑i,j pi = 1, and therefore, we can always do this.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 34 / 38
Discrete Distribution and Fair Coin
Theorem
The expected number of fair bits E (T ) required by the optimal algorithm
to generate a random variable X is bounded as H(X ) ≤ E (T ) < H(X ) + 2.
Proof.
The lower bound has been already proved, here we prove the upper bound.
Let us start with the initial distribution (p1 , p2 , . . . , pm ) and expand each of
the probabilities using the dyadic coefficients, i.e.
(1) (2)
pi = pi + pi +···
(j)
with pi ∈ {0, 2−j }. Let us consider a new random variable Y with the
(1) (2) (1) (2) (1) (2)
probability distribution p1 , p1 , . . . , p2 , p2 , . . . , pm , pm , . . . . We
construct the binary tree T for the dyadic probability distribution Y .
Recall that the expected depth of T is H(Y ).
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 35 / 38
Discrete Distribution and Fair Coin
Proof (Cont.)
X is a function of Y giving
Ti = ∑ j2−j .
(j)
j:pi >0
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 36 / 38
Discrete Distribution and Fair Coin
Proof (Cont.)
We can find n such that 2−(n−1) > pi ≥ 2−n . This is equivalent to
n − 1 < − log pi ≤ n.
(j)
We have that pi > 0 only if j ≥ n and we rewrite Ti as
Ti = ∑ j2−j .
(j)
j:j≥n,pi >0
Recall that
pi = ∑ 2−j .
(j)
j:j≥n,pi >0
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 37 / 38
Discrete Distribution and Fair Coin
Proof (Cont.)
Next, we will show that Ti < −pi log pi + 2pi . Let us expand
= ∑ (j − n − 1)2−j =
(j)
j:j≥n,pi >0
= − 2−n + 0 + ∑ (j − n − 1)2−j =
(j)
j:j≥n+2,pi >0
= − 2−n + ∑ k2−(k+n+1)
(k+n+1)
k:k≥1,pi >0
where k = j − n − 1
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 38 / 38
Discrete Distribution and Fair Coin
Proof (Cont.)
Increasing the number of addends on the right hand side, we get
Using E (T ) = ∑i Ti and Ti < −pi log pi + 2pi , we obtain the desired result
!
E (T ) = ∑ Ti < − ∑ pi log pi + 2 ∑ pi = H(X ) + 2.
i i i
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 39 / 38
Lecture 12: Channel Capacity
Vojtěch Řehák
based on slides of Jan Bouda
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 1 / 36
Part I
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 2 / 36
Communication System - Motivation
W Xn Yn Ŵ
Encoder Channel p(y | x) Decoder
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 3 / 36
Communication System - Definition
Definition
A discrete channel is a system (X , p(y | x), Y ) consisting of an input
alphabet X , output alphabet Y , and a probability transition matrix
p(y | x) specifying the probability we observe y ∈ Y when x ∈ X was sent.
Example
Binary symmetric channel preserves its input with probability 1 − p and it
outputs the negation of the input with probability p.
1−p
0 0
p
p
1 1
1−p
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 4 / 36
Sequential Use of a Channel
Let x k = x1 , x2 , . . . xk , X k = X1 , X2 , . . . Xk , and similarly for y k and Y k .
Definition
A channel is said to be without feedback if the output distribution do
not depend on past output symbols, i.e. p(yk | x k , y k−1 ) = p(yk | x k ).
A channel is said to be memoryless if the output distribution depends
only on the current input and is conditionally independent of previous
channel inputs and outputs, i.e. p(yk | x k , y k−1 ) = p(yk | xk ).
W Xn Yn Ŵ
Encoder Channel p(y | x) Decoder
Hence, the channel transition function for the n-th extension of a discrete
memoryless channel reduces to
n
p(y n | x n ) = ∏ p(yi | xi ).
i=1
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 6 / 36
(M, n) Codes
Definition
An (M, n) code for the channel (X , p(y | x), Y ) consists of the following:
1. A set of input messages {1, 2, . . . , M}.
2. An encoding function f : {1, 2, . . . , M} → X n , yielding codewords
f (1), f (2), . . . , f (M).
3. A decoding function g : Y n → {1, 2, . . . , M}, which is a deterministic
rule assigning a guess to each possible receiver vector.
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 7 / 36
Example: Binary Symmetric Channel
Example
Binary symmetric channel preserves its input with probability 1 − p and it
outputs the negation of the input with probability p.
What is the probability of an incorrect decoding using binary symmetric
channel and encoding function f and decoding function g ?
1−p
0 0
g (000) = 1 g (011) = 2
f (1) = 000 p
g (001) = 1 g (101) = 2
p g (010) = 1 g (110) = 2
f (2) = 111
g (100) = 1 g (111) = 2
1 1
1−p
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 8 / 36
Example: Binary Symmetric Channel
Lecture 12: Channel Capacity
2024-12-12
Example
Binary symmetric channel preserves its input with probability 1 − p and it
outputs the negation of the input with probability p.
What is the probability of an incorrect decoding using binary symmetric
channel and encoding function f and decoding function g ?
1−p
0 0
g (000) = 1 g (011) = 2
f (1) = 000
Example: Binary Symmetric Channel
p
g (001) = 1 g (101) = 2
p g (010) = 1 g (110) = 2
f (2) = 111
g (100) = 1 g (111) = 2
1 1
1−p
Common discussion:
3p 2 − 2p 3 < p ⇐⇒ 0 < 2p 3 − 3p 2 + p = p(2p − 1)(p − 1)
Hence, the roots are 0, 1/2, and 1.
Error Probability
Definition
Probability of an error for the code (M, n) and the channel
(X , p(y | x), Y ) provided the ith message was sent is
Definition
The maximal probability of an error for an (M, n) code is defined as
λmax = max λi
i∈{1,2,...,M}
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 9 / 36
Average Error Probability
Definition
The (arithmetic) average probability of error for an (M, n) code is
defined as
(n) 1 M
Pe = ∑ λi .
M i=1
(n)
Note that Pe = P(I ̸= g (Y )) if I describes index chosen uniformly from
the set {1, 2, . . . , M}.
(n)
Also note that Pe ≤ λmax . This is important as we will prove that λmax
can be pushed to 0.
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 10 / 36
Code Rate
Definition
The rate R of an (M, n) code is
log2 M
R= bits per transmission.
n
Intuitively, the rate expresses the ratio between the number of message
bits and the number of channel uses, i.e. it expresses the number of
non-redundant bits per transmission.
Example
What is the rate of the code function f (1) = 000 and f (2) = 111?
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 11 / 36
Code Rate
Lecture 12: Channel Capacity Definition
2024-12-12
The rate R of an (M, n) code is
log2 M
R= bits per transmission.
n
Intuitively, the rate expresses the ratio between the number of message
bits and the number of channel uses, i.e. it expresses the number of
non-redundant bits per transmission.
The triple code is a (2, 3) code, hence the rate is log (2)/3 = 1/3 of a bit.
The rate of an (8, 3) code is log (8)/3 = 3/3 = 1 bit.
Part II
Channel Capacity
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 12 / 36
Channel capacity - motivation
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 13 / 36
Noiseless binary channel
Example
• Let us consider a channel with binary input that faithfully reproduces
its input on the output.
0 1
1 1
a b
• The channel is error-free and we can obviously transmit one bit per
channel use.
• The capacity is 1 bit.
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 14 / 36
Noisy channel with non-overlapping outputs
Example
• This channel has two inputs and to each of them correspond two
possible outputs. Outputs for different inputs are different.
0 1
p 1−p p 1−p
a b c d
• This channel appears to be noisy, but in fact it is not. Every input
can be recovered from the output without error.
• Capacity of this channel is also 1 bit.
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 15 / 36
Binary Symmetric Channel
Example
• Binary symmetric channel preserves its input with probability 1 − p
and it outputs the negation of the input with probability p.
1−p
0 0
p
p
1 1
1−p
• The capacity depends on the probability p.
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 16 / 36
Channel capacity
Definition
The channel capacity of a discrete memoryless channel is
C = max I (X ; Y ),
pX
Note that
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 17 / 36
Noisy channel with non-overlapping outputs
0 1
p 1−p p 1−p
a b c d
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 18 / 36
Noisy channel with non-overlapping outputs
Formally, let P(X = 0) = q then the mutual information is
I (X ; Y ) = H(Y ) − H(Y | X )
= H(Y ) − (P(X = 0) · H(Y | X = 0) + P(X = 1) · H(Y | X = 1)) =
= H(Y ) − q · H(p, 1 − p) − (1 − q) · H(p, 1 − p) =
= H(Y ) − H(p, 1 − p) =
= H(qp, q(1 − p), (1 − q)p, (1 − q)(1 − p)) − H(p, 1 − p) =
= H(q, 1 − q) + q · H(p, 1 − p) + (1 − q) · H(p, 1 − p) − H(p, 1 − p) =
= H(q, 1 − q) = H(X )
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 19 / 36
Binary Symmetric Channel
1−p
0 0
p
p
1 1
1−p
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 20 / 36
Binary Symmetric Channel
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 21 / 36
Noiseless binary channel
Example
What is the capacity of this channel?
0 1
1 1
a b
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 22 / 36
Noiseless Quaternary channel
Example
What is the capacity of this channel?
0 1 2 3
1 1 1 1
a b c d
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 23 / 36
Noisy Typewriter
• Let us suppose that the input alphabet has k letters (and the output
alphabet is the same here).
• Each symbol either remains unchanged (with probability 1/2) or it is
received as the next letter (with probability 1/2).
Z A B C D ···
1/2 1/2 1/2 1/2
1/2 1/2 1/2 1/2 1/2
··· a b c d ···
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 24 / 36
Noisy Typewriter
Alternatively:
If we use every alternate symbol of the 26 input symbols, the 13 symbols
can be transmitted faithfully.
Therefore, this way we may transmit log 13 error-free bits per one channel
transfer.
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 25 / 36
Binary erasure channel
1−α
0 0
α
1 1
1−α
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 26 / 36
Binary erasure channel
Note that:
H(E ) = H(α, (1 − α))
H(Y | E ) = P(E = 1) · H(Y |E = 1) + P(E = 0) · H(Y |E = 0)
= α · 0 + (1 − α) · H(X )
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 27 / 36
Binary erasure channel
Hence,
= max(1 − α)H(π, 1 − π) = 1 − α,
π
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 28 / 36
Symmetric channels
with the entry in xth row and y th column giving the probability that y is
received when x is sent. All the rows are permutations of each other and
the same holds for all columns. We say that such a channel is symmetric.
Symmetric channel may be alternatively specified by
Y = X + Z mod c,
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 29 / 36
Symmetric channels
Y = X + Z mod c
Note that an arbitrary row of the transition matrix is a distribution of Z
mod c. Hence, H(Y | X ) = H(Z ).
I (X ; Y ) = H(Y ) − H(Y | X ) = H(Y ) − H(Z ) ≤ log c − H(Z )
with equality if the distribution of Y is uniform. Note that uniform input
distribution P(X = x) = c1 yields the uniform distribution of the output
since
1 1
P(Y = y ) = ∑ P(Y = y | X = x)P(X = x) = ∑ P(Y = y | X = x) = ,
x c x c
as the entries in each column of the probability transition matrix sums to 1.
Therefore, symmetric channels have capacity
C = max I (X ; Y ) = log c − H(Z )
pX
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 30 / 36
Properties of Channel Capacity
• C ≥0
▶ since I (X ; Y ) ≥ 0
• C ≤ log |Im(X )|
▶ since
maxpX I (X ; Y ) = maxpX H(X ) − H(X | Y ) ≤ maxpX H(X ) = log |Im(X )|
• C ≤ log |Im(Y )|
▶ since I (X ; Y ) = I (Y ; X )
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 31 / 36
Part III
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 32 / 36
Definition Recall
Definition
The probability of an error provided the ith message was sent is λi .
Definition
The maximal probability of an error for an (M, n) code is defined as
λmax = max λi
i∈{1,2,...,M}
Definition
The rate R of an (M, n) code is
log2 M
R= bits per transmission.
n
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 33 / 36
Channel Coding Theorem
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 34 / 36
Comments on the Channel Coding Theorem
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 35 / 36
Channel Coding Theorem and Typicality
2nH(Y )
= 2nH(Y )−nH(Y |X ) = 2nI (X ;Y ) ,
2nH(Y |X )
what establishes the approximate number of distinguishable sequences
we can send.
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 36 / 36