Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views541 pages

Prob

This document outlines Lecture 1 of the course 'Probability in Computer Science' at Masaryk University, covering basic definitions, course structure, and assessment methods. It emphasizes the importance of probability theory in modeling random processes and its applications in various fields such as machine learning and cryptography. Additionally, it introduces key concepts like random experiments, sample spaces, and events, providing examples to illustrate these ideas.

Uploaded by

malec293
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views541 pages

Prob

This document outlines Lecture 1 of the course 'Probability in Computer Science' at Masaryk University, covering basic definitions, course structure, and assessment methods. It emphasizes the importance of probability theory in modeling random processes and its applications in various fields such as machine learning and cryptography. Additionally, it introduces key concepts like random experiments, sample spaces, and events, providing examples to illustrate these ideas.

Uploaded by

malec293
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 541

Lecture 1: Introduction and Basic Definitions

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

September 26, 2024

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 1 / 61
Part I

Basic Information

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 2 / 61
Course Information

• Lecture
▶ motivation, definitions, theorems, proofs, demonstration on examples
▶ these slides IV111 * handout.pdf + videos from fall 2020
▶ for later lectures there are old versions of the slides, they will be
gradually updated
• Seminars
▶ practicing on word problems, discussions
▶ all examples are in IV111 exercises for tutorials.pdf
▶ attendance at tutorials is compulsory and absence will be penalized
• Assessment
▶ written final exam for preliminary evaluation (see Sample Exam in IS)
▶ subsequent oral exam
• The course will be taught in English
see the Interactive syllabus for more detail

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 3 / 61
Course Topics in Exam Questions
Q1: Probability space, random variable
• probability space, events, conditional probability, independence
Q2: Markov and Chebyshev inequalities
• random variable expectation, moments, tail bounds
Q3: Laws of large numbers

Q4: Discrete-time and continuous-time Markov chains
• probabilistic processes, analysis, average and almost sure performance
Q5: Ergodic theorem for DTMC

Q6: Kraft and McMillan theorems, Huffman codes
• entropy, randomness, information, coding, compression
Q7: Channel coding theorem
• noisy channels, capacity and coding rates
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 4 / 61
Bibliography

M. Mitzenmacher and E. Upfal


Probability and Computing
Cambridge University Press, 2005
G. Grimmett, D. Stirzaker
Probability and random processes
Oxford University Press, 2001
K.S. Trivedi
Probability and statistics with reliability, queuing, and computer
science applications.
New York: Wiley, 2002
W. Feller
An Introduction to Probability Theory and Its Applications
John Wiley & Sons, 1968

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 5 / 61
Bibliography

R.B. Ash and C.A. Doléans-Dade


Probability and Measure Theory
Harcourt Academic Press, 2000
R. Motwani and P. Raghavan
Randomized Algorithms
Cambridge University Press, 2000
T.M. Cover and J.A. Thomas
Elements of Information Theory
John Wiley & Sons, 2006
D. MacKay
Information Theory, Inference and Learning Algorithms
Cambridge University Press, 2003

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 6 / 61
Part II

Motivation

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 7 / 61
Randomness, Probability and Statistics

Probability
given model, predict data

World or Model Measured Data

Statistics
given data, predict model

Probability deals with predicting the likelihood of future events,


while statistics involves the analysis of the frequency of past events.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 8 / 61
Randomness, Probability and Statistics

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 9 / 61
Randomness, Probability and Statistics

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 10 / 61
Motivation for The Course

• Probability is one of the central concepts of mathematics and also of


computer science.
• It allows us to model and study not only truly random processes,
but also situation when we have incomplete information about the
process.
• It is important when studying average behavior of algorithms,
adversaries, . . .

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 11 / 61
Applications of Probability Theory

• Machine learning
• Recommendation systems
• Spam filtering
• Computational finance
• Cryptography, Network security
• System biology
• DNA sequencing
• ...

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 12 / 61
Example: Algorithm for Verifying Matrix Multiplication
Given three n × n matrices A, B, and C, verify

AB = C.
Standard algorithm:
• perform matrix multiplication AB
• compare the result with C

The standard matrix multiplication takes Θ(n3 ) time.


More sophisticated multiplication algorithm takes Θ(n2.37 ).

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 13 / 61
Example: Algorithm for Verifying Matrix Multiplication
Given three n × n matrices A, B, and C, verify

AB = C.
Randomized algorithm: (Freivald’s algorithm)
1. Choose uniformly at random a vector ⃗r = (r1 , r2 , . . . , rn ) ∈ {0, 1}n .
2. Compute vector ⃗vB⃗r = B⃗r .
3. Compute vector ⃗vAB⃗r = A⃗vB⃗r .
4. Compute vector ⃗vC⃗r = C⃗r .
5. If ⃗vAB⃗r ̸= ⃗vC⃗r return AB ̸= C, else return AB = C.
This algorithm takes Θ(n2 ) time.
Questions:
1. Say that AB ̸= C, what is the probability of a wrong answer?
2. What is the probability of a wrong answer when executed 100 times?

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 13 / 61
Example: Algorithm for Verifying Matrix Multiplication
Lecture 1: Introduction and Basic Definitions Given three n × n matrices A, B, and C, verify

2024-09-26
AB = C.
Randomized algorithm: (Freivald’s algorithm)
1. Choose uniformly at random a vector ⃗r = (r1 , r2 , . . . , rn ) ∈ {0, 1}n .
2. Compute vector ⃗vB⃗r = B⃗r .
3. Compute vector ⃗vAB⃗r = A⃗vB⃗r .
4. Compute vector ⃗vC⃗r = C⃗r .

Example: Algorithm for Verifying Matrix 5. If ⃗vAB⃗r ̸= ⃗vC⃗r return AB ̸= C, else return AB = C.
This algorithm takes Θ(n2 ) time.
Questions:

Multiplication 1. Say that AB ̸= C, what is the probability of a wrong answer?


2. What is the probability of a wrong answer when executed 100 times?

Answers:
1. For AB ̸= C and a vector ⃗r chosen uniformly at random, the
probability of wrong answer is at most 1/2.
For formal proof see Wikipedia.
2. If the vectors are chosen independently, the probability of the wrong
answer could be at most 1/2100 which is smaller a HW error
probability.
Goals of Probability Theory

• The main goal of the probability theory is to study random


experiments.
• Random experiment is a model of physical or thought experiment,
where we are uncertain about outcomes of the experiment,
regardless whether the uncertainty is due to objective coincidences or
our ignorance. We should be able to estimate how ’likely’ respective
outcomes are.
• Random experiment is specified by the set of possible outcomes
and probabilities that each particular outcome occurs.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 14 / 61
Examples of Random Experiments

A typical example of a random experiment is tossing a coin.


Example
Possible outcomes of this experiment are head and tail.
Probability of each of these outcomes depends on physical properties of
the coin, on the way it is thrown, . . .
In case of a fair coin (and fair coin toss) the probability to obtain head
(tail) is 1/2.

Coin tossing is an important abstraction - analysis of the coin tossing can


be applied to many other problems.

In the context of computer science, this experiment represents the value of


a randomly generated bit.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 15 / 61
Examples of Random Experiments

Another example of a random experiment is throwing a six-sided die.


Example
Possible outcomes of a this experiment are symbols ’1’–’6’ representing
respective facets of the die.
Assuming the die is unbiased, the probability that a particular outcome
occurs is the same for all outcomes and equals to 1/6.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 16 / 61
Examples of Random Experiments: Balls and Bins

• In this experiment we have three bins and we sequentially and


independently put each ball randomly into one of the bins.
• The atomic outcomes of this experiment correspond to positions of
respective balls. Let us denote the balls as a, b and c.
• All atomic outcomes of this experiment are:

1.[abc][ ][ ] 10.[a ][ bc][ ] 19.[ ][a ][ bc]


2.[ ][abc][ ] 11.[ b ][a c][ ] 20.[ ][ b ][a c]
3.[ ][ ][abc] 12.[ c][ab ][ ] 21.[ ][ c][ab ]
4.[ab ][ c][ ] 13.[a ][ ][ bc] 22.[a ][ b ][ c]
5.[a c][ b ][ ] 14.[ b ][ ][a c] 23.[a ][ c][ b ]
6.[ bc][a ][ ] 15.[ c][ ][ab ] 24.[ b ][a ][ c]
7.[ab ][ ][ c] 16.[ ][ab ][ c] 25.[ b ][ c][a ]
8.[a c][ ][ b ] 17.[ ][a c][ b ] 26.[ c][a ][ b ]
9.[ bc][ ][a ] 18.[ ][ bc][a ] 27.[ c][ b ][a ]

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 17 / 61
Part III

Sample Space, Events, and Probability

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 18 / 61
Random Experiment

• The (idealized) random experiment is the central notion of the


probability theory.
• Random experiment is specified (from mathematical point of view) by
the set of possible outcomes and probabilities assigned to each of
these outcomes.
• A single execution of a random experiment is called the trial.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 19 / 61
Sample Space

Definition
The (non-empty) set of the possible outcomes of a random experimenta
is called the sample spaceb of the experiment and it will be denoted S.
The outcomes of a random experiment (elements of the sample space) are
denoted sample pointsc .
a náhodný pokus
b základnı́ prostor
c základnı́/elementárnı́ body

• Every thinkable/considered outcome of a random experiment is


described by one, and only one, sample point.
• What is the thinkable/considered outcome?

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 20 / 61
Sample Space - Examples

The sample space of the


• ’coin tossing’ experiment is {head, tail}.
• ’throwing a six-sided die’ experiment is {1, 2, 3, 4, 5, 6}.

Example
Let us consider the following experiment: we toss the coin until the ’head’
appears. Possible outcomes of this experiment are

H, TH, TTH, TTTH, . . . .

We may also consider the possibility that ’head’ never occurs. In this case
we have to introduce an extra sample point denoted, e.g. ⊥.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 21 / 61
Sample Space - Examples

Flipping two coins: S = {(H, H), (H, T ), (T , H), (T , T )}


Flipping two coins (number of Hs): S = {(H, H), (H, T ), (T , T )}

Number of emails per day: S = N0


Sequence of spams/hams per day: S = {S, H}∗

Idea
The sample space is not determined completely by the experiment.
It is partially determined by the purpose for which the experiment is
carried out.

Do we need only the number of emails? Or email length?


Or sequence of spams/hams classification? Or classification by senders?
...

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 22 / 61
Sample Space Size

Coin flip: S = {H, T }


Roll of 6-sided die: S = {1, 2, 3, 4, 5, 6}

Number of coin flips to see head: S = N0 ∪ {∞}


Sequence of spams/hams per day: S = {S, H}∗

Portion of a day wasted on FB: S = (0, 1) ⊆ R


Rolling time of 6-sided die: S = R≥0

finite countable uncountable


discrete sample space continuous sample space

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 23 / 61
Event

In addition to basic outcomes of a random experiment, we are often


interested in more complicated events that represent a number of
outcomes of a random experiment.

The event ’outcome is even’ in the ’throwing a six-sided die’ experiment


corresponds to atomic outcomes ’2’, ’4’, ’6’. Therefore, we represent an
event of a random experiment as a subset of its sample space.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 24 / 61
Events in Discrete Space

Definition
Eventa of a random experiment with sample space S is any subset of S.
a jev

• The event A ∼’throwing by two independent dice results in sum 6’ is


A = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}.
• Similarly, the event B ∼’two odd faces’ is B = {(1, 1), (1, 3), . . . (5, 5)}.
• 0/ is a (null) event1 .
• S is a (universal) event2 .
• {s} is an (elementary) event, for each sample point s ∈ S.

1 nemožný jev
2 jistý jev
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 25 / 61
Event

Note that occurrence of a particular outcome may imply a number of


events. In such a case these events occur simultaneously.
• The outcome (3, 3) implies event ’the sum is 6’ as well as the event
’two odd faces’.
• Events ’the sum is 6’ and ’two odd faces’ may occur simultaneously.
• Every compound event can be decomposed into atomic events
(sample points), compound event is an aggregate of atomic events.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 26 / 61
Example of Events: Three Balls in Three Bins
• The event A=’there is more than one ball in one of the bins’
corresponds to atomic outcomes 1-21. We say that the event A is an
aggregate of events 1-21.
• The event B=’ first bin is not empty ’ is an aggregate of sample
points 1,4-15,22-27.
• The event C is defined as ’ both A and B occur ’. It represents
sample points 1,4-15.

1. [abc][ ][ ] 10. [a ][ bc][ ] 19. [ ][a ][ bc]


2. [ ][abc][ ] 11. [ b ][a c][ ] 20. [ ][ b ][a c]
3. [ ][ ][abc] 12. [ c][ab ][ ] 21. [ ][ c][ab ]
4. [ab ][ c][ ] 13. [a ][ ][ bc] 22. [a ][ b ][ c]
5. [a c][ b ][ ] 14. [ b ][ ][a c] 23. [a ][ c][ b ]
6. [ bc][a ][ ] 15. [ c][ ][ab ] 24. [ b ][a ][ c]
7. [ab ][ ][ c] 16. [ ][ab ][ c] 25. [ b ][ c][a ]
8. [a c][ ][ b ] 17. [ ][a c][ b ] 26. [ c][a ][ b ]
9. [ bc][ ][a ] 18. [ ][ bc][a ] 27. [ c][ b ][a ]

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 27 / 61
Algebra of Events - Notation

• x ∈ A - a sample point x ∈ S is contained in event A.


• A = B - two events A and B contain the same sample points.
• A = 0/ - event A contains no sample points.
• A def
= {x ∈ S|x ̸∈ A}
- an event ’A does not occur’ - complementary (negative) event.
• A ∩ B - an event ’both A and B occur’.
• A ∪ B - an event ’either A or B or both occur’.
• A ⊆ B - an event ’A implies B’.

Message
Events are sets.
Hence, set operations can be used to obtain a new event.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 28 / 61
Algebra of Events - Laws
E1 Commutative:
A ∪ B = B ∪ A, A ∩ B = B ∩ A
E2 Associative:
A ∪ (B ∪ C ) = (A ∪ B) ∪ C
A ∩ (B ∩ C ) = (A ∩ B) ∩ C
E3 Distributive:
A ∪ (B ∩ C ) = (A ∪ B) ∩ (A ∪ C )
A ∩ (B ∪ C ) = (A ∩ B) ∪ (A ∩ C )
E4 Identity:
A ∪ 0/ = A, A ∩ S = A
E5 Complement:
A ∪ A = S, A ∩ A = 0/
Any relation valid in the algebra of events can be proved using these
axioms.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 29 / 61
Algebra of Events - Some Relations
Using the previously introduced axioms we can derive, e.g.,
• Idempotent laws:
A ∪ A = A, A ∩ A = A
• Domination laws:
A ∪ S = S, A ∩ 0/ = 0/
• Absorption laws:

A ∪ (A ∩ B) = A, A ∩ (A ∪ B) = A
• de Morgan’s laws:

(A ∪ B) = A ∩ B, (A ∩ B) = A ∪ B

A =A

A ∪ (A ∩ B) = A ∪ B
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 30 / 61
Probability - Intuition

Definition
Let us suppose that we perform n trials of a random experiment and we
obtain an outcome s ∈ S k-times. Then the relative frequency of the
outcome s is k/n.

• Intuitively, probability can be understood as a limit of relative


frequency (for n going to infinity).
• In many engineering application the relative frequency is used
interchangeably with probability. This is not mathematically correct,
however, we may use the relative frequency as an approximation of
probability in case we can perform a number of trials of the
experiment, but we do not have theoretical description of a random
experiment.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 31 / 61
Definition of Discrete Probability

Formal mathematical definition of probability is:


Definition
A probability function P on a (discrete) sample space S with a set of all
events F = 2S is a function P : F → [0, 1] such that:
• P(S) = 1, and
• for any countable sequence of pairwise disjoint events A1 , A2 , . . . ,
!

[ ∞
P Ai = ∑ P(Ai ).
i=1 i=1

The second item is traditionally called countable additivity axiom.


Especially, P(A ∪ B) = P(A) + P(B) for disjoint events A and B.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 32 / 61
Some Relations

Are the following statements always true?


• P(0)
/ =0
• P(A) = 1 − P(A) for any event A
• P(A ∪ B) = P(A) + P(B) − P(A ∩ B) for any events A and B
• inclusion and exclusion
!
n
[
P Ai =P(A1 ∪ · · · ∪ An )
i=1
= ∑ P(Ai ) − P(Ai ∩ Aj ) +
∑ P(Ai ∑ ∩ Aj ∩ Ak )
i 1≤i<j≤n 1≤i<j<k≤n

− · · · + (−1)n−1 P(A1 ∩ A2 ∩ . . . An ).

for any events A1 , A2 , . . . , An .

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 33 / 61
Some Relations
Lecture 1: Introduction and Basic Definitions Are the following statements always true?

2024-09-26
• P(0)
/ =0
• P(A) = 1 − P(A) for any event A
• P(A ∪ B) = P(A) + P(B) − P(A ∩ B) for any events A and B
• inclusion and exclusion
!
n
[
P Ai =P(A1 ∪ · · · ∪ An )
i=1

Some Relations = ∑ P(Ai ) −


i

1≤i<j≤n
P(Ai ∩ Aj ) +

− · · · + (−1)n−1 P(A1 ∩ A2 ∩ . . . An ).

1≤i<j<k≤n
P(Ai ∩ Aj ∩ Ak )

for any events A1 , A2 , . . . , An .

1 = P(S) = P(S ∪ 0)
/ = P(S) + P(0)
/ = 1 + P(0)
/

1 = P(S) = P(A ∪ A) = P(A) + P(A)

P(A ∪ B)
= P((A ∪ (B ∖ (A ∩ B)))
= P(A) + P(B ∖ (A ∩ B))
= P(A) + 1 − P(B ∖ (A ∩ B))
= P(A) + 1 − P(B ∪ (A ∩ B))
= P(A) + 1 − (P(B) + P(A ∩ B))
= P(A) + 1 − (1 − P(B) + P(A ∩ B))
= P(A) + 1 − 1 + P(B) − P(A ∩ B))
= P(A) + P(B) − P(A ∩ B)
Probability - How To Assign

For a sample point s, we sometimes use P(s) instead of the correct


P({s}) for the elementary event {s}.

What is the difference between a sample point and an elementary event?

Idea
For finitely or countable many sample points, it is sufficient to assign
probabilities to elementary events.
Then, the probability of an event will be the finite or countable sum of
probabilities of the elementary events included in the event.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 34 / 61
Event collections
Definition
Events A1 , A2 , . . . , An are mutually exclusive or mutually disjointa if and only
if
∀i ̸= j . Ai ∩ Aj = 0.
/
a (vzájemně) disjunktnı́ nebo neslučitelné

Definition
Events A1 , A2 , . . . , An are collectively exhaustivea if and only if

A1 ∪ A2 · · · ∪ An = S.
a tvořı́ úplný systém

A set of events can be collectively exhaustive, mutually exclusive, both or


neither.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 35 / 61
Partition of Sample Space

Definition
Mutually exclusive and collectively exhaustive list is called a partitiona of
the sample space S.
a rozklad

Example
The list of all elementary events {s}, for s ∈ S, is a partition of S.

Message
Every partition can be used as an alternative sample space.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 36 / 61
Designing a (discrete) random experiment

Idea
The design of a random experiment in a real situation is to follow this
procedure:
• Identify the sample space - set of mutually exclusive and collectively
exhaustive events. Choose the elements in the way that they cannot
be further subdivided. You can always define aggregate events.
• Assign probabilities to elements in S - probabilities are usually
either result of a careful theoretical analysis, or based on estimates
obtained from past experience.
• Identify events of interest - they are usually described by
statements and should be reformulated in terms of subsets of S.
• Compute desired probabilities - calculate the probabilities of
interesting events using sums of elementary event probabilities.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 37 / 61
Balls and Bins Revisited - Aggregates
Let us return to the first example with three balls and three bins and
suppose that the balls are not distinguishable, implying e.g. that we do
not distinguish atomic events 4, 5 and 6. Atomic events in the new
experiment (placing three indistinguishable balls into three bins) are
1.[***][ ][ ] 6.[ * ][** ][ ]
2.[ ][***][ ] 7.[ * ][ ][** ]
3.[ ][ ][***] 8.[ ][** ][ * ]
4.[** ][ * ][ ] 9.[ ][ * ][** ]
5.[** ][ ][ * ] 10.[ * ][ * ][ * ]

• It is irrelevant for our theory whether the real balls are


indistinguishable or not.
• Even if they are we may decide to treat them as indistinguishable, it
is often even preferable.
• Dice may be colored to make them distinguishable, but it depends
purely on our decision whether we use this possibility or not.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 38 / 61
Balls and Bins Revisited

We described the experiments in terms of bins and balls, but this


experiment can be equivalently applied to a number of practical situations.
The only difference is the verbal description.
• Birthdays: The possible configuration of birthdays of r people
corresponds to the random placement of r balls into n = 365 bins
(assuming every year has 365 days).
• Elevator: Elevator starts with r passengers and stops in n floors.
• Dice: A throw with r dice corresponds to placing r balls into 6 bins.
In case of a coin tosses we have n = 2 bins.
• IV111 exam: The exam from Probability in Computer Science
corresponds to placement of 94 (the number of students) balls into
n = 6 bins (A, B, C, D, E, F).

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 39 / 61
Balls and Bins Revisited

IV111 example comments:


• Each student is distinguishable, but to judge statistical outcomes of
the exam (such as probability distribution of marks) it is useless to
complicate the experiment by distinguishing respective students.
• The sample points of experiment with indistinguishable balls
correspond to aggregates of experiment with distinguishable balls. In
the example, the atomic event 4 corresponds to aggregate event of
sample points 4-6 in the original experiment.
The concrete situation dictates this choice. Our theory begins after the
proper model has been chosen.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 40 / 61
Discrete Probability Space

Definition
A discrete probability spacea is a triple (S, F , P) where
• S is a sample spaceb , i.e. the set of all possible outcomes,
• F = 2S is a collection of all events, and
• P : F → [0, 1] is a probability functionc .
a pravděpodobnostnı́ prostor
b základnı́
prostor
c pravděpodobnostnı́ funkce

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 41 / 61
Continuous Sample Space - Examples
• What is the portion of a day spent on FB?
• Hitting an archery target
(areas - discrete vs. distance from the center - continuous).

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 42 / 61
Discrete vs. Continuous Sample Space

Discrete sample space:


We count/sum probabilities of elementary events.
Continuous sample space:
An event can be composed of uncountably many sample points.

Idea
We need to measure the area of particular sample points.

For a discrete sample space, we can always use a σ -field F = 2S .

Can we measure all events of 2S ?

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 43 / 61
Counterexample
Imagine a uniform distribution on [0, 1].
Let Q be a set of all rational numbers and Q′ be rationals of [0, 1).
Let A ⊕ r = {a + r mod 1 | a ∈ A} be a circular shift.
As ⊕ r preserves the size of the set, we can naturally expect that
P(A) = P(A ⊕ r ).
Define equivalence relation on reals x ∼ y iff the difference x − y ∈ Q.
Let H be a subset of [0, 1) consisting of precisely one element from each
equivalence class of ∼ (the existence follows from the Axiom of Choice).
!
[
1 = P([0, 1)) = P H ⊕ r = ∑ P(H ⊕ r) = ∑ P(H)
r ∈Q′ r ∈Q′ r ∈Q′

Hence, we cannot assign any probability to H. For more info see:


Jeffrey S. Rosenthal.
First Look at Rigorous Probability Theory
World Scientific, 2006
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 44 / 61
Set of All Measurable Events

When the sample space is continuous, the power set F = 2S is a too large
collection for probabilities to be assigned reasonably to all its members. It
is sufficient to take a subset F ⊆ 2S that is a σ -field.
Definition
Let S be a sample space. Then F ⊆ 2S is a σ -field if
• 0/ ∈ F ,
• if A1 , A2 , · · · ∈ F then ∪∞
i=1 Ai ∈ F , and
• if A ∈ F then A ∈ F .

What is the smallest σ -field of S?

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 45 / 61
Probability Space

Definition
A probability spacea is a triple (S, F , P) where
• S is a sample spaceb , i.e. the set of all possible outcomes,
• F ⊆ 2S is a σ -fieldc , i.e. a collection of sets representing the
allowable events, and
• P : F → [0, 1] is a probability functiond .
a pravděpodobnostnı́ prostor
b základnı́prostor
c jevové pole
d pravděpodobnostnı́ funkce

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 46 / 61
Part IV

Conditional Probability, Independent Events,


and Bayes’ Rule

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 47 / 61
Conditional Probability

When we obtain an incomplete information about the result of our


experiment.
Example
We have thrown two 6-sided dice and we see that face of the first one is 2.
What is the probability that the sum of both is 3? And if the first one is 4?

Example
We are throwing a 6-sided die but it is so far that we see only it is ≥ 4.
What is the probability that it is 6?

Contrary to the initial situation, now we know that some sample points are
of zero probability. I.e., our probability distribution changes.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 48 / 61
Conditional Probability

Given that B occurred, we know that the outcomes outside B could not
occur (i.e., they are of zero probability).
Let us erase the thier probabilities to zero.
But probabilities of all sample points have to sum to 1, and their are
summing to P(B). So, we also need to enlarge them.
For every atomic outcome s, we derive
( P(s)
if s ∈ B,
P(s|B) = P(B)
0 s ∈ B.

In this way the probabilities assigned to points in B are scaled up by


1/P(B). We obtain
∑ P(s|B) = 1.
s∈B

B is our ’sample space’ now.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 49 / 61
Conditional Probability
Definition
For P(B) ̸= 0, the conditional probabilitya of A given B is

P(A ∩ B)
P(A|B) = .
P(B)

If P(B) = 0, it is undefined.
a podmı́něná pravděpodobnost

In this way, we directly obtain



P(B)P(A|B) if P(B) ̸= 0,

P(A ∩ B) = P(A)P(B|A) if P(A) ̸= 0

0 otherwise.

Note: We sometimes wish to condition on null events; then, the approach is more
complicated. See J. Rosenthal: A First Look at Rigorous Probability Theory, Chapter 13.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 50 / 61
Independence of Events

Definition
Events A and B are said to be independenta if

P(A ∩ B) = P(A) · P(B).


a nezávislé

Intuition assuming P(A) ̸= 0 ̸= P(B):


Using the definition of the conditional probability, we have

P(A ∩ B) = P(A|B)P(B).

Hence, for independent events we have

P(A) · P(B) = P(A ∩ B) = P(A|B)P(B).

I.e., P(A) = P(A|B), what is saying that the probability of the event A
does not change regardless of whether event B occurred.
V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 51 / 61
Independence of Events - Remarks

• Independence is symmetric - when A is independent of B, then


also B is independent of A.
• Mutually exclusive events A and B are independent iff
P(A) = 0 or P(B) = 0.
• If an event A is independent of itself, then P(A) = 0 or P(A) = 1.
• Null and universal events are independent of any other events.
• If A and B are independent, then also A and B, A and B, and A and
B are independent.
• Independence is not transitive. If A and B are independent and B
and C are independent does not imply that A and C are independent.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 52 / 61
Independence of More Events
Definition
Events A1 , A2 , . . . An are (mutually) independenta if and only if for any set
{i1 , i2 , . . . ik } ⊆ {1 . . . n} (2 ≤ k ≤ n) of distinct indices it holds that

P(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = P(Ai1 )P(Ai2 ) . . . P(Aik ) (1)


a (vzájemně) nezávislé

• In Eq. (1) we can replace any occurrence of Ai by Ai . Why?


• Compare the properties of P(A1 ∩ A2 · · · ∩ An ) when A1 , A2 , . . . An are
mutually independent and when they are mutually exclusive.

Definition
Events A1 , A2 , . . . An are pairwise independenta iff for all distinct indices
i, j ∈ {1 . . . n} it holds that P(Ai ∩ Aj ) = P(Ai )P(Aj ).
a po dvou nezávislé

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 53 / 61
Independence of More Events - Example

An example of pairwise independent but not (mutually) independent


events.
Example
Let S = {s1 , s2 , s3 , s4 } with P(s1 ) = P(s2 ) = P(s3 ) = P(s4 ) = 1/4.
Then for A = {s1 , s2 }, B = {s2 , s3 }, C = {s2 , s4 }, it holds that A, B, C are
pairwise independent but not (mutually) independent events.

-----------
A | s1 /| s2 |
----------/-|
/ |/ |
/ /| |
B | s3 / | s4 | C
---/ ----

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 54 / 61
Law of Total Probability

• Any list of mutually exclusive and collectively exhaustive events forms


an event space.
• E.g., note that events B and B partition the sample space S.
• We can define S ′ = {B, B} and using the probabilities P(B) and
P(B) the set S ′ has properties similar to sample space, except that
there is many-to-one correspondence between outcomes of the
experiment and elements of S ′ .

Theorem (of total probability)


Let A be an event and {B1 , . . . , Bn } be an event space. Then
n
P(A) = ∑ P(A | Bi )P(Bi ).
i=1

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 55 / 61
Law of Total Probability (Proof)

n
P(A) = ∑ P(A | Bi )P(Bi )
i=1

Proof.
Let A be an event and {B1 , . . . , Bn } be an event space, i.e. mutually
exclusive and collectively exhaustive.
n
∑ P(A | Bi )P(Bi ) =
i=1
=
=
=
= P(A)

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 56 / 61
Law of Total Probability (Proof)
Lecture 1: Introduction and Basic Definitions P(A) =
n
∑ P(A | Bi )P(Bi )
2024-09-26
i=1

Proof.
Let A be an event and {B1 , . . . , Bn } be an event space, i.e. mutually
exclusive and collectively exhaustive.
n
∑ P(A | Bi )P(Bi ) =
i=1

Law of Total Probability (Proof) =


=
=
= P(A)

n n
∑ P(A | Bi )P(Bi ) = ∑ P(A ∩ Bi )
i =1 i =1
n
[
= P( (A ∩ Bi ))
i =1
n
[
= P(A ∩ Bi )
i =1
= P(A ∩ S)
= P(A)
Law of Total Probability (Example)

Example
Production on machines A, B, C . Products can be defective.
P(A) = 0.25 P(Defect | A) = 0.05
P(B) = 0.35 P(Defect | B) = 0.04
P(C ) = 0.40 P(Defect | C ) = 0.02
P(Defect) = ∑ P(Defect | X )P(X )
X ∈{A,B,C }

Idea
The theorem of total probability is useful when we know conditional
probabilities of a property in all (exhaustive) subcases and we are
interested in (general) probability of the property.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 57 / 61
Bayes’ Rule

Theorem (Bayes’ Rule)


Let A and B be events such that P(B) > 0. Then

P(B | A) · P(A)
P(A | B) = .
P(B)

Proof.

Traditional naming of the terms:


P(B | A) Likelihood
P(A | B) = · P(A) is read as Posterior = · Prior
P(B) Evidence

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 58 / 61
Bayes’ Rule
Lecture 1: Introduction and Basic Definitions Theorem (Bayes’ Rule)

2024-09-26
Let A and B be events such that P(B) > 0. Then

P(B | A) · P(A)
P(A | B) = .
P(B)

Proof.

Bayes’ Rule
Traditional naming of the terms:
P(B | A) Likelihood
P(A | B) = · P(A) is read as Posterior = · Prior
P(B) Evidence

P (A∩B )
We use the definition of the conditional distribution P(A | B) = P (B )
and commutativity of intersection

P(A | B)P(B) = P(A ∩ B) = P(B ∩ A) = P(B | A)P(A).

Hence
P(A ∩ B) P(B | A)P(A)
P(A | B) = = .
P(B) P(B)
Bayes’ Rule - Conditional Variants

Theorem (Bayes’ Rule)


Let A and B be events such that P(B) > 0. Then

P(B | A) · P(A)
P(A | B) = .
P(B)

Theorem (Bayes’ Rule conditional variants)


Let A, B, and C be events such that P(B) · P(C ) · P(B ∩ C ) > 0. Then

P(B ∩ C | A) · P(A) P(B | A ∩ C ) · P(A | C )


P(A | B ∩ C ) = = .
P(B ∩ C ) P(B | C )

Proof.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 59 / 61
Bayes’ Rule - Conditional Variants
Lecture 1: Introduction and Basic Definitions Theorem (Bayes’ Rule)

2024-09-26
Let A and B be events such that P(B) > 0. Then

P(B | A) · P(A)
P(A | B) = .
P(B)

Theorem (Bayes’ Rule conditional variants)


Let A, B, and C be events such that P(B) · P(C ) · P(B ∩ C ) > 0. Then

Bayes’ Rule - Conditional Variants P(A | B ∩ C ) =


P(B ∩ C | A) · P(A) P(B | A ∩ C ) · P(A | C )
P(B ∩ C )
=
P(B | C )
.

Proof.

The first equality is an easy substution B ⇝ B ∩ C .

Intuitivelly, the second one is Bayes’ theorem in a probability space condi-


tioned by C . Note that P(A | B) given C is P(A | B ∩ C ).

P (B|A∩C )·P (A|C ) )·P (A|C )·P (C ) P (B|A∩C )·P (A∩C )


Formally, P (B|C ) = P (B|A∩C
P (B|C )·P (C ) = P (B∩C ) =
P (B∩A∩C ) P (A∩B∩C )
P (B∩C ) = P (B∩C ) = P(A | B ∩ C ).
Bayes’ Rule with Law of Total Probability

Theorem (Bayes’ Rule & Law of Total Probability)


Let A be an event with P(A) > 0 and {B1 , . . . , Bn } be an event space.
Then for every Bj

P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = n .
P(A) ∑i=1 P(A | Bi )P(Bi )

Proof.

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 60 / 61
Bayes’ Rule with Law of Total Probability
Lecture 1: Introduction and Basic Definitions
2024-09-26
Theorem (Bayes’ Rule & Law of Total Probability)
Let A be an event with P(A) > 0 and {B1 , . . . , Bn } be an event space.
Then for every Bj

P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = n .
P(A) ∑i=1 P(A | Bi )P(Bi )

Proof.
Bayes’ Rule with Law of Total Probability

P(Bj ∩ A) P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = = n .
P(A) P(A) ∑i =1 P(A | Bi )P(Bi )
We use the definition of the conditional distribution

P(Bj ∩ A) = P(A | Bj )P(Bj )

and the law of total probability


n
P(A) = ∑ P(A | Bi )P(Bi ).
i =1
Bayes’ Rule (Example)

P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = n .
P(A) ∑i=1 P(A | Bi )P(Bi )

Example
Production on machines A, B, C . Products can be defective.
P(A) = 0.25 P(Defect | A) = 0.05
P(B) = 0.35 P(Defect | B) = 0.04
P(C ) = 0.40 P(Defect | C ) = 0.02
Having a defective product, what is the probability A produced it?

P(Defect | A)P(A) P(Defect | A)P(A)


P(A | Defect) = =
P(Defect) ∑X ∈{A,B,C } P(Defect | X )P(X )

V. Řehák: IV111 Probability in CS Lecture 1: Introduction and Basic Definitions September 26, 2024 61 / 61
Lecture 2: Random Variables

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

October 3, 2024

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 1 / 52


Part I

Revision

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 2 / 52


Discrete Probability Space

Definition
A discrete probability space is a triple (S, F , P) where
• S is a sample space, i.e. the set of all possible outcomes,
• F = 2S is a collection of all events, and
• P : F → [0, 1] is a probability function.

Definition
A probability function P on a (discrete) sample space S with a set of all
events F = 2S is a function P : F → [0, 1] such that:
• P(S) = 1, and
• for any countable sequence of mutually exclusive events A1 , A2 , . . . ,
!

[ ∞
P Ai = ∑ P(Ai ).
i=1 i=1

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 3 / 52


Conditional Probability and Independence of Events

Definition
For P(B) ̸= 0, the conditional probabilitya of A given B is

P(A ∩ B)
P(A|B) = .
P(B)

If P(B) = 0, it is undefined.
a podmı́něná pravděpodobnost

Definition
Events A and B are said to be independenta if

P(A ∩ B) = P(A) · P(B).


a nezávislé

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 4 / 52


Independence of Events
Definition
Events A1 , A2 , . . . An are (mutually) independenta if and only if for any set
{i1 , i2 , . . . ik } ⊆ {1 . . . n} (2 ≤ k ≤ n) of distinct indices it holds that

P(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = P(Ai1 )P(Ai2 ) . . . P(Aik ) (1)


a (vzájemně) nezávislé

• In Eq. (1) we can replace any occurrence of Ai by Ai . Why?


• Compare the properties of P(A1 ∩ A2 · · · ∩ An ) when A1 , A2 , . . . An are
mutually independent and when they are mutually exclusive.

Definition
Events A1 , A2 , . . . An are pairwise independenta iff for all distinct indices
i, j ∈ {1 . . . n} it holds that P(Ai ∩ Aj ) = P(Ai )P(Aj ).
a po dvou nezávislé

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 5 / 52


Independence of More Events - Example

An example of pairwise independent but not (mutually) independent


events.
Example
Let S = {s1 , s2 , s3 , s4 } with P(s1 ) = P(s2 ) = P(s3 ) = P(s4 ) = 1/4.
Then for A = {s1 , s2 }, B = {s2 , s3 }, C = {s2 , s4 }, it holds that A, B, C are
pairwise independent but not (mutually) independent events.

-----------
A | s1 /| s2 |
----------/-|
/ |/ |
/ /| |
B | s3 / | s4 | C
---/ ----

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 6 / 52


Law of Total Probability

• Any list of mutually exclusive and collectively exhaustive events forms


an event space.
• E.g., note that events B and B partition the sample space S.
• We can define S ′ = {B, B} and using the probabilities P(B) and
P(B) the set S ′ has properties similar to sample space, except that
there is many-to-one correspondence between outcomes of the
experiment and elements of S ′ .

Theorem (of total probability)


Let A be an event and {B1 , . . . , Bn } be an event space. Then
n
P(A) = ∑ P(A | Bi )P(Bi ).
i=1

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 7 / 52


Law of Total Probability (Proof)

n
P(A) = ∑ P(A | Bi )P(Bi )
i=1

Proof.
Let A be an event and {B1 , . . . , Bn } be an event space, i.e. mutually
exclusive and collectively exhaustive.
n
∑ P(A | Bi )P(Bi ) =
i=1
=
=
=
= P(A)

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 8 / 52


Law of Total Probability (Proof)
Lecture 2: Random Variables P(A) =
n
∑ P(A | Bi )P(Bi )
2024-10-03
i=1

Proof.
Let A be an event and {B1 , . . . , Bn } be an event space, i.e. mutually
exclusive and collectively exhaustive.
n
∑ P(A | Bi )P(Bi ) =
i=1

Law of Total Probability (Proof) =


=
=
= P(A)

n n
∑ P(A | Bi )P(Bi ) = ∑ P(A ∩ Bi )
i =1 i =1
n
[
= P( (A ∩ Bi ))
i =1
n
[
= P(A ∩ Bi )
i =1
= P(A ∩ S)
= P(A)
Law of Total Probability (Example)

Example
Production on machines A, B, C . Products can be defective.
P(A) = 0.25 P(Defect | A) = 0.05
P(B) = 0.35 P(Defect | B) = 0.04
P(C ) = 0.40 P(Defect | C ) = 0.02
P(Defect) = ∑ P(Defect | X )P(X )
X ∈{A,B,C }

Idea
The theorem of total probability is useful when we know conditional
probabilities of a property in all (exhaustive) subcases and we are
interested in (general) probability of the property.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 9 / 52


Bayes’ Rule

Theorem (Bayes’ Rule)


Let A and B be events such that P(B) > 0. Then

P(B | A) · P(A)
P(A | B) = .
P(B)

Proof.

Traditional naming of the terms:


P(B | A) Likelihood
P(A | B) = · P(A) is read as Posterior = · Prior
P(B) Evidence

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 10 / 52


Bayes’ Rule
Lecture 2: Random Variables Theorem (Bayes’ Rule)

2024-10-03
Let A and B be events such that P(B) > 0. Then

P(B | A) · P(A)
P(A | B) = .
P(B)

Proof.

Bayes’ Rule
Traditional naming of the terms:
P(B | A) Likelihood
P(A | B) = · P(A) is read as Posterior = · Prior
P(B) Evidence

P (A∩B )
We use the definition of the conditional distribution P(A | B) = P (B )
and commutativity of intersection

P(A | B)P(B) = P(A ∩ B) = P(B ∩ A) = P(B | A)P(A).

Hence
P(A ∩ B) P(B | A)P(A)
P(A | B) = = .
P(B) P(B)
Bayes’ Rule - Conditional Variants

Theorem (Bayes’ Rule)


Let A and B be events such that P(B) > 0. Then

P(B | A) · P(A)
P(A | B) = .
P(B)

Theorem (Bayes’ Rule conditional variants)


Let A, B, and C be events such that P(A ∩ C ) · P(B ∩ C ) > 0. Then

P(B ∩ C | A) · P(A) P(B | A ∩ C ) · P(A | C )


P(A | B ∩ C ) = = .
P(B ∩ C ) P(B | C )

Proof.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 11 / 52


Bayes’ Rule - Conditional Variants
Lecture 2: Random Variables Theorem (Bayes’ Rule)

2024-10-03
Let A and B be events such that P(B) > 0. Then

P(B | A) · P(A)
P(A | B) = .
P(B)

Theorem (Bayes’ Rule conditional variants)


Let A, B, and C be events such that P(A ∩ C ) · P(B ∩ C ) > 0. Then

Bayes’ Rule - Conditional Variants P(A | B ∩ C ) =


P(B ∩ C | A) · P(A) P(B | A ∩ C ) · P(A | C )
P(B ∩ C )
=
P(B | C )
.

Proof.

The first equality is an easy substitution B ⇝ B ∩ C .

Intuitively, the second one is Bayes’ theorem in a probability space condi-


tioned by C . Note that P(A | B) given C is P(A | B ∩ C ).

P (B|A∩C )·P (A|C ) )·P (A|C )·P (C ) P (B|A∩C )·P (A∩C )


Formally, P (B|C ) = P (B|A∩C
P (B|C )·P (C ) = P (B∩C ) =
P (B∩A∩C ) P (A∩B∩C )
P (B∩C ) = P (B∩C ) = P(A | B ∩ C ).
Bayes’ Rule with Law of Total Probability

Theorem (Bayes’ Rule & Law of Total Probability)


Let A be an event with P(A) > 0 and {B1 , . . . , Bn } be an event space.
Then for every Bj

P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = n .
P(A) ∑i=1 P(A | Bi )P(Bi )

Proof.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 12 / 52


Bayes’ Rule with Law of Total Probability
Lecture 2: Random Variables
2024-10-03
Theorem (Bayes’ Rule & Law of Total Probability)
Let A be an event with P(A) > 0 and {B1 , . . . , Bn } be an event space.
Then for every Bj

P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = n .
P(A) ∑i=1 P(A | Bi )P(Bi )

Proof.
Bayes’ Rule with Law of Total Probability

P(Bj ∩ A) P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = = n .
P(A) P(A) ∑i =1 P(A | Bi )P(Bi )
We use the definition of the conditional distribution

P(Bj ∩ A) = P(A | Bj )P(Bj )

and the law of total probability


n
P(A) = ∑ P(A | Bi )P(Bi ).
i =1
Bayes’ Rule (Example)

P(A | Bj )P(Bj ) P(A | Bj )P(Bj )


P(Bj | A) = = n .
P(A) ∑i=1 P(A | Bi )P(Bi )

Example
Production on machines A, B, C . Products can be defective.
P(A) = 0.25 P(Defect | A) = 0.05
P(B) = 0.35 P(Defect | B) = 0.04
P(C ) = 0.40 P(Defect | C ) = 0.02
Having a defective product, what is the probability A produced it?

P(Defect | A)P(A) P(Defect | A)P(A)


P(A | Defect) = =
P(Defect) ∑X ∈{A,B,C } P(Defect | X )P(X )

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 13 / 52


Lecture 2: Random Variables

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

October 3, 2024

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 14 / 52


Part II

Motivation and Definition

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 15 / 52


Random Variable - Motivation

• In many situations, outcomes of a random experiment are numbers.


• In other situations, we want to assign to each outcome a number (in
addition to probability).
• E.g., it may quantify financial or energetic cost of a particular
outcome.
• We will define the random variable to develop methods for studying
random experiments with outcomes that can be described numerically.
• Almost all real probabilistic computation is done using random
variables.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 16 / 52


Random Variable - Definition

A random variable is a rule that assigns a numerical value to each


outcome of an experiment.
Definition
A (discrete) random variablea X on a sample space S is a function
X : S → R that assigns a real number X (s) to each sample point s ∈ S.
a náhodná proměnná

• Random variable is a numerical property of sample points.


We define the image of a random variable X as the set
Im(X ) = {X (s) | s ∈ S}. Note that Im(X ) of a discrete variable X is a
countable set.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 17 / 52


Random Variable - Definition

Definition
For a random variable X and a real number x, we define inverse image of
x to be the event
[X = x] = {s ∈ S | X (s) = x},
i.e. the set of all sample points from S to which X assigns the value x.

• Note that the sequence [X = x]x∈Im(X ) , forms an event space


(mutually exclusive and collectively exhaustive).
Following the definition of [X = x], we calculate its probability as

P([X = x]) = P({s ∈ S | X (s) = x}) = ∑ P(s).


s:X (s)=x

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 18 / 52


Random Variable - Probability Mass Function

Definition
Probability mass functiona (or probability distribution) of a random
variable X is a function pX : R → [0, 1] given by

pX (x) = P(X = x) = ∑ P(s).


s:X (s)=x

a pravděpodobnostnı́ funkce nebo distribuce

Note that pX satisfies:


(p1) 0 ≤ pX (x) ≤ 1 for all x ∈ R and (p2) ∑x∈R pX (x) = 1.
A real valued function pX (x) defined on R is a probability mass function of
some random variable if it satisfies properties (p1) and (p2).
When the random variable is clear from the context, we denote the
probability distribution as p(x).
Do not confuse probability distribution with distribution function!
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 19 / 52
Notation

For some subset A ⊆ R, we would like to compute the probability of the


event {s | X (s) ∈ A}. We write

P({s | X (s) ∈ A}) = P([X ∈ A]) = P(X ∈ A).


When A is an interval, we write:
• P(a < X < b) instead of P(X ∈ (a, b)),
• P(a < X ≤ b) instead of P(X ∈ (a, b]), and
• P(X ≤ x) instead of P(X ∈ (−∞, x]).
We calculate the probability of A as

P(X ∈ A) = ∑ pX (x).
x∈A

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 20 / 52


Cumulative Distribution Function

Definition
The cumulative distribution function (probability distribution
function or simply distribution functiona ) of a random variable X is a
function FX : R → [0, 1] given by

FX (x) = P(X ≤ x) = ∑ pX (t).


t≤x

a distribučnı́ funkce

It follows that

P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a) = F (b) − F (a).

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 21 / 52


Cumulative Distribution Function - Properties

(F1) 0 ≤ F (x) ≤ 1 for −∞ < x < ∞

(F2) F (x) is a monotone non-decreasing function of x,


i.e. x1 ≤ x2 =⇒ F (x1 ) ≤ F (x2 ).

(F3) limx→−∞ F (x) = 0, and limx→∞ F (x) = 1.


If the random variable X has a finite image, then there exist u, v ∈ R
such that F (x) = 0 for all x < u and F (x) = 1 for all x ≥ v .

(F4) F (x) is a right-continuous function,


i.e. for each x
F (x) = lim+ F (i).
i→x

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 22 / 52


Cumulative Distribution Function - Properties
Lecture 2: Random Variables (F1) 0 ≤ F (x) ≤ 1 for −∞ < x < ∞

2024-10-03 (F2) F (x) is a monotone non-decreasing function of x,


i.e. x1 ≤ x2 =⇒ F (x1 ) ≤ F (x2 ).

(F3) limx→−∞ F (x) = 0, and limx→∞ F (x) = 1.


If the random variable X has a finite image, then there exist u, v ∈ R

Cumulative Distribution Function - Properties such that F (x) = 0 for all x < u and F (x) = 1 for all x ≥ v .

(F4) F (x) is a right-continuous function,


i.e. for each x
F (x) = lim+ F (i).
i→x

A note to (F2)
If x1 ≤ x2 then (−∞, x1 ] ⊆ (−∞, x2 ] and we have

P(X ≤ x1 ) ≤ P(X ≤ x2 )

giving F (x1 ) ≤ F (x2 ).


Cumulative Distribution Function

• Any function satisfying properties (F1)-(F4) is the distribution


function of some random variable.
• Note that P(X > x) = 1 − FX (x).
• In most cases, we simply forget the theoretical background (random
experiment, sample space, events,. . . ) and examine random variables,
probability distributions, and cumulative distribution functions.
• Often the initial information is “we have a random variable X with
the probability distribution pX (x)” or “we have a random variable X
with the cumulative distribution function FX (x)”.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 23 / 52


Cumulative Distribution Function
Lecture 2: Random Variables
2024-10-03 • Any function satisfying properties (F1)-(F4) is the distribution
function of some random variable.
• Note that P(X > x) = 1 − FX (x).
• In most cases, we simply forget the theoretical background (random
experiment, sample space, events,. . . ) and examine random variables,
probability distributions, and cumulative distribution functions.

Cumulative Distribution Function • Often the initial information is “we have a random variable X with
the probability distribution pX (x)” or “we have a random variable X
with the cumulative distribution function FX (x)”.

We can construct probability space consistent with the discrete random


variable as follows.
Let S = {x | pX (x) ̸= 0} ⊂ R, X (s) = s for s ∈ S, F is 2S and

P(A) = ∑ pX (x).
x∈A

Note that the events of S are mutually exclusive and collectively exhaustive.
Part III

Examples of Probability Distributions

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 24 / 52


Examples of Probability Distributions1

In this part of the lecture, we introduce the most common probability


distributions occurring in practical situations. In fact, we can always derive
the distributions and all related results ourselves, however, it is anyway
useful to remember these distributions and situations they describe both as
examples and to speed up our calculations. These probability distributions
are so important that they have specific names and sometimes also
notation.

1 rozdělenı́
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 25 / 52
Discrete Uniform Probability Distribution

Discrete Uniform Probability Distribution (e.g. value on 6-sided die)


• Let X be a discrete random variable with a finite image
{x1 , x2 , . . . , xn } and let us assign to all elements of the image the same
probability pX (xi ) = p.
• From the requirement that the probabilities must sum to 1 we have
n n
1= ∑ pX (xi ) = ∑ p = np
i=1 i=1

and the probability is


pX (x)
(
1/n xi ∈ Im(X ) 1
pX (xi ) =
0 otherwise. • • • •
x

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 26 / 52


Discrete Uniform Probability Distribution (cont.)

• The probability distribution function is



0 x < x0

FX (x) = nj xj−1 ≤ x < xj and j ∈ {1, . . . , n}

1 xn ≤ x

pX (x) FX (x)

1 1 •


• • • • •
x x

Note that this concept cannot be extended to random variable with


countably infinite image.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 27 / 52


Constant Random Variable

• For c ∈ R the function defined for all s ∈ S by X (s) = c is a discrete


random variable with P(X = c) = 1.
• The probability distribution of this variable is
pX (x)
(
1 if x = c 1 •
pX (x) =
0 otherwise.
c x
• Such a random variable is called the constant random variable.
• The corresponding cumulative distribution function is
FX (x)
(
0 for x < c 1 •
FX (x) =
1 for x ≥ c.
c x

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 28 / 52


Indicator Random Variable
• Let us note that the event A partitions the sample space S, i.e.
A ∪ A = S.
• The indicator of an event A is the random variable IA defined by
(
1 if s ∈ A
IA (s) =
0 if s ̸∈ A
• The event A occurs if and only if IA = 1.
• The probability distribution is
pIA (0) = P(A) = 1 − P(A)
pIA (1) = P(A).
• The corresponding distribution function reads

0
 for x < 0
FIA (x) = 1 − P(A) for 0 ≤ x < 1

1 for x ≥ 1.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 29 / 52


Bernoulli Probability Distribution

Bernoulli probability distribution has one parameter p


and models single Bernoulli trial or (biased) coin toss.
• The only possible values of the random variable X are 0 and 1
(often denoted as failure and success, respectively).
• The distribution is given by

pX (0) =P(X = 0) = q
pX (1) =P(X = 1) = p = 1 − q

• The corresponding probability distribution function is



0 for x < 0

F (x) = q for 0 ≤ x < 1

1 for x ≥ 1.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 30 / 52


Geometric Probability Distribution

Geometric probability distribution has one parameter p and models the


number of Bernoulli trials until the first ’success’ occurs.
• The sample space consists of binary strings of the form
S = {0i−1 1 | i ∈ N} (for simplicity we omit 0ω which is a null event).
• We define the random variable X : {0i−1 1 | i ∈ N} → R as
def
X (0i−1 1) = i.
• X is the number of trials up to and including the first success.
• The outcome 0i−1 1 arises from a sequence of independent Bernoulli
trials, thus we have

pX (i) = P(X = i) = (1 − p)i−1 p. (2)

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 31 / 52


Geometric Probability Distribution
• We use the formula for the sum of geometric series to obtain (verify
property (p2))
∞ ∞
p p
∑ pX (i) = ∑ p(1 − p)i−1 = 1 − (1 − p) = p = 1.
i=1 i=1

• We require that p ̸= 0, since otherwise the probabilities do not sum to 1.


• A random variable with the image {1, 2 . . . } and the probability
distribution (2) is said to have the geometric distribution.
• The corresponding probability distribution function is defined by (for
t ≥ 0)
⌊t⌋
FX (t) = P(X ≤ t) = ∑ p(1 − p)i−1 = 1 − (1 − p)⌊t⌋ .
i=1

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 32 / 52


Memoryless Property of Geometric Distribution

No matter how long we wait for a success, the distribution on time to


success is the same.
Theorem
Let X be a random variable with geometric probability distribution then
for every t, l ∈ N holds that

P(X > t) = P(X > t + l | X > l)

Proof.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 33 / 52


Memoryless Property of Geometric Distribution
Lecture 2: Random Variables No matter how long we wait for a success, the distribution on time to

2024-10-03
success is the same.

Examples of Probability Distributions Theorem


Let X be a random variable with geometric probability distribution then
for every t, l ∈ N holds that

P(X > t) = P(X > t + l | X > l)

Memoryless Property of Geometric Distribution Proof.

P(X > t) = (1 − p)⌊t⌋ = (1 − p)t


i.e. t unsuccesfull trials (union of independent events - i.e. individual
tosses)

P(X > t + l ∧ X > l) P(X > t + l)


P(X > t + l | X > l) = = =
P(X > l) P(X > l)
(1 − p)t +l
= = (1 − p)t
(1 − p)l
Binomial Probability Distribution
Binomial random variable X , denoted by B(n, p), has two parameters,
count n and probability p. It models the number of successes (outcomes
1) in n consecutive Bernoulli trials with probability p of success in each
trial.
• The domain of the random variable X are all n–tuples of 0s and 1s.
The image is {0, 1, 2, . . . n}.
• The probability distribution of X is

pk =P(X = k) = pX (k)
( 
n k
p (1 − p)n−k for 0 ≤ k ≤ n
= k
0 otherwise.

• The binomial distribution is often denoted as b(k; n, p) = pk and


represents the probability that there are k successes in a sequence of
n Bernoulli trials with probability of success p.
• For example, b(3; 5, 0.5) = 53 (1/2)3 (1/2)2 = 0.3125


V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 34 / 52


Binomial Probability Distribution

After specifying the distribution of a random variable we should verify that


this function is a valid probability distribution, i.e. to verify properties (p1)
and (p2). While (p1) is usually clear (it is easy to see that the function is
nonnegative), the property (p2) may be not so straightforward and should
be verified explicitly.

We can apply the binomial model when the following conditions hold:
• Each trial has exactly two mutually exclusive outcomes.
• The probability of ’success’ is constant on each trial.
• The outcomes of successive trials are mutually independent.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 35 / 52


Binomial Probability Distribution

• The name ’binomial’ comes from the equation verifying that the
probabilities sum to 1
n n  
n
∑ i ∑ i pi (1 − p)n−i
p =
i=0 i=0
=[p + (1 − p)]n = 1.

• The binomial distribution function, denoted by B(t; n, p) is given by

⌊t⌋
 
n i
B(t; n, p) = FX (t) = ∑ p (1 − p)n−i .
i=0 i

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 36 / 52


Part IV

Continuous Random Variables

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 37 / 52


Continuous Random Variable - Motivation

• Sometimes we study experiments with real values.


• Then the random variable could express real values,
e.g. temperature or portion of a day spent on FB.

• How can we specify probability distribution?


• How can we specify distribution function?

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 38 / 52


Continuous Random Variable - Definition

A random variable is a rule that assigns a numerical value to each


outcome of an experiment.
Definition
Given a probability space (S, F , P), a (continuous) random variablea X
on a sample space S is a function X : S → R such that

{s | X (s) ≤ r } ∈ F for each r ∈ R.


a (spojitá) náhodná proměnná

The additional condition is a technical requirement saying that the


function X must be measurable, i.e. the inverse images [X ≤ r ] are in F .
Hence, we can define the cumulative distribution function.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 39 / 52


Cumulative Distribution Function

Definition
The cumulative distribution function (probability distribution
function or simply distribution functiona ) of a random variable X is a
function FX : R → [0, 1] given by

FX (x) = P(X ≤ x).


a distribučnı́ funkce

Note that for continuous random variables, the cumulative distribution


function is a continuous function.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 40 / 52


Probability Density Function

What is the probability mass function? Are we able to express it?


Note that P(X = r ) = 0 for all r ∈ R.
But there should be some “more probable areas”.
Definition
Probability density functiona (or density) of a random variable X is a
derivative function of the cumulative distribution function.
a pravděpodobnostnı́ hustotnı́ funkce nebo hustota
Rx
Hence, FX (x) = −∞ f (t)dt.

Note that the values are not probabilities but ranges over all positive real
numbers.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 41 / 52


Continuous Probability Distributions

• Uniform Distribution fX (x) FX (x)


1
b−a 1

a a x
b x b
• Normal Distribution fX (x) FX (x)

x x
• Exponential Distribution fX (x) FX (x)

x x
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 42 / 52
Cumulative Distribution Function - Why?
Question: Do we need cumulative distribution function?
Suggestion: Let us use either probability function or density function!
Answer: The suggestion is wrong!
Cumulative distribution function is the only way how to specify probability
when the discrete and continuous behavior is combined.
See, e.g.,the following example where P(X = 0) = 12 and the remaining
probability is uniformly distributed on the interval (a, b).
FX (x)

1 This is neither discrete


nor continuous random
• variable.

x
0 a b
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 43 / 52
Part V

Discrete Random Vectors

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 44 / 52


Discrete Random Vectors - Motivation

• Suppose we want to study relationship between two or more random


variables defined on a given sample space.
• E.g. we would like to know height, weight, waistline, and IQ of a
randomly chosen person.
• Hence, we define a random vector, i.e. a sequence of random
variables.

Definition
Let X1 , X2 , . . . , Xr be discrete random variables defined on a sample space
S.
The random vector X = (X1 , X2 , . . . , Xr ) is an r -dimensional vector-valued
function X : S → Rr with X(s) = x = (x1 , x2 , . . . , xr ), where

X1 (s) = x1 , X2 (s) = x2 , . . . , and Xr (s) = xr .

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 45 / 52


Discrete Random Vectors

Definition
The joint (or compound) probability distributiona of a random vector X
is defined to be

pX (x) = P(X = x) = P(X1 = x1 ∧ X2 = x2 ∧ · · · ∧ Xr = xr ).


a simultánnı́ nebo sdružená distribuce

Definition
The joint (or compound) distribution functiona of a random vector X is
defined to be

FX (x) = P(X1 ≤ x1 ∧ X2 ≤ x2 ∧ · · · ∧ Xr ≤ xr ).
a simultánnı́ nebo sdružená distribučnı́ funkce

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 46 / 52


Marginal Probability Distributions

In situation when we are examining more that one random variable, the
probability distribution of a single variable is referred to as marginal
probability distribution.
Definition
Let pX (x) be a joint probability distribution of a random variable
X = (X1 , X2 ). The marginal probability distributiona of X1 is defined as

pX1 (x) = P(X1 = x) = ∑ pX ((x, x2 )).


x2 ∈Im(X2 )

a marginálnı́ distribuce

Can we also calculate joint probability distribution from marginal


distributions?

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 47 / 52


Multinomial Probability Distribution

• A joint probability distribution. Consider n balls falling into r bins.


• Consider a sequence of n generalized uniquely distributed Bernoulli
trials with a distribution on r given by p1 , p2 , . . . , pr (i.e. ∑ri=1 pi = 1).
• Let us define the random vector X = (X1 , X2 , . . . Xr ) such that Xi is
the number of trials that resulted in i th outcome, i.e. the number of
balls in the i th bin.
• Then the compound probability distribution of X is

pX (n1 , n2 , . . . , nr ) =P(X1 = n1 , X2 = n2 , . . . Xr = nr )
(  n1 n2
n nr
n1 ,n2 ,...nr p1 p2 . . . pr if ∑ri=1 ni = n
=
0 otherwise
n

where n1 ,n2 ,...nr is a combination number with repetition.

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 48 / 52


Multinomial Probability Distribution

The marginal probability distribution of Xi may be computed by


 
n
pXi (ni ) = ∑ p1n1 p2n2 . . . prnr
n1 , n2 , . . . nr
n:[(∑j̸=i nj )=n−ni ]
n n
n!pini (n − ni )!p1n1 . . . pi−1
i−1 i+1
pi+1 . . . prnr
= ∑
(n − ni )!ni ! n1 !n2 ! . . . ni−1 !ni+1 ! . . . nr !
n:[(∑j̸=i nj )=n−ni ]
 
n ni
= p (p1 + · · · + pi−1 + pi+1 + · · · + pr )n−ni
ni i
 
n ni
= p (1 − pi )n−ni .
ni i

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 49 / 52


Part VI

Independent Random Variables

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 50 / 52


Independent Random Variables
Definition
Two discrete random variables X and Y are independent provided

pX ,Y (x, y ) = pX (x)pY (y ) for all x and y ,

i.e. their joint probability distribution is a product of the marginal


probability distributions.

• If X and Y are two independent random variables, then for any two
subsets A, B ⊆ R the events X ∈ A and Y ∈ B are independent:

P(X ∈ A ∧ Y ∈ B) = P(X ∈ A)P(Y ∈ B).

Theorem (an alternative definition)


Two random variables X , Y are independent if and only if

FX ,Y (x, y ) = FX (x)FY (y ).
V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 51 / 52
Independent Random Variables
Lecture 2: Random Variables Definition

2024-10-03
Two discrete random variables X and Y are independent provided

pX ,Y (x, y ) = pX (x)pY (y ) for all x and y ,

i.e. their joint probability distribution is a product of the marginal


probability distributions.

• If X and Y are two independent random variables, then for any two
subsets A, B ⊆ R the events X ∈ A and Y ∈ B are independent:

Independent Random Variables P(X ∈ A ∧ Y ∈ B) = P(X ∈ A)P(Y ∈ B).

Theorem (an alternative definition)


Two random variables X , Y are independent if and only if

FX ,Y (x, y ) = FX (x)FY (y ).

To see this
P(X ∈ A ∧ Y ∈ B) = ∑ ∑ pX ,Y (x, y )
x∈A y ∈B

= ∑ ∑ pX (x)pY (y )
x∈A y ∈B

= ∑ pX (x) ∑ pY (y )
x∈A y ∈B

=P(X ∈ A)P(Y ∈ B).


Let us assume that on a particular performance of a random experiment
we observe the event [Y = y ]. Since X and Y are independent, we get
P(X = x ∧ Y = y ) pX ,Y (x, y )
P(X = x | Y = y ) = =
pY (y ) pY (y )
pX (x)pY (y )
= = pX (x) = P(X = x).
pY (y )
Independent Random Variables

Definition
Let X1 , . . . , Xr be discrete random variables with probability distributions
pX1 , . . . , pXr . These random variables are pairwise independent iff

∀1 ≤ i < j ≤ r , ∀xi ∈ Im(Xi ), xj ∈ Im(Xj ), pXi ,Xj (xi , xj ) = pXi (xi )pXj (xj ).

Definition
Let X1 , . . . , Xr be discrete random variables with probability distributions
pX1 , . . . , pXr . These random variables are mutually independent iff
∀2 ≤ k ≤ r , ∀{i1 , . . . , ik } ⊆ {1, . . . , r },

pXi1 ,...,Xik (xi1 , . . . , xik ) = pXi1 (xi1 ) . . . pXir (xir ).

Note that the random variables are pairwise/mutually independent iff the
events [X1 = x1 ], . . . , [Xr = xr ] are so, for all x1 ∈ Im(X1 ), . . . , xr ∈ Im(Xr ).

V. Řehák: IV111 Probability in CS Lecture 2: Random Variables October 3, 2024 52 / 52


Lecture 3: Expectation and Markov Inequality

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

October 10, 2024

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 1 / 47
Part I

Revision

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 2 / 47
Few Notes on Conditional Probability
Do not mix apples and oranges:
((((
P(A|B)
( (((+ P(C )
Conditioning is a parameter that changes (scales) the probability function.
To combine different, we need to rescale them first.
P(A|B) · P(B) + P(C )
Conditional probability P(A|B) under condition of X :
P(A|B ∩ X )
When X is known, we have conditional probability function:
P(A|X ) = 1 − P(A|X )
P(B|A ∩ X ) · P(A|X )
Bayes theorem: P(A|B ∩ X ) =
P(B|X )
Law of T.P.: P(A|X ) = ∑ P(A|Bi ∩ X ) · P(Bi |X )
i
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 3 / 47
Few Notes on Conditional Independence

Let us have a six-sided die, i.e. the sample space is {1, 2, 3, 4, 5, 6} and the
probability mass function is uniform.
Let A = {1, 2, 5}, B = {2, 4, 6}, X = {1, 2, 3, 4}.
Note that A and B are not independent, as
1 1 1
6= · = P(A) · P(B).
P(A ∩ B) =
6 2 2
Given X , A and B are independent, as
1 1 1
P(A ∩ B|X ) = = · = P(A|X ) · P(B|X ).
4 2 2
Given ¬X , A and B are not independent, as
1 1
P(A ∩ B|¬X ) = 0 6= · = P(A|¬X ) · P(B|¬X ).
2 2

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 4 / 47
Few Notes on Conditional Independence
Lecture 3: Expectation and Markov Ineq. Let us have a six-sided die, i.e. the sample space is {1, 2, 3, 4, 5, 6} and the

2024-10-16
probability mass function is uniform.
Let A = {1, 2, 5}, B = {2, 4, 6}, X = {1, 2, 3, 4}.
Note that A and B are not independent, as
1 1 1
P(A ∩ B) = 6= · = P(A) · P(B).
6 2 2
Given X , A and B are independent, as
1 1 1

Few Notes on Conditional Independence P(A ∩ B|X ) = = · = P(A|X ) · P(B|X ).


4 2 2
Given ¬X , A and B are not independent, as
1 1
P(A ∩ B|¬X ) = 0 6= · = P(A|¬X ) · P(B|¬X ).
2 2

P(A ∩ B) = P({2}) = 16
P(A) · P(B) = 21 · 21 = 14

P(A ∩ B|X ) = P({2}|{1, 2, 3, 4}) = 14


P(A|X ) · P(B|X ) = P({1, 2}|{1, 2, 3, 4}) · P({2, 4}|{1, 2, 3, 4}) = 12 · 21 = 1
4

P(A ∩ B|¬X ) = P(0|{5,


/ 6}) = 0
P(A|¬X ) · P(B|¬X ) = P({5}|{5, 6}) · P({6}|{5, 6}) = 12 · 12 = 1
4
Discrete Random Variable - Revision

Definition
A (discrete) random variable X on a sample space S is a function
X : S → R that assigns a real number X (s) to each sample point s ∈ S.

Definition
Probability mass function of X is a function pX : R → [0, 1] given by

pX (x) = P(X = x) = ∑ P(s).


s:X (s)=x

Definition
Cumulative distribution function of X is a function FX : R → [0, 1] given
by
FX (x) = P(X ≤ x) = ∑ pX (t).
t≤x

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 5 / 47
Continuous Random Variable - Revision
Definition
A (continuous) random variable X on a sample space S is a function
X : S → R that assigns a real number X (s) to each sample point s ∈ S
such that
{s | X (s) ≤ r } ∈ F for each r ∈ R.

Definition
Cumulative distribution function of X is a function FX : R → [0, 1] given
by FX (x) = P(X ≤ x).

Definition
Probability density function fX of X is the derivative function of FX .

Z x
P(X ≤ x) = FX (x) = fX (t)dt
−∞
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 6 / 47
Part II

Discrete Random Vectors

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 7 / 47
Discrete Random Vectors - Motivation

• Suppose we want to study relationship between two or more random


variables defined on a given sample space.
• E.g. we would like to know height, weight, waistline, and IQ of a
randomly chosen person.
• Hence, we define a random vector, i.e. a sequence of random
variables.

Definition
Let X1 , X2 , . . . , Xr be discrete random variables defined on a sample space
S.
The random vector X = (X1 , X2 , . . . , Xr ) is an r -dimensional vector-valued
function X : S → Rr with X(s) = x = (x1 , x2 , . . . , xr ), where

X1 (s) = x1 , X2 (s) = x2 , . . . , and Xr (s) = xr .

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 8 / 47
Discrete Random Vectors

Definition
The joint (or compound) probability distributiona of a random vector X
is defined to be

pX (x) = P(X = x) = P(X1 = x1 ∧ X2 = x2 ∧ · · · ∧ Xr = xr ).


a simultánnı́ nebo sdružená distribuce

Definition
The joint (or compound) distribution functiona of a random vector X is
defined to be

FX (x) = P(X1 ≤ x1 ∧ X2 ≤ x2 ∧ · · · ∧ Xr ≤ xr ).
a simultánnı́ nebo sdružená distribučnı́ funkce

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 9 / 47
Marginal Probability Distributions

In situation when we are examining more that one random variable, the
probability distribution of a single variable is referred to as marginal
probability distribution.
Definition
Let pX (x) be a joint probability distribution of a random variable
X = (X1 , X2 ). The marginal probability distributiona of X1 is defined as

pX1 (x) = P(X1 = x) = ∑ pX ((x, x2 )).


x2 ∈Im(X2 )

a marginálnı́ distribuce

Can we also calculate joint probability distribution from marginal


distributions?

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 10 / 47
Multinomial Probability Distribution

• A joint probability distribution. Consider n balls falling into r bins.


• Consider a sequence of n generalized uniquely distributed Bernoulli
trials with a distribution on r given by p1 , p2 , . . . , pr (i.e. ∑ri=1 pi = 1).
• Let us define the random vector X = (X1 , X2 , . . . Xr ) such that Xi is
the number of trials that resulted in i th outcome, i.e. the number of
balls in the i th bin.
• Then the compound probability distribution of X is

pX (n1 , n2 , . . . , nr ) =P(X1 = n1 , X2 = n2 , . . . Xr = nr )
(  n1 n2
n nr
n1 ,n2 ,...nr p1 p2 . . . pr if ∑ri=1 ni = n
=
0 otherwise
n

where n1 ,n2 ,...nr is a combination number with repetition.

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 11 / 47
Multinomial Probability Distribution

The marginal probability distribution of Xi may be computed by


 
n
pXi (ni ) = ∑ p1n1 p2n2 . . . prnr
n1 , n2 , . . . nr
n:[(∑j6=i nj )=n−ni ]
n n
n!pini (n − ni )!p1n1 . . . pi−1
i−1 i+1
pi+1 . . . prnr
= ∑
(n − ni )!ni ! n1 !n2 ! . . . ni−1 !ni+1 ! . . . nr !
n:[(∑j6=i nj )=n−ni ]
 
n ni
= p (p1 + · · · + pi−1 + pi+1 + · · · + pr )n−ni
ni i
 
n ni
= p (1 − pi )n−ni .
ni i

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 12 / 47
Part III

Independent Random Variables

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 13 / 47
Independent Random Variables
Definition
Two discrete random variables X and Y are independent provided

pX ,Y (x, y ) = pX (x)pY (y ) for all x and y ,

i.e. their joint probability distribution is a product of the marginal


probability distributions.

• If X and Y are two independent random variables, then for any two
subsets A, B ⊆ R the events X ∈ A and Y ∈ B are independent:

P(X ∈ A ∧ Y ∈ B) = P(X ∈ A)P(Y ∈ B).

Theorem (an alternative definition)


Two random variables X , Y are independent if and only if

FX ,Y (x, y ) = FX (x)FY (y ).
V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 14 / 47
Independent Random Variables
Lecture 3: Expectation and Markov Ineq. Definition

2024-10-16
Two discrete random variables X and Y are independent provided

pX ,Y (x, y ) = pX (x)pY (y ) for all x and y ,

i.e. their joint probability distribution is a product of the marginal


probability distributions.

• If X and Y are two independent random variables, then for any two
subsets A, B ⊆ R the events X ∈ A and Y ∈ B are independent:

Independent Random Variables P(X ∈ A ∧ Y ∈ B) = P(X ∈ A)P(Y ∈ B).

Theorem (an alternative definition)


Two random variables X , Y are independent if and only if

FX ,Y (x, y ) = FX (x)FY (y ).

To see this
P(X ∈ A ∧ Y ∈ B) = ∑ ∑ pX ,Y (x, y )
x∈A y ∈B

= ∑ ∑ pX (x)pY (y )
x∈A y ∈B

= ∑ pX (x) ∑ pY (y )
x∈A y ∈B

=P(X ∈ A)P(Y ∈ B).


Let us assume that on a particular performance of a random experiment
we observe the event [Y = y ]. Since X and Y are independent, we get
P(X = x ∧ Y = y ) pX ,Y (x, y )
P(X = x | Y = y ) = =
pY (y ) pY (y )
pX (x)pY (y )
= = pX (x) = P(X = x).
pY (y )
Independent Random Variables

Definition
Let X1 , . . . , Xr be discrete random variables with probability distributions
pX1 , . . . , pXr . These random variables are pairwise independent iff

∀1 ≤ i < j ≤ r , ∀xi ∈ Im(Xi ), xj ∈ Im(Xj ), pXi ,Xj (xi , xj ) = pXi (xi )pXj (xj ).

Definition
Let X1 , . . . , Xr be discrete random variables with probability distributions
pX1 , . . . , pXr . These random variables are mutually independent iff
∀2 ≤ k ≤ r , ∀{i1 , . . . , ik } ⊆ {1, . . . , r },

pXi1 ,...,Xik (xi1 , . . . , xik ) = pXi1 (xi1 ) . . . pXir (xir ).

Note that the random variables are pairwise/mutually independent iff the
events [X1 = x1 ], . . . , [Xr = xr ] are so, for all x1 ∈ Im(X1 ), . . . , xr ∈ Im(Xr ).

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 15 / 47
Lecture 3: Expectation and Markov Inequality

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

October 10, 2024

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 16 / 47
Part IV

Expectation of Random Variables

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 17 / 47
Expectation
• Often we need a shorter description than PMF or CDF - single
number, or a few numbers.
• First such characteristic describing a random variable is the
expectation, also known as the mean value.

Definition
Expectationa of a random variable X is defined as

E (X ) = ∑ x · P(X = x)
x∈Im(X )

provided the sum is absolutely convergent. In case the sum is convergent,


but not absolutely convergent, we say that no finite expectation exists. In
case the sum is not convergent, the expectation has no meaning.
a střednı́ hodnota

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 18 / 47
Expectation
Lecture 3: Expectation and Markov Ineq. • Often we need a shorter description than PMF or CDF - single
number, or a few numbers.

2024-10-16
• First such characteristic describing a random variable is the
expectation, also known as the mean value.

Definition
Expectationa of a random variable X is defined as

E (X ) = ∑ x · P(X = x)
x∈Im(X )

Expectation provided the sum is absolutely convergent. In case the sum is convergent,
but not absolutely convergent, we say that no finite expectation exists. In
case the sum is not convergent, the expectation has no meaning.
a střednı́ hodnota

• E(X) is a weighted average


• for continuous variable
Z
E (X ) = x · fX (x) dx
Discrete Uniform Probability Distribution

Discrete Uniform Probability Distribution (e.g. value on 6-sided die)


• Let Im(X ) = {x1 , x2 , . . . , xn }
• The probability mass function
(
1/n x ∈ Im(X )
pX (x) = P(X = x) =
0 otherwise.

• The expectation of X is

1 1 n
E (X ) = ∑ x · P(X = x) = ∑ x· = · ∑ xi
x∈Im(X ) x∈Im(X )
n n i=1

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 19 / 47
Bernoulli Probability Distribution

Bernoulli probability distribution has one parameter p


and models single Bernoulli trial or (biased) coin toss.
• The distribution is given by

pX (0) =P(X = 0) = 1 − p
pX (1) =P(X = 1) = p

• The expectation is

E (X ) = ∑ x · P(X = x) = 0 · (1 − p) + 1 · p = p
x∈Im(X )

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 20 / 47
Indicator Random Variable
Indicator random variable of an event A.
• The indicator of an event A is the random variable IA defined by

IA (s) = 1 if s ∈ A and IA (s) = 0 if s 6∈ A

• The probability distribution is

pIA (0) = P(A) = 1 − P(A)


pIA (1) = P(A).
• The distribution function reads

0
 for x < 0
FIA (x) = P(A) for 0 ≤ x < 1

1 for x ≥ 1.

• It is a Bernoulli distribution, and so E (IA ) = P(A).


V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 21 / 47
Constant Random Variable

Constant random variable


• For c ∈ R the function defined for all s ∈ S by X (s) = c.
• The probability distribution of this variable is
(
1 if x = c
pX (x) =
0 otherwise.

• The cumulative distribution function is


(
0 for x < c
FX (x) =
1 for x ≥ c.

• The expectation
E (X ) = 1 · c = c

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 22 / 47
Geometric Probability Distribution
Geometric probability distribution has one parameter p and models the
number of Bernoulli trials until the first ’success’ occurs.
• The probability function

p(x) = P(X = x) = (1 − p)x−1 p.

• The probability distribution function (for x ≥ 0) is

F (x) = P(X ≤ x) = 1 − (1 − p)bxc .

• The expectation is

E (X ) = ∑x∈Im(X ) x · P(X = x) =

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 23 / 47
Geometric Probability Distribution
Lecture 3: Expectation and Markov Ineq. Geometric probability distribution has one parameter p and models the
number of Bernoulli trials until the first ’success’ occurs.

2024-10-16 Examples of Probability Distributions • The probability function

p(x) = P(X = x) = (1 − p)x−1 p.

• The probability distribution function (for x ≥ 0) is

F (x) = P(X ≤ x) = 1 − (1 − p)bxc .

Geometric Probability Distribution • The expectation is

E (X ) = ∑x∈Im(X ) x · P(X = x) =

Expectation:

E (X ) = ∑ x · P(X = x) = ∑ x · (1 − p)x−1 p
x∈Im(X ) x∈Im(X )

multiply by (1 − p)
(1 − p)E (X ) = ∑x∈Im(X ) x(1 − p)x p
make the difference of the two equations
E (X ) − (1 − p)E (X ) = ∑x∈Im(X ) (x(1 − p)x−1 p − x(1 − p)x p)
pE (X ) = p ∑x∈Im(X ) (x(1 − p)x−1 − x(1 − p)x )
E (X ) = ∑x∈Im(X ) (x(1 − p)x−1 − x(1 − p)x )
e.g. = (1 + 2(1 − p) + 3(1 − p)2 + . . .
−(1 − p) − 2(1 − p)2 − 3(1 − p)3 − . . .
hence
E (X ) = 1 + (1 − p) + (1 − p)2 + . . .
E (X ) = ∑x∈Im(X ) (1 − p)x = 1−(11−p) = 1/p
Binomial Probability Distribution

Binomial random variable X , denoted by B(n, p).


• The probability distribution is
( 
n x
p (1 − p)n−x for 0 ≤ x ≤ n
pX (x) = P(X = x) = x
0 otherwise.

• The expectation is

E (X ) = ∑x∈Im(X ) x · P(X = x) =

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 24 / 47
Binomial Probability Distribution
Lecture 3: Expectation and Markov Ineq.
2024-10-16 Examples of Probability Distributions Binomial random variable X , denoted by B(n, p).
• The probability distribution is
( 
n x
p (1 − p)n−x for 0 ≤ x ≤ n
pX (x) = P(X = x) = x
0 otherwise.

• The expectation is

Binomial Probability Distribution E (X ) = ∑x∈Im(X ) x · P(X = x) =

n
x is read aloud as ”n choose x”
E (X ) = ∑x∈Im(X ) xP(X = x)
= ∑nx=0 x xnp x (1 − p)n−x


= ∑nx=1 x xn p x (1 − p)n−x
n!
= ∑nx=1 x x !(n−x x
)! p (1 − p)
n−x

n−1)!
= ∑nx=1 n (x−(1)!( x
n−x )! p (1 − p)
n−x
n−1
= ∑nx=1 n x−1 p x (1 − p)n−x

n−1 x−1
= np ∑nx=1 x− 1p (1 − p)(n−1)−(x−1)
1 n−1 y
= np ∑n−
y =0 y p (1 − p)
(n−1)−y

m m y (m−y )
= np ∑y =0 y p (1 − p)
= np(p + 1 − p)m
= np
or n independent Bernoulli trials = np. To use the easiest way, we need to
define independence on random variables. Hence, we need more random
variables.
Part V

Functions of Random Variables

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 25 / 47
Sum of Independent Random Variables

• Let X and Y be independent random variables. Let Z = X + Y is a


new random variable defined as sum of X and Y . If X and Y are
non-negative integer values, the probability distribution of Z is
t
pZ (t) = pX +Y (t) = ∑ pX (x)pY (t − x).
x=0

Idea
Observe that:
• Sum of two Bernoulli distributions with probability of success p is a
binomial distribution B(2, p).
• Sum of two binomial distributions B(n1 , p) and B(n2 , p) is a binomial
distribution B(n1 + n2 , p).

What is a sum of two geometric distributions?


V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 26 / 47
Sum of Independent Random Variables
Lecture 3: Expectation and Markov Ineq. • Let X and Y be independent random variables. Let Z = X + Y is a

2024-10-16
new random variable defined as sum of X and Y . If X and Y are
non-negative integer values, the probability distribution of Z is
t
pZ (t) = pX +Y (t) = ∑ pX (x)pY (t − x).
x=0

Idea
Observe that:

Sum of Independent Random Variables • Sum of two Bernoulli distributions with probability of success p is a
binomial distribution B(2, p).
• Sum of two binomial distributions B(n1 , p) and B(n2 , p) is a binomial
distribution B(n1 + n2 , p).

What is a sum of two geometric distributions?

Bernoulli + Bernoulli is Binomial


Binomial + Binomail is Binomial
Geometric + Geometric is Negative Binomial
Negative Binomial + Negative Binomial is Negative Binomial
Poisson + Poisson is Poisson
Exponential + Exponential is Erlang
Erlang + Erlang is Erlang
Negative Binomial Distribution

Negative binomial distribution has two parameters p and r and


expresses the number of Bernoulli trials to the r th success. Hence, it is a
convolution of r geometric distributions.
• The probability function
(
x−1 r
p (1 − p)x−r for x ≥ r

pX (x) = P(X = x) = r −1
0 otherwise.

• The expectation is

E (X ) = ∑x∈Im(X ) xP(X = x) =

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 27 / 47
Negative Binomial Distribution
Lecture 3: Expectation and Markov Ineq.
2024-10-16 Negative binomial distribution has two parameters p and r and
expresses the number of Bernoulli trials to the r th success. Hence, it is a
convolution of r geometric distributions.
• The probability function
(
x−1 r x−r

pX (x) = P(X = x) = r −1 p (1 − p) for x ≥ r
0 otherwise.

Negative Binomial Distribution • The expectation is

E (X ) = ∑x∈Im(X ) xP(X = x) =

The probability distribution - p r for r successes, (1 − p)x−r failures, the


last is successs, hence, x− 1
r −1 stands for placement of x − 1 successes in
r − 1 trials.
x−1 x! x! x
Note that xr −1 = (r −1)!(x−r )! = r (r )!(x−r )! = r r
E (X ) = ∑x≥r x x− 1 r
r −1 p (1 − p)
x−r = x
∑x≥r r r p r (1 − p)x−r
=
r x r +1
(1 − p)x−r . Now, the sum is a sum of neg.

p ∑x≥r r p bin.
prob. distr., to see that substitute y − 1 = x and s − 1 = r . I.e.
−1 s
E (X ) = pr ∑y ≥s ys−1 p (1 − p)y −s .
As the sum is a prob. distr. equals 1, E (X ) = pr .

Why is the expectation r /p? r independent trials, each with mean 1/p. Is
it true that expectation of convolution is always the sum of expectations?
Linearity of Expectation

Theorem (Expectation of sum)


Let X and Y be random variables. Then

E (X + Y ) = E (X ) + E (Y ).

Proof.

E (X + Y ) =

= E (X ) + E (Y ).

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 28 / 47
Linearity of Expectation
Lecture 3: Expectation and Markov Ineq. Theorem (Expectation of sum)

2024-10-16
Let X and Y be random variables. Then

E (X + Y ) = E (X ) + E (Y ).

Proof.

E (X + Y ) =

Linearity of Expectation =

= E (X ) + E (Y ).

E (X + Y ) = ∑ ∑ (x + y )P(X = x, Y = y ) =
x∈Im(X ) y ∈Im(Y )

= ∑ ∑ x · P(X = x, Y = y ) + ∑ ∑ y · P(X = x, Y = y ) =
x y y x

= ∑ x ∑ P(X = x, Y = y ) + ∑ y ∑ P(X = x, Y = y ) =
x y y x

= ∑ xP(X = x) + ∑ yP(Y = y ) =
x∈Im(X ) y ∈Im(Y )

= E (X ) + E (Y ).

Using this, the expected value of Binomial distribution is easy to prove.


The binomial distribution is a convolution of n Bernoulli trials.
Linearity of Expectation
Lecture 3: Expectation and Markov Ineq. Theorem (Expectation of sum)

2024-10-16
Let X and Y be random variables. Then

E (X + Y ) = E (X ) + E (Y ).

Proof.

E (X + Y ) =

Linearity of Expectation =

= E (X ) + E (Y ).

Another approach proving the same more formally: Let Z = X + Y and


thanks to P(Z = z) = ∑s :Z (s )=z P(s), we have
!
E (Z ) = ∑ z · P(Z = z) = ∑ z· ∑ P(s) =
z∈Im(Z ) z∈Im(Z ) s :Z (s )=z

= ∑ ∑ z · P(s) = ∑ Z (s) · P(s)


z∈Im(Z ) s :Z (s )=z s∈S

Hence,

E (X + Y ) = ∑ (X + Y )(s) · P(s) = ∑ (X (s) + Y (s)) · P(s) =


s∈S s∈S

= ∑ X (s) · P(s) + ∑ Y (s) · P(s) = E (X ) + E (Y )


s∈S s∈S
Linearity of Expectation

Theorem (Multiplication by a constant)


Let X be random variable and c be a real number. Then

E (cX ) = cE (X ).

In general, the above theorem implies the following result for any linear
combination of n random variables, i.e.
Corollary (Linearity of expectation)
Let X1 , X2 , . . . Xn be random variables and c1 , c2 , . . . cn ∈ R constants. Then
!
n n
E ∑ ci Xi = ∑ ci E (Xi ).
i=1 i=1

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 29 / 47
Expectation of Independent Random Variables

Theorem
If X and Y are independent random variables, then

E (XY ) = E (X )E (Y ).

Proof.

E (XY ) =

= E (X )E (Y ).

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 30 / 47
Expectation of Independent Random Variables
Lecture 3: Expectation and Markov Ineq. Theorem

2024-10-16
If X and Y are independent random variables, then

E (XY ) = E (X )E (Y ).

Proof.

E (XY ) =

Expectation of Independent Random Variables =

= E (X )E (Y ).

E (XY ) = ∑ ∑ x · y · P(X = x, Y = y ) = due to independence


x∈Im(X ) y ∈Im(Y )

= ∑ ∑ x · y · P(X = x) · P(Y = y ) =
x∈Im(X ) y ∈Im(Y )

= ∑ ∑ x · P(X = x) · y · P(Y = y ) =
x∈Im(X ) y ∈Im(Y )

= ∑ x · P(X = x) · ∑ y · P(Y = y ) =
x∈Im(X ) y ∈Im(Y )

= E (X ) · E (Y ).
Observation

Theorem
If Im(X ) ⊆ N then E (X ) = ∑∞
i=1 P(X ≥ i).

Proof.

E (X ) =
=
=
=

= ∑ P(X ≥ i).
i=1

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 31 / 47
Observation
Lecture 3: Expectation and Markov Ineq. Theorem

2024-10-16 If Im(X ) ⊆ N then E (X ) = ∑∞

Proof.
i=1 P(X ≥ i).

E (X ) =
=
=

Observation =

=

∑ P(X ≥ i).
i=1

E (X ) = ∑ x · P(X = x) =
x∈Im(X )

= ∑ x · P(X = x) =
x =1
∞ x
= ∑ ∑ P(X = x) = the summands are non-negative
x =1 i =1
∞ ∞
= ∑ ∑ P(X = x) = we can swap sums due to (Fubini-)Tonelli thm
i =1 x =i

= ∑ P(X ≥ i).
i =1
R∞
For non-negative continuous variables E (X ) = i =0 P(X ≥ i).
Part VI

Conditional Expectation

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 32 / 47
Conditional probability

Using the derivation of conditional probability of two events, we can derive


conditional probability of (a pair of) random variables.

Definition
The conditional probability distribution of random variable X given
random variable Y (where pX ,Y (x, y ) is their joint distribution) is

P(X = x, Y = y )
pX |Y (x|y ) = P(X = x|Y = y ) = =
P(Y = y )
pX ,Y (x, y )
=
pY (y )

provided the marginal probability pY (y ) 6= 0.

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 33 / 47
Conditional expectation
We may consider Y |(X = x) to be a new random variable that is given by
the conditional probability distribution pY |X . Therefore, we can define its
mean.
Definition
The conditional expectation of Y given X = x is defined

E (Y |X = x) = ∑ yP(Y = y |X = x) = ∑ ypY |X (y |x).


y y

We can derive the expectation of Y from the conditional expectations.

Theorem (of total expectation)


Let X and Y be random variables, then

E (Y ) = ∑ E (Y |X = x)pX (x).
x

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 34 / 47
Conditional expectation
Lecture 3: Expectation and Markov Ineq. We may consider Y |(X = x) to be a new random variable that is given by

2024-10-16
the conditional probability distribution pY |X . Therefore, we can define its
mean.
Definition
The conditional expectation of Y given X = x is defined

E (Y |X = x) = ∑ yP(Y = y |X = x) = ∑ ypY |X (y |x).


y y

We can derive the expectation of Y from the conditional expectations.


Conditional expectation Theorem (of total expectation)
Let X and Y be random variables, then

E (Y ) = ∑ E (Y |X = x)pX (x).
x

∑ E (Y |X = x)pX (x) = ∑(∑ y · P(Y = y |X = x)) · P(X = x)


x x y

= ∑(∑ y · P(Y = y |X = x) · P(X = x))


x y

= ∑ ∑ y · P(Y = y |X = x) · P(X = x)
y x

= ∑ y · ∑ P(Y = y |X = x) · P(X = x)
y x

= ∑ y · P(Y = y )
y

=E (Y )
Example: Random sums

Example
Let N, X1 , X2 , . . . be mutually independent random variables. Let us
suppose that X1 , X2 , . . . have identical probability distribution pX . Hence,
they have common mean E (X ). We also know the mean value E (N).
Compute the expectation of the random variable T defined as

T = X1 + X2 + · · · + XN .

For a fixed value N = n, we can derive the conditional expectation of T by


n
E (T |N = n) = E (X1 + X2 + · · · + Xn ) = ∑ E (Xi ) = nE (X ). (1)
i=1

Using the theorem of total expectation, we get E (T ) =

∑ E (T |N = n)pN (n) = ∑ nE (X )pN (n) = E (X ) ∑ npN (n) = E (X )E (N).


n n n

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 35 / 47
Part VII

Markov Inequality

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 36 / 47
Markov Inequality

It is important to derive as much information as possible even from a


partial description of random variable. The mean value already gives more
information than one might expect, as captured by Markov inequality.

Theorem (Markov inequality)


Let X be a nonnegative random variable with finite mean value E (X ).
Then for all t > 0 it holds that
E (X )
P(X ≥ t) ≤ .
t

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 37 / 47
Markov Inequality
Lecture 3: Expectation and Markov Ineq. It is important to derive as much information as possible even from a

2024-10-16
partial description of random variable. The mean value already gives more
information than one might expect, as captured by Markov inequality.

Theorem (Markov inequality)


Let X be a nonnegative random variable with finite mean value E (X ).
Then for all t > 0 it holds that
E (X )
P(X ≥ t) ≤ .

Markov Inequality t

1) Draw the function E (X )/t where E (X ) is a constant. Explain why


t < E (x) is not important.
2) For t = E (X ), 2E (X ), 3E (X ), show distributions with P(X ≥ t) = E (tX ) .
3) Show and explain the random variable Yt .
Markov Inequality

Proof.
Let t > 0. We define a random variable Yt (for fixed t) as
(
0 if X < t
Yt =
t X ≥ t.

Then Yt is a discrete random variable with probability distribution


pYt (0) = P(X < t), pYt (t) = P(X ≥ t). We have

E (Yt ) = tP(X ≥ t).

The observation X ≥ Yt gives

E (X ) ≥ E (Yt ) = tP(X ≥ t),

what is the Markov inequality.

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 38 / 47
Markov Inequality: Example

Example
Bound the probability of obtaining more than 75 heads in a sequence of
100 fair coin flips.

Let X1 , X2 , . . . , X100 be random variables expressing


(
1 if the i th coin flip is head
Xi =
0 otherwise,

Hence, X = ∑100i=1 Xi be the number of heads in 100 coin flips. Note that
E (Xi ) = 1/2, and E (X ) = 100/2 = 50.
Using the Markov inequality, we get

E (X ) 50 2
P(X ≥ 75) ≤ = = .
75 75 3

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 39 / 47
Markov Inequality: Example - parameterized

Example
Bound the probability of obtaining more than 34 n heads in a sequence of n
fair coin flips.

Let X1 , X2 , . . . , Xn be random variables expressing


(
1 if the i th coin flip is head
Xi =
0 otherwise,

Hence, Yn = ∑ni=1 Xi be the number of heads in n coin flips. Note that


E (Xi ) = 1/2, and E (Yn ) = n/2.
Using the Markov inequality, we get

3 E (Yn ) n/2 2
P(Yn ≥ n) ≤ 3 = = .
4 4n
3n/4 3

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 40 / 47
Markov Inequality: Example - parameterized

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 41 / 47
Expectation is not everything

Example (St. Petersburg Lottery)


A casino offers a game of chance for a single player in which a fair coin is
tossed at each stage. The pot starts at $1 and is doubled every time a
head appears. The first time a tail appears, the game ends and the player
wins whatever is in the pot. What would be a fair price to pay for entering
the game?

The expected value of winnings of one game is:


E (X ) = 21 1 + 14 2 + 18 4 + · · · = ∑∞ 1 i i−1
i=1 ( 2 ) 2 = ∑∞ 1
i=1 2 = ∞

But
• P(X ≥ 32) = 1 − ∑5i=1 ( 12 )i = ( 21 )5 = 0.03125
• P(X ≥ 1024) = ( 12 )10 = 0.000976562

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 42 / 47
Expectation is not everything
Lecture 3: Expectation and Markov Ineq. Example (St. Petersburg Lottery)

2024-10-16 A casino offers a game of chance for a single player in which a fair coin is
tossed at each stage. The pot starts at $1 and is doubled every time a
head appears. The first time a tail appears, the game ends and the player
wins whatever is in the pot. What would be a fair price to pay for entering
the game?

The expected value of winnings of one game is:

Expectation is not everything E (X ) = 12 1 + 14 2 + 18 4 + · · · = ∑∞

But
1 i i−1
i=1 ( 2 ) 2 = ∑∞ 1
i=1 2 = ∞

• P(X ≥ 32) = 1 − ∑5i=1 ( 12 )i = ( 12 )5 = 0.03125


• P(X ≥ 1024) = ( 12 )10 = 0.000976562

The paradox is named from Daniel Bernoulli’s presentation of the problem


and his solution, published in 1738 in the Commentaries of the Imperial
Academy of Science of Saint Petersburg. However, the problem was in-
vented by Daniel’s cousin Nicolas Bernoulli who first stated it in a letter
to Pierre Raymond de Montmort of September 9, 1713.

Expected values for bounded versions of Saint Petersburg lottery.


Banker Bankroll Expected value of lottery
Friendly game $100 $4.28
Millionaire $1,000,000 $10.95
Billionaire $1,000,000,000 $15.93
Bill Gates (2008) $58,000,000,000 $18.84
U.S. GDP (2007) $13.8 trillion $22.78
World GDP (2007) $54.3 trillion $23.77
Googolaire $10100 $166.50
Expectation is not everything

Example
Program A terminates in 1 hour on 99% of inputs and in 1000 hours on
1% of inputs.
Program B terminates in 10.99 hours on every input.
Which one would you prefer?

The expected values of termination times are the same.


E (XA ) = 1 ∗ 0.99 + 1000 ∗ 0.01 = 10.99
E (XB ) = 1 ∗ 10.99 = 10.99

But the algorithms differs in their deviation from its average.

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 43 / 47
Part VIII

Moments and Deviations

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 44 / 47
Moments

Definition
The k th moment of a random variable X is defined as E (X k ).

• Note that for k = 1 we get the expectation of X .

Theorem
If X and Y are random variables with matching corresponding moments of
all orders, i.e. ∀k E (X k ) = E (Y k ), then X and Y have the same
distributions.

• Usually we center the expected value to 0 – we use moments of


Φ(X ) = X − E (X ).
• I.e. we define the k th central moment of X as
 
µk = E [X − E (X )]k .

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 45 / 47
Variance - the second central moment

Definition
The second central moment is known as the variancea of X and defined as

µ2 = E [X − E (X )]2 .


The variance is usually denoted as Var (X ) or σX2 .


The square root of Var (X ) is known as the standard deviationb σX .
a rozptyl
b směrodatná odchylka

If variance is small, then X takes values close to E (X ) with high


probability. If the variance is large, then the distribution is more ’diffused’.

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 46 / 47
Variance

Theorem
Let Var (X ) be the variance of the random variable X . Then

Var (X ) = E (X 2 ) − [E (X )]2 .

Proof.

Var (X ) =

= E (X 2 ) − [E (X )]2 .

V. Řehák: IV111 Probability in CS Lecture 3: Expectation and Markov Ineq. October 10, 2024 47 / 47
Variance
Lecture 3: Expectation and Markov Ineq. Theorem

2024-10-16
Let Var (X ) be the variance of the random variable X . Then

Var (X ) = E (X 2 ) − [E (X )]2 .

Proof.

Var (X ) =

Variance =

= E (X 2 ) − [E (X )]2 .

Var (X ) = E [X − E (X )]2 = E X 2 − 2XE (X ) + [E (X )]2 =


 

= E (X 2 ) − E [2XE (X )] + [E (X )]2 =
= E (X 2 ) − 2[E (X )]2 + [E (X )]2 =
= E (X 2 ) − [E (X )]2 .
Lecture 4: Variance and Chebyshev Inequality

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

October 17, 2024

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 1 / 36
Part I

Revision + something extra

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 2 / 36
Expectation

Theorem (Linearity of expectation)


Let X and Y be random variables. Then

E (aX + bY + c) = aE (X ) + bE (Y ) + c.

Theorem (Jensen’s inequality)


Let X be a random variable. If f is a convex function, then

E [f (X )] ≥ f (E (X )).

Theorem
If X and Y are independent random variables, then

E (XY ) = E (X )E (Y ).

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 3 / 36
Expectation
Lecture 4: Variance and Chebyshev Ineq. Theorem (Linearity of expectation)

2024-10-16 Let X and Y be random variables. Then

E (aX + bY + c) = aE (X ) + bE (Y ) + c.

Theorem (Jensen’s inequality)


Let X be a random variable. If f is a convex function, then

E [f (X )] ≥ f (E (X )).
Expectation Theorem
If X and Y are independent random variables, then

E (XY ) = E (X )E (Y ).

E (f (X )) = p1 f (x1 ) + p2 f (x2 ) ≥ f (p1 x1 + p2 x2 ) = f (E (X ))


Markov Inequality

It is important to derive as much information as possible even from


a partial description of random variable. The mean value already gives
more information than one might expect, as captured by Markov inequality.

Theorem (Markov inequality)


Let X be a nonnegative random variable with finite mean value E (X ).
Then for all t > 0 it holds that
E (X )
P(X ≥ t) ≤ .
t
Alternatively, for k > 0
1
P(X ≥ k · E (X )) ≤ .
k

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 4 / 36
Markov Inequality

Proof.
Let t > 0. We define a random variable Yt (for fixed t) as
(
0 if X < t
Yt =
t X ≥ t.

Then Yt is a discrete random variable with probability distribution


pYt (0) = P(X < t), pYt (t) = P(X ≥ t). We have

E (Yt ) = tP(X ≥ t).

The observation X ≥ Yt gives

E (X ) ≥ E (Yt ) = tP(X ≥ t),

what is the Markov inequality.

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 5 / 36
Markov Inequality: Example

Example
Bound the probability of obtaining more than 75 heads in a sequence of
100 fair coin flips.

Let X1 , X2 , . . . , X100 be random variables expressing


(
1 if the i th coin flip is head
Xi =
0 otherwise,

Hence, X = ∑100i=1 Xi be the number of heads in 100 coin flips. Note that
E (Xi ) = 1/2, and E (X ) = 100/2 = 50.
Using the Markov inequality, we get

E (X ) 50 2
P(X ≥ 75) ≤ = = .
75 75 3

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 6 / 36
Markov Inequality: Example - parameterized

Example
Bound the probability of obtaining more than 34 n heads in a sequence of n
fair coin flips.

Let X1 , X2 , . . . , Xn be random variables expressing


(
1 if the i th coin flip is head
Xi =
0 otherwise,

Hence, Yn = ∑ni=1 Xi be the number of heads in n coin flips. Note that


E (Xi ) = 1/2, and E (Yn ) = n/2.
Using the Markov inequality, we get

3 E (Yn ) n/2 2
P(Yn ≥ n) ≤ 3 = = .
4 4n
3n/4 3

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 7 / 36
Markov Inequality: Example - parameterized

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 8 / 36
Expectation is not everything

Example (St. Petersburg Lottery)


A casino offers a game of chance for a single player in which a fair coin is
tossed at each stage. The pot starts at $1 and is doubled every time a
head appears. The first time a tail appears, the game ends and the player
wins whatever is in the pot. What would be a fair price to pay for entering
the game?

The expected value of winnings of one game is:


E (X ) = 21 1 + 14 2 + 18 4 + · · · = ∑∞ 1 i i−1
i=1 ( 2 ) 2 = ∑∞ 1
i=1 2 = ∞

But
• P(X ≥ 32) = 1 − ∑5i=1 ( 12 )i = ( 21 )5 = 0.03125
• P(X ≥ 1024) = ( 12 )10 = 0.000976562

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 9 / 36
Expectation is not everything
Lecture 4: Variance and Chebyshev Ineq. Example (St. Petersburg Lottery)

2024-10-16 A casino offers a game of chance for a single player in which a fair coin is
tossed at each stage. The pot starts at $1 and is doubled every time a
head appears. The first time a tail appears, the game ends and the player
wins whatever is in the pot. What would be a fair price to pay for entering
the game?

The expected value of winnings of one game is:

Expectation is not everything E (X ) = 12 1 + 14 2 + 18 4 + · · · = ∑∞

But
1 i i−1
i=1 ( 2 ) 2 = ∑∞ 1
i=1 2 = ∞

• P(X ≥ 32) = 1 − ∑5i=1 ( 12 )i = ( 12 )5 = 0.03125


• P(X ≥ 1024) = ( 12 )10 = 0.000976562

The paradox is named from Daniel Bernoulli’s presentation of the problem


and his solution, published in 1738 in the Commentaries of the Imperial
Academy of Science of Saint Petersburg. However, the problem was in-
vented by Daniel’s cousin Nicolas Bernoulli who first stated it in a letter
to Pierre Raymond de Montmort of September 9, 1713.

Expected values for bounded versions of Saint Petersburg lottery.


Banker Bankroll Expected value of lottery
Friendly game $100 $4.28
Millionaire $1,000,000 $10.95
Billionaire $1,000,000,000 $15.93
Bill Gates (2008) $58,000,000,000 $18.84
U.S. GDP (2007) $13.8 trillion $22.78
World GDP (2007) $54.3 trillion $23.77
Googolaire $10100 $166.50
Expectation is not everything

Example
Program A terminates in 1 hour on 99% of inputs and in 1000 hours on
1% of inputs.
Program B terminates in 10.99 hours on every input.
Which one would you prefer?

The expected values of termination times are the same.


E (XA ) = 1 ∗ 0.99 + 1000 ∗ 0.01 = 10.99
E (XB ) = 1 ∗ 10.99 = 10.99

But the algorithms differs in their deviation from its average.

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 10 / 36
Part II

Moments and Deviations

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 11 / 36
Moments

Definition
The k th moment of a random variable X is defined as E (X k ).

• Note that for k = 1 we get the expectation of X .

Theorem
If X and Y are random variables with matching corresponding moments of
all orders, i.e. ∀k E (X k ) = E (Y k ), then X and Y have the same
distributions.

• Usually, we center the expected value to 0 – we use moments of


Φ(X ) = X − E (X ).
• I.e., we define the k th central moment of X as
 
µk = E [X − E (X )]k .

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 12 / 36
Variance - the second central moment

Definition
The second central moment is known as the variancea of X and defined as

µ2 = E [X − E (X )]2 .


The variance is usually denoted as Var (X ) or σX2 .


The square root of Var (X ) is known as the standard deviationb σX .
a rozptyl
b směrodatná odchylka

If variance is small, then X takes values close to E (X ) with high


probability. If the variance is large, then the distribution is more ’diffused’.

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 13 / 36
Variance
Theorem
Let Var (X ) be the variance of the random variable X . Then

Var (X ) = E (X 2 ) − [E (X )]2 .

Proof.

Var (X ) =

= E (X 2 ) − [E (X )]2 .

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 14 / 36
Variance
Lecture 4: Variance and Chebyshev Ineq. Theorem

2024-10-16
Let Var (X ) be the variance of the random variable X . Then

Var (X ) = E (X 2 ) − [E (X )]2 .

Proof.

Var (X ) =

Variance =

= E (X 2 ) − [E (X )]2 .

Var (X ) = E [X − E (X )]2 = E X 2 − 2XE (X ) + [E (X )]2 =


 

= E (X 2 ) − E [2XE (X )] + [E (X )]2 =
= E (X 2 ) − 2[E (X )]2 + [E (X )]2 =
= E (X 2 ) − [E (X )]2 .
Variance - properties
Theorem
Let X be a random variable and a and b be real numbers. Then

Var (aX + b) = a2 Var (X ).

Proof.

Var (aX + b) =

= a2 Var (X ).

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 15 / 36
Variance - properties
Lecture 4: Variance and Chebyshev Ineq. Theorem

2024-10-16
Let X be a random variable and a and b be real numbers. Then

Var (aX + b) = a2 Var (X ).

Proof.

Var (aX + b) =

Variance - properties =

= a2 Var (X ).

Var (aX + b) = E [(aX + b) − E (aX + b)]2 =




= E [aX + b − aE (X ) − b]2 =


= E [a(X − E (X ))]2 =


= E a2 [X − E (X )]2 =


= a2 E ([X − E (X )]2 ) =
= a2 Var (X ).
Variance of Some Distributions

Distribution Probability E (X ) Var (X )


a+b n2 −1
Uniform a..b n = b − a + 1 2 12

Constant P(c) = 1 c 0

Bernoulli 0, 1 P(1) = p p p(1 − p)

Binomial n-times a p-trial np np(1 − p)


1 1−p
Geometric p-trials to succ. p p2
r r (1−p)
Neg. binomial p-trials to r succ.s p p2

Is it true that Var (X + Y ) = Var (X ) + Var (Y )?

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 16 / 36
Variance of Some Distributions
Lecture 4: Variance and Chebyshev Ineq.
2024-10-16
Distribution Probability E (X ) Var (X )
a+b n2 −1
Uniform a..b n = b − a + 1 2 12

Constant P(c) = 1 c 0

Bernoulli 0, 1 P(1) = p p p(1 − p)

Binomial n-times a p-trial np np(1 − p)

Variance of Some Distributions Geometric

Neg. binomial
p-trials to succ.

p-trials to r succ.s
1
p
r
p
1−p
p2
r (1−p)
p2

Is it true that Var (X + Y ) = Var (X ) + Var (Y )?

Motivation for covariance:

Var (X + Y ) = E (X + Y )2 − (E (X + Y ))2 =


= E (X 2 + 2XY + Y 2 ) − (E (X ) + E (Y ))2 =
= E (X 2 ) + 2E (XY ) + E (Y 2 ) − (E (X ))2 + 2E (X )E (Y ) + (E (Y ))2


= Var (X ) + Var (Y ) + 2 ∗ (E (XY ) − E (X )E (Y )) =


= Var (X ) + Var (Y ) + 2 ∗ Cov (X , Y ).
Covariance

Definition
The quantity

E [X − E (X )][Y − E (Y )] = ∑ pxi ,yj [xi − E (X )] [yj − E (Y )]
i,j

is called the covariance of X and Y and denoted Cov (X , Y ).

• Covariance measures linear dependence between two random


variables. It is positive if the variables are “correlated”, and negative
when “anticorrelated”.
• E.g. when Y = aX , using E (Y ) = aE (X ) we have

Cov (X , Y ) = aVar (X ).

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 17 / 36
Covariance

Definition
We define the correlation coefficient ρ(X , Y ) as the normalized
covariance, i.e.
Cov (X , Y )
ρ(X , Y ) = p .
Var (X )Var (Y )

Theorem
For any two random variables X and Y , it holds that

0 ≤ (Cov (X , Y ))2 ≤ Var (X )Var (Y ).

I.e., −1 ≤ ρ(X , Y ) ≤ 1.

For Y = aX , we have ρ(X , Y ) =

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 18 / 36
Covariance
Lecture 4: Variance and Chebyshev Ineq. Definition

2024-10-16 We define the correlation coefficient ρ(X , Y ) as the normalized


covariance, i.e.
Cov (X , Y )
ρ(X , Y ) = p .
Var (X )Var (Y )

Theorem
For any two random variables X and Y , it holds that

Covariance 0 ≤ (Cov (X , Y ))2 ≤ Var (X )Var (Y ).

I.e., −1 ≤ ρ(X , Y ) ≤ 1.

For Y = aX , we have ρ(X , Y ) =

Cov (X , Y )
ρ(X , Y ) = p
Var (X )Var (Y )
Cov (X , aX )
= p
Var (X )Var (aX )
aVar (X )
= p
Var (X )a2 Var (X )

aVar (X ) 1
 for a > 0
= = −1 for a < 0
|a|Var (X ) 
⊥ for a = 0

Covariance

Theorem
Let X and Y be random variables. Then the covariance

Cov (X , Y ) = E (XY ) − E (X )E (Y ).

Proof.

Cov (X , Y ) = E [X − E (X )][Y − E (Y )] =

Corollary
For independent random variables X and Y , it holds that Cov (X , Y ) = 0.

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 19 / 36
Covariance
Lecture 4: Variance and Chebyshev Ineq. Theorem

2024-10-16
Let X and Y be random variables. Then the covariance

Cov (X , Y ) = E (XY ) − E (X )E (Y ).

Proof.

Cov (X , Y ) = E [X − E (X )][Y − E (Y )] =

Covariance Corollary
For independent random variables X and Y , it holds that Cov (X , Y ) = 0.


Cov (X , Y ) = E [X − E (X )][Y − E (Y )] =
= E [XY − YE (X ) − XE (Y ) + E (X )E (Y )] =
= E (XY ) − E (X )E (Y ) − E (X )E (Y ) + E (X )E (Y ) =
= E (XY ) − E (X )E (Y )
Covariance

It may happen that X is completely dependent on Y and


yet the covariance is 0.
E.g., for X with Uniform({−1, 0, 1}) distribution and Y = X 2 .

Hence,

X and Y independent ⇒ Cov (X , Y ) = 0.

X and Y independent : Cov (X , Y ) = 0.

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 20 / 36
Covariance
Lecture 4: Variance and Chebyshev Ineq.
2024-10-16 It may happen that X is completely dependent on Y and
yet the covariance is 0.
E.g., for X with Uniform({−1, 0, 1}) distribution and Y = X 2 .

Hence,

X and Y independent ⇒ Cov (X , Y ) = 0.


Covariance X and Y independent : Cov (X , Y ) = 0.

E (X ) = 0
E (Y ) = E (X 2 ) = 1/3 + 1/3 = 2/3
E (XY ) = E (X 3 ) = E (X ) = 0.
Hence,
Cov (X , Y ) = E (XY ) − E (X )E (Y ) = 0
Variance of Independent Variables

Theorem
If X and Y are independent random variables, then

Var (X + Y ) = Var (X ) + Var (Y ).

Proof.
Var (X + Y ) = E [(X + Y ) − E (X + Y )]2 =


V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 21 / 36
Variance of Independent Variables
Lecture 4: Variance and Chebyshev Ineq. Theorem

2024-10-16
If X and Y are independent random variables, then

Var (X + Y ) = Var (X ) + Var (Y ).

Proof.
Var (X + Y ) = E [(X + Y ) − E (X + Y )]2 =


Variance of Independent Variables

Var (X + Y ) = E [(X + Y ) − E (X + Y )]2 =




= E [(X + Y ) − E (X ) − E (Y )]2 = E [(X − E (X )) + (Y − E (Y ))]2 =


 

= E [X − E (X )]2 + [Y − E (Y )]2 + 2[X − E (X )][Y − E (Y )] =




= E [X − E (X )]2 + E [Y − E (Y )]2 + 2E [X − E (X )][Y − E (Y )] =


  

= Var (X ) + Var (Y ) + 2E [X − E (X )][Y − E (Y )] =
= Var (X ) + Var (Y ) + 2Cov (X , Y ) = Var (X ) + Var (Y ).
Variance

Theorem
If X and Y are (not independent) random variables, we obtain

Var (X + Y ) = Var (X ) + Var (Y ) + 2Cov (X , Y ).

The proof is analogous to the previous one.


• The additivity of variance can be generalized to a set X1 , X2 , . . . Xn of
pairwise independent variables and constants a1 , a2 , . . . an , b ∈ R as
!
n n
Var ∑ ai Xi + b = ∑ ai2 Var (Xi ).
i=1 i=1

• When the variable X1 , X2 , . . . Xn are not necessarily independent:


!
n n
Var ∑ ai Xi + b = ∑ ai2 Var (Xi ) + 2 ∑ ai aj Cov (Xi , Xj ).
i=1 i=1 1≤i<j≤n

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 22 / 36
Variance
Lecture 4: Variance and Chebyshev Ineq. Theorem

2024-10-16
If X and Y are (not independent) random variables, we obtain

Var (X + Y ) = Var (X ) + Var (Y ) + 2Cov (X , Y ).

The proof is analogous to the previous one.


• The additivity of variance can be generalized to a set X1 , X2 , . . . Xn of
pairwise independent variables and constants a1 , a2 , . . . an , b ∈ R as
!
n n

Variance Var ∑ ai Xi + b
i=1
= ∑ ai2 Var (Xi ).
i=1

• When the variable X1 , X2 , . . . Xn are not necessarily independent:


!
n n
Var ∑ ai Xi + b = ∑ ai2 Var (Xi ) + 2 ∑ ai aj Cov (Xi , Xj ).
i=1 i=1 1≤i<j≤n

Why not mutually independent? Why is pairwise sufficient?


Var (∑ Xi ) = E [∑ Xi − E (∑ Xi )]2 = E [∑ Xi − ∑ E (Xi )]2
 
i i i i i

= E [∑ Xi − E (Xi )][∑ Xi − E (Xi )]
i i
= ∑ Cov (Xi , Xj )
i,j

= ∑ Var (Xi ) + ∑ Cov (Xi , Xj )


i i6=j

= ∑ Var (Xi ) + 2 · ∑ Cov (Xi , Xj )


i i<j

distribute multiplication over the sums, more general:


!
n n
Var ∑ ai Xi + b = ∑ ai2 Var (Xi ) + 2 ∑ ai aj Cov (Xi , Xj ).
i =1 i =1 1≤i<j≤n
Example

Example
Let X1 , X2 be two independent identically distributed random variables.
Compute Var (X1 + X1 ) and Var (X1 + X2 ) and explain the difference
between X1 + X1 and X1 + X2 .

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 23 / 36
Example
Lecture 4: Variance and Chebyshev Ineq. Example

2024-10-16
Let X1 , X2 be two independent identically distributed random variables.
Compute Var (X1 + X1 ) and Var (X1 + X2 ) and explain the difference
between X1 + X1 and X1 + X2 .

Example

Var (X1 + X1 ) = Var (2 · X1 ) = 4 · Var (X1 )


Var (X1 + X2 ) = Var (X1 ) + Var (X2 ) = 2 · Var (X1 )

X1 + X1 stands for the double of the gain of the first round.


X1 + X2 stands for the gain of the first two rounds.

X1 and X2 are independent


X1 and X1 are dependent

An alternative computation of Var (X1 + X1 ):


Var (X1 + X1 ) = Var (X1 ) + Var (X1 ) + 2 · Cov (X1 , X1 ) = Var (X1 ) +
Var (X1 ) + 2 · Var (X1 ) = 4Var (X1 )
Part III

Conditional Distribution and Expectation –


Extended to Moments

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 24 / 36
Conditional probability - recall

Using the derivation of conditional probability of two events, we can derive


conditional probability of (a pair of) random variables.

Definition
The conditional probability distribution of random variable Y given
random variable X (their joint distribution is pX ,Y (x, y )) is

P(Y = y , X = x)
pY |X (y |x) = P(Y = y |X = x) = =
P(X = x)
pX ,Y (x, y )
=
pX (x)

provided the marginal probability pX (x) 6= 0.

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 25 / 36
Conditional expectation

We may consider Y |(X = x) to be a new random variable that is given by


the conditional probability distribution pY |X . Therefore, we can define its
mean and moments.
Definition (recall)
The conditional expectation of Y given X = x is defined

E (Y |X = x) = ∑ yP(Y = y |X = x) = ∑ ypY |X (y |x).


y y

Definition
Analogously, the conditional variance can be defined as

Var (Y |X = x) = E (Y 2 |X = x) − [E (Y |X = x)]2 .

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 26 / 36
Conditional expectation

We can derive the expectation of Y from the conditional expectations.

Theorem (of total expectation - recall)


Let X and Y be random variables, then

E (Y ) = ∑ E (Y |X = x)pX (x).
x

Theorem (of total moments)


Let X and Y be random variables, then

E (Y k ) = ∑ E (Y k |X = x)pX (x).
x

Note that we do not have such a theorem for central moments.

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 27 / 36
Conditional expectation
Lecture 4: Variance and Chebyshev Ineq. We can derive the expectation of Y from the conditional expectations.

2024-10-16 Theorem (of total expectation - recall)


Let X and Y be random variables, then

E (Y ) = ∑ E (Y |X = x)pX (x).
x

Theorem (of total moments)

Conditional expectation Let X and Y be random variables, then

E (Y k ) = ∑ E (Y k |X = x)pX (x).
x

Note that we do not have such a theorem for central moments.

The proof is very similar to the one of total expectation.

∑ E (Y k |X = x)pX (x) = ∑(∑ y k · P(Y = y |X = x)) · P(X = x)


x x y

= ∑(∑ y k · P(Y = y |X = x) · P(X = x))


x y

= ∑ ∑ y k · P(Y = y |X = x) · P(X = x)
y x

= ∑ y k · ∑ P(Y = y |X = x) · P(X = x)
y x

= ∑ y k · P(Y = y )
y

=E (Y k )
Example: Random sums
Example
Let N, X1 , X2 , . . . be mutually independent random variables. Let us
suppose that X1 , X2 , . . . have identical probability distribution pX , mean
E (X ), and variance Var (X ). We also know the values E (N) and Var (N).
Compute the expectation and variance of the random variable T defined as

T = X1 + X2 + · · · + XN .

From the previous lecture, we already know that E (T |N = n) = n · E (X )


and E (T ) = E (X ) · E (N).
Similarly, we can obtain
n
Var (T |N = n) = Var (X1 + X2 + · · · + Xn ) = ∑ Var (Xi ) = n · Var (X ) (1)
i=1
since X1 , . . . , Xn are independent identically distributed.
But the theorem of total moments does not work for centralized moments
like variance, so we cannot easily obtain Var (T )!
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 28 / 36
Example: Random sums

It remains to derive the variance of T , i.e. Var (T ) = E (T 2 ) − [E (T )]2 .


We already have E (T ) and need E (T 2 ) for which we can use the theorem
of total moments. Hence, we need E (T 2 |N = n), the second moment of
(T |N = n) = X1 + · · · + Xn . We cannot compute this directly. But we can
express the definition of conditional variance of T given N = n:

Var (T |N = n) = E (T 2 |N = n) − [E (T |N = n)]2 (2)

Note that we already have Var (T |N = n) = n · Var (X ) and


E (T |N = n) = n · E (X ). Hence,

E (T 2 |N = n) = n · Var (X ) + (n · E (X ))2 .

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 29 / 36
Example: Random sums
Now, using the theorem of total moments, we get

E (T 2 ) = ∑ E (T 2 |N = n) · P(N = n)
n
= ∑ nVar (X ) + n2 [E (X )]2 · P(N = n)

n
   
2 2
=Var (X ) ∑ n · P(N = n) + [E (X )] ∑ n · P(N = n)
n n
2 2
=Var (X )E (N) + [E (X )] E (N ).

Finally, we obtain

Var (T ) =E (T 2 ) − [E (T )]2 =
=Var (X )E (N) + [E (X )]2 E (N 2 ) − [E (X )]2 [E (N)]2 =
=Var (X )E (N) + [E (X )]2 Var (N).

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 30 / 36
Example: Random sums
Lecture 4: Variance and Chebyshev Ineq. Now, using the theorem of total moments, we get

2024-10-16
E (T 2 ) = ∑ E (T 2 |N = n) · P(N = n)
n
= ∑ nVar (X ) + n2 [E (X )]2 · P(N = n)

n
   
=Var (X ) ∑ n · P(N = n) + [E (X )]2 ∑ n2 · P(N = n)
n n
=Var (X )E (N) + [E (X )]2 E (N 2 ).

Example: Random sums Finally, we obtain

Var (T ) =E (T 2 ) − [E (T )]2 =
=Var (X )E (N) + [E (X )]2 E (N 2 ) − [E (X )]2 [E (N)]2 =
=Var (X )E (N) + [E (X )]2 Var (N).

Var (T ) = Var (X )E (N) + [E (X )]2 Var (N)


intuition:
Var (X )E (N) express the variance “caused by X ”, i.e. when N has a
constant distribution
[E (X )]2 Var (N) express the variance “caused by N”, i.e. when X has a
constant distribution

Compare it with a wrong solution:

Var (T ) = ∑ Var (T |N = n)pN (n) = ∑ n · Var (X ) · pN (n) =


n n
=Var (X ) · ∑ n · pN (n) =
n
=Var (X ) · E (N)
Part IV

Chebyshev Inequality

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 31 / 36
Chebyshev Inequality
In case we know both mean value and variance of a random variable, we
can use much more accurate estimation
Theorem (Chebyshev inequality)
Let X be a random variable with finite variance. Then
  Var (X )
P |X − E (X )| ≥ t ≤ , t >0
t2
p
Alternatively, substituting t = kσX = k Var (X ), we obtain
  1
P |X − E (X )| ≥ kσX ≤ 2 , k > 0
k
or, alternatively, substituting X − E (X ) = X 0

E (X 02 )
P(|X 0 | ≥ t) ≤ , t > 0.
t2

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 32 / 36
Chebyshev Inequality

Proof.
We apply the Markov inequality to the nonnegative variable [X − E (X )]2
and we replace t by t 2 to get

 E [X − E (X )]2

 2 2 Var (X )
P (X − E (X )) ≥ t ≤ 2
= .
t t2
We obtain the Chebyshev inequality using the fact that the events
[(X − E (X ))2 ≥ t 2 ] = [|X − E (X )| ≥ t] are the same.

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 33 / 36
Chebyshev Inequality: Example - parameterized

Example (Coin flipping revisited)


Bound the probability of obtaining more that 34 n heads in a sequence of n
fair coin flips.

Again, Xi = 1 if the ith outcome is head and 0 if it is tail, and


Yn = ∑ni=1 Xi . Let us calculate the variance of Yn :

1
E (Xi2 ) = E (Xi ) = .
2
Then
1 1 1
Var (Xi ) = E (Xi2 ) − [E (Xi )]2 = − =
2 4 4
and using the independence we have
n
Var (Yn ) = .
4
V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 34 / 36
Chebyshev Inequality: Example - parameterized

We apply the Chebyshev bound to get

P(Yn ≥ 3n/4) = P(Yn − E (Yn ) ≥ 3n/4 − n/2) due to E (Yn ) = n/2


≤ P(|Yn − E (Yn )| ≥ n/4) as this bounds both sides
Var (Yn )
≤ due to Cheb. ineq.
(n/4)2
n/4
= thanks to Var (Yn ) = n/4
(n/4)2
4
= .
n

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 35 / 36
Chebyshev Inequality: Example - parameterized

V. Řehák: IV111 Probability in CS Lecture 4: Variance and Chebyshev Ineq. October 17, 2024 36 / 36
Lecture 5: Chernoff Bounds and
Laws of Large Numbers

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

October 24, 2024

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 1 / 29
Part I

Motivation Examples

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 2 / 29
Unknown Biased Coin

Example
Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k bits are correct?
repeat flips and count the frequency of heads An = ∑ni=1 Xi /n.
We need n s.t.
1
P(|An − p| ≥ k ) ≤ 0.001.
2
Markov inequality: P(An ≥ t) ≤ E (An )/t
Expectation E (An ) = ∑ni=1 E (Xi )/n = E (Xi ) = p.
Note that An is not symmetric around E (An ) when p ̸= 1/2. Hence, we
can bound only one side where moreover n disappears:

1 1 p2k
P((An − p) ≥ ) = P(An ≥ + p) ≤
2k 2k 1 + p2k

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 3 / 29
Unknown Biased Coin
Lecture 5: Chernoff Bounds and LoLN Example

2024-10-24
Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k bits are correct?
repeat flips and count the frequency of heads An = ∑ni=1 Xi /n.
We need n s.t.
1
P(|An − p| ≥ k ) ≤ 0.001.
2
Markov inequality: P(An ≥ t) ≤ E (An )/t

Unknown Biased Coin Expectation E (An ) = ∑ni=1 E (Xi )/n = E (Xi ) = p.


Note that An is not symmetric around E (An ) when p ̸= 1/2. Hence, we
can bound only one side where moreover n disappears:

1 1 p2k
P((An − p) ≥ ) = P(An ≥ k + p) ≤
2k 2 1 + p2k

Var (An ) = Var (∑i Xi /n) = ∑i Var (Xi )/n2 = Var (Xi )/n = p(1 − p)/n
Chebyshev inequality: P(|An − E (An )| ≥ t) ≤ Var (An )/t 2

1 p(1 − p)
P(|An − p| ≥ k
) ≤ 22k Var (An ) = 22k
2 n
We do not know p, but p(1 − p) ≤ 1/4 hence it is

22k 22k−2
≤ = ≤ 0.001 and so n ≥ 103 ∗ 22k−2
4n n
i.e. n has exponential value to # digits/bits in both the probability bound
and the result precision.
Unknown Biased Coin

Example
Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k significant bits are correct?

Here, we need n such that


p
P(|An − p| ≥ ) ≤ 0.001.
2k
Markov inequality:
P(An ≥ t) ≤ E (An )/t
Hence
p p 2k 2k
P((An − p) ≥ ) = P(An ≥ + p) ≤ p · = <1
2k 2k p + p2k 1 + 2k
both n and p disapeared.
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 4 / 29
Unknown Biased Coin
Lecture 5: Chernoff Bounds and LoLN Example

2024-10-24
Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k significant bits are correct?

Here, we need n such that


p
P(|An − p| ≥ ) ≤ 0.001.
2k
Markov inequality:

Unknown Biased Coin Hence


P(An ≥ t) ≤ E (An )/t

p p 2k 2k
P((An − p) ≥ ) = P(An ≥ k + p) ≤ p · = <1
2k 2 p + p2k 1 + 2k
both n and p disapeared.

Chebyshev inequality:

P(|An − E (An )| ≥ t) ≤ Var (An )/t 2

p 22k 22k p(1 − p) 22k (1 − p)


P(|An − p| ≥ ) ≤ Var (An ) = =
2k p2 p2 n n p
(1−p )
The problem is that we cannot bound p from above!
For very small p, it is

(1 − p) 1
= − 1 ≈ very large.
p p
So, we need to bound p from below.
Chernoff inequality: better bounds, let’s try it.
Chernoff will also need a bound on p from below but it is tighter.
Part II

Moment Generating Function

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 5 / 29
Moments and Moment Generating Function

Idea
Markov inequality uses E (X ).
Chebyshev inequality uses Var (X ).
Chernoff bound will use moment generating functions.

Definition (Recall)
The k th moment of a random variable X is defined as E (X k ).

Definition
The moment generating function of a random variable X is

MX (t) = E (e tX ).

We will be interested mainly in the properties of this function around t = 0.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 6 / 29
Moment Generating Function and Moments

The moment generating function captures all moments:


Theorem
Let MX (t) = E (e tX ) be a moment generating function of X . Assuming
that exchanging the expectation and differentiation operands is legitimate,
for all n > 1 we have
(n)
E (X n ) = MX (0),
(n)
where MX (0) is the nth derivative of MX (t) evaluated at 0.

The assumption “expectation and differentiation can be exchanged” holds


whenever the moment generating function is finite in a neighborhood of 0.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 7 / 29
Moment Generating Function and Moments

Proof.
Assuming that exchanging the expectation and differentiation operands is
legitimate, we have
(n)
MX (t) = (E (e tX ))(n) = E ((e tX )(n) ) = E (X n e tX ).

OR
MX (t) = E (e tX ) = E (1 + tX + (tX )2 /2! + (tX )3 /3! + · · · )
Computing at t = 0, we get
(n)
MX (0) = E (X n ).

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 8 / 29
Moment Generating Function and Distributions

Moment generating functions uniquely define the probability distribution:


Theorem
Let X and Y be two random variables and δ > 0.
If for all t ∈ (−δ , δ )
MX (t) = MY (t) < ∞,
then X and Y have the same distributions.

Proof of this theorem is beyond the scope of this course.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 9 / 29
Moment Generating Function and Distributions

This allows us, e.g. to calculate probability distribution of sum of


independent random variables:
Theorem
If X and Y are independent random variables, then

MX +Y (t) = MX (t)MY (t). (1)

Proof.

using independence
z }| {
MX +Y (t) = E (e t(X +Y ) ) = E (e tX e tY ) = E (e tX )E (e tY ) = MX (t)MY (t).

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 10 / 29
Part III

Chernoff Bounds

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 11 / 29
Chernoff Bounds
The Chernoff bound for random variable X is obtained by applying the
Markov inequality to e tX for some suitably chosen t.
For any t > 0

E (e tX )
P(X ≥ a) = P(tX ≥ ta) = P(e tX ≥ e ta ) ≤ .
e ta
Similarly, for any t < 0

E (e tX )
P(X ≤ a) = P(tX ≥ ta) = P(e tX ≥ e ta ) ≤ .
e ta
tX
While the value of t that minimizes E (e
e ta
)
gives the best bound, in
practice we usually use the value of t that gives a convenient form.
Definition
Bounds derived using this approach are called the Chernoff bounds.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 12 / 29
Chernoff Bound and a Sum of Poisson Trials

Poisson trials (do not confuse with Poisson random variables!!) are a
sequence of independent coin flips, but the probability of respective coin
flips differs. Bernoulli trials are a special case of the Poisson trials.
Example
Let X1 , . . . , Xn be independent Poisson trials with P(Xi = 1) = pi , and
X = ∑ni=1 Xi their sum. For every δ > 0 we want to bound the probabilities

P(X ≥ (1 + δ )E (X )) and P(X ≤ (1 − δ )E (X )).

Note that the expected value is


n n
E (X ) = ∑ E (Xi ) = ∑ pi .
i=1 i=1

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 13 / 29
Chernoff Bound and a Sum of Poisson Trials
We derive a bound on the moment generating function
def
MXi (t) = E (e tXi ) = pi e t·1 + (1 − pi )e t·0

= 1 + pi (e t − 1)
t −1)
≤ e pi (e
using that 1 + y ≤ e y for any y ≥ 0.
The moment generating function of X is (due to the Theorem of slide 10)
n
E (e tX ) = MX (t) = ∏ Mx (t) i
i=1

n
t
≤ ∏ e p (e −1)i

i=1

n t −1) t −1)E (X )
= e ∑i=1 pi (e = e (e
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 14 / 29
Chernoff Bound and a Sum of Poisson Trials

Theorem
Let X1 , . . . , Xn be independent Poisson trials with P(Xi = 1) = pi ,
X = ∑ni=1 Xi their sum and µ = E (X ). Then the following Chernoff
bounds hold:
1. for any δ > 0


P(X ≥ (1 + δ )µ) ≤
(1 + δ )(1+δ )

2. for 0 < δ ≤ 1
2 /3
P(X ≥ (1 + δ )µ) ≤ e −µδ

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 15 / 29
Chernoff Bound and a Sum of Poisson Trials

Proof.
1. Using Markov inequality, we have that for any t > 0

P(X ≥ (1 + δ )µ) = P(e tX ≥ e t(1+δ )µ )


E (e tX )

e t(1+δ )µ !µ
t t
e (e −1)µ e (e −1)
≤ t(1+δ )µ = .
e e t(1+δ )

For any δ > 0, we can set t = ln(1 + δ ) to get




P(X ≥ (1 + δ )µ) ≤
(1 + δ )(1+δ )

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 16 / 29
Chernoff Bound and a Sum of Poisson Trials

Proof (Cont.)
2. We want to show that for any 0 < δ ≤ 1

eδ 2
≤ e −δ /3 ,
(1 + δ )(1+δ )

what will give us the result immediately. Taking the natural logarithm
of both sides we obtain the equivalent condition

def δ2
f (δ ) = δ − (1 + δ ) ln(1 + δ ) + ≤ 0.
3

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 17 / 29
Chernoff Bound and a Sum of Poisson Trials

Proof (Cont.)
Note that f (0) = 0. We show that f is decreasing on [0, 1].
We calculate the first and second derivative of f (δ )

1+δ 2 2
f ′ (δ ) =1 − − ln(1 + δ ) + δ = −ln(1 + δ ) + δ
1+δ 3 3
1 2
f ′′ (δ ) = − + .
1+δ 3
We see that f ′′ (δ ) < 0 for 0 ≤ δ < 1/2 and f ′′ (δ ) > 0 for δ > 1/2. Hence,
f ′ (δ ) first decreases and then increases on [0, 1]. Since f ′ (0) = 0 and
f ′ (1) < 0, we see that f ′ (t) ≤ 0 on [0, 1]. Since f (0) = 0, it follows that
f (t) ≤ 0 on [0, 1] as well, what completes the proof.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 18 / 29
Chernoff Bound and a Sum of Poisson Trials

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 19 / 29
Chernoff Bound and a Sum of Poisson Trials

Theorem
Let X1 , . . . , Xn be independent Poisson trials with P(Xi = 1) = pi ,
X = ∑ni=1 Xi their sum and µ = E (X ). Then for 0 < δ ≤ 1
1. !µ
e −δ
P(X ≤ (1 − δ )µ) ≤
(1 − δ )(1−δ )
2.
2 /2
P(X ≤ (1 − δ )µ) ≤ e −µδ

Proof: Analogous to the previous theorem, left as a home exercise. Hint:


start with any t < 0.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 20 / 29
Chernoff Bound and a Sum of Poisson Trials

Corollary
Let X1 , . . . , Xn be independent Poisson trials and X = ∑ni=1 Xi . For
0 < δ ≤ 1,
2
P(|X − E (X )| ≥ δ E (X )) ≤ 2e −E (X )δ /3

Example (Our second motivation example)


Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that
first k significant bits are correct?

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 21 / 29
Chernoff Bound and a Sum of Poisson Trials
Lecture 5: Chernoff Bounds and LoLN Corollary

2024-10-24
Let X1 , . . . , Xn be independent Poisson trials and X = ∑ni=1 Xi . For
0 < δ ≤ 1,
2
P(|X − E (X )| ≥ δ E (X )) ≤ 2e −E (X )δ /3

Example (Our second motivation example)


Let us have a biased coin with an unknown probability p of tossing head.
How can we determine the probability p such that we are 99.9% sure that

Chernoff Bound and a Sum of Poisson Trials first k significant bits are correct?

Let Sn = ∑ni=1 Xi and An = Sn /n.


E (Sn ) = E (∑ni=1 Xi ) = n · E (X1 ) = np
We would like to bound
P((An − p) ≥ 21k p) = P((nAn − np) ≥ 1
2k
np) = P((Sn − E (sn )) ≥ 1
2k
E (Sn ))

Chernoff bound:
for δ = 21k
 
E (Sn ) 1 ·1
−pn· 2k
P |Sn − E (Sn )| ≥ k
≤ 2e 2 3
2
2k
It’s better than Chebyshev 2n (1−p p
)
but again, there is the probability p.
Without bounding it from below, we cannot compute any bound.
Explanation: Imagine that no head appears after many trials, i.e., with
high probability, p is smaller than some bound, but we know nothing how
small it is, and so what the first significant bit of p is.
Chernoff Bound: Example - parameterized
Example (Coin flipping re-revisited)
Bound the probability of obtaining more that 34 n heads in a sequence of n
fair coin flips.

Again, Xi = 1 if the ith outcome is head and 0 otherwise, and


X = ∑ni=1 Xi . Moreover, E (X ) = nE (Xi ) = n2 .
2
Using the Chernoff bound P(|X − E (X )| ≥ δ E (X )) ≤ 2e −E (X )δ /3 , we get
P(X ≥ 3n/4) ≤ P(|X − E (X )| ≥ n/4)
n 1 21
≤ 2e − 2 ( 2 ) 3

≤ 2e −n/24 .
Recall that from Chebyshev inequality:
4
P(Yn ≥ 3n/4) ≤ .
n
V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 22 / 29
Chernoff Bounds vs. Chebyshev Bounds

Chernoff bound beats Chebyshev bound for n > 92.

0.15

0.10

0.05

70 80 90 100 110 120

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 23 / 29
Part IV

Laws of Large Numbers

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 24 / 29
(Weak) Law of Large Numbers

Theorem ((Weak) Law of Large Numbers)


If X1 , X2 , . . . is a sequence of independent identically distributed random
variables with expectation µ = E (Xk ), then for every ε > 0
 
X1 + · · · + Xn
lim P − µ > ε = 0.
n→∞ n

In words, the probability that the average of Xi differs from the


expectation by more than arbitrarily small ε goes to 0.
Idea
Law of Large Numbers proves formally that the mathematically defined
statement of probability corresponds to our motivation based on
frequencies in repeated experiments, i.e. the expected value is really
“expected” as the average of the repeated experiments.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 25 / 29
(Weak) Law of Large Numbers

Proof.
We proof the theorem only for the special case when Var (Xk ) = σ 2 exists.
Let An = (X1 + · · · + Xn )/n. Then
 
X1 + · · · + Xn
P − µ > ε = P (|An − µ| > ε)
n
Hence, E (An ) = E (1/n ∑ni=1 Xi ) = 1/nE (∑ni=1 Xi ) = 1/n ∑ni=1 E (Xi ) = µ.
Moreover, Var (An ) = Var (1/n ∑ni=1 Xi ) = 1/n2 Var (∑ni=1 Xi ) =
1/n2 ∑ni=1 Var (Xi ) = n/n2 Var (Xk ) = σ 2 /n.
The law of large numbers is a direct consequence of the Chebyshev
inequality for An :

Var (An ) σ2
P (|An − µ| > ε) ≤ =
ε2 nε 2
Hence, if n → ∞ the right-hand side tends to 0 to get the result.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 26 / 29
Strong Law of Large Numbers

The (weak) law of large numbers implies that large errors in


“experimentally obtained expectation” occur infrequently.
In many practical situations, we require the stronger statement that the
error remains small for all sufficiently large n, i.e. infinite sequences with
average difference from the mean have zero probability.

Theorem (Strong Law of Large Numbers)


If X1 , X2 , . . . is a sequence of independent identically distributed random
variables with expectation µ = E (Xk ), then
 
X1 + · · · + Xn
P lim = µ = 1.
n→∞ n

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 27 / 29
Weak vs. Strong Law of Large Numbers

The weak law of large numbers says:


∀ε > 0 . ∀δ > 0 . ∃N > 0 . ∀n ≥ N .
 
X1 + · · · + Xn
P −µ ≤ ε ≥ 1−δ
n

The strong law of large numbers says:


∀ε > 0 . ∀δ > 0 . ∃N > 0 .
 
X1 + · · · + Xn
P ∀n ≥ N . −µ ≤ ε ≥ 1−δ
n

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 28 / 29
Converging in probability but not a.s.

Example
Let us have a sequence of random variables Yn where
1 1
P(Yn = 0) = 1 − and P(Yn = n) =
n n

With increasing n, the probability P(Yn = 0) increases to 1.


Formally, ∀δ . ∃N > 0 . ∀n ≥ N . P(Yn = 0) = 1 − n1 ≥ 1 − δ .
Hence, we say that Yn converges to 0 in probability.
Note that, on the contrary, E (Yn ) = 0 · 1 − n1 + n · n1 = 1.


Yn does not converge to 0 with probability 1 (i.e. almost surely).


1

Note that ∀N > 0 . P(∀n ≥ N . Yn = 0) = ∏∞
i=N 1 − n = 0.

V. Řehák: IV111 Probability in CS Lecture 5: Chernoff Bounds and LoLN October 24, 2024 29 / 29
Lectures 6:
Stochastic Processes and Markov Chains

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

October 31, 2024

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 1 / 29
Part I

Stochastic Processes

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 2 / 29
Stochastic Processes - Examples

Other examples:
temperature, memory consumption, queue length, program counter value,
configuration of a program execution, printer status, . . .
chemical reaction, car location, animal population, . . .

What are the important questions we would like to answer?


What is the probability of reaching a value, expected time to reach/leave a
(range of) value(s), portion of time spend in/out of a range of values.
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 3 / 29
Stochastic Process - Definition
Sequence of experiments that watches “values” evolving in time.

X0 . . . X1 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .

Definition (Stochastic Process)


A stochastic process is a collection of random variables X = {Xt | t ∈ T }.
The index t often represents time; Xt is called the state of X at time t.

Classification:
• the time domain T can be either countable, or uncountable -
representing discrete-time or continuous-time process, resp.
• the state space of Xt can be finite, countable, or uncountable -
representing finite-state, discrete-state, or continuous-state
process.

How does the sample space look like? Are the r. v. independent?
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 4 / 29
Stochastic Process - Definition
Lecture 6: Stoch. Proc. and MC Sequence of experiments that watches “values” evolving in time.

2024-10-31
X0 . . . X1 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .

Definition (Stochastic Process)


A stochastic process is a collection of random variables X = {Xt | t ∈ T }.
The index t often represents time; Xt is called the state of X at time t.

Classification:

Stochastic Process - Definition • the time domain T can be either countable, or uncountable -
representing discrete-time or continuous-time process, resp.
• the state space of Xt can be finite, countable, or uncountable -
representing finite-state, discrete-state, or continuous-state
process.

How does the sample space look like? Are the r. v. independent?

How does the sample space look like?

X0 . . . X1 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .

Are the random variables independent?


Part II

Markov Chains

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 5 / 29
Markov Chain

Definition
A discrete-time stochastic process {X0 , X1 , X2 , . . . } is a Markov chain if

P(Xt = a | Xt−1 = b, Xt−2 = at−2 , . . . , X0 = a0 ) = P(Xt = a | Xt−1 = b) = pb,a .

That is, the value of Xt should depend on the value of Xt−1 , but does
not depend on the history of how we arrived at Xt−1 .
This is called Markov or memoryless property.

Note that, due to “= pb,a ”, it does not depend on the time t either1 .
Hence, a Markov chain can be drawn as an automaton (i.e., a state
diagram). Vertices are states (i.e. values) of the Markov chain and for
each non-zero pb,a there is an edge from b to a with label pb,a .

1 This is, in fact, so-called time-homogeneous Markov chain.


V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 6 / 29
Examples
Example
A gambler is playing a fair coin-flip game: wins $1 if head, loses $1 if tail.
• Let X0 denote a gambler’s initial money.
• Let Xt denote a gambler’s money after t flips.
Hence, {Xt | t ∈ {0, 1, 2, . . . . }} is a stochastic process. It is a Markov
chain. Draw the automaton.

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 7 / 29
Examples
Example
Draw the automaton for St. Petersburg Lottery.

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 8 / 29
Examples
Example
Draw the automaton for a queue, when between every two subsequent
states a new customer comes with probability p and one customer
(of a non-empty queue) is served with probability q.

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 9 / 29
Markov Chain - Transition Matrix

• Next, we focus our study on Markov chains whose state spaces


(the sets of values that Xt can take) are finite.
• So, without loss of generality, we label the states in the state space by
1, 2, . . . , n.
• Due to the definition of Markov chain, the probability
pi,j = P(Xt = j | Xt−1 = i) is the probability that the process moves
from the state i to the state j in one step. Hence, a matrix
 
p1,1 p1,2 · · · p1,n
p2,1 p2,2 · · · p2,n 
P = [ pi,j | 1 ≤ i, j ≤ n ] =  .
 
. .. . . .. 
 . . . . 
pn,1 pn,2 · · · pn,n

is the transition matrix of the automaton.


Question: What is the sum of a the i-th row, i.e. ∑nj=1 pi,j ?
V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 10 / 29
Markov Chain - Probability of a Run

 
0 1/4 0 3/4
1/2 0 1/3 1/6
P =
 0

0 1 0 
0 1/2 1/4 1/4

Question: Starting in the state 1 with probability p0 , what is the


probability of executing a run with a prefix 1 − 4 − 2 − 3?

Recall P(A ∩ B) = P(B)P(A | B).

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 11 / 29
Markov Chain - Probability of a Run
Lecture 6: Stoch. Proc. and MC 
0 1/4 0 3/4

2024-10-31
1/2 0 1/3 1/6
P =
 0

0 1 0 
0 1/2 1/4 1/4

Question: Starting in the state 1 with probability p0 , what is the


probability of executing a run with a prefix 1 − 4 − 2 − 3?

Markov Chain - Probability of a Run


Recall P(A ∩ B) = P(B)P(A | B).

P(X0 = 1, X1 = 4, X2 = 2, X3 = 3) =
P(X0 = 1)P(X1 = 4, X2 = 2, X3 = 3 | X0 = 1) =
p0 P(X1 = 4, X2 = 2, X3 = 3 | X0 = 1) =
p0 P(X1 = 4 | X0 = 1)P(X2 = 2, X3 = 3 | X1 = 4, X0 = 1) =
p0 p1,4 P(X2 = 2, X3 = 3 | X1 = 4) =
p0 p1,4 P(X2 = 2 | X1 = 4)P(X3 = 3 | X2 = 2, X1 = 4) =
p0 p1,4 p4,2 P(X3 = 3 | X2 = 2) =
p0 p1,4 p4,2 p2,3
Distribution on States

Let us discuss distributions of Xk given X0 = 1.

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 12 / 29
Transition Matrix

• Let vector ⃗λ (t) = (λ1 (t), λ2 (t), . . . , λn (t)) denote the probability
distribution on states expressing where the process is at time t.
Note that λi (t) = P(Xt = i) and ⃗λ (0) is the initial distribution.

Question: How can we compute ⃗λ (1) from the transition matrix P


assuming we know the initial distribution ⃗λ (0)?

The value of λi (1) can be expressed2 as

λi (1) := λ1 (0) · p1,i + λ2 (0) · p2,i + · · · + λn (0) · pn,i .

In other words,
⃗λ (0) · P = ⃗λ (1).

2 Using the law of total probability P(A) = ∑j P(Bj )P(A | Bj ).


V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 13 / 29
Automata-based setting

Theorem
Every (discrete-time finite-state) Markov chain can be alternatively
(uniquely) defined by an initial vector ⃗λ (0) and a transition matrix P.

Proof.
It follows from the observations mentioned above that
• λi (0) = P(X0 = i), for all i, and
• the matrix P represents the conditional distributions for all the
subsequent random variables.

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 14 / 29
Part III

Reachability and Hitting Time

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 15 / 29
Example: Bounded Reachability

 
0 1/4 0 3/4
1/2 0 1/3 1/6
P =
 0

0 1 0 
0 1/2 1/4 1/4

What is the probability of being in state 4 after 3 steps from state 1?


From the graph, all possible paths are

1 − 2 − 1 − 4, 1 − 2 − 4 − 4, 1 − 4 − 2 − 4, and 1 − 4 − 4 − 4.

Probability of success for each path is: 3/32, 1/96, 1/16, and 3/64,
respectively. Summing up the probabilities, the total probability is 41/192.

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 16 / 29
Example: Bounded Reachability
Lecture 6: Stoch. Proc. and MC
2024-10-31
 
0 1/4 0 3/4
1/2 0 1/3 1/6
P =
 0

0 1 0 
0 1/2 1/4 1/4

What is the probability of being in state 4 after 3 steps from state 1?


From the graph, all possible paths are

Example: Bounded Reachability 1 − 2 − 1 − 4, 1 − 2 − 4 − 4, 1 − 4 − 2 − 4, and 1 − 4 − 4 − 4.

Probability of success for each path is: 3/32, 1/96, 1/16, and 3/64,
respectively. Summing up the probabilities, the total probability is 41/192.

P(1 − 2 − 1 − 4) + P(1 − 2 − 4 − 4) + P(1 − 4 − 2 − 4) + P(1 − 4 − 4 − 4) =


= 1/4 · 1/2 · 3/4 + 1/4 · 1/6 · 1/4 + 3/4 · 1/2 · 1/6 + 3/4 · 1/4 · 1/4 =
= 3/32 + 1/96 + 1/16 + 3/64 = 41/192
m-step Transition Matrix

• For any m, we define the m-step transition matrix P (m) such that
(m)
pi,j = P(Xt+m = j | Xt = i),

which is the probability that we move from state i to state j in


exactly m steps.
• It is easy to check that P (2) = P · P = P 2 , P (3) = P · P (2) = P 3 , and in
general, P (m) = P m .

Thus, for any t ≥ 0 and m ≥ 1 we have,


⃗λ (t + m) = ⃗λ (t)P m

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 17 / 29
Example: Bounded Reachability

Alternatively, we can compute


 
3/16 7/48 29/64 41/192
5/48 5/24 79/144 5/36 
P3 =  0

0 1 0 
1/16 13/96 107/192 47/192
3 = 41/192 gives the correct answer.
The entry P1,4

Idea
How to solve reachability in arbitrary many steps?
Let us demonstrate the problem on an example.

E.g. Throwing a six-sided die until 6 is reached.

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 18 / 29
Example: Bounded Reachability
Lecture 6: Stoch. Proc. and MC
2024-10-31
Alternatively, we can compute
 
3/16 7/48 29/64 41/192
P3 = 5/48 5/24 79/144 5/36 


0 0 1 0 
1/16 13/96 107/192 47/192
3 = 41/192 gives the correct answer.
The entry P1,4

Example: Bounded Reachability Idea


How to solve reachability in arbitrary many steps?
Let us demonstrate the problem on an example.

E.g. Throwing a six-sided die until 6 is reached.

Throwing a six-sided die until 6 is reached.


Hitting Time - Definition

Definition
The hitting time of a subset A of states of a Markov chain is a random
variable H A : S → {0, 1, 2, . . . } ∪ {∞} given by

H A = inf {n ≥ 0 | Xn ∈ A}

where S is the sample space of the Markov chain and ∞ is infimum of 0.


/

Recall: How does the sample space S look like?

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 19 / 29
Hitting Probability and Mean Hitting Time - Definitions

Definition
Starting from a state i, the probability of hitting A is

hiA = P(H A < ∞ | X0 = i)

and the mean time taken to reach a set of states A is

kiA = E (H A | X0 = i)
= ∑ n · P(H A = n | X0 = i) + ∞ · P(H A = ∞ | X0 = i)
n<∞

where 0 · ∞ = 0.

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 20 / 29
Drunkard’s walk
What is the probability of reaching state 4 and what is the mean number
of steps to reach a ditch state, i.e. states 1 or 4?
1 1
1/2
1/2 1/2
1 2 3 4
1/2

Let hi = P( hit 4 | X0 = i ) and ki = E ( time to hit {1, 4} | X0 = i ).

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 21 / 29
Drunkard’s walk
Lecture 6: Stoch. Proc. and MC What is the probability of reaching state 4 and what is the mean number
of steps to reach a ditch state, i.e. states 1 or 4?

2024-10-31 1

1/2
1/2
1/2
1

1 2 3 4
1/2

Let hi = P( hit 4 | X0 = i ) and ki = E ( time to hit {1, 4} | X0 = i ).

Drunkard’s walk

h1 = 0, h4 = 1
h2 = 1/2h1 + 1/2h3 and
h3 = 1/2h2 + 1/2h4
Hence,
h2 = 1/2h3 = 1/2(1/2h2 + 1/2), i.e.
h2 = 4/3 · 1/4 = 1/3.

For hitting time:


k1 = k4 = 0.
k2 = 1 + 1/2k1 + 1/2k3 , and
k3 = 1 + 1/2k2 + 1/2k4 ,
Hence,
k2 = 1+1/2k3 = 1+1/2(1+1/2k2 ) = 3/2+1/4k2 k2 = 4/3·3/2 = 4/2 = 2.
Hitting Probability and Hitting Time - Theorems

Theorem
The vector of hitting probabilities hA = h1A , h2A , . . . is the minimal
non-negative solution to the system of linear equations
(
1 for i ∈ A
hiA = A
∑j pi,j hj for i ̸∈ A

Theorem
The vector of mean hitting times k A = k1A , k2A , . . . is the minimal
non-negative solution to the system of linear equations
(
A 0 for i ∈ A
ki = A
1 + ∑j̸∈A pi,j kj for i ̸∈ A

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 22 / 29
The Game - Example

• Consider two players, one has $ L1 and the other has $ L2 . Player 1
will continue to throw a fair coin, such that
– if head appears, he wins $ 1,
– if tails appears, he loses $ 1.
• Suppose the game is played until one player goes bankrupt. What is
the probability that Player 1 survives?

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 23 / 29
The Markov Chain Model

The previous game can be modelled by the following Markov chain:

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 24 / 29
The Analysis
{L2}
We can write the equations for hj (= hj for simplicity) and solve them.

h−L1 = 0
h−L1 +1 = 1/2 · h−L1 +2 + 1/2 · h−L1
h−L1 +2 = 1/2 · h−L1 +3 + 1/2 · h−L1 +1
..
.
hL2 −1 = 1/2 · hL2 −2 + 1/2 · hL2
hL2 = 1

Let us set h−L1 +1 = r . Due to 2 · r = h−L1 +2 + 0, we have h−L1 +2 = 2r .


Due to 2 · 2r = h−L1 +3 + r , we have h−L1 +3 = 3r . Etc.
Hence, hj = (j + L1 )r . As hL2 = 1, we have r = 1/(L1 + L2 ).
{L2}
This leads to h0 = L1 /(L1 + L2 ).

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 25 / 29
The Analysis (Another Solution)

• Initially, the chain is at state 0.


• Let Pj(t) denote the probability the chain is at state j after t steps.
• Let q be the probability the game ends with Player 1 winning $ L2 .
• We can see that
(t )
(i) limt→∞ Pj = 0 for j ̸= −L1 , L2
(t )
(ii) limt→∞ Pj = 1 − q for j = −L1
(t )
(iii) limt→∞ Pj = q for j = L2

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 26 / 29
The Analysis (Another Solution) cont.

• Now, let Wt denote the money Player 1 has won after t steps.
• Note that the expected value of Wt − Wt−1 is zero.
• By linearity of expectation,

E [Wt ] = 0.
• On the other hand,

(t)
E [Wt ] = ∑ jPj = 0.
j

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 27 / 29
The Analysis (Another Solution) cont.

• By taking limits, we have

0 = lim E [Wt ]
t→∞
(t)
= lim ∑ jPj
t→∞
j

= (−L1 )(1 − q) + 0 + 0 + · · · + 0 + (L2 )q.

• Re-arranging terms, we obtain

q = L1 /(L1 + L2 ).

– That is, the probability of winning (or losing) is proportional to the


amount of money a player is willing to lose (or win).

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 28 / 29
Markov Chain Analysis

• Transient analysis
▶ distribution after k-steps
the k-th power of the transition matrix
▶ reaching/hitting probability
equations for hitting probabilities hi
▶ (mean) hitting time
equations for hitting times ki

• Long-run analysis [the next lecture]


▶ probability of infinite hitting
▶ mean inter-visit time
▶ long-run limit distribution
▶ stationary (invariant) distribution

V. Řehák: IV111 Probability in CS Lecture 6: Stoch. Proc. and MC October 31, 2024 29 / 29
Lectures 7:
Long-Run Analysis of Discrete-Time Markov Chains

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

November 7, 2024

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 1/1
Part I

Revision

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 2/1
Definitions
Sequence of experiments that watches “values” evolving in time.
X0 . . . X1 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .

Definition (Stochastic Process)


A stochastic process is a collection of random variables X = {Xt | t ∈ T }.
The index t often represents time; Xt is called the state of X at time t.

Definition
A discrete-time stochastic process {X0 , X1 , X2 , . . . } is a Markov chain if

P(Xt = a | Xt−1 = b, Xt−2 = at−2 , . . . , X0 = a0 ) = P(Xt = a | Xt−1 = b) = pb,a .

Theorem
Every (discrete-time finite-state) Markov chain can be alternatively
(uniquely) defined by an initial vector ~λ (0) and a transition matrix P.
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 3/1
Definitions
Lecture 7: Long-Run Analysis of DTMC Sequence of experiments that watches “values” evolving in time.

2024-11-06
X0 . . . X1 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .

Definition (Stochastic Process)


A stochastic process is a collection of random variables X = {Xt | t ∈ T }.
The index t often represents time; Xt is called the state of X at time t.

Definition
A discrete-time stochastic process {X0 , X1 , X2 , . . . } is a Markov chain if

Definitions P(Xt = a | Xt−1 = b, Xt−2 = at−2 , . . . , X0 = a0 ) = P(Xt = a | Xt−1 = b) = pb,a .

Theorem
Every (discrete-time finite-state) Markov chain can be alternatively
(uniquely) defined by an initial vector ~λ (0) and a transition matrix P.

Draw a picture for a MC with two states.


Discuss the probability space.
Hitting Time - Definition

Definition
The hitting time of a subset A of states of a Markov chain is a random
variable H A : S → {0, 1, 2, . . . } ∪ {∞} given by H A = inf {n ≥ 0 | Xn ∈ A}
where S is the sample space of the Markov chain and ∞ is infimum of 0. /

Definition
Starting from a state i, the probability of hitting a set of states A is

hiA = P(H A < ∞ | X0 = i)

and the mean time taken to reach A is

kiA = E (H A | X0 = i) = ∑ n · P(H A = n) + ∞ · P(H A = ∞)


n<∞

where 0 · ∞ = 0.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 4/1
Markov Chain Analysis

• Transient analysis
I distribution after k-steps
the k-th power of the transition matrix
I reaching/hitting probability
equations for hitting probabilities hi
I (mean) hitting time
equations for hitting times ki

• Long-run analysis [this lecture]


I probability of infinite hitting
I mean inter-visit time
I long-run limit distribution
I stationary (invariant) distribution

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 5/1
Part II

Long-Run Analysis

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 6/1
Markov Chain Analysis

• Transient analysis [Lecture 6]


I distribution after k-steps
I reaching/hitting probability
I (mean) hitting time

• Long-run analysis
I probability of infinite hitting
I mean inter-visit time
I long-run limit distribution
I stationary (invariant) distribution

We write i =⇒∗ j as an abbreviation for H {j} < ∞ given X0 = i,


i.e. starting from i, we eventually visit j.
We write i =⇒+ j when starting from i, we eventually visit j in positive
number of steps. It is useful for i =⇒+ i.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 7/1
Transient Analysis

Let us discuss distributions of Xk given X0 = 1.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 8/1
State Classification

Definition
A state of a Markov chain is said to be absorbing iff it cannot be left,
once it is entered, i.e. pi,i = 1.

Definition
A state i of a Markov chain is said to be recurrent iff, starting from state
i, the process eventually returns to state i with probability 1.

Definition
A state of a Markov chain is said to be transient (or non-recurrent) iff
there is a positive probability that the process will not return to this state.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 9/1
Infinite Hitting - Transient States

Definition (recall)
A state i of a Markov chain is said to be transient (or non-recurrent) iff
there is a positive probability that the process will not return to this state,
i.e.
P(i =⇒+ i) < 1.

Theorem
Every transient state is visited finitely many times almost surely
(i.e. with probability 1).

Proof.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 10 / 1


Infinite Hitting - Transient States
Lecture 7: Long-Run Analysis of DTMC Definition (recall)

2024-11-06 A state i of a Markov chain is said to be transient (or non-recurrent) iff


there is a positive probability that the process will not return to this state,
i.e.
P(i =⇒+ i) < 1.

Theorem
Every transient state is visited finitely many times almost surely

Infinite Hitting - Transient States (i.e. with probability 1).

Proof.

Let p be the probability of not returning to i from i, i.e. P(¬(i =⇒+ i)).
If i is transient, then p > 0.
Probability of finitely many visits of i equals
to prob. of exactly 1 visit + exactly 2 visits + . . .
= p + (1 − p) (p + (1 − p)(· · · )) = p(1 + (1 − p) + (1 − p)2 + · · · ).
The geometric series in the brackets equals to 1/(1 − (1 − p)) = 1/p.
Hence, the probability of finitely many visits = 1.

Czech proverb:
The pitcher goes so often to the well, that it is broken at last.
Infinite Hitting - Recurrent States

Definition (recall)
A state i of a Markov chain is said to be recurrent iff, starting from state
i, the process eventually returns to state i with probability 1, i.e.

P(i =⇒+ i) = 1.

Theorem
In a finite-state Markov chain, each recurrent state is almost surely either
not visited or visited infinitely many times.

Proof.
If it is visited, then it is revisited with probability one. Hence, in an infinite
run, it is visited infinitely many times with probability one.
Otherwise, it is not visited.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 11 / 1


Transient vs. Recurrent States

Which states are transient? Which states are recurrent?


Idea
Decompose the graph
representation onto strongly
connected components (there
is a path between each pair
of states in both directions).

Theorem
In a finite-state Markov chain, a state is recurrent if and only if it is in a
bottom strongly connected component of the Markov chain graph
representation. All other states are transient.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 12 / 1


Irreducible Markov Chain

Idea
For the sake of infinite behaviour, we will concentrate on bottom strongly
connected components only.

Definition
A Markov chain is said to be irreducible if every state can be reached from
every other state in a finite number of steps, i.e. P(i =⇒+ j) > 0 for all i, j.

Theorem
A Markov chain is irreducible if and only if its graph representation is a
single strongly connected component.

Corollary
All states of a finite-state irreducible Markov chain are recurrent.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 13 / 1


Stationary (Invariant) Distribution

Consider the following Markov chain and a distribution


~λ (t) = (0.2, 0.4, 0.4) on its states.

1 1
1 2 3
1/2
1/2

Question: In this case, what will be the subsequent distribution ~λ (t + 1)?

We can see that this is an “equilibrium” distribution.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 14 / 1


Stationary (Invariant) Distribution
Lecture 7: Long-Run Analysis of DTMC
2024-11-06
Consider the following Markov chain and a distribution
~λ (t) = (0.2, 0.4, 0.4) on its states.

1 1
1 2 3
1/2

Stationary (Invariant) Distribution 1/2

Question: In this case, what will be the subsequent distribution ~λ (t + 1)?

We can see that this is an “equilibrium” distribution.

 
0 1 0
(0.2, 0.4, 0.4) ·  0 0 1 = (0.2, 0.4, 0.4)
1/2 1/2 0
Stationary (Invariant) Distribution

Definition
Let P be the transition matrix of a Markov chain and ~λ be a probability
distribution on its states. If
~λ P = ~λ

then ~λ is a stationary (or steady-state or invariant or equilibrium)


distribution of the Markov chain.

Question:
How many stationary distributions can a Markov chain have?
Can it be more than one?
Can it be none?

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 15 / 1


Stationary (Invariant) Distributions

Answer: There can be more that one stationary distribution.

For example, in the Drunkard’s walk

1 1
1/2
1/2 1/2
1 2 3 4
1/2

both (1, 0, 0, 0) and (0, 0, 0, 1) are stationary distributions.

But, this is not an irreducible Markov chain.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 16 / 1


Stationary (Invariant) Distributions

Theorem
If a finite-state Markov chain is irreducible then there is a unique
stationary distribution.

Q: Can it be none?
Theorem
For each finite-state Markov chain, there is a stationary distribution.

How can we compute the stationary distribution of a finite-state


irreducible Markov chain?

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 17 / 1


Stationary (Invariant) Distribution & Cut-sets

Let us have the following Markov chain with its transition matrix.
p  
s0 s1 1−p p
1−p 1−q P=
q 1−q
q
Solving ~π P = ~π yields to the following system of equations

π1 (1 − p) + π2 q = π1
π1 p + π2 (1 − q) = π2
π1 + π2 = 1

For these equations, we find the second redundant. The solution is

π1 = q/(p + q) and π2 = p/(p + q).

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 18 / 1


Stationary (Invariant) Distribution & Cut-sets

Theorem
Let P be a transition matrix of a finite-state Markov chain and
~π = (π1 , π2 , . . . , πn ) be a stationary distribution corresponding to P.
For any state i of the Markov chain, we have

∑ πj Pj,i = ∑ πi Pi,j .
j6=i j6=i

That is, in the stationary distribution ~π = (π1 , π2 , . . . , πn ), the probability


that a chain enters a state i equals the probability that it leaves the state i.
Proof.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 19 / 1


Stationary (Invariant) Distribution & Cut-sets
Lecture 7: Long-Run Analysis of DTMC Theorem

2024-11-06
Let P be a transition matrix of a finite-state Markov chain and
~π = (π1 , π2 , . . . , πn ) be a stationary distribution corresponding to P.
For any state i of the Markov chain, we have

∑ πj Pj,i = ∑ πi Pi,j .
j6=i j6=i

That is, in the stationary distribution ~π = (π1 , π2 , . . . , πn ), the probability


that a chain enters a state i equals the probability that it leaves the state i.
Stationary (Invariant) Distribution & Cut-sets Proof.

∑ πj Pj,i = ∑ πi Pi,j
j6=i j6=i

πi Pi,i + ∑ πj Pj,i = ∑ πi Pi,j + πi Pi,i adding self-loops


j6=i j6=i

∑ πj Pj,i = ∑ πi Pi,j = πi ∑ Pi,j = πi


j j j

πP = π
Stationary Distribution & Cut-sets
Another solution (using a cut-set formula):
Let us have a subset of states, e.g. {s1 }.
p

1−p s1 s2
1−q
q

In the stationary distribution, the probability of leaving {s1 } must equal


the probability of entering {s1 } (i.e. the flows crossing the cut have to
sum to zero).
Hence
π1 p = π2 q
Using π1 + π2 = 1 yields

π1 = q/(p + q) and π2 = p/(p + q).

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 20 / 1


Stationary Distribution & Cut-sets
We can summarize this result in the following:
Theorem
Let P be a transition matrix of a finite-state Markov chain. A distribution
~π = (π1 , π2 , . . . , πn ) is a stationary distribution corresponding to P if and
only if for every subset A of states of the Markov chain it holds that

∑ πi Pi,j = ∑ πj Pj,i .
i∈A,j6∈A i∈A,j6∈A

Proof.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 21 / 1


Stationary Distribution & Cut-sets
Lecture 7: Long-Run Analysis of DTMC We can summarize this result in the following:

2024-11-06
Theorem
Let P be a transition matrix of a finite-state Markov chain. A distribution
~π = (π1 , π2 , . . . , πn ) is a stationary distribution corresponding to P if and
only if for every subset A of states of the Markov chain it holds that

∑ πi Pi,j = ∑ πj Pj,i .
i∈A,j6∈A i∈A,j6∈A

Stationary Distribution & Cut-sets Proof.

“if” It holds for A = {i} for every state i, hence, the distribution is sta-
tionary.

∑ πi Pi,j = ∑ πj Pj,i
j6∈{i} j6∈{i}

πi Pi,i + ∑ πi Pi,j = πi Pi,i + ∑ πj Pj,i


j6∈{i} j6∈{i}

πi = ∑ πj Pj,i
j

“only if” If it is stationary, it holds for singletons (due to the previous thm),
hence, it also holds for larger sets.
Example - Finite Buffer Model

Example
Consider the following irreducible Markov chain model of a finite buffer.
Find a stationary distribution for its states.

1−p 1−p−q 1−q


p p
s0 s1 s2
q q

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 22 / 1


Example - Finite Buffer Model

Example
Consider the following irreducible Markov chain model of a finite buffer.
Find a stationary distribution for its states.

2/3 1/6 1/2


1/3 1/3
s0 s1 s2
1/2 1/2

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 23 / 1


Example - Finite Buffer Model
Lecture 7: Long-Run Analysis of DTMC Example

2024-11-06
Consider the following irreducible Markov chain model of a finite buffer.
Find a stationary distribution for its states.

2/3 1/6 1/2


1/3 1/3
s0 s1 s2
1/2 1/2

Example - Finite Buffer Model

What are the states with the highest and the lowest probability in the
stationary distribution?
Try to identify structural schema of the graph that increases the stationary
distribution?
Google PageRank

MC stationary distribution identifies the importance of website pages.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 24 / 1


Google PageRank
Lecture 7: Long-Run Analysis of DTMC MC stationary distribution identifies the importance of website pages.

2024-11-06

Google PageRank

It is not a problem to find all web pages with a given keyword. The problem
is to identify what are to most important web pages. Here comes the story
of a random walk. Imagine that a user surfs webs by random clicks on
links that occur on pages. The higher number of visits, the higher rank
of the page. This corresponds to invariant distribution when we view the
pages as states of a MC and assign uniform distribution on the outgoing
links. To have the MC irreducible, there is a concept of restart (with a
magic probability, e.g. 1.6 % in this picture) to another state uniformly.
Mean Portion of Visited States and Inter Visit Time
Theorem (Expected long-run frequency)
Let us have a finite-state irreducible Markov chain and the unique
stationary distribution ~π . It holds that

E (number of visits of state i during the first n steps)


πi = limn→∞ .
n

Theorem (Mean inter-visit time)


Let us have a finite-state irreducible Markov chain and the unique
stationary distribution ~π . It holds that

πi = 1/mi

where
mi = E (number of steps of i =⇒+ i)
is the mean inter-visit time of state i (or expected return time to i).
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 25 / 1
Mean Portion of Visited States and Inter Visit Time
Lecture 7: Long-Run Analysis of DTMC Theorem (Expected long-run frequency)

2024-11-06
Let us have a finite-state irreducible Markov chain and the unique
stationary distribution ~π . It holds that

E (number of visits of state i during the first n steps)


πi = limn→∞ .
n

Theorem (Mean inter-visit time)


Let us have a finite-state irreducible Markov chain and the unique
stationary distribution ~π . It holds that
Mean Portion of Visited States and Inter Visit πi = 1/mi

Time where
mi = E (number of steps of i =⇒+ i)
is the mean inter-visit time of state i (or expected return time to i).

{i}
Be careful: mi 6= ki .

mi counts steps of i =⇒+ i.


{i}
kiA counts steps of i =⇒∗ A. Hence, ki = 0 as i ∈ {i}.
Ergodic Theorem

And now, something even stronger. . .

Theorem (Ergodic Theorem)


Let us have a finite-state irreducible Markov chain and its unique
stationary distribution ~π . It holds for every state i that
 
number of visits of state i during the first n steps
P πi = limn→∞ =1
n

and also

P (mi = the average distance between two visits of i) = 1.

I.e., frequencies of visits of individual states equal to the stationary


distribution almost surely.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 26 / 1


Ergodic Theorem
Lecture 7: Long-Run Analysis of DTMC And now, something even stronger. . .

2024-11-06 Theorem (Ergodic Theorem)


Let us have a finite-state irreducible Markov chain and its unique
stationary distribution ~π . It holds for every state i that
 
number of visits of state i during the first n steps
P πi = limn→∞ =1
n

Ergodic Theorem and also

P (mi = the average distance between two visits of i) = 1.

I.e., frequencies of visits of individual states equal to the stationary


distribution almost surely.

This theorem is proven using the strong law of large numbers.


The ergodic theorem can be rephrased as follows:
It can indeed happen that you spend whole infinite run in a transient part
of a MC, or visit only one single state with a self-loop. But all of these
abnormal cases occur with probability zero.
On the contrary, with probability one, the infinite run you observe will
have the frequency of visits of particular states exactly corresponding to
the stationary distribution of the MC.
Aperiodic Markov Chains

For example:

aperiodic periodic
Definition
A state j in a Markov chain is periodic if there exists an integer ∆ > 1
such that P(Xt+s = j | Xt = j) = 0 unless s is divisible by ∆.
A Markov chain is periodic if any state in the chain is periodic.
A state or chain that is not periodic is aperiodic.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 27 / 1


Aperiodic Markov Chains - Convergence To Equilibrium

Theorem (Limit of transient distributions)


Let us have a finite-state aperiodic irreducible Markov chain and the
unique stationary distribution ~π . It holds that

~π = limn→∞~λ P n

where ~λ is an arbitrary distribution on states.

Note that limn→∞~λ P n does not depend on ~λ . It is caused by the fact that
for finite-state aperiodic irreducible Markov chains, P n converges in n → ∞
to a matrix with equal rows.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 28 / 1


Finite-state Markov Chains - Overview

For finite-state Markov chains:


• Vector of expected long-run frequencies is a stationary distribution.
• If it is irreducible:
• All states are recurrent.
• There is a unique stationary distribution.
1
• The stationary distribution is mean inter-visit time .
• The stationary distribution equals to the frequency of visits
on an infinite run almost surely.
• If it is aperiodic:
− ~π is a limit of transient distributions.

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 29 / 1


Markov Chain Analysis
• Transient analysis [the previous lecture]
I distribution after k-steps
the k-th power of the transition matrix
I reaching/hitting probability
equations for hitting probabilities hi
I (mean) hitting time
equations for hitting times ki

• Long-run analysis [this lecture]


I probability of infinite hitting
transient vs. recurrent states (BSCC)
I mean inter-visit time, frequency of visits
equations, cut-sets, matrix power, simulation, . . .
I long-run limit distribution
equations, cut-sets, matrix power, simulation, . . .
I stationary (invariant) distribution
equations, cut-sets, matrix power, simulation, . . .
V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 30 / 1
Part III

Infinite-State Markov Chains

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 31 / 1


Infinite-State Markov Chains

Example
Classify the following infinite-state Markov chains for different values of p.
p p p p p
0 1 2 3 4 ···
1−p 1−p 1−p 1−p 1−p 1−p

p p p p p
··· -1 0 1 2 ···
1−p 1−p 1−p 1−p 1−p

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 32 / 1


Infinite-State Markov Chains
It is no longer true that each Markov chain has a stationary distribution.

1 1 1 1 1
0 1 2 3 4 ···

It is no longer true that all states are recurrent in irreducible Markov chain.

2/3 2/3 2/3 2/3 2/3


0 1 2 3 4 ···
1/3 1/3 1/3 1/3 1/3 1/3

A state can be recurrent and the mean inter-visit time can be infinite.
1/2 1/2 1/2 1/2 1/2
··· -1 0 1 2 ···
1/2 1/2 1/2 1/2 1/2

V. Řehák: IV111 Probability in CS Lecture 7: Long-Run Analysis of DTMC November 7, 2024 33 / 1


Lecture 8: Continuous-Time Processes

Vojtěch Řehák

Faculty of Informatics, Masaryk University

November 14, 2024

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 1 / 30


Part I

Motivation and Memoryless Distributions

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 2 / 30


Discrete-Time vs. Continuous-Time Processes

Recall, stochastic process is a collection of random variables Xt for t ∈ T .


I.e., “Xt = a” stands for “at time t the value is a”.

• In discrete time

X0 , X1 , X2 , X3 , X4 , X5 , . . .
I distribution on where we go in the next step

• In continuous time

X0 . . . X0.5 . . . X2 . . . X3.54 . . . X5.123 . . . X5∗π . . .


I distribution on when we do the next step
I distribution on where we go in the next step

For continuous time, we need a continuous distribution on when.


V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 3 / 30
Event-Driven Systems and Markovian Property

Event-Driven System
• we are staying in a state and waiting for events
• the system changes its state when the first event occurs
• the subsequent state depends on the event that occurs first

newjob pjam
paper
idle printing
jam
done repaired

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 4 / 30


Event-Driven Systems and Markovian Property

Markovian Property
• We would like to have a Markovian model - where the subsequent
behavior depends on the current state only.
I It depends neither on where we were going through nor when we did
the (last) step(s).
I It also does not depend on the time we have been waiting for the
next step.
• We are looking for a continuous-time variant of the geometric
distribution.

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 5 / 30


(Memoryless =) Exponential Distribution
• one parameter λ (called rate); expected value 1
λ
• probability density function (PDF), for t ≥ 0:
f (t) = λ · e −λ t
• cumulative distribution function (CDF), for t ≥ 0:
Z t
F (t) = P(X ≤ t) = λ · e −λ x dx = 1 − e −λ t
0

λ 1
F (t)

f (t)
0 0
t t

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 6 / 30


(Memoryless =) Exponential Distribution
• one parameter λ (called rate); expected value 1
λ
• probability density function (PDF), for t ≥ 0:
f (t) = λ · e −λ t
• cumulative distribution function (CDF), for t ≥ 0:
Z t
F (t) = P(X ≤ t) = λ · e −λ x dx = 1 − e −λ t
0

λ 1
F (t)

f (t)
0 0
0 t t

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 6 / 30


(Memoryless =) Exponential Distribution
• one parameter λ (called rate); expected value 1
λ
• probability density function (PDF), for t ≥ 0:
f (t) = λ · e −λ t
• cumulative distribution function (CDF), for t ≥ 0:
Z t
F (t) = P(X ≤ t) = λ · e −λ x dx = 1 − e −λ t
0

λ 1
F (t)

f (t)
0 0
0 t t

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 6 / 30


(Memoryless =) Exponential Distribution
• one parameter λ (called rate); expected value 1
λ
• probability density function (PDF), for t ≥ 0:
f (t) = λ · e −λ t
• cumulative distribution function (CDF), for t ≥ 0:
Z t
F (t) = P(X ≤ t) = λ · e −λ x dx = 1 − e −λ t
0

λ 1
F (t)

f (t)
0 0
0 t t

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 6 / 30


Memoryless Property of Exponential Distribution

Theorem
For an exponentially distributed random variable X and every t, t0 ≥ 0,
it holds that
P(X > t0 + t | X > t0 ) = P(X > t).

Proof.

P(X > t0 + t | X > t0 ) =

= P(X > t)

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 7 / 30


Memoryless Property of Exponential Distribution
Lecture 8: Continuous-Time Processes Theorem

2024-11-13
For an exponentially distributed random variable X and every t, t0 ≥ 0,
it holds that
P(X > t0 + t | X > t0 ) = P(X > t).

Proof.

P(X > t0 + t | X > t0 ) =

Memoryless Property of Exponential Distribution =

= P(X > t)

P(X > t0 + t ∧ X > t0 )


P(X > t0 + t | X > t0 ) =
P(X > t0 )
P(X > t0 + t) e −λ ·(t0 +t )
= =
P(X > t0 ) e −λ ·t0
e −λ ·t0 · e −λ ·t
= = e −λ ·t
e −λ ·t0
= P(X > t)
Part II

Continuous-Time Markov Chain

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 8 / 30


Continuous-Time Markov Chain (CTMC)

Definition
Continuous-Time Markov Chain (CTMC) is an event-driven system
with exponentially distributed events.

In each state, we are waiting for exponentially distributed events


(with arbitrary rates).

We can have some simple questions:


• What is the mean waiting time in a state?
• What is the probability that a particular event wins in a state?
• How the waiting time depends on the winning event?
• How the winning event depends on the waiting time?

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 9 / 30


Properties of Exponential Distribution
• waiting time: for exponentially distributed W1 , W2 with rates λ1 , λ2 ;
min(W1 , W2 ) is exponentially distributed with rate λ1 + λ2 .

• mean waiting time: 1/λ


for an exponentially distributed event with rate λ

• probability that W1 wins: P(W1 < W2 ) = λ1 /(λ1 + λ2 )


for exponential distributions W1 , W2 with rates λ1 , λ2 .

• probability that W1 wins for a given winning time:


∀t . P(W1 < W2 |min(W1 , W2 ) > t) = λ1 /(λ1 + λ2 )
for exponential distributions W1 , W2 with rates λ1 , λ2 .
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 10 / 30
Properties of Exponential Distribution
Lecture 8: Continuous-Time Processes • waiting time: for exponentially distributed W1 , W2 with rates λ1 , λ2 ;
min(W1 , W2 ) is exponentially distributed with rate λ1 + λ2 .

2024-11-13 • mean waiting time: 1/λ


for an exponentially distributed event with rate λ

• probability that W1 wins: P(W1 < W2 ) = λ1 /(λ1 + λ2 )

Properties of Exponential Distribution for exponential distributions W1 , W2 with rates λ1 , λ2 .

• probability that W1 wins for a given winning time:


∀t . P(W1 < W2 |min(W1 , W2 ) > t) = λ1 /(λ1 + λ2 )
for exponential distributions W1 , W2 with rates λ1 , λ2 .

waiting time distribution:


P(min(W1 , W2 ) > t) = P(W1 > t ∧ W2 > t) = P(W1 > t) · P(W2 > t) =
e −λ1 t · e −λ2 t = e −(λ1 +λ2 )t
mean waiting time:
E (W ) = 0∞ t · λ · e −λ t dt = 0∞ −f · g 0 = 0∞ f 0 · g − 0∞ (f · g )0 =
R R R R
1 −λ t ]∞ − [te −λ t ]∞ = 1 · (0 − 1) − (0 − 0) = 1
−λ · [e 0 0 −λ λ
where f = t, g = e −λ t , g 0 = −λ · e −λ t , and f 0 = 1 (integration by parts -
per partes).
probability that W
R 1
wins:
P(W1 < W2 ) = 0∞ λ1 · e −λ1 t · P(W2 > t)dt = 0∞ λ1 · e −λ1 t · e −λ2 t dt = λ1 ·
R
R ∞ −(λ +λ )t λ1 R∞ −(λ1 +λ2 )t dt = λ1
0 e λ +λ · 0 (λ1 + λ2 ) · e
1 2 dt =
1 2 λ +λ
1 2

probability that W1 wins for a given winning time:


P(W1 < W2 | min(W1 , W2 ) > k) = P(W1 − k < W2 − k | W1 > k ∧ W2 >
k) = P(W1 < W2 ) - thanks to the memoryless property
Part III

Queues as Examples of CTMC

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 11 / 30


Kendall Notation of Queue Types
A/S/n/B/K queue where
A - inter-arrival time distribution (G - general, M - exponential, D -
deterministic. . . )
S - service time distribution (G - general, M - exponential, D -
deterministic. . . )
n - number of servers (1, 2, . . . , ∞ )
B - buffer size (1, 2, . . . , ∞), implicite value is ∞
K - population size (1, 2, . . . , ∞), implicite value is ∞

Draw, e.g. M/M/1, M/M/2, M/M/1/4, M/M/5/5.


λ1 λ1 λ1 λ1 λ1 λ1 λ1 λ1
··· ···
λ2 λ2 λ2 λ2 λ2 2λ2 2λ2 2λ2
λ1 λ1 λ1 λ1 λ1 λ1 λ1 λ1 λ1

λ2 λ2 λ2 λ2 λ2 2λ2 3λ2 4λ2 5λ2

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 12 / 30


M/M/1 Queue - Example

• unbounded queue with one service unit (server)


• time to new request is exponentially distributed (with rate λ1 )
• time to finish service routine is exponentially distributed (with rate λ2 )

λ1 λ1 λ1 λ1
s0 s1 s2 s3 ···
λ2 λ2 λ2 λ2

Important difficult questions are:


• What is the distribution on states at a given time?
• Does the distribution converge to some stationary distribution?
• What is its continuous (time) frequency, fraction of time spent?
• What is its utilization (server usage) and mean queue length?

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 13 / 30


Part IV

CTMC Analysis

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 14 / 30


CTMC Formally
Let λ (t) be a distribution on the states at time t.
I.e., for every state i, P(Xt = i) = λi (t) where λ (t) = (λ1 (t), λ2 (t), . . . ).

Definition
CTMC is defined by (an initial distribution λ (0) and) a rate matrix Q
where Q[i, j] is the rate of an event leading from state i to state j.
  3
0 3
Q= s0 s1
2 0
2

Definition
A distribution π on states of a CTMC is called stationary iff
π = λ (t), for every time t ≥ 0, when starting the CTMC in π = λ (0).
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 15 / 30
Important Questions - Defined Formally

The important questions for a given CTMC with λ (0) and Q are:
• What is the distribution on states at a given time?
I It is to compute λ (t), for a given time t ≥ 0.

• Does the distribution converge to a stationary distribution?


I Is limt→∞ λ (t) = π where π is a stationary distribution?

• What is the continuous (time) frequency, i.e. fraction of time spent?


I Fs = limt→∞ Ts (t)/t where Ts (t) is time spent in s from start to time t.

• What is the queue-server utilization and the mean queue length?


I The utilization is a probability that the server is in use, i.e. 1 − F0 as
the server is idle only in state so.
The mean queue length is defined by ∑s s · Fs .

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 16 / 30


Dual View on CTMC

Idea
As the winning event does not depend on the winning time and vice versa,
we can separately choose the winner and the winning time.

λ1 p1
λ2 λ p2
s ≡ s p3
λ3

where λ = λ1 + λ2 + λ3
pi = λi /λ

Definition
The rate λ = λ1 + λ2 + λ3 is called exit rate of the state s
and the probabilities p1 , p2 , p3 are called exit probabilities.
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 17 / 30
Solution Technique 1 - Discretization

Discretization techniques
• transform the CTMC to a DTMC (e.g. based on exit probabilities)
• analyse the CTMC using the properties of the DTMC
• are useful for long-run average (steady-state) properties
1. discrete (time-abstract) frequency
I do not care about time; a move in CTMC is a step in the DTMC
I computed as the invariant distribution of the underlying DTMC
2. continuous (time) frequency
I discrete frequency weighted by the mean exit times
3. utilization and mean length of a queue
I occupancy of the queue multiplied by the (probability of) the
continuous (time) frequency

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 18 / 30


Example - Discretization
Example
What is the utilization (i.e. the mean queue length) of M/M/2/2 queue
with new-request rate 5 and service rate 7?

1 5/12
5 5
CTMC: s0 s1 s2 DTMC: s0 s1 s2
7 14 7/12 1

The time-abstract frequency is computed as the stationary distribution


for the DTMC. Doing so, we obtain (7/24, 1/2, 5/24), i.e. 1/2 of the
moves lead to state s2 , 7/24 to s1 , and 5/24 to s3 (almost surely).
We wait for a different time in every state.
The exit rates are 5, 5 + 7 = 12, and 14.
So, the mean waiting times to leave the state are 1/5, 1/12, and 1/14.
(If the rates are in Hz, the waiting times are in seconds.)
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 19 / 30
Example - Discretization

Example (cont.)
What is the utilization (i.e. the mean queue length) of M/M/2/2 queue
with new-request rate 5 and service rate 7?

1 5/12
5 5
CTMC: s0 s1 s2 DTMC: s0 s1 s2
7 14 7/12 1

The time-abstract frequency is (7/24, 1/2, 5/24).


The mean waiting times to leave the state are 1/5, 1/12, and 1/14.
Now, we need to express the proportions of time spent in the states.
The ratio is 7/24 · 1/5 : 1/2 · 1/12 : 5/24 · 1/14. To obtain a distribution
(summing up to 1), we divide all values by their sum 193/1680, and obtain
the time frequency (98/193, 70/193, 25/193).

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 20 / 30


Example - Discretization

Example (cont.)

1 5/12
5 5
CTMC: s0 s1 s2 DTMC: s0 s1 s2
7 14 7/12 1

The time frequency is (98/193, 70/193, 25/193).


Recall that the state s0 represents an empty queue.
There is one request in the queue in state s1 .
And in state s2 , there are two requests queued.
Hence, the utilization is 1 − 98/193 ≈ 0.49,
i.e. the server is running in less than half of the time on average.
And the mean queue length is
0 · 98/193 + 1 · 70/193 + 2 · 25/193 = 120/193 ≈ 0.62 requests queued.

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 21 / 30


Uniformization Technique

Idea
Thanks to the memoryless property:
λ
λ1 λ1
λ2 λ2
s ≡ s
λ3 λ3
for any λ
Adding a self-loop does not change the behavior of a CTMC!

• Every CTMC can be transformed to the one with uniform exit rates.
(Find a state with the highest exit rate and added appropriate
self-loops to all others.)
• Uniform exit rates allow for easier discretization and enables
analysis of distribution on states at a given time.
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 22 / 30
Example - Uniformization
Example
What is the utilization (i.e. the mean queue length) of M/M/2/2 queue
with new-request rate 5 and service rate 7?

5 5
the original CTMC: s0 s1 s2
7 14

9/14 2/14
9 2 5/14 5/14
5 5 its
uniformized
s0 s1 s2 s0 s1 s2
CTMC: DTMC:
7 14 7/14 1
It can be easily checked that (98/193, 70/193, 25/193) is the stationary
distribution for the DTMC. Thanks to uniform exit rates, the rebalance by
waiting times is not needed. It is stationary also for (unifomized) CTMC.
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 23 / 30
Important Questions (recall)

For a given CTMC with λ (0) and Q:


• What is the distribution on states at a given time?
I TODO

• Does the distribution converge to a stationary distribution?


I TODO

• What is the continuous (time) frequency, i.e. fraction of time spent?


I Invariant distribution for DTMC rebalanced using waiting times.
I For uniformized CTMC, it is the invariant distribution of DTMC.

• What is the queue-server utilization and the mean queue length?


I Using stationary distribution π of the CTMC, the utilization is 1 − π0
and the mean queue length is defined by ∑s s · πs .

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 24 / 30


Distribution on States at a Given Time
We decompose the solution into disjoint cases depending on the number
of steps we did.

P(Xt = s) = ∑ P(Xt = s | [i steps in time t]) · P([i steps in time t])
i=0

where P(Xt = s | [i steps in time t]) is the distribution on states after i


steps (easily computable using the underlying DTMC).
For uniformized CTMC, the inter-step times are i.i.d. random variables,
and so P([i steps in time t]) can be computed using Poisson distribution.
Poisson distribution (λ , t)
• number of exponentially distributed arrivals with rate λ in time t
(λ t)i −λ t
• pmf (i) : P([i steps in time t]) = i! e
bic (λ t) j
• CDF (i) : P([≤ i steps in time t]) = e −λ t ∑j=0 j!
• its mean value is λ t and its variance is also λ t
V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 25 / 30
Important Questions - Summary
For a given CTMC with λ (0) and Q:
• What is the distribution on states at a given time?
I For uniformized CTMC, we use DTMC matrix and the Poisson
distribution to approximate it.

• Does the distribution converge to a stationary distribution?


I Yes, for finite-state CTMC there is always a stationary
distribution π and the distributions λ (t) converges to π almost
surely. (There is no problem with periodic Markov Chains.)

• What is the continuous (time) frequency, i.e. fraction of time spent?


I Invariant distribution for DTMC rebalanced using waiting times.
I For uniformized CTMC, it is the invariant distribution of DTMC.

• What is the queue-server utilization and the mean queue length?


I Using stationary distribution π of the CTMC, the utilization is 1 − π0
and the mean queue length is defined by ∑s s · πs .
Note the mean queue length is equal to limt→∞ E (Xt ).

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 26 / 30


Part V

Tools and Non-Markovian Models

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 27 / 30


Tool support

PRISM - http://www.prismmodelchecker.org/
CADP - http://www.inrialpes.fr/vasy/cadp/
MRMC - http://www.mrmc-tool.org
Modest Toolset - http://www.modestchecker.net/
Storm - http://www.stormchecker.org/
PROPhESY - https://moves.rwth-aachen.de/research/tools/prophesy/
MATLAB extension - http://www.mathworks.com/products/matlab/

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 28 / 30


Non-Markovian Models

Semi-Markov Processes (SMP)


• the events are arbitrarily distributed
• the first coming event restarts all other events

Generalized Semi-Markov Processes (GSMP)


• the events are arbitrarily distributed
• the first coming event does not have to restart all other events;
full concurrency of event generations

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 29 / 30


Phase-Type Approximation (PH)

Every non-exponential distribution can be approximated by a CTMC with


a dedicated absorbing state.
NON-exponential
2 4.05 15.6 15.8

0.01

1
10 01

0.0
0.

Theorem ([Neuts 81])


Any distribution on [0, ∞) can be approximated arbitrarily close by a
phase-type approximation.

If there are more then a single action in a state, we can do a parallel


composition of their phase-type models.

V. Řehák: IV111 Probability in CS Lecture 8: Continuous-Time Processes November 14, 2024 30 / 30


Lecture 9: Information Theory

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

November 21, 2024

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 1 / 36


Part I

Uncertainty and Entropy

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 2 / 36


Motivation

We would like measure information. But what is the information?


Let us return back to our examples with random experiments.
• uncertainty about an output of an experiment
I temperature in a fridge and outside
I president elections in Belarus
I ice hockey match Canada vs. Japan
I cricket match Czech Rep. vs. India
• imagine that we send the result of our experiment
• data transmission and compression
I loosing quality when reducing video size for low speed connection
I Morse code

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 3 / 36


Uncertainty

• Given a random experiment it is natural to ask how uncertain we


are about an outcome of the experiment.
• E.g., tossing a fair coin vs. throwing a fair six-sided dice.
The first experiment attains two outcomes and the second experiment
has six possible outcomes. Both experiments have the uniform
probability distribution. Our intuition says that we are more uncertain
about an outcome of the second experiment.
• E.g., tossing a fair coin vs. a randomly generated bit.
Intuitively, we should expect that the uncertainty about an outcome
of each of these experiments is the same. Hence, the uncertainty
should be based only on the probability distribution and not on the
concrete sample space.
• Therefore, the uncertainty about a particular random experiment can
be specified as a function of the probability distribution
{p1 , p2 , . . . , pn } and we will denote it as H(p1 , p2 , . . . , pn ).
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 4 / 36
Uncertainty - Requirements

1. Let us fix the number of outcomes of an experiment and compare the


uncertainty of different probability distributions. Natural requirement
is that the most uncertain is the experiment with the uniform
probability distribution, i.e. H(p1 , . . . pn ) is maximal for
p1 = · · · = pn = 1/n.
2. Permutation of probability distribution does not change the
uncertainty, i.e. for any permutation π : {1 . . . n} → {1 . . . n} it holds
that H(p1 , p2 , . . . , pn ) = H(pπ(1) , pπ(2) . . . , pπ(n) ).
3. Uncertainty should be nonnegative and equals to zero if and only if
we are sure about the outcome of the experiment.
H(p1 , p2 , . . . , pn ) ≥ 0 and it is equal if and only if pi = 1 for some i.
4. If we include into an experiment an outcome with zero probability,
this does not change our uncertainty, i.e.
H(p1 , . . . , pn , 0) = H(p1 , . . . , pn )

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 5 / 36


Uncertainty - Requirements

5. As justified before, having the uniform probability distribution on n


outcomes cannot be more uncertain than having the uniform
probability distribution on n + 1 outcomes, i.e.

n× (n+1)×
z }| { z }| {
H(1/n, . . . , 1/n) ≤ H(1/(n + 1), . . . , 1/(n + 1)).

6. H(p1 , . . . , pn ) is a continuous function of its parameters.


7. Uncertainty of an experiment consisting of a simultaneous throw of
m and n sided die is as uncertain as an independent throw of m and
n sided die implying
mn× m× n×
z }| { z }| { z }| {
H(1/(mn), . . . , 1/(mn)) = H(1/m, . . . , 1/m) + H(1/n, . . . , 1/n).

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 6 / 36


Entropy and Uncertainty

8. Let us consider a random choice of one of n + m balls, m being red


and n being blue. Let p = ∑m i=1 pi be the probability that a red ball is
m+n
chosen and q = ∑i=m+1 pi be the probability that a blue one is chosen.
Then the uncertainty which ball is chosen is the uncertainty whether
red or blue ball is chosen plus weighted uncertainty that a
particular ball is chosen provided blue/red ball was chosen. Formally,

H(p1 , . . . , pm , pm+1 , . . . , pm+n ) =


   
p1 pm pm+1 pm+n
= H(p, q) + pH ,..., + qH ,..., .
p p q q

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 7 / 36


Shannon Entropy
It can be shown that any function satisfying Axioms 1 − 8 is of the form
m m
H(p1 , . . . , pm ) = − ∑ pi loga pi = −(loga 2) ∑ pi log2 pi .
i=1 i=1

I.e., the function is defined uniquely up to multiplication by a constant,


which effectively changes only the base of the logarithm. The base
specifies units in which we measure the entropy (for bits, we use log2 ).
Hence, in what follows, we use a logarithm without an explicit base.

Definition
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then the (Shannon) entropy of the random variable
X is defined as
H(X ) = − ∑ p(x) log p(x).
x∈Im(X )

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 8 / 36


Entropy as Expectation
Let φ : R → R be a function. Let us recall that the expectation of the
transformed random variable is E [φ (X )] = ∑x∈Im(X ) φ (x)P(X = x).

Lemma
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then

H(X ) = −Ep [log p(X )] .

Proof.

1
Note that log p(X ) and log p(X ) are used as random variables.

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 9 / 36


Entropy as Expectation
Lecture 9: Information Theory Let φ : R → R be a function. Let us recall that the expectation of the
transformed random variable is E [φ (X )] = ∑x∈Im(X ) φ (x)P(X = x).

2024-11-21 Lemma
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then

H(X ) = −Ep [log p(X )] .

Proof.
Entropy as Expectation
1
Note that log p(X ) and log p(X ) are used as random variables.

H(X ) = − ∑ p(x) · log p(x) = −Ep [log p(X )]


x∈Im(X )
Entropy

Here, we prove that entropy is always non-negative.


Lemma
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then
H(X ) ≥ 0.

Proof.

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 10 / 36


Entropy
Lecture 9: Information Theory Here, we prove that entropy is always non-negative.

2024-11-21 Lemma
Let X be a random variable with a probability distribution
p(x) = P(X = x). Then
H(X ) ≥ 0.

Proof.

Entropy

As p(x) is a probability distribution, we know that p(x) ≤ 1 for all x.


Then log p(X ) ≤ 0, and so Ep [log p(X )] ≤ 0. Hence,

H(X ) = −Ep [log p(X )] ≥ 0.


Examples

Example
What is the entropy and how should we send/encode the result of:
1. one toss of a fair coin?
2. one throw of a fair eight-sided die?
3. Horse race with winning probabilities of individual horses
1/2, 1/8, 1/8, 1/8, 1/8?

Answers:
1. H(1/2, 1/2) = 1/2 · log 2 + 1/2 · log 2 = 1 bit
2. H(1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8) = 8 · 1/8 · log 8 = 3 bits
3. H(1/2, 1/8, 1/8, 1/8, 1/8) = 1/2 · log 2 + 4 · 1/8 · log 8 = 2 bits
Let us assign shorter messages to the horses with higher winning
probability. If the assigned messages are 1, 000, 001, 010, and 011,
then the expected message length is 1/2 · 1 + 1/2 · 3 = 2 bits.
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 11 / 36
Part II

Joint and Conditional Entropy

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 12 / 36


Joint entropy

In order to examine an entropy of more complex random experiments


described by correlated random variables we have to introduce the entropy
of a pair (or n–tuple) of random variables.

Definition
Let X and Y be random variables with a joint probability distribution
p(x, y ) = P(X = x, Y = y ).
We define the joint (Shannon) entropy of X and Y as

H(X , Y ) = − ∑ ∑ p(x, y ) log p(x, y ),


x∈Im(X ) y ∈Im(Y )

or, alternatively,
 
1
H(X , Y ) = −Ep [log p(X , Y )] = Ep log .
p(X , Y )

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 13 / 36


Conditional Entropy
Important question is how uncertain we are about an outcome of a random
variable X given a particular outcome y of a random variable Y .

Definition
Let X and Y be random variables and y ∈ Im(Y ).
The conditional entropy of X given Y = y is

H(X |Y = y ) = − ∑ P(X = x|Y = y ) log P(X = x|Y = y ). (1)


x∈Im(X )

When y is given, pY =y (x) = P(X = x|Y = y ) is a distribution of X .


Note that H(X |Y = y ) = EpY =y [log pY =y (X )].
Pay attention to the base of the expectation!

The uncertainty about an outcome of X given an (unspecified)


outcome of Y is naturally defined as an average conditional entropy for
all y , i.e. the sum of equations (1) weighted according to P(Y = y ), i.e.
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 14 / 36
Conditional Entropy

Definition
Let X and Y be random variables with a joint probability distribution
p(x, y ) = P(X = x, Y = y ). Let us denote p(x|y ) = P(X = x|Y = y ).
The conditional entropy of X given Y is

H(X |Y ) = ∑ p(y )H(X |Y = y ) =


y ∈Im(Y )

= − ∑ p(y ) ∑ p(x|y ) log p(x|y ) =


y ∈Im(Y ) x∈Im(X )

= − ∑ ∑ p(x, y ) log p(x|y )


x∈Im(X ) y ∈Im(Y )

= − Ep [log p(X |Y )].

Note that the probability space in which we are computing the expectation
is based on the joint probability p(x, y )!

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 15 / 36


Chain Rule of Conditional Entropy

Theorem (Chain rule of conditional entropy)


Let X and Y be random variables. Then
H(X , Y ) = H(Y ) + H(X |Y ).

Proof.
H(X , Y ) =

= H(Y ) + H(X |Y ).

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 16 / 36


Chain Rule of Conditional Entropy
Lecture 9: Information Theory Theorem (Chain rule of conditional entropy)

2024-11-21
Let X and Y be random variables. Then
H(X , Y ) = H(Y ) + H(X |Y ).

Proof.
H(X , Y ) =

Chain Rule of Conditional Entropy =

= H(Y ) + H(X |Y ).

H(X , Y ) = − ∑ ∑ p(x, y ) log p(x, y )


x∈Im(X ) y ∈Im(Y )

=− ∑ ∑ p(x, y ) log[p(y )p(x|y )] =


x∈Im(X ) y ∈Im(Y )

=− ∑ p(x, y ) log p(y ) − ∑ p(x, y ) log p(x|y ) =


x∈Im(X ) x∈Im(X )
y ∈Im(Y ) y ∈Im(Y )

=− ∑ p(y ) log p(y ) − ∑ p(x, y ) log p(x|y ) =


y ∈Im(Y ) x∈Im(X )
y ∈Im(Y )

= H(Y ) + H(X |Y )

Alternatively, we may use log p(X , Y ) = log p(Y ) + log p(X |Y ) and take
the expectation (based on the joint probability distribution) on both sides
to get the desired result H(X , Y ) = H(Y ) + H(X |Y ).
Conditioned Chain Rule of Conditional Entropy

Corollary (Conditioned chain rule)


Let X , Y , and Z be random variables.

H(X , Y |Z ) = H(Y |Z ) + H(X |Y , Z )

Proof.
Similarly to the previous proof, we may use

log p(X , Y |Z ) = log p(Y |Z ) + log p(X |Y , Z )

and take the expectation (based on the joint probability distribution of all
X , Y , and Z ) on both sides to get the desired result.

Note that in general H(Y |X ) 6= H(X |Y ).


On the other hand, H(X ) − H(X |Y ) = H(Y ) − H(Y |X ) is symmetric and
expresses the common information of X and Y .
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 17 / 36
Part III

Cross-Entropy, Relative Entropy, and


Mutual Information

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 18 / 36


Relative Entropy - Motivation

Let us have to distributions p and q; say q is a slight modification of p.


How the modification changes the entropy?

H(q) − H(p) = − E [log q(X )] − (−E [log p(X )])

= E [log p(X ) − log q(X )]

 
p(X )
= E log
q(X )

Idea
Wait!!!! What is the base of the expectations????

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 19 / 36


Cross-Entropy

Let us start with the definition of a cross-entropy, which measures


”entropy” of q in the probability space defined by p.

Definition
The cross-entropy of the distribution q relative to a distribution p
(on a common set of sample points Im(X )) is defined as

H(pkq) = − ∑ p(x) log q(x) = −Ep [log q(X )] .


x∈Im(X )

Better motivation for cross-entropy will come later in coding theory.

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 20 / 36


Relative entropy

The relative entropy measures inefficiency of assuming that a given


distribution is q when the true distribution is p.
Definition
The relative entropy or Kullback-Leibler divergence between two
probability distributions p(x) and q(x) (on a common set of sample points
Im(X )) is defined as
 
p(x) p(X )
D(pkq) = ∑ p(x) log = Ep log .
x∈Im(X )
q(x) q(X )

If p and q differ in their images, we use 0 log q0 = 0 and p log p0 = ∞.


Note that D(pkq) = H(pkq) − H(p).
It is not a distance in the mathematical sense since it is not symmetric
in its parameters and it does not satisfy the triangle inequality.

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 21 / 36


Mutual information

Mutual information measures the shared information between two


random variables. It is the decrease of the uncertainty about an outcome
of a random variable given an outcome of another random variable.
Definition
Let X and Y be random variables with a joint distribution p.
The mutual information I (X ; Y ) is the relative entropy between
the joint distribution and the product of marginal distributions pX and pY
 
p(X , Y )
I (X ; Y ) = D(pk(pX · pY )) = Ep log .
pX (X )pY (Y )

Note that mutual information of independent random variables is zero.

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 22 / 36


Mutual Information and Entropy
Theorem
Let X and Y be random variables. Then

I (X ; Y ) = H(X ) − H(X |Y ).

Proof.

p(x, y ) p(x|y )
I (X ; Y ) = ∑ p(x, y ) log p(x)p(y ) = ∑ p(x, y ) log =
x,y x,y p(x)
= − ∑ p(x, y ) log p(x) + ∑ p(x, y ) log p(x|y ) =
x,y x,y
!
= − ∑ p(x) log p(x) − − ∑ p(x, y ) log p(x|y ) =
x x,y

= H(X ) − H(X |Y ).

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 23 / 36


Mutual information
From the symmetry, we get I (X ; Y ) = H(Y ) − H(Y |X ). I.e., X says about
Y as much as Y says about X . Using H(X , Y ) = H(X ) + H(Y |X ), we get

Corollary (Inclusion-exclusion)
I (X ; Y ) = H(X ) + H(Y ) − H(X , Y ).

Note also that I (X ; X ) = H(X ) − H(X |X ) = H(X ).


V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 24 / 36
Part IV

Properties of Entropy and Mutual Information

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 25 / 36


General Chain Rule for Entropy
Theorem
Let X1 , X2 , . . . , Xn be random variables. Then

H(X1 , X2 , . . . , Xn ) = H(X1 ) + H(X2 |X1 ) + · · · + H(Xn |Xn−1 , . . . , X1 ).

Proof.
We use repeated application of the chain rule for a pair of random variables

H(X1 , X2 ) =H(X1 ) + H(X2 |X1 ),


H(X1 , X2 , X3 ) =H(X1 ) + H(X2 , X3 |X1 ) =
=H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 ),
..
.
H(X1 , X2 , . . . , Xn ) =H(X1 ) + H(X2 |X1 ) + · · · + H(Xn |Xn−1 , . . . , X1 ) =
n
= ∑ H(Xi |Xi−1 , . . . , X1 ).
i=1
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 26 / 36
Conditional Mutual Information

Definition
Let X , Y , and Z be random variables.
The conditional mutual information between X and Y given Z = z is

I (X ; Y |Z = z) = H(X |Z = z) − H(X |Y , Z = z).

The conditional mutual information between X and Y given Z is


 
p(X , Y |Z )
I (X ; Y |Z ) = H(X |Z ) − H(X |Y , Z ) = E log ,
p(X |Z )p(Y |Z )

where the expectation is taken over the joint distribution p(x, y , z).

Theorem (Chain rule for mutual information)


I (X1 , X2 , . . . , Xn ; Y ) = ∑ni=1 I (Xi ; Y |Xi−1 , . . . , X1 )

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 27 / 36


Conditional Relative Entropy

Definition
The conditional relative entropy is the average of the relative entropies
between the conditional probability distributions pY |X and qY |X averaged
over the probability distribution pX . Formally,

pY |X (y |x) pY |X
 

D pY |X kqY |X = ∑ pX (x) ∑ pY |X (y |x) log = Ep log
x y qY |X (y |x) qY |X

where p in the base of the expectation is the joint distribution of X and Y .

The relative entropy between two joint distributions can be expanded as


the sum of a relative entropy and a conditional relative entropy.

Theorem (Chain rule for relative entropy)


D(p(x, y )kq(x, y )) = D(p(x)kq(x)) + D(p(y |x)kq(y |x)).

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 28 / 36


Chain Rule for Relative Entropy

Theorem (Chain rule for relative entropy)


D(p(x, y )kq(x, y )) = D(p(x)kq(x)) + D(p(y |x)kq(y |x)).

Proof.

p(x, y )
D(p(x, y )kq(x, y )) = ∑ ∑ p(x, y ) log q(x, y ) =
x y
p(x)p(y |x)
= ∑ ∑ p(x, y ) log q(x)q(y |x) =
x y
p(x) p(y |x)
= ∑ p(x, y ) log q(x) + ∑ p(x, y ) log q(y |x) =
x,y x,y

= D(p(x)kq(x)) + D(p(y |x)kq(y |x)).

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 29 / 36


Part V

Information inequalities

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 30 / 36


Information Inequality

Theorem (Information inequality)


Let p(x) and q(x), be two probability distributions. Then

D(pkq) ≥ 0

with equality if and only if p(x) = q(x) for all x.

During the proof we will use Jensen’s inequality stating for logarithm
(a concave function) and positive random variable X that

E log X ≤ log EX .

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 31 / 36


Information Inequality

Proof.
Jensen ineq is eq iff the function is linear, or the function is applied on a
constant random variable.
   
p(x) q(x)
−D(pkq) = − Ep log = Ep log
q(x) p(x)
(∗) q(x) q(x)
≤ log Ep = log ∑ p(x) ≤ log ∑ q(x)
p(x) x∈Im(p)
p(x) x∈Im(q)

= log 1 = 0,

where (∗) follows from Jensen’s inequality.

Since log t is a strictly concave function (implying − log t is strictly


convex) of t, we have equality in (∗) iff q(x)/p(x) = 1 everywhere, i.e.
p(x) = q(x). Also, if p(x) = q(x) the second inequality becomes equality.

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 32 / 36


Consequences of Information Inequality

Corollary (Nonnegativity of mutual information)


For any two random variables X , Y

I (X ; Y ) ≥ 0

with equality if and only if X and Y are independent.

Corollary
D(p(y |x)kq(y |x)) ≥ 0
with equality if and only if p(y |x) = q(y |x) for all y and x with p(x) > 0.

Corollary
I (X ; Y |Z ) ≥ 0
with equality if and only if X and Y are conditionally independent given Z .
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 33 / 36
Consequences of Information Inequality
Lecture 9: Information Theory Corollary (Nonnegativity of mutual information)

2024-11-21
For any two random variables X , Y

I (X ; Y ) ≥ 0

with equality if and only if X and Y are independent.

Corollary
D(p(y |x)kq(y |x)) ≥ 0

Consequences of Information Inequality with equality if and only if p(y |x) = q(y |x) for all y and x with p(x) > 0.

Corollary
I (X ; Y |Z ) ≥ 0
with equality if and only if X and Y are conditionally independent given Z .

I (X ; Y |Z ) Data compression and conditional independence.


Consequences of Information Inequality

Theorem
Let X be a random variable, then

H(X ) ≤ log |Im(X )|

with equality if and only if X has a uniform distribution over Im(X ).

Proof.
Let u(x) = 1/|Im(X )| be a uniform probability distribution over Im(X ) and
let p(x) be the probability distribution of X . Then D(pku) =
   
p(X ) 1 1
= Ep log = Ep [log p(X )] + Ep log = −H(X ) + log
u(X ) u(X ) u(X )
1 1
Due to 0 ≤ D(pku) and u(x) = |Im(X )| : H(X ) ≤ log u(x) = log |Im(X )|.
Note that 0 = D(pku) iff p ≡ u.
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 34 / 36
Consequences of Information Inequality

Theorem (Conditioning reduces entropy)


Let X and Y be a random variables, then

H(X |Y ) ≤ H(X )

with equality if and only if X and Y are independent.

Proof.
0 ≤ I (X ; Y ) = H(X ) − H(X |Y )
with equality iff X and Y are independent.

Previous theorem says that (on average) the knowledge of a random


variable Y reduces our uncertainty about other random variable X .
However, there may exist y such that H(X |Y = y ) > H(X )!

V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 35 / 36


Consequences of Information Inequality
Lecture 9: Information Theory Theorem (Conditioning reduces entropy)

2024-11-21 Let X and Y be a random variables, then

H(X |Y ) ≤ H(X )

with equality if and only if X and Y are independent.

Proof.

Consequences of Information Inequality 0 ≤ I (X ; Y ) = H(X ) − H(X |Y )


with equality iff X and Y are independent.

Previous theorem says that (on average) the knowledge of a random


variable Y reduces our uncertainty about other random variable X .
However, there may exist y such that H(X |Y = y ) > H(X )!

An example for H(X |Y = y ) > H(X ).


Consequences of Information Inequality

Theorem (Independence bound on entropy)


Let X1 , X2 , . . . , Xn be random variables. Then
n
H(X1 , X2 , . . . , Xn ) ≤ ∑ H(Xi )
i=1

with equality if and only if Xi ’s are mutually independent.

Proof.
We use the chain rule for entropy
n n
H(X1 , X2 , . . . , Xn ) = ∑ H(Xi |Xi−1 , . . . , X1 ) ≤ ∑ H(Xi ),
i=1 i=1

where the inequality follows directly from the previous theorem. We have
equality if and only if Xi is independent of all Xi−1 , . . . , X1 .
V. Řehák: IV111 Probability in CS Lecture 9: Information Theory November 21, 2024 36 / 36
Lecture 10: Codes for Data Compression

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

November 28, 2024

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 1 / 34
Part I

Optimal Length of Code

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 2 / 34
Intro

In our following analysis, we will design various methods compressing


results of our experiment. One message for each possible result. The
shorter, the better. Note that the length of the message is a random
variable. Hence, we would like to minimize the expected value of the
message length.

For a given experiment (with a set of results and a probability distribution),


the goal is to design an optimal code assigning a codeword to each result.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 3 / 34
Message Source

Following this analysis, we model the source of information as a random


variable X with all possible messages equal to Im(X ). This source emits
the message x with the probability P(X = x). A sequence of messages is
created by a sequence of independent trials described by X and hence is
described by a random process X1 , X2 , . . . where Xi are independently and
identically distributed. Such a source is called a memoryless source.

We may naturally expect that source has a memory. This is modeled by a


random process X1 , X2 , . . . with Im(Xi ) = Im(Xj ), ∀i, j, but we require
neither independence nor identical distribution of Xi . In practice, this
means that probability of a particular message being emitted at particular
time depends on the history of the messages - it models a source with
memory.

In this course, we focus on memoryless sources.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 4 / 34
Code

Definition
A code C for a random variable (memoryless source) X is a mapping
C : Im(X ) → D∗ , where D∗ is the set of all finite-length strings over the
alphabet D. With |D| = d, we say the code is d-ary.
C (x) is the codeword assigned to x and lC (x) denotes the length of C (x).

Definition
The expected length LC (X ) of a code C for a random variable X is given
by
LC (X ) = ∑ P(X = x)lC (x) = E [lC (X )] .
x∈Im(X )

In what follows, we will assume (WLOG) that the alphabet is


D = {0, 1, . . . , d − 1}.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 5 / 34
Code

Example
Let X and C be given by the following probability distribution and
codeword assignment

P(X = 1) = 1/2, codeword C (1) = 0


P(X = 2) = 1/4, codeword C (2) = 10
P(X = 3) = 1/8, codeword C (3) = 110
P(X = 4) = 1/8, codeword C (4) = 111

The entropy H(X ) = 1.75 bits and the expected length


LC (X ) = E [lC (X )] = 1.75, too.
Note that any encoded (not any!) sequence can be uniquely decoded to
symbols {1, 2, 3, 4}, try, e.g., 0110111100110.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 6 / 34
Code

Example
Consider another example with

P(X = 1) = 1/3, codeword C (1) = 0


P(X = 2) = 1/3, codeword C (2) = 10
P(X = 3) = 1/3, codeword C (3) = 11

The entropy in this case is H(X ) = log2 3 ≈ 1.58 bits, but the expected
length is LC (X ) ≈ 1.66.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 7 / 34
Non-singular Code

Definition
A code C is said to be non-singular if it maps every element in the range
of X to different string in D∗ , i.e.

∀x, y ∈ Im(X ) x ̸= y ⇒ C (x) ̸= C (y ).

Non-singularity allows unique decoding of any single codeword,


however, in practice we send a sequence of codewords and require the
complete sequence to be uniquely decodable. We can use e.g. any
non-singular code and use an extra symbol # ̸∈ D as a codeword
separator. However, this is very inefficient and we can improve efficiency
by designing uniquely decodable or prefix code.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 8 / 34
Uniquely Decodable Code

Let Im(X )+ denotes the set of all nonempty strings over the alphabet
Im(X ).

Definition
An extension C ∗ of a code C is the mapping from Im(X )+ to D∗ defined
by
C ∗ (x1 x2 . . . xn ) = C (x1 )C (x2 ) . . . C (xn ),
where C (x1 )C (x2 ) . . . C (xn ) denotes concatenation of corresponding
codewords.

Definition
A code is uniquely decodable iff its extension is non-singular.

In other words, a code is uniquely decodable if any encoded string has only
one possible source string.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 9 / 34
Prefix Code

Definition
A code is called prefix code (or instantaneous code) if no codeword is a
prefix of any other codeword.

The advantage of prefix codes is not only their unique decodability, but
also the fact that a codeword can be decoded as soon as we read its last
symbol.

Example (Codes of different types)


X Singular Non-singular, but not Uniquely decodable, Prefix
uniquely decodable but not prefix
1 0 0 10 0
2 0 010 00 10
3 0 01 11 110
4 0 10 110 111

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 10 / 34
Part II

Kraft Inequality

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 11 / 34
Kraft Inequality

In this section we concentrate on prefix codes of minimal expected length.

Theorem (Kraft inequality)


For any prefix code over an alphabet of size d, the codeword lengths
(including multiplicities) l1 , l2 , . . . lm satisfy the inequality
m
∑ d −l i
≤ 1.
i=1

Conversely, given a sequence of codeword lengths that satisfy this


inequality, there exists a prefix code with these codeword lengths.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 12 / 34
Kraft Inequality

Proof.
Consider a d–ary tree in which every inner node has d descendants. Each
edge represents a choice of a code alphabet symbol at a particular
position. For example, d edges emerging from the root represent d choices
of the alphabet symbol at the first position of different codewords. Each
codeword is represented by a node (some nodes are not codewords!)
and the path from the root to a particular node (codeword) specifies the
codeword symbols. The prefix condition implies that no codeword is an
ancestor of other codeword on the tree. Hence, each codeword
eliminates its possible descendants.
Let lmax = max{l1 , l2 , . . . , lm }. Consider all nodes of the tree at the level
lmax . Some of them are codewords, some of them are descendants of
codewords, some of them are neither.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 13 / 34
Kraft Inequality

Proof (Cont.)
A codeword at level li has d lmax −li descendants at level lmax . Sets of
descendant of different codewords must be disjoint and the total number of
nodes in all these sets must be at most d lmax . Summing over all codewords
we have
m
max −li
∑ dl ≤ d lmax
i=1

and hence
m
∑ d −l i
≤ 1.
i=1

Conversely, given any set of codeword lengths l1 , l2 , . . . , lm satisfying the


Kraft inequality we can always construct a tree described above. We may
WLOG assume that l1 ≤ l2 ≤ · · · ≤ lm .

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 14 / 34
Kraft Inequality

Proof (Cont.)
Label the first note of depth l1 as the codeword 1 and remove its descendants
from the tree. Then mark first remaining node of depth l2 as the codeword 2.
In this way you can construct prefix code with codeword lengths l1 , l2 , . . . , lm .
We may observe easily that this construction does not violate the prefix property.
To do so, the new codeword should be placed either as a precedent, or an
antecedent of an existing codeword, what is prevented by the construction.
It remains to show that there is always enough nodes.
Assume that for some i ≤ m there is no free node of level li when we want to add
a new codeword of length li . This, however, means that all nodes at level li are
either codewords, or descendants of a codeword, giving
i−1
∑ d li −lj = d li
j =1

1 −lj
and we have ∑ji−
=1 d = 1, and, finally, ∑ij =1 d −lj > 1 violating the initial
assumption.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 15 / 34
Part III

McMillan Inequality

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 16 / 34
McMillan Inequality
Kraft inequality holds also for codes with countably infinite number of
codewords, however, we omit the proof here. There exist uniquely
decodable codes that are not prefix codes, but, as established by the
following theorem, the Kraft inequality applies to general uniquely
decodable codes as well and, therefore, when searching for an optimal
code it suffices to concentrate on prefix codes. General uniquely decodable
codes offer no extra codeword lengths in contrast to prefix codes.

Theorem (McMillan inequality)


The codeword lengths of any uniquely decodable code must satisfy the
Kraft inequality, i.e.
∑ d −li ≤ 1.
i

Conversely, given a set of codeword lengths that satisfy the inequality it is


possible to construct a uniquely decodable code with these codeword
lengths.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 17 / 34
McMillan Inequality

Proof.
Consider the k-th extension C k of a code C . By the definition of the
unique decodability, C k is non-singular for any k.
Observe that lC k (x1 x2 · · · xk ) = ∑ki=1 lC (xi ). Let us calculate the sum for
the code extension C k .

∑ d −lC k (x1 x2 ···xk ) = ∑ d −lC (x1 ) d −lC (x2 ) · · · d −lC (xk )


x1 ,x2 ,...,xk ∈Im(X ) x1 ,x2 ,...,xk ∈Im(X )
! ! !
−lC (x1 ) −lC (x2 ) −lC (xk )
= ∑ d · ∑ d ·...· ∑ d
x1 ∈Im(X ) x2 ∈Im(X ) xk ∈Im(X )
!k
−lC (x)
= ∑ d
x∈Im(X )

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 18 / 34
McMillan Inequality

Proof (Cont.)
Another expression is obtained when we reorder the terms by word
lengths to get
klmax
∑ d −lC k (x1 x2 ···xk ) = ∑ a(m)d −m ,
x1 ,x2 ,...,xk ∈Im(X ) m=1

where lmax is the maximum codeword length and a(m) is the number of
k character source strings mapped to a codeword of length m.
The code is uniquely decodable, i.e. there is at most one input being
mapped on each codeword (of length m). The total number of such inputs
is at most the same as the number of sequences of length m, i.e.

a(m) ≤ d m .

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 19 / 34
McMillan Inequality

Proof (Cont.)
Using a(m) ≤ d m , we get
!k
klmax
∑ d −lC (x) = ∑ a(m)d −m
x∈Im(X ) m=1
klmax
≤ ∑ d m d −m = klmax
m=1

implying
∑ d −l i
≤ (klmax )1/k .
i

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 20 / 34
McMillan Inequality

Proof (Cont.)
This inequality holds for any k and observing limk→∞ (klmax )1/k = 1 we
have
∑ d −li ≤ 1.
i

The opposite implication follows from the Kraft inequality.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 21 / 34
Part IV

Optimal Codes

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 22 / 34
Optimal Codes

In the previous part, we derived necessary and sufficient condition on


lengths of codewords for prefix (uniquely decodable) codes.
We will use it to find a prefix code with the minimum expected length.

Theorem
The expected length of any prefix d–ary code C for a random variable
X is greater than or equal to the entropy Hd (X ) (d is the base of the
logarithm), i.e.
LC (X ) ≥ Hd (X )
with equality iff for all xi P(X = xi ) = pi = d −li for some integer li .

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 23 / 34
Optimal Codes

Proof.
We write the difference between the expected length and the entropy as

LC (X ) − Hd (X ) = ∑ pi li + ∑ pi logd pi =
i i
= ∑ pi logd d li + ∑ pi logd pi =
i i
pi
= ∑ pi logd −l .
i d i

Note that the final expression ”looks like” a relative entropy.


It holds that 0 ≤ d −lj ≤ 1 for all j, but ∑j d −lj is not necessarily 1.
Hence, let us normalize it to a distribution ri defined as

ri = d −li /c where c = ∑ d −lj .


j

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 24 / 34
Optimal Codes

Proof (Cont.)
Having ri = d −li /c and c = ∑j d −lj , we can continue by
pi pi
LC (X ) − Hd (X ) = ∑ pi logd −l
= ∑ pi logd
i d i i ri · c
 
pi 1 pi 1
= ∑ pi logd + logd = ∑ pi logd + ∑ pi logd
i ri c i ri i c
1
= D(p∥r ) + logd ≥ 0
c
by the non-negativity of the relative entropy and the fact that c ≤ 1 (Kraft
inequality). Hence, LC (X ) ≥ Hd (X ) with equality if and only if c = 1 and,
for all i, pi = d −li , i.e. − logd pi is an integer.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 25 / 34
Optimal Codes

Definition
A probability distribution is called d–adic if each of the probabilities is
equal to d −n for some integer n.

Due to the previous theorem, the expected length is equal to the entropy if
and only if the probability distribution of X is d–adic. The proof also
suggests a method to find a code with optimal length in case the
probability distribution is not d–adic.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 26 / 34
Optimal Codes - Construction - Idea

Let us have a source X with a given distribution.


Our basic goal is to minimize ∑i pi li with the restriction ∑i d −li ≤ 1.
Idea:
1. Find a d–adic distribution that is the closest (measured by relative
entropy) to the distribution of X , i.e. finding d–adic distribution as
ri = d −li / ∑j d −lj minimizing
!
LC (X ) − Hd (X ) = D(p∥r ) − logd ∑ d −l i
≥ 0.
i

This distribution defines the set of codeword lengths l1 , l2 , . . . , ln .


2. Use the technique described in the proof of the Kraft inequality to
construct the code.
Note that this procedure is not easy, since the search for the closest
d–adic distribution is not obvious.
V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 27 / 34
Part V

Shannon-Fano Coding and


Bounds on the Optimal Code Length

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 28 / 34
Bounds on the Optimal Code Length

Idea (Naive Shannon-Fano coding)


The choice of word lengths li = logd p1i gives L = Hd (X ). Since it may not
equal an integer, we round it up to get
  
1
li = logd .
pi

These lengths satisfy the Kraft inequality since


l m
1 1
− logd − logd
∑d pi
≤ ∑d pi
= ∑ pi = 1.
i i i

The choice of codeword lengths satisfies


1 1
logd ≤ li < logd + 1.
pi pi

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 29 / 34
Bounds on the Optimal Code Length

Taking expectation over pi on both sides, we get

Hd (X ) ≤ LC (X ) < Hd (X ) + 1. (1)

The optimal code can do only better and we have


Theorem
Let l1∗ , l2∗ , . . . , lm
∗ be the optimal codeword lengths for a source distribution

{pi }i and a d–ary alphabet and let L∗ be the associated expected length
of the optimal code, i.e. L∗ = ∑i pi li∗ . Then

Hd (X ) ≤ L∗ < Hd (X ) + 1.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 30 / 34
Bounds on the Optimal Code Length

Proof.
1
Let li = ⌈logd pi ⌉. Then li satisfies the Kraft inequality and from (1) we
have
Hd (X ) ≤ LC (X ) = ∑ pi li < Hd (X ) + 1.
i

But since our code is optimal, L∗ ≤ L = ∑i pi li and since L∗ ≥ Hd (X ) we


have the result.
The non-integer expressions logd (1/pi ) cause in the previous theorem
overhead at most 1 bit per symbol. We can further reduce the 1 bit by
spreading it over a number of symbols. Let us consider a system in
which we send a sequence of symbols emitted by source X , where all
symbols are drawn independently according to an identical distribution.
We can consider n such symbols to be a supersymbol from alphabet
Im(X )n .

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 31 / 34
Bounds on the Optimal Code Length

Let us define Ln as the expected codeword length per input symbol, i.e.
1 1
Ln = ∑ p(x1 , x2 , . . . , xn )lc (x1 , x2 , . . . , xn ) = E [lC (X1 , X2 , . . . , Xn )].
n n
Using the bounds derived above, we have

H(X1 , X2 , . . . , Xn ) ≤ E [lC (X1 , X2 , . . . , Xn )] < H(X1 , X2 , . . . , Xn ) + 1.

Since X1 , X2 , . . . , Xn are independently and identically distributed, we have


H(X1 , X2 , . . . , Xn ) = nH(X ), and dividing by n, we get

1
H(X ) ≤ Ln < H(X ) + .
n
Using large blocks allows us to arbitrarily approach the optimal length -
the entropy.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 32 / 34
Optimal Coding and Relative Entropy

The relative entropy allows us to quantify inefficiency caused by wrong


estimation of input probability distribution.
Theorem
Let X be a random variable with distribution p. Let q be a distribution.
The expected length under p of the code assignment optimally for q,
1
i.e. lC (x) = ⌈log q(x) ⌉, satisfies

H(p) + D(p∥q) ≤ Ep [lC (X )] < H(p) + D(p∥q) + 1.

Due to H(p) + D(p∥q) = Hp (q), we can reformulate it using cross entropy

Hp (q) ≤ Ep [lC (X )] < Hp (q) + 1.

This could serve as a motivation for the definitions of D(p∥q) and Hp (q).

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 33 / 34
Optimal Coding and Relative Entropy

Proof.
 
1
Ep [lC (X )] = ∑ p(x) log
x q(x)
 
1
< ∑ p(x) log +1
x q(x)
 
1
= ∑ p(x) log + ∑ p(x) ∗ 1
x q(x) x
= Hp (q) + 1.

The lower bound can be proven analogously.

V. Řehák: IV111 Probability in CS Lecture 10: Codes for Data Compression November 28, 2024 34 / 34
Lecture 11: Optimal Codes for Data Compression

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

December 5, 2024

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 1 / 38
L. 11: Optimal Codes for Data Compression
2024-12-02 Lecture 11: Optimal Codes for Data Compression

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

December 5, 2024

On Dec 5, 2024, there will be no lecture in A318.


Please, read this presentation and watch the COVID videos:
https://is.muni.cz/auth/el/fi/podzim2024/IV111/index.
qwarp?prejit=14225206
Lect 11 Part 3 – Shannon-Fano and Huffman Codes
Lect 11 Part 4 – Proof of Optimality
Lect 11 Part 5 – Data Compression in Practice
Lect 11 Part 6 – Generating Discrete Distribution Using Fair-Coin Tosses
Part I

Revision

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 2 / 38
Revision

The goal is to find an optimal code for a given probability distribution, i.e.
with the shortest expected length of a message.

Thanks to Kraft inequality and McMillan inequality, we are looking for an


optimal prefix code.

We know that the expected length is bounded from below by the entropy
Hd and it can reach the bound iff the distribution is d-adic.
Theorem
The expected length of any prefix d–ary code C for a random variable
X is greater than or equal to the entropy Hd (X ) (d is the base of the
logarithm), i.e.
LC (X ) ≥ Hd (X )
with equality iff for all xi P(X = xi ) = pi = d −li for some integer li .

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 3 / 38
Bounds on the Optimal Code Length

Theorem
Let l1∗ , l2∗ , . . . , lm
∗ be the optimal codeword lengths for a source distribution

{pi }i and a d–ary alphabet and let L∗ be the associated expected length
of the optimal code, i.e. L∗ = ∑i pi li∗ . Then

Hd (X ) ≤ L∗ < Hd (X ) + 1.

Idea (Naive Shannon-Fano coding)


The choice of word lengths li = logd p1i gives L = Hd (X ). Since it may not
equal an integer, we round it up to get
  
1
li = logd .
pi

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 4 / 38
Part II

Naive Shannon-Fano and Huffman Codes

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 5 / 38
Coding Based on Shannon Entropy
Example
We have two source symbols with probabilities 0.0001 and 0.9999. Using
1
the entropy definition we get two codewords with lengths 1 = ⌈log 0.9999 ⌉
1
and 14 = ⌈log 0.0001 ⌉ bits. The optimal code obviously uses 1 bit codeword
for both symbols.

On the other hand, is it true that an optimal code uses always codewords
of length not larger than ⌈log p1i ⌉?
It is not.
Consider probabilities 1/3, 1/3, 1/4, 1/12. Expected length of its optimal
code is 2 (as we will be able to show later). Hence, a code 0, 10, 110, 111
with lengths 1, 2, 3, 3 is an optimal code.
The third symbol has length 3, which is > ⌈log p13 ⌉ = ⌈log2 4⌉ = 2.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 6 / 38
Shannon-Fano Algorithm for Code Generation

Naive Shannon-Fano algorithm:


• top-down (from root to leafs) construction of the coding tree

For binary codes:


1. divide the probabilities into two parts (the sum of probabilities on the
left is as closed as possible to the sum on the right)
2. the left part is assigned 0 as the first bit of codeword,
the right part is assigned 1 as the first bit of codeword
3. if the parts are not singletons, refine them analogously and assign the
next bits to codewords

Remark
Shannon-Fano coding is not optimal!

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 7 / 38
Shannon-Fano vs. Huffman Codes - Example

Example
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.15, 0.15 and construct an optimal binary code.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 8 / 38
Shannon-Fano vs. Huffman Codes - Example
L. 11: Optimal Codes for Data Compression Example

2024-12-02
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.15, 0.15 and construct an optimal binary code.

Shannon-Fano vs. Huffman Codes - Example

Shannon-Fano code:
0.25, 0.25 and 0.2, 0.15, 0.15
0.25 and 0.25 and 0.2 and 0.15, 0.15
Expected length: 2 · 0.7 + 3 · 0.3 = 2.3 bits.
The Huffman approach.
0.25, 0.25, 0.2, 0.15, 0.15
0.25, 0.25, 0.2, 0.3
0.25, 0.45, 0.3
0.55, 0.45
1
Expected code length = 2.3 bits.
H(X)=2.2854. . .
Shannon-Fano vs. Huffman Codes - Example

Example
Let us consider a random variable with outcome probabilities
0.38, 0.18, 0.16, 0.15, 0.13 and construct an optimal binary code.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 9 / 38
Shannon-Fano vs. Huffman Codes - Example
L. 11: Optimal Codes for Data Compression Example

2024-12-02
Let us consider a random variable with outcome probabilities
0.38, 0.18, 0.16, 0.15, 0.13 and construct an optimal binary code.

Shannon-Fano vs. Huffman Codes - Example

Shannon-Fano code: 0.38, 0.13 and 0.18, 0.16, 0.15


0.38 and 0.13 and 0.18 and 0.16, 0.15
Expected length: 2 · 0.69 + 3 · 0.31 = 1.38 + 0.93 = 2.31 bits.
The Huffman approach.
0.38, 0.18, 0.16, 0.15, 0.13
0.38, 0.18, 0.16, 0.28
0.38, 0.34, 0.28
0.38, 0.62
1
Expected code length = 0.38 + 3 · 0.62 = 0.38 + 1.86 = 2.24 bits.
Huffman Codes - Example

Example
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.15, 0.15 and construct an optimal ternary code.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 10 / 38
Huffman Codes - Example
L. 11: Optimal Codes for Data Compression Example

2024-12-02
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.15, 0.15 and construct an optimal ternary code.

Huffman Codes - Example

The Huffman approach.


0.25, 0.25, 0.2, 0.15, 0.15
0.25, 0.25, 0.5
1
Expected code length = 1.5 bits.
H(X)=1.4419. . .

What if the number of outcomes ̸= 1 + k(D − 1)?


Huffman Codes - Example

Example
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.1, 0.1, 0.1 and construct 3-ary code.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 11 / 38
Huffman Codes - Example
L. 11: Optimal Codes for Data Compression Example

2024-12-02
Let us consider a random variable with outcome probabilities
0.25, 0.25, 0.2, 0.1, 0.1, 0.1 and construct 3-ary code.

Huffman Codes - Example

Add a dummy outcome(s) to have 1 + k(D − 1) outcomes.


• We add 0 + 0.1 + 0.1 = 0.2 corresponding to 5, 6, r .
• We add 0.2 + 0.1 + 0.2 = 0.5 corresponding to (5, 6, r ), 4, 3.
• We add 0.5 + 0.25 + 0.25 = 1 corresponding to ((5, 6, r ), 4, 3), 2, 1.
• We assign ε to ((5, 6, r ), 4, 3), 2, 1.
• We assign 0; 1; 2 to ((5, 6, r ), 4, 3); 2; 1, respectively.
• We assign 00; 01; 02 to (5, 6, r ); 4; 3, respectively.
• We assign 000; 001; 002 to 5; 6; r , respectively.
Therefore, we end with the code 2, 1, 02, 01, 000, 001 (redundant symbol
would have the codeword 002).
Huffman Codes - Algorithm
Let us introduce the d–ary Huffman codes for the source described by the
random variable X with probability distribution p1 , p2 , . . . , pm . Then the d–ary
Huffman code for X is constructed as
• Add redundant input symbols with probability 0 to the distribution so
that the distribution has 1 + k(d − 1) symbols for some k.
• Find d smallest probabilities pi1 , . . . , pid and replace them with
pi1 ,...,id = ∑dj=1 pij .
• Repeat the previous step until we end with the probability distribution
having only single nonzero probability - equal to 1.
To construct the code, we keep expanding the sum of probabilities and create
the codewords assigned to probabilities, i.e.
• We assign ε, i.e. the empty codeword, to the probability p1,...,1+k(d−1) .
• Let w be a codeword assigned to pi1 ,...,id . We assign the codewords
w0 , w1 , . . . , wd−1 to probabilities to pi1 , . . . , pid , respectively.
• We keep expanding until we end with the original probability distribution.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 12 / 38
Part III

Proof of Optimality

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 13 / 38
Properties of Optimal Prefix Codes

Lemma (Optimal-code properties)


For any random variable X with P(X = xi ) = pi and it’s optimal prefix
code C with lengths lC (xi ) = li and maximal length lmax , it holds that:
1. The smaller probability, the longer (or equal) codeword,
i.e. if pj > pk then lj ≤ lk .
2. Two longest codewords have the same length.
3. Each longest codeword has a sibling.a
4. The longest codewords correspond to the least likely symbols,
i.e. ∀j ∈ M, k ̸∈ M : pj ≤ pk where M = {i | li = lmax }.
a Siblings are codewords that differ only in their last letter.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 14 / 38
Properties of Optimal Prefix Codes

Proof.
Let us consider an optimal code C .
1. Let us suppose that pj > pk . Consider C ′ with the codewords j and k
interchanged (comparing to C ). Then

LC ′ (X ) − LC (X ) = ∑ pi li′ − ∑ pi li
i i
=pj lk + pk lj − pj lj − pk lk
=(pj − pk )(lk − lj ).

We know that pj − pk > 0 and since C is optimal, we have


LC ′ (X ) − LC (X ) ≥ 0. Hence, we have lk ≥ lj .

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 15 / 38
Properties of Optimal Prefix Codes

Proof (Cont.)
2. If two longest codewords have different length, then we can delete
the last letter of the longer one to get shorter code while preserving
the prefix property, what contradicts our assumption that C is
optimal. Therefore, two longest codewords have the same length.
3. If there is a codeword of maximal length without a sibling then we
can delete the last letter of the codeword and still maintain the
prefix property, what contradicts optimality of the code.
4. Again, if there is a longest codeword with higher probability, we can
swap two codewords and contradict optimality of C . Formally, if
∀j ∈ M, k ̸∈ M : lmax = lj ≥ lk what by item 1 implies pj ≤ pk .

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 16 / 38
Properties of Optimal Prefix Codes

Lemma (Least-likely-siblings property)


For each optimal prefix code, there is an optimal prefix code C preserving
the same codeword lengths and having a longest codeword w such that w
with its siblings are assigned to least likely symbols, i.e.
Mw = {i | C (xi ) is w or a sibling of w } satisfies ∀j ∈ Mw , k ̸∈ Mw : pj ≤ pk .

Proof.
Due to the previous lemma, we know that w and its siblings are in the
least-likely symbols of the longest codewords (i.e., Mw ⊆ M). The problem
is that in general Mw is not necessarily a set of indices of least-likely
symbols. To obtain the optimal code required here, we simply exchange
some of the longest codewords such that Mw is a set of least-likely
indices. Note that this modification preserves the assigned codeword
lengths (all the exchanged codewords have the maximal length).

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 17 / 38
Optimality of Huffman Codes
Theorem
Huffman codes are optimal, i.e. the codes obtained by the Huffman
algorithm assign codewords of the exactly same lengths as optimal codes.

Proof.
This proof is limited to the case of a binary code, general n-ary code is
analogous (with a discussion about unused longest siblings).
The proof is done by induction on the number of codewords. If there are
only two codewords, the Huffman-code lengths are 1, 1 what is optimal.
Now, let us assume, that the statement holds for all codes with m − 1
codewords. Let Cm be an optimal code with m codewords satisfying our
Least-likely-siblings property. Hence, there are two longest least-likely
siblings, and we can define the ’merged’ code Cm−1 of (m − 1)
codewords the way that we take the common prefix of the two siblings and
assign it to a new symbol with probability pm−1 + pm (the corresponding
smallest probabilities).
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 18 / 38
Optimality of Huffman Codes
Proof (Cont.)
Let li , pi be the lengths and probabilities for code Cm , and li′ , pi′ for Cm−1 .

Hence, li = li′ and pi = pi′ , for i = 1 . . . m − 2; and pm−1 + pm = pm−1 and

lm = lm−1 = lm−1 + 1. The expected length of the code Cm is
m m−2
LCm (X ) = ∑ pi li = ∑ pi li + pm−1 lm−1 + pm lm
i=1 i=1
m−2
= ∑ pi li′ + pm−1 (lm−1
′ ′
+ 1) + pm (lm−1 + 1)
i=1
m−2
= ∑ pi li′ + (pm−1 + pm )(lm−1

+ 1)
i=1
m−2
= ∑ pi′ li′ + pm−1
′ ′
lm−1 + pm−1 + pm
i=1
m−1
= ∑ pi′ li′ + pm−1 + pm = LC m−1 (X ) + pm−1 + pm .
i=1
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 19 / 38
Optimality of Huffman Codes

Proof (Cont.)
It is important that the expected length of Cm differs from the expected
length of Cm−1 only by a fixed amount given by the probability distribution
of the source. Therefore, Cm is optimal iff Cm−1 is optimal. (If one of
them is not optimal, i.e. there is a better one, the above-mentioned
transformation contradicts optimality of the other side.)

Due to the induction hypothesis, Cm−1 is optimal iff its lengths can be
obtained from the Huffman algorithm. As the transformation of Cm to
Cm−1 corresponds to one step of the Huffman algorithm, we have that Cm
lengths can be obtained from the Huffman algorithm iff Cm−1 can.

Remark
Shannon-Fano coding is not optimal!

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 20 / 38
Part IV

Data Compression in Practice

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 21 / 38
Huffman and Shannon-Fano Codes in Practice

• In practice, we have to consider a number of various problems.


• Usually, it is very hard to determine probability distribution of the
source. Even when we determine it correctly, the actual sequence
generated can be different from what we expected.
• In case we want to compress, e.g., a general file, the strategy adopted
is to calculate probabilities as the relative frequency of ’symbols’ (e.g.
a sequence of bytes) in the file. This assures optimal coding (relatively
to the chosen set of symbols!), but we have to generate a codeword
table that has to be stored together with the compressed file.
• When measuring practical efficiency, we have to judge both size of
the compressed file and size of the table.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 22 / 38
Adaptive Coding

• In extreme, we can consider the whole file to be one symbol and it


is then compressed to a single-bit message. However, the coding table
is as long as the original file.
• Another restriction is that symbols are fixed for the whole file
(message).
• A nice and elegant solution is adaptive coding, where the list of
symbols and codewords is generated ’on the fly’ without the need to
store the codeword table.
• An asymptotically optimal coding is, e.g., the Lempel-Ziv coding.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 23 / 38
Lempel-Ziv Coding
The source sequence is parsed into strings that did not appear before.
For example, if the input is 1011010100010 . . . , it is parsed as
1, 0, 11, 01, 010, 00, 10, . . . . After determining each phrase we search for the
shortest string that did not appear before. The coding follows:
• Parse the input sequence as above and count the number of codewords.
This will be used to determine the length of the bit string referring to a
particular codeword.
• We code each phrase by specifying the id of its longest prefix (it certainly
already appeared and was parsed) and the extra bit. The empty prefix we
usually assign index 0.
• Our example will be coded as
(000, 1)(000, 0)(001, 1)(010, 1)(100, 0)(010, 0)(001, 0).
• The length of the code can be further optimized, e.g. at the beginning of
the coding process the length of the bit string describing the codeword
can be shorter than at the end. Note that in fact we do not need commas
and parentheses, it suffices to specify the length of the bit string
identifying the prefix.
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 24 / 38
Part V

Generating Discrete Distribution


Using Fair-Coin Tosses

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 25 / 38
Discrete Distribution and a Fair Coin

Example
Suppose we want to simulate a source described by a random variable X
with the distribution

1
a with probability 2

X = b with probability 14

c with probability 14

using a sequence of fair coin tosses.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 26 / 38
Discrete Distribution and a Fair Coin
L. 11: Optimal Codes for Data Compression Example

2024-12-02
Suppose we want to simulate a source described by a random variable X
with the distribution

1
a with probability 2

X = b with probability 14

c with probability 14

using a sequence of fair coin tosses.

Discrete Distribution and a Fair Coin

The solution is pretty easy - if the outcome of the first coin toss is 0, we
set X = a, otherwise we perform another coin toss and set X = b if the
outcomes are 10 and X = c if the outcomes were 11.
The average number of fair coin tosses is 1.5 what equals to the entropy
of X .
Discrete Distribution and a Fair Coin

The general formulation of the problem is that we have a sequence of


fair coin tosses Z1 , Z2 , . . . and we want to generate a discrete random
variable X with the probability distribution ⃗p = (p1 , p2 , . . . , pm ). Let the
random variable T denotes the number of coin flips used by the
algorithm.

We can describe the algorithm mapping outcomes of Z1 , Z2 , . . . to


outcomes of X by a binary tree. Leaves of the tree are marked by
outcomes of X and the path from the root to a particular leaf represents
the sequence of coin toss outcomes.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 27 / 38
Discrete Distribution and a Fair Coin

The tree should satisfy:


1. It is full, i.e. every node is either leaf, or it has two descendants in
the tree. The tree may be infinite.
2. The probability of a leaf at depth k is 2−k . There can be more
leaves labeled by the same outcome of X . The sum of their
probabilities is the probability of this outcome.
3. The expected number of fair bits E (T ) required to generate X is
equal to the expected depth of this tree.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 28 / 38
Discrete Distribution and a Fair Coin

Example
Suppose we want to simulate a source described by a random variable X
with the distribution
(
a with probability 23
X=
b with probability 13

using a sequence of fair coin tosses.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 29 / 38
Discrete Distribution and a Fair Coin
L. 11: Optimal Codes for Data Compression Example

2024-12-02 Suppose we want to simulate a source described by a random variable X


with the distribution
(
a with probability 23
X=
b with probability 13

using a sequence of fair coin tosses.

Discrete Distribution and a Fair Coin

infinite tree:
1-a
01 - b
001 - a
0001 - b
00001 - a
alternative construction:
two levels - first two for “a”, one for “b”, the last for “repeat”

E (T ) = ∑ i ∗ 2−i = 2

H(X ) = 0.91 . . .
Discrete Distribution and a Fair Coin
Lemma
Let Y denote the set of leaves of a full binary tree and Y a random
variable with distribution on Y, where the probability of a leaf of the depth
k is 2−k . The expected depth of this tree is equal to the entropy of Y .

Proof.
Let k(y ) denote the depth of y . The expected depth of the tree is

E (T ) = ∑ k(y )2−k(y ) ,
y ∈Y

The entropy of Y is
1 1
H(Y ) = − ∑ k(y )
log k(y ) = ∑ 2−k(y ) k(y ) = E (T ).
y ∈Y 2 2 y ∈Y

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 30 / 38
Discrete Distribution and a Fair Coin

Theorem
For any algorithm generating X , the expected number of fair bits used is
at least the entropy H(X ), i.e.

E (T ) ≥ H(X ).

Proof.
Any algorithm generating X from fair bits can be represented by a binary
tree. Label all leaves by distinct symbols Y. The tree may be infinite.
Consider the random variable Y defined on the leaves of the tree
such that for any leaf of depth k the probability is P(Y = y ) = 2−k . By
the previous lemma, we get E (T ) = H(Y ). The random variable X is a
function of Y and hence we have H(X ) ≤ H(Y ), i.e. some leaves may
equal on outputs of X . Combining we get that for any algorithm
H(X ) ≤ E (T ).

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 31 / 38
Dyadic Distribution and a Fair Coin

Theorem
Let X be a random variable with a dyadic (i.e. 2–adic) distribution.
The optimal algorithm to generate X from fair coin flips requires an
expected number of coin tosses equal to the entropy H(X ).

Proof.
The previous theorem shows that we need at least H(X ) bits to generate
X . We use the Huffman code tree to generate the variable. For dyadic
distribution Huffman code coincides with Shannon-Fano code, and so it
1
has codewords of length log p(x) and the probability of such a codeword is
p(x) = 2log p(x) . Hence, the expected depth of the tree is H(X ).

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 32 / 38
Non-dyadic Example

Example
Suppose we want to simulate a source described by a random variable X
with the distribution
(
a with probability 23
X=
b with probability 13

using a sequence of fair coin tosses.

2
Note that the binary expansion of 3 is 0.10101010101 . . . .

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 33 / 38
Discrete Distribution and Fair Coin

To deal with a general (non-dyadic) distribution, we have to find the


binary expansion of each probability, i.e.
(j)
pi = ∑ pi ,
j≥0

(j) (j)
where pi is either 2−j or 0. Now, we will assign to each nonzero pi a
leaf of depth j in a binary tree. Their depths satisfy the Kraft inequality,
(j)
because ∑i,j pi = 1, and therefore, we can always do this.

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 34 / 38
Discrete Distribution and Fair Coin

Theorem
The expected number of fair bits E (T ) required by the optimal algorithm
to generate a random variable X is bounded as H(X ) ≤ E (T ) < H(X ) + 2.

Proof.
The lower bound has been already proved, here we prove the upper bound.
Let us start with the initial distribution (p1 , p2 , . . . , pm ) and expand each of
the probabilities using the dyadic coefficients, i.e.
(1) (2)
pi = pi + pi +···
(j)
with pi ∈ {0, 2−j }. Let us consider a new random variable Y with the
(1) (2) (1) (2) (1) (2)
probability distribution p1 , p1 , . . . , p2 , p2 , . . . , pm , pm , . . . . We
construct the binary tree T for the dyadic probability distribution Y .
Recall that the expected depth of T is H(Y ).
V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 35 / 38
Discrete Distribution and Fair Coin

Proof (Cont.)
X is a function of Y giving

H(Y ) = H(Y , X ) = H(X ) + H(Y |X ).

It remains to show that H(Y |X ) < 2.


Let us expand the entropy of Y as
m m
(j) (j)
H(Y ) = − ∑ ∑ pi log pi = ∑ ∑ j2−j .
i=1 j≥1 i=1 j:p (j) >0
i

Let Ti denote the sum of the addends corresponding to pi , i.e.

Ti = ∑ j2−j .
(j)
j:pi >0

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 36 / 38
Discrete Distribution and Fair Coin

Proof (Cont.)
We can find n such that 2−(n−1) > pi ≥ 2−n . This is equivalent to

n − 1 < − log pi ≤ n.
(j)
We have that pi > 0 only if j ≥ n and we rewrite Ti as

Ti = ∑ j2−j .
(j)
j:j≥n,pi >0

Recall that
pi = ∑ 2−j .
(j)
j:j≥n,pi >0

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 37 / 38
Discrete Distribution and Fair Coin

Proof (Cont.)
Next, we will show that Ti < −pi log pi + 2pi . Let us expand

Ti + pi log pi − 2pi < Ti − pi (n − 1) − 2pi = Ti − (n − 1 + 2)pi =


= ∑ j2−j − (n + 1) ∑ 2−j =
(j) (j)
j:j≥n,pi >0 j:j≥n,pi >0

= ∑ (j − n − 1)2−j =
(j)
j:j≥n,pi >0

= − 2−n + 0 + ∑ (j − n − 1)2−j =
(j)
j:j≥n+2,pi >0

= − 2−n + ∑ k2−(k+n+1)
(k+n+1)
k:k≥1,pi >0

where k = j − n − 1

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 38 / 38
Discrete Distribution and Fair Coin
Proof (Cont.)
Increasing the number of addends on the right hand side, we get

−2−n + ∑ k2−(k+n+1) ≤ − 2−n + ∑ k2−(k+n+1) ,


k:k≥1,pi
(k+n+1)
>0 k:k≥1

Due to the expected value of geometric distribution (for p = 1/2), we get

= −2−n + 2−(n+1) ∑ k2−k = −2−n + 2−(n+1) 2 = 0.


k:k≥1

Using E (T ) = ∑i Ti and Ti < −pi log pi + 2pi , we obtain the desired result
!
E (T ) = ∑ Ti < − ∑ pi log pi + 2 ∑ pi = H(X ) + 2.
i i i

V. Řehák: IV111 Probability in CS L. 11: Optimal Codes for Data Compression December 5, 2024 39 / 38
Lecture 12: Channel Capacity

Vojtěch Řehák
based on slides of Jan Bouda

Faculty of Informatics, Masaryk University

December 12, 2024

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 1 / 36
Part I

Channels, Codes, Rates, and Errors

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 2 / 36
Communication System - Motivation

Communication is a process transforming an input message W using


encoder into a sequence of n input symbols of a channel. Channel then
transforms this sequence into a sequence of n output symbols. Finally, we
use decoder to obtain an estimate Ŵ of the original message.

W Xn Yn Ŵ
Encoder Channel p(y | x) Decoder

The problem of data transmission (over a noisy channel) is dual to data


compression. During compression, we remove redundancy in the data.
During data transmission, we add redundancy in a controlled fashion to
fight errors in the channel.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 3 / 36
Communication System - Definition
Definition
A discrete channel is a system (X , p(y | x), Y ) consisting of an input
alphabet X , output alphabet Y , and a probability transition matrix
p(y | x) specifying the probability we observe y ∈ Y when x ∈ X was sent.

Example
Binary symmetric channel preserves its input with probability 1 − p and it
outputs the negation of the input with probability p.

1−p
0 0
p

p
1 1
1−p

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 4 / 36
Sequential Use of a Channel
Let x k = x1 , x2 , . . . xk , X k = X1 , X2 , . . . Xk , and similarly for y k and Y k .
Definition
A channel is said to be without feedback if the output distribution do
not depend on past output symbols, i.e. p(yk | x k , y k−1 ) = p(yk | x k ).
A channel is said to be memoryless if the output distribution depends
only on the current input and is conditionally independent of previous
channel inputs and outputs, i.e. p(yk | x k , y k−1 ) = p(yk | xk ).

W Xn Yn Ŵ
Encoder Channel p(y | x) Decoder

A message W ∈ {1, 2, . . . , M} is encoded into the signal X n (W ), which is


received as a random sequence Y n ∼ p(y n | x n ) by the receiver. The
receiver then guesses the message W by an appropriate decoding rule
Ŵ = g (Y n ). The receiver makes an error if W ̸= Ŵ .
V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 5 / 36
Channel Extension

The following definition allows us to model a sequence of n channel uses.


Definition
The n-th extension of the discrete memoryless channel is the channel
(X n , p(y n | x n ), Y n ), where

p(yk | x k , y k−1 ) = p(yk | xk ) for k = 1, 2, . . . , n.

Hence, the channel transition function for the n-th extension of a discrete
memoryless channel reduces to
n
p(y n | x n ) = ∏ p(yi | xi ).
i=1

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 6 / 36
(M, n) Codes

Definition
An (M, n) code for the channel (X , p(y | x), Y ) consists of the following:
1. A set of input messages {1, 2, . . . , M}.
2. An encoding function f : {1, 2, . . . , M} → X n , yielding codewords
f (1), f (2), . . . , f (M).
3. A decoding function g : Y n → {1, 2, . . . , M}, which is a deterministic
rule assigning a guess to each possible receiver vector.

Set of M elements to be transferred by n channel uses each.


Example
We have proposed the code function f (1) = 000 and f (2) = 111 for the
binary symmetric channel.
What is the type of this code?

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 7 / 36
Example: Binary Symmetric Channel

Example
Binary symmetric channel preserves its input with probability 1 − p and it
outputs the negation of the input with probability p.
What is the probability of an incorrect decoding using binary symmetric
channel and encoding function f and decoding function g ?

1−p
0 0
g (000) = 1 g (011) = 2
f (1) = 000 p
g (001) = 1 g (101) = 2
p g (010) = 1 g (110) = 2
f (2) = 111
g (100) = 1 g (111) = 2
1 1
1−p

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 8 / 36
Example: Binary Symmetric Channel
Lecture 12: Channel Capacity
2024-12-12
Example
Binary symmetric channel preserves its input with probability 1 − p and it
outputs the negation of the input with probability p.
What is the probability of an incorrect decoding using binary symmetric
channel and encoding function f and decoding function g ?

1−p
0 0
g (000) = 1 g (011) = 2
f (1) = 000
Example: Binary Symmetric Channel
p
g (001) = 1 g (101) = 2
p g (010) = 1 g (110) = 2
f (2) = 111
g (100) = 1 g (111) = 2
1 1
1−p

PError = P(W = 1, Ŵ = 2) + P(W = 2, Ŵ = 1)


= P(Ŵ = 2 | W = 1)P(W = 1) + P(Ŵ = 1 | W = 2)P(W = 2)
Note that P(W = 1) = 1 − P(W = 2).
Due to symmetry, P(Ŵ = 2 | W = 1) = P(Ŵ = 1 | W = 2) = x, and so:
PError = x · (1 − P(W = 2)) + x · P(W = 2) = x = P(Ŵ = 2 | W = 1)
= P(Y 3 ∈ {111, 110, 101, 011} | X 3 = 000) = p 3 + 3p 2 (1 − p) = 3p 2 − 2p 3 .

Common discussion:
3p 2 − 2p 3 < p ⇐⇒ 0 < 2p 3 − 3p 2 + p = p(2p − 1)(p − 1)
Hence, the roots are 0, 1/2, and 1.
Error Probability

Definition
Probability of an error for the code (M, n) and the channel
(X , p(y | x), Y ) provided the ith message was sent is

λi = P (g (Y n ) ̸= i | X n = f (i)) = ∑ p(y n | f (i)).


y n :g (y n )̸=i

Definition
The maximal probability of an error for an (M, n) code is defined as

λmax = max λi
i∈{1,2,...,M}

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 9 / 36
Average Error Probability

Definition
The (arithmetic) average probability of error for an (M, n) code is
defined as
(n) 1 M
Pe = ∑ λi .
M i=1

(n)
Note that Pe = P(I ̸= g (Y )) if I describes index chosen uniformly from
the set {1, 2, . . . , M}.
(n)
Also note that Pe ≤ λmax . This is important as we will prove that λmax
can be pushed to 0.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 10 / 36
Code Rate

Definition
The rate R of an (M, n) code is

log2 M
R= bits per transmission.
n
Intuitively, the rate expresses the ratio between the number of message
bits and the number of channel uses, i.e. it expresses the number of
non-redundant bits per transmission.
Example
What is the rate of the code function f (1) = 000 and f (2) = 111?

What is the rate of an (8, 3) code?

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 11 / 36
Code Rate
Lecture 12: Channel Capacity Definition

2024-12-12
The rate R of an (M, n) code is

log2 M
R= bits per transmission.
n
Intuitively, the rate expresses the ratio between the number of message
bits and the number of channel uses, i.e. it expresses the number of
non-redundant bits per transmission.

Code Rate Example


What is the rate of the code function f (1) = 000 and f (2) = 111?

What is the rate of an (8, 3) code?

The triple code is a (2, 3) code, hence the rate is log (2)/3 = 1/3 of a bit.
The rate of an (8, 3) code is log (8)/3 = 3/3 = 1 bit.
Part II

Channel Capacity

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 12 / 36
Channel capacity - motivation

Intuitively, channel capacity is the noiseless throughput of the channel.

As we will show later, it specifies the highest rate (number of


non-redundant bits per channel use) at which information can be sent with
arbitrarily low error.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 13 / 36
Noiseless binary channel

Example
• Let us consider a channel with binary input that faithfully reproduces
its input on the output.
0 1
1 1

a b
• The channel is error-free and we can obviously transmit one bit per
channel use.
• The capacity is 1 bit.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 14 / 36
Noisy channel with non-overlapping outputs

Example
• This channel has two inputs and to each of them correspond two
possible outputs. Outputs for different inputs are different.
0 1
p 1−p p 1−p

a b c d
• This channel appears to be noisy, but in fact it is not. Every input
can be recovered from the output without error.
• Capacity of this channel is also 1 bit.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 15 / 36
Binary Symmetric Channel

Example
• Binary symmetric channel preserves its input with probability 1 − p
and it outputs the negation of the input with probability p.
1−p
0 0
p

p
1 1
1−p
• The capacity depends on the probability p.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 16 / 36
Channel capacity

Definition
The channel capacity of a discrete memoryless channel is

C = max I (X ; Y ),
pX

where X is the input random variable, Y describes the output distribution


and the maximum is taken over all possible input distributions pX .

Note that

C = max I (X ; Y ) = max[H(Y ) − H(Y | X )]


pX pX

where for symmetric channels, H(Y | X ) depends only on the channel


transition matrix p(y | x).

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 17 / 36
Noisy channel with non-overlapping outputs

0 1
p 1−p p 1−p

a b c d

Capacity of this channel is 1 bit, what is in agreement with the quantity C


that attains its maximum for the uniform input distribution.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 18 / 36
Noisy channel with non-overlapping outputs
Formally, let P(X = 0) = q then the mutual information is

I (X ; Y ) = H(Y ) − H(Y | X )
= H(Y ) − (P(X = 0) · H(Y | X = 0) + P(X = 1) · H(Y | X = 1)) =
= H(Y ) − q · H(p, 1 − p) − (1 − q) · H(p, 1 − p) =
= H(Y ) − H(p, 1 − p) =
= H(qp, q(1 − p), (1 − q)p, (1 − q)(1 − p)) − H(p, 1 − p) =
= H(q, 1 − q) + q · H(p, 1 − p) + (1 − q) · H(p, 1 − p) − H(p, 1 − p) =
= H(q, 1 − q) = H(X )

Note that we have used


H(p1 , . . . , pm , pm+1 , . . . , pm+n ) =
   
p1 pm pm+1 pm+n
= H(p, q) + pH ,..., + qH ,..., .
p p q q

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 19 / 36
Binary Symmetric Channel

Binary symmetric channel preserves its input with probability 1 − p and it


outputs the negation of the input with probability p .

1−p
0 0
p

p
1 1
1−p

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 20 / 36
Binary Symmetric Channel

Mutual information is bounded by

I (X ; Y ) = H(Y ) − H(Y | X ) = H(Y ) − ∑ p(x)H(Y | X = x) =


x
= H(Y ) − ∑ p(x)H(p, 1 − p) = H(Y ) − H(p, 1 − p) ≤
x
≤ 1 − H(p, 1 − p).

Equality is achieved when the input distribution is uniform. Hence, the


information capacity of a binary symmetric channel with error probability p
is
C = 1 − H(p, 1 − p) bits.
Note that C = 0 for p = 1/2, and C = 1 for both p = 0 and p = 1 .

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 21 / 36
Noiseless binary channel

Example
What is the capacity of this channel?

0 1
1 1

a b

• Due to the previous example for p = 0, the capacity is


C = maxpX I (X ; Y ) = 1 and is attained for the uniform distribution on
the input.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 22 / 36
Noiseless Quaternary channel

Example
What is the capacity of this channel?

0 1 2 3
1 1 1 1

a b c d

• Similarly to the previous example, the capacity is


C = maxpX I (X ; Y ) = 2 and is attained for the uniform distribution on
the input.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 23 / 36
Noisy Typewriter

• Let us suppose that the input alphabet has k letters (and the output
alphabet is the same here).
• Each symbol either remains unchanged (with probability 1/2) or it is
received as the next letter (with probability 1/2).

Z A B C D ···
1/2 1/2 1/2 1/2
1/2 1/2 1/2 1/2 1/2

··· a b c d ···

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 24 / 36
Noisy Typewriter

The channel capacity is

C = max I (X ; Y ) = max[H(Y ) − H(Y | X )] = max H(Y ) − 1 =


pX pX pX

= log 26 − 1 = log 26 − log 2 = log(26/2) = log 13

since H(Y | X ) = H(1/2, 1/2) = 1 is independent of pX .

Alternatively:
If we use every alternate symbol of the 26 input symbols, the 13 symbols
can be transmitted faithfully.
Therefore, this way we may transmit log 13 error-free bits per one channel
transfer.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 25 / 36
Binary erasure channel

Binary erasure channel either preserves the input faithfully, or it erases it


(with probability α). Receiver recognize the bits that have been erased.
We model the erasure as a specific output symbol e.

1−α
0 0
α

1 1
1−α

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 26 / 36
Binary erasure channel

The capacity may be calculated as follows

C = max I (X ; Y ) = max(H(Y ) − H(Y | X )) = max H(Y ) − H(α, 1 − α).


pX pX pX

It remains to determine the maximum of H(Y ). Let us define E as


E = 0 ⇔ Y = e and E = 1 otherwise. We use the expansion

H(Y ) = H(Y , E ) = H(E ) + H(Y | E )

Note that:
H(E ) = H(α, (1 − α))
H(Y | E ) = P(E = 1) · H(Y |E = 1) + P(E = 0) · H(Y |E = 0)
= α · 0 + (1 − α) · H(X )

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 27 / 36
Binary erasure channel

Hence,

C = max H(Y ) − H(α, 1 − α) =


pX

= max(1 − α)H(X ) + H(α, 1 − α) − H(α, 1 − α) =


pX

= max(1 − α)H(π, 1 − π) = 1 − α,
π

where the maximum is achieved for uniform pX , i.e. π = 1/2.

In this case the interpretation is very intuitive - fraction of α symbols is


lost in the channel, so we can recover only 1 − α symbols.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 28 / 36
Symmetric channels

Let us consider a channel with a transition matrix


 
0.3 0.2 0.5
p(y | x) = 0.5 0.3 0.2 , (1)
0.2 0.5 0.3

with the entry in xth row and y th column giving the probability that y is
received when x is sent. All the rows are permutations of each other and
the same holds for all columns. We say that such a channel is symmetric.
Symmetric channel may be alternatively specified by

Y = X + Z mod c,

where Z is some distribution on integers 0, 1, 2, . . . , c − 1, input X has the


same alphabet as Z , and X and Z are independent.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 29 / 36
Symmetric channels

Y = X + Z mod c
Note that an arbitrary row of the transition matrix is a distribution of Z
mod c. Hence, H(Y | X ) = H(Z ).
I (X ; Y ) = H(Y ) − H(Y | X ) = H(Y ) − H(Z ) ≤ log c − H(Z )
with equality if the distribution of Y is uniform. Note that uniform input
distribution P(X = x) = c1 yields the uniform distribution of the output
since
1 1
P(Y = y ) = ∑ P(Y = y | X = x)P(X = x) = ∑ P(Y = y | X = x) = ,
x c x c
as the entries in each column of the probability transition matrix sums to 1.
Therefore, symmetric channels have capacity
C = max I (X ; Y ) = log c − H(Z )
pX

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 30 / 36
Properties of Channel Capacity

• C ≥0
▶ since I (X ; Y ) ≥ 0

• C ≤ log |Im(X )|
▶ since
maxpX I (X ; Y ) = maxpX H(X ) − H(X | Y ) ≤ maxpX H(X ) = log |Im(X )|

• C ≤ log |Im(Y )|
▶ since I (X ; Y ) = I (Y ; X )

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 31 / 36
Part III

Channel Coding Theorem

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 32 / 36
Definition Recall

Definition
The probability of an error provided the ith message was sent is λi .

Definition
The maximal probability of an error for an (M, n) code is defined as

λmax = max λi
i∈{1,2,...,M}

Definition
The rate R of an (M, n) code is

log2 M
R= bits per transmission.
n

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 33 / 36
Channel Coding Theorem

Theorem (The channel coding theorem)


Let C be a capacity of a communication channel and R be a code rate.
• If R < C , then for any ε > 0 there exists a code with rate R (and large
enough block length n) whose error probability λmax is less than ε.
• If R > C , the error probability λmax of any code with rate R is
bounded away from zero.

The theorem was published in 1948 by Claude Shannon. Shannon only


gave an outline of the proof. The first rigorous proof for the discrete case
is due to Amiel Feinstein in 1954.
But there was no construction of the codes. The construction (named
Polar codes) was published by Erdal Arikan in 2009, i.e. 60 years later.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 34 / 36
Comments on the Channel Coding Theorem

• In order to establish a reliable transmission over a noisy channel, we


encode the message W into a string of n symbols from the channel
input alphabet.
• We do not use all possible n-symbol sequences as codewords.
• We want to select a subset C of n-symbol sequences such that for
any x1n , x2n ∈ C the possible channel outputs corresponding to x1n and
x2n are disjoint.
• Therefore, the situation is analogous to the typewriter example, and
we can decode the original message faithfully.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 35 / 36
Channel Coding Theorem and Typicality

• For a r.v. X and large n, there are 2nH(X ) distinct outputs of X n a. s.


• For each (typical) input n-symbol sequence, there are approximately
2nH(Y |X ) possible output sequences, all of them equally likely.
• We want to ensure that no two input sequences produce the same
output sequence.
• The total number of typical output sequences is approx. 2nH(Y ) .
• This gives that the total number of disjoint input sequences is

2nH(Y )
= 2nH(Y )−nH(Y |X ) = 2nI (X ;Y ) ,
2nH(Y |X )
what establishes the approximate number of distinguishable sequences
we can send.

V. Řehák: IV111 Probability in CS Lecture 12: Channel Capacity December 12, 2024 36 / 36

You might also like