Markov Processes Lecture Notes
Markov Processes Lecture Notes
MA2404/MA7404
Markov Processes
Module MA2404/MA7404
Markov Processes
Edition 3, March 2022
i
Preface
Welcome to module MA2404/MA7404 Markov Processes.
The module has no formal prerequisites apart from that you are familiar with
standard mathematics seen in first year undergraduate courses.
An important thread through all actuarial and financial disciplines is the use
of appropriate models for various types of risks. The aim of this module is to
provide an introduction to risk modelling, with emphasis on Markov models.
We begin the module with short review of probability theory, basic statistics,
and stochastic processes, then study risk models, theory of Markov processes
and their application to actuarial and financial modelling.
Dr Tetiana Grechuk
ii
Contents
1 Probability theory and stochastic processes 1
1.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random variables and their expectations . . . . . . . . . . . . 2
1.3 Variance, covariance, and correlation . . . . . . . . . . . . . . 4
1.4 Probability distribution . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Examples of discrete probability distributions . . . . . . . . . 9
1.6 Examples of continuous probability distributions . . . . . . . . 12
1.7 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Conditional Probability and Expectation . . . . . . . . . . . . 17
1.9 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . 20
1.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.11 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iii
4.5 Joint distributions and copulas . . . . . . . . . . . . . . . . . 77
4.6 Dependence of distribution tails . . . . . . . . . . . . . . . . . 81
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5 Markov Chains 85
5.1 The Markov property . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Definition of Markov Chains . . . . . . . . . . . . . . . . . . . 87
5.3 The Chapman-Kolmogorov equations . . . . . . . . . . . . . . 89
5.4 Time dependency of Markov chains . . . . . . . . . . . . . . . 90
5.5 Further applications . . . . . . . . . . . . . . . . . . . . . . . 92
5.5.1 The simple (unrestricted) random walk . . . . . . . . . 92
5.5.2 The restricted random walk . . . . . . . . . . . . . . . 94
5.5.3 The modified NCD model . . . . . . . . . . . . . . . . 96
5.5.4 A model of accident proneness . . . . . . . . . . . . . . 98
5.5.5 A model for credit rating dynamics . . . . . . . . . . . 98
5.5.6 General principles of modelling using Markov chains . . 99
5.6 Stationary distributions . . . . . . . . . . . . . . . . . . . . . 101
5.7 The long-term behaviour of Markov chains . . . . . . . . . . . 104
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.9 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
iv
7.5 Stages of analysis in Machine Learning . . . . . . . . . . . . . 137
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
v
The following book has been used as the basis for the lecture notes
vi
Chapter 1
Probability theory and stochastic processes
The aim of this chapter is to give the review of probability theory, basic
statistics, and give a very short introduction to stochastic processes. This
background material is necessary for understanding the models that will be
developed in later chapters. While the rigorous development of probability
theory and stochastic processes can be very technical, we have attempted to
avoid unnecessary technicalities in this text. The focus is that you understand
the material on the intuitive level, and also have a level of technical knowledge
able to solve quantitative problems when necessary.
Ω = {ω1 , ω2 , . . . , ωn },
1
which we will define the probability (such subsets we will call “events”), and
subsets for which the probability is undefined. The set of all events will be
denoted F. We assume that F satisfies the following properties
1. Ω ∈ F
S∞ T∞
2. If A1 , A2 , · · · ∈ F, then n=1 An ∈ F and n=1 An ∈ F
3. If A ∈ F then Ac ∈ F
Here, denotes the union of the sets, denotes the intersection, and Ac =
S T
{ω ∈ Ω | ω 6∈ A} denotes the complement of set A.
1. P (Ω) = 1
S∞ P∞
2. P n=1 An = n=1 P (An ), provided A1 , A2 , · · · ∈ F is a collection of
pairwise disjoint sets, i.e. Ai ∩ Aj = ∅ for all i 6= j.
Then we call such P a probability measure and the triple (Ω, F, P ) a proba-
bility space.
An important exampleof a probability space is the so-called standard proba-
bility space [0, 1], B, λ , where B is the smallest set satisfying the properties
1-3 above and containing all intervals of the type (a, b), (a, b], [a, b), [a, b] ⊂
[0, 1] and λ denotes the Lebesgue measure, that is, a natural extension of the
functional which assigns to each interval its length:
λ(a, b] = b − a;
ξ:Ω→R
2
We may interpret a random variable as a number produced by an experiment,
such as temperature or the oil price tomorrow. The mathematical formalism
allows us to be accurate. In particular, the property of measurability is im-
portant to make sense of the probability of the event that a random variable
does not exceed a threshold: P ξ ≤ r) := P ({ω : ξ(ω) ≤ r} .
For example, consider standard probability space [0, 1], B, λ , and a function
ξ such that ξ(ω) = a, ω ≤ p, and ξ(ω) = b, ω > p, where p ∈ (0, 1) and a 6= b.
Then ξ is a random variable assuming just two different values a and b with
probabilities p ∈ (0, 1) and q := 1 − p respectively. It is called a Bernoulli
random variable. If a = 1 and b = 0, ξ is called a standard Bernoulli variable.
on the standard probability space. This is the random variable whose possible
values are all real numbers in the interval [a, b].
The expectation of a random variable is, informally, the average of its possible
values weighted according to their probabilities. Sometimes in the literature,
the expectation is also called the mathematical expectation or mean. For
example, in experiment with throwing a dice, the possible outcomes of the
experiments are Ω = {1, 2, 3, 4, 5, 6} with equal probabilities, and the average
outcome is (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5. More generally, if the possible val-
ues of ξ are x1 , x2 , . . . , xn , and they happen with probabilities p1 , p2 , . . . , pn ,
respectively, then
Xn
E[ξ] = xi p i . (2)
i=1
x1 , x2 , . . . , xn , . . .
with probabilities
p1 , p2 , . . . , pn , . . . ,
3
P∞
respectively, where each pn ≥ 0 and n=1 pn = 1, then
∞
X
E[ξ] = xi p i . (3)
i=1
To calculate the average of possible values for general function (for ex-
ample, one that may take any values from some interval, such as (1)), the
summation in (2) and (3) is replaced by integration. By definition, the ex-
pectation of a random variable is its integral on Ω with respect to P , so that
Z Z
E[ξ] ≡ ξ dP ≡ ξ(ω) dP (ω).
Ω Ω
It is easy to check that for random variable taking finitely many values
this integral reduces to (2). For example, for Bernoulli random variable,
Z 1 Z p Z 1
E[ξ] = ξ(ω)dω = adω + bdω = ap + b(1 − p).
0 0 p
4
If ξ measures some real life phenomenon, such as remaining lifetime of an
individual, E[ξ] indicates “how big” ξ to expect on average, and may serve as
a forecast how long is an individual expected to survive. Variance measures
how big is the (square of) the difference ξ − E[ξ], and therefore indicates how
“close” is the prediction E[ξ] to the reality. Therefore mean and variance are
two fundamental quantities associated with a random variable.
Not every ξ ∈ L1 has a finite variance. The class of random variables with
finite variance is denoted by L2 (Ω, F, P ) or just L2 . A random variable with
finite variance is called square-integrable. Variance is always non-negative
and is equal to zero only for constants.
p
The square root of the variance σ(ξ) = Var(ξ) is called the standard devi-
ation of ξ.
Example 1.1. For ξ defined in (1),
Z 1
2
E[ξ ] = (a + (b − a)ω)2 dω = (a2 + ab + b2 )/3,
0
and
p p √
σ(ξ) = Var(ξ) = (a2 + ab + b2 )/3 − (a + b)2 /4 = (b − a)/ 12.
Note that
5
Cauchy-Schwarz inequality says
2
E[ξη] ≤ E[ξ 2 ]E[η 2 ] for all ξ, η ∈ L2 ,
and one can easily deduce that
−1 ≤ Corr(ξ, η) ≤ 1 for all ξ, η ∈ L2 .
The correlation equals 1 if and only if ξ = aη with some constant a > 0, and
is −1 if and only if ξ = aη for a < 0. If ξ and η are independent, then the
covariance Cov(ξ, η) and correlation Corr(ξ, η) are 0. If Corr(ξ, η) > 0, the
random variables ξ and η are called positively correlated, and the intuition is
that higher values of ξ is an indication of higher values of η. If Corr(ξ, η) < 0,
ξ and η are called negatively correlated. For example, if ξ is temperature
outside and η is the number of old people who died on the street, then ξ and
η may be positively correlated during the summer (the higher temperature
the hotter summer, and old people may be affected by very hot weather) but
negatively correlated during the winter (the higher temperature the less cold
the winter).
Example 1.2. An n-sided dice has numbers 1, 2, . . . , n on its sides, each can
be shown on its upper surface with the same probability 1/n when the dice
is thrown or rolled. If ξ is the corresponding random variable, it is discretely
distributed, and its CDF Fξ (x) is a piecewise constant function given by
0
if x < 1
Fξ (x) = i/n if i ≤ x < i + 1, i = 1, 2, . . . , n − 1
1 if n ≤ x
By (2),
n
1X 1 n(n + 1) n+1
E[X] = i= · = ,
n i=1 n 2 2
6
and
n
2 1X 2 1 n(n + 1)(2n + 1) (n + 1)(2n + 1)
E[X ] = i = · = ,
n i=1 n 6 6
hence
2 n2 − 1
2
V ar[X] = E[X ] − (E[X]) = . (4)
12
Fξ (x) = P (ξ ≤ x) = 0 if x < a,
x−a
Fξ (x) = P (ξ ≤ x) = λ(ω : a + (b − a)ω ≤ x) = if a ≤ x ≤ b,
b−a
where λ denotes the length of the interval, and
Fξ (x) = P (ξ ≤ x) = 1 if b < x.
7
It is easy to check that
Z x
Fξ (x) = ρξ (z) dz,
−∞
for function ρξ (z) = 1/(b − a), z ∈ [a, b] (and ρξ (z) = 0, z 6∈ [a, b]). Hence,
ξ is continuously distributed with PDF ρξ . In fact, random variable ξ with
this density is called uniformly distributed on [a, b]. In this case we usually
write ξ ∼ U (a, b).
We have Z ∞ Z b
x a+b
E[ξ] = xρξ (x) dx = dx = ,
−∞ a b−a 2
and
∞ b
(x − E[ξ])2 (b − a)2
Z 2 Z
Var(ξ) = x − E[ξ] ρξ (x) dx = dx = ,
−∞ a b−a 12
If it can be represented as
Z x1 Z xd
Fξ1 ,...,ξd (x1 , . . . , xd ) = ··· ρξ1 ,...,ξd (z1 , . . . , zd ) dz1 . . . dzd ,
−∞ −∞
8
for some non-negative function ρξ1 ,...,ξd (z1 , . . . , zd ), the latter is called the joint
PDF of ξ1 , . . . , ξd .
If for some random variable ξ1 , . . . ξd and function f : Rd → R the expectation
E[f (ξ1 , . . . ξd )] exists, then
Z
E f (ξ1 , . . . ξd ) = f (x1 , . . . , xd )ρξ1 ,...,ξd (x1 , . . . , xd ) dx1 . . . dxd .
Rd
and
V ar[ξ] = E[X 2 ] − (E[X])2 = (a − b)2 p(1 − p).
If a = 1 and b = 0, ξ is called standard Bernoulli variable. In this case,
9
hence in this case ξ is just a standard Bernoulli variable. In general,
if ξ1 , . . . , ξn is a sequence of i.i.d. standard Bernoulli variables with
parameter p, then
ξ = ξ1 + ξ2 + · · · + ξn
has binomial distribution Bin(n, p). In particular,
n
X n
X
E[ξ] = E[ξk ] = p = np,
k=1 k=1
and n n
X X
V ar[ξ] = V ar[ξk ] = p(1 − p) = np(1 − p).
k=1 k=1
pn = (1 − p)n p, n = 0, 1, 2, . . .
∞
X 1
− n(1 − p)n−1 = − .
n=1
p2
10
• ξ follows negative Binomial distribution with parameters k (positive
integer) and p (real number on [0, 1]) if it takes values 0, 1, 2, . . . with
probabilities
(k + n − 1)! k
pn = · p (1 − p)n , n = 0, 1, 2, . . .
n!(k − 1)!
In this case, we write ξ ∼ N B(k, p). For example, if k = 1, then
(1 + n − 1)! 1
pn = · p (1 − p)n = p(1 − p)n , n = 0, 1, 2, . . .
n!(1 − 1)!
hence N B(1, p) is just a geometric distribution. In general, if ξ1 , . . . , ξn
is a sequence of i.i.d. geometric variables with parameter p, then
ξ = ξ1 + ξ2 + · · · + ξk
has negative binomial distribution N B(k, p). In particular,
k k
X X 1−p k(1 − p)
E[ξ] = E[ξi ] = = ,
i=1 i=1
p p
and
k k
X X 1−p k(1 − p)
V ar[ξ] = V ar[ξi ] = = .
i=1 i=1
p2 p2
Negative Binomial distribution can serve as a model for the total num-
ber of claims if this number is not bounded from above.
• ξ follows Poisson distribution with parameter λ > 0 if it takes values
0, 1, 2, . . . with probabilities
λn e−λ
pn =
n!
P∞ xn
Using series expansion for exponential function ex = n=0 n! , one can
easily compute expectation of Poisson distribution
∞ ∞ ∞
X X λn e−λ −λ
X λn−1
E[ξ] = n · pn = n = λe = λe−λ eλ = λ.
n=1 n=1
n! n=1
(n − 1)!
11
1.6 Examples of continuous probability distributions
One example (uniform probability distribution) is studied in Example 1.3.
This section provides more examples, with emphasis to ones useful in insur-
ance modelling.
Example 1.4. Question: We say that random variable ξ has the expo-
nential distribution with parameter λ ∈ R > 0, and write ξ ∼ Exp(λ), if its
probability density function is
ρξ (x) = λe−λx for x ≥ 0, (5)
and ρξ (x) = 0 for x < 0. Calculate Fξ (x), Fξ−1 (α), E[ξ] and Var(ξ).
Answer: For any x ≥ 0,
Z x Z x
Fξ (x) = ρξ (z) dz = λe−λz dz = 1 − e−λx .
−∞ 0
12
The mean and variance of lognormal distribution are
2 /2 2 2
E[ξ] = eµ+σ , Var(ξ) = (eσ − 1)e2µ+σ .
The mean of ξ is
λ
E[ξ] = , α>1
α−1
and is infinite if α ≤ 1. The variance of ξ is
αλ2
Var(ξ)] = , α > 2,
(α − 1)2 (α − 2)
and is infinite if α ≤ 2.
• ξ has the Burr distribution with parameters α > 0, λ > 0, and γ > 0
if it has PDF
γαλα xγ−1
ρξ (x) = , x > 0. (10)
(λ + xγ )α+1
In this case we write ξ ∼ Burr(α, λ, γ). Burr distribution has CDF
α
λ
Fξ (x) = 1 − , x > 0.
λ + xγ
13
λ
α
By solving equation 1 − λ+x γ = β, we find quantile function of the
Burr distribution
1/γ
Fξ−1 (β) = λ(1 − β)−1/α − λ
, 0 < β < 1.
Pareto distribution is the special case of Burr distribution with γ = 1.
• ξ has the generalized Pareto distribution with parameters α > 0, δ > 0,
and k > 0 if it has PDF
Γ(α + k)δ α xk−1
ρξ (x) = , x > 0. (11)
Γ(α)Γ(k) (δ + x)α+k
The mean of ξ is
dΓ(α − 1)Γ(k + 1)
E[ξ] = , α > 1,
Γ(α)Γ(k)
and is infinite if α ≤ 1. The variance exists if α > 2, and is equal to
E[ξ 2 ] − (E[ξ])2 , where
d2 Γ(α − 2)Γ(k + 2)
E[ξ 2 ] = , α > 2.
Γ(α)Γ(k)
• ξ has the Weibull distribution with parameters c > 0 and γ > 0 if it
has PDF
γ
ρξ (x) = cγxγ−1 e−cx , x > 0. (12)
In this case we write ξ ∼ W (c, γ). Weibull distribution has CDF
γ
Fξ (x) = 1 − e−cx , x > 0.
γ
If 0 < γ < 1, the upper tail of Weibull distribution P [ξ > x] = e−cx
decaysα faster than that for Pareto distribution (for which P [ξ > x] =
λ
λ+x
) but slower than that for exponential distribution (for which
P [ξ > x] = e−λx ). This makes Weibull distribution a very flexible
distribution, which is extensively used as a model for losses in insurance.
γ
By solving equation 1 − e−cx = α, we find quantile function
1/γ
−1 log(1 − α)
Fξ (α) = − , 0 < α < 1.
c
The mean of ξ is
−1/γ 1+γ
E[ξ] = c Γ .
γ
The variance is E[ξ 2 ] − (E[ξ])2 , where
2 −2/γ 2+γ
E[ξ ] = c Γ .
γ
14
1.7 Independence
The notion of independence is one of the most important in Probability
Theory. Intuitively we would like to call two events or random variables
independent if there is no mutual dependency. For example, if we toss a coin
(or roll a die) twice, the outcomes are, intuitively, independent from each
other.
Two events A and B are called independent if
P (A ∩ B) = P (A)P (B).
15
Definition 1.3. Random variables ξ1 , . . . , ξn are called mutually indepen-
dent if the events
A1 = {ω ∈ Ω : a1 < ξ1 (ω) < b1 }
······
An = {ω ∈ Ω : an < ξn (ω) < bn }
are mutually independent for all real numbers {ak , bk }nk=1 . An infinite set
of random variables {ξα } is called mutually independent if any finite subset
{ξα1 , . . . , ξαn } is mutually independent.
for any functions f and g : R → R (such that f (ξ) and g(η) ∈ L1 ). Indeed,
Z Z
E f (ξ)g(η) = f (x)g(y)ρξ,η (x, y)dxdy = f (x)g(y)ρξ (x)ρη (y)dxdy
R2 R2
Z +∞ Z +∞
= f (x)ρξ (x)dx ρη (y)g(y)dy = E[f (ξ)]E[g(η)]
−∞ −∞
16
for independent ξ and η. So the covariance could serve as a proxy measure
for a degree of independence: the closer the covariance (or correlation) is
to zero, the less dependent are the random variables. However, it is not a
very good measure, since there exists dependent random variables with zero
covariance.
For any random variables ξ, η ∈ L2 such that Cov(ξ, η) = 0 (so, in par-
ticular, for any independent ξ and η) we have
More generally,
n
X n
X
Var ξk = Var(ξk )
k=1 k=1
17
10
In the example above, with B = {a customer is a smoker}, P (A ∩ B) = 100
,
30
P (B) = 100 = 0.3, and P [A|B] = P P(A∩B)
(B)
= 0.1
0.3
= 13 .
Events A and B are independent, if and only if
P [A ∩ B] P [A]P [B]
P [A|B] = = = P [A].
P [B] P [B]
Intuitively, this means that information about event B does not have any
influence on the probability of the event A.
Knowing P [A], P [B], and P [A|B], one can estimate P [B|A] as follows
P [A ∩ B] P [A|B]P [B]
P [B|A] = = ,
P [A] P [A]
which is called the Bayes’ theorem.
If {Bn , n = 1, 2, . . . } is S
a finite or countably infinite partition of Ω (that is,
Bi ∩ Bj = ∅, i 6= j and n Bn = Ω), then, for any event A,
X X
P [A] = P [A ∩ Bn ] = P [A|Bn ]P [Bn ]. (13)
n n
(here N and E are the events that the driver is “new” and “experienced”,
correspondingly).
where pX (x) and pY (y) are densities of X and Y , pX,Y (x, y) is the joint density
p (x,y)
of X and Y , and pX|Y (x | y) := X,YpY (y)
is called a conditional density of X
given Y .
18
Let A be any event with positive probability. Let IA be the indicator function
of A, that is, IA (ω) = 1 if ω ∈ A and IA (ω) = 0 otherwise. The conditional
expectation of random variable X given A is denoted as E(X|A) and defined
as
E(IA X)
E(X|A) := .
P (A)
Intuitively, E(X|A) is the average value of X given that event A happened.
E(IY =y X)
E(X|Y = y) := .
P (Y = y)
where
V ar(X|Y ) = E(X 2 |Y ) − (E(X|Y ))2 .
and
V ar(X) = V ar(Y ) = (02 · 0.5 + 12 · 0.5) − (0.5)2 = 0.25.
19
Now assume that we know that Y = 0. Then
P [X = Y = 0] 0.4
P [X = 0|Y = 0] = = = 0.8,
P [Y = 0] 0.5
P [X = 1, Y = 0] 0.1
P [X = 1|Y = 0] = = = 0.2,
P [Y = 0] 0.5
and
Similarly,
Hence,
E(E[X|Y ]) = 0.2 · 0.5 + 0.8 · 0.5 = 0.5 = E[X],
in agreement with the law of total expectation (15).
Similarly we can calculate that
V ar(E[X|Y ]) = 0.22 · 0.5 + 0.82 · 0.5 − (0.2 · 0.5 + 0.8 · 0.5)2 = 0.09,
hence
20
t, and T is the set of all times we are interested in. We will call such family
of random variables a stochastic process (or random process).
Formally, a stochastic process is a family of random variables {Xt : t ∈ T },
where T is an arbitrary index set. For example, any random variable is a
stochastic process with one-element set T . But typically parameter t repre-
sents time, and the most common examples for T are T = {0, 1, 2, . . . } and
T = R (or [0, ∞)). In the first case the stochastic process is called discrete
time, and is actually just a sequence of random variables; in the second case
the random process is called continuous time.
There are many classifications of stochastic processes. One of the most basic
is to classify them with respect to the time (index) set T and with respect to
the state space. By definition the state space S is the set of possible values
of a random process Xt .
21
as price or temperature is measured only at certain time intervals (days,
months, quarters, years). For example, if we do not care about intra-day
changing of the GBP/USD exchange rate, then we can consider this as a
discrete-time process {X0 , X1 , X2 , . . . } where Xi indicates the exchange rate
in the morning of i-th day.
22
Describing a stochastic process. To describe the stochastic process
{Xt : t ∈ T }, we need to specify the joint distributions of Xt1 , Xt2 , . . . , Xtn
for all t1 , t2 , . . . , tn in T and all integers n. The collection of the joint distri-
butions above is called the family of finite dimensional probability distribu-
tions (f.f.d. for short). To describe a stochastic process in practice, we will
rarely give the exact formulas for its f.f.d., but will rather use some indirect
intuitive descriptions. For example, take the familiar Bernoulli trials of con-
secutive tosses of a fair coin. A sequence of i.i.d. Bernoulli variables (ξt )∞ t=1
is a stochastic process, and its f.f.d. is fully determined by this descrip-
tion. Indeed, for any sequences of times t1 , t2 , . . . , tn in T = {0, 1, 2, . . . }
and “results” x1 , x2 , . . . , xn in S, we are able to estimate the probability
P (ξt1 = x1 ∪ ξt2 = x2 ∪ · · · ∪ ξtn = xn ), and it is equal to 2−n .
23
above forecast is very optimistic; if, however, X0 = 120, it is pessimistic.
What we are really interested in is the price dynamics, whether it increases
or decreases and how much.
Formally, an increment of the stochastic process {Xt : t ∈ T } is the quantity
Xt+u − Xt , u > 0. Many processes are the most naturally defined through
their increments. For example, let Xt be total amount of money on a bank
account of a person A at the first day of month t. Assume that monthly salary
of A is a fixed amount C, and monthly expenses Yt are random. Then the
stochastic process Xt is naturally defined through its increments Xt+1 −Xt =
C − Yt .
In the above example, the process Xt is not stationary (even weakly) unless
Yt ≡ C, ∀t. For example, if EYt < C, ∀t, the total amount of money on the
bank account increases (on average) with time. However, if Yt are identically
distributed, the rate of growth of Xt remains unchanged over the time. Such
processes are said to have stationary increments. If, moreover, monthly ex-
penses Yt are (jointly) independent, the rate of growth of Xt does not depend
of its history, and we say that Xt has independent increments.
Formally, a stochastic process {Xt : t ∈ T } has stationary (or time-homogeneous)
increments, if for every u > 0 the increment Zt = Xt+u − Xt is a stationary
process; a process {Xt : t ∈ T } is said to have independent increments if for
any a, b, c, d ∈ T such that a < b < c < d, random variables Xa − Xb and
Xc − Xd are independent.
24
1.10 Summary
A random variable is a measurable function ξ : Ω → R from probability
space Ω to real line.
For a random variable ξ, its cumulative distribution function (cdf) is Fξ (x) :=
P (ξ ≤ x). If Fξ (x) can be represented as
Z x
Fξ (x) = ρξ (z) dz,
−∞
for some non-negative function ρξ , the latter is called the probability density
function (pdf) of ξ.
R
The expectation of a random variable ξ is defined as E[ξ] ≡ Ω ξ dP . It can
be calculated as
Z ∞ Z ∞
E[ξ] = x dFξ (x) = xρξ (x) dx.
−∞ −∞
25
• Mixed processes.
26
1.11 Questions
(a) p(a1 ) = 0.3; p(a2 ) = 0.2; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;
(b) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1:
(c) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.2; p(a4 ) = −0.1; p(a5 ) = 0.2.
Show that the correlation between the returns on asset A and asset B
is equal to -0.3830.
Prove that the pairs (A, B), (A, C) and (B, C) are independent, but
the triple (A, B, C) is not mutually independent according to Definition
1.2.
27
Chapter 2
Claim size estimation in insurance and rein-
surance
2.1 Basic principles of insurance risk modelling
In general, for a risk to be insurable, the following conditions must be satis-
fied:
However, the desire for income means that an insurer will usually be
found to provide cover when these ideal criteria are not met.
Other characteristics that most general insurance products share are:
• Cover is normally for a fixed period, most commonly one year, after
which it has to be renegotiated. There is normally no obligation on
insurer or insured to continue the arrangement thereafter, although in
most cases a need for continuing cover may be assumed to exist.
• Claims are not of fixed amounts, and the amount of loss as well as the
fact needs to be proved before a claim can be settled.
28
• Claims may occur at any time during the policy period.
• The policy lasts for a fixed, and relatively short, period of time, typi-
cally one year.
• In return, the insurer pays claims that arise during the term of the
policy.
At the end of the policy’s term the policyholder may or may not renew
the policy; if it is renewed, the premium payable by the policyholder may or
may not be the same as in the previous period.
The insurer may choose to pass part of the premium to a reinsurer; in
return, the reinsurer will reimburse the insurer for part of the cost of the
claims during the policy’s term according to some agreed formula.
An important feature of a short-term insurance contract is that the pre-
mium is set at a level to cover claims arising during the (short) term of the
policy only. This contrasts with life assurance policies, where mortality rates
increasing with age mean that the (level) annual premium in the early years
would be more than sufficient to cover the expected claims in the early years.
The excess amount would then be accumulated as a reserve to be used in the
later years, when the premium on its own would be insufficient to meet the
expected cost of claims.
29
The profit of any company, including insurance company, during a certain
time period, e.g. a month, can be calculated as income of the company during
this time period minus its expenses/losses.
The profit of insurance company is due to premiums paid by customers.
At the beginning of a month, a company knows the number of customers and
what premiums they are paying. Of course, the company cannot predict the
number of new customers coming during next month as well as the number of
customers who stop paying premiums. However, because the premium paid
by every individual customer is usually small, and the number of customers
changes not too much during each month, these are the minor issues. Hence,
the company can estimate its profit for next month with good accuracy.
The expenses/losses for the insurance company can be divided into two
parts: expenses to cover the claims and other expenses. Other expenses, such
as taxes, staff salaries, etc., can also be predicted. The main problem for any
insurance company is to estimate expenses/losses to cover the claims.
If there will be N claims during a month with sizes X1 , X2 , . . . , XN , then
the total losses to cover all claims are
X1 + X2 + · · · + XN .
x1 , x2 , . . . , xn
of past claims and use this sequence to estimate the claim distribution. In
most cases, this process works as follows.
30
1. The company assumes that claim distribution belongs to certain family,
but with unknown parameters. For example, it may assume that claims
follow normal distribution with unknown mean and variance.
X−m
of σ
is called the skewness of X, while the forth moment
" 4 #
X −m
E
σ
is known as kurtosis of X.
Sometimes it is convenient to calculate moments using moment generating
function. For a random variable X, its moment generating function is
MX (t) := E etX .
31
If X and Y are independent random variables and S = X + Y , then
MS (t) = MX (t) · MY (t),
which is a very convenient property.
If X belongs to certain family of distributions with r parameters a1 , a2 , . . . , ar ,
its j-th moment can be explicitly calculated as a function of the parameters,
that is
E[X j ] = fj (a1 , a2 , . . . , ar ), j = 1, 2, . . .
If past data x1 , x2 , . . . , xn of i.i.d. realizations of X are available, its j-th
moment can also be estimated from data as
n
1X j
E[X j ] ≈ x , j = 1, 2, . . .
n i=1 i
If we denote n
1X j
mj = x, j = 1, 2, . . . , r
n i=1 i
to be the moments estimated from the data, then (17) simplifies to
mj = fj (a1 , a2 , . . . , ar ), j = 1, 2, . . . , r. (18)
This is a system of r equations with r unknowns which often has a unique
solution.
We next consider concrete examples with specific families of distributions.
• Assume that X follow the exponential distribution with parameter λ,
see (5). In this case, we have only 1 parameter, so it suffices to consider
the 1-st moment only, that is, expectation. The expectation E[X] of
exponential distribution is 1/λ, and (17) reduces to
n
1X 1
xi = ,
n i=1 λ
32
• Assume that X follow a normal distribution with parameters µ and
σ, see (6). Because we have 2 parameters, it suffices to consider 2
moments. The first 2 moments of normal distribution are E[X] = m
and E[X 2 ] = σ 2 + m2 . Hence, parameters µ and σ can be found from
system of equations
n n
1X 1X 2
xi = m, x = σ 2 + m2 .
n i=1 n i=1 i
The solution is
v !2
n
u n n
1X u1 X 1X
m= xi , σ=t x2 − xi .
n i=1 n i=1 i n i=1
33
where m1 and m2 are defined in (19). The solution is
2m2 − 2m21 m1 m2
α= , λ= ,
m2 − 2m21 m2 − 2m21
For other families of distributions, like Burr distribution (10), the gener-
alized Pareto distribution (11) or Weibull distribution (12), explicit expres-
sions for moments may be too complicated to solve system (18) analytically.
However, it may be solved numerically using appropriate computer software.
p · p · (1 − p) · p · (1 − p) · p · p · p · (1 − p) · p = p7 (1 − p)3 .
34
or p = 0.7. With this parameter, the result of the experiment which we ac-
tually observed is as likely as it possibly can. This is the idea of the method
of maximum likelihood.
More generally, let X1 , X2 , . . . , Xn be a sequence of i.i.d random variable
whose distribution belongs to some family of discrete distributions with pa-
rameter θ. Given historical data x1 , x2 , . . . , xn we actually obtained, we can
ask how “likely” it was to get this data, and the answer is
n
Y
L(θ) = P (Xi = xi | θ),
i=1
The optimal θ̂ maximizing this function can be found from the same system
of equations (20).
We next consider concrete examples with specific families of distributions.
• Assume that X follow the exponential distribution with parameter λ,
that is, has density given by (5). Then
n
X n
X n
X n
X
−λxi
l(λ) = log[f (xi | λ)] = log[λe ]= (log[λ]−λxi ) = n log[λ]−λ xi ,
i=1 i=1 i=1 i=1
35
and (20) reduces to
n
n X
− xi = 0,
λ i=1
from which we find λ = Pnn xi . Note that in this case the result is the
i=1
same as with method of moments.
and n
d 2 n 1 X
l(µ, σ ) = − 2 + 4 (xi − µ)2 = 0
dσ 2 2σ 2σ i=1
from which we find that
n n
1X 1X
µ= xi , σ2 = (xi − µ)2 ,
n i=1 n i=1
36
Then n
d nα X
l(α, λ) = − xi = 0,
dλ λ i=1
and
n
d d X
l(α, λ) = n log[λ] − n log[Γ(α)] + log[xi ] = 0.
dα dα i=1
From the first equation, λ = Pnnα xi . We can substitute this into the
i=1
second equation and solve it for α numerically. We remark that in this
case the solution from maximum likelihood method is different from
the one from method of moments.
For other families of distributions the method is the same, and the result-
ing equations, if impossible to solve analytically, can be solved numerically
using appropriate computer software.
If there are r parameters, let us select r different numbers 0 < α1 < α2 <
· · · < αr < 1 and require that F −1 (αi , λ) “agree with data”, that is
F −1 (αi , λ) = qi , i = 1, 2, . . . , r,
where
qi = q(αi , x1 , . . . xn ), i = 1, 2, . . . , r, (21)
37
Applying function F to both sides of these equations, we get
q1 = q(α1 , x1 , . . . xn )
α1 = F (q1 , λ) = 1 − e−λq1 .
We get
log(1 − α1 )
λ=− .
q1
• Assume that X follow the Pareto distribution (9) with parameters α >
0 and λ > 0. Because we have 2 parameters, we need to select 0 <
α1 < α2 < 1, for example, α1 = 1/4, α2 = 3/4. Then we estimate
q1 = q(α1 , x1 , . . . xn ), q2 = q(α2 , x1 , . . . xn )
q1 = q(α1 , x1 , . . . xn ), q2 = q(α2 , x1 , . . . xn )
α1 = 1 − exp(−cq1γ ), α2 = 1 − exp(−cq2γ ).
38
It can we rewritten as
hence
log(1 − α1 ) q1
γ = log / log , (23)
log(1 − α2 ) q2
and then
log(1 − α1 )
c=− . (24)
q1γ
2.5 Reinsurance
To protect itself from large claims, an insurance company, let us call it I
(insurer), may in turn take out an insurance policy in another company,
which we call it R (reinsurer). Such a policy is called a reinsurance policy.
Insurance company I received premiums from client C, and pay part of this
premium to R. Then, if client C makes a claim, part of it may be covered by
R, in accordance to a contract between I and R. In this section we consider
reinsurance contracts of two very simple types: proportional reinsurance and
individual excess of loss reinsurance.
In proportional reinsurance the insurer I pays a fixed proportion α of the
claim, 0 < α < 1, whatever the size of the claim, and the reinsurer R pays
the remaining proportion 1 − α of the claim. In other words, if claim amount
is X, then I pays αX and R pays (1 − α)X. The parameter α is known as
the retained proportion or retention level.
Imagine that claim arrives independently and identically distributed from
some unknown distribution, and insurance company I has historical record
of sizes y1 , y2 , . . . , yn of payments they made. Then they may easy recover
the actual sizes of past claims: they are y1 /α, y2 /α, . . . , yn /α. Similarly, if
reinsurance company R has historical record of sizes z1 , z2 , . . . , zn of pay-
ments they made, then the actual sizes of past claims are z1 /(1 − α), z2 /(1 −
α), . . . , zn /(1−α). These data may be used to estimate the claim distribution
using methods described in sections 2.2-2.4.
In excess of loss reinsurance, the insurer I will pay any claim in full up to
a certain amount M , which is called the retention level; any amount above
M will be paid by the reinsurer R. Note that the term “retention level” is
39
used in both proportional reinsurance and excess of loss reinsurance, but it
has completely different meaning!
With excess of loss reinsurance contract, if the claim for amount X arrives
then the insurer will pay Y where:
(
X if X ≤ M
Y =
M if X > M
The reinsurer pays the amount
(
0 if X ≤ M
Z =X −Y = (25)
X −M if X > M
40
This is not the case for excess of loss reinsurance if the retention level M is
fixed. In this case, the amount for insurer I po pay after inflation becomes
(
0 kX if kX ≤ M
Y =
M if kX > M
41
if some data are missing or incomplete, we say that we have a “censored
sample”. So, with censored sample as above, can we still use the methods
described in sections 2.2-2.4 to estimate the parameters of the unknown claim
distribution?
• The method of moments, see Section 2.2, is not available, because the
moments cannot be reliably estimated from the censored sample;
• The method of percentiles, see Section 2.4, can be used without modi-
fication, provided that all qi in (21) are less than M . This is the case
if the retention level M is high, so that only few highest claims are
unknown. For example, let M = 1000, n = 9, the data are
500, 300, 1000, 100, 800, 500, 1000, 700, 300.
and the percentiles levels in Section 2.4 are α1 = 1/4, α2 = 3/4. Then
the lowest integers grater than nα1 = 9/4 and nα2 = 27/4 are 3 and 7,
respectively. If we sort the data in non-decreasing order
100, 300, 300, 500, 500, 700, 800, 1000, 1000
the third and sevenths terms are q1 = 300 and q2 = 800, respectively.
These values of q1 and q2 are all that we need to proceed with method
of percentiles, and unknown claims over 1000 has no influence on this
calculation. This is the big advantage of method of percentiles.
• The method of maximum likelihood, described in Section 2.3, can be
used but require modification. Let J be the set of claims less than M
for which information is available. The contribution of these data to
the maximal likelihood function is exactly like in Section 2.3:
Y
f (xi | θ),
i∈J
42
and its logarithm is
X
l(θ) = log(L(θ)) = log[f (xi | θ)] + m log(1 − F (M | θ)).
i∈J
The optimal θ̂ maximizing this function can be found from the same
system of equations (20).
w1 , w2 , . . . , wk
W = X − M |X > M.
Let us express the cdf G(w) and pdf g(w) of W using the cdf F (x) and pdf
f (x) of the original claim size distribution X. For any w ≥ 0,
P [X ≤ w + M and X > M ]
G(w) = P [W ≤ w] = P [X−M ≤ w|X > M ] = =
P [X > M ]
P [M < X ≤ w + M ] F (w + M ) − F (M )
= =
1 − P [X ≤ M ] 1 − F [M ]
Differentiating with respect to w, we get
f (w + M )
g(w) = G0 (w) = .
1 − F (M )
Example 2.1. Assume that claim size X has Pareto distribution (9) with
parameters α > 2 and λ > 0. Then the pdf of W is
α
αλα
f (w + M ) λ
g(w) = = : 1− 1− =
1 − F (M ) (λ + w + M )α+1 λ+M
43
α
αλα α(λ + M )α
λ+M
= · = .
(λ + w + M )α+1 λ (λ + w + M )α+1
Let us use, for example, method of moments to estimate parameters. It
states that
k
1X λ+M
wi = E[W ] =
k i=1 α−1
and !2
k k
1X 2 1X α(λ + M )2
w − wi = V ar[W ] = .
k i=1 i k i=1 (α − 1)2 (α − 2)
This gives a system of two equations to find two unknown parameters λ and
α.
44
2.7 Summary
In this chapter we focus on modelling the claim size distribution. We assume
that claim distribution belongs to certain family, but with unknown param-
eters. The company estimate unknown parameters a1 , a2 , . . . , ar to fit the
data of past claims x1 , x2 , . . . , xn as good as possible.
The method of moments suggests to select parameters in such a way that first
r moments estimated from the data match the first r moments fj (a1 , a2 , . . . , ar ),
j = 1, 2, . . . , r estimated from the formulas for the distribution, that is,
mj = fj (a1 , a2 , . . . , ar ), j = 1, 2, . . . , r
Pn
where mj = 1
n i=1 xji , j = 1, 2, . . . , r.
The optimal θ̂ = (â1 , â2 , . . . , âr ) can be found from the system of equations
d
l(θ̂) = 0, i = 1, 2, . . . , r.
dai
αi = F (qi , λ), i = 1, 2, . . . , r,
where F is a cdf which depends on parameters, 0 < α1 < α2 < · · · < αr < 1
are some pre-specified numbers, and qi is the estimate of the percentile at level
αi based on data x1 , x2 , . . . , xn . To find it, we first find the smallest integer j
greater than nαi , then sort the sequence x1 , x2 , . . . xn is non-decreasing order,
and then the j-th smallest number is qi .
To protect itself from large claims, an insurance company may in turn take
out an insurance policy in another company, called reinsurer. We consider
reinsurance contracts of two very simple types: proportional reinsurance and
individual excess of loss reinsurance. In proportional reinsurance, if claim
amount is X, then insurer pays αX and reinsurer pays (1 − α)X, where α is
45
the parameter known as the retention level. With excess of loss reinsurance
contract, if the claim for amount X arrives then the insurer will pay
(
X if X ≤ M
Y =
M if X > M
46
2.8 Questions
4. Assume that the history of claim sizes are the same as in the previous
question, but the company order a reinsurance policy with excess of
loss reinsurance above the level M = 2000.
(a) Write down the history of expenses of the reinsurer;
(b) Assuming that the original claim size distribution is Pareto dis-
tribution with parameters α > 2 and λ > 0, estimate the unknown
parameters using method of moments with data available to the rein-
surer.
(c) Comment whether do you think the Pareto distribution is a good
model to fit these data.
47
Chapter 3
Estimation of aggregate claim distribution
3.1 The collective risk model
As discussed in the previous chapter, if there will be N claims during a month
(or week, or year, or other fixed period of time) with sizes X1 , X2 , . . . , XN ,
then the total losses to cover all claims are
S = X 1 + X2 + · · · + XN ,
• ...
• ...
Hence, by the law of total probability (13)
∞
X ∞
X
G(x) = P (S ≤ x) = P (S ≤ x and N = n) = P (N = n)·P (S ≤ x | N = n).
n=0 n=0
(26)
The term
P (S ≤ x | N = n) = P (X1 + X2 + · · · + Xn ≤ x)
48
If independent random variables X and Y have densities f and g, the
density h of their sum X + Y is given by
Z ∞
h(x) = f (t)g(x − t)dt,
−∞
Rx
and the cdf of X + Y is H(x) = −∞ h(t)dt. The n-fold convolution of any
continuous distribution F can be computed by applying these formulas n
times. The n-fold convolution of discrete distribution F can be computed by
definition, as demonstrated in Example below.
P (S = 0) = P (X1 = X2 = X3 = 0) = 1/8,
49
For example, if Xi are discrete random variables taking values in non-
negative integers only, then, for every non-negative integer x,
∞
X
P (S = x) = G(x) − G(x − 1) = P (N = n) · (F n∗ (x) − F n∗ (x − 1)).
n=0
3 3
X
n∗ n∗ 1X
P (S = 2) = P (N = n)·(F (2)−F (2−1)) = P (N = n)·(F n∗ (2)−F n∗ (1)).
n=0
4 n=0
or, in other words, E(S|N ) = N E(X). Hence, by the law of total expectation
(15)
50
or V ar(S|N ) = N V ar(X). Then by the law of total variance (16),
σS2 = V ar(S) = E[V ar(S|N )]+V ar[E(S|N )] = E[N V ar(X)]+V ar[N E(X)],
hence
But
n
Y
tS t(X1 +X+2+···+Xn )
E(e |N = n) = E[e ]= E[etXi ] = (MX (t))n .
i=1
Hence,
Example 3.3. Consider the special case when all claims are for the same
fixed amount B. That is, P (Xi = B) = 1 for all i. Then
S = X1 + X2 + · · · + XN = B + B + · · · + B = N B.
λn e−λ
P [N = n] =
n!
This is a natural model if we assume that claims arrives “uniformly at rate
λ” as we explain in later chapters.
51
Using series expansion for exponential function ex = ∞ xn
P
n=0 n!
, one can
compute moment generating function of Poisson distribution
∞ ∞ n −λ ∞
X X
tn λ e −λ
X (λet )n t
MN (t) = E[e ] = tX tn
e P [N = n] = e = λe = e−λ eλe .
n=0 n=0
n! n=0
n!
By (29),
σS2 = µN σX
2 2 2
+ σN 2
µX = λ(σX + µ2X ) = λE[X 2 ]. (32)
Also, by (30),
Formulas for µS and σS2 above can also be derived by differentiating MS (t)
once and twice. By differentiating it three times we can also derive the
skewness:
E (S − µS )3 = λE[X 3 ],
hence " 3 #
S − µS λE[X 3 ]
skew[S] := E = . (34)
σS (λE[X 2 ])3/2
Because claims Xi are positive random variables, E[X 3 ] > 0, hence S is
positively skewed even if the distribution of Xi are negatively skewed. Also
note that lim skew[S] = 0, hence the distribution of S is almost symmetric
λ→∞
if λ is large.
52
Example 3.4. Assume that the number of claims during a year has the
Poisson distribution with parameter λ and the size of each claim is a random
variable uniformly distributed on [a, b]. All claims sizes are independent.
What is the mean and variance of the cumulative size of the claims from all
policies?
Answer: If X is the size of a claim, then
Z b
1 a+b
E[X] = xdx = ;
b−a a 2
Z b
2 1 a2 + ab + b2
E[X ] = x2 dx = ,
b−a a 3
which gives
E[S] = λE[X] = λ(a + b)/2,
and
V ar[S] = λE[X 2 ] = λ(a2 + ab + b2 )/3.
Si = X1 + X2 + · · · + XNi , i = 1, 2, . . . , n,
S = S1 + S2 + · · · + Sn .
S 0 = Y1 + Y2 + · · · + Yn .
53
We want to prove that S and S 0 have the same distributions. Because there is
a one-to-one relationship between distributions and moment-generation func-
tions, it is sufficient to prove that S and S 0 has the same moment-generation
functions.
We first calculate MS (t) of S. By independence of Si ,
n
Y
tS t(S1 +S2 +···+Sn )
MS (t) = E[e ] = E[e ]= E[etSi ].
i=1
Now, by (33)
Hence,
n n
!
X X
M 0 (t) = exp(λMY (t) − λ) = exp λi Mi (t) − λi = MS (t).
i=1 i=1
This implies that S and S 0 have the same distribution and finishes the proof.
54
As mentioned in Chapter , this is the case if
N = N1 + N2 + · · · + Nn
where Ni are i.i.d standard Bernoulli variables (that is, take value 1 with
probability p and 0 otherwise). This is a natural model if we assume that
the company covers n independent policies such that each one may issue a
claim with the same probability p.
Because E[Ni ] = p, V ar[Ni ] = p(1 − p) and the moment generating
function Mi (t) of Ni is
we have E[N ] = np, V ar[N ] = np(1 − p), and MN (t) = (pet + 1 − p)n .
By (28),
µS = E[S] = µN · µX = npµX . (35)
By (29),
σS2 = µN σX
2 2 2
+ σN 2
µX = np(σX + (1 − p)µ2X ) = np(E[X 2 ] − p(E[X])2 ). (36)
Also, by (30),
MS (t) = MN (log MX (t)) = (pelog MX (t) + 1 − p)n = (pMX (t) + 1 − p)n . (37)
hence
npE[X 3 ] − 3np2 E[X 2 ]E[X] + 2np3 (E[X])3
skew[S] = .
(npE[X 2 ] − np2 (E[X])2 )3/2
We can see that S can be positively or negatively skewed, depending on the
parameters.
Example 3.5. Assume that all claims are for the same amount B. Then
E[X k ] = B k , k = 1, 2, 3, and
In particular, skew[S] > 0 if p < 0.5 but skew[S] < 0 if p > 0.5.
Example 3.6. Assume that the number of claims during a year has the
binomial distribution with parameters n and p the size of each claim is a
55
random variable X uniformly distributed on [a, b]. All claims sizes are inde-
pendent. What is the mean and variance of the cumulative size of the claims
from all policies?
Answer:
E[S] = npE[X] = np(a + b)/2,
and
N = N1 + N2 + · · · + Nk
we have
You may note that V ar[N ] > E[N ], while for Poisson distribution V ar[N ] =
E[N ]. This is the advantage of negative binomial distribution: it can better
fit the data if sample variance is greater than sample mean, which is often
the case in practise.
By (28),
k(1 − p)
µS = E[S] = µN · µX = µX (38)
p
By (29),
k(1 − p) k(1 − p)2
σS2 = µN σX
2 2 2
+ σN µX = E[X 2 ] + (E[X])2 . (39)
p p2
56
Also, by (30),
pk
MS (t) = MN (log MX (t)) = . (40)
(1 − (1 − p)MX (t))k
Differentiating MS (t) three times we can derive the third centralized mo-
ment:
3k(1 − p)2 E[X]E[X 2 ] 2k(1 − p)3 (E[X])3 k(1 − p)E[X 3 ]
E (S − µS )3 =
+ + .
p2 p3 p
Because all terms are positive, the compound negative binomial distribution
is always positively skewed.
SI = Y1 + Y2 + · · · + YN ,
SR = Z1 + Z2 + · · · + ZN . (41)
Example 3.7. The number N of claims has Poisson distribution with pa-
rameter λ = 10. Individual claim amounts are uniformly distributed on
(0, 2000). The insurer of this risk has effected excess of loss reinsurance with
retention level 1600. Calculate the mean, variance and coefficient of skewness
57
of both the insurer’s and reinsurer’s aggregate claims under this reinsurance
arrangement.
Answer: In this case, Xi ∼ U (0, 2000), M = 1600. As usual, denote
Yi = min(Xi , M ) and Zi = Xi − Yi = max(0, Xi − M ). Then
Z M
E[Yi ] = xf (x)dx + M · P (Xi > M ),
0
0.0005(M 2 − 02 )
E[Yi ] = + 0.2M = 960.
2
Hence, by (31),
E[SI ] = λE[Yi ] = 10 · 960 = 9600.
Further
Z M
E[Yi2 ] = x2 f (x)dx + M 2 · P (Xi > M ) = 1, 194, 666.7,
0
and by (32)
V ar[SI ] = λE[Yi2 ] = 11, 946, 667.
Next,
Z M
E[Yi3 ] = x3 f (x)dx + M 3 · P (Xi > M ) = 1, 638, 400, 000,
0
hence by (34)
Let us now do similar calculation for the reinsurer. Because Xi ∼ U (0, 2000),
we have E[Xi ] = 1000, hence
and by (31)
E[SR ] = λE[Zi ] = 10 · 40 = 400.
Further,
2000
0.0005(2000 − M )3
Z
E[Zi2 ] = (x − M )2 f (x)dx = ≈ 10666.7,
M 3
58
and by (32)
V ar[SR ] = λE[Zi2 ] = 106, 667.
Next,
2000
0.0005(2000 − M )4
Z
E[Zi3 ] = (x − M )3 f (x)dx = = 3, 200, 000,
M 4
and by (34)
SR = W1 + W2 + · · · + WN R , (42)
fX (w + M )
g(w) = , w > 0,
1 − FX (W )
where fX and FX are density and cdf of the original claim size distribution.
To find the distribution of N R note that
N R = I1 + I2 + · · · + IN ,
59
payments made by the reinsurer. Denore π the probability that Xj > M .
Since Ij takes the value 1 only if Xj > M , we have
MI (t) = E[etIj ] = et π + 1 − π.
E[N R] = πE[N ],
Example 3.8. In Example 3.7 above, we can use formula (42) to analyse the
reinsurance aggregate claim SR . In this case, N R follow Poisson distribution
with parameter 10 · 0.2 = 2, and individual claims Wi have density function
fX (w + M ) 0.0005
g(w) = = = 0.0025, 0 < w < 400,
1 − FX (W ) 0.2
that is, Wi = U (0, 400). Then E[Wi ] = 200, E[Wi2 ] = 53, 333.33 and
E[Wi3 ] = 16, 000, 000, giving the same result for mean, variance, and skew-
ness of SR as above.
Thus, there are two ways to specify and evaluate the distribution of SR .
• claim amounts from these risks are not identically distributed random
variables; and
• the number of risks does not change over the period of insurance cover.
60
As before, aggregate claims from this portfolio are denoted by S. So:
S = Y1 + Y2 + · · · + Yn ,
where Yj denotes the claim amount under the j-th risk and n denotes the
number of risks. It is possible that some risks will not give rise to claims.
Thus, some of the observed values of Yj may be 0.
For each risk, the following assumptions are made:
If a claim occurs under the j-th risk, the claim amount is denoted by the
random variable Xj . Let Fj (x), µj and σj2 denote the distribution function,
mean and variance of Xj , respectively.
Assumption (a) is very restrictive. It means that a maximum of one claim
from each risk is allowed for in the model. This includes risks such as one-
year term assurance, but excludes many types of general insurance policy.
For example, there is no restriction on the number of claims that could be
made in a policy year under household contents insurance.
There are three important differences between this model and the collec-
tive risk model:
• The number of risks in the portfolio has been specified. In the collective
risk model, this number N was not specified and was modelled as a
random variable.
• The number of claims from each individual risk has been restricted.
There was no such restriction in the collective risk model.
Assumptions (a) and (b) say that Nj are standard Bernoulli variables,
or, equivalently, Nj ∼ Bin(1, qj ). Thus, the distribution of Yj is compound
binomial, with individual claim amount random variable Xj . From formulas
(35) and (36) it follows that
E[Yj ] = qj µj
and
E[Yj ] = qj σj2 + qj (1 − qj )µ2j .
61
The aggregate claim amount S is the sum of n independent compound
binomial random variables. It is easy to find the mean and variance of S.
" n # n n
X X X
E[S] = E Yj = E[Yj ] = qj µj , (43)
j=1 j=1 j=1
and
" n
# n n
X X X
V ar[S] = V ar Yj = V ar[Yj ] = (qj σj2 + qj (1 − qj )µ2j ). (44)
j=1 j=1 j=1
62
Example 3.9. Question: Suppose that the Poisson parameters λi of
policies are not known but are equally likely to be 0.1 or 0.3. Let n be the
number of policies and m1 and m2 be the first and second moments of the
claim size X.
(i) Find the mean and variance of the aggregate claim Si from a random
policy i;
Explanation: The situation described in this Example may arise in, for
example, motor insurance. It may be that there are n drivers insured, and
some of them are “good” drivers and some are “bad” drivers. The individual
claim amount distribution is the same for all drivers but “good” drivers make
fewer claims (0.1 p.a. on average) than “bad” drivers (0.3 p.a. on average). It
is assumed that it is known, possibly from national data, that a policyholder
is equally likely to be a “good” driver or a “bad” driver.
Answer: (i) Let as choose policy i at random. From problem formulation,
Hence,
E[λi ] = (0.1 + 0.3)0.5 = 0.2, and V ar[λi ] = (0.12 + 0.32 )0.5 − 0.22 = 0.01.
63
and
(i) Find the mean and variance of the aggregate claim Si from a random
policy i;
Explanation: The situation described in this Example may arise in, for
example, buildings insurance in a certain area. The number of claims could
depend on, among other factors, the weather during the year; an unusually
high number of storms resulting in a high expected number of claims (i.e. a
high value of λ) and vice versa for all the policies together.
Answer: (i) For a random policy i,
and exactly the same calculation as in Example 3.9 gives the same result:
and
E[Si ] = 0.2m1 and V ar[Si ] = 0.2m2 + 0.01m21 .
(ii) Even when Si are not independent, expectation of a sum is still the
sum of expectations, and
64
However, for dependent Si ,
δ α Γ(m + α)
P [Ni = m] = .
Γ(α)m! (δ + 1)x+α
One can check that if α = k is positive integer, then this formula reduced for
δ
the formula for negative binomial distribution with parameters α and δ+1 .
65
3.8 Summary
Let N be a (random) number of claims during some fixed period of time. If
the sizes of claims are X1 , X2 , . . . , XN , then the total cost to cover all claims
is
S = X 1 + X2 + · · · + XN ,
and S = 0 if N = 0. If all Xi are independent, identically distributed, and
also independent of N , then we say that S has compound distribution.
The cdf of S is given by
∞
X
G(x) = P (S ≤ x) = P (N = n) · F n∗ (x),
n=0
where F (x) is the cdf of the individual claim and F n∗ (x) its n-fold convolu-
tion.
2
We denote µX , µN , µS the means of random variables Xi , N, S and by σX ,
2 2
σN , σS the corresponding variances. Then
µS = µN · µX , and σS2 = µN σX
2 2 2
+ σN µX .
For example,
66
with retention level M , it is Yi = min(Xi , M ). In both cases, the aggregate
claim for the insurer is
SI = Y1 + Y2 + · · · + YN ,
S = Y1 + Y2 + · · · + Yn ,
67
3.9 Questions
68
Chapter 4
Tails and dependence analysis of claims distri-
butions
4.1 How likely very large claims to occur?
Low frequency events involving large losses can have a devastating impact
on companies and investment funds. The financial crisis that started in 2007
was an example of this. It generated more extreme movements in share prices
than had been seen for over 20 years previously.
So, it is important to ensure that we model the form of the distribution
in the tails correctly. However, the low frequency of these events also means
that there is relatively little data to model their effects accurately.
Many types of financial data tend to be much more narrowly peaked in
the centre of the distribution and to have fatter tails than the normal dis-
tribution. This shape of distribution is known as leptokurtic. For example,
when share prices are modelled, large price movements occur more frequently
than predicted by the normal distribution. So, the normal distribution may
be unsuitable for modelling the large movements in the tails. One reason
for these fat tails is that the volatility of financial variables does not re-
main constant but varies stochastically over time. This property is known as
heteroscedasticity.
Even if we select an appropriate form of fat-tailed distribution, if we
attempt to fit the distribution using the whole of our dataset, this is unlikely
to result in a good model for the tails, since the parameter estimates will be
heavily influenced by the main bulk of the data in the central part of the
distribution.
Fortunately, better modelling of the tails of the data can be done through
the application of extreme value theory. The key idea of extreme value theory
is that the asymptotic behaviour of the tails of most distributions can be
accurately described by certain families of distributions. More specifically,
the maximum values of a distribution (when appropriately standardised)
and the values exceeding a specified threshold (called threshold exceedances)
converge to two families of distributions as the sample size increases.
There are a number of measures we can use to quantify the tail weight
of a particular distribution, that is, how likely very large values are to occur.
Depending on the context, an exponential, normal or log-normal distribution
may be a suitable baseline to use for comparison.
• Existence of moments.
69
Recall that the k-th moment of a continuous positive-valued distribu-
tion with density function f (x) is
Z∞
Mk = xk f (x)dx
0
70
The hazard rate is the analogy of force of mortality which you study
in parallel module. If the force of mortality increases as a person’s age
increases, relatively few people will live to old age (corresponding to a
light tail). If, on the other hand, the force of mortality decreases as the
person’s age increases, there is the potential to live to a very old age
(corresponding to a heavier tail).
For example, for exponential distribution with parameter λ > 0 we
have
f (x) = λe−λx , and F (x) = 1 − e−λx , x > 0,
hence the hazard rate
λe−λx
h(x) = =λ
1 − (1 − e−λx )
is a constant. Exponential distribution corresponds to the α = 1 case
of the gamma distribution (8). Numerical calculations shows that, for
gamma distribution, the hazard rate is decreasing if α > 1 (which
indicates for a heavier tail than that for the exponential distribution)
and increasing if α > 1 (which indicates for a lighter tail than that for
the exponential distribution).
For the Pareto distribution (9), we find that the hazard rate is always a
decreasing function of x (see the end of chapter question for the proof),
confirming that it has a heavy tail.
• Mean residual life.
The mean residual life of a distribution with density function f (x) and
cumulative distribution function F (x) is defined as:
R∞ R∞
(y − x)f (y)dy
x R
(1 − F (y))dy
e(x) = ∞ = x .
x
f (y)dy 1 − F (x)
71
For the Pareto distribution, we find that the mean residual life is an
increasing function of x (see the end of chapter question for the proof),
confirming that it has a heavy tail.
72
Parameter β is called scale parameter, while γ is called a shape parameter.
When γ = 0, G(x) reduces to exponential distribution with parameter λ = β1 .
Mn = max{X1 , X2 , . . . , Xn }.
P (Mn ≤ x) = P (X1 ≤ x, X2 ≤ x, . . . Xn ≤ x) =
where α, β > 0, and γ are some real parameters, and x is such that 1 +
γ(x−α)
β
> 0. If 1 + γ(x−α)
β
≤ 0, (46) is undefined, and we set F (x) = 0 if
73
γ > 0 and F (x) = 1 if γ < 0. With this convention, F (x) becomes a cdf of
some distribution, and this distribution is known as Generalized Extreme
Value (GEV) distribution. Parameter α is called a location parameter,
β > 0 is scale parameter, and γ is known as shape parameter.
The parameters α and β just rescale (shift and stretch) the GEV dis-
tribution (46), in a similar way as changing mean and standard deviation
shifts and stretches the normal distribution. The parameter γ determines
the overall shape of the distribution (analogous to the skewness) and its sign
(positive, negative or zero) results in three different shaped distributions.
• When γ < 0, the GEV distribution (46) reduces to
exp − 1 + γ(x−α) −1/γ
β
if x < α − βγ ,
F (x) =
1 if x ≥ α − βγ ,
74
• When γ > 0, the GEV distribution (46) reduces to
0 if x ≤ α − βγ ,
F (x) = −1/γ
exp − 1 + γ(x−α)
β
if x > α − βγ ,
• Underlying distributions that have finite upper limits (e.g. the uniform
distribution) are of the Weibull type (which also has a finite upper
limit).
• “Light tail” distributions that have finite moments of all orders (e.g.
exponential, normal, log-normal) are typically of the Gumbel type.
75
probability of one claim is 10−3 , the company estimated that the chance that
all 10 claims happen is about (10−3 )10 = 10−30 . This is so tiny that can be
ignored, and the company felt completely safe.
After several months, a large fire happened in one of the houses. The fire
quickly spread on the neighbour houses, destroying them all. All 10 houses
were affected, the company received 10 large claims and becomes bankrupt.
The companies mistake is that formula (10−3 )10 = 10−30 works only if all
fires would be independent. However, they are not. Because houses were on
one street, fire on one of them could course fire on others. Even if houses
would be on different streets, fire on one of them could be a reason of, for
example, extremely hot weather, which could increase the chance of another
fire. So, understanding the dependency between different events and random
variables is of fundamental importance.
A popular way to calculate “how much” two random variables are depen-
dent, or correlated, is Pearson correlation coefficient
Cov(X, Y )
Corr(X, Y ) := p p ,
Var(X) Var(Y )
76
Let us think about this in our example. Consider 2 scenarios, w1 and w2 .
We see that X is higher in scenario w2 while Y is higher in w1 . So, in this
case we have converse dependence (the higher X, the lower Y ). We call such
pair of scenario “discordant”.
Now, consider other pair of scenarios, w1 and w3 . In this case, X is higher
in scenario w3 , and Y is higher in w3 as well. So, we have “the higher X the
higher Y ” case. We call such pair of scenario “concordant”.
Similarly, pair of scenarios w2 and w3 is “concordant”. Because there are
more concordant pairs than discordant ones, we conclude that the depen-
dence between X and Y is (mostly) “direct”.
In general case, assume that there are n scenarios on which random vari-
ables X and Y assume different values. Then the can consider all pairs of
scenarios (there are n(n − 1)/2 pairs), and count how many concordant and
discordant pairs are. If there are C concordant pairs and D discordant ones,
the ratio
C −D C −D
τ= =
C +D n(n − 1)/2
is called Kendall coefficient of concordance. If τ > 0, this means that de-
pendence “the higher X the lower Y ” is observed, and the closer τ to its
maximal value 1, the stronger this dependence is. Conversely, if τ < 0, then
the dependence “the higher X the lower Y ” is observed, and the closer τ to
its minimal value −1, the stronger is this dependence.
In the special example considered above, n = 3, n(n − 1)/2 = 3, C = 2,
D = 1, and
2−1 1
τ= = .
3 3
There are other ways to measure concordance, for example, Spearman
coefficient (which we will not define here). If X and Y are independent, then
their concordance is 0. However, as with Pearson correlation coefficient, the
problem is in the different direction: we may have Kendall (or Spearman)
coefficient equal to 0 even if the random variables are dependent. This just
means that “positive” and “negative” dependences happens equally often
and on average compensate each other.
77
joint cdf is
FX,Y (x, y) = P [X ≤ x, Y ≤ y].
If X and Y are independent, then
78
• C(u, v) = min(u, v) is called co-monotonic (or minimum) copula, and
represents the dependence in the form “the higher X, the higher Y ”.
Three copulas listed above are called fundamental copulas. They are
specific cases of a more general family of copulas called Frechet-Hoeffding
copulas.
Another important example is:
• Gaussian copula
or equivalently,
!
u
Φ−1 (v) − ρΦ−1 (t)
Z
C(u, v) = Φ p dt
0 1 − ρ2
Many of the commonly used copulas are special cases of the important
family of copulas which is called the Archimedean family. Let ψ : (0, 1] →
[0, ∞) be a continuous, strictly decreasing, convex function with ψ(1) = 0.
Properties of ψ imply that it has an inverse function ψ −1 : [0, L) → (0, 1],
where L = lim ψ(t). If L < ∞, we also define, by convention, ψ −1 (x) = 0 for
t→0
x ≥ L, so that ψ −1 is defined everywhere on [0, ∞).
Then the Archimedean family of copulas are all copulas of the form
79
• For
ψ(t) = − ln t, 0 < t ≤ 1.
the C(u, v) in (51) reduces to
C(u, v) = uv.
• For
ψ(t) = (− ln t)α , 0 < t ≤ 1,
where 1 ≤ α is a parameter, the C(u, v) in (51) reduces to
n o
C(u, v) = exp − ((− ln u)α + (− ln v)α )1/α .
• For
e−αt − 1
ψ(t) = − ln , 0 < t ≤ 1,
e−α − 1
where α 6= 0 is a parameter, the C(u, v) in (51) reduces to
(e−αu − 1)(e−αv − 1)
1
C(u, v) = − ln 1 + .
α (e−α − 1)
This copula is known as the Frank copula.
• For
1 −α
ψ(t) = (t − 1), 0 < t ≤ 1,
α
where α 6= 0 is a parameter, the C(u, v) in (51) reduces to
See the end of chapter questions for the details of all calculations.
In all cases, the value of parameter α represents the strength of the de-
pendency between the variables.
Sklar’s theorem also works for any number of random variables. It states
that the joint distribution of n random variables is always a function of
individual cumulative distribution functions:
80
for some function C which depends on n variables, and is called the (n-
variable) copula. For example,
C(x1 , x2 , . . . , xn ) = x1 · x2 · · · · · xn
C(x1 , x2 , . . . , xn ) = min(x1 , x2 , . . . , xn )
C(u, u)
λL = lim P (X ≤ FX−1 (u)|Y ≤ FY−1 (u)) = lim .
u→0+ u→0+ u
To define the upper tail dependence, we need to look at the opposite end
of the marginal distributions. For any r.v. X, let
81
Then for any two r.v.s X and Y and any two numbers x and y, we have
hence,
The last equation allows to easily compute survival copula C̄ if we know the
(usual) copula C, and vice versa.
We can then define the coefficient of upper tail dependence as:
C̄(u, u)
λU = lim P (X > FX−1 (u)|Y > FY−1 (u)) = lim .
u→1− u→0+ u
The tail dependence can take values between 0 (no dependence) and 1
(full dependence).
Different copulas result in different levels of tail dependence. For example,
the Frank copula and the Gaussian copula have zero dependence in both tails,
while the Gumbel copula with parameter α has zero lower tail dependence
but upper tail dependence of 2 − 21/α . The Clayton copula, on the other
hand, has zero upper tail dependence but lower tail dependence of 2−1/α .
See the end of chapter question for the details of all calculations.
Hence, the Gumbel copula, with an appropriate value for the parameter
α, might be a suitable copula to use when modelling large general insurance
claims resulting from a common underlying cause.
82
4.7 Summary
The “heavier” is the tail of claim distribution, the more likely very large
claims to occur. We can “quantify” how heavy are the tails by analysing the
following:
R∞
• Existence of moments Mk = 0 xk f (x)dx, where f (x) is a density of
a non-negative r.v.
f (x)
• Limiting density ratio: lim , where f (x) and g(x) are two densities.
x→∞ g(x)
f (x)
• Hazard rate h(x) = 1−F (x)
, where f (x) is density and F (x) is cdf.
R∞
x R(y−x)f (y)dy
• Mean residual life e(x) = ∞ .
x f (y)dy
F (x + u) − F (u)
Fu (x) = P (X − u ≤ x|X > u) = .
1 − F (u)
Then
lim Fu (x) = G(x),
u→∞
exists and depends only on x, then F (x) must follow the Generalized Extreme
Value (GEV) distribution. It has 3 parameters: location parameter α, scale
parameter β > 0, and shape parameter γ.
If γ < 0, γ = 0, and γ > 0, the GEV distribution reduces to the Weibull
distribution, the Gumbel distribution, and the Fretchet distribution, respec-
tively.
For 2 random variables X and Y , the “concordance” measures to what
extend we have direct dependence of the form “the higher X the higher
83
Y ” (this corresponds to positive concordance), and to what extend we have
the opposite dependence “the higher X the lower Y ” (corresponding to the
negative concordance). If X and Y are independent, the concordance is 0.
There are several ways to measure concoradance, one of them is the Kendall
coefficient
C −D
τ= ,
C +D
where C and D are the numbers on concordant and discordant pairs of sce-
narios, respectively.
The joint cdf F of n random variables can be written in the form
C(x1 , x2 , . . . , xn ) = x1 · x2 · · · · · xn
is the independence copula, which implies that all n random variables are
jointly independent, while
C(x1 , x2 , . . . , xn ) = min(x1 , x2 , . . . , xn )
is the co-monotonic copula, which corresponds to the case when all variables
are directly dependent (the higher is one variable, the higher are all). For
n = 2,
C(x1 , x2 ) = max(x1 + x2 − 1, 0)
is called counter-monotonic copula, which represents the inverse dependence
of the form “the higher one variable, the lower another one”.
The coefficients of lower and upper tail dependence of 2 random variables
are
C(u, u) C̄(u, u)
λL = lim , and λU = lim ,
u→0+ u u→0+ u
where C̄ is the survival copula defined by C̄(1−u, 1−v) = 1−u−v +C(u, v).
84
4.8 Questions
Chapter 5
Markov Chains
The Markov property is used extensively in the Actuarial Mathematics to
develop two-state and multi-state Markov models of mortality and other
decrements. The rest of this course is devoted to a thorough description of
the Markov property in a general context and its applications to actuarial
modelling.
85
We will distinguish between two types of stochastic process that possess the
Markov property: Markov chains and Markov jump processes. Both have a
discrete state space, but Markov chains have a discrete time set and Markov
jump processes have a continuous time set.
We begin with Markov chains and discuss the mathematical formulation of
such process, leading to one important actuarial application: the no-claims-
discount process used in motor insurance. We then move onto Markov jump
processes.
The practical considerations of applying these models in the Actuarial Math-
ematics will be discussed in detail in later sections. In this chapter we focus
on the mathematical development of Markov models without reference to
their calibration to real data.
86
for all s1 < s2 < · · · < sn < s < t and all states a, x1 , x2 , . . . , xn , x in S.
An important result is that any process with independent increments has the
Markov property.
Example 5.1. Question: Prove that any process with independent incre-
ments has the Markov property.
Answer: We begin with equation (53) and use the fact that Xt = Xt −Xs +x
to introduce an increment
the second equality arises from the definition of independent increments and
the fact that x is known.
A Markov process with a discrete state space and a discrete time set is called
a Markov chain, these are consider in this chapter. A Markov process with
discrete state space and continuous time set is called a Markov jump process,
these are considered in the next chapter.
87
Note that P(n, n+1) is a finite matrix in the case of a finite number of states,
and an infinite matrix in the case of an infinite number of states.
3. If one or more claims are made the policyholder moves down one level,
or remains at the 0% level.
The insurance company believes that the chance of claiming each year is
independent of the current discount level and has a probability of 1/4. Why
can this process be modeled as a Markov chain? Give the state space and
transition matrix.
Answer: The model can be considered as a Markov chain since the future
discount depends only on the current level, not the entire history. The state
space is S = {0%, 30%, 60%}, which is convenient to denote as S = {0, 1, 2}
(where state 0 is the 0% state, 1 is the 30% state and 2 is the 60% state).
The transition probability matrix between two states in a unit time is given
by 1 3
4 4
0
P = 41 0 34 . (57)
0 14 43
88
A clear way of representing Markov chains is by a transition graph. The
states are represented by circles linked by arrows indicating each possible
transition. Next to each arrow is the corresponding transition probability.
Example 5.3. Question: Draw the transition graph for the NCD system
defined in Example 5.2.
Answer: See Figure 1
Figure 1: Transition graph for the NCD system of Example 5.2. Reproduced
with permission of the Faculty and Institute of Actuaries.
Equation (55) defines the probabilities of transition over a single time step.
Similarly, the n-step transition probabilities pij (m, m + n) denote the prob-
ability that a process in state i at time m will be in state j at time m + n.
That is:
pij (m, m + n) = P [Xm+n = j| Xm = i] .
for all states i, j ∈ S and all integer times m < l < n. This can be expressed
in terms of n-step stochastic matrices as
89
Example 5.4. Question: Prove equation (58).
Answer: We use the Markov property and the law of total probability.
This should be intuitively clear but a formal proof is left as a question at the
end of the chapter.
The value of t can represent many factors such as time of year, age of pol-
icyholder or the length of time the policy has been in force. For example,
90
young drivers and very old drivers may have more accidents than middle-
aged drivers and therefore t might represent the age or age group of the
driver purchasing a motor insurance policy.
Although time-inhomogeneous models are important in practical modelling,
a further analysis is beyond the scope of this course.
A Markov chain is called time homogeneous if transition probabilities do not
depend on time. This is a significant simplification to any Markov-chain
model. In particular, for a time-homogeneous Markov chain, equation (59)
becomes
N
Y −1
P [Xn = jn , n = 0, 1, 2 . . . , N ] = P [X0 = j0 ] pjn jn+1 . (60)
n=0
P(n) = Pn ,
Example 5.5. Question: Calculate the 2-step transition matrix for the
NCD system from Example 5.2 and confirm that it is a stochastic matrix.
Answer: The 1-step transition matrix is given by equation (57) and so we
can compute that
1 3 1 3
0 0 4 3 9
4 4 4 4 1
P(2) = 14 0 43 14 0 34 = 1 6 9 .
1 3 1 3 16
0 4 4 0 4 4 1 3 12
We note that the two conditions for P(2) to be a stochastic matrix are satis-
fied.
Example 5.6. Question: Using the 2-step transition matrix from Exam-
ple 5.5 state the probabilities that
91
(b) A policyholder initially in the 60%-state is in the 30%-state after 2
years.
Answer:
Xn = Y1 + Y2 + · · · + Yn ,
The simple random walk has the Markov property, that is:
P (Xm+n = j| X1 = i1 , X2 = i2 , . . . , Xm = i) ,
= P (Xm + Ym+1 + Ym+2 + · · · + Ym+n = j| X1 = i1 , X2 = i2 , . . . , Xm = i) ,
= P (Ym+1 + Ym+2 + · · · + Ym+n = j − i) ,
= P (Xm+n = j| Xm = i) .
92
Hence a simple random walk is a time-homogeneous Markov chain with tran-
sition probabilities:
p if j = i + 1
pij = 1 − p if j = i − 1
0 otherwise
93
equations for r and l gives
1 1
r = (n + j − i) and l = (n − j + i).
2 2
From this we can see that the n-step transition probabilities are
(n) n 1 1
pij = 1 p 2 (n+j−i) (1 − p) 2 (n−j+i) ,
2
(n + j − i)
where nr is the number of possible paths with r positive steps, each of which
occurs with probability pr (1 − p)n−r . The expression arises since the distri-
bution of the number of positive steps in n steps is Binomial with parameters
n and p. Since r and l must be non-negative integers, it follows that both
n + j − i and n − j + i must be non-negative even numbers.
In addition to being time-homogeneous, a simple random walk is spatially-
homogeneous, that is
(n)
pij = P (Xn = j| X0 = i) = P (Xn = j + r| X0 = i + r) .
1
A simple random walk with p = q = 2
is called a symmetric simple random
walk.
94
In other words, once state b is reached, the random walk stops and remains
in this state thereafter.
A reflecting barrier is a value c such that:
P (Xn+1 = c + 1| Xn = c) = 1.
In other words, once state c is reached, the random walk is “pushed away”.
A mixed barrier is a value d such that:
for all s > 0 and α ∈ [0, 1]. In other words, once state d is reached, the
random walk remains in this state with probability α or moves to the neigh-
boring state d + 1 with probability 1 − α, i.e. it is an absorbing barrier with
probability α and a reflecting barrier with probability 1 − α.
If, in the example, above the man does not take his rich friend, he will
continue to gamble until either his money reaches the target N or he runs
out of money. In each case reaching the boundary means that the wealth
will remain there forever; the barriers therefore become absorbing barriers.
The transition graph for the general case of a restricted random walk with
two mixed barriers is given in Figure 3. The special cases of reflecting and
absorbing boundary conditions are obtained by taking α or β equal to 0 or
1.
Figure 3: Transition graph for the restricted random walk with mixed bound-
ary conditions. Reproduced with permission of the Faculty and Institute of
Actuaries.
95
The 1-step transition matrix is given by
α 1−α
1−p 0 p
. .. ... ..
P= .
.
1−p 0 p
1−p 0 p
1−β β
Note that the matrix is finite, which is in contrast to the transition matrix
for the unrestricted random walk.
The simple NCD model given in Example 5.2 is a practical example of a
restricted random walk.
• 2+ : 40% discount and no claim in the previous year, that is, the state
corresponding to {Xn = 2, Xn−1 = 1}.
96
• 2− : 40% discount and claim in the previous year, that is, the state
corresponding to {Xn = 2, Xn−1 = 3}.
Assuming that the probability of making no claims in any year is still 3/4, the
Markov chain on the modified state space S 0 = {0, 1, 2+ , 2− , 3} has transition
graph given by Figure 4, and 1-step transition matrix given by
1/4 3/4 0 0 0
1/4 0 3/4 0 0
P= 0 1/4 0 0 3/4 .
1/4 0 0 0 3/4
0 0 0 1/4 3/4
Figure 4: Transition graph for the modified NCD process. Reproduced with
permission of the Faculty and Institute of Actuaries.
and
P Xn+1 = 1| Xn = 2− = 0.
97
5.5.4 A model of accident proneness
An insurance company may want to use the whole history of the claims from
a given driver to estimate his/her accident proneness. Let Yi be a number
of claims during period i. In the simplest model, we may assume that it
can be no more than 1 claim per period, so Yi is either 0 or 1. By time
t a driver has a history of the form Y1 = y1 , Y2 = y2 , . . . , Yt = yt , where
yi ∈ {0, 1}, i = 1, . . . , t. Based of this history, the probability of future claim
can be estimated, say, as
f (y1 + y2 + · · · + yt )
P [Yt+1 = 1|Y1 = y1 , Y2 = y2 , . . . , Yt = yt ] = ,
g(t)
where f, g are two given increasing functions satisfying 0 ≤ f (m) ≤ g(m), ∀m.
The stochastic process {Yt , t = 0, 1, 2 . . . } does not have the Markov property
(54). However, the cumulative number of claims from the driver, given by
t
X
Xt = Yt
i=1
P [Xt+1 = 1 + xt |X1 = x1 , X2 = x2 , . . . , Xt = xt ]
f (xt )
= P [Yt+1 = 1|Y1 = x1 , Y2 = x2 − x1 , . . . , Yt = xt − xt−1 ] = ,
g(t)
which does not depend on the past history x1 , x2 , . . . , xt−1 . Thus, {Xt , t =
0, 1, 2 . . . } is a Markov chain.
Example 5.7. Assume that there are just 2 ratings for a bond B: I -
investment grade, and J - junk grade. This can be modelled as a Markov
98
chain with states I, J, and D - default. Assume that the transition matrix
is given by
0.90 0.05 0.05
P = 0.10 0.80 0.10
0 0 1
Assume that bond B returns yearly profit 9% of the investment in state I,
and 10% of the investment in state J. However, in case of default you will
be able to get back only 40% of your investment. Assume also that risk-free
rate (in a bank) is 2%.
In this case, investing capital C in a bond in state I, we will get back
1.09C in case we stay at I or move to J, and 0.4(1.09C) if the process moves
to state D. Hence, our expected profit is 0.90(1.09C)+0.05(1.09C)+0.05(0.4·
1.09C) − C = 0.0573C, that is, 5.73%. This is higher than risk-free rate 2%.
The difference is called premium for risk.
A risk neutral probability measure is an “adjusted” probability measure
such that the expected profit (with respect to this measure) from a bond is
the same as the profit from the bank. The “adjusted” transition probabilities
from state I to states I, J, and D are given by 1−0.1πI , 0.05πI , 0.05πI , where
πI is the adjustment coefficient. By definition of risk neutral probability mea-
sure, (1 − 0.10πI )(1.09C) + 0.05πI (1.09C) + 0.05πI (0.4 · 1.09C) − C = 0.02C,
from which we can find πI . Similarly, “adjusted” transition probabilities from
state J to states I, J, and D are given by 0.10πJ , 1 − 0.20πJ , 0.10πJ , where
πJ can be found from a similar argument (this is left as an exercise).
99
• Step 2. Estimation transition probabilities: Once the state space
is determined, the Markov model must be fitted to the data by estimat-
ing the transition probabilities. In the NCD model (Example 5.2) we
have just claimed that “the company believes that the chance of claim-
ing each year ... has a probability of 1/4”. In practice, however, the
transition probabilities should be estimated from the data. Naturally,
the probability pij of transition from state i to state j should be esti-
mated as number of transitions from i to j, divided by the total number
of transitions from state i. More formally, let x1 , x2 , . . . , xN be a set of
available observations, ni be the number of times t (1 ≤ t ≤ N − 1)
such that xt = i, and nij be the number of times t (1 ≤ t ≤ N − 1) such
n
that xt = i and xt+1 = j. Then the best estimate for pij is pˆij = niji ,
and the 95% confidence interval can be approximated as
s s
pˆij − 1.96 pˆij (1 − pˆij ) , pˆij + 1.96 pˆij (1 − pˆij )
ni ni
This follows from the fact that the conditional distribution of Nij given
Ni is binomial with parameters Ni and pij .
• Step 3. Checking the Markov property: Once the state space and
transition probabilities are found, the model is fully determined. But,
to ensure that the fit of the model to the data is adequate, we need to
check that the Markov property seems to hold. In practice, it is often
considered sufficient to look at triplets of successive observations. For
a set of observations x1 , x2 , . . . , xN , let nijk be the number of times t
(1 ≤ t ≤ N −2) such that xt = i, xt+1 = j, and xt+2 = k. If the Markov
property holds, nijk is an observation from a Binomial distribution with
parameters nij and pjk . An effective test to check this is a χ2 test: the
statistic
X X X (nijk − nij pˆjk )2
X2 =
i j k
nij pˆjk
100
model to estimate different quantities of interest. In particular, we
have used the Markov model for the Example 5.2 to address questions
like “What is the probability that a policyholder initially in the 0%-
state is in the 0%-state after 2 years?” (see Example 5.6). If the Markov
model is too complicated to answer questions of this type analytically,
we can use Monte-Carlo simulation (see chapter 2). Simulating a time-
homogeneous Markov chain is relatively straightforward. In addition
to commercial simulation packages, even standard spreadsheet software
can easily cope with the practical aspects of estimating transition prob-
abilities and performing a simulation.
P (Xn = j| X0 = i) → πj , (63)
101
Hence if the initial distribution for a Markov chain is a stationary distribu-
tion, then Xn has the same probability distribution for all n.
A general Markov chain does not necessarily have a stationary probability
distribution, and if it does it need not be unique. For instance, the unre-
stricted random walk discussed in §5.5 has no stationary distribution, and
the uniqueness of the stationary distribution in the restricted random walk
depends on the parameters α and β.
However it is known that a Markov chain with finite state space has at least
one stationary probability distribution. This is stated without proof.
Whether the stationary distribution is unique is more subtle and requires
that we consider only irreducible chains. This is defined by the property
that any state j can be reached from any other state i in a finite number of
steps. In other words, a chain is irreducible if for any pair of states i and j
(n)
there exists an integer n such that pij > 0. It is often sufficient to view the
transition graph to determine whether a Markov chain is irreducible or not.
Example 5.8. Question: Are the simple NCD, modified NCD, unre-
stricted and restricted random walk processes irreducible?
Answer: It is clear from Figures 1, 2 & 4 that both NCD processes and
the unrestricted random walks are irreducible as all states have a non-zero
probability of being reached from any other state in a finite number of steps.
For the restricted random walk, Figure 3 shows that it is irreducible unless
either boundary is absorbing, i.e. it is irreducible for α 6= 1 or β =6= 1.
An irreducible Markov chain with a finite state space has a unique stationary
probability distribution. This is stated without proof.
102
Example 5.10. Question: Compute the stationary distribution for the
modified NCD model defined in §5.5.
Answer: The conditions for a stationary distribution defined above lead to
the following expressions
1 1 1
π0 = π0 + π 1 + π2 − ,
4 4 4
3 1
π1 = π0 + π2+ ,
4 4
3
π2+ = π1 ,
4
1
π2− = π3 ,
4
3 3 3
π3 = π2+ + π2− + π3 .
4 4 4
This system of equations is not linearly independent since adding all the
equations results
X in an identity. This is a general feature of π = πP due to
the property pij = 1.
j∈S
We therefore discard one of the equations (discarding the last one will simplify
the system) and work in terms of a working variable, say π1 .
103
5.7 The long-term behaviour of Markov chains
It is natural to expect the distribution of a Markov chain to tend to the
stationary distribution π for large times if π exists. However, certain phe-
nomena can complicate this. For example, a state i is said to be periodic
with period d > 1 if a return to i is possible only in a number of steps that
(n)
is a multiple of d. More specifically, pii = 0 unless n = md for some integer
m.
Any periodic behaviour is usually evident from the transition graph. For
example, both NCD models considered above are aperiodic; the unrestricted
random walk has period 2 and restricted random walk is aperiodic unless α
and β are either 0 or 1.
We state the following result about convergence of a Markov chain without
proof:
(n)
Let pij be the n-step transition probability of an irreducible aperiodic Markov
(n)
chain on a finite state space. Then, lim pij = πj for each i and j.
n→∞
References
The following texts were used in the preparation of this chapter and you are
referred there for further reading if required.
104
5.8 Summary
For discrete state spaces the Markov property is written as
for all s1 < s2 < · · · < sn < s < t and all states a, x1 , x2 , . . . , xn , x in S.
Any process with independent increments has the Markov property.
Markov chains are discrete-time and discrete-state-space stochastic processes
satisfying the Markov property. You should be familiar with the simple
NCD, modified NCD, unrestricted random walk and restricted random walk
processes.
In general, the n-step transition probabilities pij (m, m + n) denote the prob-
ability that a process in state i at time m will be in state j at time m + n.
The transition probabilities of a Markov process satisfy the Chapman–Kolmogorov
equations: X
pij (m, n) = pik (m, l)pkj (l, n),
k∈S
for all states i, j ∈ S and all integer times m < l < n. This can be expressed
in terms of n-step stochastic matrices as
π = πP(n) .
105
5.9 Questions
106
i. one year.
ii. two years.
iii. three years.
(iii) Explain whether the chain is irreducible and/or aperiodic.
(iv) Does this Markov chain converge to a stationary distribution?
(v) Calculate the long-run probability that a policyholder is in dis-
count level 2.
107
Chapter 6
Markov Jump Processes
A Markov jump process is a stochastic process with discrete state space and
continuous time set, which has Markov property.
The mathematical development of Markov jump processes is similar to Markov
chains considered in the previous chapter. For example, the Chapman–
Kolmogorov equations have the same format. However, Markov jump pro-
cesses are in continuous time and so the notion of a one-step transition prob-
ability does not exist and we are forced to consider time intervals of arbi-
trarily small length. Taking the limit of these intervals to zero leads to the
reformulation of the Chapman–Kolmogorov equations in terms of differential
equations.
We begin my discussing the Poisson process which is the simplest example of
a Markov jump process. In doing so we will encounter some general features
of Markov jump processes.
108
The counting process {Nt }t∈[0,∞) is said to be a Poisson process with rate
λ > 0, if
1. N0 = 0;
3. P (Nt+h − Nt = 1) = λh + o(h);
p0 (t + h) = P (Nt+h = 0),
= P (Nt = 0, Nt+h − Nt = 0),
= P (Nt = 0)P (Nt+h − Nt = 0),
= p0 (t)(1 − λh + o(h)).
dp0 (t)
= −λp0 (t),
dt
with the initial condition, p0 (0) = 1. It is clear that this has solution
Similarly, for n ≥ 1;
pn (t + h) = P (Nt+h = n),
= P (Nt = n, Nt+h − Nt = 0) + P (Nt = n − 1, Nt+h − Nt = 1) + o(h),
= P (Nt = n)P (Nt+h − Nt = 0) + P (Nt = n − 1)P (Nt+h − Nt = 1) + o(h),
= pn (t)p0 (h) + pn−1 (t)p1 (h) + o(h),
= (1 − λh)pn (t) + λhpn−1 (t) + o(h).
Rearranging this for pn (t+h), and again taking the limit as h → 0, we obtain
the differential equation
dpn (t)
= −λpn (t) + λpn−1 (t), (66)
dt
for n = 1, 2, 3, . . . .
It can be shown by mathematical induction, or using generating functions,
that the solution to the differential equation (66), with initial condition
pn (0) = 0 yields equation (64). As required.
A Poisson process has positive integer values and can jump at any time
t ∈ [0, ∞). However, since time is continuous, the probability of a jump is
zero at specific time point t. The process can be pictured as an “upwards
staircase” shown in Figure 5.
110
Figure 5: Sample Poisson process. Horizontal distance is time.
The sequence {τn }n≥1 is called the sequence of interarrival times (or holding
times). These are the horizontal distances between each step in Figure 5.
The random variables τ1 , τ2 , . . . are i.i.d., each having the exponential distri-
bution with parameter λ. They therefore each have the density function
To demonstrate this for general τn , first consider τ1 and note that the event
τ1 > t occurs if and only if there are zero events of the Poisson process in the
fixed interval (0, t], that is
P (τ1 ≤ t) = 1 − e−λt ,
111
The same argument can be repeated for τ3 , τ4 , . . . leading to the conclusion
that the interarrival times are i.i.d. random variables that are exponentially
distributed with parameter λ.
Further, it can be shown using similar arguments that if N̂t and Ñt are
two independent Poisson processes with parameters λ1 and λ2 respectively,
then their sum Nt = N̂t + Ñt is a Poisson process with parameter λ1 + λ2 .
This result follows immediately from our intuitive interpretation of a Poisson
process: assume that male customers are arriving uniformly with rate λ1 ,
and female customers are arriving independently and uniformly with rate λ2 .
Then N̂t describes the cumulative number of male customers, Ñt - female
customers, thus Nt = N̂t + Ñt is the total number of customers, which clearly
also arriving uniformly with rate λ1 + λ2 .
This can be extended to the sum of any number of Poisson processes and is
a very useful result.
112
where {Nt }t∈[0,∞) is a Poisson process with rate λ, and {Yi , i ≥ 1} are
independent and identically distributed random variables, with distribution
function F , which are also independent of Nt .
The expected value and variance of the compound Poisson process are
given by
E[Xt ] = λtE[Y ], V ar[Xt ] = λtE[Y 2 ], (69)
where Y is a random variable with distribution function F .
Example 6.4. In Example 6.3 assume that size of each claim is a random
variable uniformly distributed on [a, b]. All claims sizes are independent.
What is the mean and variance of the cumulative size of the claims from all
policies during 3 years?
Answer:
PNtThe cumulative size of the claims is the compound Poisson process
Xt = i=0 Yi , where Nt is the number of claims from all policies, which is
the Poisson process
R b with parameter λ = 10,
R b 000q, and Yi is size of claim i.
2 2
Then EYi = b−a a xdx = 2 ; EYi = b−a a x dx = a +ab+b
1 a+b 2 1 2
3
, which gives
a+b
E[X3 ] = 3λE[Yi ] = 30, 000q = 15, 000q(a + b),
2
and
a2 + ab + b2
V ar[X3 ] = 3λE[Y 2 ] = 30, 000q = 10, 000q(a2 + ab + b2 ).
3
Assume that a company has initial capital u, premium rate c, and the cumu-
lative claims size Xt is given by (68). Then the basic problem in risk theory
is to estimate the probability of ruin at time t > 0, defined as
pij (s, t) = P [Xt = j|Xs = i] , where pij (s, t) ≥ 0 and s < t. (71)
113
In matrix form, these are expressed as
The proof of these is analogous to that for equation (58) in discrete time,
and is left as a question at the end of the chapter.
We require that the transition probabilities satisfy the continuity condition
1, i = j
lim+ pij (s, t) = δij = (73)
t→s 0, i 6= j
This condition means that as the time difference between two observations
approach zero, the process will very likely not change its state with proba-
bility approaching one in the limit.
It is easy to see that this condition is consistent with Chapman-Kolmogorov
equation. Indeed, taking the limits t2 → t− +
3 or t2 → t1 in equation (72) we
obtain the identity.
However, this condition does not follow from the Chapman-Kolmogorov
equations. For example, pij (s, t) = 12 for i, j = 1, 2 satisfy equation (72),
since 1 1 1 1 1 1
2
1
2
1 = 2
1
2
1 · 21 21 .
2 2 2 2 2 2
114
It follows from equation (74) that {αij } approach certain limits as h → 0. In
particular, we define
pjj (t, t + h) − 1
lim := qjj (t),
h→0 h (75)
pkj (t, t + h)
lim := qkj (t), for k 6= j.
h→0 h
The quantities qjj (t), qkj (t) are called transition rates. They correspond to
the rate of transition from state k to state j in a small time interval h, given
that state k is occupied at time t.
Transition probabilities pkj (t, t + h) can be expressed through the transition
rates as
hqkj (t) + o(h), k 6= j
pkj (t, t + h) = (76)
1 + hqjj (t) + o(h), k = j
∂pij (s, t) X
= pik (s, t)qkj (t). (77)
∂t k∈S
∂P(s, t)
= P(s, t)Q(t),
∂t
where Q(t) is called the generator matrix with entries qij (t).
Repeating the procedure but differentiating with respect to s, we have
Therefore
∂pij (s, t) X
=− qik (s)pkj (s, t), (78)
∂s k∈S
115
and we see that the derivative with respect to s can also be expressed in
terms of the transition rates. The differential equations (78) are called Kol-
mogorov’s backward equations. In matrix form these are written as
∂P(s, t)
= −Q(s)P(s, t).
∂s
Therefore if transition probabilities pij (s, t) for t > s have derivatives with
respect to t and s, transition rates are well-defined and given by equation
(75).
Alternatively, if we can assume the existence of transition rates, then it
follows that transition probabilities pij (s, t) for t > s have derivatives with
respect to t and s, given by equations (77) and (78). These equations are
compatible, and we may ask whether we can find transition probabilities,
given transition rates, by solving equations (77) and (78).
It can be shown that each row of the generator matrix Q(s) has zero sum.
That is, X
qii (s) = − qij (s).
j6=i
The residual holding time for a general Markov jump process is denoted Rs .
This is the random amount of time between time s and the next jump:
Similarly, the current holding time is denoted Ct . This is the time between
the last jump and time t:
We will not study these questions further for general Markov processes, but
will investigate such and related questions for time-homogeneous Markov
processes below.
116
6.4 Time-homogeneous Markov jump processes
Just as we defined time-homogeneous Markov chains (equation (60)), we can
define time-homogeneous Markov jump processes.
Consider the transition probabilities for a Markov process given by equation
(71), a Markov process in continuous time is called time-homogeneous if the
transition probabilities pij (s, t) = pij (0, t − s) for all i, j ∈ S and s, t > 0.
In other words, a Markov process in continuous time is called time-homogeneous
if the probability P (Xt = j| Xs = i) depends only on the time interval t − s.
In this case we can write
Here, for example, pij (s) form a stochastic matrix for every s, that is
X
pij (s) ≥ 0 and pij (s) = 1,
j∈S
Also pij (s) satisfy the Chapman-Kolmogorov equations, which, for a time-
homogeneous Markov process take the form
X
pij (t + s) = pik (t) pkj (s). (79)
k∈S
117
The argument for different values of n is similar. So, if for some t we would
have pii (t) = 0, this would imply pii (t/n) = 0 for all n, contradiction with
(73).
The following properties of transition functions and transition rates for a
time-homogeneous process are stated without proof:
dpij (t) pij (h) − δij
1. Transition rates qij = = lim exist for all i, j.
dt t=0 h→0 h
Equivalently, as h → 0, h > 0
hqij + o(h), i 6= j
pij (h) = (81)
1 + hqii + o(h), i = j
Comparing this to equation (76) we see that the only difference between
the time-homogeneous and time-inhomogeneous cases is that the tran-
sition rates qij are not allowed to change over time.
2. Transition rates are non-negative and finite for i 6= j, and are non-
positive when i = j, that is
X
qii = − qij .
j6=i
dP(t)
= QP(t).
dt
118
X
Note that since qii = − qij , each row of the matrix Q has zero sum.
j6=i
Example 6.5. Consider the Poisson process again. The rate at which
events occur is a constant λ, leading to
λ, j = i + 1
qij = 0, j 6= i, i + 1 (82)
−λ, j = i
dpi0 (t)
= −λpi0 (t),
dt
dpij (t)
= −λpij (t) + λpij−1 (t),
dt
with pij (0) = δij . These equations are essentially the same as equations (65)
and (66).
The backward equations are
dpij (t)
= −λpij (t) + λpi+1,j (t).
dt
6.5 Applications
In this section we briefly discuss a number of applications of Markov jump
processes to actuarial modelling. In each case the models can be made time-
homogeneous by insisting that the transition rates are independent of time.
A more detailed discussion of the survival model is postponed to the next
chapters.
119
Figure 6: Transition graph for the survival model. Reproduced with permis-
sion of the Faculty and Institute of Actuaries.
120
Figure 7: Transition graph for the sickness-death model. Reproduced with
permission of the Faculty and Institute of Actuaries.
• the probability that an individual who is sick at time s will still be sick
at time t.
These are in terms of the residual holding times as
Rt
P (Rs > t − s| Xs = H) = e− s (σ(u)+µ(u))du ,
and Rt
P (Rs > t − s| Xs = S) = e− s (ρ(u)+ν(u))du ,
respectively.
We note that transition probabilities can be related to each other. For ex-
ample, the probability of a transition from state H at time s to S at time t
would be
Z t−s R
s+w
pHS (s, t) = e− s (σ(u)+µ(u))du σ(s + w)pSS (s + w, t)dw.
0
121
This is interpreted as “the individual remains in the healthy state from time
s to time s + w and then jumps to the state sick at time s + w where he
remains”. The derivation of this equation is beyond the scope of the course,
however similar expressions can be written down intuitively.
This sickness-death model can be extended to include the length of time an
individual has been in state S. This leads to the so-called long term care
model where the rate of transition out of state S will depend on the current
holding time in state S.
Figure 8: Transition graph for the marriage model. Reproduced with permis-
sion of the Faculty and Institute of Actuaries.
122
This mathematical statement can be read as “the individual is in state B at
time s where he either remains until time (t − v), or jumps to states W or D
by time (t − v). At time (t − v) he then jumps to state M and remains there
until time t”.
References
The following texts were used in the preparation of this chapter and you are
referred there for further reading if required.
123
6.6 Summary
Markov jump processes are continuous-time and discrete-state-space stochas-
tic processes satisfying the Markov property. You should be familiar with
the Poisson, survival, sickness-death and marriage models.
The Poisson process is a simple Markov jump process. It is time-homogeneous
with stationary increments that are Poisson distributed with mean λ > 0.
Waiting times between jumps are exponentially distributed with mean 1/λ.
As with Markov chains, transition probabilities exist for a general Markov
jump process
pij (s, t) = P [Xt = j|Xs = i] , where pij (s, t) ≥ 0 and s < t,
which must also satisfy the Chapman-Kolmogorov equations.
The quantities qjj (t), qkj (t) are the transition rates, such that
pjj (t, t + h) − 1
lim := qjj (t),
h→0 h
pkj (t, t + h)
lim := qkj (t), for k 6= j.
h→0 h
Kolmogorov’s forward and backwards equations are respectively
∂pij (s, t) X ∂pij (s, t) X
= pik (s, t)qkj (t) and =− qik (s)pkj (s, t).
∂t k∈S
∂s k
124
6.7 Questions
(a) Calculate the probability that there will be fewer than 1 claim on
a given day.
(b) Estimate the probability that another claim will be reported dur-
ing the next hour. State all assumptions made.
(c) If there have not been any claims for over a week, calculate the
expected time before a new claim occurs.
125
Chapter 7
Machine Learning
7.1 A motivating example
Machine learning can be defined as the study of systems and algorithms
that improve their performance with experience. Machine learning methods
have become increasingly popular in recent decades, because of increase both
in computing power and in the amount of data available. You use machine
learning systems every day, even without noticing. For example, you may use
Google Translate website to translate the text, and the current best methods
in automated translation use machine learning. As another example, you
mailbox most probably has some kind of spam filter, and most of nowadays
spam filters are build based on machine learning technology.
Below is the example of report from one popular spam filter, SpamAssas-
sin.
• 0.6 HTML IMAGE RATIO 02 BODY: HTML has a low ratio of text
to image area
126
examples of spam e-mails and mark them as “spam”. We can also collect a
large amount of normal e-mails and mark them as “not spam”. We can then
use these e-mails as a “training set”, and find weights such that as many as
possible of these e-mails would be classified correctly.
• (a) e-mail with low ratio of text to image area, but not containing
URL from the blacklist. In other words, x1 = 1 but x2 = 0. The e-mail
marked as “not spam”.
• (b) e-mail with low ratio of text to image area, and with URL from the
blacklist. In other words, x1 = 1 and x2 = 1. The e-mail marked as
“spam”.
• (c) e-mail with bad URL only: x1 = 0 but x2 = 1. The e-mail marked
as “not spam”.
w1 x1 + w2 x2 = w1 · 1 + w2 · 0 = w1 < 5.
w1 x1 + w2 x2 = w1 · 1 + w2 · 1 = w1 + w2 ≥ 5.
w1 x1 + w2 x2 = w1 · 0 + w2 · 1 = w2 < 5.
So, in this example, w1 and w2 can be any numbers less than 5 with sum
at least 5. For example, w1 = w2 = 3 works.
We can represent the analysis in Example 7.1 geometrically, by introduc-
ing a coordinate plane with coordinates (x1 , x2 ). Then the three e-mails in
the training set are points with coordinates (1, 0), (1, 1), and (0, 1), which
we denote A, B, and C, respectively. We can draw the points (A and C)
corresponding to non-spam e-mails as blue, and the point (B) representing
the spam e-mail in red. The “spam region” w1 x1 + w2 x2 ≥ 5 is a half-plane
127
whose boundary is the line w1 x1 + w2 x2 = 5. So, geometrically our task was
do draw a line on the plane which separates the blue and red points. One
possible such line is the one with equation 3x1 + 3x2 = 5, corresponding to
the solution w1 = w2 = 3 we have chosen in Example 7.1.
If we would use three factors to make a decision (for example, the above
two plus whether the sender address is in the auto white-list), then the
third factor can be represented as a coordinate x3 , the training set could
be depicted as a set of points in the 3-dimensional space, and the problem
would be to separate red and blue points by a plane with equation w1 x1 +
w2 x2 + w3 x3 = 5. In general, we may have n factors and m e-mails in
the training set, which, geometrically, corresponds to set of m red and blue
points in the n-dimensional space. We then need to P separate the points by
n − 1-dimensional set of points satisfying equation ni=1 wi xi = 5 (or any
other constant instead of 5). Set of points satisfying this equation is called
a hyperplane.
In addition to finding the “best” weights for the given list of factors, the
system may learn which new factors it is good to include. For example, it
can compute the frequency of various words in e-mails and find that the
word “lottery” appears in a lot of spam e-mails and in almost none good
e-mails. Based on this statistic, the system may introduce a new, n + 1-
th factor indicating whether an e-mail contains the word “lottery” or not.
The red and blue points then get a new coordinate xn+1 (with xn+1 = 1
for e-mails with “lottery” word and xn+1 = 0 for all other e-mails), and
then we separate these points in (n + 1)-dimensional space and find weights
w1 , w2 . . . , wn , wn+1 , including the weight wn+1 of the new factor.
We can see that this problem (find the weights based on training data)
is just a problem in algebra (find wi from some system of inequities), or,
equivalently, in geometry (separate red and blue points by a hyperplane).
However, we call the branch of science studying this and similar problems
“machine learning”, because it allows the system to improve with experience.
In this specific example with spam filter, it may initially work badly for my
specific mailbox, because I, for example, may:
- receive a lot of good e-mails with “low ratio of text to image area”
because of nature of my work, but
Because of this, I initially may see some spam messages in the main mailbox,
and, conversely, some good e-mails in the spam folder. However, I can then
mark the spam e-mails from main mailbox as “spam” and the good e-mails
128
in the spam folder as “not spam”, which provides new training data for the
spam filter, which are specific for my particular situation. The filter will
then find new weights based on new data and can quickly “learn” that, in
my particular case, “low ratio of text to image area” is not an indication of
spam, so the corresponding weight should be low, or zero, or even negative.
Instead, it can put the large positive weights for some other factors which my
spammers often use. In this way, the system may quickly learn and adapt
itself to the need of each particular user.
Moreover, if spammers are smart enough, they may try to develop e-mails
which avoid typical features of spam message, such as word “lottery” or low
ratio of text to image area. This may help them to pass the spam filters, but
only temporary. After user marks that e-mails as “spam”, the system will
use this to update the weights, “understand” how these new spam messages
looks like, and filter them out next time.
129
may want for the spam filter to automatically assign to every e-mail its “im-
portance”, so that we can sort e-mails by it and answer the most important
e-mails first. Clear spam e-mails can have zero or negative score, the “border-
line” e-mails for which the filter is unsure if this a spam or not can have small
positive score, then work e-mails may be prioritised over the personal ones,
etc. Similarly, for self-driving car, instead of classifying detected objects into
classes, it may be more convenient to just assign to every object a score,
indicating how dangerous/important it is, so that the higher score the more
urgently we need to stop. This task is called regression. Mathematically,
(linear) regression problem is typically reduces to the following task: given
m points in the coordinate plane Rn+1 with coordinates x1 , x2 , . . . , xn , y, ap-
proximate the points by linear function y = w0 + w1 x1 + w2 x2 + · · · + wn xn
as good as possible. Typically, the quality of approximation is measured as
the sum of squares of differences between the y coordinates of data points
and the function.
Example 7.2. Approximate points (0, 0), (1, 1), (2, 3), (3, 3) by a line y =
ax + b to minimize the sum of squares error.
Solution. For x = 0, y = a · 0 + b = b, and the data point is (0, 0) so the
(squared) error is (b − 0)2 . Similarly, for x = 1, y = a + b, and the data point
is (1, 1), the squared error is (a + b − 1)2 . Continuing this way, we write down
the error as
In optimality
∂e(a, b)
= 2(a + b − 1) + 2(2a + b − 3)2 + 2(3a + b − 3)3 = 0
∂a
and
∂e(a, b)
= 2b + 2(a + b − 1) + 2(2a + b − 3) + 2(3a + b − 3) = 0
∂b
This simplifies to
4(7a + 3b − 8) = 2(6a + 4b − 7) = 0
130
“Customers who viewed this item also viewed”, followed by the list of similar
books. There can be also another list entitled “customers who bought this
also bough”. Also, when you read information about any film, you often get
a list of suggestions for similar films you may be interested, etc. To create
such lists, the program should learn what books/products/films are “similar”
or “associated” with each other, and then use these associations to provide
you with best recommendations.
When recommending the best film for you, the system often looks for
hidden or latent variables, that is, some structure which helps to under-
stand your choice. For example, you may assign scores to the films you saw,
and these scores looks unstructured even for you. However, automatic analy-
sis of your scores may reveal the “structure” that you typically assign a high
score to films of particular genre that have at least one of three particular
actors.
Machine learning can also be used in a variety of other applications,
for example, in playing games, from board games like poker, Chess, or Go,
to real-life application with “resemble” the game, like developing optimal
strategies for trading on financial markets.
In all applications, it is important to understand how to evaluate the
performance of machine learning system, e.g. of algorithm for calculating
weights for spam filter. If we have N e-mails marked as spam or not, we can
use these N e-mails as test set to find weights, but it might be that these
weights are good only for these specific N e-mails and will not work well for
any other e-mails. The situation when weights are so adapted to the specific
test set that the system fails to do anything else is called overfitting, and is
one of the most serious problems in the area. To test that it did not occur,
one may divide our N e-mails into two groups, and use one group (say, 90%
of e-mails) for training, and the remaining 10% of e-mails for testing. On the
test stage, we may count the number of e-mails (both spam and non-spam)
which are classified correctly and divide it by the total number of e-mails -
this ratio is called the accuracy of the classifier.
The set of 10% of e-mails we select for testing is called the test set, and
it is usually selected randomly. However, as in any random experiment, we
may get a very different result each time we repeat the procedure. One time
the test set may contain some “typical” e-mails for which the filter performs
well, while another time it may contain many “atypical” e-mails (e.g. good
e-mails which happen to have many spam-like features) and the filter may
make a lot of errors. Because of this, it is a good idea to repeat the test
several times and average the result. For example, we may randomly divide
out initial set of N e-mails into 10 group, use one of them as test set and the
remaining 9 as training set, and then repeat this procedure 10 times, with
131
each group being the test set once. The final accuracy is the average of 10
accuracies obtained. This procedure is an example of cross validation.
Sometimes the set of data used for training is further divided into two
groups: a training data set and validation data set. The training set is
used to estimate the parameters of the model (such as weights wi in Example
7.1), while the validation set is used to decide some more global questions
such as number of parameters to consider, number of categories to classify
(should it just spam and non-spam or maybe 3 or more categories), rate at
which the model should learn from the data, etc. Such “global parameters”
are called hyper-parameters.
In some applications, the “accuracy” as defined above (the total number
of correct classifications divided by the sum of the correct and incorrect ones),
is not the best way to evaluate the system performance, because there are
two very different types of errors. The first mistake is when the e-mail is
spam but the system does not recognise it and puts it to the main folder.
This is called a False negative (FN). The second type of mistake is when a
good e-mail is putted into the spam folder, and this is called False Positive
(FP). If you never check the spam folder, then FN is not a big problem,
while FP may be a big issue. We can also define true positive (TP) when
the e-mail is spam and it is in the spam folder, and true negative (TN) if the
e-mail is non-spam and it goes to the main folder. Then we can consider the
following measures of system performance:
TP
• Precision P re = T P +F P
;
TP
• Recall Rec = T P +F N
;
2·P re·Rec
• F1 score = P re+Rec
;
FP
• False Positive Rate = T N +F P
.
In many cases, there is a trade-off between recall and false positive rate. If
we improves one of this, the other one may become worse.
132
training set is just set of e-mails, and it is unknown which e-mail is spam.
Can we program computer to learn something from this data? In fact, we
can! The computer can still note that some e-mails are somewhat similar
(for example, use similar words like “lottery” or have low ratio of text to
image area), and “guess” that maybe this group of similar e-mails are the
spam ones. This is an example of unsupervised learning.
In example Example 7.1, we can find weights algebraically, from a system
of inequalities, or geometrically, as a line which separate red and blue points
on the plane. More generally, we are looking for a hyperplane which separate
points in the n-dimensional space, where n is the number of factors to be
weighted. This n-dimensional space is called the instance space, and the
whole method is known as geometric model.
Hyperplane that perfectly separate the data may not exists, and, even if
it exists, its coefficients are not always easy to compute. Here is the method
which is very easy to compute. We can calculate the mean of all red points
(call if point A), the mean of all blue points (call if point B), and then select
a hyperplane which is perpendicular to the line segment AB and crosses it
in the midpoint. This hyperplane is known as basic linear classifier, and
may separate red and blue points quite well, although not perfectly.
In Example 7.1, there are many possible solutions, for example, w1 =
w2 = 3, or w1 = w2 = 4, etc. Geometrically, one can draw many possible
lines which separates read and blue points. Some of such lines may be very
close to blue points, some - to the red ones. Intuitively, one may prefer a line
which separates points with maximal margin, that is, as far from the data
points as possible. This idea forms the basis of the method called support
vector machines. In Example 7.1, such “best” line corresponds to the
solution w1 = w2 = 10 3
.
In many applications, the key task is the similarity testing: is this e-mail
similar to the spam e-mails in the database? What are the films most similar
to the one a customer likes and put high rating? If every e-mail (or film, or
any other object) is described by n parameters and is represented as a point
in n-dimensional space, one easy way to define “similarly” is to say that
objects are similar if and only if the corresponding points X = (x1 , . . . , xn )
and Y = (y1 , . . . , yn ) are close in the usual Euclidean distance
v
u n
uX
d(X, Y ) = t (xi − yi )2 .
i=1
A very simple classification algorithm is, for every new point A to be classi-
fied, find an already classified point B at the smallest distance from A, and
133
then assign A the same class as B. This method is known as the nearest-
neighbour classifier. In particular, imagine that in Example 7.1 new e-mail
arrives with no bad url and no high image to text area ratio. It corresponds
to new point (0, 0) on the plane. The distance from it to the “spam” point
(1, 1) is p √
(0 − 1)2 + (0 − 1)2 = 2,
while the distances to both “non-spam” points are
p p √
(0 − 0)2 + (0 − 1)2 = (0 − 1)2 + (0 − 0)2 = 1 < 2.
Hence, the nearest-neighbour classifier would classify this new point as non-
spam.
Another distance-based method can be used for clustering task. Imagine
that we need to cluster the data into K clusters, and we have some initial
guess how to do this. For each cluster 1 = 1, 2, . . . , K, we can calculate
the mean Mi of all points in cluster i. After this, for each point X, we can
compute the distance from X to M1 , M2 , . . . , MK , select the minimal out
of these distances, and re-assign X to the corresponding cluster. We can
then repeat this procedure until no point will be re-assigned. This method is
called K-means, and it is very popular and powerful method for clustering.
Example 7.3. Consider 3 points A, B, C on the plane with coordinates
(0, 0), (0, 1), (4, 0), initially classified such that A and C are red while B is
blue. Then the mean of red cluster is M1 = (2, 0), while the mean of blue
cluster is M2 = B = (0, 1). The distances from A to Mi are
d(A, M1 ) = 2, d(A, M2 ) = 1,
hence M2 is the nearest one, and A moves to the blue cluster. Similarly,
√ √
d(B, M1 ) = 5, d(B, M2 ) = 0, d(C, M1 ) = 2, d(C, M2 ) = 17,
hence B and C stays in blue and red clusters, respectively.
We now repeat the procedure: the new cluster means are M1 = C = (4, 0)
and M2 = (0, 0.5), respectively. It is easy to check that the closest Mi is M2
for A and B and M1 for C, hence all points stay in the same class and the
algorithm terminates.
Of course, for nearest-neighbour classifier, K-means, and other related
methods, it is not necessary to use the Euclidean distance. For example, we
may instead use the Manhattan distance between points X = (x1 , . . . , xn )
and Y = (y1 , . . . , yn ), given by
n
X
|xi − yi |.
i=1
134
7.4 Probabilistic analysis
In classification task, models using probabilistic analysis may be useful.
For example, assume that we try to decide whether an e-mail a spam or not
based on the information whether it contain words “lottery” and “tablet”.
• 15 e-mails with word “lottery” only. 10 of them are spam, and 5 - not;
• 24 e-mails with word “tablet” only. 20 of them are spam, and 4 - not;
In a similar way, we can apply MAP decision rule to all possible combi-
nations of value of L and T , and write a computer program which gives the
answers in all possible cases:
135
This computer program is an example of decision tree. In general, decision
three may contain any number of this nested “if... then” structure. Instead of
final decision (SPAM or NON-SPAM) the program may return, for example,
a real number which represents the probability that an e-mail is spam.
In Example 7.4, suppose that part of e-mail is encoded and the filter
cannot read it. In the open part, the filter see the word “lottery” but no
“tablet”, and it is not clear if “tablet” is present in the encoded part or not.
In this case, the MAP decision rule classifies e-mail as spam if P (S = 1|L =
1) > 0.5. This probability can be estimated from the law of total probability
(13), or directly from data as
10 + 0 10
P (S = 1|L = 1) = = = 0.625,
15 + 1 16
because in total there are 15 + 1 = 16 e-mails with word “lottery”, 10 of
them are spam.
In fact, in the situation like in Example 7.4, statisticians often makes
decisions based on different conditional probabilities, such as P (T = L =
0|S = 1) and P (T = L = 0|S = 0), which is an example of likelihood
function. The logic is that one asks herself: how likely I would find e-mail
looking like this (in our case, with no words “lottery” and “tablet”) in the
spam folder? And how likely I would find it in the non-spam folder? In our
example, there are 50 spam e-mails, 20 of them are with T = L = 0, and
also there are 50 non-spam e-mails, 40 of them are with T = L = 0, hence
20 40
P (T = L = 0|S = 1) = < = P (T = L = 0|S = 0).
50 50
Thus, observing an e-mail like this in a spam folder is about twice less likely
than finding such an e-mail in the non-spam folder. Hence, the e-mail should
be classified as non-spam.
The two methods above (MAP and likelihood methods) are related by
Bayes’ theorem
P [A ∩ B] P [A|B]P [B]
P [B|A] = = ,
P [A] P [A]
which in our case says that
P (T = L = 0|S = 1)P (S = 1)
P (S = 1|T = L = 0) = .
P (T = L = 0)
The same formulas work for all other possible values of T and L, for example
P (T = L = 1|S = 1)P (S = 1)
P (S = 1|T = L = 1) = .
P (T = L = 1)
136
In fact, our data-based estimate P (S = 1|T = L = 1) = P (T = L =
1|S = 1) = 0 is unjustified, because there is just one e-mail with T = L = 1,
and it is strange to calculate any probabilities based on the sample with
ONE experiment. This situation is very typical: there are may be very
few e-mails with the word pattern exactly as prescribed, or, in general case,
very few objects with values of parameters exactly equal to some values.
An alternative ways to estimate probabilities like P (T = L = 1|S = 1) is to
assume that words occur independently (or at least independently conditional
on the event S = 1), and then
and then
P (T = 1|S = 1) · P (L = 1|S = 1) · P (S = 1)
P (S = 1|T = L = 1) = .
P (T = L = 1)
The classification based on probabilities calculated in this way is called naive
Bayes classification, with the word “naive” reflecting the assumption of in-
dependence.
Another probabilistic learning method which have been used recently with
dramatic success is reinforcement learning. In reinforcement learning the
learner is not given a target output in the same way as with supervised
learning. The learner uses the input data to choose some output, and is then
told how well it is doing, or how close the chosen output is to the desired
output. The learner can then use this information as well as the input data to
choose another hypothesis. This method has been recently apply to playing
games, such as Chess and Go, in which a machine first plays against itself at
random, and then automatically learn from experience, increasing the chance
of selecting moves similar to ones that lead to the success in previous games.
A generic program Alpha Zero designed to play in many games with the same
algorithm, having just rules of the games as an input, quickly learn from self-
play and beat all human and all specifically designed programs which human
developed for many decades!
• Collecting data
The data must be assembled in a form suitable for analysis using
computers. Several different tools are useful for achieving this: a
137
spreadsheet may be used, or a database such as Microsoft Access.
Data may come from a variety of sources, including sample surveys,
population censuses, company administration systems, databases con-
structed for specific purposes (such as the Human Mortality Database,
www.mortality.org). During the last 20-30 years the size of datasets
available for analysis by actuaries and other researchers has increased
enormously. Datasets, such as those on purchasing behaviour collected
by supermarkets, relate to millions of transactions.
• Feature scaling
Some Machine Learning techniques will only work effectively if the
variables are of similar scale. If, for example, one variable (say x1 ) is
measured in, say, kilometres, and other variables are in centimetres,
the value of x1 will be 1000 times larger than it would be with the
same data if it were measured in centimetres as well. This may lead to
inadequate results in a number of machine learning methods such as
linear regression.
138
This involves choosing a suitable Machine Learning algorithm using a
subset of the data. The algorithm will typically represent the data as a
model and the model will have parameters which need to be estimated
from the data. This stage is analogous to the process of fitting a model
to data when using linear regression and generalised linear models.
139
– Any modifications to the data (e.g. recoding or transformation of
variables, or computation of new variables) should be clearly de-
scribed, ideally with the computer code used. In Machine Learn-
ing this is often called “features engineering”, whereby combina-
tions of features are used to create something more meaningful.
– The selection of the algorithm and the development of the model
should be described, again with computer code being made avail-
able. This should include the parameters of the model and how
and why they were chosen.
140
7.6 Summary
Machine learning can be defined as the study of systems and algorithms that
improve their performance with experience.
Machine learning typically used to solve classification problems, cluster-
ing, regression problems, analysis of association rules, discovering hidden or
latent variables, etc.
One of the main problems in machine learning is overfitting, when the
system works perfectly on the data it was trained in but performs badly on
any other data. To test that it did not occur, one may divide our data into
two groups, and use one group for training, and another one for testing. We
can then exchange the role of training and testing data, this procedure is
called cross validation. In fact, data for training can be further divided into
two groups: a training data set and validation data set, the first one is to
find model parameters, and the second one is to find hyper-parameters. The
training-validation-test data proportion may be, for example, 60% − 20% −
20%.
In the yes-no classification problem, the correct outcomes are true positive
(TP), and true negative (TN), while the incorrect ones are False Positive (FP)
and False negative (FN). This can be used to calculate various measures of
performance, such as Precision, Recall, F1 score, and False Positive Rate.
You should be able to define and understand the following terms and
methods:
• Supervised learning
• Unsupervised learning
• Linear classifier
• Nearest-neighbour classifier
• K-means algorithm
• Decision tree
• Likelihood function
• Reinforcement learning
141
Machine Learning tasks can often be broken down into a series of steps:
• Collecting data
• Feature scaling
• Splitting the data into the training, validation and testing data sets
142
7.7 Questions
1. Approximate points (0, 0, 0), (1, 0, 2), (0, 1, 3), (1, 1, 4) in the coordinate
space (x1 , x2 , y) by a plane y = ax1 + bx2 + c to minimize the sum of
squares error.
2. There are two points marked on the plane - red point A with coordi-
nate (0, 0) and blue point B with coordinates (10, 10). Then 4 points
C, D, E, F arrives in order, and each is coloured in the same way as its
nearest neighbour. The coordinates of F is (10, 8). Give examples of
coordinates of points C, D, E such that point F will be coloured red.
143
Solutions of end-of-chapter questions
8.1 Chapter 1 solutions
1. A sample space consists of five elements Ω = {a1 , a2 , a3 , a4 , a5 }. For which
of the following sets of probabilities does the corresponding triple (Ω, A, P )
become a probability space? Why?
(a) p(a1 ) = 0.3; p(a2 ) = 0.2; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;
(b) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;
(c) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.2; p(a4 ) = −0.1; p(a5 ) = 0.2;
Answer: Since Ω is finite, we may assume that A is the set of all subsets
of Ω. So we only have to look at the point probabilities p(ai ) = P ({ai }) for
i = 1, . . . , 5. From the definition of a discrete probability distribution, we
know that the sum of all point probabilities must be equal to 1, i.e. here we
must have
p(a1 ) + p(a2 ) + · · · + p(a5 ) = 1.
In part (a), the values of this sum is equal to 0.8, which means that P is not
a probability distribution and therefore (Ω, A, P ) is not a probability space.
We further know that probabilities can never be negative. In part (c), we
have p(a4 ) = −0.1, which means that (Ω, A, P ) is not a probability space.
In part (b), (Ω, A, P ) is indeed a probability space, since here all require-
ments are met.
144
given by
Z x 0,
Rx if x ≤ 1/2
FX (x) = ρX (z) dz = 2 dz = 2x − 1, if x ∈ (1/2, 1)
−∞ 1/2
1, if x ≥ 1.
2 1
h 1 i 1 1
= 3
+ 3 = = = 0.020833...
3 4 4 3 × 16 48
3. Assets A and B have the following distribution of returns in various
states:
State Asset A Asset B Probability
1 10% −2% 0.2
2 8% 15% 0.2
3 25% 0% 0.3
4 −14% 6% 0.3
Show that the correlation between the returns on asset A and asset B is equal
to −0.3830.
Cov(RA , RB )
Corr(RA , RB ) = p ,
Var(RA )Var(RB )
145
RA and RB . We have
E(RA ) = (10 × 0.2 + 8 × 0.2 + 25 × 0.3 + (−14) × 0.3)% = 6.9%,
2
Var(RA ) = E(RA ) − (E(RA ))2 ,
= 102 × 0.2 + 82 × 0.2 + 252 × 0.3 + (−14)2 × 0.3 − 6.92 %%
= (15.2148)2 %%,
p
Var(RA ) = 15.2148%,
E(RB ) = (−2 × 0.2 + 15 × 0.2 + 0 × 0.3 + 6 × 0.3)% = 4.4%,
2
Var(RB ) = E(RB ) − (E(RB ))2
2 2 2 2 2
= 2 × 0.2 + 15 × 0.2 + 0 × 0.3 + 6 × 0.3 − 4.4 %%
= (6.1025)2 %%,
p
Var(RB ) = 6.1025%,
E(RA RB ) = 10 × (−2) × 0.2 + 8 × 15 × 0.2 + 25 × 0 × 0.3
+ (−14) × 6 × 0.3 %%
= −5.2%%.
Note that % and %% stand for 1/100 and 1/1002 , respectively. Using the
values above, we obtain
−5.2/1002 − 6.9 × 4.4/1002
Corr(RA , RB ) = = −0.3830,
0.152148 × 0.061025
as required.
4. Formalise Example 8.5 as Ω = {ω1 , ω2 , ω3 , ω4 }, P ({ω1 }) = P ({ω2 }) =
P ({ω3 }) = P ({ω4 }) = 1/4 and
A := {ω1 , ω4 }, B := {ω2 , ω4 }, C := {ω3 , ω4 }.
Prove that the pairs (A, B), (A, C) and (B, C) are independent, but the triple
(A, B, C) is not mutually independent according to Definition 8.2.
Answer: We have
1 1 1
P (A) = P (B) = P (C) = + = ,
4 4 2
1
P (A ∩ B) = P ({ω4 }) = = P (A)P (B),
4
1
P (A ∩ C) = P ({ω4 }) = = P (A)P (C),
4
1
P (B ∩ C) = P ({ω4 }) = = P (B)P (C),
4
146
which shows that the pairs (A, B), (A, C) and (B, C) are independent. How-
ever
1 1
P (A ∩ B ∩ C) = P ({ω4 }) = 6= = P (A)P (B)P (C).
4 8
So the triple (A, B, C) is not mutually independent.
147
8.2 Chapter 2 solutions
1. The number of claims a company received during the last 12 months are
Then
12
0 12 1 X
l (p) = − ki = 0
p 1 − p i=1
148
if
12 12 1
p= P12 = = .
12 + i=1 ki 132 11
and
log(1 − α1 ) log(1 − 0.25)
c=− γ ≈− ≈ 0.002327.
q1 4010.8037
4. Assume that the history of claim sizes are the same as in the previous
question, but the company order a reinsurance policy with excess of loss
reinsurance above the level M = 2000.
(a) Write down the history of expenses of the reinsurer;
(b) Assuming that the original claim size distribution is Pareto distribu-
tion with parameters α > 2 and λ > 0, estimate the unknown parameters
using method of moments with data available to the reinsurer.
(c) Comment whether do you think the Pareto distribution is a good
model to fit these data.
149
Answer: (a) The history of reinsurer expenses is
(b) Using formulas from Example 2.1 with k = 6 and wi as above, we get
k
1X 1
wi = (1500 + 5970 + 1076 + 837 + 560 + 950) = 1815.5
k i=1 6
and
k k
!2
1X 2 1X 1
wi − wi = (15002 +59702 +10762 +8372 +5602 +9502 )−1815.52 = 3531517.25,
k i=1 k i=1 6
(c) The resulting parameters does not look realistic. One would expect
much smaller values such that α ≈ 3 or α ≈ 4. We may conclude that Pareto
distribution is not the best model to fit these data.
150
8.3 Chapter 3 solutions
1. Assume that the number N of claims can be any integer from 1 to 100
with equal chances, and the claim sizes X1 , . . . , XN are i.i.d. from Pareto
distribution with parameters α = 3 and λ = 2. Estimate mean and variance
of the aggregate claim S.
λ 2 αλ2
µX = = 1, and σX = = 3.
α−1 (α − 1)2 (α − 2)
see (4) for derivation of variance. Hence, the mean and variance of S are
µS = µN · µX = 50.5 · 1 = 50.5
and
σS2 = µN σX
2 2 2
+ σN µX = 50.5 · 3 + 833.25 · 12 = 974.75.
45( 45 )0
P (N = 0) = e− 7 7
≈ 0.0016
0!
− 45 ( 45
7
)1
P (N = 1) = e 7 ≈ 0.0104
1!
45 2
45 ( )
P (N = 2) = e− 7 7 ≈ 0.0334
2!
151
P (N ≥ 3) = 1 − P (N = 0) − P (N = 1) − P (N = 2) ≈
≈ 1 − 0.0016 − 0.0104 − 0.0334 = 0.9546.
(b) Let Y be a claim size, uniformly distributed on [a, b], where a = 1, 000,
1
b = 2, 000. Then the density of Y is f (y) = b−a , a ≤ y ≤ b, and
Z b b
1 1 a+b
E[Y ] = y dy = (y 2 /2) = = 1, 500,
a b−a b−a a 2
b b
a2 + ab + b2
Z
2 2 1 1 7
E[Y ] = y dy = (y 3 /3) = = · 106 .
a b−a b−a a 3 3
Then for the aggregate claim size S
45
E[S] = λE[Y ] = · 1, 500 ≈ 9, 643,
7
45 7
V ar[S] = λE[Y 2 ] = · · 106 = 15 · 106 .
7 3
(c) If there are at least 3 claims, then the total size will be at least 3, 000.
If there are 2 claims, there is a 50% chance to have a total size 3, 000. If there
is 1 claim or no claim, the total size will surely be less than 3, 000. Hence,
the answer is
and
2 2 2
V ar[X|λ] = (eσ − 1)e2µ+σ = (eσ − 1)λ2 .
By the law of total expectation,
152
2 2
= (eσ − 1)(p2 + s2 ) + s2 = eσ (p2 + s2 ) − p2 .
By (39),
k(1 − p) k(1 − p)2
σS2 = E[X 2 ] + (E[X])2 =
p p2
20(1 − 0.25) 20(1 − 0.25)2
= 80, 000 + (200)2 = 12, 000, 000.
0.25 0.252
Hence,
S − 12, 000 20, 000 − 12, 000
P (S ≤ 20, 000) = P √ ≤ √ ≈ P (Z ≤ 2.3),
12, 000, 000 12, 000, 000
153
8.4 Chapter 4 solutions
1. (a) Calculate the Hazard rate of the Pareto distribution. Check if it is an
increasing or decreasing function.
(b) Calculate the Mean residual life of the Pareto distribution. Check if
it is an increasing or decreasing function.
(c) What conclusion about tails of Pareto distribution can we make based
on items (a) and (b).
Answer: The Pareto distribution with parameters α > 0 and λ > 0 has
PDF
αλα
f (x) = , x > 0.
(λ + x)α+1
and CDF α
λ
F (x) = 1 − .
λ+x
(a) The Hazard rate is
f (x) αλα λα α
h(x) = = α+1
: α
=
1 − F (x) (λ + x) (λ + x) λ+x
The derivative
α
h0 (x) = − <0
(λ + x)2
is negative, hence h(x) is a decreasing function.
(b) The Mean residual life is
R∞ α Z ∞ α
(1 − F (y))dy
x λ+x λ
e(x) = = dy =
1 − F (x) λ x λ+y
α α
λ (λ + x)1−α
λ+x λ+x
= = ,
λ α−1 α−1
which is an increasing function of x provided that α > 1.
(c) Because the Hazard rate is h(x) is a decreasing function, and the Mean
residual life is increasing function, this is an indication that The Pareto
distribution has heavy tail. If the claims follow this distribution, we can
expect some very large claim with not very small probability.
2. Prove the formula for F (x) is Example 4.1.
Answer: Substituting FX (x) = 1 − exp(−λx), an = λ1 ln n and βn = λ1
into (45), we get
n
1 1
F (x) = lim 1 − exp −λ ln n + x = lim (1 − exp(−x − ln n))n =
n→∞ λ λ n→∞
154
n
e−x
−x
lim 1− = e−e .
n→∞ n
Answer: (a)
ψ(t) = (− ln t)α , 0 < t ≤ 1,
where 1 ≤ α is a parameter. Then
1
ψ 0 (t) = α(− ln t)α−1 (− ) < 0
t
and
2
00 α−2 1 α−1 1
ψ (t) = α(α − 1)(− ln t) − + α(− ln t) > 0,
t t2
hence ψ(t) is strictly decreasing and convex. We also have ψ(1) = (− ln 1)α =
0. To find inverse function we solve equation ψ(t) = (− ln t)α = x to get
answer t = exp(−x1/α ). Then by (51),
n o
C(u, v) = ψ −1 (ψ(u) + ψ(v)) = exp − ((− ln u)α + (− ln v)α )1/α ,
155
(b) Let
e−αt − 1
ψ(t) = − ln , 0 < t ≤ 1,
e−α − 1
where α 6= 0 is a parameter. Then
e−α − 1 −αe−αt αe−αt
ψ 0 (t) = − · = .
e−αt − 1 e−α − 1 e−αt − 1
If α > 0 then e−αt − 1 < 0, while if α < 0 then e−αt − 1 > 0. In any case,
ψ 0 (t) < 0.
Next,
e−x (e−α − 1), hence t = − α1 ln (e−x (e−α − 1) + 1). For x = ψ(u) + ψ(v),
−αu −αv
(e−αu − 1)(e−αv − 1)
e −1 e −1
exp(−x) = exp ln + ln = ,
e−α − 1 e−α − 1 (e−α − 1)2
and by (51),
(e−αu − 1)(e−αv − 1)
−1 1
C(u, v) = ψ (x) = − ln 1 + ,
α (e−α − 1)
which is the Frank copula.
(c) Let
1 −α
ψ(t) = (t − 1), 0 < t ≤ 1,
α
where α 6= 0 is a parameter. Then
and
ψ 00 (t) = −(−α − 1)t−α−2 ,
hence ψ(t) is strictly decreasing for all α 6= 0 and convex for α > −1. We
also have ψ(1) = α1 (1−α − 1) = 0. To find inverse function we solve equation
ψ(t) = α1 (t−α − 1) = x to get answer t = (αx + 1)α . Then by (51),
156
which is the Clayton copula.
C(u, u) = u2 ,
C(u, u) u2
λL = lim = lim = 0,
u→0+ u u→0+ u
C(u, u)
λL = lim = lim uβ−1 = 0.
u→0+ u u→0+
Next,
and
C̄(u, u) (1 − u)β − 1
λU = lim = 2 + lim = 2 − β = 2 − 21/α .
u→0+ u u→0+ u
(c) For the Frank copula
(e−αu − 1)(e−αv − 1)
1
C(u, v) = − ln 1 + ,
α (e−α − 1)
157
we have
(e−αu − 1)2
1
C(u, u) = − ln 1 + .
α (e−α − 1)
As u → 0+ , e−αu − 1 ≈ −αu, and
α2 u2 α2 u2 αu2
1 1
C(u, u) ≈ − ln 1 + −α ≈ − · −α = − −α ,
α (e − 1) α (e − 1) (e − 1)
hence
C(u, u) α
λL = lim = − −α lim u = 0.
u→0+ u (e − 1) u→0+
Next,
(e−α(1−u) − 1)2
1
C̄(u, u) = −1 + 2u + C(1 − u, 1 − u) = −1 + 2u − ln 1 + ,
α (e−α − 1)
and
C̄(u, u)
λU = lim = 0.
u→0+ u
(d) For the Clayton copula C(u, v) = (u−α + v −α − 1)−1/α , we have
and
C(u, u)
λL = lim = 2−1/α .
u→0+ u
Also,
C̄(u, u) −1 + 2u + 1 − 2u + O(u2 )
λU = lim = lim = 0.
u→0+ u u→0+ u
158
8.5 Chapter 5 solutions
1. Consider a Markov chain with state space S = {0, 1, 2} and transition
matrix
p q 0
P = 1/4 0 3/4 .
p − 1/2 7/10 1/5
(a) Calculate values for p and q.
(b) Draw the transition graph for the process.
(3)
(c) Calculate the transition probabilities pi,j .
(d) Find any stationary distributions for the process.
Answer: (a) The sum of all entries in the last row must be equal to 1, as a
7
consequence of which p = 1 − 51 − 10 + 12 = 35 . In view of the first row, we see
2
that q = 5 .
(b)
1
2/5 7/10
3/4
1/4
0 2
3/5 1/10
1/5
159
(d) It can be shown that the only stationary distribution is given by
55 64 60
π = (π1 , π2 , π3 ) = , , ≈ (0.30726, 0.35754, 0.33520).
179 179 179
Indeed this follows, if we solve the linear equations πP = π for π1 , π2 , π3 ∈
[0, 1] with π1 + π2 + π3 = 1. More precisely, we have
3 1 1
π1 + π2 + π3 = π1 , (83)
5 4 10
2 7
π1 +
0π2 + π3 = π2 , (84)
5 10
3 1
0π1 + π2 + π3 = π3 , (85)
4 5
π1 + π2 + π 3 = 1. (86)
15
From (85) it follows that π3 = 16 π2 . Using this in (84), we see that π2 = 64 π
55 1
and, in turn, π3 = 11 π1 . In view of (86), we then get π1 (1 + 64
12
55
+ 12
11
) = 1, i.e.
55
π1 = 179 . From the above, we then obtain the remaining values π2 and π3
as indicated. We did not use (83). This equation must be valid, since P is a
stochastic matrix. Therefore, (83) can be used to check our solution.
Suppose the equation is true for a N ∈ N, then, using the Markov property
160
of {Xk }, we get
P (X0 = j0 , X1 = j1 , . . . , XN +1 = jN +1 )
= P (XN +1 = jN +1 | X0 = j0 , X1 = j1 , . . . , XN = jN )
× P (X0 = j0 , X1 = j1 , . . . , XN = jN )
N
Y −1
= P (XN +1 = jN +1 | XN = jN ) × P (X0 = j0 ) pjn ,jn+1 (n, n + 1)
n=0
N
Y −1
= pjN ,jN +1 (N, N + 1) × P (X0 = j0 ) pjn ,jn+1 (n, n + 1)
n=0
N
Y
= P (X0 = j0 ) pjn ,jn+1 (n, n + 1),
n=0
Level 1: 0% discount;
• Following a year with one claim, move to the next lower level, or remain
at level 1.
• Following a year with two or more claims, move down two levels, or
move to level 1 (from level 2) or remain at level 1.
(i) Explain why Xt is a Markov chain. Write down the transition matrix
of this chain.
161
(ii) Calculate the probability that a policyholder who is currently at level
2 will be at level 2 after:
Answer:
(i) It is clear that X(t) is a Markov chain; knowing the present state, any
additional information about the past is irrelevant for predicting the
next transition.
Then the transition matrix is given by
.15 .85 0 0
.15 0 .85 0
P =.03 .12 0 .85 .
(ii) (a) For the one year transition p22 = 0, since with probability 1, the
chain will leave the state 2.
(b) The second order transition matrix is given by
0.15 0.1275 0.7225 0
0.048 0.2295 0 0.7225
P (2) = P ∗ P =
0.0225 0.051 0.204 0.7225 ,
(iii) The chain is irreducible as any state is reachable by any other state.
It is also aperiodic. For states 1 and 4 the chain can simply remain
there. This is not the case for states 2 and 3. However these are
162
also aperiodic, since starting from 2 the chain can return in 2 and 3
transitions from the previous part of the question. Similarly the chain
started at 3 can return at 3 in two steps (look at P 2 ), and at three
steps.
(iv) The chain is irreducible and has a finite state space and thus has a
unique stationary distribution.
(v) To find the long run probability that the chain is at level 2 we need to
calculate the unique stationary distribution π. This amounts to solving
the matrix equation πP = π. This is a system of 4 equations in 4
unknowns given by
π1 + π2 + π4 + π4 = 1.
(n)
Let pij be the n-step transition probability of an irreducible aperiodic
(n)
Markov chain on a finite state space. Then, lim pij = πj for each i and
j. Thus the long run probability that the chain is in state 2 is given by
π2 = .05269.
163
8.6 Chapter 6 solutions
1. Claims are known to follow a Poisson process with a uniform rate of 3 per
day.
(a) Calculate the probability that there will be fewer than 1 claim on a
given day.
(b) Estimate the probability that another claim will be reported during the
next hour. State all assumptions made.
(c) If there have not been any claims for over a week, calculate the expected
time before a new claim occurs.
Answer: Let {Nt }t∈[0,∞) denote our Poisson process with rate λ = 3, where
the time is measured in days.
(a) We have to evaluate P (Nt+1 − Nt < 1) for a fixed t ≥ 0. But this is
equal to
P (N1 = 0) = e−λ = e−3 = 0.04979.
1
(b) We look for the probability that, during the time interval (t, t + 24
]
for a fixed t, at least one claim will be reported, i.e.
164
where i, j ∈ S and 0 ≤ t1 < t2 < t3 < ∞. We have
The individual remains in the healthy state from time s to time s+w and then
jumps to the state dead (where he remains) or to the state sick (where he
jumps to state dead by time t). Note that here pDD (s+w, t) = 1. Further note
that the formula for pHS (s, t) did not contain the term pDS (s + w, t)µ(s + w),
since the probability to jump from dead to sick is equal to zero.
(b) Solve the Kolmogorov’s forward equations for this Markov jump process
to find all transition probabilies.
165
(d) What is the probability that the process will be in state 0 in the long
term? Does it depend on the initial state?
dp00 (t)
= −αp00 (t) + βp01 (t),
dt
dp01 (t)
= αp00 (t) − βp01 (t),
dt
dp10 (t)
= −αp10 (t) + βp11 (t),
dt
dp11 (t)
= αp10 (t) − βp11 (t),
dt
Substituting p01 (t) = 1 − p00 (t) in the first equation, we get equation
dp00 (t)
= −αp00 (t) + β(1 − p00 (t)) = −(α + β)p00 (t) + β.
dt
which has a general solution
β
p00 (t) = + Ce−(α+β)t .
α+β
α
Initial condition p00 (0) = 1 lealds to C = α+β
, so finally we get
β α −(α+β)t
p00 (t) = + e .
α+β α+β
Transition probabilities p01 (t), p10 (t), and p11 (t) can be found similarly. They
are:
α α −(α+β)t
p01 (t) = − e ;
α+β α+β
β β
p10 (t) = − e−(α+β)t ;
α+β α+β
α β
p11 (t) = + e−(α+β)t .
α+β α+β
166
(c) For a time-homogeneous Markov process Chapman–Kolmogorov equa-
tions take the form X
pij (t + s) = pik (t) pkj (s).
k∈S
In our case, S = {0, 1}, thus there are 4 equations. For example, for i = j = 0
we get
p00 (t + s) = p00 (t) p00 (s) + p01 (t) p10 (s).
So we should check that
β α −(α+β)(t+s)
+ e =
α+β α+β
β α −(α+β)t β α −(α+β)s
+ e + e +
α+β α+β α+β α+β
α α −(α+β)t β β −(α+β)s
− e − e ,
α+β α+β α+β α+β
167
8.7 Chapter 7 solutions
1. Approximate points (0, 0, 0), (1, 0, 2), (0, 1, 3), (1, 1, 4) in the coordinate
space (x1 , x2 , y) by a plane y = ax1 + bx2 + c to minimize the sum of squares
error.
In optimality
∂e(a, b, c)
= 2(−6 + 2a + b + 2c) = 0
∂a
∂e(a, b, c)
= 2(−7 + a + 2b + 2c) = 0
∂b
∂e(a, b, c)
= 2(−9 + 2a + 2b + 4c) = 0
∂c
and the solution is a = 3/2, b = 5/2, c = 1/4.
2. There are two points marked on the plane - red point A with co-
ordinate (0, 0) and blue point B with coordinates (10, 10). Then 4 points
C, D, E, F arrives in order, and each is coloured in the same way as its near-
est neighbour. The coordinates of F is (10, 8). Give examples of coordinates
of points C, D, E such that point F will be coloured red.
Answer: For example, C may have coordinates (5, 4), D - (8, 6), and E
- (10, 7). Then
p p
|CA| = (5 − 0)2 + (4 − 0)2 < (5 − 10)2 + (4 − 10)2 = |CB|,
|F E| = 1 < 2 = |F B|,
168
hence F will be coloured the same as E: in red.
3. Consider 4 points A, B, C, D on the plane with coordinates (0, 0),
(1, 0), (0, 1), and (2, 1), initially classified such that A and D are red while
B and C are blue. What would be the outcome of the K-means algorithm
(with K = 2) applied to these initial data?
Answer: Them mean MR of red points is ((0, 0) + (2, 1))/2 = (1, 0.5).
The mean MB of blue points is ((1, 0) + (0, 1))/2 = (0.5, 0.5). Now, for point
A, p p
|AMB | = (0.5)2 + (0.5)2 < (0.5)2 + 12 = |AMR |,
hence point A becomes blue. By similar argument, we deduce that C stays
blue, B becomes red, and D stays red. After this, the mean MR of red
points is ((1, 0) + (2, 1))/2 = (1.5, 0.5), while the mean MB of blue points is
((0, 0) + (0, 1))/2 = (0, 0.5). Now, for point A,
p p
|AMB | = 02 + (0.5)2 < (0.5)2 + (1.5)2 = |AMR |,
hence point A stays blue. By similar argument, we deduce that all points
stay the same colour, and the algorithm terminates. Answer: A and C are
blue, B and D are red.
4. A filter should classify e-mails into 3 categories: personal, work, and
spam. The statistics shows that approximately 30% of e-mails are personal,
50% are work ones, and 20% are spam. It also shows that word “friend” is
included into 20% of personal e-mails, 5% of work e-mails, and 30% of spam
e-mails. In addition, the word “profit” is included into 5% of personal e-mails,
30% of work e-mails, and 25% of spam e-mails. A new e-mails arrives which
contains both words “friend” and “profit”. Use naive Bayes classification to
decide if this e-mail is more likely to be personal, work, or spam?
Answer: Let X be a random variable equal to X = 1, X = 2, or X = 3
if the random e-mail is personal, work, or spam, respectively. Let F we
a random variable such that F = 1 if e-mail contains word “friend” (and
F = 0) otherwise. Similarly, let R we a random variable such that R = 1 if
e-mail contains word “profit” (and R = 0) otherwise. We need to compare
three conditional probabilities
P (X = 1|R = F = 1), P (X = 2|R = F = 1), P (X = 3|R = F = 1),
and select the maximal one. For each i = 1, 2, 3, the naive Bayes estimate
for P (X = i|R = F = 1) is
P (R = 1|X = i) · P (F = 1|X = i) · P (X = i)
P (X = i|R = F = 1) = .
P (R = F = 1)
169
As we can see, the denominators are the same foe all i, so it suffices to
compare the numerators. We have
170