Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views177 pages

Markov Processes Lecture Notes

The document contains lecture notes for the MA2404/MA7404 module on Markov Processes, focusing on risk modeling with an emphasis on Markov models. It includes a review of probability theory, statistics, and stochastic processes, followed by detailed chapters on claim size estimation, aggregate claim distribution, and Markov chains. The notes aim to provide a foundational understanding necessary for actuarial and financial modeling without delving into excessive technicalities.

Uploaded by

rosayu1213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views177 pages

Markov Processes Lecture Notes

The document contains lecture notes for the MA2404/MA7404 module on Markov Processes, focusing on risk modeling with an emphasis on Markov models. It includes a review of probability theory, statistics, and stochastic processes, followed by detailed chapters on claim size estimation, aggregate claim distribution, and Markov chains. The notes aim to provide a foundational understanding necessary for actuarial and financial modeling without delving into excessive technicalities.

Uploaded by

rosayu1213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 177

Lecture Notes

MA2404/MA7404

Markov Processes
Module MA2404/MA7404
Markov Processes
Edition 3, March 2022

c University of Leicester 2022

All rights reserved. No part of the publication may be reproduced, stored in


a retrieval system or transmitted in any form or by any means, mechanical,
photocopying, recording or otherwise, without the prior written consent of
the University of Leicester.

i
Preface
Welcome to module MA2404/MA7404 Markov Processes.

The module has no formal prerequisites apart from that you are familiar with
standard mathematics seen in first year undergraduate courses.

An important thread through all actuarial and financial disciplines is the use
of appropriate models for various types of risks. The aim of this module is to
provide an introduction to risk modelling, with emphasis on Markov models.

We begin the module with short review of probability theory, basic statistics,
and stochastic processes, then study risk models, theory of Markov processes
and their application to actuarial and financial modelling.

Dr Tetiana Grechuk

ii
Contents
1 Probability theory and stochastic processes 1
1.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random variables and their expectations . . . . . . . . . . . . 2
1.3 Variance, covariance, and correlation . . . . . . . . . . . . . . 4
1.4 Probability distribution . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Examples of discrete probability distributions . . . . . . . . . 9
1.6 Examples of continuous probability distributions . . . . . . . . 12
1.7 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Conditional Probability and Expectation . . . . . . . . . . . . 17
1.9 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . 20
1.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.11 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Claim size estimation in insurance and reinsurance 28


2.1 Basic principles of insurance risk modelling . . . . . . . . . . . 28
2.2 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Method of maximum likelihood . . . . . . . . . . . . . . . . . 34
2.4 Method of percentiles . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Reinsurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Claim size estimation with excess of loss reinsurance . . . . . . 41
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.8 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 Estimation of aggregate claim distribution 48


3.1 The collective risk model . . . . . . . . . . . . . . . . . . . . . 48
3.2 The compound Poisson distribution . . . . . . . . . . . . . . . 51
3.3 The compound binomial distribution . . . . . . . . . . . . . . 54
3.4 The compound negative binomial distribution . . . . . . . . . 56
3.5 Aggregate claim distribution under reinsurance . . . . . . . . 57
3.6 The individual risk model . . . . . . . . . . . . . . . . . . . . 60
3.7 Aggregate claim estimation under uncertainty in parameters . 62
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 Tails and dependence analysis of claims distributions 69


4.1 How likely very large claims to occur? . . . . . . . . . . . . . . 69
4.2 The distribution of large claims . . . . . . . . . . . . . . . . . 72
4.3 The distribution of maximal claim . . . . . . . . . . . . . . . . 73
4.4 Dependence, correlation, and concordance . . . . . . . . . . . 75

iii
4.5 Joint distributions and copulas . . . . . . . . . . . . . . . . . 77
4.6 Dependence of distribution tails . . . . . . . . . . . . . . . . . 81
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5 Markov Chains 85
5.1 The Markov property . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Definition of Markov Chains . . . . . . . . . . . . . . . . . . . 87
5.3 The Chapman-Kolmogorov equations . . . . . . . . . . . . . . 89
5.4 Time dependency of Markov chains . . . . . . . . . . . . . . . 90
5.5 Further applications . . . . . . . . . . . . . . . . . . . . . . . 92
5.5.1 The simple (unrestricted) random walk . . . . . . . . . 92
5.5.2 The restricted random walk . . . . . . . . . . . . . . . 94
5.5.3 The modified NCD model . . . . . . . . . . . . . . . . 96
5.5.4 A model of accident proneness . . . . . . . . . . . . . . 98
5.5.5 A model for credit rating dynamics . . . . . . . . . . . 98
5.5.6 General principles of modelling using Markov chains . . 99
5.6 Stationary distributions . . . . . . . . . . . . . . . . . . . . . 101
5.7 The long-term behaviour of Markov chains . . . . . . . . . . . 104
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.9 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Markov Jump Processes 108


6.1 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.1.1 Interarrival times . . . . . . . . . . . . . . . . . . . . . 110
6.1.2 Compound Poisson process . . . . . . . . . . . . . . . . 112
6.2 The time-inhomogeneous Markov jump process . . . . . . . . 113
6.3 Transition rates . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4 Time-homogeneous Markov jump processes . . . . . . . . . . . 117
6.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.5.1 Survival model . . . . . . . . . . . . . . . . . . . . . . 119
6.5.2 Sickness-death model . . . . . . . . . . . . . . . . . . . 120
6.5.3 Marriage model . . . . . . . . . . . . . . . . . . . . . . 122
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7 Machine Learning 126


7.1 A motivating example . . . . . . . . . . . . . . . . . . . . . . 126
7.2 The problems machine learning can solve . . . . . . . . . . . . 129
7.3 Models, methods, and techniques . . . . . . . . . . . . . . . . 132
7.4 Probabilistic analysis . . . . . . . . . . . . . . . . . . . . . . . 135

iv
7.5 Stages of analysis in Machine Learning . . . . . . . . . . . . . 137
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8 Solutions of end-of-chapter questions 144


8.1 Chapter 1 solutions . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2 Chapter 2 solutions . . . . . . . . . . . . . . . . . . . . . . . . 148
8.3 Chapter 3 solutions . . . . . . . . . . . . . . . . . . . . . . . . 151
8.4 Chapter 4 solutions . . . . . . . . . . . . . . . . . . . . . . . . 154
8.5 Chapter 5 solutions . . . . . . . . . . . . . . . . . . . . . . . . 159
8.6 Chapter 6 solutions . . . . . . . . . . . . . . . . . . . . . . . . 164
8.7 Chapter 7 solutions . . . . . . . . . . . . . . . . . . . . . . . . 168

v
The following book has been used as the basis for the lecture notes

Faculty and Institute of Actuaries, CS2 Core Reading

vi
Chapter 1
Probability theory and stochastic processes
The aim of this chapter is to give the review of probability theory, basic
statistics, and give a very short introduction to stochastic processes. This
background material is necessary for understanding the models that will be
developed in later chapters. While the rigorous development of probability
theory and stochastic processes can be very technical, we have attempted to
avoid unnecessary technicalities in this text. The focus is that you understand
the material on the intuitive level, and also have a level of technical knowledge
able to solve quantitative problems when necessary.

1.1 Probability space


Intuitively, a random variable is a function with random values. For example,
it may be a lifetime of an individual, or the number of car accidents in the
next year. To model such randomness mathematically it is convenient to
assume that a random variable is a function “defined somewhere” on the
space of all possible “states of the world”, ω. The randomness comes from
the fact that “we do not know exactly what ω we are in”.
As the simplest example, consider tossing a fair coin (i.e. that the probability
of a head is equal to the probability of a tail, and so these probabilities are
equal to 1/2). Then we can take Ω = {H, T }, and P (H) = P (T ) = 1/2,
where P denotes the probability. In experiment with throwing a dice, we
can take Ω = {1, 2, 3, 4, 5, 6}, and the probability of each outcome is 1/6.
In general, any experiment with finite number of possible outcomes can be
studied using probability space

Ω = {ω1 , ω2 , . . . , ωn },

with probability P (ωi ) ≥ 0, i = 1, . . . , n, and ni=1 P (ωi ) =


P
P1. Any subset
A ⊂ Ω is called event, and the probability of A is P (A) = ω∈A P (ω). For
example, in the experiment with dice, A = {2, 4, 6} corresponds to the event
“dice shows an even number”, and P (A) = P (2) + P (4) + P (6) = 3/6 = 1/2.
However, some random quantities, such as temperature outside tomorrow
at 2pm (measured in some units) or a lifetime of an individual (measured,
for example, in years), may take any real number from some interval. To
model such random quantities, we need to consider probability space Ω with
arbitrary, possibly infinitely many of elements. In this case, some subsets of
Ω has very complicated structure, and the probability for such subsets cannot
even be defined. Hence, we divide subsets of Ω into two groups: subsets for

1
which we will define the probability (such subsets we will call “events”), and
subsets for which the probability is undefined. The set of all events will be
denoted F. We assume that F satisfies the following properties

1. Ω ∈ F
S∞ T∞
2. If A1 , A2 , · · · ∈ F, then n=1 An ∈ F and n=1 An ∈ F

3. If A ∈ F then Ac ∈ F

Here, denotes the union of the sets, denotes the intersection, and Ac =
S T
{ω ∈ Ω | ω 6∈ A} denotes the complement of set A.

We now define a probability as a function P : F → [0, 1] which satisfies the


following properties

1. P (Ω) = 1
S∞  P∞
2. P n=1 An = n=1 P (An ), provided A1 , A2 , · · · ∈ F is a collection of
pairwise disjoint sets, i.e. Ai ∩ Aj = ∅ for all i 6= j.

Then we call such P a probability measure and the triple (Ω, F, P ) a proba-
bility space.
An important exampleof a probability space is the so-called standard proba-
bility space [0, 1], B, λ , where B is the smallest set satisfying the properties
1-3 above and containing all intervals of the type (a, b), (a, b], [a, b), [a, b] ⊂
[0, 1] and λ denotes the Lebesgue measure, that is, a natural extension of the
functional which assigns to each interval its length:

λ(a, b] = b − a;

to the whole set B.

1.2 Random variables and their expectations


Given a probability space (Ω, F, P ) we call a function

ξ:Ω→R

a random variable if it is measurable, i.e. satisfies



ω : ξ(ω) ≤ r ∈ F for all r ∈ R.

2
We may interpret a random variable as a number produced by an experiment,
such as temperature or the oil price tomorrow. The mathematical formalism
allows us to be accurate. In particular, the property of measurability is im-
portant to make sense of the probability of the event that a random variable
does not exceed a threshold: P ξ ≤ r) := P ({ω : ξ(ω) ≤ r} .

For example, consider standard probability space [0, 1], B, λ , and a function
ξ such that ξ(ω) = a, ω ≤ p, and ξ(ω) = b, ω > p, where p ∈ (0, 1) and a 6= b.
Then ξ is a random variable assuming just two different values a and b with
probabilities p ∈ (0, 1) and q := 1 − p respectively. It is called a Bernoulli
random variable. If a = 1 and b = 0, ξ is called a standard Bernoulli variable.

Another example is linear function

ξ(ω) = a + (b − a)ω, a < b, (1)

on the standard probability space. This is the random variable whose possible
values are all real numbers in the interval [a, b].

The expectation of a random variable is, informally, the average of its possible
values weighted according to their probabilities. Sometimes in the literature,
the expectation is also called the mathematical expectation or mean. For
example, in experiment with throwing a dice, the possible outcomes of the
experiments are Ω = {1, 2, 3, 4, 5, 6} with equal probabilities, and the average
outcome is (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5. More generally, if the possible val-
ues of ξ are x1 , x2 , . . . , xn , and they happen with probabilities p1 , p2 , . . . , pn ,
respectively, then
Xn
E[ξ] = xi p i . (2)
i=1

For example, if ξ is a Bernoulli random variable with outcomes a, b with


probabilities p, q = 1 − p, then

E[ξ] = ap + bq = ap + b(1 − p).

Similarly, if ξ can take infinitely many possible values

x1 , x2 , . . . , xn , . . .

with probabilities
p1 , p2 , . . . , pn , . . . ,

3
P∞
respectively, where each pn ≥ 0 and n=1 pn = 1, then

X
E[ξ] = xi p i . (3)
i=1

To calculate the average of possible values for general function (for ex-
ample, one that may take any values from some interval, such as (1)), the
summation in (2) and (3) is replaced by integration. By definition, the ex-
pectation of a random variable is its integral on Ω with respect to P , so that
Z Z
E[ξ] ≡ ξ dP ≡ ξ(ω) dP (ω).
Ω Ω

In particular, on the standard probability space


Z 1
E[ξ] = ξ(ω)dω.
0

For example, for ξ defined in (1),


Z 1
E[ξ] = (a + (b − a)ω)dω = a + (b − a)/2 = (a + b)/2.
0

It is easy to check that for random variable taking finitely many values
this integral reduces to (2). For example, for Bernoulli random variable,
Z 1 Z p Z 1
E[ξ] = ξ(ω)dω = adω + bdω = ap + b(1 − p).
0 0 p

R 1 some functions, for example ξ(ω) = 1/ω, 0 < ω ≤ 1, the integral


For
0
ξ(ω)dω is not finite. If E[|ξ|] exists and is finite, then ξ is called integrable.
The class of all integrable random variables is denoted by L1 (Ω, F, P ) or just
L1 .
The key property of the expectation is linearity

E[aξ + bη] = aE[ξ] + bE[η], for all ξ, η ∈ L1 and constants a, b ∈ R.

1.3 Variance, covariance, and correlation


The variance of a random variable ξ is defined by

Var(ξ) := E[(ξ − E[ξ])2 ] ≡ E[ξ 2 ] − E[ξ]2 .

4
If ξ measures some real life phenomenon, such as remaining lifetime of an
individual, E[ξ] indicates “how big” ξ to expect on average, and may serve as
a forecast how long is an individual expected to survive. Variance measures
how big is the (square of) the difference ξ − E[ξ], and therefore indicates how
“close” is the prediction E[ξ] to the reality. Therefore mean and variance are
two fundamental quantities associated with a random variable.
Not every ξ ∈ L1 has a finite variance. The class of random variables with
finite variance is denoted by L2 (Ω, F, P ) or just L2 . A random variable with
finite variance is called square-integrable. Variance is always non-negative
and is equal to zero only for constants.
p
The square root of the variance σ(ξ) = Var(ξ) is called the standard devi-
ation of ξ.
Example 1.1. For ξ defined in (1),
Z 1
2
E[ξ ] = (a + (b − a)ω)2 dω = (a2 + ab + b2 )/3,
0

and
p p √
σ(ξ) = Var(ξ) = (a2 + ab + b2 )/3 − (a + b)2 /4 = (b − a)/ 12.

Note that

Var(aξ) = a2 Var(ξ), σ(aξ) = aσ(ξ), for all ξ ∈ L2 and constant a ∈ R.

For two random variables ξ, η ∈ L2 their covariance is defined by


 
Cov(ξ, η) := E (ξ − E[ξ])(η − E[η]) ≡ E[ξη] − E[ξ]E[η].

For a sum of two random variables we have


 2 
Var(ξ + η) = E (ξ − E[ξ]) + (η − E[η])
= E (ξ − E[ξ])2 + 2E (ξ − E[ξ])(η − E[η]) + E (η − E[η])2
     

= Var(ξ) + 2Cov(ξ, η) + Var(η).

Sometimes it is convenient to normalise the covariance. For non-constant


random variables ξ and η, define the correlation of ξ and η by
Cov(ξ, η)
Corr(ξ, η) := p p .
Var(ξ) Var(η)

5
Cauchy-Schwarz inequality says
2
E[ξη] ≤ E[ξ 2 ]E[η 2 ] for all ξ, η ∈ L2 ,
and one can easily deduce that
−1 ≤ Corr(ξ, η) ≤ 1 for all ξ, η ∈ L2 .
The correlation equals 1 if and only if ξ = aη with some constant a > 0, and
is −1 if and only if ξ = aη for a < 0. If ξ and η are independent, then the
covariance Cov(ξ, η) and correlation Corr(ξ, η) are 0. If Corr(ξ, η) > 0, the
random variables ξ and η are called positively correlated, and the intuition is
that higher values of ξ is an indication of higher values of η. If Corr(ξ, η) < 0,
ξ and η are called negatively correlated. For example, if ξ is temperature
outside and η is the number of old people who died on the street, then ξ and
η may be positively correlated during the summer (the higher temperature
the hotter summer, and old people may be affected by very hot weather) but
negatively correlated during the winter (the higher temperature the less cold
the winter).

1.4 Probability distribution


Given a random variable ξ we define its cumulative distribution function (or
CDF in short) by
Fξ (x) := P (ξ ≤ x).
Clearly the CDF is non-decreasing in x and lim Fξ (x) = 0, lim Fξ (x) = 1.
x→−∞ x→+∞

A random variable ξ is called discretely distributed if it can take values


x1 , x2 , . . . , xn , . . . with probabilities p1 , p2 , . . . , pn , . . . , respectively. The CDF
of such random variable is a piecewise constant function.

Example 1.2. An n-sided dice has numbers 1, 2, . . . , n on its sides, each can
be shown on its upper surface with the same probability 1/n when the dice
is thrown or rolled. If ξ is the corresponding random variable, it is discretely
distributed, and its CDF Fξ (x) is a piecewise constant function given by

0
 if x < 1
Fξ (x) = i/n if i ≤ x < i + 1, i = 1, 2, . . . , n − 1

1 if n ≤ x

By (2),
n
1X 1 n(n + 1) n+1
E[X] = i= · = ,
n i=1 n 2 2

6
and
n
2 1X 2 1 n(n + 1)(2n + 1) (n + 1)(2n + 1)
E[X ] = i = · = ,
n i=1 n 6 6

hence
2 n2 − 1
2
V ar[X] = E[X ] − (E[X]) = . (4)
12

A random variable ξ is called continuously distributed if its CDF can be


represented as Z x
Fξ (x) = ρξ (z) dz,
−∞

for some non-negative function ρξ , which is called the probability density


function (or PDF in short) of ξ.
The expectation of a random function can be calculated using its CDF or its
PDF (if the latter exists) by
Z ∞ Z ∞
E[ξ] = x dFξ (x) = xρξ (x) dx.
−∞ −∞

Or more generally, if for some random variable ξ and function f : R → R


the expectation E[f (ξ)] exists, then
Z ∞ Z ∞
 
E f (ξ) = f (x) dFξ (x) = f (x)ρξ (x) dx.
−∞ −∞

In particular, for the variance we have


Z ∞ 2 Z ∞  2
Var(ξ) = x − E[ξ] dFξ (x) = x − E[ξ] ρξ (x) dx.
−∞ −∞

Example 1.3. For ξ defined in (1),

Fξ (x) = P (ξ ≤ x) = 0 if x < a,
x−a
Fξ (x) = P (ξ ≤ x) = λ(ω : a + (b − a)ω ≤ x) = if a ≤ x ≤ b,
b−a
where λ denotes the length of the interval, and

Fξ (x) = P (ξ ≤ x) = 1 if b < x.

7
It is easy to check that
Z x
Fξ (x) = ρξ (z) dz,
−∞

for function ρξ (z) = 1/(b − a), z ∈ [a, b] (and ρξ (z) = 0, z 6∈ [a, b]). Hence,
ξ is continuously distributed with PDF ρξ . In fact, random variable ξ with
this density is called uniformly distributed on [a, b]. In this case we usually
write ξ ∼ U (a, b).
We have Z ∞ Z b
x a+b
E[ξ] = xρξ (x) dx = dx = ,
−∞ a b−a 2
and
∞ b
(x − E[ξ])2 (b − a)2
Z  2 Z
Var(ξ) = x − E[ξ] ρξ (x) dx = dx = ,
−∞ a b−a 12

which coincides with formulas obtained in Example 1.1.


Note that random variable η(ω) = b + (a − b)ω has the same CDF (and
PDF) as random variable ξ in Example 1.3, despite on the fact that ξ 6= η.
If random variables ξ and η have identical CDFs Fξ = Fη , we say that they
are identically distributed (i.d.). It can be shown that the random variables
ξ and η are identically distributed if and only if E[f (ξ)] = E[f (η)] for any
f : R → R.
The inverse function to Fξ (x), defined as

Fξ−1 (α) = inf{x|Fξ (x) > α}, 0 < α < 1,

is called the α-quantile of ξ. For example, random variables η(ω) = b +


(a − b)ω has CDF described in Example 1.3, and its inverse can be found by
solving equation x−a
b−a
= α, resulting in x = a + (b − a)α. Hence, Fη−1 (α) =
a + (b − a)α.
For several random variables ξ1 , . . . ξd their joint CDF is defined as

Fξ1 ,...,ξd (x1 , . . . , xd ) = P [ξ1 ≤ x1 , ξ2 ≤ x2 , . . . , ξd ≤ xd ].

If it can be represented as
Z x1 Z xd
Fξ1 ,...,ξd (x1 , . . . , xd ) = ··· ρξ1 ,...,ξd (z1 , . . . , zd ) dz1 . . . dzd ,
−∞ −∞

8
for some non-negative function ρξ1 ,...,ξd (z1 , . . . , zd ), the latter is called the joint
PDF of ξ1 , . . . , ξd .
If for some random variable ξ1 , . . . ξd and function f : Rd → R the expectation
E[f (ξ1 , . . . ξd )] exists, then
Z
 
E f (ξ1 , . . . ξd ) = f (x1 , . . . , xd )ρξ1 ,...,ξd (x1 , . . . , xd ) dx1 . . . dxd .
Rd

1.5 Examples of discrete probability distributions


One example of discrete probability distribution is studied in Example 1.2.
This section provides more examples, with emphasis to ones useful in insur-
ance modelling. The most common use of discrete probability distribution
in insurance is to count the number of claims. The following discrete distri-
bution often used for this.

• ξ is a Bernoulli variable if it takes values a and b with probabilities p


and 1 − p, respectively, where a and b are arbitrary real numbers and
p ∈ [0, 1]. By (2),

E[ξ] = ap + b(1 − p), E[X 2 ] = a2 p + b2 (1 − p)

and
V ar[ξ] = E[X 2 ] − (E[X])2 = (a − b)2 p(1 − p).
If a = 1 and b = 0, ξ is called standard Bernoulli variable. In this case,

E[ξ] = p, V ar[ξ] = p(1 − p).

Standard Bernoulli variable is used to model the situation when only


1 claim can happen within a policy, and the probability of this claim
is p.

• ξ follows Binomial distribution with parameters n (non-negative inte-


ger) and p (real number on [0, 1]) if it takes values 0, 1, 2, . . . , n with
probabilities
n!
pk = · pk (1 − p)n−k , k = 0, 1, . . . , n.
k!(n − k)!

In this case, we write ξ ∼ Bin(n, p). For example, if n = 1, then


1! 1!
p0 = ·p0 (1−p)1−0 = 1−p, p1 = ·p1 (1−p)1−1 = p,
0!(1 − 0)! 1!(1 − 1)!

9
hence in this case ξ is just a standard Bernoulli variable. In general,
if ξ1 , . . . , ξn is a sequence of i.i.d. standard Bernoulli variables with
parameter p, then
ξ = ξ1 + ξ2 + · · · + ξn
has binomial distribution Bin(n, p). In particular,
n
X n
X
E[ξ] = E[ξk ] = p = np,
k=1 k=1

and n n
X X
V ar[ξ] = V ar[ξk ] = p(1 − p) = np(1 − p).
k=1 k=1

Binomial distribution arises when there are n independent policies such


that each can produce a claim with the same probability p. Then the
total number of claims from all policies is Bin(n, p).

• ξ follows geometric distribution with parameter p, 0 ≤ p ≤ 1, if it takes


values 0, 1, 2, . . . with probabilities

pn = (1 − p)n p, n = 0, 1, 2, . . .

If ξ1 , ξ2 , . . . is an infinite sequence of i.i.d. standard Bernoulli variables


with parameter p, then the number of 0-s in this sequence before the
first 1 follows the geometric distribution. This is a valid distribution
because ∞ ∞
X X 1
pn = p (1 − p)n = p = 1.
n=1 n=1
1 − (1 − p)
Differentiating both sides of equation ∞ n 1
P
n=1 (1 − p) = p , we get


X 1
− n(1 − p)n−1 = − .
n=1
p2

This can be used to calculate the expectation of geometric distribution


∞ ∞
X X 1 1−p
E[ξ] = n · pn = p(1 − p) n(1 − p)n−1 = p(1 − p) 2
= .
n=1 n=1
p p

By similar (but more involved) calculation, we can get


1−p
V ar[ξ] = .
p2

10
• ξ follows negative Binomial distribution with parameters k (positive
integer) and p (real number on [0, 1]) if it takes values 0, 1, 2, . . . with
probabilities
(k + n − 1)! k
pn = · p (1 − p)n , n = 0, 1, 2, . . .
n!(k − 1)!
In this case, we write ξ ∼ N B(k, p). For example, if k = 1, then
(1 + n − 1)! 1
pn = · p (1 − p)n = p(1 − p)n , n = 0, 1, 2, . . .
n!(1 − 1)!
hence N B(1, p) is just a geometric distribution. In general, if ξ1 , . . . , ξn
is a sequence of i.i.d. geometric variables with parameter p, then
ξ = ξ1 + ξ2 + · · · + ξk
has negative binomial distribution N B(k, p). In particular,
k k
X X 1−p k(1 − p)
E[ξ] = E[ξi ] = = ,
i=1 i=1
p p
and
k k
X X 1−p k(1 − p)
V ar[ξ] = V ar[ξi ] = = .
i=1 i=1
p2 p2
Negative Binomial distribution can serve as a model for the total num-
ber of claims if this number is not bounded from above.
• ξ follows Poisson distribution with parameter λ > 0 if it takes values
0, 1, 2, . . . with probabilities
λn e−λ
pn =
n!
P∞ xn
Using series expansion for exponential function ex = n=0 n! , one can
easily compute expectation of Poisson distribution
∞ ∞ ∞
X X λn e−λ −λ
X λn−1
E[ξ] = n · pn = n = λe = λe−λ eλ = λ.
n=1 n=1
n! n=1
(n − 1)!

Similar calculation shows that E[ξ 2 ] = λ2 + λ, hence


V ar[ξ] = E[ξ 2 ] − (E[ξ])2 = (λ2 + λ) − λ2 = λ.
Poisson distribution is a natural model for the total number of claims
if the claims arrives at “uniform rate λ claims per unit of time”. We
will study this model in details later in this course.

11
1.6 Examples of continuous probability distributions
One example (uniform probability distribution) is studied in Example 1.3.
This section provides more examples, with emphasis to ones useful in insur-
ance modelling.

Example 1.4. Question: We say that random variable ξ has the expo-
nential distribution with parameter λ ∈ R > 0, and write ξ ∼ Exp(λ), if its
probability density function is
ρξ (x) = λe−λx for x ≥ 0, (5)
and ρξ (x) = 0 for x < 0. Calculate Fξ (x), Fξ−1 (α), E[ξ] and Var(ξ).
Answer: For any x ≥ 0,
Z x Z x
Fξ (x) = ρξ (z) dz = λe−λz dz = 1 − e−λx .
−∞ 0

Equation 1 − e−λx = α has a solution x = − ln(1 − α)/λ, hence Fξ−1 (α) =


− ln(1 − α)/λ.
The expectation is calculated as
Z ∞
1
E[ξ] = xλe−λx dx = .
0 λ
The variance is calculated as
Z ∞  2
1 1
Var(ξ) = x− λe−λx dx = 2 .
0 λ λ
Some elementary calculus was required in each case.

Other examples of important probability distributions include:


• ξ has a normal distribution with parameters µ and σ 2 if it has PDF
1 (x−µ)2
ρξ (x) = √ e− 2σ 2 . (6)
2πσ 2
In that case we write ξ ∼ N (µ, σ 2 ). The mean and variance of ξ are
E[ξ] = µ and Var(ξ) = σ 2 .
• ξ has a lognormal distribution if log(ξ) has a normal distribution, that
is, log(ξ) ∼ N (µ, σ 2 ). In this case, we will write ξ ∼ LogN (µ, σ 2 ).
Equivalently, ξ has a lognormal distribution if its PDF is
1 (log(x)−µ)2
ρξ (x) = √ e− 2σ2 , x > 0. (7)
x 2πσ 2

12
The mean and variance of lognormal distribution are
2 /2 2 2
E[ξ] = eµ+σ , Var(ξ) = (eσ − 1)e2µ+σ .

• ξ has a gamma distribution with parameters α > 0 and λ > 0 if it has


PDF
λα α−1 −λx
ρξ (x) = x e , x > 0, (8)
Γ(α)
R∞
where Γ(α) = 0 xα−1 e−x dx, α > 0 is the gamma function. In this
case we write ξ ∼ Ga(α, λ). The mean and variance of ξ are
α α
E[ξ] = , Var(ξ) = .
λ λ2
• ξ has the Pareto distribution with parameters α > 0 and λ > 0 if it
has PDF
αλα
ρξ (x) = , x > 0. (9)
(λ + x)α+1
In this case we write ξ ∼ P a(α, λ). Pareto distribution has CDF
Z x Z x α
αλα

λ
Fξ (x) = ρξ (z) dz = α+1
dz = 1 − .
−∞ 0 (λ + z) λ+x
λ

By solving equation 1 − λ+x = β, we find its quantile function

Fξ−1 (β) = λ (1 − β)−1/α − 1 , 0 < β < 1.


 

The mean of ξ is
λ
E[ξ] = , α>1
α−1
and is infinite if α ≤ 1. The variance of ξ is
αλ2
Var(ξ)] = , α > 2,
(α − 1)2 (α − 2)
and is infinite if α ≤ 2.
• ξ has the Burr distribution with parameters α > 0, λ > 0, and γ > 0
if it has PDF
γαλα xγ−1
ρξ (x) = , x > 0. (10)
(λ + xγ )α+1
In this case we write ξ ∼ Burr(α, λ, γ). Burr distribution has CDF
 α
λ
Fξ (x) = 1 − , x > 0.
λ + xγ

13
λ

By solving equation 1 − λ+x γ = β, we find quantile function of the
Burr distribution
1/γ
Fξ−1 (β) = λ(1 − β)−1/α − λ

, 0 < β < 1.
Pareto distribution is the special case of Burr distribution with γ = 1.
• ξ has the generalized Pareto distribution with parameters α > 0, δ > 0,
and k > 0 if it has PDF
Γ(α + k)δ α xk−1
ρξ (x) = , x > 0. (11)
Γ(α)Γ(k) (δ + x)α+k
The mean of ξ is
dΓ(α − 1)Γ(k + 1)
E[ξ] = , α > 1,
Γ(α)Γ(k)
and is infinite if α ≤ 1. The variance exists if α > 2, and is equal to
E[ξ 2 ] − (E[ξ])2 , where
d2 Γ(α − 2)Γ(k + 2)
E[ξ 2 ] = , α > 2.
Γ(α)Γ(k)
• ξ has the Weibull distribution with parameters c > 0 and γ > 0 if it
has PDF
γ
ρξ (x) = cγxγ−1 e−cx , x > 0. (12)
In this case we write ξ ∼ W (c, γ). Weibull distribution has CDF
γ
Fξ (x) = 1 − e−cx , x > 0.
γ
If 0 < γ < 1, the upper tail of Weibull distribution P [ξ > x] = e−cx
decaysα faster than that for Pareto distribution (for which P [ξ > x] =
λ
λ+x
) but slower than that for exponential distribution (for which
P [ξ > x] = e−λx ). This makes Weibull distribution a very flexible
distribution, which is extensively used as a model for losses in insurance.
γ
By solving equation 1 − e−cx = α, we find quantile function
 1/γ
−1 log(1 − α)
Fξ (α) = − , 0 < α < 1.
c
The mean of ξ is  
−1/γ 1+γ
E[ξ] = c Γ .
γ
The variance is E[ξ 2 ] − (E[ξ])2 , where
 
2 −2/γ 2+γ
E[ξ ] = c Γ .
γ

14
1.7 Independence
The notion of independence is one of the most important in Probability
Theory. Intuitively we would like to call two events or random variables
independent if there is no mutual dependency. For example, if we toss a coin
(or roll a die) twice, the outcomes are, intuitively, independent from each
other.
Two events A and B are called independent if

P (A ∩ B) = P (A)P (B).

This property is consistent with the intuition of independence.


For random variables we have the following definition.

Definition 1.1. Random variables ξ and η are called independent if the


events
A = {ω ∈ Ω : a < ξ(ω) < b}
B = {ω ∈ Ω : c < η(ω) < d}
are independent for all real numbers a, b, c, d.

However, it is not sufficient to define independence only for pairs of events or


random variables. It is possible that all the pairs of events (A, B), (A, C) and
(B, C) are independent, but the triple (A, B, C) is not mutually independent.

Example 1.5. Consider a non-traditional dice with four faces. We number


the first three faces with numbers 1, 2 and 3 respectively, and on the fourth
face we put all the three numbers 1, 2 and 3. Now let us through the dice,
and let
A be the event that 1 is on the face down;
B be the event that 2 is on the face down;
C be the event that 3 is on the face down.
Simple logic says that A depends on (B, C), since if B and C happen simul-
taneously then A happens too with probability one. But one can easily check
that all the pairs (A, B), (A, C) and (B, C) are independent.

Definition 1.2. A collection of events are called mutually independent, if


for every finite subset A1 , . . . , An of the collection we have

P (A1 ∩ · · · ∩ An ) = Πns=1 P (Ai ).

Similarly, we may define mutual independence for random variables.

15
Definition 1.3. Random variables ξ1 , . . . , ξn are called mutually indepen-
dent if the events
A1 = {ω ∈ Ω : a1 < ξ1 (ω) < b1 }
······
An = {ω ∈ Ω : an < ξn (ω) < bn }

are mutually independent for all real numbers {ak , bk }nk=1 . An infinite set
of random variables {ξα } is called mutually independent if any finite subset
{ξα1 , . . . , ξαn } is mutually independent.

This definition can be reformulated in terms of the joint probability distribu-


tion. The random variables ξ1 , . . . ξd are mutually independent if and only if
their joint distribution (i.e. the distribution of the random vector (ξ1 , . . . , ξd ))
is the product of the distribution functions of ξk ’s:
d
Y
Fξ1 ,...,ξd (x1 , . . . , xd ) = Fξk (xk ).
k=1

If the variables ξ1 , . . . , ξd have probability densities, then they are mutually


independent if and only if the random Qd vector X also has a density which can
be factorised as ρX (x1 , . . . , xd ) = k=1 ρξk (xk ).
For independent random variables ξ and η we have
 
E f (ξ)g(η) = E[f (ξ)]E[g(η)]

for any functions f and g : R → R (such that f (ξ) and g(η) ∈ L1 ). Indeed,
Z Z
 
E f (ξ)g(η) = f (x)g(y)ρξ,η (x, y)dxdy = f (x)g(y)ρξ (x)ρη (y)dxdy
R2 R2
Z +∞ Z +∞
= f (x)ρξ (x)dx ρη (y)g(y)dy = E[f (ξ)]E[g(η)]
−∞ −∞

More generally, for mutually independent ξ1 , . . . , ξn we have


n
hY i Yn
E fk (ξk ) = E[fk (ξk )]
k=1 k=1

for any functions fk : R → R (such that fk (ξk ) ∈ L1 ).


This is a very convenient property. For instance, it allows us to prove that
     
Cov(ξ, η) := E (ξ − E[ξ])(η − E[η]) = E ξ − E[ξ] E η − E[η] = 0

16
for independent ξ and η. So the covariance could serve as a proxy measure
for a degree of independence: the closer the covariance (or correlation) is
to zero, the less dependent are the random variables. However, it is not a
very good measure, since there exists dependent random variables with zero
covariance.
For any random variables ξ, η ∈ L2 such that Cov(ξ, η) = 0 (so, in par-
ticular, for any independent ξ and η) we have

Var(ξ + η) = Var(ξ) + 2Cov(ξ, η) + Var(η) = Var(ξ) + Var(η).

More generally,
n
X  n
X
Var ξk = Var(ξk )
k=1 k=1

for all ξ1 , . . . , ξn ∈ L2 provided Cov(ξj , ξk ) = 0 for all j 6= k (in particular


if the random variables are mutually independent). Random variables with
zero covariance are called uncorrelated.
Note that random variables that are independent and identically distributed
are often denoted i.i.d..

1.8 Conditional Probability and Expectation


Sometimes, while estimating the probability of some event A, we can use the
information that some other event B happened.

Example 1.6. An insurance company, which provides regular payments


while a client is ill and unable to work, needs to estimate the probability
of event A = {a new customer will be ill during next year}. They have
a statistics, according to which, out of 100 current customers, 20 was ill
at least once during the last year (10 of them smokes) and 80 was not ill
during the last year (20 of them smokes). Based on this, they can estimate
20
P (A) = 100 = 0.2 (a ratio of the number of customers that was ill to the
total number of customers). However, if they know that the new customer
10
smokes, they can estimate the probability in question as 10+20 = 13 > 0.2 (a
ratio of the number of smokers that was ill to the total number of smokers).
In general, the probability of event A, when event B is known to occur,
is called conditional probability of A given B, denoted P [A|B], and can be
evaluated as
P [A ∩ B]
P [A|B] = .
P [B]

17
10
In the example above, with B = {a customer is a smoker}, P (A ∩ B) = 100
,
30
P (B) = 100 = 0.3, and P [A|B] = P P(A∩B)
(B)
= 0.1
0.3
= 13 .
Events A and B are independent, if and only if
P [A ∩ B] P [A]P [B]
P [A|B] = = = P [A].
P [B] P [B]
Intuitively, this means that information about event B does not have any
influence on the probability of the event A.
Knowing P [A], P [B], and P [A|B], one can estimate P [B|A] as follows
P [A ∩ B] P [A|B]P [B]
P [B|A] = = ,
P [A] P [A]
which is called the Bayes’ theorem.
If {Bn , n = 1, 2, . . . } is S
a finite or countably infinite partition of Ω (that is,
Bi ∩ Bj = ∅, i 6= j and n Bn = Ω), then, for any event A,
X X
P [A] = P [A ∩ Bn ] = P [A|Bn ]P [Bn ]. (13)
n n

The last relation is called the law of total probability.

Example 1.7. A car insurance company, which classifies drivers as “new”


and “experienced”, wants to estimate the probability of the event A = {a
randomly selected driver will cause a car accident during next year}. From
their past data, they estimate that the probability of A is 0.3 for “new”
drivers, and 0.05 for “experienced” ones. If 40% of their current customers
are new, then

P [A] = P [A|N ]P [N ] + P [A|E]P [E] = 0.3 · 0.4 + 0.05 · 0.6 = 0.15

(here N and E are the events that the driver is “new” and “experienced”,
correspondingly).

A continuous version of the law of total probability (13) is


Z Z
pX (x) = pX,Y (x, y) dy = pX|Y (x | y) pY (y) dy, (14)
y y

where pX (x) and pY (y) are densities of X and Y , pX,Y (x, y) is the joint density
p (x,y)
of X and Y , and pX|Y (x | y) := X,YpY (y)
is called a conditional density of X
given Y .

18
Let A be any event with positive probability. Let IA be the indicator function
of A, that is, IA (ω) = 1 if ω ∈ A and IA (ω) = 0 otherwise. The conditional
expectation of random variable X given A is denoted as E(X|A) and defined
as
E(IA X)
E(X|A) := .
P (A)
Intuitively, E(X|A) is the average value of X given that event A happened.

For example, if Y is a discrete random variable which takes some value y


with non-zero probability, then

E(IY =y X)
E(X|Y = y) := .
P (Y = y)

We also define E(X|Y ) as a random variable, such that

E(X|Y )(ω) = E(X|Y = Y (ω)), ∀ω ∈ Ω.

Fundamental formulas involving conditional expectation are law of total ex-


pectation
E(X) = E[E(X|Y )], (15)
and law of total variance

V ar(X) = E[V ar(X|Y )] + V ar(E(X|Y )), (16)

where
V ar(X|Y ) = E(X 2 |Y ) − (E(X|Y ))2 .

Example 1.8. Let X and Y be random variables taking values 0 and 1,


such that
P [X = Y = 1] = P [X = Y = 0] = 0.4
and
P [X = 1, Y = 0] = P [X = 0, Y = 1] = 0.1.
Then P [X = 0] = P [X = 1] = P [Y = 0] = P [Y = 1] = 0.5, hence,

E(X) = E(Y ) = 0 · 0.5 + 1 · 0.5 = 0.5

and
V ar(X) = V ar(Y ) = (02 · 0.5 + 12 · 0.5) − (0.5)2 = 0.25.

19
Now assume that we know that Y = 0. Then
P [X = Y = 0] 0.4
P [X = 0|Y = 0] = = = 0.8,
P [Y = 0] 0.5

P [X = 1, Y = 0] 0.1
P [X = 1|Y = 0] = = = 0.2,
P [Y = 0] 0.5
and

E[X|Y = 0] = 0 · P [X = 0|Y = 0] + 1 · P [X = 1|Y = 0] = 0.2.

Similarly,

E[X|Y = 1] = 0 · P [X = 0|Y = 1] + 1 · P [X = 1|Y = 1] = 0.8.

Hence,
E(E[X|Y ]) = 0.2 · 0.5 + 0.8 · 0.5 = 0.5 = E[X],
in agreement with the law of total expectation (15).
Similarly we can calculate that

V ar(X|Y = 0) = E(X 2 |Y = 0) − (E(X|Y = 0))2 = 0.2 − 0.22 = 0.16,

V ar(X|Y = 1) = E(X 2 |Y = 1) − (E(X|Y = 1))2 = 0.8 − 0.82 = 0.16,


hence
E[V ar(X|Y )] = 0.16.
Also,

V ar(E[X|Y ]) = 0.22 · 0.5 + 0.82 · 0.5 − (0.2 · 0.5 + 0.8 · 0.5)2 = 0.09,

hence

E[V ar(X|Y )] + V ar(E(X|Y )) = 0.16 + 0.09 = 0.25 = V ar(X),

in agreement with the law of total variance (16).

1.9 Stochastic processes


A random variable is a suitable model for a number produced by a single
experiment, at a specified moment of time, for example “a temperature to-
morrow at 14.00”. If we are interesting how temperature will change in
time, we actually need to consider the whole collection of random variables
{Xt : t ∈ T }, where Xt is the (random) temperature at a specified moment

20
t, and T is the set of all times we are interested in. We will call such family
of random variables a stochastic process (or random process).
Formally, a stochastic process is a family of random variables {Xt : t ∈ T },
where T is an arbitrary index set. For example, any random variable is a
stochastic process with one-element set T . But typically parameter t repre-
sents time, and the most common examples for T are T = {0, 1, 2, . . . } and
T = R (or [0, ∞)). In the first case the stochastic process is called discrete
time, and is actually just a sequence of random variables; in the second case
the random process is called continuous time.
There are many classifications of stochastic processes. One of the most basic
is to classify them with respect to the time (index) set T and with respect to
the state space. By definition the state space S is the set of possible values
of a random process Xt .

Discrete state spaces with discrete time changes. Most typically T is


{0, 1, 2, . . . } in this case and the state space S is a discrete set. For example,
a motor insurance company reviews the status of each customer yearly with
respect to three possible levels of discount S = {0, 10%, 25%}.
It is not necessary that S should be a set of numbers. For example, it may be
the credit rating, or information from an applicant form. Typical examples
of discrete state processes with discrete time change are Markov chains which
are discussed later in this course.

Discrete state spaces with continuous time changes. In this case


typically T = [0, ∞) and the state space S is a discrete set. For instance, an
individual insurance policy holder can be classified as healthy, sick or dead.
So S = {healthy, sick, dead}. It is natural to take the time set as continuous
as illness or death can happen at any time.
An important special case of this class are so-called counting processes. A pro-
cess (Nt )t∈[0,∞) is counting, if it is increasing and takes values in {0, 1, 2, . . . }.
For instance, it can be the cumulative number of claims reported at random
times.

Continuous state spaces with continuous time changes. Typically in


this case T is [0, ∞) or R and S = [0, ∞), (0, ∞) or R. For instance, it is
natural to consider the exchange rate GBP/USD as a random process with
the state space (0, ∞) and continuous time.

Continuous state spaces with discrete time changes. The typical


example of these is when an essentially continuously valued process such

21
as price or temperature is measured only at certain time intervals (days,
months, quarters, years). For example, if we do not care about intra-day
changing of the GBP/USD exchange rate, then we can consider this as a
discrete-time process {X0 , X1 , X2 , . . . } where Xi indicates the exchange rate
in the morning of i-th day.

Mixed type processes. There are special types of continuous-time pro-


cesses, with continuous or discrete state spaces, which have some specifically
structured changes at predetermined times. For example, the market price
of a coupon-paying bond changes at the deterministic times of the coupon
payments, but it also changes randomly all the time before its maturity, due
to the current situation in the market.

For every real-life process to be analysed, it is important to establish whether


it is most usefully modelled using a discrete, a continuous, or a mixed time
domain. Usually the choice of state space will be clear from the nature of the
process being studied (as, for example, with the Healthy-Sick-Dead model),
but whether a continuous or discrete time set is used will often depend on
the specific aspects of the process which are of interest, and upon practical
issues like the time points for which data are available. Continuous time
and continuous state space stochastic processes, although conceptually more
difficult than discrete ones, are often more convenient to deal with, in the
same way as it is often easier to calculate an integral than to sum an infinite
series.
Sample pathes. To determine a particular value for a general stochastic
process {Xt : t ∈ T }, we need to specify time t ∈ T , and a particular real-
ization ω ∈ Ω. From this perspective we can interpret a stochastic process
as a function of two variables: t and ω. If we fix a particular state of nature
ω ∈ Ω, we get a particular realization of stochastic process, which is a de-
terministic function from T to the state space S. This function is called a
sample path of the process.
For example, consider the exchange rate GBP/USD during a particular
month. In advance, it is hard to predict the exact exchange rate at every
moment of time during this month, so it is natural to model it as a random
process. During the month, we can observe a realization of this process as
a function of time, which is an example of a sample path. If we use the
model of continuous time process, a sample path is a continuous function,
defined for every moment t during the month. If we are interested only in
exchange rates at (say) 9.00 every day, the suitable model is a discrete-time
process, and the sample path is just a sequence of numbers. In this case we
will sometimes refer to it as to sample sequence of the discrete-time process.

22
Describing a stochastic process. To describe the stochastic process
{Xt : t ∈ T }, we need to specify the joint distributions of Xt1 , Xt2 , . . . , Xtn
for all t1 , t2 , . . . , tn in T and all integers n. The collection of the joint distri-
butions above is called the family of finite dimensional probability distribu-
tions (f.f.d. for short). To describe a stochastic process in practice, we will
rarely give the exact formulas for its f.f.d., but will rather use some indirect
intuitive descriptions. For example, take the familiar Bernoulli trials of con-
secutive tosses of a fair coin. A sequence of i.i.d. Bernoulli variables (ξt )∞ t=1
is a stochastic process, and its f.f.d. is fully determined by this descrip-
tion. Indeed, for any sequences of times t1 , t2 , . . . , tn in T = {0, 1, 2, . . . }
and “results” x1 , x2 , . . . , xn in S, we are able to estimate the probability
P (ξt1 = x1 ∪ ξt2 = x2 ∪ · · · ∪ ξtn = xn ), and it is equal to 2−n .

Stationarity. In the example above, the probability to “meet” any se-


quences of results x1 , x2 , . . . , xn does not depend on times t1 , t2 , . . . , tn . This
means that the statistical properties of the process remain unchanged over
the time, which is intuitively obvious for tosses of a fair coin. If, however,
a stochastic process describes the tomorrow’s temperature, it would be rea-
sonable to expect a lower temperature during the morning than at noon.
Formally, a stochastic process {Xt : t ∈ T } is said to be stationary, or strictly
stationary, if for all integers n and all t, t1 , t2 , . . . , tn in T the joint dis-
tributions of Xt1 , Xt2 , . . . , Xtn and Xt+t1 , Xt+t2 , . . . , Xt+tn coincides. Sub-
stituting n = 1, we can see that, in particular, all distribution functions
{FXt (x) : t ∈ T } are the same for all t. Consequently, all parameters de-
pending only on distribution (such as mean and variance), if they exist, also
do not change over time.
Strict stationarity is a strong requirement which may be difficult to test fully
in real life. Actually, a much weaker condition, known as weak stationarity, is
often already very useful for applications. A stochastic process {Xt : t ∈ T }
is said to be weakly stationary, if the mean of the process, m(t) = E[Xt ], is
constant, and the covariance of the process, C(s, t) = E[(Xs − m(s))(Xt −
m(t))], depends only on the time difference t − s. Obviously, any strictly
stationary stochastic process with finite mean and variance is also weakly
stationary.

Increments. If stochastic process {Xt : t ∈ T } describes the tomorrow’s


temperature, we are interested in the value of Xt itself. Sometimes, however,
the dynamic of how the value changes over the time is much more interesting.
For example, if Xt is the price of a stock share, a “forecast” Xt = 100 provides
almost no information by itself. If the current stock price is X0 = 60, the

23
above forecast is very optimistic; if, however, X0 = 120, it is pessimistic.
What we are really interested in is the price dynamics, whether it increases
or decreases and how much.
Formally, an increment of the stochastic process {Xt : t ∈ T } is the quantity
Xt+u − Xt , u > 0. Many processes are the most naturally defined through
their increments. For example, let Xt be total amount of money on a bank
account of a person A at the first day of month t. Assume that monthly salary
of A is a fixed amount C, and monthly expenses Yt are random. Then the
stochastic process Xt is naturally defined through its increments Xt+1 −Xt =
C − Yt .
In the above example, the process Xt is not stationary (even weakly) unless
Yt ≡ C, ∀t. For example, if EYt < C, ∀t, the total amount of money on the
bank account increases (on average) with time. However, if Yt are identically
distributed, the rate of growth of Xt remains unchanged over the time. Such
processes are said to have stationary increments. If, moreover, monthly ex-
penses Yt are (jointly) independent, the rate of growth of Xt does not depend
of its history, and we say that Xt has independent increments.
Formally, a stochastic process {Xt : t ∈ T } has stationary (or time-homogeneous)
increments, if for every u > 0 the increment Zt = Xt+u − Xt is a stationary
process; a process {Xt : t ∈ T } is said to have independent increments if for
any a, b, c, d ∈ T such that a < b < c < d, random variables Xa − Xb and
Xc − Xd are independent.

24
1.10 Summary
A random variable is a measurable function ξ : Ω → R from probability
space Ω to real line.
For a random variable ξ, its cumulative distribution function (cdf) is Fξ (x) :=
P (ξ ≤ x). If Fξ (x) can be represented as
Z x
Fξ (x) = ρξ (z) dz,
−∞

for some non-negative function ρξ , the latter is called the probability density
function (pdf) of ξ.
R
The expectation of a random variable ξ is defined as E[ξ] ≡ Ω ξ dP . It can
be calculated as
Z ∞ Z ∞
E[ξ] = x dFξ (x) = xρξ (x) dx.
−∞ −∞

The variance of a random variable ξ is defined by Var(ξ) := E[(ξ − E[ξ])2 ] ≡


E[ξ 2 ] − E[ξ]2 . For two random variables
 ξ, η, their covariance is defined by
Cov(ξ, η) := E (ξ − E[ξ])(η − E[η]) ≡ E[ξη] − E[ξ]E[η].
Two events A and B are called independent if P (A ∩ B) = P (A)P (B).
Random variables ξ and η are called independent if the events A = {ω ∈
Ω : a < ξ(ω) < b} and B = {ω ∈ Ω : c < η(ω) < d} are independent for
all real numbers a, b, c, d. If ξ and η are independent, Cov(ξ, η) = 0, but the
converse is not always true.
The conditional probability of A given B, denoted P [A|B], can be evaluated
as P [A|B] = P P[A∩B]
[B]
.
A stochastic process is a family of random variables indexed in time, {Xt : t ∈ T }.
The time set T can be discrete or continuous, as can the state space S in
which the variables take their values.
Stochastic process can be roughly classified into the following groups:

• Discrete S and discrete T ;

• Continuous S and discrete T ;

• Discrete S and continuous T ;

• Continuous S and continuous T ; or

25
• Mixed processes.

A stochastic process {Xt : t ∈ T } is said to be stationary, if for all integers


n and all t, t1 , t2 , . . . , tn in T the joint distributions of Xt1 , Xt2 , . . . , Xtn and
Xt+t1 , Xt+t2 , . . . , Xt+tn coincides. It is called weakly stationary, if the mean
of the process, m(t) = E[Xt ], is constant, and the covariance of the process,
C(s, t) = E[(Xs − m(s))(Xt − m(t))], depends only on the time difference
t − s.
An increment of the stochastic process {Xt : t ∈ T } is the quantity Xt+u −Xt ,
u > 0. A stochastic process has stationary (or time-homogeneous) incre-
ments, if for every u > 0 the increment Zt = Xt+u − Xt is a stationary
process; a process {Xt : t ∈ T } is said to have independent increments if for
any a, b, c, d ∈ T such that a < b < c < d, random variables Xa − Xb and
Xc − Xd are independent.

26
1.11 Questions

1. A sample space consists of five elements Ω = {a1 , a2 , a3 , a4 , a5 }. For


which of the following sets of probabilities does the corresponding triple
(Ω, A, P ) become a probability space? Why?

(a) p(a1 ) = 0.3; p(a2 ) = 0.2; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;
(b) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1:
(c) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.2; p(a4 ) = −0.1; p(a5 ) = 0.2.

2. Let X be a random variable from the continuous uniform distribution,


X ∼ U (0.5, 1.0). Starting with the probability density function, derive
expressions for the cumulative distribution function, expectation and
variance of X.

3. Assets A and B have the following distribution of returns in various


states:

State Asset A Asset B Probability


1 10% -2% 0.2
2 8% 15% 0.2
3 25% 0% 0.3
4 -14% 6% 0.3

Show that the correlation between the returns on asset A and asset B
is equal to -0.3830.

4. Formalise Example 1.5 as Ω = {ω1 , ω2 , ω3 , ω4 }, P (ω1 ) = P (ω2 ) =


P (ω3 ) = P (ω4 ) = 1/4 and

A := {ω1 , ω4 }, B := {ω2 , ω4 }, C := {ω3 , ω4 }.

Prove that the pairs (A, B), (A, C) and (B, C) are independent, but
the triple (A, B, C) is not mutually independent according to Definition
1.2.

5. You intend to model the maximum daily temperature in your office as


a stochastic process. What time set and state space would you use?

27
Chapter 2
Claim size estimation in insurance and rein-
surance
2.1 Basic principles of insurance risk modelling
In general, for a risk to be insurable, the following conditions must be satis-
fied:

• The policyholder must have an interest in the risk being insured, to


distinguish between insurance and a bet; and

• a risk must be of a financial and reasonably quantifiable nature

In addition, the following conditions are desirable:

• Individual risk events should be independent of each other.

• The probability of the event should be relatively small. In other words,


an event that is nearly certain to occur is not conducive to insurance.

• Large numbers of potentially similar risks should be pooled to reduce


the variance and achieve more certainty.

• There should be an ultimate limit on the liability undertaken by the


insurer.

• Moral hazards should be eliminated as far as possible because these are


difficult to quantify, result in selection against the insurer and lead to
unfairness in treatment between one policyholder and another.

However, the desire for income means that an insurer will usually be
found to provide cover when these ideal criteria are not met.
Other characteristics that most general insurance products share are:

• Cover is normally for a fixed period, most commonly one year, after
which it has to be renegotiated. There is normally no obligation on
insurer or insured to continue the arrangement thereafter, although in
most cases a need for continuing cover may be assumed to exist.

• Claims are not of fixed amounts, and the amount of loss as well as the
fact needs to be proved before a claim can be settled.

• A claim occurring does not bring the policy to an end.

28
• Claims may occur at any time during the policy period.

Although there is normally a contractual obligation on the policyholder


to report a claim to the insurer as quickly as possible, notification may take
some time if the loss is not evident immediately. Settlement of the claim
may take a long time if protracted legal proceedings are needed or if it is
not straightforward to determine the extent of the loss. However, from the
moment of the event giving rise to the claim the ultimate settlement amount
is a liability of the insurer. Estimating the amounts of money that need to
be reserved to settle these liabilities is one of the most important areas of
actuarial involvement in general insurance. Classes of insurance in which
claims tend to take a long time to settle are known as long-tail. Those which
tend to take a short time to settle are known as short-tail, although the
dividing line between the two categories is not always distinct.
Many forms of non-life insurance can be regarded as short-term contracts,
for example motor insurance Some forms of life insurance also fall into this
category, for example group life and one-year term assurance policies.
A short-term insurance contract can be defined as having the following
attributes:

• The policy lasts for a fixed, and relatively short, period of time, typi-
cally one year.

• The insurance company receives from the policyholder(s) a premium.

• In return, the insurer pays claims that arise during the term of the
policy.

At the end of the policy’s term the policyholder may or may not renew
the policy; if it is renewed, the premium payable by the policyholder may or
may not be the same as in the previous period.
The insurer may choose to pass part of the premium to a reinsurer; in
return, the reinsurer will reimburse the insurer for part of the cost of the
claims during the policy’s term according to some agreed formula.
An important feature of a short-term insurance contract is that the pre-
mium is set at a level to cover claims arising during the (short) term of the
policy only. This contrasts with life assurance policies, where mortality rates
increasing with age mean that the (level) annual premium in the early years
would be more than sufficient to cover the expected claims in the early years.
The excess amount would then be accumulated as a reserve to be used in the
later years, when the premium on its own would be insufficient to meet the
expected cost of claims.

29
The profit of any company, including insurance company, during a certain
time period, e.g. a month, can be calculated as income of the company during
this time period minus its expenses/losses.
The profit of insurance company is due to premiums paid by customers.
At the beginning of a month, a company knows the number of customers and
what premiums they are paying. Of course, the company cannot predict the
number of new customers coming during next month as well as the number of
customers who stop paying premiums. However, because the premium paid
by every individual customer is usually small, and the number of customers
changes not too much during each month, these are the minor issues. Hence,
the company can estimate its profit for next month with good accuracy.
The expenses/losses for the insurance company can be divided into two
parts: expenses to cover the claims and other expenses. Other expenses, such
as taxes, staff salaries, etc., can also be predicted. The main problem for any
insurance company is to estimate expenses/losses to cover the claims.
If there will be N claims during a month with sizes X1 , X2 , . . . , XN , then
the total losses to cover all claims are

X1 + X2 + · · · + XN .

Here, the number N of claim is unpredictable, hence it is modelled as a


random variable. The sizes X1 , X2 , . . . , XN of claims are random variables
as well. Our first key assumption is that the number of claims and the sizes
of claims are independent random variables and can be studied separately.
To justify this assumption, let us consider motor insurance as an example. A
prolonged spell of bad weather may have a significant effect on claim numbers
but little or no effect on the distribution of individual claim amounts. On
the other hand, inflation may have a significant effect on the cost of repairing
cars, and hence on the distribution of individual claim amounts, but little or
no effect on claim numbers.
Our second key assumption is that sizes X1 , X2 , . . . , XN of claims are
independent and identically distributed (i.i.d). If the distribution of Xi is
known, the model is complete and can be used to answer various questions
of central importance for insurance company, for example, with what prob-
ability the claims above certain level arrive.
In practise, however, the claim distribution is rarely known. The insur-
ance company usually has a sequence

x1 , x2 , . . . , xn

of past claims and use this sequence to estimate the claim distribution. In
most cases, this process works as follows.

30
1. The company assumes that claim distribution belongs to certain family,
but with unknown parameters. For example, it may assume that claims
follow normal distribution with unknown mean and variance.

2. The company estimate unknown parameters to fit the data of past


claims x1 , x2 , . . . , xn as good as possible.

3. Steps 1-2 can then be repeated for different families of distributions.


Then there are “goodness of fit” tests in statistics, for example ξ 2 test,
which allow to select family which fits the data best.

We next focus on step 2: estimating unknown parameters of some given


family of distributions.

2.2 Method of moments


For a random variable X and positive integer j, the j-th moment of X is
E[X j ]. For example, the first moment (j = 1) is just expectation m = E[X].
The difference X −m has expectation E[X −m] = E[X]−m = 0 and is called
centralized random variable. The second moment E[(X − m)2 ] of centralized
random variable X − m is just a variance of X, usually denoted as σ 2 , where
σ is standard deviation. The ratio X−mσ
has mean 0 and standard deviation
1 and is called standardized random variable. The third moment
" 3 #
X −m
E
σ

X−m
of σ
is called the skewness of X, while the forth moment
" 4 #
X −m
E
σ

is known as kurtosis of X.
Sometimes it is convenient to calculate moments using moment generating
function. For a random variable X, its moment generating function is

MX (t) := E etX .
 

If we have MX (t), we can find n-th moment of X as n-th derivative of MX (t)


at 0:
dn
E [X n ] = n MX (t)
dt t=0

31
If X and Y are independent random variables and S = X + Y , then
MS (t) = MX (t) · MY (t),
which is a very convenient property.
If X belongs to certain family of distributions with r parameters a1 , a2 , . . . , ar ,
its j-th moment can be explicitly calculated as a function of the parameters,
that is
E[X j ] = fj (a1 , a2 , . . . , ar ), j = 1, 2, . . .
If past data x1 , x2 , . . . , xn of i.i.d. realizations of X are available, its j-th
moment can also be estimated from data as
n
1X j
E[X j ] ≈ x , j = 1, 2, . . .
n i=1 i

Now, the method of moments suggests to select parameters a1 , a2 , . . . , ar


in such a way that first r moments estimated from the data match the first
r moments estimated from the formulas for the distribution, that is,
n
1X j
x = fj (a1 , a2 , . . . , ar ), j = 1, 2, . . . , r. (17)
n i=1 i

If we denote n
1X j
mj = x, j = 1, 2, . . . , r
n i=1 i
to be the moments estimated from the data, then (17) simplifies to
mj = fj (a1 , a2 , . . . , ar ), j = 1, 2, . . . , r. (18)
This is a system of r equations with r unknowns which often has a unique
solution.
We next consider concrete examples with specific families of distributions.
• Assume that X follow the exponential distribution with parameter λ,
see (5). In this case, we have only 1 parameter, so it suffices to consider
the 1-st moment only, that is, expectation. The expectation E[X] of
exponential distribution is 1/λ, and (17) reduces to
n
1X 1
xi = ,
n i=1 λ

which results in the estimate


n
λ = Pn .
i=1 xi

32
• Assume that X follow a normal distribution with parameters µ and
σ, see (6). Because we have 2 parameters, it suffices to consider 2
moments. The first 2 moments of normal distribution are E[X] = m
and E[X 2 ] = σ 2 + m2 . Hence, parameters µ and σ can be found from
system of equations
n n
1X 1X 2
xi = m, x = σ 2 + m2 .
n i=1 n i=1 i

The solution is
v !2
n
u n n
1X u1 X 1X
m= xi , σ=t x2 − xi .
n i=1 n i=1 i n i=1

• If X follow a log-normal distribution with parameters µ and σ, that


is, log(X) ∼ N (µ, σ 2 ), and past data x1 , . . . , xn are available, then
the logarithms yi = log xi of past data are i.i.d sample from normal
distribution N (µ, σ 2 ), and its parameters can be estimated from these
data exactly as above:
v !2
n
u n n
1X u1 X 1 X
m= yi , σ = t y2 − yi .
n i=1 n i=1 i n i=1

• If X follow a gamma distribution (8) with parameters α > 0 and λ > 0,


we need r = 2 moments, estimated from the data as
n n
1X 1X 2
m1 = xi , m2 = x. (19)
n i=1 n i=1 i

Then (18) reduces to


α α  α 2
m1 = , m2 = + ,
λ λ2 λ
and the solution is
m21 m1
α= , λ= .
m2 − m21 m2 − m21

• If X follow a Pareto distribution (9) with parameters α > 0 and λ > 0,


then two moments exists if α > 2, and, in this case, system (18) reduces
to
λ αλ2
m1 = , m2 = + m21 ,
α−1 (α − 1)2 (α − 2)

33
where m1 and m2 are defined in (19). The solution is

2m2 − 2m21 m1 m2
α= , λ= ,
m2 − 2m21 m2 − 2m21

provided that m2 − 2m21 > 0.

For other families of distributions, like Burr distribution (10), the gener-
alized Pareto distribution (11) or Weibull distribution (12), explicit expres-
sions for moments may be too complicated to solve system (18) analytically.
However, it may be solved numerically using appropriate computer software.

2.3 Method of maximum likelihood


Method of moments is not always appropriate because, for some families
of distributions, moments may not exist for some parameters. An alterna-
tive, and even more intuitive method, is the method of maximal likelihood.
For simplicity, we first introduce this method on a very simple example.
Imagine a coin which can show 2 outcomes, head H and tail T , with proba-
bilities p and 1 − p, respectively, but assume that parameter p is unknown.
To estimate p, we have tossed this coin 10 times, and the outcomes are
H, H, T, H, T, H, H, H, T, H. The method of moments is not applicable here,
because the outcomes are not even numerical values. Instead, let us just
calculate the probability of getting exactly this sequence. The probability of
getting the first head is p, the second head is p, the next tail is 1 − p, and
so on. All 10 tosses are independent, hence the probability of getting this
sequence is the product of corresponding probabilities

p · p · (1 − p) · p · (1 − p) · p · p · p · (1 − p) · p = p7 (1 − p)3 .

For example, if p = 0 or p = 1, then p7 (1 − p)3 = 0, hence we would never


get this sequence. If p = 1/2, then p7 (1 − p)3 = 1/1024, so getting this
sequence is unlikely but possible. However, can we do better? Does there
exist p for which such sequence is not so unlikely? After all, why not select
p for which this sequence is as likely as possible. In other word, we want to
find p which maximizes the function L(p) = p7 (1 − p)3 . To find such p, we
can differentiate and solve equation L0 (p) = 0.
In fact, the following trick can be used to simplify calculations. Because
the logarithm is an increasing function, maximizing L(p) is equivalent to max-
imizing the logarithm log(L(p)). In our case, log(L(p)) = 7 log p+3 log(1−p).
The derivative is p7 − 1−p
3
, which is equal to 0 if p7 = 1−p
3
, or 7(1 − p) = 3p,

34
or p = 0.7. With this parameter, the result of the experiment which we ac-
tually observed is as likely as it possibly can. This is the idea of the method
of maximum likelihood.
More generally, let X1 , X2 , . . . , Xn be a sequence of i.i.d random variable
whose distribution belongs to some family of discrete distributions with pa-
rameter θ. Given historical data x1 , x2 , . . . , xn we actually obtained, we can
ask how “likely” it was to get this data, and the answer is
n
Y
L(θ) = P (Xi = xi | θ),
i=1

where P (Xi = xi | θ) is the conditional probability of event Xi = xi given


θ. We then find θ which maximizes L(θ), or, equivalently, maximizes its
logarithm
Xn
l(θ) = log(L(θ)) = log[P (Xi = xi | θ)].
i=1

The optimal θ̂ can be found from equation


d
l(θ̂) = 0. (20)

d
In fact, θ can be vector of r parameters. Then dθ in (20) should be
understood as r partial derivatives, and (20) reduces to system of r equations
with r unknowns.
Given sample x1 , x2 , . . . , xn from a continuous distribution with density
f (x | θ) which depends on vector θ of parameters, the likelihood function L(θ)
takes the form n
Y
L(θ) = f (xi | θ),
i=1
and its logarithm is
n
X
l(θ) = log(L(θ)) = log[f (xi | θ)].
i=1

The optimal θ̂ maximizing this function can be found from the same system
of equations (20).
We next consider concrete examples with specific families of distributions.
• Assume that X follow the exponential distribution with parameter λ,
that is, has density given by (5). Then
n
X n
X n
X n
X
−λxi
l(λ) = log[f (xi | λ)] = log[λe ]= (log[λ]−λxi ) = n log[λ]−λ xi ,
i=1 i=1 i=1 i=1

35
and (20) reduces to
n
n X
− xi = 0,
λ i=1
from which we find λ = Pnn xi . Note that in this case the result is the
i=1
same as with method of moments.

• Assume that X follow a normal distribution with parameters µ and σ 2 ,


see (6). Then
n   n
1 (x −µ)2 n 1 X
− i 2
X
2
l(µ, σ ) = log √ e 2σ = − log(2πσ 2 )− 2 (xi −µ)2 .
i=1 2πσ 2 2 2σ i=1

and (20) reduces to


n
d 2 1 X
l(µ, σ ) = 2 (xi − µ) = 0
dµ σ i=1

and n
d 2 n 1 X
l(µ, σ ) = − 2 + 4 (xi − µ)2 = 0
dσ 2 2σ 2σ i=1
from which we find that
n n
1X 1X
µ= xi , σ2 = (xi − µ)2 ,
n i=1 n i=1

the same solution as with method of moments.

• If X follow a log-normal distribution with parameters µ and σ 2 , then


logarithms yi = log xi are i.i.d sample from normal distribution N (µ, σ 2 ),
hence: n n
1X 2 1X
µ= log xi , σ = (log xi − µ)2 ,
n i=1 n i=1

• If X follow a gamma distribution (8) with parameters α > 0 and λ > 0,


then n  α 
X λ α−1 −λxi
l(α, λ) = log xi e
i=1
Γ(α)
or
n
X n
X
l(α, λ) = nα log[λ] − n log[Γ(α)] + (α − 1) log[xi ] − λ xi .
i=1 i=1

36
Then n
d nα X
l(α, λ) = − xi = 0,
dλ λ i=1

and
n
d d X
l(α, λ) = n log[λ] − n log[Γ(α)] + log[xi ] = 0.
dα dα i=1

From the first equation, λ = Pnnα xi . We can substitute this into the
i=1
second equation and solve it for α numerically. We remark that in this
case the solution from maximum likelihood method is different from
the one from method of moments.

For other families of distributions the method is the same, and the result-
ing equations, if impossible to solve analytically, can be solved numerically
using appropriate computer software.

2.4 Method of percentiles


Let x1 , x2 , . . . xn be i.i.d sample from some distribution. Given α ∈ (0, 1), we
would like to find an estimate for the number x such that FX (x) = P [X ≤
x] = α, where X is a random variable with this distribution. This can be
done using the following procedure. First, let j be the smallest integer greater
than nα. Then sort the sequence x1 , x2 , . . . xn is non-decreasing order, and
then the j-th number is the answer. We denote this answer as q(α, x1 , . . . xn ).
Example. Let the data be 1, 5, 6, 4, 3, 5, 6 and α = 1/4. Then n = 7,
nα = 7/4, and the smallest integer greater than 7/4 is j = 2. Then sort the
data in non-decreasing order to get 1, 3, 4, 5, 5, 6, 6. The 2-nd number in this
sequence is 3, and this is the answer.
Now consider family of distributions witch depends on vector of parame-
ters λ and has cumulative distribution function F (x, λ). We assume that F
is strictly increasing, as has an inverse function defined on (0, 1):

F −1 (α, λ) = inf{x|F (x, λ) > α}, 0 < α < 1,

If there are r parameters, let us select r different numbers 0 < α1 < α2 <
· · · < αr < 1 and require that F −1 (αi , λ) “agree with data”, that is

F −1 (αi , λ) = qi , i = 1, 2, . . . , r,

where
qi = q(αi , x1 , . . . xn ), i = 1, 2, . . . , r, (21)

37
Applying function F to both sides of these equations, we get

αi = F (qi , λ), i = 1, 2, . . . , r. (22)

Finding parameters from this system is called the method of percentiles.


Some examples of its application are presented below.

• Assume that X follow the exponential distribution (5) with parameter


λ. Because we have just 1 parameter, it suffice to choose one α1 ∈ (0, 1),
for example, α1 = 0.5. Then, given data x1 , . . . , xn , we need to estimate

q1 = q(α1 , x1 , . . . xn )

as explained above. Finally, λ is found from the equation

α1 = F (q1 , λ) = 1 − e−λq1 .

We get
log(1 − α1 )
λ=− .
q1
• Assume that X follow the Pareto distribution (9) with parameters α >
0 and λ > 0. Because we have 2 parameters, we need to select 0 <
α1 < α2 < 1, for example, α1 = 1/4, α2 = 3/4. Then we estimate

q1 = q(α1 , x1 , . . . xn ), q2 = q(α2 , x1 , . . . xn )

from data. Then system (22) becomes


 α  α
λ λ
α1 = 1 − , α2 = 1 −
λ + q1 λ + q2

and can be solved numerically to find α and λ.

• Assume that X follow the Weibull distribution (12) with parameters


c > 0 and γ > 0. Because we have 2 parameters, we need to select
0 < α1 < α2 < 1, for example, α1 = 1/4, α2 = 3/4. Then we estimate

q1 = q(α1 , x1 , . . . xn ), q2 = q(α2 , x1 , . . . xn )

from data. Then system (22) becomes

α1 = 1 − exp(−cq1γ ), α2 = 1 − exp(−cq2γ ).

38
It can we rewritten as

−cq1γ = log(1 − α1 ), −cq2γ = log(1 − α2 ).

Dividing first equation by the second one, we get


 γ
q1 log(1 − α1 )
= ,
q2 log(1 − α2 )

hence    
log(1 − α1 ) q1
γ = log / log , (23)
log(1 − α2 ) q2
and then
log(1 − α1 )
c=− . (24)
q1γ

2.5 Reinsurance
To protect itself from large claims, an insurance company, let us call it I
(insurer), may in turn take out an insurance policy in another company,
which we call it R (reinsurer). Such a policy is called a reinsurance policy.
Insurance company I received premiums from client C, and pay part of this
premium to R. Then, if client C makes a claim, part of it may be covered by
R, in accordance to a contract between I and R. In this section we consider
reinsurance contracts of two very simple types: proportional reinsurance and
individual excess of loss reinsurance.
In proportional reinsurance the insurer I pays a fixed proportion α of the
claim, 0 < α < 1, whatever the size of the claim, and the reinsurer R pays
the remaining proportion 1 − α of the claim. In other words, if claim amount
is X, then I pays αX and R pays (1 − α)X. The parameter α is known as
the retained proportion or retention level.
Imagine that claim arrives independently and identically distributed from
some unknown distribution, and insurance company I has historical record
of sizes y1 , y2 , . . . , yn of payments they made. Then they may easy recover
the actual sizes of past claims: they are y1 /α, y2 /α, . . . , yn /α. Similarly, if
reinsurance company R has historical record of sizes z1 , z2 , . . . , zn of pay-
ments they made, then the actual sizes of past claims are z1 /(1 − α), z2 /(1 −
α), . . . , zn /(1−α). These data may be used to estimate the claim distribution
using methods described in sections 2.2-2.4.
In excess of loss reinsurance, the insurer I will pay any claim in full up to
a certain amount M , which is called the retention level; any amount above
M will be paid by the reinsurer R. Note that the term “retention level” is

39
used in both proportional reinsurance and excess of loss reinsurance, but it
has completely different meaning!
With excess of loss reinsurance contract, if the claim for amount X arrives
then the insurer will pay Y where:
(
X if X ≤ M
Y =
M if X > M
The reinsurer pays the amount
(
0 if X ≤ M
Z =X −Y = (25)
X −M if X > M

Because Z ≥ 0, it is clear that E[Z] ≥ 0, or, equivalently, E[Y ] ≤ E[X]. In


fact, from formula for Z it is clear that E[Z] > 0 if P [X > M ] > 0. Hence,
in this case, E[Y ] < E[X]. With a bit more work, one may prove similar
inequality for variance as well. In conclusion,
E[Y ] ≤ E[X] and V ar[Y ] ≤ V ar[X],
and both inequality are strict if P [X > M ] > 0. This means that, for the
insurer, both mean amount paid and the variance of the amount paid are
reduced.
If X is a non-negative continuous random variable with density function
f (x), then
Z∞
E[X] = xf (x)dx.
0
This is the mean amount the insurance company I would pay without rein-
surance. With reinsurance, the mean amount for I to pay is
ZM
E[Y ] = xf (x)dx + M P (X > M ),
0

while the mean amount for re-insurer R to pay is


Z∞
E[Z] = (x − M )f (x)dx.
M

Without the reinsurence, if the claim amount is inflated by a factor of k,


then so is the mean amount for insurer to pay:
E[kX] = kE[X].

40
This is not the case for excess of loss reinsurance if the retention level M is
fixed. In this case, the amount for insurer I po pay after inflation becomes
(
0 kX if kX ≤ M
Y =
M if kX > M

and the mean amount is


M/k
Z
E[Y 0 ] = kxf (x)dx + M P (X > M/k).
0

One can easily check that in general E[Y 0 ] 6= kE[Y ].


Many insurance companies, especially those working in motor insurance
and many kinds of property and accident insurance, offer policies which re-
quire from clients to cover their loss themselves up to some limit L, called
the excess. If the amount X of loss is less than L, the claim cannot happen,
and if X is greater than L, then the client will claim only for X − L. This
policy is called “policy with excess”. If Y is the amount actually paid by the
insurer, then (
0 if X ≤ L
Y =
X − L if X > L
Note that this equation completely coincides with (25), just M is replaced
by L. Hence, the position of the insurer in a policy with excess is exactly
the same as that of an reinsurer with an excess of loss reinsurance contract.
Similarly, the position of the policyholder/client buying policy with excess
is exactly the same as that of an insurer with an excess of loss reinsurance
contract.

2.6 Claim size estimation with excess of loss reinsur-


ance
With excess of loss reinsurance, the problem of estimation of unknown claim
distribution may be challenging due to incomplete data. Imagine that claims
arrives independently and identically distributed from some unknown distri-
bution, and insurance company I has historical record of sizes of payments
they made. A typical record has the form
x1 , x2 , M, x4 , x5 , x6 , M, x8 , . . .
In this example, the actual sizes of 3-rd and 7-th claims are not recorded, the
company only known that they are greater than or equal to M . In general,

41
if some data are missing or incomplete, we say that we have a “censored
sample”. So, with censored sample as above, can we still use the methods
described in sections 2.2-2.4 to estimate the parameters of the unknown claim
distribution?
• The method of moments, see Section 2.2, is not available, because the
moments cannot be reliably estimated from the censored sample;
• The method of percentiles, see Section 2.4, can be used without modi-
fication, provided that all qi in (21) are less than M . This is the case
if the retention level M is high, so that only few highest claims are
unknown. For example, let M = 1000, n = 9, the data are
500, 300, 1000, 100, 800, 500, 1000, 700, 300.
and the percentiles levels in Section 2.4 are α1 = 1/4, α2 = 3/4. Then
the lowest integers grater than nα1 = 9/4 and nα2 = 27/4 are 3 and 7,
respectively. If we sort the data in non-decreasing order
100, 300, 300, 500, 500, 700, 800, 1000, 1000
the third and sevenths terms are q1 = 300 and q2 = 800, respectively.
These values of q1 and q2 are all that we need to proceed with method
of percentiles, and unknown claims over 1000 has no influence on this
calculation. This is the big advantage of method of percentiles.
• The method of maximum likelihood, described in Section 2.3, can be
used but require modification. Let J be the set of claims less than M
for which information is available. The contribution of these data to
the maximal likelihood function is exactly like in Section 2.3:
Y
f (xi | θ),
i∈J

where f (x | θ) is the density function if claim distribution. Next, assume


that m other claims are referred to the re-insurer, and insurer only
knows that they are greater than m. These censored claims contribute
to the likelihood function a factor of
[P (X > M )]m .
If F (x | θ) is the cumulative distribution function if claim distribution,
then P (X > M ) = 1−P (X ≤ M ) = 1−F (M | θ). Hence, the complete
likelihood function is
Y
L(θ) = f (xi | θ) · (1 − F (M | θ))m ,
i∈J

42
and its logarithm is
X
l(θ) = log(L(θ)) = log[f (xi | θ)] + m log(1 − F (M | θ)).
i∈J

The optimal θ̂ maximizing this function can be found from the same
system of equations (20).

Let us consider the same problem of estimating unknown claim distribu-


tion from the point of view of reinsurer. The historical records

w1 , w2 , . . . , wk

of reinsurer expenses consist on differences wi = xi − M for claims whose size


xi is greater than M . For claims with xi ≤ M , the reinsurer even does not
know that such claims occur. This implies that in fact w1 , w2 , . . . , wk is an
i.i.d. sample for random variable

W = X − M |X > M.

Let us express the cdf G(w) and pdf g(w) of W using the cdf F (x) and pdf
f (x) of the original claim size distribution X. For any w ≥ 0,

P [X ≤ w + M and X > M ]
G(w) = P [W ≤ w] = P [X−M ≤ w|X > M ] = =
P [X > M ]

P [M < X ≤ w + M ] F (w + M ) − F (M )
= =
1 − P [X ≤ M ] 1 − F [M ]
Differentiating with respect to w, we get

f (w + M )
g(w) = G0 (w) = .
1 − F (M )

If X belongs to some distribution family with unknown parameters, we can


use these formulas to express cdf and pdf of W as a function of these param-
eters, and then use methods from sections 2.2-2.4 to estimate the parameters
base on data w1 , w2 , . . . , wk .

Example 2.1. Assume that claim size X has Pareto distribution (9) with
parameters α > 2 and λ > 0. Then the pdf of W is
α 
αλα
  
f (w + M ) λ
g(w) = = : 1− 1− =
1 − F (M ) (λ + w + M )α+1 λ+M

43

αλα α(λ + M )α

λ+M
= · = .
(λ + w + M )α+1 λ (λ + w + M )α+1
Let us use, for example, method of moments to estimate parameters. It
states that
k
1X λ+M
wi = E[W ] =
k i=1 α−1
and !2
k k
1X 2 1X α(λ + M )2
w − wi = V ar[W ] = .
k i=1 i k i=1 (α − 1)2 (α − 2)
This gives a system of two equations to find two unknown parameters λ and
α.

44
2.7 Summary
In this chapter we focus on modelling the claim size distribution. We assume
that claim distribution belongs to certain family, but with unknown param-
eters. The company estimate unknown parameters a1 , a2 , . . . , ar to fit the
data of past claims x1 , x2 , . . . , xn as good as possible.

The method of moments suggests to select parameters in such a way that first
r moments estimated from the data match the first r moments fj (a1 , a2 , . . . , ar ),
j = 1, 2, . . . , r estimated from the formulas for the distribution, that is,

mj = fj (a1 , a2 , . . . , ar ), j = 1, 2, . . . , r
Pn
where mj = 1
n i=1 xji , j = 1, 2, . . . , r.

Method of maximum likelihood suggests to select vector of parameters theta =


(a1 , a2 , . . . , ar ) to maximize (the logarithm of) the likelihood function
n
X
l(θ) = log(L(θ)) = log[f (xi | θ)].
i=1

The optimal θ̂ = (â1 , â2 , . . . , âr ) can be found from the system of equations

d
l(θ̂) = 0, i = 1, 2, . . . , r.
dai

Method of percentiles suggests to find vector of parameters λ = (a1 , a2 , . . . , ar )


from the system of equations

αi = F (qi , λ), i = 1, 2, . . . , r,

where F is a cdf which depends on parameters, 0 < α1 < α2 < · · · < αr < 1
are some pre-specified numbers, and qi is the estimate of the percentile at level
αi based on data x1 , x2 , . . . , xn . To find it, we first find the smallest integer j
greater than nαi , then sort the sequence x1 , x2 , . . . xn is non-decreasing order,
and then the j-th smallest number is qi .

To protect itself from large claims, an insurance company may in turn take
out an insurance policy in another company, called reinsurer. We consider
reinsurance contracts of two very simple types: proportional reinsurance and
individual excess of loss reinsurance. In proportional reinsurance, if claim
amount is X, then insurer pays αX and reinsurer pays (1 − α)X, where α is

45
the parameter known as the retention level. With excess of loss reinsurance
contract, if the claim for amount X arrives then the insurer will pay
(
X if X ≤ M
Y =
M if X > M

and the reinsurer pays Z = X − Y .

46
2.8 Questions

1. The number of claims a company received during the last 12 months


are
10, 8, 15, 10, 7, 3, 20, 14, 5, 12, 8, 8.
Assuming that these numbers are i.i.d. realizations of
(a) Poisson distribution with parameter λ
(b) negative binomial distribution with parameters p and k,
use method of moments to estimate unknown parameters.

2. Assume that the same data as in question 1 are i.i.d. realizations of


geometric distribution with parameter p. Use method of maximum
likelihood to estimate p.

3. The history of n = 18 most recent claim sizes (rounded to integer


number of pounds) are

937, 342, 150, 1080, 401, 3500, 7970, 1400, 530,

1106, 847, 899, 3076, 2837, 315, 2560, 390, 2950.


Assuming that these are i.i.d data from Weibull distribution, use the
method of percentiles with α1 = 1/4 and α2 = 3/4 to estimate param-
eters of the distribution.

4. Assume that the history of claim sizes are the same as in the previous
question, but the company order a reinsurance policy with excess of
loss reinsurance above the level M = 2000.
(a) Write down the history of expenses of the reinsurer;
(b) Assuming that the original claim size distribution is Pareto dis-
tribution with parameters α > 2 and λ > 0, estimate the unknown
parameters using method of moments with data available to the rein-
surer.
(c) Comment whether do you think the Pareto distribution is a good
model to fit these data.

47
Chapter 3
Estimation of aggregate claim distribution
3.1 The collective risk model
As discussed in the previous chapter, if there will be N claims during a month
(or week, or year, or other fixed period of time) with sizes X1 , X2 , . . . , XN ,
then the total losses to cover all claims are

S = X 1 + X2 + · · · + XN ,

and S = 0 if N = 0. If all Xi are independent, identically distributed, and


also independent of N , then we say that S has compound distribution.
In the previous chapter we focused on estimating the cumulative distri-
bution function F (x) of individual claims Xi based on historical data.
In this chapter we focus on estimation the cumulative distribution func-
tion G(x) of total claim size S. By definition, G(x) is equal to the probability
of event {S ≤ x}. This event can happen if either
• {S ≤ x and N = 0}, that is, no claims occurred, or

• {S ≤ x and N = 1}, that is, one claims of amount ≤ x occurred, or

• {S ≤ x and N = 2}, that is, 2 claims of total amount ≤ x occurred, or

• ...

• {S ≤ x and N = n}, that is, n claims of total amount ≤ x occurred,


or

• ...
Hence, by the law of total probability (13)

X ∞
X
G(x) = P (S ≤ x) = P (S ≤ x and N = n) = P (N = n)·P (S ≤ x | N = n).
n=0 n=0
(26)
The term

P (S ≤ x | N = n) = P (X1 + X2 + · · · + Xn ≤ x)

represents the distribution function of sum of n i.i.d. random variables with


distribution function F (x) each. This function is known as n-fold convolution
of F and is denoted as F n∗ (x).

48
If independent random variables X and Y have densities f and g, the
density h of their sum X + Y is given by
Z ∞
h(x) = f (t)g(x − t)dt,
−∞
Rx
and the cdf of X + Y is H(x) = −∞ h(t)dt. The n-fold convolution of any
continuous distribution F can be computed by applying these formulas n
times. The n-fold convolution of discrete distribution F can be computed by
definition, as demonstrated in Example below.

Example 3.1. Let F be a distribution of discrete random variable, taking


values 0 and 1 with equal chances. Calculate F 3∗ (x).
Answer: Let S = X1 + X2 + X3 where each Xi is 0 or 1 with equal chances.
Then S can takes values 0, 1, 2, and 3 with probabilities

P (S = 0) = P (X1 = X2 = X3 = 0) = 1/8,

P (S = 1) = P (X1 = 1, X2 = X3 = 0) + P (X2 = 1, X1 = X3 = 0)+


+P (X3 = 1, X1 = X2 = 0) = 3/8,
and similarly
P (S = 2) = 3/8, P (S = 3) = 1/8.
Hence, 

 0 if x<0

1/8 if 0≤x<1



3∗
F (x) = 1/2 if 1≤x<2

7/8 if 2≤x<3





1 if 3≤x
We remark that F 1∗ (x) = F (x). For convenience, we also introduce
notation (
0 if x < 0
F 0∗ (x) =
1 if 0 ≤ x.
With notation F n∗ (x) equation (26) becomes

X
G(x) = P (S ≤ x) = P (N = n) · F n∗ (x). (27)
n=0

49
For example, if Xi are discrete random variables taking values in non-
negative integers only, then, for every non-negative integer x,

X
P (S = x) = G(x) − G(x − 1) = P (N = n) · (F n∗ (x) − F n∗ (x − 1)).
n=0

Example 3.2. Let S = X1 + X2 + · · · + XN where Xi are i.i.d. random


variables taking values 0 or 1 with equal chances, and N can be 0, 1, 2, or 3
with equal chances. Find the probability that S = 2. Answer:

3 3
X
n∗ n∗ 1X
P (S = 2) = P (N = n)·(F (2)−F (2−1)) = P (N = n)·(F n∗ (2)−F n∗ (1)).
n=0
4 n=0

In Example 3.1, we calculated that


7 1 3
F 3∗ (2) − F 3∗ (1) = − = .
8 2 8
Similar calculation shows that F 2∗ (2)−F 2∗ (1) = 1− 34 = 14 , F 1∗ (2)−F 1∗ (1) =
F 0∗ (2) − F 0∗ (1) = 1 − 1 = 0. Hence,
 
1 1 3 5
P (S = 2) = · 0 + 0 + + = .
4 4 8 32

We now find mean and variance of S = X1 + X2 + · · · + XN , where


X1 , . . . , XN are i.i.d. copies of some random variable X.
2
We denote µX , µN , µS the means of random variables X, N, S and by σX ,
2 2
σN , σS the corresponding variances. We have

E(S|N = n) = E(X1 + X2 + · · · + Xn ) = nE(X),

or, in other words, E(S|N ) = N E(X). Hence, by the law of total expectation
(15)

µS = E(S) = E[E(S|N )] = E[N E(X)] = E[N ]E[X] = µN · µX . (28)

In words, the expected size of total claim is equal to expected number of


claims times the average size of one claim.
Similarly,

V ar(S|N = n) = V ar(X1 + X2 + · · · + Xn ) = nV ar(X),

50
or V ar(S|N ) = N V ar(X). Then by the law of total variance (16),

σS2 = V ar(S) = E[V ar(S|N )]+V ar[E(S|N )] = E[N V ar(X)]+V ar[N E(X)],

hence

σS2 = E[N ]V ar(X) + V ar[N ](E(X))2 = µN σX


2 2 2
+ σN µX . (29)

We can also calculate moment generating function MS (t) of S. By the


law of total expectation (15)

MS (t) = E(etS ) = E[E(etS |N )].

But
n
Y
tS t(X1 +X+2+···+Xn )
E(e |N = n) = E[e ]= E[etXi ] = (MX (t))n .
i=1

Hence,

MS (t) = E[(MX (t))N ] = E[eN ·log MX (t) ] = MN (log MX (t)). (30)

Example 3.3. Consider the special case when all claims are for the same
fixed amount B. That is, P (Xi = B) = 1 for all i. Then

S = X1 + X2 + · · · + XN = B + B + · · · + B = N B.

Hence, E[S] = E[N B] = BE[N ] and V ar[S] = V ar[N B] = B 2 V ar[N ].


2
Because µX = B and σX = 0, the same expressions follow from formulas
(28) and (29).
In the next sections we consider compound distributions S for various
models for the distribution of the number of claims N .

3.2 The compound Poisson distribution


In this section we assume that the number of claims N follow a Poisson
distribution with parameter λ, that is,

λn e−λ
P [N = n] =
n!
This is a natural model if we assume that claims arrives “uniformly at rate
λ” as we explain in later chapters.

51
Using series expansion for exponential function ex = ∞ xn
P
n=0 n!
, one can
compute moment generating function of Poisson distribution
∞ ∞ n −λ ∞
X X
tn λ e −λ
X (λet )n t
MN (t) = E[e ] = tX tn
e P [N = n] = e = λe = e−λ eλe .
n=0 n=0
n! n=0
n!

Differentiating it, we can find moments

dMN (t) d2 MN (t)


= exp(λ(et − 1)) · λet , 2
= exp(λ(et − 1))((λet )2 + λet )
dt dt
dMN (t)
µN = E[N ] = = λ,
dt t=0
and
d2 MN (t)
2
E[N ] = = λ2 + λ,
dt2 t=0
hence
2
σN = E[N 2 ] − (E[N ])2 = λ2 + λ − λ2 = λ.
Now we can use formulas derived in the previous section to calculate mean
and variance of S = X1 + X2 + · · · + XN . By (28),

µS = E[S] = µN · µX = λµX . (31)

By (29),
σS2 = µN σX
2 2 2
+ σN 2
µX = λ(σX + µ2X ) = λE[X 2 ]. (32)
Also, by (30),

MS (t) = MN (log MX (t)) = exp(λ(elog MX (t) −1)) = exp(λ(MX (t)−1)). (33)

Formulas for µS and σS2 above can also be derived by differentiating MS (t)
once and twice. By differentiating it three times we can also derive the
skewness:
E (S − µS )3 = λE[X 3 ],
 

hence " 3 #
S − µS λE[X 3 ]
skew[S] := E = . (34)
σS (λE[X 2 ])3/2
Because claims Xi are positive random variables, E[X 3 ] > 0, hence S is
positively skewed even if the distribution of Xi are negatively skewed. Also
note that lim skew[S] = 0, hence the distribution of S is almost symmetric
λ→∞
if λ is large.

52
Example 3.4. Assume that the number of claims during a year has the
Poisson distribution with parameter λ and the size of each claim is a random
variable uniformly distributed on [a, b]. All claims sizes are independent.
What is the mean and variance of the cumulative size of the claims from all
policies?
Answer: If X is the size of a claim, then
Z b
1 a+b
E[X] = xdx = ;
b−a a 2
Z b
2 1 a2 + ab + b2
E[X ] = x2 dx = ,
b−a a 3
which gives
E[S] = λE[X] = λ(a + b)/2,
and
V ar[S] = λE[X 2 ] = λ(a2 + ab + b2 )/3.

Now assume that we have n types of claims. The number Ni of claims of


i-th type has Poisson distribution with parameter λi , and the sizes of such
claims are i.i.d. with cdf Fi (x). The total size of such claims is

Si = X1 + X2 + · · · + XNi , i = 1, 2, . . . , n,

and the total size off all claims in then

S = S1 + S2 + · · · + Sn .

We claim that S has a compound Poisson distribution with parameter λ =


λ1 + λ2 + . . . λn and the cdf of individual claim amounts
n
1X
F (x) = λi Fi (x).
λ i=1

This result is very important because it reduces sum of n independent com-


pound Poisson distribution to just one compound Poisson distribution with
different parameters.
Let us prove this result. Let N follow a Poisson distribution with param-
eter λ, let Y1 , Y2 , . . . , YN be i.i.d. random variables with cdf F , and let

S 0 = Y1 + Y2 + · · · + Yn .

53
We want to prove that S and S 0 have the same distributions. Because there is
a one-to-one relationship between distributions and moment-generation func-
tions, it is sufficient to prove that S and S 0 has the same moment-generation
functions.
We first calculate MS (t) of S. By independence of Si ,
n
Y
tS t(S1 +S2 +···+Sn )
MS (t) = E[e ] = E[e ]= E[etSi ].
i=1

Now, by (33)

E[etSi ] = exp(λi (Mi (t) − 1)), i = 1, 2, . . . , n,

where Mi (t) is the moment generation function which corresponds to cdf


Fi (x). Then
n
" n #
Y X
MS (t) = exp(λi (Mi (t) − 1)) = exp λi (Mi (t) − 1)
i=1 i=1

Now let us calculate the moment generation function M 0 (t) of S 0 . By by


(33),
M 0 (t) = exp(λ(MY (t) − 1)),
where MY (t) is the moment generation function which corresponds to cdf
F (x). By definition, it is equal to
Z ∞ n Z ∞ n
tx 1X tx 1X
MY (t) = e dF (x) = λi e dFi (x) = λi Mi (t).
−∞ λ i=1 −∞ λ i=1

Hence,
n n
!
X X
M 0 (t) = exp(λMY (t) − λ) = exp λi Mi (t) − λi = MS (t).
i=1 i=1

This implies that S and S 0 have the same distribution and finishes the proof.

3.3 The compound binomial distribution


In this section we assume that the number of claims N follow a binomial
distribution with parameter n and p, that is,
n!
P [N = k] = · pk (1 − p)n−k , k = 0, 1, . . . , n.
k!(n − k)!

54
As mentioned in Chapter , this is the case if

N = N1 + N2 + · · · + Nn

where Ni are i.i.d standard Bernoulli variables (that is, take value 1 with
probability p and 0 otherwise). This is a natural model if we assume that
the company covers n independent policies such that each one may issue a
claim with the same probability p.
Because E[Ni ] = p, V ar[Ni ] = p(1 − p) and the moment generating
function Mi (t) of Ni is

E[etNi ] = et·0 · (1 − p) + et·1 · p = pet + 1 − p,

we have E[N ] = np, V ar[N ] = np(1 − p), and MN (t) = (pet + 1 − p)n .
By (28),
µS = E[S] = µN · µX = npµX . (35)
By (29),

σS2 = µN σX
2 2 2
+ σN 2
µX = np(σX + (1 − p)µ2X ) = np(E[X 2 ] − p(E[X])2 ). (36)

Also, by (30),

MS (t) = MN (log MX (t)) = (pelog MX (t) + 1 − p)n = (pMX (t) + 1 − p)n . (37)

Differentiating MS (t) three times we can derive the skewness:

E (S − µS )3 = npE[X 3 ] − 3np2 E[X 2 ]E[X] + 2np3 (E[X])3 ,


 

hence
npE[X 3 ] − 3np2 E[X 2 ]E[X] + 2np3 (E[X])3
skew[S] = .
(npE[X 2 ] − np2 (E[X])2 )3/2
We can see that S can be positively or negatively skewed, depending on the
parameters.

Example 3.5. Assume that all claims are for the same amount B. Then
E[X k ] = B k , k = 1, 2, 3, and

npB 3 − 3np2 B 3 + 2np3 B 3 0.5 − p


skew[S] = 2 2 2 3/2
=p .
(npB − np B ) np(1 − p)

In particular, skew[S] > 0 if p < 0.5 but skew[S] < 0 if p > 0.5.

Example 3.6. Assume that the number of claims during a year has the
binomial distribution with parameters n and p the size of each claim is a

55
random variable X uniformly distributed on [a, b]. All claims sizes are inde-
pendent. What is the mean and variance of the cumulative size of the claims
from all policies?
Answer:
E[S] = npE[X] = np(a + b)/2,
and

V ar[S] = npE[X 2 ] − np2 (E[X])2 = np(a2 + ab + b2 )/3 − np2 (a + b)2 /4.

3.4 The compound negative binomial distribution


In this section we assume that the number of claims N follow a negative
binomial distribution with parameters k and p, that is,
(k + n − 1)! k
P [N = n] = · p (1 − p)n , n = 0, 1, 2, . . .
n!(k − 1)!
As mentioned in Chapter , this is the case if

N = N1 + N2 + · · · + Nk

where Ni are i.i.d variables with geometric distribution.


Because E[Ni ] = (1 − p)/p, V ar[Ni ] = (1 − p)/p2 and the moment gener-
ating function Mi (t) of Ni is

tNi
X p
E[e ]= etn · (1 − p)n p =
n=1
1 − (1 − p)et

we have

E[N ] = k(1 − p)/p, V ar[N ] = k(1 − p)/p2 , MN (t) = pk (1 − (1 − p)et )−k .

You may note that V ar[N ] > E[N ], while for Poisson distribution V ar[N ] =
E[N ]. This is the advantage of negative binomial distribution: it can better
fit the data if sample variance is greater than sample mean, which is often
the case in practise.
By (28),
k(1 − p)
µS = E[S] = µN · µX = µX (38)
p
By (29),
k(1 − p) k(1 − p)2
σS2 = µN σX
2 2 2
+ σN µX = E[X 2 ] + (E[X])2 . (39)
p p2

56
Also, by (30),

pk
MS (t) = MN (log MX (t)) = . (40)
(1 − (1 − p)MX (t))k

Differentiating MS (t) three times we can derive the third centralized mo-
ment:
 3k(1 − p)2 E[X]E[X 2 ] 2k(1 − p)3 (E[X])3 k(1 − p)E[X 3 ]
E (S − µS )3 =

+ + .
p2 p3 p
Because all terms are positive, the compound negative binomial distribution
is always positively skewed.

3.5 Aggregate claim distribution under reinsurance


Under proportional reinsurance, the insurer and reinsurer each pays a defined
proportion of every claim, and therefore their aggregate claim is proportional
to the aggregate claim with no reinsurance, whose distribution has been
derived in the previous sections. For a retention level α (0 ≤ α ≤ 1), the
i-th individual claim amount for the insurer is αXi and for the reinsurer is
(1 − α)Xi . The aggregate claims amounts are αS and (1 − α)S, respectively.
Under the excess of loss reinsurance with retention level M , the amount
that an insurer pays on the i-th claim is Yi = min(Xi , M ), while the amount
that the reinsurer pays is Zi = Xi −Yi = max(0, Xi −M ). Thus, the insurer’s
aggregate claims net of reinsurance can be represented as:

SI = Y1 + Y2 + · · · + YN ,

and the reinsure’s aggregate claims is:

SR = Z1 + Z2 + · · · + ZN . (41)

We remark that if Xi < M , which is usually the case, then Zi = 0, and, in


reality, the reinsurer will not even see this claim. However, this formula for
SR with lots of zero terms, while somewhat artificial, is convenient for calcu-
lations. In particular, we can use formulas (28), (29), and (30) to estimate
mean, variance, and moment generating function for SI and SR .

Example 3.7. The number N of claims has Poisson distribution with pa-
rameter λ = 10. Individual claim amounts are uniformly distributed on
(0, 2000). The insurer of this risk has effected excess of loss reinsurance with
retention level 1600. Calculate the mean, variance and coefficient of skewness

57
of both the insurer’s and reinsurer’s aggregate claims under this reinsurance
arrangement.
Answer: In this case, Xi ∼ U (0, 2000), M = 1600. As usual, denote
Yi = min(Xi , M ) and Zi = Xi − Yi = max(0, Xi − M ). Then
Z M
E[Yi ] = xf (x)dx + M · P (Xi > M ),
0

where f (x) = 0.0005 is the U (0, 2000) density. This gives

0.0005(M 2 − 02 )
E[Yi ] = + 0.2M = 960.
2
Hence, by (31),
E[SI ] = λE[Yi ] = 10 · 960 = 9600.
Further
Z M
E[Yi2 ] = x2 f (x)dx + M 2 · P (Xi > M ) = 1, 194, 666.7,
0

and by (32)
V ar[SI ] = λE[Yi2 ] = 11, 946, 667.
Next,
Z M
E[Yi3 ] = x3 f (x)dx + M 3 · P (Xi > M ) = 1, 638, 400, 000,
0

hence by (34)

λE[Yi3 ] 16, 384, 000, 000


skew[SI ] = = ≈ 0.397.
(λE[Yi2 ])3/2 (11, 946, 667)3/2

Let us now do similar calculation for the reinsurer. Because Xi ∼ U (0, 2000),
we have E[Xi ] = 1000, hence

E[Zi ] = E[Xi − Yi ] = E[Xi ] − E[Yi ] = 1000 − 960 = 40,

and by (31)
E[SR ] = λE[Zi ] = 10 · 40 = 400.
Further,
2000
0.0005(2000 − M )3
Z
E[Zi2 ] = (x − M )2 f (x)dx = ≈ 10666.7,
M 3

58
and by (32)
V ar[SR ] = λE[Zi2 ] = 106, 667.
Next,
2000
0.0005(2000 − M )4
Z
E[Zi3 ] = (x − M )3 f (x)dx = = 3, 200, 000,
M 4

and by (34)

λE[Yi3 ] 32, 000, 000


skew[SI ] = 2 3/2
= ≈ 0.92.
(λE[Yi ]) (106, 667)3/2

The reinsure’s aggregate claim can alternatively be represented as:

SR = W1 + W2 + · · · + WN R , (42)

where N R is the number of actual (non-zero) claims to the re-insurer, and


Wi are the sizes of these claims. For example, if in the example above there
are 8 claims of sizes

403, 1490, 1948, 443, 1866, 1704, 1221, 823,

the (41) reduces to

SR = 0 + 0 + 348 + 0 + 266 + 104 + 0 + 0,

while (42) reduces to more natural expression

SR = 348 + 266 + 104.

Random variables Wi in (42) have density function

fX (w + M )
g(w) = , w > 0,
1 − FX (W )

where fX and FX are density and cdf of the original claim size distribution.
To find the distribution of N R note that

N R = I1 + I2 + · · · + IN ,

where N is the total number of claims, and Ij is an indicator random variable


which takes the value 1 if the reinsurer makes a (non-zero) payment on the
j-th claim and takes the value 0 otherwise. Thus, N R gives the number of

59
payments made by the reinsurer. Denore π the probability that Xj > M .
Since Ij takes the value 1 only if Xj > M , we have

P (Ij = 1) = P (Xj > M ) = π, and P (Ij = 0) = 1 − π.

Hence, E[Ij ] = π, E[Ij2 ] = π, and moment generating function

MI (t) = E[etIj ] = et π + 1 − π.

By formulas (28), (29), and (30), this implies that

E[N R] = πE[N ],

V ar[N R] = E[N ](π − π 2 ) + V ar[N ]π,


and
MN R (t) = MN (log MI (t)) = MN (log(et π + 1 − π))

Example 3.8. In Example 3.7 above, we can use formula (42) to analyse the
reinsurance aggregate claim SR . In this case, N R follow Poisson distribution
with parameter 10 · 0.2 = 2, and individual claims Wi have density function

fX (w + M ) 0.0005
g(w) = = = 0.0025, 0 < w < 400,
1 − FX (W ) 0.2

that is, Wi = U (0, 400). Then E[Wi ] = 200, E[Wi2 ] = 53, 333.33 and
E[Wi3 ] = 16, 000, 000, giving the same result for mean, variance, and skew-
ness of SR as above.

Thus, there are two ways to specify and evaluate the distribution of SR .

3.6 The individual risk model


Under this model a portfolio consisting of a fixed number of risks is consid-
ered. It will be assumed that

• these risks are independent;

• claim amounts from these risks are not identically distributed random
variables; and

• the number of risks does not change over the period of insurance cover.

60
As before, aggregate claims from this portfolio are denoted by S. So:

S = Y1 + Y2 + · · · + Yn ,

where Yj denotes the claim amount under the j-th risk and n denotes the
number of risks. It is possible that some risks will not give rise to claims.
Thus, some of the observed values of Yj may be 0.
For each risk, the following assumptions are made:

(a) The number of claims from the j-th risk, Nj , is either 0 or 1.

(b) The probability of a claim from the j-th risk is qj .

If a claim occurs under the j-th risk, the claim amount is denoted by the
random variable Xj . Let Fj (x), µj and σj2 denote the distribution function,
mean and variance of Xj , respectively.
Assumption (a) is very restrictive. It means that a maximum of one claim
from each risk is allowed for in the model. This includes risks such as one-
year term assurance, but excludes many types of general insurance policy.
For example, there is no restriction on the number of claims that could be
made in a policy year under household contents insurance.
There are three important differences between this model and the collec-
tive risk model:

• The number of risks in the portfolio has been specified. In the collective
risk model, this number N was not specified and was modelled as a
random variable.

• The number of claims from each individual risk has been restricted.
There was no such restriction in the collective risk model.

• It is assumed that individual risks are independent but not necessarily


identically distributed. In the collective risk model, individual claim
amounts are independent and identically distributed.

Assumptions (a) and (b) say that Nj are standard Bernoulli variables,
or, equivalently, Nj ∼ Bin(1, qj ). Thus, the distribution of Yj is compound
binomial, with individual claim amount random variable Xj . From formulas
(35) and (36) it follows that

E[Yj ] = qj µj

and
E[Yj ] = qj σj2 + qj (1 − qj )µ2j .

61
The aggregate claim amount S is the sum of n independent compound
binomial random variables. It is easy to find the mean and variance of S.
" n # n n
X X X
E[S] = E Yj = E[Yj ] = qj µj , (43)
j=1 j=1 j=1

and
" n
# n n
X X X
V ar[S] = V ar Yj = V ar[Yj ] = (qj σj2 + qj (1 − qj )µ2j ). (44)
j=1 j=1 j=1

The distribution of S can be computed only under certain conditions,


for example, when the compound binomial variables Yj are identically dis-
tributed. In the special case for each policy the values of qj , µj and σj2 are
identical, say q, µ and σ2 . Since Fj (x) is independent of j, we can denote it
simply F (x). Hence, S is compound binomial, with binomial parameters n
and q, and individual claims have distribution function F (x). In this special
case, it reduces to the collective risk model, and it can be seen from (43) and
(44) that
E[S] = nqµ, var[S] = nqσ 2 + nq(1 − q)µ2 .

3.7 Aggregate claim estimation under uncertainty in


parameters
In the previous sections we have assumed that the distributions of the number
of claims and of the amounts of individual claims are known with certainty.
In general, parameters of these distributions are not known and should be
estimated from appropriate sets of data. In this section we will see how
the models introduced earlier can be extended to allow for parameter uncer-
tainty/variability. We will do this by looking at a series of examples. All the
examples will be based on claim numbers having a Poisson distribution.
In our first Example we assume that insurance company has n indepen-
dent policies,
S = S1 + S2 + · · · + Sn ,
and the aggregate claim Si from i-th policy has a compound Poisson dis-
tribution with parameter λi , and the CDF of the individual claim amounts
distribution is F (x). We assume that F (x) is fixed and identical for all
policies, but λi are not known with certainty. In contrast, we assume that
λ1 , λ2 , . . . , λn is an i.i.d. sample from some known distribution. If we then
choose a policy i at random, we can make probability statements about λi ,
such that “there is a 50% chance that λi lies between 3 and 5”, etc.

62
Example 3.9. Question: Suppose that the Poisson parameters λi of
policies are not known but are equally likely to be 0.1 or 0.3. Let n be the
number of policies and m1 and m2 be the first and second moments of the
claim size X.

(i) Find the mean and variance of the aggregate claim Si from a random
policy i;

(ii) Find the mean and variance of the aggregate claim S.

Explanation: The situation described in this Example may arise in, for
example, motor insurance. It may be that there are n drivers insured, and
some of them are “good” drivers and some are “bad” drivers. The individual
claim amount distribution is the same for all drivers but “good” drivers make
fewer claims (0.1 p.a. on average) than “bad” drivers (0.3 p.a. on average). It
is assumed that it is known, possibly from national data, that a policyholder
is equally likely to be a “good” driver or a “bad” driver.
Answer: (i) Let as choose policy i at random. From problem formulation,

P [λi = 0.1] = 0.5 and P [λi = 0.3] = 0.5.

Hence,

E[λi ] = (0.1 + 0.3)0.5 = 0.2, and V ar[λi ] = (0.12 + 0.32 )0.5 − 0.22 = 0.01.

The conditional distribution of Si if λi is known is compound Poisson, hence


formulas (31) and (32) imply

E[Si |λi ] = λi m1 , and V ar[Si |λi ] = λi m2 .

Hence, by the law of total expectation (15)

E[Si ] = E[E[Si |λi ]] = E[λi m1 ] = m1 E[λi ] = 0.2m1 .

Similarly, by the law of total variance (16)

V ar[Si ] = E[V ar(Si |λi )] + V ar(E(Si |λi )) = E[λi m2 ] + V ar(λi m1 ) =

m2 E[λi ] + m21 V ar[λi ] = 0.2m2 + 0.01m21 .


(ii) Because Si are independent and identically distributed,

E[S] = E[S1 + S2 + · · · + Sn ] = nE[Si ] = 0.2nm1 ,

63
and

V ar[S] = V ar[S1 + S2 + · · · + Sn ] = nV ar[Si ] = 0.2nm2 + 0.01nm21 .

In the previous example, we assumed that all Si are independent. We now


consider a modification of this example when the number of claims from any
policy i follow a Poisson distribution with the same but unknown parameter
λ. In this case higher S1 is indication for a higher λ and therefore higher
S2 , hence it is not reasonable to assume that Si are independent. In this
situation, the correct assumption is that conditional random variables Si |λ
are independent (and identically distributed).

Example 3.10. Question: Suppose that the Poisson parameter λ is


unknown but is the same for all policies and is equally likely to be 0.1 or 0.3.
Let Si |λ be i.i.d, and let everything else be as in Example 3.9.

(i) Find the mean and variance of the aggregate claim Si from a random
policy i;

(ii) Find the mean and variance of the aggregate claim S.

Explanation: The situation described in this Example may arise in, for
example, buildings insurance in a certain area. The number of claims could
depend on, among other factors, the weather during the year; an unusually
high number of storms resulting in a high expected number of claims (i.e. a
high value of λ) and vice versa for all the policies together.
Answer: (i) For a random policy i,

P [λ = 0.1] = 0.5 and P [λ = 0.3] = 0.5,

and exactly the same calculation as in Example 3.9 gives the same result:

E[Si |λ] = λm1 , and V ar[Si |λ] = λm2 ,

and
E[Si ] = 0.2m1 and V ar[Si ] = 0.2m2 + 0.01m21 .
(ii) Even when Si are not independent, expectation of a sum is still the
sum of expectations, and

E[S] = E[S1 + S2 + · · · + Sn ] = nE[Si ] = 0.2nm1 .

64
However, for dependent Si ,

V ar[S] = V ar[S1 + S2 + · · · + Sn ] 6= nV ar[Si ].

Instead, we should use independence of Si |λ to get

E[S|λ] = nE[Si |λ] = nλm1 , and V ar[S|λ] = nV ar[Si |λ] = nλm2 ,

and then the law of total variance (16) implies that

V ar[S] = E[V ar(S|λ)] + V ar[E(S|λ)] =

= E[nλm2 ] + V ar[nλm1 ] = 0.2nm2 + 0.01n2 m21 .


We remark that this variance is grater than that in Example 3.9.

Example 3.11. Question: Suppose that Poisson parameters λi are drawn


from gamma distribution with known parameters α and δ. Find a distribu-
tion for the number of claims Ni from a random policy i.
Answer: By problem formulation, conditional distribution Ni |λi is Poisson
distribution with parameter λi , while λi follows Ga(α, δ), that is, has density
δα
f (λ) = Γ(α) λα−1 e−δλ , see (8). Then by (14), for any m = 0, 1, 2, . . . ,
∞ m ∞
δ α α−1 −δλ δα
Z Z
−λ λ
P [Ni = m] = e λ e dλ = e−λ(δ+1) λm+α−1 dλ
0 m! Γ(α) Γ(α)m! 0

The integral is proportional to density of gamma distribution with parame-


ters m + α and δ + 1, hence the integration gives the corresponding cdf which
can be found from tables. We get

δ α Γ(m + α)
P [Ni = m] = .
Γ(α)m! (δ + 1)x+α

One can check that if α = k is positive integer, then this formula reduced for
δ
the formula for negative binomial distribution with parameters α and δ+1 .

65
3.8 Summary
Let N be a (random) number of claims during some fixed period of time. If
the sizes of claims are X1 , X2 , . . . , XN , then the total cost to cover all claims
is
S = X 1 + X2 + · · · + XN ,
and S = 0 if N = 0. If all Xi are independent, identically distributed, and
also independent of N , then we say that S has compound distribution.
The cdf of S is given by

X
G(x) = P (S ≤ x) = P (N = n) · F n∗ (x),
n=0

where F (x) is the cdf of the individual claim and F n∗ (x) its n-fold convolu-
tion.
2
We denote µX , µN , µS the means of random variables Xi , N, S and by σX ,
2 2
σN , σS the corresponding variances. Then

µS = µN · µX , and σS2 = µN σX
2 2 2
+ σN µX .

For example,

• If N follow a Poisson distribution with parameter λ, then S is called


compound Poisson and

µS = λµX , and σS2 = λE[X 2 ].

• If N follow a binomial distribution with parameter n and p, then S is


called compound binomial and

µS = npµX . and σS2 = npE[X 2 ] − np2 µ2X .

• If N follow a negative binomial distribution with parameters k and p,


then S is called compound binomial and

k(1 − p) k(1 − p) k(1 − p)2 2


µS = µX and σS2 = E[X 2 ] + µX .
p p p2

Under proportional reinsurance with retention level α, the i-th individual


claim amount for the insurer is Yi = αXi . Under the excess of loss reinsurance

66
with retention level M , it is Yi = min(Xi , M ). In both cases, the aggregate
claim for the insurer is

SI = Y1 + Y2 + · · · + YN ,

and all the formulas above work with Yi in place of Xi .


For the reinsurer, the individual claim amount is Zi = Xi − Yi = (1 − α)Xi
under proportional reinsurance and Zi = Xi − Yi = max(0, Xi − M ) under
the excess of loss reinsurance. In both cases, the aggregate claim for the
reinsurer is
SR = Z1 + Z2 + · · · + ZN ,
and all the formulas above work with Zi in place of Xi .
In the individual risk model, the aggregate claim is

S = Y1 + Y2 + · · · + Yn ,

where Yj = Nj Xj , Nj ∼ Bin(1, qj ) is the number of claims from j-th policy,


and Xj is the individual claim amount with mean mj and variance σj2 . Then
n
X n
X
E[S] = qj µj , and V ar[S] = (qj σj2 + qj (1 − qj )µ2j ).
j=1 j=1

67
3.9 Questions

1. Assume that the number N of claims can be any integer from 1 to


100 with equal chances, and the claim sizes X1 , . . . , XN are i.i.d. from
Pareto distribution with parameters α = 3 and λ = 2. Estimate mean
and variance of the aggregate claim S.

2. The number of claims insurance company receives in April follows Pois-


son distribution with λ = 457
, the sizes of claims are i.i.d. and follow a
uniform distribution on [1, 000; 2, 000].
(a) Estimate the probability that the company will receive at least 3
claims in April.
(b) Estimate the mean and variance for the total size of all April’s
claims.
(c) Estimate the probability that the total size of all April’s claims will
be strictly less than 3, 000.

3. The claim size X follows a log-normal distribution with parameters µ


and σ, where σ is known but µ is not. Instead, we model µ as another
2
random variable such that λ = eµ+σ /2 has mean p and variance s2 .
Estimate the mean and variance of X.

4. The number N of claims to be received by insurance company next


year follow a negative binomial distribution with parameters k = 20
and p = 0.25. The claim sizes X1 , . . . , XN are i.i.d. and follow ex-
ponential distribution with parameter λ = 0.005. Assuming that the
aggregate claim size S is approximately normally distributed, estimate
the probability that S with not exceed 20, 000.

68
Chapter 4
Tails and dependence analysis of claims distri-
butions
4.1 How likely very large claims to occur?
Low frequency events involving large losses can have a devastating impact
on companies and investment funds. The financial crisis that started in 2007
was an example of this. It generated more extreme movements in share prices
than had been seen for over 20 years previously.
So, it is important to ensure that we model the form of the distribution
in the tails correctly. However, the low frequency of these events also means
that there is relatively little data to model their effects accurately.
Many types of financial data tend to be much more narrowly peaked in
the centre of the distribution and to have fatter tails than the normal dis-
tribution. This shape of distribution is known as leptokurtic. For example,
when share prices are modelled, large price movements occur more frequently
than predicted by the normal distribution. So, the normal distribution may
be unsuitable for modelling the large movements in the tails. One reason
for these fat tails is that the volatility of financial variables does not re-
main constant but varies stochastically over time. This property is known as
heteroscedasticity.
Even if we select an appropriate form of fat-tailed distribution, if we
attempt to fit the distribution using the whole of our dataset, this is unlikely
to result in a good model for the tails, since the parameter estimates will be
heavily influenced by the main bulk of the data in the central part of the
distribution.
Fortunately, better modelling of the tails of the data can be done through
the application of extreme value theory. The key idea of extreme value theory
is that the asymptotic behaviour of the tails of most distributions can be
accurately described by certain families of distributions. More specifically,
the maximum values of a distribution (when appropriately standardised)
and the values exceeding a specified threshold (called threshold exceedances)
converge to two families of distributions as the sample size increases.
There are a number of measures we can use to quantify the tail weight
of a particular distribution, that is, how likely very large values are to occur.
Depending on the context, an exponential, normal or log-normal distribution
may be a suitable baseline to use for comparison.

• Existence of moments.

69
Recall that the k-th moment of a continuous positive-valued distribu-
tion with density function f (x) is
Z∞
Mk = xk f (x)dx
0

For some distributions, like normal distribution or Gamma distribution,


the density f (x) decreases so fast that this integral exists and finite for
every k. In other words, the k-th moment exists for all values of k. This
is an indication of a relatively light tail, meaning that large claims are
unlikely to happen.
R∞
However, for some distributions, the integral xk f (x)dx does not con-
0
verge (becomes infinite) for all k greater than or equal to some value
k0 . In this case, we say that the k-th moment does not exist for k ≥ k0 .
For example, for the Pareto distribution with density function (9) the
k-th moment only exists when k < α. So a Pareto distribution with
a low value of the parameter α has a much thicker tail. If claims fol-
low this distribution, very large claims may occur with non-negligible
probability.
• Limiting density ratios.
We can compare the thickness of the tail of two distributions by cal-
culating the relative values of their density functions at the far end of
the upper tail. For example, if we compare the Pareto distributions (9)
with parameters α = 2 and α = 3 both with the same value of λ, we
find that:
2λ2 3λ3
 
fα=2 (x) 2
lim = lim : = lim (λ + x) = ∞.
x→∞ fα=3 (x) x→∞ (λ + x)3 (λ + x)4 3λ x→∞
This confirms that the distribution with α = 2 has a much thicker tail.
If we compare the gamma distribution with the Pareto distribution,
we find that the presence of the exponential factor in the gamma den-
sity results in a limiting density ratio of zero, which confirms that the
gamma distribution has a lighter tail.
• Hazard rate.
The hazard rate of a distribution with density function f (x) and cu-
mulative distribution function F (x) is defined as:
f (x)
h(x) =
1 − F (x)

70
The hazard rate is the analogy of force of mortality which you study
in parallel module. If the force of mortality increases as a person’s age
increases, relatively few people will live to old age (corresponding to a
light tail). If, on the other hand, the force of mortality decreases as the
person’s age increases, there is the potential to live to a very old age
(corresponding to a heavier tail).
For example, for exponential distribution with parameter λ > 0 we
have
f (x) = λe−λx , and F (x) = 1 − e−λx , x > 0,
hence the hazard rate
λe−λx
h(x) = =λ
1 − (1 − e−λx )
is a constant. Exponential distribution corresponds to the α = 1 case
of the gamma distribution (8). Numerical calculations shows that, for
gamma distribution, the hazard rate is decreasing if α > 1 (which
indicates for a heavier tail than that for the exponential distribution)
and increasing if α > 1 (which indicates for a lighter tail than that for
the exponential distribution).
For the Pareto distribution (9), we find that the hazard rate is always a
decreasing function of x (see the end of chapter question for the proof),
confirming that it has a heavy tail.
• Mean residual life.
The mean residual life of a distribution with density function f (x) and
cumulative distribution function F (x) is defined as:
R∞ R∞
(y − x)f (y)dy
x R
(1 − F (y))dy
e(x) = ∞ = x .
x
f (y)dy 1 − F (x)

Again, we can interpret this in terms of mortality as the expected future


lifetime. If the expected future lifetime decreases with age, relatively
few people will live to old age (corresponding to a light tail), but if it
increases, there is the potential to live to a very old age (corresponding
to a heavier tail).
For the gamma distribution, we find that, if α = 1 (exponential dis-
tribution), the mean residual life is constant, but if α < 1, it is an
increasing function of x (indicating a heavier tail than the exponential
distribution) and if α > 1, it is a decreasing function (indicating a
lighter tail than the exponential distribution).

71
For the Pareto distribution, we find that the mean residual life is an
increasing function of x (see the end of chapter question for the proof),
confirming that it has a heavy tail.

4.2 The distribution of large claims


As an alternative to focusing on “heavy” or “light” tail, we can consider the
distribution of all the values of random variable X that exceed some (large)
specified threshold, e.g. all claims exceeding 1 million pounds. As we will
see, for large samples and large thresholds, the distribution of these extreme
values converges to the Generalised Pareto Distribution. This enables us to
model the tail of a distribution by selecting a suitably high threshold and
then fitting a generalised Pareto distribution to the observed values in excess
of that threshold.
Let X be a random variable with cumulative distribution function F ,
u ∈ R be a threshold, and let

Fu (x) = P (X − u ≤ x|X > u)

denotes the conditional probability that the threshold exceedance X − u is


at most x under the condition that the claim X exceeded the threshold u.
Then
P (X ≤ x + u, X > u) F (x + u) − F (u)
Fu (x) = = .
P (X > u) 1 − F (u)
For example, if the individual losses are distributed exponentially with
F (x) = 1 − e−λx , we have:

(1 − e−λ(x+u) ) − (1 − e−λu ) e−λu − e−λ(x+u)


Fu (x) = −λu
= −λu
= 1 − e−λx = F (x)
1 − (1 − e ) e
So, in this case, the threshold exceedances follow the same exponential dis-
tribution as X, irrespective of the threshold we choose.
In general, there is a theorem stating that, whatever the underlying dis-
tribution F is, the distribution of the threshold exceedances will converge to
a Generalised Pareto distribution as the threshold u increases, that is,

lim Fu (x) = G(x),


u→∞

where G(x) belongs to a two-parameter family of Generalised Pareto distri-


butions   −γ
1 − 1 + x if γ 6= 0,
G(x) =  γβ 
1 − exp − x if γ = 0,
β

72
Parameter β is called scale parameter, while γ is called a shape parameter.
When γ = 0, G(x) reduces to exponential distribution with parameter λ = β1 .

4.3 The distribution of maximal claim


In the previous section we studied tails of individual random variable. In this
section we assume that we have a sequence of claim sizes X1 , X2 , . . . , Xn , . . . ,
which are independent and identically distributed (i.i.d.), and we are inter-
ested to estimate the maximal claim out of first n:

Mn = max{X1 , X2 , . . . , Xn }.

This problem is of important practical interest. For example, if a company


receives on average 3 claims per day, it might except to receive about 3·365 =
1095 claims per year, and then M1095 gives an intuition of how large the
maximal claim in the year they should expect.
For any real x,

P (Mn ≤ x) = P (X1 ≤ x, X2 ≤ x, . . . Xn ≤ x) =

= P (X1 ≤ x)P (X2 ≤ x) · P (Xn ≤ x) = P (X1 ≤ x)n = [FX (x)]n ,


where we have used that X1 , X2 , . . . , Xn , . . . are i.i.d. and denote FX (x) their
common cdf. This formula can be used to compute the distribution of Mn
directly for small values of n.
However, if n is large, which is often the case in applications, that [FX (x)]n
is essentially 0 unless x is such that FX (x) is very close to 1. However, for
such values of x it is difficult to estimate FX (x).
Fortunately, for large n we can apply famous Fisher-Tippett-Gnedenko
theorem, also known as “extreme value theorem”, which studies the distribu-
tion of Mn in the limit as n → ∞. Namely, it states that if a1 , a2 , . . . , an , . . .
and b1 , b2 , . . . , bn , . . . are two sequence of real numbers such that the limit
 
Mn − an
F (x) = lim P ≤ x = lim [FX (an + bn x)]n (45)
n→∞ bn n→∞

exists and depends only on x, then F (x) must be of the form


   −1/γ 
exp − 1 + γ(x−α)
6 0,
if γ =

β
F (x) =   (46)
exp − exp(− x−α )

if γ = 0,
β

where α, β > 0, and γ are some real parameters, and x is such that 1 +
γ(x−α)
β
> 0. If 1 + γ(x−α)
β
≤ 0, (46) is undefined, and we set F (x) = 0 if

73
γ > 0 and F (x) = 1 if γ < 0. With this convention, F (x) becomes a cdf of
some distribution, and this distribution is known as Generalized Extreme
Value (GEV) distribution. Parameter α is called a location parameter,
β > 0 is scale parameter, and γ is known as shape parameter.

Example 4.1. Assume that X1 , X2 , . . . , Xn , . . . are i.i.d. r.v.s witch follow


exponential distribution with parameter λ. Then, for the limit in (45) to
exist, we should choose an = λ1 ln n and βn = λ1 . In this case,
−x
F (x) = e−e ,
which corresponds to (46) with parameters α = γ = 0, β = 1.

The details for this example are given in an end-of-chapter question.

The parameters α and β just rescale (shift and stretch) the GEV dis-
tribution (46), in a similar way as changing mean and standard deviation
shifts and stretches the normal distribution. The parameter γ determines
the overall shape of the distribution (analogous to the skewness) and its sign
(positive, negative or zero) results in three different shaped distributions.
• When γ < 0, the GEV distribution (46) reduces to
  
exp − 1 + γ(x−α) −1/γ
 
β
if x < α − βγ ,
F (x) =
1 if x ≥ α − βγ ,

and it is known as the Weibull distribution. For example, if the un-


derlying distribution (the distribution FX in (45)) follows the uniform
distribution, or beta distribution, or triangular distribution, then the
extreme value distribution (45) has the form (46) with γ < 0, that is,
it is a Weibull distribution.
• When γ = 0, the GEV distribution reduces to
 
x−α
F (x) = exp − exp(− ) ,
β
and it is known as the Gumbel distribution. As demonstrated in Exam-
ple 4.1, the Gumbel distribution arises as the extreme value distribution
when the underlying distribution (the distribution of individual claims)
is exponential. Similarly, one can show that if the underlying distribu-
tion (the distribution FX of Xi in (45)) is Chi-square, or Gamma, or
Log-normal, or Normal, or Weibull, then the extreme value distribution
(45) has the form (46) with γ = 0, that is, it is a Gumbel distribution.

74
• When γ > 0, the GEV distribution (46) reduces to

0  if x ≤ α − βγ ,

F (x) =  −1/γ
exp − 1 + γ(x−α)
β
if x > α − βγ ,

and it is known as the Fretchet distribution. For example, if the un-


derlying distribution (the distribution FX in (45)) is Burr distribution,
or F-distribution, or Log-gamma, or Pareto, or t-distribution, then the
extreme value distribution (45) has the form (46) with γ > 0, that is,
it is a Fretchet distribution.

If we know the form of the underlying distribution, it is possible to work


out the limiting distribution of the maximum value. We can then use the
appropriate member of the GEV family to model the tail of the distribution.
The underlying distribution will determine which of the three different types
of GEV distribution will arise. Mathematicians have determined criteria that
can be used to predict which family a distribution belongs to. As a rough
guide:

• Underlying distributions that have finite upper limits (e.g. the uniform
distribution) are of the Weibull type (which also has a finite upper
limit).

• “Light tail” distributions that have finite moments of all orders (e.g.
exponential, normal, log-normal) are typically of the Gumbel type.

• “Heavy tail” distributions whose higher moments can be infinite are of


the Frechet type.

4.4 Dependence, correlation, and concordance


On the previous section we assumed that claim sizes X1 , X2 , . . . , Xn , . . . ,
which are independent. However, this is not always an adequate and realistic
assumption.
Imagine that an insurance company sells policies against fires. Most of
claims comes from small fires (say, affecting just one room in a house), and
the probability of a large claim, arising from fire which completely destroy
the house, is estimated as about 10−3 for every individual policy. The com-
pany sold 10 such policies for houses on the same street, and estimated that
they have reserved capital to pay at most 9 large claims. If all 10 policies
make a large claim, the company become bankrupt. However, because the

75
probability of one claim is 10−3 , the company estimated that the chance that
all 10 claims happen is about (10−3 )10 = 10−30 . This is so tiny that can be
ignored, and the company felt completely safe.
After several months, a large fire happened in one of the houses. The fire
quickly spread on the neighbour houses, destroying them all. All 10 houses
were affected, the company received 10 large claims and becomes bankrupt.
The companies mistake is that formula (10−3 )10 = 10−30 works only if all
fires would be independent. However, they are not. Because houses were on
one street, fire on one of them could course fire on others. Even if houses
would be on different streets, fire on one of them could be a reason of, for
example, extremely hot weather, which could increase the chance of another
fire. So, understanding the dependency between different events and random
variables is of fundamental importance.
A popular way to calculate “how much” two random variables are depen-
dent, or correlated, is Pearson correlation coefficient
Cov(X, Y )
Corr(X, Y ) := p p ,
Var(X) Var(Y )

where Cov(X, Y ) = E[XY ] − E[X]E[Y ]. In particular, if X and Y are


independent, then Cov(X, Y ) = 0 and therefore Corr(X, Y ) = 0.
However, one disadvantage of Corr(X, Y ) is that the converse does not
hold: we may have Corr(X, Y ) = 0 even if X and Y depend from each
other in a strong way. For example, assume that there are 3 equiprobable
possibilities:
X = −3 and Y = 1
or
X = 1 and Y = −5
or
X = 2 and Y = 4
We denote these scenarios as w1 , w2 , w3 , respectively. Then E[X] = 31 (−3 +
1 + 2) = 0, E[Y ] = 13 (1 − 5 + 4) = 0, and
1
E[XY ] = (−3 · 1 + 1 · (−5) + 2 · 4) = 0,
3
hence Corr(X, Y ) = 0. However, random variables X and Y clearly depends
from each other. For example, if we know that X = 1 then we can conclude
that Y = −5.
In applications, it is crucial to understand whether the dependence is
direct (the higher X the higher Y ) or converse (the higher X, the lower Y ).

76
Let us think about this in our example. Consider 2 scenarios, w1 and w2 .
We see that X is higher in scenario w2 while Y is higher in w1 . So, in this
case we have converse dependence (the higher X, the lower Y ). We call such
pair of scenario “discordant”.
Now, consider other pair of scenarios, w1 and w3 . In this case, X is higher
in scenario w3 , and Y is higher in w3 as well. So, we have “the higher X the
higher Y ” case. We call such pair of scenario “concordant”.
Similarly, pair of scenarios w2 and w3 is “concordant”. Because there are
more concordant pairs than discordant ones, we conclude that the depen-
dence between X and Y is (mostly) “direct”.
In general case, assume that there are n scenarios on which random vari-
ables X and Y assume different values. Then the can consider all pairs of
scenarios (there are n(n − 1)/2 pairs), and count how many concordant and
discordant pairs are. If there are C concordant pairs and D discordant ones,
the ratio
C −D C −D
τ= =
C +D n(n − 1)/2
is called Kendall coefficient of concordance. If τ > 0, this means that de-
pendence “the higher X the lower Y ” is observed, and the closer τ to its
maximal value 1, the stronger this dependence is. Conversely, if τ < 0, then
the dependence “the higher X the lower Y ” is observed, and the closer τ to
its minimal value −1, the stronger is this dependence.
In the special example considered above, n = 3, n(n − 1)/2 = 3, C = 2,
D = 1, and
2−1 1
τ= = .
3 3
There are other ways to measure concordance, for example, Spearman
coefficient (which we will not define here). If X and Y are independent, then
their concordance is 0. However, as with Pearson correlation coefficient, the
problem is in the different direction: we may have Kendall (or Spearman)
coefficient equal to 0 even if the random variables are dependent. This just
means that “positive” and “negative” dependences happens equally often
and on average compensate each other.

4.5 Joint distributions and copulas


To fully understand if random variables X and Y are dependent or not,
and if so, how they are dependent, we need to look at their joint distribution
function. Recall that for random variable (r.v.) X its cumulative distribution
function (cdf) is FX (x) = P [X ≤ x], while for pair of r.v.s X and Y their

77
joint cdf is
FX,Y (x, y) = P [X ≤ x, Y ≤ y].
If X and Y are independent, then

FX,Y (x, y) = P [X ≤ x, Y ≤ y] = P [X ≤ x]·P [Y ≤ y] = FX (x)·FY (y). (47)

In the context of joint distribution functions, the individual distribution of


each of the variables in isolation is known as its marginal distributions. (47)
states that, for independent r.v.s, their joint distribution is the product of
marginal distributions.
In contrast, imagine X and Y are dependent in the strongest possible
way: the higher X, the higher Y (for all pairs of scenarios!). This means
that Y = f (X) for some strictly increasing function f . Then

FX,Y (x, y) = P [X ≤ x, Y ≤ y] = P [f (X) ≤ f (x), Y ≤ y] = P [Y ≤ min(f (x), y)] =

= min(P [Y ≤ f (x)], P [Y ≤ y]) = min(P [X ≤ x], P [Y ≤ y]) = min(FX (x), FY (y)).


Hence,
FX,Y (x, y) = min(FX (x), FY (y)). (48)
Now imagine the converse dependence: the higher X, the lower Y , that is,
Y = f (X) for some strictly decreasing function f . Then

FX,Y (x, y) = P [X ≤ x, Y ≤ y] = P [f (X) ≥ f (x), Y ≤ y] = P [f (x) ≤ Y ≤ y]

= max(P [Y ≤ y] − P [Y ≤ f (x)], 0) = max(P [Y ≤ y] − P [X ≥ x], 0).


Hence,
FX,Y (x, y) = max(FY (y) + FX (x) − 1, 0). (49)
We can see that in all cases FX,Y (x, y) is some function of FX (x) and
FY (y):
FX,Y (x, y) = C(FX (x), FY (y)), (50)
where C(u, v) = u · v in (47), C(u, v) = min(u, v) in (48), and C(u, v) =
max(u + v − 1, 0) in (49).
This is not a coincidence. Famous Sklar’s theorem states that, for any
pair of r.v. X and Y , (50) holds for some function C. This function is called a
copula, and represents the way X and Y depends on each other. Specifically,
copula shows how joint distribution depends on marginal distributions.
In particular,

• C(u, v) = u · v is called independence copula, which represents no de-


pendence whatsoever.

78
• C(u, v) = min(u, v) is called co-monotonic (or minimum) copula, and
represents the dependence in the form “the higher X, the higher Y ”.

• C(u, v) = max(u + v − 1, 0) is called counter-monotonic, or maximum


copula, which represents the dependence of the form “the higher X,
the lower Y ”.

Three copulas listed above are called fundamental copulas. They are
specific cases of a more general family of copulas called Frechet-Hoeffding
copulas.
Another important example is:

• Gaussian copula

C(u, v) = Φρ (Φ−1 (u), Φ−1 (v)),

where Φ is the distribution function of the standard normal distribution


and Φρ is the distribution function of a bivariate normal distribution
with correlation ρ. Applying this Gaussian copula to normal marginal
distributions will result in a bivariate normal distribution with corre-
lation ρ. Gaussian copula can be written in the integral form
−1 (u) Φ−1 (v)
ΦZ Z  
1 1 2 2
C(u, v) = p exp − (s + t − 2ρst) dsdt,
2π 1 − ρ2 2(1 − ρ2 )
−∞ −∞

or equivalently,
!
u
Φ−1 (v) − ρΦ−1 (t)
Z
C(u, v) = Φ p dt
0 1 − ρ2

Many of the commonly used copulas are special cases of the important
family of copulas which is called the Archimedean family. Let ψ : (0, 1] →
[0, ∞) be a continuous, strictly decreasing, convex function with ψ(1) = 0.
Properties of ψ imply that it has an inverse function ψ −1 : [0, L) → (0, 1],
where L = lim ψ(t). If L < ∞, we also define, by convention, ψ −1 (x) = 0 for
t→0
x ≥ L, so that ψ −1 is defined everywhere on [0, ∞).
Then the Archimedean family of copulas are all copulas of the form

C(u, v) = ψ −1 (ψ(u) + ψ(v)). (51)

Several examples are presented below.

79
• For
ψ(t) = − ln t, 0 < t ≤ 1.
the C(u, v) in (51) reduces to

C(u, v) = uv.

Hence, in this case C(u, v) is the independence copula.

• For
ψ(t) = (− ln t)α , 0 < t ≤ 1,
where 1 ≤ α is a parameter, the C(u, v) in (51) reduces to
n o
C(u, v) = exp − ((− ln u)α + (− ln v)α )1/α .

This copula is known as the Gumbel copula.

• For
e−αt − 1
 
ψ(t) = − ln , 0 < t ≤ 1,
e−α − 1
where α 6= 0 is a parameter, the C(u, v) in (51) reduces to

(e−αu − 1)(e−αv − 1)
 
1
C(u, v) = − ln 1 + .
α (e−α − 1)
This copula is known as the Frank copula.

• For
1 −α
ψ(t) = (t − 1), 0 < t ≤ 1,
α
where α 6= 0 is a parameter, the C(u, v) in (51) reduces to

C(u, v) = (u−α + v −α − 1)−1/α .

This copula is known as the Clayton copula.

See the end of chapter questions for the details of all calculations.
In all cases, the value of parameter α represents the strength of the de-
pendency between the variables.
Sklar’s theorem also works for any number of random variables. It states
that the joint distribution of n random variables is always a function of
individual cumulative distribution functions:

F (x1 , x2 , . . . , xn ) = C(F1 (x1 ), . . . , Fn (xn )),

80
for some function C which depends on n variables, and is called the (n-
variable) copula. For example,

C(x1 , x2 , . . . , xn ) = x1 · x2 · · · · · xn

implies that all n random variables are jointly independent, while

C(x1 , x2 , . . . , xn ) = min(x1 , x2 , . . . , xn )

is the n-variable version of the co-monotonic copula, and corresponds to the


case when all variables are directly dependent (the higher is one variable, the
higher are all).
The Archimedean family (51) can also be extended to the n variable case
in a straighforward way:

C(x1 , x2 , . . . , xn ) = ψ −1 (ψ(x1 ) + ψ(x2 ) + · · · + ψ(xn )), (52)

where ψ : (0, 1] → [0, ∞) is a function satisfying the same conditions as


above (continuous, strictly decreasing, convex function with ψ(1) = 0). Sub-
stituting different examples of ψ in (52), one can produce as many examples
of n variable copulas as she/he wants.

4.6 Dependence of distribution tails


It is often the case in insurance and investment applications that large losses
tend to occur together. For example, a hurricane could result in large losses
on several different property insurance policies sold by the same company, or
a stock market crash could lead to large losses on a number of investments
in the same investment portfolio. So, the relationships between the variables
at the extremes of the distributions, i.e. in the upper and lower tails, are
of particular importance. These can be measured using the coefficients of
upper and lower tail dependence, which can be calculated directly from the
copula function.
The coefficient of lower tail dependence is defined as:

C(u, u)
λL = lim P (X ≤ FX−1 (u)|Y ≤ FY−1 (u)) = lim .
u→0+ u→0+ u
To define the upper tail dependence, we need to look at the opposite end
of the marginal distributions. For any r.v. X, let

F̄X (x) = P [X > x] = 1 − P [X ≤ x] = 1 − FX (x).

81
Then for any two r.v.s X and Y and any two numbers x and y, we have

P [X > x, Y > y] = C̄(F̄X (x), F̄X (y)),

for some function C̄ which is called a survival copula function.


By the principle of inclusion/exclusion, we have:

1−P [X > x, Y > y] = P [X ≤ x or Y ≤ y] = P [X ≤ x]+P [Y ≤ y]−P [X ≤ x, Y ≤ y],

hence,

1 − C̄(F̄X (x), F̄X (y)) = FX (x) + FY (y) − C(FX (x), FY (y)),

or with notation u = FX (x), v = FY (y),

C̄(1 − u, 1 − v) = 1 − u − v + C(u, v).

The last equation allows to easily compute survival copula C̄ if we know the
(usual) copula C, and vice versa.
We can then define the coefficient of upper tail dependence as:

C̄(u, u)
λU = lim P (X > FX−1 (u)|Y > FY−1 (u)) = lim .
u→1− u→0+ u
The tail dependence can take values between 0 (no dependence) and 1
(full dependence).
Different copulas result in different levels of tail dependence. For example,
the Frank copula and the Gaussian copula have zero dependence in both tails,
while the Gumbel copula with parameter α has zero lower tail dependence
but upper tail dependence of 2 − 21/α . The Clayton copula, on the other
hand, has zero upper tail dependence but lower tail dependence of 2−1/α .
See the end of chapter question for the details of all calculations.
Hence, the Gumbel copula, with an appropriate value for the parameter
α, might be a suitable copula to use when modelling large general insurance
claims resulting from a common underlying cause.

82
4.7 Summary
The “heavier” is the tail of claim distribution, the more likely very large
claims to occur. We can “quantify” how heavy are the tails by analysing the
following:
R∞
• Existence of moments Mk = 0 xk f (x)dx, where f (x) is a density of
a non-negative r.v.
f (x)
• Limiting density ratio: lim , where f (x) and g(x) are two densities.
x→∞ g(x)

f (x)
• Hazard rate h(x) = 1−F (x)
, where f (x) is density and F (x) is cdf.
R∞
x R(y−x)f (y)dy
• Mean residual life e(x) = ∞ .
x f (y)dy

Let X be a random variable with cumulative distribution function F ,


u ∈ R be a threshold, and let

F (x + u) − F (u)
Fu (x) = P (X − u ≤ x|X > u) = .
1 − F (u)

Then
lim Fu (x) = G(x),
u→∞

where G(x) belongs to a two-parameter family of Generalised Pareto distri-


butions.
Let
Mn = max{X1 , X2 , . . . , Xn }.
where X1 , X2 , . . . , Xn are i.i.d. random variables. The extreme value theorem
states that if a1 , a2 , . . . , an , . . . and b1 , b2 , . . . , bn , . . . are two sequence of real
numbers such that the limit
 
Mn − an
F (x) = lim P ≤x
n→∞ bn

exists and depends only on x, then F (x) must follow the Generalized Extreme
Value (GEV) distribution. It has 3 parameters: location parameter α, scale
parameter β > 0, and shape parameter γ.
If γ < 0, γ = 0, and γ > 0, the GEV distribution reduces to the Weibull
distribution, the Gumbel distribution, and the Fretchet distribution, respec-
tively.
For 2 random variables X and Y , the “concordance” measures to what
extend we have direct dependence of the form “the higher X the higher

83
Y ” (this corresponds to positive concordance), and to what extend we have
the opposite dependence “the higher X the lower Y ” (corresponding to the
negative concordance). If X and Y are independent, the concordance is 0.
There are several ways to measure concoradance, one of them is the Kendall
coefficient
C −D
τ= ,
C +D
where C and D are the numbers on concordant and discordant pairs of sce-
narios, respectively.
The joint cdf F of n random variables can be written in the form

F (x1 , x2 , . . . , xn ) = C(F1 (x1 ), F2 (x2 ), . . . , Fn (xn )),

where Fi (xi ) are cdfs of individual random variables (known as marginal


distributions), and function C is called a copula and represents the way
random variables depend on each other.
The Archimedean family is the family of copulas in the form:

C(x1 , x2 , . . . , xn ) = ψ −1 (ψ(x1 ) + ψ(x2 ) + · · · + ψ(xn )),

where ψ : (0, 1] → [0, ∞) is a continuous, strictly decreasing, convex function


with ψ(1) = 0.
For example,

C(x1 , x2 , . . . , xn ) = x1 · x2 · · · · · xn

is the independence copula, which implies that all n random variables are
jointly independent, while

C(x1 , x2 , . . . , xn ) = min(x1 , x2 , . . . , xn )

is the co-monotonic copula, which corresponds to the case when all variables
are directly dependent (the higher is one variable, the higher are all). For
n = 2,
C(x1 , x2 ) = max(x1 + x2 − 1, 0)
is called counter-monotonic copula, which represents the inverse dependence
of the form “the higher one variable, the lower another one”.
The coefficients of lower and upper tail dependence of 2 random variables
are
C(u, u) C̄(u, u)
λL = lim , and λU = lim ,
u→0+ u u→0+ u
where C̄ is the survival copula defined by C̄(1−u, 1−v) = 1−u−v +C(u, v).

84
4.8 Questions

1. (a) Calculate the Hazard rate of the Pareto distribution. Check if it is


an increasing or decreasing function.
(b) Calculate the Mean residual life of the Pareto distribution. Check
if it is an increasing or decreasing function.
(c) What conclusion about tails of Pareto distribution can we make
based on items (a) and (b).

2. Prove the formula for F (x) is Example 4.1.

3. (a) Check that function ψ(t) = − ln t, 0 < t ≤ 1, is continuous, strictly


decreasing, convex, and ψ(1) = 0;
(b) Find its inverse function ψ −1 ;
(c) Find C(u, v) = ψ −1 (ψ(u) + ψ(v)), see (51).

4. Repeat the previous question for


(a) ψ(t) = (− ln t)α , 0 < t ≤ 1, where α ≥ 1 is a parameter;
 −αt 
−1
(b) ψ(t) = − ln ee−α −1 , 0 < t ≤ 1, where α 6= 0 is a parameter;
(c) ψ(t) = α1 (t−α − 1), 0 < t ≤ 1, where α 6= 0 is a parameter.

5. Calculate the coefficients of lower and upper tail dependence of two


random variables with
(a) the independence copula,
(b) the Gumbel copula with α ≥ 1,
(c) the Frank copula with α 6= 0,
(d) the Clayton copula with α > 0.

Chapter 5
Markov Chains
The Markov property is used extensively in the Actuarial Mathematics to
develop two-state and multi-state Markov models of mortality and other
decrements. The rest of this course is devoted to a thorough description of
the Markov property in a general context and its applications to actuarial
modelling.

85
We will distinguish between two types of stochastic process that possess the
Markov property: Markov chains and Markov jump processes. Both have a
discrete state space, but Markov chains have a discrete time set and Markov
jump processes have a continuous time set.
We begin with Markov chains and discuss the mathematical formulation of
such process, leading to one important actuarial application: the no-claims-
discount process used in motor insurance. We then move onto Markov jump
processes.
The practical considerations of applying these models in the Actuarial Math-
ematics will be discussed in detail in later sections. In this chapter we focus
on the mathematical development of Markov models without reference to
their calibration to real data.

5.1 The Markov property


A major simplification to the general stochastic processes discussed in Section
1.9 occurs if the development of a process can be predicted from its current
state, i.e. without any reference to its past history. This is the Markov
property.
In this chapter we are concerned with the stochastic process {Xt } defined on
a state space S and time set t ≥ 0.
The Markov property can be stated mathematically as

P [ Xt ∈ A| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x] = P [Xt ∈ A|Xs = x],


(53)
for all s1 < s2 < · · · < sn < s < t in the time set, all states x1 , x2 , . . . , xn , x
in the state space S and all subsets of A of S.
Working with subsets A ⊆ S is necessary so that the above definition of
the Markov property covers both the discrete and continuous state spaces.
Recall from Chapter 1 that in the continuous case the probability that Xt is
a particular value is zero, and so it is necessary to work with probabilities of
Xt lying in some subset of the state space in any general definition.
Although we are entirely concerned with discrete state spaces in this chapter,
it is important to realise that the Markov property can be possessed by
general stochastic processes.

For discrete state spaces the Markov property is written as

P [ Xt = a| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x] = P [Xt = a|Xs = x], (54)

86
for all s1 < s2 < · · · < sn < s < t and all states a, x1 , x2 , . . . , xn , x in S.
An important result is that any process with independent increments has the
Markov property.

Example 5.1. Question: Prove that any process with independent incre-
ments has the Markov property.
Answer: We begin with equation (53) and use the fact that Xt = Xt −Xs +x
to introduce an increment

P [ Xt ∈ A| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x],


=P [Xt − Xs + x ∈ A| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x] ,
=P [Xt − Xs + x ∈ A| Xs = x] = P [Xt ∈ A| Xs = x] ,

the second equality arises from the definition of independent increments and
the fact that x is known.
A Markov process with a discrete state space and a discrete time set is called
a Markov chain, these are consider in this chapter. A Markov process with
discrete state space and continuous time set is called a Markov jump process,
these are considered in the next chapter.

5.2 Definition of Markov Chains


A Markov chain is a discrete-time stochastic process with a countable state
space S, obeying the Markov property. It is therefore a sequence of random
variables {Xt } with the property given by equation (54) which we rewrite for
notational convenience as

P [Xn = j| X0 = i0 , X1 = i1 , . . . , Xm−1 = im−1 , Xm = i] = P [Xn = j|Xm = i],

for all integer times n > m and states {i0 , i1 , . . . , im−1 , i, j} ∈ S.


We define the transition probabilities as
 
pij (n, n + 1) = P Xn+1 = j| Xn = i . (55)

Therefore, pij (n, n + 1) is the probability of being in state j at time n + 1,


having been in state i at time n.
For each fixed n = 0, 1, . . . , we can form a matrix of transition probabilities
from time n to the next time step n + 1:
 
P(n, n + 1) = pij (n, n + 1) i,j∈S . (56)

87
Note that P(n, n+1) is a finite matrix in the case of a finite number of states,
and an infinite matrix in the case of an infinite number of states.

Example 5.2. Question: Consider a no claims discount (NCD) model


for car-insurance premiums. The insurance company offers discounts of 0%,
30% and 60% of the full premium, determined by the following rules:

1. All new policyholders start at the 0% level.

2. If no claim is made during the current year the policyholder moves up


one discount level, or remains at the 60% level.

3. If one or more claims are made the policyholder moves down one level,
or remains at the 0% level.

The insurance company believes that the chance of claiming each year is
independent of the current discount level and has a probability of 1/4. Why
can this process be modeled as a Markov chain? Give the state space and
transition matrix.
Answer: The model can be considered as a Markov chain since the future
discount depends only on the current level, not the entire history. The state
space is S = {0%, 30%, 60%}, which is convenient to denote as S = {0, 1, 2}
(where state 0 is the 0% state, 1 is the 30% state and 2 is the 60% state).
The transition probability matrix between two states in a unit time is given
by  1 3 
4 4
0
P =  41 0 34  . (57)
0 14 43

A matrix A is called a stochastic matrix if

1. All its entries are non-negative, and

2. The sum of entries in any row is one.

It is clear that the transition matrix in Example 5.2 is a stochastic matrix


by this definition. More generally, every transition matrix P(n, n + 1) of a
Markov chain is a stochastic matrix. Indeed, allX the transition probabilities
pij (n, n + 1) are by definition non-negative, and pij = 1 for all i since the
j∈S
system must move to some state from any state i.

88
A clear way of representing Markov chains is by a transition graph. The
states are represented by circles linked by arrows indicating each possible
transition. Next to each arrow is the corresponding transition probability.

Example 5.3. Question: Draw the transition graph for the NCD system
defined in Example 5.2.
Answer: See Figure 1

Figure 1: Transition graph for the NCD system of Example 5.2. Reproduced
with permission of the Faculty and Institute of Actuaries.

5.3 The Chapman-Kolmogorov equations

Equation (55) defines the probabilities of transition over a single time step.
Similarly, the n-step transition probabilities pij (m, m + n) denote the prob-
ability that a process in state i at time m will be in state j at time m + n.
That is:
pij (m, m + n) = P [Xm+n = j| Xm = i] .

The transition probabilities of a Markov process satisfy the system of equa-


tions called the Chapman–Kolmogorov equations
X
pij (m, n) = pik (m, l)pkj (l, n), (58)
k∈S

for all states i, j ∈ S and all integer times m < l < n. This can be expressed
in terms of n-step stochastic matrices as

P(m, n) = P(m, l)P(l, n),

where P(m, l)P(l, n) is the product of matrices in the usual sense.

89
Example 5.4. Question: Prove equation (58).
Answer: We use the Markov property and the law of total probability.

pij (n, m) = P (Xm = j| Xn = i) ,


X
= P (Xm = j, Xl = k| Xn = i) ,
k∈S
X
= P (Xm = j| Xl = k, Xn = i) P (Xl = k| Xn = i) ,
k∈S
X
= P (Xm = j| Xl = k) P (Xl = k| Xn = i) .
k∈S

Which is the required result.

The Chapman–Kolmogorov equations provide a method for computing the


n-step transition probabilities from the one-step transition probabilities. The
distribution of a Markov chain is therefore fully determined once the following
are specified:
• The one-step transition probabilities pij (n, n + 1).

• The initial probability distribution αj0 = P (X0 = j0 ).


The probability of any path can then be determined from
N
Y −1
P [X0 = j0 , X1 = j1 , . . . , XN = jN ] = αj0 pjn ,jn+1 (n, n + 1). (59)
n=0

This should be intuitively clear but a formal proof is left as a question at the
end of the chapter.

5.4 Time dependency of Markov chains


For a time-inhomogeneous Markov chain, the transition probabilities pij (t, t + 1)
change with time t. The transition probabilities will therefore have a sequence
of stochastic matrices denoted by P(t):
 
p00 (t, t + 1) p01 (t, t + 1) . . .
P(t) = pij (t, t + 1) i,j∈S =  p10 (t, t + 1) p11 (t, t + 1) . . . 
   
.. ..
. .

The value of t can represent many factors such as time of year, age of pol-
icyholder or the length of time the policy has been in force. For example,

90
young drivers and very old drivers may have more accidents than middle-
aged drivers and therefore t might represent the age or age group of the
driver purchasing a motor insurance policy.
Although time-inhomogeneous models are important in practical modelling,
a further analysis is beyond the scope of this course.
A Markov chain is called time homogeneous if transition probabilities do not
depend on time. This is a significant simplification to any Markov-chain
model. In particular, for a time-homogeneous Markov chain, equation (59)
becomes
N
Y −1
P [Xn = jn , n = 0, 1, 2 . . . , N ] = P [X0 = j0 ] pjn jn+1 . (60)
n=0

It is therefore clear that the matrix of the n-step transition probabilities is


the n-th power of the matrix of 1-step transition probabilities {pij }:
(n)
X
P [Xn+m = j| Xm = i] := pij = pik1 pk1 k2 · · · pkn−2 kn−1 pkn−1 j .
k ,k ,...,k
| {z }
1 2 n−1
n terms

If we let P(n) denote the n-step transition matrix, then

P(n) = Pn ,

where P is the one-step matrix of transition probabilities.

Example 5.5. Question: Calculate the 2-step transition matrix for the
NCD system from Example 5.2 and confirm that it is a stochastic matrix.
Answer: The 1-step transition matrix is given by equation (57) and so we
can compute that
 1 3  1 3   
0 0 4 3 9
4 4 4 4 1 
P(2) =  14 0 43   14 0 34  = 1 6 9 .
1 3 1 3 16
0 4 4 0 4 4 1 3 12

We note that the two conditions for P(2) to be a stochastic matrix are satis-
fied.

Example 5.6. Question: Using the 2-step transition matrix from Exam-
ple 5.5 state the probabilities that

(a) A policyholder initially in the 0%-state is in the 60%-state after 2 years.

91
(b) A policyholder initially in the 60%-state is in the 30%-state after 2
years.

(c) A policyholder initially in the 0%-state is in the 0%-state after 2 years.

Answer:

(a) Element (P(2) )1,3 gives the required probability, 9/16.


Note that this is consistent with the path 0% → 30% → 60%, i.e. no
claims for two years, therefore the probability is 3/4 × 3/4 = 9/16.

(b) (P(2) )3,2 = 3/16.


Note that this is consistent with the path 60% → 60% → 30%, there-
fore the probability is 3/4 × 1/4 = 3/16.

(c) (P(2) )1,1 = 4/16.


Note that this is consistent with either path 0% → 0% → 0% or 0% →
30% → 0%, therefore the probability is 1/4 × 1/4 + 3/4 × 1/4 = 4/16.

5.5 Further applications


The simple NCD system of Example 5.2 gives a practical example of a time-
homogeneous Markov chain. We now consider three further examples.

5.5.1 The simple (unrestricted) random walk


A simple random walk is a stochastic process {Xt } with state space S = Z
i.e. the integers. The process is defined by

Xn = Y1 + Y2 + · · · + Yn ,

where Y1 , Y2 , . . . are a sequence of i.i.d. Bernoulli variables such that

P (Yi = 1) = p and P (Yi = −1) = 1 − p.

The simple random walk has the Markov property, that is:

P (Xm+n = j| X1 = i1 , X2 = i2 , . . . , Xm = i) ,
= P (Xm + Ym+1 + Ym+2 + · · · + Ym+n = j| X1 = i1 , X2 = i2 , . . . , Xm = i) ,
= P (Ym+1 + Ym+2 + · · · + Ym+n = j − i) ,
= P (Xm+n = j| Xm = i) .

92
Hence a simple random walk is a time-homogeneous Markov chain with tran-
sition probabilities:

 p if j = i + 1
pij = 1 − p if j = i − 1
0 otherwise

Since we are considering an unrestricted simple random walk, the transition


graph (Figure 2) and 1-step transition matrix are infinite.
In particular the 1-step transition matrix is given by
 .
.. ...

 .
 ..

 0 p 


 1−p 0 p 

P=
 . . . . .. ... .

 
 1−p 0 p 
..
 
1−p 0 .
 
 
.. ..
. .

It is clear that this is a stochastic matrix.

Figure 2: Transition graph for the unrestricted random walk. Reproduced


with permission of the Faculty and Institute of Actuaries.

To determine the n-step transition probabilities, consider moving from state


i to state j in n steps. Let the number of positive steps be r, (that is, r is
the total number of steps where Xi+1 − Xi = 1), and the number of negative
steps be l, (that is, l is the total number of steps where Xi+1 − Xi = −1).
Since there are n steps in total, it follows that r + l = n and that r − l = j − i,
the excess of positive steps over negative steps. Solving these simultaneous

93
equations for r and l gives
1 1
r = (n + j − i) and l = (n − j + i).
2 2
From this we can see that the n-step transition probabilities are
 
(n) n 1 1
pij = 1 p 2 (n+j−i) (1 − p) 2 (n−j+i) ,
2
(n + j − i)

where nr is the number of possible paths with r positive steps, each of which


occurs with probability pr (1 − p)n−r . The expression arises since the distri-
bution of the number of positive steps in n steps is Binomial with parameters
n and p. Since r and l must be non-negative integers, it follows that both
n + j − i and n − j + i must be non-negative even numbers.
In addition to being time-homogeneous, a simple random walk is spatially-
homogeneous, that is
(n)
pij = P (Xn = j| X0 = i) = P (Xn = j + r| X0 = i + r) .

1
A simple random walk with p = q = 2
is called a symmetric simple random
walk.

5.5.2 The restricted random walk


We introduce the restricted random walk with an example:
A man needs to raise £N to fund a specific project and asks his very rich
friend to accompany him to a casino where he hopes to win this money. The
man plays the following game: A fair coin is tossed. If it lands heads-up
the man wins £1; if it lands tails-up the man loses £1. If he loses all his
money he will borrow £1 from his friend and continue to play until he has
the required £N . Once he has accumulated £N he will stop playing the
game.
The restricted random walk is therefore a simple random walk with boundary
conditions. In this example the boundary conditions are specified at 0 and
N . At N the barrier is an absorbing barrier, while at 0 it is called a reflecting
barrier.
More formally, an absorbing barrier is a value b such that:

P (Xn+s = b| Xn = b) = 1 for all s > 0.

94
In other words, once state b is reached, the random walk stops and remains
in this state thereafter.
A reflecting barrier is a value c such that:

P (Xn+1 = c + 1| Xn = c) = 1.

In other words, once state c is reached, the random walk is “pushed away”.
A mixed barrier is a value d such that:

P (Xn+1 = d| Xn = d) = α and P (Xn+1 = d + 1| Xn = d) = 1 − α,

for all s > 0 and α ∈ [0, 1]. In other words, once state d is reached, the
random walk remains in this state with probability α or moves to the neigh-
boring state d + 1 with probability 1 − α, i.e. it is an absorbing barrier with
probability α and a reflecting barrier with probability 1 − α.
If, in the example, above the man does not take his rich friend, he will
continue to gamble until either his money reaches the target N or he runs
out of money. In each case reaching the boundary means that the wealth
will remain there forever; the barriers therefore become absorbing barriers.
The transition graph for the general case of a restricted random walk with
two mixed barriers is given in Figure 3. The special cases of reflecting and
absorbing boundary conditions are obtained by taking α or β equal to 0 or
1.

Figure 3: Transition graph for the restricted random walk with mixed bound-
ary conditions. Reproduced with permission of the Faculty and Institute of
Actuaries.

95
The 1-step transition matrix is given by
 
α 1−α
 1−p 0 p 
 
 . .. ... .. 
P= .
.
 

 1−p 0 p 

 1−p 0 p 
1−β β

Note that the matrix is finite, which is in contrast to the transition matrix
for the unrestricted random walk.
The simple NCD model given in Example 5.2 is a practical example of a
restricted random walk.

5.5.3 The modified NCD model


The simple NCD model given in Example 5.2 can be modified with a number
of improvements. One such improvement is to have the following states:
State 0: no discount
State 1: 25% discount
State 2: 40% discount
State 3: 60% discount
The transition rules are as before except that when there is a claim during the
current year, the discount status moves down either two levels if there was a
claim in the previous year, or one level if the previous year was claim-free.
It is clear that the discount status of a policyholder at time n, Xn , does not
form a Markov chain since the future discount status does not only depend
on the current status but also on the previous year’s status.
For example
P [Xn+1 = 1| Xn = 2, Xn−1 = 1] > 0, (61)
whereas
P [Xn+1 = 1| Xn = 2, Xn−1 = 3] = 0. (62)

A Markov chain can be constructed from this non-Markov chain by splitting


state 2 into two states defined as:

• 2+ : 40% discount and no claim in the previous year, that is, the state
corresponding to {Xn = 2, Xn−1 = 1}.

96
• 2− : 40% discount and claim in the previous year, that is, the state
corresponding to {Xn = 2, Xn−1 = 3}.

Assuming that the probability of making no claims in any year is still 3/4, the
Markov chain on the modified state space S 0 = {0, 1, 2+ , 2− , 3} has transition
graph given by Figure 4, and 1-step transition matrix given by
 
1/4 3/4 0 0 0
 1/4 0 3/4 0 0 
 
P=  0 1/4 0 0 3/4 .

 1/4 0 0 0 3/4 
0 0 0 1/4 3/4

Figure 4: Transition graph for the modified NCD process. Reproduced with
permission of the Faculty and Institute of Actuaries.

Note that a policyholder can only be in state 2+ by moving up from state


1; and in state 2− by moving down from state 3. Hence equations (61) and
(62) become
P Xn+1 = 1| Xn = 2+ = 1/4,
 

and
P Xn+1 = 1| Xn = 2− = 0.
 

respectively. The transition probabilities are now determined by the current


discount status only and the process is Markov.

97
5.5.4 A model of accident proneness
An insurance company may want to use the whole history of the claims from
a given driver to estimate his/her accident proneness. Let Yi be a number
of claims during period i. In the simplest model, we may assume that it
can be no more than 1 claim per period, so Yi is either 0 or 1. By time
t a driver has a history of the form Y1 = y1 , Y2 = y2 , . . . , Yt = yt , where
yi ∈ {0, 1}, i = 1, . . . , t. Based of this history, the probability of future claim
can be estimated, say, as

f (y1 + y2 + · · · + yt )
P [Yt+1 = 1|Y1 = y1 , Y2 = y2 , . . . , Yt = yt ] = ,
g(t)

where f, g are two given increasing functions satisfying 0 ≤ f (m) ≤ g(m), ∀m.
The stochastic process {Yt , t = 0, 1, 2 . . . } does not have the Markov property
(54). However, the cumulative number of claims from the driver, given by
t
X
Xt = Yt
i=1

satisfies (54). Indeed,

P [Xt+1 = 1 + xt |X1 = x1 , X2 = x2 , . . . , Xt = xt ]
f (xt )
= P [Yt+1 = 1|Y1 = x1 , Y2 = x2 − x1 , . . . , Yt = xt − xt−1 ] = ,
g(t)

which does not depend on the past history x1 , x2 , . . . , xt−1 . Thus, {Xt , t =
0, 1, 2 . . . } is a Markov chain.

5.5.5 A model for credit rating dynamics


Credit rating dynamics of a bond is often represented by a Markov chain.
There are finitely many possible ratings, such as AAA, AA, A, BBB, etc.,
with AAA being the highest rating, corresponding to the lowest probability
of default, AA is the next one, and so on. There may be subdivisions inside
any class, like AA (high), AA, and AA (low), etc. In any case, we can
associate a state of Markov chain to every rating, and there is also a special
state (say, D), corresponding to the default. The default state is absorbing,
that is, transition probability from D to any state i 6= D is 0.

Example 5.7. Assume that there are just 2 ratings for a bond B: I -
investment grade, and J - junk grade. This can be modelled as a Markov

98
chain with states I, J, and D - default. Assume that the transition matrix
is given by  
0.90 0.05 0.05
P =  0.10 0.80 0.10 
0 0 1
Assume that bond B returns yearly profit 9% of the investment in state I,
and 10% of the investment in state J. However, in case of default you will
be able to get back only 40% of your investment. Assume also that risk-free
rate (in a bank) is 2%.
In this case, investing capital C in a bond in state I, we will get back
1.09C in case we stay at I or move to J, and 0.4(1.09C) if the process moves
to state D. Hence, our expected profit is 0.90(1.09C)+0.05(1.09C)+0.05(0.4·
1.09C) − C = 0.0573C, that is, 5.73%. This is higher than risk-free rate 2%.
The difference is called premium for risk.
A risk neutral probability measure is an “adjusted” probability measure
such that the expected profit (with respect to this measure) from a bond is
the same as the profit from the bank. The “adjusted” transition probabilities
from state I to states I, J, and D are given by 1−0.1πI , 0.05πI , 0.05πI , where
πI is the adjustment coefficient. By definition of risk neutral probability mea-
sure, (1 − 0.10πI )(1.09C) + 0.05πI (1.09C) + 0.05πI (0.4 · 1.09C) − C = 0.02C,
from which we can find πI . Similarly, “adjusted” transition probabilities from
state J to states I, J, and D are given by 0.10πJ , 1 − 0.20πJ , 0.10πJ , where
πJ can be found from a similar argument (this is left as an exercise).

5.5.6 General principles of modelling using Markov chains


In this section we summarise the examples above and identify the key step in
modelling real-life situations using Markov chains. For simplicity, we discuss
only time-homogeneous models here.

• Step 1. Setting up the state space: The most natural choice


is usually to identify the state space of the Markov chain with a set
of observations. For example, this is the case for the NCD system
from Example 5.2, where we setted up the state space S = {0, 1, 2}
in correspondence with possible discounts {0%, 30%, 60%}. However,
as we saw in the example with the modified NCD system (see section
5.5.3), such natural state space may be not suitable to form a Markov
chain, because the Markov property may fail. In the modified NCD
example, a small modification of the state space allowed us to construct
a Markov chain.

99
• Step 2. Estimation transition probabilities: Once the state space
is determined, the Markov model must be fitted to the data by estimat-
ing the transition probabilities. In the NCD model (Example 5.2) we
have just claimed that “the company believes that the chance of claim-
ing each year ... has a probability of 1/4”. In practice, however, the
transition probabilities should be estimated from the data. Naturally,
the probability pij of transition from state i to state j should be esti-
mated as number of transitions from i to j, divided by the total number
of transitions from state i. More formally, let x1 , x2 , . . . , xN be a set of
available observations, ni be the number of times t (1 ≤ t ≤ N − 1)
such that xt = i, and nij be the number of times t (1 ≤ t ≤ N − 1) such
n
that xt = i and xt+1 = j. Then the best estimate for pij is pˆij = niji ,
and the 95% confidence interval can be approximated as
 s s 
pˆij − 1.96 pˆij (1 − pˆij ) , pˆij + 1.96 pˆij (1 − pˆij ) 
ni ni

This follows from the fact that the conditional distribution of Nij given
Ni is binomial with parameters Ni and pij .
• Step 3. Checking the Markov property: Once the state space and
transition probabilities are found, the model is fully determined. But,
to ensure that the fit of the model to the data is adequate, we need to
check that the Markov property seems to hold. In practice, it is often
considered sufficient to look at triplets of successive observations. For
a set of observations x1 , x2 , . . . , xN , let nijk be the number of times t
(1 ≤ t ≤ N −2) such that xt = i, xt+1 = j, and xt+2 = k. If the Markov
property holds, nijk is an observation from a Binomial distribution with
parameters nij and pjk . An effective test to check this is a χ2 test: the
statistic
X X X (nijk − nij pˆjk )2
X2 =
i j k
nij pˆjk

should approach the χ2 distribution with r = |S|3 degrees of freedom.


For example, if |S| = 4, the statistic X 2 does not exceed the criti-
cal level 83.675 with probability 95%. Thus, exceeding this level is a
strong indication that the Markov property does not hold. The Chi-
square distribution table up to the level r = 1000 can be found, say, at
http://www.medcalc.org/manual/chi-square-table.php
• Step 4. Using the model: Once the model parameters are de-
termined, and Markov property checked, we can use the established

100
model to estimate different quantities of interest. In particular, we
have used the Markov model for the Example 5.2 to address questions
like “What is the probability that a policyholder initially in the 0%-
state is in the 0%-state after 2 years?” (see Example 5.6). If the Markov
model is too complicated to answer questions of this type analytically,
we can use Monte-Carlo simulation (see chapter 2). Simulating a time-
homogeneous Markov chain is relatively straightforward. In addition
to commercial simulation packages, even standard spreadsheet software
can easily cope with the practical aspects of estimating transition prob-
abilities and performing a simulation.

5.6 Stationary distributions


In many cases the distribution of Xn converges to a limit π in the sense that

P (Xn = j| X0 = i) → πj , (63)

and the limit is the same regardless of the starting point.


The distribution {πj }j∈S is said to be a stationary distribution of a Markov
chain with transition matrix P if
X
1. πj = πi pij for all j, which can be expressed as π = πP where π is
i∈S
a row vector and πP is the usual vector-matrix product; and
X
2. πj ≥ 0 for all j and πj = 1.
j∈S

The interpretation of π is that, if the initial probability distribution of π, i.e.


πi = P (X0 = i), then at time 1 the probability distribution of X1 is again
given by π. Mathematically
X
P (X1 = j) = P (X1 = j| X0 = i) P (X0 = i) ,
i∈S
X
= πi pij = πj .
i∈S

By induction we have that


X
P (Xn = j) = P (Xn = j| Xn−1 = i) P (Xn−1 = i) ,
i∈S
X
= πi pij = πj .
i∈S

101
Hence if the initial distribution for a Markov chain is a stationary distribu-
tion, then Xn has the same probability distribution for all n.
A general Markov chain does not necessarily have a stationary probability
distribution, and if it does it need not be unique. For instance, the unre-
stricted random walk discussed in §5.5 has no stationary distribution, and
the uniqueness of the stationary distribution in the restricted random walk
depends on the parameters α and β.
However it is known that a Markov chain with finite state space has at least
one stationary probability distribution. This is stated without proof.
Whether the stationary distribution is unique is more subtle and requires
that we consider only irreducible chains. This is defined by the property
that any state j can be reached from any other state i in a finite number of
steps. In other words, a chain is irreducible if for any pair of states i and j
(n)
there exists an integer n such that pij > 0. It is often sufficient to view the
transition graph to determine whether a Markov chain is irreducible or not.

Example 5.8. Question: Are the simple NCD, modified NCD, unre-
stricted and restricted random walk processes irreducible?
Answer: It is clear from Figures 1, 2 & 4 that both NCD processes and
the unrestricted random walks are irreducible as all states have a non-zero
probability of being reached from any other state in a finite number of steps.
For the restricted random walk, Figure 3 shows that it is irreducible unless
either boundary is absorbing, i.e. it is irreducible for α 6= 1 or β =6= 1.

An irreducible Markov chain with a finite state space has a unique stationary
probability distribution. This is stated without proof.

Example 5.9. Question: Do the simple NCD, modified NCD, unre-


stricted and restricted random walk processes have a unique stationary dis-
tribution?
Answer: The simple NCD process is irreducible and has a finite state space.
It therefore has a unique stationary distribution.
The modified NCD process is irreducible and has a finite state space. It
therefore has a unique stationary distribution.
The unrestricted random walk is irreducible but does not have a finite state
space. It therefore does not have a unique stationary distribution.
The restricted random walk has a finite state space and is irreducible for
α 6= 1 and β 6= 1. It therefore has a unique stationary distribution for α 6= 1
and β 6= 1.

102
Example 5.10. Question: Compute the stationary distribution for the
modified NCD model defined in §5.5.
Answer: The conditions for a stationary distribution defined above lead to
the following expressions
1 1 1
π0 = π0 + π 1 + π2 − ,
4 4 4
3 1
π1 = π0 + π2+ ,
4 4
3
π2+ = π1 ,
4
1
π2− = π3 ,
4
3 3 3
π3 = π2+ + π2− + π3 .
4 4 4
This system of equations is not linearly independent since adding all the
equations results
X in an identity. This is a general feature of π = πP due to
the property pij = 1.
j∈S

We therefore discard one of the equations (discarding the last one will simplify
the system) and work in terms of a working variable, say π1 .

3π0 − π2− =π1 , 3π0 + π2+ = 4π1 ,


3
π2+ = π1 , 4π2− − π3 = 0.
4

This system is solved with


3 13
π2+ = π1 , π0 = π1 ,
4 12
9
π2− = π1 , π3 = 9π1 .
4
X
Using the requirement that πj = 1, we arrive at the stationary distribu-
j∈S
tion  
13 12 9 27 108
π= , , , , .
169 169 169 169 169

103
5.7 The long-term behaviour of Markov chains
It is natural to expect the distribution of a Markov chain to tend to the
stationary distribution π for large times if π exists. However, certain phe-
nomena can complicate this. For example, a state i is said to be periodic
with period d > 1 if a return to i is possible only in a number of steps that
(n)
is a multiple of d. More specifically, pii = 0 unless n = md for some integer
m.
Any periodic behaviour is usually evident from the transition graph. For
example, both NCD models considered above are aperiodic; the unrestricted
random walk has period 2 and restricted random walk is aperiodic unless α
and β are either 0 or 1.
We state the following result about convergence of a Markov chain without
proof:
(n)
Let pij be the n-step transition probability of an irreducible aperiodic Markov
(n)
chain on a finite state space. Then, lim pij = πj for each i and j.
n→∞

Example 5.11. Question: An insurance company has 10,000 policyhold-


ers on the modified NCD system defined in §5.5. Estimate the number of
policyholders on each discount rate.
Answer: The model is irreducible and aperiodic, therefore, assuming that
the policies have been held for a sufficient length of time, the distribution of
policyholders amongst states is given by the stationary distribution computed
in Example 5.10. We would therefore expect the following distribution:
State 0: no discount 10, 000 × 13/169 ≈ 769
State 1: 25% discount 10, 000 × 12/169 ≈ 710
State 2: 40% discount 10, 000 × (9/169 + 27/169) ≈ 2, 130
State 3: 60% discount 10, 000 × 108/169 ≈ 6, 391

References
The following texts were used in the preparation of this chapter and you are
referred there for further reading if required.

• Faculty and Institute of Actuaries, CT4 Core Reading;

• D. R. Cox & H. D. Miller, The Theory of Stochastic Processes;

• S. Ross, Stochastic Processes.

104
5.8 Summary
For discrete state spaces the Markov property is written as

P [ Xt = a| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x] = P [Xt = a|Xs = x],

for all s1 < s2 < · · · < sn < s < t and all states a, x1 , x2 , . . . , xn , x in S.
Any process with independent increments has the Markov property.
Markov chains are discrete-time and discrete-state-space stochastic processes
satisfying the Markov property. You should be familiar with the simple
NCD, modified NCD, unrestricted random walk and restricted random walk
processes.
In general, the n-step transition probabilities pij (m, m + n) denote the prob-
ability that a process in state i at time m will be in state j at time m + n.
The transition probabilities of a Markov process satisfy the Chapman–Kolmogorov
equations: X
pij (m, n) = pik (m, l)pkj (l, n),
k∈S

for all states i, j ∈ S and all integer times m < l < n. This can be expressed
in terms of n-step stochastic matrices as

P(m, n) = P(m, l)P(l, n).

An irreducible time-homogeneous Markov chain with a finite state space has


a unique stationary probability distribution, π, such that

π = πP(n) .

Aperiodic processes will converge to the stationary distribution as n → ∞.

105
5.9 Questions

1. Consider a Markov chain with state space S = {0, 1, 2} and transition


matrix  
p q 0
P =  1/4 0 3/4  .
p − 1/2 7/10 1/5
(a) Calculate values for p and q.
(b) Draw the transition graph for the process.
(3)
(c) Calculate the transition probabilities pij .
(d) Find any stationary distributions for the process.
2. Prove equation (59) relating the probability of a particular path occur-
ring in a Markov chain.
3. A No-Claims Discount system operated by a motor insurer has the
following four levels:
Level 1: 0% discount;
Level 2: 25% discount;
Level 3: 40% discount;
Level 4: 60% discount.
The rules for moving between these levels are as follows:
• Following a year with no claims, move to the next higher level, or
remain at level 4.
• Following a year with one claim, move to the next lower level, or
remain at level 1.
• Following a year with two or more claims, move down two levels,
or move to level 1 (from level 2) or remain at level 1.
For a given policyholder in a given year the probability of no claims is
0.85 and the probability of making one claim is 0.12. Xt denotes the
level of the policyholder in year t.
(i) Explain why Xt is a Markov chain. Write down the transition
matrix of this chain.
(ii) Calculate the probability that a policyholder who is currently at
level 2 will be at level 2 after:

106
i. one year.
ii. two years.
iii. three years.
(iii) Explain whether the chain is irreducible and/or aperiodic.
(iv) Does this Markov chain converge to a stationary distribution?
(v) Calculate the long-run probability that a policyholder is in dis-
count level 2.

107
Chapter 6
Markov Jump Processes
A Markov jump process is a stochastic process with discrete state space and
continuous time set, which has Markov property.
The mathematical development of Markov jump processes is similar to Markov
chains considered in the previous chapter. For example, the Chapman–
Kolmogorov equations have the same format. However, Markov jump pro-
cesses are in continuous time and so the notion of a one-step transition prob-
ability does not exist and we are forced to consider time intervals of arbi-
trarily small length. Taking the limit of these intervals to zero leads to the
reformulation of the Chapman–Kolmogorov equations in terms of differential
equations.
We begin my discussing the Poisson process which is the simplest example of
a Markov jump process. In doing so we will encounter some general features
of Markov jump processes.

6.1 Poisson process


The Poisson process {Nt }t∈[0,∞) , is an example of a counting process. That
is, it has state space S = {0, 1, 2, . . . , n, . . . } corresponding to the number
of occurrences of some event. The events occur singly and can occur at
any time. Counting process are useful in modelling customers in a queue,
insurance claims or car accidents, for example.
Informally, a counting process (with counts, for example, customers in a
queue) is a Poisson process, if customers arrives independently, and “uni-
formly” in time, i.e. with constant rate of λ customers per time unit. Thus,
in time interval of length h we would expect on average about λh customers.
To make the above intuition more formal, let us assume that time interval
(t, t + h) is very short, such that the probability of two or more events during
this interval can be neglected. In this case the expected number of events
(which should be about λh by the intuition above) is 0 · (1 − p) + 1 · p = p,
where p is the probability of an event to occur.
Formally, the probability of an event in any short time interval (t, t + h) is
λh + o(h), where a function f is said to be o(h) if
f (h)
lim = 0.
h→0 h

The Poisson process can then be defined as follows:

108
The counting process {Nt }t∈[0,∞) is said to be a Poisson process with rate
λ > 0, if
1. N0 = 0;

2. The process has stationary and independent increments.

3. P (Nt+h − Nt = 1) = λh + o(h);

4. P (Nt+h − Nt > 1) = o(h).

Example 6.1. Question: Prove that a Poisson process is a Markov jump


process.
Answer: A Poisson process has independent increments (property 2),
therefore it has the Markov property. The state state S = {0, 1, 2, . . . , n, . . . }
of the process is discrete, and time set t ∈ [0, ∞) is continuous, thus it is a
Markov jump process by definition.
It is possible to show that the Poisson process defined above coincides with
the other standard definition of the Poisson process, that is, a process having
independent stationary Poisson-distributed increments. Or more formally,
for any t > 0, Nt follows a Poisson distribution with parameter λt, that is
(λt)n
P (Nt = n) = e−λt , for any n = 0, 1, 2, . . . (64)
n!
More generally, for any t, s > 0, Nt+s − Ns has the same probability distri-
bution as Nt .

Example 6.2. Question: Prove that the two definitions of a Poisson


process are consistent.
Answer: Define the probability that there have been n events by time t as
pn (t) = P (Nt = n). Then,

p0 (t + h) = P (Nt+h = 0),
= P (Nt = 0, Nt+h − Nt = 0),
= P (Nt = 0)P (Nt+h − Nt = 0),
= p0 (t)(1 − λh + o(h)).

Rearranging this equation and dividing by h yields


p0 (t + h) − p0 (t) o(h)p0 (t)
= −λp0 (t) − .
h h
109
Taking the limit as h → 0, leads to the differential equation

dp0 (t)
= −λp0 (t),
dt
with the initial condition, p0 (0) = 1. It is clear that this has solution

p0 (t) = e−λt . (65)

Similarly, for n ≥ 1;

pn (t + h) = P (Nt+h = n),
= P (Nt = n, Nt+h − Nt = 0) + P (Nt = n − 1, Nt+h − Nt = 1) + o(h),
= P (Nt = n)P (Nt+h − Nt = 0) + P (Nt = n − 1)P (Nt+h − Nt = 1) + o(h),
= pn (t)p0 (h) + pn−1 (t)p1 (h) + o(h),
= (1 − λh)pn (t) + λhpn−1 (t) + o(h).

Rearranging this for pn (t+h), and again taking the limit as h → 0, we obtain
the differential equation

dpn (t)
= −λpn (t) + λpn−1 (t), (66)
dt
for n = 1, 2, 3, . . . .
It can be shown by mathematical induction, or using generating functions,
that the solution to the differential equation (66), with initial condition
pn (0) = 0 yields equation (64). As required.

A Poisson process has positive integer values and can jump at any time
t ∈ [0, ∞). However, since time is continuous, the probability of a jump is
zero at specific time point t. The process can be pictured as an “upwards
staircase” shown in Figure 5.

6.1.1 Interarrival times


Since the Poisson process changes only by unit upward jumps, its sample
paths are fully characterised by the times at which the jumps take place.
Consider a Poisson process and let τ1 be the time at which the first event
occurs and let τn for n > 1 denote the time between the (n − 1)th and the
nth event. It is clear that τn for n ≥ 1 is a continuous random variable which
takes values in the range [0, ∞).

110
Figure 5: Sample Poisson process. Horizontal distance is time.

The sequence {τn }n≥1 is called the sequence of interarrival times (or holding
times). These are the horizontal distances between each step in Figure 5.
The random variables τ1 , τ2 , . . . are i.i.d., each having the exponential distri-
bution with parameter λ. They therefore each have the density function

fτ (t) = λe−λt for t > 0. (67)

To demonstrate this for general τn , first consider τ1 and note that the event
τ1 > t occurs if and only if there are zero events of the Poisson process in the
fixed interval (0, t], that is

P (τ1 > t) = P (Nt = 0) = e−λt .

The distribution function of τ1 is therefore

P (τ1 ≤ t) = 1 − e−λt ,

and so τ1 is exponentially distributed with parameter λ.


Now consider the distribution of τ2 conditional on τ1 :

P (τ2 > t|τ1 = s) = P (0 events in (s, s + t]|τ1 = s),


= P (Nt+s − Ns = 0|τ1 = s),
= P (Nt+s − Ns = 0), (by independent increments)
= p0 (t) = e−λt .

Therefore τ2 is independent of τ1 and has the same exponential distribution


as τ1 .

111
The same argument can be repeated for τ3 , τ4 , . . . leading to the conclusion
that the interarrival times are i.i.d. random variables that are exponentially
distributed with parameter λ.
Further, it can be shown using similar arguments that if N̂t and Ñt are
two independent Poisson processes with parameters λ1 and λ2 respectively,
then their sum Nt = N̂t + Ñt is a Poisson process with parameter λ1 + λ2 .
This result follows immediately from our intuitive interpretation of a Poisson
process: assume that male customers are arriving uniformly with rate λ1 ,
and female customers are arriving independently and uniformly with rate λ2 .
Then N̂t describes the cumulative number of male customers, Ñt - female
customers, thus Nt = N̂t + Ñt is the total number of customers, which clearly
also arriving uniformly with rate λ1 + λ2 .
This can be extended to the sum of any number of Poisson processes and is
a very useful result.

Example 6.3. An insurance company assumes that the number of claims on


an individual motor insurance policy in a year is a Poisson random variable
with parameter q. Claims in successive time intervals are assumed to be
independent. The company holds 10, 000 such motor insurance policies which
are assumed to be independent.
For 10, 000 independent policies, the total number of claims in any year will
therefore be Poisson with mean 10, 000q.
The total number of claims on a policy in a two-year period is a Poisson
random variable with mean 2q.

6.1.2 Compound Poisson process


The Poisson process {Nt }t∈[0,∞) is a natural model for counting number of
claims reaching an insurance company during time period [0, t]. In practice,
however, the cumulative size of the claims is moreP important. If Yi is the size
of claim i, the cumulative size is given by Xt = N t
Y
i=0 i . The simplest model
is to assume that all claims Yi are independent and identically distributed.
In this case the stochastic process {Xt }t∈[0,∞) is called a compound Poisson
process.
Formally, a compound Poisson process with rate λ > 0 and jump size
distribution F is a continuous-time stochastic process given by
Nt
X
Xt = Yi (68)
i=0

112
where {Nt }t∈[0,∞) is a Poisson process with rate λ, and {Yi , i ≥ 1} are
independent and identically distributed random variables, with distribution
function F , which are also independent of Nt .
The expected value and variance of the compound Poisson process are
given by
E[Xt ] = λtE[Y ], V ar[Xt ] = λtE[Y 2 ], (69)
where Y is a random variable with distribution function F .

Example 6.4. In Example 6.3 assume that size of each claim is a random
variable uniformly distributed on [a, b]. All claims sizes are independent.
What is the mean and variance of the cumulative size of the claims from all
policies during 3 years?
Answer:
PNtThe cumulative size of the claims is the compound Poisson process
Xt = i=0 Yi , where Nt is the number of claims from all policies, which is
the Poisson process
R b with parameter λ = 10,
R b 000q, and Yi is size of claim i.
2 2
Then EYi = b−a a xdx = 2 ; EYi = b−a a x dx = a +ab+b
1 a+b 2 1 2
3
, which gives

a+b
E[X3 ] = 3λE[Yi ] = 30, 000q = 15, 000q(a + b),
2
and
a2 + ab + b2
V ar[X3 ] = 3λE[Y 2 ] = 30, 000q = 10, 000q(a2 + ab + b2 ).
3

Assume that a company has initial capital u, premium rate c, and the cumu-
lative claims size Xt is given by (68). Then the basic problem in risk theory
is to estimate the probability of ruin at time t > 0, defined as

Φt (u) = P [u + ct − Xt < 0]. (70)

6.2 The time-inhomogeneous Markov jump process


Similar to the Markov chain, we introduce transition probabilities for a gen-
eral Markov jump process

pij (s, t) = P [Xt = j|Xs = i] , where pij (s, t) ≥ 0 and s < t. (71)

The transition probabilities must also satisfy the Chapman-Kolmogorov equa-


tions X
pij (t1 , t3 ) = pik (t1 , t2 ) pkj (t2 , t3 ), for t1 < t2 < t3 . (72)
k∈S

113
In matrix form, these are expressed as

P(t1 , t3 ) = P(t1 , t2 )P(t2 , t3 ).

The proof of these is analogous to that for equation (58) in discrete time,
and is left as a question at the end of the chapter.
We require that the transition probabilities satisfy the continuity condition

1, i = j
lim+ pij (s, t) = δij = (73)
t→s 0, i 6= j

This condition means that as the time difference between two observations
approach zero, the process will very likely not change its state with proba-
bility approaching one in the limit.
It is easy to see that this condition is consistent with Chapman-Kolmogorov
equation. Indeed, taking the limits t2 → t− +
3 or t2 → t1 in equation (72) we
obtain the identity.
However, this condition does not follow from the Chapman-Kolmogorov
equations. For example, pij (s, t) = 12 for i, j = 1, 2 satisfy equation (72),
since  1 1   1 1   1 1 
2
1
2
1 = 2
1
2
1 · 21 21 .
2 2 2 2 2 2

6.3 Transition rates


Let us assume that transition probabilities pij (s, t) for t > s have derivatives
with respect to t and s. Also, assume for simplicity that the state space S is
finite. Then by the standard definition of a derivative we have

∂pij (s, t) pij (s, t + h) − pij (s, t)


= lim ,
∂t h→0
P h
pik (s, t)pkj (t, t + h) − pij (s, t)
= lim k
h→0 h,
!
X pkj (t, t + h) pjj (t, t + h) − 1
= lim pik (s, t) + pij (s, t) .
h→0
k6=j
h h
:= lim αij (74)
h→0

114
It follows from equation (74) that {αij } approach certain limits as h → 0. In
particular, we define

pjj (t, t + h) − 1
lim := qjj (t),
h→0 h (75)
pkj (t, t + h)
lim := qkj (t), for k 6= j.
h→0 h

The quantities qjj (t), qkj (t) are called transition rates. They correspond to
the rate of transition from state k to state j in a small time interval h, given
that state k is occupied at time t.
Transition probabilities pkj (t, t + h) can be expressed through the transition
rates as 
hqkj (t) + o(h), k 6= j
pkj (t, t + h) = (76)
1 + hqjj (t) + o(h), k = j

It follows from equation (74) that

∂pij (s, t) X
= pik (s, t)qkj (t). (77)
∂t k∈S

These differential equations are called Kolmogorov’s forward equations. In


matrix form, they can be written as

∂P(s, t)
= P(s, t)Q(t),
∂t
where Q(t) is called the generator matrix with entries qij (t).
Repeating the procedure but differentiating with respect to s, we have

∂pij (s, t) pij (s + h, t) − pij (s, t)


= lim ,
∂s h→0 h P
pij (s + h, t) − k pik (s, s + h)pkj (s + h, t)
= lim ,
h→0 h !
X pik (s, s + h) pii (s, s + h) − 1
= − lim pkj (s + h, t) + pij (s + h, t) .
h→0
k6=i
h h

Therefore
∂pij (s, t) X
=− qik (s)pkj (s, t), (78)
∂s k∈S

115
and we see that the derivative with respect to s can also be expressed in
terms of the transition rates. The differential equations (78) are called Kol-
mogorov’s backward equations. In matrix form these are written as

∂P(s, t)
= −Q(s)P(s, t).
∂s

Therefore if transition probabilities pij (s, t) for t > s have derivatives with
respect to t and s, transition rates are well-defined and given by equation
(75).
Alternatively, if we can assume the existence of transition rates, then it
follows that transition probabilities pij (s, t) for t > s have derivatives with
respect to t and s, given by equations (77) and (78). These equations are
compatible, and we may ask whether we can find transition probabilities,
given transition rates, by solving equations (77) and (78).
It can be shown that each row of the generator matrix Q(s) has zero sum.
That is, X
qii (s) = − qij (s).
j6=i

The residual holding time for a general Markov jump process is denoted Rs .
This is the random amount of time between time s and the next jump:

{Rs > w, Xs = i} = {Xu = i, s ≤ u ≤ s + w}.

It can be proved that


R s+w
qii (t)dt
P (Rs > w| Xs = i) = e s .

Similarly, the current holding time is denoted Ct . This is the time between
the last jump and time t:

{Ct ≥ w, Xt = i} = {Xu = i, t − w ≤ u ≤ t}.

We will not study these questions further for general Markov processes, but
will investigate such and related questions for time-homogeneous Markov
processes below.

116
6.4 Time-homogeneous Markov jump processes
Just as we defined time-homogeneous Markov chains (equation (60)), we can
define time-homogeneous Markov jump processes.
Consider the transition probabilities for a Markov process given by equation
(71), a Markov process in continuous time is called time-homogeneous if the
transition probabilities pij (s, t) = pij (0, t − s) for all i, j ∈ S and s, t > 0.
In other words, a Markov process in continuous time is called time-homogeneous
if the probability P (Xt = j| Xs = i) depends only on the time interval t − s.
In this case we can write

pij (s, t) = P (Xt = j| Xs = i) = pij (t − s),


pij (t, t + s) = P (Xt+s = j| Xt = i) = pij (s),
pij (0, t) = P (Xt = j| X0 = i) = pij (t).

Here, for example, pij (s) form a stochastic matrix for every s, that is
X
pij (s) ≥ 0 and pij (s) = 1,
j∈S

and is assumed to satisfy continuity conditions at s = 0



1, i = j
lim+ pij (s) = pij (0) = δij =
s→0 0, i 6= j

Also pij (s) satisfy the Chapman-Kolmogorov equations, which, for a time-
homogeneous Markov process take the form
X
pij (t + s) = pik (t) pkj (s). (79)
k∈S

In matrix form, the Chapman–Kolmogorov equations become

P(t + s) = P(t)P(s). (80)

Note that P(0) = I is the identity matrix.


If a time-homogeneous Markov process is currently in state i, it follows from
equation (73) that the probability of remaining in i is non-zero for all t, that
is, pii (t) > 0. Indeed, from equation (79) it follows that pii (t) ≥ pii (t/n)n for
any integer n. For example,
X
pii (t) = pik (t/2) pki (t/2) ≥ pii (t/2)pii (t/2).
k∈S

117
The argument for different values of n is similar. So, if for some t we would
have pii (t) = 0, this would imply pii (t/n) = 0 for all n, contradiction with
(73).
The following properties of transition functions and transition rates for a
time-homogeneous process are stated without proof:
dpij (t) pij (h) − δij
1. Transition rates qij = = lim exist for all i, j.
dt t=0 h→0 h
Equivalently, as h → 0, h > 0

hqij + o(h), i 6= j
pij (h) = (81)
1 + hqii + o(h), i = j

Comparing this to equation (76) we see that the only difference between
the time-homogeneous and time-inhomogeneous cases is that the tran-
sition rates qij are not allowed to change over time.

2. Transition rates are non-negative and finite for i 6= j, and are non-
positive when i = j, that is

qij ≥ 0 for i 6= j but qii ≤ 0 for i = j.


X
Differentiating pij (t) = 1 with respect to t at t = 0 yields that
j∈S

X
qii = − qij .
j6=i

3. If the set of states S is finite, all transition rates are finite.

Kolmogorov’s forward equations for a time-homogeneous process takes the


form
dpij (t) X
= pik (t)qkj ,
dt k∈S

and in matrix form


dP(t)
= P(t)Q
dt
where Q is the generator matrix with entries qkj .
Similarly, Kolmogorov’s backward equations are:

dP(t)
= QP(t).
dt
118
X
Note that since qii = − qij , each row of the matrix Q has zero sum.
j6=i

Example 6.5. Consider the Poisson process again. The rate at which
events occur is a constant λ, leading to

 λ, j = i + 1
qij = 0, j 6= i, i + 1 (82)
−λ, j = i

and pij (t) = P (Nt+s = j| Ns = i) .


The Kolmogorov forward equations are

dpi0 (t)
= −λpi0 (t),
dt
dpij (t)
= −λpij (t) + λpij−1 (t),
dt
with pij (0) = δij . These equations are essentially the same as equations (65)
and (66).
The backward equations are

dpij (t)
= −λpij (t) + λpi+1,j (t).
dt

6.5 Applications
In this section we briefly discuss a number of applications of Markov jump
processes to actuarial modelling. In each case the models can be made time-
homogeneous by insisting that the transition rates are independent of time.
A more detailed discussion of the survival model is postponed to the next
chapters.

6.5.1 Survival model


Consider a two-state model where the two states are alive and dead, i.e.
transition is in one direction only, from the state alive (A) to the state dead
(D) with transition rate µ(t). This is the survival model and has discrete
state space S={A,D}. The transition graph is given in Figure 6.
In actuarial notation, the transition rate µ(t) is identified with the force
of mortality at age t.

119
Figure 6: Transition graph for the survival model. Reproduced with permis-
sion of the Faculty and Institute of Actuaries.

It is clear that the generator matrix Q(t) is given by


 
−µ(t) µ(t)
Q(t) = .
0 0
The Kolmogorov forward equations therefore become
∂pAA (s, t)
= −µ(t)pAA (s, t),
∂t
and it is clear that the solution corresponding to the initial condition pAA (s, s) = 1
is Rt
pAA (s, t) = e− s µ(x) dx .
Note that pAA (s, t) is the probability that an individual alive at time (age) s
will still be alive at time (age) t.
Equivalently, consider the probability that an individual now aged s will
survive until at least age s+w, denoted w ps in the standard mortality notation
R s+w Rw
w ps = pAA (s, s + w) = e− s µ(x) dx
= e− 0 µ(s+u) du
.

6.5.2 Sickness-death model


The survival model can be extended to include the state of health of an
individual. In this so-called sickness-death model, the state of an individual
is described as being healthy (H), sick (S), or dead (D). The discrete state
space is therefore S={H,S,D}.
An individual in state H can jump to either state S or state D. Similarly, an in-
dividual in state S can jump to either state H or state D. Time-inhomogeneity
arises through the following age-dependent transitions rates
H → S : σ(t)
H → D : µ(t)
S → H : ρ(t)
S → D : ν(t)

120
Figure 7: Transition graph for the sickness-death model. Reproduced with
permission of the Faculty and Institute of Actuaries.

The transition graph is given in Figure 7.


The generator matrix is:
 
−(σ(t) + µ(t)) σ(t) µ(t)
Q(t) =  ρ(t) −(ρ(t) + ν(t)) ν(t)  .
0 0 0

Under this formulation it is possible to calculate probabilities such as


• the probability that an individual who is healthy at time s will still be
healthy at time t; or

• the probability that an individual who is sick at time s will still be sick
at time t.
These are in terms of the residual holding times as
Rt
P (Rs > t − s| Xs = H) = e− s (σ(u)+µ(u))du ,

and Rt
P (Rs > t − s| Xs = S) = e− s (ρ(u)+ν(u))du ,
respectively.
We note that transition probabilities can be related to each other. For ex-
ample, the probability of a transition from state H at time s to S at time t
would be
Z t−s  R 
s+w
pHS (s, t) = e− s (σ(u)+µ(u))du σ(s + w)pSS (s + w, t)dw.
0

121
This is interpreted as “the individual remains in the healthy state from time
s to time s + w and then jumps to the state sick at time s + w where he
remains”. The derivation of this equation is beyond the scope of the course,
however similar expressions can be written down intuitively.
This sickness-death model can be extended to include the length of time an
individual has been in state S. This leads to the so-called long term care
model where the rate of transition out of state S will depend on the current
holding time in state S.

6.5.3 Marriage model


A further example of a time-inhomogeneous model is the marriage model
under which an individual can be either never married (B), married (M),
divorced (D), widowed (W) or dead (∆). A Markov jump process can be
formulated on the state space S ={B, M, D, W, ∆}.
The transition graph is given in Figure 8, where we can see that the death
rate has been taken to be independent of the marital status for simplicity.

Figure 8: Transition graph for the marriage model. Reproduced with permis-
sion of the Faculty and Institute of Actuaries.

Example 6.6. Question: State an expression for the probability of being


married at time t and of having been so for at least w given that you have
never been married at time s (w < t − s).
Answer: If Ct is the current holding time, we have
Z t−w
P [Xt = M, Ct > w|Xs = B] = pBB (s, t − v)α(t − v)+
s
 Rt
pBW (s, t − v)r(t − v) + pBD (s, t − v)ρ(t − v) e− t−v (µ(u)+ν(u)+d(u))du dv.

122
This mathematical statement can be read as “the individual is in state B at
time s where he either remains until time (t − v), or jumps to states W or D
by time (t − v). At time (t − v) he then jumps to state M and remains there
until time t”.

References
The following texts were used in the preparation of this chapter and you are
referred there for further reading if required.

• Faculty and Institute of Actuaries, CT4 Core Reading;

• D. R. Cox & H. D. Miller, The Theory of Stochastic Processes;

• S. Ross, Stochastic Processes.

123
6.6 Summary
Markov jump processes are continuous-time and discrete-state-space stochas-
tic processes satisfying the Markov property. You should be familiar with
the Poisson, survival, sickness-death and marriage models.
The Poisson process is a simple Markov jump process. It is time-homogeneous
with stationary increments that are Poisson distributed with mean λ > 0.
Waiting times between jumps are exponentially distributed with mean 1/λ.
As with Markov chains, transition probabilities exist for a general Markov
jump process
pij (s, t) = P [Xt = j|Xs = i] , where pij (s, t) ≥ 0 and s < t,
which must also satisfy the Chapman-Kolmogorov equations.
The quantities qjj (t), qkj (t) are the transition rates, such that
pjj (t, t + h) − 1
lim := qjj (t),
h→0 h
pkj (t, t + h)
lim := qkj (t), for k 6= j.
h→0 h
Kolmogorov’s forward and backwards equations are respectively
∂pij (s, t) X ∂pij (s, t) X
= pik (s, t)qkj (t) and =− qik (s)pkj (s, t).
∂t k∈S
∂s k

These can be written in matrix form as


∂P(s, t) ∂P(s, t)
= P(s, t)Q(t) and = −Q(s)P(s, t),
∂t ∂t
where Q(t) is the generator matrix with entries qij (t).
In time-homogeneous models the time-dependence of the transition proba-
bilities and transition rates (therefore generator matrices) is removed.
The residual holding time Rs is the random amount of time between time s
and the next jump:
{Rs > w, Xs = i} = {Xu = i, s ≤ u ≤ s + w}.
It can be proved that
R s+w
qii (t)dt
P (Rs > w| Xs = i) = e s ,
.

124
6.7 Questions

1. Claims are known to follow a Poisson process with a uniform rate of 3


per day.

(a) Calculate the probability that there will be fewer than 1 claim on
a given day.
(b) Estimate the probability that another claim will be reported dur-
ing the next hour. State all assumptions made.
(c) If there have not been any claims for over a week, calculate the
expected time before a new claim occurs.

2. Prove equation (72) which gives the Chapman–Kolmogorov equations


for a Markov jump process.

3. Consider the sickness-death model given in Figure 7, write down an


integral expression for pHD (s, t).

4. Let {Xt , t ≥ 0} be a time-homogeneous Markov process with state


space S = {0, 1} and transition rates q01 = α, q10 = β.

(a) Write down the generator matrix for this process.


(b) Solve the Kolmogorov’s forward equations for this Markov jump
process to find all transition probabilities.
(c) Check that the Chapman–Kolmogorov equations hold.
(d) What is the probability that the process will be in state 0 in the
long term? Does it depend on the initial state?

125
Chapter 7
Machine Learning
7.1 A motivating example
Machine learning can be defined as the study of systems and algorithms
that improve their performance with experience. Machine learning methods
have become increasingly popular in recent decades, because of increase both
in computing power and in the amount of data available. You use machine
learning systems every day, even without noticing. For example, you may use
Google Translate website to translate the text, and the current best methods
in automated translation use machine learning. As another example, you
mailbox most probably has some kind of spam filter, and most of nowadays
spam filters are build based on machine learning technology.
Below is the example of report from one popular spam filter, SpamAssas-
sin.

• 0.6 HTML IMAGE RATIO 02 BODY: HTML has a low ratio of text
to image area

• 0.0 HTML MESSAGE BODY: HTML included in message

• 2.0 URIBL BLACK Contains an URL listed in the URIBL blacklist

• -0.9 AWL AWL: From: address is in the auto white-list

This is a typical report from analysing an e-mail. The system noticed


that it has “low ratio of text to image area”, which is typical for spam e-
mails. It assign to this factor a positive but not very high weight 0.6. The
system also noticed that some HTML included in message, but assign weight
0 to this fact. Most importantly, the e-mail contains a URL listed in some
blacklist, and this gives positive weight 2.0. On the other hand, the message
came from a trusted sender (included in the auto white-list), which is an
indication that it is not a spam, and therefore contributes a negative weight
−0.9. The total weight is 0.6 + 0 + 2.0 − 0.9 = 1.7. SpamAssassin classifies
as “spam” e-mails with total weight at least 5, so this particular e-mail is
classified as a good one.
But how SpamAssassin formed the list of criteria/factors to look at and
how assigned specific weights for these factors. As you can imagine, the
number of factors is large, and it would be very difficult for the developers of
spam filter to select weights “by common sense”. Instead, they programme
the filter to “learn” weights from data. We can collect a large amount of

126
examples of spam e-mails and mark them as “spam”. We can also collect a
large amount of normal e-mails and mark them as “not spam”. We can then
use these e-mails as a “training set”, and find weights such that as many as
possible of these e-mails would be classified correctly.

Example 7.1. As an oversimplified example, imagime that we have just


two factors: “low ratio of text to image area” and “containing URL from the
blacklist”. Let us introduce variables x1 and x2 , and, for every e-mail, write
xi = 1 if the corresponding factors is true and xi = 0 otherwise. Imagine
that we have the training set with 3 e-mails:

• (a) e-mail with low ratio of text to image area, but not containing
URL from the blacklist. In other words, x1 = 1 but x2 = 0. The e-mail
marked as “not spam”.

• (b) e-mail with low ratio of text to image area, and with URL from the
blacklist. In other words, x1 = 1 and x2 = 1. The e-mail marked as
“spam”.

• (c) e-mail with bad URL only: x1 = 0 but x2 = 1. The e-mail marked
as “not spam”.

Assume that we classify e-mail as spam if and only if w1 x1 + w2 x2 ≥ 5.


Then, to classify e-mail (a) correctly, we should have

w1 x1 + w2 x2 = w1 · 1 + w2 · 0 = w1 < 5.

To classify e-mail (b) correctly, we should have

w1 x1 + w2 x2 = w1 · 1 + w2 · 1 = w1 + w2 ≥ 5.

And finally, to classify e-mail (c) correctly, we should have

w1 x1 + w2 x2 = w1 · 0 + w2 · 1 = w2 < 5.

So, in this example, w1 and w2 can be any numbers less than 5 with sum
at least 5. For example, w1 = w2 = 3 works.
We can represent the analysis in Example 7.1 geometrically, by introduc-
ing a coordinate plane with coordinates (x1 , x2 ). Then the three e-mails in
the training set are points with coordinates (1, 0), (1, 1), and (0, 1), which
we denote A, B, and C, respectively. We can draw the points (A and C)
corresponding to non-spam e-mails as blue, and the point (B) representing
the spam e-mail in red. The “spam region” w1 x1 + w2 x2 ≥ 5 is a half-plane

127
whose boundary is the line w1 x1 + w2 x2 = 5. So, geometrically our task was
do draw a line on the plane which separates the blue and red points. One
possible such line is the one with equation 3x1 + 3x2 = 5, corresponding to
the solution w1 = w2 = 3 we have chosen in Example 7.1.
If we would use three factors to make a decision (for example, the above
two plus whether the sender address is in the auto white-list), then the
third factor can be represented as a coordinate x3 , the training set could
be depicted as a set of points in the 3-dimensional space, and the problem
would be to separate red and blue points by a plane with equation w1 x1 +
w2 x2 + w3 x3 = 5. In general, we may have n factors and m e-mails in
the training set, which, geometrically, corresponds to set of m red and blue
points in the n-dimensional space. We then need to P separate the points by
n − 1-dimensional set of points satisfying equation ni=1 wi xi = 5 (or any
other constant instead of 5). Set of points satisfying this equation is called
a hyperplane.
In addition to finding the “best” weights for the given list of factors, the
system may learn which new factors it is good to include. For example, it
can compute the frequency of various words in e-mails and find that the
word “lottery” appears in a lot of spam e-mails and in almost none good
e-mails. Based on this statistic, the system may introduce a new, n + 1-
th factor indicating whether an e-mail contains the word “lottery” or not.
The red and blue points then get a new coordinate xn+1 (with xn+1 = 1
for e-mails with “lottery” word and xn+1 = 0 for all other e-mails), and
then we separate these points in (n + 1)-dimensional space and find weights
w1 , w2 . . . , wn , wn+1 , including the weight wn+1 of the new factor.
We can see that this problem (find the weights based on training data)
is just a problem in algebra (find wi from some system of inequities), or,
equivalently, in geometry (separate red and blue points by a hyperplane).
However, we call the branch of science studying this and similar problems
“machine learning”, because it allows the system to improve with experience.
In this specific example with spam filter, it may initially work badly for my
specific mailbox, because I, for example, may:

- receive a lot of good e-mails with “low ratio of text to image area”
because of nature of my work, but

- receive a lot of spam messages with no images because my e-mail may


be in the database of spammers who send such spam messages.

Because of this, I initially may see some spam messages in the main mailbox,
and, conversely, some good e-mails in the spam folder. However, I can then
mark the spam e-mails from main mailbox as “spam” and the good e-mails

128
in the spam folder as “not spam”, which provides new training data for the
spam filter, which are specific for my particular situation. The filter will
then find new weights based on new data and can quickly “learn” that, in
my particular case, “low ratio of text to image area” is not an indication of
spam, so the corresponding weight should be low, or zero, or even negative.
Instead, it can put the large positive weights for some other factors which my
spammers often use. In this way, the system may quickly learn and adapt
itself to the need of each particular user.
Moreover, if spammers are smart enough, they may try to develop e-mails
which avoid typical features of spam message, such as word “lottery” or low
ratio of text to image area. This may help them to pass the spam filters, but
only temporary. After user marks that e-mails as “spam”, the system will
use this to update the weights, “understand” how these new spam messages
looks like, and filter them out next time.

7.2 The problems machine learning can solve


In the previous section we studied one specific example of the problem which
is solved by machine learning technique: spam filtering. The filter divides
all e-mails into two classes: spam and non-spam, and therefore this problem
is an example of classification problem. More generally, one may want to
automatically classify all e-mails into more than two groups: spam, work
e-mails, private e-mails, etc., which is an example of multi-class classifica-
tion. As another example of the same problem, we may want to develop an
image recognition technology, which classify images detected by a web-cam
as “human”, “animal”, “bird”, etc. Such image recognition is critical for, for
example, self-driving cars, because it helps to estimate the level of danger
and select action (if we see a human ahead on the road - stop urgently, but if
this is a bird, the car may continue to move, possibly with decreased speed).
In some applications of multi-class classification (like spam filtering) it
may be clear in advances what classes to consider (spam, non-spam, etc.).
However, in some other applications this is unclear. For example, in the
image recognition problem, it is difficult to list in advance all possible “ob-
jects” the webcam can detect. In this case, we may ask the system to do
this automatically. It may represent all observed objects as points in some
n-dimensional space, and then automatically detect that these points form
m groups, or classes. After this, every new object detected is classified into
one of these classes. “Similar” objects go to one class, and “dissimilar” ones
- into different classes. This problem is called clustering.
Sometimes, we would prefer not to classify objects into finite number of
“classes” but rather give them a “score”, a real number. For example, we

129
may want for the spam filter to automatically assign to every e-mail its “im-
portance”, so that we can sort e-mails by it and answer the most important
e-mails first. Clear spam e-mails can have zero or negative score, the “border-
line” e-mails for which the filter is unsure if this a spam or not can have small
positive score, then work e-mails may be prioritised over the personal ones,
etc. Similarly, for self-driving car, instead of classifying detected objects into
classes, it may be more convenient to just assign to every object a score,
indicating how dangerous/important it is, so that the higher score the more
urgently we need to stop. This task is called regression. Mathematically,
(linear) regression problem is typically reduces to the following task: given
m points in the coordinate plane Rn+1 with coordinates x1 , x2 , . . . , xn , y, ap-
proximate the points by linear function y = w0 + w1 x1 + w2 x2 + · · · + wn xn
as good as possible. Typically, the quality of approximation is measured as
the sum of squares of differences between the y coordinates of data points
and the function.

Example 7.2. Approximate points (0, 0), (1, 1), (2, 3), (3, 3) by a line y =
ax + b to minimize the sum of squares error.
Solution. For x = 0, y = a · 0 + b = b, and the data point is (0, 0) so the
(squared) error is (b − 0)2 . Similarly, for x = 1, y = a + b, and the data point
is (1, 1), the squared error is (a + b − 1)2 . Continuing this way, we write down
the error as

e(a, b) = (b − 0)2 + (a + b − 1)2 + (2a + b − 3)2 + (3a + b − 3)2 .

In optimality
∂e(a, b)
= 2(a + b − 1) + 2(2a + b − 3)2 + 2(3a + b − 3)3 = 0
∂a
and
∂e(a, b)
= 2b + 2(a + b − 1) + 2(2a + b − 3) + 2(3a + b − 3) = 0
∂b
This simplifies to

4(7a + 3b − 8) = 2(6a + 4b − 7) = 0

and the solution is a = 11/10, b = 1/10.

Another example is the analysis of association rules, patterns which are


popular in marketing applications. For example, when you view any popular
book on Amazon website, you can see a list at the bottom of the page entitled

130
“Customers who viewed this item also viewed”, followed by the list of similar
books. There can be also another list entitled “customers who bought this
also bough”. Also, when you read information about any film, you often get
a list of suggestions for similar films you may be interested, etc. To create
such lists, the program should learn what books/products/films are “similar”
or “associated” with each other, and then use these associations to provide
you with best recommendations.
When recommending the best film for you, the system often looks for
hidden or latent variables, that is, some structure which helps to under-
stand your choice. For example, you may assign scores to the films you saw,
and these scores looks unstructured even for you. However, automatic analy-
sis of your scores may reveal the “structure” that you typically assign a high
score to films of particular genre that have at least one of three particular
actors.
Machine learning can also be used in a variety of other applications,
for example, in playing games, from board games like poker, Chess, or Go,
to real-life application with “resemble” the game, like developing optimal
strategies for trading on financial markets.
In all applications, it is important to understand how to evaluate the
performance of machine learning system, e.g. of algorithm for calculating
weights for spam filter. If we have N e-mails marked as spam or not, we can
use these N e-mails as test set to find weights, but it might be that these
weights are good only for these specific N e-mails and will not work well for
any other e-mails. The situation when weights are so adapted to the specific
test set that the system fails to do anything else is called overfitting, and is
one of the most serious problems in the area. To test that it did not occur,
one may divide our N e-mails into two groups, and use one group (say, 90%
of e-mails) for training, and the remaining 10% of e-mails for testing. On the
test stage, we may count the number of e-mails (both spam and non-spam)
which are classified correctly and divide it by the total number of e-mails -
this ratio is called the accuracy of the classifier.
The set of 10% of e-mails we select for testing is called the test set, and
it is usually selected randomly. However, as in any random experiment, we
may get a very different result each time we repeat the procedure. One time
the test set may contain some “typical” e-mails for which the filter performs
well, while another time it may contain many “atypical” e-mails (e.g. good
e-mails which happen to have many spam-like features) and the filter may
make a lot of errors. Because of this, it is a good idea to repeat the test
several times and average the result. For example, we may randomly divide
out initial set of N e-mails into 10 group, use one of them as test set and the
remaining 9 as training set, and then repeat this procedure 10 times, with

131
each group being the test set once. The final accuracy is the average of 10
accuracies obtained. This procedure is an example of cross validation.
Sometimes the set of data used for training is further divided into two
groups: a training data set and validation data set. The training set is
used to estimate the parameters of the model (such as weights wi in Example
7.1), while the validation set is used to decide some more global questions
such as number of parameters to consider, number of categories to classify
(should it just spam and non-spam or maybe 3 or more categories), rate at
which the model should learn from the data, etc. Such “global parameters”
are called hyper-parameters.
In some applications, the “accuracy” as defined above (the total number
of correct classifications divided by the sum of the correct and incorrect ones),
is not the best way to evaluate the system performance, because there are
two very different types of errors. The first mistake is when the e-mail is
spam but the system does not recognise it and puts it to the main folder.
This is called a False negative (FN). The second type of mistake is when a
good e-mail is putted into the spam folder, and this is called False Positive
(FP). If you never check the spam folder, then FN is not a big problem,
while FP may be a big issue. We can also define true positive (TP) when
the e-mail is spam and it is in the spam folder, and true negative (TN) if the
e-mail is non-spam and it goes to the main folder. Then we can consider the
following measures of system performance:
TP
• Precision P re = T P +F P
;
TP
• Recall Rec = T P +F N
;
2·P re·Rec
• F1 score = P re+Rec
;
FP
• False Positive Rate = T N +F P
.

In many cases, there is a trade-off between recall and false positive rate. If
we improves one of this, the other one may become worse.

7.3 Models, methods, and techniques


In Example 7.1, the e-mails in training set were market as “spam” and “non-
spam”, and this information was used to calculate weights. We assume that
this classification of the training set was done by human. This approach is
called supervised learning.
However, in reality we can have millions of e-mails in training set and
it may take months of hard work for a human to classify it. What id our

132
training set is just set of e-mails, and it is unknown which e-mail is spam.
Can we program computer to learn something from this data? In fact, we
can! The computer can still note that some e-mails are somewhat similar
(for example, use similar words like “lottery” or have low ratio of text to
image area), and “guess” that maybe this group of similar e-mails are the
spam ones. This is an example of unsupervised learning.
In example Example 7.1, we can find weights algebraically, from a system
of inequalities, or geometrically, as a line which separate red and blue points
on the plane. More generally, we are looking for a hyperplane which separate
points in the n-dimensional space, where n is the number of factors to be
weighted. This n-dimensional space is called the instance space, and the
whole method is known as geometric model.
Hyperplane that perfectly separate the data may not exists, and, even if
it exists, its coefficients are not always easy to compute. Here is the method
which is very easy to compute. We can calculate the mean of all red points
(call if point A), the mean of all blue points (call if point B), and then select
a hyperplane which is perpendicular to the line segment AB and crosses it
in the midpoint. This hyperplane is known as basic linear classifier, and
may separate red and blue points quite well, although not perfectly.
In Example 7.1, there are many possible solutions, for example, w1 =
w2 = 3, or w1 = w2 = 4, etc. Geometrically, one can draw many possible
lines which separates read and blue points. Some of such lines may be very
close to blue points, some - to the red ones. Intuitively, one may prefer a line
which separates points with maximal margin, that is, as far from the data
points as possible. This idea forms the basis of the method called support
vector machines. In Example 7.1, such “best” line corresponds to the
solution w1 = w2 = 10 3
.
In many applications, the key task is the similarity testing: is this e-mail
similar to the spam e-mails in the database? What are the films most similar
to the one a customer likes and put high rating? If every e-mail (or film, or
any other object) is described by n parameters and is represented as a point
in n-dimensional space, one easy way to define “similarly” is to say that
objects are similar if and only if the corresponding points X = (x1 , . . . , xn )
and Y = (y1 , . . . , yn ) are close in the usual Euclidean distance
v
u n
uX
d(X, Y ) = t (xi − yi )2 .
i=1

A very simple classification algorithm is, for every new point A to be classi-
fied, find an already classified point B at the smallest distance from A, and

133
then assign A the same class as B. This method is known as the nearest-
neighbour classifier. In particular, imagine that in Example 7.1 new e-mail
arrives with no bad url and no high image to text area ratio. It corresponds
to new point (0, 0) on the plane. The distance from it to the “spam” point
(1, 1) is p √
(0 − 1)2 + (0 − 1)2 = 2,
while the distances to both “non-spam” points are
p p √
(0 − 0)2 + (0 − 1)2 = (0 − 1)2 + (0 − 0)2 = 1 < 2.
Hence, the nearest-neighbour classifier would classify this new point as non-
spam.
Another distance-based method can be used for clustering task. Imagine
that we need to cluster the data into K clusters, and we have some initial
guess how to do this. For each cluster 1 = 1, 2, . . . , K, we can calculate
the mean Mi of all points in cluster i. After this, for each point X, we can
compute the distance from X to M1 , M2 , . . . , MK , select the minimal out
of these distances, and re-assign X to the corresponding cluster. We can
then repeat this procedure until no point will be re-assigned. This method is
called K-means, and it is very popular and powerful method for clustering.
Example 7.3. Consider 3 points A, B, C on the plane with coordinates
(0, 0), (0, 1), (4, 0), initially classified such that A and C are red while B is
blue. Then the mean of red cluster is M1 = (2, 0), while the mean of blue
cluster is M2 = B = (0, 1). The distances from A to Mi are
d(A, M1 ) = 2, d(A, M2 ) = 1,
hence M2 is the nearest one, and A moves to the blue cluster. Similarly,
√ √
d(B, M1 ) = 5, d(B, M2 ) = 0, d(C, M1 ) = 2, d(C, M2 ) = 17,
hence B and C stays in blue and red clusters, respectively.
We now repeat the procedure: the new cluster means are M1 = C = (4, 0)
and M2 = (0, 0.5), respectively. It is easy to check that the closest Mi is M2
for A and B and M1 for C, hence all points stay in the same class and the
algorithm terminates.
Of course, for nearest-neighbour classifier, K-means, and other related
methods, it is not necessary to use the Euclidean distance. For example, we
may instead use the Manhattan distance between points X = (x1 , . . . , xn )
and Y = (y1 , . . . , yn ), given by
n
X
|xi − yi |.
i=1

134
7.4 Probabilistic analysis
In classification task, models using probabilistic analysis may be useful.
For example, assume that we try to decide whether an e-mail a spam or not
based on the information whether it contain words “lottery” and “tablet”.

Example 7.4. Imagine that we have the following data:

• 60 e-mails without these words. 20 of them are spam, and 40 - not;

• 15 e-mails with word “lottery” only. 10 of them are spam, and 5 - not;

• 24 e-mails with word “tablet” only. 20 of them are spam, and 4 - not;

• 1 e-mail with both these words, and it is not a spam.

Based on this data we can calculate various conditional probabilities. For


example, if an e-mail contains no “lottery” and no “tablet”, we may estimate
that it has about 20
60
= 13 chance to be a spam e-mail and about 40
60
= 23 chance
to be a non-spam. In notation, let S be a random variable indicating that
e-mail is spam if S = 1 and non-spam if S = 0, L be a random variable such
that L = 1 if this random e-mail contains the word Lottery (and L = 0 if
not), and T is a similar random variable about word “tablet”. Then
1 2
P (S = 1|T = L = 0) = , P (S = 0|T = L = 0) = .
3 3
Because P (S = 0|T = L = 0) > P (S = 1|T = L = 0), we classify e-mail with
T = L = 0 as non-spam. This method is called maximum a posteriori
(MAP) decision rule.

In a similar way, we can apply MAP decision rule to all possible combi-
nations of value of L and T , and write a computer program which gives the
answers in all possible cases:

• If T = 0 (no “tablet” word in e-mail), then:

– If L = 0 (no “lottery” word in e-mail), then: NON-SPAM


– If L = 1 (there is a “lottery” word in e-mail), then: SPAM

• If T = 1 (there is a “tablet” word in e-mail), then:

– If L = 0 (no “lottery” word in e-mail), then: SPAM


– If L = 1 (there is a “lottery” word in e-mail), then: NON-SPAM

135
This computer program is an example of decision tree. In general, decision
three may contain any number of this nested “if... then” structure. Instead of
final decision (SPAM or NON-SPAM) the program may return, for example,
a real number which represents the probability that an e-mail is spam.
In Example 7.4, suppose that part of e-mail is encoded and the filter
cannot read it. In the open part, the filter see the word “lottery” but no
“tablet”, and it is not clear if “tablet” is present in the encoded part or not.
In this case, the MAP decision rule classifies e-mail as spam if P (S = 1|L =
1) > 0.5. This probability can be estimated from the law of total probability
(13), or directly from data as
10 + 0 10
P (S = 1|L = 1) = = = 0.625,
15 + 1 16
because in total there are 15 + 1 = 16 e-mails with word “lottery”, 10 of
them are spam.
In fact, in the situation like in Example 7.4, statisticians often makes
decisions based on different conditional probabilities, such as P (T = L =
0|S = 1) and P (T = L = 0|S = 0), which is an example of likelihood
function. The logic is that one asks herself: how likely I would find e-mail
looking like this (in our case, with no words “lottery” and “tablet”) in the
spam folder? And how likely I would find it in the non-spam folder? In our
example, there are 50 spam e-mails, 20 of them are with T = L = 0, and
also there are 50 non-spam e-mails, 40 of them are with T = L = 0, hence
20 40
P (T = L = 0|S = 1) = < = P (T = L = 0|S = 0).
50 50
Thus, observing an e-mail like this in a spam folder is about twice less likely
than finding such an e-mail in the non-spam folder. Hence, the e-mail should
be classified as non-spam.
The two methods above (MAP and likelihood methods) are related by
Bayes’ theorem
P [A ∩ B] P [A|B]P [B]
P [B|A] = = ,
P [A] P [A]
which in our case says that
P (T = L = 0|S = 1)P (S = 1)
P (S = 1|T = L = 0) = .
P (T = L = 0)
The same formulas work for all other possible values of T and L, for example
P (T = L = 1|S = 1)P (S = 1)
P (S = 1|T = L = 1) = .
P (T = L = 1)

136
In fact, our data-based estimate P (S = 1|T = L = 1) = P (T = L =
1|S = 1) = 0 is unjustified, because there is just one e-mail with T = L = 1,
and it is strange to calculate any probabilities based on the sample with
ONE experiment. This situation is very typical: there are may be very
few e-mails with the word pattern exactly as prescribed, or, in general case,
very few objects with values of parameters exactly equal to some values.
An alternative ways to estimate probabilities like P (T = L = 1|S = 1) is to
assume that words occur independently (or at least independently conditional
on the event S = 1), and then

P (T = L = 1|S = 1) = P (T = 1|S = 1) · P (L = 1|S = 1),

and then
P (T = 1|S = 1) · P (L = 1|S = 1) · P (S = 1)
P (S = 1|T = L = 1) = .
P (T = L = 1)
The classification based on probabilities calculated in this way is called naive
Bayes classification, with the word “naive” reflecting the assumption of in-
dependence.
Another probabilistic learning method which have been used recently with
dramatic success is reinforcement learning. In reinforcement learning the
learner is not given a target output in the same way as with supervised
learning. The learner uses the input data to choose some output, and is then
told how well it is doing, or how close the chosen output is to the desired
output. The learner can then use this information as well as the input data to
choose another hypothesis. This method has been recently apply to playing
games, such as Chess and Go, in which a machine first plays against itself at
random, and then automatically learn from experience, increasing the chance
of selecting moves similar to ones that lead to the success in previous games.
A generic program Alpha Zero designed to play in many games with the same
algorithm, having just rules of the games as an input, quickly learn from self-
play and beat all human and all specifically designed programs which human
developed for many decades!

7.5 Stages of analysis in Machine Learning


Machine Learning tasks can often be broken down into a series of steps.

• Collecting data
The data must be assembled in a form suitable for analysis using
computers. Several different tools are useful for achieving this: a

137
spreadsheet may be used, or a database such as Microsoft Access.
Data may come from a variety of sources, including sample surveys,
population censuses, company administration systems, databases con-
structed for specific purposes (such as the Human Mortality Database,
www.mortality.org). During the last 20-30 years the size of datasets
available for analysis by actuaries and other researchers has increased
enormously. Datasets, such as those on purchasing behaviour collected
by supermarkets, relate to millions of transactions.

• Exploring and preparing the data


The data need to be prepared in such a way that a computer is able
to access it and apply a range of algorithms. If the data are already in
a spreadsheet, this may be a simple matter of importing the data into
whatever computer package is being used to develop the algorithms.
If the data are stored in complex file formats, it will be useful to con-
vert the data to rectangular format, with one line per case and one
column per variable. It is also important here to recognise the nature
of the variables being analysed: are they nominal, ordinal or continu-
ous? Next we should do cleaning of the data, which includes replacing
missing values, and checking the data for obvious errors.

• Feature scaling
Some Machine Learning techniques will only work effectively if the
variables are of similar scale. If, for example, one variable (say x1 ) is
measured in, say, kilometres, and other variables are in centimetres,
the value of x1 will be 1000 times larger than it would be with the
same data if it were measured in centimetres as well. This may lead to
inadequate results in a number of machine learning methods such as
linear regression.

• Splitting the data into the training, validation and testing


data sets
typical split might be to use 60% of the data for training, 20% for
validation and 20% for testing. However it depends on the problem
and not on the data. A guide might be to select enough data for the
validation data set and the testing data set so that the validation and
testing processes can function, and to allocate the rest of the data
to the training data set. In practice, this often leads to around a
60%/20%/20% split.

• Training a model on the data

138
This involves choosing a suitable Machine Learning algorithm using a
subset of the data. The algorithm will typically represent the data as a
model and the model will have parameters which need to be estimated
from the data. This stage is analogous to the process of fitting a model
to data when using linear regression and generalised linear models.

• Validation and testing


The model should then be validated using the 20% of the data set
aside for this purpose. This should indicate, for example, whether we
are at risk of over-fitting our data. The results of the validation exercise
may mean that further training is required. Once the model has been
trained on a set of data, its performance should be evaluated. How
this is done may depend on the purpose of the analysis. If the aim
is prediction, then one obvious approach is to test the model on a set
of data different from the one used for development. If the aim is to
identify hidden patterns in the data, other measures of performance
may be needed.

• Improving model performance


We can measure the performance of the model by testing it on the 20%
of the data we have reserved for this purpose. The hope is that the
performance on the “test” data set is similar to that achieved on the
training data set. This amounts to stating that the difference between
the in-sample error and the out-of-sample error will be generally small.
If the performance of the model is not sufficient for the task to hand,
it may be possible to improve its performance. Sometimes the combi-
nation of several different algorithms applied to the same data set will
produce a performance which is substantially better than any individ-
ual model. In other cases, the use of more data might provide a boost
to performance. However, except when considering very simple com-
binations of models, care should be taken not to overfit the evaluation
set.

• The reproducibility of research


It is important that data analysis be reproducible. This means that
someone else can take the same data, analyse it in the same way, and
obtain the same results. In order that an analysis be reproducible the
following criteria are necessary:

– The data used should be fully described and available to other


researchers.

139
– Any modifications to the data (e.g. recoding or transformation of
variables, or computation of new variables) should be clearly de-
scribed, ideally with the computer code used. In Machine Learn-
ing this is often called “features engineering”, whereby combina-
tions of features are used to create something more meaningful.
– The selection of the algorithm and the development of the model
should be described, again with computer code being made avail-
able. This should include the parameters of the model and how
and why they were chosen.

There is an inherent problem with reproducing stochastic models, in


that those of necessity have a random element. Of course, details of the
random number generator seeds chosen, and the precise command and
package used to generate any randomness, could be presented. How-
ever, since stochastic models are typically run many times to produce a
distribution of results, it normally suffices that the distribution of the
results is reproducible.

140
7.6 Summary
Machine learning can be defined as the study of systems and algorithms that
improve their performance with experience.
Machine learning typically used to solve classification problems, cluster-
ing, regression problems, analysis of association rules, discovering hidden or
latent variables, etc.
One of the main problems in machine learning is overfitting, when the
system works perfectly on the data it was trained in but performs badly on
any other data. To test that it did not occur, one may divide our data into
two groups, and use one group for training, and another one for testing. We
can then exchange the role of training and testing data, this procedure is
called cross validation. In fact, data for training can be further divided into
two groups: a training data set and validation data set, the first one is to
find model parameters, and the second one is to find hyper-parameters. The
training-validation-test data proportion may be, for example, 60% − 20% −
20%.
In the yes-no classification problem, the correct outcomes are true positive
(TP), and true negative (TN), while the incorrect ones are False Positive (FP)
and False negative (FN). This can be used to calculate various measures of
performance, such as Precision, Recall, F1 score, and False Positive Rate.
You should be able to define and understand the following terms and
methods:

• Supervised learning

• Unsupervised learning

• Linear classifier

• Support vector machines

• Nearest-neighbour classifier

• K-means algorithm

• Maximal a posteriori (MAP) decision rule

• Decision tree

• Likelihood function

• Naive Bayes classifier

• Reinforcement learning

141
Machine Learning tasks can often be broken down into a series of steps:

• Collecting data

• Exploring and preparing the data

• Feature scaling

• Splitting the data into the training, validation and testing data sets

• Training a model on the data

• Validation and testing

• Improving model performance

• The reproducibility of research

142
7.7 Questions
1. Approximate points (0, 0, 0), (1, 0, 2), (0, 1, 3), (1, 1, 4) in the coordinate
space (x1 , x2 , y) by a plane y = ax1 + bx2 + c to minimize the sum of
squares error.

2. There are two points marked on the plane - red point A with coordi-
nate (0, 0) and blue point B with coordinates (10, 10). Then 4 points
C, D, E, F arrives in order, and each is coloured in the same way as its
nearest neighbour. The coordinates of F is (10, 8). Give examples of
coordinates of points C, D, E such that point F will be coloured red.

3. Consider 4 points A, B, C, D on the plane with coordinates (0, 0),


(1, 0), (0, 1), and (2, 1), initially classified such that A and D are red
while B and C are blue. What would be the outcome of the K-means
algorithm (with K = 2) applied to these initial data?

4. A filter should classify e-mails into 3 categories: personal, work, and


spam. The statistics shows that approximately 30% of e-mails are
personal, 50% are work ones, and 20% are spam. It also shows that
word “friend” is included into 20% of personal e-mails, 5% of work
e-mails, and 30% of spam e-mails. In addition, the word “profit” is
included into 5% of personal e-mails, 30% of work e-mails, and 25%
of spam e-mails. A new e-mails arrives which contains both words
“friend” and “profit”. Use naive Bayes classification to decide if this
e-mail is more likely to be personal, work, or spam?

143
Solutions of end-of-chapter questions
8.1 Chapter 1 solutions
1. A sample space consists of five elements Ω = {a1 , a2 , a3 , a4 , a5 }. For which
of the following sets of probabilities does the corresponding triple (Ω, A, P )
become a probability space? Why?

(a) p(a1 ) = 0.3; p(a2 ) = 0.2; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;

(b) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;

(c) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.2; p(a4 ) = −0.1; p(a5 ) = 0.2;

Answer: Since Ω is finite, we may assume that A is the set of all subsets
of Ω. So we only have to look at the point probabilities p(ai ) = P ({ai }) for
i = 1, . . . , 5. From the definition of a discrete probability distribution, we
know that the sum of all point probabilities must be equal to 1, i.e. here we
must have
p(a1 ) + p(a2 ) + · · · + p(a5 ) = 1.
In part (a), the values of this sum is equal to 0.8, which means that P is not
a probability distribution and therefore (Ω, A, P ) is not a probability space.
We further know that probabilities can never be negative. In part (c), we
have p(a4 ) = −0.1, which means that (Ω, A, P ) is not a probability space.
In part (b), (Ω, A, P ) is indeed a probability space, since here all require-
ments are met.

2. Let X be a random variable from the continuous uniform distribution,


X ∼ U (0.5, 1.0). Starting with the probability density function, derive ex-
pressions for the cumulative distribution function, expectation and variance
of X.

Answer: The probability density function ρX (x) of X is a positive constant


on the interval (0.5, 1.0) and zero otherwise. Further it has integral 1, so the
right choice is given by

2, if x ∈ (1/2, 1),
ρX (x) =
0, otherwise.

Here IA (x) denotes the indicator function of a set A which is equal to 1 if


x ∈ A and equal to 0 otherwise. The cumulative distribution function is

144
given by

Z x  0,
Rx if x ≤ 1/2
FX (x) = ρX (z) dz = 2 dz = 2x − 1, if x ∈ (1/2, 1)
−∞  1/2
1, if x ≥ 1.

Hence the expectation of X is


Z ∞ Z 1 x=1 1 3
E[X] = xρX (x) dx = 2x dx = x2 =1− = = 0.75.
−∞ 1/2 x=1/2 4 4

The variance of X can be calculated like this:


Z ∞ Z 1 
3 2 3 2 2 3 3 x=1
Var(X) = x− ρX (x) dx = 2 x− dx = x−
−∞ 4 1/2 4 3 4 x=1/2

2 1
h 1 i 1 1
= 3
+ 3 = = = 0.020833...
3 4 4 3 × 16 48
3. Assets A and B have the following distribution of returns in various
states:
State Asset A Asset B Probability
1 10% −2% 0.2
2 8% 15% 0.2
3 25% 0% 0.3
4 −14% 6% 0.3
Show that the correlation between the returns on asset A and asset B is equal
to −0.3830.

Answer: Let RA and RB be the returns on assets A and B, respectively.


Then the correlation between RA and RB is given by

Cov(RA , RB )
Corr(RA , RB ) = p ,
Var(RA )Var(RB )

where Cov(RA , RA ) = E(RA RB ) − E(RA )E(RB ) is the covariance between

145
RA and RB . We have
E(RA ) = (10 × 0.2 + 8 × 0.2 + 25 × 0.3 + (−14) × 0.3)% = 6.9%,
2
Var(RA ) = E(RA ) − (E(RA ))2 ,
 
= 102 × 0.2 + 82 × 0.2 + 252 × 0.3 + (−14)2 × 0.3 − 6.92 %%
= (15.2148)2 %%,
p
Var(RA ) = 15.2148%,
E(RB ) = (−2 × 0.2 + 15 × 0.2 + 0 × 0.3 + 6 × 0.3)% = 4.4%,
2
Var(RB ) = E(RB ) − (E(RB ))2
 
2 2 2 2 2
= 2 × 0.2 + 15 × 0.2 + 0 × 0.3 + 6 × 0.3 − 4.4 %%
= (6.1025)2 %%,
p
Var(RB ) = 6.1025%,

E(RA RB ) = 10 × (−2) × 0.2 + 8 × 15 × 0.2 + 25 × 0 × 0.3

+ (−14) × 6 × 0.3 %%
= −5.2%%.
Note that % and %% stand for 1/100 and 1/1002 , respectively. Using the
values above, we obtain
−5.2/1002 − 6.9 × 4.4/1002
Corr(RA , RB ) = = −0.3830,
0.152148 × 0.061025
as required.
4. Formalise Example 8.5 as Ω = {ω1 , ω2 , ω3 , ω4 }, P ({ω1 }) = P ({ω2 }) =
P ({ω3 }) = P ({ω4 }) = 1/4 and
A := {ω1 , ω4 }, B := {ω2 , ω4 }, C := {ω3 , ω4 }.
Prove that the pairs (A, B), (A, C) and (B, C) are independent, but the triple
(A, B, C) is not mutually independent according to Definition 8.2.
Answer: We have
1 1 1
P (A) = P (B) = P (C) = + = ,
4 4 2
1
P (A ∩ B) = P ({ω4 }) = = P (A)P (B),
4
1
P (A ∩ C) = P ({ω4 }) = = P (A)P (C),
4
1
P (B ∩ C) = P ({ω4 }) = = P (B)P (C),
4
146
which shows that the pairs (A, B), (A, C) and (B, C) are independent. How-
ever
1 1
P (A ∩ B ∩ C) = P ({ω4 }) = 6= = P (A)P (B)P (C).
4 8
So the triple (A, B, C) is not mutually independent.

5. You intend to model the maximum daily temperature in your office as a


stochastic process. What time set and state space would you use?

Answer: It is reasonable to use a suitable discrete time set such as T =


{0, 1, 2, . . . } and a continuous state space such as S = R.

147
8.2 Chapter 2 solutions
1. The number of claims a company received during the last 12 months are

10, 8, 15, 10, 7, 3, 20, 14, 5, 12, 8, 8.

Assuming that these numbers are i.i.d. realizations of


(a) Poisson distribution with parameter λ
(b) negative binomial distribution with parameters p and k,
use method of moments to estimate unknown parameters.

Answer: Because Poisson distribution has only 1 parameter, it suffices


to condider only first moment (expectation). The expectation of Poisson
distribution is λ. From data, the estimate for the expectation is
1
(10 + 8 + 15 + 10 + 7 + 3 + 20 + 14 + 5 + 12 + 8 + 8) = 10.
12
Hence, λ ≈ 10.
k(1−p)
For negative binomial distribution, the mean and variance are p
and
k(1−p)
p2
, respectively, and the corresponding estimates from the data are 10
and
1
(102 + 82 + 152 + 102 + 72 + 32 + 202 + 142 + 52 + 122 + 82 + 82 ) − 102 = 20.
12
Hence, the system of equations is
k(1 − p) k(1 − p)
= 10, = 20.
p p2
The solution to this system is p = 21 , k = 10.

2. Assume that the same data as in question 1 are i.i.d. realizations of


geometric distribution with parameter p. Use method of maximum likelihood
to estimate p.

Answer: Denote the data as k1 , k2 , . . . , k12 . The logarithm of the likeli-


hood function is
12
X 12
X
ki
l(p) = log[(1 − p) p] = 12 log(p) + log(1 − p) ki .
i=1 i=1

Then
12
0 12 1 X
l (p) = − ki = 0
p 1 − p i=1

148
if
12 12 1
p= P12 = = .
12 + i=1 ki 132 11

3. The history of n = 18 most recent claim sizes (rounded to integer


number of pounds) are

937, 342, 150, 1080, 401, 3500, 7970, 1400, 530,

1106, 847, 899, 3076, 2837, 315, 2560, 390, 2950.


Assuming that these are i.i.d data from Weibull distribution, use the method
of percentiles with α1 = 1/4 and α2 = 3/4 to estimate parameters of the
distribution.

Answer: The sorted data are

150, 315, 342, 390, 401, 530, 847, 899, 937,

1080, 1400, 1106, 2560, 2837, 2950, 3076, 3500, 7970.


The smallest integers grater than α1 n = 18/4 = 4.5 and α2 n = 13.5 are 5
and 14, respectively. The 5-th and 14-th data points in the sorted sequence
are q1 = 401 and q2 = 2837, respectively. Hence, by (23) and (24), the
parameters of the Weibull distribution are
       
log(1 − α1 ) q1 log(1 − 0.25) 401
γ = log / log = log / log ≈ 0.8037,
log(1 − α2 ) q2 log(1 − 0.75) 2837

and
log(1 − α1 ) log(1 − 0.25)
c=− γ ≈− ≈ 0.002327.
q1 4010.8037

4. Assume that the history of claim sizes are the same as in the previous
question, but the company order a reinsurance policy with excess of loss
reinsurance above the level M = 2000.
(a) Write down the history of expenses of the reinsurer;
(b) Assuming that the original claim size distribution is Pareto distribu-
tion with parameters α > 2 and λ > 0, estimate the unknown parameters
using method of moments with data available to the reinsurer.
(c) Comment whether do you think the Pareto distribution is a good
model to fit these data.

149
Answer: (a) The history of reinsurer expenses is

1500, 5970, 1076, 837, 560, 950.

(b) Using formulas from Example 2.1 with k = 6 and wi as above, we get
k
1X 1
wi = (1500 + 5970 + 1076 + 837 + 560 + 950) = 1815.5
k i=1 6

and
k k
!2
1X 2 1X 1
wi − wi = (15002 +59702 +10762 +8372 +5602 +9502 )−1815.52 = 3531517.25,
k i=1 k i=1 6

so the system of equations to find the parameters become


λ + 2000
1815.5 =
α−1
and
α(λ + 2000)2
3531517.25 = .
(α − 1)2 (α − 2)
Squaring both parts of the first equation and dividing it on the second one,
we get
2
(1815.5)2 α(λ + 2000)2

λ + 2000 α−2
= : 2
= .
3531517.25 α−1 (α − 1) (α − 2) α
α−2 2
Hence, α
≈ 0.9333 and α ≈ 1−0.9333
≈ 30. Then

λ = 1815.5(α − 1) − 2000 ≈ 50649.5.

(c) The resulting parameters does not look realistic. One would expect
much smaller values such that α ≈ 3 or α ≈ 4. We may conclude that Pareto
distribution is not the best model to fit these data.

150
8.3 Chapter 3 solutions
1. Assume that the number N of claims can be any integer from 1 to 100
with equal chances, and the claim sizes X1 , . . . , XN are i.i.d. from Pareto
distribution with parameters α = 3 and λ = 2. Estimate mean and variance
of the aggregate claim S.

Answer: The mean and variance of Pareto distribution P a(3, 2) are

λ 2 αλ2
µX = = 1, and σX = = 3.
α−1 (α − 1)2 (α − 2)

The mean and variance of N are


100 100
1 X 2 1 X 1002 − 1
µN = i = 50.5, and σN = (i − µN )2 = = 833.25,
100 i=1 100 i=1 12

see (4) for derivation of variance. Hence, the mean and variance of S are

µS = µN · µX = 50.5 · 1 = 50.5

and
σS2 = µN σX
2 2 2
+ σN µX = 50.5 · 3 + 833.25 · 12 = 974.75.

2. The number of claims insurance company receives in April follows


Poisson distribution with λ = 45 7
, the sizes of claims are i.i.d. and follow a
uniform distribution on [1, 000; 2, 000].
(a) Estimate the probability that the company will receive at least 3
claims in April.
(b) Estimate the mean and variance for the total size of all April’s claims.
(c) Estimate the probability that the total size of all April’s claims will
be strictly less than 3, 000.

Answer: (a) Let N be the number of April claims. Then P (N = n) =


n
e−λ (λ)
n!
. Hence

45( 45 )0
P (N = 0) = e− 7 7
≈ 0.0016
0!
− 45 ( 45
7
)1
P (N = 1) = e 7 ≈ 0.0104
1!
45 2
45 ( )
P (N = 2) = e− 7 7 ≈ 0.0334
2!

151
P (N ≥ 3) = 1 − P (N = 0) − P (N = 1) − P (N = 2) ≈
≈ 1 − 0.0016 − 0.0104 − 0.0334 = 0.9546.
(b) Let Y be a claim size, uniformly distributed on [a, b], where a = 1, 000,
1
b = 2, 000. Then the density of Y is f (y) = b−a , a ≤ y ≤ b, and
Z b b
1 1 a+b
E[Y ] = y dy = (y 2 /2) = = 1, 500,
a b−a b−a a 2
b b
a2 + ab + b2
Z
2 2 1 1 7
E[Y ] = y dy = (y 3 /3) = = · 106 .
a b−a b−a a 3 3
Then for the aggregate claim size S
45
E[S] = λE[Y ] = · 1, 500 ≈ 9, 643,
7
45 7
V ar[S] = λE[Y 2 ] = · · 106 = 15 · 106 .
7 3
(c) If there are at least 3 claims, then the total size will be at least 3, 000.
If there are 2 claims, there is a 50% chance to have a total size 3, 000. If there
is 1 claim or no claim, the total size will surely be less than 3, 000. Hence,
the answer is

P (N = 0) + P (N = 1) + P (N = 2)/2 ≈ 0.0016 + 0.0104 + 0.0334/2 = 0.0287.

3. The claim size X follows a log-normal distribution with parameters


µ and σ, where σ is known but µ is not. Instead, we model µ as another
2
random variable such that λ = eµ+σ /2 has mean p and variance s2 . Estimate
the mean and variance of X.

Answer: By the properties of log-normal distribution,


2 /2
E[X|λ] = eµ+σ = λ,

and
2 2 2
V ar[X|λ] = (eσ − 1)e2µ+σ = (eσ − 1)λ2 .
By the law of total expectation,

E[X] = E[E[X|λ]] = E[λ] = p.

By the law of total variance,


2
V ar(X) = E[V ar(X|λ)] + V ar(E(X|λ)) = E[(eσ − 1)λ2 ] + V ar(λ) =

152
2 2
= (eσ − 1)(p2 + s2 ) + s2 = eσ (p2 + s2 ) − p2 .

4. The number N of claims to be received by insurance company next


year follow a negative binomial distribution with parameters k = 20 and
p = 0.25. The claim sizes X1 , . . . , XN are i.i.d. and follow exponential
distribution with parameter λ = 0.005. Assuming that the aggregate claim
size S is approximately normally distributed, estimate the probability that
S with not exceed 20, 000.

Answer: If X follows exponential distribution with λ = 0.005, then


µX = E[X] = λ1 = 200 and V ar[X] = λ12 = 40, 000. Hence, E[X 2 ] =
V ar[X] + (EX)2 = 80, 000.
By (38),

k(1 − p) 20(1 − 0.25)


µS = E[S] = µX = 200 = 12, 000.
p 0.25

By (39),
k(1 − p) k(1 − p)2
σS2 = E[X 2 ] + (E[X])2 =
p p2
20(1 − 0.25) 20(1 − 0.25)2
= 80, 000 + (200)2 = 12, 000, 000.
0.25 0.252
Hence,
 
S − 12, 000 20, 000 − 12, 000
P (S ≤ 20, 000) = P √ ≤ √ ≈ P (Z ≤ 2.3),
12, 000, 000 12, 000, 000

where Z follows standard normal distribution. From tables, P (Z ≤ 2.3) ≈


0.99.

153
8.4 Chapter 4 solutions
1. (a) Calculate the Hazard rate of the Pareto distribution. Check if it is an
increasing or decreasing function.
(b) Calculate the Mean residual life of the Pareto distribution. Check if
it is an increasing or decreasing function.
(c) What conclusion about tails of Pareto distribution can we make based
on items (a) and (b).
Answer: The Pareto distribution with parameters α > 0 and λ > 0 has
PDF
αλα
f (x) = , x > 0.
(λ + x)α+1
and CDF  α
λ
F (x) = 1 − .
λ+x
(a) The Hazard rate is
f (x) αλα λα α
h(x) = = α+1
: α
=
1 − F (x) (λ + x) (λ + x) λ+x
The derivative
α
h0 (x) = − <0
(λ + x)2
is negative, hence h(x) is a decreasing function.
(b) The Mean residual life is
R∞ α Z ∞  α
(1 − F (y))dy

x λ+x λ
e(x) = = dy =
1 − F (x) λ x λ+y
α α
λ (λ + x)1−α

λ+x λ+x
= = ,
λ α−1 α−1
which is an increasing function of x provided that α > 1.
(c) Because the Hazard rate is h(x) is a decreasing function, and the Mean
residual life is increasing function, this is an indication that The Pareto
distribution has heavy tail. If the claims follow this distribution, we can
expect some very large claim with not very small probability.
2. Prove the formula for F (x) is Example 4.1.
Answer: Substituting FX (x) = 1 − exp(−λx), an = λ1 ln n and βn = λ1
into (45), we get
   n
1 1
F (x) = lim 1 − exp −λ ln n + x = lim (1 − exp(−x − ln n))n =
n→∞ λ λ n→∞

154
n
e−x

−x
lim 1− = e−e .
n→∞ n

3. (a) Check that function ψ(t) = − ln t, 0 < t ≤ 1, is continuous, strictly


decreasing, convex, and ψ(1) = 0;
(b) Find its inverse function ψ −1 ;
(c) Find C(u, v) = ψ −1 (ψ(u) + ψ(v)), see (51).

Answer: (a) The continuity of ψ(t) is obvious. The derivative ψ 0 (t) = − 1t


is negative, hence ψ(t) is strictly decreasing. The second derivative ψ 00 (t) = t12
is positive, hence ψ(t) is a convex function. Finally, ψ(1) = − ln 1 = 0.
(b) To find inverse function we solve equation ψ(t) = − ln t = x to get
answer t = e−x , hence ψ −1 (x) = exp(−x).
(c) By (51),

C(u, v) = ψ −1 (ψ(u) + ψ(v)) = exp(−(− ln u − ln v)) = exp(ln(uv)) = uv.

Hence, in this case C(u, v) is the independence copula.

4. Repeat the previous question for


(a) ψ(t) = (− lnt)α , 0 < t ≤ 1, where α ≥ 1 is a parameter;
−αt −1
(b) ψ(t) = − ln ee−α −1 , 0 < t ≤ 1, where α 6= 0 is a parameter;
(c) ψ(t) = α1 (t−α − 1), 0 < t ≤ 1, where α 6= 0 is a parameter.

Answer: (a)
ψ(t) = (− ln t)α , 0 < t ≤ 1,
where 1 ≤ α is a parameter. Then
1
ψ 0 (t) = α(− ln t)α−1 (− ) < 0
t
and
 2  
00 α−2 1 α−1 1
ψ (t) = α(α − 1)(− ln t) − + α(− ln t) > 0,
t t2
hence ψ(t) is strictly decreasing and convex. We also have ψ(1) = (− ln 1)α =
0. To find inverse function we solve equation ψ(t) = (− ln t)α = x to get
answer t = exp(−x1/α ). Then by (51),
n o
C(u, v) = ψ −1 (ψ(u) + ψ(v)) = exp − ((− ln u)α + (− ln v)α )1/α ,

which is exactly the Gumbel copula.

155
(b) Let
e−αt − 1
 
ψ(t) = − ln , 0 < t ≤ 1,
e−α − 1
where α 6= 0 is a parameter. Then
e−α − 1 −αe−αt αe−αt
ψ 0 (t) = − · = .
e−αt − 1 e−α − 1 e−αt − 1
If α > 0 then e−αt − 1 < 0, while if α < 0 then e−αt − 1 > 0. In any case,
ψ 0 (t) < 0.
Next,

00 −α2 e−αt (e−αt − 1) − αe−αt (−αe−αt ) α2 e−2αt


ψ (t) = = −αt > 0,
(e−αt − 1)2 (e − 1)2
 −α 
−1
hence ψ(t) is strictly decreasing and convex. We also have ψ(1) = − ln ee−α −1 =
0, so all the conditions
 −αt are satisfied. To find inverse function we solve equa-
−1 e−αt −1
tion ψ(t) = − ln e−α −1 = x. Equivalently, e−α −1 = e−x , or e−αt − 1 =
e

e−x (e−α − 1), hence t = − α1 ln (e−x (e−α − 1) + 1). For x = ψ(u) + ψ(v),
  −αu  −αv
(e−αu − 1)(e−αv − 1)
 
e −1 e −1
exp(−x) = exp ln + ln = ,
e−α − 1 e−α − 1 (e−α − 1)2
and by (51),
(e−αu − 1)(e−αv − 1)
 
−1 1
C(u, v) = ψ (x) = − ln 1 + ,
α (e−α − 1)
which is the Frank copula.
(c) Let
1 −α
ψ(t) = (t − 1), 0 < t ≤ 1,
α
where α 6= 0 is a parameter. Then

ψ 0 (t) = −t−α−1 < 0

and
ψ 00 (t) = −(−α − 1)t−α−2 ,
hence ψ(t) is strictly decreasing for all α 6= 0 and convex for α > −1. We
also have ψ(1) = α1 (1−α − 1) = 0. To find inverse function we solve equation
ψ(t) = α1 (t−α − 1) = x to get answer t = (αx + 1)α . Then by (51),

C(u, v) = ψ −1 (ψ(u) + ψ(v)) = (αψ(u) + αψ(v) + 1)α = (u−α + v −α − 1)−1/α ,

156
which is the Clayton copula.

5. Calculate the coefficients of lower and upper tail dependence of two


random variables with
(a) the independence copula,
(b) the Gumbel copula with α ≥ 1,
(c) the Frank copula with α 6= 0,
(d) the Clayton copula with α > 0.

Answer: (a) For the independence copula C(u, v) = uv, we have

C(u, u) = u2 ,

C(u, u) u2
λL = lim = lim = 0,
u→0+ u u→0+ u

C̄(u, u) = −1 + u + u + C(1 − u, 1 − u) = −1 + 2u + (1 − u)2 = u2 ,


C̄(u, u) u2
λU = lim = lim =1
u→0+ u u→0+ u
n o
(b) For the Gumbel copula C(u, v) = exp − ((− ln u)α + (− ln v)α )1/α ,
we have
n o
C(u, u) = exp − (2(− ln u)α )1/α = exp(21/α ln u) = uβ ,

where β = 21/α . Because α ≥ 1, we have 1 < β ≤ 2.

C(u, u)
λL = lim = lim uβ−1 = 0.
u→0+ u u→0+

Next,

C̄(u, u) = −1 + 2u + C(1 − u, 1 − u) = −1 + 2u + (1 − u)β ,

and
C̄(u, u) (1 − u)β − 1
λU = lim = 2 + lim = 2 − β = 2 − 21/α .
u→0+ u u→0+ u
(c) For the Frank copula

(e−αu − 1)(e−αv − 1)
 
1
C(u, v) = − ln 1 + ,
α (e−α − 1)

157
we have
(e−αu − 1)2
 
1
C(u, u) = − ln 1 + .
α (e−α − 1)
As u → 0+ , e−αu − 1 ≈ −αu, and

α2 u2 α2 u2 αu2
 
1 1
C(u, u) ≈ − ln 1 + −α ≈ − · −α = − −α ,
α (e − 1) α (e − 1) (e − 1)

hence
C(u, u) α
λL = lim = − −α lim u = 0.
u→0+ u (e − 1) u→0+
Next,

(e−α(1−u) − 1)2
 
1
C̄(u, u) = −1 + 2u + C(1 − u, 1 − u) = −1 + 2u − ln 1 + ,
α (e−α − 1)

and
C̄(u, u)
λU = lim = 0.
u→0+ u
(d) For the Clayton copula C(u, v) = (u−α + v −α − 1)−1/α , we have

C(u, u) = (2u−α − 1)−1/α .

If α > 0 and u → 0, then u−α → ∞, and therefore 2u−α − 1 ≈ 2u−α . Hence,

C(u, u) ≈ (2u−α )−1/α = 2−1/α · u,

and
C(u, u)
λL = lim = 2−1/α .
u→0+ u
Also,

C̄(u, u) = −1 + 2u + C(1 − u, 1 − u) = −1 + 2u + (2(1 − u)−α − 1)−1/α .

If u → 0, (1 + Au)B ≈ 1 + ABu for any constants A, B, hence

(2(1 − u)−α − 1)−1/α ≈ (2(1 + αu) − 1)−1/α = (1 + 2αu)−1/α ≈ 1 − 2u,

where ≈ is up to the terms of order u2 . Hence,

C̄(u, u) −1 + 2u + 1 − 2u + O(u2 )
λU = lim = lim = 0.
u→0+ u u→0+ u

158
8.5 Chapter 5 solutions
1. Consider a Markov chain with state space S = {0, 1, 2} and transition
matrix  
p q 0
P =  1/4 0 3/4  .
p − 1/2 7/10 1/5
(a) Calculate values for p and q.
(b) Draw the transition graph for the process.
(3)
(c) Calculate the transition probabilities pi,j .
(d) Find any stationary distributions for the process.
Answer: (a) The sum of all entries in the last row must be equal to 1, as a
7
consequence of which p = 1 − 51 − 10 + 12 = 35 . In view of the first row, we see
2
that q = 5 .
(b)

1
2/5 7/10
3/4
1/4
0 2
3/5 1/10
1/5

(c) With the help of (a) we get


   
3/5 2/5 0 23/50 6/25 3/10
P =  1/4 0 3/4  , P2 =  9/40 5/8 3/20  ,
1/10 7/10 1/5 51/200 9/50 113/200
   
183/500 197/500 6/25 0.366 0.394 0.24
P3 =  49/160 39/200 399/800  =  0.30625 0.195 0.49875  .
509/2000 199/400 31/125 0.2545 0.4975 0.248
(3) (3)
The values pi,j are the entries of P3 , i.e. P3 = (pi,j )i,j∈S . For example, we
(3)
have p1,2 = 0.49875. It should be mentioned that higher powers of P can be
evaluated using the property Pk+` = Pk P` , (k, ` ∈ N). E.g. the calculation
of P4 = (P2 )2 does not require the calculation of P3 .

159
(d) It can be shown that the only stationary distribution is given by
 55 64 60 
π = (π1 , π2 , π3 ) = , , ≈ (0.30726, 0.35754, 0.33520).
179 179 179
Indeed this follows, if we solve the linear equations πP = π for π1 , π2 , π3 ∈
[0, 1] with π1 + π2 + π3 = 1. More precisely, we have
3 1 1
π1 + π2 + π3 = π1 , (83)
5 4 10
2 7
π1 +
0π2 + π3 = π2 , (84)
5 10
3 1
0π1 + π2 + π3 = π3 , (85)
4 5
π1 + π2 + π 3 = 1. (86)
15
From (85) it follows that π3 = 16 π2 . Using this in (84), we see that π2 = 64 π
55 1
and, in turn, π3 = 11 π1 . In view of (86), we then get π1 (1 + 64
12
55
+ 12
11
) = 1, i.e.
55
π1 = 179 . From the above, we then obtain the remaining values π2 and π3
as indicated. We did not use (83). This equation must be valid, since P is a
stochastic matrix. Therefore, (83) can be used to check our solution.

2. Prove equation (28) relating the probability of a particular path occuring


in a Markov chain.

Answer: We have to show that, if {Xk }k∈Z+ (Z+ = {0, 1, 2, . . . }) is a Markov


chain, then
N
Y −1
P (X0 = j0 , X1 = j1 , . . . , XN = jN ) = P (X0 = j0 ) pjn ,jn+1 (n, n + 1)
n=0

for N ∈ N and states j0 , . . . , jN ∈ S. Note that we do not assume a time-


homogeneous chain, as a consequence of which the one-step transition prob-
abilities pi,j (n, n + 1) also depend on time n. The above equation can be
proved by induction over N . Indeed, if N = 1, then the equation can be
shown this way:

P (X0 = j0 , X1 = j1 ) = P (X0 = j0 )P (X1 = j1 | X0 = j0 )


= P (X0 = j0 )pj0 ,j1 (0, 1).

Suppose the equation is true for a N ∈ N, then, using the Markov property

160
of {Xk }, we get

P (X0 = j0 , X1 = j1 , . . . , XN +1 = jN +1 )
= P (XN +1 = jN +1 | X0 = j0 , X1 = j1 , . . . , XN = jN )
× P (X0 = j0 , X1 = j1 , . . . , XN = jN )
N
Y −1
= P (XN +1 = jN +1 | XN = jN ) × P (X0 = j0 ) pjn ,jn+1 (n, n + 1)
n=0
N
Y −1
= pjN ,jN +1 (N, N + 1) × P (X0 = j0 ) pjn ,jn+1 (n, n + 1)
n=0
N
Y
= P (X0 = j0 ) pjn ,jn+1 (n, n + 1),
n=0

which completes our induction proof.

3. A No-Claims Discount system operated by a motor insurer has the fol-


lowing four levels:

Level 1: 0% discount;

Level 2: 25% discount;

Level 3: 40% discount;

Level 4: 60% discount.

The rules for moving between these levels are as follows:

• Following a year with no claims, move to the next higher level, or


remain at level 4.

• Following a year with one claim, move to the next lower level, or remain
at level 1.

• Following a year with two or more claims, move down two levels, or
move to level 1 (from level 2) or remain at level 1.

For a given policyholder in a given year the probability of no claims is 0.85


and the probability of making one claim is 0.12. Xt denotes the level of the
policyholder in year t.

(i) Explain why Xt is a Markov chain. Write down the transition matrix
of this chain.

161
(ii) Calculate the probability that a policyholder who is currently at level
2 will be at level 2 after:

(a) one year.


(b) two years.
(c) three years.

(iii) Explain whether the chain is irreducible and/or aperiodic.

(iv) Does this Markov chain converge to a stationary distribution?

(v) Calculate the long-run probability that a policyholder is in discount


level 2.

Answer:
(i) It is clear that X(t) is a Markov chain; knowing the present state, any
additional information about the past is irrelevant for predicting the
next transition.
Then the transition matrix is given by
 
.15 .85 0 0
.15 0 .85 0 
P =.03 .12 0 .85 .

0 .03 .12 .85

(ii) (a) For the one year transition p22 = 0, since with probability 1, the
chain will leave the state 2.
(b) The second order transition matrix is given by
 
0.15 0.1275 0.7225 0
 0.048 0.2295 0 0.7225
P (2) = P ∗ P = 
0.0225 0.051 0.204 0.7225 ,

0.0081 0.0399 0.1275 0.8245


(2)
and thus p22 = .2295.
(c) The relevant entry from the third order transition matrix is .062475.

(iii) The chain is irreducible as any state is reachable by any other state.
It is also aperiodic. For states 1 and 4 the chain can simply remain
there. This is not the case for states 2 and 3. However these are

162
also aperiodic, since starting from 2 the chain can return in 2 and 3
transitions from the previous part of the question. Similarly the chain
started at 3 can return at 3 in two steps (look at P 2 ), and at three
steps.

(iv) The chain is irreducible and has a finite state space and thus has a
unique stationary distribution.

(v) To find the long run probability that the chain is at level 2 we need to
calculate the unique stationary distribution π. This amounts to solving
the matrix equation πP = π. This is a system of 4 equations in 4
unknowns given by

π1 = .15π1 + .15π2 + .03π3 (87)


π2 = .85π1 + .12π3 + .03π4 (88)
π3 = .85π2 + .12π4 (89)
π4 = .85π3 + .85π4 . (90)

We discard the first equation and replace it by

π1 + π2 + π4 + π4 = 1.

Using substitutions or any other method we solve the system to obtain

(π1 , π2 , π3 , π4 ) = (.01424, .05269, .13996, .79311).

(n)
Let pij be the n-step transition probability of an irreducible aperiodic
(n)
Markov chain on a finite state space. Then, lim pij = πj for each i and
j. Thus the long run probability that the chain is in state 2 is given by
π2 = .05269.

163
8.6 Chapter 6 solutions
1. Claims are known to follow a Poisson process with a uniform rate of 3 per
day.

(a) Calculate the probability that there will be fewer than 1 claim on a
given day.

(b) Estimate the probability that another claim will be reported during the
next hour. State all assumptions made.

(c) If there have not been any claims for over a week, calculate the expected
time before a new claim occurs.

Answer: Let {Nt }t∈[0,∞) denote our Poisson process with rate λ = 3, where
the time is measured in days.
(a) We have to evaluate P (Nt+1 − Nt < 1) for a fixed t ≥ 0. But this is
equal to
P (N1 = 0) = e−λ = e−3 = 0.04979.
1
(b) We look for the probability that, during the time interval (t, t + 24
]
for a fixed t, at least one claim will be reported, i.e.

P (Nt+1/24 − Nt ≥ 1) = P (N1/24 ≥ 1) = 1 − P (N1/24 = 0) = 1 − e−λ/24


= 1 − e−1/8 = 0.11750.

(c) Conditional on N7 = 0, we can assume that {Nt+7 }t∈[0,∞) behaves like


a Poisson process (Net )t∈[0,∞) with parameter λ = 3. But here the first jump
(claim) occurs at a random time τe1 , which has an exponential distribution
with parameter λ. It is well-known that the expectation is E(e τ1 ) = λ1 = 13 .

2. Prove equation (38), which gives the Chapman-Kolmogorov equations for


a Markov jump process.

Answer: Let {Xt }t∈[0,∞) be a (not necessarily time-homogeneous) Markov


process with discrete state space S and transition probabilities pi,j (s, t) =
P (Xt = j | Xs = i) where i, j ∈ S, 0 ≤ s < t < ∞ and we assume that
P (Xs = i) > 0. We have to show that
X
pi,j (t1 , t3 ) = pi,k (t1 , t2 ) pk,j (t2 , t3 ),
k∈S

164
where i, j ∈ S and 0 ≤ t1 < t2 < t3 < ∞. We have

pi,j (t1 , t3 ) = P (Xt3 = j | Xt1 = i)


X
= P (Xt3 = j, Xt2 = k | Xt1 = i)
k∈S
X
= P (Xt3 = j | Xt1 = i, Xt2 = k) P (Xt2 = k | Xt1 = i)
k∈S
X
= P (Xt3 = j | Xt2 = k) P (Xt2 = k | Xt1 = i)
k∈S
X
= pi,k (t1 , t2 ) pk,j (t2 , t3 ).
k∈S

3. Consider the sickness-death model given in Figure 9, write down an


integral expression for pHD (s, t).
Answer: We have
Z t−s  
pHD (s, t) = pSD (s + w, t)σ(s + w) + pDD (s + w, t)µ(s + w)
0
 Z s+w 
× exp − (σ(u) + µ(u)) du dw
s
Z t−s  
= pSD (s + w, t)σ(s + w) + µ(s + w)
0
 Z s+w 
× exp − (σ(u) + µ(u)) du dw.
s

The individual remains in the healthy state from time s to time s+w and then
jumps to the state dead (where he remains) or to the state sick (where he
jumps to state dead by time t). Note that here pDD (s+w, t) = 1. Further note
that the formula for pHS (s, t) did not contain the term pDS (s + w, t)µ(s + w),
since the probability to jump from dead to sick is equal to zero.

4. Let {Xt , t ≥ 0} be a time-homogeneous Markov process with state space


S = {0, 1} and transition rates q01 = α, q10 = β.

(a) Write down the generator matrix for this process.

(b) Solve the Kolmogorov’s forward equations for this Markov jump process
to find all transition probabilies.

(c) Check that the Chapman–Kolmogorov equations hold.

165
(d) What is the probability that the process will be in state 0 in the long
term? Does it depend on the initial state?

Answer: (a) The generator matrix Q is given by


 
−α α
Q= .
β −β
dP(t)
(b) The Kolmogorov forward equations dt
= P(t)Q therefore become

dp00 (t)
= −αp00 (t) + βp01 (t),
dt
dp01 (t)
= αp00 (t) − βp01 (t),
dt
dp10 (t)
= −αp10 (t) + βp11 (t),
dt
dp11 (t)
= αp10 (t) − βp11 (t),
dt
Substituting p01 (t) = 1 − p00 (t) in the first equation, we get equation

dp00 (t)
= −αp00 (t) + β(1 − p00 (t)) = −(α + β)p00 (t) + β.
dt
which has a general solution
β
p00 (t) = + Ce−(α+β)t .
α+β
α
Initial condition p00 (0) = 1 lealds to C = α+β
, so finally we get

β α −(α+β)t
p00 (t) = + e .
α+β α+β

Transition probabilities p01 (t), p10 (t), and p11 (t) can be found similarly. They
are:
α α −(α+β)t
p01 (t) = − e ;
α+β α+β
β β
p10 (t) = − e−(α+β)t ;
α+β α+β
α β
p11 (t) = + e−(α+β)t .
α+β α+β

166
(c) For a time-homogeneous Markov process Chapman–Kolmogorov equa-
tions take the form X
pij (t + s) = pik (t) pkj (s).
k∈S

In our case, S = {0, 1}, thus there are 4 equations. For example, for i = j = 0
we get
p00 (t + s) = p00 (t) p00 (s) + p01 (t) p10 (s).
So we should check that
β α −(α+β)(t+s)
+ e =
α+β α+β
  
β α −(α+β)t β α −(α+β)s
+ e + e +
α+β α+β α+β α+β
  
α α −(α+β)t β β −(α+β)s
− e − e ,
α+β α+β α+β α+β

which is a straighforward exersize. Three other Chapman–Kolmogorov equa-


tions can be checked similarly.
(d)
β
lim p00 (t) = lim p10 (t) = ,
t→∞ t→∞ α+β
so the the probability that the process will be in state 0 in the long term is
β
α+β
, and it does not depend on the initial state.

167
8.7 Chapter 7 solutions
1. Approximate points (0, 0, 0), (1, 0, 2), (0, 1, 3), (1, 1, 4) in the coordinate
space (x1 , x2 , y) by a plane y = ax1 + bx2 + c to minimize the sum of squares
error.

Answer: For (x1 , x2 ) = (0, 0), y = a · 0 + b · 0 + c = c, and the data point


is (0, 0, 0) so the (squared) error is (c − 0)2 . Similarly, for (x1 , x2 ) = (1, 0),
y = a + c, and the data point is (1, 0, 2), the squared error is (a + c − 2)2 .
Continuing this way, we write down the error as

e(a, b, c) = (c − 0)2 + (a + c − 2)2 + (b + c − 3)2 + (a + b + c − 4)2 .

In optimality
∂e(a, b, c)
= 2(−6 + 2a + b + 2c) = 0
∂a
∂e(a, b, c)
= 2(−7 + a + 2b + 2c) = 0
∂b
∂e(a, b, c)
= 2(−9 + 2a + 2b + 4c) = 0
∂c
and the solution is a = 3/2, b = 5/2, c = 1/4.

2. There are two points marked on the plane - red point A with co-
ordinate (0, 0) and blue point B with coordinates (10, 10). Then 4 points
C, D, E, F arrives in order, and each is coloured in the same way as its near-
est neighbour. The coordinates of F is (10, 8). Give examples of coordinates
of points C, D, E such that point F will be coloured red.

Answer: For example, C may have coordinates (5, 4), D - (8, 6), and E
- (10, 7). Then
p p
|CA| = (5 − 0)2 + (4 − 0)2 < (5 − 10)2 + (4 − 10)2 = |CB|,

hence C will be coloured the same as A, that is, in red. Next,


p p
|DC| = (8 − 5)2 + (6 − 4)2 < (8 − 10)2 + (6 − 10)2 = |DB|,

hence D will be coloured the same as C, that is, in red. Further,


p p
|ED| = (10 − 8)2 + (7 − 6)2 < (10 − 10)2 + (7 − 10)2 = |EB|,

hence E will be coloured the same as D, that is, in red. Finally,

|F E| = 1 < 2 = |F B|,

168
hence F will be coloured the same as E: in red.
3. Consider 4 points A, B, C, D on the plane with coordinates (0, 0),
(1, 0), (0, 1), and (2, 1), initially classified such that A and D are red while
B and C are blue. What would be the outcome of the K-means algorithm
(with K = 2) applied to these initial data?
Answer: Them mean MR of red points is ((0, 0) + (2, 1))/2 = (1, 0.5).
The mean MB of blue points is ((1, 0) + (0, 1))/2 = (0.5, 0.5). Now, for point
A, p p
|AMB | = (0.5)2 + (0.5)2 < (0.5)2 + 12 = |AMR |,
hence point A becomes blue. By similar argument, we deduce that C stays
blue, B becomes red, and D stays red. After this, the mean MR of red
points is ((1, 0) + (2, 1))/2 = (1.5, 0.5), while the mean MB of blue points is
((0, 0) + (0, 1))/2 = (0, 0.5). Now, for point A,
p p
|AMB | = 02 + (0.5)2 < (0.5)2 + (1.5)2 = |AMR |,
hence point A stays blue. By similar argument, we deduce that all points
stay the same colour, and the algorithm terminates. Answer: A and C are
blue, B and D are red.
4. A filter should classify e-mails into 3 categories: personal, work, and
spam. The statistics shows that approximately 30% of e-mails are personal,
50% are work ones, and 20% are spam. It also shows that word “friend” is
included into 20% of personal e-mails, 5% of work e-mails, and 30% of spam
e-mails. In addition, the word “profit” is included into 5% of personal e-mails,
30% of work e-mails, and 25% of spam e-mails. A new e-mails arrives which
contains both words “friend” and “profit”. Use naive Bayes classification to
decide if this e-mail is more likely to be personal, work, or spam?
Answer: Let X be a random variable equal to X = 1, X = 2, or X = 3
if the random e-mail is personal, work, or spam, respectively. Let F we
a random variable such that F = 1 if e-mail contains word “friend” (and
F = 0) otherwise. Similarly, let R we a random variable such that R = 1 if
e-mail contains word “profit” (and R = 0) otherwise. We need to compare
three conditional probabilities
P (X = 1|R = F = 1), P (X = 2|R = F = 1), P (X = 3|R = F = 1),
and select the maximal one. For each i = 1, 2, 3, the naive Bayes estimate
for P (X = i|R = F = 1) is
P (R = 1|X = i) · P (F = 1|X = i) · P (X = i)
P (X = i|R = F = 1) = .
P (R = F = 1)

169
As we can see, the denominators are the same foe all i, so it suffices to
compare the numerators. We have

P (R = 1|X = 1) · P (F = 1|X = 1) · P (X = 1) = 0.05 · 0.2 · 0.3 = 0.003,

P (R = 1|X = 2) · P (F = 1|X = 2) · P (X = 2) = 0.3 · 0.05 · 0.5 = 0.0075,


P (R = 1|X = 3) · P (F = 1|X = 3) · P (X = 3) = 0.25 · 0.3 · 0.2 = 0.015.
Hence, the e-mail is most likely to be a spam.

170

You might also like