Rediction Error Methods
Rediction Error Methods
David Di Ruscio
Telemark University College
N-3414 Porsgrunn, Norway
Contents
1 Introduction 2
1
6 Matlab implementation 24
6.1 Tutorial: SS-PEM Toolbox for MATLAB . . . . . . . . . . . . . . 24
1 Introduction
An overview of the Prediction Error Method (PEM) for system identification is
given. Furthermore, a description of the common discrete time input and output
polynomial model structures, which are frequently used in prediction error system
identification methods, is presented in this note. The relationship between the
state space model and the input and output model is in particular pointed out.
where ∆ = E(ek eTk ) is the covariance matrix of the innovations process and x̄1 is
the initial predicted state. Suppose now that the model is parameterized, i.e. so
that the free parameters in the model matrices (A, B, K, D, E, x1 ) are organized
into a parameter vector θ. The problem is to identify the ”best” parameter vector
from known output and input data matrices (Y, U ). The optimal predictor, i.e.
the optimal prediction, ȳk , for the output yk , is then of the form
with initial predicted state x̄1 . ȳk (θ) is the prediction of the output yk given
inputs u up to time k, outputs y up to time k − 1 and the parameter vector θ.
Note that if E = 0m×r then inputs u only up to k − 1 are needed. The free
2
parameters in the system matrices are mapped into the parameter vector (or visa
versa). Note that the predictor ȳk (θ) is (only) optimal for the parameter vector
θ which minimize some specified criterion. This criterion is usually a function of
the prediction errors. Note also that it is common to use the notation ȳk|θ for the
prediction. Hence, ȳk|θ = ȳk (θ).
Define the Prediction Error (PE)
k (θ) = yk − ȳk (θ). (5)
A good model is a model for which the model parameters θ results in a ”small”
PE. Hence, it make sense to use a PE criterion which measure the size of the
PE. A PE criterion is usually always in one ore another way defined as a scalar
function of the following important expression and definition
N
1 X
R (θ) = k (θ)k (θ)T ∈ Rm×m , (6)
N k=1
which is the sample covariance matrix of the PE. This means that we want to find
the parameter vector which make R as small as possible. Hence, it make sense
to use a PE criterion which measure the size of the sample covariance matrix of
the PE. A common scalar PE criterion for multivariable output systems is thus
N N
1 X T 1 X
VN (θ) = tr( k (θ)k (θ) ) = k (θ)T k (θ). (7)
N k=1 N k=1
Note that the trace of a matrix is equal to the sum of its diagonal elements and
that tr(AB) = tr(BA) of two matrices A and B of appropriate dimensions. We
will in the following give a discussion of the PE criterion as well as some variants
of it. Define a PE criterion as follows
N
1 X
VN (θ) = `(k (θ)) (8)
N k=1
where `(·) is a scalar valued function, e.g. the Euclidean l2 norm, i.e.
`(k (θ)) =k k (θ) k22 = k (θ)T k (θ), (9)
or a quadratic function
`(k (θ)) = k (θ)T Λk (θ) = tr(Λk (θ)k (θ)T ), (10)
for some weighting matrix Λ. A common criterion for multiple output systems
(with weights) is thus also
N N
1 X T 1 X
VN (θ) = k (θ) Λk (θ) = tr(Λ( k (θ)k (θ)T )), (11)
N k=1 N k=1
3
where we usually simply are using Λ = I. The weighting matrix Λ is usually a
diagonal and positive matrix.
One should note that there exist an optimal weighting matrix Λ, but that this
matrix is difficult to define a-priori. The optimal weighting (when the number of
observations N is large) is defined from the knowledge of the innovations noise
covariance matrix of the system, i.e., Λ = ∆−1 where ∆ = E(ek eTk ). An interesting
solution to this would be to define the weighting matrix from the subspace system
identification method DSR or DSR e and use ∆ = F F T where F is computed by
the DSR algorithm.
The sample covariance matrix of the PE is positive semi-definite, i.e. R ≥ 0.
The PE may be zero for deterministic systems however for combined deterministic
and stochastic systems we usually have that R > 0, i.e. positive definite. This
means off-course in any case that R is a symmetric matrix. The eigenvalues of
a symmetric matrix are all real. Define λ1 , . . . , λm as the eigenvalues of R (θ) for
use in the following discussion.
Hence, a good parameter vector, θ, is such that the sample covariance matrix
R is small. The trace operator in (11), i.e. the PE criterion
is a measure of the size of the matrix R (θ). Hence, the trace of a matrix is equal
to the sum of the diagonal elements of the matrix, but the trace is also equal to
the sum of the eigenvalues of a symmetric matrix.
An alternative PE criterion which often is used is the determinant, i.e.,
Hence, the determinant of the matrix is equal to the product of its eigenvalues.
This lead us to a third alternative, which is to use the maximum eigenvalue of
R (θ) as a measure of its size, i.e., we may use the following PE criterion
where arg min denotes the operator which returns the argument that minimizes
the function. The subscript N is often omitted, hence θ̂ = θ̂N and V (θ) = VN (θ).
4
The definition (16) is a standard optimization problem. A simple solution is then
to use a software optimization algorithm which is dependent only on function
evaluations, i.e., where the user only have to define the PE criterion V (θ).
We will in the following give a discussion of how this may be done. The
minimizing parameter vector θ̂ = θ̂N ∈ Rp has to be searched for in the param-
eter space DM by some iterative non-linear optimization method. Optimization
methods are usually constructed as variations of the Gauss-Newton method and
the Newton-Raphson method, i.e.,
where α is defined as a line search (ore step length) scalar parameter chosen to
ensure convergence (i.e. chosen to ensure that V (θi+1 < V (θi )), and where the
gradient, gi (θi ), is
dV (θi )
gi (θi ) = ∈ Rp , (18)
dθi
and
dgi (θi ) d dV (θi ) d2 V (θi )
Hi (θi ) = = ( ) = ∈ Rp×p , (19)
dθiT dθiT dθi dθiT dθi
is the Hessian matrix. Remark that the Hessian is a symmetric matrix and that
it is positive definite in the minimum, i.e.,
d2 V (θ̂)
H(θ̂) = > 0, (20)
dθ̂T dθ̂
where θ̂ = θ̂N is the minimizing parameter vector. Note that the iteration scheme
(17) is identical to the Newton-Raphson method when α = 1. In practice one
often have to use a variable step length parameter α, both in order to stabilize
the algorithm and to improve the rate of convergence far from the minimum.
Once the gradient, gi (θi ) and the Hessian matrix Hi (θi ) (or an approximation of
the Hessian) have been computed, we can chose the line search parameter, α, as
α = arg min VN (θi+1 (α)) = arg min(θi − αHi−1 (θi )gi (θi )). (21)
α α
Equation (21) is an optimization problem for the line search parameter α. Once
α have been determined from the scalar optimization problem (21), the new
parameter vector θi+1 is determined from (17).
The iteration process (17) must be initialized with an initial parameter vector,
θ1 . This was earlier (before the subspace methods) a problem. However, a good
solution is to use the parameters from a subspace identification method. Equation
(17) can then be implemented in a while or for loop. That is to iterate Equation
5
(17) for i = 1, . . . , until convergence, i.e., until the gradient is sufficiently zero,
i.e., until gi (θi ) ≈ 0 for some i ≥ 1. This, and only such a parameter vector is
our estimate, i.e. θ̂ = θ̂N = θi when g(θi ) ≈ 0.
The iteration equation (17) can be deduced from the fact that in the minimum
we have that the gradient, g, is zero, i.e.,
dV (θ̂)
g(θ̂) = = 0. (22)
dθ̂
We can now use the Newton-Raphson method, which can be deduced as follows.
An expression of g(θ̂) can be defined from a Taylor series expansion of g(θ) around
θ, i.e.,
dg(θ)
0 = g(θ̂) ≈ g(θ) + dθT
(θ̂ − θ). (23)
θ
θ̂ = θ − ( dg(θ)
dθT
)−1 g(θ). (24)
θ
This equation is the background for the iteration scheme (17), i.e., putting θ := θi
and θ̂ := θi+1 . Hence,
θi+1 = θi − ( dg(θ)
dθT
)−1 g(θi ). (25)
θi
Note that the parameter vector, θ, the gradient, g, and the Hessian matrix have
structures as follows
θ1
θ = ... ∈ Rp , (27)
θp
dV
g1 dθ
dV (θ) . .. 1
g(θ) = = .. = . ∈ Rp , (28)
dθ dV
gp dθp
dg1 dg1 d2 V d2 V
dθ1
... dθp dθ12
... dθp dθ1
dg(θ) d2 V (θ) . .. .. = .. .. ..
∈ Rp×p .(29)
H= = = .. . . . . .
dθT dθT dθ dgp dgp
d2 V d2 V
dθ1
... dθp dθ1 dθp
... dθp2
6
The gradient and the Hessian can be computed numerically. However, it is usually
more efficient by an analytically computation if possible. Remark that a common
notation of the Hessian matrix is
d2 V (θ)
H= , (30)
dθ2
and that the elements in the Hessian is given by
∂ 2 V (θ)
hij = , (31)
∂θi ∂θj
7
with Λ = ∆−1 where ∆ = (E(ek eTk )), has the same asymptotic covariance matrix
as the parameter estimate
It can also be shown that the PEM parameter estimate for Gaussian distributed
disturbances, ek , (33), or equivalently the PEM estimate (32) with the optimal
weighting, is identical to the Maximum Likelihood (ML) parameter estimate. The
PEM estimates (32) or (33) are both statistically optimal. The only drawback by
using (33), i.e., minimizing the determinant PE criterion VN (θ) = det(R (θ)), is
that it requires more numerical calculations than the trace criterion. However, on
the other side the evaluation of the PE criterion (32) with the optimal weighting
matrix Λ = ∆−1 is not (directly) realistic since the exact covariance matrix ∆
is not known in advance. However, note that an estimate of ∆ can be built up
during the iteration (optimization) process.
The parameter estimate θ̂N is a random vector, i.e., Gaussian distributed
when ek is Gaussian. This means that θ̂N has a mean and a variance. We want
the mean to be as close to the true parameter vector as possibile and the variance
to be as small as possible. We can show that the PEM estimate θ̂N is consistent,
i.e., the mean of the parameter estimate, θ̂N , converges to the true parameter
vector, θ0 , as N tends to infinity. In other words we have that E(θ̂N ) = θ0 .
Consider g(θ̂N ) = 0 expressed as a Taylor series expansion of g(θ0 ) around the
true parameter vector θ0 , i.e.
Hence,
E(θ̂N ) = θ0 , (38)
because θ0 is deterministic.
8
The parameter estimates (33) and (32) with optimal weighting are efficient,
i.e., they ensures that the parameter covariance matrix
where ∆ = E(ek eTk ) = E(e2k ) in the single output case, and with
dȳk (θ)
ψk (θ0 ) = |θ0 ∈ Rp×m (41)
dθ
Loosely spoken, Equation (40) states that the variance, P , of the parameter
estimate, θ̂N , is ”small” if the covariance matrix of ψk (θ0 ) is large. This covariance
matrix is large if the predictor ȳk (θ) is ”very” sensitive to (perturbations in) the
parameter vector θ.
The derivative of the scalar valued function ` = `(k (θ)) with respect to the
parameter vector θ can be expressed from the chain rule
9
2.3 Least Squares and the prediction error method
The optimal predictor, ȳk (θ), is generally a non-linear function of of the unknown
parameter vector, θ. The PEM estimate can in this case in generally not be
solved analytically. However, in some simple and special cases we have that the
predictor is a linear function of the parameter vector. This problem have an
analytical solution and the corresponding PEM is known as the Least Squares
(LS) method. we will in this section give a short description of the solution to
this problem.
yk = Euk + ek , (46)
yk = ϕTk θ + ek , (47)
where
yk = ϕTk θ0 + ek , (50)
10
Here ϕk is a vector of known variables (or quantities) and these variables is
often called regression variables or regressors. The output variables yk is also
called the regressed variables.
A natural predictor is as usual
We will in the following find the parameter estimate, θ̂N , which minimizes the
PE criterion
N
1 X T
VN (θ) = (θ)Λk (θ), (52)
N k=1 k
where k = yk − ϕTk θ is the Prediction Error (PE). The gradient matrix, ψk (θ),
defined in (44) is in this case given by
∂ ȳk (θ)
ψk (θ) = = ϕk ∈ Rp×m . (53)
∂θ
Using this and the expression (45) for the gradient, g(θ), of the PE criterion,
VN (θ), gives
N N
∂VN (θ) 1 X T 1 X
g(θ) = = −2 ϕk Λ(yk − ϕk θ) = −2 (ϕk Λyk − ϕk ΛϕTk θ)). (54)
∂θ N k=1 N k=1
The OLS and the PEM estimate is here simply given by solving g(θ) = 0, i.e.,
XN N
X
T −1
θ̂N = ( ϕk Λϕk ) ϕk Λyk . (55)
k=1 k=1
if the indicated inverse (of the Hessian) exists. Note that the Hessian matrix in
this case is simply given by
N
∂g(θ) 1 X
H(θ) = T
= 2 ϕk ΛϕTk . (56)
∂θ N k=1
The Hessian (symmetric) matrix should be positive definite, H(θ) > 0, for Eq.
(55) to be a minimum. This indicates a positive weighting matrix Λ > 0.
The gradient vector in Eq. (54) may also be derived using the chain rule as
follows
N N
∂(VN (θ)) 1 X ∂k ∂(k (θ)T Λk (θ)) −2 X
g(θ) = = = ϕΛ(yk − ϕT θ). (57)
∂θ N k=1 ∂θ ∂k N k=1
11
2.3.3 Matrix derivation of the least squares method
For practical reasons when computing the least squares solution as well as for
the purpose of analyzing the statistical properties of the estimate it may be
convenient to write the linear regression (47) in vector/matrix form as follows
Y = Φθ0 + e, (58)
as the matrix of prediction errors. Hence we have that the PE criterion is given
by
N
1 X T 1
VN (θ) = k (θ)Λk (θ) = εT Λ̃ε, (61)
N k=1 N
The nice thing about (61) is that the summation is not present in the last ex-
pression of the PE criterion. Hence we can obtain a more direct derivation of the
parameter estimate. The gradient in (54) can in this case simply be expressed as
12
The Hessian is given by
∂g(θ) 1
H(θ) = T
= 2 ΦT Λ̃Φ. (64)
∂θ N
Solving g(θ) = 0 gives the following expression for the estimate
θ̂N = (ΦT Λ̃Φ)−1 ΦT Λ̃Y, (65)
which is identical to (55). It can be shown that the optimal weighting is given by
Λ = ∆−1 . The solution (55) or (65) with the optimal weighting matrix Λ = ∆−1 is
known in the literature as the Best Linear Unbiased Estimate (BLUE). Choosing
Λ = Im in (55) or (65) gives the OLS solution.
A short analysis of the parameter estimate is given in the following. Substi-
tuting (58) into (65) gives an expression of the difference between the parameter
estimate and the true parameter vector, i.e.,
θ̂N − θ0 = (ΦT Λ̃Φ)−1 ΦT Λ̃e. (66)
The parameter estimate (65) is an unbiased estimate since
E(θ̂N ) = θ0 + (ΦT Λ̃Φ)−1 ΦT Λ̃E(e) = θ0 . (67)
The covariance matrix of the parameter estimate is given by
P = E((θ̂N − θ0 )(θ̂N − θ0 )T ) = (ΦT Λ̃Φ)−1 ΦT Λ̃E(eeT )Λ̃Φ(ΦT Λ̃Φ)−1 . (68)
˜ −1 where ∆
Suppose first that we are choosing the weighting matrix Λ̃ = ∆ ˜ =
T ˜
E(ee ). It should be noted that ∆ also is a block diagonal matrix with ∆ =
E(ei eTi ) on the block diagonals. Then we have
N
X
T
PBLUE = E((θ̂N − θ0 )(θ̂N − θ0 ) ) = ˜ −1 Φ)−1 .
ϕk ∆−1 ϕTk = (ΦT ∆ (69)
k=1
13
2.3.4 Alternative matrix derivation of the least squares method
Another linear regression model formulation which is frequently used, e.g., in the
Chemometrics literature, is given by
Y = XB + E, (73)
where X T X is non-singular.
Alternative approximations to the inverse X 0 ∗ X is obtained by the Partial
Least Squares (PLS) method solved by reduced rank analysis and the Singular
Value Decomposition (SVD). Furthermore a popular approximation is the Partial
Least Squares Regression (PLS) method. Here in the next subsection we mention
the Ridge regression method.
14
When K = a this gives an ARX or linear regression model of the form
yk = ϕTk θ + ek , (78)
where
a
ϕTk
= yk−1 uk−1 , θ= . (79)
b
This may also be written as a linear regression model as in Eq. (74) with
y2 y1 u1
y3 y2 u2
a
Y = .. , X = .. .. , B = b . (80)
. . .
yN yN −1 uN −1
The acronym ARMAX comes from the fact that the term A(q)yk is defined as
an Auto Regressive (AR) part, the term C(q)ek is defined as an Moving Average
(MA) part and that the part B(q)uk represents eXogenous (X) inputs. Note in
connection with this that an Auto Regressive (AR) model is of the form A(q)yk =
ek , i.e. with additive equation noise. The noise term, C(q)ek , in a so called ARMA
model A(q)yk = C(q)ek represents a moving average of the white noise ek .
In general we have that the polynomials in Eq. (81) may be expressed as
when yk , uk and ek are scalar signals and where na, nb and nc are the order
of the A(q), B(q) and the C(q) polynomials, respectively. Note also that the
15
polynomials also may be written as A(q) = A(q −1 ), B(q) = B(q −1 ) and C(q) =
C(q −1 ) where q −1 is the shift operator such that
q −1 yk = yk−1 . (85)
The ARMAX model structure can be deduced from a general linear SISO
state space model. MIMO systems is not considered in this Section. This will be
illustrated in the following example.
Example 3.1 (Estimator canonical form to polynomial form) Consider the
following single input and single output discrete time state space model on esti-
mator canonical form
−a1 1 b1 c1
xk+1 = xk + uk + vk (86)
−a0 0 b0 c0
yk = 1 0 xk + euk + f vk (87)
where uk is a known
1 deterministic
input signal, vk is an unknown white noise
T 2
process and xk = xk+1 xk+1 is the state vector.
An input output model formulation can be derived as follows
x1k+1 = −a1 x1k + x2k + b1 uk + c1 vk (88)
x2k+1 = −a0 x1k + b0 uk + c0 vk (89)
yk = x1k + euk + f vk (90)
Express Equation (88) with k =: k +1 and substitute for x2k+1 defined by Equation
(89). This gives an equation in terms of the 1st state x1k . Finaly, eliminate x1k by
using Equation (90). This gives
y k
1 a1 a0 yk−1
yk−2
uk vk
= e b1 + a1 e b0 + a0 e uk−1 + f c1 + a1 f c0 + a0 f vk−1 (91)
uk−2 vk−2
Let us introduce the backward shift operator q −1 such that q −1 uk = uk−1 . This
gives the following polynomial (or transfer function) model
A(q)
z }| {
(1 + a1 q −1 + a0 q −2 ) yk
B(q) C(q)
z }| { z }| {
−1 −2 −1 −2
= (e + (b1 + a1 e)q + (b0 + a0 e)q ) uk + (f + (c1 + a1 f )q + (c0 + a0 f )q ) vk
(92)
which is equal to an ARMAX model structure.
16
Example 3.2 (Observability canonical form to polynomial form)
Consider the following single input and single output discrete time state space
model on observability canonical form
0 1 b0 c0
xk+1 = xk + uk + vk , (93)
−a1 −a0 b1 c1
yk = 1 0 xk + euk + f vk , (94)
where uk is a known
1deterministic
input signal, vk is an unknown white noise
T 2
process and xk = xk+1 xk+1 is the state vector. An input and output model
can be derived by eliminating the states in (93) and (94). We have
Solve (95) for x2k and substitute into (96). This gives an equation in x1k , i.e.,
Solve (97) for x1k and substitute into (98). This gives
17
3.2 ARX model structure
An Auto Regression (AR) model A(q)yk = ek with eXtra (or eXogenous) inputs
(ARX) model can be expressed as follows
where the term A(q)yk is the Auto Regressive (AR) part and the term B(q)uk
represents the part with the eXtra (X) inputs. The eXtra variables uk are called
eXogenous in econometrics. Note also that the ARX model also is known as an
equation error model because of the additive error or white noise term ek .
It is important to note that the parameters in an ARX model, e.g. as defined
in Equation (105) where ek is a white noise process, can be identified directly
by the Ordinary Least Squares (OLS) method, if the inputs, uk , are informative
enough. A problem with the ARX structure is off-course that the noise model
(additive noise) is to simple in many practical cases.
Example 3.3 (ARX and State Space model structure) Comparing the ARX
model structure (105) with the more general model structure given by (92) we find
that the state space model (86) and (87) is equivalent to an ARX model if
which, from the above discussion, have the following state space model equivalent
−a1 1 b1 −a1
xk+1 = xk + uk + f vk (108)
−a0 0 b0 −a0
yk = 1 0 xk + euk + f vk (109)
It is at first sight not easy to see that this state space model is equivalent to an
ARX model. It is also important to note that the ARX model has a state space
model equivalent. The noise term ek = f vk is white noise if vk is withe noise. The
noise ek appears as both process (state equation) noise and measurements (output
equation) noise. The noise is therefore filtred throug the process dynamics.
18
3.3 OE model structure
An Output Error (OE) model structure can be represented as the following poly-
nomial model
B(q)
yk = uk + ek (111)
A(q)
Example 3.4 (OE and State Space model structure) From (92) we find that
the state space model (86) and (87) is equivalent to an OE model if
c1 = 0, c0 = 0, f = 1, vk = ek (112)
yk = 1 0 xk + euk + ek (115)
Note that the noise term ek appears in the state space model as an equivalent
measurements noise term or output error term.
19
or for single output systems
B(q) C(q)
yk = uk + ek (117)
F (q) D(q)
In the BJ model the moving average noise (coloured noise) term C(q)ek is filtred
throug the dynamics represented by the polynomial D(q). Similarly, the dynamics
in the path from the inputs uk are represented by the polynomial A(q).
However, it is important to note that the BJ model structure can be repre-
sented by an equivalent state space model structure.
3.5 Summary
The family of polynomial model structures
BJ : Box Jenkins
ARM AX : Auto Regressive Moving Average with eXtra inputs
ARX : Auto Regressive with eXtra inputs (118)
ARIM AX : Auto Regressive Integrating Moving Average with eXtra inputs
OE : Output Error
When using the prediction error methods for system identification the model
structure and the order of the polynomial need to be specified. It is important
that the prediction error which is to be minimized is a function of as few unknown
parameters as possible. Note also that all of the model structures discussed above
are linear models. However, the optimization problem of computing the unknown
parameters are highly non-linear. Hence, we can run into numerical problems.
This is especially the case for multiple output and MIMO systems.
The subspace identification methods, e.g. DSR, is based on the general state
space model defined by (119) and (120. The subspace identification methods
is therefore flexibile enough to identify systems described by all the polynomial
model structures (118). Note also that not even the system order n need to be
specified beforehand when using the subspace identification method.
20
4 Optimal one-step-ahead predictions
4.1 State Space Model
Consider the innovations model
where K is the Kalman filter gain matrix. A predictor ȳk for yk can be defined
as the two first terms on the right hand side of (122), i.e.
The equations for the (optimal) predictor (Kalman filter) is therefore given by
Hence, the optimal prediction, ȳk , of the output yk can simply be obtained by
simulating (126) and (127) with a specified initial predicted state x̄1 . The results
is known as the one-step ahead predictions. The name one-step ahead predictor
comes from the fact that the prediction of yk+1 is based upon all outputs up to
time k as well as all relevant inputs. This can be seen by writing (126) and (127)
as
where
21
4.2 Input-output model
The linear system can be expressed as the following input and output polynomial
model
yk = G(q)uk + H(q)ek . (132)
The noise term ek can be expressed as
H −1 (q)yk = H −1 (q)G(q)uk + ek . (133)
Adding yk on both sides gives
yk = (I − H −1 (q))yk + H −1 (q)G(q)uk + ek . (134)
The prediction of the output is given by the first two terms on the right hand
side of (134) since ek is white and therefore cannot be predicted, i.e.,
ȳk (θ) = (I − H −1 (q))yk + H −1 (q)G(q)uk . (135)
Loosely spoken, the optimal prediction of yk is given by the predictor, ȳk , so
that a measure of the difference ek = yk − ȳk (θ), e.g., the variance, is minimized
with respect to the parameter vector θ, over the data horizon. We also assume
that a sufficient model structure is used, and that the unknown parameters is
parameterized in θ. Ideally, the prediction error ek will be white.
22
The (white) noise vectors, ek+1 and ek+2 are all in the future, and they can not
be predicted because they are white. The best prediction of yk+2 must then be
to put ek+1 = 0 and ek+2 = 0. Hence, the predictor for j = 2 is
This can simply be generalized for j > 2 as presented in the following lemma.
where the initial state xk+1 = x̄k+1 is given from the Kalman filter state equation
where the initial state x̄1 is known and specified. This means that all outputs up
to time, k, and all relevant inputs, are used to predict the output at time k + 1,
. . ., k + M .
23
Given the Kalman filter matrices (A, B, D, K) of an strictly proper system, and
the initial predicted state x̄k . The predictions ȳk+1 , ȳk+2 and ȳk+3 can be written
in compact form as follows
ȳk+1 D D
ȳk+2 = DA (A − KD)x̄k + DA Kyk
ȳk+3 DA2 DA2
DB 0 0 uk
+ DAB DB 0 uk+1 . (147)
2
DA B DAB DB uk+2
6 Matlab implementation
We will in this section illustrate a simple but general MATLAB implementation
of a Prediction Error Method (PEM).
>> [A,B,D,E,K,x1]=sspem(Y,U,n);
24
A typical drawback with PEMs is that the system order, n, has to be specified
beforehand. A good solution is to analyze and identify the system order, n, by
the subspace algorithm DSR. The identified model is represented in observability
canonical form. That is a state space realization with as few free parameters as
possibile. This realization can be illustrated as follows. Consider a system with
m = 2 outputs, r = 2 inputs and n = 3 states, which can be represented by
a linear model. The resulting model from sspem.m will be on Kalman filter
innovations form, i.e.,
xk+1 = Axk + Buk + Kek , (148)
yk = Dxk + Euk + ek , (149)
with initial predicted state x1 given, or on prediction form, i.e.,
xk+1 = Axk + Buk + K(yk − ȳk ), (150)
ȳk (θ) = Dxk + Euk , (151)
and where the model matrices (A, B, D, E, K) and the initial predicted state x1
are given by
0 0 1 θ7 θ10 θ13 θ16
A = θ1 θ3 θ5 , B = θ8 θ11 , K = θ14 θ17 , (152)
θ2 θ4 θ6 θ9 θ12 θ15 θ18
θ23
1 0 0 θ19 θ21
D= , E= , x1 θ24 . (153)
0 1 0 θ20 θ22
θ25
This model is on a so called observability canonical form, i.e. the third order
model has as few free parameters as possibile. The number of free parameters
are in this example 25. The observability canonical form is such that the D
matrix is filled with ones and zeros. We have that D = ones(m, n). Furthermore,
we have that a concatenation of theD matrix and the first n − m rows of the
D
A matrix is the identity matrix, i.e. = In . This may be used
A(1 : n − m, :)
as a rule when writing up the n − m first rows of the A matrix. The rest of
the A matrix as well as the matrices B, K, E and x1 is filled with parameters.
However, the user does not have to deal with the construction of the canonical
form, when using the SSPEM toolbox.
The total number of free parameters in a linear state space model and in the
parameter vector θ ∈ Rp are in general
p = mn + nr + nm + mr + n = (2m + r + 1)n + mr (154)
An existing DSR model can also be refined as follows. Suppose now that a
state space model, i.e. the matrices (A, B, D, E, K, x1) are identified by the DSR
algorithm as follows
25
>> [A,B,D,E,C,F,x1]=dsr(Y,U,L);
>> K=C*inv(F);
This model can be transformed to observability canonical form and the free pa-
rameters in this model can be mapped into the parameter vector θ1 by the func-
tion ss2thp.m, as follows
>> th_1=ss2thp(A,B,D,E,K,x1);
The model parameter vector θ1 can be further refined by using these parameters
as initial values to the PEM method sspem.m. E.g. as follows
>> [A,B,D,E,K,x1,V,th]=sspem(Y,U,n,th_1);
The value of the prediction error criterion, V (θ), is returned in V . The new and
possibly better (more optimal) parameter vector is returned in th.
Note that for a given parameter vector, then the state space model matrices
can be constructed by
>> (A,B,D,E,K,x1)=thp2ss(th,n,m,r);
It is also worth to note that the value, V (θ), of the prediction error criterion
is evaluated by the MATLAB function vfun mo.m. The data matrices Y and
U and the system order, n, must first be defined as global variables, i.e.
>> global Y U n
>> V=vfun_mo(th);
where th is the parameter vector. Note that the parameter vector must be of
length p as explained above.
See also the function ss2cf.m in the D-SR Toolbox for MATLAB which re-
turns an observability form state space realization from a state space model.
26
7 Recursive ordinary least squares method
We start this section by a simple example of how the mean of a variable, say yk ,
may be recursively estimated.
Example 7.1
The mean of a variable yk at present time t may be expressed as
t
1X
ȳt = yk . (155)
t k=1
Xt t
X
T −1
θ̂t = ( ϕk Λϕk ) ϕk Λyk . (159)
k=1 k=1
Xt
Pt = ( ϕk ΛϕTk )−1 . (160)
k=1
27
From this definition we have that
Xt−1
Pt = ( ϕk ΛϕTk + ϕt ΛϕTt )−1 . (161)
k=1
which gives.
Pt−1 = Pt−1
−1
+ ϕt ΛϕTt (162)
28
Step 1 Initial values for Pt=0 and θ̂t=0 . It is common practice to take Pt=0 = ρIp
with ρ a ”large” constant and θ̂t=0 = 0 without any a-priori information.
and
where
0 1 b1 c1
A= , B= , C= , D= 1 0 (177)
a1 a2 b2 c2
yk+2 = a2 yk+1 + a1 yk
+ b1 uk+1 + (b2 − a2 b1 )uk
+ ek+2 + (c1 − a2 )ek+1 + (c2 − a2 c1 − a1 )ek (178)
29
This can be written as a linear regression model
θ
z }| {
ϕT
t a2
z }| { a1
yt = yt−1 yt−2 ut−1 ut−2 +et . (180)
b1
b 2 − a2 b 1
An ROLS algorithm for the estimation of the parameter vector is implemented
in the following MATLAB script.
N=200;
rand(’state’,0), randn(’seed’,0)
u=randn(N,1);
u=prbs1(N,10,40);
randn(’seed’,0)
30
Th(t,:)=th’; % Store the parameter estimates.
end
Tm(i,:)=th’;
end
th
th0=[a2;a1;b1;b2-a2*b1]
figure(1)
subplot(411), plot(Th(:,1)), ylabel(’a_2’), title(’ROLS example’)
subplot(412), plot(Th(:,2)), ylabel(’a_1’)
subplot(413), plot(Th(:,3)), ylabel(’b_1’)
subplot(414), plot(Th(:,4)), ylabel(’b_2-a_2b_1’)
xlabel(’Diskret tid: t’)
figure(2)
subplot(211), plot(u), ylabel(’u_k’), title(’ROLS_example’)
subplot(212), plot(y), ylabel(’y_k’)
xlabel(’Diskret tid: t’)
31
Using (186) in (185) gives
θ̂t = θ̂t−1 + Kt (yt − ϕTt θ̂t−1 ). (191)
Hence, θ̄t = θ̂t−1 . Finally, the comparisons with ROLS and the Kalman filter also
shows that
Λ = W −1 . (192)
Hence, the optimal weighting matrix in the OLS algorithm is the inverse of the
measurements noise covariance matrix.
The formulation of the Kalman filter gain presented in (184) is slightly differ-
ent from the more common formulation, viz.
Kt = X̄t DtT (W + Dt X̄DtT )−1 . (193)
In order to prove that (184) are equivalent we substititute X̂t into the first ex-
pression in (184), i.e.,
Kt = X̂t DtT W −1
= (X̄t − X̄t DtT (W + Dt X̄t DtT )−1 Dt X̄t )DtT W −1
= (X̄t DtT W −1 − X̄t DtT (W + Dt X̄t DtT )−1 Dt X̄t DtT W −1 )
= X̄t DtT (W −1 − (W + Dt X̄t DtT )−1 Dt X̄t DtT W −1 )
= X̄t DtT (W + Dt X̄DtT )−1 ((W + Dt X̄DtT )W −1 − Dt X̄t DtT W −1 )
= X̄t DtT (W + Dt X̄DtT )−1 . (194)
where we simply have set N = t and omitted the mean 1t and included a forgetting
factor λ in order to be able to weighting the newest data more than old data. We
have typically that 0 < λ ≤ 1, often λ = 0.99.
The least squares estimate is given by
t
X
θ̂t = Pt λt−k ϕk Λyk . (196)
k=1
where
Xt
Pt = ( λt−k ϕk ΛϕTk )−1 . (197)
k=1
32
7.2.1 Recursive computation of Pt
Let us derive a recursive formulation for the covariance matrix Pt . We have from
the definition of Pt that
t
X
Pt = ( λt−k ϕk ΛϕTk )−1
k=1
t−1
X
= ( λt−k ϕk ΛϕTk + ϕt ΛϕTt )−1
k=1
Xt−1
= (λ λt−1−k ϕk ΛϕTk +ϕt ΛϕTt )−1 (198)
k=1
| {z }
−1
Pt−1
Using that
Xt−1
Pt−1 =( λt−1−k ϕk ΛϕTk )−1 (199)
k=1
gives
−1
Pt = (λPt−1 + ϕt ΛϕTt )−1 (200)
Using the matrix inversion lemma (173) we have the more common covariance
update equation
λPt = Pt−1 − Pt−1 ϕt (λΛ−1 + ϕTt Pt−1 ϕt )−1 ϕTt Pt−1 (201)
33
Substituting (203) into (202) gives
−1
θ̂t = Pt (λPt−1 θ̂t−1 + ϕt Λyt ). (204)
From (200) we have that
−1
λPt−1 = Pt−1 − ϕt ΛϕTt (205)
Substituting this into (204) gives
θ̂t = Pt ((Pt−1 − ϕt ΛϕTt )θ̂t−1 + ϕt Λyt )
= θ̂t−1 + Pt ϕt Λyt − Pt ϕt ΛϕTt θ̂t−1 (206)
hence,
θ̂t = θ̂t−1 + Pt ϕt Λ(yt − ϕTt θ̂t−1 ) (207)
1. Identify a higher order ARX model using the OLS approach. The order of
the ARX model should be chosen large enough, and such that the residual is
approximately white noise. One have to ensure that the input or reference
signal experiment is rich enough with perturbations in order for the OLS
problem to be well defined.
2. Form a higher order state space model from the ARX model parameters
and then form the necessary number of impulse response matrices. the
impulse response matrices may also be formed directly from the ARX model
parameters.
3. Use Hankel matrix realization theory to compute a reduced order state space
model of correct order.
34
8.1 Miscellaneous examples
Example 8.1 (Higher order ARX model)
Given a 1st order state space model in the form
xk+1 = axk + buk + cek , (208)
yk = xk + ek . (209)
This can be written as the following ARMAX model
yk + a1 yk−1 = b1 uk−1 + ek + c1 ek−1 , (210)
where a1 = −a, b1 = b and c1 = c − a.
Let us now investigate how well the parameters can be estimated by a third
order ARX model.
Putting k := k + 1 in (210) and substitute for ek solved from (210) into this
equation gives a second order difference equation. Repetition of this gives the
following third order difference model
yk+2 + (a1 − c1 )yk+1 + (c21 − c1 a1 )yk + c21 a1 yk−1 = b1 uk+1 − c1 b1 uk + c21 b1 uk−1
+ ek+2 + c31 ek−1 . (211)
This can be written as the approximate linear regression model
θ
z }| {
c 1 − a1
ϕT
k
c1 a1 − c21
z }| {
−c21
yk = yk−1 yk−2 yk−3 uk−1 uk−2 uk−3 +ek + c31 ek−3 (212)
b1
−c1 b1
c21 b1
The parameters in this model can be approximately estimated via an ARX model
solved by the Ordinary Least Squares (OLS) method when c31 = (c − a)3 ≈ 0. Note
also that the predictor (Kalman filter) on innovations form is stable. This means
that the magnitude of the eigenvalues of the predictor, a − cd, is less than one.
Hence, c31 = (c − a)3 ≈ 0 is a good assumption.
35
Note also that the noise term may be expressed as
Putting k := k − 1 in (216) and substituting into (215) gives the 2nd order ARX
model approximation
when (k − a)2 ≈ 0. Similarly expressing ek−2 from (216) and substituting into
(217) gives the 3rd order ARX model approximation
when (k − a)3 ≈ 0.
b
ln(p) = a − . (221)
T +c
Multiplying (221)) with T + c gives
and from eq. (222) we find the linear regression model eq. (220).
36
Example 9.2 (Parameters in Antoine’s equation)
Given the Antoine equation for the relationship between temperature T , and vapor
pressure p, in pure components such as e.g. saturated steam, as also discussed in
Example 9.1 i.e.
b
p = ea− T +c . (223)
i pi [Bar] Ti [◦ C]
1 1 99.1
2 2 119.6
3 3 132.9
4 4 142.9
5 5 151.1
6 6 158.1
7 7 164.2
8 8 169.6
9 9 174.5
10 10 179.0
Using the results from Example 9.1 we form the linear regression matrix equa-
tion
Y = XB, (224)
or equivalently Y = Φθ.
Using eq. (220) then we may form the known data matrices Y and X as
37
follows
0
82.90
y1
T1 ln(p1 )
146.01
.. .. 198.13
. .
243.19
Y = yi = Ti ln(pi ) = , (225)
.. .. 283.24
. .
319.52
y10 T10 ln(p10 ) 352.67
383.42
412.16
0 99.10 1
−0.69 119.60 1
ϕT1
− ln(p1 ) T1
1
−1.10 132.90 1
.. .. .. .. −1.39 142.92 1
. . . .
T −1.61 151.10 1
X = Φ = ϕi = − ln(pi ) Ti 1 = . (226)
. .. .. .. −1.79 158.08 1
.. . . .
−1.95 164.20 1
ϕT10 − ln(p10 ) T10
1 −2.08 169.60 1
−2.20 174.50 1
−2.30 179.00 1
Then we find the ordinary least squares estimate of the parameter vector θ (or
equivalently the vector of regressor matrix B) as follows
c 226.37
BOLS = θ̂ = (X T X)−1 X T Y = a = 11.68 . (227)
ac − b −1157.23
38