MSc Applied Econometrics — Time Series
Estimation, Specification and Testing
F Blasques, C Francq, SJ Koopman, JM Zakoian
Lancaster University
Timberlake Consultants
January/February 2023
ARMA models: estimation, specification and testing
Program today:
• Estimation via Regression
• Maximum Likelihood
• Specification
• General to specific
• Residuals
• Testing and Diagnostic Checking
• Model Validation
2 / 45
The ARMA Coefficients
Let {Xt } be generated by an ARMA(p, q),
Xt = φ1 Xt−1 + . . . + φp Xt−p + εt + θ1 εt−1 + . . . + θq εt−q
with {εt } ∼ W N (0, σε2 ). Then,
• The stationarity properties of {Xt } depend on the values of
the parameters φ1 , ..., φp .
• Unconditional moments and dynamic structure (the ACF of
{Xt }) depend on the parameters φ1 , ..., φp , θ1 , ..., θq and σε2 .
• Forecasted paths and forecasting accuracy depend on
φ1 , ..., φp , θ1 , ..., θq and σε2 .
Conclusion: We have to learn about these parameters! We
have to estimate them!
3 / 45
Estimation of ARMA Coefficients
The usual estimators can help us! We just need to be careful
with the fact that the data is not IID.
Estimators that you know:
• Maximum likelihood (maximize likelihood)
• Method of moments (minimize distance between moments)
• Least squares (minimize squared residuals)
Here we focus on:
• Maximum likelihood
• based on joint distribution
• based on prediction error decomposition
• Least squares
4 / 45
The Maximum Likelihood Estimator
Definition: Given a sample XT := (X1 , ..., XT ), with joint pdf
f (XT ; ψ) depending on some vector of parameters ψ, the
Maximum Likelihood (ML) estimator ψ̂T is defined as,
ψ̂T = arg max f (XT ; ψ)
ψ
⇒ Given a realized sample xT := x1 , ..., xT , the ML estimate
maximizes the realized likelihood function f (xT ; ψ).
⇒ The ML estimator ψ̂T selects the value for ψ that is the most
likely given the observed data xT .
⇒ Location of ψ for maximum f (XT ; ψ) is the same location of
ψ for maximum log f (XT ; ψ).
5 / 45
Maximum Likelihood Estimation
Important: Independence is needed to factorize the likelihood,
T
Y
f (xT ; ψ) = ft (xt ; ψ),
t=1
and identical distributions are needed to obtain ft = f ∀ t,
T
Y T
Y
ft (xt ; ψ) = f (xt ; ψ).
t=1 t=1
Important: When X1 , ..., XT are not independent, THEN, we
CANNOT factorize the joint likelihood,
T
Y
f (xT ; ψ) 6= ft (xt ; ψ).
t=1
6 / 45
Joint Likelihood of T -Period Sequence from ARMA(p, q)
Let the sample XT := (X1 , ..., XT ) be a subset of a time-series
{XT } generated by a stationary ARMA(p, q) model,
Xt = φ1 Xt−1 + φ2 Xt−2 + . . . + φp Xt−p + εt + θ1 εt−1 + . . . + θq εt−q ,
where εt ∼ NID(0, σε2 ) with non-zero coefficients φ1 , ..., φp ,
θ1 , ..., θp and σε2 .
Then, the elements X1 , ..., XT are not independent and the
likelihood cannot be factorized!
However, XT is jointly Gaussian and hence we can still
compute the ML estimator!
7 / 45
Multivariate Normal Distribution
Assume XT := (X1 , ..., XT ) is jointly Gaussian (that is jointly
normally distributed), with distribution XT ∼ N(µ, Γ), the joint
probability log-density function (pdf) is given by
log f (XT ; µ, Γ) =
T 1 1
− log 2π − log |Γ| − (XT − µ)0 Γ−1 (XT − µ).
2 2 2
In case XT is a T -period sequence from a stationary
ARMA(p, q) process, we have µ = 0 and Γ is a Toeplitz (band)
matrix of autocovariances.
8 / 45
Multivariate Normal Likelihood of ARMA Model
If XT := (X1 , ..., XT ) is jointly Gaussian, we can define the ML
estimator of the vector of parameters
ψ := (φ1 , ..., φp , θ1 , ..., θq , σε2 ) as,
T 1 1
ψ̂T = arg max − log 2π − log |Γ(ψ)| − (X0T Γ−1 (ψ)XT )
ψ 2 2 2
where Γ(ψ) is the variance-covariance matrix of the Gaussian
vector (X1 , ..., XT ) which depends on the ARMA parameters,
γ(0) γ(1) γ(2) ... γ(T )
γ(1) γ(0) γ(1) . . . γ(T − 1)
Γ(ψ) =
γ(2) γ(1) γ(0) . . . γ(T − 2) .
.. .. .. .. ..
. . . . .
γ(T ) γ(T − 1) γ(T − 2) . . . γ(0)
9 / 45
Variance-Covariance Matrix of AR(1)
Let X1 , ..., XT be a subset of a time-series generated by a
stationary Gaussian AR(1) model,
Xt = φXt−1 + εt , {εt } ∼ NID(0, σε2 ).
Then, using the results from last week:
φ2 . . . φT
1 φ
2
φ 1 φ . . . φT −1
2 σε
φ2 φ 1 . . . φT −2
Γ(ψ) = Γ(φ, σε ) = .
1 − φ2 ..
.. .. .. ..
. . . . .
φT φT −1 φT −2 ... 1
10 / 45
Variance-Covariance Matrix of MA(1)
Let X1 , ..., XT be a subset of a time-series generated by a
stationary Gaussian MA(1) model,
Xt = εt + θεt−1 , {εt } ∼ NID(0, σε2 ).
Then, using the results from last week:
(1 + θ2 )
θ 0 ... 0
θ 2
(1 + θ ) θ ... 0
2 2
Γ(θ, σε ) = σε 0 θ (1 + θ2 ) ... 0 .
.. .. .. .. ..
. . . . .
0 0 0 . . . (1 + θ2 )
11 / 45
Problems with Multivariate Distribution Likelihood
The idea is straightforward... but the practice is difficult!
Practical problems:
• Maximizing the likelihood function is involved!
• Inverting the matrix Γ(ψ) is challenging for large sample
sizes T .
• Computing the determinant |Γ(ψ)| is also challenging for
large sample sizes T .
• We leave it to clever algorithms for the computer!
Important: We can simplify things by using prediction error
decomposition to re-write the joint likelihood as a product of
densities!
12 / 45
Prediction Error Decomposition of ARMA Likelihood
Recall: When observations are dependent, then,
f (x1 , ..., xT ; ψ) 6= f (x1 ; ψ) × . . . × f (xT ; ψ).
However: We can always factorize the joint distribution
function as a product of conditional times marginal:
f (x1 , x2 ; ψ) = f (x1 ; ψ) × f (x2 |x1 ; ψ),
f (x1 , x2 , x3 ; ψ) = f (x2 , x1 ; ψ) × f (x3 |x2 , x1 ; ψ)
= f (x1 ; ψ) × f (x2 |x1 ; ψ) × f (x3 |x2 , x1 ; ψ),
T
Y
f (x1 , ..., xT ; ψ) = f (x1 ; ψ) × f (xt |xt−1 , ...; ψ).
t=2
13 / 45
Prediction Error Decomposition of ARMA Likelihood
Important: We can always write the loglikelihood function as
a sum of conditional densities!
T
X
log L(xT ; ψ) = log f (xt |Dt−1 ; ψ),
t=1
where Dt denotes the set x1 , . . . , xt and f (x1 |D0 ; ψ) = f (x1 ; ψ).
If we know the conditional distributions f (xt |Dt−1 ; ψ), then we
can write the likelihood and maximize it!
What is the distribution of Xt conditional on past values?
14 / 45
Likelihood Function of AR(1)
Let X1 , ..., XT be a subset of a time-series generated by a
stationary Gaussian AR(1) model,
Xt = φXt−1 + εt , {εt } ∼ NID(0, σε2 ).
Then, we have
X2 |X1 ∼ N(φX1 , σε2 )
X3 |X2 , X1 = X3 |X2 ∼ N(φX2 , σε2 )
and in general Xt |Xt−1 ∼ N(φXt−1 , σε2 ).
⇒ In the stationary Gaussian AR(1), every f (Xt |Dt−1 ) is a
normal density with different mean but same variance!
15 / 45
Likelihood Function of AR(1)
If X1 , ..., XT is a subset of a stationary Gaussian AR(1) process,
then, ψ = (φ, σε2 ) and,
T
Y
ψ̂T = arg max f (xT ; ψ) = arg max f (xt |Dt−1 ; ψ)
ψ ψ
t=2
T
Y 1 h (xt − φxt−1 )2 i
= arg max √ exp −
ψ 2πσ 2 2σ 2
t=2
T √
X (xt − φxt−1 )2
= arg max − log 2πσ 2 − .
ψ 2σ 2
t=2
Note: We start at t = 2 and simply treat x1 as a fixed value x0
is unknown. We call this conditional ML by error
decomposition. If we assume a distribution for x1 then we call it
ML by error decomposition with exact initialization.
16 / 45
ML Estimator of an AR(1) with NID(0,1) Innovations
Let X1 , ..., XT be a subset of a time-series generated by the
following stationary AR(1) model,
Xt = φXt−1 + εt , {εt } ∼ NID(0, 1).
⇒ Stability requires |φ| < 1 and hence, the parameter space is
the interval (−1, 1).
⇒ ML estimator is given by,
T
X √ (xt − φxt−1 )2
φ̂T = arg max − log 2π − .
φ 2
t=2
Since the likelihood is differentiable w.r.t. φ we can find the
maximum by setting the derivative equal to zero!
We ignore further details such as checking second derivatives
and asymptotes of the likelihood function.
17 / 45
ML Estimator of an AR(1) with NID(0,1) Innovations
T
X √ (xt − φxt−1 )2
Since, log L(φ) = − log 2π −
2
t=2
T
∂ log L(φ) X
we have = (xt − φxt−1 )xt−1
∂φ
t=2
and by construction, φ̂T satisfies,
T
∂ log L(φ̂T ) X
=0 ⇔ (xt − φ̂T xt−1 )xt−1 = 0
∂φ
t=2
T T PT
X X xt xt−1
⇔ xt xt−1 = φ̂T x2t−1 ⇔ φ̂T = Pt=2
T
.
2
t=2 t=2 t=2 xt−1
18 / 45
Least Squares Estimator
Definition: Given a subset XT := (X1 , ..., XT ) of a process
{Xt } generated by an AR(1) model,
Xt = φXt−1 + εt
the least-squares (LS) estimator of φ is defined as
T
X
φ̂T = arg min (Xt − φXt−1 )2
t=2
⇒ Least squares estimator φ̂T for AR(1) selects the value for φ
that gives the best squared error fit given the observed data xT .
⇒ LS estimator φ̂T is equivalent to (conditional) ML estimator.
19 / 45
Maximum Likelihood Estimation in ACTION
To illustrate:
• Simulate a long sequence from an AR(1) process with
coefficients
φ = 0.8, σε2 = 1.
• Select a T -period sequence from long sequence, T = 200.
• Numerically maximize log-likelihood w.r.t. φ and σε2 .
• There are efficient search methods for finding the
maximum.
• The ML estimates are φbT = 0.8009 and σ b2 = 0.992406
εT
20 / 45
Simulated AR(1) process with φ = 0.8
6
-1
-2
-3
-4
0 50 100 150 200 250 300 350 400 450 500
21 / 45
Maximum Likelihood Estimation of AR(1) with φ = 0.8
-1.42 Loglikelihood value
-1.44
-1.46
-1.48
-1.50
-1.52
φ coefficient
0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
22 / 45
ML and LS Estimator Properties
Important: We now know how to obtain point estimates for
the autoregressive parameter φ.
Question: How reliable are those estimates? Can we calculate
confidence intervals? Can we test hypothesis about φ?
Yes! Both small sample and large sample properties!
• Monte Carlo simulations reveal small sample properties
like bias, variance and RMSE.
• Asymptotic theory establishes large sample properties
like consistency and asymptotic normality.
23 / 45
Simulated Small-Sample Properties
Figure: Distribution obtained from 9000 Monte Carlo draws of LS and
ML estimator of φ in AR(1) with known variance.
phi=0.8 T=100 phi=0.97 T=100
1000 3000
2000
500
1000
0 0
0.5 0.6 0.7 0.8 0.9 0.8 0.85 0.9 0.95 1
phi=0.8 T=250 phi=0.97 T=250
1000 2000
500 1000
0 0
0.5 0.6 0.7 0.8 0.9 0.8 0.85 0.9 0.95 1
phi=0.8 T=1000 phi=0.97 T=1000
1000 1500
1000
500
500
0 0
0.5 0.6 0.7 0.8 0.9 0.8 0.85 0.9 0.95 1
24 / 45
Model Specification: Selecting p and q
⇒ The properties of ML and OLS estimators are typically derived
under assumption of correct ARMA(p,q) model!
⇒ The forecasts and impulse response functions depend on the order
p and q of ARMA and ADL.
Question: How can we select p and q? How can we test for correct
specification?
Answer: Using the general-to-specific approach (based on
autocorrelation tests and parameter significance or some information
criteria).
25 / 45
Determine p and q in ARMA model
Selecting p and q as large as possible is not a good strategy:
• Estimation uncertainty increases a lot!
• Forecasting uncertainty increases due to parameter
estimation uncertainty!
Definition: The ‘General-to-Specific’ estimation strategy for
ARMA models begins with the estimation of an ARMA(p, q)
with large p and q, and reduces the size of the model by
eliminating lags that are insignificant.
Alternative: G2S can be done by minimizing an information
criteria like Akaike’s Information Criterion (AIC) which
introduces a penalty to the log likelihood that discourages too
many parameters.
26 / 45
Akaike’s Information Criterion (AIC)
AIC = 2k−2 log L(ψ̂) where k = number of parameters in model
⇒ Model with largest likelihood L has lower AIC
⇒ Likelihood can always be improved by adding parameters to
the model!
⇒ AIC penalizes models with more parameters!
Strategy: model with smallest AIC is best!
27 / 45
Correct Specification in ARMA model
Let XT := (X1 , ..., XT ) be a subset of a process {Xt } generated
by an ARMA(p, q) model,
Xt = φ1 Xt−1 + ...φp Xt−p + εt + θ1 εt−1 + ... + θq εt−q .
Then: Estimation of an ARMA(p∗ , q ∗ ) model with p∗ < p or
q ∗ < q introduces autocorrelation in the residuals!
Example:
• DGP is AR(2) Xt = φ1 Xt−1 + φ2 Xt−2 + εt .
• Estimated model is AR(1) Xt = φ1 Xt−1 + ut .
⇒ ut = εt + φ2 Xt−2 has autocorrelation because {Xt } has
autocorrelation!
28 / 45
Correct Specification in ARMA model
Conclusion: Autocorrelation in the residuals is a sign of
misspecification of the ARMA model! (p or q might be too
low!)
Conclusion: Testing for autocorrelation in the residuals is
important!
29 / 45
Residual Sample Autocorrelation
Note: A host of other autocorrelation tests can be performed!
You already know these! Just recap from previous courses!
30 / 45
Autoregressive Modelling Tips
Let XT := (X1 , ..., XT ) be a subset of a process {Xt } generated
by an AR(p) model,
Xt = φ1 Xt−1 + ... + φp Xt−p + εt .
General-to-Specific:
(1) Estimate AR(p) for large p.
(2) Check for autocorrelation (or other signs of incorrect
specification!)
(3) Eliminate insignificant lag (large p-values! what is large?).
(4) Repeat steps 1 and 2 until all lags appear significant or
significant autocorrelation appears in the residuals.
31 / 45
ADL(p, q) Model Specification
Important: Let the DGP of {(Yt , Xt )} be given by an
ADL(p, q) model. Then, in general, estimation of an
ADL(p∗ , q ∗ ) with p∗ < p or q ∗ < q will result in a regression
with autocorrelation in the residuals.
Conclusion: Autocorrelation in the ADL regression residuals
is a sign of model misspecification!
32 / 45
Regression Residuals and ADL(p, q)
Let the dynamics of {Yt } be given by an ADL(1, 0) model
Yt = α + φYt−1 + βXt + εt
Suppose that we regress Yt on Xt ,
Yt = α + βXt + ut
The static regression model is misspecified and
ut = φYt−1 + εt is correlated with its lag
ut−1 = φYt−2 + εt−1 .
33 / 45
Innovation Properties: Normality tests
⇒ The ML estimator for the AR(1) was derived under the
assumption that the innovations are Gaussian!
⇒ The forecast confidence bounds are derived assuming that
the innovations are Gaussian!
Question: Can we test if the innovations are Gaussian?
Answer: Yes! Using the Jarque-Bera statistic :
T −k+1 2 1 2
JB = µ̂3 + (µ̂4 − 3)
6 4
k = number of regressors,
µ̂3 is sample skewness and µ̂4 is sample kurtosis
H0 : data is normally distributed , H1 : data not
normal
JB∼ χ22 under H0
34 / 45
Eviews Output: Residuals of ADL
35 / 45
Eviews Output: Jarque-Bera Test
36 / 45
Eviews Output: Jarque-Bera Test
37 / 45
Innovation Properties: White Noise
⇒ The point forecasts and IRFs of ARMA and ADL models
were derived under the assumption that the innovations are
white noise!
Question: Can we test if the innovations are white noise? Can
we test if the innovations are uncorrelated and have fixed
variance (mean zero is ensured by the intercept!)?
Answer: Yes! Using autocorrelation tests and
heteroeskedasticity tests!!
38 / 45
AutocorrelationPand Heteroeskedasticity Tests
(ut − ut−1 )2 / u2t
P
Durbin-Watson: d =
DW ≈ 2 : no autocorrelation
DW < 2 : positive autocorrelation
ρ̂2k /(T − k)
P
Ljung-Box Q-test Statistic: T (T + 2)
H0 : no autocorrelation H1 : autocorrelation
Breusch-Godfrey test: ut = α0 + α1 Xt + ρ1 ut−1 + ... + ρp ut−p + vt
H0 : no autocorrelation H1 : autocorrelation
Test statistic T × R2 ∼ χ2p under H0
Breusch-Pagan test: u2t = α0 + α1 Xt + ... + αp Xt−p + vt
H0 : homoeskedasticity H1 : heteroeskedasticity
Test statistic: F test for α1 = ... = αp = 0
39 / 45
Eviews Output: Autocorrelation Test (Q-Stat)
40 / 45
Eviews Output: Autocorrelation Test (BG LM)
41 / 45
Eviews Output: Heteroeskedasticity Test (BPG)
42 / 45
Model Validation
Important: time-series models can be evaluated and compared
by their forecasting accuracy
Problem: in-sample fit is not a good measure due to the
problem of over-fitting
Important idea: split the sample in two parts. Use the first
for estimation and the second for evaluating the forecasting
performance of the model... this is called sub-sample validation!
Example: Suppose you have a sample of T observations
DT = X1 , ..., XT
1. use Dn = X1 , ..., Xn for estimation (n < T )
2. use Xn+1 , Xn+2 ..., XT for validation!
43 / 45
Sub-sample validation with repeated static forecasts
Validation in practice: We have a sample Dt = X1 , ..., XT
1. use Dn = X1 , ..., Xn for estimation (n < T )
2. produce a 1-step-ahead (i.e. static) forecast X̂n+1 using Dn .
Compare against true value Xn+1
3. produce a 1-step-ahead (i.e. static) forecast X̂n+2 using
Dn+1 . Compare against true value Xn+2
4. continue until you use the entire sample!
Note: the forecast is called dynamic when we use past forecasts as
regressors for obtaining the next forecast.
44 / 45
Sub-sample validation with repeated static forecasts
Validation: the forecast mean squared error (FRMSE) and the
forecast mean absolute error (FMAE) can be used to evaluate the
quality of the forecast!
1 XT 12
FRMSE = (Xt − X̂t )2 (strong penalty for outliers)
T − n t=n+1
T
1 X
FMAE = |Xt − X̂t | (smaller penalty for outliers)
T − n t=n+1
Model A Model B Model C
Total Sample Size 276 276 n276
Estimation Sample 230 230 230
Validation Sample 46 46 46
FRMSE 27.1 28.9 18.4
FMAE 22.2 21.8 16.1
45 / 45