Estimation
Estimation
Prerequisites
• Linear regression.
Objectives
• To derive the sample autocovariance of a time series, and show that this is a positive
definite sequence.
• To show that the variance of the sample covariance involves fourth order cumulants,
which can be unwielding to estimate in practice. But under linearity the expression
for the variance greatly simplifies.
• To show that under linearity the correlation does not involve the fourth order cumulant.
This is the Bartlett formula.
• To use the above results to construct a test for uncorrelatedness of a time series (the
Portmanteau test). And understand how this test may be useful for testing for inde-
pendence in various different setting. Also understand situations where the test may
fail.
• To be able to derive the Yule-Walker and least squares estimator of the AR parameters.
• To understand what the quasi-Gaussian likelihood for the estimation of ARMA models
is, and how the Durbin-Levinson algorithm is useful in obtaining this likelihood (in
practice). Also how we can approximate it by using approximations of the predictions.
64
• Understand that there exists alternative methods for estimating the ARMA parame-
ters, which exploit the fact that the ARMA can be written as an AR(∞).
We now consider various methods for estimating the parameters in a stationary time
series. We first consider estimation of the mean and covariance and then look at estimation
of the parameters of an AR and ARMA process.
65
1
!n−|k|
The eagle-eyed amongst you may wonder why we don’t use T −|k| t=1 Xt Xt+|k| , and that
1
! n−|k|
ĉn (k) is more biased than T −|k| t=1 Xt Xt+|k| . However ĉn (k) has some very nice properties
which are discussed in the remark below.
then {ĉn (k)} is a positive definite sequence. Therefore, using Lemma 1.1.1 there exists a
stationary time series {Zt } which has the covariance ĉn (k).
There are various ways to show that {ĉn (k)} is a positive definite sequence. One method
uses that corresponding spectral density is non-negative, we give this proof in Section 5.3.1.
An alternative proof uses that a positive definite means that for any sequence a = (a1 , . . . , an )"
we have
ĉn (0) ĉn (1) ĉn (2) . . . ĉn (n − 1)
ĉn (1) ĉn (0) ĉn (1) . . . ĉn (n − 2)
" a ≥ 0,
a .. .. ... .. ..
. . . .
.. ..
ĉn (n − 1) ĉn (n − 2) . . ĉn (0)
!n−k !
noting that ĉn (k) = n1 t=1 Xt Xt−k . However, ĉn (k) = n1 n−kt=1 Xt Xt−k has a very interesting
construction, it can be shown that the above convariance matrix is Ĉn = Xn X"n , where Xn is
a T × 2n matrix with
0 0 ... 0 X1 X2 . . . Xn−1 Xn
0 0 . . . X1 X2 . . . Xn−1 Xn 0
Xn = .. .. .. .. .. .. .. .. ..
. . . . . . . . .
X1 X2 . . . Xn−1 Xn 0 ... ... 0
66
this will not be the case, and the variance will involve several (infinite) number of parameters
which are not straightforward to estimate. Later in this section we show that the variance
of the sample covariance can be extremely complicated, however we show that the variance
of the sample correlation (under linearity) is extremely simple (in comparison) - this is
known as Bartlett’s formula (you may have come across Maurice Bartlett before, outside his
fundamental contributions in time series he is well known for proposing the famous Bartlett
correction). This example demonstrates, how the assumption of linearity can really simplify
problems in time series and also how we can circumvent certain problems in which arise in an
estimator by making slight modifications of it, such as going from covariance to correlation.
The following theorem gives the asymptotic sampling properties of the covariance esti-
mator (4.1). The proof of the result can be found in Brockwell and Davis (1998), Chapter 8,
Fuller (1995), but it goes back to Bartlett (indeed its called Bartlett’s formula). We prove
the result in Section 4.1.4
!
where j |ψj | < ∞, {εt } are iid random variables with E(ε4t ) < ∞. Suppose we observe
{Xt : t = 1, . . . , n} and use (4.1) as an estimator of the covariance c(k) = cov(X0 , Xk ).
Define ρ̂n (r) = ĉn (r)/ĉn (0) as the sample correlation. Then for each h ∈ {1, . . . , n}
√ D
n(ρ̂n (h) − ρ(h)) → N (0, Wh ) (4.3)
where ρ̂n (h) = (ρ̂n (1), . . . , ρ̂n (h)), ρ(h) = (ρ(1), . . . , ρ(h)) and
∞
"
(Wh )ij = {ρ(k + i) + ρ(k − i) − 2ρ(i)ρ(k)}{ρ(k + j) + ρ(k − j) − 2ρ(j)ρ(k)}.
k=−∞
• Given a data set, we need to check whether there is dependence, if there is we need to
analyse it in a different way.
• Suppose we fit a linear regression, we may need to check whether the residuals are actu-
ally uncorrelated, else the standard errors based on the assumption of uncorrelatedness
would be unreliable.
67
• We need to check whether a time series model is the appropriate model. To do this
we fit the model, estimate the residuals, if the residuals appear to be uncorrelated it
would seem likely that the model is correct. If they are correlated, then the model is in
appropriate. For example, we may fit an AR(1) to the data, estimate the residuals εt ,
if there is still correlation in the residuals, then the AR(1) was not the correct model,
since Xt − ρXt−1 still has information about other residuals.
Suppose {Xt } are iid random variables, and we use (4.1) as an estimator of the autocovari-
ances. Recalling if {Xt } are iid then ρ(k) = 0 for k +=, using this and (4.3) we see that the
asymptotic distribution of ρ̂n (h) in this case is
√ D
n(ρ̂n (h) − ρ(h)) → N (0, Wh )
where
#
1 i=j
(Wh )ij =
+ j
0 i=
√ D
In other words n(ρ̂n (h) − ρ(h)) → N (0, I). Hence the sample autocovariances at different
lags are uncorrelated. This allows us to easily construct confidence intervals for the auto-
covariances under the assumption of the observations. If the vast majority of the sample
autocovariance lie inside the confidence there is not enough evidence to suggest that the data
is a realisation of a iid random variables (often called a white noise process). An example of
the empirical ACF and the CI constructed under the assumption of independence is given
in Figure 4.1. We see that the empirical autocorrelations of the realisation from iid random
variables all lie within the CI. The same cannot be said for the emprical correlations of a
dependent time series. Of course, doing the check by eye means that we may encounter
multiple testing problems, since even under independence, some sample correlations may
lie above the line. To counter this problem, we should construct a test statistic for testing
√ D
uncorrelatedness. Since under the null n(ρ̂n (h) − ρ(h)) → N (0, I), one method of testing
is to use the square correlations
h
"
Sh = n |ρn (r)|2 ,
r=1
under the null it will have a χ2 -distribution with h degrees of freedom, under the alternative
it will be a non-central (generalised) chi-squared. The non-centrality is what makes us reject
the null if the alternative of correlatedness is true. This is known as a Portmanteau test.
Remark 4.1.2 (Long range dependence versus changes in the mean) We first note
!
that a process is said to have long range dependence if the covariances k |c(k)| are not abso-
lutely summable. From a practical point of view data is said to exhibit long range dependence
68
Series ACF1
1.0
0.8
0.6
ACF
0.4
0.2
0.0
−0.2
0 5 10 15 20
Lag
Series ACF2
1.0
0.5
ACF
0.0
−0.5
0 5 10 15 20
Lag
Figure 4.1: The top plot is the empirical ACF taken from a iid data and the lower lot is the
empirical ACF of a realisation from the AR(2) model defined in (2.19).
if the autocovariances do not decay very fast to zero as the lag increases. We now demonstrate
that one must be careful in the diagnoses of long range dependence, because a slow decay of
the autocovariance could also imply a change in mean if this has not been corrected for. This
was shown in Bhattacharya et al. (1983), and applied to econometric data in Mikosch and
Stărică (2000) and Mikosch and Stărică (2003). A test for distinguishing between long range
dependence and change points is proposed in Berkes et al. (2006).
Suppose that Yt satisfies
Yt = µt + εt ,
where {εt } are iid random variables and the mean µt depends on t. We observe {Yt } but
do not know the mean is changing. We want to evaluate the autocovariance function, hence
estimate the autocovariance at lag k using
n−|k|
1 "
ĉn (k) = (Yt − Ȳn )(Yt+|k| − Ȳn ).
n t=1
Observe that Ȳn is not really estimating the mean but the average mean! If we plotted the
empirical ACF {ĉn (k)} we would see that the covariances do not decay with time. However
the true ACF would be zero and at all lags but zero. The reason the empirical ACF does not
decay to zero is because we have corrected for the correct mean. Indeed it can be shown that
!
for large lags ĉn (k) ≈ s<t (µs − µt )2 . Hence because we are not correcting for the mean in
the autocovariance, it remains.
69
It should be noted if you study a realisation of a time series with a large amount of depen-
dence, it is unclear whether what you see is actually a stochastic time series or an underlying
trend. This makes disentangling a trend from data with a large amount of correlation ex-
tremely difficult.
To prove the result we first evaluate the variance ĉn (r). We start by obtaining an expression
under strict stationarity of the time series, and then simplify it under linearity. A simply
expansion shows that
var(ĉn (r))
n−|r|
1 "
= cov(Xt Xt+r , Xτ Xτ +r )
n2 t,τ =1
n−|r|
1 "* +
= cov(X t , X τ )cov(X t+r , X τ +r ) + cov(X t , X τ +r )cov(X t+r , X τ ) + cum(X t , X t+r , X τ , X τ +r )
n2 t,τ =1
n−|r|
1 "* 2
+
= c(t − τ ) + c(t − τ − r)c(t + r − τ ) + k4 (r, τ − t, τ + r − t)
n2 t,τ =1
:= I + II + III,
where the above is due to strict stationarity of the time series. We analyse the above term
!
by term. By changing variables and the limits, under the assumption that r |rc(r)| < ∞
it can be shown that
∞
1 " 1
I= c(k)2 + o( ),
n k=−∞ n
similarly
∞
1 " 1
II = c(k)c(k − r) + o( ).
n k=−∞ n
!
To deal with the fourth order cumulant term we use the condition that t1 ,t2 ,t3 |ti κ4 (t1 , t2 , t3 )| <
∞ (which is not as strong as you may think, if the time series is linear, mixing etc, then
under certain conditions on these rates we get this bound). Using this gives
1" 1
III = κ4 (r, k, k + r) + o( ).
n t,k n
70
Therefore altogether we have
∞ ∞
1 " 2 1 " 1" 1
var(ĉn (r)) = c(k) + c(k)c(k − r) + κ4 (r, k, k + r) + o( ),
n k=−∞ n k=−∞ n t,k n
We observe that the covariance of the covariance estimator contains both covariance and
cumulants terms. Thus if we need to estimate them, for example to construct confidence
intervals, this can be extremely difficult. However, under linearity the above fourth order
cumulant term has a simpler form.
where {εt } are iid, E(εt ) = 0, var(εt ) = 1 and κ4 = cum4 (εt ). Then the third (4th order
cumulant) term III reduces to
∞
1 " * " " " " + 1
III = cum ψj1 ε−j1 , ψj2 εr1 −j2 , ψj3 εk−j3 , ψj4 εk+r2 −j1 + o( ).
n k=−∞ j =−∞ j =−∞ j =−∞ j =−∞
n
1 2 3 4
Noting that if {εt } are iid, then the fourth order cumulant will be zero unless they are all
them reduces the above to
∞ ∞
κ4 " " 1
III = ψj ψj−r1 ψj−k ψj−r2 −k + o( )
n k=−∞ j=−∞ n
∞
κ4 * " +* " + 1 1 1
= ψj ψj−r1 ψi ψi−r2 + o( ) = κ4 c(r1 )c(r2 ) + o( ),
n j=−∞ i
n n n
Thus in the case of linearity our expression for the variance is nicer, and the only difficult
parameter to estimate of κ4 , which can be done by various means (see later in this course).
71
The variance of the correlation under linearity
There exists a surprising trick to completely omit the cumulant term, that is to consider the
correlation rather than just the covariance. The sample correlation is
ĉn (r)
ρ̂n (r) = ,
ĉn (0)
Lemma 4.1.1 (Bartlett’s formula) Suppose {Xt } is a linear time series, then the vari-
ance of the distribution of ρ̂n (r) is
∞
"
{ρ(k + r) + ρ(k − r) − 2ρ(r)ρ(k)}{ρ(k + r) + ρ(k − r) − 2ρ(r)ρ(k)}
k=−∞
1 * + c(r) * + 1
ρ̂n (r) = ĉn (r) − c(r) − 2
ĉ n (0) − c(0) + Op ( ).
c(0) c(0) n
1 c(r)2 c(r)2 1
var(ĉn (r)) 2
− 2cov(ĉ n (r), ĉ n (0)) 3
+ var(ĉ n (0)) 4
+ O( 2 ).
c(0) c(0) c(0) n
Now substituting (4.4) in the above gives the result, and importantly you will see that the
fourth order cumulants cancel. !
The proof of Theorem 4.1.1 is identical to the above (though we still need to show
asymptotic normality).
Remark 4.1.3 The above would appear to be a nice trick, but there are two major factors
that lead to the cancellation of the fourth order cumulant term
• Linearity.
• The ratio.
72
Indeed this is not a chance result, in fact there is a logical reason why this result is true (and
is true for many statistics, which have a similar form - commonly called ratio statistics). It
is easiest explained in the Fourier domain. If the estimator can be written as
!
1 nk=1 φ(ωk )In (ωk )
! ,
n n1 nk=1 In (ωk )
where In (ω) is the periodogram, and {Xt } is a linear time series, then we will show later
that the asymptotic distribution of the above has a variance which is only in terms of the
covariances not higher order cumulants.
!
where E(εt ) = 0 and var(εt ) = σ 2 and the roots of the characteristic polynomial 1− pj=1 φj z j
lie outside the unit circle. Our aim in this section is to construct estimator of the AR
parameters {φj }. We will show that in the case that {Xt } has an AR(p) representation
the estimation is relatively straightforward, and the estimation methods all have properties
which are asymptotically equivalent to the Gaussian maximum estimator.
The following estimation scheme stems from the following observation. Suppose the
AR(p) time series {Xt } is causal (that is the roots of the characteristic polynomial lie outside
the unit circle, hence it satisfies an MA(∞) presentation). Then we can multiple Xt by Xt−i
for 1 ≤ i ≤ p, since the process is causal εt and Xt−i . Therefore taking expectations we have
for all i > 0
p p
" "
E(Xt Xt−i ) = φj E(Xt−j Xt−i ), ⇒ c(i) = φj c(i − j). (4.5)
j=1 j=1
Recall these are the Yule-Walker equations we considered in Section 2.5.3. Putting the cases
1 ≤ i ≤ p together we can write the above as
γ p = Γp φ̂p , (4.6)
73
4.2.1 The Yule-Walker estimator
The Yule-Walker equations inspire the method of moments estimator often called the Yule-
Walker estimator. We use (4.6) as the basis of the estimator. It is clear that γ̂ p and Γ̂p are
estimators of γ p and Γp where (Γ̂p )i,j = ĉn (i − j) and (γ̂ p )i = ĉn (i). Therefore we can use
φ̂p = Γ̂−1
p γ̂ p , (4.7)
74
doing the estimation. They can have a different distribution, the only difference is that
estimate may be less efficient (will not obtain the Cramer-Rao lower bound).
Suppose we observe {Xt ; t = 1, . . . , n} where Xt are observations from an AR(1) process.
To construct the the MLE, we use that the joint distribution of {Xt } is the product of the
conditional distributions. Hence we need an expression for the conditional distribution (in
terms of the densities). Let F" f" be the distribution function and the density function of +
respectively. We first note that the AR(p) process is p-Markovian, that is
P(Xt ≤ x|Xt−1 , Xt−2 , . . .) = P(Xt ≤ x|Xt−1 , . . . , Xt−p ) ⇒ fa (Xt |Xt−1 , . . .) = fa (Xt−1 |Xt−1 , . . . Xt−p ),(4.8)
where fa is the conditional density of Xt given the past, where the distribution function is
derived as if a is the true AR(p) parameters.
Remark 4.2.1 To understand why (4.8) is true consider the simple case that p = 1 (AR(1)).
Studying the conditional probability gives
Usually we ignore the initial distribution log fa (X1 , . . . , Xp ) and maximise the conditional
likelihood to obtain the estimator. In the case that the sample sizes are large n >> p, the
75
contribution of log fa (X1 , . . . , Xp ) is minimal and the conditional likelihood and likelihood
are asymptotically equivalent.
We note in the case that fε is Gaussian, the conditional log-likelihood is −nLn (a), where
n p
1 "
2
"
Ln (a) = log σ + 2 (Xt − aj Xt−j )2 .
nσ t=p+1 j=1
Therefore the estimates of the AR(p) parameters is φ̃p = arg min Ln (a). It is clear that φ̃p
is the least squares estimator and can be explicitly obtained using
φ̃p = Γ̃−1
p γ̃ p ,
1
!n 1
!n
where (Γ̃p )i,j = n−p t=p+1 Xt−i Xt−j and (γ̃ n )i = n−p t=p+1 Xt Xt−i .
76
4.3.1 The Hannan and Rissanen AR(∞) expansion method
We first describe an easy method to estimate the parameters of an ARMA process. These
estimators may not be necessarily ‘efficient’ (we define this term later) but they have an
explicit form and can be easily obtained. Therefore they are a good starting point, and can
be used as the initial value when using the Gaussian maximum likelihood to estimate the
parameters (as described below). The method was first propose in Hannan and Rissanen
(1982) and An et al. (1982) and we describe it below. It is worth bearing in mind that
currently the ‘large p small n problem’ is a hot topic. These are generally regression problems
where the sample size n is quite small but the number of regressors p is quite large (usually
model selection is of importance in this context). The methods proposed by Hannan involves
expanding the ARMA process (assuming invertibility) as an AR(∞) process and estimating
the parameters of the AR(∞) process. In some sense this can be considered as a regression
problem with an infinite number of regressors. Hence there are some parallels between the
estimation described below and the ‘large p, small n problem’.
As we mentioned in Lemma 2.4.1, if an ARMA process is invertible it is can be represented
as
∞
"
Xt = bj Xt−j + εt . (4.10)
j=1
The idea behind Hannan’s method is to estimate the parameters {bj }, then estimate the
innovations εt , and use the estimated innovations to construct a multiple linear regression
estimator of the ARMA paramters {θi } and {φj }. Of course in practice we cannot estimate
all parameters {bj } as there are an infinite number of them. So instead we do a type of sieve
estimation where we only estimate a finite number and let the number of parameters to be
estimated grow as the sample size increases. We describe the estimation steps below:
(i) Suppose we observe {Xt }nt=1 . Recalling (4.10), will estimate {bj }pj=1
n
parameters. We
will suppose that pn → ∞ as n → ∞ and pn << n (we will state the rate below).
We use least squares to estimate {bj }pj=1
n
and define
where
n
" T
"
R̂n = Xt−1 X"t−1 r̂n = Xt Xt−1
t=pn +1 t=pn +1
77
(ii) Having estimated the first {bj }pj=1
n
coefficients we estimate the residuals with
pn
"
ε̃t = Xt − b̂j,n Xt−j .
j=1
(φ̃n , θ̃ n ) = R̃−1
n s̃n
where
n n
1 " 1 "
R̃n = Ỹt Ỹt and s̃n = Ỹt Xt ,
n n
t=max(p,q) t=max(p,q)
78
be the best linear predictor of Xt+1 given Xt , . . . , X1 and the ARMA parameters φ and θ
which are used to calculate the covariances in the prediction. Let rt+1 (σ, φ, θ) be the one-
(φ,θ)
step ahead mean squared error E(Xt − Xt+1|t )2 . By using Cholskey’s decomposition it can
be shown that
n−1 n−1 (φ,θ)
2
1 1" 1 " (Xt+1 − Xt+1|t )
Ln (φ, θ, σ) = log rt+1 (σ, φ, θ) + .
n n t=1 n t=1 rt+1 (σ, φ, θ)
We see that we have avoided inverting the matrix Γ(φ, θ, σ). The GMLE is the parameter
(φ,θ)
θ̂ n , φ̂n which minimises Ln (φ, θ, σ). We note that the one-step ahead predictor Xt+1|t can
be obtained using Durbin-Levinson Algorithm.
It is possible to obtain an approximation of Ln (φ, θ, σ) which is simple to evaluate.
However this approximation only really make sense when the sample size n is large. It is,
however, useful when obtaining the asymptotic sampling properties of the GMLE.
To motivate the approximation consider the one-step ahead prediction error considered
in Section 3.3. We have shown in Proposition 3.3.1 that for large t, X̃t+1|t,...,0 ≈ Xt+1|t (=
PXt ,Xt−1 ,... (Xt+1 )) and σ 2 ≈ E(Xt+1 − Xt+1|t )2 . Now define
t
"
(φ,θ)
X̃t+1|t,...,0 = bj (φ, θ)Xt+1−j . (4.13)
j=1
(φ,θ) (φ,θ)
We now replace in Ln (φ, θ, σ), Xt+1|t with X̃t+1|t,...,0 and rt+1 (σ, φ, θ) with σ 2 to obtain
T −1
1 2 1 " (φ,θ)
L̃n (φ, θ, σ) = log σ + 2 (Xt+1 − X̃t+1|t,...,0 )2 .
n nσ t=1
We show in Section 8 that n1 Ln (φ, θ, σ) and n1 L̃n (φ, θ, σ) are asymptotically equivalent.
79