Biometrika (2005), 92, 3, pp.
519–528
© 2005 Biometrika Trust
Printed in Great Britain
A note on composite likelihood inference and model selection
B CRISTIANO VARIN
Department of Statistics, University of Padova, via C. Battisti 241, I-35121 Padova, Italy
[email protected] PAOLO VIDONI
Department of Statistics, University of Udine, via T reppo 18, I-33100 Udine, Italy
[email protected] Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
S
A composite likelihood consists of a combination of valid likelihood objects, usually
related to small subsets of data. The merit of composite likelihood is to reduce the com-
putational complexity so that it is possible to deal with large datasets and very complex
models, even when the use of standard likelihood or Bayesian methods is not feasible. In
this paper, we aim to suggest an integrated, general approach to inference and model
selection using composite likelihood methods. In particular, we introduce an information
criterion for model selection based on composite likelihood. We also describe applications
to the modelling of time series of counts through dynamic generalised linear models and to
the analysis of the well-known Old Faithful geyser dataset.
Some key words: Dynamic generalised linear model; Hidden Markov model; Information criterion; Old Faithful
geyser data; Pairwise likelihood.
1. I
In a number of applications, the presence of large correlated datasets or the specification
of highly structured statistical models makes infeasible the computation of the likelihood
function. An alternative to ordinary likelihood methods or Bayesian strategies is to adopt
simpler pseudolikelihoods, like those belonging to the composite likelihood class (Lindsay,
1988). A composite likelihood consists of a combination of valid likelihood objects, usually
related to small subsets of data. It has good theoretical properties and it behaves well
in many applications concerning, for example, spatial statistics (Hjort & Omre, 1994;
Heagerty & Lele, 1998; Varin et al., 2005), multivariate survival analysis (Parner, 2001),
generalised linear mixed models (Renard et al., 2004), frailty models (Henderson &
Shimakura, 2003) and genetics (Fearnhead & Donnelly, 2002). In this paper, we develop
and justify an integrated, general approach for inference and model selection, using com-
posite likelihood methods. In particular, we focus on a new information criterion for
model selection, which is the counterpart of Takeuchi’s information criterion (Takeuchi,
1976; Shibata, 1989, p. 222; Burnham & Anderson, 2002), based on composite likelihood.
520 C V P V
2. I
The term composite likelihood (Lindsay, 1988) denotes a rich class of pseudolikelihoods
based on likelihood-type objects.
D 1. L et { f (y; h), yµY, hµH} be a parametric statistical model, with YkRn,
HkRd, n1 and d1. Consider a set of events {A : A kF, iµI}, where IkN and F
i i
is some sigma algebra on Y. A composite likelihood is defined as
L (h; y)= a f (yµA ; h)wi,
C i
iµI
where f (yµA ; h)= f ({y µY : y µA }; h), with y=(y , . . . , y ), while {w , iµI} is a set of
i j j i 1 n i
suitable weights. T he associated composite loglikelihood is l (h; y)=log L (h; y).
C C
With the full likelihood L (h; y) viewed as a special case, we can group composite
likelihoods into two general classes. The first includes ‘subsetting methods’ and it
contains, for example, the pairwise likelihood (Cox & Reid, 2004), which is based on
marginal events related to pairs of observations. Analogously, we may define the tripletwise
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
likelihood and so on. These composite likelihoods will be considered in the applications
of § 4. The second class is based on ‘omission methods’, since composite likelihoods
are obtained by omitting components of the full likelihood. The mth-order likelihood
(Azzalini, 1983) f (y ) X n f (y |yi−1 ), with yi−1 =(y , . . . , y ) and mµ{1, . . . , n−1}
1 i=2 i i−m i−m i−m i−1
fixed, belongs to this class. Other examples are the Besag (1974) pseudolikelihood and
the partial likelihood (Cox, 1975).
Since each component of L (h; y) is a likelihood object, it is almost immediate to
C
state that the estimating equation Vl (h; y)=0 is unbiased, under the usual regularity
C
conditions. The associated maximum composite likelihood estimator h@ =h@ (Y )
MCL MCL
is consistent and asymptotically normally distributed, with mean h and variance matrix
H(h)−1J(h)H(h)−T. Here, J(h)=var {Vl (h; Y )} and H(h)=E {V2l (h; Y )}, where the
f C f C
expectations are with respect to f (y; h). Although a deep analysis of efficiency issues is
still lacking, some useful results may be found in Lindsay (1988) and Cox & Reid (2004).
In this paper, we aim to emphasise the use of composite likelihood methods both for
making inference and for model selection purposes. We shall introduce a predictive model
selection procedure based on the following generalisation of the Kullback–Leibler
information.
D 2. Given a random variable Z=(Z , . . . , Z ), with density g(z), the composite
1 n
Kullback–L eibler divergence of a density h(z) relative to g(z) is
C q rD
L (g; Z)
I (g, h)=E log C = ∑ E {log g(ZµA )−log h(ZµA )}w ,
C g(z) L (h; Z) g(z) i i i
C iµI
where L (g; Z)= X g(ZµA )wi and L (h; Z)= X h(ZµA )wi.
C iµI i C iµI i
Note that I (g, h) is defined as the expectation, with respect to the true density g(z),
C
of the difference between the composite loglikelihoods associated with g(z) and h(z),
respectively. Moreover, it is a linear combination of the ordinary Kullback–Leibler
divergences, corresponding to the likelihood objects forming the composite likelihood
function.
Consider the sample Y =(Y , . . . , Y ) and a parametric statistical model specified by the
1 n
family of density functions { f (y; h), yµY, hµH}, with respect to a common dominating
measure. There might be several plausible statistical models for Y , which may or may not
Composite likelihood inference 521
contain the true g(y). We would like to choose the model which offers the most satisfactory
predictive description of the observed data y. To be more precise, if Z is a future random
variable, defined as an independent copy of Y , we are interested in the choice of the ‘best’
model for forecasting Z, given a realisation of Y , using composite likelihood methods.
As usual for an information criterion, model selection can be approached on the basis
of the expected composite Kullback–Leibler information between the true density g(z)
and the estimated density @f (z)= f (z; h@ ), under the assumed statistical model: we
MCL
select the model which minimises E {I (g, @f )} or, equivalently, which maximises
g(y) C
Q(g, f )= ∑ E [E {log f (ZµA ; h@ )}]w . (1)
g(y) g(z) i MCL i
iµI
Equation (1) defines a theoretical criterion for predictive model selection, using composite
likelihood. However, it requires the knowledge of the true density g(z). Thus, in practice
we should maximise a selection statistic Q@ (g, f ), defined as a suitable estimator for Q(g, f ),
based on Y . In particular, we look for estimators that are unbiased, either exactly or to
the relevant order of approximation. A natural estimator is
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
l (h@ ; Y )=log L (h@ ; Y )= ∑ log f (Y µA ; h@ )w ,
C MCL C MCL i MCL i
iµI
which is the sample counterpart of (1) and corresponds to the maximised composite
loglikelihood. In the following section, we prove that l (h@ ; Y ) is biased and we introduce
C MCL
a modification which corrects the first-order bias.
3. A -
Since standard likelihood theory under misspecification (White, 1994) may be usually
considered, with modest changes, within the composite likelihood context, we state the
following usual regularity assumptions.
Assumption 1. The parameter space H is a compact subset of Rd (d1) and, for every
fixed yµY, L (h; y) is twice differentiable, with continuity, with respect to h.
C
Assumption 2. The estimator h@ is defined as a solution to the composite likelihood
MCL
equation and there exists a vector h µint(H) such that, exactly or with an error term
*
that is negligible as n +2, E {Vl (h ; Y )}=0.
g(y) C *
Assumption 3. The estimator h@ is consistent for h and asymptotically normally
MCL *
distributed.
The quantity h µint(H) is a pseudo-true parameter value, such that the composite
*
Kullback–Leibler divergence between g(y) and f (y; h) is minimal. If the model is correctly
specified for Y , g(y)= f (y; h ), for some h µint(H), which is the true parameter value.
0 0
Note that, since we are concerned with model selection problems, two potential sources
of misspecification are involved: the first one reflects the fact that the true distribution
may not belong to the working family of distributions and the second is related to the
use of the composite likelihood instead of the full likelihood function. As emphasised in
the following, when using composite likelihood methods, we have to consider likelihood
theory under misspecification even if the true model for the data is taken into account.
The following lemmas motivate the main result of the paper. The arguments involve
expansions that are standard in full likelihood considerations and are similar to those
leading to Takeuchi’s information criterion, discussed below. However, the results are
presented here in a more general context, using composite likelihood.
522 C V P V
L 1. Under Assumptions 1–3, we have that
Q(g, f )=E {l (h ; Y )}+1 tr{J(h )H(h )−1}+o(1),
g(y) C * 2 * *
with
J(h )=var {Vl (h ; Y )}, H(h )=E {V2l (h ; Y )}, (2)
* g(y) C * * g(y) C *
where the expectations are with respect to g(y).
Proof. Consider the stochastic Taylor expansion for l {h@ (Y ); Z} around h@ (Y )=h :
C MCL MCL *
l {h@ (Y ); Z}=l (h ; Z)+{h@ (Y )−h }TVl (h ; Z)
C MCL C * MCL * C *
+1 {h@ (Y )−h }T{V2l (h ; Z)}{h@ (Y )−h }+o (1).
2 MCL * C * MCL * p
Taking expectations term by term, with respect to the true distribution of Z, since Y and Z
are independent and identically distributed, and since Assumption 2 holds, we have
E [l {h@ (Y ); Z}]=E {l (h ; Y )}
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
g(z) C MCL g(y) C *
+1 {h@ (Y )−h }TE {V2l (h ; Z)}{h@ (Y )−h }+o (1).
2 MCL * g(z) C * MCL * p
Moreover, the mean value of the above expansion, with respect to the true distribution
of Y , gives
Q(g, f )=E {l (h ; Y )}+1 tr{H(h )V (h )}+o(1), (3)
g(y) C * 2 * *
where V (h )=E [{h@ (Y )−h }{h@ (Y )−h }T]. By means of standard asymptotic
* g(y) MCL * MCL *
arguments concerning the variance matrix of h@ (Y ), we obtain
MCL
V (h )=H(h )−1J(h )H(h )−1+o(n). (4)
* * * *
Plugging (4) into (3) completes the proof. %
L 2. Under Assumptions 1–3, we have that
E [l {h@ (Y ); Y }]=E {l (h ; Y )}−1 tr{J(h )H(h )−1}+o(1),
g(y) C MCL g(y) C * 2 * *
with J(h ) and H(h ) given by (2).
* *
Proof. Consider the stochastic Taylor expansion for l {h@ (Y ); Y } around h@ (Y )=h :
C MCL MCL *
l {h@ (Y ); Y }=l (h ; Y )+{h@ (Y )−h }TVl (h ; Y )
C MCL C * MCL * C *
+1 {h@ (Y )−h }TV2l (h ; Y ){h@ (Y )−h }+o (1).
2 MCL * C * MCL * p
Since, by standard asymptotic arguments, Vl (h ; Y ) may approximated by
C *
−{h@ (Y )−h }TV2l (h ; Y ),
MCL * C *
we obtain
l {h@ (Y ); Y }=l (h ; Y )−1 {h@ (Y )−h }TV2l (h ; Y ){h@ (Y )−h }+o (1).
C MCL C * 2 MCL * C * MCL * p
Taking expectations, with respect to the true distribution of Y and using relationship (4)
and V2l (h ; Y )=E {V2l (h ; Y )}+o (n), we complete the proof. %
C * g(y) C * p
From these lemmas, it is immediate to see that l (h@ ; Y ) is biased and that, under
C MCL
standard regularity conditions, the following information criterion is a first-order unbiased
estimator for Q(g, f ).
Composite likelihood inference 523
D 3. Consider a random sample Y , as previously defined. T he composite
likelihood information criterion selects the model maximising
l (h@ C (Y )−1},
; Y )+tr{JC (Y )H (5)
C MCL
C (Y ) are consistent, first-order unbiased estimators for J(h ) and H(h ),
where JC (Y ) and H
* *
respectively.
The statistic (5) is a generalisation of Takeuchi’s information criterion, which takes the
same form but with the parameter estimator being the ordinary maximum likelihood
estimator h@ and the quantities JC (Y ) and H C (Y ) computed from the ordinary likelihood
ML
function. In his setting and when the candidate model includes the true one, the
information identity J(h)=−H(h) holds and the selection statistic reduces to the more
familiar Akaike (1973) criterion l(h@ ; Y )−d. For the present setting this simplification
ML
will never occur, since the true model for the data does not play the role of true for the
composite likelihood; that is, the information identity does not hold.
Finally, we briefly mention two further important points. The first concerns the choice
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
of the weights in the composite likelihood. Typically, with regard to the pairwise likelihood,
the weights are chosen in order to eliminate non-neighbour pairs of observations, which
should be less informative; see for example Nott & Rydén (1999).
The second point concerns the efficient estimation of J(h) and H(h). The latter does
not pose difficulties and, under standard regularity conditions, a consistent estimator is
HC {h@ (Y )}=V2l {h@ (Y ); Y }. Much more difficult is the estimation of J(h), since
MCL C MCL
the associated naive estimator JC (h)=Vl (h; Y )Vl (h; Y )T vanishes when evaluated at
C C
h=h@ (Y ). If independent observations of the random vector Y are available, J(h) can
MCL
be estimated by the sample variance of the associated individual contributions to the
composite score function. On the other hand, if we observe only a single replicate of Y ,
as in the case of time series data, a different strategy is needed.
In this case, one might estimate J(h) by defining a partition of the sample Y such
that the associated contributions to the composite score function are approximately
uncorrelated. The sample variance matrix of these contributions may define an estimator
for J(h). However, the specification of this partition can be problematic for time series
data, especially in case of long-range dependency. As an alternative, a parametric bootstrap
procedure may be defined. However, in this case, the model has to be considered as
correctly specified and, if the data are high-dimensional, the computation can be very
time-consuming. A further possibility for time series and spatial data that do not seriously
depart from the condition of stationarity is to consider a resampling procedure called
window subsampling; see Heagerty & Lumley (2000) and references therein. This method
has been developed in the context of pairwise likelihood for binary spatial data by
Heagerty & Lele (1998). The idea behind window subsampling is to define suitable over-
lapping subseries of the original data that may be viewed as independent, identically
distributed, replicated observations. If we consider all the overlapping subseries of
dimension m, a suitable estimator for J(h) is
1 n−m+1 m
JC (h)= ∑ Vl (h; Y i+m )Vl (h; Y i+m )T,
m n−m+1 n C i C i
i=1
evaluated at h=h@ (Y ), where Y i+m =(Y , . . . , Y ). More refined estimators, and useful
MCL i i i+m
comments for the choice of the window dimension m, may be found in Heagerty &
Lumley (2000).
524 C V P V
4. A
4·1. T esting overdispersion in dynamic generalised linear models
When modelling count data, usually by means of the Poisson distribution, one often
needs to account for overdispersion. Here, by means of a simulation study, we discuss the
use of composite likelihood methods for testing the presence of overdispersion with
regard to time series of count data. We compare two dynamic generalised linear models
(West & Harrison, 1997, p. 521) based respectively on the Poisson and on the negative
binomial distribution.
Given count data y=(y , . . . , y ), consider a Poisson-(1) model, which is a partially
1 n
observable stochastic process {X , Y } . The unobservable part {X } is an (1)
i i i1 i i1
process and the observations Y (i1) are conditionally independent given {X }
i i i1
and follow a Poisson distribution. More precisely, Y |X =x ~Po(exi ), for i1, and
i i i
X =lX +e , for i2, with e ~N(0, s2), for i2, independently for each i. We assume
i i−1 i i
that |l|<1, so that the latent model is stationary, and we set X ~N{0, s2/(1−l2)}. An
1
alternative model, useful for describing overdispersion, is obtained by substituting the
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
Poisson distribution with a negative binomial distribution with mean m =exi and an
i
additional size parameter k>0, such that, for i1,
A BA B
C(k−1+y ) km yi 1 1/k
f (y |x ; k)= i i (y µN),
i i C(k−1)y ! 1+km 1+km i
i i i
where C(.) is the gamma function. Note that these two models are nested, since the second
one tends to the first one as k 0. Although these models look simple, their analysis
using standard likelihood-based procedures is problematic, since the computation of the
likelihood function requires the evaluation of an intractable n-dimensional integral.
Although approximate solutions involving simulation-based methods are possible (Durbin
& Koopman, 2001, Ch. 11), we perform a different strategy that relies on the pairwise
likelihood based on consecutive pairs of observations:
PP
n
L (h; y)= a f (y |x ; h) f (y |x ; h) f (x |x ; h) f (x ; h)dx dx . (6)
P i i i−1 i−1 i i−1 i−1 i i−1
i=2
In the first simulation study, we generated 500 datasets with n=300 observations
from the Poisson-(1) model with l=0·35 and s=1. The two-dimensional integrals
forming the pairwise likelihood (6) were approximated by means of double Gauss–Hermite
quadrature with 10 nodes for each dimension. Using adaptive quadrature gave similar
results. The sample means of the simulated pairwise likelihood estimators for l and s
were respectively 0·3500 and 0·9820, with sample standard deviations 0·1128 and 0·0831.
The good performance of the composite likelihood estimators also held for other values
of l and s.
The second simulation study deals with model selection. We generated 100 datasets
with n=300 observations from the negative binomial-(1) model with l=0·35, s=0·5
and various values for the size parameter k, namely 1, 1 , 1 , 1 and 0; k=0 indicates that
2 4 8
the simulations are from the Poisson-(1) model. We compared the two alternative
models using the composite likelihood crierion (5). The matrix J(h) was estimated with a
window subsampling procedure with window dimension m=50. Table 1(a) gives the
frequencies of correct model selection over the 100 simulated datasets, for different values
for k.
Composite likelihood inference 525
Table 1. Frequencies of correct model selection over the 100 simu-
lated datasets with k=1, 1 , 1 , 1 , 0 and (a) l=0·35, s=0·5, and
2 4 8
(b) l=−0·6, s=0·7; k=0 indicates a true Poisson-(1) model
(a) (b)
k=1 k=1 k=1 k=1 k=0 k=1 k=1 k=1 k=1 k=0
2 4 8 2 4 8
100 90 64 59 55 99 82 64 45 58
Whenever k>1, the criterion (5) almost always indicated the true model, that is
the negative binomial one. Note that, as expected, whenever k approaches zero, we
choose the wrong model more often, since there is a potential slight overdispersion in the
observations and then the two models provide almost equivalent data descriptions, as
detected by the composite Kullback–Leibler divergence. When k is less than 1 , the values
4
of the selection statistics are usually very similar, as confirmed by the analysis of the
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
magnitudes of the observed differences. Similar results are obtained with l=−0·6 and
s=0·7; see Table 1(b).
4·2. T he Old Faithful geyser data
Here we present an application to the Old Faithful geyser dataset discussed in Azzalini
& Bowman (1990). We find that composite likelihood methods based on the tripletwise
likelihood perform well for parameter estimation but fail to distinguish between the two
models considered.
The data consist of a binary version of the time series of the duration of the successive
eruptions of the Old Faithful geyser in the Yellowstone National Park in the period from
1 to 15 August 1985. The short and long eruptions, based on a threshold of 3 minutes,
are labelled as 0 and 1, respectively. The random variables N , for r=0, 1, indicate
r
the corresponding observed numbers of eruptions, and N =105 and N =194. Also the
0 1
one-step observed transitions N , for r, s=0, 1, from state r to state s, are N =0,
rs 00
N =105, N =104 and N =89. Since N =0, only five two-step observed transitions
10 01 11 00
are nonnull: N =69, N =35, N =35, N =104 and N =54. In order to find a
010 110 011 101 111
plausible model for these data, MacDonald & Zucchini (1997; § 4.2) compare a suitable
hidden Markov model based on the binomial distribution with the second-order Markov
chain model proposed by Azzalini & Bowman (1990). A hidden Markov model is a
partially observable stochastic process {X , Y } , such that the unobserved part, {X } ,
i i i1 i i1
is a Markov chain. The evaluation of the likelihood function requires O(wn) computations
but MacDonald & Zucchini (1997) consider a suitable rearrangement of the terms
involved, thereby reducing significantly the computational burden.
An alternative to the full likelihood may be found within the class of composite
likelihoods. The simplest candidate in the pairwise likelihood, based on pairs of consecutive
observations. However, for the two models under consideration, this is not useful since
the associated likelihood equation has an infinite number of solutions. We therefore define
a tripletwise likelihood based on triplets of consecutive observations.
We consider the two competing models. For the two-state second-order Markov chain,
the tripletwise likelihood is easily computed and it involves pr(Y =r, Y =s, Y =t),
i−2 i−1 i
for i>2. In order to calculate these probabilities it is convenient to consider
D pr(Y =s, Y =r|Y =t, Y =s)=pr(Y =r|Y =s, Y =t),
(sr)(ts)= i−1 i i−2 i−1 i i−1 i−2
526 C V P V
for r, s, t=0, 1, i>2. It is easy so see that D =b, D =c and D =1, with
(10)(01) (10)(11) (01)(10)
b, cµ(0, 1) as unknown parameters. Here h=(b, c) and the maximum tripletwise likelihood
estimates are found to be b@ =0·6634 and c@ =0·3932, which equal the maximum
MTL MTL
likelihood estimates. Graphical inspection from Fig. 1 shows, as expected, that the ordinary
loglikelihood has a more peaked form; indeed, the contour plots show different gradient
directions. The maximised tripletwise loglikelihood is l (b@ , c@ ; y)=−451·5889. Note
T MTL MTL
that the maximum tripletwise likelihood estimates allow for equality between the estimated
and observed frequencies for the five triplets of potential observations: the estimate for
pr(Y =0, Y =1, Y =0), based on b@ and c@ , equals N /(n−2), and similarly
i−2 i−1 i MTL MTL 010
for the remaining triplets. We can therefore say that this model reaches a sort of ‘best’
possible fitting, as detected by the tripletwise likelihood.
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
Fig. 1. Contour plots of (a) the ordinary loglikelihood and (b) the tripletwise loglikelihood for the
second-order Markov chain model fitted to the Old Faithful geyser data.
With regard to the second model, we recall that, for a hidden Markov model, the
tripletwise likelihood is
n
L (h; y)= a ∑ f (x , x , x ; h) f (y |x ; h) f (y |x ; h) f (y |x ; h),
T i−2 i−1 i i−2 i−2 i−1 i−1 i i
i=3 xi−2, xi−1, xi
where f (x , x , x ; h) is the joint probability function of (X , X , X ) and the
i−2 i−1 i i−2 i−1 i
summation is over all the triplets of subsequent latent observations. In this case,
n
L (h; y)= a ∑ pr(Y =r, Y =s, Y =t)Nrst.
T i−2 i−1 i
i=3 r, s, tµ{0,1}
We assume that the transition probabilities of the hidden Markov chain are
pr(X =1|X =0)=1, pr(X =0|X =1)=a,
i+1 i i+1 i
for i1, and pr(Y =y|X =0)=ry(1−r)1−y, y=0, 1, pr(Y =1|X =1)=1, for i1, with
i i 1 i
aµ(0, 1) and rµ(0, 1) as unknown parameters. If we assume stationarity it is not difficult
to compute the relevant triplet probabilities. Here h=(a, r) and the maximum tripletwise
likelihood estimates are a@ =0·8948 and r@ =0·2584, while the maximum likelihood
MTL MTL
estimates, though not the same, are very similar; that is, a@ =0·827 and r@ =0·225.
ML ML
Composite likelihood inference 527
Graphical inspection of the contour plots of the ordinary loglikelihood and of the
tripletwise loglikelihood leads to conclusions similar to those emphasised in the previous
case.
We find that the maximised tripletwise loglikelihood l (a@ , r@ ; y)=−451·5889
T MTL MTL
coincides with that obtained for the second-order Markov chain. This is a consequence
of the perfect match between the estimated and the observed frequencies for the five triplets
of potential observations, which holds for the two-state hidden Markov model as well.
Indeed, the computation, using Monte Carlo simulation, of the bias correction term
C (Y )−1} specifying the composite likelihood information criterion also gives the
tr{JC (Y )H
same approximated value of 4·65 for the two models. Thus, in this case, the tripletwise
likelihood, though useful for inferential purposes, does not detect the high-order structural
differences between the two estimated models. This conclusion emphasises that a careful
choice of the composite likelihood is necessary, both for inference and model selection,
with the aim of balancing the improved computational facility and the reduced descriptive
ability that characterise pseudolikelihood procedures.
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014
A
The authors would like to thank Prof. A. Azzalini, Dr M. Chiogna and the associate
editor for helpful comments. This research is partially supported by grants from the
Ministry of Education and University, Italy.
R
A, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proc. Second
International Symposium on Information T heory, Ed. B. N. Petrov and F. Caski, pp. 267–81. Budapest:
Akademiai Kiado.
A, A. (1983). Maximum likelihood estimation of order m for stationary stochastic processes. Biometrika
70, 381–7.
A, A. & B, A. W. (1990). A look at some data on the Old Faithful geyser. Appl. Statist.
39, 357–65.
B, J. E. (1974). Spatial interaction and the statistical analysis of lattice system (with Discussion). J. R.
Statist. Soc. B 36, 192–236.
B, K. P. & A, D. R. (2002). Model Selection and Multimodel Inference: a Practical Information-
theoretic Approach, 2nd ed. New York: Springer-Verlag.
C, D. R. (1975). Partial likelihood. Biometrika 62, 269–76.
C, D. R. & R, N. (2004). A note on pseudolikelihood constructed from marginal densities. Biometrika
91, 729–37.
D, J. & K, S. J. (2001). T ime Series Analysis by State Space Methods. Oxford: Oxford
University Press.
F, P. & D, P. (2002). Approximate likelihood methods for estimating local recombination
rates. J. R. Statist. Soc. B 64, 657–80.
H, P. J. & L, S. R. (1998). A composite likelihood approach to binary spatial data. J. Am. Statist.
Assoc. 93, 1099–111.
H, P. J. & L, T. (2000). Window subsampling of estimating functions with application to
regression models. J. Am. Statist. Assoc. 95, 197–211.
H, R. & S, S. (2003). A serially correlated gamma frailty model for longitudinal count
data. Biometrika 90, 355–66.
H, N. L. & O, H. (1994). Topics in spatial statistics. Scand. J. Statist. 21, 289–357.
L, B. (1988). Composite likelihood methods. In Statistical Inference from Stochastic Processes,
Ed. N. U. Prabhu, pp. 221–39. Providence, RI: American Mathematical Society.
MD, I. L. & Z, W. (1997). Hidden Markov and Other Models for Discrete-valued T ime Series.
London: Chapman and Hall.
528 C V P V
N, D. J. & R, T. (1999). Pairwise likelihood methods for inference in image models. Biometrika
86, 661–76.
P, E. T. (2001). A composite likelihood approach to multivariate survival data. Scand. J. Statist. 28,
295–302.
R, D., M, G. & G, H. (2004). A pairwise likelihood approach to estimation in multilevel
probit models. Comp. Statist. Data Anal. 44, 649–67.
S, R. (1989). Statistical aspects of model selection. In From Data to Model, Ed. J. Willems, pp. 215–40.
New York: Springer-Verlag.
T, K. (1976). Distribution of information statistics and criteria for adequacy of models (in Japanese).
Math. Sci. 153, 12–8.
V, C., H, G. & S, Ø. (2005). Pairwise likelihood inference in spatial generalized linear mixed
models. Comp. Statist. Data Anal. To appear.
W, M. & H, J. (1997). Bayesian Forecasting and Dynamic Models, 2nd ed. New York: Springer-
Verlag.
W, H. (1994). Estimation, Inference and Specification Analysis. New York: Cambridge University Press.
[Received November 2003. Revised December 2004]
Downloaded from http://biomet.oxfordjournals.org/ at Aston University on January 14, 2014