Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views19 pages

GVSusing BUGS

This paper details the implementation of Gibbs variable selection using the BUGS software, focusing on the specification of likelihood, prior distributions, and model probabilities. It discusses various MCMC methods for variable selection, particularly the Gibbs variable selection approach, and provides guidance on applying these methods in different statistical models. The paper is structured into sections that cover the algorithm, implementation in BUGS, and examples of application.

Uploaded by

Ahmed HAMIMES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

GVSusing BUGS

This paper details the implementation of Gibbs variable selection using the BUGS software, focusing on the specification of likelihood, prior distributions, and model probabilities. It discusses various MCMC methods for variable selection, particularly the Gibbs variable selection approach, and provides guidance on applying these methods in different statistical models. The paper is structured into sections that cover the algorithm, implementation in BUGS, and examples of application.

Uploaded by

Ahmed HAMIMES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Gibbs Variable Selection Using BUGS

Ioannis Ntzoufras∗
Department of Business Administration
University of the Aegean, Chios, Greece
e-mail: [email protected]

Abstract

In this paper we discuss and present in detail the implementation of


Gibbs variable selection as defined by Dellaportas et al. (2000, 2002)
using the BUGS software (Spiegelhalter et al. , 1996a,b,c). The spec-
ification of the likelihood, prior and pseudo-prior distributions of the
parameters as well as the prior term and model probabilities are de-
scribed in detail. Guidance is also provided for the calculation of the
posterior probabilities within BUGS environment when the number of
models is limited. We illustrate the application of this methodology in a
variety of problems including linear regression, log-linear and binomial
response models.

Keywords: Logistic regression; Linear regression; MCMC; Model selec-


tion.

1 Introduction

In Bayesian model averaging or model selection we focus on the calculation of


posterior model probabilities which involve integrals analytically tractable only in
certain restricted cases. This obstacle has been overcomed via the construction of
efficient MCMC algorithms for model and variable selection problems.
A variety of MCMC methods have been proposed for variable selection including
the Stochastic Search Variable Selection (SSVS) of George and McCulloch (1993),
the reversible jump Metropolis by Green (1995), the model selection approach of
Carlin and Chib (1995) the variable selection sampler of Kuo and Mallick (1998)
and the Gibbs variable selection (GVS) by Dellaportas et al. (2000, 2002).
The primary aim of this paper is to clearly illustrate how we can utilize BUGS
(Spiegelhalter et al. , 1996a, see also www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml)
for the implementation of variable selection methods. We concentrate on Gibbs
variable selection introduced by Dellaportas et al. (2000, 2002) with independent
prior distributions. Extension to other Gibbs samplers such as George and Mc-
Cullogh (1993) SSVS and Kuo and Mallick (1998) sampler is straightforward; see
for example in Dellaportas et al. (2000). Finally, application of Carlin and Chib
(1995) algorithm is also illustrated using BUGS by Spiegelhalter et al. (1996c).
∗ Journal of Statistical Software, Volume 7, Issue 7, available from www.jstatsoft.org

1
The paper is organised into three sections additional to this introductory one. Sec-
tion 2 briefly describes the general Gibbs variable selection algorithm as introduced
by Dellaportas et al. (2002), Section 3 provides detailed guidance for implementa-
tion in BUGS and finally Section 4 presents three illustrated examples.

2 Gibbs Variable Selection

Many statistical models may be represented naturally as (s, γ) ∈ S × {0, 1}p , where
the indicator vector γ identifies which of the p possible sets of covariates are present
in the model and s denotes other structural properties of the model. For example,
for a generalised linear model, s may describe the distribution, link function and
variance function, and the linear predictor may be written as
p
X
η= γj X j β j (1)
j=1

where X j is the design matrix and β j the parameter vector related to the jth term.
In the following, we restrict attention to variable selection aspects assuming that s
is known and we concentrate on the estimation of the posterior distribution of γ.
We denote the likelihood of each model by f (y|β, γ) and the prior by f (β, γ) =
f (β|γ)f (γ), where f (β|γ) is the prior of the parameter vector β conditional on the
model structure γ and f (γ) is the prior of the corresponding model. Moreover, β
can be partitioned into two vectors β γ and β \γ corresponding to parameters of
variables included or excluded from the model. Under this approach the prior can
be rewritten as
f (β, γ) = f (β γ |γ)f (β \γ |β γ , γ)f (γ)
while, since we are using the linear predictor (1), the likelihood can be simplified to

f (y|β, γ) = f (y|β γ , γ).

From the above it is clear that the components of the vector β \γ do not affect the
model likelihood and hence the posterior distribution within each model γ is given
by
f (β|γ, y) = f (β γ |γ, y) × f (β \γ |β γ , γ)
where f (β γ |γ, y) is the actual posterior of the parameters of model γ and
f (β \γ |β γ , γ, y) is the conditional prior distribution of the parameters not included
in the model γ. We may now interpret f (β γ |γ) as the actual prior of the model
while the distribution f (β \γ |β γ , γ) may be called as ‘pseudoprior’ since the param-
eter vector β \γ does not gain any information from the data and does not influence
the actual posterior of the parameters of each model, f (β γ |γ, y). Although this
pseudoprior does not influence the posterior distributions of interest, it influences
the performance of the MCMC algorithm and hence it should be specified with
caution.
The sampling procedure is summarised by the following steps:

1. We sample the parameters included in the model by the posterior

f (β γ |β \γ , γ, y) ∝ f (y|β, γ)f (β γ |γ)f (β \γ |β γ , γ) (2)

2
2. Sample the parameters excluded from the model from the pseudoprior

f (β \γ |β γ , γ, y) ∝ f (β \γ |β γ , γ) (3)

3. Sample each variable indicator γj from a Bernoulli distribution with success


probability Oj /(1 + Oj ); where Oj is given by

f (y|β, γj = 1, γ \j ) f (β|γj = 1, γ \j ) f (γj = 1, γ \j )


Oj = . (4)
f (y|β, γj = 0, γ \j ) f (β|γj = 0, γ \j ) f (γj = 0, γ \j )

The selection of priors and pseudopriors is a very important aspect in model selec-
tion. Here we briefly present the simplest approach where f (β|γ) is given a product
Qp
of independent prior and pseudoprior densities: f (β|γ) = j=1 f (β j |γ j ). In such
case, a usual and simple choice of f (β j |γ j ) is given by

f (β j |γ j ) = (1 − γj )f (β j |γj = 0) + γj f (β j |γj = 1) (5)


Q
resulting to actual prior distribution f (β γ |γ) = γj =1 f (β j |γ j ) and pseudoprior
Q
f (β \γ |β γ , γ) = γj =0 f (β j |γ j ).
Note that the above prior can be efficiently used in any model selection problem
if we orthogonalize the data matrix and then perform model choice using the new
transformed data (see Clyde et al. , 1996). If orthogonalization is undesirable
then we may need to construct more sophisticated and flexible algorithms such as
reversible jump MCMC; see Green (1995) and Dellaportas et al. (2002).
The simplified prior (5) and model formulation such as (1), result in the following
full conditional posterior
n 
Y f (y|γ, β)f (β j |γj = 1 ) γj = 1
f (β j |γ, β \j , y) ∝ f (y|β γ , γ) f (βk |γk ) ∝
f (β j |γj = 0) γj = 0
k=1
(6)
indicating that the pseudoprior, f (β j |γj = 0) does not affect the posterior distri-
bution of each model coefficient.
Similarly to George and McCulloch (1993), we use a mixture of Normal distribution
for model parameters.

f (βj |γj = 1) ≡ N (0, Σj ) and f (βj |γj = 0) ≡ N (µ̄j , Sj ). (7)

The hyperparameters µ̄j and Sj are parameters of the pseudoprior distribution;


therefore their choice is only relevant to the behaviour of the MCMC chain and
do not affect the posterior distribution. Ideal choices of these parameters are the
maximum likelihood or pilot run estimators of the full model (as, for example, in
Dellaportas and Forster, 1999). However, in the experimental process, we noted
that an automatic selection of µ̄j = 0 and Sj = Σj /k 2 with k = 10 has also been
proven an adequate choice; for more details see Ntzoufras (1999). This ‘automatic
selection’ uses the properties of the prior distributions with ‘large’ and ‘small’ vari-
ance introduced in SSVS by George and McCulloch (1993). The parameter k is
now only a pseudoprior parameter and therefore it does not affect the posterior
distribution. Suitable calibration of this parameter assists the chain to move better
(or worse) between different models.

3
The prior proposed by Dellaportas and Forster (1999) for contingency tables, is also
adopted here for logistic regression models with categorical explanatory variables
(see Dellaportas et al. , 2000). Alternatively, for generalized linear models, Raftery
(1996) has proposed to select the prior covariance matrix using elements from the
data matrix multiplied by a hyperparameter. The latter was selected in such way
that the effect of the prior distribution on the posterior odds becomes minimal.
When no restrictions on the model space are imposed then a common prior for the
term indicators γj is f (γj ) = Bernoulli (1/2), whereas in other cases (for example,
hierarchical or graphical log-linear models) it is required that f (γj |γ \j ) depends on
γ\j ; for more details see Chipman (1996) and Section 3.4.
Other Gibbs samplers for model selection have also been proposed by George and
McCulloch (1993), Carlin and Chib (1995) and Kuo and Mallick (1998). Detailed
comparison and discussion of the above methods is given by Dellaportas et al. (2000,
2002). Implementation of Carlin and Chib methodology in BUGS is illustrated by
Spiegelhater et al. (1996c, page 47) while an additional simple example of Gibbs
variable selection methods is provided by Dellaportas et al. (2000).

3 Applying Gibbs Variable Selection Using


BUGS

In this section we provide detailed guidance for implementing Gibbs variable selec-
tion using BUGS software. It is divided into four sub-sections involving the defi-
nition of the model likelihood f (y|β, γ), the specification of the prior distributions
f (β|γ) and f (γ), and, finally, the direct calculation of posterior model probabilities
using BUGS.

3.1 Definition of likelihood

The linear predictor of type (1) used in Gibbs variable selection and Kuo and Mallick
sampler can be easily incorporated in BUGS using the following code

for (i in 1:N) { for(j in 1:p) {z[i,j]<-x[i,j]*b[j]*g[j]}}


for (i in 1:N) {
eta[i] <-sum(z[i,]) ;
y[i]~distribution [ parameter1, parameter2 ] }

where

• N denotes the sample size,

• p the number of total variables under consideration,

• x[i,j] is the i, j component of the data or design matrix X,

• y[i] is i element of the response vector y,

• b[j] is the j element of the parameter vector β,

4
• g[j] is the inclusion indicator for j element of γ,

• z[i,j] is a matrix used to simplify calculations,

• eta[i] is the i element of linear predictor vector η and should be substituted


by the corresponding link function, for example logit(p[i]) in binomial
logistic regression,

• distribution should be substituted by appropriate BUGS command for the


distribution that the user prefers (for example dnorm for normal distribution),

• parameter1,parameter2 should be substituted according to distribution cho-


sen, for example for the normal distribution with mean µi and variance τ −1
we may use mu[i], tau.

For the usual normal, binomial and Poisson models the model formulations are
given by the following lines of BUGS code

Normal: for (i in 1:N) { mu[i] <- sum(z[i,]) ;


y[i]~dnorm (mu[i], tau) }
where mu[i] is the expected value for the ith observation and tau is the
precision of the regression model.

Poisson: for (i in 1:N) { log(lambda[i]) <- sum(z[i,]);


y[i] ~ dpois(lambda[i])}
where lambda[i] is the Poisson mean for the ith observation.

Binomial: for (i in 1:N) { logit(p[i]) <- sum(z[i,]);


y[i] ~ dbin(p[i], n[i])}
where p[i] is the probability of success and n[i] is the total number of
Bernoulli trials for the ith binomial experiment. Alternative link functions
maybe used by substituting logit(p[i]) by probit(p[i]) or cloglog(p[i])
for Φ−1 (p) and log(−log(1 − p)); where Φ is the standardised normal cumu-
lative distribution function.

3.2 Definition of Prior Distribution of Parameter Vector

When we use independent priors as given by (5) and each covariate parameter vector
is univariate, the definition of the prior is straightforward. Our prior is a mixture
of independent normal distributions

βj ∼ γj N (0, Σj ) + (1 − γj )N (µ̄j , Sj ), j = 1, 2, . . . , p (8)

where µ̄j , Sj are the mean and variance respectively used in the corresponding
pseudoprior distributions and Σj is the prior variance, when the j term is included
in the model. In order to use (8) in BUGS we write

• b[j]∼dnorm( bpriorm[j], tprior[j]) denoting βj ∼ N (mj , τj−1 ),

• bpriorm[j] < − (1-g[j])*mean[j] denoting mj = (1 − γj )µ̄j ,

5
• tprior[j] < − g[j]*t[j]+(1-g[j])*pow(se[j],-2) denoting τj = (1 −
γj )Sj−1 + γj Σ−1
j ,

for j = 1, 2, . . . , p; where mj and τj are the prior mean and precision for βj depend-
ing on γj and t[j], se[j], mean[j], bpriorm[j], tprior[j] are the BUGS
variables for Σ−1
p
j , Sj , µ̄j , mj and τj , respectively.
When we consider a categorical explanatory variable j with J > 2 categories then
the corresponding parameter vector β j will be multivariate with dimension dj =
J − 1. In such cases we denote by p and d(> p) the dimensions of γ and the full
parameter vector β, respectively. Therefore, we need one variable to facilitate the
association between these two vectors. This vector is denoted by the BUGS variable
pos. The pos vector, which has dimension equal to the dimension of β, takes values
from 1, 2, ..., p and denotes that kth element of the parameter vector β is associated
with the γposk binary indicator for all k = 1, 2, ..., d.
For illustration, let us consider an ANOVA model with two categorical variables X1
and X2 with 3 and 4 categories respectively. Then, the terms under consideration
are X0 , X1 , X2 and X12 ; where X0 denotes the constant term and X12 the inter-
action between the terms X1 and X2 . The corresponding dimensions are dX0 = 1,
dX1 = 2, dX2 = 3 and dX12 = dX1 × dX2 = 6. Then, we set the pos vector equal to
pos <- c ( 1, 2,2, 3,3,3, 4,4,4,4,4,4 )
to state that the first parameter corresponds to the first term (X0 ), parameters 2-3
correspond to the second term (X1 ), parameters 4-6 correspond to the third term
(X2 ) and parameters 7-12 correspond to the fourth term (X12 ). Finally, we use
another vector called gtemp of dimension d which is given by
gtemp[i] <- g[ pos[i] ]
for all i = 1, . . . , d. The vector gtemp is used in the likelihood instead of the g
vector. For details see example 1 and the associated BUGS code in the Appendix.
Moreover, the definition of the prior distribution when factors or terms with many
parameters are considered is more complicated. For example a mixture of multi-
variate normal prior distributions as given by (5) can be expressed as a multivariate
normal distribution on the ‘full’ parameter vector β. Therefore we may write in
BUGS

• b[ ] ∼ dmnorm( bpriorm[ ], Tau[,]) denoting β ∼ Nd (m, T −1 ),

• bpriorm[k]< −(1-g[pos[k]])*mean[k] denoting mk = (1 − γposk )µ̄k ,

• Tau[k,l] < − g[pos[k]]*g[pos[l]]*t[k,l]+


+(1-g[pos[k]]*g[pos[l]])*equals(k,l)*pow(se[k],-2) denoting that
 −1
 [Σ ]kl when γposk = γposl = 1
−2
Tkl = sek when k = l & γposk = 0 for k, l = 1, 2, . . . , d;

0 otherwise

where Nd is the d-dimensional normal distribution; mT = (m1 , m2 , . . . , md ) and T


are the prior mean vector and precision matrix depending on γ; µ̄k is the corre-
sponding pilot run estimate for k element of model parameter vector β; Σ is the
constructed prior covariance matrix for the whole parameter vector β when we use
for each β j the multivariate extension of prior distribution (8); Tkl and [Σ−1 ]kl is

6
the k row and l column elements of T and Σ−1 matrices respectively; and Tau[,],
t[,] are the BUGS matrices for T and Σ−1 , respectively. An illustration of usage
of such prior distribution is given in example 1.

3.3 Implementation of Other Gibbs Samplers for Variable


Selection

SSVS and Kuo and Mallick sampler can easily be applied with minor modifications
in the above code. In SSVS the prior (8) is used with µ̄j = 0 and Sj = Σj /kj2 ,
where kj2 should be large enough in order that βj will be close to zero when γj = 0.
For selection of the prior parameters in SSVS see semiautomatic prior selection of
George and McCulloch (1993). The above restriction can easily be applied in BUGS
by
bpriorm[j] <- 0
tprior[j] <- t[j]*g[j]+(1-g[j])*t[j]*pow(k[j],2) .

Moreover, the likelihood in SSVS should be slightly modified by substituting the


first line of the code in Section 3.1 with
for (i in 1:N) { for (j in 1:p) { z[i,j]<-x[i,j]*b[j]}}.

Kuo and Mallick sampler uses prior on β that does not depend on model indicator
γ. Therefore the specification of the prior is the same as in simple modelling using
BUGS. Moreover, the likelihood definition is the same as in Gibbs variable selection
described in Section 3.1.

3.4 Definition of Prior Term Probabilities

In order to apply any variable selection method in BUGS we need to define the prior
probabilities f (γ). When we are vague about models we may set f (γ) = 1/M , where
M is the number of all models under consideration. When the explanatory variables
do not involve interactions (for example in linear regression) then the number of
models under consideration is 2p . In these situations the latent variables γj can be
treated as a − priori independent and therefore set in BUGS

• g[j] ∼ dbern(0.5) denoting that γj ∼ Bernoulli(0.5).

for all j = 1, 2, . . . , p. This prior results to f (γ) = 2−p for all γ ∈ {0, 1}p . When
we are dealing with models using categorical explanatory variables with interaction
terms, such as AN OV A or log-linear models, we usually want to restrict attention to
hierarchical models. The conditional distributions of f (γj |γ \j ) need to be specified
in such way that f (γ) = 1/M when γ is referring to hierarchical model and f (γ) = 0
otherwise.
For example, in a two way AN OV A we have three terms under consideration; the
main effects A,B and the interaction AB. All possible models are eight, while the
hierarchical ones are only five (constant, [A], [B], [A][B] and [AB]). Therefore, we
wish to specify f (γ) = 0.20 for the above five models and f (γ) = 0 for the rest.
This can be applied by setting in BUGS

7
• g[3] ∼ dbern(0.2) denoting that γAB ∼ Bernoulli(0.2).

• pi < − g[3]+0.5(1-g[3]) denoting that π = γAB + 0.5 ∗ (1 − γAB ),

• for (i in 1:2) { g[j] ∼ dbern(pi) } denoting that for all i ∈ {A, B},
γj |γAB ∼ Bernoulli(π).

From the above it is evident that

f ([AB]) = f (γAB = 1)f (γA = 1|γAB = 1)f (γB = 1|γAB = 1)


= 0.2 × 1 × 1 = 0.2 ,

f ([A][B]) = f (γAB = 0)f (γA = 1|γAB = 0)f (γB = 1|γAB = 0)


= 0.8 × 0.5 × 0.5 = 0.2 .

Using similar calculations we find that f (γ) = 0.2 for all five models under consid-
eration. For further relevant discussion and application see Chipman (1996). For
implementation in BUGS see examples 1 and 3.

3.5 Calculating Model Probabilities in Bugs

In order to directly calculate the posterior model probabilities in BUGS and avoid
saving large MCMC output we may use matrix type variables with dimension equal
p
γj 2j−1 we transform
P
to the number of models. Using a simple coding such as 1+
j=1
the vector γ in a unique, for each model index (noted by mdl) for which pmdl[mdl]=1
and pmdl[j]=0 for all j 6= mdl. The above statements can be written in BUGS with
the code
for (j in 1:p) { index[j] < − pow(2,j-1) }
mdl < − 1+inprod(g[ ], index[ ])
for (m in 1:mdl) { pmdl[m] < − equals(m,mdl) }
Then using the command stats(pmdl) in BUGS environment (or cmd file) we can
monitor the posterior model probabilities. This is feasible only if the number of
models is limited and therefore applicable only in some simple problems.

4 Examples

The implementation of three illustrated examples are briefly presented. The first
example is a 3 × 2 × 4 contingency table used to illustrate how to handle factors
with more than two levels. Example 2 provides model selection details in a regres-
sion type problem involving many different error distributions while example 3 is
a simple logistic regression problem with random effects. In all examples posterior
probabilities are presented while the associated BUGS codes are provided in the
appendix. Additional details (for example, convergence plots) are omitted since our
aim is just to illustrate how to use BUGS for variable selection.

8
4.1 Example 1: 3 × 2 × 4 Contingency Table

This example is a 3 × 2 × 4 contingency table presented by Knuiman and Speed


(1988) where 491 individuals are classified by three categorical variables: obesity
(O: low,average,high), hypertension (H: yes,no) and alcohol consumption (A: 1,1–
2,3–5,6+ drinks per day); see Table 1.

Alcohol Intake
Obesity High BP 0 1-2 3-5 6+
Low Yes 5 9 8 10
No 40 36 33 24
Average Yes 6 9 11 14
No 33 23 35 30
High Yes 9 12 19 19
No 24 25 28 29

Table 1: 3 × 2 × 4 Contingency Table: Knuiman and Speed (1988) Dataset.

The full model is given by

nilk ∼ P oisson(λilk ), log(λilk ) = m + oi + hl + ak + ohil + oaik + halk + ohailk ,

for i = 1, 2, 3, l = 1, 2, k = 1, 2, 3, 4. The above model can be rewritten


with likelihood given by (1) where β can be divided to β j sub-vectors with j ∈
{∅, O, H, OH, A, OA, HA, OHA}; where β∅ = m, β TO = [o2 , o3 ], βH = h2 , βOH T
=
T T T
[oh22 , oh32 ], β A = [a2 , a3 , a4 ], β OA = [oa22 , oa23 , oa32 , oa33 ], β HA = [ha22 , ha23 ]
and β TOHA = [oha222 , oha223 , oha322 , oha323 ]. Each β j is a multivariate vector and
therefore each prior distribution involves mixture multivariate normal distributions.
We use sum-to-zero constraints and prior variance Σj as in Dellaportas and Forster
(1999). We restrict attention in hierarchical models including always the main ef-
fects since we are mainly interested for relationships between the categorical factors.
Under these restrictions, the models under consideration are nine (9). In order to
forbid moves to non hierarchical models we use the following BUGS code to define
the prior model probabilities:

• g[8] ∼ dbern(0.1111) for γOHA ∼ Bernoulli(1/9).

• pi < − g[8]+0.5*(1-g[8]) for π = γOHA + 0.5 ∗ (1 − γOHA ),

• for (i in 5:7) { g[j]∼dbern(pi) } for γj |γOHA ∼ Bernoulli(π) for all


i ∈ {OH, OA, HA},

• for (j in 1:4) { g[j] ∼ dbern(1) } for γj ∼ Bernoulli(1) for all i ∈


{constant, O, H, A}.

These priors result to prior probability for all hierarchical models equal to 1/9 and
zero otherwise.
Results using both pilot run pseudoprior and automatic pseudoprior with k = 10
are summarised in Table 2. The data give ‘strong’ evidence in favour of the model

9
Posterior Model Probabilities (%)
Pseudopriors k=10 Pilot Run
Burn-in 1,000 10,000 1,000 10,000
Iterations 1,000 10 × 10, 000 1,000 10 × 10, 000
Models
[O][H][A] 62.80 68.87 65.20 67.80
[OH][A] 36.90 30.53 34.40 31.63
[O][HA] 0.20 0.40 0.10 0.43
[OH][HA] 0.10 0.20 0.30 0.14
Terms
γOH = 1 37.00 30.63 34.70 31.77
γHA = 1 0.30 0.20 0.40 0.57

Table 2: 3 × 2 × 4 Contingency Table: Posterior Model Probabilities Using BUGS.

of independence. Model [OH][A], in which obesity and hypertension are depending


on each other given the level of alcohol consumption, is the model with the second
highest posterior probability. All the other models have probability lower than 1%.

4.2 Example 2: Stacks Dataset

This example involves stack-loss data analysed by Spiegelhalter et al. (1996b, page
27) using the Gibbs sampling. The dataset features 21 daily responses of stack loss
(y) which measures the amount of ammonia escaping with covariates the air flow
(x1 ), temperature (x2 ) and acid concentration (x3 ). Spiegelhalter et al. (1996b)
consider regression models with four different error structures (normal, double ex-
ponential, logistic and Student’s t4 distributions). They also consider the cases of
ridge and simple independent regression models. We extend their work by applying
Gibbs variable selection on all these eight cases.
The full model will be

yi ∼ D(µi , τ ), µi = β0 + β1 zi1 + β2 zi2 + β3 zi3 , i = 1, . . . , 21

where Di (µi , τ ) is the distribution of the errors with mean µi and variance τ −1 which
here is assumed to be normal, or double exponential, or logistic or t4 ; where zij =
(xij − x̄j )/sd(xj ) are the standardised covariates. The ridge regression approach
assumes a further restriction that the βj for j = 1, 2, 3 are exchangable (Lindley
and Smith, 1972) and therefore we have βj ∼ N (0, φ−1 ). We use ‘non-informative’
priors with prior precision equal to 10−3 for the independent regression and for φ
in ridge regression we use gamma prior with parameters equal to 10−3 . Since we do
not wish to apply any restriction on the model space we use the prior probabilities
γj ∼ Bernoulli(1/2) for j = 1, 2, 3 which results to prior probability of 1/8 for
all possible models. For the pilot run pseudoprior parameters we use the posterior
values as given Spiegelhalter et al. (1996b).
Tables 3 and 4 provide the results from all eight distinct cases using pilot run
pseudopriors. In all cases flow of air (z1 ) has posterior probability of inclusion
higher than 99%. The temperature (z2 ) seems to be also an important term with

10
posterior probability of inclusion varying from 39% to 96%. The last term (z3 ) which
measures the acid concentration in air has low posterior probabilities of inclusion
which are less than 5% for simple independence models and less than 20% for ‘ridge’
regression models.

Independence Regression
Models Normal D.Exp. Logistic t4
Constant 0.00 0.00 0.00 0.00
z1 14.12 58.48 41.19 56.46
z2 0.56 0.01 0.02 0.00
z1 + z2 81.25 38.64 55.25 40.46
z3 0.00 0.00 0.00 0.00
z1 + z3 0.63 1.75 1.35 1.82
z2 + z3 0.05 0.00 0.00 0.00
z1 + z2 + z3 3.39 1.11 2.18 1.26
Terms
γz1 = 1 99.30 99.98 99.97 100.00
γz2 = 1 84.90 39.76 57.45 41.72
γz3 = 1 4.30 2.86 3.53 3.08

Table 3: Stacks Dataset: Posterior Model Probabilities in Independence Regression


(burn-in 10,000, samples of 10 × 10, 000, with pilot run pseudopriors).
Ridge Regression
Models Normal D.Exp. Logistic t4
Constant 0.00 0.00 0.00 0.00
z1 3.26 22.54 14.42 13.30
z2 0.05 0.00 0.00 0.00
z1 + z2 79.79 65.00 73.32 70.92
z3 0.00 0.00 0.00 0.00
z1 + z3 0.44 1.74 1.32 1.86
z2 + z3 0.00 0.00 0.00 0.00
z1 + z2 + z3 16.46 10.72 11.01 13.92
Terms
γz1 = 1 100.00 100.00 100.00 100.00
γz2 = 1 96.50 75.72 84.33 84.84
γz3 = 1 16.10 12.46 12.33 15.78

Table 4: Stacks Dataset: Posterior Model Probabilities in Ridge Regression (burn-in


10,000, samples of 10 × 10, 000, with pilot run pseudopriors).

4.3 Example 3: Seeds Dataset, Logistic Regression with Ran-


dom Effects

This example involves the examination of a proportion of seeds that germinated


on 21 plates. For these 21 plates we have recorded the seed (bean or cucumber)
and the type of root extract. This data set is analysed by Spiegelhalter et al.
(1996b, page 10) using BUGS; for more details see references there in. The model

11
is a logistic regression with 2 categorical explanatory variables and random effects.
The full model will be written
 
pilk
yilk ∼ Bin(nilk , pilk ), log = m + ai + bl + abil + wk ,
1 − pilk

for i, l = 1, 2 and k = 1, . . . , 21; where yilk and nilk is the number of seeds germinated
and total number of seeds respectively for i seed, l type of root extract and k plate;
wk is the random effect for the k plate.
We use sum-to-zero constraints for both fixed and random effects. Following Della-
portas and Forster (1999) we use prior variance for the fixed effects Σ = 4 × 2. The
prior for the precision of the random effects is considered to be a gamma distribu-
tion with parameters equal to 10−3 . The pseudoprior parameters were taken from
a pilot chain of the saturated model. The models under consideration are ten. The
prior term probabilities for the fixed effects are assigned similarly as in the example
for two-way ANOVA models. For the random effects term indicator we have that
γw ∼ Bernoulli(0.5).

Fixed Effects Random Effects


Models k=10 Pilot k=10 Pilot
Constant 0.00 0.00 1.21 0.99
[A] 0.00 0.00 0.22 0.07
[B] 32.34 32.07 50.61 50.75
[A][B] 3.78 3.84 7.24 7.60
[AB] 2.80 2.83 1.80 1.85
Total 38.92 38.74 61.08 61.26

Table 5: Seeds Dataset: Posterior Model Probabilities Using BUGS (burn-in 10,000,
samples of 10 × 10, 000).

Table 5 provides the calculated posterior model probabilities. We used both pilot
run proposals and automatic pseudoprior with k = 10. Both chains gave the same
results as expected and the type of root extract (B) is the only factor that influences
the proportion of germinated gems. The corresponding models with random and
fixed effects have posterior probability equal to 51% and 32%, respectively. The
marginal posterior probability of random effects is 61% which is about 56% higher
than the posterior probability of fixed effects models.

12
5 Appendix: BUGS Codes

Bugs code and all associated data files are freely available in electronic form at
the Journal of Statistical Software web site www.jstatsoft.org/v07/i07/ or by
electronic mail request.

5.1 Example 1
model log-linear;
#
# 3x2x4 LOG-LINEAR MODEL SELECTION WITH BUGS (GVS)
# (c) OCTOBER 1996
# (c) REVISED OCTOBER 1997
#
const
terms=8, # number of terms
N = 24; # number of Poisson cells
var
include, # conditional prior probability for gi
pmdl[9], # model indicator vector
mdl, # code of model
b[N], # model coefficients
mean[N], # proposal mean used in pseudoprior
se[N], # proposal standard deviation used in
# pseudoprior
bpriorm[N], # prior mean for b depending on g
Tau[N,N], # model coefficients precision
tprior[N,N],# prior value for Tau when all terms
# are included in model
x[N,N], # design matrix
z[N,N], # matrix with z_ij=x_ij b_j g_j, used in
# likelihood
n[N], # Poisson cells
pos[N], # position of each parameter
lambda[N], # Poisson mean for each cell
gtemp[N], # temporary term indicator vector
g[terms]; # term indicator vector
data pos,n in "ex2.dat", x in ’ex2des.dat’,
mean, se in ’prop2.dat’, tprior in ’cov.dat’;
inits in "ex2.in";
{
#
# associate g[i] with coefficients.
#
for (i in 1:N) {
gtemp[i]<-g[pos[i]];
}
#
# calculation of the z matrix used in likelihood
#
for (i in 1:N) {
for (j in 1:N) {
z[i,j]<-x[i,j]*b[j]*gtemp[j]
}
}
#
# model configuration
for (i in 1:N) {
log(lambda[i])<-sum(z[i,])
n[i]~dpois(lambda[i]);
}
# defining model code
# 0 for independence model [A][B][C], 1 for [AB][C],

13
# 2 for [AC][B], 3 for [AB][AC], 4 for [BC][A],
# 5 for [AB][BC], 6 for [AC][BC], 7 for [AB][BC],
# 15 for [ABC].
#
mdl<-g[5]+2*g[6]+4*g[7]+8*g[8];
for (i in 0:7) {
pmdl[i+1]<-equals(mdl,i)
}
pmdl[9]<-equals(mdl,15)
#
# Prior for b model coefficient
# Mixture normal depending on current status of g[i]
#
for (i in 1:N) { for (j in 1:N) {
#
# GVS using se,mean from pilot run
# ********************************
#
Tau[i,j]<-0+tprior[i,j]*(gtemp[i]*gtemp[j])+
(1-gtemp[i]*gtemp[j])*equals(i,j)/(se[i]*se[i]);
#
# Automatic proposal using prior similar to SSVS
# with k=10
# ************************************************
# Tau[i,j]<-tprior[i,j]*pow(100,1-gtemp[i]*gtemp[j]);
#
# Kuo and Mallick proposal is independent of g[i]
# [tau[i]=1/2 and bpriorm[i]=0]
# ***********************************************
#
#
# Tau[i,j]<-tprior[i,j];
#
}
#
# GVS PRIOR M FROM PILOT RUN
# **************************
bpriorm[i]<-mean[i]*(1-gtemp[i]);
#
# PRIOR M FOR THAT DOES NOT DEPEND ON G.
# *************************************
# bpriorm[i]<-0.0;
}
b[]~dmnorm(bpriorm[],Tau[,]);
#
# defining prior information for gi to
# allow only hierarchical models with equal probability.
# We also ignore models nested to the model of
# independence [A][B][C] since we are interested in
# associations between factors.
g[8]~dbern(0.1111111);
include<-(1-g[8])*0.5+g[8]*1.0;
g[7]~dbern(include);
g[6]~dbern(include);
g[5]~dbern(include);
for (i in 1:4) {
g[i]~dbern(1.0);
}
}

14
5.2 Example 2

model stacks;
#
# LINEAR REGRESSION VARIABLE SELECTION WITH BUGS (GVS)
# BUGS EXAMPLE: STACKS, see BUGS examples vol.1
#
# (c) OCTOBER 1997
#
const
p = 3, # number of covariates
N = 21, # number of observations
models=8, # number of models under consideration 2^8
PI = 3.141593;
var
x[N,p], # raw covariates
z[N,p] , # standardised covariates
Y[N],mu[N], # data and expectations
stres[N], # standardised residuals
outlier[N], # indicator if |stan res| > 2.5
beta0,beta[p], # standardised intercept, coefficients
b0,b[p], # unstandardised intercept, coefficients
phi, # prior precision of standardised coef.
tau,sigma,d, # precision, sd and d.f. of t distribution
g[p], # variable indicators
mdl, # Model index
pmdl[models], # Vector with model indicators
mean[p],se[p], # pseudoprior mean and se error
bprior[p], # Conditional to model Prior prior mean
tprior[p]; # Conditional to model Prior prior precision
data Y,x in "STACKS.DAT",
# files with proposed values
mean,se in ’pnorm.dat’; # Normal distribution
#mean,se in ’pdexp.dat’; # Double exponential distribution
#mean,se in ’plogist.dat’;# Logistic distribution
#mean,se in ’pt4.dat’; # Student(4) distribution
inits in "STACKS.IN";
{
# Standardise x’s and coefficients
for (j in 1:p) {
b[j] <- beta[j]/sd(x[,j]) ;
for (i in 1:N) {
z[i,j] <- (x[i,j] - mean(x[,j]))/sd(x[,j]) ;
}
}
b0<-beta0-b[1]*mean(x[,1])-b[2]*mean(x[,2])-b[3]*mean(x[,3]);
# Model
d <- 4; # degrees of freedom for t
for (i in 1:N) {
#
# Normal Distribution
# -------------------
Y[i] ~ dnorm(mu[i],tau);
#
# Double Exponential Distribution
# -------------------------------
# Y[i] ~ ddexp(mu[i],tau);
#
# Logistic Distribution
# ----------------------
# Y[i] ~ dlogis(mu[i],tau);
#
# Student t4 Distribution
# -----------------------

15
# Y[i] ~ dt(mu[i],tau,d);
#
mu[i] <- beta0 + g[1]*beta[1]*z[i,1]+g[2]*beta[2]*z[i,2]
+ g[3]*beta[3]*z[i,3];
stres[i] <- (Y[i] - mu[i])/sigma;
#
# if standardised residual is greater than 2.5 then outlier
outlier[i]<-step(stres[i] -2.5) + step(-(stres[i]+2.5) );
}
#
# Defining Model Code
mdl<- 1+g[1]*1+g[2]*2+g[3]*4
#
# defining vector with model indicators
for (j in 1:models){
pmdl[j]<-equals(mdl,j);}
# Priors
beta0 ~ dnorm(0,.00001);
for (j in 1:p) {
#
# ******** GVS PRIORS FOR INDEPENDENCE REGRESSION ********
#
# GVS priors with proposals from pilot run
# bprior[j]<-(1-g[j])*mean[j];
# tprior[j] <-g[j]*0.001+(1-g[j])/(se[j]*se[j]);
#
# GVS priors with proposals a mixture of Normals(0,c^2t^2)
bprior[j]<-0.0;
tprior[j] <-pow(100,1-g[j])*0.001;
#
# ******** GVS PRIORS FOR RIDGE REGRESSION ********
#
# GVS priors with proposals from pilot run
# bprior[j]<-(1-g[j])*mean[j];
# tprior[j] <-g[j]*phi+(1-g[j])/(se[j]*se[j]);
#
# GVS priors with proposals a mixture of Normals(0,c^2t^2)
# bprior[j]<-0.0;
# tprior[j] <-pow(100,1-g[j])*phi;
beta[j] ~ dnorm(bprior[j],tprior[j]); # coefs independent
}
tau ~ dgamma(1.0E-3,1.0E-3);
#
# phi ~ dgamma(1.0E-3,1.0E-3);
#
# when we use pilot run based pseudopriors bugs was unable
# to select update method. Therefore we use an upper limit
# which makes bugs work with Metropolis instead Gibbs
#
# phi ~ dgamma(1.0E-3,1.0E-3)I(0,10000);
# standard deviation of error distribution
sigma <- sqrt(1/tau); # normal errors
# sigma <- sqrt(2)/tau; # double exponential errors
# sigma <- sqrt(pow(PI,2)/3)/tau ; # logistic errors
# sigma <- sqrt(d/(tau*(d-2))); # errors of t with d d.f.
#
#
# Priors for variable indicators
for (j in 1:p) { g[j]~ dbern(0.5);}
}

16
5.3 Example 3
model seedszrogvs;
#
# LOGISTIC REGRESSION VARIABLE AND
# RANDOM EFFECTS SELECTION WITH BUGS (GVS)
#
# BUGS EXAMPLE: SEEDS, see BUGS examples vol.1
#
# (c) OCTOBER 1997
#
const
terms=4, # Number of terms under consideration
models=16,# number of models under consideration 2^4
N = 21; # number of samples
var
alpha0, alpha1, alpha2, alpha12, # model coefficients
tau, sigma, # variance of random effects (tau=1/sigma)
x1[N], x2[N], # Design Column for factor a1 and a2
# here we used the STZ constraints
p[N], # Success probability for Binomial
n[N], # Total number of trials for Binomial
r[N], # Binomial data
b[N], # Random effects (standardised)
c[N], # Random effects c[i] (unstandardised)
include, # conditional model probability for
# main effects
g[terms], # terms indicator vector
mdl, # model index
pmdl[models], # model indicator
mean[terms-1], # proposal mean
se[terms-1], # proposal se
bprior[terms-1],# prior mean for model coefficients
tprior[terms-1];# prior precision for model coefficients
data r,n,x1,x2 in "seeds.dat", mean,se in ’prop6.dat’;
inits in "seeds.in";
{
alpha0 ~ dnorm(0.0,1.0E-6); # intercept
for (j in 1:(terms-1)) {
# ******** GVS PRIORS ***********
#
# GVS priors with proposals from pilot run
bprior[j]<-(1-g[j])*mean[j];
tprior[j] <-g[j]/8+(1-g[j])/(se[j]*se[j]);
#
# GVS priors with proposals a mixture of Normals(0,c^2t^2)
# bprior[j]<-0.0;
# tprior[j] <-pow(100,1-g[j])/8;
}
#
#
alpha1 ~ dnorm(bprior[1],tprior[1]); # seed coeff
alpha2 ~ dnorm(bprior[2],tprior[2]); # extract coeff
alpha12 ~ dnorm(bprior[3],tprior[3]);
tau ~ dgamma(1.0E-3,1.0E-3); # 1/sigma^2
for (i in 1:N) {
c[i] ~ dnorm(0.0,tau);
b[i] <- c[i] - mean(c[]); # make sure b’s add to zero
logit(p[i]) <-alpha0+g[1]*alpha1*x1[i]+g[2]*alpha2*x2[i]
+g[3]*alpha12*x1[i]*x2[i]+g[4]*b[i];
r[i] ~ dbin(p[i],n[i]);
}

17
sigma <- 1.0/sqrt(tau);
#
# Defining Model Code
mdl<- 1+g[1]*1+g[2]*2+g[3]*4+g[4]*8
#
# defining vector with model indicators
for (j in 1:models){
pmdl[j]<-equals(mdl,j);}
# Priors for variable indicators
g[4]~ dbern(0.50);
g[3]~ dbern(0.20);
include<-g[3]+(1-g[3])*0.5
g[2]~ dbern(include);
g[1]~ dbern(include);
}

References
Carlin, B.P. and Chib, S. (1995). ‘Bayesian Model Choice via Markov Chain Monte
Carlo Methods’, Journal of the Royal Statistical Society B, 157, 473–484.
Chipman, H. (1996). ‘Bayesian Variable Selection with Related Predictors’, Cana-
dian Journal of Statistics, 24, 17–36.
Clyde, M., DeSimone, H. and Parmigiani, G. (1996). ‘Prediction via Orthogonalized
Model Mixing’, Journal of the American Statistical Association, 91, 1197–1208.
Dellaportas, P. and Forster, J.J. (1999). ‘Markov Chain Monte Carlo Model Deter-
mination for Hierarchical and Graphical Log-linear Models’, Biometrika, 86,
615–633.
Dellaportas, P., Forster, J.J. and Ntzoufras, I. (2000). ‘Bayesian Variable Selection
Using the Gibbs Sampler’, Generalized Linear Models: A Bayesian Perspective
(D. K. Dey, S. Ghosh, and B. Mallick, eds.). New York: Marcel Dekker, 271–
286.
Dellaportas, P., Forster, J.J. and Ntzoufras, I. (2002). ‘On Bayesian Model and
Variable Selection Using MCMC’, Statistics and Computing, 12, 27–36.
George, E.I. and McCulloch, R.E. (1993). ‘Variable Selection via Gibbs Sampling’,
Journal of the American Statistical Association, 88, 881–889.
Green, P. (1995). ‘Reversible Jump Markov Chain Monte Carlo Computation and
Bayesian Model Determination’, Biometrika, 82, 711–732.
Kuo, L. and Mallick, B. (1998). ‘Variable Selection for Regression Models’,
Sankhya B, 60, 65–81.
Knuiman, M.W. and Speed, T.P. (1988). ‘Incorporating Prior Information Into the
Analysis of Contingency Tables’, Biometrics, 44, 1061–1071.
Lindley, D.V. and Smith, A.F.M. (1972). ‘Bayes Estimates for the Linear Model’
(with discussion). Journal of the Royal Statistical Society B, 34, 1–41.
Ntzoufras, I. (1999). ‘Aspects of Bayesian Model and Variable Selection Using
MCMC’, Unpublished Ph.D. Thesis, Department of Statistics, Athens Uni-
versity of Economics and Business, Athens, Greece.
Raftery, A.E. (1996). ‘Approximate Bayes Factors and Accounting for Model Un-
certainty in Generalized Linear Models’, Biometrika, 83, 251–266.
Spiegelhalter, D., Thomas, A., Best, N. and Gilks, W. (1996a). BUGS 0.5:
Bayesian Inference Using Gibbs Sampling Manual, MRC Biostatistics
Unit, Institute of Public health, Cambridge, UK. Available from
www.mrc-bsu.cam.ac.uk/bugs/documentation/bugs05/manual05.html.

18
Spiegelhalter, D., Thomas, A., Best, N. and Gilks, W. (1996b). BUGS 0.5:
Examples Volume 1, MRC Biostatistics Unit, Institute of Public health,
Cambridge, UK. Available on line access from
www.mrc-bsu.cam.ac.uk/bugs/documentation/exampVol1/bugs.html.
Spiegelhalter, D., Thomas, A., Best, N. and Gilks, W.(1996c). BUGS 0.5:
Examples Volume 2, MRC Biostatistics Unit, Institute of Public health, Cam-
bridge, UK. Available on line access from
www.mrc-bsu.cam.ac.uk/bugs/documentation/exampVol2/vol 2.html.

19

You might also like