Bayesbayesmh Pdf#bayesbayesmh
Bayesbayesmh Pdf#bayesbayesmh
com
bayesmh — Bayesian models using Metropolis–Hastings algorithm+
+
This command includes features that are part of StataNow.
Description
bayesmh fits a variety of Bayesian models using an adaptive Metropolis–Hastings (MH) algorithm.
It provides various likelihood models and prior distributions for you to choose from. Likelihood models
include univariate normal linear and nonlinear regressions, multivariate normal linear and nonlinear
regressions, generalized linear models such as logit and Poisson regressions, multiple-equations linear
and nonlinear models, multilevel models, and more. Prior distributions include continuous distributions
such as uniform, Jeffreys, normal, gamma, multivariate normal, and Wishart and discrete distributions
such as Bernoulli and Poisson. You can also program your own Bayesian models; see [BAYES] bayesmh
evaluators.
Also see [BAYES] Bayesian estimation for a list of Bayesian regression models that can be fit
more conveniently with the bayes prefix ([BAYES] bayes).
Quick start
Bayesian normal linear regression of y1 on x1 with flat priors for coefficient on x1 and the intercept
and with a Jeffreys prior on the variance parameter {var}
bayesmh y1 x1, likelihood(normal({var})) ///
prior({y1: x1 _cons}, flat) prior({var}, jeffreys)
Add binary variable a using factor-variable notation
bayesmh y1 x1 i.a, likelihood(normal({var})) ///
prior({y1: x1 i.a _cons}, flat) prior({var}, jeffreys)
Same as above
bayesmh y1 x1 i.a, likelihood(normal({var})) ///
prior({y1:}, flat) prior({var}, jeffreys)
Specify a different prior for a = 1
bayesmh y1 x1 i.a, likelihood(normal({var})) ///
prior({y1:x1 _cons}, flat) prior({y1: 1.a}, normal(0,100)) ///
prior({var}, jeffreys)
Specify a starting value of 1 for parameter {var}
bayesmh y1 x1 i.a, likelihood(normal({var})) ///
prior({y1:}, flat) prior({var}, jeffreys) initial({var} 1)
Same as above
bayesmh y1 x1 i.a, likelihood(normal({var=1})) ///
prior({y1:}, flat) prior({var}, jeffreys)
1
2 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
A normal prior with µ = 2 and σ 2 = 0.5 for the coefficient on x1, a normal prior with µ = −40 and
σ 2 = 100 for the intercept, and an inverse-gamma prior with shape parameter of 0.1 and scale
parameter of 1 for {var}
bayesmh y1 x1, likelihood(normal({var})) ///
prior({y1:x1}, normal(2,.5)) ///
prior({y1:_cons}, normal(-40,100)) ///
prior({var}, igamma(0.1,1))
Place {var} into a separate block
bayesmh y1 x1, likelihood(normal({var})) ///
prior({y1:x1}, normal(2,.5)) ///
prior({y1:_cons}, normal(-40,100)) ///
prior({var}, igamma(0.1,1)) block({var})
Same as above, but simulate four chains
bayesmh y1 x1, likelihood(normal({var})) ///
prior({y1:x1}, normal(2,.5)) ///
prior({y1:_cons}, normal(-40,100)) ///
prior({var}, igamma(0.1,1)) block({var}) ///
nchains(4)
Zellner’s g prior to allow {y1:x1} and {y1: cons} to be correlated, specifying 2 dimensions,
df = 30, µ = 2 for {y1:x1}, µ = −40 for {y1: cons}, and variance parameter {var}
bayesmh y1 x1, likelihood(normal({var})) ///
prior({var}, igamma(0.1,1)) ///
prior({y1:}, zellnersg(2,30,2,-40,{var}))
Model for dichotomous dependent variable y2 regressed on x1 with a logit likelihood
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100))
Same as above, and save model results to simdata.dta, and store estimates in memory as m1
bayesmh y2 x1, likelihood(logit) prior({y2:}, ///
normal(0,100)) saving(simdata.dta)
estimates store m1
Same as above, but save the results on replay
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100))
bayesmh, saving(simdata.dta)
estimates store m1
Show model summary without performing estimation
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) dryrun
Fit model without showing model summary
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
nomodelsummary
Same as above, and specify the random-number seed for reproducibility
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
rseed(1234)
Same as above (set seed method useful only for a single chain)
set seed 1234
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100))
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 3
Specify 20,000 MCMC samples, and set length of the burn-in period to 5,000
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
mcmcsize(20000) burnin(5000)
Specify that only observations 1 + 5k , for k = 0, 1, . . . , be saved to the MCMC sample
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
thinning(5)
Set the maximum number of adaptive iterations of the MCMC procedure to 30, and specify that
adaptation of the MCMC procedure be attempted every 25 iterations
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
adaptation(maxiter(30) every(25))
Request that a dot be displayed every 100 simulations
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
dots(100)
Also request that an iteration number be displayed every 1,000 iterations
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
dots(100, every(1000))
Same as above
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
dots
Request that the 90% equal-tailed credible interval be displayed
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
clevel(90)
Request that the default 95% highest posterior density credible interval be displayed
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) hpd
Use the batch-means estimator of MCSE with the length of the block of 5
bayesmh y2 x1, likelihood(logit) prior({y2:}, normal(0,100)) ///
batch(5)
Multivariate normal regression of y1 and y3 on x1 and x2, using normal priors with µ = 0 and
σ 2 = 100 for the regression coefficients and intercepts, an inverse-Wishart prior for the covariance
matrix parameter {S, matrix} of dimension 2, df = 100, and an identity scale matrix
bayesmh y1 y3 = x1 x2, likelihood(mvnormal({S, matrix})) ///
prior({y1:} {y3:}, normal(0,100)) ///
prior({S, matrix}, iwishart(2,100,I(2)))
Same as above, but use abbreviated declaration for the covariance matrix
bayesmh y1 y3 = x1 x2, likelihood(mvnormal({S,m})) ///
prior({y1:} {y3:}, normal(0,100)) ///
prior({S,m}, iwishart(2,100,I(2)))
Same as above, and specify starting values for matrix {S,m} using previously defined matrix W
bayesmh y1 y3 = x1 x2, likelihood(mvnormal({S,m})) ///
prior({y1:} {y3:}, normal(0,100)) ///
prior({S,m}, iwishart(2,100,I(2))) initial({S,m} W)
4 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Menu
Statistics > Bayesian analysis > General estimation and regression
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 5
Syntax
Linear models
Univariate linear regression
bayesmh depvar indepvarspec if in weight ,
likelihood(modelspec) prior(priorspec) options
Multivariate normal linear regression with common regressors
bayesmh depvars = indepvarspec if in weight ,
likelihood(mvnormal(. . .)) prior(priorspec) options
Multivariate normal regression with outcome-specific regressors
bayesmh ( eqname1: depvar1 indepvarspec1 )
( eqname2: depvar2 indepvarspec2 ) . . . if in
weight ,
likelihood(mvnormal(. . .)) prior(priorspec) options
Nonlinear models
Univariate nonlinear regression
bayesmh nleqspec if in weight ,
likelihood(modelspec) prior(priorspec) options
Multivariate normal nonlinear regression
bayesmh (nleqspec1) (nleqspec2) . . . if in weight ,
likelihood(mvnormal(. . .)) prior(priorspec) options
Multilevel models
Any model can be fit as a multilevel model by including at least one random-effects term respec,
such as random intercepts U[id] at the level variable id, in indepvarspec, indepvarspec#, nlspec, or
nlspec#; see Random effects.
Multiple-equation models
bayesmh (eqspec) (eqspec) ... if in weight , prior(priorspec) options
Probability distributions
Univariate distributions
bayesmh depvar if in weight ,
likelihood(distribution) prior(priorspec) options
Multiple-equation distribution specifications
bayesmh (deqspec) (deqspec)
. . . if in weight ,
prior(priorspec) options
6 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
respec includes an optional list of independent variables indepvars and at least one of random-effects
terms such as random intercepts U[id] at the level variable id. For instance, respec can be x1 x2
U[id]; see Random effects.
The syntax of nleqspec is depvar = (subexprspec), where subexprspec is either subexpr or resubexpr.
resubexpr is a substitutable expression that contains model parameters and random effects specified
in braces, {}, as in exp({b}+{U[id]}); see Random effects for details.
The syntax of eqspec is one of the following:
for linear models
varspec if in weight , likelihood(modelspec) noconstant
for nonlinear models
nlspec if in weight , likelihood(modelspec)
The syntax of varspec is one of the following:
for single outcome
eqname: depvar indepvarspec
for multiple outcomes with common regressors
depvars = indepvarspec
for multiple outcomes with outcome-specific regressors
( eqname1: depvar1 indepvarspec1 )
( eqname2: depvar2 indepvarspec2 ) . . .
The syntax of nlspec is nleqspec for a single outcome or (nleqspec1) (nleqspec2) . . .
for multiple outcomes.
model Description
Model
normal(var) normal regression with variance var
t(sigma2, df ) t regression with squared scale sigma2 and degrees of freedom df
lognormal(var) lognormal regression with variance var
lnormal(var) synonym for lognormal()
exponential exponential regression
+
asymlaplaceq(sigma, tau) asymmetric Laplace (quantile) regression with scale sigma and
quantile tau
mvnormal(Sigma) multivariate normal regression with covariance matrix Sigma
probit probit regression
logit logistic regression
logistic logistic regression; synonym for logit
binomial(n) binomial regression with logit link and number of trials n
binlogit(n) synonym for binomial()
oprobit ordered probit regression
ologit ordered logistic regression
poisson Poisson regression
stexponential exponential survival regression
stgamma(lns) gamma survival regression with log-scale parameter lns
stloglogistic(lns) loglogistic survival regression with log-scale parameter lns
stlognormal(lnstd) lognormal survival regression with log-standard-deviation
parameter lnstd
stweibull(lnp) Weibull survival regression with log-shape parameter lnp
llf(subexpr) substitutable expression for observation-level log-likelihood
function
+
These features are part of StataNow.
A distribution argument is a number for scalar arguments such as var; a variable name, varname (except for matrix
arguments); a matrix for matrix arguments such as Sigma; a model parameter, paramspec; an expression, expr;
or a substitutable expression, subexpr or resubexpr. See Specifying arguments of likelihood models and prior
distributions. For survival models, stmodel, a distribution argument can be only a scalar argument.
modelopts Description
Model
offset(varnameo ) include varnameo in model with coefficient constrained to 1;
not allowed with normal() and mvnormal()
exposure(varnamee ) include ln(varnamee ) in model with coefficient constrained to 1;
allowed only with poisson
survivalopts options for survival models
survivalopts are allowed only with survival models stexponential, stgamma(), stloglogistic(), stlognormal(),
and stweibull().
8 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
survivalopts Description
Model
no logparam fit survival model using a scale, variance, or shape parameter
in a log (the default) or original metric
ph proportional hazards parameterization; default with survival
models stexponential and stweibull()
aft accelerated failure-time parameterization; default with survival
models other than stexponential and stweibull()
time synonym for aft
failure(varname) indicator for failure event
ltruncated(varname | #) lower limit for left-truncation
distribution Description
Model
dexponential(beta) exponential distribution with scale parameter beta
dbernoulli(p) Bernoulli distribution with success probability p
dbinomial(p,n) binomial distribution with success probability p and
number of trials n
dpoisson(mu) Poisson distribution with mean mu
A distribution argument is a model parameter, paramspec, or a substitutable expression, subexpr or resubexpr, containing
model parameters. An n argument may be a number; an expression, expr; or a variable name, varname. See
Specifying arguments of likelihood models and prior distributions.
priordist Description
Model
normal(mu,var) normal with mean mu and variance var
t(mu,sigma2,df) location–scale t with mean mu, squared scale sigma2, and
degrees of freedom df
lognormal(mu,var) lognormal with mean mu and variance var
lnormal(mu,var) synonym for lognormal()
uniform(a,b) uniform on (a, b)
gamma(alpha,beta) gamma with shape alpha and scale beta
igamma(alpha,beta) inverse gamma with shape alpha and scale beta
exponential(beta) exponential with scale beta
beta(a,b) beta with shape parameters a and b
laplace(mu,beta) Laplace with mean mu and scale beta
cauchy(loc,beta) Cauchy with location loc and scale beta
+
halfcauchy(loc,beta) half-Cauchy with location loc and scale beta
chi2(df) central χ2 with degrees of freedom df
+
rayleigh(beta) Rayleigh distribution with scale beta
pareto(alpha,beta) Pareto with shape alpha and scale beta
jeffreys Jeffreys prior for variance of a normal distribution
mvnormal(d,mean,Sigma) multivariate normal of dimension d with mean vector mean and
covariance matrix Sigma; mean can be a matrix name or a list
of d means separated by comma: mu1 , mu2 , . . ., mud
mvnormal0(d,Sigma) multivariate normal of dimension d with zero mean vector and
covariance matrix Sigma
mvn0(d,Sigma) synonym for mvnormal0()
mvnexchangeable(d,mean,var,rho)
multivariate normal of dimension d with means mean and
exchangeable covariance matrix with diagonal var and
off-diagonal var×rho
mvn0exchangeable(d,var,rho) as mvnexchangeable() but with zero mean vector
mvnindependent(d,mean,vars) multivariate normal of dimension d with means mean and
diagonal covariance matrix; vars can be a Stata vector of
dimension d with fixed variances or a list of d variances
(parameters or fixed values) separated by comma:
var1 , var2 , . . ., vard
mvn0independent(d,vars) as mvnindependent() but with zero mean vector
mvnidentity(d,mean,var) multivariate normal of dimension d with means mean and
identity covariance matrix with equal variances var
mvn0identity(d,var) as mvnidentity() but with zero mean vector
mvnscaled(d,mean,A,{var}) multivariate normal of dimension d with mean vector mean and
covariance matrix ({var}A); mean can be a matrix name or a list
of d means separated by a comma: mu1 , mu2 , . . ., mud ;
A is a positive-definite scale matrix; {var} is a variance
parameter
mvn0scaled(d,A,{var}) as mvnscaled() but with zero mean vector
10 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
options Description
Model
noconstant suppress constant term; not allowed with ordered models,
nonlinear models, and probability distributions
∗
likelihood(lspec) distribution for the likelihood model
∗
prior(priorspec) prior for model parameters; this option may be repeated
dryrun show model summary without estimation
Model 2
define(label:resubexpr) defines a function of model parameters; this option may be repeated
Simulation
nchains(#) number of chains; default is to simulate one chain
mcmcsize(#) MCMC sample size; default is mcmcsize(10000)
burnin(#) burn-in period; default is burnin(2500)
thinning(#) thinning interval; default is thinning(1)
rseed(#) random-number seed
exclude(paramref ) specify model parameters to be excluded from the simulation results
Blocking
block(paramref , blockopts ) specify a block of model parameters; this option may be repeated
blocksummary display block summary
Initialization
initial(initspec) specify initial values for model parameters with a single chain
init#(initspec) specify initial values for #th chain; requires nchains()
initall(initspec) specify initial values for all chains; requires nchains()
nomleinitial suppress the use of maximum likelihood estimates as starting values
initrandom specify random initial values
initsummary display initial values used for simulation
12 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Adaptation
adaptation(adaptopts) control the adaptive MCMC procedure
scale(#) initial multiplier for scale factor; default is scale(2.38)
covariance(cov) initial proposal covariance; default is the identity matrix
Reporting
clevel(#) set credible interval level; default is clevel(95)
hpd display HPD credible intervals instead of the default equal-tailed
credible intervals
eform (string) report exponentiated coefficients and, optionally, label as string
remargl compute log marginal-likelihood for multilevel models
batch(#) specify length of block for batch-means calculations;
default is batch(0)
saving(filename , replace ) save simulation results to filename.dta
nomodelsummary suppress model summary
noexpression suppress output of expressions from model summary
chainsdetail display detailed simulation summary for each chain
no dots suppress dots or display dots every 100 iterations and iteration
numbers every 1,000 iterations; default is nodots
dots(#
, every(#) ) display dots as simulation is performed
no show(paramref ) specify model parameters to be excluded from or included in
the output
showreffects (reref ) specify that all or a subset of random-effects parameters be included
in the output
notable suppress estimation table
noheader suppress output header
title(string) display string as title above the table of parameter estimates
display options control spacing, line width, and base and empty cells
Advanced
search(search options) control the search for feasible initial values
corrlag(#) specify maximum autocorrelation lag; default varies
corrtol(#) specify autocorrelation tolerance; default is corrtol(0.01)
∗
Options likelihood() and prior() are required. prior() must be specified for all model parameters.
Options prior() and block() may be repeated.
indepvars and paramref may contain factor variables; see [U] 11.4.3 Factor variables.
indepvars and paramref may contain time-series operators; see [U] 11.4.4 Time-series varlists.
With multiple-equations specifications, a local if specified within an equation is applied together with the global if
specified with the command.
collect is allowed; see [U] 11.1.10 Prefix commands.
Only fweights are allowed; see [U] 11.1.6 weight.
With multiple-equations specifications, local weights (weights specified within an equation) override global weights
(weights specified with the command).
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 13
blockopts Description
gibbs requests Gibbs sampling; available for selected models only and
not allowed with scale(), covariance(), or adaptation()
split requests that all parameters in a block be treated as separate blocks
reffects requests that all parameters in a block be treated as random-effects
parameters
scale(#) initial multiplier for scale factor for current block; default is
scale(2.38); not allowed
with gibbs
covariance(cov) initial proposal covariance for the current block; default is the
identity matrix; not allowed with gibbs
adaptation(adaptopts) control the adaptive MCMC procedure of the current block;
not allowed with gibbs
adaptopts Description
every(#) adaptation interval; default is every(100)
maxiter(#) maximum number of adaptation loops; default is maxiter(25) or
max{25, floor(burnin()/every())} whenever default values
of these options are modified
miniter(#) minimum number of adaptation loops; default is miniter(5)
alpha(#) parameter controlling acceptance rate (AR); default is alpha(0.75)
beta(#) parameter controlling proposal covariance; default is beta(0.8)
gamma(#) parameter controlling adaptation rate; default is gamma(0)
∗
tarate(#) target acceptance rate (TAR); default is parameter specific
∗
tolerance(#) tolerance for AR; default is tolerance(0.01)
∗
Only starred options may be specified in the adaptation() option specified within block().
Options
Model
noconstant suppresses the constant term (intercept) from the regression model. By default, bayesmh
automatically includes a model parameter {depname: cons} in all regression models except ordered
and nonlinear models. Excluding the constant term may be desirable when there is a factor variable,
the base level of which absorbs the constant term in the linear combination.
likelihood(lspec) specifies the distribution of the data. This option specifies the likelihood portion
of the Bayesian model. This option is required. lspec is one of modelspec or distribution.
modelspec specifies one of the supported likelihood distributions for regression models. A location
parameter of these distributions is automatically parameterized as a linear combination of the
specified independent variables and needs not be specified. Other parameters may be specified as
arguments to the distribution separated by commas. Each argument may be a real number (#), a
variable name (except for matrix parameters), a predefined matrix, a model parameter specified in
{}, a Stata expression, or a substitutable expression containing model parameters and, optionally,
random effects; see Declaring model parameters and Specifying arguments of likelihood models
and prior distributions. For survival models, a distribution argument may be only a real number
14 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
or a model parameter. For the parameterization of the asymlaplaceq() likelihood, see Methods
and formulas of [BAYES] bayes: qreg.
distribution specifies one of the supported distributions for modeling the dependent variable. A
distribution argument must be a model parameter specified in {} or a substitutable expression
containing model parameters and, optionally, random effects; see Declaring model parameters and
Specifying arguments of likelihood models and prior distributions. A number of trials, n, of the
binomial distribution may be a real number (#), a Stata expression, or a variable name. For an
example of modeling outcome distributions directly, see Beta-binomial model.
For some regression models, option likelihood() provides suboptions subopts in
likelihood(. . . , subopts). subopts are offset(), exposure(), and, for survival models,
survivalopts.
offset(varnameo ) specifies that varnameo be included in the regression model with the coefficient
constrained to be 1. This option is available with probit, logit, binomial(), binlogit(),
oprobit, ologit, and poisson.
exposure(varnamee ) specifies a variable that reflects the amount of exposure over which the
depvar events were observed for each observation; ln(varnamee ) with coefficient constrained
to be 1 is entered into the log-link function. This option is available with poisson.
survivalopts are logparam, nologparam, ph, aft, time (synonym for aft), failure(varname),
and ltruncated(varname | #).
logparam and nologparam specify the estimation metric for the auxiliary model parameter.
logparam specifies that the survival model be fit using the log of the parameter controlling the
shape of the distribution—scale for stgamma() and stloglogistic(), standard deviation
for stlognormal(), and shape for stweibull(). This is the default. nologparam specifies
that the model be fit using the parameter in the original metric. Which metric to use may
depend on the desired prior distribution for the auxiliary parameter.
ph, aft, failure(), ltruncated(); see survival options in [SEM] gsem family-and-link
options.
prior(priorspec) specifies a prior distribution for model parameters. This option is required and
may be repeated. A prior must be specified for each model parameter. Model parameters may
be scalars or matrices, but both types may not be combined in one prior statement. If multiple
scalar parameters are assigned a single univariate prior, they are considered independent, and the
specified prior is used for each parameter. You may assign a multivariate prior of dimension d to d
scalar parameters. Also see Referring to model parameters and Specifying arguments of likelihood
models and prior distributions.
All likelihood() and prior() combinations are allowed, but they are not guaranteed to correspond
to proper posterior distributions. You need to think carefully about the model you are building and
evaluate its convergence thoroughly; see Convergence of MCMC.
dryrun specifies to show the summary of the model that would be fit without actually fitting the
model. This option is recommended for checking specifications of the model before fitting the
model. The model summary reports the information about the likelihood model and about priors
for all model parameters.
Model 2
define(name:resubexpr) is for use with nonlinear models. It defines a function of model parameters,
resubexpr, and labels it as name. This option can be repeated to define multiple functions. The
define() option is useful for expressions that appear multiple times in the main nonlinear
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 15
specification: you define the expression once and then simply refer to it by using {name:} in
the nonlinear specification. This option can also be used for notational convenience. See Random
effects for how to specify resubexpr.
Simulation
nchains(#) specifies the number of Markov chains to simulate. You must specify at least two chains.
By default, only one chain is produced. Simulating multiple chains is useful for convergence
diagnostics and to improve precision of parameter estimates. Four chains are often recommended
in the literature, but you can specify more or less depending on your objective. The reported
estimation results are based on all chains. You can use bayesstats summary with option
sepchains to see the results for each chain. The reported acceptance rate, efficiencies, and log
marginal-likelihood are averaged over all chains. You can use option chainsdetail to see these
simulation summaries for each chain. Also see Convergence diagnostics using multiple chains and
Gelman–Rubin convergence diagnostic in [BAYES] bayesstats grubin.
mcmcsize(#) specifies the target MCMC sample size. The default MCMC sample size is mcmc-
size(10000). The total number of iterations for the MH algorithm equals the sum of the burn-in
iterations and the MCMC sample size in the absence of thinning. If thinning is present, the total
number of MCMC iterations is computed as burnin() + (mcmcsize() − 1) × thinning() + 1.
Computation time of the MH algorithm is proportional to the total number of iterations. The MCMC
sample size determines the precision of posterior summaries, which may be different for different
model parameters and will depend on the efficiency of the Markov chain. With multiple chains,
mcmcsize() applies to each chain. Also see Burn-in period and MCMC sample size.
burnin(#) specifies the number of iterations for the burn-in period of MCMC. The values of parameters
simulated during burn-in are used for adaptation purposes only and are not used for estimation.
The default is burnin(2500). Typically, burn-in is chosen to be as long as or longer than the
adaptation period. With multiple chains, burnin() applies to each chain. Also see Burn-in period
and MCMC sample size and Convergence of MCMC.
thinning(#) specifies the thinning interval. Only simulated values from every (1 + k × # )th iteration
for k = 0, 1, 2, . . . are saved in the final MCMC sample; all other simulated values are discarded.
The default is thinning(1); that is, all simulation values are saved. Thinning greater than one
is typically used for decreasing the autocorrelation of the simulated MCMC sample. With multiple
chains, thinning() applies to each chain.
rseed(#) sets the random-number seed. This option can be used to reproduce results. With one
chain, rseed(#) is equivalent to typing set seed # prior to calling bayesmh; see [R] set seed.
With multiple chains, you should use rseed() for reproducibility; see Reproducing results.
exclude(paramref ) specifies which model parameters should be excluded from the final MCMC
sample. These model parameters will not appear in the estimation table, and postestimation
features for these parameters and log marginal-likelihood will not be available. This option is
useful for suppressing nuisance model parameters. For example, if you have a factor predictor
variable with many levels but you are only interested in the variability of the coefficients associated
with its levels, not their actual values, then you may wish to exclude this factor variable from the
simulation results. If you simply want to omit some model parameters from the output, see the
noshow() option. paramref can include individual random-effects parameters.
16 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Blocking
block( paramref , blockopts ) specifies a group of model parameters for the blocked MH algorithm.
By default, all parameters except matrices are treated as one block, and each matrix parameter
is viewed as a separate block. You can use the block() option to separate scalar parameters in
multiple blocks. Technically, you can also use block() to combine matrix parameters in one block,
but this is not recommended. The block() option may be repeated to define multiple blocks.
Different types of model parameters, such as scalars and matrices, may not be specified in one
block(). Parameters within one block are updated simultaneously, and each block of parameters
is updated in the order it is specified; the first specified block is updated first, the second is updated
second, and so on. See Improving efficiency of the MH algorithm—blocking of parameters.
blockopts include gibbs, split, reffects, scale(), covariance(), and adaptation().
gibbs specifies to use Gibbs sampling to update parameters in the block. This option is allowed
only for specific combinations of likelihood models and prior distributions; see Gibbs sampling
for some likelihood-prior and prior-hyperprior configurations. For more information, see Gibbs
and hybrid MH sampling. In the presence of multiple random effects, you may combine
options gibbs and split to perform Gibbs sampling separately for each set of random-
effects parameters. gibbs may not be combined with reffects, scale(), covariance(),
or adaptation().
split specifies that all parameters in a block are treated as separate blocks. This may be useful for
levels of factor variables. Option split is convenient in combination with option gibbs with
multiple random effects to perform Gibbs sampling separately for each set of random-effects
parameters.
reffects specifies that the parameters associated with the levels of a factor variable included in
the likelihood specification be treated as random-effects parameters. Random-effects parameters
must be included in one prior statement and are assumed to be conditionally independent
across levels of a grouping variable given all other model parameters. reffects requires that
parameters be specified as {depvar:i.varname}, where i.varname is the corresponding factor
variable in the likelihood specification, and may not be combined with block()’s suboptions
gibbs and split. This option was useful for fitting hierarchical or multilevel models in
previous versions and is now provided for historical reasons. See Random effects for how to
fit multilevel models.
scale(#) specifies an initial multiplier for the scale factor corresponding to the specified block.
√
The initial scale factor is computed as #/ np for continuous parameters and as #/np for discrete
parameters, where np is the number of parameters in the block. The default is scale(2.38).
If specified, this option overrides the respective setting from the scale() option specified with
the command. scale() may not be combined with gibbs.
covariance(matname) specifies a scale matrix matname to be used to compute an initial
proposal covariance matrix corresponding to the specified block. The initial proposal covariance
is computed as rho×Sigma, where rho is a scale factor and Sigma = matname. By default,
Sigma is the identity matrix. If specified, this option overrides the respective setting from the
covariance() option specified with the command. covariance() may not be combined with
gibbs.
adaptation(tarate()) and adaptation(tolerance()) specify block-specific TAR and ac-
ceptance tolerance. If specified, they override the respective settings from the adaptation()
option specified with the command. adaptation() may not be combined with gibbs.
blocksummary displays the summary of the specified blocks. This option is useful when block()
is specified.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 17
Initialization
initial(initspec) specifies initial values for the model parameters to be used in the simulation.
With multiple chains, this option is equivalent to specifying option init1(). You can specify a
parameter name, its initial value, another parameter name, its initial value, and so on. For example,
to initialize a scalar parameter alpha to 0.5 and a 2x2 matrix Sigma to the identity matrix I(2),
you can type
bayesmh . . ., initial({alpha} 0.5 {Sigma,m} I(2)) . . .
You can also specify a list of parameters using any of the specifications described in Referring to
model parameters. For example, to initialize all regression coefficients from equations y1 and y2
to zero, you can type
bayesmh . . ., initial({y1:} {y2:} 0) . . .
The general specification of initspec is
paramref initval paramref initval . . .
where initval is a number, a Stata expression that evaluates to a number, or a Stata matrix for
initialization of matrix parameters.
Curly braces may be omitted for scalar parameters but must be specified for matrix parameters.
Initial values declared using this option override the default initial values or any initial values
declared during parameter specification in the likelihood() option. See Specifying initial values
for details.
init#(initspec) specifies initial values for the model parameters for the #th chain. This option requires
option nchains(). init1() overrides the default initial values for the first chain, init2() for
the second chain, and so on. You specify initial values in init#() just like you do in option
initial(). See Specifying initial values for details.
initall(initspec) specifies initial values for the model parameters for all chains. This option requires
option nchains(). You specify initial values in initall() just like you do in option initial().
You should avoid specifying fixed initial values in initall() because then all chains will use the
same initial values. initall() is useful to specify random initial values when you define your
own priors within prior()’s density() and logdensity() suboptions. See Specifying initial
values for details.
nomleinitial suppresses using maximum likelihood estimates (MLEs), or linear programming
estimates for bayes: qreg, as starting values for model parameters. With multiple chains, this
option and discussion below apply only to the first chain. By default, when no initial values are
specified, MLE values (when available) are used as initial values. If nomleinitial is specified
and no initial values are provided, the command uses ones for positive scalar parameters, zeros
for other scalar parameters, and identity matrices for matrix parameters. nomleinitial may be
useful for providing an alternative starting state when checking convergence of MCMC. This option
cannot be combined with initrandom.
initrandom specifies that the model parameters be initialized randomly. Random initial values are
generated from the prior distributions of the model parameters. If you want to use fixed initial
values for some of the parameters, you can specify them in the initial() option or during
parameter declarations in the likelihood() option. Random initial values are not available for
parameters with flat, jeffreys, density(), logdensity(), and jeffreys() priors; you
must provide your own initial values for such parameters. This option cannot be combined with
nomleinitial. See Specifying initial values for details.
initsummary specifies that the initial values used for simulation be displayed.
18 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Adaptation
adaptation(adaptopts) controls adaptation of the MCMC procedure. Adaptation takes place every
prespecified number of MCMC iterations and consists of tuning the proposal scale factor and
proposal covariance for each block of model parameters. Adaptation is used to improve sampling
efficiency. Provided defaults are based on theoretical results and may not be sufficient for all
applications. See Adaptation of the MH algorithm for details about adaptation and its parameters.
adaptopts are any of the following options:
every(#) specifies that adaptation be attempted every #th iteration. The default is every(100).
To determine the adaptation interval, you need to consider the maximum block size specified
in your model. The update of a block with k model parameters requires the estimation
of a k × k covariance matrix. If the adaptation interval is not sufficient for estimating the
k(k + 1)/2 elements of this matrix, the adaptation may be insufficient.
maxiter(#) specifies the maximum number of adaptive iterations. Adaptation includes tuning
of the proposal covariance and of the scale factor for each block of model parameters.
Once the TAR is achieved within the specified tolerance, the adaptation stops. However, no
more than # adaptation steps will be performed. The default is variable and is computed as
max{25, floor(burnin()/adaptation(every()))}.
maxiter() is usually chosen to be no greater than (mcmcsize() + burnin())/
adaptation(every()).
miniter(#) specifies the minimum number of adaptive iterations to be performed regardless of
whether the TAR has been achieved. The default is miniter(5). If the specified miniter()
is greater than maxiter(), then miniter() is reset to maxiter(). Thus, if you specify
maxiter(0), then no adaptation will be performed.
alpha(#) specifies a parameter controlling the adaptation of the AR. alpha() should be in
[0, 1]. The default is alpha(0.75).
beta(#) specifies a parameter controlling the adaptation of the proposal covariance matrix.
beta() must be in [0,1]. The closer beta() is to zero, the less adaptive the proposal
covariance. When beta() is zero, the same proposal covariance will be used in all MCMC
iterations. The default is beta(0.8).
gamma(#) specifies a parameter controlling the adaptation rate of the proposal covariance
matrix. gamma() must be in [0,1]. The larger the value of gamma(), the less adaptive the
proposal covariance. The default is gamma(0).
tarate(#) specifies the TAR for all blocks of model parameters; this is rarely used. tarate()
must be in (0,1). The default AR is 0.234 for blocks containing continuous multiple parameters,
0.44 for blocks with one continuous parameter, and 1/n maxlev for blocks with discrete
parameters, where n maxlev is the maximum number of levels for a discrete parameter in
the block.
tolerance(#) specifies the tolerance criterion for adaptation based on the TAR. tolerance()
should be in (0,1). Adaptation stops whenever the absolute difference between the current
AR and TAR is less than tolerance(). The default is tolerance(0.01).
scale(#) specifies an initial multiplier for the scale factor for all blocks. The initial scale factor is
√
computed as #/ np for continuous parameters and #/np for discrete parameters, where np is the
number of parameters in the block. The default is scale(2.38).
covariance(cov) specifies a scale matrix cov to be used to compute an initial proposal covariance
matrix. The initial proposal covariance is computed as ρ × Σ, where ρ is a scale factor and
Σ = matname. By default, Σ is the identity matrix. Partial specification of Σ is also allowed.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 19
The rows and columns of cov should be named after some or all model parameters. According
to some theoretical results, the optimal proposal covariance is the posterior covariance matrix of
model parameters, which is usually unknown. This option does not apply to the blocks containing
random-effects parameters.
Reporting
clevel(#) specifies the credible level, as a percentage, for equal-tailed and HPD credible intervals.
The default is clevel(95) or as set by [BAYES] set clevel.
hpd displays the HPD credible intervals instead of the default equal-tailed credible intervals.
eform and eform(string) specify that the coefficient table be displayed in exponentiated form and
that exp(b) and string, respectively, be used to label the exponentiated coefficients in the table.
remargl specifies to compute the log marginal-likelihood for panel-data and multilevel models. It
is not reported by default for these models. Bayesian panel-data and multilevel models contain
many parameters because, in addition to regression coefficients and variance components, they also
estimate individual random effects. The computation of the log marginal-likelihood involves the
inverse of the determinant of the sample covariance matrix of all parameters and loses its accuracy
as the number of parameters grows. For high-dimensional models such as multilevel models, the
computation of the log marginal-likelihood can be time consuming, and its accuracy may become
unacceptably low. Because it is difficult to access the levels of accuracy of the computation for
all panel-data and multilevel models, the log marginal-likelihood is not reported by default. For
models containing a small number of random effects, you can use the remargl option to compute
and display the log marginal-likelihood.
batch(#) specifies the length of the block for calculating batch means and an MCSE using batch
means. The default is batch(0), which means no batch calculations. When batch() is not
specified, the MCSE is computed using effective sample sizes instead of batch means. batch()
may not be combined with corrlag() or corrtol().
saving(filename , replace ) saves simulation results in filename.dta. The replace option
specifies to overwrite filename.dta if it exists. If the saving() option is not specified, bayesmh
saves simulation results in a temporary file for later access by postestimation commands. This
temporary file will be overridden every time bayesmh is run and will also be erased if the current
estimation results are cleared. saving() may be specified during estimation or on replay.
The saved dataset has the following structure. Variable chain records chain identifiers. Variable
index records iteration numbers. bayesmh saves only states (sets of parameter values) that are
different from one iteration to another and the frequency of each state in variable frequency.
(Some states may be repeated for discrete parameters.) As such, index may not necessarily
contain consecutive integers. Remember to use frequency as a frequency weight if you need to
obtain any summaries of this dataset. Values for each parameter are saved in a separate variable
in the dataset. Variables containing values of parameters without equation names are named as
eq0 p#, following the order in which parameters are declared in bayesmh. Variables containing
values of parameters with equation names are named as eq# p#, again following the order in which
parameters are defined. Parameters with the same equation names will have the same variable
prefix eq#. For example,
. bayesmh y x1, likelihood(normal({var})) saving(mcmc) ...
will create a dataset, mcmc.dta, with variable names eq1 p1 for {y:x1}, eq1 p2 for {y: cons},
and eq0 p1 for {var}. Also see macros e(parnames) and e(varnames) for the correspondence
between parameter names and variable names.
20 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
In addition, bayesmh saves variable loglikelihood to contain values of the log likelihood
from each iteration and variable logposterior to contain values of the log posterior from each
iteration.
nomodelsummary suppresses the detailed summary of the specified model. The model summary is
reported by default.
noexpression suppresses the output of expressions from the model summary. Expressions (when
specified) are reported by default.
chainsdetail specifies that acceptance rates, efficiencies, and log marginal-likelihoods be reported
separately for each chain. By default, the header reports these statistics averaged over all chains.
This option requires option nchains().
nodots, dots, and dots(#) specify to suppress or display dots during simulation. With multiple
chains, these options affect all chains. dots(#) displays a dot every # iterations. During the
adaptation period, a symbol a is displayed instead of a dot. If dots(. . ., every(#)) is specified,
then an iteration number is displayed every #th iteration instead of a dot or a. dots(, every(#)) is
equivalent to dots(1, every(#)). dots displays dots every 100 iterations and iteration numbers
every 1,000 iterations; it is a synonym for dots(100, every(1000)). By default, no dots are
displayed (nodots or dots(0)).
show(paramref ) or noshow(paramref ) specifies a list of model parameters to be included in the
output or excluded from the output, respectively. By default, all model parameters (except random-
effects parameters) are displayed. Do not confuse noshow() with exclude(), which excludes
the specified parameters from the MCMC sample. When the noshow() option is specified, for
computational efficiency, MCMC summaries of the specified parameters are not computed or stored
in e(). paramref can include individual random-effects parameters.
showreffects and showreffects(reref ) are used with multilevel models and specify that all or
a list reref of random-effects parameters be included in the output in addition to other model
parameters. By default, all random-effects parameters are excluded from the output as if you
have specified the noshow() option. This option computes, displays, and stores in e() MCMC
summaries for the random-effects parameters.
notable suppresses the estimation table from the output. By default, a summary table is displayed
containing all model parameters except those listed in the exclude() and noshow() options.
Regression model parameters are grouped by equation names. The table includes six columns
and reports the following statistics using the MCMC simulation results: posterior mean, posterior
standard deviation, MCMC standard error or MCSE, posterior median, and credible intervals.
noheader suppresses the output header either at estimation or upon replay.
title(string) specifies an optional title for the command that is displayed above the table of the
parameter estimates. The default title is specific to the specified likelihood model.
display options: vsquish, noemptycells, baselevels, allbaselevels, nofvlabel,
fvwrap(#), fvwrapon(style), and nolstretch; see [R] Estimation options.
Advanced
search(search options) searches for feasible initial values. search options are on, repeat(#),
and off.
search(on) is equivalent to search(repeat(500)). This is the default.
search(repeat(k)), k > 0, specifies the number of random attempts to be made to find
a feasible initial-value vector, or initial state. The default is repeat(500). An initial-value
vector is feasible if it corresponds to a state with positive posterior probability. If feasible initial
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 21
values are not found after k attempts, an error will be issued. repeat(0) (rarely used) specifies
that no random attempts be made to find a feasible starting point. In this case, if the specified
initial vector does not correspond to a feasible state, an error will be issued.
search(off) prevents the command from searching for feasible initial values. We do not recom-
mend specifying this option.
corrlag(#) specifies the maximum autocorrelation lag used for calculating effective sample sizes. The
default is min{500, mcmcsize()/2}. The total autocorrelation is computed as the sum of all lag-k
autocorrelation values for k from 0 to either corrlag() or the index at which the autocorrelation
becomes less than corrtol() if the latter is less than corrlag(). Options corrlag() and
batch() may not be combined.
corrtol(#) specifies the autocorrelation tolerance used for calculating effective sample sizes. The
default is corrtol(0.01). For a given model parameter, if the absolute value of the lag-k
autocorrelation is less than corrtol(), then all autocorrelation lags beyond the k th lag are
discarded. Options corrtol() and batch() may not be combined.
Using bayesmh
The bayesmh command for Bayesian analysis includes three functional components: setting up
a posterior model, performing MCMC simulation, and summarizing and reporting results. The first
component, the model-building step, requires some experience in the practice of Bayesian statistics
and, as any modeling task, is probably the most demanding. You should specify a posterior model
that is statistically correct and that represents the observed data. Another important aspect is the
computational feasibility of the model in the context of the MH MCMC procedure implemented in
bayesmh. The provided MH algorithm is adaptive and, to a degree, can accommodate various statistical
models and data structures. However, careful model parameterization and well-specified initial values
and MCMC sampling scheme are crucial for achieving a fast-converging Markov chain and consequently
good results. Simulation of MCMC must be followed by a thorough investigation of the convergence
of the MCMC algorithm. Once you are satisfied with the convergence of the simulated chains, you
may proceed with posterior summaries of the results and their interpretation. Below we discuss the
three major steps of using bayesmh and provide recommendations.
Likelihood model
The likelihood model describes the data. You build your likelihood model the same way you do
this in frequentist likelihood-based analysis.
The bayesmh command provides various likelihood models, which are specified in the like-
lihood() option. For a univariate response, there are normal models, generalized linear models
for binary and count response, and more. For a multivariate model, you may choose between a
multivariate normal model with covariates common to all variables and with covariates specific to
each variable. You can also build likelihood models for multiple variables by specifying a distribution
and a regression function for each variable by using bayesmh’s multiple-equations specification.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 23
bayesmh is primarily designed for fitting regression models. As we said above, you specify the
likelihood or outcome distribution in the likelihood() option. The regression specification of the
model is the same as for other regression commands. For a univariate response, you specify the
dependent and all independent variables following the command name. (Here we also include the
prior() option that specifies prior distributions to emphasize that it is required in addition to
likelihood(). See the next subsection for details about this option.)
. bayesmh y x1 x2, likelihood() prior() ...
For a multivariate response, you separate the dependent variables from the independent variables
with the equal sign.
. bayesmh y1 y2 = x1 x2, likelihood(mvnormal(. . .)) prior() ...
With multiple-equations specification, you follow the syntax for the univariate response, but you
specify each equation in parentheses and you specify the likelihood() option within each equation.
. bayesmh (y1 x1, likelihood()) (y2 x2, likelihood()), prior() ...
In the above models, the regression function is modeled using a linear combination of the specified
independent variables and regression coefficients. The constant is included by default, but you can
specify the noconstant option to omit it from the linear predictor.
bayesmh also allows you to model the regression function as a nonlinear function of independent
variables and regression parameters. In this case, you must use the equal sign to separate the dependent
variable from the expression and specify the expression in parentheses:
. bayesmh y = ({a}+{b}*x^{c}), likelihood(normal()) prior() ...
. bayesmh (y1 = ({a1}+{b1}*x^{c1}) ///
(y2 = ({a2}+{b2}*x^{c2}), likelihood(mvnormal()) prior() ...
You can fit linear and nonlinear multilevel models by including random-effects terms in your
regression specifications.
. bayesmh y x1 x2 U[id], likelihood() prior() ...
. bayesmh y = ({a}+{b}*x^{c}+{U[id]}), likelihood() prior() ...
Finally, you can model an outcome distribution directly by specifying one of the supported
probability distributions.
For a not-supported or nonstandard likelihood, you can use the llf() option within likeli-
hood() to specify a generic expression for the observation-level likelihood function; see Substitutable
expressions. When you use the llf() option, it is your responsibility to ensure that the provided
expression corresponds to a valid density. For more complicated Bayesian models, you may consider
writing your own likelihood or posterior function evaluators; see [BAYES] bayesmh evaluators.
Prior distributions
In addition to the likelihood, you must also specify prior distributions for all model parameters in a
Bayesian model (except random effects). Prior distributions or priors are key components in a Bayesian
model specification and should be chosen carefully. They are used to quantify some expert knowledge
or existing information about model parameters. For example, priors can be used for constraining
the domain of some parameters to localize values that we think are more probable for reasons that
are not considered in the likelihood specification. Improper priors (priors with densities that do not
integrate to finite numbers) are also allowed, as long as they yield valid posterior distributions. Priors
are often categorized as informative (subjective) or noninformative (objective). Noninformative priors
24 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
are also known as vague priors. Uniform distributions are often used as noninformative priors and
can even be applied to parameters with unbounded domains, in which case they become improper
priors. Normal and gamma distributions with very large variances relative to the expected values
of the parameters are also used as noninformative priors. Another family of noninformative priors,
often chosen for their invariance under reparameterization, are so-called Jeffreys priors, named after
Harold Jeffreys (Jeffreys 1946). For example, the bayesmh command provides built-in Jeffreys priors
for the normal family of distributions. Jeffreys priors are usually improper. As discussed by many
researchers, however, the overuse of noninformative priors contradicts the principles of Bayesian
approach—analysis of a posterior model with noninformative priors would be close to one based on
the likelihood only. Noninformative priors may also negatively influence the MCMC convergence. It
is thus important to find good priors based on earlier studies and use them in the model as well as
perform sensitivity analysis for competing priors. A good choice of prior should minimize the MCMC
standard errors of the parameter estimates.
As for likelihoods, the bayesmh command provides several priors you can choose from by
specifying the prior() options. For example, continuous univariate priors include normal, lognormal,
uniform, inverse gamma, and exponential; discrete priors include Bernoulli and Poisson; multivariate
priors include multivariate normal and inverse Wishart. There are also special priors: jeffreys and
jeffreys(#), which specify Jeffreys priors for the variance of the normal and multivariate normal
distributions, and zellnersg() and zellnersg0(), which specify multivariate priors for regression
coefficients (Zellner and Revankar 1969).
The prior() option is required and may be repeated. You can use the prior() option for each
parameter or you can combine multiple parameters in one prior() specification.
For example, we can specify different priors for parameters {y:x} and {y: cons} by
. bayesmh y x, . . . prior({y:x}, normal(10,100)) prior({y: cons}, normal(20,200)) . . .
or the same univariate prior using one prior() statement, using
. bayesmh y x, . . . prior({y:x _cons}, normal(10,100)) . . .
or a multivariate prior with zero mean and fixed variance–covariance S, as follows:
. bayesmh y x, . . . prior({y:x _cons}, mvnormal0(2,S)) . . .
In the prior() option, we list model parameters following any of the specifications described in
Referring to model parameters and then, following the comma, we specify one of the prior distributions
priordist.
If you want to specify a nonstandard prior or if the prior you need is not supported, you can use
the density() or logdensity() option within the prior() option to specify an expression for
a generic density or log density of the prior distribution; see Substitutable expressions. When you
use the density() or logdensity() option, it is your responsibility to ensure that the provided
expression corresponds to a valid density. For a complicated Bayesian model, you may consider
writing your own posterior function evaluator; see [BAYES] bayesmh evaluators.
Sometimes, you may need to specify a flat prior (a prior with the density equal to one) for some
of the parameters. This is often needed when specifying a noninformative prior. You can specify the
flat option instead of the prior distribution in the prior() option to request the flat prior. This
option is equivalent to specifying density(1) or logdensity(0) in prior().
With multilevel models, random-effects parameters, such as random intercepts {U[id]} at the id
levels, are assigned default normal priors with zero mean and an unknown variance, that is, {var U}.
You must, however, specify the priors for the unknown variance components. For instance, if we
include random intercepts {U[id]} in our model, we will need to specify the prior for {var U}.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 25
You can use the prior() option to change the default priors for random effects, prior({U}, . . .).
See Random effects.
The specified likelihood model for the data and prior distributions for the parameters are not
guaranteed to result in proper posterior distributions of the parameters. Therefore, unless you are
using one of the standard Bayesian models, you should always check the validity of the posterior
model you specified.
For a multivariate normal linear regression, in addition to four regression parameters declared
automatically by bayesmh: {y1:x}, {y1: cons}, {y2:x}, and {y2: cons}, we may also declare
a parameter for the variance–covariance matrix:
. bayesmh y1 y2 = x, likelihood(mvnormal({Sigma, matrix})) ...
or abbreviate matrix to m for short:
. bayesmh y1 y2 = x, likelihood(mvnormal({Sigma, m})) ...
For a two-level random-intercept model,
. bayesmh y x U[id], ...
in addition to regression coefficients {y:x} and {y: cons}, bayesmh creates a variance component
{var U} associated with the included random effects {U[id]}. See Random effects for details.
After a model parameter is declared, we may need to refer to it in our further model specification.
We will definitely need to refer to it when we specify its prior distribution. We may also need to use
it as an argument in the prior distributions of other parameters or need to specify it in the block()
option for blocking of model parameters; see Improving efficiency of the MH algorithm—blocking
of parameters.
To refer to one parameter, we simply use its definition: {param}, {eqname:param}, {param,
matrix}, or {eqname:param, matrix}. There are several ways in which you can refer to multiple
parameters. You can refer to multiple model parameters in the parameter specification paramref of the
prior(paramref, . . .) option, of the block(paramref, . . .) option, or of the initial(paramref
#) option.
The most straightforward way to refer to multiple scalar model parameters is to simply list them
individually, as follows:
{param1} {param2} ...
but there are shortcuts.
For example, the alternative to the above is
{param1 param2} ...
where we simply list the names of all parameters inside one set of curly braces.
If parameters have the same equation name, you can refer to all the parameters with that equation
name as follows. Suppose that we have three parameters with the same equation name eqname, then
the specification
{eqname:param1} {eqname:param2} {eqname:param3}
or the specification
{eqname:param1 param2 param3}
The above specification is useful if we want to refer to a subset of parameters with the same
equation name. For example, in the above, if we wanted to refer to only param1 and param2, we
could type
{eqname:param1 param2}
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 27
If a factor variable is used in the specification of the regression function, you can use the same
factor-variable specification within paramref to refer to the coefficients associated with the levels of
that factor variable; see [U] 11.4.3 Factor variables.
You can mix and match all the specifications above in one parameter specification, paramref.
To refer to multiple matrix model parameters, you can use {paramlist, matrix} to refer to matrix
parameters with names paramlist and {eqname:paramlist, matrix} to refer to matrix parameters
with names in paramlist and with equation name eqname.
For example, the specification
{eqname:Sigma1,m} {eqname:Sigma2,m} {Sigma3,m} {Sigma4,m}
Substitutable expressions
You may use substitutable expressions in bayesmh to define nonlinear expressions subexpr,
arguments of outcome distributions in option likelihood(), observation-level log likelihood in
option llf(), arguments of prior distributions in option prior(), and generic prior distributions in
prior()’s suboptions density() and logdensity(). Substitutable expressions are just like any
other mathematical expression in Stata, except that they may include model parameters. Substitutable
expressions may contain factor variables and time-series operators; see [U] 11.4.3 Factor variables
and [U] 11.4.4 Time-series varlists.
To specify a substitutable expression in your bayesmh model, you must comply with the following
rules:
1. Model parameters are bound in braces: {mu}, {var:sigma2}, {Sigma, matrix}, and
{Cov:Sigma, matrix}.
2. Linear combinations can be specified using the notation
{ eqname: varlist , xb noconstant }
The xb option is used to distinguish between the linear combination that contains one variable
and a free parameter that has the same name as the variable and the same group name
as the linear combination. For example, {lc:weight, xb} is equivalent to {lc: cons}
+ {lc:weight}*weight, whereas {lc:weight} refers to either a free parameter weight
with a group name lc or the coefficient of the weight variable, if {lc:} has been previously
defined in the expression as a linear combination that involves variable weight. Thus the xb
option indicates that the specification is a linear combination rather than a single parameter
to be estimated.
When you define a linear combination, a constant term is included by default. The nocon-
stant option suppresses the constant.
See Linear combinations in [ME] menl for details about specifying linear combinations.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 29
3. Initial values are given by including an equal sign and the initial value inside the braces,
for example, {b1=1.267}, {gamma=3}, etc. If you do not specify an initial value, that
parameter is initialized to one for positive scalar parameters and to zero for other scalar
parameters, or it is initialized to its MLE, if available. The initial() option overrides initial
values provided in substitutable expressions. Initial values for matrices must be specified in
the initial() option. By default, matrix parameters are initialized with identity matrices.
Specifying linear combinations. We can use substitutable expressions to specify linear combinations.
For example, a normal linear regression,
. bayesmh y x1 x2, likelihood(normal(1)) prior({y:}, normal(0,100))
Including random effects. Substitutable expressions may also contain random effects; see Random
effects.
If you wish to constrain a coefficient to a specific value, you can specify the @ symbol immediately
after the variable whose coefficient is being constrained and then type the value. For instance,
. bayesmh y x1 x2@1, ...
will constrain the coefficient parameter {y:x2} to 1, which means that this parameter is a constant
and will not be sampled.
You can also constrain a coefficient to a symbol, which is equivalent to renaming the corresponding
parameter. For instance,
. bayesmh y x1 x2@a, ...
30 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
will replace {y:x2} with the free parameter {a}. This feature may be useful with multiple-equations
models when we want the variable used in several linear combinations to have the same coefficient.
For instance,
. bayesmh (y1 x1 x2@a, . . .) (y2 x1 x2@a, . . .)
will replace the parameters {y1:x2} and {y2:x2} with {a}, thus constraining the two original
coefficients to be the same.
Random effects
You can include random effects in your bayesmh’s specifications to fit multilevel models. Examples
of random effects specified within the bayesmh syntax are U1[id], U2[id1>id2], U3[id1#id3],
c.x1#U4[id], and 2.f1#U5[id], to name a few. These represent a random intercept at the id level,
a random intercept at the id2-within-id1 level, a random interaction between the crossed levels id1
and id3, a random slope for the continuous variable x1, and a random slope associated with the
second level of the factor variable f1, respectively. See the general syntax for the random-effects
terms below.
To fit linear multilevel models, you include random-effects terms just as you include covariates—you
simply list them following the dependent variable. For instance,
. bayesmh y x1 x2 U[id], ...
. bayesmh y x1 x2 U0[id] c.x1#U1[id], ...
In multiple-equations models, there are equation-specific coefficients associated with each random-
effect term. The coefficient of the random effect in the first equation in which it appears is constrained
to 1. For example,
. bayesmh (y1 x1 U[id1], . . .) (y2 x1 U[id1] V[id2], . . .)
constrains {y1:U} and {y2:V} to 1 because their associated random effects, {U[id1]} and {V[id2]},
appear for the first time in equations {y1:} and {y2:}, respectively. {y2:U} will be sampled because
the associated random effect, {U[id1]}, had already appeared in the first equation.
The coefficients are constrained to 1 for the purpose of identifiability because you cannot identify
both the coefficients and the variance component, which is introduced automatically by bayesmh, for
each random effect. (Technically, you could identify both parameters with Bayesian models if you
specify strong informative priors for them.)
You can override the coefficient constraints by using @value immediately following the random-
effects term. For example,
. bayesmh (y1 x1 U[id1], . . .) (y2 x1 U[id1]@1 V[id2], . . .)
constrains {y2:U} to 1 and lets {y1:U} be sampled. You may also constrain a random effect to a
symbol as follows:
. bayesmh (y1 x1 U[id1]@y1_U, . . .) (y2 x1 U[id1] V[id2], . . .)
Here both equations will contain coefficient parameters for U[id]: {y1 U} will be the coefficient
in the first equation, and {y2:U} will continue to be the coefficient in the second equation. Notice that
{y1 U} will be treated by bayesmh as a free parameter rather than its native regression coefficient.
The above specification is useful when you want to constrain a variance component instead of one
of the coefficients.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 31
You can also include random effects in nonlinear models. You do this by creating a so-called
random-effects substitutable expression—a substitutable expression that contains random effects.
When you include random effects in substitutable expressions, you must enclose them in {}, just as
you do this with other model parameters. For instance,
. bayesmh y = (({b1}+{U[id]})/(1+exp(-(x-{b2})/{b3}))), . . .
. bayesmh y = (1/({b0}+{b1}*x1+{b2}*x2+{U0[id]}+{c.x1#U1[id]})), . . .
The previous bayesmh model can be specified more elegantly by using a linear-combination
specification within a substitutable expression:
. bayesmh y = (1/({xb:x1 x2 U0[id] c.x1#U1[id]})), . . .
When random effects are specified within a linear-combination specification, as in the above exam-
ple, the curly braces around each random effect are not needed. See Random-effects substitutable
expressions in [ME] menl for examples of substitutable expressions containing random effects.
The general syntax for specifying random-effects terms, reterm, is provided below.
reterm Description
{rename[levelspec]} Random intercepts rename at hierarchy levelspec
{c.varname#rename[levelspec]} Random coefficients rename for continuous variable varname
{#.fvvarname#rename[levelspec]} Random coefficients rename for the #th level of
factor variable fvvarname
rename is a random-effects name. It is a Stata name that starts with a capital letter. levelspec defines
the level of hierarchy and is described below.
levelspec Description
levelvar variable identifying the group structure for the random effect at that level
lv2 > lv1 two-level nesting: levels of variable lv1 are nested within lv2
lv3 > lv2 > lv1 three-level nesting: levels of variable lv1 are nested within lv2,
which is nested within lv3
. . . > lv3 > lv2 > lv1 higher-level nesting
lv1#lv2 two-way interaction between crossed levels lv1 and lv2
lv1#lv2#lv3 three-way interaction between crossed levels lv1, lv2, and lv3
lv1#lv2#lv3#. . . higher-order interactions between crossed levels
all treat entire dataset as one big group
n treat each observation as its own group; defines a latent variable
You can equivalently specify levels in the opposite order, from the lowest level to the highest; for example, lv1 < lv2
< lv3, but they will be displayed in the canonical order, from the highest level to the lowest.
After you define a random-effects term once using its full specification rename[levelspec], you
can refer to it further simply by name rename, or you can continue using the full name.
When you include a random effect in your regression model, bayesmh creates a parameter for each
level of the grouping variable. For example, if you include U[id]—the random intercepts by level
variable id that contains levels 1 through 10—bayesmh will create a separate scalar parameter for
each level of id: {U[1.id]}, {U[2.id]}, . . . , {U[10.id]}. These scalar parameters are sampled
in one block using the sampling algorithm described in Adaptive MH algorithm for random effects
in Methods and formulas.
When you use random effects with user-specified log-likelihood and log-posterior evaluators, they
are sampled by default in one block as regular scalar parameters.
32 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
When you refer to random-effects parameters in bayesmh’s specifications, you typically refer to
them as a group. For example, suppose that you included random intercepts by level variable id in
your model as U[id]. To specify a prior distribution for these random intercepts, you can refer to
them by using the full definition {U[id]} or simply by name {U}. In postestimation commands or,
for instance, in the showreffects() option, you may want to refer to individual random-effects
parameters such as {U[1.id]} and {U[1]} or to the subsets of them such as {U[(1/5).id]} and
{U[1/5]}. See Different ways of specifying model parameters in [BAYES] Bayesian postestimation
for other ways of referring to individual random-effects parameters.
For each random effect {rename[levelspec]} you include in the model, bayesmh automatically
assigns it a normal prior with zero mean and variance component {var rename}. But it is your
responsibility to specify a prior for each variance component {var rename}. You can also use the
prior() option to change the default prior for random effects. This is particularly useful for specifying
a multivariate normal prior with an unstructured covariance matrix for correlated random effects; see
example 25.
With multiple-equations models, you must specify a prior for each equation-specific coefficient
associated with a random effect as long as the coefficient is not constrained. For example, if we write
. bayesmh (y1 x1 U[id1], . . .) (y2 x1 U[id1] V[id2], . . .)
then a prior must be specified for coefficient {y2:U} but not for coefficients {y1:U} and {y2:V}
because these are constrained to 1.
Specifying a Bayesian model may be a tedious task when there are many model parameters and
possibly hyperparameters. It is thus essential to verify model specification before starting a potentially
time-consuming estimation.
bayesmh displays the summary of the specified model as a part of its standard output. You can
use the dryrun option to obtain the model summary without estimation or simulation. Once you are
satisfied with the specified model, you can use the nomodelsummary option to suppress a potentially
long model summary during estimation. Even if you specify nomodelsummary during estimation,
you will still be able to see the model summary, if desired, by simply replaying the results:
. bayesmh
Reproducing results
Because bayesmh uses MCMC simulation—a stochastic procedure for sampling from a complicated
and possibly nontractable distribution—it will produce different results each time you run the command.
If the MCMC algorithm converged, the results should not change drastically. To obtain reproducible
results, you must specify the random-number seed.
To specify a random-number seed, you can use bayesmh’s rseed() option. With a single chain,
you can instead use set seed # prior to calling bayesmh; see [R] set seed. With multiple chains,
you should use rseed() for reproducibility because, as we explain later, using set seed is no longer
sufficient.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 33
With a single chain, if you forgot to specify the random-number seed before calling bayesmh, you
can retrieve the random-number state used by the command from e(rngstate) and use it later with
set rngstate. With multiple chains, reproducing results after the simulation without specifying the
seed is more difficult. We strongly recommend that you specify the rseed() option with bayesmh
when simulating multiple chains.
When you specify the nchains() option to simulate multiple chains, each chain uses its own
stream of random numbers; see [R] set rngstream. This is important to ensure that the chains are
independent. To reproduce the simulation results, a random-number seed must be used for each stream.
This is why using set seed prior to calling bayesmh will not be sufficient to reproduce results from
multiple chains—set seed will affect only the first random-number stream. bayesmh’s rseed()
option, however, will use the specified random-number seed with each stream. If you forgot to specify
the seed with multiple chains, you can retrieve chain-specific random-number states from stored scalars
e(rngstate1), e(rngstate2), etc. and use them with chain-specific random-number streams; see
[R] set rngstream and set rngstate in [R] set seed. For example, suppose you simulated two
chains and forgot to specify the random-number seed:
. bayesmh . . ., nchains(2) . . .
You can type the following directly after the simulation to reproduce the results:
. set rng mt64s
. set rngstate ‘e(rngstate2)’
. set rngstate ‘e(rngstate1)’
. bayesmh . . ., nchains(2) . . .
Stata’s default random-number generator is mt64; see [R] set rng. To simulate multiple chains, the
nchains() option temporarily switches to the stream random-number generator mt64s. To manually
reproduce the results from multiple chains, you need to use mt64s, but we recommend that you switch
back to mt64 for the rest of your analysis. The set rngstate command sets the corresponding
stream automatically; you do not need to use set rngstream to do this yourself. It is important,
however, that you set the state of the first chain last, just before the next call to bayesmh, so that
the stream used by the first chain is the current stream. Although you can reproduce results after
estimation, we strongly recommend that you use the rseed() option during estimation if you want
reproducibility.
bayesmh has the default burn-in period of 2,500 iterations and the default MCMC sample size of
10,000 iterations. That is, the first 2,500 iterations of the MCMC sampler are discarded and the next
10,000 iterations are used to form the MCMC samples of values of model parameters. You can change
these numbers by specifying options burnin() and mcmcsize().
The burn-in period must be long enough for the algorithm to reach convergence or, in other words,
for the Markov chain to reach its stationary distribution or the desired posterior distribution of model
parameters. The sample size for the MCMC sample is typically determined based on the autocorrelation
present in the MCMC sample. The higher the autocorrelation, the larger the MCMC sample should be
to achieve the same precision of the parameter estimates as obtained from the chain with low or
negligible autocorrelation. Because of the nature of the sampling algorithm, all MCMC exhibit some
autocorrelation and thus MCMC samples tend to have large sizes.
The defaults provided by bayesmh may not be sufficient for all Bayesian models and data types.
You will need to explore the convergence of the MCMC algorithm for your particular data problem
and modify the settings, if needed.
34 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
After the burn-in period, bayesmh includes every iteration in the MCMC sample. You can specify
the thinning(#) option to store results from a subset of iterations. This option is useful if you want
to subsample the chain to decrease autocorrelation in the final MCMC sample. If you use this option,
bayesmh will perform a total of thinning() × (mcmcsize() − 1) + 1 iterations, excluding burn-in
iterations, to obtain MCMC sample of size mcmcsize().
When you specify the nchains() option to produce multiple chains, the mcmcsize(), burnin(),
and thinning() options apply to each chain.
Even when {a} and {b} have independent prior specifications, the location parameters {a} and {b}
are expected to be correlated a posteriori because of their common dependence on y. Alternatively, if
the variance parameter {var} is independent of {a} and {b} a priori, it is generally less correlated
with the location parameters a posteriori. A good blocking scheme is to use options block({a} {b})
and block({var}) with bayesmh. We can also reparameterize our model to reduce the correlation
between {a} and {b} by recentering. To center the slope parameter, we replace {b} with {b} − #,
where # is a constant close to the mean of {b}. Now {a} and {b} − # can also be placed in separate
blocks. See, for example, Thompson (2014) for more discussion related to model parameterization.
Other options that control MCMC sampling efficiency are scale(), covariance(), and adap-
tation(); see Adaptation of the MH algorithm for details.
With multiple chains, the block() option and other options that control MCMC sampling efficiency
apply to all chains.
Because {y: cons} and {var} are approximately independent a posteriori, we specified them in
separate blocks.
Gibbs sampling can be applied to hyperparameters, which are not directly involved in the likelihood
specification of the model. For example, we can use Gibbs sampling for the covariance matrix of
regression coefficients.
36 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
. bayesmh y x, likelihood(normal(var))
> prior(var, igamma(10,1))
> prior({y:_cons x}, mvnormal(2,1,0,{Sigma,m}))
> prior({Sigma,m}, iwishart(2,10,V))
> block({Sigma,m}, gibbs)
In the next example, the matrix parameter {Sigma,m} specifies the covariance matrix in the
multivariate normal prior for a pair of model parameters, {y:1.cat} and {y:2.cat}. {Sigma,m} is
a hyperparameter—it is not a model parameter of the likelihood but a parameter of a prior distribution,
and it has an inverse-Wishart hyperprior distribution, which is a semiconjugate prior with respect to
the multivariate normal prior distribution. Therefore, we can request a Gibbs sampler for {Sigma,m}.
bayesmh y x i.cat, likelihood(probit)
> prior(y:x _cons, normal(0, 1000))
> prior(y:1.cat 2.cat, mvnormal0(2,{Sigma,m}))
> prior({Sigma,m}, iwishart(2,10,V))
> block({Sigma,m}, gibbs)
In general, Gibbs sampling, when available, is useful for covariance matrices because MH sam-
pling has low efficiency for sampling positive-definite symmetric matrices. In a multivariate normal
regression, the inverse Wishart distribution is a conjugate prior for the covariance matrix and thus
inverse Wishart is the most common prior specification for a covariance matrix parameter. If an
inverse-Wishart prior (iwishart()) is used for a covariance matrix, you can specify Gibbs sampling
for the covariance matrix. You can do so by placing the matrix in a separate block and specifying
the gibbs suboption in that block, as we showed above. Using Gibbs sampling for the covariance
matrix usually greatly improves the sampling efficiency.
matrix can be modified using the scale() and covariance() options. In the presence of blocks of
parameters, these options can be specified separately for each block within the block() option. At each
adaptation step, a new scale matrix is formed as a mixture (a linear combination) of the previous scale
matrix and the current empirical covariance matrix of model parameters. The mixture of the two matrices
is controlled by option adaptation(beta()). A positive adaptation(beta()) is recommended to
have a more stable scale matrix between adaptation periods. The adaptation lasts until the maximum
number adaptation(every())×adaptation(maxiter()) of adaptive iterations is reached or
until adaptation(tarate()) is reached within the adaptation(tolerance()) limit. The default
for maxiter() depends on the specified burn-in and adaptation(every()) and is computed as
max{25, floor(burnin()/adaptation(every()))}. The default for adaptation(every()) is
100. If you change the default values of these parameters, you may want to increase the burnin()
to be as long as the specified adaptation period so that adaptation is finished before the final
simulated sample is obtained. (There are adaptation regimes in which adaptation is performed during
the simulation phase as well, such as continuous adaptation.) Two additional adaptation options,
adaptation(alpha()) and adaptation(gamma()) control the AR and the adaptation rate. For
a detailed description of the adaptation process, see Adaptive random-walk Metropolis–Hastings in
[BAYES] Intro and Adaptive MH algorithm in Methods and formulas.
With multiple chains, adaptation options apply to all chains.
When exploring convergence of MCMC, it may be useful to try different initial values to verify
that the convergence is unaffected by starting values. Using different initial values is also essential
for multiple chains. We first describe how to specify initial values for a single chain and later for
multiple chains.
Single chain. There are two different ways to specify initial values of model parameters in bayesmh
for a single chain. First is by specifying an initial value when declaring a model parameter. Second
is by specifying an initial value in the initial() option. Initial values for matrix model parameters
may be specified only in the initial() option.
For example, below we initialize variance parameter {var} with a value of 1 using two equivalent
ways, as follows:
. bayesmh y x, likelihood(normal({var=1})) ...
or
. bayesmh y x, likelihood(normal({var})) initial({var} 1) ...
If both initial-value specifications are used, initial values specified in the initial() option override
any initial values specified during parameter declaration for the corresponding parameters.
You can initialize multiple parameters with the same value by supplying a list of parameters
by using any of the specifications described in Referring to model parameters to initial(). For
example, to initialize all regression coefficients from equations y1 and y2 to zero, you can type
. bayesmh . . ., initial({y1:} {y2:} 0) . . .
Stata expressions that evaluate to a number can also be used to specify initial values for scalar
parameters. One particularly useful application of this is specifying random initial values using Stata’s
random-number functions; see [FN] Random-number functions. For example, we can generate
random initial values for parameters {y1:} from a normal distribution with mean 0 and standard
deviation 10 and for parameters {y2:} from a uniform on (0, 1) distribution as follows:
. bayesmh . . ., initial({y1:} rnormal(0,10) {y2:} runiform(0,1)) . . .
38 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
You may also specify the initrandom option to request random initial values for all model
parameters. In that case, initial values are generated from the prior distributions of the parameters,
except for parameters that are assigned flat, jeffreys, density(), logdensity(), or jeffreys()
prior distributions. For such parameters, you must specify your own initial values, or bayesmh will
issue an error message.
Multiple chains. In the presence of multiple chains, you can use the init#() options to specify
initial values for each chain: the init1() option specifies initial values for the first chain, init2()
for the second chain, and so on. You specify initial values within the init#() options just like you
do this within initial() for a single chain. (With multiple chains, initial() is synonymous to
init1().)
For example,
. bayesmh y x, likelihood(normal({var})) nchains(2) init1({var} 1) init2({var} 10) ...
You can use the initall() option to specify initial values for all chains. This is useful, for
instance, when you want to generate random initial values from the same distribution for all chains.
You should avoid specifying fixed initial values within initall() because then all chains will use
the same starting values.
Default initial values. By default, if no initial value is specified and option nomleinitial is
not used, bayesmh uses MLEs, whenever available, as starting values for model parameters for a
single chain. For random-effects parameters, bayesmh uses zeros as initial values and ones for their
respective variance components. You can specify the initsummary option to see the default initial
values used by bayesmh.
For example, for the previous regression model, bayesmh uses regression coefficients and mean
squared error from linear regression regress y x as the respective starting values for the regression
model parameters and variance parameter {var}.
If MLE is not available and an initial value is not provided, then a scalar model parameter is
initialized with 1 for positive parameters and 0 for other parameters, and a matrix model parameter is
initialized with an identity matrix. Note, however, that this default initialization is not guaranteed to
correspond to the feasible state for the specified posterior model; that is, posterior probability of the
initial state can be 0. When initial values are not feasible, bayesmh makes 500 random attempts to
find a feasible initial-value vector. An initial-value vector is feasible if it corresponds to a state with
positive posterior probability. If feasible initial values are not found after 500 attempts, bayesmh will
issue the following error:
could not find feasible initial state
r(498);
You may use the search() option to modify the default settings for finding feasible initial values.
In the presence of multiple chains, each chain uses a different set of initial values for model
parameters. The above description of default initial values applies to the first chain only. The subsequent
chains use random initial values, which generally are generated from the prior distributions.
For improper priors flat, jeffreys, and jeffreys(#), bayesmh cannot draw random initial
values directly from these priors. Doing so would typically produce extreme values for model
parameters for which log likelihood would be missing. Instead, the command generates initial values
from a normal distribution centered at the initial values of the first chain with standard deviations
proportional to the magnitudes of the respective initial estimates. This approach is also used to generate
default initial values with user-defined priors density() and logdensity().
Random initial values may not always be feasible. Extreme values may be produced for model
parameters for some prior distributions, which may lead to missing log-likelihood values. bayesmh
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 39
will attempt to generate several different sets of initial values before terminating the simulation of
a particular chain and issuing a warning message. In this case, you must specify your own initial
values for that chain.
Default initial values are provided for convenience! To detect nonconvergence, overdispersed
initial values should be used with multiple chains. Randomly generated default initial values are not
guaranteed to produce overdispersed initial values for all chains. To fully explore convergence, we
recommend that you specify your own initial values with multiple chains, especially with improper
or noninformative priors.
See Convergence diagnostics using multiple chains for an example of specifying initial values with
multiple chains.
You can use the initsummary option to see the initial values used for simulation. The initial
values are also stored in the e(init) matrix after estimation.
By default, all model parameters are saved in the dataset. If desired, you can exclude some of the
parameters from the dataset by specifying the exclude() option. Beware that you will not be able
to obtain posterior summaries for these parameters or use them in any way in your analysis, because
no simulation results will be available for them. Also, the Laplace–Metropolis approximation for the
log marginal-likelihood will not be available because its computation requires simulation results for
all model parameters.
When fitting multilevel models containing many random effects, if you are interested only in the
estimates of regression coefficients and variance components, you may consider using the exclude()
option to exclude saving MCMC estimates of random-effects parameters to save time. If you do this,
beware that some of the Bayesian postestimation features may not be available.
Convergence of MCMC
As we discuss in Convergence diagnostics of MCMC in [BAYES] Intro, checking convergence is
an essential step of any MCMC simulation. Bayesian inference based on an MCMC sample is only valid
if the Markov chain has converged and the sample is drawn from the desired posterior distribution.
It is important to emphasize that we need to verify the convergence for all model parameters and
not only for a subset of parameters of interest. Another difficulty in accessing convergence of MCMC
is the lack of a single conclusive convergence criterion. The diagnostic usually involves checking
for several necessary (but not necessarily sufficient) conditions for convergence. In general, the more
aspects of the MCMC sample you inspect, the more reliable your results are.
An MCMC is said to have converged if it reached its stationary distribution. In the Bayesian context,
the stationary distribution is the true posterior distribution of model parameters. Provided that the
considered Bayesian model is well specified (that is, it defines a proper posterior distribution of model
parameters), the convergence of MCMC is determined by the properties of its sampling algorithm.
The main component of the MH algorithm, or any MCMC algorithm, is the number of iterations
it takes for the chain to approach its stationary distribution or for the MCMC sample to become
representative of a sample from the true posterior distribution of model parameters. The period during
which the chain is converging to its stationary distribution from its initial state is called the burn-in
period. The iterations of the burn-in period are discarded from the MCMC sample used for analysis.
Another complication is that adjacent observations from the MCMC sample tend to be positively
correlated; that is, autocorrelation is typically present in MCMC samples. In theory, this should not be
a problem provided that the MCMC sample size is sufficiently large. In practice, the autocorrelation in
the MCMC sample may be so high that obtaining a sample of the necessary size becomes infeasible
and finding ways to reduce autocorrelation becomes important.
Two aspects of the MH algorithm that affect the length of the burn-in (and convergence) are the
starting values of model parameters or, in other words, a starting state and a proposal distribution.
bayesmh has the default burn-in of 2,500 iterations, but you can change it by specifying the burnin()
option. bayesmh uses a Gaussian normal distribution with a zero mean and a covariance matrix that
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 41
is updated with current sample values during the adaptation period. You can control the proposal
distribution by changing the initial scale factor in option scale() and an initial scale matrix in option
covariance(); see Adaptation of the MH algorithm.
For the starting values of a single chain, bayesmh uses MLEs whenever available, but you can
specify your own initial values in option initial(); see Specifying initial values. Good initial values
help to achieve fast convergence of MCMC and bad initial values may slow convergence down. A
common approach for eliminating the dependence of the chain on the initial values is to discard an
initial part of the simulated sample: a burn-in period. The burn-in period must be sufficiently large
for a chain to “forget” its initial state and approach its stationary distribution or the desired posterior
distribution.
There are some researchers (for example, Geyer [2011]) who advocate that any starting point in
the posterior domain is equally good and there should be no burn-in. While this is a sensible approach
for a fixed, nonadaptive MH algorithm, it may not be as sensible for an adaptive MH algorithm because
the proposal distribution is changing (possibly drastically) during the adaptation period. Therefore,
adaptive iterations are better discarded from the analysis MCMC sample and thus it is recommended
that the burn-in period is at least as long as the adaptation period. (There are adaptive regimes such
as continuous adaptation in which adaptation continues after the burn-in period as well.)
In addition to fast convergence, an “ideal” MCMC chain will also have good mixing (or low
autocorrelation). A good mixing can be viewed as a rapid movement of the chain around the parameter
space. High autocorrelation in MCMC and consequently low efficiencies are usually indications of bad
mixing. To improve the mixing of the chain, you may need to improve the efficiency of the algorithm
(see Improving efficiency of the MH algorithm—blocking of parameters) or sometimes reparameterize
your model. In the presence of high autocorrelation, you may also consider subsampling or thinning
the chain, option thinning(), to reduce autocorrelation, but this may not always be the best approach.
Even when the chain appears to have converged and has good mixing, you may still have a case
of pseudoconvergence, which is common for multimodal posterior distributions. Specifying different
sets of initial values may help detect pseudoconvergence.
Multiple chains are often used to assess the convergence of MCMC; see Convergence diagnostics
using multiple chains and Balov (2016c). For more information about convergence of MCMC and
its diagnostics, see Convergence diagnostics of MCMC in [BAYES] Intro, [BAYES] bayesgraph,
[BAYES] bayesstats ess, and [BAYES] bayesstats grubin.
In what follows, we concentrate on demonstrating various specifications of bayesmh, which may
not always correspond to the optimal Bayesian analysis for the considered problem. In addition,
although we skip checking convergence for some of our models to keep the exposition short, it is
important that you always check the convergence of all parameters in your model in your analysis
before you make any inferential conclusions. If you are also interested in any functions of model
parameters, you must check convergence of those functions as well.
Video examples
Introduction to Bayesian statistics, part 1: The basic concepts
Introduction to Bayesian statistics, part 2: MCMC and the Metropolis–Hastings algorithm
42 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Likelihood:
mpg ~ normal({mpg:_cons},36)
Prior:
{mpg:_cons} ~ 1 (flat)
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 43
Equal-tailed
mpg Mean Std. dev. MCSE Median [95% cred. interval]
bayesmh first reports the summary of the model. The likelihood model specified for mpg is normal
with mean {mpg: cons} and fixed variance of 36. The prior for {mpg: cons} is flat or completely
noninformative.
Our model is very simple, so its summary is very short. For other models, the model summary
may get very long. You can use the nomodelsummary option to suppress it from the output.
It is useful, however, to review the model summary before estimation for models with many
parameters and complicated specifications. You can use the dryrun option to see the model summary
without estimation. Once you verified the correctness of your model specification, you can specify
nomodelsummary during estimation.
Next, bayesmh reports the header including the title for the fitted model, the used MCMC
algorithm, and various numerical summaries of the sampling procedure. bayesmh performed 12,500
MCMC iterations, of which 2,500 were discarded as burn-in iterations and the next 10,000 iterations
were kept in the final MCMC sample. An overall AR is 0.42, meaning that 42% out of 10,000 proposal
parameter values were excepted by the algorithm. This is a good AR for the MH algorithm. Values
below 10% may be a cause for concern and may indicate problems with convergence of MCMC. Very
low ARs may also mean high autocorrelation. The efficiency is 0.23 and is also considered good for
the MH algorithm. Efficiencies below 1% should be investigated further and would require further
tuning of the algorithm and possibly revisiting the considered model.
Finally, bayesmh reports an estimation table that includes the posterior mean, posterior standard
deviation, MCMC standard error (MCSE), posterior median, and the 95% credible interval.
The estimated posterior mean for {mpg: cons} is 21.298 with a posterior standard deviation of
0.70. The efficiency of the estimator of the posterior mean is about 23%, which is relatively high
for the random-walk MH sampling. In general, you should expect to see lower efficiencies from this
algorithm for models with more parameters. The MCSE, which is an approximation of the error in
estimating the true posterior mean, is about 0.015. Therefore, provided that the MCMC simulation has
converged, the posterior mean of the constant is accurate to 1 decimal position, 21.3. If you want an
estimation precision of, say, 2 decimal positions, you may need to increase the MCMC sample size
101 times; that is, use mcmcsize(100000).
The estimated posterior mean and medians are very close, suggesting that the posterior distribution
of {mpg: cons} may be symmetric. In fact, the posterior distribution of a mean in this model is
known to be a normal distribution.
According to the reported 95% credible interval, the probability that the mean of mpg is between
19.9 and 22.7 is about 0.95. You can use the clevel() option to change the default credible level;
also see [BAYES] set clevel.
Because we used a completely noninformative prior, our results should be the same as frequentist
results. In this Bayesian model, the posterior distribution of the constant parameter is known to be
normal with a mean equal to the sample average. In the frequentist domain, the MLE of the constant
44 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
is also the sample average, so the posterior mean estimate and the MLE should be the same in this
model.
The sample average of mpg is 21.2973. Our posterior mean estimate is 21.298, which is very close.
The reason it is not exactly the same is because we estimated the posterior mean of the constant based
on an MCMC sample simulated from its posterior distribution instead of using the known formula.
Closed-form expressions for posterior mean estimators are available only for some Bayesian models.
In general, posterior distributions of parameters are unknown and posterior summaries may only be
estimated from the MCMC samples of parameters.
In practice, we must verify the convergence of MCMC before making any inferential conclusions
about the obtained results.
We start by looking at various graphical diagnostics as produced by bayesgraph diagnostics.
. bayesgraph diagnostics {mpg:_cons}
mpg:_cons
Trace Histogram
24 .6
22
.4
20
.2
18
0 2000 4000 6000 8000 10000 0
Iteration number 18 20 22 24
Autocorrelation Density
0.60 .6 All
1-half
0.40 2-half
.4
0.20
.2
0.00
0
0 10 20 30 40
Lag 18 20 22 24
The trace plot represents a “perfect” trace plot. It does not exhibit any trends, and it traverses the
distribution quickly. The chain is centered around 21.3, but also explores the portions of the distribution
where the density is low, which is indicative of good mixing of the chain. The autocorrelation dies
off very quickly. The posterior distribution looks normal. The kernel density estimates based on the
first and second halves of the sample are very similar to each other and are close to the overall
density estimate. We can see that MCMC converged and mixes well. See [BAYES] bayesgraph for
details about this command.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 45
See Convergence diagnostics using multiple chains for an example of using multiple chains to assess
convergence. Also see Convergence diagnostics of MCMC for more discussion about convergence of
MCMC.
Likelihood:
mpg ~ normal({mpg:_cons},36)
Prior:
{mpg:_cons} ~ normal(25,10)
Equal-tailed
mpg Mean Std. dev. MCSE Median [95% cred. interval]
Compared with example 1, our results change only slightly: the estimated posterior mean is 21.48
with a posterior standard deviation of 0.68. The 95% credible interval is [20.1, 22.82].
The reason we obtained such similar results is that our specified prior is in close agreement
√ with
what we observed in this sample. The prior mean of 25 with a standard deviation of 10 = 3.16
overlaps greatly with what we observe for {mpg: cons} in the data.
46 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
If we place a very strong prior on the value for the mean by, for example, substantially decreasing
the variance of the normal prior distribution,
. set seed 14
. bayesmh mpg, likelihood(normal(36)) prior({mpg:_cons}, normal(25,0.1))
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal({mpg:_cons},36)
Prior:
{mpg:_cons} ~ normal(25,0.1)
Equal-tailed
mpg Mean Std. dev. MCSE Median [95% cred. interval]
we obtain very different results. Now the posterior mean and standard deviation estimates are very
close to their prior values, as one would expect with such strong prior information.
Which results are correct? The answer depends on how confident we are in our prior knowledge.
If we previously observed many samples in which the average mileage for the considered population
of cars was essentially 25, our last results are consistent with this and the information about the
mean of {mpg: cons} contained in the observed sample was not enough to counteract our belief.
If, on the other hand, we had no prior information about the mean mileage, then we would use a
noninformative or mildly informative prior in our Bayesian analysis. Also, if we believe that our
observed data should have more weight in our analysis, we would not specify a very strong prior.
Example 3: Noninformative normal prior for the mean when variance is known
In example 1, we used a completely noninformative, flat prior for {mpg: cons}. In example 2,
we considered a conjugate normal prior for {mpg: cons}. We also saw that by varying the variance
of the normal prior distribution, we could control the “informativeness” of our prior. The larger the
variance, the less informative the prior. In fact, if we let the variance approach infinity, we will arrive
at the same posterior distribution of the constant as with the flat prior.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 47
Likelihood:
mpg ~ normal({mpg:_cons},36)
Prior:
{mpg:_cons} ~ normal(0,1000000)
Equal-tailed
mpg Mean Std. dev. MCSE Median [95% cred. interval]
we will obtain results that are very similar to the results from example 1 with the flat prior.
We do not need to use such an extreme value of the variance for the results to become less sensitive
to the prior specification. As we saw in example 2, using the variance of 10 in that example resulted
in very little impact of the prior on the results.
Example 4: Noninformative Jeffreys prior when mean and variance are unknown
A noninformative prior commonly used for the normal model with unknown mean and variance
is the Jeffreys prior, under which the prior for the mean is flat and the prior for the variance is
the reciprocal of the variance. We use the same flat prior for {mpg: cons} as in example 1 and
specify the jeffreys prior for {var} using a separate prior() statement.
48 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
. set seed 14
. bayesmh mpg, likelihood(normal({var}))
> prior({mpg:_cons}, flat) prior({var}, jeffreys)
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal({mpg:_cons},{var})
Priors:
{mpg:_cons} ~ 1 (flat)
{var} ~ jeffreys
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
_cons 21.29222 .6828864 .021906 21.27898 19.99152 22.61904
Because we used a noninformative prior, our results should be similar to the frequentist results apart
from simulation uncertainty. Compared with example 1, the average efficiency of the MH algorithm
decreased to 10%, as is expected with more parameters, but is still considered a good efficiency for
the MH algorithm.
The posterior mean estimate of {mpg: cons} is close to the OLS estimate of 21.297, and the
posterior standard deviation is close to the standard error of the OLS estimate 0.673. MCSE is slightly
larger than in example 1 because we have lower efficiency. If we wanted to make MCSE smaller, we
could increase our MCMC sample size. The posterior mean estimate of {var} agrees with the MLE
of the variance 33.02, but we would not expect the two to be necessarily the same. We estimated the
posterior mean of {var}, not the posterior mode, and because posterior distribution of {var} is not
symmetric, the two estimates may not be the same.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 49
Again, as with any MCMC analysis, we must verify the convergence of our MCMC sample before
we can trust our results.
. bayesgraph diagnostics _all
mpg:_cons
Trace Histogram
24 .6
23
22 .4
21
.2
20
19
0 2000 4000 6000 8000 10000 0
Iteration number 19 20 21 22 23 24
Autocorrelation Density
0.80 .6
All
0.60 1-half
.4 2-half
0.40
0.20 .2
0.00
0
0 10 20 30 40
Lag 19 20 21 22 23 24
var
Trace Histogram
.08
60
50 .06
40 .04
30
.02
20
0 2000 4000 6000 8000 10000 0
Iteration number 20 30 40 50 60
Autocorrelation Density
0.80 .08
All
0.60 1-half
.06
2-half
0.40
.04
0.20
.02
0.00
0
0 10 20 30 40
Lag 20 30 40 50 60 70
Graphical diagnostic plots do not show any signs of nonconvergence for either of the parameters. We
can also check convergence more formally using multiple chains; see [BAYES] bayesstats grubin and
Convergence diagnostics using multiple chains.
50 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Recall that to access convergence of MCMC, we must explore convergence for all model parameters.
Example 5: Informative conjugate prior when mean and variance are unknown
For a normal distribution with unknown mean and variance, the informative conjugate prior is a
normal prior for the mean and an inverse-gamma prior for the variance. Specifically, if y ∼ N (µ, σ 2 ),
then the informative conjugate prior for the parameters is
µ|σ 2 ∼ N (µ0 , σ 2 )
σ 2 ∼ InvGamma(ν0 /2, ν0 σ02 /2)
where µ0 is the prior mean of the normal distribution and ν0 and σ02 are the prior degrees of freedom
and prior variance for the inverse-gamma distribution. Let’s assume µ0 = 25, ν0 = 10, and σ02 = 30.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 51
Notice that in the specification of the prior for {mpg: cons}, we specify the parameter {var}
as the variance of the normal distribution. We use igamma(5,150) as the prior for the variance
parameter {var}.
. set seed 14
. bayesmh mpg, likelihood(normal({var}))
> prior({mpg:_cons}, normal(25,{var}))
> prior({var}, igamma(5,150))
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal({mpg:_cons},{var})
Priors:
{mpg:_cons} ~ normal(25,{var})
{var} ~ igamma(5,150)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
_cons 21.314 .6639278 .02097 21.29516 20.08292 22.63049
Compared with example 4, the variance is slightly smaller, but the results are still very similar.
Example 6: Noninformative inverse-gamma prior when mean and variance are unknown
The Jeffreys prior for the variance from example 4 can be viewed as a limiting case of an
inverse-gamma distribution with the degrees of freedom approaching zero.
52 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Indeed, if we replace the jeffreys prior in example 4 with an inverse-gamma distribution with
very small degrees of freedom,
. set seed 14
. bayesmh mpg, likelihood(normal({var}))
> prior({mpg:_cons}, flat)
> prior({var}, igamma(0.0001,0.0001))
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal({mpg:_cons},{var})
Priors:
{mpg:_cons} ~ 1 (flat)
{var} ~ igamma(0.0001,0.0001)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
_cons 21.29223 .6828811 .021905 21.27899 19.99154 22.61903
we obtain results that are very close to the results from example 4.
We will have three model parameters: the slope and the intercept for the linear predictor and the
variance parameter for the error term. Regression parameters, {mpg:weight} and {mpg: cons},
will be declared implicitly by bayesmh, but we will need to explicitly specify the variance parameter
{var}. We will also need to assign appropriate priors for all parameters.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 53
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight _cons} ~ 1 (flat) (1)
{var} ~ jeffreys
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.6019838 .0512557 .001817 -.6018433 -.7015638 -.5021532
_cons 39.47227 1.589082 .058601 39.49735 36.26465 42.43594
Our model summary shows the likelihood model for mpg, flat priors for the two regression coefficients,
and a Jeffreys prior for the variance parameter. Now that we have a covariate in the model, the mean
of the normal distribution is labeled as xb mpg to emphasize that it is now a linear combination of
independent variables. Regression coefficients involved in the linear predictor are marked with (1)
on the right.
The results are again very similar to the frequentist results. Posterior mean estimates of the
coefficients are very similar to the OLS estimates obtained by using regress below. Posterior
standard deviations are similar to the standard errors from regress.
54 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
βweight |σ 2 ∼ N (µweight , σ 2 )
βcons |σ 2 ∼ N (µcons , σ 2 )
σ 2 ∼ InvGamma(ν0 /2, ν0 σ02 /2)
where regression coefficients have different means but equal variances. µweight and µcons are the
prior means of the normal distributions, and ν0 and σ02 are the prior degrees of freedom and prior
variance for the inverse-gamma distribution. Let’s assume µweight = −0.5, µcons = 40, ν0 = 10,
and σ02 = 10.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 55
. set seed 14
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:weight}, normal(-0.5,{var}))
> prior({mpg:_cons}, normal(40,{var}))
> prior({var}, igamma(5,50))
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight} ~ normal(-0.5,{var}) (1)
{mpg:_cons} ~ normal(40,{var}) (1)
{var} ~ igamma(5,50)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.6074375 .0480685 .001916 -.6078379 -.6991818 -.5119767
_cons 39.65274 1.499741 .05696 39.63501 36.59486 42.47547
For this mildly informative prior, our regression coefficients are still very similar to the results obtained
using the noninformative prior in example 7, but the variance estimate is slightly smaller.
. set seed 14
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, zellnersg(2,30,-0.5,40,{var}))
> prior({var}, igamma(5,50))
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight _cons} ~ zellnersg(2,30,-0.5,40,{var}) (1)
{var} ~ igamma(5,50)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.6004123 .0510882 .001595 -.5998094 -.7040552 -.5058665
_cons 39.55017 1.590016 .050051 39.49377 36.56418 42.79701
The results are now closer to the results using noninformative prior obtained in example 7, because
we are introducing some information from the observed data by using (X 0 X)−1 .
Then, we can use the multivariate normal prior mvnormal() with the variance specified as an
expression 30*var*S.
. set seed 14
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, mvnormal(2,-0.5,40,30*{var}*S))
> prior({var}, igamma(5,50))
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight _cons} ~ mvnormal(2,-0.5,40,30*{var}*S) (1)
{var} ~ igamma(5,50)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.6004123 .0510882 .001595 -.5998094 -.7040552 -.5058665
_cons 39.55017 1.590016 .050051 39.49377 36.56418 42.79701
Then, we use the mvnscaled() prior with mean values −0.5 and 40, scale matrix A, and variance
parameter {var}.
. set seed 14
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, mvnscaled(2,-0.5,40,A,{var}))
> prior({var}, igamma(5,50))
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight _cons} ~ mvnscaled(2,-0.5,40,A,{var}) (1)
{var} ~ igamma(5,50)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.6004123 .0510882 .001595 -.5998094 -.7040552 -.5058665
_cons 39.55017 1.590016 .050051 39.49377 36.56418 42.79701
In this section, we demonstrate how one can improve efficiency of the MH algorithm by using
blocking of parameters and Gibbs sampling, whenever available. We continue with our simple linear
regression of mpg on rescaled weight from Simple linear regression, but we use different values for
the parameters of prior distributions. We also assume that regression coefficients and the variance
parameter are independent a priori. We use the blocksummary option to include a summary about
each block.
Our first simulation is performed using the default settings for the algorithm. Specifically, all three
model parameters are placed in one simulation block and are updated simultaneously, as our block
summary indicates.
. use https://www.stata-press.com/data/r18/auto
(1978 automobile data)
. replace weight = weight/100
variable weight was int now float
(74 real changes made)
. set seed 14
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, normal(0,100))
> prior({var}, igamma(10,10)) blocksummary
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight _cons} ~ normal(0,100) (1)
{var} ~ igamma(10,10)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5759855 .0471288 .001569 -.5750919 -.6676517 -.4868595
_cons 38.65481 1.468605 .048784 38.70029 35.88062 41.49839
The mean estimates based on the simulated sample are {mpg:weight} = −0.58, {mpg: cons}
= 38.65, and {var} = 9.8. The MH algorithm achieves an overall AR of 24% and an average
efficiency of about 8%.
Our next step is to perform a visual inspection of the convergence of the chain.
. bayesgraph diagnostics {var}
var
Trace Histogram
20 .3
15 .2
10
.1
5
0 2000 4000 6000 8000 10000 0
Iteration number 5 10 15 20
Autocorrelation Density
0.80 .3
All
1-half
0.60
.2 2-half
0.40
0.20 .1
0.00
0
0 10 20 30 40
Lag 5 10 15 20
A graphical summary for the {var} parameter does not show any obvious problems. The trace plot
reveals a good coverage of the domain of the marginal distribution, while the histogram and kernel
density plots resemble the shape of an expected inverse-gamma distribution. The autocorrelation dies
off after about lag 20.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 61
35 40 45
-.4
mpg:weight -.6
-.8
45
40
mpg:_cons
35
15
var
10
5
-.8 -.6 -.4 5 10 15
The scatterplots reveal high correlation between {mpg:weight} and {mpg: cons}. On the other
hand, there is no significant correlation between {var} and the other two parameters.
In cases like this, we can expect higher sampling efficiency if we place {var} in a separate block.
We can do this by including the option block({var}). The other two parameters, {mpg:weight}
and {mpg: cons}, will be automatically considered as a second block.
62 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
. set seed 14
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, normal(0,100))
> prior({var}, igamma(10,10))
> block({var}) blocksummary
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight _cons} ~ normal(0,100) (1)
{var} ~ igamma(10,10)
1: {var}
2: {mpg:weight _cons}
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5744536 .0450094 .001484 -.576579 -.663291 -.4853636
_cons 38.59206 1.397983 .04654 38.63252 35.80229 41.32773
In this second run, we achieve higher simulation efficiency, about 12% on average. The MCSE for
{var} is 0.034 and is about half the value of 0.058 from example 11, which leads to twice as much
accuracy in the estimation of the posterior mean of {var}.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 63
Again, we can verify the convergence of the MCMC run for {var} by inspecting the bayesgraph
diagnostics plot.
. bayesgraph diagnostics {var}
var
Trace Histogram
20 .3
15
.2
10
.1
5
0 2000 4000 6000 8000 10000 0
Iteration number 5 10 15 20
Autocorrelation Density
0.80 .3
All
0.60 1-half
.2 2-half
0.40
0.20 .1
0.00
0
0 10 20 30 40
Lag 5 10 15 20
The improved sampling efficiency for {var} is evident by observing that the autocorrelation becomes
negligible after about lag 10. The trace plot reveals more rapid traversing of the marginal posterior
domain as well.
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight _cons} ~ normal(0,100) (1)
{var} ~ igamma(10,10)
1: {var} (Gibbs)
2: {mpg:weight _cons}
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5764752 .0457856 .001324 -.5764938 -.6654439 -.486788
_cons 38.64148 1.438705 .04259 38.6177 35.82136 41.38734
The average efficiency is now 0.33 with the maximum of 0.74 corresponding to the variance parameter.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 65
var
Trace Histogram
.3
16
14
.2
12
10
8 .1
6
0 2000 4000 6000 8000 10000 0
Iteration number 6 8 10 12 14 16
Autocorrelation Density
.3
0.02 All
1-half
0.01 .2 2-half
0.00
-0.01 .1
-0.02
0
0 10 20 30 40
Lag 5 10 15 20
mpg
weight 1195.57 8.36 0.1196
_cons 1141.12 8.76 0.1141
For example, diagnostic plots for {weight: cons} do not look as good as diagnostic plots for
the variance parameter in example 13.
. bayesgraph diagnostics {mpg:weight}
mpg:weight
Trace Histogram
-.4 10
-.5 8
6
-.6
4
-.7
2
-.8
0 2000 4000 6000 8000 10000 0
Iteration number -.8 -.7 -.6 -.5 -.4
Autocorrelation Density
0.80 10
All
0.60 8 1-half
2-half
0.40 6
4
0.20
2
0.00
0
0 10 20 30 40
Lag -.8 -.7 -.6 -.5 -.4
Further improvement of the mixing can be achieved by requesting Gibbs sampling for the two
blocks of parameters: regression coefficients and variance. Again, this is possible only because
{mpg:weight}, {mpg: cons}, and {var} have normal and an inverse-gamma priors, which are
independent and are semiconjugate in this model.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 67
To request Gibbs sampling for the regression coefficients, we must place them in a separate block.
. set seed 14
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, normal(0,100))
> prior({var}, igamma(10,10))
> block({var}, gibbs)
> block({mpg:}, gibbs) blocksummary
Burn-in ...
Simulation ...
Model summary
Likelihood:
mpg ~ normal(xb_mpg,{var})
Priors:
{mpg:weight _cons} ~ normal(0,100) (1)
{var} ~ igamma(10,10)
1: {var} (Gibbs)
2: {mpg:weight _cons} (Gibbs)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5751071 .0467837 .000468 -.5757037 -.6659412 -.4823263
_cons 38.61033 1.459511 .014595 38.61058 35.79156 41.45336
Now we have perfect sampling efficiency (with an average of 0.98) with essentially no autocorrelation.
The estimators of posterior means have the lowest MCSEs among the four simulations.
68 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
For example, diagnostic plots for {mpg:weight} now look noticeably better.
. bayesgraph diagnostics {mpg:weight}
mpg:weight
Trace Histogram
-.4 8
-.5
6
-.6
4
-.7
2
-.8
0 2000 4000 6000 8000 10000 0
Iteration number -.8 -.7 -.6 -.5 -.4
Autocorrelation Density
0.02 8 All
1-half
0.01 6
2-half
0.00 4
-0.01
2
-0.02
0
0 10 20 30 40
Lag -.8 -.7 -.6 -.5 -.4
You can verify that the diagnostic plots of all parameters demonstrate almost perfect mixing as
well.
. bayesgraph diagnostics _all
(output omitted )
Let’s continue with the Bayesian multiple linear regression model from example 11. We specify
the nchains(4) option to simulate four Markov chains of default size 10,000. We use the rseed()
option to ensure reproducibility when running multiple chains. Specifying set seed is not sufficient
in this case; see Reproducing results. We also use nomodelsummary to suppress the output of the
model summary.
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, normal(0,100)) prior({var}, igamma(10,10))
> nomodelsummary nchains(4) rseed(16)
Chain 1
Burn-in ...
Simulation ...
Chain 2
Burn-in ...
Simulation ...
Chain 3
Burn-in ...
Simulation ...
Chain 4
Burn-in ...
Simulation ...
Bayesian normal regression Number of chains = 4
Random-walk Metropolis--Hastings sampling Per MCMC chain:
Iterations = 12,500
Burn-in = 2,500
Sample size = 10,000
Number of obs = 74
Avg acceptance rate = .2275
Avg efficiency: min = .07897
avg = .08265
max = .08827
Avg log marginal-likelihood = -226.73271 Max Gelman--Rubin Rc = 1.002
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5749136 .0463642 .000816 -.5760212 -.6649088 -.4847602
_cons 38.59661 1.447703 .025758 38.62636 35.7311 41.40999
The important change in the output header of bayesmh with multiple chains is the presence of
the maximum Gelman–Rubin convergence statistic, Max Gelman--Rubin Rc. This is the maximum
value of the statistics across all model parameters. A convergence rule often used in practice is to
declare convergence when convergence statistics of all model parameters are less than 1.1. In our
example, the maximum statistic of 1.002 is less than 1.1, so the convergence rule is satisfied. See
[BAYES] bayesstats grubin for details. Of course, it is important to also inspect convergence visually,
as we demonstrate later in this example.
Because there are multiple simulation chains, bayesmh reports the simulation summaries averaged
over the chains such as the average acceptance rate, average efficiencies, and the average log
marginal-likelihood. You can use the chainsdetail option to see those summaries separately for
each chain.
70 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
The average simulation efficiency for all chains is above 8% and seems adequate. The Gelman–Rubin
convergence rule is met. There is no indication of convergence problems. Nevertheless, inspecting the
simulation chains visually can provide additional reassurance. For instance, by comparing the trace
plots of different simulation sequences for a model parameter, we can detect convergence irregularities
and assess the overlap of the simulated marginal distributions for this parameter. If Markov chains
have converged, we should not observe substantial differences between the trace plots or between the
sampled marginal distributions.
For a single chain, we used bayesgraph diagnostics to explore the convergence of MCMC
visually. We can use this command with multiple chains as well. Let’s plot graphical summaries for
the variance parameter {var}.
. bayesgraph diagnostics {var}
var
Trace Histogram
20 .4
.3
15
.2
10
.1
5
0
0 2000 4000 6000 8000 10000
Iteration number 5 10 15 20
Autocorrelation Density
.4
.8 All
.3 1-half
.6
2-half
.4 .2
.2
.1
0
0
0 10 20 30 40
Lag 5 10 15 20
Chains: 1/4
Graphical diagnostics look somewhat messy for multiple chains, but the main takeaway from this
graph is that the results of the chains do not look drastically different. The trace plots overlap, the
autocorrelations die off, and the histograms and density plots are similar for all chains. If desired, you
can produce separate plots or graphs for each chain using bayesgraph’s bychain() or sepchains
option; see [BAYES] bayesgraph.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 71
You can also focus separately on each type of plot. For instance, let’s look more closely at the
trace and density plots.
. bayesgraph trace {var}
Trace of var
20
15
10
5
0 2000 4000 6000 8000 10000
Iteration number
Chains: 1/4
The bayesgraph trace command overlays the traces of the simulated chains for convenient visual
comparison of the chains. The trace plots are similar in terms of coverage and variation.
The overlaid density plots shown by bayesgraph kdensity provide another aspect of comparing
multiple simulation sequences.
. bayesgraph kdensity {var}
Density of var
.4
.3
.2
.1
0
5 10 15 20
Chains: 1/4
The density plots of {var} from all chains mostly overlap with some variations about the marginal
mode.
72 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Similarly, we can explore the MCMC convergence visually for other parameters. For example, we
can draw the trace plots for the coefficient parameters {mpg: cons} and {mpg:weight} and use
bayesgraph’s byparm option to place plots of both parameters on one graph.
. bayesgraph trace {mpg:}, byparm
Trace plots
mpg:weight
-.4
-.5
-.6
-.7
mpg:_cons
45
40
35
30
0 5000 10000
Iteration number
Graphs by parameter
Chains: 1/4
Again, the overlaid trace plots of {mpg: cons} and {mpg:weight} do not show any substantial
differences and indicate good mixing of the chains.
We can use the bayesstats grubin command to compute Gelman–Rubin convergence diagnostics
using multiple chains.
. bayesstats grubin
Gelman--Rubin convergence diagnostic
Number of chains = 4
MCMC size, per chain = 10,000
Max Gelman--Rubin Rc = 1.002068
Rc
mpg
weight 1.000783
_cons 1.000557
var 1.002068
Estimates of convergence statistics, Rc, larger than 1.2 indicate possible nonconvergence. In our case,
the Rc estimates for all parameters are very close to 1 and do not raise any convergence concerns.
Note that the largest estimate, 1.002, as reported by bayesmh, corresponds to parameter {var}.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 73
Once MCMC convergence is established, we can proceed with our estimation results. We replay
them here for your convenience (without the table header information).
. bayesmh, noheader
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5749136 .0463642 .000816 -.5760212 -.6649088 -.4847602
_cons 38.59661 1.447703 .025758 38.62636 35.7311 41.40999
The summary results in the estimation table are based on all chains. Because we used more chains,
our results are now more precise (have smaller MCSEs) compared with example 11.
To inspect posterior summaries of each chain, we can use the bayesstats summary command
with the sepchains option.
. bayesstats summary, sepchains
Posterior summary statistics
Chain 1 MCMC sample size = 10,000
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5736929 .0458934 .001611 -.5745238 -.6629738 -.4877666
_cons 38.5649 1.425768 .052564 38.60731 35.75694 41.37725
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5747026 .0456178 .001699 -.5759074 -.6618918 -.4851731
_cons 38.59502 1.441276 .053339 38.57138 35.72466 41.40902
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5740745 .0468218 .00169 -.576532 -.6631272 -.4817094
_cons 38.57018 1.469792 .053026 38.62822 35.68724 41.37469
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5771844 .0470114 .001543 -.5773599 -.6678485 -.4862513
_cons 38.65634 1.451485 .047729 38.69004 35.82901 41.49365
The results from all chains are similar. The differences between posterior means, for instance, are
within the ranges of the MCMC standard errors of the estimates.
In the presence of multiple chains, bayesmh displays a note beneath the estimation table about
default initial values being used for the chains. The default initial values are provided for convenience,
and often you may want to specify your own; see Specifying initial values for details. Also see Multiple
chains using overdispersed initial values next.
We continue with our multiple-chains example from Multiple chains using default initial values,
but here we simulate Markov chains using overdispersed initial values. We specify random initial
values manually using the init#() options.
For simplicity, we use only two chains. We generate initial values that are highly overdispersed
and are far away from the maximum-likelihood estimates of model parameters. For the first chain, we
generate initial values for the regression coefficients from the normal distribution with mean 10 and
standard deviation 10 and for the variance from the gamma distribution with shape 1 and scale 50.
For the second chain, we use the same distributions but different parameters, except for the standard
deviation: we use the mean of −10, the standard deviation of 10, the shape of 50, and the scale of 1.
We use the init1() and init2() options, respectively, to specify these initial values. To see the
initial values used, we also specify the initsummary option.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 75
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5334204 .0939955 .002271 -.5468147 -.6670521 -.3335525
_cons 37.27179 2.977634 .067 37.70683 30.95118 41.41418
Note: There is a high autocorrelation after 500 lags in at least one of the
chains.
The reported maximum Gelman–Rubin convergence statistic, 42.57, is very high and is much larger
than 1. A note beneath the table reports high autocorrelation in one of the chains. Clearly, we have
a problem.
76 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
We check the sampling efficiency of the parameters for each chain separately:
. bayesstats ess, sepchains
Efficiency summaries
Chain 1 MCMC sample size = 10,000
Efficiency: min = .07407
avg = .07956
max = .08962
mpg
weight 749.91 13.33 0.0750
_cons 740.66 13.50 0.0741
mpg
weight 963.73 10.38 0.0964
_cons 1234.44 8.10 0.1234
The {var} parameter in the second chain has the lowest ESS of 12.53.
Let’s check the Gelman–Rubin convergence statistics for all parameters.
. bayesstats grubin
Gelman--Rubin convergence diagnostic
Number of chains = 2
MCMC size, per chain = 10,000
Max Gelman--Rubin Rc = 42.57122
Rc
mpg
weight 1.622996
_cons 1.665635
var 42.57122
The Rc estimates for all three parameters exceed 1, confirming nonconvergence, but {var} has a
particularly large value of the convergence statistic of 42.57.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 77
To investigate the convergence problem further visually, we inspect the trace plots of the {var}
parameter from each chain.
. bayesgraph trace {var}
Trace of var
50
40
30
20
10
The two trace plots are completely separated and show that the chains explore different domains of
the posterior distribution. The trace plot of the second chain, shown in red, has a mean value of about
45. Given a large initial value for {var} and the stochastic nature of the algorithm, the second chain
did not converge by the default number of 2,500 burn-in iterations.
78 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
var
Trace Histogram
46 1.5
45.5
1
45
44.5 .5
44
0 2000 4000 6000 8000 10000 0
Iteration number 44 44.5 45 45.5 46
Autocorrelation Density
1.00 2.5
All
2 1-half
0.50
2-half
1.5
0.00 1
.5
-0.50
0
0 10 20 30 40
Lag 44 44.5 45 45.5 46
Chain 2
we notice that the autocorrelation stays close to 1 and the trace plot exhibits a slow random walk
behavior, failing to stabilize in a particular region.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 79
When you specify overdispersed initial values, you should give the chains enough time to converge.
This second chain simply has not run long enough to converge to the domain with a high posterior
density. To fix this, we can use a longer burn-in of 10,000, burnin(10000), and longer adaptation
by lowering the adaptation tolerance to 0.002, adaptation(tolerance(0.002)).
. bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, normal(0,100)) prior({var}, igamma(10,10))
> nomodelsummary nchains(2) rseed(16)
> init1({mpg:} rnormal( 10, 10) {var} rgamma(50, 1))
> init2({mpg:} rnormal(-10, 10) {var} rgamma(1, 50))
> burnin(10000) adapt(tolerance(0.002))
Chain 1
Burn-in ...
Simulation ...
Chain 2
Burn-in ...
Simulation ...
Bayesian normal regression Number of chains = 2
Random-walk Metropolis--Hastings sampling Per MCMC chain:
Iterations = 20,000
Burn-in = 10,000
Sample size = 10,000
Number of obs = 74
Avg acceptance rate = .296
Avg efficiency: min = .08096
avg = .09193
max = .1002
Avg log marginal-likelihood = -226.70215 Max Gelman--Rubin Rc = 1.001
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
weight -.5759702 .0461691 .001061 -.5772111 -.665917 -.4826217
_cons 38.64229 1.440565 .032185 38.66686 35.73169 41.42428
The maximum Gelman–Rubin statistic is now only 1.001. We use bayesstats grubin for details.
. bayesstats grubin
Gelman--Rubin convergence diagnostic
Number of chains = 2
MCMC size, per chain = 10,000
Max Gelman--Rubin Rc = 1.001315
Rc
mpg
weight 1.001315
_cons 1.00095
var 1.000061
Bayesian predictions
Bayesian predictions provide a powerful set of tools for model evaluation and assessing good-
ness of fit, in addition to predicting future observations; see Overview of Bayesian predictions in
[BAYES] bayespredict for details. You can use bayespredict, bayesreps, and bayesstats pp-
values to obtain Bayesian predictions and perform model checks. Here we illustrate some of the
features of Bayesian predictions, which are available after fitting a model using bayesmh. We continue
with the Bayesian multiple linear regression model from example 11.
As a quick model check, we can explore the distribution of the replicated outcomes and compare
them with the observed outcome distribution. Replicated outcomes are new outcome values simulated
from the posterior predictive distribution conditional on the observed set of covariates. Generally,
replicated outcomes compose a sample of T observations, MCMC replicates, and n variables, one
for each observation in the original data. The entire prediction sample is rarely needed in most
applications. Often, it is sufficient to explore a small random subset from all T MCMC replicates. We
can use bayesreps to generate such a subset and save the generated replicates as new variables in
our dataset.
To use bayesreps and bayespredict, we must first save the simulation results from bayesmh.
Let’s refit the linear regression model and save the simulation results in linregsim.dta. We suppress
the output with quietly.
. quietly bayesmh mpg weight, likelihood(normal({var}))
> prior({mpg:}, normal(0,100)) prior({var}, igamma(10,10))
> saving(linregsim) rseed(16)
We can now use bayesreps to generate the replicated outcomes for variable mpg. These will
be samples from the posterior predictive distribution of mpg conditioned on the observed set of
explanatory variables, weight. Each replication sample will be of the same size, 74, as the original
outcome mpg. Let’s generate 5 replication samples and save them in the original dataset as new
variables, mpgrep1 through mpgrep5, specified as the stub mpgrep*.
. bayesreps mpgrep*, nreps(5) rseed(16)
Computing predictions ...
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 81
We can visually inspect the histograms of the replicated samples and compare them with the
histogram for the observed mpg.
. quietly histogram mpg, name(hist0) nodraw
. local histlist hist0
. forvalues i = 1/5 {
2. quietly histogram mpgrep‘i’, name(hist‘i’) nodraw
3. local histlist ‘histlist’ hist‘i’
4. }
. graph combine ‘histlist’
.1 .08 .08
.08
.06 .06
Density
Density
Density
.06
.04 .04
.04
.02 .02
.02
0 0 0
10 20 30 40 10 15 20 25 30 35 0 10 20 30
Mileage (mpg) Replicate 1 for mpg Replicate 2 for mpg
Density
Density
0 0 0
10 15 20 25 30 35 10 15 20 25 30 10 15 20 25 30
Replicate 3 for mpg Replicate 4 for mpg Replicate 5 for mpg
The histogram of mpg (top, left) looks different from those of the replications. All of them cover the
range of (10, 30), but the observed mpg is skewed to the right and has heavier tails. The normal model
does not appear to capture the observed distribution well. After these initial checks, we proceed with
a more quantitative assessment of model fit.
A posterior predictive check is one of the main applications of Bayesian predictions. It starts with
defining test statistics that represent different aspects of the outcome distribution. Then, these test
statistics are computed using the observed and replicated outcomes, and their values are compared.
For example, the mean, minimum, and maximum statistics can be used for assessing how well the
model represents the outcome distribution with respect to its center and extremes.
We can simulate the mean, minimum, and maximum statistics using bayespredict, which
supports the use of Mata functions to compute functions of simulated outcomes. Thus, we can use
Mata functions mean(), min(), and max() to compute the desired statistics. We specify the argument
{ ysim} with the functions to request statistics of the simulated outcomes (we can also use { resid}
for residuals). We save the prediction results in mpgsim.dta. See [BAYES] bayespredict for details
about the specification.
82 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
We can now access the prediction results within other Bayesian postestimation commands such as
bayesstats summary and bayesstats ppvalues.
Let’s compare the agreement for the mean, minimum, and maximum between the replicated data
and observed data. The bayesstats ppvalues command makes such comparisons easy. It reports
the proportion of cases when the simulated statistics are greater than or equal to the observed values
of statistics, which is an estimate of the so-called posterior predictive p-value.
. bayesstats ppvalues {prmean} {prmin} {prmax} using mpgsim
Posterior predictive summary MCMC sample size = 10,000
The posterior predictive p-value is 0.45 for the mean statistic, 0.03 for the minimum, and less than
0.001 for the maximum. Our normal model captures the center of the distribution of mpg well but
fails to capture the extreme values. The posterior predictive p-value for the maximum statistic is
particularly small, which agrees with our earlier conclusion based on the histograms that the maximum
values are not well represented by the model. If we believe that the extremely large observations
of mpg are not aberrant outliers, we may need to look for a better-fitting likelihood model than the
normal model.
As the final step, we remove the files generated by bayesmh and bayespredict because we no
longer need them.
. erase linregsim.dta
. erase mpgsim.dta
. erase mpgsim.ster
See [BAYES] bayespredict and [BAYES] bayesstats ppvalues for more examples.
Sorted by:
Our goal is to investigate the relationship between the presence of a heart disease and covariates
restecg, isfbs, age, and male.
First, we fit a standard logistic regression model using the logit command.
. logit disease restecg isfbs age male
note: restecg != 0 predicts success perfectly;
restecg omitted and 17 obs not used.
note: isfbs != 0 predicts success perfectly;
isfbs omitted and 3 obs not used.
note: male != 1 predicts success perfectly;
male omitted and 2 obs not used.
Iteration 0: Log likelihood = -4.2386144
Iteration 1: Log likelihood = -4.2358116
Iteration 2: Log likelihood = -4.2358076
Iteration 3: Log likelihood = -4.2358076
Logistic regression Number of obs = 26
LR chi2(1) = 0.01
Prob > chi2 = 0.9403
Log likelihood = -4.2358076 Pseudo R2 = 0.0007
restecg 0 (omitted)
isfbs 0 (omitted)
age -.0097846 .1313502 -0.07 0.941 -.2672263 .2476572
male 0 (omitted)
_cons 3.763893 7.423076 0.51 0.612 -10.78507 18.31285
We encounter collinearity and dropping of observations because of perfect prediction. As a result, the
regression coefficients corresponding to restecg, isfbs, and male are essentially excluded from
the model. The standard logistic analysis is limited because of the small size of the dataset.
84 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Next we consider Bayesian analysis of the same data. We fit the same logistic regression model
using bayesmh and apply fairly noninformative normal priors N (0, 1e4) for all regression parameters.
. set seed 14
. bayesmh disease restecg isfbs age male, likelihood(logit)
> prior({disease:}, normal(0,10000))
Burn-in ...
Simulation ...
Model summary
Likelihood:
disease ~ logit(xb_disease)
Prior:
{disease:restecg isfbs age male _cons} ~ normal(0,10000) (1)
Equal-tailed
disease Mean Std. dev. MCSE Median [95% cred. interval]
Indeed, if we decrease the standard deviation of the priors to 10, we observe that the scale of the
estimates decreases by the same order of magnitude.
. set seed 14
. bayesmh disease restecg isfbs age male, likelihood(logit)
> prior({disease:}, normal(0,100))
Burn-in ...
Simulation ...
Model summary
Likelihood:
disease ~ logit(xb_disease)
Prior:
{disease:restecg isfbs age male _cons} ~ normal(0,100) (1)
Equal-tailed
disease Mean Std. dev. MCSE Median [95% cred. interval]
We can, therefore, conclude that the regression parameters are highly sensitive to the choice of
priors and their scale cannot be determined by the data alone; that is, it cannot be determined by
the likelihood of the model. In other words, these model parameters are not identifiable from the
likelihood alone. This conclusion is in agreement with the results of the logit command.
We may consider applying an informative prior. We can use information from other heart disease
studies from Lichman (2013). For example, we use a subset of the Hungarian data created by Andras
Janosi, M.D. of Hungarian Institute of Cardiology in Budapest, Hungary. hearthungary.dta contains
the same attributes as in heartswitz.dta but from a Hungarian population.
86 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
We fit bayesmh with noninformative priors to hearthungary.dta and obtain the following
posterior mean estimates for the regression parameters:
. use https://www.stata-press.com/data/r18/hearthungary
(Subset of Hungarian heart disease data from UCI Machine Learning Repository)
. set seed 14
. bayesmh disease restecg isfbs age male, likelihood(logit)
> prior({disease:}, normal(0,1000))
Burn-in ...
Simulation ...
Model summary
Likelihood:
disease ~ logit(xb_disease)
Prior:
{disease:restecg isfbs age male _cons} ~ normal(0,1000) (1)
Equal-tailed
disease Mean Std. dev. MCSE Median [95% cred. interval]
With this additional information, we can form more informative priors for the 5 parameters of
interest—we center {restecg} and {age} at 0, {disease:isfbs} and {disease:male} at 1, and
{disease: cons} at −4, and we use a prior variance of 10 for all coefficients.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 87
. use https://www.stata-press.com/data/r18/heartswitz
(Subset of Switzerland heart disease data from UCI Machine Learning Repository)
. set seed 14
. bayesmh disease restecg isfbs age male, likelihood(logit)
> prior({disease:restecg age}, normal( 0,10))
> prior({disease:isfbs male}, normal( 1,10))
> prior({disease:_cons}, normal(-4,10))
Burn-in ...
Simulation ...
Model summary
Likelihood:
disease ~ logit(xb_disease)
Priors:
{disease:restecg age} ~ normal(0,10) (1)
{disease:isfbs male} ~ normal(1,10) (1)
{disease:_cons} ~ normal(-4,10) (1)
Equal-tailed
disease Mean Std. dev. MCSE Median [95% cred. interval]
We now obtain more reasonable results that also agree with the Hungarian results. For the final
analysis, we may consider other heart disease datasets to verify the reasonableness of our prior
specifications and to check the sensitivity of the parameters to other prior specifications.
the 10 to 40 range. We assign N (0, 1) prior for regression coefficients. To monitor the progress, we
specify dots to request that bayesmh displays dots every 100 iterations and iteration numbers every
1,000 iterations.
. use https://www.stata-press.com/data/r18/fullauto
(Automobile models)
. replace length = length/10
variable length was int now float
(74 real changes made)
. set seed 14
. bayesmh rep77 foreign length mpg, likelihood(oprobit)
> prior({rep77: foreign length mpg}, normal(0,1))
> prior({rep77:_cut1 _cut2 _cut3 _cut4}, exponential({lambda=30}))
> prior({lambda}, uniform(10,40)) block(lambda) dots
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 10000 .........1000.........2000.........3000.........4000.........
> 5000.........6000.........7000.........8000.........9000.........10000 done
Model summary
Likelihood:
rep77 ~ oprobit(xb_rep77,{rep77:_cut1 ... _cut4})
Priors:
{rep77:foreign length mpg} ~ normal(0,1) (1)
{rep77:_cut1 ... _cut4} ~ exponential({lambda})
Hyperprior:
{lambda} ~ uniform(10,40)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
rep77
foreign 1.338071 .3750768 .022296 1.343838 .6331308 2.086062
length .3479392 .1193329 .00787 .3447806 .1277292 .5844067
mpg .1048089 .0356498 .002114 .1022382 .0373581 .1761636
_cut1 7.204502 2.910222 .197522 7.223413 1.90771 13.07034
_cut2 8.290923 2.926149 .197229 8.258871 2.983281 14.16535
_cut3 9.584845 2.956191 .197144 9.497836 4.23589 15.52108
_cut4 10.97314 3.003014 .192244 10.89227 5.544563 17.06189
When we specify dots or dots(), bayesmh displays dots as simulation is performed. The burn-in and
simulation iterations are displayed separately. During the adaptation period, iterations are displayed
with a symbol a instead of a dot. This indicates the period during which the proposal distribution is
still changing and thus may not be suitable for sampling from yet. Typically, adaptation is performed
during the burn-in period, the iterations of which are discarded from the MCMC sample. You should
pay closer attention to your results if you see adaptive iterations during the simulation period. This
may happen, for example, if you increase adaptation(maxiter()) without increasing burnin()
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 89
correspondingly. In this case, you may need to perform additional checks to verify that the part of
the MCMC sample corresponding to the adaptation period is similar to the rest of the sample.
Posterior credible intervals suggest that foreign, length, and mpg are among the explanatory
factors for rep77. Based on MCSEs, their posterior mean estimates are fairly precise. The posterior
mean estimates of cutpoints, as expected, are not as precise. The estimated posterior mean for
{lambda} is 18.52.
We placed the hyperparameter {lambda} in a separate block because we wanted to sample this
nuisance parameter independently from the other model parameters. Based on the bivariate scatterplots,
this parameter does appear to be independent of other model parameters a posteriori.
. bayesgraph matrix {rep77:foreign} {rep77:length} {rep77:mpg} {lambda}
0 .5 1 10 20 30 40
3
2
rep77:foreign
1
0
1
.5 rep77:length
0
.2
rep77:mpg .1
0
40
30
lambda
20
10
0 1 2 3 0 .1 .2
90 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
As with any MCMC analysis, we should verify convergence of all of our parameters. Here we show
diagnostic plots only for {lambda}.
. bayesgraph diagnostics {lambda}
lambda
Trace Histogram
40
.1
30
.05
20
10
0 2000 4000 6000 8000 10000 0
Iteration number 10 20 30 40
Autocorrelation Density
0.80
.08 All
0.60 1-half
.06 2-half
0.40
.04
0.20
.02
0.00
0
0 10 20 30 40
Lag 10 20 30 40
Beta-binomial model
bayesmh is a regression command, which models the mean of the outcome distribution as a
function of predictors. There are cases when we do not have any predictors and want to model the
outcome distribution directly. For example, we may want to fit a Poisson distribution or a binomial
distribution to our outcome. We can do this by specifying one of the four distributions supported
by bayesmh in the likelihood() option: dexponential(), dbernoulli(), dbinomial(), or
dpoisson().
Let’s revisit the example from What is Bayesian analysis? in [BAYES] Intro, originally from Hoff
(2009, 3), of estimating the prevalence of a rare infectious disease in a small city. The outcome
variable y is the number of infected subjects in a city of 20 subjects, and our data consist of only
one observation, y = 0. We assume a binomial distribution for the outcome y, Binom(20,θ), where
the infection probability θ is a parameter of interest. Based on some previous studies, the model
parameter θ is assigned a Beta(2, 20) prior. For this model, the posterior distribution of θ is known
to be Beta(2, 40).
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 91
Likelihood:
y ~ binomial({theta},20)
Prior:
{theta} ~ beta(2,20)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
The estimated posterior mean for {theta} is 0.0468, which is close to the theoretical value of
2/(2 + 40) = 0.0476 and is within the range of the MCSE of 0.0008.
Multivariate regression
We consider a simple multivariate normal regression model without covariates. We use auto.dta,
and we fit a multivariate normal distribution to variables mpg, weight, and length.
We rescale these variables to have approximately equal ranges. Equalizing the range of model
variables is always recommended, because this makes the model computationally more stable.
. use https://www.stata-press.com/data/r18/auto, clear
(1978 automobile data)
. quietly replace weight = weight/1000
. quietly replace length = length/100
. quietly replace mpg = mpg/10
Example 15: Default MH sampling with inverse-Wishart prior for the covariance
For a multivariate normal distribution, an inverse-Wishart prior is commonly used as a prior for
the covariance matrix. Let’s fit our multivariate model using bayesmh.
We specify the multivariate normal likelihood likelihood(mvnormal({Sigma,m})) for the three
variables mpg, weight, and length, where {Sigma,m} is a matrix parameter for the covariance
matrix. We use vague normal priors normal(0,100) for all three means of the variables. For a
covariance matrix {Sigma,m}, which is of dimension three, we specify an inverse-Wishart prior with
the identity scale matrix. We also specify the mean parameters and the covariance parameter in two
separate blocks. To monitor the simulation process, we specify dots.
92 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
. set seed 14
. bayesmh (mpg) (weight) (length), likelihood(mvnormal({Sigma,m}))
> prior({mpg:_cons} {weight:_cons} {length:_cons}, normal(0,100))
> prior({Sigma,m}, iwishart(3,100,I(3)))
> block({mpg:_cons} {weight:_cons} {length:_cons})
> block({Sigma,m}) dots
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 10000 .........1000.........2000.........3000.........4000.........
> 5000.........6000.........7000.........8000.........9000.........10000 done
Model summary
Likelihood:
mpg weight length ~ mvnormal(3,{mpg:},{weight:},{length:},{Sigma,m})
Priors:
{mpg:_cons} ~ normal(0,100)
{weight:_cons} ~ normal(0,100)
{length:_cons} ~ normal(0,100)
{Sigma,m} ~ iwishart(3,100,I(3))
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
_cons 2.13089 .0455363 .001763 2.129007 2.04435 2.223358
weight
_cons 3.018691 .0671399 .00212 3.020777 2.880051 3.149828
length
_cons 1.879233 .0210167 .00063 1.879951 1.837007 1.920619
In this first run, we do not achieve good mixing of the MCMC chain. bayesmh issues a note about
significant autocorrelation of the simulated parameters.
A closer inspection of the ESS table reveals very low sampling efficiencies for the elements of the
covariance matrix {Sigma}.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 93
. bayesstats ess
Efficiency summaries MCMC sample size = 10,000
Efficiency: min = .001396
avg = .04166
max = .1111
mpg
_cons 667.48 14.98 0.0667
weight
_cons 1002.92 9.97 0.1003
length
_cons 1111.14 9.00 0.1111
For example, the diagnostic plots for {Sigma 2 2} provide visual confirmation of the convergence
issues—very poorly mixing trace plot, high autocorrelation, and a bimodal posterior distribution.
. bayesgraph diagnostics Sigma_2_2
Sigma_2_2
Trace Histogram
.34 100
.335 80
.33 60
.325
40
.32
20
.315
0 2000 4000 6000 8000 10000 0
Iteration number .315 .32 .325 .33 .335 .34
Autocorrelation Density
1.00 200
All
1-half
150
0.50 2-half
100
0.00
50
-0.50
0
0 10 20 30 40
Lag .315 .32 .325 .33 .335 .34
Here, we see a general problem associated with the simulation of covariance matrices. Random-
walk MH algorithm is not well suited for sampling positive-definite matrices. This is why even an
adaptive version of the MH algorithm, as implemented in bayesmh, may not achieve good mixing.
94 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Example 16: Adaptation of MH sampling with inverse-Wishart prior for the covariance
Continuing example 15, we can specify longer adaptation and burn-in periods to improve conver-
gence.
. set seed 14
. bayesmh (mpg) (weight) (length), likelihood(mvnormal({Sigma,m}))
> prior({mpg:_cons} {weight:_cons} {length:_cons}, normal(0,100))
> prior({Sigma,m}, iwishart(3,100,I(3)))
> block({mpg:_cons} {weight:_cons} {length:_cons})
> block({Sigma,m}) dots burnin(5000) adaptation(maxiter(50))
Burn-in 5000 aaaaaaaaa1000aaaaaaaaa2000aaaaaaaaa3000aaaa.....4000.........5000
> done
Simulation 10000 .........1000.........2000.........3000.........4000.........
> 5000.........6000.........7000.........8000.........9000.........10000 done
Model summary
Likelihood:
mpg weight length ~ mvnormal(3,{mpg:},{weight:},{length:},{Sigma,m})
Priors:
{mpg:_cons} ~ normal(0,100)
{weight:_cons} ~ normal(0,100)
{length:_cons} ~ normal(0,100)
{Sigma,m} ~ iwishart(3,100,I(3))
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
_cons 2.13051 .0475691 .001809 2.13263 2.038676 2.220953
weight
_cons 3.017943 .0626848 .00234 3.016794 2.898445 3.143252
length
_cons 1.878912 .019905 .000769 1.878518 1.840311 1.918476
There is no note about high autocorrelation, and the average efficiency increases slightly from 4% to
5%.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 95
mpg
_cons 691.54 14.46 0.0692
weight
_cons 717.82 13.93 0.0718
length
_cons 670.63 14.91 0.0671
Sigma_2_2
Trace Histogram
.36 30
.34
.32 20
.3
.28 10
.26
0 2000 4000 6000 8000 10000 0
Iteration number .26 .28 .3 .32 .34 .36
Autocorrelation Density
0.80 30
All
1-half
0.60
20 2-half
0.40
0.20 10
0.00
0
0 10 20 30 40
Lag .26 .28 .3 .32 .34 .36
96 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Likelihood:
mpg weight length ~ mvnormal(3,{mpg:},{weight:},{length:},{Sigma,m})
Priors:
{mpg:_cons} ~ normal(0,100)
{weight:_cons} ~ normal(0,100)
{length:_cons} ~ normal(0,100)
{Sigma,m} ~ iwishart(3,100,I(3))
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
_cons 2.128801 .0457224 .00164 2.128105 2.041016 2.215
weight
_cons 3.020533 .0609036 .002328 3.021561 2.908383 3.143715
length
_cons 1.880409 .0197061 .000725 1.881133 1.843106 1.918875
Compared with example 15, the results improved substantially. Compared with example 16, the
minimum efficiency increases from about 3% to 7% and the average efficiency from 5% to 67%.
MCSEs of posterior mean estimates, particularly for elements of {Sigma}, are lower.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 97
The diagnostic plots, for example, for Sigma 2 2 also indicate a very good convergence.
. bayesgraph diagnostics Sigma_2_2
Sigma_2_2
Trace Histogram
.45 15
.4
.35 10
.3
.25 5
.2
0 2000 4000 6000 8000 10000 0
Iteration number .2 .25 .3 .35 .4 .45
Autocorrelation Density
0.02 15
All
0.01 1-half
10 2-half
0.00
-0.01 5
-0.02
0
0 10 20 30 40
Lag .2 .25 .3 .35 .4 .45
98 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Example 18: Gibbs sampling of a covariance matrix with the Jeffreys prior
In this example, we perform a sensitivity analysis of the model by replacing the inverse-Wishart
prior for the covariance matrix with a Jeffreys prior.
. set seed 14
. bayesmh (mpg) (weight) (length), likelihood(mvnormal({Sigma,m}))
> prior({mpg:} {weight:} {length:}, normal(0,100))
> prior({Sigma,m}, jeffreys(3))
> block({mpg:} {weight:} {length:})
> block({Sigma,m}, gibbs) dots
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 10000 .........1000.........2000.........3000.........4000.........
> 5000.........6000.........7000.........8000.........9000.........10000 done
Model summary
Likelihood:
mpg weight length ~ mvnormal(3,{mpg:},{weight:},{length:},{Sigma,m})
Priors:
{mpg:_cons} ~ normal(0,100)
{weight:_cons} ~ normal(0,100)
{length:_cons} ~ normal(0,100)
{Sigma,m} ~ jeffreys(3)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
mpg
_cons 2.130704 .0709095 .002185 2.129449 1.989191 2.267987
weight
_cons 3.019323 .0950116 .003245 3.019384 2.834254 3.208017
length
_cons 1.879658 .0271562 .000892 1.879859 1.827791 1.933834
Compared with example 17, the estimates of the means of the multivariate distribution do not change
much, but the estimates of the elements of the covariance matrix do change. The estimates for
{Sigma,m} obtained using the Jeffreys prior are approximately twice as big as the estimates obtained
using the inverse-Wishart prior. If we compute correlation matrices corresponding to {Sigma,m} from
the two models, they will be similar. This can be explained by the fact that both the Jeffreys prior and
the inverse-Wishart prior with identity scale matrix are not informative for the correlation structure
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 99
because they only depend on the determinant and the trace of {Sigma,m} whereas the correlation
structure is determined by the data alone.
Sigma_2_2
Trace Histogram
1.4 4
1.2
3
1
.8 2
.6
1
.4
0 2000 4000 6000 8000 10000 0
Iteration number .4 .6 .8 1 1.2 1.4
Autocorrelation Density
0.02 4
All
1-half
0.01 3
2-half
0.00 2
-0.01
1
-0.02
0
0 10 20 30 40
Lag .4 .6 .8 1 1.2 1.4
100 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Ruppert, Wand, and Carroll (2003) and Diggle et al. (2002) analyzed a longitudinal dataset
consisting of weight measurements of 48 pigs on 9 successive weeks. Pigs were identified by the
group variable id.
The following two-level model was considered:
id: Identity
var(_cons) 14.81751 3.124225 9.801716 22.40002
LR test vs. linear model: chibar2(01) = 472.65 Prob >= chibar2 = 0.0000
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 101
The model has four main parameters of interest: regression coefficients β0 and β1 and variance
components σ02 and σu2 . The pig random effects uj ’s are considered nuisance parameters. We use
normal priors for the regression coefficients and random effects and inverse-gamma priors for the
variance parameters. The chosen priors are fairly noninformative, so we would expect results to be
similar to the frequentist results.
To fit this model using bayesmh, we need to include random effects for pig in our regression
model. This can be done simply by adding the random-effects term U[id] to the list of variables.
In addition to two regression coefficients and two variance components, we have 48 random-effects
parameters. As for other models, bayesmh will automatically create parameters of the regression
function: {weight:week} for the regression coefficient of week and {weight: cons} for the
constant term. It will also create random-effects parameters {U:1.id}, {U:2.id}, . . ., {U:48.id}
and the corresponding variance component {var U}. So, we only need to create one remaining
parameter for the error variance. We will use {var 0} to match our math notation.
We will perform five simulations for the specified Bayesian model to illustrate some common
difficulties in applying MH MCMC to multilevel models.
Likelihood:
weight ~ normal(xb_weight,{var_0})
Priors:
{weight:_cons week} ~ normal(0,100) (1)
{U[id]} ~ normal(0,{var_U}) (1)
{var_0} ~ igamma(0.001,0.001)
Hyperprior:
{var_U} ~ igamma(0.001,0.001)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
weight
week 6.214207 .038642 .002359 6.213394 6.139342 6.289956
_cons 19.32073 .4780961 .095658 19.33685 18.36352 20.16849
bayesmh reports results that are similar to those from mixed, but the low minimum efficiency of
0.005 may indicate problems with MCMC convergence for some of the parameters. bayesmh does
not report the estimates of random effects by default, but you can use the showreffects option to
display them.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 103
We use bayesstats ess to identify the main model parameter that has the lowest efficiency.
. bayesstats ess
Efficiency summaries MCMC sample size = 5,000
Efficiency: min = .004996
avg = .03269
max = .05366
weight
week 268.29 18.64 0.0537
_cons 24.98 200.16 0.0050
weight:_cons
Trace Histogram
21 1
20 .8
.6
19
.4
18
.2
17
0 1000 2000 3000 4000 5000 0
Iteration number 17 18 19 20 21
Autocorrelation Density
1.00
.8 All
1-half
0.50 .6 2-half
.4
0.00
.2
-0.50
0
0 10 20 30 40
Lag 17 18 19 20 21
we see that the trace plot exhibits some trend and does not show good mixing and that the autocorrelation
is high. Our MCMC does not seem to converge and thus we should be cautious about the obtained
results.
104 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
We can also look at the trace and autocorrelation plots of all main parameters.
. bayesgraph trace _all, byparm(cols(2))
Trace plots
weight:week weight:_cons
6.4 21
20
6.3
19
6.2
18
6.1
17
var_0 var_U
5.5 30
5 25
4.5 20
4 15
3.5 10
0 5000 0 5000
Iteration number
Graphs by parameter
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 105
The trace plots of all parameters other than the constant do not appear to have any trend.
. bayesgraph ac _all, byparm
Autocorrelations
weight:week weight:_cons
1 1
.9
.5
.8
.7
0
0 10 20 30 40 0 10 20 30 40
var_0 var_U
1 1
.8
.5 .6
.4
0 .2
0 10 20 30 40 0 10 20 30 40
Lag
Graphs by parameter
The autocorrelation for the constant {weight: cons} and variance component {var U} is high.
106 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
weight
week 6.215408 .0381479 .002808 6.214654 6.140876 6.293443
_cons 19.41979 .5741026 .11524 19.46862 18.24166 20.44603
Blocking certainly improved efficiencies: the average efficiency is now 0.08, but the minimum efficiency
is still low.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 107
The trace and autocorrelation plots below have improved for variance components but not for
regression coefficients.
. bayesgraph trace _all, byparm(cols(2))
Trace plots
weight:week weight:_cons
6.4 21
6.3 20
6.2 19
6.1 18
var_0 var_U
6
30
5 25
20
4 15
10
3
0 5000 0 5000
Iteration number
Graphs by parameter
108 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Autocorrelations
weight:week weight:_cons
1 1
.9
.5
.8
0 .7
0 10 20 30 40 0 10 20 30 40
var_0 var_U
.8 .8
.6 .6
.4 .4
.2 .2
0 0
0 10 20 30 40 0 10 20 30 40
Lag
Graphs by parameter
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 109
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
weight
week 6.211245 .0394854 .001513 6.211084 6.136556 6.290471
_cons 19.10077 .5413931 .085962 19.0496 18.20506 20.29911
The average efficiency increased dramatically to 0.31 but the minimum efficiency is still low.
110 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Trace plots
weight:week weight:_cons
6.4 21
20
6.2 19
18
6 17
var_0 var_U
6 40
5 30
20
4
10
3
0 5000 0 5000
Iteration number
Graphs by parameter
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 111
Autocorrelations
weight:week weight:_cons
.8 1
.6
.4
.2
0 .5
0 10 20 30 40 0 10 20 30 40
var_0 var_U
.15 .06
.1 .04
.05 .02
0 0
-.05 -.02
0 10 20 30 40 0 10 20 30 40
Lag
Graphs by parameter
we will see that all but the constant term show nearly perfect mixing.
112 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
For linear multilevel models, we can further improve mixing by specifying Gibbs sampling also
for random effects.
. bayesmh weight week U[id], likelihood(normal({var_0}))
> prior({weight:_cons}, normal(0, 100))
> prior({weight:week}, normal(0, 100))
> prior({var_0}, igamma(0.001, 0.001))
> prior({var_U}, igamma(0.001, 0.001))
> block({weight:} {var_0 var_U}, split gibbs)
> block({U}, gibbs)
> mcmcsize(5000) dots rseed(14) nomodelsummary
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 5000 .........1000.........2000.........3000.........4000.........
> 5000 done
Bayesian normal regression MCMC iterations = 7,500
Gibbs sampling Burn-in = 2,500
MCMC sample size = 5,000
Number of obs = 432
Acceptance rate = 1
Efficiency: min = .02462
avg = .4626
Log marginal-likelihood max = .8788
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
weight
week 6.212522 .0391656 .001618 6.212953 6.135002 6.287983
_cons 19.17706 .527013 .047497 19.19138 18.0913 20.1664
The minimum efficiency is now increased to 0.025, and the diagnostics plots for the constant term
look much better:
. bayesgraph trace _all, byparm(cols(2))
Trace plots
weight:week weight:_cons
6.4 21
6.3 20
6.2 19
6.1 18
var_0 var_U
6 40
5 30
4 20
10
3
0 5000 0 5000
Iteration number
Graphs by parameter
114 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Autocorrelations
weight:week weight:_cons
.8 1
.6 .8
.4 .6
.2 .4
0 .2
0 10 20 30 40 0 10 20 30 40
var_0 var_U
.1
.05
.05
0
0
-.05 -.05
0 10 20 30 40 0 10 20 30 40
Lag
Graphs by parameter
For example, instead of using Gibbs sampling for the random effects (as in example 21), we use
block()’s suboption split for the random-effects parameters {U[id]}.
. bayesmh weight week U[id], likelihood(normal({var_0}))
> prior({weight:_cons}, normal(0, 100))
> prior({weight:week}, normal(0, 100))
> prior({var_0}, igamma(0.001, 0.001))
> prior({var_U}, igamma(0.001, 0.001))
> block({weight:} {var_0 var_U}, split gibbs)
> block({U}, split)
> mcmcsize(5000) dots rseed(14) nomodelsummary
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 5000 .........1000.........2000.........3000.........4000.........
> 5000 done
Bayesian normal regression MCMC iterations = 7,500
Metropolis--Hastings and Gibbs sampling Burn-in = 2,500
MCMC sample size = 5,000
Number of obs = 432
Acceptance rate = .8455
Efficiency: min = .007933
avg = .3116
Log marginal-likelihood max = .6695
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
weight
week 6.211245 .0394854 .001513 6.211084 6.136556 6.290471
_cons 19.10077 .5413931 .085962 19.0496 18.20506 20.29911
The average sampling efficiency, 39%, is lower than with the full Gibbs sampling in example 21 but
is higher compared with the model that did not use Gibbs sampling for random effects. For models
that do not support Gibbs sampling, splitting on random effects may be a good alternative.
116 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Here, the constant term is absorbed into the prior for the random effects τj ’s, which have a mean
of β0 instead of a zero, as for random effects uj ’s.
To specify the above model with bayesmh, we need to use the noconstant option, and we need
to specify the prior for random effects manually.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 117
Likelihood:
weight ~ normal(xb_weight,{var_0})
Priors:
{weight:week} ~ normal(0,100) (1)
{U[id]} ~ normal({weight:_cons},{var_U}) (1)
{var_0} ~ igamma(0.001,0.001)
{weight:_cons} ~ normal(0,100)
Hyperprior:
{var_U} ~ igamma(0.001,0.001)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
weight
week 6.210628 .0389494 .001632 6.21117 6.133097 6.286066
_cons 19.28477 .607197 .012616 19.28279 18.10872 20.50361
The average efficiency increased dramatically to 60% with the minimum efficiency of 11% now.
118 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
The diagnostic plots now show perfect mixing for all main model parameters:
. bayesgraph trace _all, byparm(cols(2))
Trace plots
weight:week var_0
6.4 6
6.3 5
6.2 4
6.1
3
weight:_cons var_U
22 50
40
20
30
18 20
10
16
0 5000 0 5000
Iteration number
Graphs by parameter
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 119
Autocorrelations
weight:week var_0
.8 .1
.6
.05
.4
.2 0
0 -.05
0 10 20 30 40 0 10 20 30 40
weight:_cons var_U
.15 .06
.1 .04
.05 .02
0 0
-.02
-.05
0 10 20 30 40 0 10 20 30 40
Lag
Graphs by parameter
All estimates are very close to the MLEs obtained earlier with the mixed command.
where u0j is the random effect for pig and u1j is the pig-specific random coefficient on week for
j = 1, . . . , 48 and i = 1, . . . , 9.
120 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
id: Independent
var(week) .3680668 .0801181 .2402389 .5639103
var(_cons) 6.756364 1.543503 4.317721 10.57235
LR test vs. linear model: chi2(2) = 764.42 Prob > chi2 = 0.0000
Note: LR test is conservative and provided only for reference.
weightij = β0 + β1 weekij + u0j + u1j weekij + ij = τ0j + τ1j weekij + ij ,
The model has five main parameters of interest: regression coefficients β0 and β1 and variance
components σ02 , στ20 , and στ21 . β0 and β1 are technically hyperparameters because they are specified
as mean parameters of the prior distributions for random effects τ0j ’s and τ1j ’s, respectively. Random
effects τ0j and τ1j are considered nuisance parameters. We again use normal priors for the regression
coefficients and random effects and inverse-gamma priors for the variance parameters. We specify
fairly noninformative priors.
To fit this model using bayesmh, we include random effects for pig and their interaction with week
in our regression model. Following Random effects, we add random intercepts for the id variable as
T0[id], and we include random coefficients on week as c.week#T1[id], where T0 and T1 stand
for τ0 and τ1 .
We fit our model using bayesmh. Following example 21, we perform blocking of parameters and
use Gibbs sampling for the blocks. For brevity, we also combine the same prior specifications in one
statement but use prior()’s split suboption to continue treating the parameters from the same
prior() statement as separate blocks during simulation.
. bayesmh weight T0[id] c.week#T1[id], likelihood(normal({var_0})) noconstant
> prior({T0[id]}, normal({weight:_cons}, {var_T0}))
> prior({T1[id]}, normal({weight:week}, {var_T1}))
> prior({weight:week _cons}, normal(0, 1e2) split)
> prior({var_0 var_T0 var_T1}, igamma(0.001, 0.001) split)
> block({var_0 var_T0 var_T1}, gibbs split)
> block({weight:}, gibbs split)
> block({T0}, gibbs) block({T1}, gibbs)
> mcmcsize(5000) rseed(17) dots notable
Burn-in 2500 .........1000.........2000..... done
Simulation 5000 .........1000.........2000.........3000.........4000.........
> 5000 done
Model summary
Likelihood:
weight ~ normal(xb_weight,{var_0})
Priors:
{T0[id]} ~ normal({weight:_cons},{var_T0}) (1)
{T1[id]} ~ normal({weight:week},{var_T1}) (1)
{var_0} ~ igamma(0.001,0.001)
{weight:week _cons} ~ normal(0,1e2)
Hyperprior:
{var_T0 var_T1} ~ igamma(0.001,0.001)
Our AR is good and efficiencies are high. We do not have a reason to suspect nonconvergence.
Nevertheless, it is important to perform graphical convergence diagnostics to confirm this. We used
the notable option to suppress the estimation summary to focus on checking the MCMC convergence
first and to redisplay the coefficients in the same order as in mixed.
Let’s look at diagnostic plots. We show only diagnostic plots for the mean of random intercepts,
but convergence should be established for all parameters before any inference can be made. We leave
it to you to verify convergence of the remaining parameters.
122 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
weight:_cons
Trace Histogram
20.5 1
20 .8
19.5
.6
19
.4
18.5
18 .2
Autocorrelation Density
0.10 1
All
.8 1-half
0.05 2-half
.6
0.00 .4
.2
-0.05
0
0 10 20 30 40
Lag 17 18 19 20 21
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
weight
week 6.213062 .0950649 .001621 6.213753 6.029047 6.401924
_cons 19.31661 .4041825 .007445 19.32041 18.54005 20.13218
id: Unstructured
var(week) .3715251 .0812958 .2419532 .570486
var(_cons) 6.823363 1.566194 4.351297 10.69986
cov(week,_cons) -.0984378 .2545767 -.5973991 .4005234
LR test vs. linear model: chi2(3) = 764.58 Prob > chi2 = 0.0000
Note: LR test is conservative and provided only for reference.
We modify the previous Bayesian model to account for the correlation between the random effects:
The elements στ20 and στ21 of Σ represent the variances of τ0j ’s and τ1j ’s, respectively, while σ21
is the covariance between them. We apply a weakly informative inverse-Wishart prior with degree of
freedom 3 and identity scale matrix.
124 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Gibbs sampling is not available in bayesmh for the mean parameters ({weight: cons} and
{weight:week}) of the multivariate normal distribution with an unstructured covariance. We thus
remove gibbs from the corresponding block() option.
. bayesmh weight T0[id] c.week#T1[id], likelihood(normal({var_0})) noconstant
> prior({T0 T1}, mvnormal(2, {weight:_cons}, {weight:week}, {Sigma,m}))
> prior({weight:week _cons}, normal(0, 1e2) split)
> prior({var_0}, igamma(0.001,0.001))
> prior({Sigma,m}, iwishart(2,3,I(2)))
> block({var_0} {Sigma,m}, gibbs split)
> block({weight:}, split)
> block({T0}, gibbs) block({T1}, gibbs)
> mcmcsize(5000) rseed(17) dots
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 5000 .........1000.........2000.........3000.........4000.........
> 5000 done
Model summary
Likelihood:
weight ~ normal(xb_weight,{var_0})
Priors:
{var_0} ~ igamma(0.001,0.001)
{T0[id] T1[id]} ~ mvnormal(2,{weight:_cons},{weight:week},{Sigma,m}) (1)
{weight:week _cons} ~ normal(0,1e2)
Hyperprior:
{Sigma,m} ~ iwishart(2,3,I(2))
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
weight
_cons 19.32651 .3922638 .013186 19.32816 18.54339 20.11928
week 6.207807 .0986948 .003086 6.20779 6.009859 6.402211
The average sampling efficiency is about 40% with no indications for convergence problems. The
posterior mean estimates of the main model parameters are close to the maximum likelihood results
from mixed. For example, the estimates of variance components στ20 , σ21 , and στ21 are 6.85, −0.095,
and 0.40, respectively, from bayesmh and 6.82, −0.098, and 0.37, respectively, from mixed.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 125
Likelihood:
c_use ~ logit(xb_c_use)
Priors:
{c_use:urban age i.children _cons} ~ normal(0,100) (1)
{U[district]} ~ normal(0,{var_U}) (1)
Hyperprior:
{var_U} ~ igamma(0.01,0.01)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
c_use
urban .7364239 .1120843 .007943 .7393282 .4993958 .9511179
age -.0262663 .0076378 .00056 -.02666 -.0418213 -.0116904
children
1 child 1.129249 .1530869 .010718 1.127919 .8263055 1.432189
2 children 1.368097 .1678695 .01045 1.361876 1.040911 1.690345
3 or more.. 1.340399 .1773981 .009683 1.337075 .9809634 1.692562
Although the average efficiency of 0.03 is not that high, there are no indications for convergence
problems. (We can verify this by looking at convergence diagnostics using bayesgraph diagnostics.)
Our estimates of the main regression parameters are close to those obtained with the melogit
command. The posterior mean estimate of variance parameter {var U}, 0.23, is slightly larger than
the corresponding estimate of 0.22 from melogit.
Sorted by:
The expected glucose level is analyzed according to a model proposed in Hand and
Crowder (1996). It is a three-level nonlinear model that includes subject-level random effects
U1[subject] and U2[subject] and guar-within-subject level random effects UU1[subject>guar]
and UU2[subject>guar]. See example 20 for a full description of the model. We consider the model
from that example in which the pairs U1 and U2, and UU1 and UU2, are assumed to be independent.
We fit a Bayesian version of the model using bayesmh. The likelihood specification is similar to the
one used by the menl command, but with bayesmh, we also specify the prior distributions for the model
parameters. Random effects are assigned normal priors by default with the corresponding variance
components {var U1}, {var U2}, {var UU1}, and {var UU2}. The parameters {phi1: cons},
{phi2: cons}, and {phi3} are assigned normal(0, 100) priors, and all variance components
are assigned igamma(0.01, 0.01) priors. Gibbs sampling is used for variance components, and
{phi1: cons}, {phi2: cons}, and {phi3} are sampled in separate blocks. We use the define()
option to define parameters {phi1:} and {phi2:} as a linear combination of the corresponding
random effects, including the constant term.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 127
We suppress the estimation table and redisplay results later by using bayesstats summary to
match the output from menl more closely. The model contains many parameters, so it takes about a
minute to run.
. bayesmh glucose = ({phi1:} + {phi2:}*c.time#c.time#c.time*exp(-{phi3}*time)),
> likelihood(normal({var}))
> define(phi1: U1[subject] UU1[subject>guar])
> define(phi2: U2[subject] UU2[subject>guar])
> prior({phi1:_cons} {phi2:_cons} {phi3}, normal(0, 100) split)
> prior({var var_U1 var_UU1 var_U2 var_UU2}, igamma(0.01, 0.01) split)
> block({phi1:_cons} {phi2:_cons}, split)
> block({var var_U1 var_UU1 var_U2 var_UU2}, gibbs split)
> mcmcsize(5000) rseed(17) notable
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 5000 .........1000.........2000.........3000.........4000.........
> 5000 done
Model summary
Likelihood:
glucose ~ normal(xb_phi1 + xb_phi2*c.time#c.time#c.time*exp(-{phi3}*time),{v
ar})
Priors:
{var} ~ igamma(0.01,0.01)
{phi3} ~ normal(0,100)
{phi1:_cons} ~ normal(0,100)
{phi2:_cons} ~ normal(0,100)
Hyperpriors:
{var_U1 var_UU1 var_U2 var_UU2} ~ igamma(0.01,0.01)
{U1[subject]} ~ normal(0,{var_U1})
{UU1[subject>guar]} ~ normal(0,{var_UU1})
{U2[subject]} ~ normal(0,{var_U2})
{UU2[subject>guar]} ~ normal(0,{var_UU2})
The bayesmh command reports a reasonable average sampling efficiency of about 12% but the minimum
efficiency is below 1%, so we may look into improving sampling efficiency for some parameters.
There is no obvious indication of nonconvergence, but it is important to assess MCMC convergence
visually by using, for instance, bayesgraph diagnostics or more formally by running multiple
chains and evaluating the Gelman–Rubin statistics; see Convergence diagnostics using multiple chains.
128 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Let’s look at the results and compare them with the results reported by the menl command. We
report variance components as standard deviations to more easily match the results from menl
. bayesstats summary {phi1:_cons} {phi2:_cons} {phi3}
> (sd_U1:sqrt({var_U1})) (sd_U2:sqrt({var_U2}))
> (sd_UU1:sqrt({var_UU1})) (sd_UU2:sqrt({var_UU2}))
> (sd:sqrt({var}))
Posterior summary statistics MCMC sample size = 5,000
sd_U1 : sqrt({var_U1})
sd_U2 : sqrt({var_U2})
sd_UU1 : sqrt({var_UU1})
sd_UU2 : sqrt({var_UU2})
sd : sqrt({var})
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
phi1
_cons 3.675754 .1233928 .013441 3.675342 3.426524 3.933746
phi2
_cons .4454892 .075955 .01358 .443041 .2921755 .6014314
The posterior mean estimates for the coefficients {phi1: cons}, {phi2: cons}, and {phi3} and
the residual standard deviation are close to the estimates from menl. The Bayesian estimates of
variance components are higher. In particular, the posterior means for the standard deviations of {U2}
and {UU1} are not only higher but also more concentrated with 95% credible intervals of [0.06, 0.30]
and [0.07, 0.37]. In comparison, the corresponding 95% confidence intervals from menl are rather
wide, [0.0003, 6.3] and [0.0007, 6], which indicates less reliable estimates.
To improve sampling efficiency in this example, we can reparameterize the model by recentering
the random effects U1 and U2 around constants {phi1: cons} and {phi2: cons} so that these
constants become the prior means for the random effects U1 and U2. This will allow us to use Gibbs
sampling for {phi1: cons} and {phi2: cons}.
We fit the reparameterized model using bayesmh with the Gibbs sampling for the prior means.
. bayesmh glucose = ({phi1:} + {phi2:}*c.time#c.time#c.time*exp(-{phi3}*time)),
> likelihood(normal({var}))
> define(phi1: U1[subject] UU1[subject>guar], noconstant)
> define(phi2: U2[subject] UU2[subject>guar], noconstant)
> prior({U1[subject]}, normal({phi1:_cons}, {var_U1}))
> prior({U2[subject]}, normal({phi2:_cons}, {var_U2}))
> prior({phi1:_cons} {phi2:_cons} {phi3}, normal(0, 100) split)
> prior({var var_U1 var_UU1 var_U2 var_UU2}, igamma(0.01, 0.01) split)
> block({phi1:_cons} {phi2:_cons}, gibbs split)
> block({var var_U1 var_UU1 var_U2 var_UU2}, gibbs split)
> mcmcsize(5000) rseed(17) notable
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 5000 .........1000.........2000.........3000.........4000.........
> 5000 done
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 129
Model summary
Likelihood:
glucose ~ normal(xb_phi1 + xb_phi2*c.time#c.time#c.time*exp(-{phi3}*time),{v
ar})
Priors:
{var} ~ igamma(0.01,0.01)
{phi3} ~ normal(0,100)
{phi1:_cons} ~ normal(0,100)
{phi2:_cons} ~ normal(0,100)
Hyperpriors:
{U1[subject]} ~ normal({phi1:_cons},{var_U1})
{U2[subject]} ~ normal({phi2:_cons},{var_U2})
{var_U1 var_UU1 var_U2 var_UU2} ~ igamma(0.01,0.01)
{UU1[subject>guar]} ~ normal(0,{var_UU1})
{UU2[subject>guar]} ~ normal(0,{var_UU2})
The minimum efficiency is now increased to about 2%, but the maximum efficiency is decreased. On
average, we are still at 12%.
. bayesstats summary {phi1:_cons} {phi2:_cons} {phi3}
> (sd_U1:sqrt({var_U1})) (sd_U2:sqrt({var_U2}))
> (sd_UU1:sqrt({var_UU1})) (sd_UU2:sqrt({var_UU2}))
> (sd:sqrt({var}))
Posterior summary statistics MCMC sample size = 5,000
sd_U1 : sqrt({var_U1})
sd_U2 : sqrt({var_U2})
sd_UU1 : sqrt({var_UU1})
sd_UU2 : sqrt({var_UU2})
sd : sqrt({var})
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
phi1
_cons 3.668967 .1514235 .00922 3.671296 3.361262 3.968073
phi2
_cons .4433111 .0754776 .005946 .4447485 .2940002 .5930835
Survival models
bayesmh provides several likelihood models (stexponential, stgamma(), stloglogistic(),
stlognormal(), and stweibull()) in the likelihood() option to analyze survival-time or
failure-time data. Also see [BAYES] bayes: streg and [BAYES] bayes: mestreg.
You can use these models to analyze failures-only data as well as to account for right-censoring
when you specify the failure() suboption within likelihood() and for left-truncation when you
specify the ltruncated() suboption. You can also choose between the proportional hazards (PH)
and accelerated failure-time (AFT) parameterizations with stexponential and stweibull() via
suboptions ph (the default) and aft.
When fitting survival models, you have two options for the metric of the ancillary parameters of the
survival distributions. For instance, for the Weibull distribution, you can model the shape parameter p
in the log metric by using likelihood(stweibull(lnp)) or likelihood(stweibull(lnp), log-
param) (the default) or in the original metric by using likelihood(stweibull(p), nologparam).
Similarly, for the lognormal distribution, you can model the log standard-deviation by using likeli-
hood(stlognormal(lnstd)) (the default) or the variance by using likelihood(stlognormal(var),
nologparam), and so on. Which parameterization to use for the ancillary parameters often depends
on the chosen priors. For example, in a Weibull model, we may use a normal prior for the log-shape
parameter lnp and a uniform prior for the shape parameter p.
Let’s look at a couple of examples below.
Consider cancer.dta, which records patient survival in a cancer drug trial. Of the 48 participants,
20 receive a placebo (drug = 1), 14 receive one type of treatment (drug = 2), and 14 receive another
type of treatment (drug = 3). We want to analyze time until death, measured in months (variable
studytime), as a function of treatment adjusted for age. The died variable records the failure status
for each subject, where died = 1 means a subject died and died = 0 means a subject is still alive
and is thus considered right-censored.
Initially, let’s ignore the failure status died and assume that studytime records failure times for
all subjects.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 131
For a reference, let’s fit a classical Weibull regression model first by using streg.
. use https://www.stata-press.com/data/r18/cancer
(Patient survival in drug trial)
. stset studytime
Survival-time data settings
Failure event: (assumed to fail at time=studytime)
Observed time interval: (0, studytime]
Exit on or before: failure
48 total observations
0 exclusions
drug
Other .3979255 .1428204 -2.57 0.010 .1969223 .8040971
NA .1526351 .0595183 -4.82 0.000 .0710785 .3277712
We now fit a Bayesian Weibull model by using bayesmh. To compare results with streg, we
use vague priors for model parameters and specify the eform() option to report hazard ratios
(exponentiated coefficients) instead of the coefficients reported by default by bayesmh. We also
sample the shape parameter separately from the coefficients to improve efficiency.
132 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Likelihood:
studytime ~ stweibull(xb_studytime,{lnp})
Priors:
{studytime:i.drug age _cons} ~ normal(0,10000) (1)
{lnp} ~ normal(0,10000)
Equal-tailed
Haz. ratio Std. dev. MCSE Median [95% cred. interval]
studytime
drug
Other .4093515 .1455973 .008398 .3880567 .1930648 .7578985
NA .1586529 .0625765 .004121 .1507637 .0661176 .305668
The results between bayesmh and streg are similar, as expected with weak priors.
By default, bayesmh fits a Weibull model by using the log of the shape parameter. We can use
bayesstats summary to display this parameter in the original metric and also to report its reciprocal.
. bayesstats summary (p:exp({lnp})) (reciprocal: 1/exp({lnp}))
Posterior summary statistics MCMC sample size = 10,000
p : exp({lnp})
reciprocal : 1/exp({lnp})
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
Depending on the data and desired prior, we may want to parameterize the model to use the shape
parameter in the original metric. We can do this by specifying the nologparam suboption within
likelihood().
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 133
Let’s refit the above model by using the direct parameterization of the shape parameter and specify
a uniform prior for it.
. bayesmh studytime i.drug age, likelihood(stweibull({p}), nologparam)
> prior({studytime:}, normal(0,10000)) prior({p}, uniform(0,10))
> rseed(17) eform(Haz. ratio) block({p}) initial({p} 1)
Burn-in ...
Simulation ...
Model summary
Likelihood:
studytime ~ stweibull_nolog(xb_studytime,{p})
Priors:
{studytime:i.drug age _cons} ~ normal(0,10000) (1)
{p} ~ uniform(0,10)
Equal-tailed
Haz. ratio Std. dev. MCSE Median [95% cred. interval]
studytime
drug
Other .4254684 .1642118 .011746 .4001081 .1856402 .7999705
NA .1571577 .0637717 .005037 .1477305 .0634229 .3087045
48 total observations
0 exclusions
drug
Other .1705633 .0831449 -3.63 0.000 .0656067 .4434277
NA .0782594 .0402588 -4.95 0.000 .0285532 .2144953
With bayesmh, we specify the failure indicator in the failure() suboption within likelihood().
. bayesmh studytime i.drug age, likelihood(stweibull({lnp}), failure(died))
> prior({studytime:} {lnp}, normal(0,1000))
> rseed(17) eform(Haz. ratio)
Burn-in ...
Simulation ...
Model summary
Likelihood:
studytime ~ stweibull(xb_studytime,{lnp})
Priors:
{studytime:i.drug age _cons} ~ normal(0,1000) (1)
{lnp} ~ normal(0,1000)
Equal-tailed
Haz. ratio Std. dev. MCSE Median [95% cred. interval]
studytime
drug
Other .1812423 .0873363 .004128 .1646181 .0552102 .3888732
NA .0862965 .0467029 .001991 .0761287 .023666 .2074524
The results are again similar to those from streg after accounting for right-censoring.
As with right-censoring, we can account for left-truncation by specifying the ltruncated()
option. We can also specify the aft option to fit a Weibull (or exponential) model using the AFT
parameterization instead of the default PH parameterization.
coal.dta contains 112 observations, and it includes the variables id, which records observation
identifiers; count, which records the number of coal mining disasters involving 10 or more deaths;
and year, which records the years corresponding to the disasters.
. use https://www.stata-press.com/data/r18/coal
(British coal-mining disaster data, 1851-1962)
. describe
Contains data from https://www.stata-press.com/data/r18/coal.dta
Observations: 112 British coal-mining disaster
data, 1851-1962
Variables: 3 5 Feb 2022 18:03
(_dta has notes)
Sorted by:
The figures below suggest a fairly abrupt decrease in the rate of disasters around the 1887–1895
period, possibly because of the decline in labor productivity in coal mining (Raftery and Akman 1986).
The line plot of count versus year is shown in the left pane and its smoothed version in the right
pane.
6 4
Number of disasters per year
4
Median spline
2
1
0
0
1860 1880 1900 1920 1940 1960 1860 1880 1900 1920 1940 1960
Year of disasters Year of disasters
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 137
To find the change-point parameter (cp) in the rate of disasters, we apply the following Bayesian
model with noninformative priors for the parameters (accounting for the restricted range of cp):
The model has three parameters: µ1 , µ2 , and cp, which we will declare as {mu1}, {mu2}, and
{cp} with bayesmh. One interesting feature of this model is the specification of a mixture distribution
for count. To accommodate this, we specify the substitutable expression
({mu1}*sign(year<{cp})+{mu2}*sign(year>={cp}))
as the mean of a Poisson distribution dpoisson(). To ensure the feasibility of the initial state,
we specify the desired initial values in option initial(). Because of high autocorrelation in the
MCMC chain, we increase the MCMC size to achieve higher precision of our estimates. We change
the default title to the title specific to our analysis. To monitor the progress of simulation, we request
that bayesmh display a dot every 500 iterations and an iteration number every 5,000 iterations.
. set seed 14
. bayesmh count,
> likelihood(dpoisson({mu1}*sign(year<{cp})+{mu2}*sign(year>={cp})))
> prior({mu1} {mu2}, flat)
> prior({cp}, uniform(1851,1962))
> initial({mu1} 1 {mu2} 1 {cp} 1906)
> mcmcsize(40000) title(Change-point analysis) dots(500, every(5000))
Burn-in 2500 a.... done
Simulation 40000 .........5000.........10000.........15000.........20000.......
> ..25000.........30000.........35000.........40000 done
Model summary
Likelihood:
count ~ poisson({mu1}*sign(year<{cp})+{mu2}*sign(year>={cp}))
Priors:
{mu1 mu2} ~ 1 (flat)
{cp} ~ uniform(1851,1962)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
According to our results, the change occurred in the first half of 1890. The drop of the disaster rate
was significant, from an estimated average of 3.2 to 0.9.
The diagnostic plots, for example, for {cp} do not indicate any convergence problems. (This is
also true for other parameters.)
. bayesgraph diagnostics {cp}
cp
Trace Histogram
1900 .25
1895 .2
.15
1890
.1
1885
.05
1880
0 10000 20000 30000 40000 0
Iteration number 1880 1885 1890 1895 1900
Autocorrelation Density
1.00 .25
All
0.80 .2 1-half
0.60 2-half
.15
0.40
.1
0.20
.05
0.00
0
0 10 20 30 40
Lag 1880 1885 1890 1895 1900
The simulated marginal density of {cp} shown in the right bottom corner provides more details. Apart
from the main peak, there are 2 smaller bumps around the years 1886 and 1896, which correspond
to local peaks in the number of disasters at these years: 4 in 1886 and 3 in 1896.
We may be interested in estimating the ratio between the two means. We can use bayesstats
summary to estimate this ratio.
. bayesstats summary (ratio:{mu1}/{mu2})
Posterior summary statistics MCMC sample size = 40,000
ratio : {mu1}/{mu2}
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
The posterior mean estimate of the ratio and its 95% credible intervals confirm the change between
the two means. After 1890, the mean number of disasters decreased by a factor of about 3.4 with a
95% credible range of [2.5, 4.6].
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 139
Remember that convergence must be verified not only for all model parameters but also for the
functions of interest. The diagnostic plots for ratio look good.
. bayesgraph diagnostics (ratio:{mu1}/{mu2})
ratio
Trace Histogram
7 .8
6
.6
5
4 .4
3
.2
2
0 10000 20000 30000 40000 0
Iteration number 2 3 4 5 6 7
Autocorrelation Density
0.80 .8 All
0.60 1-half
.6
2-half
0.40
.4
0.20
.2
0.00
0
0 10 20 30 40
Lag 2 3 4 5 6 7
ratio: {mu1}/{mu2}
φ π
yi(jk) = µ + (−1)j−1 + (−1)k−1 + di + i(jk) = µi(jk) + i(jk)
2 2
bioequiv.dta has four main variables: subject identifier id from 1 to 10, treatment identifier
treat containing values 1 or 2, period identifier period containing values 1 or 2, and outcome y
measuring log concentration for the two tablets.
. use https://www.stata-press.com/data/r18/bioequiv
(Bioequivalent study of Carbamazepine tablets)
. describe
Contains data from https://www.stata-press.com/data/r18/bioequiv.dta
Observations: 20 Bioequivalent study of
Carbamazepine tablets
Variables: 5 5 Feb 2022 23:45
(_dta has notes)
The outcome is assumed to be normally distributed with mean µi(jk) and variance σ 2 . To
accommodate the specific structure of the regression function, we use a nonlinear specification of
bayesmh. We specify the expression for the mean function µi(jk) as a nonlinear expression following
the outcome y. We include subject-specific random effects di as {D[id]} in the nonlinear expression.
We specify noninformative priors for parameters and use Gibbs sampling for variance components
{tau} and {var}. To improve convergence, we increase the burn-in period to 5,000. We also specify
the showreffects option to display the estimates of subject-specific effects {D[id]}.
. bayesmh y = ({mu}+(-1)^(treat-1)*{phi}/2+(-1)^(period-1)*{pi}/2+{D[id]}),
> likelihood(normal({var}))
> prior({D[id]}, normal(0,{tau}))
> prior({tau}, igamma(0.001,0.001))
> prior({var}, igamma(0.001,0.001))
> prior({mu} {phi} {pi}, normal(0,1e6))
> block({tau}, gibbs)
> block({var}, gibbs)
> burnin(5000) rseed(17) showreffects
Burn-in 5000 aaaaaaaaa1000aaaaaaaaa2000aaaaaaaaa3000aaaaaaaaa4000aaaaaaaaa5000
> done
Simulation 10000 .........1000.........2000.........3000.........4000.........
> 5000.........6000.........7000.........8000.........9000.........10000 done
Model summary
Likelihood:
y ~ normal({mu}+(-1)^(treat-1)*{phi}/2+(-1)^(period-1)*{pi}/2+{D[id]},{var})
Priors:
{var} ~ igamma(0.001,0.001)
{D[id]} ~ normal(0,{tau})
{mu phi pi} ~ normal(0,1e6)
Hyperprior:
{tau} ~ igamma(0.001,0.001)
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 141
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
D[id]
1 .0744192 .0831627 .004779 .074302 -.0849912 .2504312
2 .1364082 .0882816 .00521 .1365127 -.0359345 .3141966
3 .0640035 .0843961 .005008 .0596878 -.0939025 .2507555
4 .0708824 .0797542 .004431 .067086 -.0787817 .2440256
5 .1828674 .0937784 .005368 .184261 .0040691 .3700767
6 -.1694658 .0876467 .006416 -.1729349 -.3306482 .0033349
7 -.1212957 .0836953 .005709 -.1226434 -.2772058 .0448479
8 -.0603565 .0796002 .005112 -.0613437 -.218101 .1017121
9 -.0769446 .0800835 .00564 -.0762672 -.2324788 .088155
10 -.0076075 .0778637 .004483 -.0097928 -.1540721 .1496486
Sampling efficiencies look reasonable considering the number of model parameters. The diagnostic
plots of the main model parameters (not shown here) look reasonable, except there is a high
autocorrelation in the MCMC for {mu}, so you may consider increasing the MCMC size or using
thinning.
Parameter θ = exp(φ) is commonly used as a measure of bioequivalence. Bioequivalence is
declared whenever θ lies in the interval [0.8, 1.2] with a high posterior probability.
We use bayesstats summary to calculate this probability and to also display other main parameters.
. bayesstats summary {mu} {phi} {pi} {tau} {var}
> (theta:exp({phi})) (equiv:exp({phi})>0.8 & exp({phi})<1.2)
Posterior summary statistics MCMC sample size = 10,000
theta : exp({phi})
equiv : exp({phi})>0.8 & exp({phi})<1.2
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
Sorted by:
The estimates of log odds-ratios and their squared standard errors are recorded in variables D and var,
respectively. They are computed from variables deaths0, total0, deaths1, and total1 based on
empirical logits; see Carlin [1992, eq. (3) and (4)]. The study variable records study identifiers.
In a normal–normal model, we assume a random-effects model for estimates of log odds-ratios
with normally distributed errors and normally distributed random effects. Specifically,
Di = d + ui + i = di + i
where i ∼ N (0, vari ) and di ∼ N (d, σ 2 ). Errors i ’s represent uncertainty about estimates of log
odds-ratios in each study i and are assumed to have known study-specific variances, vari ’s. Random
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 143
effects di ’s represent differences in estimates of log odds-ratios from study to study. The estimates of
their mean and variance are of interest in meta-analysis: d estimates a true effect, and σ 2 estimates
variation in estimating this effect across studies. Small values of σ 2 imply that the estimates of a true
effect agree among studies.
In Bayesian analysis, we additionally specify prior distributions for d and σ 2 . Following Car-
lin (1992), we use noninformative priors for these parameters: normal with large variance for d and
inverse gamma with very small degrees of freedom for σ 2 .
d ∼ N (0, 1000)
σ 2 ∼ InvGamma(0.001, 0.001)
Likelihood:
D ~ normal(xb_D,var)
Prior:
{D[study]} ~ normal({d},{sig2}) (1)
Hyperpriors:
{d} ~ normal(0,1000)
{sig2} ~ igamma(0.001,0.001)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
Our posterior mean estimates d and sig2 of mean d and variance σ 2 are −0.25 and 0.019, respectively,
with posterior standard deviations of 0.06 and 0.02. The estimates are close to those reported by
Carlin (1992). Considering the number of parameters, the AR and efficiency summaries look good.
144 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
We can obtain the efficiencies for the main parameters by using bayesstats ess.
. bayesstats ess {d} {sig2}
Efficiency summaries MCMC sample size = 10,000
Efficiency: min = .02206
avg = .02348
max = .02491
The efficiencies are acceptable, but based on the correlation times, the autocorrelation becomes small
only after lag 40 or so. The precision of the mean and variance estimates is comparable with those
based on 249 independent observations for the mean and 220 independent observations for the variance.
We explore convergence visually.
. bayesgraph diagnostics {d} {sig2}
d
Trace Histogram
.2 6
0
4
-.2
-.4 2
-.6
0 2000 4000 6000 8000 10000 0
Iteration number -.6 -.4 -.2 0 .2
Autocorrelation Density
0.80
6 All
0.60 1-half
2-half
0.40 4
0.20
2
0.00
0
0 10 20 30 40
Lag -.6 -.4 -.2 0 .2
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 145
sig2
Trace Histogram
.2 50
.15 40
30
.1
20
.05
10
0
0 2000 4000 6000 8000 10000 0
Iteration number 0 .05 .1 .15 .2
Autocorrelation Density
0.80 60
All
0.60 1-half
40 2-half
0.40
0.20 20
0.00
0
0 10 20 30 40
Lag 0 .05 .1 .15 .2
The diagnostic plots look reasonable for both parameters, but autocorrelation is high. You may consider
increasing the default MCMC size to obtain more precise estimates of posterior means.
logit(pC
i ) = µi
logit(pTi ) = µi + di
where µi is log odds of success in the control group in study i and µi + di is log odds of success in
the treatment group. di ’s are viewed as random effects and are assumed to be normally distributed as
di ∼ i.i.d. N (d, σ 2 )
yiC ∼ Binomial(pC C
i , ni )
yiT ∼ Binomial(pTi , nTi )
di ∼ i.i.d. N (d, σ 2 )
where d is the population effect and is the main parameter of interest in the model and σ 2 is its
variability across trials.
We can rewrite the model above assuming the data are in long form as
where Ti is a binary treatment with Ti = 0 for the control group and Ti = 1 for the treatment group.
In Bayesian analysis, we additionally specify prior distributions for µi , d, and σ 2 . We use
noninformative priors.
µi ∼ 1
d ∼ N (0, 1000)
σ 2 ∼ InvGamma(0.001, 0.001)
We continue our analysis of beta-blockers data. The analysis of these data using a binomial-normal
model is also provided as an example in OpenBUGS (Thomas et al. 2006).
For this analysis, we use the beta-blockers data in long form.
. use https://www.stata-press.com/data/r18/betablockers_long
(Beta-blockers data in long form)
. describe
Contains data from https://www.stata-press.com/data/r18/betablockers_long.dta
Observations: 44 Beta-blockers data in long form
Variables: 4 5 Feb 2022 19:02
(_dta has notes)
Variable treat records the binary treatment: treat==0 identifies the control group, and treat==1
identifies the treatment group.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 147
We include M[study] to specify the random effects µi ’s and 1.treat#D[study] for the random
effects (Ti == 1)di ’s. We use a binomial() likelihood model for the number of deaths. We split
the hyperparameters and random effects {D[study]} into separate blocks and request Gibbs sampling
for sig2 to improve efficiency of the algorithm.
. bayesmh deaths M[study] 1.treat#D[study], likelihood(binomial(total))
> noconstant
> prior({M[study]}, flat)
> prior({D[study]}, normal({d},{sig2}))
> prior({d}, normal(0,1000))
> prior({sig2}, igamma(0.001,0.001))
> block({D[study]}, split)
> block({d sig2}, gibbs split)
> rseed(17)
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 10000 .........1000.........2000.........3000.........4000.........
> 5000.........6000.........7000.........8000.........9000.........10000 done
Model summary
Likelihood:
deaths ~ binlogit(xb_deaths,total)
Priors:
{M[study]} ~ 1 (flat) (1)
{D[study]} ~ normal({d},{sig2}) (1)
Hyperpriors:
{d} ~ normal(0,1000)
{sig2} ~ igamma(0.001,0.001)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
This model has 22 more parameters than the model in example 26. The posterior mean estimates
d and sig2 of mean d and variance σ 2 are −0.25 and 0.019, respectively, with posterior standard
deviations of 0.07 and 0.02. The estimates of the mean and variance are again close to the ones
reported by Carlin (1992).
148 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Compared with example 26, the efficiencies and other statistics for the main parameters are similar.
. bayesstats ess {d} {sig2}
Efficiency summaries MCMC sample size = 10,000
Efficiency: min = .01025
avg = .01398
max = .01771
d
Trace Histogram
.2
6
0
4
-.2
-.4 2
-.6
0 2000 4000 6000 8000 10000 0
Iteration number -.6 -.4 -.2 0 .2
Autocorrelation Density
0.80 8
All
0.60 1-half
6
2-half
0.40
4
0.20
2
0.00
0
0 10 20 30 40
Lag -.6 -.4 -.2 0 .2
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 149
sig2
Trace Histogram
.3 50
40
.2
30
.1 20
10
0
0 2000 4000 6000 8000 10000 0
Iteration number 0 .1 .2 .3
Autocorrelation Density
0.80 50
All
0.60 40 1-half
2-half
0.40 30
0.20 20
0.00 10
0
0 10 20 30 40
Lag 0 .1 .2 .3
where a is a discrimination parameter. According to this likelihood model, the probability of success
increases with the subject ability and decreases with item difficulty. The discrimination parameter
a represents the slope of the item characteristic curves. The subject abilities are assumed to be
standardized so that
θj ∼ i.i.d. N (0, 1)
The discrimination parameter a can be absorbed into θj and bi so that the model is reparameterized
as
logit{Pr(yij = 1|e
bi , θej )} = θej + ebi (1)
where σ = a and e bi = −abi . In addition to the above, a Bayesian formulation of the model requires
prior specifications for parameters σ 2 and e
bi . In the following example, we use
σ 2 ∼ InvGamma(0.01, 0.01)
To fit this model using bayesmh, we first need to reshape the data from example 1 of [IRT] irt
1pl in long format so that the answers to the nine questions are represented by the response variable
y, while the item and id variables encode the questions and students, respectively.
. use https://www.stata-press.com/data/r18/masc1, clear
(Data from De Boeck & Wilson (2004))
. generate id = _n
. quietly reshape long q, i(id) j(item)
. rename q y
The Rasch likelihood model can be specified with bayesmh using y as a dependent variable and
U[item] and V[id] as crossed random effects. We use the noconstant option in the likelihood
specification to include all levels of U[item] and V[id]. The random-effects parameters {V[id]}
are assigned a zero-mean normal prior with variance {var} [σ 2 in model specification (1)]. The
parameter {var} is assigned a noninformative inverse-gamma prior with shape 0.01 and scale 0.01,
whereas the parameters {U[item]} [e bi ’s in model (1)] are applied ad hoc informative normal(0,10)
priors.
. bayesmh y U[item] V[id], noconstant likelihood(logit)
> prior({U[item]}, normal(0, 10))
> prior({V[id]}, normal(0, {var}))
> prior({var}, igamma(0.01,0.01))
> block({var}) rseed(17) showreffects(U[item])
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaa.. done
Simulation 10000 .........1000.........2000.........3000.........4000.........
> 5000.........6000.........7000.........8000.........9000.........10000 done
Model summary
Likelihood:
y ~ logit(xb_y)
Priors:
{U[item]} ~ normal(0,10) (1)
{V[id]} ~ normal(0,{var}) (1)
Hyperprior:
{var} ~ igamma(0.01,0.01)
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 151
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
U[item]
1 .6027924 .0848417 .002727 .6033436 .4383438 .7676613
2 .1047865 .0817006 .002411 .1017675 -.0494946 .2691851
3 1.551305 .0953048 .002574 1.549129 1.362338 1.745973
4 -.2759237 .0791898 .002193 -.2752539 -.4319626 -.121707
5 -1.408907 .0940374 .002999 -1.40848 -1.590385 -1.224282
6 -.5913131 .0837824 .002701 -.5902511 -.7540854 -.431315
7 -1.128982 .0921381 .002597 -1.129163 -1.311912 -.9454393
8 2.054062 .1130098 .003294 2.052132 1.842889 2.278157
9 1.018282 .091037 .002634 1.015498 .8454456 1.195609
In the simulation summary, bayesmh reports a modest average efficiency of about 11% with no
indication of any convergence problems. We could have omitted the prior specification for {V[id]},
in which case bayesmh would have labeled the variance component as {var V}.
To match the discrimination and question difficulty parameters of the irt 1pl command, we can
apply the following transformation to the bayesmh model parameters. The common discrimination
parameter equals the square-root of {var}, and the individual question difficulties equal the negative
{U[item]}’s parameters, normalized by their common discrimination. We can obtain estimates of
these parameters using the bayesstats summary command.
152 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
We observe that the reported posterior means for the common discrimination and question difficulties
are close to those obtained with irt 1pl, within the limits of the MCMC standard errors.
In this example, we fit the Rasch model and use transformation to estimate parameters of the
corresponding 1PL IRT model. To avoid reparameterization, we could have fit the 1PL model directly
using a nonlinear specification of bayesmh, as we demonstrate in example 29 for the 2PL IRT model.
The shortcoming of the nonlinear specification is slower execution.
exp{ai (θj − bi )}
Pr(Yij = 1) =
1 + exp{ai (θj − bi )}
where ai ’s and bi ’s are discrimination and difficulty parameters and θj ’s are subject abilities. This
is a logistic regression model with probability of success modeled using the linear form ai (θj − bi ).
We assume that the probability of success increases with subject ability, which implies ai > 0.
Subject ability parameters are assumed independent and distributed according to the standard normal
distribution
θj ∼ N (0, 1)
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 153
bi ∼ N (µb , σb2 )
µa , µb ∼ N (0, 1)
σa2 , σb2 ∼ Gamma(1, 1)
In the absence of prior knowledge about parameters ai ’s and bi ’s, we want to specify proper priors
that are not subjective. Because ai ’s must be positive, a common choice is to assume that ln(ai )’s
are normally distributed with mean µa and variance σa2 . We assume that bi ’s are normally distributed
with mean µb and variance σb2 . Our prior assumption is that the questions in the study are fairly
balanced in terms of discrimination and difficulty, and we express this expectation by specifying
N (0, 1) hyperpriors for µa and µb ; that is, we assume that µa and µb are not that different from zero.
We also put a slight prior constraint on the variability of the discrimination and difficulty parameters
by assigning a gamma distribution with shape 1 and scale 1 as hyperprior distributions for σa2 and
σb2 . To demonstrate a Bayesian 2PL model, we use again the mathematics and science dataset masc1,
reshaped in long format as in example 28.
. bayesmh y = ({Discr[item]}*({V[id]}-{Diff[item]})), likelihood(logit)
> prior({V[id]}, normal(0, 1))
> prior({Discr[item]}, lognormal({mua}, {vara}))
> prior({D[iffitem]}, normal({mub}, {varb}))
> prior({vara varb}, gamma(1, 1))
> prior({mua mub}, normal(0, 1))
> ...
To specify the 2PL model likelihood in bayesmh, we need to use a nonlinear specifica-
tion to accommodate the varying coefficients ai ’s. For masc1.dta, we have 9 items, where
i = 1, . . . , 9, and 800 subjects, where j = 1, . . . , 800. A straightforward nonlinear specification is
({Discr[item]}*({V[id]}-{Diff[item]})), where random effects Discr[item], Diff[item],
and V[id] represent discrimination, item difficulty, and student ability, respectively.
To achieve better sampling efficiency, we place the hyperparameters {mua}, {mub}, {vara},
and {varb} into separate blocks using the block()’s suboption split. We also initialize the
discrimination and difficulty random effects with 1 because the default 0s result in an invalid initial
state. Because the random effects are not shown by default, we use the showreffects() option to
display the discrimination and difficulty parameters.
154 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Likelihood:
y ~ logit({Discr[item]}*({V[id]}-{Diff[item]}))
Priors:
{V[id]} ~ normal(0,1)
{Discr[item]} ~ lognormal({mua},{vara})
{Diff[item]} ~ normal({mub},{varb})
Hyperpriors:
{vara varb} ~ gamma(1,1)
{mua mub} ~ normal(0,1)
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
Discr[item]
1 1.474051 .226756 .016747 1.461149 1.085353 1.977109
2 .6710171 .1110106 .004925 .6675754 .4590724 .8893063
3 .9238635 .1454797 .011848 .9209288 .6422116 1.217656
4 .8076416 .1221467 .006042 .8019258 .5810136 1.057661
5 .8825339 .1445803 .011687 .8722941 .6319481 1.197729
6 .9497897 .1401296 .007687 .944759 .6944811 1.236898
7 .4846824 .0881389 .006968 .4791858 .3258165 .6695858
8 1.353603 .219108 .023569 1.362743 .9303272 1.772465
9 .6649918 .1198973 .01178 .6650413 .444871 .90068
Diff[item]
1 -.5069895 .0818094 .004323 -.5031544 -.6849757 -.3521039
2 -.1502343 .121276 .003424 -.1455632 -.407207 .0784043
3 -1.742259 .2430085 .019752 -1.706428 -2.331342 -1.357637
4 .3328318 .1101783 .003805 .3282234 .1280959 .555568
5 1.638084 .2356449 .018557 1.616757 1.247654 2.160822
6 .6465024 .116495 .005363 .6380789 .4409175 .8947524
7 2.158884 .4045901 .031847 2.101079 1.528233 3.101399
8 -1.779656 .2166062 .022939 -1.742365 -2.300026 -1.453126
9 -1.490028 .2781509 .025778 -1.451536 -2.13252 -1.065914
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 155
bayesmh reports an acceptable average efficiency of about 4%. A close inspection of the estimation
table shows that the posterior mean estimates for item discrimination and difficulty are similar to the
MLE estimates obtained with the irt 2pl command; see example 1 in [IRT] irt 2pl.
lncrimei = I + iS +
where I and S are latent variables and is a vector of error terms that are normally distributed
with mean zero and variance σ 2 . The coefficients for the random intercepts are fixed to 1, and the
coefficients for the slopes are fixed to 0, 1, 2, and 3, corresponding to the 4 quarters. I and S are
assumed to be correlated.
. use https://www.stata-press.com/data/r18/sem_lcm
. describe
Contains data from https://www.stata-press.com/data/r18/sem_lcm.dta
Observations: 359
Variables: 4 25 May 2022 11:08
(_dta has notes)
Sorted by:
To fit the model using bayesmh, we need to specify four normal likelihood equations, one for each
crime-rate variable, that include latent variables {I[ n]} and {S[ n]} (see Random effects). The
error variance σ 2 is given by the parameter {var}. As in a classical SEM model, the latent variables are
assumed to have a bivariate normal distribution, which we will model using the mvnormal() prior with
means {meani} and {means} and variance–covariance matrix {Sigma,m}. In a Bayesian model, we
additionally specify prior distributions for all other model parameters. Specifically, the error variance
is assigned the inverse-gamma prior, igamma(1, 1). The hyperparameters {meani} and {means} are
assigned normal(0, 100) priors. And the covariance {Sigma,m} matrix hyperparameter is assigned
an inverse-Wishart prior, iwishart(2,3,I(2)).
We place parameters in separate blocks and use Gibbs sampling for the covariance {Sigma,m}.
To do this, we must specify each parameter in separate prior() and block() options. More
conveniently, we can use prior()’s and block()’s split suboptions to combine similar parameters
in one prior() and one block() specifications.
156 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Likelihood:
lncrime0 ~ normal(xb_lncrime0,{var})
lncrime1 ~ normal(xb_lncrime1,{var})
lncrime2 ~ normal(xb_lncrime2,{var})
lncrime3 ~ normal(xb_lncrime3,{var})
Priors:
{var} ~ igamma(1,1)
{I[_n] S[_n]} ~ mvnormal(2,{meani},{means},{Sigma,m}) (1)
Hyperpriors:
{meani means} ~ normal(0,100)
{Sigma,m} ~ iwishart(2,3,I(2))
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
lncrime0
I 1 0 0 1 1 1
S 0 0 0 0 0 0
lncrime1
I 1 0 0 1 1 1
S 1 0 0 1 1 1
lncrime2
I 1 0 0 1 1 1
S 2 0 0 2 2 2
lncrime3
I 1 0 0 1 1 1
S 3 0 0 3 3 3
The average sampling efficiency is about 6% with no signs of convergence problems. The posterior
mean estimates are similar to the maximum likelihood estimates reported by the sem command.
As expected, there is a negative correlation between the latent variables I and S of about −0.32.
. bayesstats summary (corr:{Sigma_1_2}/sqrt({Sigma_1_1}*{Sigma_2_2}))
Posterior summary statistics MCMC sample size = 10,000
corr : { Sigma_1_2 } /sqrt( { Sigma_1_1 } * { Sigma_2_2 } )
Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]
Because the linear growth model assumes that the slope coefficients are constrained to 0, 1, 2, and
3, it may be interesting to check how well the observed average quarterly crime rates are explained
by the model. We can formally address this question by simulating the posterior predictive crime-
rate means from the model and comparing them with the observed quarterly averages. We use the
bayespredict command to simulate the expected outcomes from the posterior predictive distribution.
For example, in the specification below, the first expected outcome is obtained by applying the mean
function to { ysim1}, pmean0:@mean({ ysim1} ), and saving it as {pmean0} in a new prediction
dataset predmeans.dta. Once {pmean0}, {pmean1}, {pmean2}, and {pmean3} are simulated, we
use the bayesstats ppvalues command to compute the corresponding posterior predictive p-values
to check model fit. Before using bayespredict, however, we must save our simulation results in a
permanent Stata dataset.
. bayesmh, saving(semex18sim)
note: file semex18sim.dta saved.
. bayespredict (pmean0:@mean({_ysim1})) (pmean1:@mean({_ysim2}))
> (pmean2:@mean({_ysim3})) (pmean3:@mean({_ysim4})),
> saving(predmeans) rseed(17) dots
Computing predictions 10000 .........1000.........2000.........3000.........
> 4000.........5000.........6000.........7000.........8000.........9000.........
> 10000 done
file predmeans.dta saved.
file predmeans.ster saved.
. bayesstats ppvalues {pmean0} {pmean1} {pmean2} {pmean3} using predmeans
Posterior predictive summary MCMC sample size = 10,000
All expected quarterly crime rates except the second one are consistent with the observed data. For
the second-quarter crime rate, we have a low posterior p-value of 3%. We could relax the assumption
of a linear growth for the second quarter and check whether this improves model fit.
158 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Stored results
bayesmh stores the following in e():
Scalars
e(N) number of observations
e(N sub) number of subjects (only with survival models)
e(N fail) number of failures (only with survival models)
e(risk) total time at risk (only with survival models)
e(k) number of parameters
e(k sc) number of scalar parameters
e(k mat) number of matrix parameters
e(n eq) number of equations
e(nchains) number of MCMC chains
e(mcmcsize) MCMC sample size
e(burnin) number of burn-in iterations
e(mcmciter) total number of MCMC iterations
e(thinning) thinning interval
e(arate) overall AR
e(eff min) minimum efficiency
e(eff avg) average efficiency
e(eff max) maximum efficiency
e(Rc max) maximum Gelman–Rubin convergence statistic (only with nchains())
e(clevel) credible interval level
e(hpd) 1 if hpd is specified; 0 otherwise
e(batch) batch length for batch-means calculations
e(corrlag) maximum autocorrelation lag
e(corrtol) autocorrelation tolerance
e(dic) deviance information criterion
e(lml lm) log marginal-likelihood using Laplace–Metropolis method
e(scale) initial multiplier for scale factor; scale()
e(block# gibbs) 1 if Gibbs sampling is used in #th block, 0 otherwise
e(block# reffects) 1 if the parameters in #th block are random effects, 0 otherwise
e(block# scale) #th block initial multiplier for scale factor
e(block# tarate) #th block target adaptation rate
e(block# tolerance) #th block adaptation tolerance
e(adapt every) adaptation iterations adaptation(every())
e(adapt maxiter) maximum number of adaptive iterations adaptation(maxiter())
e(adapt miniter) minimum number of adaptive iterations adaptation(miniter())
e(adapt alpha) adaptation parameter adaptation(alpha())
e(adapt beta) adaptation parameter adaptation(beta())
e(adapt gamma) adaptation parameter adaptation(gamma())
e(adapt tolerance) adaptation tolerance adaptation(tolerance())
e(repeat) number of attempts used to find feasible initial values
Macros
e(cmd) bayesmh
e(cmdline) command as typed
e(method) sampling method
e(depvars) names of dependent variables
e(eqnames) names of equations
e(likelihood) likelihood distribution (one equation)
e(likelihood#) likelihood distribution for #th equation
e(prior) prior distribution
e(prior#) prior distribution, if more than one prior() is specified
e(priorparams) parameter specification in prior()
e(priorparams#) parameter specification from #th prior(), if more than one prior() is specified
e(parnames) names of model parameters except exclude()
e(postvars) variable names corresponding to model parameters in e(parnames)
e(subexpr) substitutable expression
e(subexpr#) substitutable expression, if more than one
e(wtype) weight type (one equation)
e(wtype#) weight type for #th equation
e(wexp) weight expression (one equation)
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 159
Matrices
e(mean) posterior means
e(sd) posterior standard deviations
e(mcse) MCSE
e(median) posterior medians
e(cri) credible intervals
e(Cov) variance–covariance matrix of parameters
e(ess) effective sample sizes
e(init) initial values vector
e(dic chains) deviance information criterion for each chain (only with nchains())
e(arate chains) acceptance rate for each chain (only with nchains())
e(eff min chains) minimum efficiency for each chain (only with nchains())
e(eff avg chains) average efficiency for each chain (only with nchains())
e(eff max chains) maximum efficiency for each chain (only with nchains())
e(lml lm chains) log marginal-likelihood for each chain (only with nchains())
Functions
e(sample) mark estimation sample
Adaptive MH algorithm
The bayesmh command implements an adaptive random-walk Metropolis–Hastings algorithm with
optional blocking of parameters. Providing an efficient MH procedure for simulating from a general
posterior distribution is a difficult task, and various adaptive methods have been proposed (Haario,
Saksman, and Tamminen 2001; Giordani and Kohn 2010; Roberts and Rosenthal 2009; Andrieu and
Thoms 2008). The essence of the problem is in choosing an optimal proposal covariance matrix and
a scale for parameter updates. Below we describe the implemented adaptation algorithm, assuming
one block of parameters. In the presence of multiple blocks, the adaptation is applied to each block
independently. The adaptation() option of bayesmh controls all the tuning parameters for the
adaptation algorithm.
160 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Let θ be a vector of d scalar model parameters. Let T0 be the length of a burn-in period
(iterations that are discarded) as specified in burnin() and T be the size of the MCMC sample
(iterations that are retained) as specified in mcmcsize(). The total number of MCMC iterations is
then Ttotal = T0 + (T − 1) × thinning() + 1. Also, let ALEN be the length of the adaptation
interval (option adaptation(every())) and AMAX be the maximum number of adaptations (option
adaptation(maxiter())).
f
The steps of the adaptive MH algorithm are the following. At t = 0, we initialize θt = θ0 , where
f
√
θ0 is the initial feasible state, and we set adaptation counter k = 1 and initialize ρ0 = 2.38/ d,
where d is the number of considered parameters. Σ0 is the identity matrix. For t = 1, . . . , Ttotal , do
the following:
1. Generate proposal parameters: θ∗ = θt−1 + e, e ∼ N (0, ρ2k Σk ), where ρk and Σk are current
values of the proposal scale and covariance for adaptation iteration k .
2. Calculate the acceptance probability using
p(θ∗ |y)
α(θ∗ |θt−1 ) = min ,1
p(θt−1 |y)
Adaptation. The adaptation step is performed as follows. At each adaptive iteration k of the
tth MCMC iteration, the proposal covariance Σk and scale ρk are tuned to achieve an optimal AR.
Some asymptotic results (for example, Gelman, Gilks, and Roberts [1997]) show that the optimal
AR, hereafter referred to as a TAR, for a single model parameter is 0.44 and is 0.234 for a block of
multiple parameters.
Adaptation is performed periodically after a constant number of iterations as specified by the adap-
tation(every()) option. At least adaptation(miniter()) adaptive iterations are performed not
to exceed adaptation(maxiter()). bayesmh does not perform adaptation if the absolute difference
between the current AR and TAR is within the tolerance given by adaptation(tolerance()).
The bayesmh command allows you to control the calculation of AR through the adapta-
tion(alpha()) option with the default of 0.75, as follows,
ARk = (1 − α)ARk−1 + αA
cRk
where A
cRk is the expected acceptance probability, which is computed as the average of the acceptance
probabilities, α(θ∗ |θt−1 ), since the last adaptive iteration (for example, Andrieu and Thoms [2008]),
and AR0 is defined as described in the adaptation(tarate()) option. Choosing α ∈ (0, 1) allows
for smoother change in the current AR between adaptive iterations.
The tuning of the proposal scale ρ is based on results in Gelman, Gilks, and Roberts (1997),√
Roberts and Rosenthal (2001), and Andrieu and Thoms (2008). The initial ρ0 is set to 2.38/ d,
where d is the number of parameters in the considered block. Then, ρk is updated according to
−1
(ARk /2)−Φ−1 (TAR/2)}
ρk = ρk−1 eβk {Φ (2)
where Φ(·) is the standard normal cumulative distribution function and βk is defined below.
The adaptation of the covariance matrix is performed when multiple parameters are in the block
and is based on Andrieu and Thoms (2008). You may specify an initial proposal covariance matrix Σ0
in covariance() or use the identity matrix by default. Then, at time of adaptation k , the proposal
covariance Σk is recomputed according to the formula
b k , βk = β0
Σk = (1 − βk )Σk−1 + βk Σ (3)
kγ
where Σ b k = (Θt − µk−1 )(Θt − µk−1 )0 /(tk − tk−1 ) is the empirical covariance of the recent
k k
tk
MCMC sample Θtk = {θs }s=t k−1
and tk−1 is the MCMC iteration corresponding to the adaptive
iteration k − 1 or 0 if adaptation did not take place. µk is defined as
µk = µk−1 + βk (Θtk − µk−1 ), k > 1
and µ1 = Θtk , where Θtk is the sample mean of Θtk .
The constants β0 ∈ [0, 1] and γ ∈ [0, 1] in (3) are specified in the options adaptation(beta())
and adaptation(gamma()), respectively. The default values are 0.8 and 0, respectively. When
γ > 0, we have a diminishing adaptation regime, which means that Σk is not changing much from
one adaptive iteration to another. Random-walk Metropolis–Hastings algorithms with diminishing
adaptation are shown to preserve the ergodicity of the Markov chain (Roberts and Rosenthal 2007;
Andrieu and Moulines 2006; Atchadé and Rosenthal 2005).
The above algorithm is also used for discrete parameters, but discretization is used to obtain
samples of discrete values. The default initial scale factor ρ0 is set to 2.38/d for a block of d
discrete parameters. The default TAR for discrete parameters with priors bernoulli() and index()
is max{0.1353, 1/nmaxbins }, where nmaxbins is the maximum number of discrete values a parameter
can take among all the parameters specified in the same block. Blocks containing a mixture of
continuous and discrete parameters are not allowed.
162 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
m
Y
π(η1 , . . . , ηm |θ) = π(ηj |θ)
j=1
m
Y nj
Y
Pr(η1 , . . . , ηm , θ|Y ) = π(θ) π(ηj |θ) Pr(yij |ηj , θ)
j=1 i=1
where π(θ) is the prior distribution of θ. This form of the posterior allows the parameters ηj ’s to be
placed in one block and steps 1, 2, and 3 of the adaptive MH algorithm to be performed for all of
them simultaneously in a vector form, as if they were a single scalar parameter.
To request the random-effects MH algorithm in bayesmh, use block’s suboption reffects. The
same algorithm is used if one includes the random effects in the model. A random-effects block of
parameters has a default acceptance rate of 0.44, performs adaptation of the scale ρk according to
(2), but uses a fixed identity matrix for the proposal covariance Σk .
Likelihood-prior configurations
yi |θb , σ 2 ∼ N (θb , σ 2 ), i = 1, 2, . . . , n
θb ∼ N (µ0 , τ02 )
θb |σ 2 , y ∼ Fb = N (µn , τn2 )
where µ0 and τ02 are hyperparameters (prior mean and prior variance) of a normal prior distribution
for θb and X
µn = µ0 τ0−2 + σ −2 yi τn2
τn2 = (τ0−2 + nσ −2 )−1
yi |θb , σ 2 ∼ N (x0i β, σ 2 ), i = 1, 2, . . . , n
θkb ∼ i.i.d. N (β0 , τ02 ), k = 1, 2, . . . , p1
θb |σ 2 , y ∼ Fb = MVN(µn , Λn )
where β0 and τ02 are hyperparameters (prior regression coefficient and prior variance) of normal
prior distributions for θkb and
yi |µ, θb ∼ N (µ, θb ), i = 1, 2, . . . , n
θb ∼ InvGamma(α, β)
n
X
θb |µ, y ∼ Fb = InvGamma(α + n/2, β + (yi − µ)2 /2)
i=1
where α and β are hyperparameters (prior shape and prior scale) of an inverse-gamma prior
distribution for θb .
164 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
yi |µ, Θb ∼ MVN(µ, Θb ), i = 1, 2, . . . , n
Θb ∼ InvWishart(ν, Ψ)
n
X
Θb |µ, Y ∼ Fb = InvWishart(n + ν, Ψ + (yi − µ)(yi − µ)0 )
i=1
where ν and Ψ are hyperparameters (prior degrees of freedom and prior scale matrix) of an
inverse-Wishart prior distribution for Θb .
6. Multivariate-normal–Jeffreys model: Θb is a d × d covariance matrix of a multivariate normal
distribution of y’s with a mean vector µ; mean and covariance are independent a priori,
yi |µ, Θb ∼ MVN(µ, Θb ), i = 1, 2, . . . , n
d+1
Θb ∼ |Θb |− 2 (multivariate Jeffreys)
n
X
Θb |µ, Y ∼ Fb = InvWishart(n − 1, (yi − µ)(yi − µ)0 )
i=1
θb |σ 2 ∼ MVN(µ0 , Λ0 = σ 2 A)
θb |σ 2 , y ∼ Fb = MVN(µn , Λn = σ 2 B)
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 165
where
µn = B(X 0 y + A−1 µ0 )
Λn = σ 2 B = σ 2 (X 0 X + A−1 )−1
where
µn = Λn (X 0 y∗ + Λ−1
0 µ0 )
Λn = (X 0 X + Λ−1
0 )
−1
yi |θb , σ 2 ∼ N (x0i β, σ 2 ), i = 1, 2, . . . , n
θkb ∼ i.i.d. Laplace(µ0 , λ0 ), k = 1, 2, . . . , p1
θb |τ1−2 , . . . , τp−2
1
, σ 2 , y ∼ Fb = MVN(µn , Λn )
where
µn = Λn (µ0 Tp1 + σ −2 Xb0 y)
Λn = (diag(Tp1 ) + σ −2 Xb0 Xb )−1
Tp1 = (τ1−2 , . . . , τp−2
1
)0
and the conditional distributions of the auxiliary parameters are independent inverse-Gaussian
distributions,
τk−2 |θkb ∼ InvGaussian(λ−1 b
0 |θk − µ0 |
−1
, λ−2
0 )
166 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Prior-hyperprior configurations
Suppose that a prior distribution of a parameter of interest θ has hyperparameters θh for which a
prior distribution is specified. We refer to the former prior distribution as a hyperprior. You can also
request Gibbs sampling for the following prior-hyperprior combinations.
We use θhb and θbh to refer to the hyperparameters from the block b.
1. Normal–normal model: θhb is a mean of a normal prior distribution of θ with a variance σh2 ;
mean and variance are independent a priori,
where µ0 and τ02 are the prior mean and prior variance of a normal hyperprior distribution for θhb
and
µ1 = µ0 τ0−2 + θσh−2 τ12
where α and β are the prior shape and prior scale, respectively, of an inverse-gamma hyperprior
distribution for θhb .
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 167
θ|θhb ∼ Bernoulli(θhb )
θhb ∼ Beta(α, β)
θhb |θ ∼ Fb = Beta(α + θ, β + 1 − θ)
where α and β are the prior shape and prior scale, respectively, of a beta hyperprior distribution
for θhb .
4. Poisson–gamma model: θhb is a mean of a Poisson prior distribution of θ,
θ|θhb ∼ Poisson(θhb )
θhb ∼ Gamma(α, β)
θhb |θ ∼ Fb = Gamma(α + θ, β/(β + 1))
where α and β are the prior shape and prior scale, respectively, of a gamma hyperprior distribution
for θhb .
5. Multivariate-normal–multivariate-normal model: θbh is a mean vector of a multivariate normal
prior distribution of θ with a d × d covariance matrix Σh ; mean and covariance are independent
a priori,
θ|θbh , Σh ∼ MVN(θbh , Σh )
θbh ∼ MVN(µ0 , Λ0 )
θbh |Σh , θ ∼ Fb = MVN(µ1 , Λ1 )
where µ0 and Λ0 are the prior mean vector and prior covariance of a multivariate normal hyperprior
distribution for θbh and
µ1 = Λ1 Λ−10 µ0 + Λ1 Σh θ
−1
Λ1 = (Λ−1 −1 −1
0 + Σh )
where ν and Ψ are the prior degrees of freedom and prior scale matrix of an inverse-Wishart
hyperprior distribution for Θbh .
7. Normal–Laplace model (StataNow): θhb is a mean of a normal prior distribution of θ with a
variance σh2 ; mean and variance are independent a priori,
where µ0 and λ0 are the location and scale of a Laplace hyperprior distribution for θhb . The Gibbs
sampling of θhb employs an auxiliary parameter τ −2 . The conditional posterior distribution for θhb
with respect to τ −2 and θ is
where
µ1 = µ0 τ −2 + θσh−2 σ12
τ −2 |θhb ∼ InvGaussian(λ−1 b
0 |θh − µ0 |
−1
, λ−2
0 )
Marginal likelihood
The marginal likelihood is defined as
Z
m(y) = p(y|θ)π(θ)dθ
where p(y|θ) is the probability density of data y given θ and π(θ) is the density of the prior
distribution for θ.
Marginal likelihood m(y), being the denominator term in the posterior distribution, has a major
role in Bayesian analysis. It is sometimes referred to as “model evidence”, and it is used as a
goodness-of-fit criterion. For example, marginal likelihoods are used in calculating Bayes factors for
the purpose of model comparison; see Methods and formulas in [BAYES] bayesstats ic.
The simplest approximation to m(y) is provided by the Monte Carlo integration,
M
1 X
m
bp = p(y|θs )
M s=1
where {θs }M s=1 is an independent sample from the prior distribution π(θ). This estimation is very
inefficient, however, because of the high variance of the likelihood function. MCMC samples are not
independent and cannot be used directly for calculating m b p.
An improved estimation of the marginal likelihood can be obtained by using importance sampling.
For a sample {θt }Tt=1 , not necessarily independent, from the posterior distribution, the harmonic
mean of the likelihood values,
( T
)−1
1X
m
bh = p(y|θt )−1
T t=1
m e −1/2 p(y|e
b l = (2π)p/2 | − H| θ)π(e
θ)
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 169
where p is the number of parameters (or dimension of θ), eθ is the posterior mode, and H
e is the
Hessian matrix of l(θ) = p(y|θ)π(θ) calculated at the mode e
θ.
Using the fact that the posterior sample covariance matrix, which we denote as Σ
b , is asymptot-
−1
ically equal to (−H) , Raftery (1996) proposed what he called the Laplace–Metropolis estimator
e
(implemented by bayesmh):
mb lm = (2π)p/2 |Σ|
b 1/2 p(y|e
θ)π(e
θ)
Raftery (1996) recommends that a robust and consistent estimator be used for the posterior covariance
matrix.
Estimation of the log marginal-likelihood becomes unstable for high-dimensional models such as
multilevel models and may result in a missing value.
With multiple chains, an average of the log-marginal-likelihood values over the chains is reported.
Nicholas Constantine Metropolis (1915–1999) was born in Chicago. He completed his PhD in
experimental physics at the University of Chicago in 1941. In 1943, Metropolis moved to Los
Alamos, where he spent much of his time working on computers and computational algorithms.
He first worked with analog and then IBM punch card machines. Beginning in 1948, he helped
design the MANIAC I computer, one of the first digital computers. He later oversaw the construction
of the MANIAC II and MANIAC III. He collaborated with Stanislaw Ulam to develop the Monte
Carlo method, and he coauthored a paper in 1953 introducing the Monte Carlo algorithm. The
algorithm would later be extended to general cases by W. K. Hastings and would be known as
the Metropolis–Hastings algorithm. In 1957, Metropolis returned to the University of Chicago,
where he taught physics and helped found the Institute for Computer Research.
The American Physical Society elected Metropolis as a fellow in 1953 and created an award in his
honor that recognizes extraordinary work in computational physics. Also, in 1984, the Institute
of Electrical and Electronics Engineers (IEEE) awarded him the Computer Pioneer Award. In his
late 70s, Metropolis appeared in a Woody Allen film, portraying a scientist.
Wilfred Keith Hastings (1930–2016) was born in Toronto, Ontario, Canada. He studied applied
mathematics at the University of Toronto, obtaining his bachelors in 1953 and later working
as a computer applications consultant. In this position, he was exposed to statistics and gained
experience with simulations. In 1962, he obtained his PhD, also from the University of Toronto.
His dissertation was on fiducial distributions, but after attending a statistics conference, he learned
that people were abandoning the study of fiducial probability. Shortly after graduation, he joined
the faculty at the University of Canterbury for two years and then worked at the research company
Bell Labs for two years as well. In 1966, he became an associate professor at his alma mater,
and three years later he published his work on the Markov chain Monte Carlo (MCMC) method.
His publication on Monte Carlo sampling methods was an extension of the algorithm introduced
in the 1953 publication by Nicholas Metropolis et al. The idea originated from his interactions
and consultations with the chemistry department’s application of the Metropolis algorithm to
estimating the energy of particles. Hastings’s publication was cited over 2,000 times and gave
rise to the Metropolis–Hastings algorithm. After this publication, Hastings served as a professor
at the University of Victoria for 21 years and conducted research with multiple grants from the
Natural Sciences and Engineering Research Council of Canada (NSERC).
170 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Harold Jeffreys (1891–1989) was born near Durham, England, and spent more than 75 years
studying and working at the University of Cambridge, principally on theoretical and observational
problems in geophysics, astronomy, mathematics, and statistics. He developed a systematic
Bayesian approach to inference in his monograph Theory of Probability.
References
Andrieu, C., and É. Moulines. 2006. On the ergodicity properties of some adaptive MCMC algorithms. Annals of
Applied Probability 16: 1462–1505. https://doi.org/10.1214/105051606000000286.
Andrieu, C., and J. Thoms. 2008. A tutorial on adaptive MCMC. Statistics and Computing 18: 343–373.
https://doi.org/10.1007/s11222-008-9110-y.
Atchadé, Y. F., and J. S. Rosenthal. 2005. On adaptive Markov chain Monte Carlo algorithms. Bernoulli 11: 815–828.
https://doi.org/10.3150/bj/1130077595.
Balov, N. 2016a. Bayesian binary item response theory models using bayesmh. The Stata Blog: Not Elsewhere
Classified. https://blog.stata.com/2016/01/18/bayesian-binary-item-response-theory-models-using-bayesmh/.
. 2016b. Fitting distributions using bayesmh. The Stata Blog: Not Elsewhere Classified.
https://blog.stata.com/2016/03/30/fitting-distributions-using-bayesmh/.
. 2016c. Gelman–Rubin convergence diagnostic using multiple chains. The Stata Blog: Not Elsewhere Classified.
https://blog.stata.com/2016/05/26/gelman-rubin-convergence-diagnostic-using-multiple-chains/.
. 2020. Bayesian inference using multiple Markov chains. The Stata Blog: Not Elsewhere Classified.
https://blog.stata.com/2020/02/24/bayesian-inference-using-multiple-markov-chains/.
. 2022. Bayesian threshold autoregressive models. The Stata Blog: Not Elsewhere Classified.
https://blog.stata.com/2022/05/18/bayesian-threshold-autoregressive-models/.
Birnbaum, A. 1968. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories
of Mental Test Scores, ed. F. M. Lord and M. R. Novick, 395–479. Reading, MA: Addison–Wesley.
Carlin, B. P., A. E. Gelfand, and A. F. M. Smith. 1992. Hierarchical Bayesian analysis of changepoint problems.
Journal of the Royal Statistical Society, Series C 41: 389–405. https://doi.org/10.2307/2347570.
Carlin, J. B. 1992. Meta-analysis for 2×2 tables: A Bayesian approach. Statistics in Medicine 11: 141–158.
https://doi.org/10.1002/sim.4780110202.
De Boeck, P., and M. Wilson, ed. 2004. Explanatory Item Response Models: A Generalized Linear and Nonlinear
Approach. New York: Springer.
Diggle, P. J., P. J. Heagerty, K.-Y. Liang, and S. L. Zeger. 2002. Analysis of Longitudinal Data. 2nd ed. Oxford:
Oxford University Press.
Gelfand, A. E., S. E. Hills, A. Racine-Poon, and A. F. M. Smith. 1990. Illustration of Bayesian inference
in normal data models using Gibbs sampling. Journal of the American Statistical Association 85: 972–985.
https://doi.org/10.2307/2289594.
Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. 2014. Bayesian Data Analysis.
3rd ed. Boca Raton, FL: Chapman and Hall/CRC.
Gelman, A., W. R. Gilks, and G. O. Roberts. 1997. Weak convergence and optimal scaling of random walk Metropolis
algorithms. Annals of Applied Probability 7: 110–120. https://doi.org/10.1214/aoap/1034625254.
Geweke, J. 1989. Bayesian inference in econometric models using Monte Carlo integration. Econometrica 57:
1317–1339. https://doi.org/10.2307/1913710.
Geyer, C. J. 2011. Introduction to Markov chain Monte Carlo. In Handbook of Markov Chain Monte Carlo, ed. S. P.
Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, 3–48. Boca Raton, FL: Chapman and Hall/CRC.
Giordani, P., and R. J. Kohn. 2010. Adaptive independent Metropolis–Hastings by fast estimation of mixtures of
normals. Journal of Computational and Graphical Statistics 19: 243–259. https://doi.org/10.1198/jcgs.2009.07174.
Grant, R. L., B. Carpenter, D. C. Furr, and A. Gelman. 2017a. Introducing the StataStan interface for fast, complex
Bayesian modeling using Stan. Stata Journal 17: 330–342.
. 2017b. Fitting Bayesian item response models in Stata and Stan. Stata Journal 17: 343–357.
bayesmh — Bayesian models using Metropolis–Hastings algorithm+ 171
Haario, H., E. Saksman, and J. Tamminen. 2001. An adaptive Metropolis algorithm. Bernoulli 7: 223–242.
https://doi.org/10.2307/3318737.
Hand, D. J., and M. J. Crowder. 1996. Practical Longitudinal Data Analysis. Boca Raton, FL: Chapman and Hall.
Hoff, P. D. 2009. A First Course in Bayesian Statistical Methods. New York: Springer.
Huber, C. 2016a. Introduction to Bayesian statistics, part 1: The basic concepts. The Stata Blog: Not Elsewhere
Classified. https://blog.stata.com/2016/11/01/introduction-to-bayesian-statistics-part-1-the-basic-concepts/.
. 2016b. Introduction to Bayesian statistics, part 2: MCMC and the Metropolis–Hastings algorithm. The Stata Blog:
Not Elsewhere Classified. https://blog.stata.com/2016/11/15/introduction-to-bayesian-statistics-part-2-mcmc-and-the-
metropolis-hastings-algorithm/.
Huq, N. M., and J. Cleland. 1990. Bangladesh Fertility Survey 1989 (Main Report). National Institute of Population
Research and Training.
Jarrett, R. G. 1979. A note on the intervals between coal-mining disasters. Biometrika 66: 191–193.
https://doi.org/10.2307/2335266.
Jeffreys, H. 1946. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society
of London, A ser., 186: 453–461. https://doi.org/10.1098/rspa.1946.0056.
Lichman, M. 2013. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml.
Maas, B., W. R. Garnett, I. M. Pellock, and T. J. Comstock. 1987. A comparative bioavailability study of Carbamazepine
tablets and chewable formulation. Therapeutic Drug Monitoring 9: 28–33. https://doi.org/10.1097/00007691-
198703000-00006.
Maguire, B. A., E. S. Pearson, and A. H. A. Wynn. 1952. The time intervals between industrial accidents. Biometrika
39: 168–180. https://doi.org/10.2307/2332475.
Marchenko, Y. V. 2015. Bayesian modeling: Beyond Stata’s built-in models. The Stata Blog: Not Elsewhere Classified.
https://blog.stata.com/2015/05/26/bayesian-modeling-beyond-statas-built-in-models/.
Raftery, A. E. 1996. Hypothesis testing and model selection. In Markov Chain Monte Carlo in Practice, ed. W. R.
Gilks, S. Richardson, and D. J. Spiegelhalter, 163–187. Boca Raton, FL: Chapman and Hall.
Raftery, A. E., and V. E. Akman. 1986. Bayesian analysis of a Poisson process with a change-point. Biometrika 73:
85–89. https://doi.org/10.2307/2336274.
Rasch, G. 1960. Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute of
Educational Research.
Roberts, G. O., and J. S. Rosenthal. 2001. Optimal scaling for various Metropolis–Hastings algorithms. Statistical
Science 16: 351–367. https://doi.org/10.1214/ss/1015346320.
. 2007. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of Applied Probability
44: 458–475. https://doi.org/10.1239/jap/1183667414.
. 2009. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics 18: 349–367.
https://doi.org/10.1198/jcgs.2009.06134.
Ruppert, D., M. P. Wand, and R. J. Carroll. 2003. Semiparametric Regression. Cambridge: Cambridge University
Press.
Thomas, A., B. O’Hara, U. Ligges, and S. Sturtz. 2006. Making BUGS Open. R News 6: 12–17.
Thompson, J. 2014. Bayesian Analysis with Stata. College Station, TX: Stata Press.
Yusuf, S., R. Simon, and S. S. Ellenberg. 1987. Proceedings of the workshop on methodological issues in overviews
of randomized clinical trials, May 1986. In Statistics in Medicine, vol. 6.
Zellner, A. 1986. On assessing prior distributions and Bayesian regression analysis with g -prior distributions. In
Vol. 6 of Bayesian Inference and Decision Techniques: Essays in Honor of Bruno De Finetti (Studies in Bayesian
Econometrics and Statistics), ed. P. K. Goel and A. Zellner, 233–343. Amsterdam: North-Holland.
Zellner, A., and N. S. Revankar. 1969. Generalized production functions. Review of Economic Studies 36: 241–250.
https://doi.org/10.2307/2296840.
172 bayesmh — Bayesian models using Metropolis–Hastings algorithm+
Also see
[BAYES] Bayesian postestimation — Postestimation tools after Bayesian estimation
[BAYES] bayesmh evaluators — User-defined evaluators with bayesmh
[BAYES] bayes — Bayesian regression models using the bayes prefix+
[BAYES] bayesselect — Bayesian variable selection for linear regression+
[BAYES] Bayesian commands — Introduction to commands for Bayesian analysis
[BAYES] Bayesian estimation — Bayesian estimation commands
[BAYES] Intro — Introduction to Bayesian analysis
[BAYES] Glossary
[BMA] bmaregress — Bayesian model averaging for linear regression
Stata, Stata Press, and Mata are registered trademarks of StataCorp LLC. Stata and
®
Stata Press are registered trademarks with the World Intellectual Property Organization
of the United Nations. StataNow and NetCourseNow are trademarks of StataCorp
LLC. Other brand and product names are registered trademarks or trademarks of their
respective companies. Copyright c 1985–2023 StataCorp LLC, College Station, TX,
USA. All rights reserved.
For suggested citations, see the FAQ on citing Stata documentation.