Bayesian Modelling With Stan
Bayesian Modelling With Stan
DOI 10.3758/s13428-016-0746-9
Abstract When evaluating cognitive models based on fits to Keywords Bayesian inference . Stan . Linear ballistic
observed data (or, really, any model that has free parameters), accumulator . Probabilistic programming
parameter estimation is critically important. Traditional tech-
niques like hill climbing by minimizing or maximizing a fit
statistic often result in point estimates. Bayesian approaches The development and application of formal cognitive models
instead estimate parameters as posterior probability distribu- in psychology has played a crucial role in theory development.
tions, and thus naturally account for the uncertainty associated Consider, for example, the near ubiquitous applications of
with parameter estimation; Bayesian approaches also offer accumulator models of decision making, such as the diffusion
powerful and principled methods for model comparison. model (see Ratcliff & McKoon, 2008, for a review) and the
Although software applications such as WinBUGS (Lunn, linear ballistic accumulator model (LBA; Brown &
Thomas, Best, & Spiegelhalter, Statistics and Computing, 10, Heathcote, 2008). These models have provided theoretical
325–337, 2000) and JAGS (Plummer, 2003) provide tools for understanding such constructs as aging and intelli-
Bturnkey^-style packages for Bayesian inference, they can be gence (e.g., Ratcliff, Thapar, & McKoon, 2010) and have
inefficient when dealing with models whose parameters are been used to understand and interpret data from functional
correlated, which is often the case for cognitive models, and magnetic resonance imaging (Turner, Van Maanen,
they can impose significant technical barriers to adding custom Forstmann, 2015; Van Maanen et al., 2011), electroencepha-
distributions, which is often necessary when implementing lography (Ratcliff, Philiastides, & Sajda, 2009), and neuro-
cognitive models within a Bayesian framework. A recently physiology (Palmeri, Schall, & Logan, 2015; Purcell, Schall,
developed software package called Stan (Stan Development Logan, & Palmeri, 2012). Nearly all cognitive models have
Team, 2015) can solve both problems, as well as provide a free parameters. In the case of accumulator models, these in-
turnkey solution to Bayesian inference. We present a tutorial clude the rate of evidence accumulation, the threshold level of
on how to use Stan and how to add custom distributions to it, evidence required to make a response, and the time for mental
with an example using the linear ballistic accumulator model processes not involved in making the decision. Unlike general
(Brown & Heathcote, Cognitive Psychology, 57, 153–178. statistical models of observed data, the parameters of cogni-
doi:10.1016/j.cogpsych.2007.12.002, 2008). tive models usually have well-defined psychological interpre-
tations. This makes it particularly important that the parame-
Electronic supplementary material The online version of this article
ters be estimated properly, including not just their most likely
(doi:10.3758/s13428-016-0746-9) contains supplementary material, value, but also the uncertainty in their estimation.
which is available to authorized users. Traditional methods of parameter estimation minimize or
maximize a fit statistic (e.g., SSE, χ2, ln L) using various hill-
* Jeffrey Annis climbing methods (e.g., simplex or Hooke and Jeeves). The
[email protected] result is usually point estimates of parameter values, and pos-
sibly later applying such techniques as parametric or nonpara-
1
Vanderbilt University, 111 21st Ave S., 301 Wilson Hall, metric bootstrapping to obtain indices of the uncertainty of
Nashville, TN 37240, USA those estimates (Lewandowsky & Farrell, 2011). By contrast,
864 Behav Res (2017) 49:863–886
Bayesian approaches to parameter estimation naturally treat The first MCMC algorithm was the Metropolis–Hastings
model parameters as full probability distributions (Gelman, algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, &
Carlin, Stern, Dunson, Vehtari, & Rubin, 2013; Kruschke, Teller, 1953; Hastings, 1970), and it is still popular today as a
2011; Lee & Wagenmakers, 2014). By so doing, the uncertain- default MCMC method. On each step of the algorithm a proposal
ty over the range of potential parameter values is also estimat- sample is generated. If the proposal sample has a higher proba-
ed, rather than a single point estimate. bility than the current sample, then the proposal is accepted as the
Whereas a traditional parameter estimation method might next sample; otherwise, the acceptance rate is dependent upon
find some vector of parameters, θ, that maximizes the likeli- the ratio of the posterior probabilities of the proposal sample and
hood of the data, D, given those parameters [P(D | θ)], a the current sample. The Bmagic^ of MCMC algorithms like
Bayesian method will find the entire posterior probability dis- Metropolis–Hastings is that they do not require calculating the
tribution of the parameters given the data, P(θ | D), by a con- nasty P(D) term in Bayes’s rule. Instead, by relying on ratios of
ceptually straightforward application of Bayes’s rule: P(θ | D) the posterior probabilities, the P(D) term cancels out, so the
= P(D | θ) P(θ) / P(D). A virtue—though some argue it is a decision to accept or reject a new sample is based solely on the
curse—of Bayesian methods is that they allow the researcher prior and the likelihood, which are given. The proposal step is
to express their a priori beliefs (or lack thereof) about the generated via a random process that must be Btuned^ so that the
parameter values, as a prior distribution P(θ). If a researcher algorithm efficiently samples from the posterior. If the proposals
thinks all values are equally likely, they might choose a uni- are wildly different from or too similar to the current sample, the
form or otherwise flat distribution to represent that belief; sampling process can become very inefficient. Poorly tuned
alternatively, if a researcher has reason to believe that some MCMC algorithms can lead to samples that fail to meet mini-
parameters might be more likely than others, that knowledge mum standards for approximating the posterior distribution or to
can be embodied in the prior as well. Bayes provides the Markov chain lengths that become computationally intractable
mechanism to combine the prior on parameter values, P(θ), on even the most powerful computer workstations.
with the likelihood of the data given certain parameter values, A different type of MCMC algorithm that largely does away
P(D | θ), resulting in the posterior distribution of the parame- with the difficulty of sampler tuning is Gibbs sampling. Several
ters given the data, P(θ | D). software applications have been built around this algorithm
Bayes is completely generic. It could be used with a model (WinBUGS: Lunn, Thomas, Best, & Spiegelhalter, 2000;
having one parameter or one having dozens or hundreds of JAGS: Plummer, 2003; OpenBUGS: Thomas, O’Hara,
parameters. It could rely on a likelihood based on a well- Ligges, & Sturtz, 2006). These applications allow the user to
known probability distribution, like a normal or a Gamma easily define their model in a specification language and then
distribution, or it could rely on a likelihood of response times generate the posterior distributions for the respective model
predicted by a cognitive model like the LBA. parameters. The fact that the Gibbs sampler does not require
Although the application of Bayes is conceptually straight- tuning makes these applications effectively Bturnkey^ methods
forward, its application to real data and real models is anything for Bayesian inference. These applications can be used for a
but. For one thing, the calculation of the probability of the data wide variety of problems and include a number of built-in
term in the denominator, P(D), involves a multivariate inte- distributions; one must only specify, for example, that the prior
gral, which can be next to impossible to solve using traditional on a certain parameter is distributed uniformly or that the like-
techniques for all but the simplest models. For small models lihood of the data given the parameter is normally distributed.
with only one or two parameters, the posterior distribution can Although these programs provide dozens of built-in distribu-
sometimes be calculated directly using calculus or can be rea- tions, researchers inevitably will discover that some particular
sonably estimated using numerical methods. However, as the distribution they are interested in is not built into the applica-
number of parameters in the model grows, direct mathemati- tion. This will often be the case for specialized cognitive
cal solutions using calculus become scarce, and traditional models whose distributions are not part of the standard suites
numerical methods quickly become intractable. For more so- of built-in distributions that come with these applications.
phisticated models, a technique called Markov chain Monte Thus, it is necessary for the researcher who wishes to use one
Carlo (MCMC) was developed (Brooks, Gelman, Jones, & of these Bayesian inference applications to add a custom dis-
Meng, 2011; Gelman et al., 2013; Gilks, Richardson, & tribution to the application’s distribution library. This process,
Spiegelhalter, 1996; Robert & Casella, 2004). MCMC is a however, can be technically challenging using most of the ap-
class of algorithms that utilize Markov chains to allow one plications listed above (see Wabersich & Vandekerckhove,
to approximate the posterior distribution. In short, a given 2014, for a recent tutorial with JAGS).
MCMC algorithm can take the prior distribution and likeli- In addition to the technical challenges of adding custom dis-
hood as input and generate random samples from the posterior tributions, both the Gibbs and Metropolis–Hastings algorithms
distribution without having to have a closed-form solution or often do not sample efficiently from posterior distributions with
numeric estimate for the desired posterior distribution. correlated parameters (Hoffman & Gelman, 2014; Turner,
Behav Res (2017) 49:863–886 865
Sederberg, Brown, & Steyvers, 2013). Some MCMC algo- applications, but to illustrate how to use Stan as an application
rithms (e.g., MCMC-DE; Turner et al., 2013) are designed to that perhaps may be adopted more easily by some researchers.
solve this problem, but these algorithms often require careful
tuning of the MCMC algorithm parameters to ensure efficient
sampling of the posterior. In addition, implementing models that Built-in distributions in Stan
use such algorithms can be more difficult than implementing
models in turnkey applications, because the user must work at In Stan, a Bayesian model is implemented by defining its
the implementation level of the MCMC algorithm. likelihood and priors. This is accomplished in a Stan program
Recently, a new type of MCMC application has emerged, with a set of variable declarations and program statements that
called Stan (Hoffman & Gelman, 2014; Stan Development are displayed in this article using Courier font. Stan sup-
Team, 2015). Stan uses the No-U-Turn Sampler (NUTS; ports a range of standard variable types, including integers,
Hoffman & Gelman, 2014) which extends a type of MCMC real numbers, vectors, and matrices. Stan statements are proc-
algorithm known as Hamiltonian Monte Carlo (HMC; Duane, essed sequentially and allow for standard control flow ele-
Kennedy, Pendleton, & Roweth, 1987; Neal, 2011) NUTS re- ments, such as for and while loops, and conditionals such
quires no tuning parameters and can efficiently sample from as if-then and if-then-else.
posterior distributions with correlated parameters. It is therefore Variable definitions and program statements are placed with-
an example of a turnkey Bayesian inference application that in what are referred to in Stan as code blocks. Each code block
allows the user to work at the level of the model without having has a particular function within a Stan program. For example,
to worry about the implementation of the MCMC algorithm there is a code block for user-defined functions, and others for
itself. In this article, we provide a brief tutorial on how to use data, parameters, model definitions, and generated quantities.
the Stan modeling language to implement Bayesian models in Our tutorial will introduce each of these code blocks in turn.
Stan using both built-in and user-defined distributions; we do To make the most out of this tutorial, it will be necessary to
assume that readers have some prior programming experience install both Stan (http://mc-stan.org/) and R (https://cran.r-
and some knowledge of probability theory. project.org/), as well as the RStan package (http://mc-stan.
Our first example is the exponential distribution. The ex- org/interfaces/rstan.html) so that R can interface with Stan.
ponential is built into the Stan application. We will first define Step-by-step instructions for how to do all of this can be found
the model statistically, and then outline how to implement a online (http://mc-stan.org/interfaces/rstan.html).
Bayesian model based on the exponential distribution using
Stan. Following this implementation, we will show how to run An example with the exponential distribution
the model and collect samples from the posterior. As we will
see, one way that this can be done is by interfacing with Stan In this section, we provide a simple example of how to use
via another programming language, such as R. The command Stan to implement a Bayesian model using built-in distribu-
to run the Stan model is sent from R, and then the samples are tions. For simplicity, we will use a one-parameter distribution:
sent back to the R workspace for further analysis. the exponential. To begin, suppose that we have some data (y)
Our second example will again consider the exponential that appear to be exponentially distributed. We can write this
distribution, but this time instead of using the built-in expo- in more formal terms with the following definition:
nential distribution, we will explicitly define the likelihood
function of the exponential distribution using the tools and yeExponentialðλÞ; ð1Þ
techniques that allow the user to define distributions in Stan.
Our third example will illustrate how to implement a more This asserts that the data points (y) are assumed to come from
complicated user-defined function in Stan—the LBA model an exponential distribution, which has a single parameter called
(Brown & Heathcote, 2008). We will then show how to extend the rate parameter, λ. Using traditional parameter-fitting
this model to situations with multiple participants and multiple methods, we might find the value of λ that maximized the
conditions. likelihood of observed data that we thought followed an expo-
Throughout the tutorial, we will benchmark the results from nential. Because here we are using a Bayesian approach, we
Stan against a conventional Metropolis–Hastings algorithm. As can conceive of our parameters as probability distributions.
we will see, Stan performs equally as well as Metropolis– What distribution should we choose as the prior on λ? The
Hastings for the simple exponential model, but much better for rate parameter of the exponential distribution is bounded be-
more complex models with correlated dimensions, such as the tween zero and infinity, so we should choose a distribution
LBA. We are quite certain that suitably tuned versions of with the same bounds. One distribution that fits this criterion is
MCMC-DE (Turner et al., 2013) and other more sophisticated the Gamma distribution. Formally, then, we can write
methods would perform at least as well as Stan. The goal here
was not to make fine distinctions between alternative successful λeGammaðα; β Þ: ð2Þ
866 Behav Res (2017) 49:863–886
The Gamma distribution has two parameters, referred to as the variable within that range in Stan. We do this by adding the
shape (α) and rate (β) parameters; to represent our prior beliefs <lower=0> constraint as part of its definition.
about what parameter values are more likely than others, we The third block of code is the model block, in which the
chose weakly informative shape and rate parameters of 1. So, Bayesian model is defined. The model described by Eqs. 1
Eq. 1 specifies our likelihood, and Eq. 2 specifies our prior. and 2 is easily implemented, as is shown in Box 1. First, the
This completes the specification of the Bayesian model in variables of the Gamma prior, alpha and beta, are defined
mathematical terms. The next section shows how to easily im- as real numbers of type real, and both are assigned our
plement the exponential model in the Stan modeling language. chosen values of 1.0. Note that unlike the variables in the
data and parameters blocks, variables defined in the
model block are local variables. This means that their scope
Stan code To implement this model in Stan, we first open a
does not extend beyond the block in which they are defined; in
new text file in any basic text editor. Note that the line num-
less technical terms, other blocks do not Bknow^ about vari-
bers in the code-text boxes in this article are for reference and
ables initialized in the model block.
are not part of the actual Stan code. The body of every code
After having defined these local variables, the next part of
block is delimited using curly braces {}; Stan programming
the model block defines a sampling statement. The sampling
statements are placed within these curly braces. All statements
statement lambda ~ gamma(alpha,beta) indicates that
must be followed by a semicolon.
the prior on lambda is sampled from a Gamma distribution
In the text file, we first declare a data block as is shown in
with rate and shape parameters alpha and beta, respective-
Box 1. The data code block stores all of the to-be-modeled
ly. Note that sampling statements contain a tilde character (~),
variables containing the user’s data. In this example, we can see
distinct from the assignment character (<-) in Stan. The next
that the data we will pass to the Stan program are contained in a
statement, Y ~ exponential(lambda), is also a sam-
vector of size LENGTH. It is important to note that the data are
pling statement and indicates that the data Y are exponentially
not explicitly defined in the Stan program itself. Rather, the Stan
distributed with rate parameter lambda.
program is interfaced via the command line or an alternative
The final block of code in the Stan file is the generated
method (like RStan), and the data are passed to the Stan program
quantities block. This block can be used, for example, to
in that way. We will describe this procedure in a later section.
perform what is referred to as a posterior predictive check.
Box 1 Stan code for the exponential model (in all Boxes,
The purpose of the check is to determine whether the model
line numbers are included for illustration only)
accurately fits the data in question; in other words, this lets us
compare model predictions with the observed data. Box 1
shows how this is accomplished in Stan. First, a real-number
variable named pred is created. This variable will contain the
predictions of the model. Next, the exponential random num-
ber generator (RNG) function exponential_rng takes as
input the posterior samples of lambda and outputs the pos-
terior prediction. The posterior prediction is returned from
Stan and can be used outside of Stan—for example, to com-
pare the predictions to the actual data in order to assess visu-
ally how well the model fits the data.
This completes the Stan model. When all the code has been
entered into the text file, we save the file as
exponential.stan. Stan requires that the extension .stan
be used for all Stan model files.
In this section, we will be first simulating data and then fitting Box 2 shows the R code that will run the parameter
the model to those simulated data. This is in contrast to most recovery example. The first three lines of the R code clear
real-world applications, in which models are fit to the actual the workspace (line 1), set the working directory (line 2),
observed data from an experiment. It is good practice before and load the RStan library (line 3). Then we generate some
fitting a model to real data to fit the model to simulated data simulated data, drawing 500 exponentially distributed
with known parameter values and try to recover those values. If samples (line 5) assuming a rate parameter, lambda, equal
the model cannot recover the known parameter values of the to 1. These simulated data, dat, will then be fed into Stan to
model that generated the simulated data, then it will never be obtain parameter estimates for lambda. If the Stan
able to be fitted with any confidence to real observed data. This implementation is working correctly, we should obtain a
type of exercise is usually referred to as parameter recovery. posterior distribution of λ that is centered over 1. So far, all
Here, this also serves us well in a tutorial capacity. of this is just standard R code.
The Stan model described earlier (exponential.stan) from an exponential) the name Y in line 8 so that Stan knows
is run via the stan function. This is the way that R Btalks^ to that these are the data. Stan also expects a variable named
Stan, tells it what to run, and gets back the results of the Stan LENGTH to be holding the length of the data vector Y. We
run. The first argument of the function, file, is a character assign the variable len (the length of dat computed in line
string that defines the location and name of the Stan model 6) the name LENGTH. This is the way that R feeds data into
file. This is simply the Stan file from Box 1. The data argu- Stan. Next, the warmup argument defines the number of steps
ment is a list containing the data to be passed to the Stan used to automatically tune the sampler in which Stan opti-
program. Stan will be expecting a variable named Y to be mizes the HMC algorithm. These samples can be discarded
holding the data (see line 3 of the Stan code in Box 1). We afterward and are referred to as warmup samples. The iter
assign the variable dat (in this case, our simulated draws argument defines the total number of iterations the algorithm
868 Behav Res (2017) 49:863–886
will run. Choosing the number of iterations and warmup steps package in R that is loaded automatically when R is opened.
usually proceeds by starting with relatively small numbers and The left panel of Fig. 1 shows that the autocorrelation drops to
then doubling them, each time checking for convergence values close to zero at around lags of six for the samples
(discussed below). It is recommended that warmup be half returned by Stan. The Metropolis–Hastings algorithm has
of iter. The chains argument defines the number of inde- slightly higher autocorrelation but is still reasonable in this
pendent chains that will be consecutively run. Usually, at least example.
three chains are run. After running the model, the samples are High autocorrelation indicates that the sampler is not
returned and assigned to the fit object. efficiently exploring the posterior distribution. This can
A summary of the parameter distributions can be obtained be overcome by simply running longer chains. By run-
by using print(fit)2 (line 10), which provides posterior ning longer chains, the sampler is given the chance to
estimates for each of the parameters in the model. Before any explore more of the distribution. The technique of run-
inferences can be made, however, it is critically important to ning longer chains, however, is sometimes limited by
determine whether the sampling process has converged to the memory and data storage constraints. One way to run
posterior distribution. Convergence can be diagnosed in sev- very long chains and reduce memory overhead is to use
eral different ways. One way is to look at convergence statis- a technique called thinning, which is done by saving
^ (Gelman &
tics such as the potential scale reduction factor, R every nth posterior sample from the chain and
Rubin, 1992), and the effective number of samples, Neff discarding the rest. Increasing n reduces autocorrelation
(Gelman et al., 2013), both of which are output in the summa- as well as the resulting size of the chain. Although
ry statistics with print(fit). A rule of thumb is that when thinning can reduce autocorrelation and chain length, it
^ is less than 1.1, convergence has been achieved; otherwise,
R leads to a linear increase in computational cost with
the chains need to be run longer. The Neff statistic gives the increases in n. For example, if one could only save 1,
number of independent samples represented in the chain. For 000 samples, but needed to run a chain of 10,000 sam-
example, a chain may contain 1,000 samples, but this may be ples to effectively explore the posterior, one could thin
equivalent to having drawn very few independent samples by ten steps, and those 1,000 samples would have lower
from the posterior. The larger the effective sample size, the autocorrelation than if 1,000 samples were generated
greater the precision of the MCMC estimate. To give an esti- without thinning. If memory constraints are not an is-
mate of an acceptable effective sample size, Gelman et al. sue, however, it is advised to save the entire chain
(2013) recommended an Neff of 100 for most applications. (Link & Eaton, 2011).
Of course, the target Neff can be set higher if greater precision Another diagnostic test that should always be per-
is desired. formed is to plot the chains themselves (i.e., the posterior
Both the R ^ and Neff statistics are influenced by what is sample at each iteration of the MCMC). This can be used
referred to as autocorrelation. To give an example, adjacent to determine whether the sampling process has converged
samples usually have some amount of correlation, due to the to the posterior distribution; it is easily performed in R
way that MCMC algorithms work. However, as the samples using the traceplot function (part of the RStan pack-
become more distant from each other in the chain, this corre- age) on line 16. The left panel of Fig. 2 shows the sam-
lation should decrease quickly. The distance between succes- ples from Stan, and the right panel shows the samples
sive samples is usually referred to as the lag. The autocorre- from Metropolis–Hastings. The researcher can use a few
lation function (ACF) relates correlation and lag. The values criteria to diagnose convergence. First, as an initial visual
of the ACF should quickly decrease with increasing lag; ACFs diagnostic, one can determine whether the chains look
that do not decrease quickly with lag often indicate that the Blike a fuzzy caterpillar^ (Lee & Wagenmakers, 2014)—
sampler is not exploring the posterior distribution efficiently do they have a strong central tendency with evenly dis-
^ values and decreased Neff values. tributed aberrations? This indicates that the samples are
and result in increased R
not highly correlated with one another and that the sam-
The ACF can easily be plotted in R on lines 12 and 14. The
pling algorithm is exploring the posterior distribution ef-
separate chains are first collapsed into a single chain with
ficiently. Second, the chains should also not drift up or
as.matrix(fit) (the as.matrix function is part of
down, but should remain around a constant value. Lastly,
the base package in R), and the ACF of lambda is plotted
it should be difficult to distinguish between individual
with acf(mcmc_chain[,'lambda']), shown in Fig. 1.
chains. Both panels of Fig. 2 clearly demonstrate all of
The acf function is part of the stats package, a base
these criteria, suggesting convergence to the posterior
2
distribution.
The print function behaves differently given different classes of ob-
Once it is determined that the sampling process has con-
jects in R. For the Stan fit object, it prints a summary table. The RStan
library must be loaded for this behavior to occur (the library defines the verged to the posterior, we can then move on to analyzing the
fit object in R). parameter estimates themselves and determining whether the
Behav Res (2017) 49:863–886 869
model can fit the observed (in this case, simulated) data. Lines the samples as a histogram. The right panel of Fig. 3
18 through 20 of the R code (Box 2) show how the posterior shows the posterior distribution of λ. The 95 % highest
predictive can be obtained by plotting the data (dat) as a density interval (HDI) is also depicted. The HDI is the
histogram and then overlaying the density of predicted values, smallest interval that can be obtained in which 95 % of
pred. The left panel of Fig. 3 shows that the posterior pre- the mass of the distribution rests; this interval can be
dictive density (solid line) fits the data (histogram bars) quite obtained from the summary statistics output by
well. Lastly, line 22 of the R code shows how to plot the print(fit). The HDI is different from a confidence
posterior distribution of λ with the following command: interval because values closer to the center of the HDI
plot(density(mcmc_chain[,'lambda'])). There are Bmore credible^ than values farther from the center
are two steps to this command. First, the density function, a (e.g., Kruschke, 2011). As the HDI increases, uncertainty
base function in R, is called. This function estimates the den- about the parameter value also increases. As the HDI de-
sity of the posterior distribution of λ from the MCMC samples creases, the range of credible values also decreases, there-
held in mcmc_chain[,'lambda']. Second, the plot by decreasing the uncertainty. Figure 3 shows that 95 %
function, also part of the base R distribution, is called, which of the mass of λ is between 0.92 and 1.09, indicating that
outputs a plot of the density plot of the MCMC samples. It is parameter recovery was successful, since the simulated
also possible to call the histogram function, hist, and plot data were generated with λ = 1.
As we have just seen, the Stan model successfully recov- the parameter recovery was successful, since most values fall
ered the single parameter value of λ that was used to generate close to the diagonal. If we use the classic Metropolis–
the exponentially distributed data. Oftentimes, parameter re- Hastings algorithm, we can see in the right panel of Fig. 4 that
covery is more rigorous, testing recovery over a range of its performance is very similar to our parameter recovery in
parameter values. The parameter recovery process can be re- Stan.
peated many times, each time storing the actual and recovered
parameter values. A plot can then be made of the actual pa-
rameter values as a function of the recovered parameter User-defined distributions in Stan
values. The values should fall close to the diagonal (i.e., the
recovered parameters should be close to the actual parame- Thus far we have implemented an exponential model in Stan
ters). This also lets us explore how well Stan does over a range using built-in probability distributions for the likelihood and
of parameterizations of the exponential. the prior. Although there are dozens of built-in probability
To better test the Stan model in this way, we simulated 200 distributions in Stan (as in other Bayesian applications), some-
sets of data over a range of values of λ. The λ parameter times the user requires a distribution that might not already be
values were drawn from a truncated normal distribution with implemented. This will often be the case for specialized dis-
mean 2.5 and standard deviation 0.25. Each data set contained tributions of the kind assumed in many cognitive models. But
500 data points. The Stan model was fit to each data set, before moving on to complicated cognitive models, we first
saving the mean of the posterior of lambda for each fit. The want to present an example using the exponential model, but
left panel of Fig. 4 shows the parameter recovery for the ex- without the benefit of using Stan’s built-in probability distri-
ponential distribution implemented in Stan. We can see that bution function.
An example with the exponential distribution, redux log likelihood, so we simply take the log of Eq. 3 in
our Stan implementation.
The exponential distribution is a built-in distribution in Stan, Having mathematically defined the (log) likelihood func-
and therefore it is not necessary to implement it as a user- tion, we can now implement it in Stan. Once we implement
defined function. We do so here for tutorial purposes. the user-defined function, we can then call it just as we would
To begin, the likelihood function of the exponential distri- call a built-in function. In this example, we will replace the
bution is built-in distribution, exponential, in the sampling state-
ment Y ~ exponential(lambda) in line 14 of Box 1
N with a user-defined exponential distribution.
P yλ ¼ ∏i¼1 λe−λyi ; ð3Þ
Box 3 shows how this is accomplished. To add a user-
defined function, it is first necessary to define a functions
where λ is the rate parameter of the exponential, y is code block. The functions block must come before all
the vector of data points, each data point yi ∈ [0, ∞), and other blocks of Stan code, but when there are no user-
N is the number of data points in y. Stan requires the defined functions this block may be omitted entirely.
When dealing with functions that implement proba- considered. First, Stan requires the name of any func-
bility distributions, three important rules must be tion that implements a probability distribution to end
872 Behav Res (2017) 49:863–886
with _log; 3 the _log suffix permits access to the distribution (Box 1). This is not surprising, given that
increment_log_prob function (an internal function the built-in and user-defined exponential distributions
that can be ignored for the purposes of this tutorial), are mathematically equivalent.
where the total log probability of the model is comput-
ed. Second, when calling such defined functions, the An example with the LBA model
_log suffix must be dropped. Lastly, when naming a
user-defined function, the name must be different from In this section, we briefly describe the LBA model and
a n y b u i l t - i n f u n c t i o n w he n d e f i n i n g i t i n t h e how it can be utilized in a Bayesian framework, before
functions block and it must different from any describing how it can be implemented in Stan.
built-in function when the _log suffix is dropped. For Accumulator models attempt to describe how the evi-
example, suppose that we named our user-defined func- dence for one or more decision alternatives is accumu-
tion exp_log. When called, this would be different lated over time. LBA predicts response probabilities as
from the built-in exponential()function, but unfor- well as distributions of the response times, much like
tunately it would conflict with another built-in function, other accumulator models. Unlike some models, which
exp, and result in an error. With these rules in mind, assume a noisy accumulation of evidence to threshold
we can now properly name our user-defined exponential within a trial, LBA instead assumes a linear and con-
likelihood function. Line 2 of Box 3 shows that we tinuous accumulation to threshold —hen ce, the
have named the exponential likelihood function Bballistic^ in LBA. LBA assumes that the variabilities
newexp_log. This name works because there are no in response probabilities and response times are deter-
built-in function called newexp and no built-in distri- mined by between-trial variability in the accumulation
bution newexp_log. rate and other parameters.
The newexp_log function returns a real number of LBA assumes a separate accumulator for each re-
type real. The first argument of the function is the sponse alternative i. A response is made when the evi-
data vector x, and the second argument is the rate pa- dence accrued for one of these alternatives exceeds
rameter lam. Note that the scope of the variables within some predetermined threshold, b. The rate of accumula-
each function is local. Within the function itself, another tion of evidence is referred to as the drift rate. The
local variable is defined called prob, which is a vector LBA model assumes that the drift rate, di, is sampled
of the same length as the data vector x. For each ele- on each trial from a normal distribution with mean vi
ment in the data vector, a probability density will be and standard deviation s. Figure 5 illustrates an example
computed and stored in prob, implementing the ele- in which the drift for response m1 is greater than that
ments in Eq. 3. As we noted previously, Stan requires
the log likelihood, so instead of multiplying the proba-
bility densities, we take the natural logs and sum them.
The sum of the log densities will be assigned to the
variable lprob.4 Then lprob value—representing the
log likelihood of the exponential distribution (log of
Eq. 3)—is then returned by the function. After this,
lines 12 through 30 are identical to the code shown in
Box 1 lines 1 through 19, with the exception of the call
to newexp (Box 3 line 25) instead of the built-in Stan
exponential distribution (Box 1 line 14).
We found that this implementation produced exactly
the same results as the implementation using the built-in
3
This naming convention holds for user-defined and as well as built-in
functions. For example, in line 14 of Box 1, in the sampling statement Y ~
exponential(lambda) we are actually calling the built-in function
exponential_log by dropping the _log suffix.
4
Stan includes C++ libraries designed for efficient vector and matrix
operations, and therefore it is often more efficient to use the vectorized
form of a function. For example, the log likelihood can be computed in a
single line with lprob <- sum(log(lam) - x*lam);. For simplic-
ity, we do not consider vectorization any further, and instead refer readers Fig. 5 Graphical depiction of the linear ballistic accumulator (LBA)
to the Stan manual. model (Brown & Heathcote, 2008)
Behav Res (2017) 49:863–886 873
for response m2. In this example trial, the participant Stan. If we consider the vector of binary responses, R,
will make response m1 because that accumulator reaches and response times, T, for each trial i and a total of N
its threshold, b, before the other accumulator. trials, the likelihood function is given by
Each accumulator starts with some a priori amount of
N
evidence. This start point is assumed to vary across
PðT ; RjθÞ ¼ ∏ LBAtrunc ðT i ; Ri jθÞ: ð8Þ
trials. The start-point variability is assumed to be uni- i¼1
formly distributed between 0 and A (A must be less
than the threshold b). To implement the model in a Bayesian framework, priors are
Like other accumulator models, LBA also makes the placed on each of the parameters of the LBA model. We chose
assumption that there is a period of nondecision time, τ, priors that one might encounter in real-world applications and
that occurs before evidence begins to accumulate (as based them on Turner et al. (2013). First, to make the model
well as after, leading to whatever motor response is identifiable, we set s to a constant value (Donkin, Brown, &
required). In this implementation, as in some other Heathcote, 2009). Here, we fix s at 1. We then assume that the
LBA implementations, we assume that the nondecision priors for the drift rates are truncated normal distributions:
time is fixed across trials. vi eNormalð2; 1Þϵ ð0; ∞Þ: ð9Þ
The following equations are a formalization of these pro-
cessing assumptions, showing the likelihood function for the We assume a uniform prior on nondecision time:
LBA (see Brown & Heathcote, 2008, for derivations). Given
the processing assumptions of the LBA, the response time, t, τeUniformð0; 1Þ: ð10Þ
on trial j is given by
The prior for the maximum starting evidence A is a truncated
b−ai normal distribution:
t j ¼ τ þ min : ð4Þ
i di
AeNormalð:5; 1Þϵ ð0; ∞Þ: ð11Þ
Let us assume that θ is the full set of LBA parameters θ = {v1,
v2,b,A,s,τ}. Then the joint probability density function of mak- To ensure that the threshold, b, is always greater than the
ing response m1 at time t (referred to as the defective PDF) is starting point a, we reparameterize the model by shifting b
by k units away from A. We refer to k as the relative threshold.
LBAðm1 ; tjθÞ ¼ f ðt−τ jv1 ; b; A; sÞ½1−F ðt−τ jv2 ; b; A; sÞ; ð5Þ Thus, we do not model b directly, but as the sum of k and A,
and assume that the prior for k is a truncated normal:
and the joint density for making response m2 at time t is
keNormalð:5; 1Þϵð0; ∞Þ: ð12Þ
LBAðm2 ; tjθÞ ¼ f ðt−τ jv2 ; b; A; sÞ½1−F ðt−τ jv1 ; b; A; sÞ; ð6Þ
First, the local variables to be used in the function are defined arising from taking the natural logarithm of extremely small
(lines 105–111 in Box 4). Then, to obtain the decision threshold values of the defective PDF. Once all of the densities are com-
b, k is added to A. On each iteration of the for loop, the puted, the likelihood is obtained by taking the sum of the natural
decision time t is obtained by subtracting the nondecision time logarithms of the densities in prob and returning the result.
tau from the response time RT. If the decision time is greater Box 5 continues the code from Box 4 and, as in our
than zero, then the defective PDF is computed as in Eqs. 5 and 6, earlier example, shows the data block defining the
and the CDF and PDF functions described earlier accordingly data variables that are to be modeled. The LENGTH
are called on lines 120 and 122 (see the Appendix for the Stan variable defines the number of rows in RT, whose first
implementation of each). The defective PDF associated with column contains response times and whose second col-
each row in RT is stored in the prob array. If the value of the umn contains responses. A response coded as 1 corre-
defective PDF is less than 1 × 10–10, then the value stored in sponds to the first accumulator finishing first, and a
prob is set to 1 × 10–10; this is to avoid underflow problems response coded as 2 corresponds to the second
Behav Res (2017) 49:863–886 875
accumulator finishing first. One of the advantages of the number of choices in the task and must be equal to
LBA is that it can be applied to tasks with more than the length of the drift rate vector defined in the
two choices. The NUM_CHOICES variable defines the parameters block.
The parameters block shows that the parameters functions block, along with all of the other user-
are all real numbers of type real and include the rel- defined functions, but has been omitted in Box 4 for
ative threshold k, the maximum starting evidence A, the brevity. The code and explanation for this function can
nondecision time tau, and the vector of drift rates v. be found in the Appendix. This code is based on the
All parameters have normal priors truncated at zero, and Brtdists^ package for R (Singman et al., 2016), which
therefore are constrained with <lower=0>. can be found online (https://cran.r-project.org/web/
The Bayesian LBA model is implemented in the packages/rtdists/index.html). We note that porting code
model block, which shows that the priors for the rela- from R to Stan is relatively straightforward, as they
tive threshold k and the maximum threshold A are both both are geared toward vector and matrix operations
assumed to be normally distributed with a mean of .5 and transformations.
and standard deviation 1. The prior for nondecision time
is assumed to be normally distributed with a mean of .5 R code Box 6 shows the R code that runs the LBA model
and standard deviation .5, and the priors for drift rates in Stan. This should look very similar to the code we
are distributed normally with means of 2 and standard used for the simple exponential example earlier. We
deviations of 1. The data, RT, is assumed to be distrib- again begin by clearing the workspace, setting the
uted according to the LBA distribution, lba. working directory, and loading the RStan library. After
The implementation of the generated simulating the data from the LBA distribution using a
quantities block for the LBA uses a user-defined file called Blba.r^ (see the website listed above for the
function called lba_rng, which generates random sim- code), which contains the rlba function that generates
ulated samples from the LBA model given the posterior random samples drawn from the LBA distribution, the
parameter estimates. The function is also defined in the model is then run on line 10.
876 Behav Res (2017) 49:863–886
1 rm(list=ls())
2 setwd("~/LBA/")
3 source('lba.r')
4 library(rstan)
5 #make simualated data
6 out = rlba(500,1,1,c(2,1),1,.5)
7 rt = cbind(out$rt,out$resp)
8 len = length(rt[,1])
9 #run the Stan model
10 fit <- stan(file = 'lba.stan',
data = list(RT=rt,LENGTH=len,NUM_CHOICES=2),
warmup = 750,
iter = 1500,
chains = 3)
As we noted in our earlier example, in real-world applica- by Metropolis–Hastings. In the left panel, the Stan chains show
tions of the model, the data would not be simulated but would good convergence: They look like Bfuzzy caterpillars,^ it is dif-
be collected from a behavioral experiment. We use simulated ficult to distinguish one chain from the others, and the chains do
data here for convenience of the tutorial and because we are not drift up and down. In the right panels of Fig. 7, the
interested in determining whether the Bayesian model can re- Metropolis–Hastings chains clearly do not meet any of the nec-
cover the known parameters used to generate the simulated data essary criteria for convergence. The only way we found to cor-
(parameter recovery). With just some minor modification, the rect for this was to thin by at least 50 or more steps.
code we provide using simulated data can be generalized to an Figure 8 shows the results of a larger parameter recovery
application to real data. For example, real data stored in a text study for Stan and the Metropolis–Hastings algorithm. In this
file or spreadsheet can be read into R and then formatted and exploration, 200 simulated data sets containing 500 data points
coded in the same way as the simulated data. each were generated, each with a different set of parameter
The Bayesian LBA model can be validated in a fashion values. The parameter values were drawn randomly from a trun-
similar to that for the Bayesian exponential model. Figure 6 cated normal with a lower bound of 0, a mean of 1, and a
shows the autocorrelation function for each parameter. For standard deviation of 1. The Stan model was fit to each data
Stan, autocorrelation across all parameters became undetect- set, and the resulting mean of the posterior distribution for each
able after approximately 15 iterations. The right panels shows parameter was saved. Figure 8 shows the actual parameter values
that the Metropolis–Hastings algorithm had high autocorrela- plotted against the recovered parameter values. For Stan, most of
tion for long lags, indicating that the sampler was not taking the points fall along the diagonal, indicating that parameter re-
independent samples from the posterior distribution. This high covery was successful. For Metropolis–Hastings, it is clear visu-
autocorrelation leads to lower numbers of effective samples ally that parameter recovery is poorer—this is due to the afore-
and longer convergence times. The Neff values returned by mentioned difficulty this algorithm has with the inherent corre-
Metropolis–Hastings across all chains were on average 27 for lations between parameters in the LBA model.
each parameter. This means that running 4,500 iterations (three We note that parameter recovery in sequential-sampling
chains of 1,500 samples) is equivalent to drawing only 27 models is often difficult if the experimental design is uncon-
independent samples. On the other hand, Stan returned on av- strained, like the one we present here, which benefited from a
erage 575 effective samples for each parameter after 4,500 large number of data points for each data set as well as priors
^ for all parameters was above 1.1 for
iterations. In addition, R that were similar to the actual parameters that had generated
Metropolis–Hastings, and below 1.1 for Stan, indicating that the data. We present this parameter recovery as a sanity check
the chains converged for Stan but not for Metropolis–Hastings. to ensure that Stan can recover sensible parameter values un-
The deleterious effect of the high autocorrelation of der optimal conditions. Obviously, this will not be the case in
Metropolis–Hastings in comparison to the low autocorrelation real-world applications, and therefore, great care must be tak-
of Stan is apparent in Fig. 7. The left panels show the chains en when designing experiments to test sequential-sampling
produced by Stan, and the right panels show the chains produced models like the LBA.
Behav Res (2017) 49:863–886 877
v1 ACF
0.6
0.4
0.2
0.0
1.0
0.8
v2 ACF 0.6
0.4
0.2
0.0
1.0
0.8
b ACF
0.6
0.4
0.2
0.0
1.0
0.8
A ACF
0.6
0.4
0.2
0.0
1.0
0.8
ACF
0.6
0.4
0.2
0.0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Lag
Better Metropolis–Hastings sampling might be achieved A reason behind the poor sampling of Metropolis–Hastings
by careful adjustment and experimentation with the proposal is the correlated parameters of the LBA. The half below the
step process. Here, the proposal step was generated by sam- diagonal of Fig. 9 shows the joint posterior distribution for each
pling from a normal distribution with a mean equal to the parameter pair of the LBA for a given set of simulated data, and
current sample and a standard deviation of .05. Increasing the half above the diagonal gives the corresponding correlations.
the standard deviation increases the average distance between Each point in each panel in the lower half of the grid represents a
the current sample and the proposal, but decreases the proba- posterior sample from the joint posterior probability distribution
bility of accepting the proposal. We found that different set- of a particular parameter pair for the LBA model. For example,
tings of the standard deviation largely led to autocorrelations the bottom left corner panel shows the joint posterior probability
similar to those we have presented here. The only thing we distribution between τ and v1. We can see that this distribution
found that led to improvements in autocorrelation was thin- has negatively correlated parameters. The upper right corner
ning. Thinning by 50 steps led to autocorrelation dropping to panel of the grid confirms this, showing the correlation between
nonsignificant values at around lags of 40. At 75 steps, τ and v1 to be –.45. Five joint distributions have correlations
Metropolis–Hastings’s performance was similar to that of with absolute values well above .50 (v1–v2, v1–b, v2–b, v2–τ, and
^ values.
Stan, resulting in similar autocorrelation, Neff, and R A–τ). These correlations in parameter values cause some
878 Behav Res (2017) 49:863–886
v1
2
0
4
3
v2
2
1
0
3.0
2.5
2.0
b 1.5
1.0
0.5
0.0
2.0
1.5
A
1.0
0.5
0.0
1.0
0.8
0.6
0.4
0.2
0.0
0 500 1000 1500 0 500 1000 1500
Iteration
MCMC algorithms such Gibbs sampling and the Metropolis– the posterior distribution when compared to Metropolis–
Hastings to perform poorly. On the other hand, Stan does not Hastings, due in large part to the correlated parameters of the
drastically suffer from the model’s correlated parameters. LBA model. Whereas Stan was designed with the intention to
In summary, the Stan implementation of the LBA model handle these situations properly, standard MCMC techniques
shows successful parameter recovery and efficient sampling of such as the Metropolis–Hastings algorithm were not, and they
do not converge to the posterior distribution in any sort of reliable The model we consider assumes that the vector of response
manner. times for each participant i in condition j is distributed according
to the LBA:
Fitting multiple subjects in multiple conditions: %%scale85%RT i; j eLBA k i ; Ai ; v1i; j ; v2i; j ; s; τ i ; ð13Þ
a hierarchical extension of the LBA model where, as before, the responses are coded as 1 and 2, corre-
sponding to each accumulator; ki is the relative threshold; Ai is
The simple LBA model just described was designed for a single
the maximum starting evidence; v1i,j and v2i,j are the mean drift
subject in a single condition. This is never the case in any real-
rates for each accumulator; s is the standard deviation, held
world application of the LBA model. In this section, we describe
constant across participants and conditions; and τi is the nonde-
and implement an LBA model that is designed for multiple sub-
cision time. As before, we assume that s is fixed at 1.0 and that
jects in multiple conditions. The model will include parameters
the prior on each parameter follows a truncated normal distribu-
that model performance at both the group and individual levels.
tion with its own group mean μ and standard deviation σ.
The Bayesian approach allows both the group- and individual-
level parameters to be estimated simultaneously. This type of
k i eNormal uk ; σk ϵ ð0; ∞Þ ð14Þ
model is called a Bayesian hierarchical model (e.g., Kruschke,
2011; Lee & Wagenmakers, 2014). In a Bayesian hierarchical Ai eNormal uA ; σA ϵ ð0; ∞Þ ð15Þ
model, the parameters for individual participants are informed by
the group parameter estimates. This reduces the potential for the v1i; j eNormal μvj1 ; σvj1 ϵ ð0; ∞Þ ð16Þ
individual parameter estimates to be sensitive to outliers, and
decreases the overall number of participants necessary to achieve v2i; j eNormal μvj2 ; σvj2 ϵ ð0; ∞Þ ð17Þ
reliable parameter estimates.
880 Behav Res (2017) 49:863–886
1 model {
2
3 k_mu ~ normal(.5,1)T[0,];
4 A_mu ~ normal(.5,1)T[0,];
5 tau_mu ~ normal(.5,.5)T[0,];
6
7 k_sigma ~ gamma(1,1);
8 A_sigma ~ gamma(1,1);
9 tau_sigma ~ gamma(1,1);
10
11 for (j in 1:NUM_COND){
12 for (n in 1:NUM_CHOICES){
13 v_mu[j,n] ~ normal(2,1)T[0,];
14 v_sigma[j,n] ~ gamma(1,1);
15 }
16 }
17
18 for (i in 1:NUM_SUBJ){
19 k[i] ~ normal(k_mu,k_sigma)T[0,];
20 A[i] ~ normal(A_mu,A_sigma)T[0,];
21 tau[i] ~ normal(tau_mu,tau_sigma)T[0,];
22 for(j in 1:NUM_COND){
23 for(n in 1:NUM_CHOICES){
24 v[i,j,n] ~ normal(v_mu[j,n],v_sigma[j,n])T[0,];
25 }
26 RT[i,j] ~ lba(k[i],A[i],v[i,j],1,tau[i]);
27 }
28 }
29 }
Behav Res (2017) 49:863–886 881
To test whether the model could successfully recover We then fit the hierarchical LBA model to the simulated
the parameters, we simulated 20 subjects, each with 100 data. The group-level parameter estimates are shown in
responses and response times. Each simulated subject’s Fig. 10. For the panels plotting v1 and v2, solid lines indicate
parameters were drawn from a group-level distribution. Condition 1, dotted lines indicate Condition 2, and dashed
Specifically, the maximum starting evidence parameter, lines indicate Condition 3. All other parameters were held
A, relative threshold parameter, k, and nondecision time constant across conditions. The group-level parameter esti-
parameter, τ, were drawn from a truncated normal distri- mates for the hierarchical model shown in Fig. 10 closely
bution with a mean of .5, standard deviation of .5, and align with the group-level distribution parameters used to gen-
lower bound of 0. We then varied the drift rates across erate the simulated data.
three conditions. The drift rates of the first accumulator To further illustrate the advantages of the hierarchical LBA
were drawn from a truncated normal with means of 2 model, we also fit the nonhierarchical LBA model shown in
(Condition 1), 3 (Condition 2), and 4 (Condition 3), re- Box 5 to the same set of simulated data. The nonhierarchical
spectively, all with standard deviations of 1 and lower model assumed that for each subject, the priors on k and A
bounds of 0. The mean drift rate of the second accumula- were normally distributed with a mean of .5 and standard
tor for all three conditions was drawn from a truncated deviation of 1. The prior on τ for each subject was normal
normal with a mean of 2 and standard deviation of 1. In with mean .5 and standard deviation .5. Lastly, the prior on
applications to real data, the distribution of the drift rates the drift rate for each accumulator was drawn from a normal
corresponding to the incorrect choice will usually have a distribution with a mean of 2 and standard deviation 1. Thus,
lower mean and larger standard deviation than the distri- the priors on the parameters for each subject in the nonhierar-
bution of the drift rates corresponding to the correct chical model mirrored the group-level priors in the hierarchi-
choice. cal model.
The computation involved in NUTS is fairly expensive and Author note This work was supported by Grant Nos. NEI R01-
EY021833 and NSF SBE-1257098, the Temporal Dynamics of
can be slow for complex models. It should be noted that this
Learning Center (NSF SMA-1041755), and the Vanderbilt Vision
lowered speed is, by design, traded off for greater effective Research Center (NEI P30-EY008126).
sample rates. Other techniques, such as MCMC-DE (Turner
et al., 2013), which approximate some of the more expensive
computations involved in NUTS, may offer an alternative if
the sampling rate becomes an issue. Appendix
Although Bayesian parameter estimation has many advan-
tages over traditional methods, implementing the MCMC algo- The Stan implementations of the PDF and CDF of the
rithm can be technically challenging. Turnkey Bayesian infer- LBA are given in Boxes A1 and A2, respectively.
ence applications allow the researcher to work at the level of the These functions are used in the calculation of the
model and not of the sampler, but they are likewise not without likelihood function of the LBA given in Box 4 of the
issues. Stan is a viable alternative to other applications that do main text and are nothing more than implementations of
automatic Bayesian inference, especially when the researcher is the equations that Brown and Heathcote (2008) provid-
interested in distributions that are uncommon and require user ed. Here, we simply note some implementation details
implementation or when the model’s parameters are correlated. of each.
The first thing to note is that both are real-valued functions name the function that generates samples from the LBA
of type real. They both take as arguments the decision time, model lba_rng. After defining the local variables, the
t, the decision threshold, b, the maximum starting evidence, drift rates for each accumulator are drawn from the
A, the drift rate, v, and the standard deviation, s. Lines 4 normal distribution. Negative drift rates result in
through 10 in Box A1 and lines 24 through 31 in Box A2 negative response times, and drift rates of zero result in
define all local variables that will be used in each computation. undefined response times. The LBA model that we
After defining the local variables, the PDF or CDF is comput- implement assumes that at least one accumulator has a
ed and the result is returned. Some built-in functions allow for positive drift rate, and therefore no negative or
an easier computation of the PDF and CDF. The Phi function undefined response times. This is achieved in the
is a built-in Stan function that implements the normal cumu- while loop beginning on line 64, in which drift rates
lative distribution function. The exp function is the exponen- are drawn from the normal distribution until at least one
tial function, and the normal_log function is the natural drift rate is positive. The loop terminates after a maximum
logarithm of the PDF of the normal distribution, where the of 1,000 iterations if at least one positive drift rate has not
last two arguments are the mean and standard deviation, been drawn. If this is the case, a negative value is
respectively. returned, denoting an undefined response time (lines 79
Box A3 implements the LBA model in Stan. This code and 80). In practice, we have found this works very well
is based on the Brtdists^ package for R (Singman et al., and will not return negative or undefined response times
2015), which can be found online (https://cran.r-project. given a reasonable model and data. After drawing the drift
org/web/packages/rtdists/index.html). All Stan functions rates, the start points for each accumulator are drawn (line
that generate samples from a given distribution are 84). The finishing times of each accumulator are
c a l l e d r a n d o m n u m b e r g e n e r a t o r s ( R N G s ) . To computed according to the processing assumptions of
distinguish between functions, RNGs must contain the the LBA on line 85. Lastly, the response alternative and
_rng suffix. For example, the RNG for the exponential lowest positive response time are stored in the pred
distribution is called exponential_rng. Here, we vector and returned.
Behav Res (2017) 49:863–886 885