Bayesian Randomized Item Response Theory
Bayesian Randomized Item Response Theory
CONTENTS
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Presentation of the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Randomized IRT Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Introduction
Research on behavior and attitudes typically relies on self-reports, especially when
the infrequency of behavior and research costs make it hardly impractical not to do
so. However, many studies have shown that self-reports can be highly unreliable and
D
1
2 Book title goes here
searcher’s expectations and/or what reflects positively on their behavior. Thus, sensi-
tivity of the questions easily leads to misreporting (i.e. under- or overreporting), even
when anonymity and confidentiality of the responses is guaranteed. This observation
is supported by considerable empirical evidence. For instance, survey respondents
underreported socially undesirable behavior such use of illicit drugs (Anglin, Hser,
& Chou, 1993), the number of sex partners (Tourangeau & Smith, 1996), desires for
adult entertainment (De Jong, Pieters, & Fox, 2010), welfare fraud (van der Heij-
den, van Gils, Bouts, & Hox, 2000), and alcohol abuse and related problems (Fox &
Wyrick, 2008).
In order to overcome respondents’ tendencies to report inaccurately or even to
refuse to provide any response at all, strategies have been developed to deal with
them. Typically, anonymity of the respondents is guaranteed and explicit assurances
are given that each of their answers will remain completely confidential. Besides,
questions are often phrased such that tendencies to provide socially desirable answers
FT
are diminished. Furthermore, respondents are motivated to provide accurate answers
by stressing the importance of the research study.
Other ways of avoiding response tendencies to report inaccurately are based on
innovative data collection methods that make it impossible to infer any identifying
information from the response data. A general class of such methods for sensitive
surveys is based on the randomized response technique (RRT) (Fox & Tracy, 1986),
which involves the use of a randomizing device to mask individual responses. RRT
has originated from Warner (1965), who developed a randomized response (RR) data
collection procedure, where respondents are confronted with two mutually exclusive
questions, for instance, “I belong to Group A,” and “I do not belong to Group A.” A
RA
choice is made between the two statements using a randomizing device (e.g., tossing
of a die or use of a spinner). The randomization is performed by the respondent
and the outcome is not revealed to the interviewer. The respondent then answers
the question selected by the randomizing device. The interviewer only knows the
response, not the question.
Because of this setup, the RR technique encourages greater cooperation from
respondents and reduces socially desirable response behavior. The properties of the
randomizing device are known, which still allows for population estimates of the
sensitive behavior, for instance, proportions of the population engaging in a particular
kind of behavior or, more generally, membership of Group A. Further analysis of the
univariate RR data is limited to inferences at this aggregate data level.
D
FT
between a sensitive question and an irrelevant unrelated question.
Edgell, Himmelfarb, and Duchan (1982) generalized the procedure by introduc-
ing an additional randomizing device to generate the answer on the unrelated ques-
tion. The responses are then completely protected since it becomes impossible to
infer whether they are answers to the sensitive question or forced answers generated
by the randomizing device. Let the randomizing device select the sensitive question
with probability φ1 and a forced response with probability 1 − φ1 . The latter is sup-
posed to be a success with probability φ2 . Let U pi denote the randomized response of
person p = 1, . . . , P to item i, 1, . . . , I. Consider a success a positive response (score
RA
one) to a question and a failure a negative response (score zero). Then, the probability
of a positive randomized response is represented by
n o
P {U pi = 1; φ1 , φ2 } = φ1 P Uepi = 1 + (1 − φ1 )φ2 (1.1)
where U epi is the underlying response, which is referred to as the true response of
person p to item i when directly and honestly answering the question.
For a polytomous randomized response, let φ2 (a) denote the probability of a
forced response in category a for a = 1, . . . , Ai such that the number of response cat-
egories may vary over items. The probability of a randomized response of individual
p in category a of item i is given by,
D
n o
P {U pi = a; φ1 , φ2 } = φ1 P Uepi = a + (1 − φ1 )φ2 (a). (1.2)
FT π pi = P Uepi = 1; θ p , ai , bi = Φ (ai (θ p − bi )) ,
restriction: bi1 < . . . < biA for response alternatives a = 1, . . . , A (for more on this
RA
type of graded response model, see Samejima, vol. 1, chap. 6).
The last model can be extended to deal with questionnaires with items that mea-
sure multiple sensitive constructs. Let the multidimensional vector θ i of dimension
D denote these constructs. Then, the probability of a true response in category a is
n o
π pi (a) = P U epi = a; θ p , a i , b i
FTP {U pi = 0} = P {G pi = 0} P {U pi = 0; θ p , ai , bi } + P {G pi = 1} I (U pi = 0) .
where I (U pi = 0) equals one when the answer to item i of respondent p is zero and
equals zero otherwise. This mixture model consists of a randomized item response
model for the compliant class but a different model for the noncompliant class. Infer-
ences are made from the responses by the compliant class, which requires informa-
tion about the behavior of the respondents. That is, the assumption of an additional
response model for G pi is required (e.g., De Jong et al., 2010; Fox, 2010).
RA
1.2.3 Structural Models for Sensitive Constructs
Respondents are usually independently sampled from a population, and a normal
distribution is often used to describe the distribution of the latent variable. If so, the
population model for the latent person variable is
θ p ∼ N(µθ , σ2θ )
For more complex sampling designs, respondents can be clustered, and the model
for the population distribution needs to account for the dependencies between respon-
dents in the same cluster. As described, among others, by Fox (2010) and Fox and
Glas (2001; vol. 1, chap. 24), a multilevel population distribution for the latent per-
D
son parameters needs to be defined. Let θ p j denote the latent parameter of person p
in group j ( j = 1, . . . , J). The population distribution becomes
θ p j ∼ N β j , σ2θ
β j ∼ N µθ , τ200 .
θ p ∼ N (µµθ , Σ θ ) ,
This multidimensional model can also be extended to include a multilevel setting, but
this case will not be discussed. Also, to explain variation between persons in latent
sensitive measurements, explanatory variables at the level of persons and/or groups
can also be included. Finally, variation in item parameters can also be modeled as
described in De Boeck and Wilson (2004; vol. 1, chap. 33) and De Jong et al. (2010).
FT
ance components. An inverse Wishart prior is specified for the covariance matrix.
Normal and lognormal priors are specified for the difficulty and discrimination pa-
rameters, respectively. A uniform prior is specified for the threshold parameters while
accounting for the order constraint.
Following the MCMC sampling procedure for item randomized-response data in
Fox (2005, 2010), Fox and Wyrick (2008), and De Jong et al. (2010), a fully Gibbs
sampling procedure is developed which consists of a complex data augmentation
scheme: (i) sampling of latent true responses, U e ; (ii) sampling latent continuous re-
sponse data, Z ; and (iii) sampling latent class membership G . The item response
RA
model parameters and structural model parameters are sampled in a straightforward
way given the continuous augmented data, as described by Fox (2010) and Johnson
and Albert (2001).
Omitting conditioning on G pi = 0 for notational convenience, the procedure is de-
scribed for latent response data generated only for responses belonging to the com-
pliant class. A probabilistic relationship needs to be defined between the observed
randomized response data and the true response data. To do so, define H pi = 1 when
the randomizing device determines that person i answers item i truthfully and H pi = 0
when a forced response is generated. It follows that the conditional distribution of a
true response a given a randomized response a0 is given by
n o
epi = a0 ,U pi = a
D
n o P U
P Uepi = a0 | U pi = a =
P {U pi = a}
n o
∑l∈{0,1} P U epi = a0 ,U pi = a | H pi = l P {H pi = l}
= ,
∑l∈{0,1} P {U pi = a | H pi = l} P {H pi = l}
where a, a0 = {0, 1} and {1, 2, . . . , Ai } for binary and polytomous responses, respec-
tively.
For binary responses, π pi in Equation 1.3 defines the probability of a success.
Bayesian Randomized Item Response Theory Models for Sensitive Measurements 7
Following the data augmentation procedure of Johnson and Albert (2001) and Fox
FT
(2010), latent true response data are sampled given the augmented dichotomous or
polytomous true response data.
The latent class memberships, G pi , are generated from a Bernoulli distribution.
Let Ypi = 0 define the least self-incriminating response, then the success probability
of the Bernoulli distribution can be expressed as
P {G pi = 1} I (Ypi = 0)
P {G pi = 0} P {Ypi = 0 | θ p , ai , bi } + P {G pi = 1} I (Ypi = 0)
where a Bernoulli prior is usually specified for the class membership variable G pi .
Given the augmented data, class memberships, true responses, and latent true
RA
responses, all other model parameters can be sampled using a full Gibbs sampling
algorithm. The full conditionals can be found in the MCMC literature for IRT (e.g.,
Junker, Patz, & Vanhoudnos, vol. 2, chap. 15).
by De Jong et al. (2010), Fox (2010), Geerlings, Glas, and van der Linden (2011),
and Johnson and Albert (2001). Posterior distributions of the residuals can be used
to evaluate their magnitude and make probability statements about them. Bayesian
residuals are easily computed as by-products of the MCMC algorithm, and they can
be summarized to provide information about specific model violations. For instance,
sums of squared residuals can be used as a discrepancy measure for evaluating person
or item fit. The extremeness of the observed discrepancy measure can be evaluated
using replicated date generated under their posterior predictive distribution. Like-
wise, the assumption of local independence and unidimensionality can be checked
using appropriate discrepancy measures. For an introduction to posterior predictive
8 Book title goes here
checks, see Sinharay (vol. 2, , chap. 19). Studies of different posterior predictive
checks for Bayesian IRT models are reported in Glas and Meijer (2003), Levy, Mis-
levy, and Sinharay (2009), Sinharay, Johnson, and Stern (2006), and Sinharay (2006).
FT
tigated whether the randomized response technique improved the accuracy of the
self-reports obtained by direct questions.
model was also used to jointly analyze the CAPS and AEQ data for the relationships
between the multiple factors they measure. Finally, the effects of the randomized
response technique on the measurement of their factors was analyzed jointly.
1.5.2 Data
A total of seven hundred ninety-three students from four local colleges/universities,
Elon University (N=495), Guilford Technical Community College (N=66), Univer-
sity of North Carolina (N=166), and Wake Forest University (N=66), participated in
the survey study in 2002. Both the CAPS and AEQ items were administered to them
and their age, gender, and ethnicity was recorded. It was logistically not possible to
randomly assign students to the direct questioning (DQ) or the randomized response
(RR) condition. However, it was possible to randomly assign classes of five to ten
participants to one of the conditions.
FT A total of 351 students was assigned to the DQ condition. They served as the con-
trol group and were instructed to answer the questionnaire as they normally would. A
total of 442 students in the RR condition received a spinner to assist them in complet-
ing the questionnaire. For each item, the spinner was used as a randomizing device
which determined whether to answer honestly or to give a forced response. Accord-
ing to a forced response design, the properties of the spinner were set such that an
honest answer was requested with a probability of .60 and a forced response with a
probability of .40. When a forced response was to be given, each of the five possible
responses had a probability of .20.
RA
1.5.3 Model Specification
The following multidimensional randomized item response model was used to ana-
lyze the data,
P (Ypi = a | θ p , a i , b i ) =
p1 π pi + (1 − p1 )p2 (a)
= Φ ati θ p − bi,(a−1) − Φ ati (θθ p − bi,a ) , (1.7)
π pi
θp ∼ N µ θ,p , Σ θ
µ θ,p = β 0 + β 1 RR p
D
The MCMC algorithm was used to estimate simultaneously all model parameters
using 50, 000 iterations, with a burn-in period of 10, 000 iterations.
1.5.4 Results
In Table 1.1, the estimated factor loadings for a three-factor of the multidimensional
RIRT model in (1.7) are given. The factor loadings were standardized by dividing
each of them by the average item loading. Furthermore, for each factor the sign of
the loadings was set such that a higher latent score corresponded to a higher observed
score. To avoid label switching, Items 1, 5, and 14 were allowed to have one free non-
zero loading, so that each of these items represented one factor.
Items 1-4, 6,8, and 9 were positively associated with the first factor and had factor
loadings higher than .60. This first factor represents drinking-related socio-emotional
problems, including depression, anxiety, and troubles with family. These problems
FT
increased with alcohol consumption. Some of the items also loaded on the two other
factors.
The second factor (community problems) covered Items 5,7,and 10-13, with
loadings higher than .60, except for Item 12. In the literature, Item 12 has been asso-
ciated with factor community problems, but in our analysis the item also related to
the other factors, most strongly to the third. This second factor covers acute physio-
logical effects of drunkness together with illegal and potentially dangerous activities
(e.g., driving under the influence).
As expected, Items 14-17 were associated with a third factor, which represented
alcohol-related sexual enhancement expectancies. These expectancies increased with
RA
alcohol consumption but, given their negative loadings on the other two factors,
slightly reduced the socio-emotional and community problems.
The multivariate latent factor model was extended with an explanatory variable
denoted as RR, which indicated when a student was assigned to the RR (RR=1) or
the DQ condition (RR=0). In addition, an indicator variable was included, which
was set equal to one when the respondent was a female. Both explanatory variables
were used for each factor. The RIRT model was further extended with a multivariate
population model for all factors.
In Table 1.2, the parameter estimates of the three-factor and a two-factor model
are given. For the latter, the loadings of Items 1 and 14 were fixed to identify two fac-
tors, with one factor representing a composite measure of alcohol-related problems
D
(i.e., socio-emotional and community problems) and the other alcohol-related sexual
enhancement expectancies. A moderate positive correlation of .65 between the two
factors was found.
The students in the RR condition scored significantly higher on both factors. For
the RR group, the average latent scores were .20 and .22 on the composite problem
and the alcohol-related expectancy factors, respectively, but both were equal to zero
for the DQ group. The RR effect was slightly smaller than that of .23 reported by
Fox and Wyrick (2008), who performed a unidimensional RIRT analysis using the
CAPS items only. A comparable effect was found for the AEQ scale. Females and
males showed comparable scores on both factors.
Bayesian Randomized Item Response Theory Models for Sensitive Measurements 11
TABLE 1.1
CAPS-EAQ Scale: Weighted factor loadings for the three-component analysis.
Subscale Items Three-Factor RIRT Model
Socio-Emotional problems Factor 1 Factor 2 Factor 3
1 Feeling sad, blue or depressed 1.00 .00 .00
2 Nervousness or irritability 1.00 .01 -.03
3 Hurt another person emotionally .96 .27 .10
4 Family problems related to drinking .82 .56 .14
6 Badly affected friendship .85 .46 .27
8 Other criticize your behavior .77 .50 .41
9 Nausea or vomiting .70 .39 .60
Community problems Factor 1 Factor 2 Factor 3
5 Spent too much money on drugs .00 1.00 .00
7 Hurt another person physically .48 .84 .26
10 Drove under the influence .43 .74 .53
.00
-.09
-.14
-.17
.66
.41
.96
.00
-.12
-.06
-.03
.47
.71
.29
Factor 1 Factor 2 Factor 3
1.00
.99
.99
.99
RA
In the three-factor model, with the estimated loadings given in Table 1.1, the
problems associated with drinking were represented by two factors (i.e., socio-
emotional and community problems) and sexual enhancement expectancies by an-
other factor. The randomized response effects were significantly different from zero
for all three factors, while the effect on the factor representing community problems
related to alcohol use was approximately .32. This was slightly higher than the ef-
fects of the other components, which were around .21. It seemed as if the students
were less willing to admit to alcohol-related community problems and gave more
socially desirable responses than for the other factors.
The male students scored significantly higher than the female students on the
factor representing community problems related to alcohol use. That is, male students
D
drinking experiences, which in turn lead to more positive expectancies. Here, an in-
creased expectancy of sexual enhancement stimulates alcohol use, which leads to
more socio-emotional and community problems.
1.6 Discussion
Response bias is a serious threat to any research that uses self-report measures. Sub-
jects are often not willing to cooperate or to provide honest answers to personal, sen-
sitive questions. The general idea is that by offering confidentiality, respondents will
become more willing to respond truthfully. Warner’s (1965) randomized response
technique was developed to ensure such confidentiality.
FTOur multivariate extension of the technique still masks the responses to the items
but enables us to estimate item characteristics and measure individual differences
in sensitive behavior. The models can handle both dichotomous and polytomous re-
sponses to measure both unidimensional or multidimensional sensitive constructs. In
the empirical example above, a forced randomized response design was used to col-
lect the data, but other options are available. Our RIRT models are easily adapted to
a specific choice of response design.
In order to improve the cooperation of the respondents, both from an ethical and
professional point of view, they should be informed about the levels of information
RA
that can and cannot be inferred from randomized item responses. The outcome of the
randomization device is only known to the respondent, which protects them at the
level of the individual items.
The randomized response technique also has some disadvantages. The use of a
randomization device makes the procedure more costly, and respondents have to trust
the device. Respondents also have to understand the procedure to recognize and ap-
preciate the anonymity they guarantee. Recently, Jann, Jerke, and Krumpal (2012),
Tan, Tian, and Tang (2009), and Coutts and Jann (2011) proposed nonrandomized
response techniques to overcome the inadequacies of the randomized response tech-
nique and tested their proposals empirically. The main idea of their so-called trian-
gular and crosswise technique is to ask respondents a sensitive and a nonsensitive
D
question and let them indicate whether the answers to the questions are the same
(both ’Yes’ or both ’No’) or different (one ’Yes’ and the other ’No’). Such a joint an-
swer to both questions does not reveal the respondent’s true status. The distribution
of answers to the nonsensitive question has to be known and supports the measure-
ment of the population prevalence on the sensitive question. These nonrandomized
methods are designed to make inferences at an aggregate data level. Extensions are
required to collect multivariate sensitive items responses that will support the mea-
surement of sensitive constructs. In fact, more research is needed to explore the full
potential of nonrandomized response techniques for the analysis of individual sensi-
tive constructs.
Bayesian Randomized Item Response Theory Models for Sensitive Measurements 13
TABLE 1.2
CAPS-EAQ scale: Parameter estimates of two- and three-component randomized
item-response model.
Two Factor Three Factor
Parameter Mean SD Mean SD
FT Fixed Effects
Socio-Emotional/Community
γ11 (RR)
γ21 (Female)
γ12 (RR)
γ22 (Female)
Community
γ13 (RR)
.20
.01
Sexual enhancement expectancy
.22
.03
.09
.06
.06
.04
.21
.05
.21
.06
.32
.10
.07
.07
.05
.10
RA
γ23 (Female) −.30 .09
Information Criteria
-2log-likelihood 20622 19625
D
14 Book title goes here
Acknowledgement
The author thanks Cheryl Haworth Wyrick for providing the data from the study on
alcohol use and abuse by college students.
References
Anglin, D., Hser, Y., & Chou, C. (1993). Reliability and validity of retrospective
behavioral self-report by narcotics addicts. Evaluation Review, 17, 91-108.
FT
Béguin, A. A., & Glas, C. A. W. (2001). MCMC estimation of multidimensional
IRT models. Psychometrika, 66, 541-562.
Brown, S. A., Christiansen, B. A., & Goldman, A. (1987). The alcohol expectancy
questionnaire: An instrument for the assessment of adolescent and adult alcohol
expectancies. @Author: add journal title@, 48, 483-491.
RA
Clark, S. J. & Desharnais, R. A. (1998). Honest answers to embarrassing questions:
Detecting cheating in the randomized response model. Psychological Methods,
3, 160-168.
Cruyff, M. J. L. F., van den Hout, A., van der Heijden, P. G. M., & Böckenholt, U.
(2007). Log-linear randomized-response models taking self-protective response
behavior into account. Sociological Methods and Research, 36, 266-282.
D
De Boeck, P., & Wilson, M. (2004). Explanatory item response models: A general-
ized linear and nonlinear approach. New York: Springer.
De Jong, M. G., Pieters, R. & Fox, J.-P. (2010). Reducing social desirability bias
through item randomized response: An application to measure underreported de-
sires. Journal of Marketing Research, 47, 14-27.
Edgell, S. E., Himmelfarb, S., & Duchan, K. L. (1982). Validity of forced Responses
in a randomized response model. Sociological Methods and Research, 11, 89-
100.
Bayesian Randomized Item Response Theory Models for Sensitive Measurements 15
Fox, J. A., & Tracy, P. E. (1986). Randomized response: A method for sensitive
surveys. London: Sage.
Fox, J.-P. (2005). Randomized item response theory models. Journal of Educational
and Behavioral Statistics, 30, 189-212.
Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. New
York: Springer.
Fox, J.-P. & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model
using Gibbs sampling. Psychometrika, 66, 269-286.
Fox, J.-P., & Wyrick, C. (2008). A mixed effects randomized item response model.
Journal of Educational and Behavioral Statistics, 33, 389-415.
Geerlings, H., Glas, C. A. W., & van der Linden, W. J. (2011). Modeling rule-based
FT
item generation. Psychometrika, 76, 337-359.
Glas, C. A. W., & Meijer, R. R. (2003). A Bayesian approach to person fit analysis
in item response theory models. Applied Psychological Measurement, 27, 217-
233.
Greenberg, B. G., Abul-Ela, A.-L. A., Simmons, W. R., & Horwitz, D. G. (1969).
The unrelated question randomized response model: Theoretical framework.
Journal of the American Statistical Association, 64, 520-539.
Jann, B., Jerke, J., & Krumpal, I. (2012). Asking sensitive questions using the
RA
crosswise model: An experimental survey measuring plagiarism. Public Opin-
ion Quarterly, 76, 1-18.
Johnson, V. E. & Albert, J. H. (2001). Ordinal data modeling. New York: Springer.
Levy, R., Mislevy, R. J., & Sinharay, S. (2009). Posterior predictive model check-
ing for multidimensionality in item response theory. Applied Psychological Mea-
surement, 33, 519-537.
O’Hare, T. (1997). Measuring problem drinking in first time offenders: Develop-
ment and validation of the college alcohol problem scale (CAPS). Journal of
Substance Abuse Treatment, 14, 383-387.
Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response
D
Tan, M. T., Tian, G.-L., & Tang, M.-L. (2009). Sample surveys with sensitive ques-
tions: A nonrandomized response approach. The American Statistician, 63, 9-16.
Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The psychology of survey re-
sponse. New York: Cambridge University Press.
Tourangeau, R., & Smith, T. W. (1996). Asking sensitive questions: The impact of
data collection, question format, and question technique. Public Opinion Quar-
terly, 60, 275-304.
van der Heijden, P. G. M., van Gils, G., Bouts, J., & Hox, J. J. (2000). A compar-
ison of randomized response, computer-assisted self-interview, and face-to-face
direct questioning: Eliciting sensitive information in the context of welfare and
unemployment benefit. Sociological Methods & Research, 28, 505-537.
FT
evasive answer bias. Journal of the American Statistical Association, 60, 63-69.
RA
D
Bayesian Randomized Item Response Theory Models for Sensitive Measurements 17