Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views35 pages

Chapter5 3

Uploaded by

Carlos - Tam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views35 pages

Chapter5 3

Uploaded by

Carlos - Tam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Chapter 5

The Calibration Model


5.0 Chapter Overview

The aim of this chapter is to describe a way of connecting the scored outcomes
that resulted from the items design and the outcome space back to the construct that was
the original inspiration for the items themselves—the way these connect is through the
calibration model1. There have been many calibration models proposed and used in the
last 100 years. In this book, the main approach taken is to explain and to use just one such
model, the Rasch model. Nevertheless, it is useful to know something of the historical
background, as that gives context to the basic ideas and terms used in measurement, and
also because the general ideas that one first has when finding out about a topic will be
influenced by the common general vocabulary possessed by professionals in that area.

Hence, this first section of the chapter will discuss two basic approaches to
measurement with the aim to demonstrate how the construct modeling approach can be a
way to reconcile them. The second section explains how the calibration model can be
seen as an embodiment of the logic of the construct map and exemplifies how to interpret
the empirical results in terms of the Wright Map, which is discussed in some detail here.
This is followed by a section illustrating these ideas for the case of the PF-10 example
(Example 6). The account is didactic in nature rather than an attempt to present an
exhaustive historical analysis—at the same time as the researchers mentioned below were
working, there were others working on similar ideas and similar approaches—not
discussing them here is not intended to slight their contributions.

Key concepts: item-focused approach to measurement, score-focused approach to


measurement, Rasch model, Wright map, item response function (IRF).

5.1 Combining the Two Approaches to Measurement

If you ask a person who is a non-professional in measurement:


“What is the relation between what we want to measure and the item responses?”
their answers will usually be spoken from one of two different viewpoints.
One viewpoint focuses on the items: e.g., in the context of the PF-10, “If a patient says
that their vigorous activities are “limited a lot,” then that means they have less
physical functioning,” or “If someone can’t walk one block then they are clearly
in poor health.”

1
Note that the calibration model is often referred to in the psychometrics literature as the “measurement
model.” I see this as a misleading label—good measurement requires more than just a statistical model—
this is one of the main messages of the BAS (and other Principled Assessment Design approaches such as
evidence centered design (ECD) too—see Chapter 9). See further views on this in Mari et al. (2021).

1
A second point of view will consider ways of summarizing the respondents’ responses:
e.g., “If someone answers ‘limited a lot’ to most of the questions, then they have
poor physical capabilities”, or “A person who scores high on the test is in good
physical health.”.
Usually, in this latter case, the idea of a “score” is the same as what people became
accustomed to when they were students in school, where the individual item scores are
added to give a total (which might then be presented as a percentage instead, in which
case, the total score is divided by the maximum score to give the percentage).

These two types of answer are indicative of two different approaches to


measurement frequently expressed by novice measurers. Thus, the first approach focuses
on the items, and their relationship to the construct: I will call this the item-focused
approach. The second approach focuses on the respondents’ scores, and their relationship
to the construct: I will call this the score-focused approach. The two different points of
view have different histories2—a very brief sketch of each is given below.

The item-focused approach. Parts of the history of the item-focused approach have
already been described in foregoing chapters. The item-focused approach was made
formal by Guttman (1944, 1950) in his Guttman Scaling technique as described in the
previous chapter (see Section 4.3.3). From this it should be clear that the item-focused
approach has been the driving force behind the first three building blocks—the construct
map, the items design and the outcome space. However, the story does not end there.
While the logic of Guttman-style items leads to a straightforward relationship between
the two sides of the construct map, as shown in Section 4.3.3, the use of Guttman scales
has been found to be severely compromised by the problem that in practice there are
almost always large numbers of response patterns in the data that do not conform to the
Guttman requirements. For example, here is what Irene Kofsky had to say, drawing on
extensive experience with using Guttman scale approach in the area of child development
psychology:
… the scalogram model may not be the most accurate picture of
development, since it is based on the assumption that an individual can be
placed on a continuum at a point that discriminates the exact [emphasis
added] skills he has mastered from those he has never been able to
perform. ... A better way of describing individual growth sequences might
employ probability statements about the likelihood of mastering one task
once another has been or is in the process of being mastered. (Kofsky,
1966, pp. 202-203)
Thus, to successfully integrate the two aspects of the construct map, the issue of response
patterns that are not strictly in the Guttman format must be addressed, and Kofsky gives
us a hint—what about using probability as a “smoothing agent”?

The score-focused approach. The intuitive foundation of the score-focused


approach is what might be called intuitive test theory (Braun & Mislevy, p. 6) where
2
The account below is very much focused on the points I want to make that illuminate the “two
approaches” mentioned in the section title. Perforce, these are very much gross summaries of perspectives
and debates that have embroiled measurement in the social sciences over the last 150 years. For a very
thorough and educative recent account of this history, read Briggs (2021).

2
there is an understanding that there needs to be some sort of an aggregation of
information across the items, but the means of aggregation is either left vague, or
assumed on the basis of historical precedent to be summation of item scores. This simple
score theory is more like a folk theory, but nevertheless exerts a powerful influence on
intuitive interpretations.

The simple score theory approach has been formalized as classical test theory
(CTT—also known as true score theory), and intuitive test theory has been assumed into
it. The statistical aspects of this approach have their foundation in the previous historical
statistical ideas from astronomical data analysis, and were laid out by Edgeworth (1888,
1892) and Spearman (1904, 1907) in a series of papers. They borrowed a perspective
from the fledgling statistical science of the time, and posited that an observed score on
the instrument, X, was composed of the sum of a “true score” T, and an “error” E:

X = T + E, (5.1)

where the true score is the “signal” in this context. One way to think of T is as part of a
thought-experiment where it would be a long-term average score that the respondent
would get over many re-takings of the instrument assuming the respondent could be
“brainwashed” to forget all the preceding ones3. The “error” is not seen as something
inherently wrong (as implied by the term itself!) but simply as what is left over after
taking out the true score. Thus, the error is what is not represented by T— in this
approach it is “noise.” Formally it is assumed that (a) the error is normally distributed
with a mean of zero, and (b) that different errors are independent of one another, as well
as of the true score, T (e.g., Finch & French, 2019, pp. 29-42; Nunally & Bernstein, 215-
247). This was found to explain a phenomenon that had been observed over many
empirical studies: some sets of items seemed to give more consistent results than other
sets of items, specifically, that larger sets of items (e.g., longer tests) tend to give greater
consistency (as embodied in the Spearman-Brown Prophecy Formula; Brown (1910),
Spearman (1910)). The explanation that Spearman found for the phenomenon has to do
with what is called the reliability coefficient, essentially the relation between two forms
of the instrument constructed to be equivalent (this is discussed in Chapter 6). The
introduction of an error term, E, also allows for a quantification of inconsistency in
observed scores4.

However, these advantages of the instrument-focused approach come at a price:


The items have disappeared from the calibration model—look at Equation 5.1: There are
no items present in this equation! If we think of CTT as a model for the scoring, then it is

3
This is similar to another thought experiment that should be familiar to students of statistics—a frequentist
definition of a population mean value is that it would be the mean over all possible samples from the
population (Efron, 2005)—no need for brainwashing there, as statistics do not remember!
4
The error term in the CTT model effectively introduces probability into the situation (although this is not
explicit in Equation 5.1). The reader will recall that Kofsky suggested introducing probability as a potential
solution for the problem of Guttman scaling. However, the probability here is a different one, being
associated with the probability of a score from the whole instrument (i.e., score-focused), whereas Kofsky
was referring to the probability of the score for a particular item which (of course) is the focus of the item-
focused approach.

3
a model that ignores the fundamental architecture of the instrument, as being composed
of items. Moreover, the use of the raw score as the framing for the true score, T, means
that every time we make a change in the item set used (adding or deleting an item for
instance), there is a new scale for the true score! (This can be worked around technically
by equating, but it imposes a heavy burden including gathering large data sets for each
new item set.)

One might ask, with this obvious limitation, why is the use of the CTT approach
so common? There are two main answers to this question. The first is that, since the
pioneering work described in the previous paragraphs, measurement experts have been
developing numerous “work-arounds” the that essentially add extra assumptions and data
manipulations into the CTT framework. These allow, among other extensions, ways to
add item-wise perspectives, such as was initially proposed by Spearman and Brown. The
shortcoming of this approach is that just about every such extension makes its own set of
extra assumptions, so that real applications where several extensions will likely be
employed end up with a mess of assumptions that are very seldom considered in the
application. The second, and more problematic answer, is that we in the social sciences
have accustomed ourselves to using a methodological approach that does not make
explicit the connections between the content of the instruments (as usually expressed in a
traditional instrument blueprint) and the statistical model used to analyze the resulting
data. The connection is lost when the item responses are summed up into the raw scores
—there is no ready way to track them back empirically. This is aided and abetted by the
Spearman-Brown formula itself, which tells us that we can get higher reliabilities by
adding more items from the same general topic—hence the measurement developer using
the CTT approach can usually attain a “high enough” reliability just by adding more
generic items without having to worry about the items’ detailed relationships with the
underlying construct.

In contrast the item-focused approach has been the driving force behind the first
three building blocks. Hence, if we adhered to the framing provided by the classical test
theory approach, the efforts that have been expended on the first three building blocks
might be in vain.

In summary, each of the approaches can be seen to have its virtues:


(a) Guttman scaling focuses attention on the meaningfulness of the results from the
instrument by focusing on the meaningful way that the items might match to the
construct map—i.e., its validity; while
(b) classical test theory models the statistical nature of the scores and focuses attention on
the consistency of the results from the instrument—i.e., what we will define
below as its reliability.
There has been a long history of attempts to reconcile these two approaches. One notable
early approach is that of Louis Thurstone (1925)—he clearly saw the need to have a
measurement model that combined the virtues of both, and sketched out an early solution,
illustrated in Figure 5.1. In this Figure, the curves show the cumulative percentage of
students who got each item correct in a test, plotted against the chronological ages of the
students (in years). To see this, select any item represented by one of the curves. Notice

4
how the percent of students getting it correct increases with age. Now select a
chronological age. Notice how the percent of students answering correctly differs by
item. Easier items at that age will have a higher percent and more difficult ones will have
a lower percent.

The ordering of the curves (i.e., the curves for the different items) in Figure 5.1 is
essentially the ordering that Guttman was looking for, but with one exception—
Thurstone was using chronological age as a stand-in for the respondent’s location on the
construct map (i.e., as stand-in for the construct itself). Note though, that the fact that
they are curves rather than vertical lines (which would be related to the sort of abrupt
transitions that Guttman envisioned) corresponds to a probabilistic way of thinking about
the relationship between the score and success on this construct, and this can be taken as
a response to Kofsky’s suggestion about using a probability-based approach.
Unfortunately, this early reconciliation remained an isolated inspired moment for many
years. Thurstone also went beyond this to outline a further pair of requirements for a
measurement model:

A measuring instrument must not be seriously affected in its measuring


function by the object of measurement. To the extent that its measuring
function is so affected, the validity of the instrument is impaired or
limited. If a yardstick measured differently because of the fact that it was a
rug, a picture, or a piece of paper that was being measured, then to that
extent the trustworthiness of that yardstick as a measuring device would
be impaired. Within the range of objects for which the measuring
instrument is intended, its function must be independent of the object of
measurement. (Thurstone, 1928, p. 547)

This too was an important contribution—demanding that the scale must function
similarly regardless of the sample being measured.

This observation was generalized by Georg Rasch (1961), who added a


similar requirement for the items:

The comparison between two stimuli [items] should be independent of


which particular individuals [respondents] were instrumental for the
comparison ...Symmetrically, a comparison between two individuals
should be independent of which particular stimuli within the class
considered were instrumental for the comparison. (Rasch, 1961, pp. 331-
332)

He referred to these as requirements for specific objectivity and made that the
fundamental principle of his approach to measurement.

The Bear Assessment System (BAS) approach adopted in this book is intended as
a reconciliation of these two basic historical tendencies. The statistical formulation for
the calibration is founded on the work of Georg Rasch (1960/80) who was the first to

5
point out the important qualities of the model that bears his name, the Rasch (statistical)
model (which is described in the next section). The practical usefulness of this model for
measuring was extended by Benjamin Wright (1968, 1977) and Gerhard Fischer (see
Fischer & Molenaar (1995) for a thorough summary of Fischer’s contributions)5.

The focus for this book is on developing an introductory understanding of the


purpose and mechanics of a measurement model. With that goal in mind, construct
modeling has been chosen as a good starting point, and the Rasch calibration model has
been chosen due to how well it supports the implementation of the construct map idea.
Note that it is not intended that the measurer will learn all that is needed to appreciate the
wide-ranging debates concerning the respective models and calibration procedures by
merely reading this book—this book is an introduction and the responsible measurer will
need to go further (see Chapters 9 and 10).

Figure 5.1 Thurstone’s graph of student success on specific items from a test versus
chronological age (from Thurstone, 1925, p. 444)

5.2 The Construct Map and the Rasch Calibration Model

The calibration model is the fourth building block in the Bear Assessment System
(BAS). It has already been introduced, lightly, in Chapter 1, and its relationship to the
other building blocks was illustrated there too—see Figure 5.2. In this chapter, it is the
main focus.

Figure 5.2. The four building blocks in the BEAR Assessment System (BAS)
5
Other researchers have developed similar lines of research in what is usually termed “item response
theory” (IRT) such as Fred Lord (1952, 1980), Alan Birnbaum (see chapter 16 of Lord and Novick (1968),
and also Darrell Bock and Lyle Jones (1968), Fumiko Samejima (1969) and Erling B. Andersen (1973) for
original accounts). See the next section and Section 6.1 for a brief account of the relationship between the
Rasch model and other IRT models.

6
In metrological terms, calibration is defined as:

An empirical and informational process performed on a measuring instrument for


establishing a functional relation between the local scale of the instrument [i.e.,
the respondents’ item-scores and sum-scores] and a public scale [e.g., the logit
scale for the Wright map in Figure 1.10] (Mari et al., 2021, p. 271)

Thus, the calibration process is effectively a mapping from the local judgements involved
in the coding (scoring) of item responses to the outcome estimates of respondent
locations (on the Wright map). If there were scientific models of this (internal to the
instrument) relationship, then those models could be used to establish the mapping. But
in the social sciences, instead we gather samples of empirical data to help us establish
regularities of the instrumentation. Hence statistical estimation equations are used for this
mapping are usually used in this role in the social sciences. Thus, I will call these
statistical models calibration models (or sometimes statistical calibration models to
emphasize the statistical nature of their functioning).

Thus, the purpose of the calibration model is to relate the scored data back to the
underlying construct continuum, and hence to the construct map. The special advantages
of the Rasch model in implementing a construct map approach is the focus of this section.
The account proceeds first by considering how a construct map and the Rasch model can
be combined, resulting in what was called a "Wright map" in Chapter 1, and then
considers several advantages of doing so. The Rasch model relates to dichotomous items,
that is, items scored into just two categories. In attitude instruments, this might be
“Agree” versus “Disagree,” in achievement testing, this might be “Right” versus
“Wrong,” or, more broadly, in surveys, it might be “Yes” versus “No”—of course, there
are many other dichotomous labels for the categories. In this chapter, without loss of
generality, we will assume that they are scored into “1” and “0”—and that the scores for
item stems that are negatively related to the construct map have been reverse-coded.
There are many situations where dichotomous coding is not suitable, and the Rasch
model can be extended to take advantage of that—this will be explored in the next
chapter.

7
5.2.1 The Wright Map

The formulation of the Rasch model differs from that of classical test theory in
several critical ways. First, the Rasch model is expressed at both the item level and the
instrument level, not just the instrument level, as was noted above, is the case for
classical test theory. That is, in the CTT equation (Equation 5.1), the total score on the
instrument, X, was expressed in terms of T and E. In contrast, in the Rasch model, it is
the item response for Item i, Xi (pronounced “X-sub i”) that will be modelled. Second, the
Rasch model focuses attention on modeling the probability of the observed responses
rather than on modeling the sum of the responses as is the case for CTT. This second
point is also to be contrasted with the earlier Guttman formulation which was expressed
deterministically—a respondent would change their response immediately upon moving
up the Guttman (ordinal) scale.

Hence, in the Rasch model, the form of the relationship is that the probability of
the item response for item i, Xi, is modeled as a function of the respondent location 
(Greek “theta”) and the item location i: (Greek “delta”), where the location is
conceptualized as being along the common scale of ability (or, attitude, etc., for
respondents) and difficulty (for items). Thus, the concern about the limitations of total
raw scores, mentioned above, is avoided. In achievement and ability applications, the
respondent location will usually be termed the “respondent ability” and the item location
will be termed the item “difficulty.” In attitude applications, these terms are not
appropriate, so terms such as “attitude towards … (something)” or “propensity to
endorse …” (for respondent location) and “item scale value” or “difficulty to agree
with ...” (for item response location) are sometimes used. In order to be neutral to areas
of application, the terms used here in this section are “respondent location” and “item
location”—this is also helpful in reminding the reader that these parameters will have
certain graphical interpretations in terms of the construct map.

To make this more specific, suppose that the item has been scored dichotomously
as “1” or “0” (“Right”/“Wrong,” “Agree”/“Disagree,” etc.), that is Xi = 1 or 0. The logic
of the Rasch model is that the respondent has a certain “amount” of the construct,
indicated by , and that an item also has a certain “amount” of the construct, indicated by
i. But the way the amounts interplay is in opposite directions—hence the difference
between them,  - i, is what counts: one can imagine that the respondent’s amount of q
must be compared to the items amount of d in order to find the probability of a “1”
response (as opposed to a “0” response). We can consider three situations (see Figure
5.3).
In example (a) when the amounts (theta and delta) are equal (e.g., at the same point on
the Wright map in Figure 5.3), responses of 0 and 1 are equiprobable -- hence, a
response of “1” is 0.50. For instance, the respondent is equally likely to agree as
disagree to the item for an attitude question); or, for an achievement question,
they are equally likely to get it right as get it wrong.
In example (b) when the respondent has more of the construct than the item has (i.e.,  >
i); the probability of a 1 is greater than 0.50. Here the respondent is more likely
to agree (for an attitude question) or get it right (for an achievement question).

8
In example (c) when the item has more of the construct than the respondent has (i.e.,  <
i), then the probability of a “1” is less than 0.50. Here the respondent is more
likely to disagree (for an attitude question) or get it wrong (for an achievement
question).
To reiterate, in the context of achievement testing, for these three examples, we would
say that the “ability” of the respondent is (a) equal to or (b) greater than or (c) less than
the “difficulty” of the item. In the context of attitude measurement, we would say that (a)
the respondent and the statement are equally positive, (b) the respondent is more positive
than the item, and (c) the respondent is more negative than the item. Similar expressions
would be appropriate in other contexts.

Figure 5.3 Representation of three possible relationships between respondent location


and the location of an item

(a) (b) (c)

 i i i

In (a) the item location In (b) the item location In (c) the item location
(delta) is the same as (delta) is lower than the (delta) is higher than the
the respondent location respondent location respondent location
(theta) meaning that the (theta) meaning that the (theta) meaning that the
respondent’s ability is respondent’s ability is respondent’s ability is
equal to the item’s greater than the item’s less than the item’s
difficulty. difficulty. difficulty.

Note that these three situations depicted in Figure 5.3—


(a)  = i, (b)  > i, and (c)  < i,)
correspond to the relationships

9
(a)  - i, = 0, (b)  - i, > 0, and (c)  - i, < 0,
respectively. This allows one to think of the relationship between the respondent and
item locations as points on a line, where the difference between them is what matters. It is
just one step beyond that to interpret that the probability of a particular response is a
function of the distance between the respondent and item locations. In the specific case of
the Rasch model, the probability of response Xi = 1 is:

Probability (Xi =1 | , i ) = f ( - i) (5.2)

where f is a function that will be defined in the next few paragraphs, and we have
included  and i on the left-hand side to emphasize that the probability depends on both.

Graphically, we can picture the relationship between location and probability as in


Figure 5.4: The respondent locations, , are plotted on the vertical axis, and the
probability of the response “1” is given on the horizontal axis. To make it concrete, it is
assumed in Figure 5.4 that the item location is i, = 1.0. Thus, at  = 1.0, the respondent
and item locations are the same, and the probability is 0.50 (check it in the Figure
yourself). As the respondent location moves above 1.0, i.e., for  > 1.0, the probability
increases above 0.50; as the respondent location moves below 1.0, i.e., for  < 1.0, the
probability decreases below 0.50. At the extremes, the relationship gets closer and closer
to the limits of probability: As the respondent location moves way above 1.0, i.e., for
 >> 1.0, the probability increases to approach 1.0; and as the respondent location moves
way below 1.0, i.e., for  << 1.0, the probability decreases to approach 0.0. We assume
that it never actually reaches these extremes. Mathematically speaking, the curve is
asymptotic to 1.0 at “plus infinity” and asymptotic to 0.0 at “minus infinity.” In the
context of achievement testing, we would say that we can never be 100% sure that the
respondent will get the item right no matter how high her ability; in the context of attitude
measurement, we would say that we can never be 100% sure that the respondent will
agree with the statement no matter how positive her attitude (and similar statements hold
at the lower end).

This type of curve shown in Figure 5.4 is customarily called an item response
function (IRF), because it describes how different respondents (ranging from those with
very low locations to those with very high locations) respond to an item6. Those who
have some experience in this area will perhaps be more familiar with an alternative
orientation to the figure, with the respondent locations shown along the horizontal axis,
as shown in Figure 5.5. The orientation used in Figure 5.4 will be used in many places in
this book, even though it is not the most common, because it corresponds to the
orientation of the construct map, and this will make the development of useful intuitions
easier.
Figure 5.4 Relationship between respondent location () and probability of a response of
“1” for an item with difficulty 1.0

6
Other common terms for the same curve are “item characteristic curve” (ICC) and “item response curve”
(IRC).

10
11
Figure 5.5 Figure 5.4 re-oriented so that respondent location is on the horizontal axis

The complete equation for the Rasch model7 is:

θ−δ .
e( )
θ−δ i
Probability ( X i=1|θ , δ i )= (5.3)
1+e ( ) i

Notice that although the expression on the right-hand side is somewhat complex,
it is indeed a function of “ - i” as in Equation 5.2. This makes it a rather simple model
conceptually, and hence a good starting point for the calibration model. The units in
which the respondent and item locations are measured in this expression are termed
“logits”: the “log of the odds”—see Textbox 5.1 for information on interpreting these
logit units. Remember, the probability of “success” (i.e., Xi =1) in this model is seen as a
function of the difference between the respondent parameter and the item parameter, i.e.,
the difference between the person location and the item location. This is consistent with
the particularly intuitive interpretation on the construct map given above, in Figures 5.4
and 5.5—the difference between a respondent’s location and the item difficulty will
govern the probability that the respondent will make that response. In particular, if the
respondent is above the item difficulty (so that the difference is positive), they are more
than 50% likely to succeed, and if they are below the item difficulty, (so that the
difference is negative), then they are less than 50% likely to succeed.

7
The name for the function of  - i in Equation 5.3 is the “logistic function.”

12
Figure 5.6 Item response functions for three items

Figure 5.7 A generic Wright map. (Note: “X’ = 1 respondent.)

13
14
<Insert
Textbox 5.1 about here>

Equation 5.3, and the conceptualization of the relationship between the


respondent and the item as a “distance,” allows one to make the connection between the
construct maps used in previous chapters and the equations of this chapter. To gain a
representation of an entire instrument, one would wish to stack the item response
functions for all the items in an instrument on top of one another on the Figure. This has
been done for an additional two items in Figure 5.6. The full information one needs to
relate respondent location to the item can be found in a figure like this. But the problem

15
is that, for an instrument of the usual length, even 10 or 20 items, the equivalent figure
will become cluttered with item response functions and will be uninterpretable.

Note that in Figure 5.6 the “stacked” curves are all the same shape. This is not a
mere observation, but rather, is directly due to the simplicity of Equation 5.3: the
equations for each of the items differ by just one thing, the ds, and hence the curves are
just translated up or down according to the ds (i.e., the item difficulties). Moreover, note
that the distance of the translations is the differences between the successive ds.

Thus, in the case of the item response functions (IRFs; as in Figure 5.4) for the
Rasch model, there is a practical solution to how to indicate where the item is on the
construct map. We take advantage of the fact that the curves are all the same shape and
show them on the construct map using only one point for each item—I will call this the
critical point. For example, look at Figure 5.7: three items, indicated by the letters “i,” “j”
and “k” are shown on the map, located on the right-hand side, under “Item Responses.”
We will assume, to facilitate the illustration, that these are three dichotomous
achievement items, scored as “correct” (1) or “incorrect” (0). The letter for each item
shows this critical point where the probability of getting it correct for that item is exactly
0.50. So, for Item i, which the reader will recall had a difficulty of i = 1.0 from Figure
5.4, the symbol “i” is shown at the “critical” point, to the right of 1.0 logits. To the left of
1.0 logits, the symbol “X” indicates that there is a respondent there at the same point.
Thus, for this respondent (respondent “X”), the probability of getting it correct is 0.50.

For this same item, Item i, we can also interpret the probabilities for the other
respondents—for example, the respondents marked as “Y” are below the location of Item
i, so for them the probability of correctly responding will be less than 0.50, and for the
respondents marked a “Z,” these are above Item i, so the probability of responding
correctly will be above 0.50. The exact probabilities depend on the difference between
the respondent location, q, and the item difficulty:  - i. This relationship is illustrated
using the curve in Figure 5.4 (and 5.5) for Item i, and it is also shown for selected values
in the Table in Textbox 5.1.

In just the same way, we can interpret differences in locations of items in terms of
probability of responses. For example, considering again the respondent at 1.0 logits
(“X”), the probability of getting Item k correct would be less than 0.50, while the
probability of getting Item j would be greater than 0.50. This combination of the
construct map idea with the Rasch model using these “critical” points, has created a
particularly powerful means of interpreting measurements, which will be evidenced in the
following chapters. This combination is called a Wright Map in this book, in honor of its
creator: Benjamin D. Wright of the University of Chicago (see Wilson (2017) for further
information on this).

There are a number of other features of Figure 5.7 that are worth emphasizing.
The central line is marked out in “logits”—as noted above, how they are related to
probabilities is explained in Textbox 5.1. On the left-hand side, under “Respondents,” are
noted several respondents’ locations—this is in the shape of an on-the–side histogram

16
(with larger numbers of respondents, each symbol may indicate more than one
respondent). Under some circumstances, the symbols may be used to distinguish
interesting subgroups (such as males (e.g., “M”) and females (e.g., “F”), etc.), under other
circumstances, the symbols may be replaced by histogram bars. The logits which mark
out the units of the construct are given immediately to the right of the central line. (Note
that, in general, the respondents would not necessarily be located at round numbers of
logits as they are in Figure 5.6—they have been put there in this hypothetical example to
make for simple calculations.) On the right-hand side of Figure 5.6, under “Item
Responses” the locations of the item responses are shown. Note that the items may be
denoted in other ways than that indicated here. For example, the label could be changed
to something more meaningful, such as an item name, or even colored shapes. Examples
that use these different possibilities will be shown below and in the following chapters. In
the discussion of Figure 5.6, it was assumed that the context was a dichotomous
achievement test: similar interpretations could be made, for example, for a dichotomous
attitude scale where instead of saying the respondent had responded “correctly,” one
could have interpreted it as responding “positively” (as opposed to negatively), etc.

Armed with the Rasch model equation (Equation 5.3), the relationship between
the logits and the probabilities shown in Textbox 5.1 of response can be made explicit.
For the respondent at 1.0 logits in Figure 5.7 (and 5.4), the probability of responding
correctly on Item i should be, as noted above, 0.50 (because the respondent and the item
are at the same location). Check this by substituting these values into Equation 5.3:

e(
θ−δ i )
Pr ( X i=1|θ=1.0 , δ i=1.0 ¿=
1+ e(
θ−δ i )

( 1.0 −1.0 )
e
¿ ( )
1+ e 1.0−1.0
0.0
e
¿ 0.0
1+ e

1
¿
1+ 1

= 0.50. (5.4)

Compare this to the graph in Figure 5.4—note where the IRF (item response function)
curve intersects the horizontal arrow drawn at  =1.0: this is the graphical expression of
the equations in Equation 5.4.

Similarly, for the four respondents located at 2.0 logits in Figure 5.6 (“Z”), the
probability of a correct response on Item i will be greater than 0.50, because the
respondents are above than the item. To be exact, the probability will be determined by
inserting  =1.0 into Equation 5.3:

17
( 2.0 −1.0 )
e
Pr ( X i=1|θ=2.0 , δ i=1.0 ¿= ( )
1+ e 2.0−1.0
1.0
e
¿ 1.0
1+ e

2.718

1+ 2.718

(Recall that e is approximately 2.718)

= 0.73. (5.5)

Again, compare this to the graph in Figure 5.4—this time note where the IRF curve
intersects  =2.0 logits.

Several other observations can be made, following on from this calculation:


(a) the probability of an incorrect response on Item I for the same respondent would be
1.0 -0.73 = 0.27
(i.e., because there are only two responses so the two probabilities must add to
1.0);
(b) note in the calculation (second step) that the probability depends only on the
difference between the locations of the respondent and the item, not the specific
values of either (hence the results would have been the same for θ=3.0 , δ i=2.0 ,
and θ=−2.0 , δ i=−1.0, etc.).

Similarly, for the respondents at 0.0 logits, the probability of a correct response
will be less than 0.5, because the respondent is lower than Item i. To be exact, the
probability will be e-1/(1+e-1) = 0.27. In a similar fashion, the probabilities of the
respondents at 0.0 of making the response “1” to items k and j is 0.10 and 0.50,
respectively (the items are at 2.2 and 0.0 logits, respectively). Again, compare this to
Figure 5.4—note where the IRF curve intersects =2.2 and =0.0. To summarize, on the
Wright Map (vertical) distances relate to probabilities. It turns out that this will be
extremely helpful to the measurer in many ways—this will be discussed in the following
sections and elaborated in the next two chapters.

18
5.2.2 Modeling the Response Vector

In the previous section, the expression for the probability of a correct response to
an item was shown for the Rasch model. In this section, that will be extended to show the
probability for the entire set of responses by a respondent, which is commonly called the
response vector. Now, in a dichotomous situation, a response vector might include
responses of “0” as well as “1”, so we need to be able to express that too. This is
relatively easy, because the probabilities for Xi being 0 and 1 must sum to 1.0 (i.e.,
because there are only two possible outcomes, 0 and 1, and hence their probabilities must
add to one). Hence, Equation 5.3 implies that
e( )
θ−δ i

Probability ( X i=0|θ , δ i ) =1−


1+e ( )
θ−δ i

1
¿ θ−δ . (5.6)
1+ e( ) i

With this understood, the way the Rasch model works at the instrument level can be
made concrete.

The way that the probability of a response vector is calculated in the Rasch model
(and other item response models) is to assume that once you know the respondent’s
location and an item’s parameters, then each item’s information contributes to the
probability of the response vector as though the items were statistically independent, that
is, you just calculate the product of the item probabilities. This is called the conditional
independence assumption. As with many concepts, it becomes clearer when one
considers non-cases. A situation where one would suspect that conditional independence
might not hold is where certain items share a common stimulus material. This is common
in instruments such as reading comprehension tests, where a set of items will ask
comprehension questions about a single reading passage. Note that, if all the items in the
instrument share the same stimulus material, that does not raise the same issues.

As an illustration, consider the case where conditional independence does hold,


suppose that the three items used in the previous section were used as an instrument, that
a respondent’s response vector was (1, 1, 0) and the respondent is located at  = 0.0.
Then, under the assumption of conditional independence, the probability of this particular
response vector would be the product of the three probabilities:

Pr ( X=( 1 , 1 ,0 )|θ , δ k , δ i , δ j ¿=Pr ⁡( X 1=1∨θ , δ k ) Pr ⁡(X 2=1∨θ , δ i)Pr ⁡(X 3=0∨θ , δ j ).

Substituting in Equations 5.3 and 5.5, that becomes ...

( )( )( )
( 0.0−2.2 ) ( 0.0−1.0 )
e e 1
¿
1+e (0.0 −2.2) 1+e (0.0 −1.0 ) 1+e ( 0.0−0.0 )

19
( )( )( )
(− 2.2 ) ( −1.0 )
e e 1
¿
1+e (−2.2) 1+e (−1.0 ) 1+e (0.0 )

= (0.10)(0.90)(0.50)

= 0.045. (5.7)

As noted above, this is called “conditional independence”8 because all of the


probabilities above can only be calculated if you know the appropriate values of  and i.
That is the probabilities are “conditional” on knowing the relevant parameters—in this
case  and the s. This assumption means that the measurer believes that when
calculating the probability of the whole response vector, the probability of each
individual item response is simply multiplied by all the others (other ways would be
needed if, say, two of these items were in a bundle9).

Where do the locations come from? The equations given above for the Rasch
model are not directly solvable for the s and s. Therefore, they are estimated using one
of several statistical estimation approaches. The software used in the estimations for this
book is called the Berkeley Assessment System Software (BASS) (Fisher & Wilson,
2019; Wilson, Scalise, & Gochyyev, 2019), and has been chosen because it can
carry out most of the statistical calculations needed in the following chapters. Discussion
of estimation is beyond the scope of this book. Interested readers should consult the
references noted at the end of section 5.1; another useful source on estimation for the
Rasch model is Fischer and Molenaar (1995).

5.3 The PF-10 Example (Example 6)

The PF-10 scale has already been introduced and described in Chapter 2 above.
As described in Chapter 2, there are 3 categories of response in the PF-10, but it can also
make sense to consider just two, where the first two categories are collapsed together, and
the third is left as it is—this makes the data dichotomous10. Explicitly, this means coding
the responses “Limited a lot” and “Limited a little” to “0”, and recoding “Not limited at
all” as “1.” A large data set of patients’ responses to these items has been collected
(McHorney et al., 1994), and a Wright map calibrated from those data, dichotomized as
described, is displayed in the BASS Wright Map report shown in Figure 5.8. Note that
the full item set for the PF-10 is shown again in Table 5.1 (it was shown previously in
Table 2.1), along with an indication of the mapping of the items to the waypoints given
previously in Section 2.2.6. Table 5.1 also shows the abbreviations that are used in the
Figures and the text for each item.

8
Or sometimes it is referred to as “local independence.”
9
See Wilson and Adams (1995) for examples of how to calculate a probability in this situation.
10
Note that this manipulation is carried out here strictly for illustrative purposes. No general endorsement
of the collapsing of categories is intended. See Armstrong & Sloan (1989) and Kaiser and Wilson (2000)
for interestingly different perspectives on this tactic.

20
Table 5.1
Items in the PF-10, indicating waypoint bands

Item Item Label Item


Number
Vigorous
Activities
1 VigAct Vigorous activities, such as running, lifting heavy objects,
participating in strenuous sports
Moderate
Activities
2 ModAct Moderate activities, such as moving a table, pushing a
vacuum cleaner, bowling, or playing golf
3 Lift Lifting or carrying groceries
4 SevStair Climbing several flights of stairs
6 Bend Bending kneeling, or stooping
7 WalkMile Walking more than a mile
8 WalkBlks Walking several blocks
Easy
Activities
5 OneStair Climbing one flight of stairs
9 WalkOne Walking one block
10 Bath Bathing or dressing yourself

The description in this section of the PF-10 Wright map follows quite closely the
description of the MoV Wright map in Figure 1.11. If the reader is familiar with such
Wright map layouts, then they can skim the portions of the next paragraph offering
guidance on how to “read” the map itself. Nevertheless, the specific results shown in the
Wright map are different, so it is worthwhile to skim through and read the interpretations
in the two paragraphs that follow.

The locations of the PF-10 item thresholds are graphically summarized in the
Wright map in Figure 5.8, simultaneously showing estimates for both the students and
items on the same (logit) scale. (The item threshold estimates are shown in Appendix
5B.) Moving across the columns from left to right in the Figure 5.8, one can see the
following.
(a) The logit scale.
(b) A histogram (on its side) of the respondents’ estimated locations, including the
number of respondents represented by each bar of the histogram.
(c) The location of students at each raw score.
(d) A set of columns, one for each waypoint on the construct map, with the labels for
each waypoint printed at the bottom. In this case, the measurer has used labels
“L1,” “L2,” and “L3” for the three waypoints of PF-10 shown in Table 5.1. The
location of the threshold for each item within each column is represented by a pair
of symbols, “i.k,” where “i” indicates the item number and “k” specifies the item

21
score, so that, for example “9.1” is the first threshold location for Item 9—that is,
the threshold between the scores 0 and 1. Note that, for this PF-1- data, each item
has just one threshold (i.1), as the data have been dichotomized.
(e) A column indicating the bands for each of the waypoints in the construct map.
(f) The logit scale (again, for convenience).
Note also that the legend for the item labels is shown at the bottom: so, for example, Item
9 is named as “SevStair” in the legend.

There are several interesting features of the Wright map.


(a)Comparing this Wright map in Figure 5.8 to the construct map that was shown earlier
in Figure 2.12, we can notice several ways in which they differ. First, this map is
not just a sketch of the idea of the construct, it is an empirical map, based on
respondents’ self-reports.
(b) As noted above, a histogram (on its side) of the responses is shown on the left-hand
side of the map. What is unusual for a histogram is that the spaces between the
bars of the histogram are not evenly spaced. That is because the locations of the
bars are the estimated locations of the respondents11.
(c) There is one bar for each score on the PF-10 12: but some bars are not labelled with
scores, and these, generally, will correspond to response vectors with missing
data. Each bar corresponds to a particular response vector—they are generally not
located at integer values as are raw scores.
(d) The units for the continuous scale are shown on the far left-hand and far right-hand
sides, in the columns labelled “Logits” (see Textbox 5.1). The respondents range
from those that are “less limited” at the top, to those that are “more limited” at the
bottom. E
(e) The right-hand side of the map in Figure 5.8 shows the calibrated item locations—
these are the estimated values for the s in Equations 5.3, 5.5, etc.
Tables showing the exact values of the respondent and item locations are included in
Appendix 5A.

11
The estimation method used for the respondent locations is maximum likelihood estimation (MLE, see
Sijstma & van der Ark (2020, pp. 189-193)).
12
The general form of the MLE algorithm does not provide a finite estimate for a zero or perfect score, but
the BASS program applies a commonly used work-around to provide estimates for these scores (Wu, et al.,
2007), p. 70).

22
Figure 5.8 The Wright Map for the dichotomized PF-10 instrument

23
To begin interpreting Figure 5.8 consider first the case for a single item, say, the
“SevStair” item mentioned above. This item’s threshold appears in the column labelled
L2, corresponding to the item design, it is sensitive in the middle part of the construct
map. Consider a student located at the same point as the threshold for this item—the
histogram bar here is quite small, with 20 patients at the same location. A student at this
location would be approximately 50% likely to respond “Not limited at all” to the
SevStair item, which is somewhat towards the top of the moderate activities band on the
construct map.

Notice, for instance, that there is a group of 240 respondents who have a score of
6 and are at almost the same point on the map as the “Bend” item (7.1). This means that
these respondents have approximately13 a 0.50 probability of responding “Not limited at
all” to that item. Noting that “SevStair” (9.1) is about 1 logit (by eye) above this location,
we can see that for those same respondents (i.e., those with a score of 6) the probability
of getting the most positive response to that item is approximately .27 (see discussion
after Equation 5.5 to see where this probability comes from). And, noting that
“WalkBlks” (4.1) is about 1 logit (by eye) below this location, we can say that those
respondents have an approximate probability of 0.73 (i.e., Equation 5.5 again) of giving
the most positive response (“Not limited at all”) to that item. The reader can get a better
feeling for the relationship between logits and these probabilities by using the logits-to-
probability guidelines in Textbox 5.1. Exact probabilities can readily be worked out using
a calculator to implement Equation 5.3 using the patient and item estimates in Appendix
5A.

As was described in Section 1.6.1 for the MoV Wright Map, the measurement
developers applied the construct mapping technique to examine (a) the separation of the
item locations into waypoint bands, (b) the relative ordering and overlaps of those bands,
and (c) how to interpret the match of the bands to the PF-10 construct map (as shown in
Figure 2.12). Now, as mentioned in Chapter 2 (Section 2.3.6), the PF-10 was formed
after an extended period of surveying and winnowing, including calibration of items, so
that one would expect a relatively straightforward implementation of construct mapping
here based on earlier results. Indeed, examining the right-hand side of Figure 5.8, one can
see that the three sets of items indicated by the waypoint labels L1, L2 and L3) are well-
separated, and are in the expected order. In this case, the banding was accomplished
simply by locating the boundaries of the respective bands (2 boundaries between 3 bands)
at the mid-point between each successive set of items. The results of the construct
mapping are indicated by the two horizontal lines, indicating which band each item and
respondent belongs to.

Thus, one can see that respondents who have scores ranging from 0 to 3 are in the
band where they are likely to report being “not limited at all” to the three easy activities;
respondents with scores ranging from 4 to 8 are in the band where they are likely to
report being “not limited at all” to the six moderate activities; while respondents with

13
These probabilities are approximate because in this exercise, I am illustrating how one would use the map
as a basis for interpretation—for more exact probabilities, use the results in Appendix 5A to get values for
 and .

24
scores above 9 are in the band where they likely to report being “not limited at all” to
“Vigorous Activities.” We see that there are relatively more respondents in each of the
lower two bands than in the highest band, which might well be ascribed to the location
where the survey took place, of patients in and around a hospital. Note also that the item
“Moderate Activities” is located about midway in the “Moderate Activities” band, which
would seem to affirm the labelling of that band. Thus, one can see that the intended
construct map, as embodied in the waypoints, has been empirically matched for this data
set—this will be interpreted as a crucial element of validity evidence in Chapter 8.

5.4 Reporting Measurements

The purpose of the Wright map is to help interpret the locations of the respondents
and the item thresholds on the construct. But the purpose of the instrument (usually) is to
measure the respondents. On the map, each respondent can be located at a particular
point on the logit scale—these will be referred to in this book as the respondent
“locations” or, when emphasizing the statistical aspects of the measurement, their
“estimates.” In Figure 5.7, for example, respondents who scored 1 are located at -3.97
logits14, while those who scored 9 are located at 1.85 logits. As mentioned above, these
logits are often translated into different units before they are communicated to
consumers. Any linear transformation will preserve the probability interpretation of the
logits given above. Some consumers (e.g., parents receiving their student’s test report)
prefer not to have to deal with negative numbers, and some do not like decimal fractions.
Hence a common translation is to make the new mean 500 and the new standard
deviation 100—for most instruments, that will keep all numbers positive, and eliminate
the need for decimals1. All that is needed is a modified version of Table T5.1 that uses the
desired units rather than logits.

5.4.1 Interpretation and Errors

The use of the item threshold locations to help with the interpretation of the
measures has been discussed and illustrated above. This framework for making the
location estimates meaningful is one of the most important features of the construct
modeling approach to measurement—in fact one might even say that it is the purpose of a
construct modeling approach. Further features of the item calibration and the person
measures are described in the following chapters (Chapters 7 and 8—on reliability and
validity)—several of these can also be used to help interpret the meaning of respondents’
locations. For further exemplary maps, see the Examples Archive on the Constructing
Measures website: ??).

Recall that each location is in fact a statistical estimate. That means that it is
subject to a degree of uncertainty. This uncertainty is often characterized using the
standard error of the location—the so-called standard error of measurement (sem). The
standard error of measurement is calculated as a part of the estimation procedure, and its
statistical formulation is discussed in a paper by Raymond Adams and myself (Adams &
14
This estimate, and other respondent estimates are based on the weighted likelihood estimator (WLE)
approach (Adams et al., 2020).

25
Wilson, 1996, p. 156); just as for the estimation algorithm itself, I will not go into detail
about how this is calculated in this book (see later volume). However, one important
statistical feature of the item response model formulation used here is that the standard
errors of measurement are not assumed to have a constant value for all respondents. This
uniform error assumption is an inherent aspect in the Classical Test Theory approach to
measurement, and some, immersed in that approach have found it hard to expand their
thinking beyond this assumption. Nevertheless, it should be seen as a limitation—there is
no a priori reason to expect that the estimates for all respondents would have the same
uncertainty.

In fact, in typical situations, there is a decided pattern to the standard errors that
are calculated, and, in general, they vary depending on the estimated location of the
respondents. For example, see the chart of estimated standard errors for the PF-10
respondents plotted against the respondents’ locations shown in Figure 5.9. What we see
is a “U” shape, which is very commonly observed—the standard errors are largest at the
extremes, and smallest towards the middle of the locations. This pattern is discussed and
explained in Section 7.1. In recognition of this dependence, the standard error of
measurement is often denoted as a function of the respondent location, q : as sem(q )—
i.e., this is standard error of measurement in the logit (q) metric.

Figure 5.9 Standard errors of measurement for the PF-10

Consideration of the sem(q ) helps the measurer understand how accurate each
estimated location is. For example, if a respondent scored, say a middling value, 6 on the
PF-10, then their location is -0.47 logits and the standard error of the respondent’s
location is 0.78 logits. This is usually interpreted by saying that the measurer is uncertain
about the exact location of the respondent, but that it is centered approximately on -0.47
logits, and distributed around there with an approximately normal or Gaussian
distribution with standard deviation of approximately 0.78 logits. Hence, in this case, the

26
measurer can say that s/he the (approximate) 67% confidence interval is calculated as: -
0.47±0.78, or (-1.25, 0.31) in logits. Alternatively, the 95% confidence interval is
calculated as: -0.47±(1.96 x 0.78) or (-2.00, 1.06) in logits.

Calculation of confidence intervals for respondents can often surprise nascent


measurers unfamiliar with the impact of uncertainty on their measurements. For instance,
considering just the 67% confidence interval, the score range is quite wide—it is 1.56
logits, which shows that although the respondent scored 6 on the instrument, the range of
their locations might be anywhere from the equivalent logits for about a score of 5 to 715.
Alternatively, the 95% confidence interval ranges from a location above a score of 8 to
below a score of 4. Although this is indeed quite wide (it is 40% of the score range), it is
still an improvement on what one would know if one had no data at all about the
respondent. To see this, assume that a person without an observed score must fall
somewhere in the range from the minimum to maximum score—the full range of the
respondent locations is 9.14 logits, so the 95% confidence interval for a person scoring at
6 is about 17% of that range. Hence one could say, with a 95% confidence, that the
measurer is better off by a factor of (approximately) 5 than if she had had no data on the
respondent (i.e., compared to knowing only that it was reasonable to use the instrument
for this person). Of course, this may be an underestimate, as the measurer may not have
reasonably known that the respondent could be expected to fall within the active range of
the instrument. Quality control indices for the instrument, based on the ideas introduced
in this paragraph are given in much more scope and detail in Chapter 7.

Similarly, the item locations shown on the Wright map in Figure 5.7 also have
uncertainty. In typical measurement situations, where there are more respondents than
items, the item standard errors are quite a lot smaller than the respondent standard errors.
For example, the standard error of the item difficulty of VigAct is 0.04 logits. In many
applications, the item standard errors are small enough to ignore when interpreting the
respondent locations. However, it is important to keep in mind that they are estimates
subject to error, just as are the respondent location estimates. One situation that requires
use of the item standard error is the calculation of item fit statistics, discussed in Section
6.3.2.

5.4.2 The PF-10 Example (Example 6), continued

The banding shown on the Wright Map in Figure 5.8 offers considerable help in
“going beyond the numbers” in reporting and interpreting the measurements. For
example, the BASS Group Report shown in Figure 5.10. This Figure shows the estimated
locations for a group of 22 respondents16. These might, for instance, be a group gathering
together for a physical therapy session. The bands for each waypoint are indicated by
colors indicated in the legend in the lower part of the Figure—gray-mauve17 for Waypoint
1, mauve for Waypoint 2, and magenta for Waypoint 3. They range across Waypoints 1
to 3, and are fairly evenly distributed across all three, with the detail that there is a small
15
The reader can check this visually on Figure 5.7, or by using the estimates in Table 5A2.
16
The estimated location is shown by the black dot for each respondent. The “wings” are an indicator of
standard error.
17
Note that these colors are generated automatically from a certain internet palette by the BASS system.

27
group that are hovering around the boundary between Waypoints 1 and 2 (i.e.,
respondents 18618 and 2451), and one respondent hovering at the higher boundary (i.e.,
respondent 2356). In considering the best choices of activities for these groups, the
trainer can use the information in this Figure to plan out 3 levels of vigorousness for the
activities, and assign respondents to each, but needs to bear in mind that some patients
are not so well-matched to the waypoints, so that extra attention should be paid to them.

One aspect of Figure 5.11 that might surprise some readers is the variation in the
lengths of the error bars around the respondent locations, with the smallest bars being in
the middle, and the largest at the extremes. However, this is consistent with the pattern
shown in Figure 5.9, and is hence, a re-expression of the point made above (in the
previous section) that the standard errors increase depending on how close the respondent
is to the items (with the respondents in the middle being, on average closer, and those at
the extremes being, on average, further away).

18
Note that numbers have been substituted for respondents’ names, to preserve anonymity.

28
Figure 5.10 A Group Report for the PF-10

Figure 5.11 The Individual Scores report for Respondent 2659

29
More detail is provided by the BASS Individual Score reports. For example,
consider the BASS Individual Score report for respondent 2029 shown in Figure 5.11.
This uses the same color-coding as for Figure 5.10 (gray-mauve for Waypoint 1, mauve
for Waypoint 2, and magenta for Waypoint 3), with the addition of a an off-white color
for the “below Waypoint 1” band. For each item (shown in the rows), the active score
possibilities are sown as highlighted rectangles, and the observed category for the student
is indicated by the use of the indicated Waypoint’s color rather than light gray. So, for
example, for SevStair, the options are Waypoints 2 and 3, and patient 2029 has responded
at Waypoint 2 (i.e., “Limited a lot” or “Limited a little” in this dichotomized data set).
This respondent was located towards the upper boundary of Waypoint 2 in the Group
Report (Figure 5.10). Looking at Figure 5.11, note that of the 10 rows for items, 8 of the
pairs of boxes are colored on the right-hand side (indicating a “1” or “Not limited at all”)
for these 8. Thus, this respondent scored 8 out of the total 10 on the PF-10. Their
individual item scores are at the maximum possible for all the items except for SevStair
and VigAct, so one can interpret that they are performing quite well for a hospital patient,
with just some more progress being needed to get to a very healthy physical performance
state.

5.5 Resources

The account of the calibration model given here is intended to develop an


intuitive understanding of the way that a calibration model, in particular the Rasch model,
functions within the measurement process, and within the process of developing the
measuring instrument. Specifically, the account has taken care to point out and make
explicit the links back into the logic of the development process as embodied in the other
3 building blocks. Classic accounts such as those in Wright and Stone (1979) and Wright
and Masters (1981) are based on much the same logic, as is the account in Bond and Fox
(2007).

This is quite different than the description of calibration models, including the
Rasch model, in more standard accounts by psychometricians. Most such accounts
emphasize the statistical nature of the calibration models, making clear the relations
among the several different item response models, and enlarging upon the (crucial) issues
of parameter fit and estimation. These topics will be touched upon in the next chapter and
will be the subject of much greater focus in a follow-on volume to this one. Typical
accounts are given in Lord and Novick (1968; Chapter 16-18), Hambleton et al. (1991)
and Embretson and Reise (2000). A more recent such account is by Sijtsma and van der
Ark (2020).

5.6 Exercises and Activities

(following on from the exercises and activities in Chapters 1-4)

1. Read one of the classical works referred to in this chapter and summarize how it
helped you understand one of the points made in the chapter (or, any other point you
think is interesting).

30
2. Check that the numerical probabilities mentioned in Section 5.2.1 are accurate using
your own calculator.

3. Using the Wright map in Figure 5.8, estimate approximate probabilities of a selected
score level for several items.

4. Consider the spread of the item difficulties in the Wright map in Figure 5.8. Suppose
that the spread was much narrower—what would be the consequences of that?

5. In the text it was suggested that one might use a reference value other than 0.50 for
calibrating the respondent and items (e.g., 0.80 for mastery learning contexts). Can you
think of an application in your typical application area where this (a value other than
0.50) might be appropriate? What would be the effects of this on the Wright map? (E.g.,
how might the two Wright maps—one for 0.50 and one for the new value you have
chosen—look different?)

6. Using the sample data supplied at the BASS website, calibrate the items and patients in
the PF-10 Example, and generate the Wright map—check to see if it is consistent with
the one shown in Figure 5.8. (Note that in BASS you have options about what is to be
included in the Wright map.)
7. Imagine what you expect the Wright Map will look like for your own instrument—
sketch it out along the lines of Figure 5.7. Are there any specific details of the context
that should be included here?

8. Try calibrating the items and respondents for your own data set and generate the
Wright map corresponding to that. How do the results compare to what you generated in
Question 7?

9. Think through the steps outlined above in the context of developing your own
instrument and write down notes about your plans.

10. Share your plans and progress with others—discuss what you and they are
succeeding on, and what problems have arisen.

31
Appendix 5A

Results for the PF-10 Dichotomous Analysis

Table 5A.1 Item difficulty estimates

Item Difficulty Standard


Error
1 -4.28 0.08
2 -3.01 0.07
3 -1.99 0.06
4 -1.32 0.06
5 -1.39 0.06
6 -0.81 0.06
7 -0.46 0.06
8 0.04 0.05
9 0.43 0.05
10 2.37 0.06

Table 5A.2 Standard errors of measurement

Score Location sem


0 -5.57 1.79
1 -3.97 1.13
2 -2.97 0.94
3 -2.22 0.84
4 -1.60 0.79
5 -1.03 0.77
6 -0.47 0.78
7 0.13 0.83
8 0.84 0.93
9 1.85 1.15
10 3.57 1.84

32
Textbox 5.1
Making sense of logits

To get a feel for how the logits are related to probabilities, look at Table T5.1,
which has been calculated using Equation 5.3. To use the Table, first find the difference
between the respondent location ( ) and the item location ( )—that is the column
labelled “ - ” in the Table—then check the probability on the right-hand side. If the
values in the Table aren’t convenient, use a calculator on your phone or laptop to
calculate the probability directly, using Equation 5.3.

Table T5.1
Logit differences and probabilities
for the Rasch model

To see why the units for  -  are interpreted as the log of the odds, recall that the
odds are given by the ratio of the probabilities:

Probability ( X i =1|θ , δ i )
Odds ( X i=1|θ , δ i )= . (T5.1)
Probability ( X i=0|θ , δ i )

Then, using Equation 5.3 and the fact that the two probabilities in Equation T5.1 must
add to one19 (i.e., because Xi can only be 0 or 1),
e( )
θ−δ i

1+e ( )
θ −δ i

¿
1
1+e ( )
θ −δ i

¿ e(
θ−δ i )
.

Hence the log of the odds, referred to as the “logit,” is


19
The derivation of the probability for Xi = 0.0 is left to the reader.

33
log ( Odds ( X i=1|θ , δ i) )=log ( e )
(θ−δ i )

¿ θ−δ i .

34
1

You might also like