Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views22 pages

Chapter7 3

Uploaded by

Carlos - Tam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views22 pages

Chapter7 3

Uploaded by

Carlos - Tam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Chapter 7

Trustworthiness, Precision and Reliability


7.0 Chapter Overview

This chapter begins with an overview of the topic of trustworthiness in measurement. The
rest of this chapter and the whole of the next chapter are both devoted to describing the sorts of
investigations that need to be carried out to try and establish trust in one’s measurements. The
topic of trust is articulated via the reduction in the uncertainty associated with measurement—
under the assumption that one can never have complete certainty. Then, the important question
is: how can this uncertainty be reduced to a point where it does not interfere with the use of the
measurement?

The aim of the remainder of this chapter is to describe how to investigate whether The
degree to which test scores [are] … dependable and consistent for an in- dividual test taker
(AERA/APA/NCME, 2014, p. 222). That is, is the expected noisiness in the measurements of
the respondents is acceptable—i.e., is the measurement error small enough? Traditionally,
measurement texts have cast this as being represented by measurement error and instrument
reliability; further, it has been seen as a quality of the instrument separate from validity (i.e., does
it measure what it is intended to measure?). But here reliability and validity are seen here as
integral parts of the argument for the trustworthiness of measurement. This chapter covers the
issues related with the measurement error and reliability, which affect several components of
validity (the latter is the focus of the next chapter).

Key concepts: trustworthiness, object-relatedness, intersubjectivity, precision, trueness,


accuracy, uncertainty, measurement error, validity, standard error of measurement, reliability,
internal consistency reliability, test-retest reliability, alternate forms reliability, inter-rater
reliability.

7.1 Trustworthiness in Measurement


What is required of the measurement process in order for it to be trusted?

The approach to measurement laid out in this book is based on an attempt to attain
trustworthiness in the process and the products of measurement. As mentioned in Chapter 1, we
see this trustworthiness as being composed of two important facets: object relatedness and
subject independence. These have been defined as follows in Mari et al (2021):

Object relatedness (“objectivity” for short) is the extent to which the conveyed
information is about the property under measurement and nothing else. ...

Subject independence (“intersubjectivity” for short) refers to the goal that the conveyed
information be interpretable in the same way by different persons in different places and
times. (pp. 233-4)

1
The development process described in Chapters 1 through 6 has been designed such that it is
possible to build both objectivity and intersubjectivity into the instrument and its results—in fact,
not just the possibility, but also the reasonable likelihood that this will be the case. Now, having
gathered data about the behavior of the instrument, the measurement developer has the
opportunity (and the challenge) to investigate the extent to which the development process has
been successful in establishing the trustworthiness of the measurement. This investigative
process is the topic for this chapter and the next.

The trustworthiness of the measurement and the results are related to features of the
measurement instrument used to produce the measurement information. In the physical sciences,
these features are usually divided into two distinct parts; (a) the precision of the instrument,
which is its ability to produce measured values that are close to one another in a situation where
the measurements are replicated (Mari et al, 2021); and (b) the trueness of the instrument, which,
in the presence of accepted reference values, is the closeness of agreement between the average
value of a large series of measured values and an accepted reference value (Mari et al, 2021). In
the physical sciences, these two together determine the accuracy of the instrument1 (Mari et al,
2021).

In social sciences measurement, these aspects of instrument quality are classified


somewhat differently, and we will follow this different method of organization here in this book,
as it is so predominant in our field. First, the above-mentioned aspect of instrument “precision”
from the physical sciences is matched reasonably closely by the concepts of measurement error
and reliability in the social sciences, and precision has also recently been promoted as a useful
term in the most recent “Standards” (AERA/APA/NCME, 2014). Measurement error has
already been discussed and exemplified above in Section 5.4.1. (If the contents of that section are
no longer fresh in the reader’s mind, then it would probably be valuable for the reader to look
over this section at this point.) Reliability is an alternative way to express measurement error, as
a scale-free summary statistic. These concepts will be expanded upon in the sections below, and,
in addition, the summary statistic of reliability will be a point of focus.

Second, the label “trueness” is not used in the social sciences, rather, in the social
sciences, the analogue of trueness is conceptualized in a broader way and it has its own label
—"validity.” Thus, validity will be defined and discussed in Chapter 8.

Now, returning to the topic of measurement error, the account below starts off by
addressing the effects of the estimation of the outcomes of measurement through the statistical
calibration of the data resulting from administering the instrument to a sample of respondents.
The first focus is on the uncertainty that then pertains to the measurement of an individual
respondent—commonly labelled as the measurement error. The second focus is on how to
assemble the measurement error results into a single index (reliability) that is taken to indicate
overall results regarding uncertainty based on the sample. This is followed by a focus on a
1
Threats to the accuracy are detailed through a variety of uncertainties, including measurand definitional
uncertainty, unit definitional uncertainty, interaction uncertainty, calibration uncertainty, instrumental uncertainty
and target uncertainty (Mari et al., 2021). These can be used to design a very interesting and informative series of
investigations into instrument accuracy, but will not be pursued here, their relationship to the traditional aspects of
measurement quality in psychosocial measurement, is, as yet, not fully articulated.

2
special circumstance that does indeed distinguish social science measurements from physical
science measurement—the necessity for human raters to be a part of the measurement process in
many contexts.

7.2 Measurement Error—Precision


How does one deal with random variations in individual respondent measurements?

In developing/revealing a construct, and realizing it through an instrument, the measurer


has assumed that each respondent who might be measured indeed has some amount of that
construct. This is what was symbolized by  in Chapters 5 and 6, the respondent’s location on
the left-hand side of the Wright map. The measurer is also assuming that that amount is
measurable to a sufficient degree of accuracy to be useful. Now, when a respondent gives a set of
responses (i.e., a response vector), and that response vector is scored, there will be many
influences on those scores beside  itself. All of these influences together mean that the
“estimated ,” labeled as θ^ , will differ from the underlying  for an individual. That difference is
the measurement error—let’s label it  (i.e., Greek e, "epsilon"). Then we can write:

θ^ =  + , (7.1)

which is analogous to the expression “X = T + E” from classical test theory. It is tempting to then
write

 = θ^ - ,

and say that we are “done,” but that will not work, as, in practice, we never do know , and
hence this expression will not help us to calculate . Nevertheless, it is a conceptually helpful
move to think of error2 as expressed in Equation 7.1, even though it is implicitly confounded as a
definition.

There are many possible sources of measurement error, and each of them invariably
generates both noise and bias. Here we are primarily interested in the noise—the random
variation, but in fact, the definition of error inherent in Equation 7.1 means that the noise and the
bias are inevitably confounded for a particular respondent. Some potential sources of such
variation are in the following list:
(a) there are influences associated with the individual respondent, such as their interest in the
topic of the instrument, their mood, and/or their health;
(b) there are influences associated with the conditions under which the instrument is being
responded to, such as the temperature of the room, the noisiness of the environment, and
the time of day;

2
Linguistic note: there is nothing inherently wrong with these “errors” (even though the term “error” would seem to
imply that). This phenomenon is a normal and expected part of measuring—the term “error” is a jargon term
originating in statistics, particularly in the sense of prediction (i.e., in statistical regression). Perhaps it is better to
think of it as similar to the idea of a “residual” (cf. Equation 6.14), so that here this residual is what would be left
after taking into account the (unknown) value of . Note that this is inherently tricky—just as in Equation 7.1 we
only can know one term, θ ^ . Even though it is to be expected that there will always be such “error,” the measurer
does want to avoid having a lot of this sort of variation in the results.

3
(c) there are influences associated with the specifics of the instrument itself, such as the selection
of items, and the style of presentation; and
(d) there are influences associated with scoring, such as the training of the raters, and the
variability of the raters.
There is no exhaustive and final way to classify all these potential errors—that is their nature,
because they are, by definition, whatever is not being modeled, and hence are not completely
classifiable. Nevertheless, investigating their influence is very important, as an instrument with
little or no consistency across the different conditions mentioned above will generally not be
useful no matter how sound are the other parts of the argument for its validity.

One way to conceptualize measurement error is to carry out a thought experiment,


sometimes called the “brainwashing analogy”:
(a) Imagine that the respondent forgets that they have responded to each item immediately after
making a response (that’s the “brainwashing”),
(b) repeat this for all the items in the instrument, and estimate the person’s location from that set
of responses, then
(c) repeat this for the whole set of potential instruments (sampling across the universe of items)
and obtain their estimates under all possible combinations of the varied conditions, such
as those listed in the previous paragraph.
(d) One could then take the mean across all these estimates (i.e., the many θ^ s) as giving the best
location for that respondent—i.e., θ^ (where the bar over the “θ^ ” indicates that it is a long-
term average, though still a fiction in this thought experiment).
(e) Then, the variance of the distribution of the observed locations, the variance of the θ^ s, would
be the variance of the measurement errors .
(f) Note that the difference θ− ^ θ^ would be an estimate of the bias for any given θ^ , but the total
bias is still not available, as  is still not known.
It is, of course, unlikely that a respondent would actually forget their responses. This is just a
thought experiment. Nevertheless, it provides one way to interpret  and  and their
relationship to θ^ in Equation 7.1.

As each location (θ^ ) is a statistical estimate, it is subject to a degree of statistical


estimation uncertainty. This uncertainty is usually characterized using the standard error of the
estimated location—the so-called standard error of measurement (sem(θ^ ))3. This value gives the
measurer an indication of how accurate each estimate is. For example, if a respondent scored 13
on the PF-10, then their location is -1.31 logits (see Table 7A.1) and the standard error of the
respondent’s location is .59 logits. This is usually interpreted by saying that the measurer is
uncertain about the exact location of the respondent, but that it is centered approximately on -
1.31 logits, and distributed around there with an approximately normal or Gaussian distribution4

3
Jargon warning: In CTT, the same term is used (sem), but is somewhat differently understood, as it is assumed to
be a constant. In the item response model estimation procedure, it varies (e.g., see Figure 7.2). Some people
committed to the CTT perspective want to emphasize this difference and hence refer to the sem we will be using
here as the “conditional” sem (“csem”). To me this seems more complicated than it needs be, so I will stick to
calling both sem, but add in the term for the estimated location, to indicate that I see it as varying.
4
In truth, there is no evidence that this should be modeled as any specific distribution. However, as the concept of
this distribution is, in most social science applications, really just a hypothesis, the usual practice is to use a
distribution that is familiar to most people.

4
with standard deviation of 0.59 logits. Hence, in this case, the measurer can say that the
(approximate) 67% confidence interval is -1.31±.1.0*59 logits, or (-1.90, -0.72). Alternatively,
the 95% confidence interval is -1.31±1.96*.59 (logits) = (-2.47, -0.15).

Figure 7.1 A BASS report for a respondent with a score of 13 on the PF-10(poly).

See Figure 7.1 for an illustration of this for a respondent with a score of 13 on the PF-
10(poly)— the black dot indicates the estimated location, and the 67% confidence interval 5 (CI)
is indicated by the “wings” around the black dot (which indicates the estimated location). Note
that many readers will likely be more familiar with use of a 95% confidence interval but the 67%
interval is also often used in individual measurement contexts and in other application contexts
in the social sciences. One way to motivate this is to note that (if all assumptions are true), then
a 67% confidence interval is twice as likely to include  as not.

If the goal of the measurer were to classify/interpret people using the PF-10(poly)
waypoints, then this particular respondent’s classification/interpretation is somewhat uncertain.
In contrast, a different person at a different location might have a confidence interval that falls
entirely within a single zone. Thus, improving sem (θ^ ) would lead to fewer uncertain
classifications, abut there would always be some locations that span two adjacent bands no
matter the precision of the instrument. The confidence interval can also be compared to the
width of the bands in the Wright map. This is a somewhat limited comparison in this instance, as
there is only one finite band, L2—in other circumstance, with more waypoints, there would be
more bands to compare with the confidence interval. The L2 band is 2.81 logits wide. Which
compares favorably with the 67% confidence interval for this patient (1.18 logits), and this
comparison is clearly seen in Figure 7.1. It can also be expressed as a ratio, for this respondent, it
is 2.38 (i.e., respondent CIs in the L2 band). If the ratio were, say, 1.0, then only a patient right in
the middle of L2 would clearly be interpreted as being at that waypoint.

The 95% confidence interval would be about twice as wide6 as the one illustrated in
Figure 7.1. This second range is quite wide—it is 2.32 logits, showing that although the
respondent scored 13 on the instrument, the range of their locations might be anywhere from the
equivalent logits for a score of about 10 to a score of about 167. This can be checked by looking
5
The use of 67% confidence intervals in educational measurement, and educational assessment in particular, is not
to be encouraged. It’s standard practice (and that is why BASS makes it available, although the 95% confidence
interval is also available), but is generally not used elsewhere. Its historical use is likely due to educational test
publishers being uncomfortable by how large 95% confidence intervals are. But that’s just the common limitation of
social science measurement – the precision is usually uncomfortably poor because social scientists are stuck with
relatively small numbers of items. It’s better to use the 95% confidence intervals and not inadvertently mislead non-
technical people into thinking the precision is better than it is. (That is why the next paragraph utilizes the longer
confidence intervals.)
6
Assuming a normal or Gaussian distribution, 1.96 is the more accurate multiplier here.
7
In comparison, the 67% confidence interval around a score of 13 ranges approximately from score 11 to 14.

5
into Table 7A.1. Despite this increased uncertainty, it is important to note that it is still an
improvement on not having any data at all about the respondent. To see this, assume that a
person without an observed score must fall somewhere in the range from the minimum to
maximum score (i.e., from a score of 0 to a score of 20). Since the full range of the respondent
locations is 10.29 logits, the 95% confidence interval for a respondent scoring 13 is about
(10.29/2.32=) 23% of that range. Hence one could say, with 95% confidence, that knowing that
a respondent had scored 13, the measurer would be better off by a factor of about 4 than if they
had had no data on the respondent (i.e., compared to knowing only that it was reasonable to use
the instrument for this person). Looking back at the L2 band, as in the previous paragraph, the
ratio (i.e., number of respondent CIs in the L2 band) is 1.21—a less clear interpretation except
for respondents in the middle of L2.

Alternatively, a visual representation of the range can also be gained by noting the
locations of the estimated locations for 10 and 16 in the Wright map for the PF-10(poly) in
Figure 6.6. In this Figure, one can add some extra interpretation to the 95% confidence interval
by noting that it ranges approximately from the location of the first threshold for SevStairs to
approximately the location of the second threshold for WalkMile.

Similarly, the item locations also have a standard error. In typical measurement
situations, where there are more respondents than items, the item standard errors are much
smaller than the respondent standard errors (i.e., because there are almost always more
respondents in the sample than there are items in the instrument). For example, the standard
error of the first item step of OneStair is 0.10 (see Table 6A.2). In many applications, the item
standard errors are small enough that one can ignore them when interpreting the respondent
locations. However, it is important to keep in mind that they are also estimates and hence
subject to error, just as are the respondent location estimates. One situation that does require the
use of the item standard error is the management of linkage across different forms, which is a
topic beyond the scope of this book (for a summary see Finch & French, 2019, pp. 370-373).

The sem(θ^ ) varies systematically depending on the respondent’s location. This is


displayed for the PF-10 example in Figure 7.2. The relationship is typically a “U”-shape as it is
here, with the minimum near the mean of the item thresholds, and the value increasing towards
the extremes. One way to understand this pattern is to use the following rule-of-thumb: The
closer the respondent is to an item, the more the item can contribute to the estimation of the
respondent’s location8. Now, apply this general rule to the situation for a typical instrument like
the PF-10 (s shown in the Wright map in Figure 6.5). The respondents in the middle will always
have more items near them than those at the extremes, hence the sem(θ^ ) will tend to be smaller
in the middle than at either of the extremes. Note that this logic applies whenever the item
thresholds are distributed in a distribution somewhat between a bell-shape and a uniform way
over the θ^ locations. In contrast, if the item threshold distribution is more complicated, then the
pattern may be different. For example, if the item threshold distribution is bi-modal (i.e., twin-
peaked), with a large distance between those two modes, then the relationship between the

8
The full justification of this rule is beyond the scope of this volume, but a hint for this can be seen by looking back
at the IRF in Figure 5.4 and noting that the steeper is the tangent to the IRF at a particular point, the more the item
can contribute to finding the respondent’s location, and that the IRF is steepest at the item’s location (i.e., where the
probability of response is 0.50).

6
respondent location and the sem(θ^ ) will be more like an inverted “U”! As it is usually presented,
classical test theory assumes that the equivalent graph of Figure 7.2 would be a horizontal
straight line—and this would perhaps be a somewhat reasonable approximation towards the
center of the score distribution (say, approximately -4.0 logits to 2.0 logits), but would be quite
misleading beyond that.

An interestingly different insight can be gained by plotting instead the values of sem(θ^ )
for respondents with response vectors that include missing items. The resulting graph is shown
in Figure 7.3. In this figure, the curve shown in Figure 7.2 can still be seen as the lower edge of
the dots, but there are lots of dots above that curve—these are the sem(θ^ )s for the response
vectors with missing data. Thus, sem(θ^ ) also varies systematically (and inversely) with the
number of items that have missing data. By definition, they all have higher sem(θ^ )s than the
non-missing data response vectors (i.e., because there are fewer items and hence fewer pieces of
information available for them). Note that some respondents (near the middle of the θ^ range)
have sem(θ^ )s that are four times larger than for those with response vectors with complete
response vectors (i.e., full information). Not only do these cases have larger sem(θ^ )s, but in
addition, they can seriously affect overall summaries of error, which are discussed in the next
Section. In general, the classical test theory assumption of a constant sem ignores this issue.

7
Figure 7.2 The standard errors of measurement for the PF-10(poly)—with no missing data (each
dot represents a score).

Figure 7.3 The standard error of measurement for the PF-10(poly)—including missing data (each
dot represents a unique response vector).
.

8
Another way to express this relationship between the location estimates and the error is to
use the concept of information (abbreviated as inf(θ^ )) , which is the reciprocal of the square of
sem(θ^ ) (Lord, 1980, p.71):

inf(θ^ ) = 1/ sem(θ^ )2. (7.2)

The equivalent graph to Figure 7.2 for the information is shown in Figure 7.49. Clearly, in and of
itself, there is no “added value” for this index over sem(θ^ ) but the interesting point is that the
information can be used to calculate the sem(θ^ ) for hypothetical instruments based on sets of pre-
calibrated items10. This use capitalizes on the feature that the information for the whole
instrument is the sum of the information for each item, (Lord, 1980, p.71):
.
I
^ ∑ ¿ f i ( θ).(7.3)
inf ( θ)= ^
i=1

^ is the contribution to the information of item i. The information for item i depends
where ¿ f i ( θ)
on the calibration model. For the Rasch model (dichotomous data), the expression is (Sijtsma &
van der Ark, 2020, p. 194):

^
^ exp( θ−δ i)
¿ f i ( θ)= .(7.4)
^
[1+ exp( θ−δ )]2
i

Thus, for pre-calibrated items, these three equations can be used to calculate respondent standard
errors for any combination of the items. For prediction purposes, so long as one has at least a
reasonable sample of items already calibrated, one can use Equation 7.3 to calculate a “typical”
item information contribution (i.e., as the mean per item of the instrument information),

inf ( θ^ )=inf ( θ)/


^ I, (7.5)

so that a predicted sem(θ^ ) for a “typical” item can then be calculated via Equation 7.2:

sem ( θ^ )=[ √ inf ( θ^ ) ] .


−1
(7.6)

9
The equivalent of Figure 7.3 is not provided—the reader may make this figure for themselves in BASS.
10
In this usage, it is the functional equivalent of the Spearman-Brown prophecy formula commonly used in CTT
(e.g., Cronbach, 1990; Nunnally & Bernstein, 1994).

9
Figure 7.4 The information amounts for the PF-10(poly), with no missing data (each dot
represents a score).

Graphs of the standard errors of measurement and the information like those shown in
Figures 7.2 and 7.4 are useful in (re-)designing an instrument. In the case of the PF-10 scale,
Figure 7.2 shows that the most sensitive part of the instrument is approximately from –5.0 to
+1.0 logits (i.e., assuming a context where a standard error of approximately 0.75 logits standard
error is sufficiently precise). One way to evaluate this range of the most sensitive measurement
is appropriate for usage is to consider the items design—does this range of most sensitive
measurements correspond to the range of the item thresholds? This is a qualitative interpretation
of the range that was (perhaps implicitly) intended by the instrument developers. In the case of
the PF-10, looking back at the Wright map (Figure 6.5), this corresponds to approximately the
range of all the first thresholds (0 vs. 1&2) for all the items except for one (i.e., Bath,) and all of
the second thresholds too (0&1 vs. 2). Thus, the instrument’s range of sensitivity does make
general sense with respect to the item response categories.

A second way to evaluate the range of sensitivity is by comparing it to the distribution of


the respondents. Examining the Wright map (Figure 6.5) one can see that many respondents in
this sample are estimated to be above 2.0 logits, hence the instrument is not functioning
optimally for a reasonably large proportion of this sample. Of course, this interpretation depends
on the ultimate purpose of the instrument. If it is to be used on similar samples to this one, then it
probably should be augmented with some more items up at the VigAct end. If it is intended for a
sample that is generally lower in physical functioning than the current sample (e.g., say, among

10
patients at a hospital), then the current set of items will likely suffice. If the measurer wanted to
look carefully at a sample with very low functioning, then it would be best to add new items at
the low end (near Bath). If an important decision has to be made at a certain cutoff point on the
Wright map (e.g., whether a patient should be considered for placement into a rehabilitation
program in the case of PF-10), the design should focus having many items around that cut-off
point.

The shape of the graph is not the only important feature of Figures 7.2 and 7.4: so too is
the average height of the curve. Implementing strategies to change that height (i.e., down for
Figure 7.2 and up for Figure 7.4) one can increase consistency (i.e., lower sem(θ^ ), and higher
inf(θ^ ), respectively). The most general way to accomplish this is to increase the number of items
(assuming they are of a similar nature as the existing ones). This will almost always decrease the
sem(θ^ ). The only situation where the measurer might expect this to result in less consistency
would be if the new items were of a very diverse nature (i.e., the items were designed to address
distinct constructs using different response modes). One useful way to roughly approximate the
hypothetical effect of adding similar items is to use Equations 7.2 and 7.5 in a slightly different
way than was shown above. The steps are:
(a) choose a location that makes a convenient reference point (e.g., the point with the
lowest sem(θ^ )),
(b) use Equations 7.2, and 7.5 to estimate the information contribution of a typical item in
the current instrument,
(c) adjust the predicted whole instrument information estimate using Equation 7.5, and
(d) convert that back to the standard error of measurement using Equation 7.2.

Using the PF-10 as an example, pretend that the measurer wished to know how much the
^
sem(θ ) could be reduced by tripling the number of items from 10 to 30. The minimum standard
error of measurement is 0.56 (hard to judge from Figure 7.2 but see Table 7A.1 for precise
values), so the maximum test information is 3.19 based on the existing set of 10 items. Thus, the
typical information contribution by an item is 0.32 (=3.19/10). Hence, for 30 “similar” items, the
maximum information would be approximately 9.60 (=30×0.32). Then the minimum standard
error of measurement would be predicted to be approximately 0.32 logits, or a bit more than a
half (0.56 logits) of the current minimum. Because of the nature of the relationship, there will
generally be “diminishing returns” on investment in administering more items (as is the case in
CTT where the Spearman-Brown prophecy formula gives the same pattern of prediction). In this
case tripling the number of items is predicted to cut the sem(θ^ ) to about a half of what it was
originally (as opposed to decreasing it to a third).

A second way to decrease sem(θ^ ) is to increase the standardization of the conditions


under which the instrument is delivered. The likelihood of increasing consistency with this
strategy, which is historically quite a common tactic11, must be balanced against the possibility of
decreasing the validity of the instrument by narrowing the descriptive and construct-reference
components of the items design. A notorious example of the perils of this strategy arose in the
area of writing assessment. Here it was discovered that one could increase the consistency of
scores on a writing test by adding multiple choice items at the expense of decreasing the actual
11
For example, the use of a uniform set of options across all items, such as in a Likert-style survey (e.g., “Strongly
Agree” to “Strongly Disagree”), is an example of this.

11
writing that respondents did. The logical conclusion of that observation is to eliminate the
writing from the writing test and use only multiple-choice items. This was indeed what
happened—at one point there were prominent writing tests that had no request for writing in
them whatsoever! (Behizadeh & Engelhard, 2011) The response from among educators who
teach writing was one of horror—students could pass those tests without even writing a single
sentence (Yancey, 1999)! After considerable debate, the situation has swung back to a point
where some writing tests now include only a single essay and deliver only a single score, which
risks taking a student’s measure on a sample of topics of size one. Unfortunately, this is not a
good situation either—the best resolution lies in finding balance among the competing validity
demands—as discussed in the next chapter.

7.3 Summaries of Measurement Error

To develop quality-control indices to help with the overall evaluation of the calibration
uncertainty of an instrument, the traditional approach has been to find ways to compare the
measurements of the respondents in different ways. These are usually referred to as reliability
coefficients, and commonly when discussing this issue, one refers to the “reliability” of an
instrument12. There are several ways that one can conceptualize such comparisons, for example,
in terms of:
(a) the consistency of the item responses across the set of items in the instrument—the internal
consistency perspective;
(b) the consistency over repeated administrations of the instrument—the test-retest perspective;
and
(c) the consistency over different versions of the instrument (i.e., over different subsets of the
items (also known as different “forms” of the instrument)—the alternate forms
perspective.
The various summaries of measurement error are outlined in Table 7.1. Another consistency-
related issue that may arise is the consistency between raters—this is discussed in Section 7.4.

7.3.1 Internal Consistency Coefficients

The consistency coefficients described in this section are termed internal consistency
coefficients. This is because the basis for their calculation is the information about variability
that is contained in the data from a single administration of the instrument—they focus on the
question of how consistently the respondents respond to the set of items in the instrument. In
considering this question, one quickly realizes that there is a practical question of how to
conceptualize the comparisons, as there are many such comparisons—each pairwise comparison
among items, each triple against each other triple of items, and many more. Historically, this
troubled measurement experts, and the resolution of this dilemma is one of the important
achievements of CTT (Briggs, 2022). What was shown was that, with the simplified assumptions
of CTT, the mean correlation of all such comparisons can be expressed using a specific formula

12
Note that in the current issue of the “Standards” (ARAE/APA/NCME, 2014), the concept of reliability is matched
to precision. In my reading of the metrology literature, the specific estimators used for precision are more akin to
standard error estimators, and hence I have linked these two together rather than what is done by the authors of the
“Standards.”

12
which can also be thought of as the proportion of variance in the observed item scores accounted
for by the respondent’s raw score (Kuder & Richardson, 1937). This “variance explained”
formulation is familiar to many through its use in analysis of variance and regression methods in
statistics. The formula, as commonly used in CTT, is the well-known reliability formulae
Kuder-Richardson 20 and 21 (Kuder & Richardson, 1937) for dichotomous responses and
coefficient alpha (Cronbach, 1951) for polytomous responses. The same logic as used there is
also directly applicable in the item response model approach adopted here: The difference is that
the formulae are applied to the respondent estimates rather than to the respondent raw scores.

Thus, the separation reliability (Wright & Masters, 1981, p. 106), rIC, is the proportion of
the variance explained by the model compared to the total variance. Here the denominator (i.e.,
the variance explained by the model) is indirectly calculated as the total variance, Var ( θ^ ), minus
the variance accounted for by the errors, Var(e). To obtain this, note first that the observed total
variance of the estimated locations, Var(θ^ ), is given by the usual formula:
N
1
Var ( θ^ ) = ∑ ¿ ¿,
N −1 n=1
(7.7)

where N is the number of respondents, θ^ n is the estimated location for respondent n, and θ is the
mean of the estimated locations over all the respondents. Second, the variance accounted for by
the errors can be calculated as the mean square of the standard errors of measurement, sem(θ^ ):
N
1
MSE=
N
∑ se m2 (θ^ n ). (7.8)
n=1

Thus, the proportion of variance accounted for by the model, rIC , is given by

Var ( θ^ )
r IC= . (7.7)
Var ( θ^ ) −MSE

In the PF-10 example, the total variance (of the θ^ s) is calculated to be 4.47, and the mean square
error is calculated to be .67. Thus, for the PF-10 example, the variance explained by the model
works out to be 3.79, which gives an internal consistency reliability coefficient of .85 for the PF-
10 scale. Note that this is not the only way to calculate a reliability estimate for these data: Other
possibilities will be discussed in Chapter 9.

Of course, one must then interpret this obtained value of 0.85. Despite their almost
universal use (and similar to the case for most “variance explained” indexes), there is no
immediate criterion for what is acceptable or not. It is certainly true that a value of 0.85 is better
than 0.80, but not as good as 0.90. But, at what point should one accept or reject the use of the
instrument? There are “industry standards” in some areas of application. For example, the
Department of Education of the State of California at one point endorsed a reliability coefficient
of 0.90 as a minimum for achievement tests used in schools for the testing of individual students,
but this level has not been consistently applied, even in California state testing! One reason that

13
it is difficult to set a single uniform acceptable standard is that instruments are used for multiple
purposes. A better approach is to consider each type of application individually and develop
specific standards based on the context. For example, the importance of having a high level of
reliability is somewhat less in areas such as basic research, where the focus is usually on the
results for groups rather than on the result for an individual. Indeed, Jum Nunnally and Ira
Bernstein (1994) have recommended that a reliability of 0.70 would be sufficient for “early
stages” of research, and that 0.80 would be sufficient for “basic research.” According to these
standards, the reliability of the PF-10 would be quite acceptable. However, they go on to
recommend a “bare minimum” of 0.90 for important decisions about individuals, and that 0.95 is
“more desirable.” According to these standards, the PF-10 does not make the cut for this use.
This is a good illustration of how standards must be sensitive to usage.

7.3.2 Test-Retest Coefficients

As described in Section 7.2, there are many sources of measurement error that lie outside
a single administration of an instrument. Each such source could be the basis for calculating a
different reliability coefficient. One type of coefficient that is commonly used is the test-retest
reliability coefficient. In a test-retest reliability coefficient, the measurer first arranges to have
the same respondents give responses to the exact same questions twice, then the reliability
coefficient is calculated simply as the correlation between the two sets of estimated respondent
locations. In the classical test theory approach, the same approach is applied to the raw scores.

Bearing in mind the “brainwashing” analogy, the test and retest should be so far apart that
it is reasonable to assume that the respondents are not answering the second time by
remembering the first but that the respondents are genuinely responding to each item anew (that
is, it should be reasonable to assume that a sort of natural “brainwashing” has taken place).
Thus, one would expect then that the further apart in time that the two measurements occur, the
better will the estimation of reliability be, from this point of view. However, as the aim is to
investigate variation in the locations due to the instrument rather than due to real change in
respondent’s locations, the measurements should be close enough together for it to be reasonable
to assume that there has been little real change. Hence, estimating this sort of reliability depends
on being able to come up with a “goldilocks” time difference, and judging where such a sweet-
spot might be for any given construct may be quite difficult. In addition, it may be difficult to
ever achieve forgetfulness for some sorts of complex items, which may be quite memorable13.
Hence one can conclude that this form of the reliability coefficient will work better in a situation
where a stable construct is being measured with forgettable items, as compared to a less stable
construct being measured with memorable items.

7.3.3 Alternate Forms Coefficients

Another type of reliability coefficient is the alternate forms reliability coefficient. With
this coefficient, the measurer arranges to develop two sets of items for the instrument, each
following the same series of steps through the 4 building blocks as in Chapters 2 through 5 in

13
For example, I can still recall details of some items that were used on a test I took in 10 th Grade—it was designed
to assess students for a scholarship. Of course, that may indicate only my own odd school-boy interests, but perhaps
also the artfulness of the item writers.

14
this volume (and, as much as possible, matching the pairs of steps together). The two alternate
copies of the instrument are administered and calibrated, then the two sets of estimated
respondent locations are correlated to produce the alternate forms reliability coefficient. When
equated using the linking technique mentioned in Chapter 9, the validity results can be visually
compared using a pair of Wright maps.

This approach can be used for more than just calculating a reliability coefficient. For
example, it can be useful as a way to check that the use of the four building blocks in Chapters 2
through 5 has indeed resulted in an instrument that represents the construct in a content-stable
way (i.e., what one might call “developmental” validity evidence—see Chapter 8).

7.3.4 Other Reliability Coefficients

Other classical consistency indices have also been developed within CTT applications,
and they have their equivalents in the BAS approach. For example, the so-called “split-halves”
reliability coefficient is a sort of “lazy-man’s” alternate forms coefficient. Here the instrument is
split into two different (non-intersecting) but similar parts, with the advice being most often
given to choose the odd items (by their order) versus the even items. (Of course, by the logic
given above, it would make more sense to use what was known about the nature of the items in
order to make the two sets as alike as possible.) Then the correlation between the respondent
estimates from the two subsets of items is used as a reliability coefficient. This coefficient needs
adjustment with a factor that attempts to predict what the reliability would be if there were the
same number of items as in the original version (i.e., twice as many items as in each half). For a
CTT context, this adjustment would be based on the Spearman-Brown formula (which some
might still use in the current context), but indeed, a better approach would be to use the
technique using the information values, as described above in Section 7.2.

The reliability coefficients described in this section are calculated separately, and the
results will be quite useful for understanding the consistency of the instrument’s measures for
each of the different circumstances. In practice, such influences will occur simultaneously, and
it would be better to have ways of investigating the influences simultaneously also. Such
methods have indeed been developed, including:
(a) generalizability theory (e.g., Shavelson & Webb, 1991) is an expansion of the analysis of
variance approach mentioned above;
(b) facets analysis (Linacre, 1991; Sijtsma & van der Ark, 2020), also known as the linear
logistic test model (LLTM; Fisher, 2005)), is an expansion of the item response modeling
approach that is the focus of this volume, and
(c) the generalizability approach and the item response modeling approach can also be combined
(Choi & Wilson, 2019).

There have been other indexes developed that summarize the amount of error in an
instrument that are not based on correlation coefficients (or proportion of variance accounted
for), but rather on the concept of how the signal relates to the noise (Shannon, 1949). One way
to think of this is as attempting to estimate how many distinct locations the instrument ould
identify within the operational range of the instrument—the sense of “distinct” being based on
the idea that their confidence intervals should not overlap. And one might calculate this as a

15
proportion by comparing the confidence interval around a typical respondent with the range of
the estimated respondent locations. Let us use the PF-10 as an example. As noted above, its
minimum standard error is .56 logits. Using a 67% confidence interval—1.12 logits—and an
estimated respondent range of 10.29, the ratio is 9.19 (=10.29/1.12). Interpreted in terms of
“distinct” values, there are, at best, about 9 different locations14. Certainly, a tougher standard
would be to consider 95% confidence intervals. Here the width of the confidence intervals would
be 2.20 logits, which would translate to about 4 to 5 different locations (i.e., 4.69 to be more
accurate).

It is also interesting to consider these results in terms of the construct map for the PF-10
construct. There are three waypoints postulated (see Table 5.2), so this is comfortably within the
realm of possibility even at the 95% confidence level. In a case where there were fewer such
distinct locations than waypoints, this result would have to constitute a fairly strong criticism of
the success of the instrument development project. Of course, there are several complicating
factors that could be brought into this calculation to increase the sophistication of the index, such
as using some sort of weighting according to how many respondents are at different locations, as
well as the different standard errors at those locations. In recognition of this, an alternative based
on a normal (Gaussian) distribution of respondents was proposed by Wright and Stone (1979).

14
“At best” because the minimum standard error value was used for this calculation.

16
Table 7.1
Outline of various measurement error summaries (aka reliability coefficients)
_______________________________________________________________________

Name Function
_______________________________________________________________________

Internal Consistency coefficients

Kuder-Richardson 20/21 Used in the CTT approach (raw scores).


Used for dichotomous responses.

Cronbach’s Alpha Used in the CTT approach (raw scores)


Used for polytomous responses.

Separation Reliability Used in the item response modeling approach (estimated


scores)
Used for dichotomous and polytomous responses.

Test-Retest coefficient

Test-retest correlation Used when respondents are measured twice

Alternate Forms coefficient

Alternate forms correlation Used when there are two sets of items designed to be
similar and which have a similar structure

Inter-Rater Consistency coefficient

Exact agreement proportion Used when a rater’s scores are compared to a reference
score
_______________________________________________________________________

17
7.4 Inter-Rater Consistency.
How can one examine the consistency of raters?

Where the respondents’ responses are to be scored by raters, another source of


uncertainty occurs—inconsistencies between the raters. There are many forms that rater
inconsistency can take, such as:
(a) there are raters who do not fully accommodate the training, and hence never apply the scoring
guides in a correct way;
(b) there are differences in rater severity, that is, some raters will tend to score the same
responses higher or lower than others (i.e., some raters are more lenient and some are
more harsh);
(c) there are differences in raters’ use of the score categories, such as raters who use the extremes
more often than others, or not as often as others, as well as more complex patterns of
score category usage;
(d) there are raters who exhibit “halo effects”, that is, their scores are affected by recent scores
they have given (usually to the same respondent);
(e) there are raters who drift in their severity, their tendency to use extreme scores, etc.; and
(f) there are raters who are inconsistent with themselves for a variety of reasons.

The most important way to reduce rater inconsistency is to institute a program of sound
rater training and also a monitoring system that helps both the administrators and the raters
themselves know that they are keeping on track. A good training program will include the
following:
(i) background information on the concepts involved in the construct itself, which in the situation
described in this volume would include the sorts of construct map documentation
included in the examples in this volume,
(ii) material ready at-hand to support the raters’ judgements, including sample items with
annotated sample responses (and especially, of course, the specific items that the raters
will rate),
(iii) an opportunity for the raters to examine and rate a large number and wide range of
responses, including both examples that are clearly within a category, and examples that
are not (sometimes known as “fence-sitters”),
(iv) opportunities for the raters to discuss their ratings on specific pieces of work, and
justifications for those ratings, with their fellow raters and with other experts, including
professionals experienced in the content area associated with the construct (e.g., teachers
in the case of educational applications);
(v) systematic feedback to the raters telling them how well they are rating pre-judged responses
and responses that have been rated by other raters; and
(v) a system of rater calibration steps that either results in a rater being accepted as “calibrated,”
or passes them back for further training and/or support.

Although a system like that just described will constitute a sound foundation for raters, it
has been found that they can soon drift away from even a very sound beginning (see, for
example, Wilson & Case, 2000). To deal with this problem, it is important to have a monitoring
program in place also. There are essentially three ways to monitor the work of the raters:
(i) scatter pre-judged responses randomly among them;

18
(ii) re-rate (by experts) some of their ratings; and
(iii) compare the records of all of the ratings to the ratings for all raters.
Discussing these in any detail is beyond the scope of this volume. See, for example, Wilson and
Case (2000) for some specific procedures. A good general reference for the sorts of systems
mentioned in this paragraph, as well as the extension of the BAS approach to incorporate raters
into the calibration model in terms of harshness is Engelhard and Wind (2018).

Once the ratings have been made, they need to be summarized in ways that help the
measurer see how consistent the raters have been. There are ways to carry this out using the
construct modeling approach (see for example, Wilson and Case, 2000; Engelhard and Wind,
2018)), and also by using generalizability theory (see for example Shavelson & Webb, 1991, but
these methods are beyond the scope of this volume, so more elementary methods are described
here. To apply these more elementary methods, the first step is to gather a sample of ratings
based on the same responses for the raters under investigation. Then they are either (a)
compared to the ratings of an expert (or panel of experts), or, when that is not available, (b)
compared to the mean ratings for the group. In either case, these comparison ratings will be
referred to as the “reference” ratings.

A comprehensive way to display the results of this is shown in Table 7.2. In this
hypothetical example, there are four score levels possible. The ratings for the raters are specified
(i.e., 0 to 3) in the first column, and the reference ratings are displayed at the heads of the next
four columns. The number of cases of each possible pair is recorded in the main body of the
table, where nst is the number of responses scored s by the raters and t by the reference rating.
The appropriate marginals are also recorded and labelled in the Table using “.” to indicate
whether the row or column (or both) are summed. This table can be assembled based on just a
single rater’s ratings or the ratings of a set of raters—but the former is interesting only when the
single rater has made sufficient ratings for which reference ratings exist. A directly interpretable
index of agreement based on this table is the proportion of exact agreement—the proportion of
responses in the leading diagonal of entries nss:
4
1
Proportion exact = ∑n .
n∙∙ s=1 ss
(7.9)

In cases where one wanted to control for the possibility that the matching scores might have
arisen by chance, then a better index called “Cohen’s kappa” is available (Cohen, 1960). A less
rigorous index of agreement is the proportion of responses in the same or adjacent categories.
This is not recommended when the number of categories is small (as is the case in Table 7.2), as
it can lead to over-positive interpretations. The table can also be examined for various patterns:
(a) asymmetry of the diagonals would indicate differences in severity, and (b) relatively larger or
smaller numbers at either end could indicate a tendency to the extremes or the middle. The table
can also be examined with chi-square methods or log-linear analysis (see, for example, Agresti,
1984) to test for independence and other patterns. Note that correlation coefficient can be a
misleading way to examine the consistency between ratings, as the standardization that is a part
of it will potentially disguise differences in harshness between the raters.

19
Table 7.2
Layout of data for checking rater consistency.

Raters’ Reference Ratings


Ratings 0 1 2 3 Total
0 n00 n01 n02 n03 n 0∙
1 n10 n11 n12 n13 n1 ∙
2 n20 n21 n22 n23 n2 ∙
3 n30 n31 n32 n33 n3 ∙
Total n∙ 0 n∙ 1 n∙ 2 n∙ 3 n∙ ∙

7.5 Resources

The topic of trustworthiness is a central idea in the account of measurement philosophy in


Luca et al (2021), which gives a very comprehensive background in the foundational concepts of
measurement that underlay this book.

Further discussion of the interpretation of errors under the item response modeling
approach can be found in Lord (1980), Wright and Stone (1979) and Wright and Masters (1981).
The BASS software will report standard errors for estimates based on raw scores (for
respondents with no missing data (e.g., see Table 7A.1) and for individual respondents (see
Appendix A).

A well-rounded introduction to the CTT perspective on measurement error and reliability


can be found in Cronbach (1990). Included there are numerical examples of the correlation-
based reliability coefficients such as internal consistency, test-retest and alternate forms, as well
as an explanation of how to calculate a correlation coefficient, and a discussion of its
interpretation. The BASS software includes features that allow the calculation of the internal
consistency, test-retest and alternate forms of reliability coefficients (see Appendix A). As
mentioned, a recent publication that is a comprehensive guide to measurement with raters is
Englehard and Wind (2018).

7.6 Exercises and Activities

(following on from the exercises and activities in Chapters 1-6)

1. Use the BASS software (see Appendix A) to generate output for the PF-10 data. Check the
standard errors of measurement for the respondents. Do they display the “U-shape” pattern
mentioned above? Are they sufficiently small? (I.e., for this you will need to hypothesize a use
for the measurements.)

20
2. Locate the separation reliability and Cronbach’s alpha in the BASS output. How do they
compare?

3. Locate a data set containing either test-retest or alternate forms data and calculate a correlation
coefficient to interpret as a reliability coefficient using the BASS software. Is it satisfactory for
the purpose for which the data set was collected?

4. Estimate how many items would need to be added to the PF-10 to reach Nunnally &
Bernstein’s recommendations of 0.90 and 0.95 internal consistency reliability coefficients.
(Hint: adapt the technique described at the end of Section 7.2.)

5. Write down your plan for collecting reliability information about your instrument.

6. Think through the steps outlined above in the context of developing your instrument and write
down notes about your plans.

7. Share your plans and progress with others and discuss what you and they are succeeding on,
and what problems have arisen.

21
Appendix 7A

Results from the PF-10 Analysis

Table 7A.1 Raw Scores, Estimates and sem(θ^ ) for the PF-10 (polytomous)

Raw
Score Estimate sem(θ^ )
0 -7.02 1.55
1 -5.78 0.94
2 -5.12 0.77
3 -4.63 0.68
4 -2.70 0.83
5 -3.85 0.60
6 -3.51 0.58
7 -3.19 0.57
8 -2.88 0.56
9 -2.57 0.56
10 -2.26 0.56
11 -1.95 0.56
12 -1.64 0.57
13 -1.31 0.59
14 -0.96 0.61
15 -0.59 0.63
16 -0.18 0.67
17 0.29 0.74
18 0.89 0.84
19 1.74 1.05
20 3.27 1.73

22

You might also like