TOC Electronic Journal: To print this article select pages 54-56.
Reliability
ROBERT THORNDIKE.
A
test provides a limited sample from some domain of be- “true” standing within that domain. If I weigh myself on a dozen
havior. With respect to any test, one should raise two different occasions, I will not always get the same value for my
basic questions. The first is how faithfully the domain weight. Depending upon the time of day, on how much I ate
defined by the test represents the domain in which we are really at my last meal, on whether I have just been jogging, even on
interested. Do the test tasks match the attribute or the domain how I stand on the scales and how I look at the dial, the reading
of knowledge that we wish to assess? This is the issue of validity, may vary by a pound or more from one weighing to another.
one that is absolutely central to any given use of a testing pro- Similarly, test results will show variation. In testing we can
cedure. Validity is not, however, the issue of primary concern identify several main sources of variation, each of which can be
in this article. The second question is how accurately and pre- further subdivided. We may profitably consider four:
cisely the test score assesses the domain from which the test Variation over time. The examinee may display day-to-day,
does in fact draw a sample. This estimate of precision determines week-to-week, or month-to-month changes in health, effort,
the test’s reliability, our concern in this article. interest, or even acquired knowledge that will affect test scores.
Precision can be seen in either relative or absolute terms. We must decide whether the domain of interest to us is the
Relative precision asks the question: How accurately does the behavior of the individual at one specific point in time or the
test determine the individual’s standing in some group? The broader domain of the individual over some more extended
answer is typically expressed as a correlation coefficient, pro- period of time. The procedures that we select for collecting and
viding an estimate of the relationship of a score on one testing analyzing data must correspond to our particular interests.
to a score on another actual or hypothetical testing. Absolute Variation from task to task. A test is made up of a series of
precision asks the question: How much can the individual be separate test exercises whose properties vary somewhat from
expected to vary, in terms of some meaningful score scale, from one to the next. Although the exercises are chosen in hopes that
one testing to another? The answer to this question has typically each will tap the general domain that the test is designed to
been expressed as a standard error of measurement, indicating assess, each will also to some degree involve specific learning
the standard deviation of a person’s hypothetical series of mea- and specific experiences. Consequently, tasks included in any
sures (under the assumption that the person remains unchanged one test form will call upon a different set of specific learnings
by the testing operation). than will tasks in another form.
Both types of indices are useful, each serving somewhat dif- Variations in judgments by an appraiser. Whenever a behavior
ferent purposes. Correlational indices of relative precision are sample must be judged by individuals; there will be variation
useful for comparing different tests (either of the same attribute from one judge to another (and even in the same judge at dif-
or of different attributes) when scores are reported on scales ferent times) in how the sample is perceived. Should “something
that are not directly comparable, but data are available for the you put on a dog” get credit as a definition of muzzle? Should
same or comparable groups. For example, the Verbal scale of “never gets tired of something” get credit as a definition of
the Cognitive Ability Tests shows a reliability coefficient of .945 untiring? Scorers will disagree on such decisions even when
for a sixth-grade group, whereas the Nonverbal scale shows a fairly comprehensive scoring guides have been provided.
reliability coefficient of .930 for the same group. These numbers Moment-to-moment variations in the examinee and sheer “chance.”
indicate that people can be expected to shift in their standing Memory is a tricky thing, and we will fail to dredge up at mo-
from one testing to another less on the Verbal score than on the ment X some element of past learning that will pop up without
Nonverbal score. An absolute index such as the standard error effort at moment Y. How well an individual performs at moment
of measurement is useful when one wishes to state the actual X depends on what turn the memory search or problem-solving
score limits within which (at some specified level of confidence) attack takes; thus, there will be unpredictable variations in the
an individual’s domain score can be expected to fall. Thus, if responses of an individual.
the standard errors of measurement for the Verbal and Non- A consideration of these sources of variation becomes im-
verbal scales referred to above are 3.75 and 4.25, respectively, portant (a) as we decide what domain it is that we consider our
we can expect repeated Verbal measures to fall within a range test to be sampling, and (b) as we try to select an appropriate
of 7.5 points (representing plus or minus one standard error of procedure for gathering and analyzing the data that are to pro-
measurement) about two-thirds of the time. It would take a vide an estimate of the test’s precision in appraising that do-
range of 8.5 points to include this same fraction of the Nonverbal main. If we think of the relevant domain as being the individual
scores. as he or she exists at one particular point in time, then we can
As the article has implied up to this point, it must be expected afford to ignore variations over time. If we think of the domain
that any measure based on a sampling from a domain will be of interest as comprised of only those tasks appearing in our
less than perfectly accurate as an estimate of an individual’s test, then we can afford to ignore variations from task to task.
528 JOURNAL OF COUNSELING AND DEVELOPMENT / APRIL 1985 / VOL. 63
Reliability
If the responses are completely objective, then we can afford to where = reliability of full-length test, and
ignore variations in judgments by an appraiser. Otherwise, we = correlation of two half-length tests
will need to gather the evidence on test precision in a way that
allows variation attributable to each of these sources to show The other approach to estimating reliability from a single test
up as inconsistency in test performance. administration is based on the assumption that, to a reasonable
The classical procedure for estimating test reliability has been approximation, the items in the test are homogeneous in what
to administer two equivalent forms of the test (i.e., two forms they measure. This means that a trait or attribute common to
independently prepared but conforming to the same detailed any one pair of items will be common, to any other, and that
specifications) on two separate occasions and to correlate the aside from this common factor each item depends on factors
resulting two sets of scores. (If the test is in a free-response affecting only it. When this is the case, a sound estimate of the
format, it would be appropriate to have each form scored by a test’s reliability as a measure of the common attribute underlying
different judge.) This procedure permits all of the categories of all items is provided by an index known as coefficient alpha.
variation considered above to produce differences between the The formula for coefficient alpha is:
pair of scores for an individual. It also provides an estimate of
the precision with which one sample from the domain obtained
at one point in time will match a different sample from the
domain obtained at some different point in time. This procedure
provides a sound and conservative estimate of reliability.
If one feels that the sample of test tasks is the whole domain
or that variation from one task to another is of no consequence,
one can appropriately estimate reliability-from retesting with the
same set of test items at some later date. The interval between where r t = estimated reliability of the test,
testings should equal the interval between which one wishes to
make statements about test precision. The interval should also = variance of the test,
be at least long enough to prevent examinees from remembering = variance of item i,
their responses on the first test and using them as a guide in
the second test. = number of items in the test, and
If one is not concerned with stability over time and wishes
only to know how precisely the test describes the individuals = the sum taken over all items
on the date on which they are tested, alternate forms can be
given in immediate succession (unless, as would often be the This index provides an estimate of the correlation that would
case, giving the two tests on the same day would be an excessive be obtained between scores based on two samples of items drawn
burden on the energy and goodwill of the examinees). When from the same homogeneous domain.
one adds to this consideration the fact that a second form of the When items are scored simply as 1 (passed) or 0 (failed),
test may not have been constructed, it is easy to see why it is coefficient alpha becomes the frequently encountered Kuder-
tempting to try to generate an estimate of precision from the Richardson Formula 20. The formula for K-R 20 is:
administration of a single test form. Two approaches have been
devised for doing so. The first is to divide the test into two or
more presumably equivalent fractions and correlate the scores.
The second is to analyze the internal consistency of the single
test items.
The most common procedure for subdividing a test has been
to put the odd-numbered items in one half of the test and the where pi = proportion passing item i, and
even-numbered items in the other. The resulting two scores are qi = proportion failing item 1
then correlated. Forming half-length tests in this way is likely
to yield two shorter tests that are equivalent in content and A further simplification is achieved if one assumes that all of
difficulty. If items from a given content area or a given format the items are of the same difficulty. In this case Kuder-Richard-
are grouped together, the odd-even procedure will assign them son Formula 21 would be employed. The formula for K-R 21 is:
about equally to each of the two half-tests. Also, if the items
have been arranged to increase in difficulty from beginning to
end, the two half-tests will be about equal in difficulty. It is
always a good idea, however, to look at both the structure of
the original test and the means and standard deviations of the
half-tests to get some reassurance of their equivalence. where = the mean of the test, and the rest of the notation is
Because each half-test is a smaller sample of behavior than is as noted earlier.
the complete test, the half-scores can be expected to be less
reliable than those of the full-length test. Thus, the obtained K-R 21 always provides a smaller value than does Formula 20,
half-test correlations are not useful as they stand. But if the two but with tests of 50 or more items the difference is usually small.
halves can be assumed to be equivalent to a reasonable ap- Thus, the easy-to-calculate Formula 21 often furnishes a ser-
proximation, then the correlation that would be obtained be- viceable and conservative estimate of the value that would be
tween full-length tests can be estimated by the long-known obtained by using Formula 20.
Spearman Prophecy Formula: In addition to the assumption of equivalence of the domain
sampled by the part-tests in the split-test procedure and by each
of the items in the approach through item statistics, there is one
other condition that must be satisfied if use of these single-test
procedures is to be appropriate. This condition is that the test
not be significantly speeded. In a highly speeded test, the num-
JOURNAL OF COUNSELING AND DEVELOPMENT / APRIL 1985 / VOL. 63 529
Thorndike
ber of items failed (in large part those that examinees did not TABLE 1
have time to attempt) depends primarily on speed of work, and Sample of Raw and Standard Age Scores for
this is fixed for a single testing. The result is that the obtained Cognitive Abilities Test, Verbal
correlation is spuriously high. In a test that depends solely on
speed and in which examinees very rarely make errors, the Standard
correlation between odd-items score and even-items score will Raw score age score
necessarily approach 1.00 and will be essentially meaningless. Raw score (standard error) (standard error)
Furthermore, in this case consistency from item to item will rely
only on whether both items have been reached and answered.
For such a test, two separately timed samples of behavior are
essential if one is to have a good idea of the precision with
which each person‘s speed of work is estimated.
That a test has a time limit need not mean that score depends
significantly on speed of work. If items are arranged in order
of difficulty, a reasonable time limit can permit most examinees
to attempt all of the items that they are capable of solving, so
that additional time will result in little or no increase in score.
A procedure has been developed (Cronbach & Warrington, 1951)
that provides an estimate of the maximum amount that coeffi- the conversion that applies in that specific part of the range of
cient alpha might be inflated by a speed factor. This procedure raw scores. Typically, an increase of one point of raw score as
provides a lower bound for alpha, allowing for the effect of one approaches a perfect score on a test will represent substan-
speeding. The lower bound can then be compared with alpha tially more converted score units than will one point in the
computed by the standard formula. If the lower bound differs middle score range. Thus, a table of standard errors in converted
only slightly from alpha, one can relax and feel confident that score units will give quite a different picture of a test’s precision
the standard formula does not give a seriously inflated reliability than will one expressed in raw score units. Paradoxically, the
estimate. same “ceiling effect” that limits a test’s actual precision (ex-
The procedures that have been described so far yield a cor- pressed in equal units) will tend to make it show more uniform
relation coefficient to express relative precision—the extent to results when expressed in raw scores. We can illustrate this
which an individual maintains a stable position relative to a phenomenon with data from one level of the Verbal Cognitive
group. The standard error of measurement, which estimates the Abilities Test (see Table 1). Data on this 100-item test are shown
dispersion of a set of scores for a single individual, is obtained for raw scores and for standard age scores (i.e., converted scores
from the reliability coefficient by the formula: designed to represent equal units with a standard deviation of
16 at each age level). Note the increase in the standard error of
measurement for standard age scores at the extreme raw score
values.
With the assumption of a normal distribution of values for a set Unfortunately, norming data samples are often not large enough
of measures of the same individual, one can expect a sample to produce even reasonably stable values for the needed number
value to fall within ± 1 standard error of the total domain value of slices-of raw score data. But developments in item response
about two times out of three and to fall within ± 2 standard theory (Lord, 1980, Wright & Stone, 1979) make another ap-
error about 19 times out of 20. proach to estimation of precision possible through what is called
In manuals for tests, one often finds only a single value re- the information function. The information function of an item
ported for the standard error of the instrument. When this is indicates, for each point on the scale of the underlying trait,
the case, it is assumed that the measuring device has equal how much that item contributes to delimiting our estimate of
precision throughout the range of the underlying trait that is to the ability of an examinee. Then, assuming a homogeneous set
be assessed. This is rarely true, and at the extremes it is always of items, the information function of the test can be obtained
manifestly false. A test made up of items designed for kinder- by summing the information functions of the single items (see
gartners is not accurate in identifying differences in ability among Thorndike, 1982, pp. 82-87). The standard error of measure-
high school students, and vice versa. Any test is most precise ment at a given ability level is simply the reciprocal of the square
over some limited range of the attribute that it is designed to root of the information function at that level.
assess. One would hope that each test would provide a table An elaboration of the author’s views on the practical issues
showing the standard error of measurement at different levels in obtaining and interpreting reliability data can be found in
of raw score on the test. Thorndike and Hagen’s Measurement and Evaluation in Psychology
Calculation of standard error by score level is quite straight- and Education (1977). The author’s thoughts on the theory and
forward if one has try-out data on a large enough sample to techniques for appraising reliability have been set forth in Ap-
provide stable results. If two forms of the test have been ad- plied Psychometrics (1982).
ministered, one groups individuals in strata on the basis of their REFERENCES
average score on the two forms (e.g., all cases with average raw Cronbach, L.J., & Warrington, W.J. (1951). Time limit tests: Estimating
scores of 50-54.5). One then prepares a distribution of differ- their reliability and degree of speeding. Psychometrika, 16, 167-188.
ences in score between the two forms and calculates the stan- Lord, F.M. (1980). Applications of item response theory to practical testing
problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
dard deviation of that distribution. The standard error of Thorndike, R.L. (1982). Applied psychometrics. Boston: Houghton Mifflin.
measurement at that level is equal to the above standard de- Thorndike, R.L., & Hagen, E.P. (1977). Measurement and evaluation in
viation divided by the square root of two. psychology and education (4th ed.). New York: Wiley.
The above is the standard error of measurement in raw score Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA
Press.
units. Because raw scores are usually converted into some type
of standard score units, however, the raw score standard error Robert Thorndike is a professor emeritus, Teachers College, Columbia Uni-
should also be converted into those units, taking into account versity, New York City.
530 JOURNAL OF COUNSELING AND DEVELOPMENT / APRIL 1985 / VOL. 63