STANDARDIZATION
DEVELOPMENTAL NORMS One way in which meaning can be attached to test scores is to indicate
how far along the normal developmental path the individual has pro gressed. Thus an 8-year-old who
performs as well as the average 10-year- old on an intelligence test may be described as having a
mental age of 10; a mentally retarded adult who performs at the same level would like wise be
assigned an MA of 10. In a different context, a fourth-grade child may be characterized as reaching
the sixth-grade norm in a reading test and the third-grade norm in an arithmetic test. Other
developmental systems utilize more highly qualitative descriptions of behavior in spe 74 Principles of
Psychological Testing cific functions, ranging from sensorimotor activities to concept formation.
However expressed, scores based on developmental norms tend to be psychometricallv crude and
do not lend themselves well to precise sta tistical treatment. Nevertheless, they have considerable
appeal for de scriptive purposes, especially in the intensive clinical study of individuals and for
certain research purposes.
Mental Age
In Chapter 1 it was noted that the term “mental age” w as widely popularized through the various
translations and adaptations of the Binet-Simon scales, although Binet himself had employed the
more neutral term “mental level.” In age scales such as the Binet and its revisions, items are grouped
into year levels. For example, those items passed bv the majority of 7-vear-olds in the
standardization sample are placed in the 7-year level, those passed by the majority of 8-year-olds are
assigned to the 8-vear level, and so forth. A child’s score on the test will then correspond to the
highest year level that he can successfully complete. In actual practice, the individual’s performance
shows a certain amount of scatter. In other words, the subject fails some tests below his mental age
level and passes some above it. For this reason, it is custom ary to compute the basal age, i.e., the
highest age at and below which all tests are passed. Partial credits, in months, are then added to this
basal age for all tests passed at higher year levels. The child’s mental age on the test is the sum of the
basal age and the additional months of credit earned at higher age levels. Mental age norms have
also been employed with tests that are not di- dived into year levels. In such a case, the subject’s raw
score is first determined. Such a score may be the total number of correct items on the whole test; or
it may be based on time, on number of errors, or on some combination of such measures. The mean
raw scores obtained by the children in each year group within the standardization sample con stitute
the age norms for such a test. The mean raw score of the 8-year- old children, for example, would
represent the 8-year norm. If an indi vidual’s raw score is equal to the mean 8-year-old raw score,
then his mental age on the test is 8 years. All raw scores on such a test can be transformed in a
similar manner by reference to the age norms. It should be noted that the mental age unit does not
remain constant w ith age, but tends to shrink with advancing years. For example, a child who is one
year retarded at age 4 will be approximately three years re tarded at age 12. One year of mental
growth from ages 3 to 4 is equiva lent to three years of growth from ages 9 to 12. Since intellectual
de velopment progresses more rapidly at the earlier ages and gradually decreases as the individual
approaches his mature limit, the mental age unit shrinks correspondingly with age. This relationship
may be more Norms and the Interpretation of Test Scores 75 readily visualized if we think of the
individual’s height as being ex pressed in terms of “height age.” The difference, in inches, between a
height age of 3 and 4 years would be greater than that betw een a height age of 10 and 11. Chving to
the progressive shrinkage of the MA unit, one year of acceleration or retardation at, let us say, age 5
represents a larger deviation from the norm than does one year of acceleration or retarda tion at age
10.
Grade Equivalents
Scores on educational achievement tests are often interpreted in terms of grade equivalents. This
practice is understandable because the tests are employed writhin an academic setting. To describe a
pupil’s achievement as equivalent to seventh-grade performance in spelling, eighth-grade in reading,
and fifth-grade in arithmetic has the same popular appeal as the use of mental age in the traditional
intelli gence tests. Grade norms are found by computing the mean raw score obtained by children in
each grade. Thus, if the average number of problems solved correctly on an arithmetic test by the
fourth graders in the standardiza tion sample is 23, then a raw7 score of 23 corresponds to a grade
equiva lent of 4. Intermediate grade equivalents, representing fractions of a grade, are usually found
by interpolation, although they can also be ob tained directly by testing children at different times
within the school year. Because the school vear covers ten months, successive months can be
expressed as decimals. For example, 4.0 refers to average performance at the beginning of the fourth
grade (September testing), 4.5 refers to average performance at the middle of the grade (February
testing), and so forth. Despite their popularity, grade norms have several shortcomings. First, the
content of instruction varies somewhat from grade to grade. Hence, grade norms are appropriate
only for common subjects taught through out the grade levels covered bv the test. They are not
generally ap plicable at the high school level, where many subjects may be studied for only one or
twro years. Even with subjects taught in each grade, however, the emphasis placed on different
subjects may vary from grade to grade, and progress may therefore be more rapid in one subject
than in another during a particular grade. In other wrords, grade units are obviously unequal and
these inequalities occur irregularly in different subjects. Grade norms are also subject to
misinterpretation unless the test user keeps firmly in mind the manner in w7hich they wrere derived.
For ex ample, if a fourth-grade child obtains a grade equivalent of 6.9 in arith metic, it does not mean
that he has mastered the arithmetic processes taught in the sixth grade. He undoubtedly obtained
his score largely by 76 Principles of Psychological Testing superior performance in fourth-grade
arithmetic. It certainly could not be assumed that he has the prerequisites for seventh-grade
arithmetic. Finally, grade norms tend to be incorrectly regarded as performance standards. A sixth-
grade teacher, for example, may assume that all pupils in her class should fall at or close to the sixth-
grade norm in achieve ment tests. This misconception is certainly not surprising when grade norms
are used. Yet individual differences within any one grade are such that the range of achievement test
scores will inevitably extend over several grades.
Ordinal Scales
Another approach to developmental norms derives from research in child psychology. Empirical
observation of behavior development in infants and young children led to the description of be
havior typical of successive ages in such functions as locomotion, sensory discrimination, linguistic
communication, and concept formation. An early example is provided by the work of Gesell and his
associates at Yale (Ames, 1937; Gesell et al., 1940; Gesell & Amatruda, 1947; Halver son, 1933). The
Gesell Developmental Schedules show the approximate developmental level in months that the child
has attained in each of four major areas of behavior, namely, motor, adaptive, language, and
personal- social. These levels are found by comparing the child’s behavior with that typical of eight
key ages, ranging from 4 weeks to 36 months. Gesell and his co-workers emphasized the sequential
patterning of early behavior development. They cited extensive evidence of uniformi ties of
developmental sequences and an orderly progression of behavior changes. For example, the child’s
reactions toward a small object placed in front of him exhibit a characteristic chronological sequence
in visual fixation and in hand and finger movements. Use of the entire hand in crude attempts at
palmar prehension occurs at an earlier age than use of the thumb in opposition to the palm; this
type of prehension is in turn followed by use of the thumb and index finger in a more efficient
pincer- like grasp of the object. Such sequential patterning was likewise ob served in walking, stair
climbing, and most of the sensorimotor develop ment of the first few years. The scales developed
within this framework are ordinal in the sense that developmental stages follow in a constant order,
each stage presupposing mastery of prerequisite behavior char acteristic of earlier stages.1 1This
usage of the term “ordinal scale” differs from that in statistics, in which an ordinal scale is simply one
that permits a rank-ordering of individuals without knowledge about amount of difference between
them; in the statistical sense, ordinal scales are contrasted to equal-unit interval scales. Ordinal
scales of child development are actually designed on the model of a Cuttman scale or simplex, in
which success ful performance at one level implies success at all lower levels (Cuttman, 1944). An
Norms and the Interpretation of Test Scores 77 Since the 1960s, there has been a sharp upsurge of
interest in the de velopmental theories of the Swiss child psychologist, Jean Piaget (see Flavell, 1963;
Ginsburg & Opper, 1969; Green, Ford, & Flamer, 1971). Piaget's research has focused on the
development of cognitive processes from infancy to the midteens. He is concerned with specific
concepts rather than broad abilities. An example of such a concept, or schema, is object
permanence, whereby the child is aware of the identity and con tinuing existence of objects when
they are seen from different angles or are out of sight. Another widely studied concept is
conservation, or the recognition that an attribute remains constant over changes in per ceptual
appearance, as when the same quantity of liquid is poured into differently shaped containers, or
when rods of the same length are placed in different spatial arrangements. Piagetian tasks have been
used widely in research by developmental psychologists and some have been organized into
standardized scales, to be discussed in Chapters 10 and 14 (Goldschmid & Bentler, 1968b; Loretan,
1966; Pinard & Laurendeau, 1964; Uzgiris & Hunt, 1975). In ac cordance with Piaget’s approach,
these instruments are ordinal scales, in which the attainment of one stage is contingent upon
completion of the earlier stages in the development of the concept. The tasks are designed to reveal
the dominant aspects of each developmental stage; only later are empirical data gathered regarding
the ages at which each stage is typically reached. In this respect, the procedure differs from that fol
lowed in constructing age scales, in which items are selected in the first place on the basis of their
differentiating between successive ages. In summary, ordinal scales are designed to identify the stage
reached by the child in the development of specific behavior functions. Although scores may be
reported in terms of approximate age levels, such scores are secondary to a qualitative description of
the child’s characteristic be havior. The ordinalitv of such scales refers to the uniform progression of
development through successive stages. Insofar as these scales typically provide information about
what the child is actually able to do (e.g., climbs stairs without assistance; recognizes identitv in
quantity of liquid when poured into differently shaped containers), they share important features
with the criterion-referenced tests to be discussed in a later section of this chapter.
WITHIN-GROUP NORMS
Nearly all standardized tests now provide some form of within-group norms. With such norms, the
individual’s performance is evaluated in terms of the performance of the most nearly comparable
standardization group, as when comparing a child’s raw score with that of children of the same
chronological age or in the same school grade. Within-group scores have a uniform and clearly
defined quantitative meaning and can be appropriately employed in most types of statistical analysis.
Percentiles
Percentile scores are expressed in terms of the percentage of persons in the standardization sample
who fall below a given raw score. For example, if 28 percent of the persons obtain fewer than 15
problems correct on an arithmetic reasoning test, then a raw score of 15 corresponds to the 28th
percentile (Pjk)- A percentile indicates the individual’s relative position in the standardization sample.
Percentiles can also be regarded as ranks in a group of 100, except that in ranking it is customary to
start counting at the top, the best person in the group receiving a rank of one. With percentiles, on
the other hand, we begin counting at the bottom, so that the lower the percentile, the poorer the
individual’s standing.
standard scores.
Current tests are making increasing use of standard scores, which are the most satisfactory type of
derived score from most points of view. Standard scores express the individual’s distance from the
mean in terms of the standard deviation of the distribution. Standard scores may be obtained by
either linear or nonlinear trans formations of the original raw scores. When found by a linear
transforma tion, they retain the exact numerical relations of the original raw scores, because they are
computed by subtracting a constant from each raw score and then dividing the result by another
constant. The relative magnitude of differences between standard scores derived by such a linear
trans formation corresponds exactly to that between the raw scores. All proper ties of the original
distribution of raw scores are duplicated in the distribution of these standard scores. For this reason,
any computations that can be carried out with the original raw scores can also be carried out with
linear standard scores, without any distortion of results. Linearly derived standard scores are often
designated simply as “stand ard scores” or “s scores.” To compute a z score, we find the difference
between the individuals raw score and the mean of the normative group and then divide this
difference by the SD of the normative group. Table 3 shows the computation of z scores for two
individuals, one of whom falls 1 SD above the group mean, the other .40 SD below the mean. Any
raw score that is exactly equal to the mean is equivalent to a z score of zero. It is apparent that such
a procedure will yield derived scores that have a negative sign for all subjects falling below the mean.
Moreover, because the total range of most groups extends no farther than about 3 SD’s above and
below the mean, such standard scores will have to be reported to at least one decimal place in order
to provide sufficient differentiation among individuals.
CRITERION-REFERENCED TESTING
nature and uses.
An approach to testing that has aroused a surge of activity, particularly in education, is generally
designated as “criterion- referenced testing.” First proposed by Glaser (1963), this term is still used
somewhat loosely and its definition varies among different writers. Moreover, several alternative
terms are in common use, such as content-, domain-, and objective-referenced. These terms are
sometimes employed as synonyms for criterion-referenced and sometimes with slightly different
connotations. “Criterion-referenced,” however, seems to have gained ascendancy, although it is not
the most appropriate term. Typically, criterion-referenced testing uses as its interpretive frame of
reference a specified content domain rather than a specified population of persons. In this respect, it
has been contrasted with the usual norm- referenced testing, in which an individual’s score is
interpreted by com paring it with the scores obtained by others on the same test. In criterion-
referenced testing, for example, an examinee’s test performance may be reported in terms of the
specific kinds of arithmetic operations he has mastered, the estimated size of his vocabulary, the
difficulty level of read ing matter he can comprehend (from comic books to literary classics), or the
chances of his achieving a designated performance level on an external criterion (educational or
vocational). Thus far, criterion-referenced testing has found its major applications in several recent
innovations in education. Prominent among these are computer-assisted, computer-managed, and
other individualized, self- paced instructional systems. In all these systems, testing is closely inte
grated with instruction, being introduced before, during, and after completion of each instructional
unit to check on prerequisite skills, diagnose possible learning difficulties, and prescribe subsequent
instruc tional procedures. The previously cited Project PLAN and IPI are examples of such programs.
From another angle, criterion-referenced tests are useful in broad sur veys of educational
accomplishment, such as the National Assessment of Educational Progress (Womer, 1970), and in
meeting demands for edu cational accountability (Gronlund, 1974). From still another angle, testing
for the attainment of minimum requirements, as in qualifying for a driver’s license or a pilot’s license,
illustrates criterion-referenced testing. Finally, familiarity with the concepts of criterion-referenced
testing can contribute to the improvement of the traditional, informal tests prepared by teachers for
classroom use. Gronlund (1973) provides a helpful guide for this purpose, as well as a simple and
well-balanced introduction to criterion-referenced testing. A brief but excellent discus sion of the
chief limitations of criterion-referenced tests is given by Ebel (1972b)
content meaning.
The major distinguishing feature of criterion- referenced testing (however defined and whether
designated by this term or by one of its synonyms) is its interpretation of test performance in terms
of content meaning. The focus is clearly on what the person can do and what he knows, not on how
he compares with others. A fundamental requirement in constructing this type of test is a clearly
defined domain of knowledge or skills to be assessed by the test. If scores on such a test are to have
communicable meaning, the content domain to be sampled must be widely recognized as important.
The selected domain must then be subdivided into small units defined in performance terms. In an
educational context, these units correspond to behaviorally defined instructional objectives, such as
“multiplies three-digit by two-digit numbers” or “identifies the misspelled word in which the final e is
re tained when adding -mg.” In the programs prepared for individualized instruction, these objectives
run to several hundred for a single school subject. After the instructional objectives have been
formulated, items are prepared to sample each objective. This procedure is admittedly difficult and
time consuming. Without such careful specification and control of content, however, the results of
criterion-referenced testing could de generate into an idiosyncratic and uninterpretable jumble.
When strictly applied, criterion-referenced testing is best adapted for testing basic skills (as in
reading and arithmetic) at elementary levels. In these areas, instructional objectives can also be
arranged in an ordinal hierarchy, the acquisition of more elementary skills being prerequisite to the
acquisition of higher-level skills.6 It is impracticable and probably undesirable, however, to formulate
highly specific objectives for ad vanced levels of knowledge in less highly structured subjects. At
these levels, both the content and sequence of learning are likely to be much more flexible. On the
other hand, in its emphasis on content meaning in the interpre tation of test scores, criterion-
referenced testing may exert a salutary effect on testing in general. The interpretation of intelligence
test scores, for example, would benefit from this approach. To describe a child’s intelligence test
performance in terms of the specific intellectual skills and knowledge it represents might help to
counteract the confusions and misconceptions that have become attached to the IQ. When stated in
these general terms, however, the criterion-referenced approach is equivalent to interpreting test
scores in the light of the demonstrated validity of the particular test, rather than in terms of vague
underlying entities. Such an interpretation can certainly be combined with norm- referenced scores