TEST DEVELOPMENT
5 STAGES OF TEST DEVELOPMENT
Test Conceptualization
Test Construction
Test Tryout
Item Analysis
Test Revision
1. TEST CONCEPTUALIZATION
The beginnings of any
published test can probably be
traced to thoughts – self talk, in
behavioral terms.
SOME PRELIMINARY QUESTIONS
What is the test designed to measure?
What is the objective of the test?
Is there a need for this test?
Who will use this test?
Who will take this test?
What content will the test cover?
SOME PRELIMINARY QUESTIONS
How will the test be administered?
What is the ideal format of the test?
Should more than one form of the test be developed?
What special training will be required of test users for
administering or interpreting the test?
What type of responses will be required of test takers?
Who benefits from an administration of this test?
SOME PRELIMINARY QUESTIONS
Is there any potential harm as the
result of an administration of this test?
How will meaning be attributed to
scores on this test?
2. TEST CONSTRUCTION
2A. SCALING
The process of setting rules for
assigning numbers in measurement
The process by which a measuring
devices is designed and calibrated
and by which numbers – scale values
– are assigned to different amounts of
the trait, attribute, or characteristics
being measured.
TYPES OF SCALES
If the test taker’s test performance as a function of
age is of critical interest, then the test might be
referred to as an age-based scale.
If the test taker’s test performance as a function of
grade is of critical interest, then the test might be
referred to as a grade-based scale.
If all raw scores on the test are to be transformed into
scores that can range from 1 to 9, then the test might
be referred to as a stanine scale.
SCALING METHODS
Generally speaking, a test taker is presumed to have
more or less of the characteristic measured by a
(valid) test as a function of the test score.
The higher or lower the score, the more or less the
characteristic the test taker presumably possesses.
But how are numbers assigned to responses so that a
test score can be calculated? This is done through
scaling the test items, using any one of several
available methods.
SCALING METHODS
When the final test score is obtained by summing the
rating across all the items, it is termed a summative
scale.
One type of summative rating scale, the Likert scale is
used extensively in psychology, usually to scale
attitudes.
Each item presents the test taker with five alternative
responses (sometimes seven), usually on an agree-
disagree or approve-disapprove continuum.
SCALING METHODS
Another scaling method that produces ordinal data is
the method of paired comparisons.
Test takers are presented with pairs of stimuli, which
they are asked to compare.
They must select one of the stimuli according to some
rule; for example they rule that they agree more with
one statement than the other, or the rule that they
find one stimulus more appealing than the other.
SCALING METHODS
Another way of deriving ordinal information through
scaling system entails sorting tasks.
One method of sorting, comparative scaling, entails
judgments of a stimulus in comparison with every other
stimulus on the scale.
Example: Test takers would be asked to sort the cards
from most justifiable to least justifiable.
SCALING METHODS
Another scaling system that relies on sorting is
categorical scaling
Stimuli are placed into one of two or more alternative
categories that differ quantitively with respect to some
continuum.
Example: test takers would be asked to sort the cards
into three piles: those behaviors that are never
justified, those that are sometimes justified, and those
that are always justified.
SCALING METHODS
A Guttman scale is yet another scaling methods that yields ordinal-level
measures.
Items on it range sequentially from weaker to stronger expressions of the
attitude, belief, or feeling being measured.
A feature of Guttman scales is that all respondents who agree with the
stronger statements of the attitude will also agree with milder statements
The resulting data are then analyzed by means of scalogram analysis, an
item-analysis procedure and approach to test development that involves a
graphic mapping of a test taker’s responses
SCALING METHODS
All the foregoing methods yield
ordinal data.
The method of equal-appearing
interval, first described by Thurstone, is
one scaling method used to obtain
data that are presumed to be interval
in nature.
2B. WRITING ITEMS
The prospective test developer or item writer
immediately faces three questions related to the test
blueprint:
1. What range of content should the items cover?
2. Which of the many different types of item formats
should be employed?
3. How many items should be written in total and for
each content area cover?
2B. WRITING ITEMS
When devising a standardized test using a
multiple-choice format, it is usually advisable
that the first draft contain approximately twice
the number of items that the final version will
contain.
An item pool is the reservoir or well from which
items will or will drawn for the final version of the
test
ITEM FORMAT
Variables such as the form, plan, structure,
arrangement, and layout of individual test
items
Selected response format – requires test takers
to select a response from a set of alternative
responses
Constructed response format – requires test
takers to supply or create the correct answer,
not merely to select it.
TYPES OF SELECTED
RESPONSE FORMAT
Multiple choice
Matching
True-false
MULTIPLE CHOICE
Has 3 elements:
1. Stem
2. A correct alternative or
option
3. Several incorrect
alternatives or options
MULTIPLE CHOICE
Advantages:
Can sample a great deal of content
in relatively short time
Allows for precise interpretation and
little “bluffing” other than guessing
May be machine or computer
scored
MULTIPLE CHOICE
Disadvantages:
Does not allow for expression of
original or creative thought
Not all subject matter lends itself to
reduction to one and only one
answered keyed correct
May be time-consuming to construct
series of good items
MATCHING ITEM
The test taker is presented
with two columns: premises
(left) and responses (right)
The test taker’s task is to
determine which response is
best associated with which
premise.
MATCHING ITEM
Providing more options than needed minimizes a possibility
of the test taker getting a perfect score even though the
test taker did not actually know all the answers
Another way to lessen the probability of chance of
guessing is a factor in the test score is to state in the
directions that each response may be a correct answer
once, more than once, or not at all.
The wording of the premise and the responses should be
fairly short and to the point
No more than a dozen or so premises should be included
The list of premises and responses should both be
homogenous
MATCHING ITEM
Advantages:
Can effectively and efficiently be used to
evaluate test taker’s recall of related facts
Particularly useful when there are a large
number of facts on a single topic
Can be fun or game-like for test taker
(especially the well prepared test taker)
May be machine or computer scored
MATCHING ITEM
Disadvantages:
One of the choices may
help eliminate one of
the other choices as the
correct response
TRUE-FALSE ITEM
A multiple choice item that contains only two possible
responses is called a binary-choice item
Perhaps the most familiar binary-choice item is the true-
false item
This type of selected-response item usually takes the form
of a sentence that requires the test taker to indicate
whether the statement is or is not a fact.
Other varieties of binary-choice items include sentences to
which the test taker responds with one of two responses,
such as agree or disagree, yes or no, right or wrong, or
fact or opinion.
TRUE-FALSE ITEM
a good binary choice
contains a single idea,
is not excessively long,
and is not subject to
debate
TRUE-FALSE ITEM
Advantages:
Can sample a great deal of
content in relatively short time
Test consisting of such items is
relatively easy to construct
May be machine or computer
scored
TRUE-FALSE ITEM
Disadvantages:
Susceptibility to guessing,
especially for “test wise”
students who may detect
cues to reject one choice
or the other
TYPES OF CONSTRUCTED
RESPONSE FORMAT
Completion items
Short answer
essay
COMPLETION ITEM
Requires the examinee to
provide a word or phrase that
completes a sentence
A good completion item
should be worded so that the
correct answer is specific
SHORT-ITEM ITEM
A completion item may also be
referred to as a short-item item.
it is desirable for completion or
short answer items to be written
clearly enough that the test
taker can respond succinctly –
that is with a short answer
COMPLETION/SHORT
ANSWER ITEM
Advantages:
Wide content area, particularly
of questions that require factual
recall, can be sampled in
relatively brief amount of time
Relatively easy to construct
COMPLETION/SHORT
ANSWER ITEM
Disadvantages:
May demonstrate only recall of
circumscribed facts or bits of knowledge
Potential for inter-scorer reliability
problems when test is scored by more
than on person
May not be machine or computer scored
ESSAY ITEM
Requires a test taker to
respond to a question by
writing a composition,
typically one that
demonstrates recall of facts,
understanding, analysis
and/or interpretation
ESSAY ITEM
Advantages:
Useful in measuring responses that require
complex, imaginative, or original solutions,
applications, or demonstrations
Useful in measuring how well test taker is
able to communicate ideas in writing
Requires test taker to generate entire
response, not merely recognized it or
supply a word or two.
ESSAY ITEM
Disadvantages:
May not sample wide content area as well as other tests do
Test taker with limited knowledge can attempt to bluff with confusing,
sometimes long and elaborate writing designed to be as broad and
ambiguous as possible
Scoring can be time consuming and fraught with pitfalls
When more than one person is scoring, inter-scorer reliability issues may
be raised
May rely too heavily on writing skills, even to the point of confounding
writing ability with what is purportedly being measured
May not be machine or computer scored
WRITING ITEMS FOR COMPUTER
ADMINISTRATION
A number of widely available computer
programs are designed to facilitate the
construction of tests as well as their
administration, scoring, and interpretation
These programs typically make use of two
advantages of digital media: the ability to
store items in an item bank and the ability to
individualize testing through a technique
called item branching.
COMPUTER ADAPTIVE TESTING
refers to an interactive, computer-administered test taking process
wherein items presented to the test taker are based in part on the
test taker’s performance on previous items.
Has been found to reduce the number of test items that need to be
administered by as much as 50% while simultaneously reducing
measurement error by 50%
Tends to reduce floor effects and ceiling effects
Floor effect refers to the diminished utility of an assessment tool for
distinguishing test takers at the low end of the ability, trait, or other
attribute being measured
Ceiling effect refers to the diminished utility of an assessment tool for
distinguishing test takers at the high end of the ability, trait, or other
attribute being measured.
ITEM BRANCHING
The ability of the computer
to tailor the content and
order of presentation of test
items on the basis of
responses to previous items
ITEM BRANCHING
A computer that has stored a bank of achievement test items of
different difficulty levels can be programmed to present items
according to an algorithm or rule.
For example, one rule might be, “don’t present an item of the next
difficulty level until two consecutive items of the current difficulty level
are answered correctly.”
Another rule might be “terminate the test when five consecutive
items of a given level of difficulty have been answered incorrectly”
Alternatively, the pattern of items to which the test taker is exposed
may be based not on the test taker’s response to preceding items
but on a random drawing from the total pool of test items.
ITEM BRANCHING
Item branching technology may
be applied when constructing
tests not only of achievement
but also of personality.
May be used in personality tests
to recognized nonpurposive or
inconsistent responding
2C. SCORING ITEMS
The most commonly used model, in part to its simplicity and logic is the
cumulative method
Typically, the rule in a cumulatively scored test is that the higher the score on
the test, the higher the test taker is on the ability, trait or other characteristic
that the test purports to measure
In tests that employ class or category scoring, test taker responses earn
credit toward placement in a particular class or category with other test
takers whose pattern of responses is presumably similar in some way.
This approach is used by some diagnostic systems wherein individuals must
exhibit a certain number of symptoms to qualify for a specific diagnosis
A third scoring method, ipsative scoring, departs radically in rationale form
either cumulative or class models
A typical objective in ipsative scoring is comparing a test taker’s score on
one scale within the same test.
3. TEST TRYOUT
The test should be tried out on people who are similar in critical respects to
the people for whom the test was designed
An informal rule of thumb is that there should be no fewer than five subjects
and preferably as many as ten for each item on the test
A definite risk in using too few subjects during test tryout comes during factor
analysis of the findings, when what we might call phantom factors – factors
that actually are just artifacts of the small sample size – may emerge.
The test tryout should be executed under conditions as identical as possible
to the conditions under which the standardized test will be administered
WHAT IS A GOOD ITEM?
A good test item is
reliable and valid
A good item helps to
discriminate test takers
ITEM ANALYSIS
Item difficulty index
Item reliability index
Item validity index
Item discrimination index
ITEM DIFFICULTY INDEX
If everyone gets the item right then the item is too easy; if everyone gets the
item wrong, the item is too difficult
An index of an item’s difficulty is obtained by calculating the proportion of
the total number of test takers who answered the item correctly
A lowercase italic “p” (p) is used to denote item difficulty and a subscript
refers to the item number.
The value of an item difficulty index can theoretically range from 0 (no one
got the item right) to 1 (if everyone got the item right)
Example: if 50 out of the 100 examinees answer item 2 correctly, then the
item difficulty index for this item would be equal to 50 divided by 100, or .5
The statistic referred to as an item difficulty index in the context of
achievement testing may be an item endorsement index in other contexts
such as personality testing
ITEM RELIABILITY INDEX
Provides an indication of the internal consistency if a test
This index is equal to the product of the item score standard deviation (s)
and the correlation (r) between the item score and the total test score
A statistical tool useful in determining whether items on a test appear to
measure the same thing(s) is factor analysis.
Through the judicious use of factor analysis, items that do not “load on” the
factor that they were written to tap can be revised or eliminated
If too many items are tapping a particular area, the weakest of such items
can be eliminated
ITEM VALIDITY INDEX
A statistic designed to
provide an indication of the
degree to which a test is
measuring what it purports
to measure
ITEM DISCRIMINATION INDEX
Measures of item discrimination
index indicate how adequately an
item separates or discriminates
between high scorers and low
scorers on an entire test
QUALITATIVE ITEM ANALYSIS
Techniques of data generation
and analysis that rely primarily on
verbal rather than mathematical
or statistical procedures
4. TEST REVISION
The development process that the test
undergoes as it is modified and revised
When the item analysis of data derived from a
test administration indicates that the test is not
yet in finished form, the steps of revision, tryout,
item analysis are repeated until the test is
satisfactory and standardization can occur
TEST REVISION IN THE LIFE
OF AN EXISTING TEST
Many tests are designed to be due for revision when
any of the following conditions exist:
The stimulus materials look dated and current test
takers cannot relate to them
The verbal content of the test, including the
administration instructions and the test items, contains
dated vocabulary that is not readily understood by
current test takers
TEST REVISION IN THE LIFE
OF AN EXISTING TEST
Many tests are designed to be due for revision when any of the
following conditions exist:
As popular culture changes and words take on new meanings,
certain words or expressions in the test items or directions may
be perceived as inappropriate or even offensive to a particular
group and must therefore be changed
The test norms are no longer adequate as a result of group
membership changes in the population of potential test takers
The test norms are no longer adequate as a result of age-
related shifts in the abilities measured over time, and so an age
extension of the norms is necessary
TEST REVISION IN THE LIFE
OF AN EXISTING TEST
Many tests are designed to be due for revision when
any of the following conditions exist:
The reliability or the validity of the test, as well as the
effectiveness of individual test items, can be
significantly improved by a revision
The theory on which the test was originally based has
been improved significantly, and these changes
should be reflected in the design and content of the
test
CROSS VALIDATION
The revalidation of a test on a
sample test takers other than
those on whom test performance
was originally found to be a valid
predictor of some criterion.
Validity shrinkage
CO-VALIDATION
A test validation process conducted
on two or more tests using the same
sample of test takers
When used in conjunction with the
creation of norms or the revision of
existing norms, this process may also
be referred to as co-norming