Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views59 pages

Test Development

The document outlines the five stages of test development: conceptualization, construction, tryout, item analysis, and revision. It discusses key considerations in each stage, including the purpose of the test, item scaling methods, and various item formats such as multiple-choice and essay items. Additionally, it highlights the importance of scoring methods and the need for thorough testing and analysis to ensure the reliability and validity of the test.

Uploaded by

gelleleane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views59 pages

Test Development

The document outlines the five stages of test development: conceptualization, construction, tryout, item analysis, and revision. It discusses key considerations in each stage, including the purpose of the test, item scaling methods, and various item formats such as multiple-choice and essay items. Additionally, it highlights the importance of scoring methods and the need for thorough testing and analysis to ensure the reliability and validity of the test.

Uploaded by

gelleleane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

TEST DEVELOPMENT

5 STAGES OF TEST DEVELOPMENT

Test Conceptualization
Test Construction
Test Tryout
Item Analysis
Test Revision
1. TEST CONCEPTUALIZATION
The beginnings of any
published test can probably be
traced to thoughts – self talk, in
behavioral terms.
SOME PRELIMINARY QUESTIONS

What is the test designed to measure?


What is the objective of the test?
Is there a need for this test?
Who will use this test?
Who will take this test?
What content will the test cover?
SOME PRELIMINARY QUESTIONS

 How will the test be administered?


 What is the ideal format of the test?
 Should more than one form of the test be developed?
 What special training will be required of test users for
administering or interpreting the test?
 What type of responses will be required of test takers?
 Who benefits from an administration of this test?
SOME PRELIMINARY QUESTIONS

Is there any potential harm as the


result of an administration of this test?
How will meaning be attributed to
scores on this test?
2. TEST CONSTRUCTION
2A. SCALING
The process of setting rules for
assigning numbers in measurement
The process by which a measuring
devices is designed and calibrated
and by which numbers – scale values
– are assigned to different amounts of
the trait, attribute, or characteristics
being measured.
TYPES OF SCALES
 If the test taker’s test performance as a function of
age is of critical interest, then the test might be
referred to as an age-based scale.
 If the test taker’s test performance as a function of
grade is of critical interest, then the test might be
referred to as a grade-based scale.
 If all raw scores on the test are to be transformed into
scores that can range from 1 to 9, then the test might
be referred to as a stanine scale.
SCALING METHODS
 Generally speaking, a test taker is presumed to have
more or less of the characteristic measured by a
(valid) test as a function of the test score.
 The higher or lower the score, the more or less the
characteristic the test taker presumably possesses.
 But how are numbers assigned to responses so that a
test score can be calculated? This is done through
scaling the test items, using any one of several
available methods.
SCALING METHODS
 When the final test score is obtained by summing the
rating across all the items, it is termed a summative
scale.
 One type of summative rating scale, the Likert scale is
used extensively in psychology, usually to scale
attitudes.
 Each item presents the test taker with five alternative
responses (sometimes seven), usually on an agree-
disagree or approve-disapprove continuum.
SCALING METHODS
 Another scaling method that produces ordinal data is
the method of paired comparisons.
 Test takers are presented with pairs of stimuli, which
they are asked to compare.
 They must select one of the stimuli according to some
rule; for example they rule that they agree more with
one statement than the other, or the rule that they
find one stimulus more appealing than the other.
SCALING METHODS
 Another way of deriving ordinal information through
scaling system entails sorting tasks.
 One method of sorting, comparative scaling, entails
judgments of a stimulus in comparison with every other
stimulus on the scale.
 Example: Test takers would be asked to sort the cards
from most justifiable to least justifiable.
SCALING METHODS
 Another scaling system that relies on sorting is
categorical scaling
 Stimuli are placed into one of two or more alternative
categories that differ quantitively with respect to some
continuum.
 Example: test takers would be asked to sort the cards
into three piles: those behaviors that are never
justified, those that are sometimes justified, and those
that are always justified.
SCALING METHODS
 A Guttman scale is yet another scaling methods that yields ordinal-level
measures.
 Items on it range sequentially from weaker to stronger expressions of the
attitude, belief, or feeling being measured.
 A feature of Guttman scales is that all respondents who agree with the
stronger statements of the attitude will also agree with milder statements
 The resulting data are then analyzed by means of scalogram analysis, an
item-analysis procedure and approach to test development that involves a
graphic mapping of a test taker’s responses
SCALING METHODS
All the foregoing methods yield
ordinal data.
The method of equal-appearing
interval, first described by Thurstone, is
one scaling method used to obtain
data that are presumed to be interval
in nature.
2B. WRITING ITEMS
The prospective test developer or item writer
immediately faces three questions related to the test
blueprint:
1. What range of content should the items cover?
2. Which of the many different types of item formats
should be employed?
3. How many items should be written in total and for
each content area cover?
2B. WRITING ITEMS
 When devising a standardized test using a
multiple-choice format, it is usually advisable
that the first draft contain approximately twice
the number of items that the final version will
contain.
 An item pool is the reservoir or well from which
items will or will drawn for the final version of the
test
ITEM FORMAT
 Variables such as the form, plan, structure,
arrangement, and layout of individual test
items
 Selected response format – requires test takers
to select a response from a set of alternative
responses
 Constructed response format – requires test
takers to supply or create the correct answer,
not merely to select it.
TYPES OF SELECTED
RESPONSE FORMAT
Multiple choice
Matching
True-false
MULTIPLE CHOICE
Has 3 elements:
1. Stem
2. A correct alternative or
option
3. Several incorrect
alternatives or options
MULTIPLE CHOICE
Advantages:
Can sample a great deal of content
in relatively short time
Allows for precise interpretation and
little “bluffing” other than guessing
May be machine or computer
scored
MULTIPLE CHOICE
Disadvantages:
Does not allow for expression of
original or creative thought
Not all subject matter lends itself to
reduction to one and only one
answered keyed correct
May be time-consuming to construct
series of good items
MATCHING ITEM
The test taker is presented
with two columns: premises
(left) and responses (right)
The test taker’s task is to
determine which response is
best associated with which
premise.
MATCHING ITEM
 Providing more options than needed minimizes a possibility
of the test taker getting a perfect score even though the
test taker did not actually know all the answers
 Another way to lessen the probability of chance of
guessing is a factor in the test score is to state in the
directions that each response may be a correct answer
once, more than once, or not at all.
 The wording of the premise and the responses should be
fairly short and to the point
 No more than a dozen or so premises should be included
 The list of premises and responses should both be
homogenous
MATCHING ITEM
Advantages:
Can effectively and efficiently be used to
evaluate test taker’s recall of related facts
Particularly useful when there are a large
number of facts on a single topic
Can be fun or game-like for test taker
(especially the well prepared test taker)
May be machine or computer scored
MATCHING ITEM
Disadvantages:
One of the choices may
help eliminate one of
the other choices as the
correct response
TRUE-FALSE ITEM
 A multiple choice item that contains only two possible
responses is called a binary-choice item
 Perhaps the most familiar binary-choice item is the true-
false item
 This type of selected-response item usually takes the form
of a sentence that requires the test taker to indicate
whether the statement is or is not a fact.
 Other varieties of binary-choice items include sentences to
which the test taker responds with one of two responses,
such as agree or disagree, yes or no, right or wrong, or
fact or opinion.
TRUE-FALSE ITEM
a good binary choice
contains a single idea,
is not excessively long,
and is not subject to
debate
TRUE-FALSE ITEM
Advantages:
Can sample a great deal of
content in relatively short time
Test consisting of such items is
relatively easy to construct
May be machine or computer
scored
TRUE-FALSE ITEM
Disadvantages:
Susceptibility to guessing,
especially for “test wise”
students who may detect
cues to reject one choice
or the other
TYPES OF CONSTRUCTED
RESPONSE FORMAT
Completion items
Short answer
essay
COMPLETION ITEM
Requires the examinee to
provide a word or phrase that
completes a sentence
A good completion item
should be worded so that the
correct answer is specific
SHORT-ITEM ITEM
A completion item may also be
referred to as a short-item item.
it is desirable for completion or
short answer items to be written
clearly enough that the test
taker can respond succinctly –
that is with a short answer
COMPLETION/SHORT
ANSWER ITEM
Advantages:
Wide content area, particularly
of questions that require factual
recall, can be sampled in
relatively brief amount of time
Relatively easy to construct
COMPLETION/SHORT
ANSWER ITEM
Disadvantages:
May demonstrate only recall of
circumscribed facts or bits of knowledge
Potential for inter-scorer reliability
problems when test is scored by more
than on person
May not be machine or computer scored
ESSAY ITEM
Requires a test taker to
respond to a question by
writing a composition,
typically one that
demonstrates recall of facts,
understanding, analysis
and/or interpretation
ESSAY ITEM
Advantages:
Useful in measuring responses that require
complex, imaginative, or original solutions,
applications, or demonstrations
Useful in measuring how well test taker is
able to communicate ideas in writing
Requires test taker to generate entire
response, not merely recognized it or
supply a word or two.
ESSAY ITEM
Disadvantages:
 May not sample wide content area as well as other tests do
 Test taker with limited knowledge can attempt to bluff with confusing,
sometimes long and elaborate writing designed to be as broad and
ambiguous as possible
 Scoring can be time consuming and fraught with pitfalls
 When more than one person is scoring, inter-scorer reliability issues may
be raised
 May rely too heavily on writing skills, even to the point of confounding
writing ability with what is purportedly being measured
 May not be machine or computer scored
WRITING ITEMS FOR COMPUTER
ADMINISTRATION
 A number of widely available computer
programs are designed to facilitate the
construction of tests as well as their
administration, scoring, and interpretation
 These programs typically make use of two
advantages of digital media: the ability to
store items in an item bank and the ability to
individualize testing through a technique
called item branching.
COMPUTER ADAPTIVE TESTING
 refers to an interactive, computer-administered test taking process
wherein items presented to the test taker are based in part on the
test taker’s performance on previous items.
 Has been found to reduce the number of test items that need to be
administered by as much as 50% while simultaneously reducing
measurement error by 50%
 Tends to reduce floor effects and ceiling effects
 Floor effect refers to the diminished utility of an assessment tool for
distinguishing test takers at the low end of the ability, trait, or other
attribute being measured
 Ceiling effect refers to the diminished utility of an assessment tool for
distinguishing test takers at the high end of the ability, trait, or other
attribute being measured.
ITEM BRANCHING
The ability of the computer
to tailor the content and
order of presentation of test
items on the basis of
responses to previous items
ITEM BRANCHING
 A computer that has stored a bank of achievement test items of
different difficulty levels can be programmed to present items
according to an algorithm or rule.
 For example, one rule might be, “don’t present an item of the next
difficulty level until two consecutive items of the current difficulty level
are answered correctly.”
 Another rule might be “terminate the test when five consecutive
items of a given level of difficulty have been answered incorrectly”
 Alternatively, the pattern of items to which the test taker is exposed
may be based not on the test taker’s response to preceding items
but on a random drawing from the total pool of test items.
ITEM BRANCHING
Item branching technology may
be applied when constructing
tests not only of achievement
but also of personality.
May be used in personality tests
to recognized nonpurposive or
inconsistent responding
2C. SCORING ITEMS
 The most commonly used model, in part to its simplicity and logic is the
cumulative method
 Typically, the rule in a cumulatively scored test is that the higher the score on
the test, the higher the test taker is on the ability, trait or other characteristic
that the test purports to measure
 In tests that employ class or category scoring, test taker responses earn
credit toward placement in a particular class or category with other test
takers whose pattern of responses is presumably similar in some way.
 This approach is used by some diagnostic systems wherein individuals must
exhibit a certain number of symptoms to qualify for a specific diagnosis
 A third scoring method, ipsative scoring, departs radically in rationale form
either cumulative or class models
 A typical objective in ipsative scoring is comparing a test taker’s score on
one scale within the same test.
3. TEST TRYOUT
 The test should be tried out on people who are similar in critical respects to
the people for whom the test was designed
 An informal rule of thumb is that there should be no fewer than five subjects
and preferably as many as ten for each item on the test
 A definite risk in using too few subjects during test tryout comes during factor
analysis of the findings, when what we might call phantom factors – factors
that actually are just artifacts of the small sample size – may emerge.
 The test tryout should be executed under conditions as identical as possible
to the conditions under which the standardized test will be administered
WHAT IS A GOOD ITEM?
A good test item is
reliable and valid
A good item helps to
discriminate test takers
ITEM ANALYSIS
Item difficulty index
Item reliability index
Item validity index
Item discrimination index
ITEM DIFFICULTY INDEX
 If everyone gets the item right then the item is too easy; if everyone gets the
item wrong, the item is too difficult
 An index of an item’s difficulty is obtained by calculating the proportion of
the total number of test takers who answered the item correctly
 A lowercase italic “p” (p) is used to denote item difficulty and a subscript
refers to the item number.
 The value of an item difficulty index can theoretically range from 0 (no one
got the item right) to 1 (if everyone got the item right)
 Example: if 50 out of the 100 examinees answer item 2 correctly, then the
item difficulty index for this item would be equal to 50 divided by 100, or .5
 The statistic referred to as an item difficulty index in the context of
achievement testing may be an item endorsement index in other contexts
such as personality testing
ITEM RELIABILITY INDEX
 Provides an indication of the internal consistency if a test
 This index is equal to the product of the item score standard deviation (s)
and the correlation (r) between the item score and the total test score
 A statistical tool useful in determining whether items on a test appear to
measure the same thing(s) is factor analysis.
 Through the judicious use of factor analysis, items that do not “load on” the
factor that they were written to tap can be revised or eliminated
 If too many items are tapping a particular area, the weakest of such items
can be eliminated
ITEM VALIDITY INDEX
A statistic designed to
provide an indication of the
degree to which a test is
measuring what it purports
to measure
ITEM DISCRIMINATION INDEX

Measures of item discrimination


index indicate how adequately an
item separates or discriminates
between high scorers and low
scorers on an entire test
QUALITATIVE ITEM ANALYSIS

Techniques of data generation


and analysis that rely primarily on
verbal rather than mathematical
or statistical procedures
4. TEST REVISION
 The development process that the test
undergoes as it is modified and revised
 When the item analysis of data derived from a
test administration indicates that the test is not
yet in finished form, the steps of revision, tryout,
item analysis are repeated until the test is
satisfactory and standardization can occur
TEST REVISION IN THE LIFE
OF AN EXISTING TEST
Many tests are designed to be due for revision when
any of the following conditions exist:
 The stimulus materials look dated and current test
takers cannot relate to them
 The verbal content of the test, including the
administration instructions and the test items, contains
dated vocabulary that is not readily understood by
current test takers
TEST REVISION IN THE LIFE
OF AN EXISTING TEST
Many tests are designed to be due for revision when any of the
following conditions exist:
 As popular culture changes and words take on new meanings,
certain words or expressions in the test items or directions may
be perceived as inappropriate or even offensive to a particular
group and must therefore be changed
 The test norms are no longer adequate as a result of group
membership changes in the population of potential test takers
 The test norms are no longer adequate as a result of age-
related shifts in the abilities measured over time, and so an age
extension of the norms is necessary
TEST REVISION IN THE LIFE
OF AN EXISTING TEST
Many tests are designed to be due for revision when
any of the following conditions exist:
 The reliability or the validity of the test, as well as the
effectiveness of individual test items, can be
significantly improved by a revision
 The theory on which the test was originally based has
been improved significantly, and these changes
should be reflected in the design and content of the
test
CROSS VALIDATION
The revalidation of a test on a
sample test takers other than
those on whom test performance
was originally found to be a valid
predictor of some criterion.
Validity shrinkage
CO-VALIDATION
A test validation process conducted
on two or more tests using the same
sample of test takers
When used in conjunction with the
creation of norms or the revision of
existing norms, this process may also
be referred to as co-norming

You might also like