TEST DEVELOPMENT
The Process of developing a test occurs in five stages: (CCTAR)
A. Test conceptualization
B. Test construction
C. Test tryout
D. Item analysis
E. Revision
A. TEST CONCEPTUALIZATION
The beginning of any published test
Defining the SCOPE, PURPOSE, and LIMITS of a test
An emerging social phenomenon or pattern of behavior
SOME PRELIMINARY QUESTIONS?
1. What is the test designed to measure? (construct of interest)
2. What is the objective of the test? (Goal and use of test)
3. Is there a need for this test? (advantages)
4. Who will use this test? (test users and their purpose of use)
5. Who will take this test? (specific details of testtakers)
6. What content will the test cover? (scope)
7. How will the test be administered? [individual or group; pen and
8. paper or through computer]
9. What is the ideal format of the test? (TF: multiple choice)
10. Should more the one form of the test be developed?
11. What special training will be required of test users for administering or interpreting the test? (BG
and qualifications of test users)
12. What types of responses will be required of testtakers?
13. Who benefits from the administration of this test?
14. Is there any potential for harm as the result of an administration of this test? (ethics)
15. How will meaning be attributed to scores on this test? (score meaning and scoring procedures)
Another question to consider:
“Should the test be NORM-REFERENCED or CRITERION-REFERENCED?”
PILOT WORK
Pilot work, pilot study, and pilot research refer to the preliminary research surrounding the
creation of a prototype of the test.
Test items may be pilot studied to evaluate whether they should be included in the final
form of the instrument.
Test developer typically attempts to determine how best to measure a targeted construct
Once it has been completed, the process of test construction begins
Downloaded by SOFIA ANN REDULLA
B. TEST CONSTRUCTION
Pilot work is a necessity when constructing tests or other measuring instruments for publication
and wide distribution
Scaling is the process of:
designing and calibrating a measuring device
assigning numbers (scale values) to different amounts of the trait, attribute, or
characteristic being measured
TYPES OF SCALES
1. Age−based scale – if the test taker’s test performance as function of age is of critical interest
2. Grade−based scale – test taker’s test performance as a function of grade
3. Stanine scale – all raw scorers on the test are to be transformed into scores that can range
from 1 to 9
4. Unidimensional scale - only one dimension is presumed to underlie the ratings
5. Multidimensional scale - more than one dimension is thought to guide the test taker’s
responses
6. Paired comparisons – test takers are presented with pairs of stimuli (e.g. two photos, two
objects, or two statements) from which they must select one of the stimuli according to some
rule (e.g. “Which of the two statements do you agree more?”)
7. Comparative scaling – one method of sorting that entails judgments of a stimulus in comparison
with every other stimulus on the scale. test taker compares an item (person/object/statement)
with every other item on the list and sort them out into a rank
- E.g. Arrange the following statements from most important to least important.
8. Categorical scaling – one of two or more alternative categories that differ quantitatively with
respect to some continuum
- For example, they may be asked to sort the cards into three piles: those behaviors that
are never justified, those that are sometime justified, and those that are always justifies.
9. Guttman scales – items on it range sequentially from weaker to stronger expressions or
feeling being measured. The scales are developed through the administration of a number
of items to a target group.
- Scalogram analysis – an item-analysis procedure and approach to test development that
involves a graphic mapping of a test taker’s responses
10. Rating scale – grouping of statements or symbols on which judgments of the strength of a
particular trait or emotion are indicated by the test taker.
11. Summative scale – final test score is obtained by summing the ratings across all the items
12. Likert scale – one type of summative rating scale usually used to scale attitudes, has five
alternative responses (sometimes even). Ordinal level data.
Downloaded by SOFIA ANN REDULLA
WRITING ITEMS
What range of content should the items cover?
Which of the many different types of item formats should be employed?
How many items should be written in total and for each content area covered??
Item Pool - refers to the reservoir or well from which items will or will not be drawn for the final version of
the test.
In writing an item pool it is advisable that the first draft would contain approximately twice the
number of items that the final version of the test contains
If it poorly written, the test developer should either rewrite items sampling or create new items
HOW TO DEVELOP AN ITEM POOL?
Write a large number of items from personal experience
Asking for help from others, including experts
Interview (to get insights that could assist in item writing)
Searching through academic research literature and other databases
Item format
Variables such as the form, plan, structure, arrangement and layout of test items.
TYPES OF ITEM FORMAT
Selected−response format
o Requires testtakers to select a response from a set of alternative responses
3 types: multiple choice, matching, binary choice
Constructed−response format
o Requires testtakers to supply or to create the correct answer, not merely to select it
3 types: completion item, short answer, essay
3 TYPES OF SELECTED−RESPONSE FORMAT
MULTIPLE CHOICE FORMAT
Has three elements
1. A stem
2. A correct alternative or option
3. Distractors or foils
BINARY CHOICE ITEM – only 2 possible responses
Varieties of binary-choice format:
agree or disagree
yes or no
right or wrong
fact or opinion
true or false
ADVANTAGE: typically, easier to write than multiple−choice items because they cannot
contain distractors and can be written relatively quickly
DISADVANTAGE: there is probability of obtaining a correct answer through chance
(guessing)
Downloaded by SOFIA ANN REDULLA
MATCHING ITEM
The test taker is presented with two columns: PREMISES on the left and RESPONSES
on the right.
3 TYPES OF CONSTRUCTED−RESPONSE ITEM
COMPLETION ITEM
o Requires the examinee to provide a word or phrase that completes a sentence
o Example: The standard deviation is generally the most useful measure of ____.
o A good completion item should be worded so that the correct answer is specific
SHORT−ANSWER ITEM
o Another form of completion item but much shorter and more specific
o Example: Distractors – refers to the incorrect responses in multiple−choice
ESSAY ITEM - requires the test taker to respond to a question by writing a composition, typically
one that demonstrates recall of facts, understanding analysis, and/or interpretation
o The answer is usually in a paragraph/sentence form, a depth of knowledge
o DISADVANTAGES: focused on a more limited area in the same amount of time when
o using a series of selected-response items or completion items AND subjectivity in scoring
and inter-score difference.
WRITING ITEMS FOR COMPUTER ADMINISTRATION
Item bank - a relatively large and easily accessible collection of test questions
Item Branching - the ability of the computer to tailor the content and order of presentation of test items on
the basis of responses to previous items
− Computerized Adaptive Testing (CAT)
o An interactive, computer−administered test−taking process wherein items presented to
the testtaker are based in part on the testtaker’s performance on previous items
o The computer may not permit the testtaker to continue with the test until the practice
items have been responded to in a satisfactory manner and the test taker has
demonstrated an understanding of the test procedure
− CAT tends to reduce the FLOOR EFFECTS and CEILING EFFECTS
Floor effects (TOO HARD)
o Refers to the diminished utility of an assessment tool for distinguishing test takers at
the low end of the ability, trait, or other attribute being measured
Ceiling effects (TOO EASY)
o Refers to the diminished utility of an assessment tool for distinguishing test takers at
the high end of the ability, trait, or other attribute being measure.
Downloaded by SOFIA ANN REDULLA
SCORING ITEMS
Models of Test Scoring:
1. Cumulative Scoring - the higher the score on the test, the higher the test−taker
on the ability, trait, or other characteristics that the test purports to measure
2. Class scoring (Category scoring) - test taker responses earn credit toward
placement in a particular class or category with other test takers whose pattern
of responses is presumably similar in some way
E.g. MBTI
3. Ipsative Scoring - Comparing a test taker’s score on one scale within a test to another
scale within that same test
C. TEST TRYOUT
a. The part of test development in which the test developer will try out the test
b. In test tryout:
i. The test should be tried out on people who are similar in critical
aspects to the people for whom the test was designed
ii. There should be no fewer than 5 subjects and preferably as many as 10
for each item on the test
iii. The more the subjects employed, the weaker the role of chance in
subsequent data analysis
iv. The more the merrier
v. It should be executed under conditions as identical as possible to the
conditions under which the standardized test will be administered
D. ITEM ANALYSIS
WHAT IS A GOOD ITEM?
a. Must be reliable and valid
b. Through a good item, it helps to discriminate test takers, the high scorers and low
scorers
ITEM ANALYSIS
c. A statistical technique which is used for selecting and rejecting the items of the
test on the basis of
vi. An index of the item’s difficulty
vii. An index of the item’s reliability
viii. An index of the item’s validity
ix. An index of item discrimination.
Downloaded by SOFIA ANN REDULLA
ITEM−DIFFICULTY INDEX
- An index of an item’s difficulty is obtained by calculating the proportion of the total number
of testtakers who answered the item correctly.
- Can be referred as item−endorsement index in other context, such as personality testing
ITEM−RELIABILITY INDEX
- The item−reliability index provides an indication of the internal consistency of a test; the
higher this index, the greater the test’s internal consistency.
ITEM−DISCRIMINATION INDEX
The item−discrimination index is a measure of the difference between the proportion of high
scorers answering an item correctly and the proportion of low scorers answering the item
correctly.
the higher the value of d, the greater the number of high scorers answering the item correctly
Downloaded by SOFIA ANN REDULLA
ITEM−CHARACTERISTIC
CURVES
A graphic representation of item difficulty and discrimination
The steeper the slope, the greater the item discrimination. An item may also vary in terms
of its difficulty level
OTHER CONSIDERATIONS IN ITEM ANALYSIS
1) GUESSING
2) ITEM FAIRNESS
a. A biased test item is an item that favors one particular group of examinees in relation
to another when differences in group ability are controlled
b. item−characteristic curves can be used to identify biased items
c. Choice of item-analysis method may affect determinations of item bias
Downloaded by SOFIA ANN REDULLA
3) SPEED TEST - Item analyses of tests taken under speed conditions yield misleading or
uninterpretable results. The closer an item is to the end of the test, the more difficult it may
appear to be. This is because testtakers simply may not get to items near the end of the test
before time runs out
QUALITATIVE ITEM ANALYSIS - A general term for various nonstatistical procedures designed to
explore how individual test items work
- Techniques of data collection and analysis that rely primarily on verbal (interviews,
observations, open−ended questionnaire rather than mathematical or statistical
procedures)
A. “THINK ALOUD” TEST ADMINISTRATION - Qualitative research tool designed to shed light on
the testtaker’s thought processes during the administration of a test. Useful in assessing
why and how testtakers are misinterpreting a particular item
B. EXPERT PANELS - May also provide qualitative analyses of test items
- A sensitivity review is a study of test items, typically conducted during the test
development process, in which items are examined for fairness to all prospective
testtakers and for the presence of offensive language, stereotypes, or situations
E. TEST REVISION AS A STAGE IN NEW TEST DEVELOPMENT
Some ways of approaching test revision:
- Characterize each item according to its strengths and weaknesses.
Items with numerous weaknesses, items that are too hard, or too easy must be
eliminated or revised.
- Balance strengths and weaknesses across items.
If the test is too easy, add difficult items or revise some items.
TEST REVISION IN THE LIFE CYCLE OF AN EXISTING TEST
When to revise an existing test?
1. The stimulus materials look dated and current test takers cannot relate to them
2. The verbal content of the test, including the administration instructions and the test
items, contains detailed vocabulary that is not readily understood by current test takers
3. As popular culture changes and words take new meanings, certain words or expressions
in the test items or directions may be perceived as inappropriate or even offensive to a
particular group and must therefore be changed
4. The test norms are no longer adequate as a result of group membership changes in the
population of potential test takers
5. The test norms are no longer adequate as a result of age−related shifts in the abilities
measured over time, and so an age extension of the norms (upward, downward, or in
both directions) is necessary.
6. The reliability or the validity of the test, as well as the effectiveness of the individual test
items, can be significantly improved by a revision
7. The theory on which the test was originally based has been improved significantly, and
Downloaded by SOFIA ANN REDULLA
these changes should be reflected in the design and the content of the test
CROSS−VALIDATION AND CO−VALIDATION
A. Cross-validation – refers to the revalidation of a test on a sample of testtakers other than
those on whom test performance was originally found to be a valid predictor of some
criterion.
Validity shrinkage - the decrease in item validities that inevitably occurs after cross-
validation of findings
B. Co-validation - defined as a test validation process conducted on two or more tests using the
same sample of test takers
Co-norming – process of co-validation in conjunction with the creation of norms or the
revision of existing norm.
Downloaded by SOFIA ANN REDULLA
Downloaded by SOFIA ANN REDULLA