Test Development Process
Test Development ➢
Test Conceptualization SOME PRELIMINARY QUESTIONS
Test Construction
• What is the test designed to measure?
• What is the objective of the test?
• Is there a need for this test?
• Who will use this test?
Test Tryout • Who will take this test?
• What content will the test cover?
• How will the test be administered?
• What is the ideal format of the test?
Item Analysis • Should more than one form of the test be developed?
• What special training will be required of test users for
administering or interpreting the test?
→
• What types of responses will be required of test
takers?
• Who benefits from an administration of this test?
→ • Is there any potential for harm as the result of an
administration of this test?
• How will meaning be attributed to scores on this test?
Test Revision Norm-Referenced vs Criterion-Referenced Tests: Item
Development Issues:
➢
→
→
Norm-Referenced Test
Test Conceptualization
→
➢
→
“There ought to be a test designed to measure [fill in
the blank] in [such and such] way.”
➢ Criterion Referenced Test
➢
➢
TYPES OF SCALES:
Age-Based Scale
Grade-Based Scale
Stanine Scale
→
Unidimensional Scale
PILOT WORK
Multidimensional Scale
Pilot Work, Pilot Study, and Pilot Research
SCALING METHODS:
➢ Rating Scale
➢
➢
➢
➢
Test Construction ➢
SCALING
→ Summative Rating Scale
Scaling
➢
→ Likert Scale
scale values
L.L. Thurstone
→
➢
➢ absolute scaling
→
→ ➢
Method of Paired Comparison ➢ Scalogram Analysis
➢
➢ Objective
Method of Equal-Appearing Intervals
➢ ➢
WRITING ITEMS
➢ Advantage
➢
Sorting Task
➢
➢ Comparative Scale → What range of content should the items cover?
→ Which of the many different types of item formats
should be employed?
➢
→ How many items should be written in total and for
each content area
➢ Categorical Scale
Item Pool
Guttman Scale ➢
➢
→ True-False Item
➢
→ → Other Variety of Binary-Choice Format
→ →
→ → Disadvantage
Constructed-Response Format
Item Format
Types of Constructed-Response Item Format:
Completion Item
Selected-Response Format →
→ Disadvantage
→
Short-Answer Item
Types of Selected-Response Item Format:
→
Multiple-Choice Format
Essay Item
Matching Item
→
→ Drawback
Writing Items for Computer Administration
Binary-Choice Format
Item Bank ➢
➢ Advantages
→ E.g., If a respondent answers an item in a way that
suggests he or she is depressed, the computer
might automatically probe for depression-related
➢ symptoms and behavior.
→
Item Branching
➢ SCORING ITEMS
Cumulative Model
➢ Computerized Adaptive Testing (CAT)
Class Scoring or Category Scoring
→
→ →
→ Advantages:
Ipsative Scoring
Floor Effect →
→
Test Tryout
Ceiling Effect
➢
→ ➢
➢ ➢
Item-Endorsement Index
→
phantom
factors ➢ p
p1
➢
➢
→ If 50 of the 100 examinees answered item 2
WHAT IS A GOOD ITEM? 𝟓𝟎
correctly, then: 𝒑𝟐 = 𝟏𝟎𝟎 =. 𝟓; and
➢ → If 75 of the 100 examinees answered item 2
➢ 𝟕𝟓
correctly, then: 𝒑𝟑 = 𝟏𝟎𝟎 =. 𝟕𝟓; we could say that
→ item 3 was easier than item 2.
➢
Item Analysis ∑𝒑
𝒂𝒗𝒆𝒓𝒂𝒈𝒆 𝒑 = 𝒏
Item Analysis ➢
➢ quantitatively
qualitative ➢
➢
𝐜𝐡𝐚𝐧𝐜𝐞 𝐨𝐟 𝐬𝐮𝐜𝐞𝐬𝐬 𝐩𝐫𝐨𝐩𝐨𝐫𝐭𝐢𝐨𝐧 + 𝟏. 𝟎𝟎
𝑰𝒅𝒆𝒂𝒍 𝒑 =
𝟐
item’s difficulty
item’s reliability ideal p = 0.75
item’s validity
ideal p = 0.6
item discrimination
ITEM-RELIABILITY INDEX
ITEM-DIFFICULTY INDEX
Item-Reliability Index
Item’s Difficulty
→
➢
(s)
(r)
➢
[1] The item-score standard deviation
→
s1
p1
𝒔𝟏 = √𝒑𝟏 (𝟏 − 𝒑𝟏 )
Factor Analysis – The correlation (r) between the item score and the
criterion score
➢
→
(r1 C)
(s1),
Item-validity index = 𝒔𝟏 𝒓𝟏 𝒄
→
ITEM-DISCRIMINATION INDEX
Item-Discrimination Index
→
→ → E.g., A multiple-choice item on an achievement
test is a good item if most of the high scorers
answer correctly and most of the low scorers
answer incorrectly.
→ An item on an achievement test is not doing its job
ITEM-VALIDITY INDEX if it is answered correctly by respondents who
Item-Validity Index least understand the subject matter.
→ An item on a test purporting to measure a
particular personality trait is not doing its job if
responses indicate that people who score very low
on the test as a whole (indicating absence or low
levels of the trait in question) tend to score very
➢
high on the item (indicating that they are very
high on the trait in question—contrary to what the
test as a whole indicates).
➢
➢
d d
(U)
(L)
→
➢ Alternatives
𝑼−𝑳 ∙A B C D E
𝒅=( ) 𝒏= Item 1 U 24 3 2 0 3
𝒏
L 10 5 6 6 5
U L
(U) (L)
Alternatives
A B C D ∙E
Item 2 U 2 13 3 2 12
L 6 7 5 7 7
U
Alternatives
U A B ∙C D E
L Item 3 U 0 0 32 0 0
L 3 2 22 2 3
U
U
L
L
Alternatives
A ∙B C D E
d Item 4 U 5 15 0 5 7
L 4 5 4 4 14
d = +1.00 → U
L
d=0→ U
L
d = –1.00 → U
L Alternatives
A B C ∙D E
Item 5 U 14 0 0 5 13
L 7 0 0 16 9
ANALYSIS OF ITEM ALTERNATIVES. L
U
➢
ITEM-CHARACTERISTIC CURVES OTHER CONSIDERATIONS IN ITEM ANALYSIS
Item-Characteristic Curves GUESSING.
➢ For Discriminability Level:
→
➢ For Difficulty Level:
→
ITEM FAIRNESS.
→
SPEED TESTS. Expert Panels
→ Sensitivity Review
→
→
QUALITATIVE ITEM ANALYSIS
→
Qualitative Methods
→
Test Revisions
TEST REVISION AS A STAGE IN NEW TEST DEVELOPMENT
→
→
→
Qualitative Item Analysis
→
➢ One cautionary note: →
→
“Think Aloud” Test Administration
→
➢
→
TEST REVISION IN THE LIFE CYCLE OF AN EXISTING TEST
➢ ➢
→
➢
An existing test be kept in its present form as long
as it remains “useful” but that it should be revised ➢
“when significant changes in the domain
represented, or new conditions of test use and
interpretation, make the test inappropriate for its →
intended use.”
➢ →
Cross-Validation
validity shrinkage
→
Co-Validation
co-norming
THE USE OF IRT IN BUILDING AND REVISING TESTS
➢
→
[1] Evaluating the properties of existing tests and guiding
test revision. →
→
→
→
→
[2] Determining measurement equivalence across test
taker populations.
→ Differential Item Functioning (DIF)
→
→ DIF Analysis
→ DIF Items
[3] Developing item banks.
→