Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views50 pages

Test Theory

The document discusses properties of psychological tests and measurement. It covers topics like latent variables, test scores, variability in scores, standardized scores, and interpreting test results. It also discusses different types of measurement scales and properties tests should have like reliability and validity.

Uploaded by

r.m.karkkainen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views50 pages

Test Theory

The document discusses properties of psychological tests and measurement. It covers topics like latent variables, test scores, variability in scores, standardized scores, and interpreting test results. It also discusses different types of measurement scales and properties tests should have like reliability and validity.

Uploaded by

r.m.karkkainen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Lecture 1

- Latent (non observable) properties

1. O
2. S
3. O
4. O
5. S
6. S
7. O
8. O
9. O

24

- Qual answers→quant measures; item scores


- Item scores→test scores

- Measure spread of scores


- s2=SS/N
- Divide by N: population
- inferential statistics (sample→population)
- Population variance
standardized/z score

-
- Distance from mean big/small
- Standardized dev score; 0-1/10/100 diff comparable how far from mean
- Z mean 0, SD dev 1

- Covariance: Shared variance


-
Ch 1
- Theoretical constructs: H constructs/latent variables
- Operational measures
Psych tests
- Systematic procedure comp beahvior 2+ppl
- Behavioral samples
- Behavioral samples collected systematically
- Purpose of tests to comp behaviors 2+ppl
- Inter v intraindivid diffs
Test types
- Vary in
- Content
- Response required
- Administration methods
- Intended purpose
- Criterion/domain referenced: decision abt persons skill level
- Cutoff level established→ppl divided in
- Reference sample: representative of certain well def
population→persons score comp w ref sample scores
- Normative sample
- ¨norm referenced
- Speeded tests
- Time limited
- Entire test not completed
- Nr Qs answered in allotted time; As correct
- Power tests
- No time limit
- Answer all Qs
- Nr correct items
Psychometrics
- Eval attributes of psych tests
1. Type of info generated by use of psych tests
2. Reliability of data from tests
3. Issues abt validity of data from tests
Challenges to measurement
- Participant reactivity
- Demand characteristics
- Social desirability
- Malingering
- observer/scorer bias
- Composite scores
- Score sensitivity
- Lack psychometric info
Ch 3 indivi diffs & correlation
Variability
- Interindivid diffs: bw ppl
- DV
- Intraindivid diffs: diffs in 1 person OT/under diff circumstances
Individ diffs importance
- Diffs in scores: variability
- Distribution: diff pts of individs
central tendency
- Mean median mode
Variability
- Variance
- SD
- Reflect variation in distribution deviation
- SS: sum of squared deviations about the mean

-
1. Dev from mean scores; degree score above/below mean
2. Squared deviation
3. Mean of squared dev

- SD reflects variability in terms of size of raw dev scores, variance reflects variability in
terms of squared dev scores
- Larger variance & SD→greater distribution variability
- No neg value SD/s2!! 0→
- SD/s2 not large/small
- Require context
- N not N-1 !!
Distribution
- Normal distribution
- Neg skewed
- Pos skewed
-
2 variable association
- Association direction +/-
- Magnitude -ll-
- Consistency; strong association
Covariance
- Degree 2 variables covary; association bw variaiblity in 2 score distributions
1. Dev scores (dev from mean)
2. Cross prodcuts; multiplying dev scores
a. Neg cross product; scores inconsistent w EO; above mean V1(pos dev), below
mean V2(neg dev)
3. Mean od cross products

Cov size affected by


1. Association strength
a. Large values→Strong association
b. Scale of measurement; large scale>small scale variables
Variance-covariance matrix
- Each V row & column
- 166.67 & .39 variances diagonal (IQxIQ)
- Off diagonal variances bw variables

-
Correlation
- Direction of association bw 2 variables
- Magnitude

-
- Covariance XY/SD XY
Composite variables

-
- S2 variance
- Rij: correlation bw 2 scores
- S: SD

- Score bw 2 composite scores


Binary items
- yes/no 1/0
- Avg responses across items

-
- Mean: proportion of pos valenced As
- p=mean for binary itemt test
- q= proportion of neg valenced responses q=1-p

-
Tests score interpretation
- Raw scores
- Avg→comp bewlo/above
- Degree of diff; how far from avg, variability
- Z scores
- above/below mean
- Distance from eman score
Z scores; standard scores

- Test score med high low


- Degree avbove/below mean
- Diff from mean/SD
- Eg 0.5 SD above mean
- Larger score→more extreme value
- Score in rle to entire score distribution
Converted standard scores (standardized scores)
- Z scores converted into values easier to understand

-
Percentile ranks
- % of scores velow specific test score
- Direct:
- Raw scores below→divide by nr ppl took the test→x100
- Mean & SD known
- Standard score computed→table→link percentile
- Score bw mean & z→add 0.5 & x100%
Normalized scores
- normalization/area transformations
1. Percentile ranks from raw test scores; each raw score→% rank
2. % ranks→standard (z) scores
3. Z score→covnerted standard score onto wanted metric
Test norms
- Reference sample
-

Ch 2
Property of identity
- Identical w respect to feature reflected by category
- mutually exclusive & Exhaustive: falls only to 1 cat
- similar/diff
- Nrs no signif
Property of order
- Rel amt of attribute ppl possess
- Numerals property of roder→rank order of ppl rel to EO
- Numerals indicate order→laberls
- Doesn’t tell abt actual degree of diffs
Property of quant
- Magnitude of diffs bw ppl
- Numerals reflect real nrs
- Quant values: continuous
Nr 0
- Absolute 0: ratio
- Arbitrary 0: interval
Units of measurement
- Arbitrary
- Unit size
- Unit of meas not tied to any 1 type of obj;
- Take pgys form; some UoM cant be used to measure diff features of obj
- UoM: standard measures
Additivity & counting
Additivity
- Unit inc 1 pt of measurement same as any other pt of measurement
- Meas unit not constant in terms of underlying attribute intended to reflect
- Additivity not good to use
Counts
- counting=measurement when 1 countignt o reflect amt of feature/attibrute of obj
4 meas scales
Nominal scales
- numerals= property of identity
- Categories
- Identify groups sharing common attribute
Ordinal
- numerals=
- Property of order
Interval
- Arbitrary zero
- Property of quant
- Size of meas constant & additive, scale no multiplicative interpretations
- 80C not twice as hot as 40C
Ratio
- Absolute zero
- Property of quant
- Addtivity & multiplicative interpretations; 80km 2x40km

L2
Pysch testing
- chronbach : systematic procedure for comp behavior 2+ppl
- 3 properties
- Observable: aimed at measuring beahvior
- Obj: systematic
- Comparative: comp of diff ppl
Test for max vs typical performance
- Max performance: skills/aptitude
- Typical performance: personality traits, attitudes, disorders
- Big diffs in approach of test devel
- Few diffs in statistical analysis of tests scores
Max performance tsts
- Power
- Skill wo time pressure
- Most skilled→more correct A
- Speed
- Performance under time pressure
- More skilled→more A wn time limit
- Q difficulty trivial
Norm referenced/crtierion referenced; what to do w test score
- Norm referenced
- Comp ppl w rest of population
- Good norm data on population of great importance
- Eg IQ SAT
- Criterion referenced
- Comp ppl w absolute standard
- Test inference not tied to performance level in population
- Eg exam Test Theory criterion referenced
Psych tests contain
- Test material
- Test forms
- Test manual
- Precise test instructions
- Score processing procedure
- Norm tables
- Discussion of scientific qual
Properties of test scores
- Test score generally sum of item scores
- Most important outcome of test used
- Test manual instructions on how to interpret score
- W non referenced tests, norm table needs to be consulted
- Eg 30% of boys age 3 have a score lower than 3 (30th percentile)
Measurement level tests score
- Test score a nr
- Interpretation of nr depends on level of measurement of test score
- Nominal: eg personality types
- Ordfinal: short likert scales
- Interval: eg long likert scale
- Ratio: eg bourdon dot test
Test scores w interval meas level
- Scores only interval(or ratio) level of measurement if they are quant
- Inc 1pt on scale→reflects asme specific inc in property you’re measuring
- Person A B C w introversion scores 10 20 30
- Score diff bw A & B and B & C equal size
- Not obvious that diffs in introversion comparable
- Test scores usually sum of item scores
- Item scores evidently ordinal
- Tests scores formally ordinal
- For practical/statistical purposes we often act as if test scores are interval level of
measonly justifiable for long tests w wide range od scores
Variation
- Test score intended to reveal diffs bw ppl
- Only possible if ppl differ in test scores
- High degree of variation in test scores desirable
- Bc test score constructed out of item scores
- High variance on item scores also desirable
- High covariance bw item scores desirable
Variation test scores
- Eg test score x constructed out of item scores x1 & x2
- X=X1+X2
- What infl the test score variance s2x

-
- P values arnd .5→most variance; 50% right 50% wrong
- q=1-p
- Freq of use of each A option→insight into how items function
- Proportion of ppl that choose specific wrong alt=a value
- Distractor: wrogn A option multiple choice
- q=a1+a2+a3
Preliminary study of multiple choice items
- Bc ppl dont know the answer can guess
- P value higher than every a value
- Ideally all wrong A options (distractors) chosen equally often
- A1~a2~a3
- Ideally high tiem score variance where
- P~q
Polytomous items
- Lots of variance ideal
- Each A options chosen ~equally/with lots of variance
- Dichotomous items=/=polytomous items what is ideal
Test obj
- MC easy not for
- Open ended
- Behavioral observations
- Projective tests
Interrater reliability
- Diff assessors/test givers ned
Kappa
re

Lecture 4
CI
- Standard error: SD/VN
- X +/- 1.96xSE

Reliability
- Degree to which test scores vary when test administered min twice & under equal cond
to same person
Administration
1. Same respondents twice
a. Useful: bp, height, abilities/skills; cond less/more equal
b. Not useful: behavioral tests, performance level tests; interference caused by
memory/learning process
2. Equal cond:
a. Test cond: item, instruction, space, time
b. Relev psych properties of respondents
c. Physiological cond of respondent
Replicability
- Test score= systematic part+random infl
- Systematic part: part doenst vary across replciations; avg performance on test if test
taken infinitely many times
- Random infl: dont vary bw independent replications
- Random infl→determine reliability
Formulas
- X=T+E
- Book notation;
- Xo=Xt+Xe
- Xo = X observed
- Xt true score
- Systematic part separated from random part of observation

- Eij=Xij-Ti (tautology)
- True score of person in their avg test score across large amt of replications

-
-
Properties of reliable scores & meas error in population for single test administration
- Assumption 1: E=0,
- consequence X(mean)=T
- Assumption 2: rEY=0
- Consequence rET=0
-
- Def reliability

-
- Neg relaiiblity theoretically impossible
- Reliability: ratio true score variance/total variance of observed score
- Perfect reliability; observed=true score
SE of meas of test score

SD of error score
- SD x square root of 1-reliability
- Reliailibty: ratio true score variance/total variance observed score
SE of meas of test scores
- CI for indivi scores T
- Xi -/+ 1.96xSE

Lecture 3
Transformed scores & norms
- Raw score: test score not meaningful on its own
- Translated→transformed score
- Correct exam answers→exam grade
- Norms: reference frame for eval of raw scores, based on properties of distribution of raw
scores in population
1. Comparison w absolute standard (criterion referenced)
a. Mathematics testimonium: absolute standard min 18 correct

2. Comparison of norms based on ranking


a. Percentile rank: p30 30% of freq distribution below this pt

b.
Linear transformations
Y=a+bX
- For dev score: a=-X b=1
Correlation

Transformed scores & norms


3. Comp of norms based on avg & variation
i. Linear standard scores (z scores)

-
- Linear transformation
- Shape of freq distribution maintained
- Correlation w other test scores maintained
ii. Normalized z scores
- z=calculated using tables standard normal distribution
- Non linear transformation
- Shape of freq distribution not maintained
- Correlation w other test scores needs to be calculated again
- More convenient, only defensible if property normally distributed
Transformation standard scores

-
- Change X score→z score→sth else
- X=z x SX + Xmean
Stanines; standard nicnes, intervals numbered 1-9
- Stanines 2-8: ½ SD width

- Stanine 5: middle
Stanines

1 stanine; ½ SD width

Stanine 5:

Lecture 5
- Reliability: degree test scores remain when test administered 2+ times to person under
same cond
SE of meas

CI interval for T 95%

Determining reliability
- T: unknown→cant be determined
- Estimations methods to assess reliability

Test retest method


- 2 independent replications (same test administered under same cond)

- ; correlation x1x2 administrations


- Independent replications almost always impossible in psych
- Replications not independent; memory effects rx1x2 higher/lower than actual reliability
- Method not for practice
Parallel forms method (alt forms reliability)
- Parallel if

-
- 2 tests interchangeable, not identical→administering same ish test

-
- Same true score!
- Same meas error; same variance error score
- Problems
- Hard to construct parallel tests
- Not possible to empirically test that true scores Ti1=Ti2

Tests parallel?
-
- A & b easy to satisfy, raw test score→z scores
- C most important: tests parallel?

- Tests not parallel:


- Eg R=.91, rx1x2=.8→ test 1/2 worse than other one
Internal consistency method
- IC estimate reliability of test based on only 1 test administration
- Assume all test parts measure same thing
- Test part: halves, each individ item etc
- Degree of shared variarnce bw test parts; internal consistency of the test
- Underlying idea: consistency bw test parts provides insight into reliability of test as a
whole
- Shared variance;: both success in measuring the same thing/measuring what they
measure
- Most used ones
- Split hald method
- Cronbachs alpha
- Standardized alpha
- KR 20
Split half method
1. Test score divided into 2 halves w equal nr of items→both items calculate test score X1
& X2

2. Calculate Rhh: reliability of test half


a. Reliability of half a test w parallel forms method→wabt to know reliability of entire
test→corrected by
3. Split hald spearman brown formula

- Correlation bw 2 test halves→ x2


Split half method for GLT
1. Divided test into 2 tiems

a.
2. rhh w rx1x2=.843
3. Correction spearman brown formula

a.
- Assumption items parallel→otherwise split half reliability too low
- Issue: in practice; diff ways of splitting test in 2 ways matter in reliability
estimation→tests not parallel if split
Linear combinations
- Eg X=X1+X2+X3
- Avgs = .40, .50, .60; variances: .24, .25, .24, covariances .10

- Avg:
- Variance: s2x: sum of all elements covariance matrix =1.33
Raw Cronbach's alpha

Blue variances; items correlating w itself


Green covariances of items
Eg covariancematrix of test w 4 items

- K: nr of items
- Cii: covariance bw 2 diff items
- Population: alpha < Rxx
- Underestimates reliability: value gotten lower than actual test reliability
- Sample statistic, distirbutionn sample to sample
- Population cronbhchs alpha diff in sample/population; ie sample mean/population mean
- READ MORE ABT IN THE BOOK
- Sample size not an issue w reliability
- Cronbach alpha underestimation of reliability
- Concluded test less reliable than actually is
- Range 0-1, theoretically can be neg
- Alpha avg split half reliability for all possible ways of splitting
- Generalizes split half method (no arbitrary splitting)
Standardized cronbachs alph
- Raw alpha abt consistency of raw item scores
- Preference working w standardized item scores
- Items differ in variance, not relevance
- Standardization allows evrey item to contribute equally on a measure
- Raw alpha for standardized items = standardized alpha
- Raw alpha based on covariances bw item scores
- Standardized alpha based on correlations bw item scores

-
- Rii: avg correlation bw item scores
- Item scores comparable variance→bboth alphas virtually same
- Using standardized item scores→using standardized cronbachs alpha
Kuder richardson 20 KR20
- Coefficient KR20 reliability coefficient for dichotomous data ; 0 & 1

-
- Value KR20=rat alpha
- Only reason tow ork w KR20; calculations easier
IC methods & dimensionality of test
- IC methods assume itesm measure exact same psych property: unidimnesionality
- Test unidimensional→indication reliability
- Test multidimensional→underestimates reliability
- High coefficients dont prove test is unidimensional
- Dimensionality tested better w factor analysis
Meas accuracy
- Notation:

-
- Standard method

-
Comparison test scores
- SE equal for everyone (assumotion of classical test T)
- CIs dont overlap test scores differ signif
- Overlapping intervals→test not sufficeintly relaible to diff bw scores
- Observed scores differ
L6
Infl reliability
- Qual of items (internal consistency)
- Dimensionality
- Nr of items in the test
- Heterogeneity in population
- Type of score investigated
- Score on single test (standard)
- Difference score: diff bw 2 test scores
Qual of items
- Standardized alpha higher if inter item correlations higher

-
- Correlations preferably as high as possible
- Inter item covariances higher→Raw alpha hihger

-
- Covariances higher if correlation grows larger
- →more variance in true scores (S2T)
- Impact of emas error (SE2) remains the same→reliability inc

-
- →correlation bw item scores as high as possible
Dimensionality test
- Pos manifold: all correlation pos
- Reliability higher for higher avg item score correlation rii’
- Test unidimensional→pos correlation bw all test items
- Test multidimensional→measures multiple constructs
- Items measure 1 construct→don’t correlate w items measuring other constructs
- Rii’ lower→Rxx lower
- IC method assumes unidimensionality
- Multidimensional tests highly underestimate Rxx
- Solution. Method for multidimensional tests
- Stratified alpha
- Calculates raw alpha for subtests
- Cpmbines these into 1 reliability coefficient
Stratified alpha
- Combines subtest alphas & subtest variances into 1 reliability coefficient coefficient

- So:
- Astrat (=.75)>raw a (=.50) calculated for test as whole
- For multidimensional test astrat→better indication of true reliability
- Adding variances in variance covariance matrix→SE
Reliability & test lenght
- Inc test length: general spearman brown formula

-
- Eg old test 20 items, Rxx-original =.80, new tet items n=40 → 40/20=2 (inc by factor 2)

-
- Assumption; added/removed items parallel
- Spearman brown formula used when shortening test
- 0<n<1

-
- Eg old test 20 items, Rxx-original =.80, new test 10 items; n=10/20=.5 (shortenign of
50%)

-
-
- Low reliability of .4 at 20→200 items still not .9 reliable
- Inc test length esp useful id original test not too unreliable & test contains only few items
Determining desired nr of test items
- Rewritten version spearman brown formular desired test length
- Now we’ve determined the factor n w which test needs to be inc given Rxx-original &
Rxx-revised

-
- Lengthening factor
- →multiply by test items
- →new nr of items needed to obtain x reliability
- Never round down
- Rxx-revised desired minimum reliability of the test
Heterogeneity in population
- Reliability:

-
- W equal meas error variance SE2 reliability fully depends on variance in the true score
ST2
- Low variation in true scores (homogenous group)-->low reliability
- High reliability easier to achieve heterogeneous groups (=more variation in ture score)
- Reliability not purely property of test
- Variation true scores dec→reliability dec
- IQ test administered to all 21yrs>21yrs at university reliability
- reliability=property of test in specific population
Reliability for diff scores
- Until now talked test score reliability talked abt
- Some situations diff be test scores focal pt
- Pre (Y) & post (X) treatment meas of depression
- Progress in school across yrs
- Diff bw 2 test scores D=X-Y
- Diff score D reliable?
- Diff scores often less reliable comp regular test scores
- Test X & Y have meas error→cumulates in diff score
- X & Y often strongly correlated→diff score D little variance across ppl
- Meas change→depression measurement; pre & post rank order→
- Perfect correlation; all patients correlate in difference scores→ same diff
score; who improved most & which treatment thus most effective
indistinguishable
- Reliability of diff scores mainly important for determining individ diffs in change

-
- reliability multiplied by variance; S2xRxx
- 2rxySxSy: correlation xy multiplied by SD X & SD Y
- Diff score RD high if
- Tests scores X & Y high reliability
- Low correlation bw X & Y
- Pre & post meas ideally not highly correlated
- Often not realistic in practice→low RD
L 7 improving reliability & consequences
Devel a good test
- Test ideally large nr of items
- Measures 1 H construct
- Inc test length→good idea (spearman brown)
- Ites differ in qual→removing weak items
- Determining which items keep/remove?
Item selection
- Selection based on indicator of item discrimination
- Diff indicators, depend on use of test
- Item ret score correlation
- Dichotomized item rest score correlation
- Item discrimination index D
- Item discrimination pararmtere in item response model
Item rest score correlation (standard power tests)
- Rest score

-
- Sum score-item score
- Rest score item 1: summing all items except item 1→rest score calculated & possible diff
for all items
- H:item index; i=/=h; items except i, h another item
- Item rest correlation SPSS corrected item total correlation

- Selection rule
- Measure of item discrimiantion; measures the extent to which
- Correlation measured & observed
- Item rest score correlation (unobserved correlation if we had access to it)-->should be
similar
- Usually not enough to use .4, .3 used for selection rule
- True score &rest score match; w poor items lower correlation
- Eliminate worst score→recalculate→look at items based onr est score
correlation→remove→recalculate until no candidates for removal→in the end all
item correlations >.3→no more removals removed; table below to demonstrate;
not removed in one go
Groninger heigh test

SPSS
- Inter item correlation; some item correlations not pos/less than .3
- Item total statistics; cronbachs alpha if item deleted
- Estimated cronbach→item lowers it→removing it inc alpha
- SPSS reliability analysis; item, scale, scale if item del, covariance

Correlation for dichotomous variables


- investigating assocaition for 2 dichotomous variables X & Y

- Correlation calculated w phi coefficient


- Abt observed agreement bw 2 variables
- Score 1 on both X & Y = agreement score 0 on both X & Y = agreement
- Every1 placed in ¼ cells in 2x2 freq
- 2 cells agree ( B & C), 2 don’t agree ( A & D)
- Disagree: A: Y 1 & X 0
- Disagree: D: Y 0 & X1
- Agree: X & Y 1
- Agree: X & Y 0
- Match scatter plot; 4 quadrants below next to phi formula
-
Dichotomized item rest score correlation ( selection test)
- Tests to select certain % of respondents w highest /lowest true scores
- Calculate R(-i) & dichotomized rest score R(-i)dich

- Calculate for items


- Select items w highest ^
- Rest score not continuous variable; never getting a clena cut bw eg 30 & 70%
Construction of selection tests

Item discrimination index D ( dichtomous items)


- Items need to effectively differentiate bw ppl that score high vs low on construct
- Hih item discrimination desirable
- For dichotomous items you wnat
- Lot of ppl taht have high rest score to answer Q correctly
- Lot of ppl that have low rest score to answer Q incorrectly
- Comparable to phi, but not for selection goals
- Divide ppl into 3 groups
- Low middle high
- Based on rest score )30% highest/lowest)
- Want to see diff in performance on item bw high & low
- High proportion correct (p value) desired in high group
- Low proportion correct desired in low group
- D= Phigh-Plow
Item discrimiantion index D ( dichotomous items)
- D preferably as high as possible
- Higher D→item discriminates better
- Better discriminating items inc reliability
- Better ways to determine item discrimination
Consequences of low relaibility
- Reliability never 1
- Low rest reliability neg impact on
- Correlation test score w other constructs
- Effect size estimates
- Power of statistical tests
Attenuation of correlations
- Every observation X: certain amt of meas error E
- Meas error E completely random, doesnt correlate w anything esle
- Meas error confounds the measurement
- Meas construct T can strongly correlate w smth else Y
- Observed score X will correlate less strongly bc confounded (noisy) representation of T
- Higher meas error→more confounding→lower correlation bw X & Y
- Low reliability attenuates observed correlations
- Suppose Tx & Y correlate perfectly rTxY=1

- Correlation bw X & Y only

-
- Lower Rxx=lower max correlation bw X & any other variable
Attenuation bc X unreliable

Attenuation of correlations
- X & Y meas w not perfectly reliable tests
- Eg intelligence test X & performance review Y
- Both meas error in X & Y attenuates correlation

-
- Correlation bw 2 constructs (rTxTy) higher comp correlation bw test scores meas those
constructs rxy
Correcting for attenuation in X

L8 construct validity
2 cocnetps of validity
- Most important general def
- validity=degree test serves its purpose
- Validity depends on purpose
- Use of test can be valid, but the test itself can be not valid
- When concluded test performs sufficiently well
Validity 2 types
- Construct validity: what extent H construct responsible for test score (psych meaning)
- Criterion (predictive)validity: how well test predicts behavior/performacne outside test
situation (criterion in present, past, future)
- Construct validity:
- What exactly does the test measure
- Focal point wn scientific research
- Criterion validity
- Can test used to be to predict smth else
- Focal point for practical use test
- Construct validity / criterion validity rel
Relationship bw construct & criterion validity
- Wo construct validity no criterion validity
- Only reason that tests predict smth is bc it measures smth relev
- Wo criterion validity no construct validity
- If test measures smth relev it also has to be able to predict smth
- Some psychologists criterion validity seen as one aspect of construct validity
- Separating the 2 types of validity more convenient

-
- Bullseye construct validity
5 aspects cosntcut validity
- Pentagon of construct validity indicates what we need to pay attention to when
determining construct validity of a test
- Content of a test
- Test component asspocation
- Response processes
- Test use consequences
- Association w other constructs
1 content of the test
- Abt content validity
- Item content should rel to constructs you want to measure
- Content of items should not rel to any other constructs
- Item sets together need to sufficiently cover the constructs
- All important aspects of the constructs needs to be covered sufficiently
- Balance needs to be in order
- Aspects of the thing studied should be known; eg extraversion diff aspects should be
known; not only abt smalltalk/partyin
1 content validity v face validity
- Face validity rel to content validity
- Face validity ~ content validity as assessed by laymen
- Not important for psychometric qual of test (bc laymen is not an expert)
- Sometimes important for practical use
- Results of test w low face validity accepted less often in practice
- Good content validity often leads to face validity, not the other way around
2. Association of test components
- All items measure same property→pos manifold expected
- Pos correlation bw all items
- Important for both reliability of test & construct validity
- We wnat unidimensional test
- Multidimensional test also possible, if rel to T (pt 1)
- Diemnsionality of test examined using factor analysis
- Not for exam!!
- Examining dimensionality of test crucial for validating the test
- Unexpected multidimensionality could be at the expense of fairness
- Internal consistency coefficients eg cronbach's alpha dont give an indication of nr of
dimensions
- Examining association bw items of extra importance for multidimensional tests
- Multidimensional tests often work w rel subconstrcuts
- Rel to eg diff aspects of intelligence (RAKIT)
- Does item measure its intended sub constructs
- Do items measure other sub constructs beyond the intended one
- Key pt: do individ items measure what they need to measure?
3 response processes for mac performance tests
- Items formulated to elicit certain response processes
- Max performance test
1. Max effort to solve problem
2. often certain intended path toward the solution
3. Following the right path→correct answer
- If assumptions dont hold→expense of validity of instrument
Max effort to solve problem
- Assumption that max performance test everybody puts in max effort
- In that case performance hopefully→good indication of what you’re capable of
- Not every1 puts in max effort→expense of validity of measurement
- Some ppl score low bc not that skilled
- Some ppl score low bv have low motivation
- If possible to examine partly w process data abt eg response times
1 path to solution
- Performances comparable only if ppl try to do the same thing (=go through the same
response process)
- 17 x 99 = ?
1. Rep A multiplication rules
2. Resp B 17 x 100 - 17 = ?
3. Resp C memorized all multiplication tables & knows the answer
- Do we measure exactly same skill for everyone
Multiple solutions possible?
- Items aim to measure same ocnstuct as the rest of test
- More skilled respondents should always have higher chance to solve item corerclty
- Problematic if there are multiple possible solutions (or if th wrong option is counted as
correct one)
- Detect using item analysis
- Discrimiantion index D LECUTRE 7
- Item response theory analyses LECTURE 11 & 12
Response processes for typical performance tests
- Typical performance items meant to gain insight into someones true attitude/personality
- In practice many response styles can threaten validity
1. Social desirability
2. Acquiescence
3. Extreme v mild answer tendency
- Distorts the measure of intended property
Social desirability
- In for typical performance items no right/wrong answers
- In practive some answers more desirable
- Pos image of yourself toward others
- Pos image of yourself toward yourself
- Respondents can take this into account in answering behavior
- Social desirability tests exist, but correction difficult
- Anti social response style (provoking) also possible
Acquiescence
- Some ppl have tendency toa gree w statements rather than disagree
- Social desirability (otherwise disagreeing w researcher)
- Cogn biases
- Casues the measurement to be distorted
- If only indicative items used→overestimation
- Solution: balance indicative & contra indicative items
Extreme v mild answering
- Some ppl quick to claim extrem position
- For likert Qs often pick the most extreme options
- Leads to overestimation of extremeness fo their position/personality
- Counterpart: mild response style
- Ppl choose neutral option independent of content
- Solution: advanced statistical methods
Infl test use on validity
- Validity. Degree to which tests serves its purpose
- Validity not separate from how test used
- improper /unfair use of test→not valid
- Debatable whether this should fall under construct validity
- Improper use could be the users fault (improper use)
- Problem is in this case not measurement itself but what is done w it
Association w other constructs
- Construct validity abt the Q to what extent the test measures the construct of interest
- From psych T we know how this constructs relates to other constructs
- Association bw constructs & their corresponding ttests captured in nomological network
- Then you examine fi these scores on the test are also correlated w these constructs
- Empirical validation research

Nomological network: series of constructs, some rel some unrel


Empirically researching nomological network
- If the test measures what its supposed to measure→test score..
1. Correlates strongly w scores on tests that measure the same construct (=convergent
validity)
2. Correlates w scores on tests that measure related constructs (=convergent validity)
3. Doesnt correlate w scores on test tha measure unreal constructs (=discriminant validity)
- How we do assume that the other tests are reliable & valid
- Resembles criterion validity, but now there is no emphasis on any particular criterion
Convergent validity GLT
- Examining association bw tests that measure the same thing (usning simple linear
regression)
- Examining association bw test that measure related constructs
Discriminant validity
- Examining association bw test that measure unreal constructs
Multitrait multimethod research
- Systematic research on convergent & discriminant validity using multitrait multimethod
research
- Research on multiple constructs (=multitrait)
- Every trait measured using multiple methods (=multimethod)
- Correlation bw every trait method combination w every other combination is determined
Eg MTMM research

- Diagonal: correlation of the test w itself but doff replication


You hoep to find
- High correlation bw measurement of same constructs based on diff methods
(=convergent validity)
- low correlation bw measurement of diff constructs based on diff methods (discriminant
validity)
- Low correlation bw measurement of diff constructs based on same method (discriminant
validity & absence method effects)
Method effects
- Association bw measurement of unreal constructs due to the fact that the same
measurement method has been used
- Undesirable bc constructs are not correlated
- Results spurious correlations
- Finally: reliability of every test placed on diagonal in the table
Convergent validity: smaller diagonals
- Divided into 9 3x3 tables; smaller matrices
- Zoom into 1 smaller matrix→info abt convergent validity
- Peer v self assessed friendliness
Discriminant validity
- Close to zero correlations
- Teacher dominance v self friendliness→ low
Method effect
- Outside diagonal in main box
L 9 criterion validity
Using test in practice
- Test can be reliable & have good construct validity
- Does not yet tell us whether or not the test is practically relev
- Can we use the test to make predictions abt whether the examined persons will satisfy a
certain relev criterion
- Does the use of the test add to the qual of decisions we take bat these persons
- The question of criterion validity
Logic of using tests for decisions
- Often we want o take decisions using a criterion that isn't available
- Often impossible to measure the criterion before you have made the decision
- Study success criterion for admission to school
- Effect psych treatment on level of depression client
- By measuring a relev property you hope to predict the criterion
Logic of using tests for decisions
- Test functions as stand in for the not observed criterion
- Not same as correlation <1
- Decision taken based on test score not observed criterion→decision becomes
worse
- Decision baked on test score better than decision wo test score ?& how much
better?
- Degree to which test score helps w making correct decision will depend on the
association bw test scores & criterion
Examining criterion validity
- Criterion validity determined by association bw test score & criterion
- Normally criterion wont be available when you make the decision
- Determining criterion validity required dedicated research
- Large representative samples
- Everyone takes the tests
- Criterion measured (later) for everyone as well
- Both X score & criterion Y determined you can study vadliity
- 1st step examining correlation bw 2 scores (rxy)
- Correlation of 1: test is perfect stand in for criterion
- Correlation of 0: test completely unreal to criterion → doesnt add anything to
decision making process
- Correlation called predictive validity
Predictive validity in practice
- In practice low correlation often found bw test score & criterion
- Estimated correlation infl by
1. Restriction of range for test score X (=design error research)
2. Restriction range for criterion Y
3. Nonlinear association test score & criterion
4. Heteroscedasticity
Restriction of range for test score X

-
- Admitted group (X>1) is more homogeneous comp group as whole→lower estimated
correlation
Restriction of range for the crtierion
- Same problem can also play role for the criterion
- Not every1 that the test was administered to is available latter on to measure the
criterion
- If attrition depends on teh criterion→distorted image
- Often occurs w selection tests: poorly performing individs dont survive until moment the
criterion is emasured
- Here it is also the case that the remaining group is more homogenous comp group as a
whole→lower correlation

-
- Remaining group (Y>1.5) is even more homogeneous than group that we selected (X>1)
→estimated correlation practically 0
Nonlinear association test score & criterion

- Strength & direction association bw test score & criterion now depends on teh test score
- X<0 then r= .-68; X>0 then r= -67
Heteroscedasdicity

- Strength association bw test score & criterion now depends on test score
- X<0 then r= .78; X>/=0 r= .44
- Low motivation→guaranteed won’t pass, high motivation→other factors expl whether
pass
- Usefulness of test depends on test tscore
Predictive validity often low
- Even if these 4 problems don’t play a role predictive validity often low
- Multiple possible reasons
1. Measurement of criterion Y unreliable
2. Measurement of criterion not valid
Reliability & max predictive validity
- Predictive validity measured by rxy
- We still know from lecture 7

-
- rxy still low if RXX or RYY low→hihg rTxTy
- Reliability of the measurement of criterion very important but often overlooked
Validity criterion emasuremnt
- Idea: tes predicts actual criterion
- If we measure this criterion incorrectly→infl rxy
- rxy can be lower than real association bw test score & actual crteirion
- Intelligence test possibly good predictor of actual performance
- If performance assessment produces a non valid measurement of actual
performance the correlation rxy will fall short
Criterion validity for dichotomous decisions
- Tests often used for making dichotomous decisions
- accept/reject
- treat/dont treat
- Treatment A/B
- Most important: classify as accurately as possible
- Not the same as high linear association rxy
- Test score X approximately continuous
- Dichotomous decision based on cutoff score Xcrit
- X<Xcrit→reject (0); X>Xcrit→accept (1)
- Continuous criterion must also be dichotomous
- Y<Ycrit→doesn't satisfy criterion; Y>Ycrit→does
- Place everyone in by 2x2 freq (incorrect-correct table like w rest scores)

-
- B & C correct; choose & reject right ones
- A & D wrong ones; reject & choose wrong ones
Eg criterion validity & decisions
- Students wo math background need to pass testimonium mathematics bf start of study
- Idea that wo this knowledge chance of successfully studying is low
- Study success (meas based oncredits obtained in yr 1; Y) not known at start of study
- Test score testimonium (X) therefore stand in for not observed criterion Y
Tes use ofr dichotomous decisions
- A: pos misses/false negatives: 100 students unjustly rejected
- B: pos hits/true positives: 196 students justly not rejected
- C: neg hits/true negs: 84 students justly rejected
- D: neg misses/fasle positives: 20 student sunjustly not rejected
- Selection rate: proportion accepted students

- Selection rate:
- Base rate: coincidence: prop students >30credits

- Base rate =
- Base rate goes down if inc criterion
- Low base rate→filtering out ppl not passing criterion
- Selection rate: who passes the test
- Base rate: who passes the criterion
- sensitivity/succes rate: proportion accepted students justly accepted

- Success rate=
- Specificity: proportion rejected students justly rejected

- specificity=
- High success & specificity rate ideal
- Tradeoff; high specificity→lower success; more ppl admitted unjustly
- High success→low specificity; more ppl rejected unjustly
- Validity: correlation bw dichotomized test & criterion score (=phi)

-
- Same calc as for determining correlation item score & rests core
-
- Phi correlation coefficient
- Success rate/sensitivity dependent on
- 1. Validity phi
- If phi larger→B&C larger, A&D smaller→ B/B+D

-
- 2. Selection rate
- Rejecting more ppl→ B larger comp B+D

-
Test use for dichotomous decisions
1. Problem: many of hired candidates unqualified
a. Cause: low validity test
b. Liw base rate
c. High selection rate (many ppl need to be admitted→most ppl admitted,
unqualified hard to filter out)
2. Problem: optimal balance pos/neg misses
- Depends on situation
- Neg misses D: how bad is hiring unqualified person; non sick person treated
- Pos misses A: how bad not hiring qualified person; sick person not treated
- Stricter selection→ D smaller but A larger
- More lenient selection→ A smaller but D larger
3. Relship success rate, vladiity, base rate & seleciton ate taylor-russell tables
a. Base rate = .60 validity; =.20 low; selection rate = .10 (strict) success rate = .73
why success rate high for test w low validity?
b. Selection rate inc→success rate close to base rate
c. validity=0→success rate=base rate
d. Very large/very small base rate→selection virtually pointless

L 11 introduction to item response T (as opposed to classical test T)


Classical test T
- Reliable measurement Rxx & SE
- How to measure H constructs eg intelligence? Treu score T estimated using test score X
- Disadvantages
- T & X dependent on both respondent & test
- No control over hte model (no way to check whether X=T+E correct
- Interval measurement level of X unreliable
- Implausible assumption: accuracy of measurement SE same for everyone;
measuring w same degree of precision
Item test T
- Alt: IRT item response T
- Also called modern test T
- Statistical model for expl diffs in item & test scores based on H construct
Beforehand models
- Describe phenomenon
- Simplified representation of reality
- Fit reality to higher/lesser degree
Cond probability

- Uncond probability P(MTO-C = P ) = .40


- Uncond joint probability P(MTO-C = P, TT = P) = .30
- Random person passes both
- Cond probability P (MTO-C=PlTT=P) = .30/.70 = .43
- Passing CRM given already passed test T
- Top row passed TT .7
- .3 top row pass CRM
- Cond probability P(TT=PlMTO-C=P = .30/.40 = .75)
- Passing test T given know already passed CRM
- .4 pass CRM right vertical
- Test T pass .3
- Passing 1 exam→innc probability passing another exam
- Conditional probability higher given passing other exam comp uncond probability
- Uncond joint: random person passes both, cond: given passes 1 exam→passes another
Item probabilities for dichotomous items
- p value: proportion of ppl that passed the item
- Uncond probability!!
- P(Xi=1) = p; probability that random respondent answers i correctly
- But: not everyone has the exact same probability of answering the Q correctly
- High skilled respondents: probability>p
- Low skilled respondents : probability’<p
Item probabilities & skill levels
- To determine the discrimination index D we looked at phigh & plow
- Cond probabilities!!
- Phihg = probability Q correct given respondent belongs to best 30%
- Phigh = P(Xi = 1 l respondent in best 30%)
- Plow = probability Q correct answer given respondent belongs to worst 30%
- Plow = P(Xi = 1 l respondent in worst 30%)
- Problem: best 30% ppl still differ among themselves in probability to correcyl answer the
item
- Further split group→ 0-10, 10-20, 20-30
- Can always be split up further problem
- What we want is probability of tiem correct given your exact skill level
- Capturing this probability focus of IRT item response T; personalizing test response
Latent trait
- Latent: can be observed
- Skill level in IRT indicated w theta
- often called latent trait (not always skill)
- Eg for topological knowledge
- Comparable w true score in classical test T
- Everyone has their own value on
- Eg john = 2.0, ina = -.5
- John 2 SD above mean, ina -.5 below mean
- Assumption: standard normally distributed ( = 0, SD = 1)
IRT & item characteristic function
- IRT abt determining probability of correctly answering question given skill level
- In other words; what is P(Xi=1l ) for tiem i and every possible
- The description of item probability as a function of is called th eitem characteristic
function
- Item characteristic function shows how much your skill matters for the probability of
correctly answering the question
Item characteristic function
- Item caharcteristic function ICF diff for every item
- Not all items are equally difficult
- Not all items discriminate equally well
- If you know ICF & , you can determine the probability of answering the Q correctly
- Can be inferred from the figure
- Figure also shows if the item functions well

-
Easy item difficult item

-
- Straight line: poorly discriminating item
- Mastery level→doesn’t change chances of getting the item right
- S shaped→well discriminating item

-
- eg using wrong answer key; mistake in coding data
Item characteristic functions & IRT models
- How can you determined what the ICF is
- Use IRT model
- Model makes assumptions abt hte shape of the ICF
- Based on the data you can estimate the ICF for very item
- Done by statistical software (not SPSS)
- We need to choose an IRT model
E exponents
- The number natural logarithm
- Exponentation

-
Logistic regression for item responses
- Simple logistic regression is abt predicting a dichotomous outcome based on predictor

-
- Y & X can be anything, so we can also fill in
- Y = Xi
- X=

-
- IRT models* : logistic regression for item scores
- Simplest model = rasch model (set b1 to 1 and rewrite b0= (-beta)

-
Rasch model = one parameter logistic model

- Only 1 item parameter beta; item difficulty


- Rasch model easier to write as

-
Rasch model

- Items can only differ from EO in difficulty


- IF looks diff for every item
- Differ only ni location of S curve
- Betai therefore also called location parameter ; where located
- Bi: location where for =beta in case that P(Xi=1l )=.5
- Skill level=item difficulty→50% chance getting the answer right
- Item difficulty where ppl of that skill level have 50% chance of getting it
right
- Difficulty level=ability level of 50% chance getting it right
- In bw making it righ/wrong; difficulty level beyond which 50% chance of
getting it right

-
- Every item own ICF & own curve; each item diff probability of getting it
right

-
- Avg respondent; =0 ability level; 0 SD from mean
Item info for rasch model
- Rasch items differ in difficulty B
- Items provide lot of info abt values close to item location B
- Little info abt values far from there
- For >>Bi; almost everyone gets Q correct
- For <<Bi; almost everyone gets Q wrong
Item info function
- Function Ix( ): if higher → measured more accurately
- Eg rasch item B= 0; and A=-2, B=0, C=3

-
- B measured more accurately by item than A or C
- I: item information function; how much info we get abt ability level
- Item info always highest always arnd item difficulty; further diff bw item difficulty & ability
level→lower item info
Steps IRT analysis
- Select IRT model
- draw large sample from population
- Estimate item parameters (Rasch item difficulty)
- →use test to estimate for ppl

- not observed, therefore we get an estimate


Determining a persons estimate of
- For each person we known which Q they got right/wrong
- We have observed their response pattern on the test
- Based on the response patterns some values of are more realistic than others
- Sm1 w = -2 will nto often answer Qs w B=1 correctly

- Software fings the optimal estimate


- Nr of items that person answers correctly
- =test score X determines the estimated
- Just like classical test T; higher X→higher estimated ability
- Difference is that we estimate not T
- Concept comparable; more Qs correct→higher estimated ability
- Most important difference: in IRT standard error of meas not equal for everyone
-
Model according to rasch
- Population independence of ppl & items
- Difficult math test D & easy math test E
- CTT: TjohnE>TinaD: does not say anything abt numeracy of john rel to ina
- Rasch model: johnE= johnD= john; inaE= inaD= ina
- Thus we compare johnE w inaD
- Rasch model: we can also always compare items Bi & Bh, even if items i & h are
not in the same test
The modl according to rasch
- Strict: all functions discriminate to same degree; no random guessing
- Therefore not very flexible item characteristic functions
- Consequence→model often doesnt fit
- One parameter logistic model bc 1 item parameter
- Often IRT models more flexible (=more item parameters) but also more complex
- Besides rasch only discussed 2 parameter logistic model
Birnbaums 2 parameters logistic model

-
- Rasch model extended w item discrimnation parameter
- ai indicates how well item i distinguished bw ppl based on theri level on the latent trait

- Items now differ in how well they discriminate, however always the case tha tai>0
- Higher ai the better; leads to steeper item characteristic functions

Item info for 2PLM


- 2PLM item differ in both difficulty Bi & discrimination ai
- Similar to rasch model
- Items provied most info abt vlose to Bi
- Diff from rasch model
- Items differ in how well they discriminate
- Higher ai→item provides more info
L12 IRT in practice
Rasch model 1PLM

- Depends on item difficulty & ability level


- 1st item easiest; higher chance getting 1st right>2nd>3rd; items ordered
- Population independence of items
Birnbaum 2PLM

- Depends on item difficulty, discrimination parameter & ability level


- Item side: Some items more difficult, some discrimnate better
- Own side: ability level
- ICF intersect; different steepness→always wil intersect
Purpose item response T
1. Test construction
a. Estimate item characteristic function
b. Select best items
2. Test administration
a. Estimate for all respondents based on item sores & ICF
b. Derive accuracy with which estimated
Accuracy of estimation of T in CTT
- Classical test T: T=X; SE equal for everyone
- Assumption CIs equally wide; eveyrone measured w same
precision
Accuracy of estimation of
- Item info ix( ): higher→ measured more accurately
- Rasch item B=0; and A=-2, B=0, C=3
- more accurate for B than A/C

Test info function

- Test info function


- Test w 2 rasch items

-
- mean more accurate for =-2 than for =0 or =3; smallest CI interval arnd
mean
- Peak difficulty, peak higher if well discriminating item
Accuracy of meas
- SE of meas of for test (ie SEtest( ) determined by test info & therefore depends on

- 95% CI for
- Higher test info for →smaller CI for
Test construction based on item band
- Item bank: rel large collection fo easily accessible items, ICF & item description known
- Select exactly those items that contribute well to what you want to measure w test
- What is the desired accuracy of mean for all diff possible values of mean

-
Test construction based on item bank
- IRT population independent measuring possible
- W item banks diff tests can be composed; depends on the purpose of the test
- Target information function

-
- Target information:
- uninterrupted green
- test info striped pink
- Info new item striped blue
- Info previous item dotted black

- Lacks on high end; how highest performing ppl differ

- W 6 items also discriminate between highest performing


- Abt Constructing a test!!
- Item bank bigger than actual test
Test construction based on an item bank
- IRT allows for custom made test
- Diff purpose→diff target info function→diff test composition
- This way we can devel tests that accurately measure in a specific subpopulatiom (eg
high/low skilled group)
- Population independent→results remain comparable
- Taken a step further: everyone gets a personalized test
Adaptive tests
- Every respondent receives a unique custom made test
- Items selected so that their level matches the respondents ability
- More efficient measurement
- Item bank w items that satisfy IRT model necessary
- Almost always computer based : CAT computer adaptive testing
Adaptive tests eg numeracy
- Step 0: starts w mean=0, SEtest( )=infinte
- Step 1: select most informative item for =0 from the item bank

- Continue until estimated accurately enough (ie until SE test( ) small enough)
Adaptive test: + & -
+ Accurate measurement for veryone
+ Adjusted to individ
+ Objective
+ Short test time
+ Fast feedback possible
+ Performance on diff tests can be compared
- IRT models restrictive;
- eg don't allow for guessing
- High costs
- Often hard to construct lots of items for 1 construct
- Multistage testing: simple form of adaptive testing
Multi stage testing
- Everyone receives teh same first aprt of the test
- Based on performance 1st pary→assign next part eg
- Score below avg→easy 2nd part
- Score min avg→difficult 2nd part
- Possible w 2 stages but also w more stages
- Less efficient than adaptive testing but easier to do
Comparison classical & item response T
IRT
- Provides us w a way to test our measurement model 8tsting model assumptions)
- Recogn that accuracy of measurement depends on ability level (SEtest( ) varies w
- Allows for population independent comparisons of ppl & items
- Allows for adaptive testing
CCT
- Is simpler than IRT
- Also offers tools to set up a good test (realibility, validity)
- Results often dont differ a lot
- Can be found in more standard software
True score in IRT
- In CCT: true score=avg score of person over replications
- True score=expected score
- Expected score easily determined in IRt
- Expected item score given : P(Xi=1l )
- Expected test score given : sum of expected item scores

True score vs in IRT


- If we can determine T in IRT then why do we work w
- not dependent on test choice
- interval measurement level, T not
- value directly interpretable, bc we know
- =0, SD =1
- interpreted as z score
- ability level, normally distributed, applies to whatever test we’re trying to measure
- For true score we don¨t know it’s of interval level measurement
Use of in practice
. you can treat as z score
- With that you can translate into processed score
Illustration: IRT of polytomous items
- 2+ scores eg likert scales

- Eg depending on extraversion
- Probability of saying somewhat agree on introverted ppl eg
L 12 test bias & fairness
Types of test use
1. Meas psych construct→construct validity
2. Predicting criterion→criterion validity
2 types of validity
1. Construct validity; X rel to construct
2. Criterion validity; C predicts criterion
construct validity
1. Every1 w same skill level→same expected test score
2. Not the case→construct bias; bias in test score
Criterion validity
1. ppl w same test score→same predicted validityfor criterion
2. Not→predictive bias; bias in predicted criterion
Construct bias
- Bias test score
- Identical response processes ⅕ construct validity pillars
- Ppl don’t go through same response process→results not comp
- Language problems
- Fear of failure
- Lacking background knowledge
Construct bias dichotomous items
- When no construct bias?
- For every item everyone w same skill elvel/level on trait→probability answering Qs
correctly/pos
- Equal probabilities only given the same skill level
- John lower p comp ina to answer some Qs correctly→construct bias if john has
same skill level as ina
- IRT to investigate
Construct bias dichotomous items
- Item functions well only is predictor of p of success
- If besides other factors play a role→construct bias
- Stereotype threat
- Boys & girls w same level of math skill→equal item p
- Stereotype threat neg infl performance girls difficult items
- Construct bias in math test
1. Administer to alrge sample from population
2. H for those items to have an effect eg Bi>1
3. Estimate item characteristic functions for every item separately boys & girls
4. Examine if item characteristic fucntions of boys & girls differ on suspect items

Uniform item bias


- Bias→expect item to be more difficult given for 1 group than another
- Meaning; item higher Bi for one group comp another
- Uniform item bias
- uniform=over entire range of 1 group finds Q more difficult (=lower p) comp
other group
- Item characteristic function of 1 group always higher than other group
Non uniform item bias
- Item simply functions worse in subgroup
- Item discriminates worse in 1 group
- Discrimination parameter differs bw groups
- Item characteristic functions not equally steep→intersect
- Non uniform item bias; for some values of offers advantage & for other values
disadvantage
Dealing w item bias
- During testing item bias→
1. Remove item/ignore in analysis
2. Correct your estimate of skill/ability for detected bias
- Option 2 dangerous/smtimes unethical
- Same test preformance→diff concl for diff ppl
- Smth wrong w items/test where construct bias detected
Predictive bias
- We want to use tests to predict smth→bias
- Predicted values possible not only depends on test score & group membership
- Ppl w same test score don’t have the same expected score on criterion
- Can also occur wo construct bias
- Predictive bias abt relationship bw test score & criterion
Eg: administering math skill test at start of test T
- High score→good performance later in scourse
- High math skill nto only infl on performance
- C: important for keeping up w course material
- Ppl w same math skill but diff level of C→diff performance
Intercept bias; predictive bias
- Measure only math skills & C
- Suppose; men on avg less C than women
- →man & woman w same test score (&same math skill) expected score on criterion will
differ
- →intercept bias:
- To correctly predict criterion we need to add some pts for women (&subtract pts
from men)
- Ignore this diff→predictive bias

Slope bias: predictive bias


- Possible that both math skill & C necessary for good performance
- Only w high level both math skills & C → high performance during course
- Interaction effect; math skills & C
- Men less C→performance inc less when math skill inc comp women
- →mostly matters for ppl highly skilled in math whether male/female
- →diff slopes for 2 groups→slope bias
What to do if predictive bias?
- In principle can work w diff regression lines for diff groups
- Corrects for group diffs
- Often not ethical; equal score→unequal decision
- Not desirable; group membership not the real cause
- Not sex that expl the diffs but diffs on avg in C
- Better option: search for property that expl diffs→measure→incl meas in
prediciton/decision

-
Cosntrcut & predictive bias
- Construct tbias abt relationship bw item/rest score & construct measured
- Predictive bias abt relationship bw rest sco re& criterion
- Both cases bias plausible
- Rating of some ppl too high/others too low
- →unfair & nonvalid test use
- Important to test for biases→aware of them→deal w them
-

You might also like