Test Theory
Test Theory
1. O
2. S
3. O
4. O
5. S
6. S
7. O
8. O
9. O
24
-
- Distance from mean big/small
- Standardized dev score; 0-1/10/100 diff comparable how far from mean
- Z mean 0, SD dev 1
-
1. Dev from mean scores; degree score above/below mean
2. Squared deviation
3. Mean of squared dev
- SD reflects variability in terms of size of raw dev scores, variance reflects variability in
terms of squared dev scores
- Larger variance & SD→greater distribution variability
- No neg value SD/s2!! 0→
- SD/s2 not large/small
- Require context
- N not N-1 !!
Distribution
- Normal distribution
- Neg skewed
- Pos skewed
-
2 variable association
- Association direction +/-
- Magnitude -ll-
- Consistency; strong association
Covariance
- Degree 2 variables covary; association bw variaiblity in 2 score distributions
1. Dev scores (dev from mean)
2. Cross prodcuts; multiplying dev scores
a. Neg cross product; scores inconsistent w EO; above mean V1(pos dev), below
mean V2(neg dev)
3. Mean od cross products
-
Correlation
- Direction of association bw 2 variables
- Magnitude
-
- Covariance XY/SD XY
Composite variables
-
- S2 variance
- Rij: correlation bw 2 scores
- S: SD
-
- Mean: proportion of pos valenced As
- p=mean for binary itemt test
- q= proportion of neg valenced responses q=1-p
-
Tests score interpretation
- Raw scores
- Avg→comp bewlo/above
- Degree of diff; how far from avg, variability
- Z scores
- above/below mean
- Distance from eman score
Z scores; standard scores
-
Percentile ranks
- % of scores velow specific test score
- Direct:
- Raw scores below→divide by nr ppl took the test→x100
- Mean & SD known
- Standard score computed→table→link percentile
- Score bw mean & z→add 0.5 & x100%
Normalized scores
- normalization/area transformations
1. Percentile ranks from raw test scores; each raw score→% rank
2. % ranks→standard (z) scores
3. Z score→covnerted standard score onto wanted metric
Test norms
- Reference sample
-
Ch 2
Property of identity
- Identical w respect to feature reflected by category
- mutually exclusive & Exhaustive: falls only to 1 cat
- similar/diff
- Nrs no signif
Property of order
- Rel amt of attribute ppl possess
- Numerals property of roder→rank order of ppl rel to EO
- Numerals indicate order→laberls
- Doesn’t tell abt actual degree of diffs
Property of quant
- Magnitude of diffs bw ppl
- Numerals reflect real nrs
- Quant values: continuous
Nr 0
- Absolute 0: ratio
- Arbitrary 0: interval
Units of measurement
- Arbitrary
- Unit size
- Unit of meas not tied to any 1 type of obj;
- Take pgys form; some UoM cant be used to measure diff features of obj
- UoM: standard measures
Additivity & counting
Additivity
- Unit inc 1 pt of measurement same as any other pt of measurement
- Meas unit not constant in terms of underlying attribute intended to reflect
- Additivity not good to use
Counts
- counting=measurement when 1 countignt o reflect amt of feature/attibrute of obj
4 meas scales
Nominal scales
- numerals= property of identity
- Categories
- Identify groups sharing common attribute
Ordinal
- numerals=
- Property of order
Interval
- Arbitrary zero
- Property of quant
- Size of meas constant & additive, scale no multiplicative interpretations
- 80C not twice as hot as 40C
Ratio
- Absolute zero
- Property of quant
- Addtivity & multiplicative interpretations; 80km 2x40km
L2
Pysch testing
- chronbach : systematic procedure for comp behavior 2+ppl
- 3 properties
- Observable: aimed at measuring beahvior
- Obj: systematic
- Comparative: comp of diff ppl
Test for max vs typical performance
- Max performance: skills/aptitude
- Typical performance: personality traits, attitudes, disorders
- Big diffs in approach of test devel
- Few diffs in statistical analysis of tests scores
Max performance tsts
- Power
- Skill wo time pressure
- Most skilled→more correct A
- Speed
- Performance under time pressure
- More skilled→more A wn time limit
- Q difficulty trivial
Norm referenced/crtierion referenced; what to do w test score
- Norm referenced
- Comp ppl w rest of population
- Good norm data on population of great importance
- Eg IQ SAT
- Criterion referenced
- Comp ppl w absolute standard
- Test inference not tied to performance level in population
- Eg exam Test Theory criterion referenced
Psych tests contain
- Test material
- Test forms
- Test manual
- Precise test instructions
- Score processing procedure
- Norm tables
- Discussion of scientific qual
Properties of test scores
- Test score generally sum of item scores
- Most important outcome of test used
- Test manual instructions on how to interpret score
- W non referenced tests, norm table needs to be consulted
- Eg 30% of boys age 3 have a score lower than 3 (30th percentile)
Measurement level tests score
- Test score a nr
- Interpretation of nr depends on level of measurement of test score
- Nominal: eg personality types
- Ordfinal: short likert scales
- Interval: eg long likert scale
- Ratio: eg bourdon dot test
Test scores w interval meas level
- Scores only interval(or ratio) level of measurement if they are quant
- Inc 1pt on scale→reflects asme specific inc in property you’re measuring
- Person A B C w introversion scores 10 20 30
- Score diff bw A & B and B & C equal size
- Not obvious that diffs in introversion comparable
- Test scores usually sum of item scores
- Item scores evidently ordinal
- Tests scores formally ordinal
- For practical/statistical purposes we often act as if test scores are interval level of
measonly justifiable for long tests w wide range od scores
Variation
- Test score intended to reveal diffs bw ppl
- Only possible if ppl differ in test scores
- High degree of variation in test scores desirable
- Bc test score constructed out of item scores
- High variance on item scores also desirable
- High covariance bw item scores desirable
Variation test scores
- Eg test score x constructed out of item scores x1 & x2
- X=X1+X2
- What infl the test score variance s2x
-
- P values arnd .5→most variance; 50% right 50% wrong
- q=1-p
- Freq of use of each A option→insight into how items function
- Proportion of ppl that choose specific wrong alt=a value
- Distractor: wrogn A option multiple choice
- q=a1+a2+a3
Preliminary study of multiple choice items
- Bc ppl dont know the answer can guess
- P value higher than every a value
- Ideally all wrong A options (distractors) chosen equally often
- A1~a2~a3
- Ideally high tiem score variance where
- P~q
Polytomous items
- Lots of variance ideal
- Each A options chosen ~equally/with lots of variance
- Dichotomous items=/=polytomous items what is ideal
Test obj
- MC easy not for
- Open ended
- Behavioral observations
- Projective tests
Interrater reliability
- Diff assessors/test givers ned
Kappa
re
Lecture 4
CI
- Standard error: SD/VN
- X +/- 1.96xSE
Reliability
- Degree to which test scores vary when test administered min twice & under equal cond
to same person
Administration
1. Same respondents twice
a. Useful: bp, height, abilities/skills; cond less/more equal
b. Not useful: behavioral tests, performance level tests; interference caused by
memory/learning process
2. Equal cond:
a. Test cond: item, instruction, space, time
b. Relev psych properties of respondents
c. Physiological cond of respondent
Replicability
- Test score= systematic part+random infl
- Systematic part: part doenst vary across replciations; avg performance on test if test
taken infinitely many times
- Random infl: dont vary bw independent replications
- Random infl→determine reliability
Formulas
- X=T+E
- Book notation;
- Xo=Xt+Xe
- Xo = X observed
- Xt true score
- Systematic part separated from random part of observation
- Eij=Xij-Ti (tautology)
- True score of person in their avg test score across large amt of replications
-
-
Properties of reliable scores & meas error in population for single test administration
- Assumption 1: E=0,
- consequence X(mean)=T
- Assumption 2: rEY=0
- Consequence rET=0
-
- Def reliability
-
- Neg relaiiblity theoretically impossible
- Reliability: ratio true score variance/total variance of observed score
- Perfect reliability; observed=true score
SE of meas of test score
SD of error score
- SD x square root of 1-reliability
- Reliailibty: ratio true score variance/total variance observed score
SE of meas of test scores
- CI for indivi scores T
- Xi -/+ 1.96xSE
Lecture 3
Transformed scores & norms
- Raw score: test score not meaningful on its own
- Translated→transformed score
- Correct exam answers→exam grade
- Norms: reference frame for eval of raw scores, based on properties of distribution of raw
scores in population
1. Comparison w absolute standard (criterion referenced)
a. Mathematics testimonium: absolute standard min 18 correct
b.
Linear transformations
Y=a+bX
- For dev score: a=-X b=1
Correlation
-
- Linear transformation
- Shape of freq distribution maintained
- Correlation w other test scores maintained
ii. Normalized z scores
- z=calculated using tables standard normal distribution
- Non linear transformation
- Shape of freq distribution not maintained
- Correlation w other test scores needs to be calculated again
- More convenient, only defensible if property normally distributed
Transformation standard scores
-
- Change X score→z score→sth else
- X=z x SX + Xmean
Stanines; standard nicnes, intervals numbered 1-9
- Stanines 2-8: ½ SD width
- Stanine 5: middle
Stanines
1 stanine; ½ SD width
Stanine 5:
Lecture 5
- Reliability: degree test scores remain when test administered 2+ times to person under
same cond
SE of meas
Determining reliability
- T: unknown→cant be determined
- Estimations methods to assess reliability
-
- 2 tests interchangeable, not identical→administering same ish test
-
- Same true score!
- Same meas error; same variance error score
- Problems
- Hard to construct parallel tests
- Not possible to empirically test that true scores Ti1=Ti2
Tests parallel?
-
- A & b easy to satisfy, raw test score→z scores
- C most important: tests parallel?
a.
2. rhh w rx1x2=.843
3. Correction spearman brown formula
a.
- Assumption items parallel→otherwise split half reliability too low
- Issue: in practice; diff ways of splitting test in 2 ways matter in reliability
estimation→tests not parallel if split
Linear combinations
- Eg X=X1+X2+X3
- Avgs = .40, .50, .60; variances: .24, .25, .24, covariances .10
- Avg:
- Variance: s2x: sum of all elements covariance matrix =1.33
Raw Cronbach's alpha
- K: nr of items
- Cii: covariance bw 2 diff items
- Population: alpha < Rxx
- Underestimates reliability: value gotten lower than actual test reliability
- Sample statistic, distirbutionn sample to sample
- Population cronbhchs alpha diff in sample/population; ie sample mean/population mean
- READ MORE ABT IN THE BOOK
- Sample size not an issue w reliability
- Cronbach alpha underestimation of reliability
- Concluded test less reliable than actually is
- Range 0-1, theoretically can be neg
- Alpha avg split half reliability for all possible ways of splitting
- Generalizes split half method (no arbitrary splitting)
Standardized cronbachs alph
- Raw alpha abt consistency of raw item scores
- Preference working w standardized item scores
- Items differ in variance, not relevance
- Standardization allows evrey item to contribute equally on a measure
- Raw alpha for standardized items = standardized alpha
- Raw alpha based on covariances bw item scores
- Standardized alpha based on correlations bw item scores
-
- Rii: avg correlation bw item scores
- Item scores comparable variance→bboth alphas virtually same
- Using standardized item scores→using standardized cronbachs alpha
Kuder richardson 20 KR20
- Coefficient KR20 reliability coefficient for dichotomous data ; 0 & 1
-
- Value KR20=rat alpha
- Only reason tow ork w KR20; calculations easier
IC methods & dimensionality of test
- IC methods assume itesm measure exact same psych property: unidimnesionality
- Test unidimensional→indication reliability
- Test multidimensional→underestimates reliability
- High coefficients dont prove test is unidimensional
- Dimensionality tested better w factor analysis
Meas accuracy
- Notation:
-
- Standard method
-
Comparison test scores
- SE equal for everyone (assumotion of classical test T)
- CIs dont overlap test scores differ signif
- Overlapping intervals→test not sufficeintly relaible to diff bw scores
- Observed scores differ
L6
Infl reliability
- Qual of items (internal consistency)
- Dimensionality
- Nr of items in the test
- Heterogeneity in population
- Type of score investigated
- Score on single test (standard)
- Difference score: diff bw 2 test scores
Qual of items
- Standardized alpha higher if inter item correlations higher
-
- Correlations preferably as high as possible
- Inter item covariances higher→Raw alpha hihger
-
- Covariances higher if correlation grows larger
- →more variance in true scores (S2T)
- Impact of emas error (SE2) remains the same→reliability inc
-
- →correlation bw item scores as high as possible
Dimensionality test
- Pos manifold: all correlation pos
- Reliability higher for higher avg item score correlation rii’
- Test unidimensional→pos correlation bw all test items
- Test multidimensional→measures multiple constructs
- Items measure 1 construct→don’t correlate w items measuring other constructs
- Rii’ lower→Rxx lower
- IC method assumes unidimensionality
- Multidimensional tests highly underestimate Rxx
- Solution. Method for multidimensional tests
- Stratified alpha
- Calculates raw alpha for subtests
- Cpmbines these into 1 reliability coefficient
Stratified alpha
- Combines subtest alphas & subtest variances into 1 reliability coefficient coefficient
- So:
- Astrat (=.75)>raw a (=.50) calculated for test as whole
- For multidimensional test astrat→better indication of true reliability
- Adding variances in variance covariance matrix→SE
Reliability & test lenght
- Inc test length: general spearman brown formula
-
- Eg old test 20 items, Rxx-original =.80, new tet items n=40 → 40/20=2 (inc by factor 2)
-
- Assumption; added/removed items parallel
- Spearman brown formula used when shortening test
- 0<n<1
-
- Eg old test 20 items, Rxx-original =.80, new test 10 items; n=10/20=.5 (shortenign of
50%)
-
-
- Low reliability of .4 at 20→200 items still not .9 reliable
- Inc test length esp useful id original test not too unreliable & test contains only few items
Determining desired nr of test items
- Rewritten version spearman brown formular desired test length
- Now we’ve determined the factor n w which test needs to be inc given Rxx-original &
Rxx-revised
-
- Lengthening factor
- →multiply by test items
- →new nr of items needed to obtain x reliability
- Never round down
- Rxx-revised desired minimum reliability of the test
Heterogeneity in population
- Reliability:
-
- W equal meas error variance SE2 reliability fully depends on variance in the true score
ST2
- Low variation in true scores (homogenous group)-->low reliability
- High reliability easier to achieve heterogeneous groups (=more variation in ture score)
- Reliability not purely property of test
- Variation true scores dec→reliability dec
- IQ test administered to all 21yrs>21yrs at university reliability
- reliability=property of test in specific population
Reliability for diff scores
- Until now talked test score reliability talked abt
- Some situations diff be test scores focal pt
- Pre (Y) & post (X) treatment meas of depression
- Progress in school across yrs
- Diff bw 2 test scores D=X-Y
- Diff score D reliable?
- Diff scores often less reliable comp regular test scores
- Test X & Y have meas error→cumulates in diff score
- X & Y often strongly correlated→diff score D little variance across ppl
- Meas change→depression measurement; pre & post rank order→
- Perfect correlation; all patients correlate in difference scores→ same diff
score; who improved most & which treatment thus most effective
indistinguishable
- Reliability of diff scores mainly important for determining individ diffs in change
-
- reliability multiplied by variance; S2xRxx
- 2rxySxSy: correlation xy multiplied by SD X & SD Y
- Diff score RD high if
- Tests scores X & Y high reliability
- Low correlation bw X & Y
- Pre & post meas ideally not highly correlated
- Often not realistic in practice→low RD
L 7 improving reliability & consequences
Devel a good test
- Test ideally large nr of items
- Measures 1 H construct
- Inc test length→good idea (spearman brown)
- Ites differ in qual→removing weak items
- Determining which items keep/remove?
Item selection
- Selection based on indicator of item discrimination
- Diff indicators, depend on use of test
- Item ret score correlation
- Dichotomized item rest score correlation
- Item discrimination index D
- Item discrimination pararmtere in item response model
Item rest score correlation (standard power tests)
- Rest score
-
- Sum score-item score
- Rest score item 1: summing all items except item 1→rest score calculated & possible diff
for all items
- H:item index; i=/=h; items except i, h another item
- Item rest correlation SPSS corrected item total correlation
- Selection rule
- Measure of item discrimiantion; measures the extent to which
- Correlation measured & observed
- Item rest score correlation (unobserved correlation if we had access to it)-->should be
similar
- Usually not enough to use .4, .3 used for selection rule
- True score &rest score match; w poor items lower correlation
- Eliminate worst score→recalculate→look at items based onr est score
correlation→remove→recalculate until no candidates for removal→in the end all
item correlations >.3→no more removals removed; table below to demonstrate;
not removed in one go
Groninger heigh test
SPSS
- Inter item correlation; some item correlations not pos/less than .3
- Item total statistics; cronbachs alpha if item deleted
- Estimated cronbach→item lowers it→removing it inc alpha
- SPSS reliability analysis; item, scale, scale if item del, covariance
-
- Lower Rxx=lower max correlation bw X & any other variable
Attenuation bc X unreliable
Attenuation of correlations
- X & Y meas w not perfectly reliable tests
- Eg intelligence test X & performance review Y
- Both meas error in X & Y attenuates correlation
-
- Correlation bw 2 constructs (rTxTy) higher comp correlation bw test scores meas those
constructs rxy
Correcting for attenuation in X
L8 construct validity
2 cocnetps of validity
- Most important general def
- validity=degree test serves its purpose
- Validity depends on purpose
- Use of test can be valid, but the test itself can be not valid
- When concluded test performs sufficiently well
Validity 2 types
- Construct validity: what extent H construct responsible for test score (psych meaning)
- Criterion (predictive)validity: how well test predicts behavior/performacne outside test
situation (criterion in present, past, future)
- Construct validity:
- What exactly does the test measure
- Focal point wn scientific research
- Criterion validity
- Can test used to be to predict smth else
- Focal point for practical use test
- Construct validity / criterion validity rel
Relationship bw construct & criterion validity
- Wo construct validity no criterion validity
- Only reason that tests predict smth is bc it measures smth relev
- Wo criterion validity no construct validity
- If test measures smth relev it also has to be able to predict smth
- Some psychologists criterion validity seen as one aspect of construct validity
- Separating the 2 types of validity more convenient
-
- Bullseye construct validity
5 aspects cosntcut validity
- Pentagon of construct validity indicates what we need to pay attention to when
determining construct validity of a test
- Content of a test
- Test component asspocation
- Response processes
- Test use consequences
- Association w other constructs
1 content of the test
- Abt content validity
- Item content should rel to constructs you want to measure
- Content of items should not rel to any other constructs
- Item sets together need to sufficiently cover the constructs
- All important aspects of the constructs needs to be covered sufficiently
- Balance needs to be in order
- Aspects of the thing studied should be known; eg extraversion diff aspects should be
known; not only abt smalltalk/partyin
1 content validity v face validity
- Face validity rel to content validity
- Face validity ~ content validity as assessed by laymen
- Not important for psychometric qual of test (bc laymen is not an expert)
- Sometimes important for practical use
- Results of test w low face validity accepted less often in practice
- Good content validity often leads to face validity, not the other way around
2. Association of test components
- All items measure same property→pos manifold expected
- Pos correlation bw all items
- Important for both reliability of test & construct validity
- We wnat unidimensional test
- Multidimensional test also possible, if rel to T (pt 1)
- Diemnsionality of test examined using factor analysis
- Not for exam!!
- Examining dimensionality of test crucial for validating the test
- Unexpected multidimensionality could be at the expense of fairness
- Internal consistency coefficients eg cronbach's alpha dont give an indication of nr of
dimensions
- Examining association bw items of extra importance for multidimensional tests
- Multidimensional tests often work w rel subconstrcuts
- Rel to eg diff aspects of intelligence (RAKIT)
- Does item measure its intended sub constructs
- Do items measure other sub constructs beyond the intended one
- Key pt: do individ items measure what they need to measure?
3 response processes for mac performance tests
- Items formulated to elicit certain response processes
- Max performance test
1. Max effort to solve problem
2. often certain intended path toward the solution
3. Following the right path→correct answer
- If assumptions dont hold→expense of validity of instrument
Max effort to solve problem
- Assumption that max performance test everybody puts in max effort
- In that case performance hopefully→good indication of what you’re capable of
- Not every1 puts in max effort→expense of validity of measurement
- Some ppl score low bc not that skilled
- Some ppl score low bv have low motivation
- If possible to examine partly w process data abt eg response times
1 path to solution
- Performances comparable only if ppl try to do the same thing (=go through the same
response process)
- 17 x 99 = ?
1. Rep A multiplication rules
2. Resp B 17 x 100 - 17 = ?
3. Resp C memorized all multiplication tables & knows the answer
- Do we measure exactly same skill for everyone
Multiple solutions possible?
- Items aim to measure same ocnstuct as the rest of test
- More skilled respondents should always have higher chance to solve item corerclty
- Problematic if there are multiple possible solutions (or if th wrong option is counted as
correct one)
- Detect using item analysis
- Discrimiantion index D LECUTRE 7
- Item response theory analyses LECTURE 11 & 12
Response processes for typical performance tests
- Typical performance items meant to gain insight into someones true attitude/personality
- In practice many response styles can threaten validity
1. Social desirability
2. Acquiescence
3. Extreme v mild answer tendency
- Distorts the measure of intended property
Social desirability
- In for typical performance items no right/wrong answers
- In practive some answers more desirable
- Pos image of yourself toward others
- Pos image of yourself toward yourself
- Respondents can take this into account in answering behavior
- Social desirability tests exist, but correction difficult
- Anti social response style (provoking) also possible
Acquiescence
- Some ppl have tendency toa gree w statements rather than disagree
- Social desirability (otherwise disagreeing w researcher)
- Cogn biases
- Casues the measurement to be distorted
- If only indicative items used→overestimation
- Solution: balance indicative & contra indicative items
Extreme v mild answering
- Some ppl quick to claim extrem position
- For likert Qs often pick the most extreme options
- Leads to overestimation of extremeness fo their position/personality
- Counterpart: mild response style
- Ppl choose neutral option independent of content
- Solution: advanced statistical methods
Infl test use on validity
- Validity. Degree to which tests serves its purpose
- Validity not separate from how test used
- improper /unfair use of test→not valid
- Debatable whether this should fall under construct validity
- Improper use could be the users fault (improper use)
- Problem is in this case not measurement itself but what is done w it
Association w other constructs
- Construct validity abt the Q to what extent the test measures the construct of interest
- From psych T we know how this constructs relates to other constructs
- Association bw constructs & their corresponding ttests captured in nomological network
- Then you examine fi these scores on the test are also correlated w these constructs
- Empirical validation research
-
- Admitted group (X>1) is more homogeneous comp group as whole→lower estimated
correlation
Restriction of range for the crtierion
- Same problem can also play role for the criterion
- Not every1 that the test was administered to is available latter on to measure the
criterion
- If attrition depends on teh criterion→distorted image
- Often occurs w selection tests: poorly performing individs dont survive until moment the
criterion is emasured
- Here it is also the case that the remaining group is more homogenous comp group as a
whole→lower correlation
-
- Remaining group (Y>1.5) is even more homogeneous than group that we selected (X>1)
→estimated correlation practically 0
Nonlinear association test score & criterion
- Strength & direction association bw test score & criterion now depends on teh test score
- X<0 then r= .-68; X>0 then r= -67
Heteroscedasdicity
- Strength association bw test score & criterion now depends on test score
- X<0 then r= .78; X>/=0 r= .44
- Low motivation→guaranteed won’t pass, high motivation→other factors expl whether
pass
- Usefulness of test depends on test tscore
Predictive validity often low
- Even if these 4 problems don’t play a role predictive validity often low
- Multiple possible reasons
1. Measurement of criterion Y unreliable
2. Measurement of criterion not valid
Reliability & max predictive validity
- Predictive validity measured by rxy
- We still know from lecture 7
-
- rxy still low if RXX or RYY low→hihg rTxTy
- Reliability of the measurement of criterion very important but often overlooked
Validity criterion emasuremnt
- Idea: tes predicts actual criterion
- If we measure this criterion incorrectly→infl rxy
- rxy can be lower than real association bw test score & actual crteirion
- Intelligence test possibly good predictor of actual performance
- If performance assessment produces a non valid measurement of actual
performance the correlation rxy will fall short
Criterion validity for dichotomous decisions
- Tests often used for making dichotomous decisions
- accept/reject
- treat/dont treat
- Treatment A/B
- Most important: classify as accurately as possible
- Not the same as high linear association rxy
- Test score X approximately continuous
- Dichotomous decision based on cutoff score Xcrit
- X<Xcrit→reject (0); X>Xcrit→accept (1)
- Continuous criterion must also be dichotomous
- Y<Ycrit→doesn't satisfy criterion; Y>Ycrit→does
- Place everyone in by 2x2 freq (incorrect-correct table like w rest scores)
-
- B & C correct; choose & reject right ones
- A & D wrong ones; reject & choose wrong ones
Eg criterion validity & decisions
- Students wo math background need to pass testimonium mathematics bf start of study
- Idea that wo this knowledge chance of successfully studying is low
- Study success (meas based oncredits obtained in yr 1; Y) not known at start of study
- Test score testimonium (X) therefore stand in for not observed criterion Y
Tes use ofr dichotomous decisions
- A: pos misses/false negatives: 100 students unjustly rejected
- B: pos hits/true positives: 196 students justly not rejected
- C: neg hits/true negs: 84 students justly rejected
- D: neg misses/fasle positives: 20 student sunjustly not rejected
- Selection rate: proportion accepted students
- Selection rate:
- Base rate: coincidence: prop students >30credits
- Base rate =
- Base rate goes down if inc criterion
- Low base rate→filtering out ppl not passing criterion
- Selection rate: who passes the test
- Base rate: who passes the criterion
- sensitivity/succes rate: proportion accepted students justly accepted
- Success rate=
- Specificity: proportion rejected students justly rejected
- specificity=
- High success & specificity rate ideal
- Tradeoff; high specificity→lower success; more ppl admitted unjustly
- High success→low specificity; more ppl rejected unjustly
- Validity: correlation bw dichotomized test & criterion score (=phi)
-
- Same calc as for determining correlation item score & rests core
-
- Phi correlation coefficient
- Success rate/sensitivity dependent on
- 1. Validity phi
- If phi larger→B&C larger, A&D smaller→ B/B+D
-
- 2. Selection rate
- Rejecting more ppl→ B larger comp B+D
-
Test use for dichotomous decisions
1. Problem: many of hired candidates unqualified
a. Cause: low validity test
b. Liw base rate
c. High selection rate (many ppl need to be admitted→most ppl admitted,
unqualified hard to filter out)
2. Problem: optimal balance pos/neg misses
- Depends on situation
- Neg misses D: how bad is hiring unqualified person; non sick person treated
- Pos misses A: how bad not hiring qualified person; sick person not treated
- Stricter selection→ D smaller but A larger
- More lenient selection→ A smaller but D larger
3. Relship success rate, vladiity, base rate & seleciton ate taylor-russell tables
a. Base rate = .60 validity; =.20 low; selection rate = .10 (strict) success rate = .73
why success rate high for test w low validity?
b. Selection rate inc→success rate close to base rate
c. validity=0→success rate=base rate
d. Very large/very small base rate→selection virtually pointless
-
Easy item difficult item
-
- Straight line: poorly discriminating item
- Mastery level→doesn’t change chances of getting the item right
- S shaped→well discriminating item
-
- eg using wrong answer key; mistake in coding data
Item characteristic functions & IRT models
- How can you determined what the ICF is
- Use IRT model
- Model makes assumptions abt hte shape of the ICF
- Based on the data you can estimate the ICF for very item
- Done by statistical software (not SPSS)
- We need to choose an IRT model
E exponents
- The number natural logarithm
- Exponentation
-
Logistic regression for item responses
- Simple logistic regression is abt predicting a dichotomous outcome based on predictor
-
- Y & X can be anything, so we can also fill in
- Y = Xi
- X=
-
- IRT models* : logistic regression for item scores
- Simplest model = rasch model (set b1 to 1 and rewrite b0= (-beta)
-
Rasch model = one parameter logistic model
-
Rasch model
-
- Every item own ICF & own curve; each item diff probability of getting it
right
-
- Avg respondent; =0 ability level; 0 SD from mean
Item info for rasch model
- Rasch items differ in difficulty B
- Items provide lot of info abt values close to item location B
- Little info abt values far from there
- For >>Bi; almost everyone gets Q correct
- For <<Bi; almost everyone gets Q wrong
Item info function
- Function Ix( ): if higher → measured more accurately
- Eg rasch item B= 0; and A=-2, B=0, C=3
-
- B measured more accurately by item than A or C
- I: item information function; how much info we get abt ability level
- Item info always highest always arnd item difficulty; further diff bw item difficulty & ability
level→lower item info
Steps IRT analysis
- Select IRT model
- draw large sample from population
- Estimate item parameters (Rasch item difficulty)
- →use test to estimate for ppl
-
- Rasch model extended w item discrimnation parameter
- ai indicates how well item i distinguished bw ppl based on theri level on the latent trait
- Items now differ in how well they discriminate, however always the case tha tai>0
- Higher ai the better; leads to steeper item characteristic functions
-
- mean more accurate for =-2 than for =0 or =3; smallest CI interval arnd
mean
- Peak difficulty, peak higher if well discriminating item
Accuracy of meas
- SE of meas of for test (ie SEtest( ) determined by test info & therefore depends on
- 95% CI for
- Higher test info for →smaller CI for
Test construction based on item band
- Item bank: rel large collection fo easily accessible items, ICF & item description known
- Select exactly those items that contribute well to what you want to measure w test
- What is the desired accuracy of mean for all diff possible values of mean
-
Test construction based on item bank
- IRT population independent measuring possible
- W item banks diff tests can be composed; depends on the purpose of the test
- Target information function
-
- Target information:
- uninterrupted green
- test info striped pink
- Info new item striped blue
- Info previous item dotted black
- Continue until estimated accurately enough (ie until SE test( ) small enough)
Adaptive test: + & -
+ Accurate measurement for veryone
+ Adjusted to individ
+ Objective
+ Short test time
+ Fast feedback possible
+ Performance on diff tests can be compared
- IRT models restrictive;
- eg don't allow for guessing
- High costs
- Often hard to construct lots of items for 1 construct
- Multistage testing: simple form of adaptive testing
Multi stage testing
- Everyone receives teh same first aprt of the test
- Based on performance 1st pary→assign next part eg
- Score below avg→easy 2nd part
- Score min avg→difficult 2nd part
- Possible w 2 stages but also w more stages
- Less efficient than adaptive testing but easier to do
Comparison classical & item response T
IRT
- Provides us w a way to test our measurement model 8tsting model assumptions)
- Recogn that accuracy of measurement depends on ability level (SEtest( ) varies w
- Allows for population independent comparisons of ppl & items
- Allows for adaptive testing
CCT
- Is simpler than IRT
- Also offers tools to set up a good test (realibility, validity)
- Results often dont differ a lot
- Can be found in more standard software
True score in IRT
- In CCT: true score=avg score of person over replications
- True score=expected score
- Expected score easily determined in IRt
- Expected item score given : P(Xi=1l )
- Expected test score given : sum of expected item scores
- Eg depending on extraversion
- Probability of saying somewhat agree on introverted ppl eg
L 12 test bias & fairness
Types of test use
1. Meas psych construct→construct validity
2. Predicting criterion→criterion validity
2 types of validity
1. Construct validity; X rel to construct
2. Criterion validity; C predicts criterion
construct validity
1. Every1 w same skill level→same expected test score
2. Not the case→construct bias; bias in test score
Criterion validity
1. ppl w same test score→same predicted validityfor criterion
2. Not→predictive bias; bias in predicted criterion
Construct bias
- Bias test score
- Identical response processes ⅕ construct validity pillars
- Ppl don’t go through same response process→results not comp
- Language problems
- Fear of failure
- Lacking background knowledge
Construct bias dichotomous items
- When no construct bias?
- For every item everyone w same skill elvel/level on trait→probability answering Qs
correctly/pos
- Equal probabilities only given the same skill level
- John lower p comp ina to answer some Qs correctly→construct bias if john has
same skill level as ina
- IRT to investigate
Construct bias dichotomous items
- Item functions well only is predictor of p of success
- If besides other factors play a role→construct bias
- Stereotype threat
- Boys & girls w same level of math skill→equal item p
- Stereotype threat neg infl performance girls difficult items
- Construct bias in math test
1. Administer to alrge sample from population
2. H for those items to have an effect eg Bi>1
3. Estimate item characteristic functions for every item separately boys & girls
4. Examine if item characteristic fucntions of boys & girls differ on suspect items
-
Cosntrcut & predictive bias
- Construct tbias abt relationship bw item/rest score & construct measured
- Predictive bias abt relationship bw rest sco re& criterion
- Both cases bias plausible
- Rating of some ppl too high/others too low
- →unfair & nonvalid test use
- Important to test for biases→aware of them→deal w them
-