Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views14 pages

Lahuis 2009

Uploaded by

Marcelo Vaiman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views14 pages

Lahuis 2009

Uploaded by

Marcelo Vaiman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Article

Organizational Research Methods


14(1) 10-23
ª The Author(s) 2011
An Examination of Item Reprints and permission:
sagepub.com/journalsPermissions.nav
Response Theory Item Fit Indices DOI: 10.1177/1094428109350930
http://orm.sagepub.com
for the Graded Response Model

David M. LaHuis,1 Patrick Clark,1 and Erin O’Brien1

Abstract
The current study examined the Type I error rates and power of several item response theory (IRT) item
fit indices used in conjunction with the graded response model (GRM). Specifically, S  w2, w2*, and
adjusted w2 degrees of freedom ratios (w2/dfs) were examined. Model misfit was introduced by
manipulating item parameters and by using a different IRT model to generate item data. Results
indicated lower than expected Type I error rates for S  w2 and w2*. Adjusted w2/dfs resulted in large
Type I error rates when used with cross validation and very low Type I error rates when used
without cross validation. w2* and adjusted w2/dfs without cross validation were the most powerful overall.

Keywords
item response theory and quantitative research, computer simulation procedures (e.g., Monte
Carlo, Bootstrapping) and quantitative research, measurement models

Item response theory (IRT) models attempt to specify the relationship between individuals’ under-
lying trait levels and the probability of endorsing an item using item and person characteristics.
These models offer several advantages over classical test theory such as the ability to develop
shorter, more efficient tests, use more powerful tests of differential item functioning (DIF), and
implement computer adaptive testing. As such, the use of IRT models is becoming more common
in organizational research. For example, IRT models have been used to explore different
response processes (Stark, Chernyshenko, Drasgow, & Williams, 2006), develop personality scales
(Chernyshenko, Stark, Drasgow, & Roberts, 2007), compare computer and paper-and-pencil scales
(Donovan, Drasgow, & Probst, 2000), and examine applicant faking in selection contexts (Robie,
Zickar, & Schmit, 2001).
As with any model-based statistical technique, the use of IRT models is contingent on the model
fitting the data. As such, a number of item fit indices have been proposed for examining the
appropriateness of IRT models. Traditional item fit indices include Bock’s w2 (Bock, 1972), Yen’s
Q statistic (Yen, 1981), and G2 (McKinley & Mills, 1985). Alternative item fit indices have also
been proposed. These include S  w2 (Orlando & Thissen, 2000), Stone’s (2000) scaling corrected

1
Department of Psychology, Wright State University, Dayton, Ohio

Corresponding Author:
Department of Psychology, Wright State University, 335 Fawcett Hall, Dayton, OH 45435
Email: [email protected]

10
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
LaHuis et al. 11

fit statistic (w2*), and adjusted w2 degrees of freedom ratios (w2/dfs) proposed by Drasgow, Levine,
Tsien, Williams, and Mead (1995). All these indices are computed by grouping examinees by their
levels of ability and comparing observed response option frequencies with expected frequencies
based on the estimated IRT model. They differ in terms of how individuals are grouped, how
observed and expected frequencies are calculated, and how degrees of freedom are computed.
There have been relatively few studies examining the performance of these item fit indices. Some
studies have shown that both Bock’s w2 and G2 result in too many Type I errors when used with
dichotomously scored items (Orlando & Thissen, 2000; Stone & Zhang, 2003). Stone and Zhang
also found that both S  w2 and w2* result in reasonable Type I errors but require larger sample
sizes to achieve adequate power to detect item misfit. Kang and Chen (2008) found similar results
for S  w2 using several polytomous IRT models. However, several questions about the performance
of these item fit indices remain. For example, we are not aware of any research comparing adjusted
w2/dfs with the other indices. In addition, there is very little research examining the use of the fit
indices with the graded response model (GRM; Samejima, 1969). The GRM has been used fre-
quently to analyze personality data in organization settings (Chernyshenko, Stark, Chan, Drasgow,
& Williams, 2001; LaHuis & Copeland, 2009; Robie et al., 2001).
In the current study, we used Monte Carlo simulations to compare three alternative IRT item fit
indices in terms of Type I errors and power using the GRM. Specifically, we investigated the
performance of S  w2, w2*, and adjusted w2/dfs across a number of different conditions. We exam-
ined the power of these indices by manipulating item parameters and introducing model
misspecification.

GRM
The GRM has often been used to analyze personality data (Embretson & Reise, 2000). It assumes
that an item has m ordered categories. The GRM estimates m  1 boundary response functions
(BRFs) that represent the cumulative probability of selecting a response option greater than the
option of interest. For example, there would be four BRFs for a Likert-type scale with five response
options. The first BRF would be defined as the probability of choosing the lowest response option
versus choosing one of the other four options. The second BRF would reflect the probability of
choosing the lowest two response options versus the other three. The equation for a BRF is similar
to the two parameter logistic model equation for dichotomous data:
exp½ai ðy  bik Þ
Pik ðyÞ ¼ : ð1Þ
1 þ exp½ai ðy  bik Þ
In this equation, a is the item discrimination parameter for item i, bik is a threshold parameter for
option k, and y is an individual’s estimated trait level. There are m  1 threshold parameters. For
example, the bik threshold parameter for choosing greater than the lowest option would represent
the point on the y scale where there is 50% probability that the response is greater than the lowest
option.
The GRM has been called an indirect IRT model because the probability of endorsing a response
option is not calculated directly from Equation 1 (Embretson & Reise, 2000). Instead, the probability
of responding to each of the five categories can be calculated using Equations 2 through 6:
Pi1 ðyÞ ¼ 1  Pi1 ðyÞ; ð2Þ

Pi2 ðyÞ ¼ Pi1 ðyÞ  Pi2 ðyÞ; ð3Þ

Pi3 ðyÞ ¼ Pi2 ðyÞ  Pi3 ðyÞ; ð4Þ

11
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
12 Organizational Research Methods 14(1)

Pi4 ðyÞ ¼ Pi3 ðyÞ  Pi4 ðyÞ; ð5Þ

Pi5 ðyÞ ¼ Pi4 ðyÞ  0: ð6Þ

Traditional Item Fit Indices


Item fit has been assessed traditionally using some form of w2 statistic. Examinees are divided into a
number of intervals based on their estimated ys. Observed and expected frequencies for the interval
are compared using the formula:
X ðOik  Eik Þ2
w2i ¼ NK ; ð7Þ
k¼1
Eik ð1  Eik Þ

where Nk is the number of individuals responding to item i with option k, Oik is the observed pro-
portion, and Eik is the expected proportion. Bock’s w2 (1972) uses the median y of the interval to
calculate expected frequencies and has varying sizes of intervals. Yen’s Q statistic (1981) divides
y into 10 equal-sized intervals and uses the mean of the interval to calculate expected frequencies.
McKinley and Mills (1985) developed a likelihood ratio w2 statistic called G2 similar to the Q
statistic. y is divided into 10 equal-sized intervals and observed and expected frequencies are
compared. The formula for G2 is
X10     
2 Oik 1  Oik
Gi ¼ 2 Nk Oik ln þ ð1  Oik Þ ln : ð8Þ
k¼1
Eik 1  Eik

A major issue with these fit indices is the treatment of estimated y. Several researchers have noted
that the subdivision of y into intervals is arbitrary and that creating different intervals may affect the
results (Orlando & Thissen, 2000; Reise, 1990). In addition, because the intervals are conditioned on
y, the observed frequencies are model dependent. This may affect the degrees of freedom of the fit
indices (Orlando & Thissen, 2000). Finally, Stone (2000, 2003) has noted how the uncertainty in
estimating y also has adverse affects on the w2 statistics. To address these problems, researchers have
developed several alternative item fit indices.

Alternative Item Fit Indices


Several alternative item fit indices have been proposed to address the issue of the subdivision of y
into intervals and the uncertainty of estimating y. They differ in the way they accomplish this.
Orlando and Thissen (2000) propose conditioning on summed scores. Stone (2000, 2003) developed
measures based on the full posterior distribution of y. Finally, Drasgow et al. (1995) recommend an
approach based on comparing item frequencies with frequencies that would be expected given the
IRT parameters and an assumed distribution of y.

S  w2
Orlando and Thissen (2000) developed item fit statistics for dichotomously scored items that were
conditioned on summed scores instead of y. Observed and expected frequencies for each summed
score are compared using a w2 statistic. The procedure for computing expected frequencies uses the
joint likelihood distribution of each possible total score across all possible response patterns for each
total score. That is, the expected frequencies are based on the likelihood of response patterns where a
given item is endorsed and that produce a given total score. Expected frequencies are computed

12
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
LaHuis et al. 13

using a recursive algorithm developed by Thissen, Pommerich, Billeaud, and Williams (1995) and
the formula:
R
Pi1 ðyÞf i ðk  1jyÞfðyÞdy
Eik ¼ R ; ð9Þ
f ðkjyÞfðyÞdy
where Pi1 is the response function for item i, f(k|y) is the conditional predicted test score distribution
given y, f*i(k  1|y) is the conditional predicted test score distribution without item i, and f(y) is the
population distribution of y. Orlando and Thissen (2000) proposed computing a w2 index (S  w2)
and likelihood ratio statistic (S  G2) using the w2 and G2 formulas listed above. The degrees of free-
dom for these statistics equal the number of total scores categories minus the number of item para-
meters. When necessary, cells are collapsed to maintain a minimum expected frequency of 1. An
adjustment is made to the degrees of freedom when collapsing occurs. Results of the simulations
of Orlando and Thissen suggested that only S  w2 had acceptable Type I error rates and that large
sample sizes were needed to achieve adequate power.
Kang and Chen (2008) extended S  w2 for use with polytomously scored items by modeling the
expected response option proportions using the same recursive algorithm used for dichotomously
scored items. They examined the performance of the generalized S  w2 for partial credit models.
Their results were similar to Orlando and Thissen (2000) in that Type I error rates were acceptable
and larger sample sizes were required for adequate power.
The S  w2 index has several positive characteristics. Research suggests that it has acceptable
Type I error rates and adequate power for large samples (N  2,000) for some dichotomous and
polytomous IRT models (Kang & Chen, 2008; Orlando & Thissen, 2000, 2003; Stone & Zhang,
2003). The recursive algorithms can be implemented for dichotomous items in the Goodfit software
program (Orlando, 1997) and for both dichotomous and polytomous items in the IRTFIT statistical
analysis software (SAS) macro (Bjorner, Smith, Stone, & Sun, 2007).1 In addition, S  w2 does not
divide y into arbitrary intervals because it is conditioned on summed total scores. However, as the
number of items or response options increase, the number of possible total scores increases. This
may lead to issues with sparseness. Finally, calculating the index can be computationally demanding
when the number of possible total scores is large.

w2*
Stone (2000) developed methods for accounting for uncertainty in estimating examinees’ y when
testing item fit. Bayesian estimates of y are usually calculated using the conditional distribution
of y given a response pattern (x) and the prior distribution of y. The conditional distribution is
defined as
PðyjxÞ ¼ PðxjyÞPðyÞ=PðxÞ ð10Þ
where P(y|x) is the posterior distribution of y, P(x|y) is the conditional probability of response pat-
tern x, P(y) is the prior distribution of y, and P(x) is the unconditional probability of response pattern
x given an individual with unknown y randomly sampled from a population with an assumed distri-
bution. The mode or the mean of the posterior distribution can be used to produce point estimates
of y. Whereas the traditional w2 and G2 indices use these point estimates of y, Stone (2000) suggested
using information from the full posterior distribution of y to calculate pseudo-observed frequencies.
Pseudo-observed frequencies can then be compared with expected frequencies using either the w2 or
G2 formulas. One issue with this approach is that the resulting statistics do not follow known distri-
butions. Thus, Stone recommends a resampling procedure to obtain a scaling factor and effective
degrees of freedom for significance tests. Simulation research indicates that 100 resamples are

13
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
14 Organizational Research Methods 14(1)

enough to obtain accurate values for these (Stone & Zhang, 2003). The degrees of freedom for w2*
equals the effective degrees of freedom minus the number of estimated item parameters.
Initial research suggests that w2* appears to have acceptable Type I error rates and adequate power
with at least moderate sample sizes (N  1,000). There is also some indication that w2* works with
polytomous items. Stone (2003) found acceptable Type I error rates and power for a three-category
item in a mixed-item scale. In addition, w2* addresses the issue of uncertainty in estimating y by
using the full posterior expectations when calculating pseudo-observed frequencies. However, the
resampling procedure can be computationally demanding and it is not clear how violating
assumptions of the scaling correction affects the results. In addition, w2* has not been extensively
studied for the GRM.

Adjusted w2 to Degrees of Freedom Ratios


Drasgow et al. (1995) recommend using w2/dfs to assess fit and to adjust these ratios for sample size.
w2s are based on comparing observed frequencies for each item with frequencies that are expected
based on the estimated item parameters and the distribution of y. Observed frequencies for an item
are simply the number of times each response option is endorsed. Expected frequencies for item i are
calculated using the formula
Z
Eik ¼ N PðkjyÞf ðyÞdy; ð11Þ

where P(k|y) is the probability of endorsing option k for a given y, and f(y) is the density function for
y, usually assumed to be standard normal.
Because single-item w2s do not capture violations of local item independence, Drasgow et al.
(1995) suggest evaluating w2 statistics for pairs and triplets of items. In addition, because the pos-
sible number of item doubles and triples increases greatly with the number of items, subsets of dou-
bles and triples are calculated for larger test lengths. The subsets are based on sorting items into three
‘‘packets’’ based on item difficulty. Combinations of items within and between packets are sampled
to compute item doubles and item triples.
The degrees of freedom for the w2s equal the number of cells minus one. For example, an item
single with five response options would have four degrees of freedom. A w2 for an item double with
each item having five response options would have 24 degrees of freedom. Cells are collapsed to
maintain a minimum expected frequency of 5. The degrees of freedom are adjusted to reflect the
collapsing when it occurs.
Because w2 statistics are heavily dependent on sample size, Drasgow et al. (1995) suggest adjust-
ing w2/dfs to a sample size of 3,000 using the formula
3; 000ðw2obs  nÞ
adjw2 ¼ þ n; ð12Þ
N
where n is the degrees of freedom. Drasgow et al. also recommended using a cross-validation sample
when assessing fit to combat capitalizing on chance when estimating IRT models using the calibra-
tion sample. They suggest that adjusted w2/dfs below 3 indicate acceptable fit.
One advantage of adjusted w2/dfs is that they are extremely flexible. They can be calculated for
any IRT model. In addition, they do not require dividing y into intervals because the expected fre-
quencies are based on the assumed distribution of y. Traditionally, the distribution has been assumed
to be standard normal. Finally, the ratios can be calculated easily using the software program Modfit
(Stark, 2007).2
Adjusted w2/dfs have been commonly used in organizational research (Chernyshenko et al., 2001;
LaHuis & Copeland, 2009; Robie et al., 2001). However, there is little empirical research regarding

14
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
LaHuis et al. 15

their Type I error rates and power. For example, it is not clear how the recommended value of three
performs for different numbers of items or how well the ratios can detect misfitting items.

Current Study
In the current study, our goal was to examine the performance of the three alternative item fit indices
using the GRM, while varying the number of items and sample size. To assess power, we manipu-
lated the parameter estimates of the GRM and introduced model misspecification. We were partic-
ularly interested in comparing the item fit indices to determine which, if any, should be preferred
overall, and whether different indices captured different sources of misfit.

Method
Design
We conducted a simulation study with 36 conditions. Two different test lengths (10 and 20 items)
and three sample sizes (500, 1,000, and 2,000) were fully crossed to form six conditions where Type
I errors were assessed. Test lengths and sample sizes were chosen to allow for comparisons with
previous simulation studies (e.g., Kang & Chen, 2008; Orlando & Thissen, 2003; Stone & Zhang,
2003). The test lengths are also consistent with typical lengths of short and long forms of personality
scales. For example, the frequently used International Personality Item Pool’s short and long forms
for each Big Five personality dimension containing 10 and 20 items, respectively (Goldberg et al.,
2006). Five manipulations (two discrimination manipulations, two threshold manipulations, and one
model misspecification manipulation) were each implemented for each combination of test length
and sample sizes. We analyzed 100 samples for each condition.

Data Generation
We generated the item data using statistical package for the social sciences (SPSS) 12.0. First, pop-
ulation values for the parameter estimates were generated for the 100 samples. Based on Meade,
Lautenschlager, and Johnson (2007), parameters were randomly drawn from a normal distribution
with a mean of 1.25 and a standard deviation of .07. The threshold for the lowest BRF was sampled
from a random normal distribution with a mean of 1.7 and a standard deviation of .45. We added
constants of 1.2, 2.4, and 3.6 to the lowest threshold to create the other three thresholds, respectively.
Second, a y value for each simulee was sampled from a random standard normal distribution. Third,
the probability of endorsing each option for each item was calculated using the item parameters, y
values, and the equations for the GRM. These were used to calculate a cumulative probability of
endorsing each response option or lower. For example, the cumulative probability for endorsing the
second response option would be the sum of the probabilities of endorsing the first and second
response options. Finally, item data were generated by comparing random variables with uniform
distributions ranging from 0 to 1 with the cumulative probabilities. The lowest response option for
which the cumulative probability exceeded the random number was a simulee’s item response.
Consistent with Stone and Zhang (2003), model misfit was introduced by changing the item para-
meters for two of the items. That is, item parameters were calibrated and then the resulting item para-
meters were changed when assessing model fit. Two different slope manipulation conditions were
used: slope parameters were altered by subtracting either 0.25 or 0.50. There were also two different
manipulations of the b parameters. The last threshold parameter was changed by subtracting either
0.25 or 0.50.
Model misspecification was introduced using the generalized graded unfolding model (GGUM;
Roberts, Donoghue, & Laughlin, 2000) The GGUM differs from the GRM in that it is based on an

15
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
16 Organizational Research Methods 14(1)

ideal point response process as opposed to the dominance response process underlying the GRM.
Ideal point models suggest that individuals judge how well an item describes them in terms of the
underlying trait and tend to endorse items that they feel match their level of the trait. They will tend
to not endorse items that they feel do not match their trait levels. The mismatch may occur because
they believe their trait level is less than or exceeds that indicated by the item. The former is termed
disagreeing from below and the latter is labeled disagreeing from above. Disagreeing from above
causes the expected item score to have a nonmonotonic or bell-shaped relationship with the under-
lying trait. In contrast, the dominance model-based GRM assumes that the expected item score
increases monotonically with the underlying ability.
For the model misspecification conditions, item data with five-response options were generated
based on the GGUM. Following Koenig and Roberts (2007), the discrimination parameters were
sampled from a random uniform distribution ranging from 0.5 to 2. Item locations were generated
from a random uniform distribution ranging from 2 to 2. Finally, the last threshold parameter was
drawn from a random uniform distribution ranging from 1.4 to 0.4. Successive threshold para-
meters were generated by subtracting 0.25 and adding a value drawn from a random normal distri-
bution with a mean of 0 and a standard deviation of 0.04. Probabilities of endorsing each response
option were calculated using the GGUM formulas. Item data were generated by comparing the prob-
abilities with random variables as was done with the GRM item data generation.

Analyses
Item parameters were calibrated using Multilog 7.0 (Thissen, 2003). S  w2 and w2* were estimated
using the IRTFIT SAS macro (Bjorner et al., 2007). Adjusted w2/dfs were computed using Microsoft
Excel macros developed by the authors.3 Consistent with previous research (Orlando & Thissen,
2003; Stone & Zhang, 2003), we calculated S  w2 and w2* indices based on all the cases for each
sample. Similarly, we computed adjusted w2/dfs using cross-validation as suggested by Drasgow
et al. (1995). That is, we calibrated the item parameters using the first half of a sample and computed
observed frequencies using the second half of the sample. We also calculated adjusted w2/dfs without
cross-validation to allow for more direct comparisons with the other item fit indices.
Type I error rates were calculated using the percentage of times an index suggested that an item
did not fit when no manipulations were made. For S  w2 and w2*, misfit was indicated when the w2s
were statistically significant using an a level of .05. For adjusted w2/dfs, misfit was present when the
value exceeded the recommended cutoff of 3. Percentages for Type I error rates were calculated
across all items. Power was based on the percentage of times the indices suggested misfit when some
manipulation was made. Percentages were calculated across all misfitting items. That is, for the
parameter manipulation conditions, percentages were calculated across the two misfitting items; for
the model misspecification condition, power was assessed across all the items. For item doubles and
triples, only those containing at least one misfitting item were used to evaluate power.
Prior to examining the item fit indices, we tested the accuracy of recovering the population
parameters. This may be an issue because some of the sample sizes were smaller by IRT standards.
Consistent with Meade et al. (2007), we computed a root mean square error (RMSE) by comparing
the estimated and population values for the item parameters. For example, the formula for the a para-
meter was
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uk
uP
u ðai  ai Þ2
ti¼1
RMSEAðai Þ ¼ ; ð13Þ
k

16
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
LaHuis et al. 17

Table 1. Overall Type I Error Percentages for the Fit Indices

Adjusted w2/df

Sample size S  w2 w2* Singles Doubles Triples

10 items
500 2 2 67 (0) 65 (1) 38 (1)
1,000 2 2 61 (0) 49 (0) 20 (0)
2,000 3 1 50 (0) 23 (0) 1 (0)
20 items
500 1 3 68 (0) 64 (1) 37 (1)
1,000 2 2 59 (0) 48 (0) 21 (0)
2,000 1 2 50 (0) 23 (0) 1 (0)
Note: Percentages out of 100 samples and across all items. Values in parenthesis represent adjusted w2/df without cross
validation.

where ai is the discrimination parameter for item i, k is the number of items, and ai is the population
discrimination value.

Results
Overall, we found RMSE values similar to Meade et al. (2007). The mean of parameter RMSE val-
ues ranged from .06 (SD ¼ .01) to .12 (SD ¼ .02). The lowest RMSE was found for 20 items with a
sample size of 2,000, and the highest RMSE was found for 10 items with a sample size of 500. The
mean b parameter RMSE values ranged from .05 (SD ¼ .01) to .18 (SD ¼ .05) with higher values for
the extreme response options. Thus, these values suggested that the item parameters were estimated
with reasonable accuracy. Further details are available from the first author.

Type I Error Rates


Table 1 presents the Type I error rates for the item fit indices. Type I error rates below 5 were
observed for S  w2 and w2*. Values for S  w2 and w2* ranged from 1% to 3% but did not show
a consistent pattern across sample sizes or test lengths. In contrast, Type I error rates for adjusted
w2/dfs with cross validation were found to be very large. Percentages ranged from 23% to 68% and
were lower for larger sample sizes. Given these unacceptably large Type I error rates, we did not
estimate power for adjusted w2/dfs with cross validation. For adjusted w2/dfs without cross valida-
tion, there was little variability in adjusted w2/dfs. The value was 0 for item singles across all samples
and conditions. This occurs because adjusting w2s that are less than their degrees of freedoms results
in negative values which are set to 0. Low Type I error rates were also found for item doubles and
triples. The error rate was 1% when the sample size was 500 and 0% for the other samples.

Power Estimates
Table 2 presents empirical power estimates for the item fit indices. For the decreased slope by 0.25
condition, S  w2 only displayed adequate power when the sample size was 2,000 and test length was
10. For the larger decrease in slope’s condition, S  w2 displayed adequate power across all sample
sizes and test lengths. w2* had acceptable levels of power for both slope manipulation conditions
across all sample sizes and test lengths. This was also the case for adjusted w2/dfs based on item
singles, doubles, and triples.

17
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
18 Organizational Research Methods 14(1)

Table 2. Overall Power Estimates the Fit Indices

Adjusted w2/df

Number of items Sample size S  w2 w2* Singles Doubles Triples

Decrease slope by .25


10 items N ¼ 500 15 98 95 99 97
N ¼ 1,000 39 100 100 99 96
N ¼ 2,000 93 100 100 99 95
20 items N ¼ 500 4 95 97 99 97
N ¼ 1,000 18 100 100 99 96
N ¼ 2,000 68 100 100 99 95
Decrease slope by .5
10 items N ¼ 500 97 100 100 100 99
N ¼ 1,000 100 100 100 100 100
N ¼ 2,000 100 100 100 100 100
20 items N ¼ 500 94 100 100 100 100
N ¼ 1,000 100 100 100 100 100
N ¼ 2,000 100 100 100 100 100
Decrease last threshold by .25
10 items N ¼ 500 7 41 70 99 96
N ¼ 1,000 12 89 95 99 95
N ¼ 2,000 36 99 99 99 94
20 items N ¼ 500 3 25 69 99 96
N ¼ 1,000 7 76 93 99 94
N ¼ 2,000 21 97 99 99 94
Decrease last threshold by .5
10 items N ¼ 500 60 100 100 100 99
N ¼ 1,000 93 100 100 100 99
N ¼ 2,000 100 100 100 100 99
20 items N ¼ 500 46 99 100 100 99
N ¼ 1,000 77 100 100 100 99
N ¼ 2,000 98 100 100 100 99
GGUM generated
10 items N ¼ 500 74 88 4 87 98
N ¼ 1,000 84 93 19 93 99
N ¼ 2,000 92 95 35 96 99
20 items N ¼ 500 69 85 39 92 98
N ¼ 1,000 79 93 48 97 99
N ¼ 2,000 87 97 58 98 99
Note: Percentages out of 100 samples and across all misfitting items. Adjusted w2/dfs were computed without cross validating.
GGUM ¼ generalized graded unfolding model.

S  w2 did not have acceptable levels of power to detect a decrease in the last threshold of 0.25 but
did have acceptable power for the larger threshold decrease condition when the sample size was
2,000 or when the sample size was 1,000 and the test length was 10 items. w2* exhibited acceptable
power to detect a last threshold decrease of 0.25 when the sample size was at least 1,000. Acceptable
power was also found for w2* when the threshold was decreased by 0.50 for all sample sizes and test
lengths. For adjusted w2/dfs for item singles, adequate power to detect a threshold decrease of 0.25
was found for sample sizes of at least 1,000. Adjusted w2/dfs for item doubles and triples exhibited
sufficient power to detect threshold changes for all sample size and test length conditions.
When the GGUM was used to generate the data, S  w2 had acceptable power for sample sizes of
at least 1,000. w2* exhibited acceptable power across all sample sizes and test lengths. The adjusted

18
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
LaHuis et al. 19

w2/dfs for item singles did not demonstrate adequate power for any of the sample size and test lengths
combinations. However, both item-adjusted w2/dfs doubles and triples had adequate power across all
sample sizes and test lengths.

Discussion
One of the major goals of the current study was to identify which item fit index should be preferred
overall for the GRM and whether different indices captured different sources of misfit. Our results
suggest that overall the S  w2, w2*, and adjusted w2/dfs without cross validation indices exhibited
lower than expected Type I error rates. In contrast, the adjusted w2/dfs with cross validation resulted
in too many Type I errors for it to be a viable fit index for the GRM based on the test lengths and
sample sizes in the current study. The w2* and the non-cross-validated adjusted w2/dfs demonstrated
more power than S  w2 to detect changes in item parameters. The adjusted w2/dfs for item singles
without cross validation did not have sufficient power in the model misspecification condition but
those for item doubles and triples did. The w2* and adjusted w2/dfs for item doubles and triples
seemed to be the most powerful across all misfit conditions. Thus, based on our results, either the
w2* or a combination of adjusted w2/dfs for item doubles and triples without cross validation appear
to be the most promising indices for assessing the fit of the GRM.
Both w2* and adjusted w2/dfs have advantages and disadvantages when compared against each
other. The main advantage of w2* is that it is a single index based on a single item. This makes it
easier to identify the problem item and remove it from the test regardless of the source of misfit.
This is more difficult for adjusted w2/dfs. It may be difficult to pinpoint problem items with item
doubles and triples because the same item may be in multiple combinations and only one combina-
tion may indicate poor fit. One main advantage of adjusted w2/dfs is that it is much less computa-
tionally intensive to compute than w2* is. However, this advantage becomes lessened as the
computation power of computers increases. Another advantage is that, in our opinion, the concepts
underlying adjusted w2/dfs are much simpler to understand and communicate compared to w2*.
The Type I error results from the current study are somewhat inconsistent with previous research
regarding S  w2 and w2*. Previous simulation research has suggested that these item fit indices had
nominal Type I error rates for several dichotomous (Stone & Zhang, 2003) and polytomous (Kang &
Chen, 2008) IRT models. This indicates that the distributions for these fit indices approximated a w2
distribution. In the current study, we found lower than expected Type I error rates, which suggest
that these indices may not follow a w2 distribution when used with the GRM. One potential explana-
tion for these differences is that the IRT models used in previous research were direct IRT Models
(Embretson & Reise, 2000). That is, the probability of endorsement is modeled directly. In contrast,
the GRM is an indirect IRT model in that the probability of endorsing a response option is a function
of multiple BRFs. This may reduce the effective degrees of freedom associated with the fit indices.
Future research is needed to determine how S  w2 and w2*are distributed when they are used with
the GRM.
Consistent with prior practice, the degrees of freedom for these indices were adjusted for the
number of item parameters. There is some debate over whether such adjustments should be made.
Mislevy and Bock (1990) suggest that estimating item parameters does not result in a loss of degrees
of freedom. Stone and Zhang (2003) argued that an adjustment should be made to reflect the fact that
expected frequencies are based on estimated item parameters. There is some indication that the prac-
tical differences between adjusting and not adjusting are negligible for the generalized S  w2 (Kang
& Chen, 2008). This is probably because the degrees of freedom are based on the number of total
score categories, which tend to be large when polytomous items are used. However, adjusting
degrees of freedom may have a bigger impact on w2*. Stone and Yin (2006) note that the degrees
of freedom adjustments do not consider varying precision in estimating item parameters associated

19
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
20 Organizational Research Methods 14(1)

with different sample sizes. They demonstrate that, in some situations, the adjustment may overcor-
rect and recommend estimating item parameters with each replication when calculating w2* to
account for item parameter uncertainty. Results of their simulation study suggest that this approach
yields nominal Type I error rates and reasonable power. It should be noted that the IRTFIT macro
used in this study does not implement the suggestion by Stone and Yin.
Using cross validation with adjusted w2/dfs resulted in unacceptable Type I error rates in the cur-
rent study. This was probably caused by capitalizing on chance when calibrating item parameters
with the smaller sample sizes used in the simulation and emphasizes the need for larger sample sizes
when estimating IRT models. It also highlights the distinction between error of approximation and
overall error that has been made for structural equation models (Brown & Cudeck, 1992). Error of
approximation concerns the fit of the model to the population given optimal parameter values. Fit
indices without cross validation provide measures of this. These indices address the issue of whether
the model fits well enough to use the item parameters with the current sample. Overall error
addresses the fit of the model to the population, given sample-based parameter values. Fit indices
using cross validation assess this type of error and ask the question, ‘‘Do the item parameters from
this sample apply to other samples from the same population?’’
The answer to the question of when to use item fit indices based on cross validation depends on
the nature of the research. Item fit indices that do not use cross validation seem more appropriate for
investigations that concern a specific sample. These include studies of DIF or questions regarding
the appropriateness of an IRT model for a type of item or sample. Item fit indices using cross
validation appear more appropriate for research that will use the item parameter estimates on future
samples. Examples of this include computer adaptive testing or the development of scales.
It is interesting to note the contrast between Type I error rates and power for the item fit indices.
The combination of low Type I error rates and high levels of power suggest that these item fit indices
perform well overall under the GRM. All the fit indices had the power to detect changes in the slope
parameter. Such misfit may occur when multidimensional response data are analyzed with a unidi-
mensional model. In this situation, the slope parameter is a composite of the multidimensional IRT
model slope parameters (Stone & Zhang, 2003). Overall, both w2* and adjusted w2/dfs had adequate
power to detect changes in the last threshold. This could be similar to the misfit caused by item drift
where threshold parameters may change over time (Stone & Zhang, 2003). Finally, S  w2, w2*, and
adjusted w2/dfs for doubles and triples had acceptable power levels to detect when the GGUM was
used to generate the data. This is important, as there is some question as to whether individuals
respond to personality items using an ideal point or a dominance response process (Stark et al.,
2006). Our results indicate that S  w2, w2*, and adjusted w2/dfs for doubles and triples would be able
to determine whether the dominance-based GRM was being incorrectly used to analyze data
resulting from an ideal point response process.

Limitations and Future Research


There were several limitations of the current study. One limitation was the number of characteristics
we manipulated. We included sample sizes, test lengths, and sources of model misfit that allowed for
direct comparisons with previous research. However, other characteristics should also be investi-
gated. For example, we investigated how well the item fit indices could detect model misspecifica-
tion when the GGUM was used to generate the data. Other polytomous IRT models such as the
partial credit model could also be used to create model misspecification. In addition, consistent with
previous research, our largest sample size was 2,000 cases. It is not clear how the item fit indices will
perform with larger sample sizes.
We believe that adjusted w2/dfs without cross-validation deserve future research. In the current
study, they performed well in terms of low Type I error rates and acceptable levels of power under

20
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
LaHuis et al. 21

the GRM. However, there are few other empirical studies supporting their use. For example, it is not
clear how adjusted w2/dfs perform with other IRT models. In addition, we suggest that future
research examine other possible cutoff values and whether cutoff values should vary with the
number of response options.
Finally, all our y values were generated using a standard normal distribution. Previous research
has indicated that when y is not normally distributed, the accuracy of item parameter estimates is
diminished (Seong, 1990; Stone, 1992). This would, in turn, have effects on item fit indices. This
may be an issue for some contexts such as analyzing responses to personality items for applicants
or incumbents. In these cases, the distribution of y may be restricted because of response distortion
or range restriction. All the alternative item fit indices make some assumptions about y, and it is not
clear how violating these assumptions affect the performance of the indices.

Implications
Based on our results, both w2* and a combination of adjusted w2/dfs for item singles, doubles, and
triples without cross validation seem appropriate for analyzing the fit of the GRM. Across all con-
ditions, w2* demonstrated a good balance between Type I errors and power. Although the Type I
errors were lower than expected, it still exhibited acceptable levels of power to detect the sources
of misfit investigated in the current study across most sample sizes and test lengths. Thus, although
it is unclear exactly how w2* is distributed, it shows promise as an item fit index for the GRM. A
combination of adjusted w2/dfs for item doubles and triples could also be used to assess fit for the
GRM. They exhibited similar Type I error and power rates compared with w2*.
Our results also have implications for cross validating with item fit indices. Although we believe
that cross validation is always a good thing, there are times when the sample sizes are not large
enough to cross validate. In addition, whether cross validation should be used seems to depend
on the reason for assessing item fit. If fit is being assessed prior to use for a given sample (e.g., test-
ing for DIF), not cross-validating appears to be reasonable. However, in situations where the IRT
parameters are going to be used with future samples (computer adaptive testing, scale development),
cross validation is warranted.

Notes
1. Detailed information about how to conduct and interpret analyses using IRTFIT can be obtained from the
manual available for download at http://outcomes.cancer.gov/areas/measurement/irt_model_fit.html.
2. The Modfit program and detailed instructions for its use can be found at http://io.psych.uiuc.edu/irt/
mdf_modfit.asp.
3. We developed our own macros to calculate adjusted w2/dfs because the Modfit software program does not
run in batch mode.

Declaration of Conflicting Interests


The authors declared no potential conflicts of interests with respect to the authorship and/or
publication of this article.

Financial Disclosure/Funding
The authors received no financial support for the research and/or authorship of this article.

References
Bjorner, J. B., Smith, K. J., Stone, C., & Sun, X. (2007). IRTFIT: A macro for item fit and local dependence
tests under IRT models [Computer software]. Retrieved June, 2009 from http://outcomes.cancer.gov/
areas/measurement/irt_model_fit.html

21
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
22 Organizational Research Methods 14(1)

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more
nominal categories. Psychometrika, 37, 29-51.
Brown, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods &
Research, 21, 230-258.
Chernyshenko, O. S., Stark, S., Chan, K. Y., Drasgow, F., & Williams, B. (2001). Fitting item response theory
models to two personality inventories: Issues and insights. Multivariate Behavioral Research, 36, 523-562.
Chernyshenko, O. S., Stark, S., Drasgow, F., & Roberts, B. W. (2007). Constructing personality scales under
assumptions of an ideal point response process: Toward increasing the flexibility of personality measures.
Psychological Assessment, 19, 88-106.
Donovan, M. A., Drasgow, F., & Probst, T. A. (2000). Does comparing paper-and-pencil job attitude scales
make a difference? New IRT analyses offer insight. Journal of Applied Psychology, 85, 305-313.
Drasgow, F., Levine, M. V., Tsien, S., Williams, B., & Mead, A. D. (1995). Fitting polytomous item response
theory models to multiple-choice tests. Applied Psychological Measurement, 19, 143-165.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence
Erlbaum.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., et al. (2006). The
International Personality Item Pool and the future of public-domain personality measures. Journal of
Research in Personality, 40, 84-96.
Kang, T., & Chen, T. T. (2008). Performance of the generalized S  w2 item fit index for polytomous IRT mod-
els. Journal of Educational Measurement, 45, 391-406.
Koenig, J. A., & Roberts, J. S. (2007). Linking parameters estimated with the generalized graded unfolding
model: A comparison of characteristic curve methods. Applied Psychological Measurement, 31, 504-524.
LaHuis, D. M., & Copeland, D. A. (2009). Investigating faking using a multilevel logistic regression approach
to measuring person fit. Organizational Research Methods, 12, 396-319.
Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A Monte Carlo examination of the sensitivity of
the differential functioning of items and tests framework for tests of measurement invariance with Likert
data. Applied Psychological Measurement, 31, 430-455.
McKinley, R. L., & Mills, C. N. (1985). A comparison of several goodness-of-fit statistics. Applied
Psychological Measurement, 9, 49-57.
Mislevy, R. J., & Bock, R. D. (1990). BILOG: Item analysis and test scoring with binary logistic models [com-
puter program]. Chicago, IL: Scientific Software, Inc.
Orlando, M. (1997). Item fit in the context of item response theory. Dissertation Abstracts International, 58/04-
B, 2175.
Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory
models. Applied Psychological Measurement, 24, 50-64.
Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S  w2: An item fit index for use
with dichotomous item response theory models. Applied Psychological Measurement, 27, 289-298.
Reise, S. P. (1990). A comparison of item- and person-fit methods of assessing model-data fit in IRT. Applied
Psychological Measurement, 14, 127-137.
Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding
unidimensional polytomous responses. Applied Psychological Measurement, 24, 3-32.
Robie, C., Zickar, M. J., & Schmit, M. J. (2001). Measurement equivalence between applicant and incumbent
groups: An IRT analysis of personality scales. Human Performance, 14, 187-207.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrica
Monographs, 34, 139.
Seong, T. J. (1990). Sensitivity of marginal maximum likelihood estimation of item and ability parameters to
the characteristics of prior ability distributions. Applied Psychological Measurement, 14, 299-311.
Stark, S. (2007). Modfit version 2.0 [computer software]. Retrieved March, 2007, from http://work.psych.uiuc.edu/
irt/mdf_modfit.asp

22
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015
LaHuis et al. 23

Stark, S., Chernyshenko, O. S., Drasgow, F., & Williams, B. A. (2006). Examining assumptions about item
responding in personality assessment: Should ideal point methods be considered for scale development and
scoring? Journal of Applied Psychology, 91, 25-39.
Stone, C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-parameter logistic
response model: An evaluation of MULTILOG. Applied Psychological Measurement, 16, 1-16.
Stone, C. A. (2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT
models. Journal of Educational Measurement, 37, 58-75.
Stone, C. A. (2003). Empirical power and Type I error rates for an IRT fit statistic that considers the precision of
ability estimates. Educational and Psychological Measurement, 63, 566-583.
Stone, C. A., & Yin, Y. (2006, April). Accounting for item and ability parameter estimation in a goodness-of-fit
statistic for IRT models. Paper presented at the Annual Meeting of the National Council on Measurement in
Education, San Francisco.
Stone, C. A., & Zhang, B. (2003). Assessing goodness-of-fit of IRT models: A comparison of traditional and
alternative procedures. Journal of Educational Measurement, 40, 331-352.
Thissen, D. (2003). Multilog 7.0 [computer software]. Chicago, IL: Scientific Software International.
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. (1995). Item response theory for scores on tests
including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39-49.
Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measure-
ment, 5, 245-262.

Bios
David M. LaHuis received his doctorate in industrial and organizational psychology from the University of
Connecticut in 2002 and is currently an associate professor of psychology at the Wright State University in
Dayton, Ohio. His research interests include personnel selection, performance appraisal, item response theory,
and multilevel modeling.

Patrick Clark is a PhD student at the Wright State University. His research interests include item response
theory and differential item functioning.
Erin O’Brien is a PhD student at the Wright State University. Her research interests include item response
theory and multilevel modeling.

23
Downloaded from orm.sagepub.com at NORTH CAROLINA STATE UNIV on March 10, 2015

You might also like