Psychometrics R
Psychometrics R
Anna Brown
2024-01-26
2
Contents
Preface 12
Why I wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
What is this book for? . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
How to use this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
How this book is organised . . . . . . . . . . . . . . . . . . . . . . . . 15
How to work with exercises in this book . . . . . . . . . . . . . . . . . 16
3
4 CONTENTS
1.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Thurstonian scaling 49
4.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
CONTENTS 5
6.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
References 345
Acknowledgments 347
Preface
CONTENTS 13
This textbook provides a comprehensive set of exercises for practicing all major
Psychometric techniques using R and RStudio. The exercises are based on real
data from research studies and operational assessments, and provide step-by-
step guides that an instructor can use to teach students, or readers can use to
learn independently. Each exercise includes a worked example illustrating data
analysis steps and teaching how to interpret results and make analysis decisions,
and self-test questions that readers can attempt to check own understanding.
You can read this book online here for free. Copies in printable format may be
ordered from the author.
Data and supporting materials for all exercises are available for download from
http://annabrown.name/psychometricsR
How to cite this book:
Brown, Anna. (2023). Psychometrics in Exercises using R and RStudio. Text-
book and data resource. Available from https://bookdown.org/annabrown/
psychometricsR.
This book can be used for teaching by university lecturers and instructors.
They may use data examples and analyses provided in this book as illustrations
14 CONTENTS
in lectures (acknowledging the source), or simply adopt the book for the practi-
cal/computing part of their course. Some of these exercises will be useful as part
of the general statistics curriculum, and some will be more suitable for special
courses such as “Item Response Theory” or “Structural Equation Modelling” or
“Measurement Invariance”.
This book can be used for self-study by anybody who want to acquire prac-
tical skills in conducting psychometric analyses. These may be students and
researchers in the fields of psychology, or any behavioural or social science, of
any age and level – undergraduate and postgraduate, PhD and postdoctoral
researchers, and seasoned researchers who want to acquire new skills in psycho-
metrics. These may also be practitioners in various fields of assessment who
need to be able to make sense of their data or create new assessments.
This book can also be used for self-study by people with some experience in
Psychometrics, but wanting to learn how to do these analyses in R, perhaps
moving from another software program like SPSS.
If you are an instructor, you can use this book as you see fit for your course and
your students, perhaps prioritising their needs in either software or statistical
curriculum depending on their level as described below.
If you are a student using the book for self-study, the way you will use this
book will depend on where you begin in psychometrics (and in R/RStudio!).
Beginners in R/RStudio should start from the beginning (including the “Get-
ting started with R and RStudio” section), gradually building skills in data
manipulations using R. You will establish your own routine when working with
exercises (consider each one as a mini data analysis project), and soon will feel
confident to build a portfolio of packages and functions. Intermediate or ad-
vanced Psychometricians who only begin in R/RStudio will find the learning
process easier, because you will be applying familiar methods in a new software
environment. You may even compare outputs from R with other software you
used before. You may be pleasantly surprised!
Beginners in Psychometrics should also start from the beginning, gradually
building their skills in conducting analyses using specific techniques and meth-
ods, for instance computing a test score in the presence of missing data. The
exercises are ordered so that later exercises often (but not always!) rely on previ-
ous ones. References to previous exercises that are necessary to understand the
current exercise is always given, so you can refresh your knowledge if necessary.
Intermediate or advanced users of R/RStudio but beginners in Psychometrics
can skip my advice on R tips and tricks, but should follow the order of exercises
to build their psychometric skills.
CONTENTS 15
This part contains 4 exercises, which teach techniques for scaling ordinal ques-
tionnaire data, nominal questionnaire data, and ranked preferences data.
This part contains 2 exercises, teaching how to conduct item analysis and relia-
bility analysis of polytomous questionnaire data and dichotomous questionnaire
data.
This part contains 2 exercises, teaching how to fit 1PL and 2PL models to
dichotomous questionnaire data, and a graded response model to polytomous
questionnaire data.
This part contains 2 exercises, teaching how to test for DIF dichotomous ques-
tionnaire data using binary logistic regression, and polytomous questionnaire
data using ordinal logistic regression.
16 CONTENTS
This part contains 2 exercises, teaching how to fit CFA models to polytomous
item scores, and to a correlation matrix of subtest scores.
This part contains 2 exercises, teaching how to fit a path model to observed test
scores, and an autoregressive model to longitudinal test measurements.
This part contains 3 exercises, teaching how to fit a latent autoregressive model
and a growth curve model to longitudinal test measurements, and how to test
for longitudinal measurement invariance in repeated test measurements.
This part contains 2 exercises, teaching how to test for measurement invariance
across groups, and how to test for measurement invariance across time and
experimental groups using multiple-group analysis settings.
the presented techniques to new variables, make sense of outputs and interpret
results by presenting them with self-test questions. Answering these ques-
tions is an important part of learning. Students should attempt to answer the
question independently, and then verify their answers with the answers provided
at the end of each Exercise.
18 CONTENTS
Getting Started with R and
RStudio
You have R and RStudio installed and keen to get started with your first ex-
ercise. But before you do so, I would like to suggest some routine steps that
will make your future work with data easier and more organized. If you are
already experienced with RStudio and have your own routine, feel free to skip
this section.
First, let’s create a directory (and then a project) for keeping all work
associated with this particular analysis. This is a very convenient way
to work, because in R/RStudio, you are working in a specific directory
(“project” directory). Unless all data and other files that you need are
in the same directory as your project, you will have to specify full paths
such as "C:/Users/annabrown/Documents/R Exercise book/Exercise
1/data.txt". This can get tedious very quickly. But within a project, you can
just refer to file names, such as "data.txt". But more importantly, when all
the project-related files and objects are kept together, it is easy to get back to
your work at any time by simply opening the project (using RStudio menu File
/ Open Project…).
To create an R project associated with this folder, in RStudio click File and
then New Project. You will have a box popping up asking you to select where
to create a new project. Select the folder you have just created (in my example,
it is “R Exercise 1”). You should see your project name appearing in the Files
tab of the bottom/right RStudio window, and the file SDQ.RData should be
visible there too.
19
20 CONTENTS
2. Creating a Script
You can type commands directly into R Console and execute them one at a
time. This can be good for trying out some functions. However, in the long
run, you will need to save your analysis and modify or add to it, sometimes over
several sessions. So it is much better to create a Script (a text file with your
R code), from which any number of commands can be sent to Console at any
time.
To create a new script, select File / New File / R Script. It will open up in
its own window. Write all code suggested in a particular exercise on this script.
Run any command in the script by moving your cursor to that command and
pressing Ctrl + Enter, and you will see how the command gets passed and
executed in Console.
library(psych)
INTRODUCTION TO
PSYCHOMETRIC
SCALING METHODS
21
Exercise 1
1.1 Objectives
The purpose of this exercise is to learn how to compute test scores from ordinal
test responses, and interpret them in relation to a norm. You will also learn
how to deal with missing responses when computing test scores.
You will also learn how to deal with items that indicate the opposite end of the
construct to other items. Such items are sometimes called “negatively keyed” or
“counter-indicative”. Compare, for example, item “I get very angry and often
lose my temper” with item “I usually do as I am told”. They represent positive
and negative indicators of Conduct Problems, respectively. The (small) problem
with such items is that they need to be appropriately coded before computing
the test score so that the score unambiguously reflects “problems” rather than
the lack thereof. This exercise will show you how to do that.
23
24EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC
If you downloaded the file SDQ.RData and saved it in the same folder as this
project, you should see the file name in the Files tab (usually in the bottom
1.3. WORKED EXAMPLE 1 - LIKERT SCALING AND NORM REFERENCING FOR EMOTIONAL SYMPTOMS
right RStudio panel). Now we can load an object (data frame) contained in this
“native” R file (with extension .RData) into RStudio using the basic function
load().
load("SDQ.RData")
You should see a new object SDQ appear on the Environment tab (top right
RStudio panel). This tab will show any objects currently in the workspace, and
the data frame SDQ was stoerd in file SDQ.RData we just loaded. According
to the description in the Environment tab, the data frame contains “228 obs.”
(observations) “of 51 variables”; that is, 228 rows and 51 columns.
You can press on the SDQ object. You should see the View(SDQ) command run
on Console, and, in response to that command, the data set should open up in its
own tab named SDQ. Examine the data by scrolling down and across. The data
set contains 228 rows (cases, observations) on 51 variables. There is Gender
variable (0=male; 1=female), followed by responses to 25 SDQ items named
consid, restles, somatic etc. (these are variable names in the order of items
in the questionnaire). Item variables reflect key meaning of the actual SDQ
questions, which are also attached to the data frame as labels. For example,
consid is a shortcut for “I try to be nice to other people. I care about their
feelings”, or restles is a shortcut for “I am restless, I cannot stay still for long”.
These 25 variables are followed by 25 more variables named consid2, restles2,
somatic2 etc. These are responses to the same SDQ items at Time 2.
You should also see that there are some missing responses, marked ‘NA’.
There are more missing responses for Time 2, with whole rows missing for some
pupils. This is typical for longitudinal data, because it is not always possible
to obtain responses from the same pupil one year later (for example, the pupil
might have moved schools).
You can obtain the names of all variables by typing and running command
names(SDQ).
Let us start analysis by creating a list of items that measure Emotional Symp-
toms (you can see them in a table given earlier). This will enable easy reference
to data from these 5 items (variables) in all analyses. We will use c() - the base
R function for combining values into a list.
SDQ[items_emotion]. Try running this command and see how you get only
responses to the 5 items we specified.
QUESTION 1. Now use the same logic to create a list of items measuring
Emotional Symptoms at Time 2, called items_emotion2.
Now you are ready to compute the Emotional Symptoms scale score for Time 1.
Normally, we could use base R function rowSums(), which computes the sum of
specified variables for each row (pupil):
rowSums(SDQ[items_emotion])
## [1] 4 3 1 2 4 2 4 0 1 1 0 8 2 3 7 4 5 2 8 6 1 4 9 4 5
## [26] 9 0 3 3 1 0 2 6 3 9 4 4 0 7 1 3 6 4 5 4 1 4 1 0 5
## [51] 1 2 2 4 4 4 6 1 8 3 2 2 4 1 1 0 2 2 7 5 0 NA NA 1 1
## [76] 7 4 1 8 3 5 0 5 4 0 1 1 5 3 6 1 3 2 6 6 0 2 4 5 3
## [101] 3 1 1 7 2 3 5 5 NA 0 4 0 4 1 1 1 1 0 2 7 0 3 8 4 6
## [126] NA 2 4 7 1 0 0 1 0 4 3 0 10 5 2 1 6 1 2 1 0 1 NA 4 4
## [151] 2 4 7 5 6 1 0 5 3 1 3 3 6 4 2 3 1 0 3 3 0 3 0 0 0
## [176] 2 2 2 0 1 5 3 3 1 4 3 1 6 2 4 2 NA 0 2 5 5 0 2 2 3
## [201] 4 0 2 4 2 2 1 3 2 0 1 0 0 8 1 1 2 1 2 2 4 0 0 1 2
## [226] 2 1 6
Try this and check out the resulting scores printed in the Console window. Oops!
It appears that pupils with missing responses (even on one item) got ‘NA’ for
their scale score. This is because the default option for dealing with missing
data in rowSums() function is to skip any rows with missing data. Let’s change
to skipping only the ‘NA’ responses, not whole rows, like so:
rowSums(SDQ[items_emotion], na.rm=TRUE)
## [1] 4 3 1 2 4 2 4 0 1 1 0 8 2 3 7 4 5 2 8 6 1 4 9 4 5
## [26] 9 0 3 3 1 0 2 6 3 9 4 4 0 7 1 3 6 4 5 4 1 4 1 0 5
## [51] 1 2 2 4 4 4 6 1 8 3 2 2 4 1 1 0 2 2 7 5 0 2 7 1 1
## [76] 7 4 1 8 3 5 0 5 4 0 1 1 5 3 6 1 3 2 6 6 0 2 4 5 3
## [101] 3 1 1 7 2 3 5 5 4 0 4 0 4 1 1 1 1 0 2 7 0 3 8 4 6
## [126] 0 2 4 7 1 0 0 1 0 4 3 0 10 5 2 1 6 1 2 1 0 1 4 4 4
## [151] 2 4 7 5 6 1 0 5 3 1 3 3 6 4 2 3 1 0 3 3 0 3 0 0 0
## [176] 2 2 2 0 1 5 3 3 1 4 3 1 6 2 4 2 4 0 2 5 5 0 2 2 3
## [201] 4 0 2 4 2 2 1 3 2 0 1 0 0 8 1 1 2 1 2 2 4 0 0 1 2
## [226] 2 1 6
1.3. WORKED EXAMPLE 1 - LIKERT SCALING AND NORM REFERENCING FOR EMOTIONAL SYMPTOMS
Now you should get scale scores for all pupils, but in this calculation, the missing
responses are simply skipped, so essentially treated as zeros. This is not quite
right. Remember that there might be different reasons for not answering a
question, and not answering the question is not the same as saying “Not true”,
therefore should not be scored as 0.
Instead, we will do something more intelligent. We will use rowMeans() function
to compute the mean of those item responses that are present (still skipping the
‘NA’ values, na.rm=TRUE), and then multiply the result by 5 (the number of
items in the scale) to obtain a fair estimate of the sum score.
For example, if all non-missing responses of a person are 2, the mean is also
2, and multiplying this mean by the number of items in the scale, 5, will give
a fair estimate of the expected scale score, 5x2=10. So we essentially replace
any missing responses with the mean response for that person, thus producing
a fairer test score.
Try this and compare with the previous result from rowSums(). It should give
the same values for the vast majority of pupils, because they had no missing
data. The only differences will be for those few pupils who had missing data.
rowMeans(SDQ[items_emotion], na.rm=TRUE)*5
Now we will repeat the calculation, but this time appending the resulting score
as a new column (variable) named S_emotion to the data frame SDQ:
Let’s check whether the calculation worked as expected for those pupils with
missing data, for example case #72. Let’s pull that specific record from the data
frame, referring to the row (case) number, and then to the columns (variables)
of interest:
SDQ[72,items_emotion]
You can see that one response is missing on item afraid. If we just added up the
non-missing responses for this pupil, we would get the scale score of 2. However,
the mean of 4 non-missing scores is (0+1+0+1)/4 = 0.5, and multiplying this
by the total number of items 5 should give the scale score 2.5. Now check the
entry for this case in S_emotion:
28EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC
SDQ$S_emotion[72]
## [1] 2.5
QUESTION 2. Repeat the above steps to compute the test score for Emo-
tional Symptoms at Time 2 (call it S_emotion2), and append the score to the
SDQ data frame as a new column.
hist(SDQ$S_emotion)
Histogram of SDQ$S_emotion
80
60
Frequency
40
20
0
0 2 4 6 8 10
SDQ$S_emotion
library(psych)
describe(SDQ$S_emotion)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 228 2.89 2.33 2 2.66 2.97 0 10 10 0.72 -0.16 0.15
As you can see, the median (the score below which half of the sample lies) of
S_emotion is 2, while the mean is higher at 2.89. This is because the score is
positively skewed; in this case, the median is more representative of the central
tendency. These statistics are consistent of our observation of the histogram,
showing a profound floor effect.
QUESTION 4. Obtain and interpret the histogram and the descriptives for
S_emotion2 independently.
Below are the cut-offs for “Normal”, “Borderline” and “Abnormal” cases for
Emotional Symptoms provided by the test publisher (see https://sdqinfo.org/).
These are the scores that set apart likely borderline and abnormal cases from
the “normal” cases.
• Normal: 0-5
• Borderline: 6
• Abnormal: 7-10
Use the histogram you plotted earlier for S_emotion (Time 1) to visualize
roughly how many children in this community sample fall into the “Normal”,
“Borderline” and “Abnormal” bands.
Now let’s use the function table(), which tabulates cases with each score value.
table(SDQ$S_emotion)
##
## 0 1 2 2.5
## 36 43 37 1
## 3 4 5 6
## 27 32 19 13
## 6.66666666666667 7 8 8.75
## 1 8 6 1
## 9 10
## 3 1
30EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC
A few non-integer scores must not worry you. They occurred due to computing
the scale score from the item means for some pupils with missing responses. For
all cases without missing responses, the resulting scale score will be integer.
We can use the table() function to establish the number of children in this
sample in the “Normal” range. From the cut-offs, we know that the Normal
range is that with S_emotion score between 0 and 5. Simply specify this
condition (that we want scores less or equal to 5) when calling the function:
table(SDQ$S_emotion <= 5)
##
## FALSE TRUE
## 33 195
This gives 195 children or 85.5% (195/228=0.855) classified in the Normal range.
QUESTION 5. Now try to work out the percentage of children who can be
classified “Borderline” on Emotional Symptoms (Time 1 only).
QUESTION 6. What is the percentage of children in the “Abnormal” range
on Emotional Symptoms (Time 1 only)?
Remind yourself about items designed to measure Conduct Problems. You can
see them in a table given in Worked Example 1.3. Now let us create a list, which
will enable easy reference to data from these 5 variables in all analyses.
Note how a new object items_conduct appeared in the Environment tab. Try
calling SDQ[items_conduct] to pull only the data from these 5 items.
QUESTION 7. Create a list of items measuring Conduct Problems at Time
2, called items_conduct2.
1.4. WORKED EXAMPLE 2 - REVERSE CODING COUNTER-INDICATIVE ITEMS AND COMPUTING TEST S
Before adding item scores together to obtain a scale score, we must reverse
code any items that are counter-indicative to the scale. Otherwise, positive and
negative indicators of the construct will cancel each other out in the sum score!
For Conduct Problems, we have only one counter-indicative item, obeys. To
reverse–code this item, we will use a dedicated function of psych package,
reverse.code(). This function has the general form reverse.code(keys,
items,…). Argument keys is a vector of values 1 or -1, where -1 implies to
reverse the item. Argument items is the names of variables we want to score.
Let’s look at the set of items again:
Since the only item to reverse-code is #2 in the set of 5 items, we will combine
the following values in a vector to obtain keys=c(1,-1,1,1,1). The whole
command will look like this:
You should see that the item obeys is marked with “-“, and that it is indeed
reverse coded, if you compare it with the original below. How good is that?
# original items
head(SDQ[items_conduct])
## 3 0 2 0 0 0
## 4 0 2 0 0 0
## 5 1 0 0 2 0
## 6 0 2 0 0 0
QUESTION 8. Use the logic above to reverse code items measuring Conduct
Problems at Time 2, saving them in object R_conduct2.
Now we are ready to compute the Conduct Problems scale score for Time 1.
Because there are missing responses (particularly for Time 2), we will use
rowMeans() function to compute the mean of those item responses that are
present (skipping the ‘NA’ values, na.rm=TRUE), and then multiply the result
by 5 (the number of items in the scale) to obtain a fair estimate of the sum
score. Please refer to Worked Example 1.3 (Step 2) for a detailed explanation
of this procedure.
Importantly, we will use the reverse-coded items (R_conduct) rather than orig-
inal items (SDQ[items_conduct])in the calculation of the sum score. We will
append the computed scale score as a new variable (column) named S_conduct
to data frame SDQ:
QUESTION 9. Compute the test score for Conduct Problems at Time 2 (call
it S_conduct2), and append the score to the SDQ data frame as a new variable
(column).
Example. Then, you will be able to practice this exercise with the remaining
SDQ scales.
Repeat the steps in the Worked Example 2 for the Hyperactivity and Peer Prob-
lems facets.
When finished with this exercise, do not for get to save your work as described
in the “Getting Started with RStudio” section.
1.6 Solutions
Q1.
Q2.
If you call SDQ$S_emotion2, you will see that there are many cases with missing
score, labelled NaN. This is because the scale score cannot be computed for those
pupils who had ALL responses missing at time 2.
Q3. The S_emotion score is positively skewed and shows the floor effect.
This is not surprising since the questionnaire was developed to screen clinical
populations, and most children in this community sample did not endorse any
of the symptoms (most items are too “difficult” for them to endorse).
Q4.
hist(SDQ$S_emotion2)
34EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC
Histogram of SDQ$S_emotion2
60
50
Frequency
40
30
20
10
0
0 2 4 6 8
SDQ$S_emotion2
describe(SDQ$S_emotion2)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 172 2.41 2.14 2 2.14 1.48 0 9 9 0.91 0.11 0.16
The histogram and the descriptive statistics for Time 2 look similar to Time 1.
There is still a floor effect, with many pupils not endorsing any symptoms.
Q5. 13 children (or 5.7%) can be classified Borderline. You can look up the
count for Borderline score 6 in the output of the basic table() function. Alter-
natively, you can call:
table(SDQ$S_emotion==6)
##
## FALSE TRUE
## 215 13
Importantly, you have to use == when describing the equality condition, be-
cause the use of = in R is reserved to assigning a value to an object.
Q6. For “Abnormal”, you can ask to tabulate all scores greater than 6. To
calculate proportion, divide by N=228. This gives 20/228 = 0.88 or 8.8%.
1.6. SOLUTIONS 35
table(SDQ$S_emotion>6)
##
## FALSE TRUE
## 208 20
Q7.
Q8.
# reverse code
R_conduct2 <- reverse.code(keys=c(1,-1,1,1,1), SDQ[items_conduct2])
# check the reverse coded items
head(R_conduct2)
Q9.
If you call SDQ$S_conduct2, you will see that there are many cases with missing
score, labelled NaN. This is because the scale score cannot be computed for those
pupils who had ALL responses missing at time 2.
36EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC
Exercise 2
2.1 Objectives
There are many ways to “optimize” item scores; here, we will maximize the ratio
of the variance of the total score to the sum of the variances of the item scores.
In psychometrics, fulfilling this criterion results in maximizing the sum of the
item correlations (and therefore the test score’s “internal consistency” measured
by Cronbach’s alpha).
37
38EXERCISE 2. OPTIMAL SCALING OF ORDINAL QUESTIONNAIRE DATA
load(file="SDQ.RData")
names(SDQ)
We will use package aspect, which makes optimal scaling easy by offering a
range of very useful options and built-in plots.
library("aspect")
# drop cases with missing responses and put complete cases into data frame called "items"
items <- na.omit(SDQ[items_emotion])
##
## Correlation matrix of the scaled data:
## somatic worries unhappy clingy afraid
## somatic 1.0000000 0.3480251 0.3651134 0.2258002 0.3113325
## worries 0.3480251 1.0000000 0.4612166 0.4020225 0.3901405
## unhappy 0.3651134 0.4612166 1.0000000 0.3598932 0.4603964
## clingy 0.2258002 0.4020225 0.3598932 1.0000000 0.3865003
## afraid 0.3113325 0.3901405 0.4603964 0.3865003 1.0000000
##
##
## Eigenvalues of the correlation matrix:
## [1] 2.4961448 0.7844417 0.6270432 0.5887536 0.5036166
##
## Category scores:
## somatic:
## score
## 0 -0.8864022
## 1 0.5836441
## 2 2.0454937
##
## worries:
## score
## 0 -0.8348282
## 1 0.4234660
## 2 2.1441015
##
## unhappy:
## score
## 0 -0.5895239
## 1 1.3910873
## 2 2.7286027
##
## clingy:
## score
## 0 -1.1851447
## 1 0.2510005
## 2 1.6576948
##
## afraid:
## score
## 0 -0.7821442
2.2. WORKED EXAMPLE – OPTIMAL SCALING OF SDQ EMOTIONAL SYMPTOMS ITEMS41
## 1 1.0234825
## 2 1.8943502
The output displays the “Correlation matrix of the scaled data”, which are cor-
relations of the item scores after optimal scaling. These can be compared to
correlations between the original variables calculated using cor(items). Fur-
ther, “Eigenvalues of the correlation matrix” are displayed. Eigenvalues are the
variances of principal components (from Principal Components Analysis), and
are very helpful in indicating the number of dimensions measured by this set of
items. The result here, with the first eigenvalue substantially larger than the
remaining eigenvalues, indicates just one dimension, as we hoped.
Finally, “Category scores” show the scores that the optimal scaling procedure
assigned to the item categories. For example, the result suggests to score item
somatic by assigning the score -0.8864022 to response “not true”, the score
0.5836441 to response “somewhat true” and the score 2.045493 to response “cer-
tainly true”. The values are chosen so that the scaled item’s mean in the sample
is 0, and the correlations between the items are maximized.
opt$catscores$worries
−0.5
−1.0
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0
Index Index
opt$catscores$unhappy
opt$catscores$clingy
−1.0
−0.5
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0
Index Index
opt$catscores$afraid
−0.5
Index
42EXERCISE 2. OPTIMAL SCALING OF ORDINAL QUESTIONNAIRE DATA
Looking at the transformation plots, it can be seen that 1) the scores for subse-
quent categories increase almost linearly; 2) the categories are roughly equidis-
tant. We conclude that for scoring ordinal items in the SDQ Emotional Symp-
toms scale, Likert scaling is appropriate, and not much can be gained by optimal
scaling over basic Likert scaling.
3.1 Objectives
In this exercise, we will attempt to find “optimal” scores for nominal responses.
Unlike ordinal responses considered in Exercise 2, nominal categories do not
assume any particular order. Consequently, optimal scores assigned to them
are not expected to monotonically increase or decrease - there is simply no
restrictions on their sign or order.
As before, in assigning “optimal” scores, we will maximize the sum of the item
correlations (and therefore the test score’s “internal consistency” measured by
Cronbach’s alpha).
43
44EXERCISE 3. OPTIMAL SCALING OF NOMINAL QUESTIONNAIRE DATA
3. Children today are not as fortunate as when I was a child (agree / cannot tell / di
4. Religion should be taught in school (agree / indifferent / disagree)
By looking at the items, we may tentatively propose that they measure some-
thing in common, perhaps nostalgic feelings about the past (?), if agreeing to
items 2, 3 and 4, and older age were keyed positively. To put this initial intuition
to the test, we will conduct optimal scaling.
Examine the item names (item1, item2, item3, item4) and responses. You
can see that the responses are not coded as numbers, they are actually strings
corresponding to the response options, for example “cannot tell”. We leave them
like that, as in this analysis, we will not make use of any ordering of the response
options, considering them purely nominal categories.
library("aspect")
3.2. WORKED EXAMPLE – OPTIMAL SCALING OF NISHISATO ATTITUDE ITEMS45
##
## Correlation matrix of the scaled data:
## item1 item2 item3 item4
## item1 1.0000000 0.7960663 0.3873563 0.7034011
## item2 0.7960663 1.0000000 0.3333116 0.4785331
## item3 0.3873563 0.3333116 1.0000000 0.3999385
## item4 0.7034011 0.4785331 0.3999385 1.0000000
##
##
## Eigenvalues of the correlation matrix:
## [1] 2.5896793 0.7535762 0.5116607 0.1450839
##
## Category scores:
## item1:
## score
## 20-29 1.4526531
## 30-39 -0.3425303
## 40+ -1.0122570
##
## item2:
## score
## agree -0.6607173
## cannot tell 1.4369605
## disagree 1.6078784
##
## item3:
## score
## agree -0.5216552
## cannot tell 1.8973623
## disagree -0.5281230
##
## item4:
## score
## agree -0.4235185
## disagree -0.7718009
## indifferent 1.6643455
46EXERCISE 3. OPTIMAL SCALING OF NOMINAL QUESTIONNAIRE DATA
The output displays the “Correlation matrix of the scaled data”, which are
correlations of the item scores after optimal scaling. These are all positive and
surprisingly high, ranging between 0.33 and 0.79. Further, “Eigenvalues of the
correlation matrix” (from Principal Components Analysis) are displayed. The
first eigenvalue here (2.59) is substantially larger than the remaining eigenvalues,
indicating just one dimension.
Finally, “Category scores” show the scores that the optimal scaling procedure
assigned to the response categories. For example, for those between 20 and 29
years of age, the score suggested is 1.45; those between 30 and 39 will get -.34
and those who are 40 or older will get -1.01. These and other values are chosen
so that the scaled item’s mean in the sample is 0, and the correlations between
the items are maximized.
opt2$catscores$item2
1.0
0.5
−0.5
−1.0
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0
Index Index
opt2$catscores$item3
opt2$catscores$item4
1.0
1.0
−0.5
−0.5
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0
Index Index
The transformation plot for item1 (age) makes it obvious that although the
relationship is monotonic, it is not perfectly linear.
QUESTION 1. Interpret category score assignments and transformation plots
for the other items. Items 3 and 4 are the most interesting because they have
3.3. SOLUTIONS 47
non-monotonic relationships, and thus depart completely from our initial intu-
ition about potential Likert scaling of items.
QUESTION 2. What kind of person would get the highest score on the total
attitude scale (how old would they be, how would they respond to the other
items?). What kind of person would get the lowest score?
QUESTION 3. Now, providing the established scaling, what do you think the
resulting scale measures?
This completes the exercise.
3.3 Solutions
Q1. Output for item2 (Children today are not as disciplined as when I was
a child: Agree / Cannot tell / Disagree) suggests that for those agreeing, the
score will be -.66; those who ‘cannot tell’ will get 1.44 and those who disagree
will get only slightly more, 1.61.
Output for item3 (Children today are not as fortunate as when I was a child:
Agree / Cannot tell / Disagree) shows that those who ‘cannot tell’ will get the
score 1.90, and those who agree or disagree will get very similar scores, -.52 or
-.53, respectively.
Output for item4 (Religion should be taught in school: Agree / Indifferent
/ Disagree) shows that those who are ‘indifferent’ will get the score 1.66, and
those who agree or disagree will get negative scores, -.42 or -.77 respectively.
Q2. The highest score on the scale will be obtained by those aged 20-29, dis-
agreeing with the idea that children today are not as disciplined as when they
were a child, and not providing any definitive opinion on the other two state-
ments (“Children today are not as fortunate as when I was a child”, and “Reli-
gion should be taught in school”). The lowest score on the scale will be obtained
by those aged 40+, feeling that children today are not as disciplined, and dis-
agreeing with the other two statements (“Children today are not as fortunate
as when I was a child”, and “Religion should be taught in school”). [In fact,
there is very little difference in score whether one agrees or disagrees with the
last two statements].
Q3. Are the score assignments consistent with the conjecture that what is mea-
sured is a form of conservatism marked by aging and nostalgia/dogmatism? To
some extent, yes, because on one end of the scale we have young people who
do not have any concerns about lowering discipline standards for children, feel
it is impossible to tell whether children today are more or less fortunate, and
are indifferent to whether religion is taught in school or not (more liberal/open-
minded). On the other end of the scale we have older people who feel that dis-
cipline standards for children have deteriorated, and have opinions on whether
48EXERCISE 3. OPTIMAL SCALING OF NOMINAL QUESTIONNAIRE DATA
children today are more or less fortunate, and whether religion should be taught
in school or not (more nostalgic about the past and dogmatic).
Exercise 4
Thurstonian scaling
4.1 Objectives
The purpose of this exercise is to learn how to perform Thurstonian scaling of
preferential choice data. The objective of Thurstonian scaling is to estimate
means of psychological values (or utilities) of the objects under comparison.
That is, given the observed rank orders of the objects, we want to know what
utility distributions could give rise to such orderings. To estimate unobserved
utilities from observed preferences, we need to make assumptions about the dis-
tributions of utilities. In this analysis, normal distributions with equal variance
are assumed (so-called Thurstone Case V scaling).
49
50 EXERCISE 4. THURSTONIAN SCALING
6. Development (Development)
7. Social Interaction (Interaction)
8. Competitive Environment (Competition)
9. Pleasant environment and Safety (Safety)
Rank 1 given to any job feature means that this job feature was the most
important for this participant, rank 9 means that it was least important.
To read the data file JobFeatures.txt into RStudio, I will use read.delim(),
a basic R function for reading text files (in this case, a tab-delimited file). I
will read the data into a new object (data frame), which I will call (arbitrarily)
JobFeatures.
You should see a new object JobFeatures appear on the Environment tab
(usually top right RStudio panel). This tab will show any objects currently in
the work space. According to the description in the Environment tab, the data
set contains 1079 obs.of 9 variables.
You can press on JobFeatures object. This will make View(JobFeatures)
command run on Console, which in turn will open the data frame in its own
tab named JobFeatures. Examine the data by scrolling down and across.
The data set contains 1079 rows (cases, observations) on 9 variables: Support,
Challenge, Career, etc. You can obtain the names of all variables in the data
frame JobFeatures by using the basic function names():
names(JobFeatures)
Each row represents someone’s ranking of the 9 job features. Function head()
will display first few participants with their rankings (and tail() will display
last few):
4.3. WORKED EXAMPLE - THURSTONIAN SCALING OF IDEAL JOB FEATURES51
head(JobFeatures)
Before doing the analysis, load package psych to enable its features.
library(psych)
help("thurstone")
You will get full documentation displayed in the Help tab. As described in the
documentation, this function has the following general form:
Since our data are raw rankings, we need to set ranks=TRUE. Type and run
the following code from your Script. It may take a little while, but you will
eventually get the results on your Console!
thurstone(JobFeatures, ranks=TRUE)
Let’s interpret this output. First, the title of the function and the full command
are printed. Next, the estimated population means of the psychological values
(or utilities) for 9 job features are printed (starting with [1], which simply indi-
cates the row number of output data). To make sense of the reported means,
you have to match them with the nine features. This is easy, as the means are
reported in the order of variables in the data frame.
Every mean reflects how valuable/desirable on average this particular job feature
is. A high mean indicates that participants value this job feature highly in
relation to other features. Because preferences are always relative, however, it
is impossible to uniquely identify all the means (for explanation, you may see
McDonald (1999), chapter 18, “Pair Comparison Model”). Therefore, one of
the means has to be fixed to some arbitrary value. It is customary to fix the
mean of the least preferred feature to 0. Then, all the remaining means are
positive.
QUESTION 1 Try to interpret the reported means. Which job feature was
least wanted? What was the utility mean of this feature? What was the most
wanted feature, and what was its utility mean? Looking at the mean values,
how would you interpret the relative values of Autonomy and Interaction?
What else can you say about the relative utility values of the job features?
Now, let’s run the same function but assign its results to a new object called
(arbitrarily) scaling. Type and execute the following command:
This time, there is no output, and it looks like nothing happened. However, the
same analysis was performed but now its results are stored in scaling rather
than printed out. To see what is stored in scaling, call function ls() that will
list all object’s constituents:
4.3. WORKED EXAMPLE - THURSTONIAN SCALING OF IDEAL JOB FEATURES53
ls(scaling)
You can check what is stored inside any of these constituent parts by referring to
them by full name - starting with scaling followed by the $ sign, for example:
scaling$scale
## [1] 0.97 0.93 0.91 0.92 0.60 1.04 0.63 0.00 0.23
This will print out the 9 utility means, which we already examined.
scaling$choice
This will print a 9x9 matrix containing proportions of participants in the sample
who preferred the feature in the column to the feature in the row. This is a
summary of the rank preferences that we did not have in the beginning, but R
conveniently calculated it for us. In the “choice” matrix, the rows and columns
are in the order of variables in the original file.
Let’s examine the “choice” matrix more carefully. Look for the entry in row [8]
and column [6]. This value, 0.8526413, represents the proportion of participants
54 EXERCISE 4. THURSTONIAN SCALING
max(scaling$choice)
## [1] 0.8526413
This pair of features has the most decisive preference for one feature over the
other.
QUESTION 2. How does the above result for choices for 8th feature, Com-
petition, oevr 6th feature, Development, correspond to the estimated utility
means?
scaling$residual
This will print a 9x9 matrix containing differences between the observed propor-
tions (the choice matrix) and the expected proportions (proportions preferring
4.4. SOLUTIONS 55
the feature in the row to the feature in the column, which would be expected
based on the standard normal distributions of utilities around the means scaled
as above). Residuals are the direct way of measuring whether a model (in
this case, a model of unobserved utilities that Thurstone proposed) “fits” the
observed data. Small residuals (near zero) indicate that there are small dis-
crepancies between observed choices and choices predicted by the model; which
means that the model we adopted is rather good.
Finally, let’s ask for a Goodness of Fit index:
scaling$GF
## [1] 0.9987548
If the residuals are close to zero, then their squared ratios to the observed
proportions should be also close to zero. Therefore, the goodness of fit index of
a well-fitting model should be close to 1.
The residuals in our analysis are all very small, which indicates a close cor-
respondence between the observed choices (proportions of preferences for one
feature over the other). The small residuals are reflected in the GF index, which
is very close to 1. Overall, the Thurstone’s model fits the job-features data well.
4.4 Solutions
Q1. The smallest mean (0.00) corresponds to the 8th feature, Competition,
and the highest mean (1.04) corresponds to the 6th feature, Development.
This means that Competitive environment was least wanted, and opportuni-
ties for personal Development were most wanted by people in their ideal job.
Other features were scaled somewhere between these two, with Safety having
low mean (0.23) - barely higher than 0 for Competition, whereas Support,
Challenge, Career and Ethics having similarly high means (around 0.9).
Autonomy and Interaction have similar moderate means around 0.6.
Q2. The most decisive preference in terms of proportions of people chosing
one feature over the other must have the largest distance/difference between
the utilities (6th feature, Development, must have a much higher mean utility
than 8th feature, Competition). This result is indeed in line with the results
for the utility means, where Development mean was the highest at 1.04 and
Competition was the lowest at 0.
56 EXERCISE 4. THURSTONIAN SCALING
Part II
CLASSICAL TEST
THEORY AND
RELIABILITY THEORY
57
Exercise 5
Reliability analysis of
polytomous questionnaire
data
5.1 Objectives
The purpose of this exercise is to learn how to estimate the test score reliability
by different methods: test-retest, split-half and “internal consistency” (Cron-
bach’s alpha). You will also learn how to judge whether test items contribute
to measurement.
59
60EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA
When test scores on two occasions are available, we can estimate test reliability
by computing the correlation between them (correlate the scale score at Time
1 with the scale score at Time 2) using the base R function cor.test().
Remind yourself of the variables in the SDQ data frame (which should be avail-
able when you open the project from Exercise 1) by calling names(SDQ). Among
5.3. WORKED EXAMPLE - ESTIMATING RELIABILITY FOR SDQ CONDUCT PROBLEMS61
cor.test(SDQ$S_conduct, SDQ$S_conduct2)
##
## Pearson's product-moment correlation
##
## data: SDQ$S_conduct and SDQ$S_conduct2
## t = 7.895, df = 170, p-value = 3.421e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3992761 0.6195783
## sample estimates:
## cor
## 0.5179645
QUESTION 1. What is the test-retest reliability for the SDQ Conduct Prob-
lem scale? Try to interpret this result.
library(psych)
Now all that is left to do is to run function alpha() from package psych on
the correctly coded set R_conduct. There are various arguments you can
control in this function, and most defaults are fine, but I suggest you set
cumulative=TRUE (the default is FALSE). This will ensure that statistics in the
output are given for the sum score (“cumulative” of item scores) rather than for
the average score (deafult). We computed the sum score for Conduct Problems,
so we want the output to match the score we computed.
alpha(R_conduct, cumulative=TRUE)
##
## Reliability analysis
## Call: alpha(x = R_conduct, cumulative = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.72 0.73 0.7 0.35 2.7 0.028 2.1 2.1 0.33
##
## lower alpha upper 95% confidence boundaries
## 0.66 0.72 0.77
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## tantrum 0.62 0.65 0.59 0.31 1.8 0.041 0.0100 0.28
## obeys- 0.65 0.66 0.61 0.33 2.0 0.035 0.0078 0.33
## fights 0.67 0.66 0.61 0.33 2.0 0.034 0.0094 0.30
## lies 0.70 0.71 0.66 0.38 2.5 0.031 0.0086 0.38
## steals 0.71 0.72 0.68 0.39 2.6 0.031 0.0096 0.43
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## tantrum 226 0.79 0.76 0.68 0.59 0.57 0.72
## obeys- 228 0.71 0.72 0.64 0.52 0.58 0.60
## fights 228 0.66 0.73 0.64 0.52 0.19 0.44
## lies 226 0.70 0.64 0.49 0.43 0.54 0.72
## steals 227 0.57 0.62 0.45 0.38 0.19 0.49
##
## Non missing response frequency for each item
## 0 1 2 miss
## tantrum 0.56 0.31 0.13 0.01
## obeys- 0.48 0.46 0.06 0.00
## fights 0.82 0.16 0.02 0.00
## lies 0.59 0.27 0.14 0.01
## steals 0.86 0.10 0.04 0.00
scale? Try to interpret the size of alpha bearing in mind the definition of
reliability as “the proportion of variance in the observed score due to true score”.
Now examine the output in more detail. There are other useful statistics printed
in the same line with raw_alpha. Note the average_r, which is the average
correlation between the 5 items of this facet. The std.alpha is computed from
these.
Other useful stats are mean (which is the mean of the sum score) and sd (which
is the Standard Deviation of the sum score). It is very convenient that these
are calculated by function alpha() because you can get these even without
computing the Conduct Problem scale score! If you wish, check them against
the actual sum score stats using describe(SDQ$S_conduct). Note that this is
why I suggested you set cumulative=TRUE, so you can get stats for the sum
score, not the average score.
Now examine the output “Reliability if an item is dropped”. The first column will
give you the expected “raw_alpha” for a 4-item scale without this particular
item in it (if this item was dropped). This is useful for seeing whether the
item makes a good contribution to measurement provided by the scale. If this
expected alpha is lower than the actual reported alpha, the item improves the
test score reliability. If it is higher than the actual alpha, the item actually
reduces the score reliability. You may wonder how it is possible, since items are
supposed to add to the reliability? Essentially, such an item contributes more
noise (to the error variance) than signal (to the true score variance).
Now examine the “Item statistics” output. Two statistics I want to draw your
attention to are:
raw.r - The correlation of each item with the total score. This value is always
inflated because the item is correlated with the scale in which it is already
included!
r.drop - The correlation of this item with the scale WITHOUT this item (with
the scale compiled from the remaining items). This is a more realistic indicator
than raw.r of how closely each item is associated with the scale.
Both raw.r and r.drop should be POSITIVE. If for any item these values
are negative, you must check whether the item was coded appropriately; for
example, if all counter-indicative items were reverse coded. To help you, the
output marks all reverse coded items with “-” sign.
QUESTION 4. Which item has the highest correlation with the remaining
items (“r.drop” value)? Look up this item’s text in the SDQ data frame.
64EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA
Finally, we will request the split-half reliability coefficients for this scale. I said
“coefficients” rather than “coefficient” because there are lots of ways in which
the test can be split into two halves, each giving a slightly different estimate of
reliability.
You will use function splitHalf() from package psych on the appropriately
coded item set R_conduct. I suggest you set use="complete" to make sure
that only the complete cases (without missing data) are used, to avoid differ-
ent samples to be used in different splittings of the test. We should also set
covar=TRUE, to base all estimates on item raw scores and covariances rather
than correlations (the default in splitHalf()). This is to make our estimates
comparable with the “raw_alpha” we obtained from alpha().
You can also refer to the “raw_alpha” result we obtained with function
alpha(), and you will see that it is the same as we obtained with the function
splitHalf().
QUESTION 4. Repeat the steps in the Worked Example for the Hyperactivity
facet. Compute the test-retest reliability, alpha and split-half reliabilities (for
Time 1 only).
It is important that you save all new objects created, because you will need
some of these again in Exercise 7. When closing the project, make sure you save
the entire work space, and your script.
save(SDQ, file="SDQ_saved.RData")
If you want to practice further, you can pick any of the remaining SDQ facets.
Use the table below to enter your results (2 decimal points is fine).
NOTE. Don’t be surprised if the average split-half coefficient does not always
equal alpha.
Based on the analyses of scales that you have completed, try to answer the
following questions:
QUESTION 7. Which method do you think is best for estimating the precision
of measurement in this study?
66EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA
5.5 Solutions
Q1. The correlation 0.518 is positive, “large” in terms of effect size, but not
very impressive as an estimate of reliability because it suggests that only about
52% of variance in the test score is due to true score, and the rest is due to
error. It suggests that Conduct Problems is either not a very stable construct,
or it is not measured accurately by the SDQ. It is impossible to say which is
the case from this one correlation.
Q2. The estimated (raw) alpha is 0.72, which suggests that approximately
72% of variance in the Conduct problem score is due to true score. This is an
acceptable level of reliability for a short screening questionnaire.
Q3. “Reliability if an item is dropped” output shows that every item contributes
positively to measurement, because the reliability would go down from the cur-
rent 0.72 if we dropped any of them. For example, if we dropped the item
tantrum, alpha would be 0.62, which is much worse than the current alpha. If
the item tantrum was dropped, alpha would reduce most, so this item provides
most contribution to reliability of this scale.
Q4. Item tantrum has the highest correlation with the scale (“r.drop”=0.59).
This item reads: “I get very angry and often lose my temper”. It is typical
that items with highest item-total correlation are also those contributing most
to alpha.
Q5.
Test-retest Alpha Split-half (ave)
Emotional Symptoms 0.49 0.74 0.7
Conduct Problems 0.52 0.72 0.72
Hyperactivity 0.65 0.76 0.75
Peer Problems 0.51 0.53 0.51
Pro-social 0.53 0.65 0.64
Q6. The test-retest method provides the lowest estimates, which is not surpris-
ing considering that the interval between the two testing sessions was one year.
Particularly low is the correlation between Emotional Problems at Time 1 and
Time 2, while its internal consistency is higher, which indicates that Emotional
Problems are more transient at this age than, for example, Hyperactivity.
Q7. The substantial differences between the test-retest and alpha estimates
for all but one scale (Peer Problems) suggest that the test-retest method likely
under-estimates the reliability due to instability of the constructs measured after
such a long interval (one year). So, alpha and split-half coefficients are more
appropriate as estimates of reliability here. Alpha is to be preferred to split-half
coefficients since the latter vary widely depending on the way the test is split.
For Peer Problems, both test-retest and alpha give similar results, so the low
test-retest cannot be interpreted as necessarily low stability - it may be that
5.5. SOLUTIONS 67
the construct is relatively stable but not measured very accurately at each time
point.
68EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA
Exercise 6
6.1 Objectives
The objective of this exercise is to conduct reliability analyses on a questionnaire
compiled from dichotomous items, and learn how to compute the Standard Error
of measurement and confidence intervals around the observed score.
69
70EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION
ensuring a good content coverage of all domains involved. The use of binary
items avoids problems with response biases such as “central tendency” (ten-
dency to choose middle response options) and “extreme responding” (tendency
to choose extreme response options).
The data were collected in the USA back in 1997 as part of a large cross-cultural
study (Barrett, Petrides, Eysenck & Eysenck, 1998). This was a large study, N
= 1,381 (63.2% women and 36.8% men). Most participants were young people,
median age 20.5; however, there were adults of all ages present (range 16 – 89
years).
The focus of our analysis here will be the Neuroticism/Anxiety (N) scale, mea-
sured by 23 items:
Please note that all items indicate “Neuroticism” rather than “Emotional Sta-
bility” (i.e. there are no counter-indicative items).
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE71
Because you will work with these data again, I recommend you create a new
project, which you will be able to revisit later. Create a new folder with a name
that you can recognize later (e.g. “EPQ Neuroticism”), and download the data
file EPQ_N_demo.txt into this folder. In RStudio, create a new project in
the folder you have just created. Create a new script, where you will be typing
all your commands, and which you will save at the end.
The data for this exercise is not in the “native” R format, but in a tab-delimited
text file with headings (you can open it in Notepad to see how it looks). You
will use function read.delim() to import it into R. The function has the follow-
ing general format: read.delim(file, header = TRUE, sep = "\t", ...).
You must supply the first argument - the file name. Other arguments all have
defaults, so you change them only if necessary. Our file has headers and is
separated by tabulation ("\t") so the defaults are just fine.
Let’s import the data into into a new object (data frame) called EPQ:
The object EPQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. As you can see, there are 26 variables
in this data frame, beginning with participant id, age and sex (0 = female; 1
= male). These demographic variables are followed by 23 item responses, which
are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses,
marked with NA.
Analyses in this Exercise will require package psych, so type and run this
command from your script:
library(psych)
Let’s start the analyses. First, let’s call function describe() of package psych,
which will print descriptive statistics for item responses in the EPQ data frame
useful for psychometrics. Note that the 23 item responses start in the 4th
column of the data frame.
describe(EPQ[4:26])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## N_3 1 1379 0.64 0.48 1 0.67 0 0 1 1 -0.57 -1.67 0.01
72EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION
QUESTION 1. Examine the output. Look for the descriptive statistic repre-
senting item difficulty. Do the difficulties vary widely? What is the easiest (to
agree with) item in this set? What is the most difficult (to agree with) item?
Examine phrasing of the corresponding items – can you see why one item is
easier to agree with than the other?
Now let’s compute the product-moment (Pearson) correlations between EPQ
items using function lowerCor() of package psych. This function is very con-
venient because, unlike the base R function cor(), it shows only the lower
triangle of the correlation matrix, which is more compact and easier to read.
If you call help on this function you will see that by default the correlations
will be printed to 2 decimal places (digits=2), and the missing values will be
treated in the pairwise fashion (use="pairwise"). This is good, and we will
not change any defaults.
lowerCor(EPQ[4:26])
Now let’s compute the tetrachoric correlations for the same items. These would
be more appropriate for binary items on which a NO/YES dichotomy was forced
(although the underlying extent of agreement is actually continuous). The func-
tion tetrachoric() has the pairwise deletion for missing values by default -
again, this is good.
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE73
tetrachoric(EPQ[4:26])
We can compute the Neuroticism scale score (as a sum of its item scores). Note
that there are no counter-indicative items in the Neuroticism scale, so there is
no need to reverse code any.
From Exercise 1 you should already know that in the presence of missing data,
adding the items scores is not advisable because any response will essentially be
treated as zero, which is not right because not providing an answer to a question
is not the same as saying “NO”.
Instead, we will again use the base R function rowMeans() to compute the
average score from non-missing item responses (removing “NA” values from
calculation, na.rm=TRUE), and then multiply the result by 23 (the number of
items in the Neuroticism scale). This will essentially replace any missing re-
sponses with the mean for that individual, thus producing a fair estimate of the
total score.
Check out the new object N_score in the Environment tab. You will see that
scores for those with missing data (for example, participant #1, who had re-
sponse to N_88 missing) may have decimal points and scores for those without
missing data (for example, participant #2) are integers.
Now we can compute descriptive statistics for the Neuroticism score.
describe(N_score)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1381 12.15 5.53 12 12.17 5.93 0 23 23 -0.03 -0.89 0.15
And we plot the histogram, which will appear in the “Plots” tab:
hist(N_score)
74EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION
Histogram of N_score
150
Frequency
100
50
0
0 5 10 15 20
N_score
QUESTION 3. Examine the histogram. What can you say about the test
difficulty for the tested population?
alpha(EPQ[4:26], cumulative=TRUE)
##
## Reliability analysis
## Call: alpha(x = EPQ[4:26], cumulative = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.87 0.87 0.88 0.22 6.7 0.0049 12 5.5 0.22
##
## lower alpha upper 95% confidence boundaries
## 0.86 0.87 0.88
##
## Reliability if an item is dropped:
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE75
QUESTION 4. What is alpha for the Neuroticism scale score? Report the
raw_alpha printed at the beginning of the output (this is the version of alpha
calculated from row item scores and covariances rather than standardized item
scores and correlations).
Examine the alpha() output further. You can call ?alpha to get help on this
function and its output. An important summary statistic is average_r, which
stands for the average inter-item correlation. Other useful summary statistics
are mean and sd, which are respectively the mean and the standard deviation
of the test score. Yes, you can get these statistics without computing the test
score but by just running function alpha(). Isn’t this convenient?
If you completed Step 3, and computed N_score previously and obtained its
mean and SD using function describe(N_score), you can now compare these
results. The mean and SD given in alpha() should be the same (maybe rounded
to fewer decimal points) as the results from describe().
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE77
Press on the little blue arrow next to EPQtetr in the Environment tab. The
object’s structure will be revealed and you should see that EPRtetr contains
rho, which is is a 23x23 matrix of the tetrachoric correlations, tau, which is
the vector of 23 thresholds, etc. To retrieve the tetrachoric correlations from
the object EPQtetr, we can simply refer to them like so EPQtetr$rho.
Now you can pass the tetrachoric correlations to function alpha(). Note that
because you pass to the function the correlations rather than the raw scores, no
statistics for the “test score” can be computed and you cannot specify the type
of score using cumulative option.
78EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION
alpha(EPQtetr$rho)
##
## Reliability analysis
## Call: alpha(x = EPQtetr$rho)
##
## raw_alpha std.alpha G6(smc) average_r S/N median_r
## 0.93 0.93 0.95 0.37 13 0.36
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N var.r med.r
## N_3 0.93 0.93 0.94 0.36 12 0.015 0.35
## N_7 0.93 0.93 0.94 0.36 13 0.015 0.35
## N_12 0.93 0.93 0.94 0.37 13 0.014 0.36
## N_15 0.93 0.93 0.94 0.37 13 0.014 0.37
## N_19 0.93 0.93 0.94 0.36 12 0.013 0.35
## N_23 0.93 0.93 0.94 0.36 13 0.015 0.35
## N_27 0.93 0.93 0.94 0.36 13 0.015 0.35
## N_31 0.93 0.93 0.94 0.37 13 0.013 0.36
## N_34 0.92 0.92 0.94 0.36 12 0.013 0.35
## N_38 0.93 0.93 0.94 0.37 13 0.015 0.35
## N_41 0.93 0.93 0.94 0.37 13 0.014 0.36
## N_47 0.93 0.93 0.95 0.38 13 0.013 0.38
## N_54 0.93 0.93 0.95 0.38 13 0.014 0.38
## N_58 0.93 0.93 0.94 0.37 13 0.015 0.37
## N_62 0.93 0.93 0.94 0.37 13 0.014 0.37
## N_66 0.93 0.93 0.95 0.37 13 0.015 0.37
## N_68 0.93 0.93 0.95 0.37 13 0.014 0.37
## N_72 0.93 0.93 0.94 0.36 13 0.014 0.35
## N_75 0.93 0.93 0.94 0.36 13 0.014 0.35
## N_77 0.93 0.93 0.94 0.36 13 0.014 0.35
## N_80 0.93 0.93 0.94 0.37 13 0.014 0.36
## N_84 0.93 0.93 0.95 0.38 13 0.014 0.37
## N_88 0.93 0.93 0.95 0.38 13 0.014 0.38
##
## Item statistics
## r r.cor r.drop
## N_3 0.72 0.71 0.69
## N_7 0.68 0.67 0.64
## N_12 0.66 0.65 0.62
## N_15 0.56 0.54 0.51
## N_19 0.74 0.74 0.70
## N_23 0.70 0.69 0.66
## N_27 0.69 0.67 0.65
## N_31 0.67 0.67 0.63
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE79
SEm(y) = SD(y)*sqrt(1-alpha)
You will need the standard deviation of the total score, which you already
computed (scroll up your output to the descriptives for N_score). Now you
can substitute the SD value and also the alpha value into the formula, and do
the calculation in R by typing them in your script and then running them in
the Console.
5.53*sqrt(1-0.93)
## [1] 1.4631
Even better would be to save the result in a new object SEm, so we can use it
later for computing confidence intervals.
Finally, let’s compute the 95% confidence interval around a given score Y using
formulas Y-2*SEm and Y+2*SEm. Say, we want to compute the 95% confidence
interval around N_score for Participant #1. It is easy to pull out the respective
score from vector N_score (which holds scores for 1381 participants) by simply
referring to the participant number in square brackets: N_score[1].
QUESTION 8. What is the 95% confidence interval around the Neuroticism
scale score for participant #1?
After you finished this exercise, save your R script and entire ‘workspace’, which
includes the data frame and also all the new objects you created. This will be
useful for when you come back to this exercise again, and they will also be
needed in Exercises 12 and 14.
6.4 Solutions
Q1. The mean represents item difficulty. In this item set, difficulties vary
widely. The ‘easiest’ item to endorse is N_88 (“Are you touchy about some
things?”), with the highest mean (0.89). Because the data are binary coded
0/1, the mean can be interpreted as 89% of the sample endorsed the item. The
most difficult item to endorse in this set is N_62, with the lowest mean value
(0.27). N_62 is phrased “Do you often feel life is very dull?”, and only 27% of
the sample agreed with this. This item indicates a higher degree of Neuroticism
(perhaps even a symptom of depression) than that of item N_88, and it is
therefore more difficult to endorse.
Q2. The tetrachoric correlations are substantially larger than the product-
moment correlations. This is not surprising given that the data are binary
and the product-moment correlations were developed for continuous data. The
product-moment correlations underestimate the strength of the relationships
between these binary items.
Q3. The distribution appears symmetrical, without any visible skew. There
is no ceiling or floor effects, so the test’s difficulty appears appropriate for the
population.
Q4. The (raw) alpha computed from product-moment covariances is 0.87.
Q5. Item N_34 (“Are you a worrier?”) has the highest item-total correlation
(when the item itself is dropped). This item is central to the meaning of the
scale, or, in other words, the item that best represents this set of items. This
6.4. SOLUTIONS 81
makes sense given that the scale is supposed to measure Neuroticism. Worrying
about things is one of core indicators of Neuroticism.
Q6. Alpha from tetrachoric correlations is 0.93. It is larger than the product-
moment based alpha. This is not surprising since the tetrachoric correlations
were greater than the product-moment correlations.
Q7.
## [1] 1.463404
N_score[1]-2*SEm # lower
## [1] 9.618647
N_score[1]+2*SEm # upper
## [1] 15.47226
82EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION
Part III
TEST HOMOGENEITY
AND SINGLE-FACTOR
MODEL
83
Exercise 7
Fitting a single-factor
model to polytomous
questionnaire data
7.1 Objectives
The objective of this exercise is to fit a single-factor model to item-level ques-
tionnaire data, thereby testing homogeneity of an item set.
85
86EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA
If you have already worked with this data set in Exercise 1 or Exercise 5, the
simplest thing to do is to continue working within the project created back
then. In RStudio, select File / Open Project and navigate to the folder and the
project you created. You should see the data frame SDQ appearing in your
Environment tab, together with other objects you created and saved.
If you have not completed the previous Exercises or have not saved your work,
or simply want to start from scratch, download the data file SDQ.RData into
a new folder and follow instructions from Exercise 1 on how to set up a new
project and to load the data.
load("SDQ.RData")
Whether you are working in a new project or the old project, create a new R
script (File / New File / R Script) for this exercise to keep a separate record of
commands needed to conduct test homogeneity analyses.
Now either press on the SDQ object in the Environment tab or type View(SDQ)
to remind yourself of the data set. It contains 228 rows (cases, observations)
on 51 variables. Run names(SDQ) to get the variable names. There is Gender
variable (0=male; 1=female), followed by responses to 25 SDQ items at Time 1
named consid, restles, somatic etc. These are followed by 25 more variables
named consid2, restles2, somatic2 etc. These are responses to the same SDQ
items at Time 2.
Run head(SDQ) from your script to get a quick preview of first 6 rows of data.
You should see that there are some missing responses, marked ‘NA’. There are
more missing responses for Time 2, with whole rows missing for some pupils.
If you have not done so in previous exercises, begin by creating a list of items
that measure Conduct Problems (you can see them in a table given in Exercise
1). This will enable easy reference to data from these 5 variables in all analyses.
We will use c() - the base R function for combining values into a list.
Note how a new object items_conduct appeared in the Environment tab. Try
calling SDQ[items_conduct] to pull only the data from these 5 items.
7.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF SDQ CONDUCT PROBLEMS SCALE87
library(psych)
lowerCor(SDQ[items_conduct])
QUESTION 1. What can you say about the size and direction of inter-item
correlations? Do you think these data are suitable for factor analysis?
To obtain the measure of sampling adequacy (MSA) - an index summarizing
the correlations on their overall potential to measure something in common -
request the Kaiser-Meyer-Olkin (KMO) index:
KMO(SDQ[items_conduct])
Kaiser (1975) proposed the following guidelines for interpreting the MSA and
deciding on utility of factor analysis:
Package psych has a function producing Scree plots for the observed data and
random (i.e. uncorrelated) data matrix of the same size. Comparison of the
two scree plots is called Parallel Analysis. We retain factors from the blue
scree plot (real data) that are ABOVE the red plot (simulated random data),
in which we expect no common variance, only random variance. In Scree plots,
important factors that account for common variance resemble points on hard
rock of a mountain, and trivial factors only accounting for random variance
are compared with rubble at the bottom of a mountain (“scree” in geology).
While examining a Scree Plot of the empirical data is helpful in preliminary
decisions on which factors belong to the hard rock and which belong to the
rubble, Parallel Analysis provides a statistical comparison with the baseline for
this size data.
fa.parallel(SDQ[items_conduct], fa="pc")
7.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF SDQ CONDUCT PROBLEMS SCALE89
PC Actual Data
PC Simulated Data
2.0
PC Resampled Data
1.5
1.0
0.5
1 2 3 4 5
Component Number
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
Now let’s run the factor analysis, extracting one factor. I recommend you call
documentation on function fa() from package psych by running command ?fa.
From the general form of this function, fa(r, nfactors=1,…), it is clear that
we need to supply the data (argument r, which can be a correlation or covariance
matrix or a raw data matrix, like we have). Other arguments all have defaults.
The default estimation method is fm="minres". This method will minimise the
residuals (differences between the observed and the reproduced correlations),
and it is probably a good choice for these data (the sample size is not that
large, and responses in only 3 categories cannot be normally distributed). The
default number of factors to extract (nfactors=1) is exactly what we want,
however, I recommend you write this explicitly for your own future reference:
fa(SDQ[items_conduct], nfactors=1)
90EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA
Run this command and examine the output. Try to answer the following ques-
tions (refer to instructional materials of your choice for help, I recommend Mc-
Donald’s “Test theory”). I will indicate which parts of the output you need to
answer each question.
QUESTION 4. Examine the Standardized factor loadings. How do you in-
terpret them? What is the “marker item” for the Conduct Problems scale? [In
the “Standardized loadings” output, the loadings are printed in “MR1” column.
“MR” stands for the estimation method, “Minimum Residuals”, and “1” stands
for the factor number. Here we have only 1 factor, so only one column.]
7.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF SDQ CONDUCT PROBLEMS SCALE91
Now let us examine the model’s goodness of fit (GOF). This output starts with
the line “Test of the hypothesis that 1 factor is sufficient”. This is the hypothesis
that the data complies with a single-factor model (“the model”). We are hoping
to retain this hypothesis, therefore hoping for a large p-value, definitely p >
0.05, and hopefully larger. The output will also tell us about the “null model”,
which is the model where all items are uncorrelated. We are obviously hoping
to reject this null model, and obtain a very small p-value. Both of these models
will be tested with the chi-square test, with their respective degrees of freedom.
For our single-factor model, there are 5 degrees of freedom, because the model
estimates 10 parameters (5 factor loadings and 5 uniquenesses), and there are
5*6/2=15 variances and covariances (sample moments) to estimate them. The
degrees of freedom are therefore 15-10=5.
QUESTION 7. Is the chi-square for the tested model significant? Do you
accept or reject the single-factor model? [Look for “Likelihood Chi Square” in
the output.]
For now, I will ignore some other “fit indices” printed in the output. I will
return to them in Exercises dealing with structural equation models (beginning
with Exercise 16) .
Now let’s examine the model’s residuals. Residuals are the differences between
the observed item correlations (which we computed earlier) and the correlations
“reproduced” by the model – that is, correlations of item scores predicted by
the model. The smaller the residuals are, the closer the model reproduces the
data.
In the model output printed on Console, however, we have only the Root Mean
Square Residual (RMSR), which computes the mean of all residuals squared,
and then takes the square root of that. The RMSR is a summary measure of
the size of residuals, and in a way it is like GOF “effect size” - independent of
sample size. You can see that the RMSR=0.05, which is a good (low) value,
indicating that the average residual is sufficiently small.
To obtain more detailed output of the residuals, we need to get access to all
of the results produced by the function fa(). Call the function again, but this
92EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA
time, assign its results to a new object F_conduct, which we can “interrogate”
later.
Package psych has a nice function that pulls the residuals from the saved factor
analysis results (object F_conduct) and prints them in a user-friendly way:
residuals.psych(F_conduct)
NOTE: On the diagonal are discrepancies between the observed item variances
(which equal 1 in this standardised solution) and the “reproduced” item vari-
ances (variances explained by the common factor, or communalities). So, on the
diagonal we have uniquness = 1-communality. You can check this by comparing
the diagonal of the residual matrix with values u2 that we discussed earlier.
They should be the same. To remove these values on the diagonal out of sight,
use:
residuals.psych(F_conduct, diag=FALSE)
QUESTION 8. Examine the residual correlations (they are printed OFF the
diagonal). What can you say about them? Are there any large residuals? (Hint.
Interpret the size of residuals as you would the size of correlation coefficients.)
5, we already computed alpha for this scale, so you can refer to that instruction
for detail. I will simply quote the result we obtained there - “raw alpha” of 0.72.
To obtain omega, call function omega(), specifying the number of factors
(nfactors=1). You need this because various versions of coefficient omega exist
for multi-factor models, but you only need the estimate for a homogeneous
test, “Omega Total”.
# reverse code any counter-indicative items and put them in a data frame
# for details on this procedure, consult Exercise 1
R_conduct <- reverse.code(c(1,-1,1,1,1), SDQ[items_conduct])
# obtain McDonald's omega
omega(R_conduct, nfactors=1)
## Omega
## Call: omegah(m = m, nfactors = nfactors, fm = fm, key = key, flip = flip,
## digits = digits, title = title, sl = sl, labels = labels,
## plot = plot, n.obs = n.obs, rotate = rotate, Phi = Phi, option = option,
## covar = covar)
## Alpha: 0.73
## G.6: 0.7
## Omega Hierarchical: 0.74
## Omega H asymptotic: 1
## Omega Total 0.74
##
## Schmid Leiman Factor loadings greater than 0.2
## g F1* h2 u2 p2
## tantrum 0.71 0.51 0.49 1
## obeys- 0.67 0.44 0.56 1
## fights 0.65 0.42 0.58 1
## lies 0.49 0.24 0.76 1
## steals 0.45 0.20 0.80 1
##
## With eigenvalues of:
## g F1*
## 1.8 0.0
##
## general/max 1.639707e+16 max/min = 1
94EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA
QUESTION 9. What is the “Omega Total” for Conduct Problems scale score?
How does it compare with the “raw alpha” for this scale? Which estimate do
you think is more appropriate to assess the reliability of this scale?
After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. Give the script a meaningful name, for example
“SDQ scale homogeneity analyses”. When closing the project by pressing File /
Close project, make sure you select Save when prompted to save your ‘Workspace
image’ (with extension .RData).
7.3. FURTHER PRACTICE - FACTOR ANALYSIS OF THE REMAINING SDQ SUBSCALES95
7.4 Solutions
Q1. The correlations are all similar in magnitude (around 0.3-0.4). Correla-
tions between **obeys*” and other items are negative, and between all
other items are positive. This is because obeys** indicates the low end
of Conduct Problems (i.e. the lack of such problems), so it is keyed in the op-
posite direction to all other items. The data seem suitable for factor analysis
because all items correlate – so potentially have a common source for this shared
variance.
Q2. MSA = 0.76. The data are “middling” according to Kaiser’s guidelines
(i.e. suitable for factor analysis).
Q3. According to the Scree plot of the observed data (in blue) the 1st factor
accounts for a substantial amount of variance compared to the 2nd factor. There
is a large drop from the 1st factor to the 2nd, and the “rubble” begins with the
2nd factor. This indicates that most co-variation is explained by one factor.
Parallel analysis confirms this conclusion, because the simulated data yields a
line (in red) that crosses the observed scree between 1st and 2nd factor, with
1st factor significantly above the red line, and 2nd factor below it. It means
that only the 1st factor should be retained, and the 2nd, 3rd and all subsequent
factors should be discarded as they are part of the rubble.
Q4. Standardized factor loadings reflect the number of Standard Deviations by
which the item score will change per 1 SD change in the factor score. The higher
the loading, the more sensitive the item is to the change in the factor. Stan-
dardised factor loadings in the single-factor model are also correlations between
the factor and the items (just like beta coefficients in the simple regression).
For the Conduct Problems scale, factor loadings range between 0.45 and 0.71 in
magnitude (i.e. absolute value). Factor loadings over 0.5 in magnitude can be
considered reasonably high. The loading for obeys is negative (-0.67), which
means that higher factor score results in the lower score on this item. The
marker item is tantrum with the loading 0.71 (highest loading). The marker
item indicates that behaviour described by this item, “I get very angry and
often lose my temper”, is central to the meaning of the common factor.
Q5. Communality is the variance in the item due to the common factor, and
uniqueness is the unique item variance. In standardised factor solutions (which
96EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA
Fitting a single-factor
model to dichotomous
questionnaire data
8.1 Objectives
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.
This exercise makes use of the data we considered in Exercise 6. These data
come from a a large cross-cultural study (Barrett, Petrides, Eysenck & Eysenck,
1998), with N = 1,381 participants who completed the Eysenck Personality
Questionnaire (EPQ).
97
98EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE DA
The focus of our analysis here will be the Neuroticism/Anxiety (N) scale, mea-
sured by 23 items with only two response options - either “YES” or “NO”, for
example:
You can find the full list of EPQ Neuroticism items in Exercise 6. Please
note that all items indicate “Neuroticism” rather than “Emotional Stability”
(i.e. there are no counter-indicative items).
If you have already worked with this data set in Exercise 6, the simplest thing
to do is to continue working within the project created back then. In RStudio,
select File / Open Project and navigate to the folder and the project you cre-
ated. You should see the data frame EPQ appearing in your Environment tab,
together with other objects you created and saved.
If you have not completed Exercise 6 or have not saved your work, or simply
want to start from scratch, download the data file EPQ_N_demo.txt into
a new folder and follow instructions from Exercise 6 on creating a project and
importing the data.
The object EPQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. As you can see, there are 26 variables
in this data frame, beginning with participant id, age and sex (0 = female; 1
= male). These demographic variables are followed by 23 item responses, which
are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses,
marked with NA.
library(psych)
in a compact format. Note that to refer to the 23 item responses only, you need
to specify the columns where they are stored (from 4 to 26):
lowerCor(EPQ[4:26])
## N_3 N_7 N_12 N_15 N_19 N_23 N_27 N_31 N_34 N_38 N_41
## N_3 1.00
## N_7 0.36 1.00
## N_12 0.26 0.28 1.00
## N_15 0.31 0.21 0.12 1.00
## N_19 0.33 0.30 0.32 0.18 1.00
## N_23 0.36 0.30 0.24 0.28 0.28 1.00
## N_27 0.28 0.28 0.31 0.24 0.32 0.29 1.00
## N_31 0.24 0.20 0.18 0.22 0.31 0.24 0.26 1.00
## N_34 0.30 0.33 0.34 0.20 0.41 0.32 0.36 0.45 1.00
## N_38 0.26 0.24 0.28 0.21 0.29 0.27 0.33 0.28 0.41 1.00
## N_41 0.23 0.21 0.16 0.29 0.27 0.26 0.23 0.43 0.35 0.27 1.00
## N_47 0.17 0.13 0.19 0.09 0.16 0.20 0.20 0.18 0.20 0.29 0.14
## N_54 0.20 0.19 0.11 0.20 0.15 0.16 0.15 0.20 0.18 0.16 0.21
## N_58 0.36 0.34 0.22 0.19 0.24 0.35 0.24 0.18 0.22 0.20 0.16
## N_62 0.26 0.23 0.14 0.23 0.20 0.29 0.20 0.20 0.23 0.19 0.24
## N_66 0.20 0.16 0.27 0.12 0.25 0.24 0.22 0.17 0.24 0.25 0.13
## N_68 0.22 0.29 0.15 0.19 0.22 0.21 0.22 0.14 0.21 0.19 0.15
## N_72 0.24 0.32 0.32 0.13 0.42 0.24 0.34 0.29 0.41 0.29 0.26
## N_75 0.26 0.29 0.21 0.25 0.31 0.27 0.25 0.53 0.38 0.28 0.40
## N_77 0.33 0.30 0.18 0.23 0.29 0.37 0.31 0.26 0.29 0.25 0.25
## N_80 0.23 0.23 0.27 0.10 0.56 0.19 0.29 0.28 0.35 0.24 0.25
## N_84 0.25 0.22 0.15 0.08 0.14 0.18 0.16 0.10 0.15 0.13 0.09
## N_88 0.19 0.17 0.12 0.12 0.19 0.17 0.13 0.05 0.17 0.15 0.11
## N_47 N_54 N_58 N_62 N_66 N_68 N_72 N_75 N_77 N_80 N_84
## N_47 1.00
## N_54 0.09 1.00
## N_58 0.16 0.20 1.00
## N_62 0.07 0.16 0.20 1.00
## N_66 0.21 0.11 0.16 0.15 1.00
## N_68 0.05 0.17 0.21 0.24 0.14 1.00
## N_72 0.18 0.08 0.21 0.19 0.27 0.22 1.00
## N_75 0.19 0.25 0.24 0.23 0.19 0.16 0.30 1.00
## N_77 0.11 0.23 0.27 0.41 0.24 0.28 0.27 0.29 1.00
## N_80 0.11 0.13 0.18 0.14 0.25 0.21 0.40 0.30 0.26 1.00
## N_84 0.12 0.12 0.29 0.14 0.11 0.16 0.13 0.12 0.17 0.13 1.00
## N_88 0.09 0.07 0.16 0.07 0.12 0.09 0.11 0.12 0.12 0.17 0.15
## [1] 1.00
Now let’s compute the tetrachoric correlations for the same items. These would
100EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D
be more appropriate for binary items on which a NO/YES dichotomy was forced
(although the underlying extent of agreement is actually continuous).
tetrachoric(EPQ[4:26])
##
## with tau of
## N_3 N_7 N_12 N_15 N_19 N_23 N_27 N_31 N_34 N_38 N_41
## -0.354 -0.112 -0.829 0.557 -0.195 -0.036 0.066 0.427 -0.252 -0.080 0.504
## N_47 N_54 N_58 N_62 N_66 N_68 N_72 N_75 N_77 N_80 N_84
## -0.209 0.486 -0.408 0.603 -0.209 0.162 -0.277 0.360 0.219 -0.276 -0.876
## N_88
## -1.232
KMO(tetrachoric(EPQ[4:26])$rho)
This time, we will cal function fa.parallel() requesting the tetrachoric corre-
lations (cor="tet") rather than Pearson’s correlations (default). We will also
change the default estimation method (minimum residuals or “minres”) to the
maximum likelihood (fm="ml"), which our large sample allows. Finally, we will
change another default - the type of eigenvalues shown. We will display only
eigenvalues for principal components (fa="pc"), as done in some commercial
software such as Mplus.
PC Actual Data
PC Simulated Data
8
PC Resampled Data
6
4
2
0
5 10 15 20
Component Number
## Parallel analysis suggests that the number of factors = NA and the number of compo
The Scree plot shows that the first factor accounts for a substantial amount
of variance compared to the second and subsequent factors. There is a large
drop from the first factor to the second (forming a clear “mountain side”), and
mostly “rubble” afterwards, beginning with the second factor. This indicates
that most co-variation in the items is explained by just one factor. However,
Parallel Analysis reveals that there are 3 factors, as factors 2 and 3 explain
significantly more variance than would be expected in random data of this size.
From the Scree plot it is clear that factors 2 and 3 explain little variance (even
if significant). Residuals will reveal which correlations are not well explained by
just 1 factor, and might give us a clue why the 2nd and 3rd factors are required.
8.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF EDQ NEUROTICISM SCALE103
Now let’s fit a single-factor model to the 23 Neuroticism items. We will again
request the tetrachoric correlations (cor="tet") and the maximum likelihood
estimation method (fm="ml").
##
## The root mean square of the residuals (RMSR) is 0.08
## The df corrected root mean square of the residuals is 0.08
##
## The harmonic number of observations is 1374 with the empirical chi square 4262.77
## The total number of observations was 1381 with Likelihood Chi Square = 5122.04 w
##
## Tucker Lewis Index of factoring reliability = 0.678
## RMSEA index = 0.124 and the 90 % confidence intervals are 0.121 0.127
## BIC = 3459.01
## Fit based upon off diagonal values = 0.96
## Measures of factor score adequacy
## ML1
## Correlation of (regression) scores with factors 0.97
## Multiple R square of scores with factors 0.94
## Minimum correlation of possible factor scores 0.88
Run this command and examine the output. Try to answer the following ques-
tions (refer to instructional materials of your choice for help, I recommend Mc-
Donald’s “Test theory”). I will indicate which parts of the output you need to
answer each question.
QUESTION 3. Examine the Standardized factor loadings. How do you inter-
pret them? What is the “marker item” for the Neuroticism scale? [In the “Stan-
dardized loadings” output, the loadings are printed in “ML1” column. “ML”
stands for the estimation method, “Maximum Likelihood”, and “1” stands for
the factor number. Here we have only 1 factor, so only one column.]
QUESTION 4. Examine communalities and uniquenesses (look at h2 and u2
values in the table of “Standardized loadings”, respectively). What is commu-
nality and uniqueness and how do you interpret them?
QUESTION 5. What is the proportion of variance explained by the factor
in all 23 items (total variance explained)? To answer this question, look for
“Proportion Var” entry in the output (in a small table beginning with “SS
loadings”).
Now let us examine the model’s goodness of fit (GOF). This output starts with
the line “Test of the hypothesis that 1 factor is sufficient”. This is the hypothesis
that the data complies with a single-factor model (“the model”). We are hoping
to retain this hypothesis, therefore hoping for a large p-value, definitely p >
0.05, and hopefully larger. The output will also tell us about the “null model”,
which is the model where all items are uncorrelated. We are obviously hoping
to reject this null model, and obtain a very small p-value. Both of these models
8.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF EDQ NEUROTICISM SCALE105
will be tested with the chi-square test, with their respective degrees of freedom.
For our single-factor model, there are 230 degrees of freedom, because the model
estimates 46 parameters (23 factor loadings and 23 uniquenesses), and there are
23*24/2=276 variances and covariances (sample moments) to estimate them.
The degrees of freedom are therefore 276-46=230.
QUESTION 6. Is the chi-square for the tested model significant? Do you
accept or reject the single-factor model? [Look for “Likelihood Chi Square” in
the output.]
For now, I will ignore some other “fit indices” printed in the output. I will
return to them in Exercises dealing with structural equation models (beginning
with Exercise 16) .
Now let’s examine the model’s residuals. Residuals are the differences between
the observed item correlations (which we computed earlier) and the correlations
“reproduced” by the model – that is, correlations of item scores predicted by
the model. In the model output printed on Console, we have the Root Mean
Square Residual (RMSR), which is a summary measure of the size of residuals,
and in a way it is like GOF “effect size” - independent of sample size. You
can see that the RMSR=0.08, which is an acceptable value, indicating that the
average residual is sufficiently small.
To obtain more detailed output of the residuals, we need to get access to all
of the results produced by the function fa(). Call the function again, but this
time, assign its results to a new object fit, which we can “interrogate” later.
Package psych has a nice function that pulls the residuals from the saved factor
analysis results (object fit) and prints them in a user-friendly way. To remove
item uniquenesses from the diagonal out of sight, use option diag=FALSE.
residuals.psych(fit, diag=FALSE)
## N_3 N_7 N_12 N_15 N_19 N_23 N_27 N_31 N_34 N_38 N_41
## N_3 NA
## N_7 0.09 NA
## N_12 -0.01 0.06 NA
## N_15 0.19 0.01 -0.11 NA
## N_19 0.00 -0.03 0.05 -0.09 NA
## N_23 0.09 0.03 -0.01 0.12 -0.07 NA
## N_27 -0.02 -0.01 0.11 0.03 -0.01 0.00 NA
## N_31 -0.07 -0.13 -0.10 -0.01 -0.01 -0.08 -0.05 NA
## N_34 -0.09 -0.02 0.04 -0.08 0.01 -0.04 0.01 0.16 NA
## N_38 -0.05 -0.06 0.05 0.01 -0.04 -0.01 0.06 0.01 0.09 NA
## N_41 -0.05 -0.08 -0.11 0.13 -0.03 -0.01 -0.06 0.19 0.06 0.01 NA
106EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D
## N_47 -0.01 -0.06 0.07 -0.05 -0.05 0.05 0.04 0.02 -0.01 0.18 -0.03
## N_54 0.04 0.03 -0.06 0.11 -0.06 -0.02 -0.04 0.02 -0.03 -0.02 0.07
## N_58 0.14 0.14 0.00 0.03 -0.05 0.16 -0.01 -0.09 -0.11 -0.06 -0.10
## N_62 0.09 0.02 -0.07 0.09 -0.07 0.11 -0.04 -0.05 -0.05 -0.04 0.03
## N_66 -0.03 -0.07 0.13 -0.05 0.02 0.04 0.01 -0.07 -0.02 0.06 -0.10
## N_68 0.03 0.13 -0.04 0.05 -0.02 0.00 0.02 -0.11 -0.06 -0.02 -0.07
## N_72 -0.09 0.03 0.08 -0.13 0.10 -0.08 0.06 0.00 0.05 -0.01 -0.02
## N_75 -0.06 -0.02 -0.06 0.04 -0.04 -0.05 -0.09 0.26 0.04 -0.02 0.14
## N_77 0.07 0.02 -0.10 0.02 -0.04 0.11 0.02 -0.05 -0.08 -0.05 -0.03
## N_80 -0.08 -0.07 0.03 -0.18 0.29 -0.13 0.00 0.01 0.00 -0.04 0.00
## N_84 0.13 0.11 0.01 -0.06 -0.08 0.04 0.01 -0.11 -0.07 -0.04 -0.09
## N_88 0.08 0.07 -0.03 0.07 0.07 0.06 -0.02 -0.19 -0.01 0.03 -0.02
## N_47 N_54 N_58 N_62 N_66 N_68 N_72 N_75 N_77 N_80 N_84
## N_47 NA
## N_54 -0.02 NA
## N_58 0.03 0.10 NA
## N_62 -0.10 0.03 0.04 NA
## N_66 0.13 -0.04 -0.04 -0.02 NA
## N_68 -0.12 0.07 0.06 0.12 -0.01 NA
## N_72 0.00 -0.16 -0.08 -0.06 0.07 0.01 NA
## N_75 0.03 0.09 -0.01 -0.02 -0.06 -0.09 -0.02 NA
## N_77 -0.09 0.08 0.05 0.26 0.04 0.10 -0.04 -0.04 NA
## N_80 -0.09 -0.06 -0.09 -0.13 0.05 0.01 0.14 0.02 -0.02 NA
## N_84 0.04 0.06 0.24 0.05 -0.02 0.09 -0.06 -0.08 0.04 -0.05 NA
## N_88 0.02 -0.04 0.07 -0.08 0.02 -0.01 -0.08 -0.04 -0.03 0.05 0.13
## [1] NA
hist(residuals.psych(fit, diag=FALSE))
8.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF EDQ NEUROTICISM SCALE107
60
40
20
0
QUESTION 7. Examine the residuals. What can you say about them? Are
there any large residuals? (Hint. Interpret the size of residuals as you would
the size of correlation coefficients.)
omega(tetrachoric(EPQ[4:26])$rho, nfactors=1)
## Omega
## Call: omegah(m = m, nfactors = nfactors, fm = fm, key = key, flip = flip,
## digits = digits, title = title, sl = sl, labels = labels,
## plot = plot, n.obs = n.obs, rotate = rotate, Phi = Phi, option = option,
## covar = covar)
## Alpha: 0.93
## G.6: 0.95
## Omega Hierarchical: 0.93
## Omega H asymptotic: 1
## Omega Total 0.93
##
## Schmid Leiman Factor loadings greater than 0.2
## g F1* h2 u2 p2
## N_3 0.70 0.50 0.50 1
## N_7 0.66 0.44 0.56 1
## N_12 0.65 0.42 0.58 1
## N_15 0.53 0.28 0.72 1
## N_19 0.74 0.54 0.46 1
## N_23 0.68 0.46 0.54 1
## N_27 0.67 0.45 0.55 1
## N_31 0.67 0.45 0.55 1
## N_34 0.79 0.62 0.38 1
## N_38 0.65 0.42 0.58 1
## N_41 0.64 0.41 0.59 1
## N_47 0.40 0.16 0.84 1
## N_54 0.44 0.19 0.81 1
## N_58 0.60 0.36 0.64 1
## N_62 0.56 0.32 0.68 1
## N_66 0.50 0.25 0.75 1
## N_68 0.49 0.24 0.76 1
## N_72 0.68 0.46 0.54 1
## N_75 0.71 0.50 0.50 1
## N_77 0.68 0.47 0.53 1
## N_80 0.64 0.41 0.59 1
## N_84 0.44 0.19 0.81 1
## N_88 0.44 0.19 0.81 1
##
## With eigenvalues of:
## g F1*
## 8.7 0.0
##
## general/max 5.243623e+16 max/min = 1
## mean percent general = 1 with sd = 0 and cv of 0
8.3. SOLUTIONS 109
QUESTION 8. What is the “Omega Total” for Conduct Problems scale score?
How does it compare with the alpha for this scale?
8.3 Solutions
Q1. Both sets of correlations support the suitability for factor analysis be-
cause they are all positive and relatively similar in size, as expected for items
measuring the same thing. The tetrachoric correlations are larger than the
product-moment correlations. This is not surprising given that the product-
moment correlations tend to underestimate the strength of the relationships
between binary items.
110EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D
Q2. MSA = 0.92. The data are “marvelous” for factor analysis according to
the Kaiser’s guidelines.
Q3. Standardized factor loadings reflect the number of Standard Deviations by
which the item score will change per 1 SD change in the factor score. The higher
the loading, the more sensitive the item is to the change in the factor. Standard-
ised factor loadings in the single-factor model are also correlations between the
factor and the items (just like beta coefficients in the simple regression). For the
Neuroticism scale, factor loadings range between 0.40 and 0.80. Factor loadings
over 0.5 can be considered reasonably high. The marker item is N_34 with the
loading 0.80 (highest loading). This item, “Are you a worrier”, is central to the
meaning of the common factor (and supports the hypothesis that the construct
measured by this item set is indeed Neuroticism).
Q4. Communality is the variance in the item due to the common factor, and
uniqueness is the unique item variance. In standardised factor solutions (which
is what function fa() prints), communality is the proportion of variance in
the item due to the factor, and uniqueness is the remaining proportion (1-
communality). Looking at the printed values, between 16% and 64% of variance
in the items is due to the common factor.
Q5. The factor accounts for 38% of the variability in the 23 Neuroticism items.
Q6. The chi-square test is highly significant, which means the null hypothesis
(that the single-factor model holds in the population) must be rejected. How-
ever, the chi-square test is sensitive to the sample size, and even very good
models are rejected when large samples (like the one tested here, N=1381) are
used to test them.
## The total number of observations was 1381 with Likelihood Chi Square = 5122.04
Q7. The vast majority of residuals are between -0.1 and 0.1. However, there
are few large residuals, above 0.2. For instance, the largest residual correlation
(0.29) is between N_19 (“Are your feelings easily hurt?”) and N_80 (“Are
you easily hurt when people find fault with you or the work you do?). This is a
clear case of so-called local dependence - when the dependency between two items
remains even after accounting for the common factor. It is not surprising to find
local dependence in two items with such a similar wording. These items have far
more in common than they have with other items. Beyond the common factor
that they share with other items, these item also share some of their unique
parts. Another large residual, 0.26, is between N_31 (”Would you call yourself
a nervous person¿‘) and N_75 (”Do you suffer from “nerves”?“). Again, there is
a clear similarity of wording in these two items. These cases of local dependence
might be responsible for the 2nd and 3rd factors we found in Parallel Analysis.
Q8. Omega Total = 0.93, and is the same as Alpha = 0.93 (at least to the
second decimal place). Coefficient alpha can only be lower than omega, and
they are equal when all factor loadings are equal. Here, not all loadings are
equal, but most are very similar, resulting in very similar alpha and omega.
Part IV
EXPLORATORY FACTOR
ANALYSIS (EFA)
111
Exercise 9
9.1 Objectives
113
114 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.
I recommend you download data file CHI_ESQ.txt into a new folder, and In
RStudio, associate a new project with this folder. Create a new script, where
you will type all commands.
The data are stored in the space-separated (.txt) format. You can preview the
file by clicking on CHI_ESQ.txt in the Files tab in RStudio (this will not
import the data yet, just open the actual file). You will see that the first row
contains the item abbreviations (esq+p for “parent”): “esqp_01”, “esqp_02”,
“esqp_03”, etc., and the first entry in each row is the respondent number: “1”,
“2”, …“620”. Function read.table() will import this into RStudio taking care
of these column and row names. It will actually understand that we have headers
and row names because the first row contains one fewer fields than the number
of columns (see ?read.table for detailed help on this function).
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION115
We have just read the data into a new data frame named CHI_ESQ. Examine
the data frame by either pressing on it in the Environment tab, or calling func-
tion View(CHI_ESQ). You will see that there are quite a few missing responses,
which could have occurred because either “Don’t know” response option was
chosen, or because the question was not answered at all.
First, load the package psych to enable access to its functionality. Conveniently,
most functions in this package easily accommodate analysis with correlation
matrices as well as with raw data.
library(psych)
lowerCor(CHI_ESQ, use="pairwise")
## es_01 es_02 es_03 es_04 es_05 es_06 es_07 es_08 es_09 es_10 es_11
## esqp_01 1.00
## esqp_02 0.67 1.00
## esqp_03 0.65 0.68 1.00
## esqp_04 0.79 0.65 0.67 1.00
## esqp_05 0.61 0.54 0.52 0.64 1.00
## esqp_06 0.58 0.55 0.50 0.60 0.68 1.00
## esqp_07 0.60 0.50 0.55 0.63 0.71 0.70 1.00
## esqp_08 0.15 0.25 0.21 0.22 0.14 0.30 0.21 1.00
## esqp_09 0.24 0.26 0.18 0.19 0.18 0.24 0.16 0.31 1.00
## esqp_10 0.17 0.25 0.19 0.15 0.14 0.18 0.15 0.36 0.28 1.00
## esqp_11 0.63 0.53 0.60 0.60 0.61 0.55 0.58 0.22 0.20 0.22 1.00
## esqp_12 0.72 0.60 0.62 0.72 0.71 0.66 0.68 0.22 0.21 0.15 0.71
## [1] 1.00
Examine the correlations carefully. Notice that all correlations are positive.
However, there is a pattern to these correlations. While most are large (above
0.5), correlations between items esqp_08, esqp_09, esqp_10 (describing
experiences with facilities, appointment times and location) and the rest of the
items (describing experiences with treatment) are substantially lower, ranging
between 0.14 and 0.30. Correlations of these items among each other are only
116 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES
slightly larger, ranging between 0.28 and 0.36. We will see whether and how
EFA will pick up on this pattern of correlations.
KMO(CHI_ESQ)
We will use function fa.parallel() to produce a scree plot for the observed
data, and compare it to that of a random (i.e. uncorrelated) data matrix of
the same size. We will use the default estimation method (fm="minres"), be-
cause the data are responses with only 3 ordered categories, where we cannopt
expect a normal distribution. We will again display only eigenvalues for princi-
pal components (fa="pc"), as is done in commercial software packages such as
Mplus.
fa.parallel(CHI_ESQ, fa="pc")
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION117
PC Actual Data
6
PC Simulated Data
PC Resampled Data
5
4
3
2
1
0
2 4 6 8 10 12
Component Number
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
We will use function fa(), which has the following general form
fa(r, nfactors=1, n.obs = NA, rotate="oblimin", fm="minres" …),
and requires data (argument r), and the number of observations if the data is
correlation matrix (n.obs). Other arguments have defaults, which we are happy
with for the first analysis.
Specifying all necessary arguments, test the hypothesized 1-factor model:
##
## Factor analysis with Call: fa(r = CHI_ESQ, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 54 and the objective function was 0.92
## The number of observations was 620 with Chi Square = 562.91 with prob < 7.7e-86
##
## The root mean square of the residuals (RMSA) is 0.07
## The df corrected root mean square of the residuals is 0.08
##
## Tucker Lewis Index of factoring reliability = 0.86
## RMSEA index = 0.123 and the 10 % confidence intervals are 0.114 0.133
## BIC = 215.7
From the summary output, we find that Chi Square = 562.91 (df = 54) with
prob < 7.7e-86 - a highly significant result. According to the chi-square test, we
have to reject the model. We also find that the root mean square of the residuals
(RMSR) is 0.07. This is an acceptable result.
Next, let’s examine the model residuals, which are direct measures of model
(mis)fit.
# obtain residuals
residuals.psych(fit1, diag=FALSE)
## es_01 es_02 es_03 es_04 es_05 es_06 es_07 es_08 es_09 es_10 es_11
## esqp_01 NA
## esqp_02 0.05 NA
## esqp_03 0.03 0.12 NA
## esqp_04 0.10 0.03 0.04 NA
## esqp_05 -0.04 -0.05 -0.07 -0.02 NA
## esqp_06 -0.06 -0.02 -0.07 -0.04 0.08 NA
## esqp_07 -0.04 -0.08 -0.03 -0.03 0.10 0.11 NA
## esqp_08 -0.09 0.03 -0.01 -0.03 -0.09 0.08 -0.02 NA
## esqp_09 0.00 0.05 -0.03 -0.05 -0.04 0.03 -0.06 0.23 NA
## esqp_10 -0.03 0.06 0.01 -0.06 -0.05 -0.01 -0.04 0.28 0.21 NA
## esqp_11 0.00 -0.03 0.03 -0.04 0.01 -0.03 -0.01 0.00 -0.01 0.03 NA
## esqp_12 0.01 -0.05 -0.02 0.00 0.04 0.00 0.02 -0.03 -0.04 -0.07 0.06
## [1] NA
You can see that the residuals are all small except one cluster corresponding
to the correlations between items esqp_08, esqp_09, and esqp_10, where
residuals are between 0.21 and 0.28. Evidently, one factor can explain the
observed correlations between items 1-7 and 11-12, but it cannot fully explain
the pattern of correlations we observed between items 8-10 (which describe
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION119
experinece with facilities, appointment times and location rather than treatment
as the rest of items). Quite intuitively, one factor can explain the overall positive
manifold of all correlations, but not a complex pattern we observe here.
We conclude that the 1-factor model is clearly not adequate for these data.
##
## Factor analysis with Call: fa(r = CHI_ESQ, nfactors = 2)
##
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 43 and the objective function was 0.64
## The number of observations was 620 with Chi Square = 394 with prob < 3e-58
##
## The root mean square of the residuals (RMSA) is 0.04
## The df corrected root mean square of the residuals is 0.05
##
## Tucker Lewis Index of factoring reliability = 0.878
## RMSEA index = 0.115 and the 10 % confidence intervals are 0.105 0.125
## BIC = 117.52
## With factor correlations of
## MR1 MR2
## MR1 1.00 0.39
## MR2 0.39 1.00
QUESTION 3. What is the chi square statistic, and p value, for the tested
2-factor model? Would you retain or reject this model based on the chi-square
test? Why?
QUESTION 4. Find and interpret the RMSR in the output.
Examine the 2-factor model residuals.
120 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES
residuals.psych(fit2, diag=FALSE)
## es_01 es_02 es_03 es_04 es_05 es_06 es_07 es_08 es_09 es_10 es_11
## esqp_01 NA
## esqp_02 0.05 NA
## esqp_03 0.03 0.12 NA
## esqp_04 0.09 0.03 0.04 NA
## esqp_05 -0.05 -0.04 -0.07 -0.03 NA
## esqp_06 -0.06 -0.02 -0.07 -0.04 0.09 NA
## esqp_07 -0.05 -0.07 -0.03 -0.04 0.08 0.11 NA
## esqp_08 -0.05 -0.03 -0.02 0.02 -0.02 0.05 0.03 NA
## esqp_09 0.04 0.01 -0.03 -0.01 0.01 0.01 -0.02 0.00 NA
## esqp_10 0.01 0.01 0.01 -0.02 0.02 -0.03 0.00 0.00 0.00 NA
## esqp_11 0.00 -0.03 0.03 -0.04 0.01 -0.03 -0.01 0.00 -0.01 0.03 NA
## esqp_12 -0.01 -0.04 -0.02 -0.02 0.02 0.01 0.00 0.02 0.01 -0.02 0.06
## [1] NA
QUESTION 5. Examine the residual correlations. What can you say about
them? Are there any non-trivial residuals (greater than 0.1 in absolute value)?
Having largely confirmed suitability of the 2-factor model, are are ready to
examine and interpret its results. We will start with an un-rotated solution.
##
## Loadings:
## MR1 MR2
## esqp_01 0.829
## esqp_02 0.749
## esqp_03 0.748
## esqp_04 0.840 -0.104
## esqp_05 0.788 -0.155
## esqp_06 0.764
## esqp_07 0.777 -0.110
## esqp_08 0.310 0.551
## esqp_09 0.291 0.403
## esqp_10 0.259 0.501
## esqp_11 0.756
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION121
The un-rotated solution yields the first factor that maximizes the variance
shared by all items. Not surprisingly then, all items load on factor 1 (“MR1”).
However, loadings are weak for items esqp_08, esqp_09, and esqp_10 (de-
scribing experiences with facilities, appointment times and location). On the
other hand, these 3 items load substantially on factor 2 (“MR2”). We can
plot these loadings using plot.psych()- very convenient function from package
psych.
Factor Analysis
1.0
0.5
10 8
MR2
9
0.0
2
36
11
75 1412
−0.5
MR1
The loadings plot visualizes the un-rotated solution. While items 1-7 and 11-12
cluster together and load exclusively on Factor 1, items 8, 9, and 10 (describing
experiences with facilities, appointment times and location) cluster together and
separately from the rest of items. They load weakly on Factor 1 and moderately
on Factor 2. Considering the item content, we may interpret Factor 1 as Overall
Satisfaction, and Factor 2 as Specific Satisfaction with Environment. I said
122 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES
##
## Loadings:
## MR1 MR2
## esqp_01 0.820 0.159
## esqp_02 0.687 0.311
## esqp_03 0.716 0.215
## esqp_04 0.832 0.155
## esqp_05 0.797
## esqp_06 0.722 0.252
## esqp_07 0.774 0.130
## esqp_08 0.129 0.619
## esqp_09 0.155 0.473
## esqp_10 0.556
## esqp_11 0.725 0.214
## esqp_12 0.856 0.145
##
## MR1 MR2
## SS loadings 5.413 1.264
## Proportion Var 0.451 0.105
## Cumulative Var 0.451 0.556
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION123
Factor Analysis
1.0
0.5
8
10
9
MR2
2
36
11
751412
0.0
−0.5
MR1
As we can see, the “varimax” rotation led to smaller loadings of items 8, 9 and
10 on Factor 1 than in the un-rotated solution. However, other cross-loadings
have increased in this solution, and the items are not factorially simple as we
hoped. In fact, this solution is more difficult to interpret than the un-rotated
solution.
QUESTION 7. What is the proportion of variance explained in all items by
each of the factors, and together (total variance explained)? To answer this
question, look for “Proportion Var” and “Cumulative Var” entry in the output
(in a small table beginning with ““”SS loadings”). Why is the total (cumulative)
variance the same as in the un-rotated solution?
We will now try an obliquely rotated solution, using the default “oblimin” rota-
tion.
##
124 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES
## Loadings:
## MR1 MR2
## esqp_01 0.842
## esqp_02 0.670 0.172
## esqp_03 0.721
## esqp_04 0.856
## esqp_05 0.832
## esqp_06 0.719 0.102
## esqp_07 0.799
## esqp_08 0.627
## esqp_09 0.466
## esqp_10 0.568
## esqp_11 0.731
## esqp_12 0.883
##
## MR1 MR2
## SS loadings 5.578 0.992
## Proportion Var 0.465 0.083
## Cumulative Var 0.465 0.547
Factor Analysis
1.0
8
0.5
10
9
MR2
2
0.0
6
3
11
7 51412
−0.5
MR1
This obliquely rotated solution yields more factorially simple items than the
previous solutions and is easy to interpret. Factor 1 is comprised from items
1-7 and 11-12, with the marker item esqp_12 (overall good service). Factor 2
9.4. SOLUTIONS 125
is comprised from items 8-10, with the marker item esqp_08 (good facilities).
We label the factors in this study Satisfaction with Care and Satisfaction with
Environment, and these are two correlated domains of satisfaction.
QUESTION 8. What is the proportion of variance explained by each of the
factors, and together (total variance explained)? To answer this question, look
for “Proportion Var” and “Cumulative Var” entry in the output (in a small
table beginning with ““”SS loadings”). Why is the total (cumulative) variance
different to the one reported for the un-rotated solution?
We now ask for “Phi” - the correlation matrix of latent factors. These correla-
tions are model-based; that is, they are estimated within the model, as one of
parameters. These correlations are between the latent factors, not their imper-
fect measurements (e.g. sum scores), and therefore they are the closest you can
get to estimating how the actual attributes are correlated in the population.
In contrast, correlations between observed sum scores will tend to be lower,
because the sum scores are not perfectly reliable and correlation between them
will be attenuated by unreliabilty.
fit2.o$Phi
## MR1 MR2
## MR1 1.0000000 0.3876848
## MR2 0.3876848 1.0000000
After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. To keep all the created objects (such as fit2.o),
which might be useful when revisiting this exercise, save your entire work space.
To do that, when closing the project by pressing File / Close project, make sure
you select Save when prompted to save your “Workspace image”.
9.4 Solutions
Q1. MSA = 0.92. According to Kaiser’s guidelines, the data are “marvellous”
for fitting a factor model. Refer to Exercise 7 to remind yourself of this index
and its interpretation.
Q2. There is a very sharp drop from factor #1 to factor #2, and a less pro-
nounced drop from #2 to #3. Presumably, we have one very strong factor
126 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES
(presumably general satisfaction with service), which explains the general pos-
itive manifold in the correlation matrix, and 1 further factor, making 2 factors
altogether. Parallel analysis confirms that 2 factors of the blue plot (observed
data) are above the red dotted line (the simulated random data).
Q3. Chi-square = 394 (df=43) with prob < 3e-58. The p value is given in the
scientific format, with 3 multiplied by 10 to the power of -58. It is so small
(divided by 10 to the power of 58) that we can just say p<0.001. We should
reject the model, because the probability of observing the given correlation
matrix in the population where the 1-factor model is true (the hypothesis we
are testing) is vanishingly small (p<0.001). However, chi-square test is very
sensitive to the sample size, and our sample is quite large. We must judge the
fit based on other criteria, for instance residuals.
Q4. The RMSR=0.04 is nice and small, indicating that the 2-factor model
reproduces the observed correlations well overall.
Q5. Most residuals are very small (close to zero), and only 2 residuals are above
0.1 in absolute value. The largest residual 0.12 is between items esqp_02 and
esqp_03. This is still only very slightly above 0.1. We conclude that none of
the residuals cause particular concern.
Q6. In the un-rotated solution, Factor 1 accounts for 49% of total variance and
Factor 2 accounts for 6.6%. As the factors are uncorrelated, these can be added
together to obtain 55.6%.
Q7. After the varimax rotation, Factor 1 accounts for 45.1% of total variance
and Factor 2 accounts for 10.5%. As the factors are still uncorrelated, these can
be added together to obtain 55.6%. This is the same amount as in the un-rotated
solution because factor rotations do not alter the total variance explained by
factors in items (item uniqueness are invariant to rotation). Despite the amounts
of variance re-distributing between factors, the total is still the same.
Q8. After the oblimin rotation, Factor 1 accounts for 46.5% of total variance
and Factor 2 accounts for 8.3%. The cumulative total given, 54.7%, is no longer
the same as in the un-rotated solution because the factors are correlated, and
their variances cannot be added. However, the true total variance explained is
still 55.6%, as reported in the un-rotated solution, because factor rotation does
not change this result.
Q9. In the oblique solution, factors correlate moderately at 0.39. We would
expect domains of satisfaction to correlate. We interpret this result as the
expected correlation between constructs Satisfaction with Care and Satisfaction
with Environment in the population of parents from which we sampled.
Exercise 10
10.1 Objectives
In this exercise, you will explore the factor structure of an ability test battery
by analyzing observed correlations of its subtest scores.
127
128 EXERCISE 10. EFA OF ABILITY SUBTEST SCORES
We will work with the published correlations between the 9 subtests, based on
N=215 subjects. Despite having no access to the actual subject scores, the
correlation matrix is all you need to run EFA.
A new object Thurstone should appear in the Environment tab. Examine the
object by either pressing on it, or calling function View(). You will see that it
is a correlation matrix, with values 1 on the diagonal, and moderate to large
positive values off the diagonal. This is typical for ability tests – they tend to
correlate positively with each other.
First, load the package psych to enable access to its functionality. Conveniently,
most functions in this package easily accommodate analysis with correlation
matrices as well as with raw data.
library(psych)
##
## Attaching package: 'psych'
10.3. WORKED EXAMPLE - EFA OF THURSTONE’S PRIMARY ABILITY DATA129
Before you start EFA, request the Kaiser-Meyer-Olkin (KMO) index – the mea-
sure of sampling adequacy (MSA):
KMO(Thurstone)
In this case, we have a prior hypothesis that 3 broad domains of ability are
underlying the 9 subtests. We begin the analysis with function fa.parallel()
to produce a scree plot for the observed data, and compare it to that of a
random (i.e. uncorrelated) data matrix of the same size. In addition to the actual
data (our correlation matrix, Thurstone), we need to supply the sample size
(n.obs=215) to enable simulation of random data, because from the correlation
matrix alone it is impossible to know how big the sample was. We will keep the
default estimation method (fm="minres"). We will display eigenvalues for both
- principal components and factors (fa="both").
5
PC Actual Data
4 PC Simulated Data
FA Actual Data
FA Simulated Data
3
2
1
0
2 4 6 8
Factor/Component Number
## Parallel analysis suggests that the number of factors = 3 and the number of compon
Remember that when interpreting the scree plot, you retain only the factors on
the blue (real data) scree plot that are ABOVE the red (simulated random data)
plot, in which we know there is no common variance, only random variance,
i.e. “scree” (rubble).
QUESTION 2. Examine the Scree plot. Does it support the hypothesis that
3 factors underlie the data? Why? Does your conclusion correspond to the text
output of the parallel analysis function?
##
## Factor analysis with Call: fa(r = Thurstone, nfactors = 1, n.obs = 215)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 27 and the objective function was 1.23
## The number of observations was 215 with Chi Square = 257.68 with prob < 1.7e-39
##
## The root mean square of the residuals (RMSA) is 0.1
## The df corrected root mean square of the residuals is 0.12
##
## Tucker Lewis Index of factoring reliability = 0.708
## RMSEA index = 0.199 and the 10 % confidence intervals are 0.178 0.222
## BIC = 112.67
From the summary output, we find that Chi Square = 257.68 (df=27) with prob
< 1.7e-39 - a highly significant result. According to the chi-square test, we have
to reject the model. We also find that the root mean square of the residuals
(RMSR) is 0.1. This is unacceptably high.
Next, let’s examine the model residuals, which are direct measures of model
(mis)fit.
# obtain residuals
residuals.psych(fit1, diag=FALSE)
## s1 s2 s3 s4 s5 s6 s7 s8 s9
## s1 NA
## s2 0.16 NA
## s3 0.14 0.13 NA
## s4 -0.11 -0.07 -0.07 NA
## s5 -0.10 -0.08 -0.09 0.23 NA
## s6 -0.05 -0.02 -0.04 0.18 0.14 NA
## s7 -0.05 -0.08 -0.08 -0.03 0.00 -0.09 NA
## s8 0.01 -0.01 0.02 -0.09 -0.07 -0.08 0.15 NA
## s9 -0.09 -0.12 -0.09 0.03 0.06 -0.03 0.24 0.07 NA
You can see that the residuals are mostly close to zero except those within
clusters s1-s3, s4-s6, and s7-s9, where residuals mostly above 0.1 and as high
as 0.24. Evidently, one factor cannot quite explain the dependencies of subtests
within these clusters.
We conclude that the 1-factor model is not adequate for these data.
132 EXERCISE 10. EFA OF ABILITY SUBTEST SCORES
##
## Factor analysis with Call: fa(r = Thurstone, nfactors = 3, n.obs = 215)
##
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 12 and the objective function was 0.01
## The number of observations was 215 with Chi Square = 3.01 with prob < 1
##
## The root mean square of the residuals (RMSA) is 0.01
## The df corrected root mean square of the residuals is 0.01
##
## Tucker Lewis Index of factoring reliability = 1.026
## RMSEA index = 0 and the 10 % confidence intervals are 0 0
## BIC = -61.44
## With factor correlations of
## MR1 MR2 MR3
## MR1 1.00 0.59 0.53
## MR2 0.59 1.00 0.52
## MR3 0.53 0.52 1.00
QUESTION 3. What is the chi square statistic, and p value, for the tested
3-factor model? Would you retain or reject this model based on the chi-square
test? Why?
QUESTION 4. Find and interpret the RMSR in the output.
Examine the 3-factor model residuals.
residuals.psych(fit3, diag=FALSE)
## s1 s2 s3 s4 s5 s6 s7 s8 s9
## s1 NA
10.3. WORKED EXAMPLE - EFA OF THURSTONE’S PRIMARY ABILITY DATA133
## s2 0.01 NA
## s3 0.00 -0.01 NA
## s4 -0.01 0.00 0.01 NA
## s5 0.01 0.00 0.00 0.00 NA
## s6 0.00 0.00 0.00 0.00 0.00 NA
## s7 0.00 0.01 -0.01 0.01 -0.01 0.00 NA
## s8 -0.01 0.00 0.02 0.00 0.00 0.01 0.00 NA
## s9 0.01 -0.01 0.00 0.00 0.01 0.00 0.00 0.00 NA
QUESTION 5. Examine the residual correlations. What can you say about
them? Are there any non-trivial residuals (greater than 0.1 in absolute value)?
We will now interpret the 3-factor solution, which was rotated obliquely (by
default, rotate="oblimin" is used in fa() function).
##
## Loadings:
## MR1 MR2 MR3
## s1 0.901
## s2 0.890
## s3 0.838
## s4 0.853
## s5 0.747 0.104
## s6 0.180 0.626
## s7 0.842
## s8 0.382 0.463
## s9 0.212 0.627
##
## MR1 MR2 MR3
## SS loadings 2.489 1.732 1.337
## Proportion Var 0.277 0.192 0.149
## Cumulative Var 0.277 0.469 0.617
I use a 0.32 cut-off for non-trivial loadings since this translates into 10% of
variance shared with the factor in an orthogonal solution. This value is quite
arbitrary, and many use 0.3 instead.
QUESTION 7. Are there any non-trivial cross-loadings?
We now ask for “Phi” - the correlation matrix of latent factors. These corre-
lations are model-based; that is, they are estimated within the model, as one
of parameters. These correlations are between the latent factors (NOT their
imperfect measurements, e.g. sum scores), and therefore they are the closest
you can get to estimating how the actual ability domains are correlated in the
population.
fit3$Phi
After you finished work with this exercise, save your R script by pressing
the Save icon in the script window. To keep all the created objects (such as
FA_neo), which might be useful when revisiting this exercise, save your en-
tire work space. To do that, when closing the project by pressing File / Close
project, make sure you select Save when prompted to save your “Workspace
image”.
10.4 Solutions
Q1. MSA = 0.88. According to Kaiser’s guidelines, the data are “meritorious”
(i.e. very much suitable for factor analysis). Refer to Exercise 7 to remind
yourself of this index and its interpretation.
Q2. The scree plot has a very steep drop from the first factor to the second,
and then a slightly ambiguous “step” formed from the 2nd and the 3rd factors.
The 2nd factor could already be considered part of “scree” (rubble), and Par-
allel Analysis based on PC (principal components) eigenvalues agrees with this
judgement. Alternatively, the 2nd and 3rd factors could be considered parts of
the hard rock and the “scree” beginnig with the 4th factor. Parallel Analysis
10.4. SOLUTIONS 135
fit3$PVAL
## [1] 0.9955007
Q4. The root mean square of the residuals (RMSA) is 0.01 This is a
very small value, indicating excellent fit.
Q5. The residual correlations are all very close to 0. There are no residuals
greater than 0.1 in absolute value.
Q6. This obliquely rotated solution complies to the hypothesized structure
and is easy to interpret. Factor 1 is comprised from subtests s1-s3. Factor 2
is comprised from subtests s4-s6. Factor 3 is comprised from subtests s7-s9.
facilities). The factors can be labelled according to the prior hypothesis: Verbal
Ability, Word Fluency and Reasoning Ability, and these are 3 correlated domains
of ability.
Q7. Yes, subtest s8 “Pedigrees” cross-loads on F1 (loading = 0.382) as well as
on its own domain F3 (loading = 0.463). We can interpret this result as that
both Verbal Ability and Reasoning Ability are needed to complete this subtest.
Q8. All correlations between domains are positive and large (just over 0.5),
which is to be expected from measures of ability.
136 EXERCISE 10. EFA OF ABILITY SUBTEST SCORES
Exercise 11
11.1 Objectives
The NEO PI-R is based on the Five Factor model of personality, measuring
five broad domains, namely Neuroticism (N), Extraversion (E), Openness (O),
Agreeableness (A) and Conscientiousness (C). In addition, NEO identifies 6
facet scales for each broad factor, measuring 30 subscales in total. The facet
subscales are listed below:
137
138 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES
The NEO PI-R Manual reports correlations of the 30 facets based on N=1000
subjects. Despite having no access to the actual scale scores for the 1000 sub-
jects, the correlation matrix is all we need to perform factor analysis of the
subscales.
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.
I recommend to download data file neo.csv into a new folder, and In RStudio,
associate a new project with this folder. Create a new script, where you will
type all commands.
The data this time is actually a correlation matrix, stored in the comma-
separated (.csv) format. You can preview the file by clicking on neo.csv in
the Files tab in RStudio, and selecting View File. This will not import the data
yet, just open the actual .csv file. You will see that the first row contains the
NEO facet names - “N1”, “N2”, “N3”, etc., and the first entry in each row is
again the facet names. To import this correlation matrix into RStudio preserv-
ing all these facet names for each row and each column, we will use function
read.csv(). We will say that we have headers (header=TRUE), and that the
row names are contained in the first column (row.names=1).
We have just read the data into a new object named neo, which appeared in
your Environment tab. Examine the object by either pressing on it, or calling
function View(neo). You will see that the data are indeed a correlation matrix,
with values 1 on the diagonal.
11.3. WORKED EXAMPLE - EFA OF NEO PI-R CORRELATION MATRIX139
First, load the package psych to enable access to its functionality. Conveniently,
most functions in this package easily accommodate analysis with correlation
matrices as well as with raw data.
library(psych)
Before you start EFA, request the Kaiser-Meyer-Olkin (KMO) index – the mea-
sure of sampling adequacy (MSA):
KMO(neo)
In the case of NEO PI-R, we have a clear prior hypothesis about the number
of factors underlying the 30 facets of NEO. The instrument was designed to
measure the Five Factors of personality; therefore we would expect the facets
to be indicators of 5 broad factors.
We will use function fa.parallel() to produce a scree plot for the observed
data, and compare it to that of a random (i.e. uncorrelated) data matrix of the
same size. This time, in addition to the actual data (our correlation matrix,
neo), we need to supply the sample size (n.obs=1000) to enable simulation of
random data, because from the correlation matrix alone it is impossible to know
how big the sample was. We will also change the default estimation method to
maximum likelihood (fm="ml"), because the sample is large and the scale scores
are reported to be normally distributed. We will again display only eigenvalues
for principal components (fa="pc").
140 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES
PC Actual Data
6
PC Simulated Data
5
4
3
2
1
0
0 5 10 15 20 25 30
Component Number
## Parallel analysis suggests that the number of factors = NA and the number of compo
Remember that when interpreting the scree plot, you retain only the factors on
the blue (real data) scree plot that are ABOVE the red (simulated random data)
plot, in which we know there is no common variance, only random variance,
i.e. “scree” (rubble).
QUESTION 2. Examine the Scree plot. Does it support the hypothesis that
5 factors underlie the data? Why? Does your conclusion correspond to the text
output of the parallel analysis function?
We will use function fa(), which has the following general form
fa(r, nfactors=1, n.obs = NA, rotate="oblimin", fm="minres" …),
and requires data (argument r), and the number of observations if the data is
correlation matrix (n.obs). Other arguments have defaults, and we will change
the number of factors to extract from the default 1 to the expected number of
11.3. WORKED EXAMPLE - EFA OF NEO PI-R CORRELATION MATRIX141
factors, nfactors=5, and the estimation method (fm="ml"). We are happy with
the oblique rotation method, rotate="oblimin".
Specifying all necessary arguments, we can simply call the fa() function to test
the hypothesized 5-factor model. However, it will be very convenient to store
the factor analysis results in a new object, which can be “interrogated” later
when we need to extract specific parts of the results. So, we will assign the
results of fa() to an object named (arbitrarily) FA_neo.
##
## Factor analysis with Call: fa(r = neo, nfactors = 5, n.obs = 1000, fm = "ml")
##
## Test of the hypothesis that 5 factors are sufficient.
## The degrees of freedom for the model is 295 and the objective function was 1.47
## The number of observations was 1000 with Chi Square = 1443.45 with prob < 1.8e-150
##
## The root mean square of the residuals (RMSA) is 0.03
## The df corrected root mean square of the residuals is 0.04
##
## Tucker Lewis Index of factoring reliability = 0.863
## RMSEA index = 0.062 and the 10 % confidence intervals are 0.059 0.066
## BIC = -594.34
## With factor correlations of
## ML1 ML4 ML3 ML2 ML5
## ML1 1.00 -0.41 -0.12 -0.13 -0.07
## ML4 -0.41 1.00 0.07 0.21 0.08
## ML3 -0.12 0.07 1.00 0.01 -0.16
## ML2 -0.13 0.21 0.01 1.00 0.32
## ML5 -0.07 0.08 -0.16 0.32 1.00
Examine the summary output. Try to answer the following questions (you can
refer to instructional materials of your choice for help; I recommend McDonald’s
Test Theory). I will indicate which parts of the output you need to answer each
question.
QUESTION 3. Find the chi-square statistic testing the 5-factor model (look
for “Likelihood Chi Square”). How many degrees of freedom are there? What
is the chi square statistic, and p value, for the tested model? Would you retain
or reject this model based on the chi-square test? Why?
142 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES
Next, let’s examine the model residuals, which are direct measures of model
(mis)fit. First, you can evaluate the root mean square of the residuals (RMSR),
which is a summary of all residuals.
QUESTION 4. Find and interpret the RMSR in the output.
Package psych has a nice function residuals.psych() that pulls the residuals
from the saved factor analysis results (object FA_neo) and prints them in a
user-friendly way.
residuals.psych(FA_neo)
The above call will print residuals matrix with uniqunesses on the diagonal.
These are discrepancies between the observed item variances (equal 1 in this
standardized solution) and the variances explained by the 5 common factors,
or communalities (uniquness = 1-communality). To remove these values on the
diagonal out of sight, use:
residuals.psych(FA_neo, diag=FALSE)
## N1 N2 N3 N4 N5 N6 E1 E2 E3 E4 E5
## N1 NA
## N2 0.00 NA
## N3 0.01 0.01 NA
## N4 0.00 -0.04 0.01 NA
## N5 -0.03 0.02 -0.02 0.02 NA
## N6 0.04 0.00 -0.02 0.00 0.00 NA
## E1 0.02 0.00 0.00 -0.02 -0.03 0.02 NA
## E2 0.04 -0.02 -0.01 -0.03 -0.06 0.08 0.09 NA
## E3 0.02 0.02 0.02 -0.06 -0.02 -0.01 0.04 0.08 NA
## E4 -0.02 0.00 0.01 0.03 -0.04 0.02 -0.06 0.01 0.01 NA
## E5 -0.01 -0.04 0.02 0.01 0.01 0.00 -0.04 0.09 -0.10 0.04 NA
## E6 -0.02 -0.01 -0.03 0.03 0.01 -0.01 -0.01 -0.06 -0.05 0.06 0.04
## O1 0.04 -0.03 0.01 0.00 0.02 -0.01 -0.03 -0.05 -0.03 -0.04 0.03
## O2 -0.05 0.01 -0.02 -0.01 0.01 0.03 0.00 0.03 -0.01 0.01 -0.02
## O3 0.00 0.04 0.00 -0.01 0.02 -0.03 0.03 -0.02 -0.01 -0.05 -0.05
## O4 0.01 -0.01 0.00 -0.02 -0.05 0.00 -0.03 0.08 0.00 0.06 0.04
## O5 0.01 -0.03 0.01 0.02 0.00 -0.02 0.00 -0.06 0.01 0.00 0.06
## O6 0.04 0.04 0.00 0.01 -0.03 -0.02 0.00 -0.03 -0.05 0.04 0.01
## A1 -0.02 0.02 0.00 -0.01 0.02 0.02 0.03 0.02 0.03 0.03 -0.05
## A2 -0.03 0.05 -0.01 -0.03 0.03 0.02 -0.04 -0.02 -0.01 0.04 -0.01
## A3 0.01 0.00 0.02 0.00 0.04 -0.04 0.02 -0.08 0.01 -0.02 -0.01
## A4 0.01 -0.06 0.01 0.02 -0.04 0.02 -0.03 0.05 -0.04 0.04 0.00
## A5 -0.03 0.02 0.02 0.00 -0.01 -0.07 -0.02 -0.06 -0.03 0.02 0.03
## A6 0.00 0.01 0.02 -0.02 0.01 -0.01 -0.02 0.01 0.01 -0.02 0.03
## C1 0.02 0.02 -0.01 0.02 0.05 -0.04 -0.01 -0.04 0.00 -0.05 -0.03
11.3. WORKED EXAMPLE - EFA OF NEO PI-R CORRELATION MATRIX143
## C2 0.00 -0.02 -0.01 0.00 0.02 0.01 0.01 0.02 -0.02 -0.01 0.05
## C3 -0.01 0.01 0.01 0.00 0.02 0.00 0.04 0.02 0.03 -0.02 -0.05
## C4 -0.01 0.00 0.00 0.02 0.01 0.01 0.01 0.00 0.04 0.07 -0.01
## C5 0.01 -0.02 0.00 0.00 -0.03 -0.01 -0.03 0.00 -0.05 0.02 0.06
## C6 0.06 -0.02 -0.01 -0.02 -0.09 0.02 0.03 0.04 0.00 -0.07 0.00
## E6 O1 O2 O3 O4 O5 O6 A1 A2 A3 A4
## E6 NA
## O1 0.01 NA
## O2 0.03 -0.03 NA
## O3 0.03 0.05 0.03 NA
## O4 -0.02 0.00 0.04 -0.02 NA
## O5 -0.01 0.00 0.02 -0.04 -0.03 NA
## O6 -0.01 0.02 -0.08 -0.01 0.09 0.01 NA
## A1 -0.03 0.01 -0.01 -0.02 -0.03 -0.02 0.08 NA
## A2 0.01 -0.05 0.04 -0.03 0.05 -0.01 0.00 0.05 NA
## A3 0.02 0.02 -0.04 0.01 -0.03 0.02 -0.02 -0.03 0.01 NA
## A4 0.03 0.01 0.00 -0.04 0.05 0.00 0.01 0.00 0.00 -0.01 NA
## A5 0.02 -0.05 0.02 0.04 0.05 -0.01 -0.04 -0.01 0.06 -0.02 0.01
## A6 -0.04 0.01 0.01 0.00 -0.02 0.01 0.03 0.02 -0.01 0.05 -0.03
## C1 0.00 0.04 -0.05 0.05 -0.06 0.01 0.06 0.02 0.01 0.04 -0.03
## C2 0.01 -0.01 0.06 0.00 -0.01 -0.01 -0.03 -0.03 0.00 -0.03 0.01
## C3 -0.04 0.00 -0.02 0.03 -0.05 0.00 -0.02 0.05 0.02 0.00 -0.03
## C4 -0.01 -0.05 0.03 -0.04 0.03 0.01 -0.03 0.02 0.00 -0.02 0.04
## C5 0.01 0.04 -0.01 -0.01 0.05 -0.01 0.02 -0.01 0.00 -0.01 0.01
## C6 -0.01 0.00 -0.01 0.04 -0.05 0.00 0.00 -0.06 -0.05 0.01 -0.01
## A5 A6 C1 C2 C3 C4 C5 C6
## A5 NA
## A6 0.04 NA
## C1 -0.09 0.02 NA
## C2 0.00 -0.01 -0.02 NA
## C3 -0.01 -0.03 0.03 -0.04 NA
## C4 0.01 -0.03 -0.04 -0.02 0.01 NA
## C5 0.03 0.02 0.00 0.06 -0.02 0.01 NA
## C6 -0.07 0.03 0.08 0.04 0.02 -0.04 -0.04 NA
QUESTION 5. Examine the residual correlations. What can you say about
them? Are there any non-trivial residuals (greater than 0.1 in absolute value)?
FA_neo$loadings
##
## Loadings:
## ML1 ML4 ML3 ML2 ML5
## N1 0.798
## N2 0.578 -0.437 -0.123
## N3 0.783 -0.109
## N4 0.694 0.100
## N5 0.373 -0.252 -0.282 0.295
## N6 0.650 -0.251
## E1 0.170 0.690 0.105
## E2 -0.155 0.547
## E3 -0.290 0.227 -0.406 0.250 0.164
## E4 0.394 -0.382 0.358
## E5 -0.478 0.380
## E6 -0.116 0.696
## O1 -0.317 -0.144 0.114 0.474
## O2 0.129 0.183 0.699
## O3 0.331 0.342 0.418
## O4 -0.173 0.153 0.427
## O5 -0.149 -0.127 0.710
## O6 -0.123 -0.157 0.325
## A1 -0.282 0.422 0.306 0.111
## A2 0.219 0.627
## A3 0.208 0.353 0.635 -0.109
## A4 0.728 0.112
## A5 0.197 0.518 -0.122
## A6 0.457 0.373
## C1 -0.294 0.555 0.104
## C2 0.662 -0.135
## C3 0.619 0.273
## C4 0.715 -0.170 0.159
## C5 -0.180 0.727
## C6 -0.116 0.496 0.276 -0.229
##
## ML1 ML4 ML3 ML2 ML5
## SS loadings 3.228 3.020 2.834 2.612 1.855
## Proportion Var 0.108 0.101 0.094 0.087 0.062
## Cumulative Var 0.108 0.208 0.303 0.390 0.452
Note that in the factor loadings matrix, the factor labels do not always corre-
spond to the column number. For example (factor named ML1 is in 1st column,
but ML2 is in 4th column. This does not matter, but just pay attention to the
factor name that you quote.)
11.3. WORKED EXAMPLE - EFA OF NEO PI-R CORRELATION MATRIX145
Note. I use a 0.32 cut-off for non-trivial loadings since this translates into 10%
of variance shared with the factor in an orthogonal solution. This value is quite
arbitrary, and many use 0.3 instead.
QUESTION 7. Refer to the NEO facet descriptions in the supplementary
document to hypothesize why any non-trivial cross-loadings may have occurred.
Examine columns h2 and u2 of the factor loading matrix. The h2 column is the
facet’s communality (proportion of variance due to all of the common factors),
11.4. SOLUTIONS 147
After you finished work with this exercise, save your R script by pressing
the Save icon in the script window. To keep all the created objects (such as
FA_neo), which might be useful when revisiting this exercise, save your en-
tire work space. To do that, when closing the project by pressing File / Close
project, make sure you select Save when prompted to save your “Workspace
image”.
11.4 Solutions
Q1. MSA = 0.87. The data are “meritorious” according to Kaiser’s guidelines
(i.e. very much suitable for factor analysis). Refer to Exercise 7 to remind
yourself of this index and its interpretation.
Q2. There is a clear change from a steep slope (“mountain”) and a shallow slope
(“scree” or rubble) in the Scree plot. The first five factors form a mountain;
factor #6 and all subsequent factors belong to the rubble pile and therefore
we should proceed with 5 factors. Parallel analysis confirms this decision –
five factors of the blue plot (observed data) are above the red dotted line (the
simulated random data).
Q3. Chi-square=1443.45 on DF=295 (p=1.8E-150). The p value is given in the
scientific format, with 1.8 multiplied by 10 to the power of -150. It is so tiny
(divided by 10 to the power of 150) that we can just say p<0.001. We should
reject the model, because the probability of observing the given correlation
matrix in the population where the 5-factor model is true (the hypothesis we
are testing) is vanishingly small (p<0.001). However, chi-square test is very
sensitive to the sample size, and our sample is large. We must judge the fit
based on other criteria, for instance residuals.
Q4.
The RMSR of 0.03 is a very small value, indicating that the 5-factor model
reproduces the observed correlations well.
148 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES
Q5. All residuals are very small (close to zero), and none are above 0.1 in
absolute value. Since the 30x30 matrix of residuals is somewhat hard to eyeball,
you can create a histogram of the values pulled by residuals.psych():
hist(residuals.psych(FA_neo, diag=FALSE))
100
50
0
factors correlate positively with each other. The strongest correlation is between
ML1 (Neuroticism) and ML4 (Conscientiousness) (r=–.41).
Part V
ITEM RESPONSE
THEORY (IRT)
151
Exercise 12
12.1 Objectives
We already fitted a linear single-factor model to the Eysenck Personality Ques-
tionnaire (EPQ) Neuroticism scale in Exercise 8, using tetrachoric correlations
of its dichotomous items. The objective of this exercise is to fit two basic Item
Response Theory (IRT) models - 1-parameter logistic (1PL) and 2-parameter
logistic (2PL) - to the actual dichotomous responses to EPQ Neuroticism items.
After selecting the most appropriate response model, we will plot Item Char-
acteristic Curves, and examine item difficulties and discrimination parameters.
Finally, we will estimate people’s trait scores and their standard errors.
153
154EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA
Eysenck, 1998), with N = 1,381 participants who completed the Eysenck Person-
ality Questionnaire (EPQ). The focus of our analysis here will be the Neuroti-
cism/Anxiety (N) scale, measured by 23 items with only two response options -
either “YES” or “NO”, for example:
You can find the full list of EPQ Neuroticism items in Exercise 6. Please
note that all items indicate “Neuroticism” rather than “Emotional Stability”
(i.e. there are no counter-indicative items).
If you have already worked with this data set in Exercises 6 and 8, the sim-
plest thing to do is to continue working within one of the projects created back
then. In RStudio, select File / Open Project and navigate to the folder and the
project you created. You should see the data frame EPQ appearing in your
Environment tab, together with other objects you created and saved.
If you have not completed relevant exercises, or simply want to start from
scratch, download the data file EPQ_N_demo.txt into a new folder and
follow instructions from Exercise 6 on creating a project and importing the
data.
The object EPQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. As you can see, there are 26 variables
in this data frame, beginning with participant id, age and sex (0 = female; 1
= male). These demographic variables are followed by 23 item responses, which
are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses,
marked with NA.
Because the item responses are stored in columns 4 to 26 of the EPQ data
frame, we will have to refer to them as EPQ[ ,4:26] in future analyses. More
conveniently, we can create a new object containing only the item response data:
You will need package ltm (stands for “latent trait modelling”) installed on your
computer. Select Tools / Install packages… from the RStudio menu, type in ltm
and click Install. Make sure you load the installed package into the working
memory:
library(ltm)
Once the command has been executed, examine the model results by calling
summary(fit1PL).
You should see some model fit statistics, and the estimated item parameters.
Take note of the log likelihood (see Model Summary at the beginning of the
output). Log likelihood can only be interpreted in relative terms (compared
to a log likelihood of another model), so you will have to wait for results from
another model before using it.
For item parameters printed in a more convenient format, call
## Dffclt Dscrmn
## N_3 -0.57725861 1.324813
## N_7 -0.18590831 1.324813
## N_12 -1.35483500 1.324813
## N_15 0.89517911 1.324813
## N_19 -0.32174146 1.324813
## N_23 -0.06547699 1.324813
## N_27 0.10182697 1.324813
## N_31 0.68572728 1.324813
## N_34 -0.41414515 1.324813
## N_38 -0.13645633 1.324813
## N_41 0.80993448 1.324813
## N_47 -0.34271380 1.324813
## N_54 0.78134781 1.324813
## N_58 -0.66528672 1.324813
## N_62 0.97477759 1.324813
## N_66 -0.34506389 1.324813
156EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA
You should see the difficulty and discrimination parameters. Difficulty refers to
the latent trait score at the point of inflection of the probability function (item
characteristic curve or ICC); this is the point where the curve changes from
concave to convex. In 1-parameter and 2-parameter logistic models, this point
corresponds to the latent trait score where the probability of ‘YES’ response
equals the probability of ‘NO’ response (both P=0.5). Discrimination refers to
the steepness (slope) of the ICC at the point of inflection (at the item difficulty).
Now plot item characteristic curves for this model using plot() function:
plot(fit1PL)
12.2. WORKED EXAMPLE - FITTING 1PL AND 2PL MODELS TO EPQ NEUROTICISM ITEMS157
N_88
N_84
N_80
N_72N_77
0.8
N_66
N_58 N_68N_75
Probability
0.6
N_47
0.4
N_62
N_38 N_54
N_34
0.2
N_12 N_41
N_27
N_19
N_23
0.0
N_31
N_3
N_7 N_15
−4 −2 0 2 4
Ability
Examine the item characteristic curves (ICCs). First, you should notice that
they run in “parallel” to each other, and do not cross. This is because the slopes
are constrained to be equal in the 1PL model. Try to estimate the difficulty
levels of the most extreme items on the left and on the right, just by looking at
the approximate trait values where corresponding probabilities equal 0.5. Check
if these values equal to the difficulty parameters printed in the output.
To fit a 2-parameter logistic (2PL) model, use ltm() function (stands for ‘latent
trait model’) as follows:
The function uses a formula, described in the ltm package manual. The formula
follows regression conventions used by base R and many packages – it specifies
that items in the set N_data are ‘regressed on’ (~) one latent trait (z1). Note
that z1 is not an arbitrary name; it is actually fixed in the package. At most,
two latent traits can be fitted (z1 + z2). We are only fitting one trait and
therefore we specify ~ z1.
When the command has been executed, see the results by calling summary(fit2PL).
You should see some model fit statistics, and then estimated item parameters
– difficulties and discriminations. Take note of the log likelihood for this
158EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA
model. We will test how the 2PL model compares to the 1PL model later using
this value.
For the item parameters printed in a convenient format, call
## Dffclt Dscrmn
## N_3 -0.51605652 1.6321883
## N_7 -0.17666140 1.4774008
## N_12 -1.22040224 1.5911491
## N_15 1.03197005 1.0570904
## N_19 -0.26737063 1.9740413
## N_23 -0.06402011 1.5006632
## N_27 0.09002533 1.5308778
## N_31 0.59894671 1.6672151
## N_34 -0.32318252 2.3270028
## N_38 -0.13268055 1.4266667
## N_41 0.75664387 1.4738613
## N_47 -0.52960514 0.7057917
## N_54 1.08776504 0.8186594
## N_58 -0.69064694 1.2397378
## N_62 1.05886259 1.1532931
## N_66 -0.42109219 0.9640119
## N_68 0.32362475 0.9217228
## N_72 -0.40104331 1.6932412
## N_75 0.48193055 1.8070986
## N_77 0.31094924 1.5627476
## N_80 -0.42017648 1.5126054
## N_84 -1.92770607 0.8595893
## N_88 -2.57944224 0.9375370
plot(fit2PL)
N_88
N_80
N_84
N_72N_77
N_75
0.8
N_58N_66
N_68
Probability
0.6
N_47
0.4
N_62
N_38 N_54
0.2
N_34
N_41
N_12 N_23
0.0
N_27
N_19 N_31
N_7 N_15
N_3
−4 −2 0 2 4
Ability
You have fitted two IRT models to the Neuroticism items. These models are
nested - one is the special case of another (1PL model is a special case of 2PL
with all discrimination parameters constrained equal). The 1PL (Rasch) model
is more restrictive than the 2PL model because it has fewer parameters (23
difficulty parameters and only 1 discrimination parameter, against 23 difficulties
+ 23 discrimination parameters in the 2PL model), and we can test which model
fits better. Use the base R function anova() to compare the models:
anova(fit1PL, fit2PL)
##
## Likelihood Ratio Table
## AIC BIC log.Lik LRT df p.value
## fit1PL 34774.05 34899.59 -17363.03
## fit2PL 34471.99 34712.60 -17189.99 346.06 22 <0.001
160EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA
This function prints the log likelihood values for the two models (you already
saw them). It computes the likelihood ratio test (LRT) by multiplying each log
likelihood by -2, and then subtracting the lower value from the higher value.
The resulting value is distributed as chi-square, with degrees of freedom equal
the difference in the number of parameters between the two models.
QUESTION 6. Examine the results of anova() function. LRT value is the
likelihood ratio test result (the chi-square statistic). What are the degrees of
freedom? Can you explain why the degrees of freedom as they are? Is the
difference between the two models significant? Which model would you retain
and why?
Now we can score people in this sample using function factor.scores() and
Empirical Bayes method (EB). We will use the better-fitting 2PL model.
The estimated scores together with their standard errors will be stored for each
respondent in a new object Scores. Check out what components are stored in
this object by calling head(Scores). It appears that the estimated trait scores
(z1) and their standard errors (se.z1) are stored in $score.dat part of the
Scores object.
To make our further work with these values easier, let’s assign them to new
variables in the EPQ data frame:
Now, you can plot the histogram of the estimated IRT scores, by calling
hist(EPQ$Zscore).
You can also examine relationships between the IRT estimated scores and simple
sum scores. This is how we computed the sum scores previously (see Exercise
6:
Then plot the sum score against the IRT score. Note that the relationship
between these scores is very strong but not linear. Rather, it resembles a logistic
curve.
12.2. WORKED EXAMPLE - FITTING 1PL AND 2PL MODELS TO EPQ NEUROTICISM ITEMS161
plot(EPQ$Zscore, EPQ$Nscore)
20
EPQ$Nscore
15
10
5
0
−2 −1 0 1 2
EPQ$Zscore
Now let’s plot the IRT estimated scores against their standard errors:
plot(EPQ$Zscore, EPQ$Zse)
162EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA
0.50
EPQ$Zse
0.40
0.30
−2 −1 0 1 2
EPQ$Zscore
A very interesting feature of this graph is that a few points are out of line with
the majority (which form a very smooth curve). This means that SEs can be
different for people with the same latent trait score. Specifically, SEs can be
larger for some people (note that the outliers are always above the trend, not
below). The reason for it is that some individuals had missing responses, and
because fewer items provided information for their score estimation, their SEs
were larger than for individuals with complete response sets.
After you finished this exercise, you may close the project. Do not forget to
save your R script, giving it a meaningful name, for example “IRT analysis of
EPQ_N”.
Please also make sure that you save your entire ‘workspace’, which includes the
data frame and all the new objects you created. This will be useful for your
revision.
12.3. SOLUTIONS 163
12.3 Solutions
Q1. The difficulty parameters of Neuroticism items vary widely. The ‘easiest’
item to endorse is N_88, with the lowest difficulty value (-2.03). N_88 is
phrased “Are you touchy about some things?”, and according to the 1PL model,
at very low neuroticism level z=–2.03 (remember, the trait score is scaled like
z-score), the probability of agreeing with this item is already 0.5. The most
‘difficult’ item to endorse in this set is N_62, with the highest difficulty value
(0.97). N_62 is phrased “Do you often feel life is very dull?”, and according to
this model, one need to have neuroticism level of at least 0.97 to endorse this
item with probability 0.5. The phrasing of this item is indeed more extreme
than that of item N_88, and it is therefore more ‘difficult’ to endorse.
Q2. Only one discrimination parameter is printed for this set because the 1PL
(Rasch) model assumes that all items have equal discriminations. Therefore the
model constrains all discriminations to be equal.
Q3. The ‘easiest’ item to endorse in this set is still item N_88, which now has
the difficulty value -2.58. The most ‘difficult’ item to endorse in this set is now
item N_54, with the difficulty value (1.09). The most difficult item from the
1PL model, N_62, has a similarly high difficulty (1.06). N_54 is phrased “Do
you suffer from sleeplessness?”, and according to this model, one needs to have
neuroticism level of at least 1.09 to endorse this item with probability 0.5.
Q4. The most discriminating item in this set is N_34 (Dscrmn.=2.33). This
item reads “Are you a worrier?”, which seems to be right at the heart of the
Neuroticism construct. This is the item which would have the highest factor
loading in factor analysis of these data. The least discriminating item is N_47
(Dscrmn.=0.71), reading “Do you worry about your health?”. According to this
model, the item is more marginal to the general Neuroticism construct (perhaps
it tackles a context-specific behaviour).
Q5. The most and the least discriminating items can be easily seen on the
plot. The most discriminating item N_34 has the steepest slope, and the least
discriminating item N_47 has the shallowest slope.
Q6. The degrees of freedom = 22. This is made up by the difference in the
number of item parameters estimated. The Rasch model estimated 23 difficulty
parameters, and only 1 discrimination parameter (one for all items). The 2PL
model estimated 23 difficulty parameters, and 23 discrimination parameters.
The difference between the two models = 22 parameters.
The chi-square value 346.06 on 22 degrees of freedom is highly significant, so
the 2PL model fits the data much better and we have to prefer it to the more
parsimonious but worse fitting 1PL model.
Q7. Most precise measurement is achieved in the range between about z=-0.2
and z=0. The smallest standard error was about 0.3 (exact value 0.299). The
largest standard error was about 0.57 (exact value 0.573).
164EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA
Exercise 13
13.1 Objectives
Data for this exercise come from the Medical Research Council National Survey
of Health and Development (NSHD), also known as the British 1946 birth cohort
study. The data pertain to a wave of interviewing undertaken in 1999, when
the participants were aged 53.
165
166EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR
At that point, N = 2,901 participants (1,422 men and 1,479 women) completed
the 28-item version of General Health Questionnaire (GHQ-28). The GHQ-28
was developed as a screening questionnaire for detecting non-psychotic psy-
chiatric disorders in community settings and non-psychiatric clinical settings
(Goldberg, 1978). You can find a short description of the GHQ-28 in the data
archive available with this book.
Please note that some items indicate emotional health and some emotional dis-
tress; however, 4 response options for each item are custom-ordered depending
on whether the item measures health or distress, as follows:
With this coding scheme, for every item, the score of 1 indicates good health
or lack of distress; and the score of 4 indicates poor health or a lot of distress.
Therefore, high scores on the questionnaire indicate emotional distress.
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions. To practice further, you may repeat the
analyses for the remaining GHQ-28 subscales.
First, download the data file likertGHQ28sex.txt into a new folder and follow
instructions from Exercise 1 on creating a project and loading the data. As the
file here is in the tab-delimited format (.txt), use function read.table() to
import data into the data frame we will call GHQ.
The object GHQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. You should see 33 variables, beginning
with the 7 items for Somatic Symptoms scale, SOMAT_1, SOMAT_2, …
SOMAT_7, and followed by items for the remaining subscales. They are
followed by participant sex (0 = male; 1 = female). The last 4 variables in the
file are sum scores (sums of relevant 7 items) for each of the 4 subscales. There
are NO missing data.
Because the item responses for Somatic Symptoms scale are stored in columns
1 to 7 of the GHQ data frame, we can refer to them as GHQ[ ,1:7] in future
analyses. More conveniently, we can create a new object containing only the
relevant item response data:
Before attempting to fit an IRT model, it would be good to examine whether the
7 items form a homogeneous scale (measure just one latent factor in common).
To this end, we may use Parallel Analysis functionality of package psych. For
a reminder of this procedure, see Exercise 7. In this case, we will treat the
4-point ordered responses as ordinal rather than interval scales, and ask for the
analysis to be performed from polychoric correlations (cor="poly") rather than
Pearson’s correlations that we used before:
168EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR
library(psych)
fa.parallel(somatic, cor="poly", fa="pc")
PC Actual Data
PC Simulated Data
PC Resampled Data
3
2
1
0
1 2 3 4 5 6 7
Component Number
## Parallel analysis suggests that the number of factors = NA and the number of compo
QUESTION 1. Examine the Scree plot. Does it support the hypothesis that
only one factor underlies responses to Somatic Symptoms items? Why? Does
your conclusion agree with the text output of the parallel analysis function?
For IRT analyses, you will need package ltm (stands for “latent trait modelling”)
installed on your computer. Select Tools / Install packages… from the RStudio
menu, type in ltm and click Install. Make sure you load the installed package
into the working memory before starting the below analyses.
library(ltm)
To fit a Graded Response Model (GRM), use grm() function on somatic object
containing item responses:
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1
##
## Call:
## grm(data = somatic)
##
## Coefficients:
## Extrmt1 Extrmt2 Extrmt3 Dscrmn
## SOMAT_1 -1.707 1.281 2.889 2.350
## SOMAT_2 -0.447 0.791 1.944 2.856
## SOMAT_3 -0.375 0.793 1.965 3.743
## SOMAT_4 0.270 1.231 2.332 3.223
## SOMAT_5 1.014 3.252 5.742 1.092
## SOMAT_6 1.328 2.895 5.392 1.234
## SOMAT_7 0.681 2.754 5.176 0.808
##
## Log.Lik: -15376.89
Examine the basic output above. Take note of the Log.Lik (log likelihood) at
the end of the output. Log likelihood can only be interpreted in relative terms
(compared to a log likelihood of another model), so you will have to wait for
results from another model before using it.
Examine the Extremity and Discrimination parameters. Extremity parameters
are really extensions of the difficulty parameters in binary IRT models, but
instead of describing the threshold between two possible answers in binary items
(e.g. ‘YES’ or ‘NO’), they contrast response category 1 with those above it, then
category 2 with those above 2, etc. There are k-1 extremity parameters for an
item with k response categories. In our 4-category items, Extrmt1 refers to
the latent trait score z at which the probability of selecting any category >1
(2 or 3 or 4) equals 0.5. Below this point, selecting category 1 is more likely
than selecting 2, 3 or 4; and above this point selecting categories 2, 3 or 4 is
more likely. Extrmt2 refers to the latent trait score z at which the probability
of selecting any categories >2 (3 or 4) equals 0.5. And Extrmt3 refers to the
latent trait score z at which the probability of selecting any categories >3 (just
4) equals 0.5. The GRM assumes that the thresholds are ordered - that is,
switching from one response category to the next happens at higher latent trait
score values. That is why the GRM is a suitable model for Likert-type responses,
which are supposedly ordered categories. Discrimination refers to the steepness
(slope) of the response characteristic curves at the extremity points. In GRM,
this parameter is assumed identical for all categories, so the category curves
have the same steepness.
This will be easier to understand by looking at Operation Characteristic Curves
(OCC). These can be obtained easily by calling plot() function. If the default
170EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR
setting items=NULL is used, then OCCs will be produced for all items and the
user can move from one to the next by hitting . To obtain OCC for a particular
item, specify the item number:
0.6
3
0.4
0.2
0.0
1 2
−4 −2 0 2 4
Ability
It can be seen that for item 1, Extrmt1= -1.707 is the trait score (the plot refers
to it as “Ability” score) at which the OCC for category 1 crosses Probability 0.5;
Extrmt2= 1.281 corresponds to the trait score at which OCC 2 crosses P=0.5,
and Extrmt3= 2.889 corresponds to the trait score at which OCC 3 crosses
P=0.5.
Plot and examine OCCs for all items.
QUESTION 2. Examine the OCC plots and the corresponding Extremity
parameters. How are the extremities spaced out for different items? Do the
extremity values vary widely? What are the most extreme values in this set?
Examine the phrasing of the corresponding items – can you see why some cat-
egories are “easy” or “difficult” to select?
QUESTION 3. Examine the Discrimination parameters. What is the most
and least discriminating item in this set? Examine the phrasing of the corre-
sponding items – can you interpret the meaning of the construct we are mea-
suring in relation to the most discriminating item (as we did for “marker” items
in factor analysis)?
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1
Now we are ready to look at Item Characteristic Curves (ICC) - the most useful
plot in practice of applying an IRT model. An ICC for each item will consist of
as many curves as there are response categories, in our case 4. Each curve will
represent the probability of selecting this particular response category, given the
latent trait score. Let’s obtain the ICC for item SOMAT_3 by calling plot()
function and specifying type="ICC":
1 4
0.8
Probability
0.6
3
0.4
0.2
2
0.0
−4 −2 0 2 4
Ability
In this pretty plot, each response category from 1 to 4 have its own curve in
its own colour. For item SOMAT_3 indicating distress, curve #1 (in black)
represents the probability of selecting response “not at all”. You should be able
to see that at the extremely low values of somatic symptoms (low distress), the
probability of selecting this response approaches 1, and the probability goes
down as the somatic symptoms score goes up (distress increases). Somewhere
around z=-0.5, the probability of selecting response #2 “no more than usual”
plotted in pink becomes equal to the probability of selecting category #1 (the
black and the pink lines cross), and then starts increasing as the somatic symp-
toms score goes up. The pink line #2 peaks somewhere around z=0.2 (which
means that at this level of somatic symptoms, response #2 “no more than usual”
is most likely), and then starts coming down. Then somewhere close to z=1,
the pink line crosses the green line for response #3 “rather more than usual”, so
that this response becomes more likely. Eventually, at about z=2 (high levels of
distress), response #4 “much more than usual” (light blue line) becomes more
172EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR
likely and it reaches the probability close to 1 for extremely high z scores.
Plot and examine ICCs for all remaining items.Try to interpret the ICC for item
SOMAT_1, which is the only item indicating health rather than distress. See
how with the increasing somatic symptom score, the probability is decreasing for
category #1 “better than usual” and increasing for category #4 “much worse
than usual”.
The standard GRM allows all items have different discrimination parameters,
and we indeed saw that they varied between items. It would be interesting to
see if these differences are significant - that is, if constraining the discrimination
parameters equal resulted in a significantly worse model fit. We can check that
by fitting a model where all discrimination parameters are constrained equal:
##
## Call:
## grm(data = somatic, constrained = TRUE)
##
## Coefficients:
## Extrmt1 Extrmt2 Extrmt3 Dscrmn
## SOMAT_1 -1.798 1.406 3.122 1.789
## SOMAT_2 -0.423 1.346 2.814 1.789
## SOMAT_3 -0.360 1.340 2.929 1.789
## SOMAT_4 0.470 1.845 3.184 1.789
## SOMAT_5 0.830 2.367 4.173 1.789
## SOMAT_6 1.137 2.330 4.342 1.789
## SOMAT_7 0.480 1.663 2.963 1.789
##
## Log.Lik: -15668.06
Examine the log likelihood (Log.Lik) for fit2, and note that it is different from
the result for fit1.
You have fitted two IRT models to the Somatic Symptoms items. These models
are nested - one is the special case of another (constrained model is a special case
of the unconstrained model). The constrained model is more restrictive than
the unconstrained model because it has fewer parameters (the same number of
extremity parameters, but only 1 discrimination parameter in the constrained
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1
anova(fit2, fit1)
##
## Likelihood Ratio Table
## AIC BIC log.Lik LRT df p.value
## fit2 31380.12 31511.52 -15668.06
## fit1 30809.78 30977.02 -15376.89 582.34 6 <0.001
This function prints the Log.Lik values for the two models (you already saw
them). It computes the likelihood ratio test (LRT) by multiplying each log
likelihood by -2, and then subtracting the lower value from the higher value.
The resulting value is distributed as chi-square, with degrees of freedom equal
the difference in the number of parameters between the two models.
QUESTION 4. Examine the results of anova() function. LRT value is the
likelihood ratio test result (the chi-square statistic). What are the degrees of
freedom? Can you explain why the degrees of freedom as they are? Is the
difference between the two models significant? Which model would you retain
and why?
Now, let’s see if the unconstrained GRM model (which we should prefer) fits
the data well. For this, we will look at so-called two-way margins. They are
obtained by taking two variables at a time, making a contingency table, and
computing the observed and expected frequencies of bivariate responses under
the GRM model (for instance, response #1 to item1 and #1 to item2; response
#1 to item1 and #2 to item2 etc.). The Chi-square statistic is computed from
each pairwise contingency table. A significant Chi-square statistic (greater than
3.8 for 1 degree of freedom) would indicate that the expected bivariate response
frequencies are significantly different from the observed. This means that a pair
of items has relationships beyond what is accounted for by the IRT model. In
other words the local independence assumption of IRT is violated - the items are
not independent controlling for the latent trait (dependence remains after the
IRT model controls for the influences due to the latent trait). Comparing the
observed and expected two-way margins is analogous to comparing the observed
and expected correlations when judging the fit of a factor analysis model.
To compute these so called Chi-squared residuals, the package ltm has function
margins().
margins(fit1)
##
174EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR
## Call:
## grm(data = somatic)
##
## Fit on the Two-Way Margins
##
## SOMAT_1 SOMAT_2 SOMAT_3 SOMAT_4 SOMAT_5 SOMAT_6 SOMAT_7
## SOMAT_1 - 449.93 501.91 373.66 148.71 152.32 149.56
## SOMAT_2 *** - 328.58 253.63 154.07 153.28 145.72
## SOMAT_3 *** *** - 163.16 124.78 133.09 118.25
## SOMAT_4 *** *** *** - 102.39 96.00 99.20
## SOMAT_5 *** *** *** *** - 879.46 80.40
## SOMAT_6 *** *** *** *** *** - 82.40
## SOMAT_7 *** *** *** *** *** *** -
##
## '***' denotes pairs of items with lack-of-fit
QUESTION 5. Examine the Chi-squared residuals and find the largest one.
Which pair of items does it pertain to? Can you interpret why this pair of items
violates the local independence assumption?
The analysis of two-way margins confirms the results we obtained earlier from
Parallel Analysis - that there is more than one factor underlying responses to
the somatic items. At the very least, there is a second factor pertaining to a
local dependency between two somatic items.
Now we can score people in this sample using function factor.scores() and
Empirical Bayes method (EB). We will use the better-fitting 2PL model.
The estimated scores together with their standard errors will be stored for each
respondent in a new object Scores. Check out what components are stored in
this object by calling head(Scores). It appears that the estimated trait scores
(z1) and their standard errors (se.z1) are stored in $score.dat part of the
Scores object.
To make our further work with these values easier, let’s assign them to new
variables in the GHQ data frame:
Now, you can plot the histogram of the estimated IRT scores, by calling
hist(GHQ$somaticZ, breaks=20).
You can also examine relationships between the IRT estimated score and the
simple sum score stored in the variable SUM_SOM. [Since there are no missing
data, this variable can be computed by using function rowSums(somatic)]. You
can plot the sum score against the IRT score.
plot(GHQ$somaticZ, GHQ$SUM_SOM)
25
GHQ$SUM_SOM
20
15
10
−1 0 1 2 3
GHQ$somaticZ
Note that the relationship between these scores is very strong, and almost linear.
It is certainly more linear than the relationship in Exercise 12, which resembled
a logistic curve. This suggest that the more response categories there are, the
closer IRT models approximate linear models.
Now let’s plot the IRT estimated scores against their standard errors:
plot(GHQ$somaticZ, GHQ$somaticZse)
176EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR
0.50
GHQ$somaticZse
0.40
0.30
−1 0 1 2 3
GHQ$somaticZ
A very interesting feature of this graph is that some points are out of line
with the majority (which form a relatively smooth curve). This means that
Standard Errors can be different for people with the same latent trait score.
Specifically, SEs can be larger for some people (note that the outliers are always
above the trend, not below). The reason for it is that some individuals have
aberrant response patterns - patterns not in line with the GRM model (for
example, endorsing more ‘difficult’ response categories for some items while not
endorsing ‘easier’ categories for other items). That makes estimating their score
less certain.
After you finished this exercise, you may want to save the new variables you
created in the data frame GHQ. You may store the data in the R internal
format (extension .RData), so next time you can use function load() to read
it.
13.4. SOLUTIONS 177
save(GHQ, file="likertGHQ28sex.RData")
Before closing the project, do not forget to save your R script, giving it a
meaningful name, for example “Fitting GRM model to GHQ”.
13.4 Solutions
Q1. According to the Scree plot of the observed data (in blue) the 1st factor
accounts for a substantial amount of variance compared to the 2nd factor. There
is a large drop from the 1st factor to the 2nd; however, the break from the steep
to the shallow slope is not as clear as it could be, with the 2nd factor probably
still being part of “the mountain” rather than part of “the rubble”. There are
probably 2 factors underlying the data - one major explaining the most shared
variance in the items, and one minor.
Parallel analysis confirms this conclusion, because the simulated data yields a
line (in red) that crosses the observed scree between 2nd and 3rd factor, with
2nd factor significantly above the red line, and 3rd factor below it. It means
that factors 1 and 2 should be retained, and factors 3, 4 and all subsequent
factors should be discarded.
Q2. The extremity parameters vary widely between items. For SOMAT_2,
SOMAT_3, and SOMAT_4, they are spaced more closely than for SO-
MAT_1 or SOMAT_5, SOMAT_6 and SOMAT_7.
The lowest threshold is between categories 1 and 2 of item SOMAT_1, with
the lowest extremity value (Extrmt1=-1.707). SOMAT_1 is phrased “Have
you recently: been feeling perfectly well and in good health?” (which is an item
indicating health), and according to the GRM model, at the level of Somatic
Symptoms z=–1.707 (which is well below the mean - remember, the trait score
is scaled like z-score?), the probabilities of endorsing “1 = better than usual”
and “2 = same as usual” are equal. People with Somatic Symptoms score lower
than z=-1.707, will more likely say “1 = better than usual”, and with a score
higher than z=-.707 (having more Somatic Symptoms) will more likely say “2
= same as usual”.
The highest threshold is between categories 3 and 4 of item SOMAT_5, with
the lowest extremity value (Extrmt3=5.742). SOMAT_5 is phrased “Have you
recently: been getting any pains in your head?” (which is an item indicating
distress), and according to the GRM model, it requires to have an extremely
high score of z=5.742 on Somatic Symptoms (more than 5 Standard Deviations
above the mean) to switch from endorsing “3 = rather more than usual” to
endorsing “4 = much more than usual”.It is easy to see why this item is more
difficult to agree with than some other items indicating distress (which are all
items except SOMATIC_1), as it refers to a pretty severe somatic symptom
- head pains.
178EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR
Q3. The most discriminating item in this set is SOMAT_3, phrased “Have you
recently: been feeling run down and out of sorts?”. Apparently, this symptom is
most sensitive to the change in overall Somatic Symptoms (a “marker” for this
construct). The least discriminating item in this set is SOMAT_7, phrased
“Have you recently: been having hot or cold spells?”. This symptom is less
sensitive to the change in Somatic Symptoms (peripheral to the meaning of this
construct).
Q4. The degrees of freedom for the likelihood ratio test is df=6. This is made
up by the difference in the number of item parameters estimated. The uncon-
strained model estimated 7x3=21 extremity parameters and 7 discrimination
parameters, 28 parameters in total. The constrained model estimated 7x3=21
extremity parameters and, only 1 discrimination parameter (one for all items),
22 parameters in total. The difference between the two models 28-22=6 param-
eters.
The chi-square value 582.34 on 6 degrees of freedom is highly significant, so the
unconstrained GRM fits the data much better and we have to prefer it to the
more parsimonious but worse fitting constrained model.
Q5. The largest Chi-squared residual is 879.46, pertaining to the pair SO-
MAT_5 (been getting any pains in your head?) and SOMAT_6 (been get-
ting a feeling of tightness or pressure in your head?). It is pretty easy to see
why this pair of items violates the local independence assumption. Both refer
to one’s head - and experiencing “pain” is more likely in people who are also
experience “pressure”, even after controlling for all other somatic symptoms. In
other words, if we take people with exactly the same overall level of somatic
symptoms (i.e. controlling for the overall trait level), those among them who
experience “pain” will also more likely experience “pressure”. There is a residual
dependency between these two symptoms.
Q6. Most precise measurement is observed in the range between about z=-0.5
and z=2.5. The smallest standard error was about 0.3 (exact value 0.287). The
largest standard error was about 0.58 (exact value 0.581).
Part VI
DIFFERENTIAL ITEM
FUNCTIONING (DIF)
ANALYSIS
179
Exercise 14
DIF analysis of
dichotomous questionnaire
items using logistic
regression
14.1 Objectives
The objective of this exercise is to learn how to screen dichotomous test items
for Differential Item Functioning (DIF) using logistic regression. Item DIF is
present when people of the same ability (or people with the same trait level)
but from different groups have different probabilities of passing/endorsing the
item. In this example, we will screen for DIF with respect to gender.
181
182EXERCISE 14. DIF ANALYSIS OF DICHOTOMOUS QUESTIONNAIRE ITEMS USING LOGIST
& Eysenck, 1998), with N = 1,381 participants who completed the Eysenck
Personality Questionnaire (EPQ). The focus of our analysis here will be the
Neuroticism/Anxiety (N) scale, measured by 23 items with only two response
options - either “YES” or “NO”, for example:
You can find the full list of EPQ Neuroticism items in Exercise 6. Please
note that all items indicate “Neuroticism” rather than “Emotional Stability”
(i.e. there are no counter-indicative items).
If you have already worked with this data set in previous exercises, the simplest
thing to do is to continue working within the project created back then. In RStu-
dio, select File / Open Project and navigate to the folder and the project you
created. You should see the data frame EPQ appearing in your Environment
tab, together with other objects you created and saved.
If you have not completed previous exercises or have not saved your work, or
simply want to start from scratch, download the data file EPQ_N_demo.txt
into a new folder and follow instructions from Exercise 6 on creating a project
and importing the data.
The object EPQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. As you can see, there are 26 variables
in this data frame, beginning with participant id, age and sex (0 = female; 1
= male). These demographic variables are followed by 23 item responses, which
are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses,
marked with NA.
Any DIF analysis begins with creating a variable that represents the trait score
on which test takers will be matched. (Remember that DIF is the difference in
probability of endorsing an item for people with the same trait score? We need
to compute the trait score, to control for it in analyses).
14.2. WORKED EXAMPLE - SCREENING EPQ NEUROTICISM ITEMS FOR GENDER DIF183
You should already know how to compute the sum score when some item re-
sponses are missing. We did this in Exercise 1. You use the base R function
rowMeans() to compute the average item score (omitting NA values from calcu-
lation, na.rm=TRUE), and then multiply the result by 23 (the number of items
in the Neuroticism scale). This will essentially replace any missing responses
with the mean for that individual.
Noting that the item responses are located in columns 4 to 26, compute the
Neuroticism trait score (call it Nscore), and append it to the dataframe as a
new variable:
Next, we need to prepare the grouping variable for DIF analyses. The variable
sex is coded as 0 = female; 1 = male. ATTENTION : this means that male is
the focal group (group which will be the focus of analysis, and will be compared
to the reference group - here, female). To make it easy to interpret the DIF
analyses, give value labels to this variable as follows:
Run the command head(EPQ) again to check that the new variable Nscore and
the correct labels for sex indeed appeared in the data frame.
Next, let’s obtain and examine the item means, and the means of Nscore by
sex. An easy way to do this is to apply base R function colMeans() to only
males or females from the sample:
colMeans(EPQ[EPQ$sex=="female",4:27], na.rm=TRUE)
colMeans(EPQ[EPQ$sex=="male",4:27], na.rm=TRUE)
Now, let’s run DIF analyses for item N_19 (Are your feelings easily hurt?).
We will create 3 logistic regression models. The first, Baseline model, will
include the total Neuroticism score as the only predictor of N_19. Because
this item was designed to measure Neuroticism, Nscore should positively and
significantly predict responses to this item.
By adding sex as another predictor in the second model, we will check for
uniform DIF (main effect of sex). If sex adds significantly (in terms of chi-
square value) and substantially (in terms of Nagelkerke R2) over and above
Nscore, males and females have different odds of saying “YES” to N_19,
given their Neuroticism score. This means that uniform DIF is present.
By adding Nscore by sex interaction as another predictor in the third model,
we will check for non-uniform DIF. If the interaction term adds significantly (in
terms of chi-square value) and substantially (in terms of Nagelkerke R2) over
and above Nscore and sex, non-uniform DIF is present.
We will use the R base function glm() (stands for ‘generalized linear model’)
to specify the three logistic regression models:
# Baseline model
Baseline <- glm(N_19 ~ Nscore, data=EPQ, family=binomial(link="logit"))
# Uniform DIF model
dif.U <- glm(N_19 ~ Nscore + sex, data=EPQ, family=binomial(link="logit"))
# Non-Uniform DIF model
dif.NU <- glm(N_19 ~ Nscore + sex + Nscore:sex, data=EPQ, family=binomial(link="logit")
You can see that the model syntax above is very simple. We basically saying
that “N_19 is regressed on (~) Nscore””; or that “N_19 is regressed on (~)
14.2. WORKED EXAMPLE - SCREENING EPQ NEUROTICISM ITEMS FOR GENDER DIF185
Nscore and sex”, or that “N_19 is regressed on (~) Nscore, sex and Nscore
by sex interaction” (Nscore:sex). We ask the function to perform logistic
regression (family = binomial(link="logit")). And of course, we pass the
dataset (data = EPQ) where all the variables can be found.
Type the models one by one into your script, and run them. Objects Baseline,
dif.U and dif.NU should appear in your Environment. Next, you will obtain
and interpret various outputs generated from the results of these models.
To test if the main effect of sex or the interaction between sex and Nscore
added significantly to the Baseline model, use the base R anova() function. It
analyses not only variance components (the ANOVA as most students know it),
but also deviance components (what is minimized in logistic regression). The
chi-square statistic is used to test the significance of contributions of each added
predictor, in the order in which they appear in the regression equation. To get
this breakdown, apply the anova() function to the final model, dif.NU.
Each row shows the contribution of each added predictor to the model’s chi-
square. The Deviance column shows the chi-square statistic for the model with
each subsequent predictor added (starting from NULL model with just intercept
and no predictors, then with Nscore predictor added, then with sex added
and finally Nscore:sex). The Df column (first column) shows the degrees of
186EXERCISE 14. DIF ANALYSIS OF DICHOTOMOUS QUESTIONNAIRE ITEMS USING LOGIST
freedom for each added predictor, which is 1 degree of freedom every time. The
column Pr(>Chi) is the p-value.
Examine the output and try to judge whether the predictors added at each step
contributed significantly to the prediction of N_19.
QUESTION 2. Is the Baseline model (with Nscore as the only predic-
tor) significant? Try to report the chi-square statistic for this model using
the template below: Baseline – NULL: diff.chi-square (df= __ , N=__ ) =
_______, p = _______.
QUESTION 3. Does sex add significantly over and above Nscore? What
is the increment in chi-square compared to the Baseline model? What does
this mean in terms of testing for Uniform DIF? dif.U – Baseline: diff.chi-square
(df= __ , N=__ ) = _______, p = _______.
QUESTION 4. Does Nscore by sex interaction add significantly over and
above Nscore and sex? What is the increment in chi-square compared to
the dif.U model? What does this mean in terms of testing for Non-Uniform
DIF? dif.NU – dif.U: diff.chi-square (df= __ , N=__ ) = _______, p =
_______.
Effects of added predictors may be significant, but they may be trivial in size.
To judge whether the differences between groups while controlling for the latent
trait (DIF) are consequential, effect sizes need to be computed and evaluated.
Nagelkerke R Square is recommended for judging the effect size of logistic re-
gression models.
To obtain Nagelkerke R2, we will use package fmsb. Install this package on
your computer and load it into memory. Then apply function NagelkerkeR2()
to the results of 3 models you produced earlier.
library(fmsb)
NagelkerkeR2(Baseline)
## $N
## [1] 1377
##
## $R2
## [1] 0.4887726
14.2. WORKED EXAMPLE - SCREENING EPQ NEUROTICISM ITEMS FOR GENDER DIF187
NagelkerkeR2(dif.U)
## $N
## [1] 1377
##
## $R2
## [1] 0.5237134
NagelkerkeR2(dif.NU)
## $N
## [1] 1377
##
## $R2
## [1] 0.524433
Look at the output and note that the functions return 2 values – the sample
size on which the calculation was made, and the actual R2.
QUESTION 5. Report the Nagelkerke R2 for each model below: Base-
line: Nagelkerke R2 = _______ dif.U: Nagelkerke R2 = _______ dif.NU:
Nagelkerke R2 = _______
Finally, let’s compute the increments in Nagelkerke R2 for dif.U compared to
Baseline, and dif.NU compared to dif.U. We will refer directly to the $R2
values of the models:
## [1] 0.03494082
## [1] 0.0007195194
Now, refer to the following decision rules to judge whether DIF is present or
not:
QUESTION 6. What are the increments for Nagelkerke R2, and what do you
conclude about Differential Item Functioning?
188EXERCISE 14. DIF ANALYSIS OF DICHOTOMOUS QUESTIONNAIRE ITEMS USING LOGIST
exp(coef(dif.NU))
After you finished this exercise, save your R script by pressing the Save icon.
Give the script a meaningful name, for example “EPQ_N logistic regression”.
When closing the project by pressing File / Close project, make sure you select
Save when prompted to save your ‘Workspace image’ (with extension .RData).
14.3 Solutions
Q1. Females endorse item N_19 more frequently (0.70 or 70% of them endorse
it) than males (0.36 or 36%). Females also have higher Nscore (mean=13.30)
than males (mean=10.16). Knowing this, the differences in the item endorse-
ment rates are actually expected, as they may be explained by the difference
14.3. SOLUTIONS 189
15.1 Objectives
The objective of this exercise is to screen polytomous test items for Differential
Item Functioning (DIF) using ordinal logistic regression. Item DIF is present
when people with the same trait level but from different groups have different
probabilities of selecting the item response categories. In this example, we will
screen for DIF with respect to gender.
191
192EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL
Item responses are coded so that the score of 1 indicates good health or lack of
distress; and the score of 4 indicates poor health or a lot of distress. Therefore,
high scores on the questionnaire indicate emotional distress.
If you completed Exercise 13, the easiest thing to do is to open the project you
created back then and continue working with it.
If you need to start from scratch, download the data file likertGHQ28sex.txt
into a new folder and create a project associated with it. As the file here is in
the tab-delimited format (.txt), use function read.table() to import data into
the data frame we will call GHQ.
The object GHQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. You should see 33 variables, beginning
with the 7 items for Somatic Symptoms scale, SOMAT_1, SOMAT_2, …
SOMAT_7, and followed by items for the remaining subscales. They are
followed by participant sex (variable sexm0f1 coded male = 0 and female =
1). The last 4 variables in the file are sum scores (sums of relevant 7 items) for
each of the 4 subscales. There are NO missing data.
15.2. WORKED EXAMPLE - SCREENING GHQ SOMATIC SYMPTOMS ITEMS FOR GENDER DIF193
Next, you need to prepare the grouping variable for DIF analyses. The variable
sexm0f1 coded as male = 0 and female = 1. ATTENTION : this means that
female is the focal group (group which will be the focus of analysis, and will be
compared to the reference group, male). Give value labels to the sex variable
as follows:
Run the command head(GHQ$sexf0m1) to check that the correct labels for
sexm0f1 indeed appeared in the data frame.
Next, let’s obtain and examine the descriptive statistics for items and subscale
scores by sex. The easiest way to do this is to use the function describeBy(x,
group, ...) from psych package. The function provides descriptive statistics
for all variables in the data frame x by grouping variable group.
library(psych)
describeBy(GHQ[ ,c("SOMAT_1","SOMAT_2","SOMAT_3","SOMAT_4","SOMAT_5","SOMAT_6","SOMAT_7","SUM_SOM
##
## Descriptive statistics by group
## group: male
## vars n mean sd median trimmed mad min max range skew kurtosis
## SOMAT_1 1 1422 2.05 0.49 2 2.03 0.00 1 4 3 0.85 3.68
## SOMAT_2 2 1422 1.74 0.75 2 1.64 1.48 1 4 3 0.80 0.25
## SOMAT_3 3 1422 1.71 0.73 2 1.61 1.48 1 4 3 0.77 0.06
## SOMAT_4 4 1422 1.44 0.68 1 1.31 0.00 1 4 3 1.47 1.70
## SOMAT_5 5 1422 1.24 0.50 1 1.14 0.00 1 4 3 2.01 3.64
## SOMAT_6 6 1422 1.20 0.49 1 1.08 0.00 1 4 3 2.41 5.26
## SOMAT_7 7 1422 1.20 0.52 1 1.07 0.00 1 4 3 2.79 7.85
## SUM_SOM 8 1422 10.58 2.82 10 10.15 2.97 7 26 19 1.49 2.71
## se
## SOMAT_1 0.01
## SOMAT_2 0.02
## SOMAT_3 0.02
## SOMAT_4 0.02
## SOMAT_5 0.01
## SOMAT_6 0.01
## SOMAT_7 0.01
194EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL
## SUM_SOM 0.07
## ------------------------------------------------------------
## group: female
## vars n mean sd median trimmed mad min max range skew kurtosis
## SOMAT_1 1 1479 2.09 0.59 2 2.08 0.00 1 4 3 0.58 1.49
## SOMAT_2 2 1479 1.87 0.79 2 1.80 1.48 1 4 3 0.63 -0.11
## SOMAT_3 3 1479 1.86 0.81 2 1.79 1.48 1 4 3 0.62 -0.26
## SOMAT_4 4 1479 1.49 0.74 1 1.34 0.00 1 4 3 1.44 1.40
## SOMAT_5 5 1479 1.39 0.62 1 1.28 0.00 1 4 3 1.46 1.52
## SOMAT_6 6 1479 1.30 0.59 1 1.17 0.00 1 4 3 1.95 3.13
## SOMAT_7 7 1479 1.77 0.81 2 1.67 1.48 1 4 3 0.81 -0.01
## SUM_SOM 8 1479 11.77 3.45 11 11.34 2.97 7 27 20 1.18 1.56
## se
## SOMAT_1 0.02
## SOMAT_2 0.02
## SOMAT_3 0.02
## SOMAT_4 0.02
## SOMAT_5 0.02
## SOMAT_6 0.02
## SOMAT_7 0.02
## SUM_SOM 0.09
QUESTION 1. Who has the higher average score on item SOMAT_7 (“been
having hot or cold spells”) – males or females? Who score higher on the Somatic
Symptoms (SUM_SOM) on average – males or females? Interpret the means
for SOMAT_7 in the light of the SUM_SOM means.
Now you are ready for DIF analyses.
We will use package lordif to run the DIF analysis. This package has great
functions that automate the use of ordinal logistic regression to detect DIF.
Function rundif() is the basic function for detecting polytomous DIF items.
It performs an ordinal (common odds-ratio) logistic regression DIF analysis by
comparing the endorsement levels for each response category of given items with
the matching variable - the variable reflecting the scale score - for individuals
from the focal and referent groups. The function has the following format:
rundif(item, resp, theta, gr, criterion, pseudo.R2, R2.change,
...)
where item is one item or a collection of items which we examine for DIF; resp is
a data frame containing item responses; theta is a conditioning (matching) vari-
able; gr is a variable identifying groups; criterion is the criterion for flagging
DIF items (i.e., “CHISQR”, “R2”, or “BETA”); pseudo.R2 is pseudo R-squared
15.2. WORKED EXAMPLE - SCREENING GHQ SOMATIC SYMPTOMS ITEMS FOR GENDER DIF195
library(lordif)
## $stats
## item ncat chi12 chi13 chi23 beta12 pseudo12.McFadden pseudo13.McFadden
## 1 SOMAT_7 4 0 0 0 0.0052 0.0829 0.0875
## pseudo23.McFadden pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1 0.0046 0.1241 0.1305 0.0063
## pseudo12.CoxSnell pseudo13.CoxSnell pseudo23.CoxSnell df12 df13 df23
## 1 0.1046 0.11 0.0053 1 2 1
##
## $flag
## [1] TRUE
You can also run the above basic procedure for all 7 items simultaneously calling
QUESTION 5. Obtain print(DIF) output and examine it. Can you see any
more DIF items?
The above analyses used the scale score SUM_SOM as the matching (condi-
tioning) variable. This score is made up of the items scores. But what if we have
some DIF items in the scale - surely then, the total score will be “contaminated”
by DIF? Indeed, there are methods that remove the effects of DIF items from
the matching (conditioning) score. Package lordif offers one of such “purifica-
tion” methods, whereby items flagged for DIF are removed from the “anchor”
set of common items from which the trait score is estimated. This function uses
IRT theta estimates (not the simple sum score) as the matching/conditioning
variable. The graded response model (GRM) is used for IRT trait estimation
as default. The procedure runs iteratively until the same set of items is flagged
over two consecutive iterations, unless anchor items are specified.
The lordif function has the following format:
lordif(resp.data, group, selection = NULL, criterion = c("Chisqr",
"R2", "Beta"), pseudo.R2 = c("McFadden", "Nagelkerke", "CoxSnell"),
R2.change = 0.02, anchor = NULL, ...)
print(pureDIF)
## Call:
## lordif(resp.data = GHQ[, 1:7], group = GHQ$sexm0f1, criterion = "R2",
## pseudo.R2 = "Nagelkerke", R2.change = 0.035)
##
## Number of DIF groups: 2
##
## Number of items flagged for DIF: 1 of 7
##
198EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL
## Items flagged: 7
##
## Number of iterations for purification: 2 of 10
##
## Detection criterion: R2
##
## Threshold: R-square change >= 0.035
##
## item ncat pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1 1 4 0.0011 0.0039 0.0028
## 2 2 4 0.0000 0.0001 0.0001
## 3 3 4 0.0000 0.0001 0.0000
## 4 4 4 0.0035 0.0041 0.0005
## 5 5 3 0.0105 0.0107 0.0002
## 6 6 3 0.0027 0.0031 0.0004
## 7 7 4 0.1732 0.1856 0.0124
The output obtained by calling print() makes it clear that after iterative
purification, there is only one item flagged as DIF, and it is item number 7
(SOMAT_7). So we obtain the same results as without purification. How-
ever, the effect sizes obtained are slightly different, presumably we used the IRT
estimated trait scores rather than sum scores as the matching variable.
QUESTION 6. Report the DIF effect sizes for SOMAT_7 based on the DIF
procedure with purification.
QUESTION 7. Given the item text and the characteristics of the sample, try
to interpret the DIF effect that you found. Who have higher expected score on
item SOMAT_7 after controlling for the overall Somatic Symptoms score –
males or females? Can you interpret / explain this finding substantively?
Finally, let’s plot the results visually. Package lordif has its own function
plot.lordif(), which plots diagnostic graphs for items flagged for DIF. It has
the following format:
plot.lordif(x, labels = c("Reference", "Focal"), ...)
In our case, males are the reference group, females is the focal group, therefore
we call:
You can use the arrows to navigate between the plots in the Plot tab. The first
output is “Item True Score Functions”. Here, you see the expected item score
15.3. SOLUTIONS 199
for men (black line) and women (dashed red line) for each value of the IRT “true
score” (or theta score, in the standardized metric). The large discrepancy across
all theta values shows uniform DIF. The “Item Response Functions” output
provides the probabilities of endorsing each response category for men (black
line) and women (dashed red line) for each value of the theta score. You can
see that the biggest difference between sexes is observed for the first response
category (“not at all”). It is the presence of hot spells and not their amount
that seems to make the biggest difference between sexes.
The next group of plots show the “Test Characteristic Curve” or TCC. The
TCCs differ by about 0.5 points at most. That means that for the same true
Somatic Symptom score, women are expected to have the sum score of at most
0.5 points higher than men. This difference is most pronounced for average true
scores.
After you finished this exercise, save your R script by pressing the Save icon.
Give the script a meaningful name, for example “GHQ-28 ordinal DIF”.
When closing the project by pressing File / Close project, make sure you select
Save when prompted to save your ‘Workspace image’ (with extension .RData).
15.3 Solutions
Q1. Females have higher mean SOMAT_7 item score (1.77) than males
(1.20). Females also have higher SUM_SOM (mean=11.77) than males
(mean=10.58). Knowing this, the differences in the item means are actually
expected, as they may be explained by the difference in Somatic Symptom
levels. The question is whether the differences in responses are fully explained
by the differences in trait levels.
Q2. chi13for SOMAT_7 is highly significant with the p-value reported as
0.000 (p<0.001). Corresponding pseudo13.Nagelkerke is 0.1305, which is
larger than 0.07 (cut-off for large effect size), therefore SOMAT_7 demon-
strates LARGE DIF (either uniform or non-uniform or both). This is confirmed
by the $flag equal TRUE for SOMAT_7.
Q3. chi12for SOMAT_7 is highly significant with the p-value reported as
0.000 (p<0.001). Corresponding pseudo12.Nagelkerke is 0.1241, which is
larger than 0.07, therefore SOMAT_7 demonstrates LARGE Uniform DIF.
Q4. chi23for SOMAT_7 is highly significant with the p-value reported as
0.000 (p<0.001). Corresponding pseudo23.Nagelkerke is 0.0063, which is
smaller than 0.035, therefore any Non-uniform DIF effects are trivial; which
means thare are NO Non-uniform DIF.
200EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL
print(DIF)
## $stats
## item ncat chi12 chi13 chi23 beta12 pseudo12.McFadden pseudo13.McFadden
## 1 SOMAT_1 4 0.000 0.0000 0.3373 0.0360 0.0080 0.0082
## 2 SOMAT_2 4 0.000 0.0000 0.0112 0.0241 0.0037 0.0047
## 3 SOMAT_3 4 0.000 0.0000 0.0218 0.0248 0.0039 0.0048
## 4 SOMAT_4 4 0.000 0.0000 0.0459 0.0652 0.0163 0.0170
## 5 SOMAT_5 4 0.068 0.1671 0.6181 0.0096 0.0008 0.0009
## 6 SOMAT_6 4 0.302 0.5136 0.6052 0.0069 0.0003 0.0004
## 7 SOMAT_7 4 0.000 0.0000 0.0000 0.0052 0.0829 0.0875
## pseudo23.McFadden pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1 0.0002 0.0095 0.0097 0.0002
## 2 0.0010 0.0040 0.0051 0.0011
## 3 0.0008 0.0035 0.0042 0.0007
## 4 0.0008 0.0171 0.0179 0.0008
## 5 0.0001 0.0011 0.0011 0.0001
## 6 0.0001 0.0004 0.0005 0.0001
## 7 0.0046 0.1241 0.1305 0.0063
## pseudo12.CoxSnell pseudo13.CoxSnell pseudo23.CoxSnell df12 df13 df23
## 1 0.0075 0.0077 0.0002 1 2 1
## 2 0.0036 0.0045 0.0010 1 2 1
## 3 0.0031 0.0037 0.0006 1 2 1
## 4 0.0143 0.0149 0.0007 1 2 1
## 5 0.0008 0.0009 0.0001 1 2 1
## 6 0.0003 0.0003 0.0001 1 2 1
## 7 0.1046 0.1100 0.0053 1 2 1
##
## $flag
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
CONFIRMATORY
FACTOR ANALYSIS
(CFA)
201
Exercise 16
16.1 Objectives
In this exercise, you will fit a Confirmatory Factor Analysis (CFA) model, or
measurement model, to polytomous responses to a short service satisfaction
questionnaire. We already established a factor structure for this questionnaire
using Exploratory Factor Analysis (EFA) in Exercise 9. This time, we will
confirm this structure using CFA. However, we will confirm it on a different
sample from the one we used for EFA.
In the process of fitting a CFA model, you will start getting familiar with the
SEM language and techniques using R package lavaan (stands for latent variable
analysis).
203
204 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES
The data are stored in the space-separated (.txt) file CHI_ESQ2.txt. Down-
load the file now into a new folder and create a project associated with it.
Preview the file by clicking on CHI_ESQ2.txt in the Files tab in RStudio
(this will not import the data yet, just open the actual file). You will see that
the first row contains the item abbreviations (esq+p for “parent”): “esqp_01”,
“esqp_02”, “esqp_03”, etc., and the first entry in each row is the respondent
number: “1”, “2”, …“620”. Function read.table() will import this into RStu-
dio taking care of these column and row names. It will actually understand
16.3. WORKED EXAMPLE - CFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION205
that we have headers and row names because the first row contains one fewer
fields than the number of columns (see ?read.table for detailed help on this
function).
We have just read the data into a new data frame named CHI_ESQ2. Exam-
ine the data frame by either pressing on it in the Environment tab, or calling
function View(CHI_ESQ2). You will see that there are quite a few missing re-
sponses, which could have occurred because either “Don’t know” response option
was chosen, or because the question was not answered at all.
Before you start using commands from the package lavaan, make sure you have
it installed (if not, use menu Tools and then Install packages…), and load it.
library(lavaan)
Given that we already established the factor structure for CHI-ESQ by the
means of EFA in Exercise 9, we will fit a model with 2 correlated factors - Sat-
isfaction with Care (Care for short) and Satisfaction with Environment (Envi-
ronment for short).
We need to “code” this model in syntax, using the lavaan syntax conventions.
Here are these conventions:
Figure 16.1: Figure 16.1. Confirmatory model for CHI-ESQ (paths fixed to 1
are in dashed lines)
16.3. WORKED EXAMPLE - CFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION207
Be sure to have the single quotation marks opening and closing the syntax.
Begin each equation on the new line. Spaces within each equation are optional,
they just make reading easier for you (but R does not mind).
By default, lavaan will scale each factor by setting the same unit as its first
indicator (and the factor variance is then freely estimated). So, it will set
the loading of esqp_01 to 1 to scale factor Care, it will set the loading of
esqp_08 to 1 to scale factor Environment. You can see this on the diagram,
where the loading paths for these three indicators are in dashed rather than
solid lines.
Write the above syntax in your Script, and run it by highlighting the whole lot
and pressing the Run button or Ctrl+Enter keys. You should see a new object
ESQmodel appear on the Environment tab. This object contains the model
syntax.
To fit this CFA model, we need function cfa(). We need to pass to this function
the data frame CHI_ESQ2 and the model name (model=ESQmodel).
# fit the model with default scaling, and ask for summary output
fit <- cfa(model=ESQmodel, data=CHI_ESQ2)
summary(fit)
Examine the output. Start with finding estimated factor loadings (under Latent
Variables: find statements =~), factor variances (under Variances: find state-
ments ~~) and covariances (under Covariances: find statements ~~).
16.3. WORKED EXAMPLE - CFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION209
QUESTION 1. Why are the factor loadings of esqp_01 and esqp_08 equal
1, and no Standard Errors, z-values or p-values are printed for them? How many
factor loadings are there to estimate?
QUESTION 2. How many latent variables are there in your model? What
are they? What parameters are estimated for these variables? [HINT. You do
not need the R output to answer the first part of the question. You need the
model diagram.]
QUESTION 3. How many parameters are there to estimate in total? What
are they?
QUESTION 4. How many known pieces of information (sample moments)
are there in the data? What are the degrees of freedom and how are they made
up?
QUESTION 5. Interpret the chi-square (reported as Test statistic). Do
you retain or reject the model?
To obtain more measures of fit beyond the chi-square, request an extended
output:
summary(fit, fit.measures=TRUE)
QUESTION 6. Examine the extended output; find and interpret the SRMR,
CFI and RMSEA.
Now, I suggest you change the way your factors are scaled – for the sake of
learning alternative ways of scaling latent variables, which will be useful in
different situations. The other popular way of scaling common factors is by
setting their variances to 1 and freeing all the factor loadings.
You can use your ESQmodel, but request the cfa() function to standardize all
the latent variables (std.lv=TRUE), so that their variances are set to 1. Assign
this new model run to a new object, fit.2, so you can compare it to the previous
run, fit.
summary(fit.2)
Examine the output. First, compare the test statistic (chi-square) between
fit.2 and fit (see your answer for Q5). The two chi-square statistics and their
degrees of freedom should be exactly the same! This is because alternative ways
of scaling do not change the model fit or the number of parameters. They just
210 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES
OK, now examine the latest output with standardized latent variables (factors),
and answer the following questions.
QUESTION 7. Examine the factor loadings and their standard errors. How
do you interpret these loadings compared to the model where the factor were
scaled by borrowing the scale of one of their indicators?
QUESTION 8. Interpret factor correlations. Are they what you would ex-
pect? (If you did Exercise 9, you can compare this result with your estimate of
factor correlations)
summary(fit.2, standardized=TRUE)
Examining the relative size of standardized factor loadings, we can see that the
“marker” (highest loading) item for factor Care is esqp_12 (Overall help) -
the same result as in EFA with “oblimin” rotation in Exercise 9. The marker
item for factor Environment is esqp_08 (facilities), while in EFA the marker
item was esqp_09 (appointment times).
QUESTION 11. Are the standardized error variances small or large, and how
do you judge their magnitude?
16.3. WORKED EXAMPLE - CFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION211
residuals(fit.2, type="cor")
## $type
## [1] "cor.bollen"
##
## $cov
## esq_01 esq_02 esq_03 esq_04 esq_05 esq_06 esq_07 esq_11 esq_12 esq_08
## esqp_01 0.000
## esqp_02 0.124 0.000
## esqp_03 0.080 0.115 0.000
## esqp_04 0.088 0.077 0.077 0.000
## esqp_05 -0.012 -0.024 0.012 -0.021 0.000
## esqp_06 -0.045 -0.016 -0.085 -0.047 0.017 0.000
## esqp_07 0.012 -0.068 0.000 -0.028 0.032 0.035 0.000
## esqp_11 -0.067 -0.072 -0.050 -0.023 0.008 0.019 -0.039 0.000
## esqp_12 -0.074 -0.045 -0.097 -0.031 -0.003 0.049 0.035 0.102 0.000
## esqp_08 -0.009 0.064 0.040 0.061 -0.007 0.030 0.042 0.033 0.050 0.000
## esqp_09 0.029 0.098 0.127 0.014 0.017 -0.035 -0.026 0.045 -0.060 -0.034
## esqp_10 0.006 -0.043 0.036 -0.008 -0.032 -0.137 -0.108 -0.010 -0.128 0.024
## esq_09 esq_10
## esqp_01
## esqp_02
## esqp_03
## esqp_04
## esqp_05
## esqp_06
## esqp_07
## esqp_11
## esqp_12
## esqp_08
## esqp_09 0.000
## esqp_10 0.021 0.000
You can also request standardized residuals, to formally test for significant dif-
ferences from zero. Any standardized residual larger than 1.96 in magnitude
(approximately 2 standard deviations from the mean in the standard normal
distribution) is significantly different from 0 at p=.05 level.
residuals(fit.2, type="standardized")
QUESTION 12. Are there any large residuals? Are there any statistically
significant residuals?
Now request modification indices by calling
modindices(fit.2)
QUESTION 13. Examine the modification indices output. Find the largest
index. What does it suggest?
Modify the model by adding a covariance path (~~) between items esqp_11
and esqp_12. All you need to do is to add a new line of code to the definition of
ESQmodel by using base R function paste(), thus creating a modified model
ESQmodel.m:
Re-estimate the modified model following the steps as with ESQmodel, and
assign the results to a new object fit.m.
QUESTION 14. Examine the modified model summary output. What is the
chi-square for this modified model? How many degrees of freedom do we have
and why? What is the goodness of fit of this modified model?
After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. To keep all the created objects (such as fit.2),
which might be useful when revisiting this exercise, save your entire work space.
To do that, when closing the project by pressing File / Close project, make sure
you select Save when prompted to save your “Workspace image”.
16.4. SOLUTIONS 213
16.4 Solutions
Q1. Factor loadings for esqp_01 and esqp_08 were not actually estimated;
instead, they were fixed to 1 to set the scale of the latent factors (latent factors
then simply take the scale of these particular measured variables). That is why
there is no usual estimation statistics reported for these. Factor loadings for the
remaining 10 items (12-2=10) are free parameters in the model to estimate.
Q2. There are 14 unobserved (latent) variables – 2 common factors (Care
and Environment), and 12 unique factors /errors (these are labelled by the
prefix . in front of the observed variable to which this error term is attached,
for example .esqp_01, .esqp_02, etc. It so happens that in this model, all
the latent variables are exogenous (independent), and for that reason variances
for them are estimated. In addition, covariance is estimated between the 2
common factors. Of course, the errors are assumed uncorrelated (remember the
local independence assumption in factor analysis?), and so their covariances are
fixed to 0 (not estimated and not printed in the output).
Q3. You can look this up in lavaan output. It says: Number of free
parameters 25. To work out how this number is derived, you can look how
many rows of values are printed in Parameter Estimates under each category.
The model estimates: 10 factor loadings (see Q1) + 2 factor variances (see Q2)
+ 1 factor covariance (see Q2) + 12 error variances (see Q2). That’s it. So we
have 10+2+1+12=25 parameters.
Q4. Sample moments refers to the number of variances and covariances in our
observed data. To know how many sample moments there are, the only thing
you need to know is how many observed variables there are. There are 12 ob-
served variables, therefore there are 12 variances and 12(12-1)/2=66 covariances,
78 “moments” in total.
You can use the following formula for calculating the number of sample moments
for any given data: m(m+1)/2 where m is the number of observed variables.
Q5. Chi-square is 264.573 (df=53, which is calculated as 78 sample moments
minus 35 parameters). We have to reject the model, because the test indicates
that the probability of this factor model holding in the population is less than
.001.
Q6. Comparative Fit Index (CFI) = 0.903, which is larger than 0.90 but smaller
than 0.95, indicating adequate fit. The Root Mean Square Error of Approxima-
tion (RMSEA) = 0.110, and the 90% confidence interval for RMSEA is (0.097,
0.123), well outside of the cut-off .08 for adequate fit. Standardized Root Mean
Square Residual (SRMR) = 0.054 is small as we would hope for a well-fitting
model. All indices except the RMSEA indicate acceptable fit of this model.
Q7. All factor loadings are positive as would be expected with questions being
positive indicators of satisfaction. The SEs are low (magnitude of 1/sqrt(N), as
214 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES
they should be in a properly identified model). All factor loadings are signifi-
cantly different from 0 (p-values are very small).
Latent Variables:
Estimate Std.Err z-value P(>|z|)
Care =~
esqp_01 0.361 0.023 15.872 0.000
esqp_02 0.350 0.024 14.350 0.000
esqp_03 0.219 0.018 12.011 0.000
esqp_04 0.412 0.023 18.269 0.000
esqp_05 0.520 0.028 18.416 0.000
esqp_06 0.430 0.029 14.903 0.000
esqp_07 0.477 0.028 16.845 0.000
esqp_11 0.448 0.025 17.911 0.000
esqp_12 0.459 0.024 19.427 0.000
Environment =~
esqp_08 0.211 0.038 5.617 0.000
esqp_09 0.321 0.045 7.189 0.000
esqp_10 0.194 0.035 5.624 0.000
Q8. The factor covariance are easy to interpret in terms of size, because we set
the factors’ variances =1, and therefore factor covariance is correlation. The
correlation between Care and Environment is positive and of medium size (r
= 0.438). This estimate is close to the latent factor correlation estimated in
EFA in Exercise 9 (r = 0.39).
Covariances:
Estimate Std.Err z-value P(>|z|)
Care ~~
Environment 0.438 0.073 6.022 0.000
Q9. The Estimate and Std.lav parameter values are identical because in fit.2,
the factors were scaled by setting their variances =1. So, the model is already
standardized on the latent variables (factors). The Std.lv and Std.all param-
eter values are different because the observed variables were raw item responses
and not standardized originally. The standardization of factors (Std.lv) does
not standardize the observed variables. Only in Std.all, the observed variables
are also standardized. The Std.all output makes the results comparable with
EFA in Exercise 9, where all variables are standardized.
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Care =~
16.4. SOLUTIONS 215
Q10. The Std.all loadings are interpreted as the standardized factor loadings
in EFA. In models with orthogonal factors, these loadings are equal to the cor-
relation between the item and the factor; also, squared factor loading represents
the proportion of variance explained by the factor.
Q11. The Std.all error variances range from (small) 0.256 for .esqp_12 to
(quite large) 0.820 for esqp_08 and esqp_10. I judge them to be “small”
or “large” considering that the observed variables are now standardized and
have variance 1, the error variance is simply the proportion of 1. The remaining
proportion of variance is due to the common factors. For example, for esqp_12,
error variance is 0.256 and this means 25.6% of variance is due to error and the
remaining 74.4% (1-0.256) is due to the common factors.
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.esqp_01 0.099 0.008 11.732 0.000 0.099 0.431
.esqp_02 0.127 0.011 12.036 0.000 0.127 0.508
.esqp_03 0.080 0.007 12.356 0.000 0.080 0.626
.esqp_04 0.077 0.007 10.966 0.000 0.077 0.312
.esqp_05 0.119 0.011 10.902 0.000 0.119 0.305
.esqp_06 0.171 0.014 11.937 0.000 0.171 0.480
.esqp_07 0.140 0.012 11.476 0.000 0.140 0.382
.esqp_11 0.099 0.009 11.114 0.000 0.099 0.329
.esqp_12 0.073 0.007 10.366 0.000 0.073 0.256
.esqp_08 0.204 0.020 10.431 0.000 0.204 0.820
.esqp_09 0.171 0.027 6.352 0.000 0.171 0.623
.esqp_10 0.172 0.017 10.421 0.000 0.172 0.820
Q12. There are a few residuals over 0.1 in size. The largest is for correlation be-
tween esqp_06 and esqp_10 (-.137). It is also significantly different from zero
(standardized residual is -3.155, which is greater than 1.96). There are larger
standardized residuals, for example -5.320 between esqp_01 and esqp_12.
216 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES
Q13.
<snip>
92 esqp_07 ~~ esqp_12 6.374 0.017 0.017 0.171 0.171
93 esqp_07 ~~ esqp_08 0.667 0.008 0.008 0.050 0.050
94 esqp_07 ~~ esqp_09 0.784 -0.009 -0.009 -0.060 -0.060
95 esqp_07 ~~ esqp_10 3.022 -0.017 -0.017 -0.106 -0.106
96 esqp_11 ~~ esqp_12 66.449 0.048 0.048 0.571 0.571
97 esqp_11 ~~ esqp_08 0.150 -0.003 -0.003 -0.024 -0.024
98 esqp_11 ~~ esqp_09 1.016 0.009 0.009 0.070 0.070
99 esqp_11 ~~ esqp_10 1.421 0.010 0.010 0.074 0.074
100 esqp_12 ~~ esqp_08 2.555 0.012 0.012 0.102 0.102
<snip>
# fit the model with default scaling, and ask for summary output
fit.m <- cfa(model=ESQmodel.m, CHI_ESQ2, std.lv=TRUE)
summary(fit.m)
For the modified model, Chi-square = 201.249, Degrees of freedom = 52, p<
.001. The number of DF is 52, not 53 as in the original model. This is because
we added one more parameter to estimate (correlated error terms), therefore
losing 1 degree of freedom. The model reduced the RMSEA to 0.093 (90% CI
0.08-0.107) - still too high for an adequate fit.
Exercise 17
17.1 Objectives
In this exercise, you will fit a Confirmatory Factor Analysis (CFA) model, or
measurement model, to a published correlation matrix of scores on subtests from
an ability test battery. We already considered an Exploratory Factor Analysis
(EFA) of these same data in Exercise 10. In the process of fitting a CFA model,
you will start getting familiar with the SEM language and techniques using R
package lavaan (stands for latent variable analysis).
217
218EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.
Examine object Thurstone that should now be in the Environment tab. Ex-
amine the object by either pressing on it, or calling function View(). You will
see that it is a correlation matrix, with values 1 on the diagonal, and moderate
to large positive values off the diagonal. This is typical for ability tests – they
tend to correlate positively with each other.
17.3. WORKED EXAMPLE - CFA OF THURSTONE’S PRIMARY ABILITY DATA219
Before you start using commands from the package lavaan, make sure you have
it installed (if not, use menu Tools and then Install packages…), and load it:
library(lavaan)
Given that the subtests were designed to measure 3 broader mental abilities
(Verbal, Word Fluency and Reasoning Ability), we will test the below CFA
model.
Figure 17.1: Figure 17.1. Theoretical model for Thurstone’s ability data (Ver-
bal= Vrb, Word Fluency = WrF and Reasoning Ability = Rsn)
We need to “code” this model in syntax, using the lavaan syntax conventions.
Here are these conventions:
A shorthand for and is the plus sign +. So, this is how we translate the above
sentences into syntax for our model (let’s call it T.model):
220EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY
Be sure to have the single quotation marks opening and closing the syntax.
Begin each equation on the new line. Spaces within each equation are optional,
they just make reading easier for you (but R does not mind).
By default, lavaan will scale each factor by setting the same unit as its first
indicator (and the factor variance is then freely estimated). So, it will set
the loading of s1 to 1 to scale factor Verbal, it will set the loading of s4
to 1 to scale factor WordFluency, and the loading of s7 to 1 to scale factor
Reasoning. You can see this on the diagram, where the loading paths for these
three indicators are in dashed rather than solid lines.
Write the above syntax in your Script, and run it by highlighting the whole lot
and pressing the Run button or Ctrl+Enter keys. You should see a new object
T.model appear on the Environment tab. This object contains the model
syntax.
To fit this CFA model, we need function cfa(). We need to pass to this function
the model name (model=T.model), and the data. However, you cannot specify
data=Thurstone because the argument data is reserved for raw data (subjects
x variables), but our data is the correlation matrix! Instead, we need to pass our
matrix to the argument sample.cov, which is reserved for sample covariance (or
correlation) matrices. [That also means that we need to convert Thurstone,
which is a data frame, into matrix before submitting it to analysis.] Of course
we also need to tell R the number of observations (sample.nobs=215), because
the correlation matrix does not provide such information.
# fit the model with default scaling, and ask for summary output
fit <- cfa(model=T.model, sample.cov=Thurstone, sample.nobs=215)
summary(fit)
Examine the output. Start with finding estimated factor loadings (under Latent
Variables: find statements =~), factor variances (under Variances: find state-
ments ~~) and covariances (under Covariances: find statements ~~).
QUESTION 1. Why are the factor loadings of s1, s4 and s7 equal 1, and no
Standard Errors, z-values or p-values are printed for them? How many factor
loadings are there to estimate?
QUESTION 2. How many latent variables are there in your model? What
are they? What parameters are estimated for these variables? [HINT. You do
not need the R output to answer the first part of the question. You need the
model diagram.]
QUESTION 3. How many parameters are there to estimate in total? What
are they?
QUESTION 4. How many known pieces of information (sample moments)
are there in the data? What are the degrees of freedom and how are they made
up?
QUESTION 5. Interpret the chi-square (reported as Test statistic). Do
you retain or reject the model?
To obtain more measures of fit beyond the chi-square, request an extended
output:
summary(fit, fit.measures=TRUE)
QUESTION 6. Examine the extended output; find and interpret the SRMR,
CFI and RMSEA.
Now, I suggest you change the way your factors are scaled – for the sake of
learning alternative ways of scaling latent variables, which will be useful in
different situations. The other popular way of scaling common factors is by
setting their variances to 1 and freeing all the factor loadings.
You can use your T.model, but request the cfa() function to standardize all
the latent variables (std.lv=TRUE), so that their variances are set to 1. Assign
this new model run to a new object, fit.2, so you can compare it to the previous
run, fit.
17.3. WORKED EXAMPLE - CFA OF THURSTONE’S PRIMARY ABILITY DATA223
summary(fit.2)
Examine the output. First, compare the test statistic (chi-square) between
fit.2 and fit (see your answer for Q5). The two chi-square statistics and their
degrees of freedom should be exactly the same! This is because alternative ways
of scaling do not change the model fit or the number of parameters. They just
swap the parameters to be estimated. The standardized version of T.model
sets the factor variances to 1 and estimates all factor loadings, while the original
version sets 3 factor loadings (one per each factor) to 1 and estimates 3 factor
variances. The difference between the two models is the matter of scaling -
otherwise they are mathematically equivalent. This is nice to know, so you can
use one or the other way of scaling depending on what is more convenient in a
particular situation.
OK, now examine the latest output with standardized latent variables (factors),
interpret its parameters and answer the following questions.
QUESTION 7. Interpret the factor loadings and their standard errors.
QUESTION 8. Interpret factor covariances. Are they what you would expect?
QUESTION 9. Are the error variances small or large, and what do you
compare them to in order to judge their magnitude?
Now obtain the standardized solution (in which EVERYTHING is standard-
ized - the latent and observed variables) by adding standardized=TRUE to the
summary() output:
summary(fit.2, standardized=TRUE)
Now request two additional outputs – fitted covariance matrix and residuals.
The fitted covariances are covariances predicted by the model. Residuals are
differences between fitted and actual observed covariances.
224EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY
fitted(fit.2)
## $cov
## s1 s2 s3 s4 s5 s6 s7 s8 s9
## s1 0.995
## s2 0.823 0.995
## s3 0.771 0.779 0.995
## s4 0.484 0.489 0.458 0.995
## s5 0.461 0.466 0.437 0.663 0.995
## s6 0.407 0.411 0.385 0.584 0.557 0.995
## s7 0.471 0.476 0.446 0.414 0.395 0.348 0.995
## s8 0.435 0.439 0.411 0.382 0.364 0.321 0.560 0.995
## s9 0.424 0.429 0.402 0.373 0.356 0.313 0.547 0.504 0.995
residuals(fit.2)
## $type
## [1] "raw"
##
## $cov
## s1 s2 s3 s4 s5 s6 s7 s8 s9
## s1 0.000
## s2 0.001 0.000
## s3 0.001 -0.003 0.000
## s4 -0.047 0.002 0.000 0.000
## s5 -0.031 -0.004 -0.014 0.008 0.000
## s6 0.038 0.076 0.056 0.003 -0.019 0.000
## s7 -0.026 -0.046 -0.047 -0.035 0.005 -0.061 0.000
## s8 0.104 0.096 0.120 -0.033 0.001 -0.002 -0.007 0.000
## s9 -0.046 -0.072 -0.044 0.049 0.088 0.010 0.048 -0.054 0.000
Examine the fitted covariances and compare them to the observed covariances
for the 9 subtests - you can see them again by calling View(Thurstone).
The residuals show if all covariances are well reproduced by the model. Since
our observed covariances are actually correlations (we worked with correlation
matrix, with 1 on the diagonal), interpretation of residuals is very simple. Just
think of them as differences between 2 sets of correlations, with the usual effect
sizes assumed for correlation coefficients.
You can also request standardized residuals, to formally test for significant dif-
ferences from zero. Any standardized residual larger than 1.96 in magnitude
(approximately 2 standard deviations from the mean in the standard normal
distribution) is significantly different from 0 at p=.05 level.
17.4. SOLUTIONS 225
residuals(fit.2, type="standardized")
QUESTION 11. Are there any large residuals? Are there any statistically
significant residuals?
Now request modification indices by calling
modindices(fit.2)
QUESTION 12. Examine the modification indices output. Find the largest
index. What does it suggest?
Modify the model by adding a direct path from Verbal factor to subtest 8 (allow
s8 to load on the Verbal factor). All you need to do is to add the indicator s8
to the definition of the Verbal factor in T.model, creating T.model.m:
Re-estimate the modified model following the steps as with T.model, and assign
the results to a new object fit.m.
QUESTION 13. Examine the modified model summary output. What is the
chi-square for this modified model? How many degrees of freedom do we have
and why? Do we accept or reject this modified model?
After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. To keep all the created objects (such as fit.2),
which might be useful when revisiting this exercise, save your entire work space.
To do that, when closing the project by pressing File / Close project, make sure
you select Save when prompted to save your “Workspace image”.
17.4 Solutions
Q1. Factor loadings for s1, s4 and s7 were not actually estimated; instead, they
were fixed to 1 to set the scale of the latent factors (latent factors then simply
take the scale of these particular measured variables). That is why there is no
226EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY
usual estimation statistics reported for these. Factor loadings for the remaining
6 subtests (9-3=6) are free parameters in the model to estimate.
Q2. There are 12 unobserved (latent) variables – 3 common factors (Verbal,
WordFluency and Reasoning), and 9 unique factors /errors (these are la-
belled by the prefix . in front of the observed variable to which this error term
is attached, for example .s1, .s2, etc. It so happens that in this model, all
the latent variables are exogenous (independent), and for that reason variances
for them are estimated. In addition, covariances are estimated between the 3
common factors (3 covariances, one for each pair of factors). Of course, the
errors are assumed uncorrelated (remember the local independence assumption
in factor analysis?), and so their covariances are fixed to 0 (not estimated and
not printed in the output).
Q3. You can look this up in lavaan output. It says: Number of free
parameters 21. To work out how this number is derived, you can look how
many rows of values are printed in Parameter Estimates under each category.
The model estimates: 6 factor loadings (see Q1) + 3 factor variances (see Q2)
+ 3 factor covariances (see Q2) + 9 error variances (see Q2). That’s it. So we
have 6+3+3+9=21 parameters.
Q4. Sample moments refers to the number of variances and covariances in
our observed data. To know how many sample moments there are, the only
thing you need to know is how many observed variables there are. There are 9
observed variables, therefore there are 9 variances and 9(9-1)/2=36 covarainces,
45 “moments” in total.
[Since our data is the correlation matrix, these sample moments are actually
right in front of you. How many unique cells are in the 9x9 correlation matrix?
There are 81 cells, but not all of them are unique, because the top and bottom
off-diagonal parts are mirror images of each other. So, we only need to count
the diagonal cells and the top of-diagonal cells, 9 + 36 = 45. ]
You can use the following formula for calculating the number of sample moments
for any given data: m(m+1)/2 where m is the number of observed variables.
Q5. Chi-square is 38.737 (df=24). We have to reject the model, because the
test indicates that the probability of this factor model holding in the population
is less than .05 (P = .029).
Q6. Comparative Fit Index (CFI) = 0.986, which is larger than 0.95, indicating
excellent fit. The Root Mean Square Error of Approximation (RMSEA) =
0.053, just outside of the cut-off .05 for close fit. The 90% confidence interval
for RMSEA is (0.017, 0.083), which just includes the cut-off 0.08 for adequate
fit. Overall, RMSEA indicates adequate fit. Standardized Root Mean Square
Residual (SRMR) = 0.044 is small as we would hope for a well-fitting model.
All indices indicate close fit of this model, despite the significant chi-square.
Q7. All factor loadings are positive as would be expected with ability variables,
since ability subtests should be positive indicators of the ability domains. The
17.4. SOLUTIONS 227
Latent Variables:
Estimate Std.Err z-value P(>|z|)
Verbal =~
s1 0.903 0.054 16.805 0.000
s2 0.912 0.053 17.084 0.000
s3 0.854 0.056 15.389 0.000
WordFluency =~
s4 0.834 0.060 13.847 0.000
s5 0.795 0.061 12.998 0.000
s6 0.701 0.064 11.012 0.000
Reasoning =~
s7 0.779 0.064 12.230 0.000
s8 0.718 0.065 11.050 0.000
s9 0.702 0.065 10.729 0.000
Q8. The factor covariances are easy to interpret in terms of size, because we set
the factors’ variances =1, and therefore factor covariances are correlations. The
correlations between the 3 ability domains are positive and large, as expected.
Covariances:
Estimate Std.Err z-value P(>|z|)
Verbal ~~
WordFluency 0.643 0.050 12.815 0.000
Reasoning 0.670 0.051 13.215 0.000
WordFluency ~~
Reasoning 0.637 0.058 10.951 0.000
Q9. The error variances are certainly quite small – considering that the observed
variables had variance 1 (remember, we analysed the correlation matrix, with 1
on the diagonal?), the error variance is less than half that for most subtests. The
remaining proportion of variance is due to the common factors. For example,
for s1, error variance is 0.18 and this means 18% of variance is due to error and
the remaining 82% (1-0.18) is due to the common factors.
Variances:
Estimate Std.Err z-value P(>|z|)
.s1 0.181 0.028 6.418 0.000
.s2 0.164 0.027 5.981 0.000
.s3 0.266 0.033 8.063 0.000
.s4 0.300 0.050 5.951 0.000
228EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY
Q10. The Estimate and Std.lav parameter values are identical because in
fit.2, the factors were scaled by setting their variances =1. So, the model
is already standardized on the latent variables (factors). The Estimate and
Std.all parameter values are almost identical (allowing for the rounding error
in third decimal place) because the observed variables were already standardized
since we worked with a correlation matrix!
Q11. The largest residual is for correlation between s3 and s8 (.120). It is also
significantly different from zero (standardized residual is 3.334, which is greater
than 1.96).
Q12.
modindices(fit.2)
The largest index is for the factor loading Verbal =~ s8 (19.943). It is at least
twice larger than any other indices. This means that chi-square would reduce
by at least 19.943 if we allowed subtest s8 to load on the Verbal factor (note
that currently it does not, because only s1, s2 and s3 load on Verbal). It
appears that solving subtest 8 (“Pedigrees” - Identification of familial relation-
ships within a family tree) requires verbal as well as reasoning ability (s8 is
factorially complex).
Q13.
# fit the model with default scaling, and ask for summary output
fit.m <- cfa(model=T.model.m, sample.cov=Thurstone, sample.nobs=215)
summary(fit.m)
For the modified model, Chi-square = 20.882, Degrees of freedom = 23, p-value
230EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY
= .588. The number of DF is 23, not 24 as in the original model. This is because
we added one more parameter to estimate (factor loading), therefore losing 1
degree of freedom. The model fits the data (the p-value is insignificant), and we
accept it.
Part VIII
PATH ANALYSIS
231
Exercise 18
18.1 Objectives
In this exercise, you will test a path analysis model. Path analysis is also
known as “SEM without common factors”. In path analysis, we model observed
variables only; the only latent variables in these models are residuals/errors.
Your first model will be relatively simple; however, in the process you will learn
how to build path model diagrams in package lavaan, test your models and
interpret outputs.
233
234EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES
how well 2 alternative models of SWB (bottom-up and top-down models) fit the
data. Variables of interest in both models were physical health, daily hassles,
world assumptions, and constructive thinking. Results showed that both models
provided good fit to the data, with neither model providing a closer fit than
the other, which suggests that the field would benefit from devoting more time
to examining how general dispositions toward happiness color perceptions of
life’s experiences. Results implicate bidirectional causal models of SWB and its
personality and situational influences.
We will work with published correlations between the observed variables, based
on N=149 subjects.
The correlations are stored in the file Wellbeing.RData. Download this file
and save it in a new folder (e.g. “Wellbeing study”). In RStudio, create a new
project in the folder you have just created. Create a new R script, where you
will be writing all your commands.
Since the data file is in the “native” R format, you can simply load it into the
environment:
load("Wellbeing.RData")
A new object Wellbeing should appear in the Environment tab. Examine the
object by either clicking on it or calling View(Wellbeing). You should see that
it is indeed a correlation matrix, with values 1 on the diagonal.
library(lavaan)
For our purposes, we only fit a bottom-up model of SWB, using data on physical
health, daily hassles, world assumptions, and constructive thinking collected at
18.3. WORKED EXAMPLE - TESTING A BOTTOM-UP MODEL OF WELL-BEING235
the study’s baseline, and the data on SWB collected one month later. This
analysis is described in “Test theory: A Unified Treatment” (McDonald, 1999)
for illustration of main points and issues with path models.
We will be analyzing the following variables:
SWB - Subjective Well-Being, measured as the sum score of PIL (purpose in
life = extent to which one possesses goals and directions), EM (environmental
mastery), and SA (self-acceptance).
WA - World Assumptions, measured as the sum of BWP (benevolence of world
and people), and SW (self as worthy = extent of satisfaction with self).
CT - Constructive Thinking, measured as the sum of GCT (global constructive
thinking = acceptance of self and others), BC (behavioural coping = ability to
focus on effective action); EC (emotional coping = ability to avoid self-defeating
thoughts and feelings).
PS - Physical Symptoms, indicating problems rather than health. It is mea-
sured as the sum of FS (frequency of symptoms), MS (muscular symptoms); GS
(gastrointestinal symptoms).
DH - Daily ‘Hassles’, measured as the sum of TP (time pressures), MW (money
worries) and IC (inner concerns).
Given the study description, we will test the following path model.
QUESTION 1. How many observed and latent variables are there in your
model? What are the latent variables in this model? What are the exogenous
and endogenous variables in this model?
Now you need to “code” this model, using the lavaan syntax conventions.To
describe the model in Figure 1 in words, we need to mention all of its regression
relationships. Let’s work from left to right of the diagram:
• WA is regressed on PS and DH (residual for WA is assumed)
236EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES
That’s all you need to do to fit a scary looking path model from Figure 1.
Simple, right?
18.3. WORKED EXAMPLE - TESTING A BOTTOM-UP MODEL OF WELL-BEING237
If you run the above command, you will get a very basic output telling you that
the model estimation ended normally, and giving you the chi-square statistic.
To get access to full results, assign the results to a new object (for example, fit),
and request the extended output including fit measures:
summary(fit, fit.measures=TRUE)
Now examine the output, starting with assessment of goodness of fit, and try
to answer the following questions.
QUESTION 2. How many parameters are there to estimate? How many
known pieces of information (sample moments) are there in the data? What are
the degrees of freedom and how are they calculated?
QUESTION 3. Interpret the chi-square. Do you retain or reject the model?
Now consider the following. McDonald (1999) argued that “researchers ap-
pear to rely too much on goodness-of-fit indices, when the actual discrepancies
[residuals] are much more informative” (p.390). He refers to indices such as
RMSEA and CFI, which we used to judge the goodness of fit of CFA models.
He continues: “generally, indices of fit are not informative in path analysis.
This is because large parts of the model can be unrestrictive, with only a few
discrepancies capable of being non-zero”.
To obtain residuals, which are differences between observed and ‘fit-
ted’/‘reproduced’ covariances, request this additional output:
residuals(fit)
## $type
## [1] "raw"
##
## $cov
## WA CT SWB PS DH
## WA 0.000
## CT 0.000 0.000
## SWB 0.000 0.000 0.000
## PS 0.000 0.000 0.096 0.000
## DH 0.000 0.000 -0.084 0.000 0.000
The residuals show how well the observed covariances are reproduced by the
model. Since our observed covariances are actually correlations (we worked
238EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES
residuals(fit, type="standardized")
## $type
## [1] "standardized"
##
## $cov
## WA CT SWB PS DH
## WA 0.000
## CT 0.000 0.000
## SWB 0.000 0.000 0.000
## PS 0.000 0.000 2.051 0.000
## DH 0.000 0.000 -1.791 0.000 0.000
QUESTION 4. Are there any large residuals? Are there any statistically
significant residuals? What can you say about the model fit based on the resid-
uals?
We will not modify the model by adding any other paths to it – it already has
only 2 degrees of freedom, and we will reduce this number further by adding
parameters.
variances (statements ~~) and covariances (statements ~~). Try to answer the
following question.
You could do all these calculations by looking up the obtained parameter es-
timates and then multiplying and adding them (simple but tedious!). Or, you
could actually ask lavaan to do these for you while fitting the model. To do the
latter, you need to create ‘labels’ for the parameters of interest, and then write
formulas with these labels (also a bit tedious, but much less room for errors!).
It will be useful to learn how to create labels, because we will use them later in
the course.
According to the diagram in Figure 2 below, I will label the path from PS to
WA as a1 and path from WA to SWB as b1. Then, my indirect effect from
PS to SWB via WA will be equal a1*b1. I will label the path from PS to CT
as a2 and path from CT to SWB as b2. Then, my indirect effect from PS to
SWB via WA will be equal a2*b2.
240EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES
Now look how I will modify the syntax to get these labels in place, in a new
model called W.model.label. The way to label a path coefficient is to add a
multiplier in front of the relevant variable (predictor), as is typical for regression
equations.
Describe and estimate this new model and obtain the summary output. Note
that the model fit or any of the existing paths will not change! This is because
the model has not changed; all that we have done is assigned labels to previously
“unlabeled” regression paths. But, you should see the following additions to
the output. First, you should see the labels appearing next to the relevant
parameters:
Regressions:
Estimate Std.Err z-value P(>|z|)
WA ~
PS (a1) -0.339 0.080 -4.256 0.000
DH -0.165 0.080 -2.065 0.039
CT ~
PS (a2) -0.191 0.078 -2.430 0.015
18.3. WORKED EXAMPLE - TESTING A BOTTOM-UP MODEL OF WELL-BEING241
Second, you should have a new section Defined Parameters with calculated
effects:
Defined Parameters:
Estimate Std.Err z-value P(>|z|)
ind.WA -0.111 0.033 -3.326 0.001
ind.CT -0.105 0.045 -2.345 0.019
ind.total -0.216 0.062 -3.463 0.001
Now you can interpret this output. (Our variables are standardized, making
it easy to interpret the indirect and total effects). Try to answer the following
questions.
QUESTION 7. What is the indirect effect of PS on SWB via WA? What
is the indirect effect of PS on SWB via CT? What is the total indirect effect
of PS on SWB (the sum of the indirect paths via WA and CT)? How do the
numbers correspond to the observed covariance of PS and SWB?
Finally, let us critically evaluate the model. Specifically, let us think of causality
in this study. The article’s abstract suggested that there was an equally plausible
top-down model of subjective well-being, in which Subjective well-being causes
World Assumptions and Constructive Thinking, with those in turn influencing
Physical Symptoms and Daily Hassles. Path Analysis in itself cannot prove
causality, so the analysis cannot tell us which model – the bottom-up or the top-
down – is better. You could say: “but this study employed a longitudinal design,
where subjective well-being was measured one month later than the predictor
variables”. Unfortunately, even longitudinal designs cannot always disentangle
the causes and effects. The fact that SWB was measured one month after the
other constructs does not mean it was caused by them, because SWB was very
stable over that time (the article reports test-retest correlation = 0.92), and
it is therefore a good proxy for SWB at baseline, or possibly even before the
baseline.
QUESTION 8 (Optional challenge). Can you try to suggest a research design
that will address these problems and help establish causality in this study?
242EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES
After you finished work with this exercise, save your R script with a meaningful
name, for example “Bottom-up model of Wellbeing”. To keep all of the created
objects, which might be useful when revisiting this exercise, save your entire
‘work space’ when closing the project. Press File / Close project, and select
“Save” when prompted to save your ‘Workspace image’.
18.4 Solutions
Q1. There are 8 variables in this model – 5 observed variables, and 3 unob-
served variables (the 3 regression residuals). There are 5 exogenous (indepen-
dent) variables: PS and DH (observed), and residuals .WA, .SWB and .CT
(unobserved). There are 3 endogenous (dependent) variables: WA, CT and
SWB.
Q2. The output prints that the “Number of free parameters” is 13. You can
work out what these are just by looking at the diagram, or by counting the
number of ‘Parameter estimates’ printed in the output. The best way to practice
is to have a go just looking at the diagram, and then check with the output.
There are 6 regression paths (arrows on diagram) to estimate. There are also 3
paths from residuals to the observed variables, but they are fixed to 1 for setting
the scale of residuals and therefore are not estimated. There are 2 covariances
(bi-directional arrows) to estimate: one for the independent variables PS and
DH, and the other is for the residuals of WA and CT. There are 5 variances
(look at how many IVs you have) to estimate. There are 3 variances of residuals,
and also 2 variances of the independent (exogenous) variables PS and DH. In
total, this makes 6+2+5=13 parameters.
Now let’s calculate sample moments, which is the number of observed variances
and covariances. There are 5 observed variables, therefore 5(5+1)/2 = 15 mo-
ments in total (made up from 5 variances and 10 covariances). The degrees of
freedom* is then 15 (moments) – 13 (parameters) = 2 (df).
Q3. The chi-square test (Chi-square = 10.682, Degrees of freedom = 2, P-value
= .005) suggests rejecting the model because the test is significant (p-value is
very low). In this case, we cannot “blame” the poor chi-square result on the
large sample size (our sample is not that large).
Q4. The residual output makes it very clear that all but 2 covariances were
freely estimated (and fully accounted for) by the model – this is why their
residuals are exactly 0. The only omitted connections were between DH and
SWB, and between PS and SWB (look at the diagram - there are no direct
paths between these pairs of variables). We modelled these relationships as fully
mediated by WA and CT, respectively. So, essentially, this model tests only
for the absence of 2 direct effects – the rest of the model is unrestricted. The
18.4. SOLUTIONS 243
residuals for these omitted paths are small in magnitude (just under 0.1), but
one borderline residual PS/SWB (=0.096) is significantly different from zero
(stdz value = 2.051 is larger than the critical value 1.96).
Based on the residual output, we have detected an area of local misfit, pertaining
to the lack of direct path from PS to SWB. This misfit is “small” (not “trivial”)
in terms of effect size, and it is significant.
Q5. All paths are significant at the 0.05 level (p-values are less than 0.05), and
the standard errors of these estimates are small (of magnitude 1/SQRT(N), as
they should be for an identified model).
Regressions:
Estimate Std.Err z-value P(>|z|)
WA ~
PS -0.339 0.080 -4.256 0.000
DH -0.165 0.080 -2.065 0.039
CT ~
PS -0.191 0.078 -2.430 0.015
DH -0.349 0.078 -4.452 0.000
SWB ~
WA 0.328 0.061 5.332 0.000
CT 0.550 0.061 8.950 0.000
Q6. Considering that all our observed variables have variance 1, the variances
for residuals .WA and .CT are quite large (about 0.8). So the predictors only
explained about 20% of variance in these variables. The error variance for
.SWB is quite small (0.39) – which means that the majority of variance in
SWB is explained by the other variables in the model.
Variances:
Estimate Std.Err z-value P(>|z|)
.WA 0.811 0.094 8.631 0.000
.CT 0.787 0.091 8.631 0.000
.SWB 0.390 0.045 8.631 0.000
Q7. The total indirect effect of PS on SWB is –.216. Let’s see how it is
computed. Look at the regression weights in the answer to Q5. The route from
PS to SWB via WA: (–0.339)0.328=–0.111. The route from PS to SWB via
CT: (–0.191)0.550=–0.105. Adding these two routes, we get the indirect effect
= –0.216. This shows you how to compute the effects from given parameter
values by hand, but we obtained the same answers through parameter labels.
Now, if you look into the original correlation matrix, View(Wellbeing), you
will see that the covariance of PS and SWB is -0.21. This is close to the
total indirect effect (-0.216). The values are not exactly the same, which means
244EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES
that the indirect effects do not explain the observed covariance fully. Adding a
direct path from PS to SWB (which is omitted in this model) would explain
the observed covariance 100%.
Q8. Perhaps one could alter the level of daily hassles or physical symptoms
through an intervention and see what difference (if any) this makes to the out-
comes compared to the non-intervention condition.
Exercise 19
Fitting an autoregressive
model to longitudinal test
measurements
19.1 Objectives
In this exercise, you will fit a special type of path analysis model - an autoregres-
sive model. Such models involve repeated (observed) measurements over several
time points, and each subsequent measurement is regressed on the previous one,
hence the name. The only latent variables in these models are regression resid-
uals; there are no common factors or errors of measurement. This exercise
will demonstrate one of the weaknesses of Path Analysis, specifically, why it is
important to model the measurement error for unreliable measures.
Data for this exercise is taken from a study by Osborne and Suddick (1972).
In this study, N=204 children took the Wechsler Intelligence Test for Children
(WISC) at ages 6, 7, 9 and 11. There are two subtests - Verbal and Non-verbal
reasoning.
245
246EXERCISE 19. FITTING AN AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST MEASUR
Scores on both tets were recorded for each child at each time point, resulting
in 4 variables for the Verbal test (V1, V2, V3, V4) and 4 variables for the
Non-verbal test (N1, N2, N3, N4).
In terms of data, we only have summary statistics - means, Standard Deviations
and correlations of all the observed variables - which were reported in the article.
As you can see in the table below, the mean of Verbal test increases as the
children get older, and so does the mean of Non-verbal test. The Standard
Deviations also increase, suggesting the increasing spread of scores over time.
However, the test scores of the same kind correlate strongly, suggesting that
there is certain stability to rank ordering of children; so that despite everyone
improving their scores, the children doing best at Time 1 still tend to do better
at Time 2, 3 and 4.
To complete this exercise, you need to repeat the analysis of Verbal subtests from
a worked example below, and answer some questions. Once you are confident,
you can repeat the analyses of Non-verbal subtests independently.
The summary statistics are stored in the file WISC.RData. Download this
file and save it in a new folder (e.g. “Autoregressive model of WISC data”). In
19.3. WORKED EXAMPLE - TESTING AN AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTEST247
RStudio, create a new project in the folder you have just created. Create a new
R script, where you will be writing all your commands.
If you want to analyse the data straight away, you can simply load the prepared
data into the environment:
load("WISC.RData")
# now fill the upper triangle part with the transposed values from the lower diagonal part
WISC.corr[upper.tri(WISC.corr)] <- t(WISC.corr)[upper.tri(WISC.corr)]
calling them, for example View(WISC.corr). You should see that WISC.corr
is indeed a full correlation matrix, with values 1 on the diagonal.
Before we proceed to fitting a path model, however, we need to produce a co-
variance matrix for WISC data, because model fitting functions in lavaan work
with covariance matrices. Of course you can supply a correlation matrix, but
then you will lose the original measurement scale of your variables and essen-
tially standardize them. I would like to keep the scale of WISC tests, and get
unstandardized parameters. Thankfully, it is easy to produce a covariance ma-
trix from the WISC correlation matrix and SDs. Since cov(X,Y) is the product
of corr(X,Y), SD(X) and SD(Y), we can write:
library(lavaan)
We will fit the following auto-regressive path model, for now without means
(you can bring the means in as an optional challenge later).
Figure 19.2: Figure 19.1. Autoregressive model for WISC Verbal subtest
Coding this model using the lavaan syntax conventions is simple. You simply
need to describe all the regression relationships in Figure 1. Let’s work from
left to right of the diagram:
• V2 is regressed on V1 (residual for V2 is assumed)
• V3 is regressed on V2 (residual for V3 is assumed)
• V4 is regressed on V3 (residual for V4 is assumed)
19.3. WORKED EXAMPLE - TESTING AN AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTEST249
That’s it. Try to write the respective statements using the shorthand symbol
~ for is regressed on. You should obtain something similar to this (I called my
model V.auto, but you can call it whatever you like):
To fit this path model, we need function sem(). We need to pass to it: the model
(model=V.auto, or simply V.auto if you put this argument first), and the data.
However, the argument data is reserved for raw data (subjects by variables),
but our data is the covariance matrix! So, we need to pass our matrix to the
argument sample.cov, which is reserved for sample covariance matrices. We
also need to pass the number of observations (sample.nobs = 204), because
the covariance matrix does not provide this information.
To get access to model fitting results stored in object fit, request the extended
output including fit measures and standardized parameter estimates:
##
## Test statistic 584.326
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.904
## Tucker-Lewis Index (TLI) 0.808
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -1864.023
## Loglikelihood unrestricted model (H1) -1834.786
##
## Akaike (AIC) 3740.046
## Bayesian (BIC) 3759.955
## Sample-size adjusted Bayesian (SABIC) 3740.945
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.301
## 90 Percent confidence interval - lower 0.237
## 90 Percent confidence interval - upper 0.371
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.099
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## V2 ~
## V1 0.754 0.051 14.691 0.000 0.754 0.717
## V3 ~
## V2 0.905 0.055 16.496 0.000 0.905 0.756
## V4 ~
## V3 1.162 0.062 18.847 0.000 1.162 0.797
##
19.3. WORKED EXAMPLE - TESTING AN AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTEST251
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .V2 18.170 1.799 10.100 0.000 18.170 0.486
## .V3 22.971 2.274 10.100 0.000 22.971 0.428
## .V4 41.560 4.115 10.100 0.000 41.560 0.365
residuals(fit, type="cor")
## $type
## [1] "cor.bollen"
##
## $cov
## V2 V3 V4 V1
## V2 0.000
## V3 0.000 0.000
## V4 0.124 0.000 0.000
## V1 0.000 0.184 0.221 0.000
What do these results mean? Shall we conclude that the model requires direct
effects for non-adjacent measures, for example a direct path from V1 (verbal
score at age 6) to V3 (score at age 9)?
A predictor is said to have a “sleeper” effect on a distal outcome if it influenced it
not only indirectly through the more proximal outcome, but directly too. The
unexplained relationships between distal points showing up in large residuals
seem to suggest that we have such “sleeper” effects. For example, some variance
in verbal score at age 9 that could not be explained by verbal score at age 7
could, however, be explained by the verbal score at age 6. It appears that some
of the effect from age 6 was “sleeping” until age 9 when it resurfaced again.
Here we need to pause and look at the correlation structure implied by the
autoregressive model and compare it with the correlations observed in this study.
If indeed the autoregressive relationships held, then any regression path is the
product of the paths it consists of, for example:
The standardized direct paths b12 and b23, as we worked out in Question
1, equal to observed correlations between adjacent measures V1 and V2, and
V2 and V3, respectively. But according to our data, b13 (0.726) is much
larger than the product of b12 (0.717) and b23 (0.756), which should be 0.542.
Our observed correlations fail to reduce to the extent the autoregressive model
predicts.
To summarise, the autoregressive model predicts that the standardized indirect
effect should get smaller the more distal the outcome is. But our data does not
support this prediction. This finding is typical for the data of this kind. And
the reason for it is not necessarily in the existence of “sleeper” effects, but could
be often attributed to the failure to account for the error of measurement.
As you know, many psychological constructs such as “verbal reasoning” are not
measured perfectly. According to the Classical Test Theory, variance in the
observed measure (Y) is due to the ‘true’ score (T) and the measurement error
(E):
Therefore, the error “contaminates” variances but not covariances. In the cor-
relation matrix then, the off-diagonal elements reflect the relationships between
the ‘true’ (standardized) scores, but the diagonal reflects the ‘true’ (standard-
ized) score variances plus the measurement errors! So, the correlation matrix
does NOT reflect correlations of ‘true’ scores. Instead, it reflects covariances
of ‘true’ scores with variances smaller than 1. The actual correlations of ‘true’
scores could be estimated if we knew their variances. In Classical Test Theory,
score reliability is defined as the proportion of variance in he observed score
due to ‘true’ score. So, knowing and accounting for the score reliability would
enable us to obtain correct estimates of the ‘true’ score correlations, and fit an
autoregressive model to those estimates.
To conclude, it is important to appreciate limitations of path analysis. Path
analysis deals with observed variables, and it assumes that these variables are
measured without error. This assumption is almost always violated in psycho-
logical measurement. Test scores are not 100% reliable, and WISC scores in
this data analysis example are no exception. Every time we deal with imper-
fect measures, we will encounter similar problems. The lower the reliability of
our observed variables, the greater the problems might be. These need to be
understood to draw correct conclusions from analysis.
To address this limitation, we need to account for the error of measurement.
Models with measurement parts (where latent traits are indicated by observed
variables) and structural parts (where we explore relationships between latent
constructs - ‘true’ scores - rather than their imperfect measures) will be a way
of dealing with unreliable data. I will provide such a full structural model for
WISC data in Exercise 20.
After you finished work with this exercise, save your R script with a meaningful
name, for example “Autoregressive model of WISC”. To keep all of the created
objects, which might be useful when revisiting this exercise, save your entire
‘work space’ when closing the project. Press File / Close project, and select
“Save” when prompted to save your ‘Workspace image’.
19.5 Solutions
Q1. The standardized regression path are as follows
Std.all
V2 ~ V1 0.717
V3 ~ V2 0.756
V4 ~ V3 0.797
These are equal to the observed correlations between V1 and V2, V2 and V3,
and V3 and V4, respectively.
Q2. Let’s first calculate sample moments, which is the number of observed
variances and covariances. There are 4 observed variables, therefore 4x(4+1)/2
= 10 moments in total (made up from 4 variances and 6 covariances).
The output prints that the “Number of model parameters” is 6. You can work
out what these are just by looking at the diagram, or by counting the number of
‘Parameter estimates’ printed in the output. The best way to practice is to have
a go just looking at the diagram, and then check with the output. There are 3
regression paths (arrows on diagram) to estimate. There are also 3 paths from
residuals to the observed variables, but they are fixed to 1 for setting the scale of
residuals and therefore are not estimated. There are also variances of regression
residuals. So far, this makes 3+3=6 parameters. But, there is actually one more
parameter that lavaan does not print - the variance of independent variable V1.
It is of course just the variance that we we already know from sample statistics,
nevertheless, it is a model parameter (although trivial). [If you want the variance
of V1 printed in the output, add the reference to it explicitly: V1 ~~ V1 ]
Then, the degrees of freedom is 3 (as lavvan rightly tells you), made up from
10(moments) – 7(parameters) = 3(df).
Q3. The chi-square test (Chi-square = 58.473, Degrees of freedom = 3, P-value
< .001) suggests rejecting the model because the test is significant (p-value is
very low). In this case, we cannot “blame” the poor chi-square result on the
large sample size (N=204 is not that large).
Part IX
STRUCTURAL
EQUATION MODELLING
255
Exercise 20
Fitting a latent
autoregressive model to
longitudinal test
measurements
20.1 Objectives
In this exercise, you will test for autoregressive relationships between latent
constructs. This will be done through a so-called full structural model, which
includes measurement part(s) (where latent traits are indicated by observed
variables) and a structural part (where we explore relationships between latent
constructs - ‘true’ scores - rather than their imperfect measures). Incorporating
the measurement part will be a way of dealing with unreliable data because
he error of measurement can be explicitly controlled. As in Exercise 19, we
will model repeated (observed) measurements over several time points, where
each subsequent measurement occasion is regressed on the previous one, hence
the name. However, we will model the autoregressive relationships not between
the observed measures but between the latent variables (‘true’ scores) in the
structural part of the model.
257
258EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M
You can continue working in the project created for Exercise 19. Alterna-
tively, download the summary statistics for WISC data stored in the file
WISC.RData, save it in a new folder (e.g. “Latent autoregressive model of
WISC data”), and create a new project in this new folder.
If it is not in your Environment already, load the data consisting of
means (WISC.mean, which we will not use in this exercise), correlations
(WISC.corr), and SDs (WISC.SD):
load("WISC.RData")
And produce the WISC covariance matrix from the correlations and SDs:
varnames=c("V1","V2","V3","V4","N1","N2","N3","N4")
rownames(WISC.cov) <- varnames
colnames(WISC.cov) <- varnames
20.3. WORKED EXAMPLE - TESTING A LATENT AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTES
library(lavaan)
Figure 20.1: Figure 20.1. Latent autoregressive model for WISC Verbal subtest
There are 4 measurement models - one for each of the Verbal scores across 4
time points. These models are based on the classic true-score model:
V = T + E
Each observed verbal score, for example V1, is a sum of two latent variables -
the ‘true’ score (T1) and the error of measurement (E1). The same applies to
V2, V3 and V4. The paths from each T and E latent variables to the observed
measures of V are fixed to 1, as the true-score model prescribes.
With these measurement models in place, the autoregressive relationships per-
tain to the ‘true’ scores only: T1, T2, T3 and T4.
Let’s code this model using the lavaan syntax conventions, starting with de-
scribing the four measurement models:
• T1 is measured by V1 (error of measurement E1 is assumed; lavaan will label
it .V1)
• T2 is measured by V2 (error of measurement E2 is assumed; lavaan will label
it .V2)
• T3 is measured by V3 (error of measurement E3 is assumed; lavaan will label
it .V3)
• T4 is measured by V4 (error of measurement E4 is assumed; lavaan will label
it .V4)
We then describe the autoregressive relationships between T1, T2, T3 and T4.
260EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M
To fit this path model, we need function sem(). We need to pass to it: the model
(model=T.auto, or simply T.auto if you put this argument first), and the data.
However, because our data is the WISC covariance matrix, we need to pass our
matrix to the argument sample.cov, rather than data which is reserved for raw
subject data. We also need to pass the number of observations (sample.nobs
= 204), because this information is not contained in the covariance matrix.
To get access to model fitting results stored in object fit1, request the standard-
ized parameter estimates:
summary(fit, standardized=TRUE)
Examine the output, starting with the Chi-square statistic. Refer to Exercise 19
262EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M
for the model output, and you will see that the Chi-square is exactly the same,
58.473 on 3 Degrees of freedom. This is not what we expected. We expected
different fit due to estimating more parameters (errors of measurement that
we included in the model). Also, the standardized regression coefficients for
the ‘true’ scores T2T1, T3 T2, and T4~T3 are the same as the standardized
regression coefficients for the observed scores V2V1, V3 V2 and V4~V3 in the
path model of Exercise 19. To find out where we went wrong, let’s examine
the parameter estimates. Ah! The variances of errors of measurement that we
assumed for the observed variables (.V1, .V2, .V3 and .V4) have been fixed
to zero by lavaan:
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.000 0.000 0.000
.V2 0.000 0.000 0.000
.V3 0.000 0.000 0.000
.V4 0.000 0.000 0.000
Obviously then, our model is then reduced to the path model in Exercise 19
because it assumes that the errors of measurement are zero (and the observed
scores are perfectly reliable).
To specify the model in the way we intended, we need to explicitly free up the
variances of measurement errors, using the ~~ syntax convention. We add the
following syntax to T.auto, making T.auto2:
summary(fit2)
## .V2 8.833 NA
## .V3 8.016 NA
## .V4 28.797 NA
## T1 22.820 NA
## .T2 0.069 NA
## .T3 4.758 NA
## .T4 0.041 NA
From the output it is clear that our model is not identified, because we have
a negative number of degrees of freedom, -1. This means we have 1 more
parameters to estimate than the number of sample moments available.
The previous model, T.auto, had 3 degrees of freedom. By asking to estimate
4 error variances, we used up all these degrees of freedom and one more, and
we ran out.
And now we can fit the model again and obtain the summary output:
20.3. WORKED EXAMPLE - TESTING A LATENT AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTES
summary(fit3, standardized=TRUE)
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .V1 (e) 8.494 1.112 7.638 0.000 8.494 0.251
## .V2 (e) 8.494 1.112 7.638 0.000 8.494 0.228
## .V3 (e) 8.494 1.112 7.638 0.000 8.494 0.158
## .V4 (e) 8.494 1.112 7.638 0.000 8.494 0.075
## T1 25.329 3.529 7.178 0.000 1.000 1.000
## .T2 2.972 2.053 1.448 0.148 0.103 0.103
## .T3 4.738 2.081 2.276 0.023 0.105 0.105
## .T4 19.986 4.278 4.672 0.000 0.190 0.190
Now examine the model parameters, and try to answer the following questions.
QUESTION 3. What is the standardized effect of T1 on T2? How does it
compare to the observed correlation between V1 and V2? If you have completed
Exercise 19, how would you interpret this result?
QUESTION 4. What is the (unstandardized) variance of the measurement
error of the Verbal test? Why are the standardized values (Std.all) for this
parameter different for the 4 measurements? What does it tell you about the
reliability of the Verbal scores?
To summarise, the latent autoregressive model predicts that the standardized
indirect effect should get smaller the more distal the outcome is. And now our
data support this prediction. The correlations between the ‘true’ scores at the
adjacent time points are much stronger (0.9 or above) than the path model in
Exercise 19 would have you to belive. The reason for misfit of that path model
was not the existence of “sleeper” effects, but the failure to account for the error
of measurement.
After you finished work with this exercise, save your R script with a meaningful
name, for example “Latent autoregressive model of WISC”. To keep all of the
created objects, which might be useful when revisiting this exercise, save your
entire ‘work space’ when closing the project. Press File / Close project, and
select “Save” when prompted to save your ‘Workspace image’.
20.4. FURTHER PRACTICE - TESTING A LATENT AUTOREGRESSIVE MODEL FOR WISC NON-VERBAL S
20.5 Solutions
Q1. There are 4 observed variables, therefore 4x(4+1)/2 = 10 sample moments
in total (made up from 4 variances and 6 covariances).
The output prints that the “Number of model parameters” is 11, and “number
of equality constraints” is 3. The 11 parameters are made up of 3 regression
paths (arrows on diagram) for the ‘true’ scores (T variables), plus 3 variances of
regression residuals for T2, T3 and T4 variables, plus 1 variance of independent
latent variable T1, plus the 4 error variances of V variables. Nominally, this
makes 11, but of course there are only 11-3=8 parameters to estimate because
we imposed equality constraints on error variances estimating only 1 parameetr
instead of 4 (3 less parameters than we did in the model T.auto2).
Then, the degrees of freedom is 2 (as lavvan rightly tells you), made up from
10(moments) – 8(parameters) = 2(df).
Q2. The chi-square test (Chi-square = 2.115, Degrees of freedom = 2) suggests
accepting the model because it is quite likely (P-value = .347) that the observed
data could emerge from the population in which the model is true.
Q3. The standardized effect of T1 on T2 is 0.947 (you can look at Std.lv,
which stands for “standardized latent variables”). This is much higher than the
observed correlations between V1 and V2, which was 0.717. From the expla-
nation of this latter value in Exercise 19 it follows that 0.717 was an estimate
of covariance between ‘true’ Verbal scores at ages 6 and 7 rather than their
correlation. Because the scores had error of measurement, the (standardized)
variances of their ‘true’ scores were lower than 1, and therefore the covariance
of 0.717 equated to a much higher correlation of ‘true’ scores, which is now
estimated as 0.947.
Q4. The unstandardized variance of the measurement error of the Verbal test
is 8.494. These values are identical across 4 time points, and are marked with
label (e) in the output. However, the standardized values (Std.all) for this
parameter are different for the 4 measurements:
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
268EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M
This is because the test score variance is composed of ‘true’ and ‘error’ vari-
ance, and even though the error variance was constrained equal, the ‘true’ score
variance change over time because the ‘true’ score is the property of the person,
and people change. Std.all values show you the proportion of error in the ob-
served score at each time point, and because the values are getting smaller, we
can conclude that the proportion of ‘true’ score variance increases. Presumably,
the ‘true’ score variance at time 4 (age 11) was the largest.
From these results, it is easy to calculate the reliability of the Verbal test scores.
By definition, reliability is the proportion of the variance in observed score due
to ‘true’ score. As the Std.all values give the proportion of error, reliabil-
ity is computed as 1 minus the given proportion of error. So, reliability of
V1 is 1-0.251=0.749, reliability of V2 is 1-0.228=0.772, reliability of V3 is
1-0.158=0.842 and reliability of V4 is 1-0.075=0.925. As Roderick McDonald
(1999) pointed out, reliability will vary according to a population sampled, and
is not property of the test alone. On contrary, the error of measurement is,
and this why the ultimate goal of reliability analysis is to obtain an estimate of
the variance of E in the test score metric. In this data analysis example, this
estimate is e=8.494.
Exercise 21
21.1 Objectives
In this exercise, you will perform growth curve modelling of longitudinal mea-
surements of child development. These will be measurements of weight or height.
Redeeming features of such measures are that they have little or no error of mea-
surement (unlike most psychological data), and that they are continuous vari-
ables at the highest level of measurement - ratio scales. These features enable
analysis of growth patterns without being concerned about the measurement,
and thus focus on the structural part of the model. You will practice plotting
growth trajectories and modelling these trajectories with grwoth curve models
with random intercept and random slopes (both linear and quadratic).
First, I must make some acknowledgements. Data and many ideas for the
analyses in this exercise are due to Dr Jon Heron of Bristol Medical School.
This data example was first presented in a summer school that Dr Heron and
other dear colleagues of mine taught at the University of Cambridge in 2011.
269
270EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS
Data for this exercise comes from the Avon Longitudinal Study of Parents
and Children (ALSPAC), also known as “Children of the Nineties”. This is
a population-based prospective birth-cohort study of ~14,000 children and their
parents, based in South-West England. For a child to be eligible for inclusion in
the study, their mother had to be resident in Avon and have an expected date
of delivery between April 1st 1991 and December 31st 1992.
We will analyse some basic growth measurements taken in clinics attended at
approximate ages 7, 9, 11, 13 and 15 years. There are N=802 children in the
data file ALSPAC.csv that we will consider today, 376 boys and 426 girls.
The analysis dataset is in wide-data format so that each repeated measure is
a different variable (e.g. ht7, ht9, ht11, …). The file contains the following
variables:
ht 7/9/11/13/15 Child height (cm) at the 7, 9, 11, 13 and 15 year clinics
wt 7/9/11/13/15 Child weight (kg) at each clinic
bmi 7/9/11/13/15 Child body mass index at each clinic
age 7/9/11/13/15 Child actual age at measurement (months) at each clinic
female Sex dummy (coded 1 for girl, and 0 for boy)
bwt Birth weight (kg)
To complete this exercise, you need to repeat the analysis of weight measure-
ments from a worked example below, and answer some questions. Once you are
confident, you may want to explore the height measurements independently.
Download file ALSPAC.csv, save it in a new folder (e.g. “Growth curve model
of ALSPAC data”), and create a new project in this new folder.
Import the data (which is in the comma-separated-values format) into a new
data frame. I will call it ALSPAC.
First, produce and examine the descriptive statistics to learn about the coverage
and spread of measurements. As you can see, there are no missing data.
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O
library(psych)
# check the data coverage and spread
describe(ALSPAC)
You may also want to examine data by sex, because there might be differences
in growth patters between boys and girls. Do this independently using function
describeBy(ALSPAC, group="female") from package psych.
We begin with visualizing the data. Perhaps the most useful way of looking
at individuals’ development over time is to connect individual values for each
time point, thus building “trajectories” of change . Package lcsm offers a nice
function plot_trajectories() that does exactly that:
100
Weight in kilograms
75
50
25
Each line on the graph represents a child, and you can see how weight of each
child changes over time. You can see that there is strong tendency to weight
increase, as you may expect, but also quite a lot of variation in the starting
weight and also in the rate of change. You can perhaps see an overall tendency
for almost linear growth between ages 7 and 11, but then slowing as children
reach 13,and then a fast acceleration between 13 and 15 (puberty).
You may, however, suspect that there might be systematic differences between
trajectories of boys and girls. To address this, we can plot the sexes separately.
100
Weight in kilograms
75
50
25
QUESTION 1. Using the syntax above as guide, plot the individual trajec-
tories for girls, in a different colour (say, “red”). Are the trajectories for girls
look similar to trajectories for boys?
One aim of such models is to summarize growth through new variables, which
can be used in analyses as predictors, or outcomes, or mediators, etc. Another
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O
wt7 = intercept + e7
wt9 = intercept + 2*slope + e9
wt11 = intercept + 4*slope + e11
wt13 = intercept + 6*slope + e13
The slope (latent variable) is the rate of change (kg per year), and coefficients
in front of this variable represent the number of years passing from the first
measurement at age 7. Obviously, at age 7 the coefficient for slope is 0. The
intercept (latent variable) represents the predicted weight at 7 for each child
(this may or may not correspond to the child’s actual weight at 7). Of course
weighs predicted by this rather crude model will not describe the data exactly,
and that is why we need errors of prediction (variables e7 to e13).
The hypothesized linear growth curve model can be depicted as follows.
To code this model, we describe the growth variables - intercept i and slope s
- using the lavaan syntax conventions. We use the shorthand symbol =~ for is
measured by and use * for fixing the “loading” coefficients as the equation above
specify.
Linear <- ' # linear growth model with 4 measurements
# intercept and slope with fixed coefficients
i =~ 1*wt7 + 1*wt9 + 1*wt11 + 1*wt13
s =~ 0*wt7 + 2*wt9 + 4*wt11 + 6*wt13
'
To fit this GC model, lavaan has function growth(). We need to pass to it: the
model (model=Linear, or simply Linear if you put this argument first), and
the data.
276EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS
Figure 21.1: Figure 21.1. Linear growth curve model for 4 repeated bodyweight
measurements
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O
## RMSEA 0.377
## 90 Percent confidence interval - lower 0.340
## 90 Percent confidence interval - upper 0.416
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.096
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## i =~
## wt7 1.000
## wt9 1.000
## wt11 1.000
## wt13 1.000
## s =~
## wt7 0.000
## wt9 2.000
## wt11 4.000
## wt13 6.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## i ~~
## s 3.714 0.354 10.502 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .wt7 0.000
## .wt9 0.000
## .wt11 0.000
## .wt13 0.000
## i 25.882 0.222 116.466 0.000
## s 3.982 0.067 59.129 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .wt7 0.861 0.403 2.138 0.033
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O
residuals(fitL, type="cor")
## $type
## [1] "cor.bollen"
##
## $cov
## wt7 wt9 wt11 wt13
## wt7 0.000
## wt9 0.011 0.000
## wt11 -0.007 0.016 0.000
## wt13 0.008 -0.001 0.036 0.000
##
## $mean
## wt7 wt9 wt11 wt13
## -0.020 0.043 0.089 -0.151
Under $cov~, we have the differences between actual minus expected covariances
of the repeated measures. This information can help find weaknesses in the
model by finding residuals that are large compared to the rest. No large residuals
(greater than 0.1) can be seen in the output.
Under $mean, we have the differences between actual minus estimated means.
The residuals are largest for the last measurement point (wt13), suggesting
280EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS
that a linear model might not be adequately capturing the longitudinal trend in
weight. The negative sign of the residual means that the model predicts larger
weight at age 13 than actually observed. We conclude that the linear model
fails to describe adequately the slowing down of growth towards the age 13.
To address the inadequate fit of the linear model, we will add a quadratic slope,
which will capture the non-linear trend (slowing down of growth) that we ob-
served on the trajectory plots and also by examining residuals.
Now we will describe each observation a weighted sum of the intercept, the slope
and the quadratic slope:
wt7 = intercept + e7
wt9 = intercept + 2*slope + 4*qslope + e9
wt11 = intercept + 4*slope + 16*qslope + e11
wt13 = intercept + 6*slope + 36*qslope + e13
The qslope (latent variable) is the quadratic rate of change (squared kg per
year), and coefficients in front of this variable represent the squared number
of years passing from the first measurement at age 7. Obviously, at age 7 the
coefficient for quadratic slope is 0.
summary(fitQ)
## s ~~
## q -0.380 0.053 -7.159 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .wt7 0.000
## .wt9 0.000
## .wt11 0.000
## .wt13 0.000
## i 25.593 0.218 117.646 0.000
## s 4.566 0.111 41.264 0.000
## q -0.153 0.014 -10.670 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .wt7 4.747 1.867 2.543 0.011
## .wt9 -0.167 0.983 -0.170 0.865
## .wt11 9.494 1.432 6.629 0.000
## .wt13 -17.604 4.541 -3.877 0.000
## i 13.365 2.077 6.435 0.000
## s 2.965 0.518 5.726 0.000
## q 0.093 0.012 7.641 0.000
It shows that two residual variances, .wt9 and .wt13, are negative. This is
obviously a problem, and could be a sign of model “overfitting”, when there are
too many parameters estimated in the model. Indeed, this model estimates 13
parameters on 14 sample moments and has only 1 degree of freedom.
Examining the output again, we can see that the variance of q (quadratic slope)
is only 0.093, which is very small compared to the variances of other growth
factors i and s. A good way to reduce the number of parameters then is to
fix the variance of q to 0, making it a fixed rather than a random effect. The
quadratic slope will have the same effect on everyone’s trajectory (it will have
the mean but no variance). And if q has no variance, it obviously should have
zero covariance with the other growth factors i and s, so we fix those covariances
to 0 as well. Here is the modified model:
summary(fitQF, fit.measures=TRUE)
##
## SRMR 0.072
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## i =~
## wt7 1.000
## wt9 1.000
## wt11 1.000
## wt13 1.000
## s =~
## wt7 0.000
## wt9 2.000
## wt11 4.000
## wt13 6.000
## q =~
## wt7 0.000
## wt9 4.000
## wt11 16.000
## wt13 36.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## i ~~
## q 0.000
## s ~~
## q 0.000
## i ~~
## s 3.384 0.336 10.062 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .wt7 0.000
## .wt9 0.000
## .wt11 0.000
## .wt13 0.000
## i 25.786 0.221 116.892 0.000
## s 4.701 0.092 51.035 0.000
## q -0.155 0.013 -12.161 0.000
##
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O
## Variances:
## Estimate Std.Err z-value P(>|z|)
## q 0.000
## .wt7 0.317 0.395 0.802 0.422
## .wt9 4.626 0.418 11.068 0.000
## .wt11 5.460 0.612 8.919 0.000
## .wt13 5.874 1.109 5.296 0.000
## i 17.981 1.385 12.979 0.000
## s 1.465 0.118 12.457 0.000
Now we have a reasonably fitting model, at least according to the SRMR (0.072)
and CFI (0.933), we can interpret the growth factors.
The section Intercepts presents the means of i, s and q, which we might refer
to as the fixed-effect part of the model. We interpret these as follows:
The average 7-year-old boy in the population starts off with a body weight
of 25.786kg, and that weight then increases by about 4.7kg per year, with an
adjustment of -0.155 in the first year, and -0.155 times the number of years
squared in each consecutive year. Therefore, the adjustment gets bigger as the
time goes by, slowing the linear rate of growth.
The sections Variances and Covariances describe the random-effects part of
the model, or the variances and covariance of our growth factors i and s. There
is a strong positive covariance 3.384 indicating that larger children at age 7 tend
to gain weight at a faster rate than smaller children. There is also considerable
variation in both intercept (var(i)=17.981) and linear slope (var(s)=1.465) as
evidenced by the plot of the raw data from earlier.
Finally, we can see the individual values on the growth factors (numbers describ-
ing intercept and slope for each boy’s individual trajectory). Package lavaan
has function lavPredict(), which will compute these values for each boy in the
sample. You need to apply this function to the result of fitting our best model
with fixed quadratic effect, fitQF.
## i s q
## [1,] 22.58784 3.984492 -0.1548514
## [2,] 21.08919 3.667423 -0.1548514
## [3,] 29.87843 5.794144 -0.1548514
## [4,] 27.29049 4.850545 -0.1548514
## [5,] 28.26805 5.018105 -0.1548514
## [6,] 23.54283 3.867593 -0.1548514
You can see that there are 3 values for each boy - i, s and q. While i and s vary
between individuals, q is constant because we set it up as a fixed effect. You
can plot histograms of the intercept and linear slope values as follows:
40
20
0
20 25 30 35 40 45
40
20
0
3 4 5 6 7 8 9
Both distributions have positive skews, with outliers located on the high end of
the weight scale at baseline, and the high end of the weight gain per year.
Finally, to give you an idea about how the random growth factors can be used in
research, I will provide a simple illustration of a growth model with covariates.
As we have the sex variable in the data set, we can easily investigate whether
the child’s sex has any effect on body weight trajectories at this age (between
7 and 13). To this end, we simply regress our random growth factors, i and s
on the sex dummy variable, female. We add the following line to the model
QuadraticFixed:
QUESTION 7. Does child’s sex has a significant effect on the growth inter-
cept? Who start off with higher weight - boys or girls? Does sex have an effect
on the linear slope? Who gain weight at a higher rate - boys or girls?
After you finished work with this exercise, save your R script with a meaningful
name, for example “Growth curve model of body weights”. To keep all of the
created objects, which might be useful when revisiting this exercise, save your
entire ‘work space’ when closing the project. Press File / Close project, and
select “Save” when prompted to save your ‘Workspace image’.
To practice further, fit a latent autoregressive path model to the girls’ body
weights, and interpret the results.
21.5 Solutions
Q1.
# girls only
plot_trajectories(data = ALSPAC[ALSPAC$female==1, ],
id_var = "id",
var_list = c("wt7", "wt9", "wt11", "wt13", "wt15"),
xlab = "Age in years",
ylab = "Weight in kilograms",
line_colour = "red")
21.5. SOLUTIONS 289
75
Weight in kilograms
50
25
The trajectories for girls look quite similar to trajectories for boys. The spread
is quite similar and the shapes are overall similar. Note that the scale of the y
axis (weight) are different, so that the girls’ spread seem greater when in fact
it is not.
Q2. The output prints that the “Number of model parameters” is 9, which are
made up of 2 variances of latent variables i and S, 1 covariance of i and s, 2
intercepts of i and s, and 4 error (residual) variances of observed variables (.wt7,
.wt9, wt11, and .wt13). There are 4 observed variables, therefore 4x(4+1)/2
= 10 sample moments in the variance/covariance matrix, plus 4 means, in total
14 sample moments. Then, the degrees of freedom is 5 (as lavaan tells you),
calculated as 14(moments) – 9(parameters) = 5(df).
Q3. The chi-square test (Chi-square = 272.201, Degrees of freedom = 5) sug-
gests rejecting the model because it is extremely unlikely (P-value < 0.001) that
the observed data could emerge from the population in which the model is true.
Q4 The CFI = 0.881 falls short of acceptable levels of 0.9 or above; the RMSEA
= 0.377 falls well short of acceptable levels of 0.08 and below, and SRMR =
0.096 is also short of acceptable levels of 0.08 or below. All in all, the model
does not fit the data.
Q5. QuadraticFixed model has 4 degrees of freedom. This is 3 more than in
the Quadratic model. This is because we fixed 3 parameters - variance of q,
its covariance with i and its covariance with s, releasing 3 degrees of freedom.
Q6. For QuadraticFixed model, Chi-square = 154.874 on 4 DF. For Linear
model, Chi-square = 272.201 on 5 DF. The improvement in Chi-square is sub-
290EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS
anova(fitL, fitQF)
##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## fitQF 4 8126.5 8165.8 154.87
## fitL 5 8241.9 8277.2 272.20 117.33 0.55622 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Q7.
summary(fitQFS, standardized=TRUE)
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
i ~
female -0.413 0.292 -1.416 0.157 -0.101 -0.051
s ~
female 0.255 0.088 2.911 0.004 0.213 0.106
Child’s sex does not have a significant effect on the growth intercept (p = 0.157).
However, child’s sex has a significant effect on the linear slope (p = 0.004).
Because the effect of the dummy variable female is positive, we conclude that
girls gain weight at a higher rate than boys. Looking at the standardized value
(0.106), this is a small effect.
Exercise 22
22.1 Objectives
The objective of this exercise is to fit a full structural model to repeated obser-
vations. This model will allow testing for significance of treatment effect (mean
difference in latent constructs for pre-treatment and post-treatment data). You
will learn how to implement measurement invariance constraints, which are es-
sential to make sure that the scale of measurement is maintained over time.
Data for this exercise is an anonymous sample from the Child and Adolescent
Mental Health Services (CAMHS) database. The sample includes children and
adolescents who were referred for psychological/psychiatric help with regard to
various problems. In order to evaluate the outcomes of these interventions, the
patients’ parents completed the Strengths and Difficulties Questionnaire (SDQ)
twice – at referral and then at follow-up, typically 6 months from the referral
291
292EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE
(from 4 to 8 months on average), in most cases post treatment, or well into the
treatment.
The Strengths and Difficulties Questionnaire (SDQ) is a screening question-
naire about 3-16 year olds. It exists in several versions to meet the needs of
researchers, clinicians and educationalists (http://www.sdqinfo.org/). In Exer-
cises 1, 5 and 7, we worked with the self-rated version of the SDQ. Today we
will work with the parent-rated version, which allows recording outcomes for
children of any age. Just like the self-rated version, the parent-rated version
includes 25 items measuring 5 facets.
The participants in this study are parents of N=579 children and adolescents
(340 boys and 239 girls) aged from 2 to 16 (mean=10.4 years, SD=3.2). This
is a clinical sample, so all patients were referred to the services with various
presenting problems.
A new object SDQ should appear in your Environment panel. Examine this
object using functions head() and names().
22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT
Let’s use function describe() from package psych to get full descriptive statis-
tics for all variables:
library(psych)
describe(SDQ)
## vars n mean sd median trimmed mad min max range skew kurtosis
## age_at_r 1 579 10.42 3.22 10 10.50 4.45 2 16 14 -0.18 -1.00
## gender 2 579 1.41 0.49 1 1.39 0.00 1 2 1 0.35 -1.88
## p1_hyper 3 579 6.26 3.06 6 6.48 4.45 0 10 10 -0.38 -1.01
## p1_emo 4 579 5.15 2.85 5 5.15 2.97 0 10 10 0.00 -1.01
## p1_conduct 5 579 4.50 2.81 4 4.43 2.97 0 10 10 0.18 -0.96
## p1_peer 6 579 3.41 2.47 3 3.27 2.97 0 10 10 0.43 -0.61
## p1_prosoc 7 579 6.48 2.46 7 6.61 2.97 0 10 10 -0.40 -0.61
## p2_hyper 8 579 5.45 3.14 6 5.52 4.45 0 10 10 -0.09 -1.23
## p2_emo 9 579 3.98 2.83 4 3.83 2.97 0 10 10 0.38 -0.81
## p2_conduct 10 579 3.92 2.88 3 3.74 2.97 0 10 10 0.46 -0.77
## p2_peer 11 579 3.13 2.47 3 2.92 2.97 0 10 10 0.62 -0.34
## p2_prosoc 12 579 6.77 2.43 7 6.94 2.97 0 10 10 -0.45 -0.52
## se
## age_at_r 0.13
## gender 0.02
## p1_hyper 0.13
## p1_emo 0.12
## p1_conduct 0.12
## p1_peer 0.10
## p1_prosoc 0.10
## p2_hyper 0.13
## p2_emo 0.12
## p2_conduct 0.12
## p2_peer 0.10
## p2_prosoc 0.10
Given that the three SDQ subscales – Conduct problems, Hyperactivity and
Pro-social behaviour – are thought to measure Externalizing problems, we will
fit the same measurement model at each time point.
Figure 22.1: Figure 22.1. Basic model for change in Externalizing problems
library(lavaan)
Now let’s code the model in Figure 22.1 by translating the following sentences
into syntax:
• External1 is measured by p1_conduct and p1_hyper and p1_prosoc
• External2 is measured by p2_conduct and p2_hyper and p2_prosoc
• External2 is regressed on External1
Using lavaan contentions, we specify this model (let’s call it Model0) as fol-
lows:
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .p1_conduct 4.504 0.117 38.576 0.000
## .p1_hyper 6.264 0.127 49.263 0.000
## .p1_prosoc 6.484 0.102 63.394 0.000
## .p2_conduct 3.924 0.120 32.771 0.000
## .p2_hyper 5.454 0.130 41.845 0.000
## .p2_prosoc 6.769 0.101 67.004 0.000
## External1 0.000
## .External2 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .p1_conduct 2.216 0.204 10.885 0.000
## .p1_hyper 4.586 0.311 14.762 0.000
## .p1_prosoc 3.183 0.211 15.077 0.000
## .p2_conduct 2.251 0.203 11.078 0.000
## .p2_hyper 4.127 0.292 14.117 0.000
## .p2_prosoc 2.867 0.193 14.820 0.000
## External1 5.678 0.472 12.029 0.000
## .External2 0.337 0.189 1.780 0.075
Examine the output. Note that the factor loadings for p1_conduct and
p2_conduct are fixed to 1, and the other loadings are freely estimated. As
expected, the loadings for p1_prosoc and p2_prosoc are negative, because
being pro-social indicates the lack of externalising problems. Also note that
the variance for External1 and residual variance for External2 (.External2)
are freely estimated, as are the unique (error) variances of all the indicators
(.p1_conduct, .p1_hyper, etc.)
There is also output called ‘Intercepts’. For every DV, its intercept is printed
(beginning with ‘.’, for example .p1_conduct), and for every IV, its mean is
printed.
Note that the mean of External1 and the intercept of External2 are fixed to
0. This is the default way of giving the origin of measurement to the common
factors. Lavaan did this automatically, just as it automatically gave the unit
of measurement to the factors by adopting the unit of their first indicators.
With the scale of common factors set, the intercepts of all indicators (observed
variables) are freely estimated – and thus they have Standard Errors (Std.Err).
These intercepts correspond to the expected scale scores on Conduct, Hyperac-
tivity and Pro-social for those with the average (=0) scores on External at the
respective time point.
Now examine the chi-square statistic, and other measures of fit.
QUESTION 2. Report and interpret the chi-square test, SRMR, CFI and
RMSEA. Would you accept or reject Model0?
298EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE
To help you understand the reasons for misfit, request the modification indices
(sort them in descending order):
modindices(fit0, sort.=TRUE)
QUESTION 3. What do the modification indices tell you? Which two changes
in the model would produce the greatest decrease in the chi-square? How do
you interpret these model changes?
Let us now modify the model allowing the unique factors (errors) of p1_hyper
and p2_hyper, and errors of p1_prosoc and p2_prosoc correlate. Modify
Model0 as follows, creating Model1:
p1_hyper ~~ p2_hyper
p1_prosoc ~~ p2_prosoc '
Now fit the modified model Model1 and ask for the summary output:
QUESTION 4. Report and interpret the chi-square for the model with cor-
related errors for repeated measures (Model1). Would you accept or reject
Model1? What are the degrees of freedom for Model1 and how do they com-
pare to Model0?
Figure 22.2: Figure 22.2. Full measurement invariance model for change in
Externalizing problems
You should already know how to specify labels for path coefficients. Factor load-
ings are path coefficients. Simply add multipliers in front of indicators in mea-
surement =~ statements, like so: lh*p1_hyper. To specify labels for variances,
add multipliers in variance ~~ statements, like so: p1_hyper ~~ eh*p1_hyper.
To specify labels for intercepts, use multipliers in statements ~ 1. For example,
to label the intercept of p1_hyper as ih, write p1_hyper ~ ih*1. OK, here
is the full measurement invariance model (Model2) corresponding to Figure
2. [Please type it in yourself, using bits of your previous models, but do not
just copy and paste my text! You need to make your own mistakes and correct
them.]
# error variances
p1_conduct ~~ ec*p1_conduct
p1_hyper ~~ eh*p1_hyper
p1_prosoc ~~ ep*p1_prosoc
# intercepts
p1_conduct ~ ic*1
p1_hyper ~ ih*1
p1_prosoc ~ ip*1
# Structural model
External2 ~ External1
Note that all parameter labels that you introduced are printed in the output
next to the respective parameter. Also note that every pair of parameters that
you constrained equal are indeed equal!
QUESTION 5. Report and interpret the chi-square test for the full measure-
ment invariance model (Model2). Would you accept or reject this model?
Now we need to understand what the chi-square result means with respect to
the combined hypotheses that Model2 tested. Rejection of the model could
mean that the measurement invariance is violated (H1 is wrong) or that there
is a significant change in the Externalising score from Time 1 to Time 2 (H2
is wrong). To help us understand which hypothesis is wrong, let us obtain the
modification indices.
modindices(fit2, sort.=TRUE)
The largest modification index should appear first in the output. Compare its
size to the chi-square of the model, because the MI shows by how much the
chi-square will reduce if the respective parameter was freely estimated.
QUESTION 6. What is the largest modification index? How does it compare
to the chi-square and other modification indices? Try to interpret what this
index suggests. Do you think it points to H1 being wrong, or H2 being wrong?
Now, hopefully you agree that the reason for misfit of Model2 was fixing the
mean of External1 and the intercept of External2 to the same value – zero.
This basically allows no change in the Externalising score due to the interven-
tion, setting the regression intercept to 0, like in this equation:
External2 = 0 + B1*External1
NA*1 means that we “freeing” the intercept of External2 (we have no particular
value or label to assign to it). Now, Model3 would test the hypothesis H1 (full
measurement invariance across Time).
Create and fit Model3 (assign its results to fit3), examine the output, and
answer the following questions.
306EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE
summary(fit3)
## .p1_prosoc ~~
## .p2_prosoc 1.823 0.170 10.728 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .p1_condct (ic) 4.550 0.113 40.169 0.000
## .p1_hyper (ih) 6.150 0.127 48.553 0.000
## .p1_prosoc (ip) 6.416 0.095 67.667 0.000
## .p2_condct (ic) 4.550 0.113 40.169 0.000
## .p2_hyper (ih) 6.150 0.127 48.553 0.000
## .p2_prosoc (ip) 6.416 0.095 67.667 0.000
## .External2 -0.672 0.073 -9.181 0.000
## External1 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .p1_condct (ec) 1.333 0.120 11.096 0.000
## .p1_hyper (eh) 4.962 0.284 17.479 0.000
## .p1_prosoc (ep) 3.260 0.172 18.978 0.000
## .p2_condct (ec) 1.333 0.120 11.096 0.000
## .p2_hyper (eh) 4.962 0.284 17.479 0.000
## .p2_prosoc (ep) 3.260 0.172 18.978 0.000
## External1 6.411 0.444 14.431 0.000
## .External2 1.656 0.194 8.547 0.000
22.4 Solutions
Q1. The means decrease for Conduct and Hyperactivity, and increase for Pro-
social. This is to be expected since the SDQ Hyperactivity and Conduct scales
measure problems (the higher the score, the larger the extent of problems),
and SDQ Pro-social measures positive pro-social behaviour. Reduction in the
problem scores is to be expected after treatment.
Q2. Chi-square = 466.618 (df = 8); P-value < .001. We have to reject the
model because the chi-square is highly significant. CFI=0.793, which is much
lower than 0.95 (the threshold for good fit), and lower than 0.90 (the threshold
for adequate fit). RMSEA =0.315, which is much greater than 0.05 (for good
fit) and 0.08 (adequate fit). Finally, SRMR = 0.075, which is just below the
threshold for adequate fit (0.08). All indices except SRMR indicate very poor
model fit.
Q3. Two largest modification indices (MI) by far can be found in
the covariance ~~ statements: p1_hyper~~p2_hyper (mi=248.422) and
p1_prosoc~~p2_prosoc (mi=156.286).
The first MI tells you that if you repeat the analysis allowing p1_hyper and
p2_hyper correlate (actually, because these are DVs, the correlation will be
between their errors/unique factors), the chi square will fall by about 248. But
is this reasonable to allow errors/unique factors for the same measures at Time
1 and Time 2 correlate? Consider how the variance on the Hyperactivity facet is
explained by both, the common Externalising factor, and the remaining unique
content of the facet (the unique factor). Because the same Hyperactivity scale
was administered on two different occasions, its unique content not explained by
the common Externalising factor would be still shared between the occasions.
Therefore, the unique factors at Time 1 and Time 2 cannot be considered inde-
pendent. The correlated errors will correct for this lack of local independence.
Similarly, we should allow correlated errors across time for the Pro-social con-
struct (p1_prosoc and p2_prosoc). A correlated error for p1_conduct and
p2_conduct is not needed since M.I. did not suggest it.
Q4. Model1: Chi-square = 1.490 (df=6), P-value = 0.960. The chi-square test
is not significant and we accept the model with correlated errors for repeated
measures. The degrees of freedom for Model1 are 6, and the degrees of freedom
for Model0 were 8. The difference, 2 df, corresponds to the 2 additional param-
eters we introduced – the two error covariances. By adding 2 parameters, we
reduced df by 2.
Q5. Model2: Chi-square = 103.584 (df = 14); P-value < 0.001. The test is
highly significant and we have to reject the model.
Q6. The largest modification index by far is 73.030. It is of the same magnitude
as the chi-square for this model (103.584), and much larger than other MIs,
which are all in single digits. This index pertains to both External1~1 and
22.4. SOLUTIONS 309
anova(fit2, fit3)
##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## fit3 13 14793 14854 24.219
## fit2 14 14870 14927 103.584 79.365 0.36789 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
MULTIPLE-GROUP
STRUCTURAL
EQUATION MODELLING
311
Exercise 23
23.1 Objectives
The objective of this exercise is to fit a measurement model to two groups of
participants using the multiple-group features of package lavaan. In particu-
lar, you will estimate means and variances of latent constructs in two groups,
implementing measurement invariance constraints.
313
314EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL
library(lavaan)
data(HolzingerSwineford1939)
variable will be the focus of our analysis, it would be nice to have labels attached
to these values, so each group is clearly labelled in all outputs. To do this, we
use R base function factor(), which encodes a vector of values into categories
(note that this function has nothing to do with factor analysis!). We would like
to encode levels c(1,2) as two nominal categories c("boy", "girl"):
Request head(sample) again. You will see that now, variable sex is populated
with either “boy” or “girl”.
Next, let’s obtain and examine the means of the six tests (x1 to x6) for boys
and girls. An easy way to do this is to use function describeBy() from package
psych:
library(psych)
##
## Descriptive statistics by group
## sex: boy
## vars n mean sd median trimmed mad min max range skew
## id 1 72 279.18 46.38 281.00 279.90 66.72 201.00 351.00 150.00 -0.10
## sex* 2 72 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## ageyr 3 72 12.90 1.05 13.00 12.84 1.48 11.00 16.00 5.00 0.62
## agemo 4 72 5.19 3.37 5.00 5.12 4.45 0.00 11.00 11.00 0.12
## school* 5 72 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## grade 6 71 7.49 0.50 7.00 7.49 0.00 7.00 8.00 1.00 0.03
## x1 7 72 4.97 1.16 4.92 4.97 1.11 2.67 8.50 5.83 0.17
## x2 8 72 6.23 1.10 6.00 6.13 1.11 4.00 9.25 5.25 0.64
## x3 9 72 2.14 1.08 1.88 2.08 1.11 0.38 4.38 4.00 0.40
## x4 10 72 3.10 1.02 3.00 3.08 0.99 0.33 6.00 5.67 0.26
## x5 11 72 4.60 1.05 4.75 4.62 1.11 1.75 6.50 4.75 -0.17
## x6 12 72 2.36 1.08 2.29 2.27 1.06 0.57 5.57 5.00 0.85
## x7 13 72 3.85 1.02 3.72 3.82 1.06 1.87 6.48 4.61 0.25
## x8 14 72 5.60 1.22 5.55 5.51 1.11 3.60 10.00 6.40 0.87
## x9 15 72 5.30 0.93 5.25 5.30 1.01 3.28 7.39 4.11 0.05
## kurtosis se
## id -1.38 5.47
## sex* NaN 0.00
316EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL
## x9 0.69 0.13
QUESTION 1. Examine the means of the six test variables for boys and
girls. Note the mean differences for x3 (spatial orientation), and all verbal tests
(x4-x6). Who score higher on what tests?
It is thought that performance on the first three tests (x1, x2 and x3) depends
on a broader spatial ability, whereas performance on the other three tests (x4,
x5 and x6) depends on a broader verbal ability. We will fit a confirmatory factor
model with two factors – Spatial and Verbal, which are expected to correlate.
We will scale the factors by adopting the unit of their first indicators (x1 for
Spatial, and x4 for Verbal), which is the default in lavaan.
Figure 23.1: Figure 23.1. Measurement model for Holzinger and Swineford data
OK, let’s first describe the measurement (confirmatory factor analysis, or CFA)
model depicted in Figure 23.1. We will call it HS.model (HS stands for
Holzinger-Swineford). You should be able to specify this model yourself by
now, using the “measured by” (=~) syntax conventions:
Now we will fit the model using function cfa(). There is only one change from
how we used this function before. This time, we want to perform a multiple-
group analysis, and fit the model not to the whole sample of 145 children, but
separately to two groups - 73 girls and 72 boys. In order to do this, we set sex
as the grouping variable for analysis (group = "sex").
318EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL
summary(fit.b, fit.measures=TRUE)
When we specify the grouping variable (group = "sex"), the data will be sepa-
rated into two groups according to children sex, and the HS.model will be fitted
to both groups without any parameter constraints. That is, all model parame-
ters in each group will be freely estimated. This is called configural invariance,
because the only thing in common between the groups is the configuration of
the model (what variables indicate what factors).
NOTE that the default in multiple-group analysis is to include means/intercepts,
so we do not need to specify this.
QUESTION 2. How many sample moments are there in the data? Why?
{Hint. When counting sample moments, do not forget that we split the data
into 2 groups, so observed means, variances and covariances are available for
both groups}.
QUESTION 3. How many parameters does the baseline (configural) model
estimate? What are they? {Hint. You can look up the total number of parame-
ters in the output, but try to work out how these are made up. The output will
help you, if you look in ‘Parameter Estimates’. Remember that values printed
there are parameters.}
QUESTION 4. Interpret the chi-square, and SRMR. Do you retain or reject
the baseline model?
The baseline model does not allow us to compare the Spatial and Verbal
factor scores between boys and girls, because each group has its own metric for
these factors. For instance, origins of the factors are set to 0 in each group by
default, so girls and boys form their own ‘norm’ groups. Girls can be compared
with girls; boys with boys, but cross-comparison are not meaningful because
the scale is “reset”. To compare the factor scores across groups properly, we
need full measurement invariance. We want the factor loadings, intercepts and
residual variances to be equal for every corresponding test across groups. They
will be still estimable parameters, but instead of estimating two sets – for boys
and for girls, we will estimate only one set for both groups.
This is very simple to do in lavaan. You do not need to adjust the model. You
only need to tell the cfa() function which types of parameters you want to
constrain equal across groups using argument group.equal. Because we want
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES319
summary(fit.mi, fit.measures=TRUE)
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 (.p7.) 0.799 0.128 6.223 0.000
## .x2 (.p8.) 0.883 0.123 7.209 0.000
## .x3 (.p9.) 0.487 0.110 4.439 0.000
## .x4 (.10.) 0.289 0.063 4.572 0.000
## .x5 (.11.) 0.443 0.073 6.094 0.000
## .x6 (.12.) 0.412 0.069 5.990 0.000
## Spatial 0.538 0.180 2.989 0.003
## Verbal 1.114 0.227 4.917 0.000
What you have just done is fitted a full measurement invariance model. The
model tests the following combined hypothesis:
H1. The measure is fully invariant across groups. Factor loadings, intercepts
and error variances for corresponding indicators are equal.
Now examine the output, focusing on ‘Parameter Estimates’. Note that all those
measurement parameters that were supposed to be equal are indeed equal!
Here is a brief explanation of why the parameters are set the way they are.
While we assume full measurement invariance (i.e. the tests function equally
across groups), we do not have any particular reasons to assume that boys and
girls should be equal to each other in terms of their latent factors – Spatial and
Verbal. In fact, they might be different with respect to group means, variances
and covariances. This is why the logical way of scaling the latent factors is
setting their metric in the first group (say, boys), and carry over that metric to
the second group (girls) via parameter constraints. The parameter constrains
will ensure the scale of measurement does not change, and then the means,
variances and covariances of the latent factors for girls can be freely estimated.
Now, examine and interpret the means, variances and covariances of Spatial
and Verbal factors.
QUESTION 7. What are the means and variances of Spatial and Verbal
for Girls? How do you interpret these values?
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES323
There are small differences between the means of Spatial and Verbal for boys
and girls. It appears that girls are slightly worse than boys in spatial ability,
but better in verbal ability. However, the means for girls were not significantly
different from 0 at the 0.05 level, which could also lead us to conclude that they
were not significantly different from boy’s means (see answer to Question 7 for
explanation).
Let us test the hypothesis of equality of the means of Spatial and Verbal for
girls and boys formally. All you need to do is to add one more group equality
constraint, for the "means" of latent variables:
summary(fit.mi.e, fit.measures=TRUE)
##
## Comparative Fit Index (CFI) 0.988
## Tucker-Lewis Index (TLI) 0.989
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -1168.718
## Loglikelihood unrestricted model (H1) -1150.825
##
## Akaike (AIC) 2381.435
## Bayesian (BIC) 2446.923
## Sample-size adjusted Bayesian (SABIC) 2377.308
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.040
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.099
## P-value H_0: RMSEA <= 0.050 0.555
## P-value H_0: RMSEA >= 0.080 0.158
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.078
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##
## Group 1 [boy]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Spatial =~
## x1 1.000
## x2 (.p2.) 0.810 0.172 4.707 0.000
## x3 (.p3.) 1.012 0.195 5.185 0.000
## Verbal =~
## x4 1.000
## x5 (.p5.) 0.977 0.086 11.401 0.000
## x6 (.p6.) 0.960 0.084 11.480 0.000
##
## Covariances:
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES325
anova(fit.mi, fit.mi.e)
After you finished work with this exercise, save your R script with a meaningful
name, for example “Holzinger-Swineford 1939 analysis”.
To keep all of the created objects, which might be useful when revisiting this
exercise, save your entire ‘work space’ when closing the project. Press File /
Close project, and select Save when prompted to save your ‘Workspace image’.
23.4 Solutions
Q1. Boys scored higher than girls on x3, but girls scored higher than boys on
all verbal tests (x4, x5 and x6).
Q2. As we model means/intercepts too, we need to include them in the counted
sample moments. ‘Sample moments’ refers to the number of means, variances
and covariances in the observed data. There are 6 observed variables, therefore
6 means, plus 6(6+1)/2=21 variances and covariances; 27 moments in total for
each group. So we have 27*2=54 sample moments in both groups.
Q3. Baseline model estimates 38 parameters in total. Boys and girls groups
estimate the same parameters (i.e. there are 19 parameters in each group).
These are:
• 4 factor loadings (loadings for x1 and x4 are fixed to 1);
• 1 covariance of Spatial and Verbal factors;
• 6 intercepts of observed variables x1-x6;
• 6 residual variances of observed variables x1-x6;
• 2 variances (for Spatial and Verbal factors).
Q4. Chi-square for the baseline model is insignificant (chisq = 16.710; Degrees
of freedom = 16; p = .405). The SMRR is 0.040 – nice and small.
A breakdown of the chi-square statistic by group is also provided, attributing
8.748 to boys, and 7.962 to girls (the chi-square statistic is additive, so these
values add to the total chi-square statistic reported). The almost equal chi-
square values for both groups (and the groups were of almost equal size) indicate
similar fit of the baseline model in both groups. We conclude that the two-factor
configural model is appropriate for boys and girls.
Q5. Chi-square for the measurement invariance model is again insignificant
(chisq = 27.482, Degrees of freedom = 30, P-value = 0.598). The SMRR is
0.058 – small again. The almost equal chi-square values for both groups (boy
chisq = 13.693, girl chisq = 13.789) indicate that the measurement invariance
model is equally appropriate for boys and girls.
328EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL
Q6. The output says that ‘Number of free parameters’ is 40, and ‘Number of
equality constraints’ is 16. This means that 40-16=24 unique parameters are
estimated:
• 4 factor loadings (note that these have labels and parameter values identical
for boys and girls);
• 1 covariance of Spatial and Verbal for boys + 1 covariance for girls (note
that these have no labels and the values are different in the output) ;
• 6 intercepts of observed variables x1-x6 (note that these have labels and
parameter values identical for boys and girls);
• 2 means of Spatial and Verbal factors for girls (note that for boys, these
means are set to 0);
• 6 residual variances of observed variables x1-x6 (note that these have labels
and parameter values identical for boys and girls);
• 2 variances of Spatial and Verbal factors for boys + 2 variances for girls.
Q7. For Girls, the mean for Spatial is –0.180 (which appears lower than for
Boys for whom the mean was fixed to 0), and the mean for Verbal is 0.318
(higher than for Boys for whom the mean was fixed to 0). Both means for girls
are not significantly different from zero at the 0.05 level (look at their p-values).
Variances of the Spatial and Verbal factors are larger for girls (0.538; 1.114)
than boys (0.484; 0.769). Girls appear to show more variability in their latent
abilities than boys.
Q8. The model with additional constraints (fit.mi.e) has the Chi-square =
35.786; Degrees of freedom = 32. Testing the difference between this model
and the previous model (fit.mi), we obtain Diff(Chi-square) = 35.786–27.482=
8.304 and Diff(DF) = 32–30 = 2.
Chi-square of 8.304 on 2 degrees of freedom is significant at the 0.05 level, with
the p-value=0.016. Restricting the model with additional equality constraints
resulted in significantly different (worse) fit. The fit is worse is because the
chi-square is greater (and constraining some free parameters cannot make the
fit better!). We conclude that the means for boys and girls on the Spatial and
Verbal tests are significantly different, and that our measurement invariance
model with free means is better than the model with means constrained equal.
You may wonder how the fit can be ‘significantly worse’ if it is still very good
according to the chi-square test (the SRMR, not being a ‘significance’ measure
but an ‘effect size’ measure picked up the worsening fit). Here the small sample
works against us – there is not enough power to reject the ‘wrong’ model (it
has too many parameters), but just enough power to detect the elements of the
model that make a difference.
Exercise 24
Measuring effect of
intervention by comparing
change models for control
and experimental groups
24.1 Objectives
The objective of this exercise is to fit a full structural model to repeated obser-
vations across two groups, where one group received an intervention between
the two measurement occasions and the other did not. This model will allow
testing for difference between the experimental and control group in terms of
the change between the two measurement occasions. You will learn how to
implement measurement invariance constraints across time and groups, which
are essential to make sure that the longitudinal change can be compared across
groups.
329
330EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
Means and covarainces of these four variables are available for the experimental
and control groups separately. I organised the data into objects that are ready
to be used by R for analyses in this exercise. These objects are packaged into
file Olsson_data.RData. Download this file, and save it in a new folder. In
RStudio, create a new project in the folder you have just created. Start a new
script.
We start from loading the data into RStudio Environment.
load(file='Olsson.RData')
Olsson.cov
## $Control
## pre_syn pre_opp post_syn post_opp
24.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT
Olsson.mean
## $Control
## [1] 18.381 20.229 20.400 21.343
##
## $Experimental
## [1] 20.556 21.241 25.667 25.870
Olsson.N
## $Control
## [1] 105
##
## $Experimental
## [1] 108
QUESTION 1. Examine the means for the control and experimental groups.
Remember that the fist two means pertain to the pre-test and the last two
means to the post-test. Do the means increase or decrease? Are there any
visible differences between the groups? How would you (tentatively) interpret
the changes?
Given that the two subtests - synonyms and opposites - are supposed to indicate
verbal ability, we will fit the following basic longitudinal measurement model to
both groups.
In this model, pre_verbal is measured by pre_syn and pre_opp, and
post_verbal is measured by post_syn and post_opp. We set the factor
332EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
Figure 24.1: Figure 24.1. Basic model for change in verbal test performance
The correlated unique factors (errors) for pre_syn and post_syn account for
the shared variance between the Synonyms subtest on the two measurement
occasions, after the verbal factor has been controlled for. The correlated unique
factors (errors) for pre_opp and post_opp account for the shared variance
between the Opposites subtest on the two measurement occasions, after the
Verbal factor has been controlled for. This is typical for repeated measures –
we have considered this feature previously in Exercise 22.
The model depicted in Figure 24.1 is the default measurement model for change
that we will program in lavaan. Once we have specified this default model, we
will consider how this will be implemented across groups.
Using lavaan contentions, we can specify this repeated measures model with
parameter constraints (let’s call it Model.1) as follows:
24.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT
# intercepts with equality constraints across time (labels i_syn and i_opp)
pre_syn ~ i_syn*1
post_syn ~ i_syn*1
pre_opp ~ i_opp*1
post_opp ~ i_opp*1
# unique factors/errors with equality constraints across time (labels e_syn and e_opp)
pre_syn ~~ e_syn*pre_syn
post_syn ~~ e_syn*post_syn
pre_opp ~~ e_opp*pre_opp
post_opp ~~ e_opp*post_opp
Now we can deal with ensuring measurement invariance across groups. It turns
out that by using the parameter equality labels in a multiple group setting,
we already imposed equality constraints across the groups. This is because
if a single label is attached to a parameter (say, “f_opp” was the label for
factor loading of pre_opp and post_opp), then this label also holds this
parameter equal across groups. if we wanted different measurement parameters
across groups (but wanted to maintain longitudinal invariance) we would need
to assign group-specific labels. Thankfully, we do not need to.
OK, let’s fit *thisModel.1 to both groups using sem() function of lavaan. Be-
cause our data is not a data frame with a grouping variable but instead lists
of statistics for both groups, we need to specify the sample covariance matrices
(sample.cov = Olsson.cov), sample means (sample.mean = Olsson.mean),
and sample sizes (sample.nobs = Olsson.N). With labels set up previously to
ensure invariance of repeated-measures parameters within each group, the op-
tion group.equal = c("loadings", "intercepts", "residuals") will en-
force these parameters to also be invariant across groups.
334EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
library(lavaan)
summary(fit.1, fit.measures=TRUE)
Run the syntax and examine the output carefully. You should be able to see
that the measurement parameters we set to be equal (factor loadings, intercepts
and error variances) are indeed equal across time and also across groups.
But when examining the structural parameters, we notice that “Intercepts”
for Control and Experimental groups are different. Specifically, the mean of
pre_verbal is fixed at 0.000 and the intercept .post_verbal is fixed at 0.000.
We know they are fixed because no Standard Errors are reported for them. This
is interesting. It is common to set the origin of measurement for pre_verbal at
Time 1 to 0 so that the intercepts of pre-syn and pre_opp could be estimated,
but surely the intercept of post_verbal can be freely estimated given that the
parameter constraints ‘pass on’ the scale of measurement from Time 1 to Time
2. However, lavaan sets the intercept of all latent factors to 0 by default. If we
retain this fixed parameter, we test the following hypothesis:
Hypothsis 1 (H1). The average performance did not change across the two testing occasions in the
Initially, this appears reasonable (because the Control group did not receive any
training) and we leave this intercept fixed to 0 for now.
Now, in the Experimental group, the mean of pr_verbal and the intercept of
post_verbal are freely estimated (we know that because Standard Errors are
reported for them), and are different from zero and from each other. Indeed,
the origin and the unit of measurement can be passed on to the model for
Experimental group through the group parameter constraints. Hence, we can
freely estimate the mean and variance of pre_verbal, and the intercept and
residual variance of post_verbal in the Experimental group.
Now let’s examine the chi-square statistic. Chi-square for the tested model is
43.384 (df = 11); P-value < .001. We have to reject the model because the
chi-square is highly significant despite the modest sample size. Because this is
multiple-group analysis, the chi-square statistic is made of two group statistics:
34.035 for Control group and 9.349 for Experimental group. Clearly, the misfit
comes from the Control group.
338EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
QUESTION 2. Report and interpret the SRMR, CFI and RMSEA. Would
can you say about the goodness of fit of Model.1?
To help you understand the reasons for misfit, request the modification indices,
and sort them from largest to smallest:
## 69 0.108
## 70 -0.108
## 22 -0.026
## 66 0.011
## 67 -0.009
## 3 -0.008
## 54 0.003
## 55 -0.002
After having answered Question 3, hopefully you agree that the lavaan de-
faults imposed on our structural parameters (specifically, setting the intercept
of post_verbal in the Control group to 0) caused model misfit.
We will now adjust the structural model setup by freeing the intercept of
post_verbal, thus allowing for change in verbal performance score between
pre-test and post-test. Because the Control group did not receive any inter-
vention, this is not “training” effect but rather “practice” effect. Modifying
the model is very easy. You can just append one more line to the syntax of
Model.1, explicitly freeing the intercept of post_verbal by giving it label NA,
and making Model.2:
We now fit the modified model, and request summary output with fit indices
and standardized parameters.
summary(fit.2, standardized=TRUE)
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn (i_sy) 18.556 0.574 32.338 0.000 18.556 3.082
## .pst_syn (i_sy) 18.556 0.574 32.338 0.000 18.556 3.262
## .pre_opp (i_pp) 19.883 0.558 35.625 0.000 19.883 3.230
## .post_pp (i_pp) 19.883 0.558 35.625 0.000 19.883 3.357
## .pst_vrb 1.684 0.295 5.716 0.000 0.319 0.319
## pr_vrbl 0.000 0.000 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn (e_sy) 4.430 3.281 1.350 0.177 4.430 0.122
## .pst_syn (e_sy) 4.430 3.281 1.350 0.177 4.430 0.137
## .pre_opp (e_pp) 14.960 2.633 5.681 0.000 14.960 0.395
## .post_pp (e_pp) 14.960 2.633 5.681 0.000 14.960 0.426
## pr_vrbl 31.832 5.737 5.549 0.000 1.000 1.000
## .pst_vrb 0.517 1.479 0.350 0.727 0.019 0.019
##
##
## Group 2 [Experimental]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## pre_verbal =~
## pre_syn 1.000 6.800 0.955
## pre_opp (f_pp) 0.849 0.065 13.143 0.000 5.773 0.831
## post_verbal =~
## pst_syn 1.000 6.810 0.955
## post_pp (f_pp) 0.849 0.065 13.143 0.000 5.781 0.831
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## post_verbal ~
## pre_verbal 0.892 0.060 14.981 0.000 0.891 0.891
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn ~~
## .post_syn -0.431 3.165 -0.136 0.892 -0.431 -0.097
## .pre_opp ~~
## .post_opp 7.494 2.595 2.887 0.004 7.494 0.501
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn (i_sy) 18.556 0.574 32.338 0.000 18.556 2.607
## .pst_syn (i_sy) 18.556 0.574 32.338 0.000 18.556 2.603
342EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
QUESTION 4. Report and interpret the chi-square for Model.2. Would you
accept or reject this model? What are the degrees of freedom for Model.2 and
how do they compare to Model.1?
Now that we have a well-fitting model, let’s focus on the ‘Intercepts’ output to
interpret the findings. We can see that the latent mean changed for the Control
group as well as for the Experimental group, but change for the Experimental
group is greater. Presumably this increase over and above the practice effect
was due to the training programme.
QUESTION 5. What is the (unstandardized) intercept of post_verbal in
the Control group (practice effect)? Is the effect in expected direction? How
will you interpret its size in relation to the unit of measurement of the verbal
factor?
QUESTION 6. What is the (unstandardized) intercept of post_verbal in
the Experimental group (training effect)? Is the effect in expected direction?
How will you interpret its size in relation to the unit of measurement of the
verbal factor?
QUESTION 7. What is the standardized intercept of post_verbal in the
Control group (standardized practice effect)? Is this a large or a small effect?
QUESTION 8. What is the standardized intercept of post_verbal in the
Experimental group (standardized training effect)? Is this a large or a small
effect?
Overall, we conclude that there was a positive effect of retaking the tests on
performance in the Control group, with effect size approximately 0.32 stan-
dard deviations of the Verbal factor score. There was a larger positive effect of
24.4. SOLUTIONS 343
training intervention plus the potential effect of retaking the tests in the Exper-
imental group, with effect size approximately 0.79 standard deviations of the
Verbal factor score.
After you finished work with this exercise, save your R script by pressing the
Save icon in the script window, and giving the script a meaningful name, for
example “Ollson study of test performance change”. To keep all of the created
objects, which might be useful when revisiting this exercise, save your entire
‘work space’ when closing the project. Press File / Close project, and select
Save when prompted to save your ‘Workspace image’ (with extension .RData).
24.4 Solutions
Q1. The means for post-tests increase for both groups. However, the increases
seem to be larger for the Experimental group (increase from 20.556 to 25.667
for synonyms, and from 21.241 to 25.870 for opposites - so by about 5 points
each, while the increase is only 1 or 2 points for the Control group). This might
be because the Experimental group received a performance training.
Q2. CFI=0.952, which is greater than 0.95 and indicates good fit. RMSEA
= 0.166, which is greater than 0.08 (the threshold for adequate fit). Finally,
SRMR = 0.062, which is below the threshold for adequate fit (0.08). the model
fits well according to the CFI and SRMR but not RMSEA.
Q3. Two largest modification indices (MI) can be found in the ‘intercepts’ ~
statements:
They both belong to group 1 (Control), both equal, and both tell the same story.
They say that if you free EITHER the mean of pre_verbal or the intercept
of post_verbal (remember that they were both set to 0 in Model.1?), the
chi square will fall by 24.972. It seems that Hypothesis 1 that the verbal test
performance stays equal across time in Control group is not supported. Of
course it is reasonable to keep the mean of pre-verbal fixed at 0 and release
the intercept of post_verbal, allowing change across time.
Q4. Chi-square = 14.272, Degrees of freedom = 10, P-value = 0.161. The
chi-square test is not significant and we accept the model with free intercept
of post_verbal in the Control group. The degrees of freedom for Model.2
344EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
are 10, and the degrees of freedom for Model.1 were 11. The difference, 1 df,
corresponds to the additional intercept parameters we introduced. By adding 1
parameter, we reduced df by 1.
Q5. The (unstandardized) intercept of post_verbal in the Control group is
1.684. The effect in positive as expected. This is on the same scale as the
synonyms subtest, since the verbal factors borrowed the unit of measurement
from this subtest by fixing its factor loading to 1.
Q6. The (unstandardized) intercept of post_verbal in the Experimental
group is 5.428. The effect in positive as expected. This is again on the same
scale as the synonyms subtest.
Q7. The standardized intercept of post_verbal in the Control group is 0.319.
This is a small effect.
Q8. The standardized intercept of post_verbal in the Experimental group is
0.797. This is a large effect.
References
Arbuckle, James L. (2016). IBM SPSS Amos 24: User’s Guide. Amos Devel-
opment Corporation.
Barrett, P. T., Petrides, K. V., Eysenck, S. B. G., & Eysenck, H. J. (1998). The
Eysenck Personality Questionnaire: An examination of the factorial similarity
of P, E, N, and L across 34 countries. Personality and Individual Differences,
25(5), 805–819. https://doi.org/10.1016/S0191-8869(98)00026-9
Brown, A., Ford, T., Deighton, J., & Wolpert, M. (2014). Satisfaction in child
and adolescent mental health services: Translating users’ feedback into measure-
ment. Administration and Policy in Mental Health and Mental Health Services
Research, 41(4), 434-446. https://doi.org/10.1007/s10488-012-0433-9.
Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personality Inventory
(NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual.
Odessa, FL: Psychological Assessment Resources.
Eysenck, S. B. G., & Eysenck, H. J. (1976). Personality and Mental Illness.
Psychological Reports, 39, 1011–1022.
Feist, Gregory J.; Bodner, Todd E.; Jacobs, John F.; Miles, Marilyn; Tan,
Vickie. (1995). Integrating top-down and bottom-up structural models of sub-
jective well-being: A longitudinal investigation. Journal of Personality and
Social Psychology, 68(1), 138-150. https://doi.org/10.1037/0022-3514.68.1.138
Goldberg, D. (1978). Manual of the General Heath Questionnaire. Windsor:
NFER-Nelson.
Holzinger, K., and Swineford, F. (1939). A study in factor analysis: The stability
of a bifactor solution. Supplementary Educational Monograph, no. 48. Chicago:
University of Chicago Press.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power
rates using an effect size measure with the logistic regression procedure for
DIF detection. Applied measurement in education, 14(4), 329-349. https:
//doi.org/10.1207/S15324818AME1404_2
Joreskog, K. G. (1969). A general approach to confirmatory maximum likeli-
hood factor analysis. Psychometrika, 34, 183-202.
345
346EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
The style I adopted in this book has been inspired by my co-teachers and dear
friends at Cambridge Tim Croudace, Jon Heron, Jan Boehnke and Jan Stochl,
and perfected over the years thanks to feedback from my students at Cambridge
and Kent, and Graduate Teaching Assistants at Kent (most notably Hirotaka
Imada, Bjarki Gronfeldt Gunnarsson, Daqing Liu, Chole Bates and Rebecca
McNeill). I am very grateful to all of them.
Data used in this book are either open access from articles, R packages or other
software resources (sources are always acknowledged), or were shared with me by
colleagues and collaborators, or are from research projects I have been involved
in during my time at SHL, Anna Freud Centre, University of Cambridge and
University of Kent. I tried my best to acknowledge all sources appropriately;
however, please let me know if you spot any omissions.
Unless acknowledged within chapters, all contents are my own. I did not have
any assistance in writing this book, if anything, many people have tried to stop
me from finishing it by loading me with other work and commitments. I thank
them for helping to build my character.
347
348EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
About the author
349
350EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE