Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
73 views352 pages

Psychometrics R

Guide for R programming of general tests used in Psychology
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views352 pages

Psychometrics R

Guide for R programming of general tests used in Psychology
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 352

2

Psychometrics in Exercises using R and RStudio


Textbook and data resource

Anna Brown

2024-01-26
2
Contents

Preface 12
Why I wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
What is this book for? . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
How to use this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
How this book is organised . . . . . . . . . . . . . . . . . . . . . . . . 15
How to work with exercises in this book . . . . . . . . . . . . . . . . . 16

Getting Started with R and RStudio 19


1. Creating work folder and project . . . . . . . . . . . . . . . . . . . . 19
2. Creating a Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3. Installing and loading packages . . . . . . . . . . . . . . . . . . . . 20
4. Saving your work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

I INTRODUCTION TO PSYCHOMETRIC SCAL-


ING METHODS 21

1 Likert scaling of ordinal questionnaire data, creating a sum


score, and norm referencing 23
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2 Study of a community sample using the Strength and Difficulties
Questionnaire (SDQ) . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3 Worked Example 1 - Likert scaling and norm referencing for Emo-
tional Symptoms scale . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Worked Example 2 - Reverse coding counter-indicative items and
computing test score for SDQ Conduct Problems . . . . . . . . . 30

3
4 CONTENTS

1.5 Further practice - Likert scaling of remaining SDQ subscales . . . 32

1.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2 Optimal scaling of ordinal questionnaire data 37

2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2 Worked Example – Optimal scaling of SDQ Emotional Symptoms


items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3 Further practice – Optimal scaling of the remaining SDQ subscales 42

3 Optimal scaling of nominal questionnaire data 43

3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Worked Example – Optimal scaling of Nishisato attitude items . 43

3.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Thurstonian scaling 49

4.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Study of preferences for job features . . . . . . . . . . . . . . . . 49

4.3 Worked Example - Thurstonian scaling of ideal job features . . . 50

4.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

II CLASSICAL TEST THEORY AND RELIABILITY


THEORY 57

5 Reliability analysis of polytomous questionnaire data 59

5.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Study of a community sample using the Strength and Difficulties


Questionnaire (SDQ) . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Worked Example - Estimating reliability for SDQ Conduct Prob-


lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.4 Further practice - Reliabilities of the remaining SDQ facets . . . 65

5.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
CONTENTS 5

6 Item analysis and reliability analysis of dichotomous question-


naire data 69

6.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.2 Personality study using Eysenck Personality Questionnaire (EPQ) 69

6.3 Worked Example - Reliability analysis of EPQ Neuroti-


cism/Anxiety (N) scale . . . . . . . . . . . . . . . . . . . . . . . . 70

6.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

III TEST HOMOGENEITY AND SINGLE-FACTOR


MODEL 83

7 Fitting a single-factor model to polytomous questionnaire data 85

7.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.2 Worked Example - Testing homogeneity of SDQ Conduct Prob-


lems scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.3 Further practice - Factor analysis of the remaining SDQ subscales 95

7.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8 Fitting a single-factor model to dichotomous questionnaire data 97

8.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8.2 Worked Example - Testing homogeneity of EDQ Neuroticism scale 97

8.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

IV EXPLORATORY FACTOR ANALYSIS (EFA) 111

9 EFA of polytomous item scores 113

9.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

9.2 Study of service satisfaction using Experience of Service Ques-


tionnaire (CHI-ESQ) . . . . . . . . . . . . . . . . . . . . . . . . . 113

9.3 Worked Example - EFA of responses to CHI-ESQ parental version 114

9.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125


6 CONTENTS

10 EFA of ability subtest scores 127


10.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.2 Study of “primary mental abilities” of Louis Thurstone . . . . . . 127
10.3 Worked Example - EFA of Thurstone’s primary ability data . . . 128
10.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

11 EFA of personality scale scores 137


11.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.2 Study of factor structure of personality using NEO PI-R . . . . . 137
11.3 Worked Example - EFA of NEO PI-R correlation matrix . . . . . 138
11.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

V ITEM RESPONSE THEORY (IRT) 151

12 Fitting 1PL and 2PL models to dichotomous questionnaire


data. 153
12.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
12.2 Worked Example - Fitting 1PL and 2PL models to EPQ Neu-
roticism items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
12.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

13 Fitting a graded response model to polytomous questionnaire


data. 165
13.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
13.2 Study of emotional distress using General Health Questionnaire
(GHQ-28) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
13.3 A Worked Example - Fitting a Graded Response model to So-
matic Symptoms items . . . . . . . . . . . . . . . . . . . . . . . . 167
13.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

VI DIFFERENTIAL ITEM FUNCTIONING (DIF)


ANALYSIS 179

14 DIF analysis of dichotomous questionnaire items using logistic


regression 181
CONTENTS 7

14.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181


14.2 Worked Example - Screening EPQ Neuroticism items for gender
DIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
14.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

15 DIF analysis of polytomous questionnaire items using ordinal


logistic regression 191
15.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
15.2 Worked Example - Screening GHQ Somatic Symptoms items for
gender DIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
15.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

VII CONFIRMATORY FACTOR ANALYSIS (CFA) 201

16 CFA of polytomous item scores 203


16.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
16.2 Study of service satisfaction using Experience of Service Ques-
tionnaire (CHI-ESQ) . . . . . . . . . . . . . . . . . . . . . . . . . 203
16.3 Worked Example - CFA of responses to CHI-ESQ parental version204
16.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

17 CFA of a correlation matrix of subtest scores from an ability


test battery 217
17.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
17.2 Study of “primary mental abilities” of Louis Thurstone . . . . . . 217
17.3 Worked Example - CFA of Thurstone’s primary ability data . . . 218
17.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

VIII PATH ANALYSIS 231

18 Fitting a path model to observed test scores 233


18.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
18.2 Study of subjective well-being of Feist et al. (1995) . . . . . . . . 233
18.3 Worked Example - Testing a bottom-up model of Well-being . . 234
18.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
8 CONTENTS

19 Fitting an autoregressive model to longitudinal test measure-


ments 245
19.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
19.2 Study of longitudinal stability of ability test score across 4 time
points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
19.3 Worked Example - Testing an autoregressive model for WISC
Verbal subtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
19.4 Further practice - Path analysis of WISC Non-verbal scores . . . 253
19.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

IX STRUCTURAL EQUATION MODELLING 255

20 Fitting a latent autoregressive model to longitudinal test mea-


surements 257
20.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
20.2 Study of longitudinal stability of the latent ability across 4 time
points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
20.3 Worked Example - Testing a latent autoregressive model for
WISC Verbal subtest . . . . . . . . . . . . . . . . . . . . . . . . . 258
20.4 Further practice - Testing a latent autoregressive model for WISC
Non-verbal subtest . . . . . . . . . . . . . . . . . . . . . . . . . . 267
20.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

21 Growth curve modelling of longitudinal measurements 269


21.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
21.2 Study of child growth in a longitudinal repeated-measures design
spanning 8 years . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
21.3 Worked Example - Fitting growth curve models to longitudinal
measurements of child’s weight . . . . . . . . . . . . . . . . . . . 270
21.4 Further practice - Testing growth curve models for girls . . . . . 288
21.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

22 Testing for longitudinal measurement invariance in repeated


test measurements 291
22.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
CONTENTS 9

22.2 Study of clinical change in externalising problems after treatment 291


22.3 Worked Example - Quantifying change on a latent construct after
an intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
22.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

X MULTIPLE-GROUP STRUCTURAL EQUATION


MODELLING 311

23 Testing for measurement invariance across sexes in a multiple-


group setting 313
23.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
23.2 Study of structure of mental abilities . . . . . . . . . . . . . . . . 313
23.3 Worked Example - Comparing structure of mental abilities be-
tween sexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
23.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

24 Measuring effect of intervention by comparing change models


for control and experimental groups 329
24.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
24.2 Study of training intervention to improve test performance (Ols-
son, 1973) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
24.3 Worked Example - Quantifying change on a latent construct after
an intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
24.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

References 345

Acknowledgments 347

About the author 349


10 CONTENTS
11
12 CONTENTS

Preface
CONTENTS 13

This textbook provides a comprehensive set of exercises for practicing all major
Psychometric techniques using R and RStudio. The exercises are based on real
data from research studies and operational assessments, and provide step-by-
step guides that an instructor can use to teach students, or readers can use to
learn independently. Each exercise includes a worked example illustrating data
analysis steps and teaching how to interpret results and make analysis decisions,
and self-test questions that readers can attempt to check own understanding.
You can read this book online here for free. Copies in printable format may be
ordered from the author.
Data and supporting materials for all exercises are available for download from
http://annabrown.name/psychometricsR
How to cite this book:
Brown, Anna. (2023). Psychometrics in Exercises using R and RStudio. Text-
book and data resource. Available from https://bookdown.org/annabrown/
psychometricsR.

Why I wrote this book

This book is an outcome of my experience in teaching psychometrics and statis-


tics in the past 13 years; first at the University of Cambridge in 2010-2012
under the ESRC Researcher Development initiative, and then at the University
of Kent in 2012-2023 as part of my normal duties as the statistics lecturer.
This book was born from computing workshop exercises that I created for my
students over the years to practice psychometric techniques that they learnt in
lectures. When preparing exercises for them, it quickly became apparent that
while there are many good textbooks about psychometric theory (my absolute
favourite is “Test Theory: A Unified Treatment” by the late Roderick McDon-
ald), there aren’t any comprehensive sources of practical exercises that students
can use to internalise and practice these techniques. Various tutorials have
good illustrations with data, but they do not provide step-by-step guides that
an instructor can use to teach students to interpret results and make analysis
decisions, and they do not provide self-test questions that students can answer
to test own understanding.
This book is intended to fill this gap.

What is this book for?

This book can be used for teaching by university lecturers and instructors.
They may use data examples and analyses provided in this book as illustrations
14 CONTENTS

in lectures (acknowledging the source), or simply adopt the book for the practi-
cal/computing part of their course. Some of these exercises will be useful as part
of the general statistics curriculum, and some will be more suitable for special
courses such as “Item Response Theory” or “Structural Equation Modelling” or
“Measurement Invariance”.
This book can be used for self-study by anybody who want to acquire prac-
tical skills in conducting psychometric analyses. These may be students and
researchers in the fields of psychology, or any behavioural or social science, of
any age and level – undergraduate and postgraduate, PhD and postdoctoral
researchers, and seasoned researchers who want to acquire new skills in psycho-
metrics. These may also be practitioners in various fields of assessment who
need to be able to make sense of their data or create new assessments.
This book can also be used for self-study by people with some experience in
Psychometrics, but wanting to learn how to do these analyses in R, perhaps
moving from another software program like SPSS.

How to use this book

If you are an instructor, you can use this book as you see fit for your course and
your students, perhaps prioritising their needs in either software or statistical
curriculum depending on their level as described below.
If you are a student using the book for self-study, the way you will use this
book will depend on where you begin in psychometrics (and in R/RStudio!).
Beginners in R/RStudio should start from the beginning (including the “Get-
ting started with R and RStudio” section), gradually building skills in data
manipulations using R. You will establish your own routine when working with
exercises (consider each one as a mini data analysis project), and soon will feel
confident to build a portfolio of packages and functions. Intermediate or ad-
vanced Psychometricians who only begin in R/RStudio will find the learning
process easier, because you will be applying familiar methods in a new software
environment. You may even compare outputs from R with other software you
used before. You may be pleasantly surprised!
Beginners in Psychometrics should also start from the beginning, gradually
building their skills in conducting analyses using specific techniques and meth-
ods, for instance computing a test score in the presence of missing data. The
exercises are ordered so that later exercises often (but not always!) rely on previ-
ous ones. References to previous exercises that are necessary to understand the
current exercise is always given, so you can refresh your knowledge if necessary.
Intermediate or advanced users of R/RStudio but beginners in Psychometrics
can skip my advice on R tips and tricks, but should follow the order of exercises
to build their psychometric skills.
CONTENTS 15

How this book is organised


The book is organised into 24 exercises, each showcasing unique psychometric
techniques. Although exercises are self-contained, some will refer to related
techniques described in previous exercises, so it is recommended to approach the
exercises in order, particularly if you are new to Psychometrics. The exercises
are grouped into 10 parts:

• Part I. Introduction to psychometric scaling methods

This part contains 4 exercises, which teach techniques for scaling ordinal ques-
tionnaire data, nominal questionnaire data, and ranked preferences data.

• Part II. Classical Test Theory and Reliability Theory

This part contains 2 exercises, teaching how to conduct item analysis and relia-
bility analysis of polytomous questionnaire data and dichotomous questionnaire
data.

• Part III. Test homogeneity and Single-Factor Model

This part contains 2 exercises, teaching how to fitting a single-factor model to


polytomous item scores and to dichotomous item scores.

• Part IV. Exploratory Factor Analysis (EFA)

This part contains 3 exercises, teaching how to conduct EFA of polytomous


item scores, and of correlation matrices of subtest scores or multidimensional
test scores.

• Part V. Item Response Theory (IRT) analysis

This part contains 2 exercises, teaching how to fit 1PL and 2PL models to
dichotomous questionnaire data, and a graded response model to polytomous
questionnaire data.

• Part VI. Differential Item Functioning (DIF) analysis

This part contains 2 exercises, teaching how to test for DIF dichotomous ques-
tionnaire data using binary logistic regression, and polytomous questionnaire
data using ordinal logistic regression.
16 CONTENTS

• Part VII. Confirmatory Factor Analysis (CFA)

This part contains 2 exercises, teaching how to fit CFA models to polytomous
item scores, and to a correlation matrix of subtest scores.

• Part VIII. Path analysis

This part contains 2 exercises, teaching how to fit a path model to observed test
scores, and an autoregressive model to longitudinal test measurements.

• Part IX. Structural equation modelling

This part contains 3 exercises, teaching how to fit a latent autoregressive model
and a growth curve model to longitudinal test measurements, and how to test
for longitudinal measurement invariance in repeated test measurements.

• Part X. Multiple-group structural equation modelling

This part contains 2 exercises, teaching how to test for measurement invariance
across groups, and how to test for measurement invariance across time and
experimental groups using multiple-group analysis settings.

How to work with exercises in this book


Each exercise will begin with objectives and a summary of techniques that will
be taught. Then, it will name a data set to be analysed, and any R packages
needed to conduct these analyses, for example:
Data file Example.csv
R package psych
Data sets (and accompanying materials such as questionnaire items) are pro-
vided on the dedicated website, http://annabrown.name/psychometricsR, or
occasionally data will be part of R packages used for analyses. Packages will
need to be installed on your computer from R repositories.
The exercises will comprise several steps, each describing a specific activity or
technique, including how to import/load and save the data, how to examine the
data, and how to run specified analyses.
The main body of each exercise presents a “Worked Example”, where I show-
case a technique, taking a student through analyses step by step. The student
needs to reproduce my analyses and outputs by following descriptions in the
Worked Example. Moreover, I will encourage students to learn how to apply
CONTENTS 17

the presented techniques to new variables, make sense of outputs and interpret
results by presenting them with self-test questions. Answering these ques-
tions is an important part of learning. Students should attempt to answer the
question independently, and then verify their answers with the answers provided
at the end of each Exercise.
18 CONTENTS
Getting Started with R and
RStudio

You have R and RStudio installed and keen to get started with your first ex-
ercise. But before you do so, I would like to suggest some routine steps that
will make your future work with data easier and more organized. If you are
already experienced with RStudio and have your own routine, feel free to skip
this section.

1. Creating work folder and project

First, let’s create a directory (and then a project) for keeping all work
associated with this particular analysis. This is a very convenient way
to work, because in R/RStudio, you are working in a specific directory
(“project” directory). Unless all data and other files that you need are
in the same directory as your project, you will have to specify full paths
such as "C:/Users/annabrown/Documents/R Exercise book/Exercise
1/data.txt". This can get tedious very quickly. But within a project, you can
just refer to file names, such as "data.txt". But more importantly, when all
the project-related files and objects are kept together, it is easy to get back to
your work at any time by simply opening the project (using RStudio menu File
/ Open Project…).

Begin by creating a new directory called “R Exercise 1” on your computer, and


download/move the data file SDQ.RData into this directory.

To create an R project associated with this folder, in RStudio click File and
then New Project. You will have a box popping up asking you to select where
to create a new project. Select the folder you have just created (in my example,
it is “R Exercise 1”). You should see your project name appearing in the Files
tab of the bottom/right RStudio window, and the file SDQ.RData should be
visible there too.

19
20 CONTENTS

2. Creating a Script
You can type commands directly into R Console and execute them one at a
time. This can be good for trying out some functions. However, in the long
run, you will need to save your analysis and modify or add to it, sometimes over
several sessions. So it is much better to create a Script (a text file with your
R code), from which any number of commands can be sent to Console at any
time.
To create a new script, select File / New File / R Script. It will open up in
its own window. Write all code suggested in a particular exercise on this script.
Run any command in the script by moving your cursor to that command and
pressing Ctrl + Enter, and you will see how the command gets passed and
executed in Console.

3. Installing and loading packages


R contributors have created thousands of functions to perform all sorts of anal-
yses and packaged them into “packages”. Whenever we need to use function X
from a particular package, we should first install the package containing X on
the computer. (Note that many basic functions are part of base R and you do
not need to install or load anything to use them). We will need package psych
for Exercise 1. To install it, click menu Tools, and then Install Packages and
type psych. You need to install packages once, but should load them in every
session that you use them (unless they are already loaded). To load package X,
use function library(X), like so:

library(psych)

4. Saving your work


It is a good idea to regularly save your R script by pressing the Save icon, or
choosing menu File / Save. Give the script a meaningful name, for example
“Likert scaling”.
After you finished working with an exercise, save your R script. Also you may
want to save your entire workspace, which includes the data frame with added
columns and also all the new objects you created. You might need these again
when revisiting the exercise, or in other exercises using the same data set. When
closing the project, File / Close project, select Save to save your Workspace
image (it will be saved with extension .RData). The project will close, your
script and workspace will be saved, and R session will be terminated (Console
cleared). Close RStudio.
Part I

INTRODUCTION TO
PSYCHOMETRIC
SCALING METHODS

21
Exercise 1

Likert scaling of ordinal


questionnaire data, creating
a sum score, and norm
referencing

Data file SDQ.RData


R package psych

1.1 Objectives

The purpose of this exercise is to learn how to compute test scores from ordinal
test responses, and interpret them in relation to a norm. You will also learn
how to deal with missing responses when computing test scores.

You will also learn how to deal with items that indicate the opposite end of the
construct to other items. Such items are sometimes called “negatively keyed” or
“counter-indicative”. Compare, for example, item “I get very angry and often
lose my temper” with item “I usually do as I am told”. They represent positive
and negative indicators of Conduct Problems, respectively. The (small) problem
with such items is that they need to be appropriately coded before computing
the test score so that the score unambiguously reflects “problems” rather than
the lack thereof. This exercise will show you how to do that.

23
24EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC

1.2 Study of a community sample using the


Strength and Difficulties Questionnaire
(SDQ)
The Strengths and Difficulties Questionnaire (SDQ) is a brief behavioural
screening questionnaire about children and adolescents of 3-16 year of age. It
exists in several versions to meet the needs of researchers, clinicians and educa-
tionalists, see http://www.sdqinfo.org/. You can download the questionnaire,
and also the scoring key and norms provided by the test publisher.
The self-rated (SDQ Pupil) version includes 25 items measuring 5 scales (facets),
with 5 items each:
Emotional Symptoms somatic worries unhappy clingy afraid
Conduct Problems tantrum obeys* fights lies steals
Hyperactivity restles fidgety distrac reflect* attends*
Peer Problems loner friend* popular* bullied oldbest
Pro-social consid shares caring kind helpout
Respondents are asked to rate each question using the following response op-
tions: 0 = “Not true” 1 = “Somewhat true” 2 = “Certainly true”
NOTE that some SDQ items represent behaviours counter-indicative of the
scales they intend to measure, so that higher scale scores correspond to lower
item scores. For instance, item “I usually do as I am told” (variable obeys) is
counter-indicative of Conduct Problems. There are 5 such items in the SDQ;
they are marked in the above table with asterisks (*).
Participants in this study are year 7 pupils from the same school (N=228). This
is a community sample, and we do not expect many children to have scores above
clinical thresholds. The SDQ was administered twice, the first time when the
children just started secondary school (were in Year 7), and one year later (were
in year 8).

1.3 Worked Example 1 - Likert scaling and norm


referencing for Emotional Symptoms scale
This worked example is analysis of the first SDQ scale, Emotional Symptoms.
This scale has no counter-indicative items.

Step 1. Preliminary examination of data set

If you downloaded the file SDQ.RData and saved it in the same folder as this
project, you should see the file name in the Files tab (usually in the bottom
1.3. WORKED EXAMPLE 1 - LIKERT SCALING AND NORM REFERENCING FOR EMOTIONAL SYMPTOMS

right RStudio panel). Now we can load an object (data frame) contained in this
“native” R file (with extension .RData) into RStudio using the basic function
load().

load("SDQ.RData")

You should see a new object SDQ appear on the Environment tab (top right
RStudio panel). This tab will show any objects currently in the workspace, and
the data frame SDQ was stoerd in file SDQ.RData we just loaded. According
to the description in the Environment tab, the data frame contains “228 obs.”
(observations) “of 51 variables”; that is, 228 rows and 51 columns.
You can press on the SDQ object. You should see the View(SDQ) command run
on Console, and, in response to that command, the data set should open up in its
own tab named SDQ. Examine the data by scrolling down and across. The data
set contains 228 rows (cases, observations) on 51 variables. There is Gender
variable (0=male; 1=female), followed by responses to 25 SDQ items named
consid, restles, somatic etc. (these are variable names in the order of items
in the questionnaire). Item variables reflect key meaning of the actual SDQ
questions, which are also attached to the data frame as labels. For example,
consid is a shortcut for “I try to be nice to other people. I care about their
feelings”, or restles is a shortcut for “I am restless, I cannot stay still for long”.
These 25 variables are followed by 25 more variables named consid2, restles2,
somatic2 etc. These are responses to the same SDQ items at Time 2.
You should also see that there are some missing responses, marked ‘NA’.
There are more missing responses for Time 2, with whole rows missing for some
pupils. This is typical for longitudinal data, because it is not always possible
to obtain responses from the same pupil one year later (for example, the pupil
might have moved schools).
You can obtain the names of all variables by typing and running command
names(SDQ).

Step 2. Creating variable lists

Let us start analysis by creating a list of items that measure Emotional Symp-
toms (you can see them in a table given earlier). This will enable easy reference
to data from these 5 items (variables) in all analyses. We will use c() - the base
R function for combining values into a list.

items_emotion <- c("somatic","worries","unhappy","clingy","afraid")

Note how a new object items_emotion appeared in the Environment tab.


Now you will be able to refer only to data from these variables by referring to
26EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC

SDQ[items_emotion]. Try running this command and see how you get only
responses to the 5 items we specified.
QUESTION 1. Now use the same logic to create a list of items measuring
Emotional Symptoms at Time 2, called items_emotion2.

Step 3. Computing the test score

Now you are ready to compute the Emotional Symptoms scale score for Time 1.
Normally, we could use base R function rowSums(), which computes the sum of
specified variables for each row (pupil):

rowSums(SDQ[items_emotion])

## [1] 4 3 1 2 4 2 4 0 1 1 0 8 2 3 7 4 5 2 8 6 1 4 9 4 5
## [26] 9 0 3 3 1 0 2 6 3 9 4 4 0 7 1 3 6 4 5 4 1 4 1 0 5
## [51] 1 2 2 4 4 4 6 1 8 3 2 2 4 1 1 0 2 2 7 5 0 NA NA 1 1
## [76] 7 4 1 8 3 5 0 5 4 0 1 1 5 3 6 1 3 2 6 6 0 2 4 5 3
## [101] 3 1 1 7 2 3 5 5 NA 0 4 0 4 1 1 1 1 0 2 7 0 3 8 4 6
## [126] NA 2 4 7 1 0 0 1 0 4 3 0 10 5 2 1 6 1 2 1 0 1 NA 4 4
## [151] 2 4 7 5 6 1 0 5 3 1 3 3 6 4 2 3 1 0 3 3 0 3 0 0 0
## [176] 2 2 2 0 1 5 3 3 1 4 3 1 6 2 4 2 NA 0 2 5 5 0 2 2 3
## [201] 4 0 2 4 2 2 1 3 2 0 1 0 0 8 1 1 2 1 2 2 4 0 0 1 2
## [226] 2 1 6

Try this and check out the resulting scores printed in the Console window. Oops!
It appears that pupils with missing responses (even on one item) got ‘NA’ for
their scale score. This is because the default option for dealing with missing
data in rowSums() function is to skip any rows with missing data. Let’s change
to skipping only the ‘NA’ responses, not whole rows, like so:

rowSums(SDQ[items_emotion], na.rm=TRUE)

## [1] 4 3 1 2 4 2 4 0 1 1 0 8 2 3 7 4 5 2 8 6 1 4 9 4 5
## [26] 9 0 3 3 1 0 2 6 3 9 4 4 0 7 1 3 6 4 5 4 1 4 1 0 5
## [51] 1 2 2 4 4 4 6 1 8 3 2 2 4 1 1 0 2 2 7 5 0 2 7 1 1
## [76] 7 4 1 8 3 5 0 5 4 0 1 1 5 3 6 1 3 2 6 6 0 2 4 5 3
## [101] 3 1 1 7 2 3 5 5 4 0 4 0 4 1 1 1 1 0 2 7 0 3 8 4 6
## [126] 0 2 4 7 1 0 0 1 0 4 3 0 10 5 2 1 6 1 2 1 0 1 4 4 4
## [151] 2 4 7 5 6 1 0 5 3 1 3 3 6 4 2 3 1 0 3 3 0 3 0 0 0
## [176] 2 2 2 0 1 5 3 3 1 4 3 1 6 2 4 2 4 0 2 5 5 0 2 2 3
## [201] 4 0 2 4 2 2 1 3 2 0 1 0 0 8 1 1 2 1 2 2 4 0 0 1 2
## [226] 2 1 6
1.3. WORKED EXAMPLE 1 - LIKERT SCALING AND NORM REFERENCING FOR EMOTIONAL SYMPTOMS

Now you should get scale scores for all pupils, but in this calculation, the missing
responses are simply skipped, so essentially treated as zeros. This is not quite
right. Remember that there might be different reasons for not answering a
question, and not answering the question is not the same as saying “Not true”,
therefore should not be scored as 0.
Instead, we will do something more intelligent. We will use rowMeans() function
to compute the mean of those item responses that are present (still skipping the
‘NA’ values, na.rm=TRUE), and then multiply the result by 5 (the number of
items in the scale) to obtain a fair estimate of the sum score.
For example, if all non-missing responses of a person are 2, the mean is also
2, and multiplying this mean by the number of items in the scale, 5, will give
a fair estimate of the expected scale score, 5x2=10. So we essentially replace
any missing responses with the mean response for that person, thus producing
a fairer test score.
Try this and compare with the previous result from rowSums(). It should give
the same values for the vast majority of pupils, because they had no missing
data. The only differences will be for those few pupils who had missing data.

rowMeans(SDQ[items_emotion], na.rm=TRUE)*5

Now we will repeat the calculation, but this time appending the resulting score
as a new column (variable) named S_emotion to the data frame SDQ:

SDQ$S_emotion <- rowMeans(SDQ[items_emotion], na.rm=TRUE)*5

Let’s check whether the calculation worked as expected for those pupils with
missing data, for example case #72. Let’s pull that specific record from the data
frame, referring to the row (case) number, and then to the columns (variables)
of interest:

SDQ[72,items_emotion]

## somatic worries unhappy clingy afraid


## 72 0 1 0 1 NA

You can see that one response is missing on item afraid. If we just added up the
non-missing responses for this pupil, we would get the scale score of 2. However,
the mean of 4 non-missing scores is (0+1+0+1)/4 = 0.5, and multiplying this
by the total number of items 5 should give the scale score 2.5. Now check the
entry for this case in S_emotion:
28EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC

SDQ$S_emotion[72]

## [1] 2.5

QUESTION 2. Repeat the above steps to compute the test score for Emo-
tional Symptoms at Time 2 (call it S_emotion2), and append the score to the
SDQ data frame as a new column.

Step 4. Examining the distribution and scale score statis-


tics

We start by plotting a basic histogram of the S_emotion score:

hist(SDQ$S_emotion)

Histogram of SDQ$S_emotion
80
60
Frequency

40
20
0

0 2 4 6 8 10

SDQ$S_emotion

QUESTION 3. What can you say about the distribution of S_emotion


score? Is the Emotional Symptoms subtest “easy” for the children in this com-
munity, or “difficult”?
You can also compute descriptive statistics for S_emotion, using a very con-
venient function describe()from package psych, which will give the range, the
mean, the median, the standard deviation and other useful statistics.
1.3. WORKED EXAMPLE 1 - LIKERT SCALING AND NORM REFERENCING FOR EMOTIONAL SYMPTOMS

library(psych)
describe(SDQ$S_emotion)

## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 228 2.89 2.33 2 2.66 2.97 0 10 10 0.72 -0.16 0.15

As you can see, the median (the score below which half of the sample lies) of
S_emotion is 2, while the mean is higher at 2.89. This is because the score is
positively skewed; in this case, the median is more representative of the central
tendency. These statistics are consistent of our observation of the histogram,
showing a profound floor effect.
QUESTION 4. Obtain and interpret the histogram and the descriptives for
S_emotion2 independently.

Step 5. Norm referencing

Below are the cut-offs for “Normal”, “Borderline” and “Abnormal” cases for
Emotional Symptoms provided by the test publisher (see https://sdqinfo.org/).
These are the scores that set apart likely borderline and abnormal cases from
the “normal” cases.

• Normal: 0-5
• Borderline: 6
• Abnormal: 7-10

Use the histogram you plotted earlier for S_emotion (Time 1) to visualize
roughly how many children in this community sample fall into the “Normal”,
“Borderline” and “Abnormal” bands.
Now let’s use the function table(), which tabulates cases with each score value.

table(SDQ$S_emotion)

##
## 0 1 2 2.5
## 36 43 37 1
## 3 4 5 6
## 27 32 19 13
## 6.66666666666667 7 8 8.75
## 1 8 6 1
## 9 10
## 3 1
30EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC

A few non-integer scores must not worry you. They occurred due to computing
the scale score from the item means for some pupils with missing responses. For
all cases without missing responses, the resulting scale score will be integer.
We can use the table() function to establish the number of children in this
sample in the “Normal” range. From the cut-offs, we know that the Normal
range is that with S_emotion score between 0 and 5. Simply specify this
condition (that we want scores less or equal to 5) when calling the function:

table(SDQ$S_emotion <= 5)

##
## FALSE TRUE
## 33 195

This gives 195 children or 85.5% (195/228=0.855) classified in the Normal range.
QUESTION 5. Now try to work out the percentage of children who can be
classified “Borderline” on Emotional Symptoms (Time 1 only).
QUESTION 6. What is the percentage of children in the “Abnormal” range
on Emotional Symptoms (Time 1 only)?

1.4 Worked Example 2 - Reverse coding


counter-indicative items and computing
test score for SDQ Conduct Problems
This worked example comprises scaling of the SDQ items measuring Conduct
Problems. Once you feel confident with reverse coding, you can complete the
exercise for the remaining SDQ facets that contain counter-indicative items.

Step 1. Creating variable list

Remind yourself about items designed to measure Conduct Problems. You can
see them in a table given in Worked Example 1.3. Now let us create a list, which
will enable easy reference to data from these 5 variables in all analyses.

items_conduct <- c("tantrum","obeys","fights","lies","steals")

Note how a new object items_conduct appeared in the Environment tab. Try
calling SDQ[items_conduct] to pull only the data from these 5 items.
QUESTION 7. Create a list of items measuring Conduct Problems at Time
2, called items_conduct2.
1.4. WORKED EXAMPLE 2 - REVERSE CODING COUNTER-INDICATIVE ITEMS AND COMPUTING TEST S

Step 2. Reverse coding counter-indicative items

Before adding item scores together to obtain a scale score, we must reverse
code any items that are counter-indicative to the scale. Otherwise, positive and
negative indicators of the construct will cancel each other out in the sum score!
For Conduct Problems, we have only one counter-indicative item, obeys. To
reverse–code this item, we will use a dedicated function of psych package,
reverse.code(). This function has the general form reverse.code(keys,
items,…). Argument keys is a vector of values 1 or -1, where -1 implies to
reverse the item. Argument items is the names of variables we want to score.
Let’s look at the set of items again:

tantrum obeys* fights lies steals

Since the only item to reverse-code is #2 in the set of 5 items, we will combine
the following values in a vector to obtain keys=c(1,-1,1,1,1). The whole
command will look like this:

R_conduct <- reverse.code(keys=c(1,-1,1,1,1), SDQ[items_conduct])

We assigned the appropriately coded subset of 5 items to a new object,


R_conduct. Preview the item scores in this object :

# reverse coded items


head(R_conduct)

## tantrum obeys- fights lies steals


## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 1 2 0 2 0
## [6,] 0 0 0 0 0

You should see that the item obeys is marked with “-“, and that it is indeed
reverse coded, if you compare it with the original below. How good is that?

# original items
head(SDQ[items_conduct])

## tantrum obeys fights lies steals


## 1 0 2 0 0 0
## 2 0 2 0 0 0
32EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC

## 3 0 2 0 0 0
## 4 0 2 0 0 0
## 5 1 0 0 2 0
## 6 0 2 0 0 0

QUESTION 8. Use the logic above to reverse code items measuring Conduct
Problems at Time 2, saving them in object R_conduct2.

Step 3. Computing the test score

Now we are ready to compute the Conduct Problems scale score for Time 1.
Because there are missing responses (particularly for Time 2), we will use
rowMeans() function to compute the mean of those item responses that are
present (skipping the ‘NA’ values, na.rm=TRUE), and then multiply the result
by 5 (the number of items in the scale) to obtain a fair estimate of the sum
score. Please refer to Worked Example 1.3 (Step 2) for a detailed explanation
of this procedure.
Importantly, we will use the reverse-coded items (R_conduct) rather than orig-
inal items (SDQ[items_conduct])in the calculation of the sum score. We will
append the computed scale score as a new variable (column) named S_conduct
to data frame SDQ:

SDQ$S_conduct <- rowMeans(R_conduct, na.rm=TRUE)*5

QUESTION 9. Compute the test score for Conduct Problems at Time 2 (call
it S_conduct2), and append the score to the SDQ data frame as a new variable
(column).

Step 4. Examining and norm referencing the scale score

Refer to Worked Example 1.3 for instructions on how to obtain descriptive


statistics and histogram of the scale score (Step 3), and how to refer the raw
scale score to the published norm (Step 4).

1.5 Further practice - Likert scaling of remain-


ing SDQ subscales
Repeat the steps in the Worked Example 1 for the Pro-social facet. NOTE that
just like the Emotional Symptoms scale, the Pro-social scale does not have any
counter-indicative items. For computing scale scores for other SDQ facets, you
will need to reverse code such items, which we will learn in the second Worked
1.6. SOLUTIONS 33

Example. Then, you will be able to practice this exercise with the remaining
SDQ scales.

Repeat the steps in the Worked Example 2 for the Hyperactivity and Peer Prob-
lems facets.

When finished with this exercise, do not for get to save your work as described
in the “Getting Started with RStudio” section.

1.6 Solutions

Q1.

items_emotion2 <- c("somatic2","worries2","unhappy2","clingy2","afraid2")

Q2.

SDQ$S_emotion2 <- rowMeans(SDQ[items_emotion2], na.rm=TRUE)*5

If you call SDQ$S_emotion2, you will see that there are many cases with missing
score, labelled NaN. This is because the scale score cannot be computed for those
pupils who had ALL responses missing at time 2.

Q3. The S_emotion score is positively skewed and shows the floor effect.
This is not surprising since the questionnaire was developed to screen clinical
populations, and most children in this community sample did not endorse any
of the symptoms (most items are too “difficult” for them to endorse).

Q4.

hist(SDQ$S_emotion2)
34EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC

Histogram of SDQ$S_emotion2

60
50
Frequency

40
30
20
10
0

0 2 4 6 8

SDQ$S_emotion2

describe(SDQ$S_emotion2)

## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 172 2.41 2.14 2 2.14 1.48 0 9 9 0.91 0.11 0.16

The histogram and the descriptive statistics for Time 2 look similar to Time 1.
There is still a floor effect, with many pupils not endorsing any symptoms.
Q5. 13 children (or 5.7%) can be classified Borderline. You can look up the
count for Borderline score 6 in the output of the basic table() function. Alter-
natively, you can call:

table(SDQ$S_emotion==6)

##
## FALSE TRUE
## 215 13

Importantly, you have to use == when describing the equality condition, be-
cause the use of = in R is reserved to assigning a value to an object.
Q6. For “Abnormal”, you can ask to tabulate all scores greater than 6. To
calculate proportion, divide by N=228. This gives 20/228 = 0.88 or 8.8%.
1.6. SOLUTIONS 35

table(SDQ$S_emotion>6)

##
## FALSE TRUE
## 208 20

Q7.

items_conduct2 <- c("tantrum2","obeys2","fights2","lies2","steals2")

Q8.

# reverse code
R_conduct2 <- reverse.code(keys=c(1,-1,1,1,1), SDQ[items_conduct2])
# check the reverse coded items
head(R_conduct2)

## tantrum2 obeys2- fights2 lies2 steals2


## [1,] 0 0 0 1 0
## [2,] 1 1 0 0 0
## [3,] 2 0 0 0 0
## [4,] 0 1 0 0 0
## [5,] NA NA NA NA NA
## [6,] 0 0 0 0 0

# compare with original items


head(SDQ[items_conduct2])

## tantrum2 obeys2 fights2 lies2 steals2


## 1 0 2 0 1 0
## 2 1 1 0 0 0
## 3 2 2 0 0 0
## 4 0 1 0 0 0
## 5 NA NA NA NA NA
## 6 0 2 0 0 0

Q9.

SDQ$S_conduct2 <- rowMeans(R_conduct2, na.rm=TRUE)*5

If you call SDQ$S_conduct2, you will see that there are many cases with missing
score, labelled NaN. This is because the scale score cannot be computed for those
pupils who had ALL responses missing at time 2.
36EXERCISE 1. LIKERT SCALING OF ORDINAL QUESTIONNAIRE DATA, CREATING A SUM SC
Exercise 2

Optimal scaling of ordinal


questionnaire data

Data file SDQ.RData


R package aspect

2.1 Objectives

Previously, in Exercise 1, we scored the Strength and Difficulties Questionnaire


(SDQ) using the so-called “Likert scaling” approach, whereby response cate-
gories “not true”-“somewhat true”-“certainly true” were assigned consecutive
integers 0-1-2. Apart from reflecting the apparently increasing degree of agree-
ment in these response options, the assignment of the integers was arbitrary, as
there was no particular reason we assigned 0-1-2 as opposed to, for instance, 1-2-
3. Such an arbitrary way of scoring item responses is also called “measurement
by fiat”. In this exercise, we will attempt to find “optimal” scores for ordinal
responses to the SDQ. “Optimal” means that scores we assign to responses are
not just any scores, but they are “best” out of all other possible scores in terms
of fulfilling some statistical criterion.

There are many ways to “optimize” item scores; here, we will maximize the ratio
of the variance of the total score to the sum of the variances of the item scores.
In psychometrics, fulfilling this criterion results in maximizing the sum of the
item correlations (and therefore the test score’s “internal consistency” measured
by Cronbach’s alpha).

37
38EXERCISE 2. OPTIMAL SCALING OF ORDINAL QUESTIONNAIRE DATA

2.2 Worked Example – Optimal scaling of SDQ


Emotional Symptoms items
We begin by loading data frame SDQ kept in file SDQ.RData. Please refer
to Exercise 1 for explanation of all variables in this data frame.

load(file="SDQ.RData")

names(SDQ)

## [1] "Gender" "consid" "restles" "somatic" "shares" "tantrum"


## [7] "loner" "obeys" "worries" "caring" "fidgety" "friend"
## [13] "fights" "unhappy" "popular" "distrac" "clingy" "kind"
## [19] "lies" "bullied" "helpout" "reflect" "steals" "oldbest"
## [25] "afraid" "attends" "consid2" "restles2" "somatic2" "shares2"
## [31] "tantrum2" "loner2" "obeys2" "worries2" "caring2" "fidgety2"
## [37] "friend2" "fights2" "unhappy2" "popular2" "distrac2" "clingy2"
## [43] "kind2" "lies2" "bullied2" "helpout2" "reflect2" "steals2"
## [49] "oldbest2" "afraid2" "attends2"

We will use package aspect, which makes optimal scaling easy by offering a
range of very useful options and built-in plots.

library("aspect")

Step 1. Selecting items for analysis

To analyse only the items measuring Emotional Symptoms, it is convenient to


create a list of item (variable) names, and then refer to only these items in the
data frame:

# pick only items designed to measure Emotional Symptoms


items_emotion <- c("somatic","worries","unhappy","clingy","afraid")
# preview the Emotional Symptoms item responses
head(SDQ[items_emotion])

## somatic worries unhappy clingy afraid


## 1 2 1 0 1 0
## 2 2 0 0 1 0
## 3 0 0 0 0 1
## 4 0 0 0 1 1
## 5 2 1 0 1 0
## 6 1 0 0 1 0
2.2. WORKED EXAMPLE – OPTIMAL SCALING OF SDQ EMOTIONAL SYMPTOMS ITEMS39

Step 2. Dropping cases with missing responses


Before performing optimal scaling, we will drop cases with missing responses
on at least one of the items, as the package aspect does not appear to support
missing values. There are only 5 such cases with mising responses.

# see how many NA values there are


summary(SDQ[items_emotion])

## somatic worries unhappy clingy


## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.6106 Mean :0.6211 Mean :0.3172 Mean :0.8421
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2.0000 Max. :2.0000 Max. :2.0000 Max. :2.0000
## NA's :2 NA's :1 NA's :1
## afraid
## Min. :0.00
## 1st Qu.:0.00
## Median :0.00
## Mean :0.48
## 3rd Qu.:1.00
## Max. :2.00
## NA's :3

# drop cases with missing responses and put complete cases into data frame called "items"
items <- na.omit(SDQ[items_emotion])

Step 3. Running the optimal scaling procedure


Function corAspect() performs optimal scaling by optimizing various cri-
teria on the correlation matrix. Here is the standard call for this function:
corAspect(data, aspect = "aspectSum", level = "nominal", ...).
First, we need to supply data, which will be our working data frame, items.
Second, we need to choose the criterion (aspect) to optimize. Here, we
will maximize the sum of items’ correlations, so use the default setting
aspect="aspectSum". Third, we need to supply the level of measurement
for the analysed variables. A nominal scale level (default) assumes that the
variables are nominal categories and involves no restrictions on the resulting
scores. An ordinal scale level requires preserving order of the scores, and
numerical variables additionally require equal distances between the scores. In
this example, the response categories “not true”-“somewhat true”-“certainly
true” clearly reflect an increasing order of agreement, which we want to
preserve, so we set level="ordinal".
40EXERCISE 2. OPTIMAL SCALING OF ORDINAL QUESTIONNAIRE DATA

opt <- corAspect(items, aspect = "aspectSum", level="ordinal")


# Summary output for the optimal scaling analysis
summary(opt)

##
## Correlation matrix of the scaled data:
## somatic worries unhappy clingy afraid
## somatic 1.0000000 0.3480251 0.3651134 0.2258002 0.3113325
## worries 0.3480251 1.0000000 0.4612166 0.4020225 0.3901405
## unhappy 0.3651134 0.4612166 1.0000000 0.3598932 0.4603964
## clingy 0.2258002 0.4020225 0.3598932 1.0000000 0.3865003
## afraid 0.3113325 0.3901405 0.4603964 0.3865003 1.0000000
##
##
## Eigenvalues of the correlation matrix:
## [1] 2.4961448 0.7844417 0.6270432 0.5887536 0.5036166
##
## Category scores:
## somatic:
## score
## 0 -0.8864022
## 1 0.5836441
## 2 2.0454937
##
## worries:
## score
## 0 -0.8348282
## 1 0.4234660
## 2 2.1441015
##
## unhappy:
## score
## 0 -0.5895239
## 1 1.3910873
## 2 2.7286027
##
## clingy:
## score
## 0 -1.1851447
## 1 0.2510005
## 2 1.6576948
##
## afraid:
## score
## 0 -0.7821442
2.2. WORKED EXAMPLE – OPTIMAL SCALING OF SDQ EMOTIONAL SYMPTOMS ITEMS41

## 1 1.0234825
## 2 1.8943502

The output displays the “Correlation matrix of the scaled data”, which are cor-
relations of the item scores after optimal scaling. These can be compared to
correlations between the original variables calculated using cor(items). Fur-
ther, “Eigenvalues of the correlation matrix” are displayed. Eigenvalues are the
variances of principal components (from Principal Components Analysis), and
are very helpful in indicating the number of dimensions measured by this set of
items. The result here, with the first eigenvalue substantially larger than the
remaining eigenvalues, indicates just one dimension, as we hoped.
Finally, “Category scores” show the scores that the optimal scaling procedure
assigned to the item categories. For example, the result suggests to score item
somatic by assigning the score -0.8864022 to response “not true”, the score
0.5836441 to response “somewhat true” and the score 2.045493 to response “cer-
tainly true”. The values are chosen so that the scaled item’s mean in the sample
is 0, and the correlations between the items are maximized.

Step 4. Viewing transformation plots


Package aspect makes it very easy to obtain transformation plots, which show
the category score assignments graphically.

plot(opt, plot.type = "transplot")


opt$catscores$somatic

opt$catscores$worries

−0.5
−1.0

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

Index Index
opt$catscores$unhappy

opt$catscores$clingy

−1.0
−0.5

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

Index Index
opt$catscores$afraid

−0.5

1.0 1.5 2.0 2.5 3.0

Index
42EXERCISE 2. OPTIMAL SCALING OF ORDINAL QUESTIONNAIRE DATA

Looking at the transformation plots, it can be seen that 1) the scores for subse-
quent categories increase almost linearly; 2) the categories are roughly equidis-
tant. We conclude that for scoring ordinal items in the SDQ Emotional Symp-
toms scale, Likert scaling is appropriate, and not much can be gained by optimal
scaling over basic Likert scaling.

2.3 Further practice – Optimal scaling of the re-


maining SDQ subscales
Following the steps in the worked example, perform optimal scaling of the re-
maining SDQ scales. Refer to Exercise 1 for the list of items in each scale. You
should not need to worry about some items being counter-indicative of their
scales, because optimal scaling should take care of this by assigning scores that
monotonically decrease when the category increases.
Exercise 3

Optimal scaling of nominal


questionnaire data

Data file Nishisato.csv


R package aspect

3.1 Objectives
In this exercise, we will attempt to find “optimal” scores for nominal responses.
Unlike ordinal responses considered in Exercise 2, nominal categories do not
assume any particular order. Consequently, optimal scores assigned to them
are not expected to monotonically increase or decrease - there is simply no
restrictions on their sign or order.
As before, in assigning “optimal” scores, we will maximize the sum of the item
correlations (and therefore the test score’s “internal consistency” measured by
Cronbach’s alpha).

3.2 Worked Example – Optimal scaling of


Nishisato attitude items
This example considers responses to 4 attitude items from N=23 respondents
(Nishisato, 1994). Optimal scaling of these data is considered in detail in Mc-
Donald (1999, p. 435-441).

1. How old are you? (20-29 / 30-39 / 40+)


2. Children today are not as disciplined as when I was a child (agree / cannot tell / disagree)

43
44EXERCISE 3. OPTIMAL SCALING OF NOMINAL QUESTIONNAIRE DATA

3. Children today are not as fortunate as when I was a child (agree / cannot tell / di
4. Religion should be taught in school (agree / indifferent / disagree)

By looking at the items, we may tentatively propose that they measure some-
thing in common, perhaps nostalgic feelings about the past (?), if agreeing to
items 2, 3 and 4, and older age were keyed positively. To put this initial intuition
to the test, we will conduct optimal scaling.

Step 1. Importing and examining data

We begin by importing data file Nishisato.csv. Unlike in the previous exer-


cises, this file is not in the internal R format but is a comma-separated text file
(.csv). Files in foreign to R formats are not “loaded” but “red” instead. We can
use function read.csv(file, header = TRUE, sep = ",",...) dedicated to
reading files of this format. Note that by default, the function assumes that the
file has variable names (header=TRUE)‘. Because this is an external file, we need
to place it into an internal to R object - a data frame. We will call this data
frame (arbitrarily) attitude.

attitude <- read.csv(file="Nishisato.csv")


head(attitude)

## item1 item2 item3 item4


## 1 40+ agree disagree disagree
## 2 30-39 agree cannot tell agree
## 3 30-39 agree disagree agree
## 4 20-29 disagree disagree indifferent
## 5 40+ agree disagree agree
## 6 20-29 cannot tell agree agree

Examine the item names (item1, item2, item3, item4) and responses. You
can see that the responses are not coded as numbers, they are actually strings
corresponding to the response options, for example “cannot tell”. We leave them
like that, as in this analysis, we will not make use of any ordering of the response
options, considering them purely nominal categories.

Step 2. Running the optimal scaling procedure

We will again use package aspect, so load it into memory now.

library("aspect")
3.2. WORKED EXAMPLE – OPTIMAL SCALING OF NISHISATO ATTITUDE ITEMS45

When using function corAspect(data, aspect = "aspectSum", level =


"nominal", ...), we will maximize the sum of items’ correlations, so use the
default setting aspect="aspectSum". For the level of measurement, we will
also use the default, level="nominal".

opt2 <- corAspect(attitude, aspect = "aspectSum", level="nominal")


# Summary output for the optimal scaling analysis
summary(opt2)

##
## Correlation matrix of the scaled data:
## item1 item2 item3 item4
## item1 1.0000000 0.7960663 0.3873563 0.7034011
## item2 0.7960663 1.0000000 0.3333116 0.4785331
## item3 0.3873563 0.3333116 1.0000000 0.3999385
## item4 0.7034011 0.4785331 0.3999385 1.0000000
##
##
## Eigenvalues of the correlation matrix:
## [1] 2.5896793 0.7535762 0.5116607 0.1450839
##
## Category scores:
## item1:
## score
## 20-29 1.4526531
## 30-39 -0.3425303
## 40+ -1.0122570
##
## item2:
## score
## agree -0.6607173
## cannot tell 1.4369605
## disagree 1.6078784
##
## item3:
## score
## agree -0.5216552
## cannot tell 1.8973623
## disagree -0.5281230
##
## item4:
## score
## agree -0.4235185
## disagree -0.7718009
## indifferent 1.6643455
46EXERCISE 3. OPTIMAL SCALING OF NOMINAL QUESTIONNAIRE DATA

The output displays the “Correlation matrix of the scaled data”, which are
correlations of the item scores after optimal scaling. These are all positive and
surprisingly high, ranging between 0.33 and 0.79. Further, “Eigenvalues of the
correlation matrix” (from Principal Components Analysis) are displayed. The
first eigenvalue here (2.59) is substantially larger than the remaining eigenvalues,
indicating just one dimension.
Finally, “Category scores” show the scores that the optimal scaling procedure
assigned to the response categories. For example, for those between 20 and 29
years of age, the score suggested is 1.45; those between 30 and 39 will get -.34
and those who are 40 or older will get -1.01. These and other values are chosen
so that the scaled item’s mean in the sample is 0, and the correlations between
the items are maximized.

Step 3. Viewing transformation plots

We obtain transformation plots by calling

plot(opt2, plot.type = "transplot")


opt2$catscores$item1

opt2$catscores$item2

1.0
0.5

−0.5
−1.0

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

Index Index
opt2$catscores$item3

opt2$catscores$item4

1.0
1.0

−0.5
−0.5

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

Index Index

The transformation plot for item1 (age) makes it obvious that although the
relationship is monotonic, it is not perfectly linear.
QUESTION 1. Interpret category score assignments and transformation plots
for the other items. Items 3 and 4 are the most interesting because they have
3.3. SOLUTIONS 47

non-monotonic relationships, and thus depart completely from our initial intu-
ition about potential Likert scaling of items.
QUESTION 2. What kind of person would get the highest score on the total
attitude scale (how old would they be, how would they respond to the other
items?). What kind of person would get the lowest score?
QUESTION 3. Now, providing the established scaling, what do you think the
resulting scale measures?
This completes the exercise.

3.3 Solutions

Q1. Output for item2 (Children today are not as disciplined as when I was
a child: Agree / Cannot tell / Disagree) suggests that for those agreeing, the
score will be -.66; those who ‘cannot tell’ will get 1.44 and those who disagree
will get only slightly more, 1.61.
Output for item3 (Children today are not as fortunate as when I was a child:
Agree / Cannot tell / Disagree) shows that those who ‘cannot tell’ will get the
score 1.90, and those who agree or disagree will get very similar scores, -.52 or
-.53, respectively.
Output for item4 (Religion should be taught in school: Agree / Indifferent
/ Disagree) shows that those who are ‘indifferent’ will get the score 1.66, and
those who agree or disagree will get negative scores, -.42 or -.77 respectively.
Q2. The highest score on the scale will be obtained by those aged 20-29, dis-
agreeing with the idea that children today are not as disciplined as when they
were a child, and not providing any definitive opinion on the other two state-
ments (“Children today are not as fortunate as when I was a child”, and “Reli-
gion should be taught in school”). The lowest score on the scale will be obtained
by those aged 40+, feeling that children today are not as disciplined, and dis-
agreeing with the other two statements (“Children today are not as fortunate
as when I was a child”, and “Religion should be taught in school”). [In fact,
there is very little difference in score whether one agrees or disagrees with the
last two statements].
Q3. Are the score assignments consistent with the conjecture that what is mea-
sured is a form of conservatism marked by aging and nostalgia/dogmatism? To
some extent, yes, because on one end of the scale we have young people who
do not have any concerns about lowering discipline standards for children, feel
it is impossible to tell whether children today are more or less fortunate, and
are indifferent to whether religion is taught in school or not (more liberal/open-
minded). On the other end of the scale we have older people who feel that dis-
cipline standards for children have deteriorated, and have opinions on whether
48EXERCISE 3. OPTIMAL SCALING OF NOMINAL QUESTIONNAIRE DATA

children today are more or less fortunate, and whether religion should be taught
in school or not (more nostalgic about the past and dogmatic).
Exercise 4

Thurstonian scaling

Data file JobFeatures.txt


R package psych

4.1 Objectives
The purpose of this exercise is to learn how to perform Thurstonian scaling of
preferential choice data. The objective of Thurstonian scaling is to estimate
means of psychological values (or utilities) of the objects under comparison.
That is, given the observed rank orders of the objects, we want to know what
utility distributions could give rise to such orderings. To estimate unobserved
utilities from observed preferences, we need to make assumptions about the dis-
tributions of utilities. In this analysis, normal distributions with equal variance
are assumed (so-called Thurstone Case V scaling).

4.2 Study of preferences for job features


Data used in this exercise was collected as part of a study of work motivation by
Dr Ilke Inceoglu. In this study, N=1079 participants were asked to rank order
the following nine job features according to the extent they would want them
in their ideal job:

1. Supportive Environment (Support)


2. Challenging Work (Challenge)
3. Career Progression (Career)
4. Ethical Job (Ethics)
5. Job Control, Personal Impact (Autonomy)

49
50 EXERCISE 4. THURSTONIAN SCALING

6. Development (Development)
7. Social Interaction (Interaction)
8. Competitive Environment (Competition)
9. Pleasant environment and Safety (Safety)

Rank 1 given to any job feature means that this job feature was the most
important for this participant, rank 9 means that it was least important.

4.3 Worked Example - Thurstonian scaling of


ideal job features
To complete this exercise, you need to repeat the analysis from a worked example
below.

Step 1. Reading and examining data

To read the data file JobFeatures.txt into RStudio, I will use read.delim(),
a basic R function for reading text files (in this case, a tab-delimited file). I
will read the data into a new object (data frame), which I will call (arbitrarily)
JobFeatures.

JobFeatures <- read.delim("JobFeatures.txt")

You should see a new object JobFeatures appear on the Environment tab
(usually top right RStudio panel). This tab will show any objects currently in
the work space. According to the description in the Environment tab, the data
set contains 1079 obs.of 9 variables.
You can press on JobFeatures object. This will make View(JobFeatures)
command run on Console, which in turn will open the data frame in its own
tab named JobFeatures. Examine the data by scrolling down and across.
The data set contains 1079 rows (cases, observations) on 9 variables: Support,
Challenge, Career, etc. You can obtain the names of all variables in the data
frame JobFeatures by using the basic function names():

names(JobFeatures)

## [1] "Support" "Challenge" "Career" "Ethics" "Autonomy"


## [6] "Development" "Interaction" "Competition" "Safety"

Each row represents someone’s ranking of the 9 job features. Function head()
will display first few participants with their rankings (and tail() will display
last few):
4.3. WORKED EXAMPLE - THURSTONIAN SCALING OF IDEAL JOB FEATURES51

head(JobFeatures)

## Support Challenge Career Ethics Autonomy Development Interaction Competition


## 1 8 3 4 5 2 6 1 7
## 2 7 5 1 6 2 8 3 9
## 3 5 8 1 9 6 2 3 4
## 4 7 6 8 9 3 4 2 5
## 5 1 4 8 3 9 2 6 7
## 6 6 1 3 7 8 5 2 4
## Safety
## 1 9
## 2 4
## 3 7
## 4 1
## 5 5
## 6 9

Step 2. Performing Thurstonian scaling

Before doing the analysis, load package psych to enable its features.

library(psych)

To perform Thurstonian scaling, we will use function thurstone(). For details


about how this function works, run the following code:

help("thurstone")

You will get full documentation displayed in the Help tab. As described in the
documentation, this function has the following general form:

thurstone(x, ranks = FALSE, digits = 2)

Each component in the parentheses is called “argument”, and so thurstone()


function has three arguments. The first argument (x) is the data to be scaled and
it needs to be supplied because there is no default value. The second argument
is the type of data you supply: either actual ranks, or percentages of preferences
summarised in a square matrix). By default, the summarised percentages are
assumed (ranks=FALSE). The third argument is the number of digits (decimal
places) reported for the goodness of fit index (default is 2). The second and
the third arguments are optional because they have default values. If you are
happy with the defaults, you do not need to even include these arguments in
your syntax.
52 EXERCISE 4. THURSTONIAN SCALING

Since our data are raw rankings, we need to set ranks=TRUE. Type and run
the following code from your Script. It may take a little while, but you will
eventually get the results on your Console!

thurstone(JobFeatures, ranks=TRUE)

## Thurstonian scale (case 5) scale values


## Call: thurstone(x = JobFeatures, ranks = TRUE)
## [1] 0.97 0.93 0.91 0.92 0.60 1.04 0.63 0.00 0.23
##
## Goodness of fit of model 1

Let’s interpret this output. First, the title of the function and the full command
are printed. Next, the estimated population means of the psychological values
(or utilities) for 9 job features are printed (starting with [1], which simply indi-
cates the row number of output data). To make sense of the reported means,
you have to match them with the nine features. This is easy, as the means are
reported in the order of variables in the data frame.
Every mean reflects how valuable/desirable on average this particular job feature
is. A high mean indicates that participants value this job feature highly in
relation to other features. Because preferences are always relative, however, it
is impossible to uniquely identify all the means (for explanation, you may see
McDonald (1999), chapter 18, “Pair Comparison Model”). Therefore, one of
the means has to be fixed to some arbitrary value. It is customary to fix the
mean of the least preferred feature to 0. Then, all the remaining means are
positive.
QUESTION 1 Try to interpret the reported means. Which job feature was
least wanted? What was the utility mean of this feature? What was the most
wanted feature, and what was its utility mean? Looking at the mean values,
how would you interpret the relative values of Autonomy and Interaction?
What else can you say about the relative utility values of the job features?

Step 3. Calculating pairwise preference proportions

Now, let’s run the same function but assign its results to a new object called
(arbitrarily) scaling. Type and execute the following command:

scaling <- thurstone(JobFeatures, ranks = TRUE)

This time, there is no output, and it looks like nothing happened. However, the
same analysis was performed but now its results are stored in scaling rather
than printed out. To see what is stored in scaling, call function ls() that will
list all object’s constituents:
4.3. WORKED EXAMPLE - THURSTONIAN SCALING OF IDEAL JOB FEATURES53

ls(scaling)

## [1] "Call" "choice" "GF" "residual" "scale"

You can check what is stored inside any of these constituent parts by referring to
them by full name - starting with scaling followed by the $ sign, for example:

scaling$scale

## [1] 0.97 0.93 0.91 0.92 0.60 1.04 0.63 0.00 0.23

This will print out the 9 utility means, which we already examined.

scaling$choice

## [,1] [,2] [,3] [,4] [,5] [,6] [,7]


## [1,] 0.5000000 0.4735867 0.4745134 0.4745134 0.3632994 0.5078777 0.3605190
## [2,] 0.5264133 0.5000000 0.4670992 0.4949027 0.3753475 0.5301205 0.3623726
## [3,] 0.5254866 0.5329008 0.5000000 0.4986098 0.3753475 0.5236330 0.3901761
## [4,] 0.5254866 0.5050973 0.5013902 0.5000000 0.3632994 0.5180723 0.3781279
## [5,] 0.6367006 0.6246525 0.6246525 0.6367006 0.5000000 0.6737720 0.5050973
## [6,] 0.4921223 0.4698795 0.4763670 0.4819277 0.3262280 0.5000000 0.2928638
## [7,] 0.6394810 0.6376274 0.6098239 0.6218721 0.4949027 0.7071362 0.5000000
## [8,] 0.7961075 0.8341057 0.8127896 0.8090825 0.7126969 0.8526413 0.7775718
## [9,] 0.7747915 0.7173309 0.7405005 0.7432808 0.6598703 0.8044486 0.6950880
## [,8] [,9]
## [1,] 0.2038925 0.2252085
## [2,] 0.1658943 0.2826691
## [3,] 0.1872104 0.2594995
## [4,] 0.1909175 0.2567192
## [5,] 0.2873031 0.3401297
## [6,] 0.1473587 0.1955514
## [7,] 0.2224282 0.3049120
## [8,] 0.5000000 0.6135310
## [9,] 0.3864690 0.5000000

This will print a 9x9 matrix containing proportions of participants in the sample
who preferred the feature in the column to the feature in the row. This is a
summary of the rank preferences that we did not have in the beginning, but R
conveniently calculated it for us. In the “choice” matrix, the rows and columns
are in the order of variables in the original file.
Let’s examine the “choice” matrix more carefully. Look for the entry in row [8]
and column [6]. This value, 0.8526413, represents the proportion of participants
54 EXERCISE 4. THURSTONIAN SCALING

who preferred 8th feature, Competition, to 6th feature, Development, and


it is the largest value in the “choice” matrix:

max(scaling$choice)

## [1] 0.8526413

This pair of features has the most decisive preference for one feature over the
other.
QUESTION 2. How does the above result for choices for 8th feature, Com-
petition, oevr 6th feature, Development, correspond to the estimated utility
means?

Step 4. Assessing model fit

Now, let’s ask for model residuals:

scaling$residual

## [,1] [,2] [,3] [,4] [,5]


## [1,] 0.000000000 0.0133479521 0.001732662 0.0077151838 -0.005872352
## [2,] -0.013347952 0.0000000000 0.022201895 0.0003878924 -0.005625238
## [3,] -0.001732662 -0.0222018946 0.000000000 0.0073806440 0.004543318
## [4,] -0.007715184 -0.0003878924 -0.007380644 0.0000000000 0.010887738
## [5,] 0.005872352 0.0056252384 -0.004543318 -0.0108877381 0.000000000
## [6,] -0.020341170 -0.0111159497 -0.028230409 -0.0278455199 0.005140288
## [7,] -0.008084655 -0.0186493603 -0.001106985 -0.0074004791 -0.006785811
## [8,] 0.036657121 -0.0096726558 0.004628757 0.0122844463 0.012984302
## [9,] -0.007039854 0.0403013911 0.008671076 0.0106467596 -0.017008932
## [,6] [,7] [,8] [,9]
## [1,] 0.020341170 0.008084655 -0.036657121 0.007039854
## [2,] 0.011115950 0.018649360 0.009672656 -0.040301391
## [3,] 0.028230409 0.001106985 -0.004628757 -0.008671076
## [4,] 0.027845520 0.007400479 -0.012284446 -0.010646760
## [5,] -0.005140288 0.006785811 -0.012984302 0.017008932
## [6,] 0.000000000 0.049380030 0.002756144 0.015651117
## [7,] -0.049380030 0.000000000 0.042051948 0.041174300
## [8,] -0.002756144 -0.042051948 0.000000000 -0.021145626
## [9,] -0.015651117 -0.041174300 0.021145626 0.000000000

This will print a 9x9 matrix containing differences between the observed propor-
tions (the choice matrix) and the expected proportions (proportions preferring
4.4. SOLUTIONS 55

the feature in the row to the feature in the column, which would be expected
based on the standard normal distributions of utilities around the means scaled
as above). Residuals are the direct way of measuring whether a model (in
this case, a model of unobserved utilities that Thurstone proposed) “fits” the
observed data. Small residuals (near zero) indicate that there are small dis-
crepancies between observed choices and choices predicted by the model; which
means that the model we adopted is rather good.
Finally, let’s ask for a Goodness of Fit index:

scaling$GF

## [1] 0.9987548

According to documentation for thurstone() function, GF is calculated as:

1 - sum(squared residuals/squared observed)

If the residuals are close to zero, then their squared ratios to the observed
proportions should be also close to zero. Therefore, the goodness of fit index of
a well-fitting model should be close to 1.
The residuals in our analysis are all very small, which indicates a close cor-
respondence between the observed choices (proportions of preferences for one
feature over the other). The small residuals are reflected in the GF index, which
is very close to 1. Overall, the Thurstone’s model fits the job-features data well.

4.4 Solutions
Q1. The smallest mean (0.00) corresponds to the 8th feature, Competition,
and the highest mean (1.04) corresponds to the 6th feature, Development.
This means that Competitive environment was least wanted, and opportuni-
ties for personal Development were most wanted by people in their ideal job.
Other features were scaled somewhere between these two, with Safety having
low mean (0.23) - barely higher than 0 for Competition, whereas Support,
Challenge, Career and Ethics having similarly high means (around 0.9).
Autonomy and Interaction have similar moderate means around 0.6.
Q2. The most decisive preference in terms of proportions of people chosing
one feature over the other must have the largest distance/difference between
the utilities (6th feature, Development, must have a much higher mean utility
than 8th feature, Competition). This result is indeed in line with the results
for the utility means, where Development mean was the highest at 1.04 and
Competition was the lowest at 0.
56 EXERCISE 4. THURSTONIAN SCALING
Part II

CLASSICAL TEST
THEORY AND
RELIABILITY THEORY

57
Exercise 5

Reliability analysis of
polytomous questionnaire
data

Data file SDQ.RData


R package psych

5.1 Objectives
The purpose of this exercise is to learn how to estimate the test score reliability
by different methods: test-retest, split-half and “internal consistency” (Cron-
bach’s alpha). You will also learn how to judge whether test items contribute
to measurement.

5.2 Study of a community sample using the


Strength and Difficulties Questionnaire
(SDQ)
In this exercise, we will again work with the self-rated version of Strengths and
Difficulties Questionnaire (SDQ), a brief behavioural screening questionnaire
about children and adolescents of 3-16 year of age. The data set, SDQ.RData,
which was used in Exercises 1 and 2, contains responses of N=228 pupils from
the same school to the SDQ. The SDQ was administered twice, the first time
when the children just started secondary school (were in Year 7), and one year
later (were in year 8).

59
60EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA

To remind you, the SDQ measures 5 facets, with 5 items each:


Emotional Symptoms somatic worries unhappy clingy afraid
Conduct Problems tantrum obeys* fights lies steals
Hyperactivity restles fidgety distrac reflect* attends*
Peer Problems loner friend* popular* bullied oldbest
Pro-social consid shares caring kind helpout
NOTE that 5 items marked in the above table with asterisks * represent be-
haviours counter-indicative of the scales they intend to measure, so that
higher scale scores correspond to lower item scores.
Every item response is coded according to the following response options: 0 =
“Not true” 1 = “Somewhat true” 2 = “Certainly true”
There are some missing responses. In particular, some pupils have no data
for the second measurement, possibly because they were absent on the day of
testing or moved to a different secondary school.

5.3 Worked Example - Estimating reliability for


SDQ Conduct Problems
To complete this exercise, you need to repeat a worked example below, which
comprises reliability analysis of the facet Conduct Problems. Once you feel
confident, you can complete the exercise for the remaining SDQ facets.

Step 1. Creating/Opening project

If you completed Exercise 1, you should have already downloaded file


SDQ.RData, and created a project associated with it. Please find and open
that project now. In that project, you should have already computed the
scale scores for Conduct Problems (Time 1 and Time 2). You should have also
reverse-coded counter indicative items. Both of those steps are essential for
running the reliability analysis. If you have not done so, please go to Exercise
1, and follow the instructions to complete these steps.

Step 2. Test-retest reliability

When test scores on two occasions are available, we can estimate test reliability
by computing the correlation between them (correlate the scale score at Time
1 with the scale score at Time 2) using the base R function cor.test().
Remind yourself of the variables in the SDQ data frame (which should be avail-
able when you open the project from Exercise 1) by calling names(SDQ). Among
5.3. WORKED EXAMPLE - ESTIMATING RELIABILITY FOR SDQ CONDUCT PROBLEMS61

other variables, you should have S_conduct and S_conduct2, representing


the test scores for Conduct problems at Time 1 and Time 2, respectively, which
you computed then. Correlate them to obtain the test-retest reliability.

cor.test(SDQ$S_conduct, SDQ$S_conduct2)

##
## Pearson's product-moment correlation
##
## data: SDQ$S_conduct and SDQ$S_conduct2
## t = 7.895, df = 170, p-value = 3.421e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3992761 0.6195783
## sample estimates:
## cor
## 0.5179645

QUESTION 1. What is the test-retest reliability for the SDQ Conduct Prob-
lem scale? Try to interpret this result.

Step 3. Internal consistency reliability (coefficient alpha)


To compute the coefficient alpha, you will need to again refer to the items
indicating Conduct Problems.
It is important that before submitting them to analyses, the items should be
appropriately coded - that is, any counter-indicative items should be reverse
coded using function reverse.code() from package psych. This is because
alpha is computed as an average of items’ covariances (“raw” alpha) or corre-
lations (“standardized” alpha). If any items correlate negatively with the rest
(which counter-indicative items should), these negative correlations will cancel
out positive correlations and alpha will be spuriously low (or even negative).
This is obviously wrong, because we compute reliability of the test score, and
when we actually computed the test score we reversed negatively keyed items.
We should do the same when computing alpha.
Fortunately, you should have already prepared the correctly coded item scores
in Exercise 1. They should be stored in the object R_conduct. If you have not
done this, please go back to Exercise 1 and apply the function reverse.code()
to the 5 items measuring Conduct Problems to create R_conduct.

library(psych)

items_conduct <- c("tantrum","obeys","fights","lies","steals")


R_conduct <- reverse.code(keys=c(1,-1,1,1,1), SDQ[items_conduct])
62EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA

Now all that is left to do is to run function alpha() from package psych on
the correctly coded set R_conduct. There are various arguments you can
control in this function, and most defaults are fine, but I suggest you set
cumulative=TRUE (the default is FALSE). This will ensure that statistics in the
output are given for the sum score (“cumulative” of item scores) rather than for
the average score (deafult). We computed the sum score for Conduct Problems,
so we want the output to match the score we computed.

alpha(R_conduct, cumulative=TRUE)

##
## Reliability analysis
## Call: alpha(x = R_conduct, cumulative = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.72 0.73 0.7 0.35 2.7 0.028 2.1 2.1 0.33
##
## lower alpha upper 95% confidence boundaries
## 0.66 0.72 0.77
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## tantrum 0.62 0.65 0.59 0.31 1.8 0.041 0.0100 0.28
## obeys- 0.65 0.66 0.61 0.33 2.0 0.035 0.0078 0.33
## fights 0.67 0.66 0.61 0.33 2.0 0.034 0.0094 0.30
## lies 0.70 0.71 0.66 0.38 2.5 0.031 0.0086 0.38
## steals 0.71 0.72 0.68 0.39 2.6 0.031 0.0096 0.43
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## tantrum 226 0.79 0.76 0.68 0.59 0.57 0.72
## obeys- 228 0.71 0.72 0.64 0.52 0.58 0.60
## fights 228 0.66 0.73 0.64 0.52 0.19 0.44
## lies 226 0.70 0.64 0.49 0.43 0.54 0.72
## steals 227 0.57 0.62 0.45 0.38 0.19 0.49
##
## Non missing response frequency for each item
## 0 1 2 miss
## tantrum 0.56 0.31 0.13 0.01
## obeys- 0.48 0.46 0.06 0.00
## fights 0.82 0.16 0.02 0.00
## lies 0.59 0.27 0.14 0.01
## steals 0.86 0.10 0.04 0.00

QUESTION 2. What is the alpha (raw_alpha is the most appropriate


statistic to report for reliability of the raw scale score) for the Conduct Problems
5.3. WORKED EXAMPLE - ESTIMATING RELIABILITY FOR SDQ CONDUCT PROBLEMS63

scale? Try to interpret the size of alpha bearing in mind the definition of
reliability as “the proportion of variance in the observed score due to true score”.

Now examine the output in more detail. There are other useful statistics printed
in the same line with raw_alpha. Note the average_r, which is the average
correlation between the 5 items of this facet. The std.alpha is computed from
these.

Other useful stats are mean (which is the mean of the sum score) and sd (which
is the Standard Deviation of the sum score). It is very convenient that these
are calculated by function alpha() because you can get these even without
computing the Conduct Problem scale score! If you wish, check them against
the actual sum score stats using describe(SDQ$S_conduct). Note that this is
why I suggested you set cumulative=TRUE, so you can get stats for the sum
score, not the average score.

Now examine the output “Reliability if an item is dropped”. The first column will
give you the expected “raw_alpha” for a 4-item scale without this particular
item in it (if this item was dropped). This is useful for seeing whether the
item makes a good contribution to measurement provided by the scale. If this
expected alpha is lower than the actual reported alpha, the item improves the
test score reliability. If it is higher than the actual alpha, the item actually
reduces the score reliability. You may wonder how it is possible, since items are
supposed to add to the reliability? Essentially, such an item contributes more
noise (to the error variance) than signal (to the true score variance).

QUESTION 3. Judging by the “Reliability if an item is dropped” output, do


all of the items contribute positively to the test score reliability? Which item
provides biggest contribution?

Now examine the “Item statistics” output. Two statistics I want to draw your
attention to are:

raw.r - The correlation of each item with the total score. This value is always
inflated because the item is correlated with the scale in which it is already
included!

r.drop - The correlation of this item with the scale WITHOUT this item (with
the scale compiled from the remaining items). This is a more realistic indicator
than raw.r of how closely each item is associated with the scale.

Both raw.r and r.drop should be POSITIVE. If for any item these values
are negative, you must check whether the item was coded appropriately; for
example, if all counter-indicative items were reverse coded. To help you, the
output marks all reverse coded items with “-” sign.

QUESTION 4. Which item has the highest correlation with the remaining
items (“r.drop” value)? Look up this item’s text in the SDQ data frame.
64EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA

Step 4. Split-half reliability

Finally, we will request the split-half reliability coefficients for this scale. I said
“coefficients” rather than “coefficient” because there are lots of ways in which
the test can be split into two halves, each giving a slightly different estimate of
reliability.
You will use function splitHalf() from package psych on the appropriately
coded item set R_conduct. I suggest you set use="complete" to make sure
that only the complete cases (without missing data) are used, to avoid differ-
ent samples to be used in different splittings of the test. We should also set
covar=TRUE, to base all estimates on item raw scores and covariances rather
than correlations (the default in splitHalf()). This is to make our estimates
comparable with the “raw_alpha” we obtained from alpha().

splitHalf(R_conduct, use="complete", covar=TRUE)

## Split half reliabilities


## Call: splitHalf(r = R_conduct, covar = TRUE, use = "complete")
##
## Maximum split half reliability (lambda 4) = 0.76
## Guttman lambda 6 = 0.7
## Average split half reliability = 0.72
## Guttman lambda 3 (alpha) = 0.72
## Guttman lambda 2 = 0.73
## Minimum split half reliability (beta) = 0.65
## Average interitem covariance = 0.12 with median = 0.12

The function prints estimates of reliability based on different splittings of the


test. For short tests like the Conduct Problems scale, the function actually per-
forms all possible splits, and calculates and prints their maximum (“Maximum
split half reliability (lambda 4)”), minimum (“Minimum split half reliability
(beta)”), and average (“Average split half reliability”). You can see them in the
output, and see that the estimates actually vary quite a bit, from 0.65 to 0.76
depending on the way in which the test was split.
Lee Cronbach showed in his famous 1951 paper that, theoretically, the coefficient
alpha is “the mean of all split-half coefficients resulting from different splittings
of a test”. But of course, it is much easier to compute alpha by using the
formula rather than running all possible splittings and estimating the split half
coefficients for them (and correcting these coefficients for unequal test lengths
because with 5 items, there will always be a 3-item half and a 2-item half).
So, in the output you will see both, the “Average split half reliability” = 0.72
and “alpha” = 0.72 (also known as “Guttman lambda 3”). In this case, they
are exactly the same, which is lucky considering the amount of calculations and
corrections involved with computing the average of all split-half coefficients.
5.4. FURTHER PRACTICE - RELIABILITIES OF THE REMAINING SDQ FACETS65

You can also refer to the “raw_alpha” result we obtained with function
alpha(), and you will see that it is the same as we obtained with the function
splitHalf().

This completes the Worked Example.

QUESTION 4. Repeat the steps in the Worked Example for the Hyperactivity
facet. Compute the test-retest reliability, alpha and split-half reliabilities (for
Time 1 only).

Step 5. Saving your work

It is important that you save all new objects created, because you will need
some of these again in Exercise 7. When closing the project, make sure you save
the entire work space, and your script.

save(SDQ, file="SDQ_saved.RData")

5.4 Further practice - Reliabilities of the re-


maining SDQ facets

If you want to practice further, you can pick any of the remaining SDQ facets.

Use the table below to enter your results (2 decimal points is fine).

Test-retest Alpha Split-half (ave)


Emotional Symptoms
Conduct Problems 0.52 0.72 0.72
Hyperactivity
Peer Problems
Pro-social

NOTE. Don’t be surprised if the average split-half coefficient does not always
equal alpha.

Based on the analyses of scales that you have completed, try to answer the
following questions:

QUESTION 6. Which method for estimating reliability gives the low-


est/highest estimate? Why?

QUESTION 7. Which method do you think is best for estimating the precision
of measurement in this study?
66EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA

5.5 Solutions

Q1. The correlation 0.518 is positive, “large” in terms of effect size, but not
very impressive as an estimate of reliability because it suggests that only about
52% of variance in the test score is due to true score, and the rest is due to
error. It suggests that Conduct Problems is either not a very stable construct,
or it is not measured accurately by the SDQ. It is impossible to say which is
the case from this one correlation.
Q2. The estimated (raw) alpha is 0.72, which suggests that approximately
72% of variance in the Conduct problem score is due to true score. This is an
acceptable level of reliability for a short screening questionnaire.
Q3. “Reliability if an item is dropped” output shows that every item contributes
positively to measurement, because the reliability would go down from the cur-
rent 0.72 if we dropped any of them. For example, if we dropped the item
tantrum, alpha would be 0.62, which is much worse than the current alpha. If
the item tantrum was dropped, alpha would reduce most, so this item provides
most contribution to reliability of this scale.
Q4. Item tantrum has the highest correlation with the scale (“r.drop”=0.59).
This item reads: “I get very angry and often lose my temper”. It is typical
that items with highest item-total correlation are also those contributing most
to alpha.
Q5.
Test-retest Alpha Split-half (ave)
Emotional Symptoms 0.49 0.74 0.7
Conduct Problems 0.52 0.72 0.72
Hyperactivity 0.65 0.76 0.75
Peer Problems 0.51 0.53 0.51
Pro-social 0.53 0.65 0.64
Q6. The test-retest method provides the lowest estimates, which is not surpris-
ing considering that the interval between the two testing sessions was one year.
Particularly low is the correlation between Emotional Problems at Time 1 and
Time 2, while its internal consistency is higher, which indicates that Emotional
Problems are more transient at this age than, for example, Hyperactivity.
Q7. The substantial differences between the test-retest and alpha estimates
for all but one scale (Peer Problems) suggest that the test-retest method likely
under-estimates the reliability due to instability of the constructs measured after
such a long interval (one year). So, alpha and split-half coefficients are more
appropriate as estimates of reliability here. Alpha is to be preferred to split-half
coefficients since the latter vary widely depending on the way the test is split.
For Peer Problems, both test-retest and alpha give similar results, so the low
test-retest cannot be interpreted as necessarily low stability - it may be that
5.5. SOLUTIONS 67

the construct is relatively stable but not measured very accurately at each time
point.
68EXERCISE 5. RELIABILITY ANALYSIS OF POLYTOMOUS QUESTIONNAIRE DATA
Exercise 6

Item analysis and reliability


analysis of dichotomous
questionnaire data

Data file EPQ_N_demo.txt


R package psych

6.1 Objectives
The objective of this exercise is to conduct reliability analyses on a questionnaire
compiled from dichotomous items, and learn how to compute the Standard Error
of measurement and confidence intervals around the observed score.

6.2 Personality study using Eysenck Personality


Questionnaire (EPQ)
The Eysenck Personality Questionnaire (EPQ) is a personality inventory de-
signed to measure Extraversion (E), Neuroticism/Anxiety (N), Psychoticism
(P), and including a Social Desirability scale (L) (Eysenck and Eysenck, 1976).
The EPQ includes 90 items, to which respondents answer either “YES” or “NO”
(thus giving dichotomous or binary responses). This is relatively unusual for
personality questionnaires, which typically employ Likert scales to boost the
amount of information from each individual item, and therefore get away with
fewer items overall. On contrary, the EPQ scales are measured by many items,

69
70EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION

ensuring a good content coverage of all domains involved. The use of binary
items avoids problems with response biases such as “central tendency” (ten-
dency to choose middle response options) and “extreme responding” (tendency
to choose extreme response options).
The data were collected in the USA back in 1997 as part of a large cross-cultural
study (Barrett, Petrides, Eysenck & Eysenck, 1998). This was a large study, N
= 1,381 (63.2% women and 36.8% men). Most participants were young people,
median age 20.5; however, there were adults of all ages present (range 16 – 89
years).

6.3 Worked Example - Reliability analysis of


EPQ Neuroticism/Anxiety (N) scale

The focus of our analysis here will be the Neuroticism/Anxiety (N) scale, mea-
sured by 23 items:

N_3 Does your mood often go up and down?


N_7 Do you ever feel "just miserable" for no reason?
N_12 Do you often worry about things you should not have done or said?
N_15 Are you an irritable person?
N_19 Are your feelings easily hurt?
N_23 Do you often feel "fed-up"?
N_27 Are you often troubled about feelings of guilt?
N_31 Would you call yourself a nervous person?
N_34 Are you a worrier?
N_38 Do you worry about awful things that might happen?
N_41 Would you call yourself tense or "highly-strung"?
N_47 Do you worry about your health?
N_54 Do you suffer from sleeplessness?
N_58 Have you often felt listless and tired for no reason?
N_62 Do you often feel life is very dull?
N_66 Do you worry a lot about your looks?
N_68 Have you ever wished that you were dead?
N_72 Do you worry too long after an embarrassing experience?
N_75 Do you suffer from "nerves"?
N_77 Do you often feel lonely?
N_80 Are you easily hurt when people find fault with you or the work you do?
N_84 Are you sometimes bubbling over with energy and sometimes very sluggish?
N_88 Are you touchy about some things?

Please note that all items indicate “Neuroticism” rather than “Emotional Sta-
bility” (i.e. there are no counter-indicative items).
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE71

Step 1. Preliminary examination of the data

Because you will work with these data again, I recommend you create a new
project, which you will be able to revisit later. Create a new folder with a name
that you can recognize later (e.g. “EPQ Neuroticism”), and download the data
file EPQ_N_demo.txt into this folder. In RStudio, create a new project in
the folder you have just created. Create a new script, where you will be typing
all your commands, and which you will save at the end.
The data for this exercise is not in the “native” R format, but in a tab-delimited
text file with headings (you can open it in Notepad to see how it looks). You
will use function read.delim() to import it into R. The function has the follow-
ing general format: read.delim(file, header = TRUE, sep = "\t", ...).
You must supply the first argument - the file name. Other arguments all have
defaults, so you change them only if necessary. Our file has headers and is
separated by tabulation ("\t") so the defaults are just fine.
Let’s import the data into into a new object (data frame) called EPQ:

EPQ <- read.delim(file="EPQ_N_demo.txt")

The object EPQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. As you can see, there are 26 variables
in this data frame, beginning with participant id, age and sex (0 = female; 1
= male). These demographic variables are followed by 23 item responses, which
are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses,
marked with NA.

Step 2. Computing item statistics

Analyses in this Exercise will require package psych, so type and run this
command from your script:

library(psych)

Let’s start the analyses. First, let’s call function describe() of package psych,
which will print descriptive statistics for item responses in the EPQ data frame
useful for psychometrics. Note that the 23 item responses start in the 4th
column of the data frame.

describe(EPQ[4:26])

## vars n mean sd median trimmed mad min max range skew kurtosis se
## N_3 1 1379 0.64 0.48 1 0.67 0 0 1 1 -0.57 -1.67 0.01
72EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION

## N_7 2 1379 0.54 0.50 1 0.56 0 0 1 1 -0.18 -1.97 0.01


## N_12 3 1381 0.80 0.40 1 0.87 0 0 1 1 -1.47 0.17 0.01
## N_15 4 1375 0.29 0.45 0 0.24 0 0 1 1 0.93 -1.13 0.01
## N_19 5 1377 0.58 0.49 1 0.60 0 0 1 1 -0.31 -1.90 0.01
## N_23 6 1377 0.51 0.50 1 0.52 0 0 1 1 -0.06 -2.00 0.01
## N_27 7 1377 0.47 0.50 0 0.47 0 0 1 1 0.11 -1.99 0.01
## N_31 8 1378 0.33 0.47 0 0.29 0 0 1 1 0.70 -1.51 0.01
## N_34 9 1378 0.60 0.49 1 0.62 0 0 1 1 -0.41 -1.84 0.01
## N_38 10 1378 0.53 0.50 1 0.54 0 0 1 1 -0.13 -1.99 0.01
## N_41 11 1370 0.31 0.46 0 0.26 0 0 1 1 0.83 -1.30 0.01
## N_47 12 1380 0.58 0.49 1 0.60 0 0 1 1 -0.33 -1.89 0.01
## N_54 13 1378 0.31 0.46 0 0.27 0 0 1 1 0.80 -1.36 0.01
## N_58 14 1381 0.66 0.47 1 0.70 0 0 1 1 -0.67 -1.56 0.01
## N_62 15 1380 0.27 0.45 0 0.22 0 0 1 1 1.02 -0.97 0.01
## N_66 16 1378 0.58 0.49 1 0.60 0 0 1 1 -0.34 -1.89 0.01
## N_68 17 1379 0.44 0.50 0 0.42 0 0 1 1 0.26 -1.93 0.01
## N_72 18 1379 0.61 0.49 1 0.64 0 0 1 1 -0.45 -1.80 0.01
## N_75 19 1375 0.36 0.48 0 0.32 0 0 1 1 0.59 -1.66 0.01
## N_77 20 1376 0.41 0.49 0 0.39 0 0 1 1 0.35 -1.88 0.01
## N_80 21 1375 0.61 0.49 1 0.64 0 0 1 1 -0.45 -1.80 0.01
## N_84 22 1376 0.81 0.39 1 0.89 0 0 1 1 -1.58 0.48 0.01
## N_88 23 1376 0.89 0.31 1 0.99 0 0 1 1 -2.51 4.29 0.01

QUESTION 1. Examine the output. Look for the descriptive statistic repre-
senting item difficulty. Do the difficulties vary widely? What is the easiest (to
agree with) item in this set? What is the most difficult (to agree with) item?
Examine phrasing of the corresponding items – can you see why one item is
easier to agree with than the other?
Now let’s compute the product-moment (Pearson) correlations between EPQ
items using function lowerCor() of package psych. This function is very con-
venient because, unlike the base R function cor(), it shows only the lower
triangle of the correlation matrix, which is more compact and easier to read.
If you call help on this function you will see that by default the correlations
will be printed to 2 decimal places (digits=2), and the missing values will be
treated in the pairwise fashion (use="pairwise"). This is good, and we will
not change any defaults.

lowerCor(EPQ[4:26])

Now let’s compute the tetrachoric correlations for the same items. These would
be more appropriate for binary items on which a NO/YES dichotomy was forced
(although the underlying extent of agreement is actually continuous). The func-
tion tetrachoric() has the pairwise deletion for missing values by default -
again, this is good.
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE73

tetrachoric(EPQ[4:26])

QUESTION 2. Examine the outputs for product-moment and tetrachoric


correlations. Compare them to each other. What do you notice? How do you
interpret the findings?

Step 3 (Optional). Computing the total scale score and


statistics

We can compute the Neuroticism scale score (as a sum of its item scores). Note
that there are no counter-indicative items in the Neuroticism scale, so there is
no need to reverse code any.
From Exercise 1 you should already know that in the presence of missing data,
adding the items scores is not advisable because any response will essentially be
treated as zero, which is not right because not providing an answer to a question
is not the same as saying “NO”.
Instead, we will again use the base R function rowMeans() to compute the
average score from non-missing item responses (removing “NA” values from
calculation, na.rm=TRUE), and then multiply the result by 23 (the number of
items in the Neuroticism scale). This will essentially replace any missing re-
sponses with the mean for that individual, thus producing a fair estimate of the
total score.

N_score <- rowMeans(EPQ[4:26], na.rm=TRUE)*23

Check out the new object N_score in the Environment tab. You will see that
scores for those with missing data (for example, participant #1, who had re-
sponse to N_88 missing) may have decimal points and scores for those without
missing data (for example, participant #2) are integers.
Now we can compute descriptive statistics for the Neuroticism score.

describe(N_score)

## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1381 12.15 5.53 12 12.17 5.93 0 23 23 -0.03 -0.89 0.15

And we plot the histogram, which will appear in the “Plots” tab:

hist(N_score)
74EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION

Histogram of N_score

150
Frequency

100
50
0

0 5 10 15 20

N_score

QUESTION 3. Examine the histogram. What can you say about the test
difficulty for the tested population?

Step 4. Estimating the coefficient alpha based on product-


moment correlations

Now let’s estimate the “internal consistency reliability” (Cronbach’s alpha) of


the total score. We apply function alpha() from psych package to all 23 items
of the SDQ data frame. Like in Exercise 5, I suggest to obtain output for the
sum score (which we computed) rather than average score (deafult in alpha()).

alpha(EPQ[4:26], cumulative=TRUE)

##
## Reliability analysis
## Call: alpha(x = EPQ[4:26], cumulative = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.87 0.87 0.88 0.22 6.7 0.0049 12 5.5 0.22
##
## lower alpha upper 95% confidence boundaries
## 0.86 0.87 0.88
##
## Reliability if an item is dropped:
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE75

## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r


## N_3 0.86 0.86 0.87 0.22 6.2 0.0052 0.0075 0.21
## N_7 0.86 0.86 0.87 0.22 6.3 0.0052 0.0075 0.21
## N_12 0.87 0.86 0.87 0.22 6.4 0.0051 0.0075 0.22
## N_15 0.87 0.87 0.87 0.23 6.5 0.0050 0.0074 0.22
## N_19 0.86 0.86 0.87 0.22 6.2 0.0053 0.0067 0.21
## N_23 0.86 0.86 0.87 0.22 6.2 0.0052 0.0075 0.21
## N_27 0.86 0.86 0.87 0.22 6.3 0.0052 0.0075 0.21
## N_31 0.86 0.86 0.87 0.22 6.3 0.0052 0.0068 0.22
## N_34 0.86 0.86 0.87 0.22 6.1 0.0053 0.0067 0.21
## N_38 0.86 0.86 0.87 0.22 6.3 0.0052 0.0075 0.21
## N_41 0.87 0.86 0.87 0.22 6.3 0.0051 0.0072 0.22
## N_47 0.87 0.87 0.88 0.23 6.6 0.0049 0.0071 0.23
## N_54 0.87 0.87 0.88 0.23 6.6 0.0050 0.0073 0.23
## N_58 0.87 0.86 0.87 0.22 6.4 0.0051 0.0076 0.22
## N_62 0.87 0.87 0.87 0.23 6.4 0.0051 0.0074 0.22
## N_66 0.87 0.87 0.87 0.23 6.5 0.0050 0.0075 0.22
## N_68 0.87 0.87 0.87 0.23 6.5 0.0050 0.0075 0.23
## N_72 0.86 0.86 0.87 0.22 6.3 0.0052 0.0070 0.21
## N_75 0.86 0.86 0.87 0.22 6.2 0.0052 0.0070 0.21
## N_77 0.86 0.86 0.87 0.22 6.2 0.0052 0.0074 0.21
## N_80 0.87 0.86 0.87 0.22 6.3 0.0052 0.0069 0.22
## N_84 0.87 0.87 0.88 0.23 6.6 0.0050 0.0071 0.23
## N_88 0.87 0.87 0.88 0.23 6.7 0.0050 0.0068 0.23
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## N_3 1379 0.58 0.58 0.56 0.52 0.64 0.48
## N_7 1379 0.56 0.56 0.54 0.50 0.54 0.50
## N_12 1381 0.49 0.50 0.46 0.43 0.80 0.40
## N_15 1375 0.44 0.44 0.40 0.37 0.29 0.45
## N_19 1377 0.61 0.61 0.60 0.56 0.58 0.49
## N_23 1377 0.58 0.58 0.55 0.51 0.51 0.50
## N_27 1377 0.57 0.57 0.54 0.51 0.47 0.50
## N_31 1378 0.55 0.55 0.53 0.49 0.33 0.47
## N_34 1378 0.64 0.64 0.63 0.59 0.60 0.49
## N_38 1378 0.56 0.55 0.52 0.49 0.53 0.50
## N_41 1370 0.52 0.52 0.49 0.46 0.31 0.46
## N_47 1380 0.37 0.37 0.32 0.29 0.58 0.49
## N_54 1378 0.39 0.39 0.34 0.31 0.31 0.46
## N_58 1381 0.51 0.51 0.48 0.44 0.66 0.47
## N_62 1380 0.46 0.46 0.43 0.40 0.27 0.45
## N_66 1378 0.45 0.44 0.40 0.37 0.58 0.49
## N_68 1379 0.44 0.44 0.39 0.37 0.44 0.50
## N_72 1379 0.57 0.57 0.54 0.51 0.61 0.49
## N_75 1375 0.59 0.58 0.57 0.52 0.36 0.48
76EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION

## N_77 1376 0.58 0.58 0.55 0.52 0.41 0.49


## N_80 1375 0.54 0.54 0.52 0.47 0.61 0.49
## N_84 1376 0.35 0.37 0.31 0.29 0.81 0.39
## N_88 1376 0.30 0.33 0.27 0.25 0.89 0.31
##
## Non missing response frequency for each item
## 0 1 miss
## N_3 0.36 0.64 0.00
## N_7 0.46 0.54 0.00
## N_12 0.20 0.80 0.00
## N_15 0.71 0.29 0.00
## N_19 0.42 0.58 0.00
## N_23 0.49 0.51 0.00
## N_27 0.53 0.47 0.00
## N_31 0.67 0.33 0.00
## N_34 0.40 0.60 0.00
## N_38 0.47 0.53 0.00
## N_41 0.69 0.31 0.01
## N_47 0.42 0.58 0.00
## N_54 0.69 0.31 0.00
## N_58 0.34 0.66 0.00
## N_62 0.73 0.27 0.00
## N_66 0.42 0.58 0.00
## N_68 0.56 0.44 0.00
## N_72 0.39 0.61 0.00
## N_75 0.64 0.36 0.00
## N_77 0.59 0.41 0.00
## N_80 0.39 0.61 0.00
## N_84 0.19 0.81 0.00
## N_88 0.11 0.89 0.00

QUESTION 4. What is alpha for the Neuroticism scale score? Report the
raw_alpha printed at the beginning of the output (this is the version of alpha
calculated from row item scores and covariances rather than standardized item
scores and correlations).
Examine the alpha() output further. You can call ?alpha to get help on this
function and its output. An important summary statistic is average_r, which
stands for the average inter-item correlation. Other useful summary statistics
are mean and sd, which are respectively the mean and the standard deviation
of the test score. Yes, you can get these statistics without computing the test
score but by just running function alpha(). Isn’t this convenient?
If you completed Step 3, and computed N_score previously and obtained its
mean and SD using function describe(N_score), you can now compare these
results. The mean and SD given in alpha() should be the same (maybe rounded
to fewer decimal points) as the results from describe().
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE77

In “Item statistics” output, an interesting statistic is r.drop – that is the cor-


relation between the item score and the total score with this particular item
dropped. This tells you how closely each item is associated with the remaining
items. The higher this value is, the more representative the item is of the whole
item collection.
QUESTION 5. Which item has the highest correlation with the remaining
items (r.drop value)? Look up this item’s content in the EPQ questionnaire.
In “Reliability if an item is dropped” output, an important statistic is
raw_alpha. It tells you how the scale’s alpha would change if you dropped a
particular item. If an item contributes information to measurement, its “alpha
if deleted” should be lower than the actual alpha (this might not be obvious as
the values might be different in the third decimal point). In this questionnaire,
dropping any item will not reduce the alpha by more than 0.01. This is not
surprising since the scale is long (consists of 23 items), and dropping one item
is not going to reduce the reliability dramatically.

Step 5. Estimating the coefficient alpha based on tetra-


choric correlations

Now, remember that in the standard alpha computation, the product-moment


covariances are used (the basis for Pearson’s correlation coefficient). These
assume continuous data and typically underestimate the degree of relationship
between binary “YES/NO” items, so a better option would be to compute alpha
from tetrachoric correlations. We have already used function tetrachoric()
to compute the tetrachoric correlations for the EPQ items. Please examine
the output of that function again carefully. You should see that this function
returns the matrix of correlations (which we will need) as well as thresholds or
“tau” (which we won’t). Let’s run the function again, this time assigning its
results to a new object EPQtetr:

EPQtetr <- tetrachoric(EPQ[4:26])

Press on the little blue arrow next to EPQtetr in the Environment tab. The
object’s structure will be revealed and you should see that EPRtetr contains
rho, which is is a 23x23 matrix of the tetrachoric correlations, tau, which is
the vector of 23 thresholds, etc. To retrieve the tetrachoric correlations from
the object EPQtetr, we can simply refer to them like so EPQtetr$rho.
Now you can pass the tetrachoric correlations to function alpha(). Note that
because you pass to the function the correlations rather than the raw scores, no
statistics for the “test score” can be computed and you cannot specify the type
of score using cumulative option.
78EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION

alpha(EPQtetr$rho)

##
## Reliability analysis
## Call: alpha(x = EPQtetr$rho)
##
## raw_alpha std.alpha G6(smc) average_r S/N median_r
## 0.93 0.93 0.95 0.37 13 0.36
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N var.r med.r
## N_3 0.93 0.93 0.94 0.36 12 0.015 0.35
## N_7 0.93 0.93 0.94 0.36 13 0.015 0.35
## N_12 0.93 0.93 0.94 0.37 13 0.014 0.36
## N_15 0.93 0.93 0.94 0.37 13 0.014 0.37
## N_19 0.93 0.93 0.94 0.36 12 0.013 0.35
## N_23 0.93 0.93 0.94 0.36 13 0.015 0.35
## N_27 0.93 0.93 0.94 0.36 13 0.015 0.35
## N_31 0.93 0.93 0.94 0.37 13 0.013 0.36
## N_34 0.92 0.92 0.94 0.36 12 0.013 0.35
## N_38 0.93 0.93 0.94 0.37 13 0.015 0.35
## N_41 0.93 0.93 0.94 0.37 13 0.014 0.36
## N_47 0.93 0.93 0.95 0.38 13 0.013 0.38
## N_54 0.93 0.93 0.95 0.38 13 0.014 0.38
## N_58 0.93 0.93 0.94 0.37 13 0.015 0.37
## N_62 0.93 0.93 0.94 0.37 13 0.014 0.37
## N_66 0.93 0.93 0.95 0.37 13 0.015 0.37
## N_68 0.93 0.93 0.95 0.37 13 0.014 0.37
## N_72 0.93 0.93 0.94 0.36 13 0.014 0.35
## N_75 0.93 0.93 0.94 0.36 13 0.014 0.35
## N_77 0.93 0.93 0.94 0.36 13 0.014 0.35
## N_80 0.93 0.93 0.94 0.37 13 0.014 0.36
## N_84 0.93 0.93 0.95 0.38 13 0.014 0.37
## N_88 0.93 0.93 0.95 0.38 13 0.014 0.38
##
## Item statistics
## r r.cor r.drop
## N_3 0.72 0.71 0.69
## N_7 0.68 0.67 0.64
## N_12 0.66 0.65 0.62
## N_15 0.56 0.54 0.51
## N_19 0.74 0.74 0.70
## N_23 0.70 0.69 0.66
## N_27 0.69 0.67 0.65
## N_31 0.67 0.67 0.63
6.3. WORKED EXAMPLE - RELIABILITY ANALYSIS OF EPQ NEUROTICISM/ANXIETY (N) SCALE79

## N_34 0.78 0.78 0.75


## N_38 0.67 0.65 0.63
## N_41 0.65 0.64 0.61
## N_47 0.44 0.40 0.38
## N_54 0.48 0.44 0.42
## N_58 0.63 0.62 0.59
## N_62 0.59 0.57 0.54
## N_66 0.54 0.50 0.48
## N_68 0.53 0.50 0.48
## N_72 0.68 0.67 0.65
## N_75 0.72 0.71 0.68
## N_77 0.70 0.69 0.66
## N_80 0.65 0.64 0.60
## N_84 0.49 0.46 0.43
## N_88 0.48 0.45 0.43

QUESTION 6. What is the tetrachoric-based alpha and how does it compare


to alpha computed from product-moment covariances?

Step 6. Computing the Standard Error of measurement for


the total score
Finally, let’s compute the Standard Error of measurement from the tetrachorics-
based alpha using the formula

SEm(y) = SD(y)*sqrt(1-alpha)

You will need the standard deviation of the total score, which you already
computed (scroll up your output to the descriptives for N_score). Now you
can substitute the SD value and also the alpha value into the formula, and do
the calculation in R by typing them in your script and then running them in
the Console.

5.53*sqrt(1-0.93)

## [1] 1.4631

Even better would be to save the result in a new object SEm, so we can use it
later for computing confidence intervals.

SEm <- 5.53*sqrt(1-0.93)

QUESTION 7. As a challenge, try to compute the SEm not by substituting


the value for standard deviation from the previous output, but by computing
the standard deviation of N_score from scratch using base R function sd().
80EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION

Step 7. Computing the Confidence Interval around a score

Finally, let’s compute the 95% confidence interval around a given score Y using
formulas Y-2*SEm and Y+2*SEm. Say, we want to compute the 95% confidence
interval around N_score for Participant #1. It is easy to pull out the respective
score from vector N_score (which holds scores for 1381 participants) by simply
referring to the participant number in square brackets: N_score[1].
QUESTION 8. What is the 95% confidence interval around the Neuroticism
scale score for participant #1?

Step 8. Saving your work

After you finished this exercise, save your R script and entire ‘workspace’, which
includes the data frame and also all the new objects you created. This will be
useful for when you come back to this exercise again, and they will also be
needed in Exercises 12 and 14.

6.4 Solutions

Q1. The mean represents item difficulty. In this item set, difficulties vary
widely. The ‘easiest’ item to endorse is N_88 (“Are you touchy about some
things?”), with the highest mean (0.89). Because the data are binary coded
0/1, the mean can be interpreted as 89% of the sample endorsed the item. The
most difficult item to endorse in this set is N_62, with the lowest mean value
(0.27). N_62 is phrased “Do you often feel life is very dull?”, and only 27% of
the sample agreed with this. This item indicates a higher degree of Neuroticism
(perhaps even a symptom of depression) than that of item N_88, and it is
therefore more difficult to endorse.
Q2. The tetrachoric correlations are substantially larger than the product-
moment correlations. This is not surprising given that the data are binary
and the product-moment correlations were developed for continuous data. The
product-moment correlations underestimate the strength of the relationships
between these binary items.
Q3. The distribution appears symmetrical, without any visible skew. There
is no ceiling or floor effects, so the test’s difficulty appears appropriate for the
population.
Q4. The (raw) alpha computed from product-moment covariances is 0.87.
Q5. Item N_34 (“Are you a worrier?”) has the highest item-total correlation
(when the item itself is dropped). This item is central to the meaning of the
scale, or, in other words, the item that best represents this set of items. This
6.4. SOLUTIONS 81

makes sense given that the scale is supposed to measure Neuroticism. Worrying
about things is one of core indicators of Neuroticism.
Q6. Alpha from tetrachoric correlations is 0.93. It is larger than the product-
moment based alpha. This is not surprising since the tetrachoric correlations
were greater than the product-moment correlations.
Q7.

SEm <- sd(N_score)*sqrt(1-0.93)


SEm

## [1] 1.463404

Q8. The 95% confidence interval is (9.62, 15.47)

N_score[1]-2*SEm # lower

## [1] 9.618647

N_score[1]+2*SEm # upper

## [1] 15.47226
82EXERCISE 6. ITEM ANALYSIS AND RELIABILITY ANALYSIS OF DICHOTOMOUS QUESTION
Part III

TEST HOMOGENEITY
AND SINGLE-FACTOR
MODEL

83
Exercise 7

Fitting a single-factor
model to polytomous
questionnaire data

Data file SDQ.RData


R package psych

7.1 Objectives
The objective of this exercise is to fit a single-factor model to item-level ques-
tionnaire data, thereby testing homogeneity of an item set.

7.2 Worked Example - Testing homogeneity of


SDQ Conduct Problems scale
To complete this exercise, you need to repeat the analysis from a worked example
below. The worked example includes factor analysis of the SDQ scale Conduct
Problems. Once you feel confident, you can repeat the same steps for other SDQ
scales.
This exercise continues to make use of the data we considered in Exercise 1 and
then again in Exercise 5. These data come from a community sample of pupils
from the same school (N=228). They completed the Strengths and Difficulties
Questionnaire (SDQ) twice - the first time at the start of secondary school (Year
7), and then one year later (Year 8).

85
86EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA

Step 1. Creating or Opening project

If you have already worked with this data set in Exercise 1 or Exercise 5, the
simplest thing to do is to continue working within the project created back
then. In RStudio, select File / Open Project and navigate to the folder and the
project you created. You should see the data frame SDQ appearing in your
Environment tab, together with other objects you created and saved.
If you have not completed the previous Exercises or have not saved your work,
or simply want to start from scratch, download the data file SDQ.RData into
a new folder and follow instructions from Exercise 1 on how to set up a new
project and to load the data.

load("SDQ.RData")

Whether you are working in a new project or the old project, create a new R
script (File / New File / R Script) for this exercise to keep a separate record of
commands needed to conduct test homogeneity analyses.

Step 2. Examining data

Now either press on the SDQ object in the Environment tab or type View(SDQ)
to remind yourself of the data set. It contains 228 rows (cases, observations)
on 51 variables. Run names(SDQ) to get the variable names. There is Gender
variable (0=male; 1=female), followed by responses to 25 SDQ items at Time 1
named consid, restles, somatic etc. These are followed by 25 more variables
named consid2, restles2, somatic2 etc. These are responses to the same SDQ
items at Time 2.
Run head(SDQ) from your script to get a quick preview of first 6 rows of data.
You should see that there are some missing responses, marked ‘NA’. There are
more missing responses for Time 2, with whole rows missing for some pupils.

Step 3. Creating variable list

If you have not done so in previous exercises, begin by creating a list of items
that measure Conduct Problems (you can see them in a table given in Exercise
1). This will enable easy reference to data from these 5 variables in all analyses.
We will use c() - the base R function for combining values into a list.

items_conduct <- c("tantrum","obeys","fights","lies","steals")

Note how a new object items_conduct appeared in the Environment tab. Try
calling SDQ[items_conduct] to pull only the data from these 5 items.
7.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF SDQ CONDUCT PROBLEMS SCALE87

Step 4. Examining suitability of data for factor analysis

Now, load the package psych to enable access to its functionality:

library(psych)

Before starting factor analysis, check correlations between responses to the 5


Conduct Problems items. Package psych has function corr.test(), which
prints the full correlation matrix, pairwise sample sizes and corresponding p-
values for the significance of correlations. If you want just the correlation matrix
in a compact format, use function lowerCor() from this package instead.

lowerCor(SDQ[items_conduct])

## tntrm obeys fghts lies stels


## tantrum 1.00
## obeys -0.46 1.00
## fights 0.42 -0.50 1.00
## lies 0.44 -0.30 0.25 1.00
## steals 0.30 -0.26 0.35 0.23 1.00

QUESTION 1. What can you say about the size and direction of inter-item
correlations? Do you think these data are suitable for factor analysis?
To obtain the measure of sampling adequacy (MSA) - an index summarizing
the correlations on their overall potential to measure something in common -
request the Kaiser-Meyer-Olkin (KMO) index:

KMO(SDQ[items_conduct])

## Kaiser-Meyer-Olkin factor adequacy


## Call: KMO(r = SDQ[items_conduct])
## Overall MSA = 0.76
## MSA for each item =
## tantrum obeys fights lies steals
## 0.75 0.75 0.74 0.76 0.82

Kaiser (1975) proposed the following guidelines for interpreting the MSA and
deciding on utility of factor analysis:

MSA range Interpretation


0.9 to 1 Marvelous data
0.8 to 0.9 Meritorious data
0.7 to 0.8 Middling
88EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA

0.6 to 0.7 Mediocre


0.5 to 0.6 Miserable
0.0 to 0.5 Totally useless

QUESTION 2. Interpret the resulting measure of sampling adequacy (MSA).

Step 5. Determining the number of factors

Package psych has a function producing Scree plots for the observed data and
random (i.e. uncorrelated) data matrix of the same size. Comparison of the
two scree plots is called Parallel Analysis. We retain factors from the blue
scree plot (real data) that are ABOVE the red plot (simulated random data),
in which we expect no common variance, only random variance. In Scree plots,
important factors that account for common variance resemble points on hard
rock of a mountain, and trivial factors only accounting for random variance
are compared with rubble at the bottom of a mountain (“scree” in geology).
While examining a Scree Plot of the empirical data is helpful in preliminary
decisions on which factors belong to the hard rock and which belong to the
rubble, Parallel Analysis provides a statistical comparison with the baseline for
this size data.

Function fa.parallel(x,…, fa="both",…) has only one required argument –


the actual data (x). There are many other arguments, but they all have default
values and therefore are not required if we are happy with the defaults. We will
change only one default - the type of eigenvalues shown. We will keep it simple
and display only eigenvalues for principal components (fa="pc"), as done in
some commercial software such as Mplus.

fa.parallel(SDQ[items_conduct], fa="pc")
7.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF SDQ CONDUCT PROBLEMS SCALE89

Parallel Analysis Scree Plots


2.5
eigen values of principal components

PC Actual Data
PC Simulated Data
2.0

PC Resampled Data
1.5
1.0
0.5

1 2 3 4 5

Component Number

## Parallel analysis suggests that the number of factors = NA and the number of components = 1

QUESTION 3. Examine the Scree plot. Does it support the hypothesis


that there is only one factor underlying responses to Conduct Problems items?
Why? Does your conclusion agree with the text output of the parallel analysis
function?

Step 6. Fitting and interpreting a single-factor model

Now let’s run the factor analysis, extracting one factor. I recommend you call
documentation on function fa() from package psych by running command ?fa.
From the general form of this function, fa(r, nfactors=1,…), it is clear that
we need to supply the data (argument r, which can be a correlation or covariance
matrix or a raw data matrix, like we have). Other arguments all have defaults.
The default estimation method is fm="minres". This method will minimise the
residuals (differences between the observed and the reproduced correlations),
and it is probably a good choice for these data (the sample size is not that
large, and responses in only 3 categories cannot be normally distributed). The
default number of factors to extract (nfactors=1) is exactly what we want,
however, I recommend you write this explicitly for your own future reference:

fa(SDQ[items_conduct], nfactors=1)
90EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA

## Factor Analysis using method = minres


## Call: fa(r = SDQ[items_conduct], nfactors = 1)
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 h2 u2 com
## tantrum 0.71 0.51 0.49 1
## obeys -0.67 0.44 0.56 1
## fights 0.65 0.42 0.58 1
## lies 0.49 0.24 0.76 1
## steals 0.45 0.20 0.80 1
##
## MR1
## SS loadings 1.82
## Proportion Var 0.36
##
## Mean item complexity = 1
## Test of the hypothesis that 1 factor is sufficient.
##
## The degrees of freedom for the null model are 10 and the objective function was 1
## The degrees of freedom for the model are 5 and the objective function was 0.07
##
## The root mean square of the residuals (RMSR) is 0.05
## The df corrected root mean square of the residuals is 0.07
##
## The harmonic number of observations is 226 with the empirical chi square 11.73 wi
## The total number of observations was 228 with Likelihood Chi Square = 14.78 with
##
## Tucker Lewis Index of factoring reliability = 0.908
## RMSEA index = 0.093 and the 90 % confidence intervals are 0.04 0.149
## BIC = -12.36
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy
## MR1
## Correlation of (regression) scores with factors 0.87
## Multiple R square of scores with factors 0.76
## Minimum correlation of possible factor scores 0.52

Run this command and examine the output. Try to answer the following ques-
tions (refer to instructional materials of your choice for help, I recommend Mc-
Donald’s “Test theory”). I will indicate which parts of the output you need to
answer each question.
QUESTION 4. Examine the Standardized factor loadings. How do you in-
terpret them? What is the “marker item” for the Conduct Problems scale? [In
the “Standardized loadings” output, the loadings are printed in “MR1” column.
“MR” stands for the estimation method, “Minimum Residuals”, and “1” stands
for the factor number. Here we have only 1 factor, so only one column.]
7.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF SDQ CONDUCT PROBLEMS SCALE91

QUESTION 5. Examine communalities and uniquenesses (look at h2 and u2


values in the table of “Standardized loadings”, respectively). What is commu-
nality and uniqueness and how do you interpret them?
QUESTION 6. What is the proportion of variance explained by the factor
in the five items (total variance explained)? To answer this question, look for
“Proportion Var” entry in the output (in a small table beginning with “SS
loadings”).

Step 7. Assessing the model’s goodness of fit

Now let us examine the model’s goodness of fit (GOF). This output starts with
the line “Test of the hypothesis that 1 factor is sufficient”. This is the hypothesis
that the data complies with a single-factor model (“the model”). We are hoping
to retain this hypothesis, therefore hoping for a large p-value, definitely p >
0.05, and hopefully larger. The output will also tell us about the “null model”,
which is the model where all items are uncorrelated. We are obviously hoping
to reject this null model, and obtain a very small p-value. Both of these models
will be tested with the chi-square test, with their respective degrees of freedom.
For our single-factor model, there are 5 degrees of freedom, because the model
estimates 10 parameters (5 factor loadings and 5 uniquenesses), and there are
5*6/2=15 variances and covariances (sample moments) to estimate them. The
degrees of freedom are therefore 15-10=5.
QUESTION 7. Is the chi-square for the tested model significant? Do you
accept or reject the single-factor model? [Look for “Likelihood Chi Square” in
the output.]
For now, I will ignore some other “fit indices” printed in the output. I will
return to them in Exercises dealing with structural equation models (beginning
with Exercise 16) .
Now let’s examine the model’s residuals. Residuals are the differences between
the observed item correlations (which we computed earlier) and the correlations
“reproduced” by the model – that is, correlations of item scores predicted by
the model. The smaller the residuals are, the closer the model reproduces the
data.
In the model output printed on Console, however, we have only the Root Mean
Square Residual (RMSR), which computes the mean of all residuals squared,
and then takes the square root of that. The RMSR is a summary measure of
the size of residuals, and in a way it is like GOF “effect size” - independent of
sample size. You can see that the RMSR=0.05, which is a good (low) value,
indicating that the average residual is sufficiently small.
To obtain more detailed output of the residuals, we need to get access to all
of the results produced by the function fa(). Call the function again, but this
92EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA

time, assign its results to a new object F_conduct, which we can “interrogate”
later.

F_conduct <- fa(SDQ[items_conduct], nfactors=1)

Package psych has a nice function that pulls the residuals from the saved factor
analysis results (object F_conduct) and prints them in a user-friendly way:

residuals.psych(F_conduct)

## tntrm obeys fghts lies stels


## tantrum 0.49
## obeys 0.01 0.56
## fights -0.05 -0.06 0.58
## lies 0.09 0.03 -0.07 0.76
## steals -0.02 0.04 0.06 0.00 0.80

NOTE: On the diagonal are discrepancies between the observed item variances
(which equal 1 in this standardised solution) and the “reproduced” item vari-
ances (variances explained by the common factor, or communalities). So, on the
diagonal we have uniquness = 1-communality. You can check this by comparing
the diagonal of the residual matrix with values u2 that we discussed earlier.
They should be the same. To remove these values on the diagonal out of sight,
use:

residuals.psych(F_conduct, diag=FALSE)

## tntrm obeys fghts lies stels


## tantrum NA
## obeys 0.01 NA
## fights -0.05 -0.06 NA
## lies 0.09 0.03 -0.07 NA
## steals -0.02 0.04 0.06 0.00 NA

QUESTION 8. Examine the residual correlations (they are printed OFF the
diagonal). What can you say about them? Are there any large residuals? (Hint.
Interpret the size of residuals as you would the size of correlation coefficients.)

Step 8. Estimating scale reliability using McDonald’s


omega

Since we have confirmed homogeneity of the Conduct Problems scale, we can


legitimately estimate its reliability using coefficients alpha or omega. In Exercise
7.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF SDQ CONDUCT PROBLEMS SCALE93

5, we already computed alpha for this scale, so you can refer to that instruction
for detail. I will simply quote the result we obtained there - “raw alpha” of 0.72.
To obtain omega, call function omega(), specifying the number of factors
(nfactors=1). You need this because various versions of coefficient omega exist
for multi-factor models, but you only need the estimate for a homogeneous
test, “Omega Total”.

# reverse code any counter-indicative items and put them in a data frame
# for details on this procedure, consult Exercise 1
R_conduct <- reverse.code(c(1,-1,1,1,1), SDQ[items_conduct])
# obtain McDonald's omega
omega(R_conduct, nfactors=1)

## Loading required namespace: GPArotation

## Omega_h for 1 factor is not meaningful, just omega_t

## Warning in schmid(m, nfactors, fm, digits, rotate = rotate, n.obs = n.obs, :


## Omega_h and Omega_asymptotic are not meaningful with one factor

## Omega
## Call: omegah(m = m, nfactors = nfactors, fm = fm, key = key, flip = flip,
## digits = digits, title = title, sl = sl, labels = labels,
## plot = plot, n.obs = n.obs, rotate = rotate, Phi = Phi, option = option,
## covar = covar)
## Alpha: 0.73
## G.6: 0.7
## Omega Hierarchical: 0.74
## Omega H asymptotic: 1
## Omega Total 0.74
##
## Schmid Leiman Factor loadings greater than 0.2
## g F1* h2 u2 p2
## tantrum 0.71 0.51 0.49 1
## obeys- 0.67 0.44 0.56 1
## fights 0.65 0.42 0.58 1
## lies 0.49 0.24 0.76 1
## steals 0.45 0.20 0.80 1
##
## With eigenvalues of:
## g F1*
## 1.8 0.0
##
## general/max 1.639707e+16 max/min = 1
94EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA

## mean percent general = 1 with sd = 0 and cv of 0


## Explained Common Variance of the general factor = 1
##
## The degrees of freedom are 5 and the fit is 0.07
## The number of observations was 228 with Chi Square = 14.78 with prob < 0.011
## The root mean square of the residuals is 0.05
## The df corrected root mean square of the residuals is 0.07
## RMSEA index = 0.093 and the 10 % confidence intervals are 0.04 0.149
## BIC = -12.36
##
## Compare this with the adequacy of just a general factor and no group factors
## The degrees of freedom for just the general factor are 5 and the fit is 0.07
## The number of observations was 228 with Chi Square = 14.78 with prob < 0.011
## The root mean square of the residuals is 0.05
## The df corrected root mean square of the residuals is 0.07
##
## RMSEA index = 0.093 and the 10 % confidence intervals are 0.04 0.149
## BIC = -12.36
##
## Measures of factor score adequacy
## g F1*
## Correlation of scores with factors 0.87 0
## Multiple R square of scores with factors 0.76 0
## Minimum correlation of factor score estimates 0.52 -1
##
## Total, General and Subset omega for each subset
## g F1*
## Omega total for total scores and subscales 0.74 0.74
## Omega general for total scores and subscales 0.74 0.74
## Omega group for total scores and subscales 0.00 0.00

QUESTION 9. What is the “Omega Total” for Conduct Problems scale score?
How does it compare with the “raw alpha” for this scale? Which estimate do
you think is more appropriate to assess the reliability of this scale?

Step 9. Saving your work

After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. Give the script a meaningful name, for example
“SDQ scale homogeneity analyses”. When closing the project by pressing File /
Close project, make sure you select Save when prompted to save your ‘Workspace
image’ (with extension .RData).
7.3. FURTHER PRACTICE - FACTOR ANALYSIS OF THE REMAINING SDQ SUBSCALES95

7.3 Further practice - Factor analysis of the re-


maining SDQ subscales
If you wish to practice further, you can repeat the factor analyses with the
remaining SDQ scales. Use the list of items by subscale in Exercise 1 to include
the right items in your analyses.

7.4 Solutions
Q1. The correlations are all similar in magnitude (around 0.3-0.4). Correla-
tions between **obeys*” and other items are negative, and between all
other items are positive. This is because obeys** indicates the low end
of Conduct Problems (i.e. the lack of such problems), so it is keyed in the op-
posite direction to all other items. The data seem suitable for factor analysis
because all items correlate – so potentially have a common source for this shared
variance.
Q2. MSA = 0.76. The data are “middling” according to Kaiser’s guidelines
(i.e. suitable for factor analysis).
Q3. According to the Scree plot of the observed data (in blue) the 1st factor
accounts for a substantial amount of variance compared to the 2nd factor. There
is a large drop from the 1st factor to the 2nd, and the “rubble” begins with the
2nd factor. This indicates that most co-variation is explained by one factor.
Parallel analysis confirms this conclusion, because the simulated data yields a
line (in red) that crosses the observed scree between 1st and 2nd factor, with
1st factor significantly above the red line, and 2nd factor below it. It means
that only the 1st factor should be retained, and the 2nd, 3rd and all subsequent
factors should be discarded as they are part of the rubble.
Q4. Standardized factor loadings reflect the number of Standard Deviations by
which the item score will change per 1 SD change in the factor score. The higher
the loading, the more sensitive the item is to the change in the factor. Stan-
dardised factor loadings in the single-factor model are also correlations between
the factor and the items (just like beta coefficients in the simple regression).
For the Conduct Problems scale, factor loadings range between 0.45 and 0.71 in
magnitude (i.e. absolute value). Factor loadings over 0.5 in magnitude can be
considered reasonably high. The loading for obeys is negative (-0.67), which
means that higher factor score results in the lower score on this item. The
marker item is tantrum with the loading 0.71 (highest loading). The marker
item indicates that behaviour described by this item, “I get very angry and
often lose my temper”, is central to the meaning of the common factor.
Q5. Communality is the variance in the item due to the common factor, and
uniqueness is the unique item variance. In standardised factor solutions (which
96EXERCISE 7. FITTING A SINGLE-FACTOR MODEL TO POLYTOMOUS QUESTIONNAIRE DA

is what function fa() prints), communality is the proportion of variance in


the item due to the factor, and uniqueness is the remaining proportion (1-
communality). Looking at the printed values, between 20% and 51% of variance
in the items is due to the common factor.
Q6. The factor accounts for 36% of the variability in the five Conduct Problems
items (see “Proportion Var” in the output.
Q7. The chi-square test is significant (“prob” or p-value is less than 0.011),
which means the hypothesis that the single-factor model holds must be rejected.
Q8. The residuals are all quite close to zero. The largest residual is for corre-
lation between “tantrum” and “lies”, approx. 0.09. It is still a small difference
between observed and reproduced correlation (normally we ignore any residuals
smaller than 0.1), so we conclude that the correlations are reproduced pretty
well.
Q9. Omega Total = 0.74. Omega should provide a better estimate of reliability
of a homogeneous scale than alpha because it does not rely on the assumption
that all factor loadings are the same as alpha does. Here, alpha (0.72) underes-
timates the reliability slightly, because not all factor loadings are equal. (alpha
can only be lower than omega, and they are equal when all factor loadings are
equal).
Exercise 8

Fitting a single-factor
model to dichotomous
questionnaire data

Data file EPQ_N_demo.txt


R package psych

8.1 Objectives

The objective of this exercise is to test homogeneity of a questionnaire com-


piled from dichotomous items, by fitting a single-factor model to tetrachoric
correlations of the questionnaire items rather than product-moment (Pearson’s)
correlations.

8.2 Worked Example - Testing homogeneity of


EDQ Neuroticism scale

To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.
This exercise makes use of the data we considered in Exercise 6. These data
come from a a large cross-cultural study (Barrett, Petrides, Eysenck & Eysenck,
1998), with N = 1,381 participants who completed the Eysenck Personality
Questionnaire (EPQ).

97
98EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE DA

The focus of our analysis here will be the Neuroticism/Anxiety (N) scale, mea-
sured by 23 items with only two response options - either “YES” or “NO”, for
example:

N_3 Does your mood often go up and down?


N_7 Do you ever feel "just miserable" for no reason?
N_12 Do you often worry about things you should not have done or said?
etc.

You can find the full list of EPQ Neuroticism items in Exercise 6. Please
note that all items indicate “Neuroticism” rather than “Emotional Stability”
(i.e. there are no counter-indicative items).

Step 1. Opening and examining the data

If you have already worked with this data set in Exercise 6, the simplest thing
to do is to continue working within the project created back then. In RStudio,
select File / Open Project and navigate to the folder and the project you cre-
ated. You should see the data frame EPQ appearing in your Environment tab,
together with other objects you created and saved.
If you have not completed Exercise 6 or have not saved your work, or simply
want to start from scratch, download the data file EPQ_N_demo.txt into
a new folder and follow instructions from Exercise 6 on creating a project and
importing the data.

EPQ <- read.delim(file="EPQ_N_demo.txt")

The object EPQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. As you can see, there are 26 variables
in this data frame, beginning with participant id, age and sex (0 = female; 1
= male). These demographic variables are followed by 23 item responses, which
are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses,
marked with NA.

Step 2. Examining suitability of data for factor analysis

Now, load the package psych to enable access to its functionality:

library(psych)

Before starting factor analysis, check correlations between responses to the 23


items. Package psych has function lowerCor()that prints the correlation matrix
8.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF EDQ NEUROTICISM SCALE99

in a compact format. Note that to refer to the 23 item responses only, you need
to specify the columns where they are stored (from 4 to 26):

lowerCor(EPQ[4:26])

## N_3 N_7 N_12 N_15 N_19 N_23 N_27 N_31 N_34 N_38 N_41
## N_3 1.00
## N_7 0.36 1.00
## N_12 0.26 0.28 1.00
## N_15 0.31 0.21 0.12 1.00
## N_19 0.33 0.30 0.32 0.18 1.00
## N_23 0.36 0.30 0.24 0.28 0.28 1.00
## N_27 0.28 0.28 0.31 0.24 0.32 0.29 1.00
## N_31 0.24 0.20 0.18 0.22 0.31 0.24 0.26 1.00
## N_34 0.30 0.33 0.34 0.20 0.41 0.32 0.36 0.45 1.00
## N_38 0.26 0.24 0.28 0.21 0.29 0.27 0.33 0.28 0.41 1.00
## N_41 0.23 0.21 0.16 0.29 0.27 0.26 0.23 0.43 0.35 0.27 1.00
## N_47 0.17 0.13 0.19 0.09 0.16 0.20 0.20 0.18 0.20 0.29 0.14
## N_54 0.20 0.19 0.11 0.20 0.15 0.16 0.15 0.20 0.18 0.16 0.21
## N_58 0.36 0.34 0.22 0.19 0.24 0.35 0.24 0.18 0.22 0.20 0.16
## N_62 0.26 0.23 0.14 0.23 0.20 0.29 0.20 0.20 0.23 0.19 0.24
## N_66 0.20 0.16 0.27 0.12 0.25 0.24 0.22 0.17 0.24 0.25 0.13
## N_68 0.22 0.29 0.15 0.19 0.22 0.21 0.22 0.14 0.21 0.19 0.15
## N_72 0.24 0.32 0.32 0.13 0.42 0.24 0.34 0.29 0.41 0.29 0.26
## N_75 0.26 0.29 0.21 0.25 0.31 0.27 0.25 0.53 0.38 0.28 0.40
## N_77 0.33 0.30 0.18 0.23 0.29 0.37 0.31 0.26 0.29 0.25 0.25
## N_80 0.23 0.23 0.27 0.10 0.56 0.19 0.29 0.28 0.35 0.24 0.25
## N_84 0.25 0.22 0.15 0.08 0.14 0.18 0.16 0.10 0.15 0.13 0.09
## N_88 0.19 0.17 0.12 0.12 0.19 0.17 0.13 0.05 0.17 0.15 0.11
## N_47 N_54 N_58 N_62 N_66 N_68 N_72 N_75 N_77 N_80 N_84
## N_47 1.00
## N_54 0.09 1.00
## N_58 0.16 0.20 1.00
## N_62 0.07 0.16 0.20 1.00
## N_66 0.21 0.11 0.16 0.15 1.00
## N_68 0.05 0.17 0.21 0.24 0.14 1.00
## N_72 0.18 0.08 0.21 0.19 0.27 0.22 1.00
## N_75 0.19 0.25 0.24 0.23 0.19 0.16 0.30 1.00
## N_77 0.11 0.23 0.27 0.41 0.24 0.28 0.27 0.29 1.00
## N_80 0.11 0.13 0.18 0.14 0.25 0.21 0.40 0.30 0.26 1.00
## N_84 0.12 0.12 0.29 0.14 0.11 0.16 0.13 0.12 0.17 0.13 1.00
## N_88 0.09 0.07 0.16 0.07 0.12 0.09 0.11 0.12 0.12 0.17 0.15
## [1] 1.00

Now let’s compute the tetrachoric correlations for the same items. These would
100EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D

be more appropriate for binary items on which a NO/YES dichotomy was forced
(although the underlying extent of agreement is actually continuous).

tetrachoric(EPQ[4:26])

## Call: tetrachoric(x = EPQ[4:26])


## tetrachoric correlation
## N_3 N_7 N_12 N_15 N_19 N_23 N_27 N_31 N_34 N_38 N_41
## N_3 1.00
## N_7 0.54 1.00
## N_12 0.44 0.48 1.00
## N_15 0.55 0.35 0.23 1.00
## N_19 0.51 0.46 0.53 0.30 1.00
## N_23 0.55 0.46 0.42 0.46 0.42 1.00
## N_27 0.44 0.43 0.55 0.38 0.49 0.45 1.00
## N_31 0.40 0.33 0.35 0.35 0.51 0.38 0.41 1.00
## N_34 0.46 0.50 0.56 0.34 0.61 0.49 0.55 0.72 1.00
## N_38 0.40 0.37 0.48 0.35 0.45 0.42 0.49 0.45 0.60 1.00
## N_41 0.39 0.34 0.31 0.46 0.45 0.42 0.37 0.64 0.58 0.43 1.00
## N_47 0.26 0.20 0.33 0.16 0.25 0.31 0.31 0.30 0.31 0.44 0.23
## N_54 0.33 0.31 0.22 0.33 0.26 0.27 0.25 0.32 0.31 0.26 0.35
## N_58 0.54 0.52 0.38 0.33 0.38 0.55 0.39 0.31 0.35 0.31 0.28
## N_62 0.47 0.38 0.30 0.38 0.34 0.48 0.34 0.33 0.39 0.32 0.39
## N_66 0.31 0.25 0.46 0.21 0.39 0.38 0.35 0.28 0.38 0.38 0.23
## N_68 0.36 0.44 0.28 0.30 0.34 0.32 0.35 0.23 0.33 0.30 0.25
## N_72 0.38 0.48 0.54 0.23 0.62 0.38 0.52 0.48 0.61 0.44 0.43
## N_75 0.43 0.45 0.41 0.41 0.49 0.42 0.40 0.75 0.61 0.44 0.61
## N_77 0.53 0.46 0.34 0.37 0.46 0.56 0.47 0.41 0.46 0.38 0.40
## N_80 0.37 0.36 0.46 0.16 0.78 0.30 0.44 0.46 0.53 0.38 0.43
## N_84 0.43 0.39 0.29 0.17 0.24 0.32 0.30 0.19 0.27 0.24 0.19
## N_88 0.37 0.36 0.25 0.30 0.40 0.35 0.27 0.11 0.34 0.31 0.27
## N_47 N_54 N_58 N_62 N_66 N_68 N_72 N_75 N_77 N_80 N_84
## N_47 1.00
## N_54 0.15 1.00
## N_58 0.26 0.35 1.00
## N_62 0.12 0.27 0.36 1.00
## N_66 0.33 0.18 0.25 0.26 1.00
## N_68 0.07 0.28 0.34 0.39 0.23 1.00
## N_72 0.28 0.13 0.33 0.32 0.42 0.35 1.00
## N_75 0.31 0.40 0.40 0.38 0.30 0.26 0.48 1.00
## N_77 0.18 0.37 0.44 0.63 0.37 0.43 0.43 0.44 1.00
## N_80 0.17 0.23 0.29 0.24 0.38 0.33 0.59 0.49 0.42 1.00
## N_84 0.21 0.24 0.49 0.29 0.19 0.30 0.24 0.22 0.33 0.23 1.00
## N_88 0.19 0.15 0.32 0.16 0.24 0.20 0.22 0.27 0.26 0.34 0.31
## [1] 1.00
8.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF EDQ NEUROTICISM SCALE101

##
## with tau of
## N_3 N_7 N_12 N_15 N_19 N_23 N_27 N_31 N_34 N_38 N_41
## -0.354 -0.112 -0.829 0.557 -0.195 -0.036 0.066 0.427 -0.252 -0.080 0.504
## N_47 N_54 N_58 N_62 N_66 N_68 N_72 N_75 N_77 N_80 N_84
## -0.209 0.486 -0.408 0.603 -0.209 0.162 -0.277 0.360 0.219 -0.276 -0.876
## N_88
## -1.232

QUESTION 1. Examine the outputs for product-moment and tetrachoric


correlations. What can you say about their size and direction? Compare them
to each other. What do you notice? Do you think these data are suitable for
factor analysis?
To obtain the measure of sampling adequacy - an index summarizing the cor-
relations on their overall potential to measure something in common - request
the Kaiser-Meyer-Olkin (KMO) index. However, instead of applying function
KMO() to the original (binary) item responses, we can apply it to the results of
tetrachoric correlation analysis (we refer to matrix $rho - the actual tetrachoric
correlation matrix):

KMO(tetrachoric(EPQ[4:26])$rho)

## Kaiser-Meyer-Olkin factor adequacy


## Call: KMO(r = tetrachoric(EPQ[4:26])$rho)
## Overall MSA = 0.92
## MSA for each item =
## N_3 N_7 N_12 N_15 N_19 N_23 N_27 N_31 N_34 N_38 N_41 N_47 N_54 N_58 N_62 N_66
## 0.94 0.94 0.93 0.88 0.91 0.95 0.96 0.84 0.91 0.95 0.94 0.89 0.91 0.93 0.90 0.93
## N_68 N_72 N_75 N_77 N_80 N_84 N_88
## 0.94 0.94 0.90 0.92 0.87 0.90 0.83

QUESTION 2. Interpret the resulting measure of sampling adequacy (KMO)


using Kaiser’s guidelines given in Exercise 7.

Step 3. Determining the number of factors

We will use function fa.parallel()from package psych to produce Scree plots


for the observed data and random (i.e. uncorrelated) data matrix of the same
size. Comparison of the two scree plots is called Parallel Analysis. We retain
factors from the blue scree plot (real data) that are ABOVE the red plot (sim-
ulated random data), in which we expect no common variance, only random
variance.
102EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D

This time, we will cal function fa.parallel() requesting the tetrachoric corre-
lations (cor="tet") rather than Pearson’s correlations (default). We will also
change the default estimation method (minimum residuals or “minres”) to the
maximum likelihood (fm="ml"), which our large sample allows. Finally, we will
change another default - the type of eigenvalues shown. We will display only
eigenvalues for principal components (fa="pc"), as done in some commercial
software such as Mplus.

fa.parallel(EPQ[4:26], fm="ml", fa="pc", cor="tet")

Parallel Analysis Scree Plots


eigen values of principal components

PC Actual Data
PC Simulated Data
8

PC Resampled Data
6
4
2
0

5 10 15 20

Component Number

## Parallel analysis suggests that the number of factors = NA and the number of compo

The Scree plot shows that the first factor accounts for a substantial amount
of variance compared to the second and subsequent factors. There is a large
drop from the first factor to the second (forming a clear “mountain side”), and
mostly “rubble” afterwards, beginning with the second factor. This indicates
that most co-variation in the items is explained by just one factor. However,
Parallel Analysis reveals that there are 3 factors, as factors 2 and 3 explain
significantly more variance than would be expected in random data of this size.
From the Scree plot it is clear that factors 2 and 3 explain little variance (even
if significant). Residuals will reveal which correlations are not well explained by
just 1 factor, and might give us a clue why the 2nd and 3rd factors are required.
8.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF EDQ NEUROTICISM SCALE103

Step 4. Fitting and interpreting a single-factor model

Now let’s fit a single-factor model to the 23 Neuroticism items. We will again
request the tetrachoric correlations (cor="tet") and the maximum likelihood
estimation method (fm="ml").

fa(EPQ[4:26], nfactors=1, fm="ml", cor="tet")

## Factor Analysis using method = ml


## Call: fa(r = EPQ[4:26], nfactors = 1, fm = "ml", cor = "tet")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 h2 u2 com
## N_3 0.69 0.47 0.53 1
## N_7 0.65 0.43 0.57 1
## N_12 0.65 0.43 0.57 1
## N_15 0.52 0.27 0.73 1
## N_19 0.74 0.55 0.45 1
## N_23 0.66 0.44 0.56 1
## N_27 0.67 0.45 0.55 1
## N_31 0.69 0.48 0.52 1
## N_34 0.80 0.64 0.36 1
## N_38 0.65 0.42 0.58 1
## N_41 0.65 0.42 0.58 1
## N_47 0.40 0.16 0.84 1
## N_54 0.43 0.18 0.82 1
## N_58 0.58 0.34 0.66 1
## N_62 0.56 0.31 0.69 1
## N_66 0.50 0.25 0.75 1
## N_68 0.48 0.23 0.77 1
## N_72 0.69 0.48 0.52 1
## N_75 0.72 0.51 0.49 1
## N_77 0.67 0.45 0.55 1
## N_80 0.65 0.43 0.57 1
## N_84 0.43 0.18 0.82 1
## N_88 0.43 0.19 0.81 1
##
## ML1
## SS loadings 8.72
## Proportion Var 0.38
##
## Mean item complexity = 1
## Test of the hypothesis that 1 factor is sufficient.
##
## The degrees of freedom for the null model are 253 and the objective function was 12.39 with
## The degrees of freedom for the model are 230 and the objective function was 3.74
104EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D

##
## The root mean square of the residuals (RMSR) is 0.08
## The df corrected root mean square of the residuals is 0.08
##
## The harmonic number of observations is 1374 with the empirical chi square 4262.77
## The total number of observations was 1381 with Likelihood Chi Square = 5122.04 w
##
## Tucker Lewis Index of factoring reliability = 0.678
## RMSEA index = 0.124 and the 90 % confidence intervals are 0.121 0.127
## BIC = 3459.01
## Fit based upon off diagonal values = 0.96
## Measures of factor score adequacy
## ML1
## Correlation of (regression) scores with factors 0.97
## Multiple R square of scores with factors 0.94
## Minimum correlation of possible factor scores 0.88

Run this command and examine the output. Try to answer the following ques-
tions (refer to instructional materials of your choice for help, I recommend Mc-
Donald’s “Test theory”). I will indicate which parts of the output you need to
answer each question.
QUESTION 3. Examine the Standardized factor loadings. How do you inter-
pret them? What is the “marker item” for the Neuroticism scale? [In the “Stan-
dardized loadings” output, the loadings are printed in “ML1” column. “ML”
stands for the estimation method, “Maximum Likelihood”, and “1” stands for
the factor number. Here we have only 1 factor, so only one column.]
QUESTION 4. Examine communalities and uniquenesses (look at h2 and u2
values in the table of “Standardized loadings”, respectively). What is commu-
nality and uniqueness and how do you interpret them?
QUESTION 5. What is the proportion of variance explained by the factor
in all 23 items (total variance explained)? To answer this question, look for
“Proportion Var” entry in the output (in a small table beginning with “SS
loadings”).

Step 5. Assessing the model’s goodness of fit

Now let us examine the model’s goodness of fit (GOF). This output starts with
the line “Test of the hypothesis that 1 factor is sufficient”. This is the hypothesis
that the data complies with a single-factor model (“the model”). We are hoping
to retain this hypothesis, therefore hoping for a large p-value, definitely p >
0.05, and hopefully larger. The output will also tell us about the “null model”,
which is the model where all items are uncorrelated. We are obviously hoping
to reject this null model, and obtain a very small p-value. Both of these models
8.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF EDQ NEUROTICISM SCALE105

will be tested with the chi-square test, with their respective degrees of freedom.
For our single-factor model, there are 230 degrees of freedom, because the model
estimates 46 parameters (23 factor loadings and 23 uniquenesses), and there are
23*24/2=276 variances and covariances (sample moments) to estimate them.
The degrees of freedom are therefore 276-46=230.
QUESTION 6. Is the chi-square for the tested model significant? Do you
accept or reject the single-factor model? [Look for “Likelihood Chi Square” in
the output.]
For now, I will ignore some other “fit indices” printed in the output. I will
return to them in Exercises dealing with structural equation models (beginning
with Exercise 16) .
Now let’s examine the model’s residuals. Residuals are the differences between
the observed item correlations (which we computed earlier) and the correlations
“reproduced” by the model – that is, correlations of item scores predicted by
the model. In the model output printed on Console, we have the Root Mean
Square Residual (RMSR), which is a summary measure of the size of residuals,
and in a way it is like GOF “effect size” - independent of sample size. You
can see that the RMSR=0.08, which is an acceptable value, indicating that the
average residual is sufficiently small.
To obtain more detailed output of the residuals, we need to get access to all
of the results produced by the function fa(). Call the function again, but this
time, assign its results to a new object fit, which we can “interrogate” later.

fit <- fa(EPQ[4:26], nfactors=1, fm="ml", cor="tet")

Package psych has a nice function that pulls the residuals from the saved factor
analysis results (object fit) and prints them in a user-friendly way. To remove
item uniquenesses from the diagonal out of sight, use option diag=FALSE.

residuals.psych(fit, diag=FALSE)

## N_3 N_7 N_12 N_15 N_19 N_23 N_27 N_31 N_34 N_38 N_41
## N_3 NA
## N_7 0.09 NA
## N_12 -0.01 0.06 NA
## N_15 0.19 0.01 -0.11 NA
## N_19 0.00 -0.03 0.05 -0.09 NA
## N_23 0.09 0.03 -0.01 0.12 -0.07 NA
## N_27 -0.02 -0.01 0.11 0.03 -0.01 0.00 NA
## N_31 -0.07 -0.13 -0.10 -0.01 -0.01 -0.08 -0.05 NA
## N_34 -0.09 -0.02 0.04 -0.08 0.01 -0.04 0.01 0.16 NA
## N_38 -0.05 -0.06 0.05 0.01 -0.04 -0.01 0.06 0.01 0.09 NA
## N_41 -0.05 -0.08 -0.11 0.13 -0.03 -0.01 -0.06 0.19 0.06 0.01 NA
106EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D

## N_47 -0.01 -0.06 0.07 -0.05 -0.05 0.05 0.04 0.02 -0.01 0.18 -0.03
## N_54 0.04 0.03 -0.06 0.11 -0.06 -0.02 -0.04 0.02 -0.03 -0.02 0.07
## N_58 0.14 0.14 0.00 0.03 -0.05 0.16 -0.01 -0.09 -0.11 -0.06 -0.10
## N_62 0.09 0.02 -0.07 0.09 -0.07 0.11 -0.04 -0.05 -0.05 -0.04 0.03
## N_66 -0.03 -0.07 0.13 -0.05 0.02 0.04 0.01 -0.07 -0.02 0.06 -0.10
## N_68 0.03 0.13 -0.04 0.05 -0.02 0.00 0.02 -0.11 -0.06 -0.02 -0.07
## N_72 -0.09 0.03 0.08 -0.13 0.10 -0.08 0.06 0.00 0.05 -0.01 -0.02
## N_75 -0.06 -0.02 -0.06 0.04 -0.04 -0.05 -0.09 0.26 0.04 -0.02 0.14
## N_77 0.07 0.02 -0.10 0.02 -0.04 0.11 0.02 -0.05 -0.08 -0.05 -0.03
## N_80 -0.08 -0.07 0.03 -0.18 0.29 -0.13 0.00 0.01 0.00 -0.04 0.00
## N_84 0.13 0.11 0.01 -0.06 -0.08 0.04 0.01 -0.11 -0.07 -0.04 -0.09
## N_88 0.08 0.07 -0.03 0.07 0.07 0.06 -0.02 -0.19 -0.01 0.03 -0.02
## N_47 N_54 N_58 N_62 N_66 N_68 N_72 N_75 N_77 N_80 N_84
## N_47 NA
## N_54 -0.02 NA
## N_58 0.03 0.10 NA
## N_62 -0.10 0.03 0.04 NA
## N_66 0.13 -0.04 -0.04 -0.02 NA
## N_68 -0.12 0.07 0.06 0.12 -0.01 NA
## N_72 0.00 -0.16 -0.08 -0.06 0.07 0.01 NA
## N_75 0.03 0.09 -0.01 -0.02 -0.06 -0.09 -0.02 NA
## N_77 -0.09 0.08 0.05 0.26 0.04 0.10 -0.04 -0.04 NA
## N_80 -0.09 -0.06 -0.09 -0.13 0.05 0.01 0.14 0.02 -0.02 NA
## N_84 0.04 0.06 0.24 0.05 -0.02 0.09 -0.06 -0.08 0.04 -0.05 NA
## N_88 0.02 -0.04 0.07 -0.08 0.02 -0.01 -0.08 -0.04 -0.03 0.05 0.13
## [1] NA

For a large residuals matrix as we have here, it is convenient to summarize the


results with a histogram:

hist(residuals.psych(fit, diag=FALSE))
8.2. WORKED EXAMPLE - TESTING HOMOGENEITY OF EDQ NEUROTICISM SCALE107

Histogram of residuals.psych(fit, diag = FALSE)


80 100
Frequency

60
40
20
0

−0.2 −0.1 0.0 0.1 0.2 0.3

residuals.psych(fit, diag = FALSE)

QUESTION 7. Examine the residuals. What can you say about them? Are
there any large residuals? (Hint. Interpret the size of residuals as you would
the size of correlation coefficients.)

Step 6. Estimating scale reliability using McDonald’s


omega

Since we have confirmed relative homogeneity of the Neuroticism scale, we can


legitimately estimate its reliability using coefficients alpha or omega. In Exercise
6, we already computed alpha from tetrachoric correlations for this scale, so you
can refer to that instruction for detail. I will simply quote the result we obtained
there, alpha=0.93.
To obtain omega, call function omega(), specifying the number of factors
(nfactors=1). You need this because various versions of coefficient omega exist
for multi-factor models, but you only need the estimate for a homogeneous
test, “Omega Total”.

omega(tetrachoric(EPQ[4:26])$rho, nfactors=1)

## Loading required namespace: GPArotation

## Omega_h for 1 factor is not meaningful, just omega_t


108EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D

## Warning in schmid(m, nfactors, fm, digits, rotate = rotate, n.obs = n.obs, :


## Omega_h and Omega_asymptotic are not meaningful with one factor

## Omega
## Call: omegah(m = m, nfactors = nfactors, fm = fm, key = key, flip = flip,
## digits = digits, title = title, sl = sl, labels = labels,
## plot = plot, n.obs = n.obs, rotate = rotate, Phi = Phi, option = option,
## covar = covar)
## Alpha: 0.93
## G.6: 0.95
## Omega Hierarchical: 0.93
## Omega H asymptotic: 1
## Omega Total 0.93
##
## Schmid Leiman Factor loadings greater than 0.2
## g F1* h2 u2 p2
## N_3 0.70 0.50 0.50 1
## N_7 0.66 0.44 0.56 1
## N_12 0.65 0.42 0.58 1
## N_15 0.53 0.28 0.72 1
## N_19 0.74 0.54 0.46 1
## N_23 0.68 0.46 0.54 1
## N_27 0.67 0.45 0.55 1
## N_31 0.67 0.45 0.55 1
## N_34 0.79 0.62 0.38 1
## N_38 0.65 0.42 0.58 1
## N_41 0.64 0.41 0.59 1
## N_47 0.40 0.16 0.84 1
## N_54 0.44 0.19 0.81 1
## N_58 0.60 0.36 0.64 1
## N_62 0.56 0.32 0.68 1
## N_66 0.50 0.25 0.75 1
## N_68 0.49 0.24 0.76 1
## N_72 0.68 0.46 0.54 1
## N_75 0.71 0.50 0.50 1
## N_77 0.68 0.47 0.53 1
## N_80 0.64 0.41 0.59 1
## N_84 0.44 0.19 0.81 1
## N_88 0.44 0.19 0.81 1
##
## With eigenvalues of:
## g F1*
## 8.7 0.0
##
## general/max 5.243623e+16 max/min = 1
## mean percent general = 1 with sd = 0 and cv of 0
8.3. SOLUTIONS 109

## Explained Common Variance of the general factor = 1


##
## The degrees of freedom are 230 and the fit is 3.74
##
## The root mean square of the residuals is 0.08
## The df corrected root mean square of the residuals is 0.08
##
## Compare this with the adequacy of just a general factor and no group factors
## The degrees of freedom for just the general factor are 230 and the fit is 3.74
##
## The root mean square of the residuals is 0.08
## The df corrected root mean square of the residuals is 0.08
##
## Measures of factor score adequacy
## g F1*
## Correlation of scores with factors 0.97 0
## Multiple R square of scores with factors 0.94 0
## Minimum correlation of factor score estimates 0.88 -1
##
## Total, General and Subset omega for each subset
## g F1*
## Omega total for total scores and subscales 0.93 0.93
## Omega general for total scores and subscales 0.93 0.93
## Omega group for total scores and subscales 0.00 0.00

QUESTION 8. What is the “Omega Total” for Conduct Problems scale score?
How does it compare with the alpha for this scale?

Step 7. Saving your work


After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. Give the script a meaningful name, for example
“EPQ_N scale homogeneity analyses”. When closing the project by pressing
File / Close project, make sure you select Save when prompted to save your
‘Workspace image’ (with extension .RData).

8.3 Solutions
Q1. Both sets of correlations support the suitability for factor analysis be-
cause they are all positive and relatively similar in size, as expected for items
measuring the same thing. The tetrachoric correlations are larger than the
product-moment correlations. This is not surprising given that the product-
moment correlations tend to underestimate the strength of the relationships
between binary items.
110EXERCISE 8. FITTING A SINGLE-FACTOR MODEL TO DICHOTOMOUS QUESTIONNAIRE D

Q2. MSA = 0.92. The data are “marvelous” for factor analysis according to
the Kaiser’s guidelines.
Q3. Standardized factor loadings reflect the number of Standard Deviations by
which the item score will change per 1 SD change in the factor score. The higher
the loading, the more sensitive the item is to the change in the factor. Standard-
ised factor loadings in the single-factor model are also correlations between the
factor and the items (just like beta coefficients in the simple regression). For the
Neuroticism scale, factor loadings range between 0.40 and 0.80. Factor loadings
over 0.5 can be considered reasonably high. The marker item is N_34 with the
loading 0.80 (highest loading). This item, “Are you a worrier”, is central to the
meaning of the common factor (and supports the hypothesis that the construct
measured by this item set is indeed Neuroticism).
Q4. Communality is the variance in the item due to the common factor, and
uniqueness is the unique item variance. In standardised factor solutions (which
is what function fa() prints), communality is the proportion of variance in
the item due to the factor, and uniqueness is the remaining proportion (1-
communality). Looking at the printed values, between 16% and 64% of variance
in the items is due to the common factor.
Q5. The factor accounts for 38% of the variability in the 23 Neuroticism items.
Q6. The chi-square test is highly significant, which means the null hypothesis
(that the single-factor model holds in the population) must be rejected. How-
ever, the chi-square test is sensitive to the sample size, and even very good
models are rejected when large samples (like the one tested here, N=1381) are
used to test them.

## The total number of observations was 1381 with Likelihood Chi Square = 5122.04

Q7. The vast majority of residuals are between -0.1 and 0.1. However, there
are few large residuals, above 0.2. For instance, the largest residual correlation
(0.29) is between N_19 (“Are your feelings easily hurt?”) and N_80 (“Are
you easily hurt when people find fault with you or the work you do?). This is a
clear case of so-called local dependence - when the dependency between two items
remains even after accounting for the common factor. It is not surprising to find
local dependence in two items with such a similar wording. These items have far
more in common than they have with other items. Beyond the common factor
that they share with other items, these item also share some of their unique
parts. Another large residual, 0.26, is between N_31 (”Would you call yourself
a nervous person¿‘) and N_75 (”Do you suffer from “nerves”?“). Again, there is
a clear similarity of wording in these two items. These cases of local dependence
might be responsible for the 2nd and 3rd factors we found in Parallel Analysis.
Q8. Omega Total = 0.93, and is the same as Alpha = 0.93 (at least to the
second decimal place). Coefficient alpha can only be lower than omega, and
they are equal when all factor loadings are equal. Here, not all loadings are
equal, but most are very similar, resulting in very similar alpha and omega.
Part IV

EXPLORATORY FACTOR
ANALYSIS (EFA)

111
Exercise 9

EFA of polytomous item


scores

Data file CHI_ESQ.txt


R package psych

9.1 Objectives

The objective of this exercise is to conduct Exploratory Factor Analysis (EFA)


of polytomous responses to a short service satisfaction questionnaire. Despite
the aim of this questionnaire to measure one construct, user satisfaction, the
analysis will reveal that more than one factor is responsible for the items’ co-
variation.

9.2 Study of service satisfaction using Experi-


ence of Service Questionnaire (CHI-ESQ)

In 2002, Commission for Health Improvement in the UK developed the Expe-


rience of Service Questionnaire (CHI-ESQ) to measure user experiences with
the Child and Adolescent Mental Health Services (CAMHS). The questionnaire
exist in two versions: one for older children and adolescents for describing expe-
riences with their own treatment, and the other is for parents/carers of children
who underwent treatment. We will consider the version for parents/carers.
The CHI-ESQ parent version consists of 12 questions covering different types of
experiences. Here are the questions:

113
114 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES

1. I feel that the people who have seen my child listened to me


2. It was easy to talk to the people who have seen my child
3. I was treated well by the people who have seen my child
4. My views and worries were taken seriously
5. I feel the people here know how to help with the problem I came for
6. I have been given enough explanation about the help available here
7. I feel that the people who have seen my child are working together to help with the
8. The facilities here are comfortable (e.g. waiting area)
9. The appointments are usually at a convenient time (e.g. don’t interfere with work, s
10. It is quite easy to get to the place where the appointments are
11. If a friend needed similar help, I would recommend that he or she come here
12. Overall, the help I have received here is good

Parents/carers are asked to respond to these questions using response options


“Certainly True” — “Partly True” — “Not True” (coded 1-2-3) and Don’t know
(not scored but treated as missing data, “NA”).
Participants in this study are N=620 parents reporting on experiences with one
CAMHS member Service. This is a subset of the large multi-service sample
analysed and reported in Brown, Ford, Deighton, & Wolpert (2014).

9.3 Worked Example - EFA of responses to CHI-


ESQ parental version

To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.

Step 1. Reading and examining the data

I recommend you download data file CHI_ESQ.txt into a new folder, and In
RStudio, associate a new project with this folder. Create a new script, where
you will type all commands.
The data are stored in the space-separated (.txt) format. You can preview the
file by clicking on CHI_ESQ.txt in the Files tab in RStudio (this will not
import the data yet, just open the actual file). You will see that the first row
contains the item abbreviations (esq+p for “parent”): “esqp_01”, “esqp_02”,
“esqp_03”, etc., and the first entry in each row is the respondent number: “1”,
“2”, …“620”. Function read.table() will import this into RStudio taking care
of these column and row names. It will actually understand that we have headers
and row names because the first row contains one fewer fields than the number
of columns (see ?read.table for detailed help on this function).
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION115

CHI_ESQ <- read.table(file="CHI_ESQ.txt")

We have just read the data into a new data frame named CHI_ESQ. Examine
the data frame by either pressing on it in the Environment tab, or calling func-
tion View(CHI_ESQ). You will see that there are quite a few missing responses,
which could have occurred because either “Don’t know” response option was
chosen, or because the question was not answered at all.

Step 2. Examining suitability of data for factor analysis

First, load the package psych to enable access to its functionality. Conveniently,
most functions in this package easily accommodate analysis with correlation
matrices as well as with raw data.

library(psych)

Before we start EFA, it is always good to examine correlations between vari-


ables. Function lowerCor() from package psych provides these correlations in
a convenient format.

lowerCor(CHI_ESQ, use="pairwise")

## es_01 es_02 es_03 es_04 es_05 es_06 es_07 es_08 es_09 es_10 es_11
## esqp_01 1.00
## esqp_02 0.67 1.00
## esqp_03 0.65 0.68 1.00
## esqp_04 0.79 0.65 0.67 1.00
## esqp_05 0.61 0.54 0.52 0.64 1.00
## esqp_06 0.58 0.55 0.50 0.60 0.68 1.00
## esqp_07 0.60 0.50 0.55 0.63 0.71 0.70 1.00
## esqp_08 0.15 0.25 0.21 0.22 0.14 0.30 0.21 1.00
## esqp_09 0.24 0.26 0.18 0.19 0.18 0.24 0.16 0.31 1.00
## esqp_10 0.17 0.25 0.19 0.15 0.14 0.18 0.15 0.36 0.28 1.00
## esqp_11 0.63 0.53 0.60 0.60 0.61 0.55 0.58 0.22 0.20 0.22 1.00
## esqp_12 0.72 0.60 0.62 0.72 0.71 0.66 0.68 0.22 0.21 0.15 0.71
## [1] 1.00

Examine the correlations carefully. Notice that all correlations are positive.
However, there is a pattern to these correlations. While most are large (above
0.5), correlations between items esqp_08, esqp_09, esqp_10 (describing
experiences with facilities, appointment times and location) and the rest of the
items (describing experiences with treatment) are substantially lower, ranging
between 0.14 and 0.30. Correlations of these items among each other are only
116 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES

slightly larger, ranging between 0.28 and 0.36. We will see whether and how
EFA will pick up on this pattern of correlations.

Now, request the Kaiser-Meyer-Olkin (KMO) index – the measure of sampling


adequacy (MSA):

KMO(CHI_ESQ)

## Kaiser-Meyer-Olkin factor adequacy


## Call: KMO(r = CHI_ESQ)
## Overall MSA = 0.92
## MSA for each item =
## esqp_01 esqp_02 esqp_03 esqp_04 esqp_05 esqp_06 esqp_07 esqp_08 esqp_09 esqp_10
## 0.92 0.93 0.93 0.92 0.93 0.93 0.93 0.76 0.86 0.80
## esqp_11 esqp_12
## 0.95 0.94

QUESTION 1. What is the overall measure of sampling adequacy (MSA)?


Interpret this result.

Step 3. Determining the number of factors

We will use function fa.parallel() to produce a scree plot for the observed
data, and compare it to that of a random (i.e. uncorrelated) data matrix of
the same size. We will use the default estimation method (fm="minres"), be-
cause the data are responses with only 3 ordered categories, where we cannopt
expect a normal distribution. We will again display only eigenvalues for princi-
pal components (fa="pc"), as is done in commercial software packages such as
Mplus.

fa.parallel(CHI_ESQ, fa="pc")
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION117

Parallel Analysis Scree Plots


eigen values of principal components

PC Actual Data
6

PC Simulated Data
PC Resampled Data
5
4
3
2
1
0

2 4 6 8 10 12

Component Number

## Parallel analysis suggests that the number of factors = NA and the number of components = 2

QUESTION 2. Examine the Scree plot. Does it support the hypothesis


that 1 factor - satisfaction with service - underlie the data? Why? Does your
conclusion correspond to the text output of the parallel analysis function?

Step 4. Fitting an exploratory 1-factor model and assessing


the model fit

We will use function fa(), which has the following general form
fa(r, nfactors=1, n.obs = NA, rotate="oblimin", fm="minres" …),
and requires data (argument r), and the number of observations if the data is
correlation matrix (n.obs). Other arguments have defaults, which we are happy
with for the first analysis.
Specifying all necessary arguments, test the hypothesized 1-factor model:

# fit 1-factor model


fit1 <- fa(CHI_ESQ, nfactors=1)
# print short summary with model fit
summary.psych(fit1)
118 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES

##
## Factor analysis with Call: fa(r = CHI_ESQ, nfactors = 1)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 54 and the objective function was 0.92
## The number of observations was 620 with Chi Square = 562.91 with prob < 7.7e-86
##
## The root mean square of the residuals (RMSA) is 0.07
## The df corrected root mean square of the residuals is 0.08
##
## Tucker Lewis Index of factoring reliability = 0.86
## RMSEA index = 0.123 and the 10 % confidence intervals are 0.114 0.133
## BIC = 215.7

From the summary output, we find that Chi Square = 562.91 (df = 54) with
prob < 7.7e-86 - a highly significant result. According to the chi-square test, we
have to reject the model. We also find that the root mean square of the residuals
(RMSR) is 0.07. This is an acceptable result.
Next, let’s examine the model residuals, which are direct measures of model
(mis)fit.

# obtain residuals
residuals.psych(fit1, diag=FALSE)

## es_01 es_02 es_03 es_04 es_05 es_06 es_07 es_08 es_09 es_10 es_11
## esqp_01 NA
## esqp_02 0.05 NA
## esqp_03 0.03 0.12 NA
## esqp_04 0.10 0.03 0.04 NA
## esqp_05 -0.04 -0.05 -0.07 -0.02 NA
## esqp_06 -0.06 -0.02 -0.07 -0.04 0.08 NA
## esqp_07 -0.04 -0.08 -0.03 -0.03 0.10 0.11 NA
## esqp_08 -0.09 0.03 -0.01 -0.03 -0.09 0.08 -0.02 NA
## esqp_09 0.00 0.05 -0.03 -0.05 -0.04 0.03 -0.06 0.23 NA
## esqp_10 -0.03 0.06 0.01 -0.06 -0.05 -0.01 -0.04 0.28 0.21 NA
## esqp_11 0.00 -0.03 0.03 -0.04 0.01 -0.03 -0.01 0.00 -0.01 0.03 NA
## esqp_12 0.01 -0.05 -0.02 0.00 0.04 0.00 0.02 -0.03 -0.04 -0.07 0.06
## [1] NA

You can see that the residuals are all small except one cluster corresponding
to the correlations between items esqp_08, esqp_09, and esqp_10, where
residuals are between 0.21 and 0.28. Evidently, one factor can explain the
observed correlations between items 1-7 and 11-12, but it cannot fully explain
the pattern of correlations we observed between items 8-10 (which describe
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION119

experinece with facilities, appointment times and location rather than treatment
as the rest of items). Quite intuitively, one factor can explain the overall positive
manifold of all correlations, but not a complex pattern we observe here.
We conclude that the 1-factor model is clearly not adequate for these data.

Step 5. Fitting an exploratory 2-factor model and assessing


the model fit

Now let’s fit a 2-factor model to these data:

# fit 2-factor model


fit2 <- fa(CHI_ESQ, nfactors=2)

## Loading required namespace: GPArotation

# print short summary with model fit


summary.psych(fit2)

##
## Factor analysis with Call: fa(r = CHI_ESQ, nfactors = 2)
##
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 43 and the objective function was 0.64
## The number of observations was 620 with Chi Square = 394 with prob < 3e-58
##
## The root mean square of the residuals (RMSA) is 0.04
## The df corrected root mean square of the residuals is 0.05
##
## Tucker Lewis Index of factoring reliability = 0.878
## RMSEA index = 0.115 and the 10 % confidence intervals are 0.105 0.125
## BIC = 117.52
## With factor correlations of
## MR1 MR2
## MR1 1.00 0.39
## MR2 0.39 1.00

QUESTION 3. What is the chi square statistic, and p value, for the tested
2-factor model? Would you retain or reject this model based on the chi-square
test? Why?
QUESTION 4. Find and interpret the RMSR in the output.
Examine the 2-factor model residuals.
120 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES

residuals.psych(fit2, diag=FALSE)

## es_01 es_02 es_03 es_04 es_05 es_06 es_07 es_08 es_09 es_10 es_11
## esqp_01 NA
## esqp_02 0.05 NA
## esqp_03 0.03 0.12 NA
## esqp_04 0.09 0.03 0.04 NA
## esqp_05 -0.05 -0.04 -0.07 -0.03 NA
## esqp_06 -0.06 -0.02 -0.07 -0.04 0.09 NA
## esqp_07 -0.05 -0.07 -0.03 -0.04 0.08 0.11 NA
## esqp_08 -0.05 -0.03 -0.02 0.02 -0.02 0.05 0.03 NA
## esqp_09 0.04 0.01 -0.03 -0.01 0.01 0.01 -0.02 0.00 NA
## esqp_10 0.01 0.01 0.01 -0.02 0.02 -0.03 0.00 0.00 0.00 NA
## esqp_11 0.00 -0.03 0.03 -0.04 0.01 -0.03 -0.01 0.00 -0.01 0.03 NA
## esqp_12 -0.01 -0.04 -0.02 -0.02 0.02 0.01 0.00 0.02 0.01 -0.02 0.06
## [1] NA

QUESTION 5. Examine the residual correlations. What can you say about
them? Are there any non-trivial residuals (greater than 0.1 in absolute value)?

Step 4. Interpreting the 2-factor un-rotated solution

Having largely confirmed suitability of the 2-factor model, are are ready to
examine and interpret its results. We will start with an un-rotated solution.

fit2.u <- fa(CHI_ESQ, nfactors=2, rotate="none")


# print factor loadings
fit2.u$loadings

##
## Loadings:
## MR1 MR2
## esqp_01 0.829
## esqp_02 0.749
## esqp_03 0.748
## esqp_04 0.840 -0.104
## esqp_05 0.788 -0.155
## esqp_06 0.764
## esqp_07 0.777 -0.110
## esqp_08 0.310 0.551
## esqp_09 0.291 0.403
## esqp_10 0.259 0.501
## esqp_11 0.756
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION121

## esqp_12 0.859 -0.121


##
## MR1 MR2
## SS loadings 5.88 0.797
## Proportion Var 0.49 0.066
## Cumulative Var 0.49 0.556

The un-rotated solution yields the first factor that maximizes the variance
shared by all items. Not surprisingly then, all items load on factor 1 (“MR1”).
However, loadings are weak for items esqp_08, esqp_09, and esqp_10 (de-
scribing experiences with facilities, appointment times and location). On the
other hand, these 3 items load substantially on factor 2 (“MR2”). We can
plot these loadings using plot.psych()- very convenient function from package
psych.

# plot factor loadings


plot.psych(fit2.u, xlim=c(-0.5,1), ylim=c(-0.5,1))

Factor Analysis
1.0
0.5

10 8
MR2

9
0.0

2
36
11
75 1412
−0.5

−0.5 0.0 0.5 1.0

MR1

The loadings plot visualizes the un-rotated solution. While items 1-7 and 11-12
cluster together and load exclusively on Factor 1, items 8, 9, and 10 (describing
experiences with facilities, appointment times and location) cluster together and
separately from the rest of items. They load weakly on Factor 1 and moderately
on Factor 2. Considering the item content, we may interpret Factor 1 as Overall
Satisfaction, and Factor 2 as Specific Satisfaction with Environment. I said
122 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES

“specific” because Factor 2 captures residualized variance in this un-rotated


solution - the variance left over after Factor 1 was accounted for.
QUESTION 6. What is the proportion of variance explained in all items by
each of the factors, and together (total variance explained)? To answer this
question, look for “Proportion Var” and “Cumulative Var” entry in the output
(in a small table beginning with ““”SS loadings”).

Step 6. Interpreting the 2-factor orthogonally rotated so-


lution

The objective of factor rotations is usually to simplify the un-rotated solution,


so that a solution is found in which all variables are factorially simple (load on
only one factor) because such models are easier to interpret. Because of simpler
mathematics, orthogonal rotations were once very popular. They keep the factor
axes orthogonal to each other, so that the factors remain uncorrelated.
We will now attempt to rotate the 2-factor solution orthogonally, using the
popular “varimax” rotation.

fit2.v <- fa(CHI_ESQ, nfactors=2, rotate="varimax")


# print factor loadings
fit2.v$loadings

##
## Loadings:
## MR1 MR2
## esqp_01 0.820 0.159
## esqp_02 0.687 0.311
## esqp_03 0.716 0.215
## esqp_04 0.832 0.155
## esqp_05 0.797
## esqp_06 0.722 0.252
## esqp_07 0.774 0.130
## esqp_08 0.129 0.619
## esqp_09 0.155 0.473
## esqp_10 0.556
## esqp_11 0.725 0.214
## esqp_12 0.856 0.145
##
## MR1 MR2
## SS loadings 5.413 1.264
## Proportion Var 0.451 0.105
## Cumulative Var 0.451 0.556
9.3. WORKED EXAMPLE - EFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION123

plot.psych(fit2.v, xlim=c(-0.5,1), ylim=c(-0.5,1))

Factor Analysis
1.0
0.5

8
10
9
MR2

2
36
11
751412
0.0
−0.5

−0.5 0.0 0.5 1.0

MR1

As we can see, the “varimax” rotation led to smaller loadings of items 8, 9 and
10 on Factor 1 than in the un-rotated solution. However, other cross-loadings
have increased in this solution, and the items are not factorially simple as we
hoped. In fact, this solution is more difficult to interpret than the un-rotated
solution.
QUESTION 7. What is the proportion of variance explained in all items by
each of the factors, and together (total variance explained)? To answer this
question, look for “Proportion Var” and “Cumulative Var” entry in the output
(in a small table beginning with ““”SS loadings”). Why is the total (cumulative)
variance the same as in the un-rotated solution?

Step 7. Interpreting the 2-factor obliquely rotated solution

We will now try an obliquely rotated solution, using the default “oblimin” rota-
tion.

fit2.o <- fa(CHI_ESQ, nfactors=2, rotate="oblimin")


# print factor loadings
fit2.o$loadings

##
124 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES

## Loadings:
## MR1 MR2
## esqp_01 0.842
## esqp_02 0.670 0.172
## esqp_03 0.721
## esqp_04 0.856
## esqp_05 0.832
## esqp_06 0.719 0.102
## esqp_07 0.799
## esqp_08 0.627
## esqp_09 0.466
## esqp_10 0.568
## esqp_11 0.731
## esqp_12 0.883
##
## MR1 MR2
## SS loadings 5.578 0.992
## Proportion Var 0.465 0.083
## Cumulative Var 0.465 0.547

plot.psych(fit2.o, xlim=c(-0.5,1), ylim=c(-0.5,1))

Factor Analysis
1.0

8
0.5

10
9
MR2

2
0.0

6
3
11
7 51412
−0.5

−0.5 0.0 0.5 1.0

MR1

This obliquely rotated solution yields more factorially simple items than the
previous solutions and is easy to interpret. Factor 1 is comprised from items
1-7 and 11-12, with the marker item esqp_12 (overall good service). Factor 2
9.4. SOLUTIONS 125

is comprised from items 8-10, with the marker item esqp_08 (good facilities).
We label the factors in this study Satisfaction with Care and Satisfaction with
Environment, and these are two correlated domains of satisfaction.
QUESTION 8. What is the proportion of variance explained by each of the
factors, and together (total variance explained)? To answer this question, look
for “Proportion Var” and “Cumulative Var” entry in the output (in a small
table beginning with ““”SS loadings”). Why is the total (cumulative) variance
different to the one reported for the un-rotated solution?
We now ask for “Phi” - the correlation matrix of latent factors. These correla-
tions are model-based; that is, they are estimated within the model, as one of
parameters. These correlations are between the latent factors, not their imper-
fect measurements (e.g. sum scores), and therefore they are the closest you can
get to estimating how the actual attributes are correlated in the population.
In contrast, correlations between observed sum scores will tend to be lower,
because the sum scores are not perfectly reliable and correlation between them
will be attenuated by unreliabilty.

fit2.o$Phi

## MR1 MR2
## MR1 1.0000000 0.3876848
## MR2 0.3876848 1.0000000

QUESTION 9. Try to interpret the factor correlation matrix.

Step 8. Saving your work

After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. To keep all the created objects (such as fit2.o),
which might be useful when revisiting this exercise, save your entire work space.
To do that, when closing the project by pressing File / Close project, make sure
you select Save when prompted to save your “Workspace image”.

9.4 Solutions

Q1. MSA = 0.92. According to Kaiser’s guidelines, the data are “marvellous”
for fitting a factor model. Refer to Exercise 7 to remind yourself of this index
and its interpretation.
Q2. There is a very sharp drop from factor #1 to factor #2, and a less pro-
nounced drop from #2 to #3. Presumably, we have one very strong factor
126 EXERCISE 9. EFA OF POLYTOMOUS ITEM SCORES

(presumably general satisfaction with service), which explains the general pos-
itive manifold in the correlation matrix, and 1 further factor, making 2 factors
altogether. Parallel analysis confirms that 2 factors of the blue plot (observed
data) are above the red dotted line (the simulated random data).
Q3. Chi-square = 394 (df=43) with prob < 3e-58. The p value is given in the
scientific format, with 3 multiplied by 10 to the power of -58. It is so small
(divided by 10 to the power of 58) that we can just say p<0.001. We should
reject the model, because the probability of observing the given correlation
matrix in the population where the 1-factor model is true (the hypothesis we
are testing) is vanishingly small (p<0.001). However, chi-square test is very
sensitive to the sample size, and our sample is quite large. We must judge the
fit based on other criteria, for instance residuals.
Q4. The RMSR=0.04 is nice and small, indicating that the 2-factor model
reproduces the observed correlations well overall.
Q5. Most residuals are very small (close to zero), and only 2 residuals are above
0.1 in absolute value. The largest residual 0.12 is between items esqp_02 and
esqp_03. This is still only very slightly above 0.1. We conclude that none of
the residuals cause particular concern.
Q6. In the un-rotated solution, Factor 1 accounts for 49% of total variance and
Factor 2 accounts for 6.6%. As the factors are uncorrelated, these can be added
together to obtain 55.6%.
Q7. After the varimax rotation, Factor 1 accounts for 45.1% of total variance
and Factor 2 accounts for 10.5%. As the factors are still uncorrelated, these can
be added together to obtain 55.6%. This is the same amount as in the un-rotated
solution because factor rotations do not alter the total variance explained by
factors in items (item uniqueness are invariant to rotation). Despite the amounts
of variance re-distributing between factors, the total is still the same.
Q8. After the oblimin rotation, Factor 1 accounts for 46.5% of total variance
and Factor 2 accounts for 8.3%. The cumulative total given, 54.7%, is no longer
the same as in the un-rotated solution because the factors are correlated, and
their variances cannot be added. However, the true total variance explained is
still 55.6%, as reported in the un-rotated solution, because factor rotation does
not change this result.
Q9. In the oblique solution, factors correlate moderately at 0.39. We would
expect domains of satisfaction to correlate. We interpret this result as the
expected correlation between constructs Satisfaction with Care and Satisfaction
with Environment in the population of parents from which we sampled.
Exercise 10

EFA of ability subtest


scores

Data file Thurstone.csv


R package psych

10.1 Objectives

In this exercise, you will explore the factor structure of an ability test battery
by analyzing observed correlations of its subtest scores.

10.2 Study of “primary mental abilities” of


Louis Thurstone

This is a classic study of “primary mental abilities” by Louis Thurstone (1947).


Thurstone analysed 9 subtests, each measuring some specific mental ability
with several tasks. The 9 subtests were designed to measure 3 broader mental
abilities, namely:
Verbal Ability Word Fluency Reasoning Ability
1. sentences 4. first letters 7. letter series
2. vocabulary 5. four-letter words 8. pedigrees
3. sentence completion 6. suffixes 9. letter grouping
Each subtest was scored as a number of correctly completed tasks, and the
scores on each of the 9 subtests can be considered continuous.

127
128 EXERCISE 10. EFA OF ABILITY SUBTEST SCORES

We will work with the published correlations between the 9 subtests, based on
N=215 subjects. Despite having no access to the actual subject scores, the
correlation matrix is all you need to run EFA.

10.3 Worked Example - EFA of Thurstone’s pri-


mary ability data
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.

Step 1. Importing and examining observed correlations

The data (subscale correlations) are stored in a comma-separated (.csv) file


Thurstone.csv. Download this file and save it in a new folder. In RStudio,
create a new project in the folder you have just created. You can preview the file
by clicking on Thurstone.csv in the Files tab in RStudio, and selecting View
File. This will not import the data yet, just open the actual .csv file. You will
see that the first row contains the subscale names - “s1”, “s2”, “s3”, etc., and the
first entry in each row is again the facet names. To import this correlation matrix
into RStudio preserving all these facet names for each row and each column, we
will use function read.csv(). We will say that we have headers (header=TRUE),
and that the row names are contained in the first column (row.names=1).

Thurstone <- read.csv(file="Thurstone.csv", header=TRUE, row.names=1)

A new object Thurstone should appear in the Environment tab. Examine the
object by either pressing on it, or calling function View(). You will see that it
is a correlation matrix, with values 1 on the diagonal, and moderate to large
positive values off the diagonal. This is typical for ability tests – they tend to
correlate positively with each other.

Step 2. Examining suitability of data for factor analysis

First, load the package psych to enable access to its functionality. Conveniently,
most functions in this package easily accommodate analysis with correlation
matrices as well as with raw data.

library(psych)

##
## Attaching package: 'psych'
10.3. WORKED EXAMPLE - EFA OF THURSTONE’S PRIMARY ABILITY DATA129

## The following object is masked _by_ '.GlobalEnv':


##
## Thurstone

Before you start EFA, request the Kaiser-Meyer-Olkin (KMO) index – the mea-
sure of sampling adequacy (MSA):

KMO(Thurstone)

## Kaiser-Meyer-Olkin factor adequacy


## Call: KMO(r = Thurstone)
## Overall MSA = 0.88
## MSA for each item =
## s1 s2 s3 s4 s5 s6 s7 s8 s9
## 0.86 0.86 0.90 0.86 0.88 0.92 0.85 0.93 0.87

QUESTION 1. What is the overall measure of sampling adequacy (MSA)?


Interpret this result.

Step 3. Determining the number of factors

In this case, we have a prior hypothesis that 3 broad domains of ability are
underlying the 9 subtests. We begin the analysis with function fa.parallel()
to produce a scree plot for the observed data, and compare it to that of a
random (i.e. uncorrelated) data matrix of the same size. In addition to the actual
data (our correlation matrix, Thurstone), we need to supply the sample size
(n.obs=215) to enable simulation of random data, because from the correlation
matrix alone it is impossible to know how big the sample was. We will keep the
default estimation method (fm="minres"). We will display eigenvalues for both
- principal components and factors (fa="both").

fa.parallel(Thurstone, n.obs=215, fa="both")

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :


## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An


## ultra-Heywood case was detected. Examine the results carefully

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :


## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.
130 EXERCISE 10. EFA OF ABILITY SUBTEST SCORES

eigenvalues of principal components and factor analysis


Parallel Analysis Scree Plots

5
PC Actual Data
4 PC Simulated Data
FA Actual Data
FA Simulated Data
3
2
1
0

2 4 6 8

Factor/Component Number

## Parallel analysis suggests that the number of factors = 3 and the number of compon

Remember that when interpreting the scree plot, you retain only the factors on
the blue (real data) scree plot that are ABOVE the red (simulated random data)
plot, in which we know there is no common variance, only random variance,
i.e. “scree” (rubble).
QUESTION 2. Examine the Scree plot. Does it support the hypothesis that
3 factors underlie the data? Why? Does your conclusion correspond to the text
output of the parallel analysis function?

Step 4. Fitting an exploratory 1-factor model and assessing


the model fit
We begin by fitting a single-factor model, which would correspond to a hypothe-
sis that only one factor (presumably general mental ability) is needed to explain
the pattern of correlations between 9 subtests. Wwill use function fa(), which
has the following general form
fa(r, nfactors=1, n.obs = NA, rotate="oblimin", fm="minres" …),
and requires data (argument r), and the number of observations if the data is
correlation matrix (n.obs). Other arguments have defaults, which we are happy
with for the first analysis.
Specifying all necessary arguments, test the hypothesized 1-factor model:
10.3. WORKED EXAMPLE - EFA OF THURSTONE’S PRIMARY ABILITY DATA131

# fit 1-factor model


fit1 <- fa(Thurstone, nfactors=1, n.obs=215)
# print short summary with model fit
summary.psych(fit1)

##
## Factor analysis with Call: fa(r = Thurstone, nfactors = 1, n.obs = 215)
##
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 27 and the objective function was 1.23
## The number of observations was 215 with Chi Square = 257.68 with prob < 1.7e-39
##
## The root mean square of the residuals (RMSA) is 0.1
## The df corrected root mean square of the residuals is 0.12
##
## Tucker Lewis Index of factoring reliability = 0.708
## RMSEA index = 0.199 and the 10 % confidence intervals are 0.178 0.222
## BIC = 112.67

From the summary output, we find that Chi Square = 257.68 (df=27) with prob
< 1.7e-39 - a highly significant result. According to the chi-square test, we have
to reject the model. We also find that the root mean square of the residuals
(RMSR) is 0.1. This is unacceptably high.
Next, let’s examine the model residuals, which are direct measures of model
(mis)fit.
# obtain residuals
residuals.psych(fit1, diag=FALSE)

## s1 s2 s3 s4 s5 s6 s7 s8 s9
## s1 NA
## s2 0.16 NA
## s3 0.14 0.13 NA
## s4 -0.11 -0.07 -0.07 NA
## s5 -0.10 -0.08 -0.09 0.23 NA
## s6 -0.05 -0.02 -0.04 0.18 0.14 NA
## s7 -0.05 -0.08 -0.08 -0.03 0.00 -0.09 NA
## s8 0.01 -0.01 0.02 -0.09 -0.07 -0.08 0.15 NA
## s9 -0.09 -0.12 -0.09 0.03 0.06 -0.03 0.24 0.07 NA

You can see that the residuals are mostly close to zero except those within
clusters s1-s3, s4-s6, and s7-s9, where residuals mostly above 0.1 and as high
as 0.24. Evidently, one factor cannot quite explain the dependencies of subtests
within these clusters.
We conclude that the 1-factor model is not adequate for these data.
132 EXERCISE 10. EFA OF ABILITY SUBTEST SCORES

Step 5. Fitting an exploratory 3-factor models and assess-


ing the model fit

Now let’s fit a 3-factor model to these data:

# fit 3-factor model


fit3 <- fa(Thurstone, nfactors=3, n.obs=215)

## Loading required namespace: GPArotation

# print short summary with model fit


summary.psych(fit3)

##
## Factor analysis with Call: fa(r = Thurstone, nfactors = 3, n.obs = 215)
##
## Test of the hypothesis that 3 factors are sufficient.
## The degrees of freedom for the model is 12 and the objective function was 0.01
## The number of observations was 215 with Chi Square = 3.01 with prob < 1
##
## The root mean square of the residuals (RMSA) is 0.01
## The df corrected root mean square of the residuals is 0.01
##
## Tucker Lewis Index of factoring reliability = 1.026
## RMSEA index = 0 and the 10 % confidence intervals are 0 0
## BIC = -61.44
## With factor correlations of
## MR1 MR2 MR3
## MR1 1.00 0.59 0.53
## MR2 0.59 1.00 0.52
## MR3 0.53 0.52 1.00

QUESTION 3. What is the chi square statistic, and p value, for the tested
3-factor model? Would you retain or reject this model based on the chi-square
test? Why?
QUESTION 4. Find and interpret the RMSR in the output.
Examine the 3-factor model residuals.

residuals.psych(fit3, diag=FALSE)

## s1 s2 s3 s4 s5 s6 s7 s8 s9
## s1 NA
10.3. WORKED EXAMPLE - EFA OF THURSTONE’S PRIMARY ABILITY DATA133

## s2 0.01 NA
## s3 0.00 -0.01 NA
## s4 -0.01 0.00 0.01 NA
## s5 0.01 0.00 0.00 0.00 NA
## s6 0.00 0.00 0.00 0.00 0.00 NA
## s7 0.00 0.01 -0.01 0.01 -0.01 0.00 NA
## s8 -0.01 0.00 0.02 0.00 0.00 0.01 0.00 NA
## s9 0.01 -0.01 0.00 0.00 0.01 0.00 0.00 0.00 NA

QUESTION 5. Examine the residual correlations. What can you say about
them? Are there any non-trivial residuals (greater than 0.1 in absolute value)?

Step 6. Interpreting the 3-factor obliquely rotated solution

We will now interpret the 3-factor solution, which was rotated obliquely (by
default, rotate="oblimin" is used in fa() function).

# print factor loadings


fit3$loadings

##
## Loadings:
## MR1 MR2 MR3
## s1 0.901
## s2 0.890
## s3 0.838
## s4 0.853
## s5 0.747 0.104
## s6 0.180 0.626
## s7 0.842
## s8 0.382 0.463
## s9 0.212 0.627
##
## MR1 MR2 MR3
## SS loadings 2.489 1.732 1.337
## Proportion Var 0.277 0.192 0.149
## Cumulative Var 0.277 0.469 0.617

QUESTION 6. Examine the standardized factor loadings (pattern matrix),


and name the factors you obtained based on which subtests load on them.
Examine the factor loading matrix again, this time looking for cross-loading
facets (facets with loadings greater than 0.32 on factor(s) other than its own).
Note that in the above output, loadings under 0.1 are suppressed by default.
134 EXERCISE 10. EFA OF ABILITY SUBTEST SCORES

I use a 0.32 cut-off for non-trivial loadings since this translates into 10% of
variance shared with the factor in an orthogonal solution. This value is quite
arbitrary, and many use 0.3 instead.
QUESTION 7. Are there any non-trivial cross-loadings?
We now ask for “Phi” - the correlation matrix of latent factors. These corre-
lations are model-based; that is, they are estimated within the model, as one
of parameters. These correlations are between the latent factors (NOT their
imperfect measurements, e.g. sum scores), and therefore they are the closest
you can get to estimating how the actual ability domains are correlated in the
population.

fit3$Phi

## MR1 MR2 MR3


## MR1 1.0000000 0.5926277 0.5334434
## MR2 0.5926277 1.0000000 0.5151336
## MR3 0.5334434 0.5151336 1.0000000

QUESTION 8. Interpret the factor correlation matrix.

Step 7. Saving your work

After you finished work with this exercise, save your R script by pressing
the Save icon in the script window. To keep all the created objects (such as
FA_neo), which might be useful when revisiting this exercise, save your en-
tire work space. To do that, when closing the project by pressing File / Close
project, make sure you select Save when prompted to save your “Workspace
image”.

10.4 Solutions

Q1. MSA = 0.88. According to Kaiser’s guidelines, the data are “meritorious”
(i.e. very much suitable for factor analysis). Refer to Exercise 7 to remind
yourself of this index and its interpretation.
Q2. The scree plot has a very steep drop from the first factor to the second,
and then a slightly ambiguous “step” formed from the 2nd and the 3rd factors.
The 2nd factor could already be considered part of “scree” (rubble), and Par-
allel Analysis based on PC (principal components) eigenvalues agrees with this
judgement. Alternatively, the 2nd and 3rd factors could be considered parts of
the hard rock and the “scree” beginnig with the 4th factor. Parallel Analysis
10.4. SOLUTIONS 135

based on FA (factor analysis) eigenvalues agrees with this alternative judgement.


So, at this point we have an ambiguous result about the number of factors.
Q3. Chi-square=3.01 (df=12) with prob<1. Clearly, we cannot reject the model
because the test is insignificant. To obtain the exact p-value, use

fit3$PVAL

## [1] 0.9955007

Q4. The root mean square of the residuals (RMSA) is 0.01 This is a
very small value, indicating excellent fit.
Q5. The residual correlations are all very close to 0. There are no residuals
greater than 0.1 in absolute value.
Q6. This obliquely rotated solution complies to the hypothesized structure
and is easy to interpret. Factor 1 is comprised from subtests s1-s3. Factor 2
is comprised from subtests s4-s6. Factor 3 is comprised from subtests s7-s9.
facilities). The factors can be labelled according to the prior hypothesis: Verbal
Ability, Word Fluency and Reasoning Ability, and these are 3 correlated domains
of ability.
Q7. Yes, subtest s8 “Pedigrees” cross-loads on F1 (loading = 0.382) as well as
on its own domain F3 (loading = 0.463). We can interpret this result as that
both Verbal Ability and Reasoning Ability are needed to complete this subtest.
Q8. All correlations between domains are positive and large (just over 0.5),
which is to be expected from measures of ability.
136 EXERCISE 10. EFA OF ABILITY SUBTEST SCORES
Exercise 11

EFA of personality scale


scores

Data file neo.csv


R package psych

11.1 Objectives

The objective of this exercise is to conduct Exploratory Factor Analysis of a


multifaceted personality questionnaire, using a published correlation matrix of
its scale scores. Specifically, we will look to establish a factor structure of one
of the most popular personality measures, NEO PI-R (Costa & McCrae, 1992).

11.2 Study of factor structure of personality us-


ing NEO PI-R

The NEO PI-R is based on the Five Factor model of personality, measuring
five broad domains, namely Neuroticism (N), Extraversion (E), Openness (O),
Agreeableness (A) and Conscientiousness (C). In addition, NEO identifies 6
facet scales for each broad factor, measuring 30 subscales in total. The facet
subscales are listed below:

137
138 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES

Neuroticism Extraversion Openness Agreeableness Conscien


N1 Anxiety E1 Warmth O1 Fantasy A1 Trust C1 Comp
N2 Angry Hostility E2 Gregariousness O2 Aesthetics A2 Straightforwardness C2 Orde
N3 Depression E3 Assertiveness O3 Feelings A3 Altruism C3 Dutif
N4 Self-Consciousness E4 Activity O4 Ideas A4 Compliance C4 Achie
N5 Impulsiveness E5 Excitement-Seeking O5 Actions A5 Modesty C5 Self-D
N6 Vulnerability E6 Positive Emotions O6 Values A6 Tender-Mindedness C6 Delib

The NEO PI-R Manual reports correlations of the 30 facets based on N=1000
subjects. Despite having no access to the actual scale scores for the 1000 sub-
jects, the correlation matrix is all we need to perform factor analysis of the
subscales.

11.3 Worked Example - EFA of NEO PI-R cor-


relation matrix

To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.

Step 1. Importing and examining observed correlations

I recommend to download data file neo.csv into a new folder, and In RStudio,
associate a new project with this folder. Create a new script, where you will
type all commands.

The data this time is actually a correlation matrix, stored in the comma-
separated (.csv) format. You can preview the file by clicking on neo.csv in
the Files tab in RStudio, and selecting View File. This will not import the data
yet, just open the actual .csv file. You will see that the first row contains the
NEO facet names - “N1”, “N2”, “N3”, etc., and the first entry in each row is
again the facet names. To import this correlation matrix into RStudio preserv-
ing all these facet names for each row and each column, we will use function
read.csv(). We will say that we have headers (header=TRUE), and that the
row names are contained in the first column (row.names=1).

neo <- read.csv(file="neo.csv", header=TRUE, row.names=1)

We have just read the data into a new object named neo, which appeared in
your Environment tab. Examine the object by either pressing on it, or calling
function View(neo). You will see that the data are indeed a correlation matrix,
with values 1 on the diagonal.
11.3. WORKED EXAMPLE - EFA OF NEO PI-R CORRELATION MATRIX139

Step 2. Examining suitability of data for factor analysis

First, load the package psych to enable access to its functionality. Conveniently,
most functions in this package easily accommodate analysis with correlation
matrices as well as with raw data.

library(psych)

Before you start EFA, request the Kaiser-Meyer-Olkin (KMO) index – the mea-
sure of sampling adequacy (MSA):

KMO(neo)

## Kaiser-Meyer-Olkin factor adequacy


## Call: KMO(r = neo)
## Overall MSA = 0.87
## MSA for each item =
## N1 N2 N3 N4 N5 N6 E1 E2 E3 E4 E5 E6 O1 O2 O3 O4
## 0.89 0.91 0.92 0.93 0.89 0.92 0.86 0.77 0.89 0.86 0.79 0.87 0.86 0.73 0.84 0.85
## O5 O6 A1 A2 A3 A4 A5 A6 C1 C2 C3 C4 C5 C6
## 0.79 0.77 0.90 0.87 0.85 0.83 0.81 0.86 0.91 0.88 0.93 0.88 0.90 0.87

QUESTION 1. What is the overall measure of sampling adequacy (MSA)?


Interpret this result.

Step 3. Determining the number of factors

In the case of NEO PI-R, we have a clear prior hypothesis about the number
of factors underlying the 30 facets of NEO. The instrument was designed to
measure the Five Factors of personality; therefore we would expect the facets
to be indicators of 5 broad factors.
We will use function fa.parallel() to produce a scree plot for the observed
data, and compare it to that of a random (i.e. uncorrelated) data matrix of the
same size. This time, in addition to the actual data (our correlation matrix,
neo), we need to supply the sample size (n.obs=1000) to enable simulation of
random data, because from the correlation matrix alone it is impossible to know
how big the sample was. We will also change the default estimation method to
maximum likelihood (fm="ml"), because the sample is large and the scale scores
are reported to be normally distributed. We will again display only eigenvalues
for principal components (fa="pc").
140 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES

fa.parallel(neo, n.obs=1000, fm="ml", fa="pc")

Parallel Analysis Scree Plots


eigen values of principal components

PC Actual Data
6

PC Simulated Data
5
4
3
2
1
0

0 5 10 15 20 25 30

Component Number

## Parallel analysis suggests that the number of factors = NA and the number of compo

Remember that when interpreting the scree plot, you retain only the factors on
the blue (real data) scree plot that are ABOVE the red (simulated random data)
plot, in which we know there is no common variance, only random variance,
i.e. “scree” (rubble).
QUESTION 2. Examine the Scree plot. Does it support the hypothesis that
5 factors underlie the data? Why? Does your conclusion correspond to the text
output of the parallel analysis function?

Step 4. Fitting an exploratory 5-factor model and checking


the model fit

We will use function fa(), which has the following general form
fa(r, nfactors=1, n.obs = NA, rotate="oblimin", fm="minres" …),
and requires data (argument r), and the number of observations if the data is
correlation matrix (n.obs). Other arguments have defaults, and we will change
the number of factors to extract from the default 1 to the expected number of
11.3. WORKED EXAMPLE - EFA OF NEO PI-R CORRELATION MATRIX141

factors, nfactors=5, and the estimation method (fm="ml"). We are happy with
the oblique rotation method, rotate="oblimin".
Specifying all necessary arguments, we can simply call the fa() function to test
the hypothesized 5-factor model. However, it will be very convenient to store
the factor analysis results in a new object, which can be “interrogated” later
when we need to extract specific parts of the results. So, we will assign the
results of fa() to an object named (arbitrarily) FA_neo.

FA_neo <- fa(neo, nfactors=5, n.obs=1000, fm="ml")

## Loading required namespace: GPArotation

# obtain results summary


summary.psych(FA_neo)

##
## Factor analysis with Call: fa(r = neo, nfactors = 5, n.obs = 1000, fm = "ml")
##
## Test of the hypothesis that 5 factors are sufficient.
## The degrees of freedom for the model is 295 and the objective function was 1.47
## The number of observations was 1000 with Chi Square = 1443.45 with prob < 1.8e-150
##
## The root mean square of the residuals (RMSA) is 0.03
## The df corrected root mean square of the residuals is 0.04
##
## Tucker Lewis Index of factoring reliability = 0.863
## RMSEA index = 0.062 and the 10 % confidence intervals are 0.059 0.066
## BIC = -594.34
## With factor correlations of
## ML1 ML4 ML3 ML2 ML5
## ML1 1.00 -0.41 -0.12 -0.13 -0.07
## ML4 -0.41 1.00 0.07 0.21 0.08
## ML3 -0.12 0.07 1.00 0.01 -0.16
## ML2 -0.13 0.21 0.01 1.00 0.32
## ML5 -0.07 0.08 -0.16 0.32 1.00

Examine the summary output. Try to answer the following questions (you can
refer to instructional materials of your choice for help; I recommend McDonald’s
Test Theory). I will indicate which parts of the output you need to answer each
question.
QUESTION 3. Find the chi-square statistic testing the 5-factor model (look
for “Likelihood Chi Square”). How many degrees of freedom are there? What
is the chi square statistic, and p value, for the tested model? Would you retain
or reject this model based on the chi-square test? Why?
142 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES

Next, let’s examine the model residuals, which are direct measures of model
(mis)fit. First, you can evaluate the root mean square of the residuals (RMSR),
which is a summary of all residuals.
QUESTION 4. Find and interpret the RMSR in the output.
Package psych has a nice function residuals.psych() that pulls the residuals
from the saved factor analysis results (object FA_neo) and prints them in a
user-friendly way.

residuals.psych(FA_neo)

The above call will print residuals matrix with uniqunesses on the diagonal.
These are discrepancies between the observed item variances (equal 1 in this
standardized solution) and the variances explained by the 5 common factors,
or communalities (uniquness = 1-communality). To remove these values on the
diagonal out of sight, use:

residuals.psych(FA_neo, diag=FALSE)

## N1 N2 N3 N4 N5 N6 E1 E2 E3 E4 E5
## N1 NA
## N2 0.00 NA
## N3 0.01 0.01 NA
## N4 0.00 -0.04 0.01 NA
## N5 -0.03 0.02 -0.02 0.02 NA
## N6 0.04 0.00 -0.02 0.00 0.00 NA
## E1 0.02 0.00 0.00 -0.02 -0.03 0.02 NA
## E2 0.04 -0.02 -0.01 -0.03 -0.06 0.08 0.09 NA
## E3 0.02 0.02 0.02 -0.06 -0.02 -0.01 0.04 0.08 NA
## E4 -0.02 0.00 0.01 0.03 -0.04 0.02 -0.06 0.01 0.01 NA
## E5 -0.01 -0.04 0.02 0.01 0.01 0.00 -0.04 0.09 -0.10 0.04 NA
## E6 -0.02 -0.01 -0.03 0.03 0.01 -0.01 -0.01 -0.06 -0.05 0.06 0.04
## O1 0.04 -0.03 0.01 0.00 0.02 -0.01 -0.03 -0.05 -0.03 -0.04 0.03
## O2 -0.05 0.01 -0.02 -0.01 0.01 0.03 0.00 0.03 -0.01 0.01 -0.02
## O3 0.00 0.04 0.00 -0.01 0.02 -0.03 0.03 -0.02 -0.01 -0.05 -0.05
## O4 0.01 -0.01 0.00 -0.02 -0.05 0.00 -0.03 0.08 0.00 0.06 0.04
## O5 0.01 -0.03 0.01 0.02 0.00 -0.02 0.00 -0.06 0.01 0.00 0.06
## O6 0.04 0.04 0.00 0.01 -0.03 -0.02 0.00 -0.03 -0.05 0.04 0.01
## A1 -0.02 0.02 0.00 -0.01 0.02 0.02 0.03 0.02 0.03 0.03 -0.05
## A2 -0.03 0.05 -0.01 -0.03 0.03 0.02 -0.04 -0.02 -0.01 0.04 -0.01
## A3 0.01 0.00 0.02 0.00 0.04 -0.04 0.02 -0.08 0.01 -0.02 -0.01
## A4 0.01 -0.06 0.01 0.02 -0.04 0.02 -0.03 0.05 -0.04 0.04 0.00
## A5 -0.03 0.02 0.02 0.00 -0.01 -0.07 -0.02 -0.06 -0.03 0.02 0.03
## A6 0.00 0.01 0.02 -0.02 0.01 -0.01 -0.02 0.01 0.01 -0.02 0.03
## C1 0.02 0.02 -0.01 0.02 0.05 -0.04 -0.01 -0.04 0.00 -0.05 -0.03
11.3. WORKED EXAMPLE - EFA OF NEO PI-R CORRELATION MATRIX143

## C2 0.00 -0.02 -0.01 0.00 0.02 0.01 0.01 0.02 -0.02 -0.01 0.05
## C3 -0.01 0.01 0.01 0.00 0.02 0.00 0.04 0.02 0.03 -0.02 -0.05
## C4 -0.01 0.00 0.00 0.02 0.01 0.01 0.01 0.00 0.04 0.07 -0.01
## C5 0.01 -0.02 0.00 0.00 -0.03 -0.01 -0.03 0.00 -0.05 0.02 0.06
## C6 0.06 -0.02 -0.01 -0.02 -0.09 0.02 0.03 0.04 0.00 -0.07 0.00
## E6 O1 O2 O3 O4 O5 O6 A1 A2 A3 A4
## E6 NA
## O1 0.01 NA
## O2 0.03 -0.03 NA
## O3 0.03 0.05 0.03 NA
## O4 -0.02 0.00 0.04 -0.02 NA
## O5 -0.01 0.00 0.02 -0.04 -0.03 NA
## O6 -0.01 0.02 -0.08 -0.01 0.09 0.01 NA
## A1 -0.03 0.01 -0.01 -0.02 -0.03 -0.02 0.08 NA
## A2 0.01 -0.05 0.04 -0.03 0.05 -0.01 0.00 0.05 NA
## A3 0.02 0.02 -0.04 0.01 -0.03 0.02 -0.02 -0.03 0.01 NA
## A4 0.03 0.01 0.00 -0.04 0.05 0.00 0.01 0.00 0.00 -0.01 NA
## A5 0.02 -0.05 0.02 0.04 0.05 -0.01 -0.04 -0.01 0.06 -0.02 0.01
## A6 -0.04 0.01 0.01 0.00 -0.02 0.01 0.03 0.02 -0.01 0.05 -0.03
## C1 0.00 0.04 -0.05 0.05 -0.06 0.01 0.06 0.02 0.01 0.04 -0.03
## C2 0.01 -0.01 0.06 0.00 -0.01 -0.01 -0.03 -0.03 0.00 -0.03 0.01
## C3 -0.04 0.00 -0.02 0.03 -0.05 0.00 -0.02 0.05 0.02 0.00 -0.03
## C4 -0.01 -0.05 0.03 -0.04 0.03 0.01 -0.03 0.02 0.00 -0.02 0.04
## C5 0.01 0.04 -0.01 -0.01 0.05 -0.01 0.02 -0.01 0.00 -0.01 0.01
## C6 -0.01 0.00 -0.01 0.04 -0.05 0.00 0.00 -0.06 -0.05 0.01 -0.01
## A5 A6 C1 C2 C3 C4 C5 C6
## A5 NA
## A6 0.04 NA
## C1 -0.09 0.02 NA
## C2 0.00 -0.01 -0.02 NA
## C3 -0.01 -0.03 0.03 -0.04 NA
## C4 0.01 -0.03 -0.04 -0.02 0.01 NA
## C5 0.03 0.02 0.00 0.06 -0.02 0.01 NA
## C6 -0.07 0.03 0.08 0.04 0.02 -0.04 -0.04 NA

QUESTION 5. Examine the residual correlations. What can you say about
them? Are there any non-trivial residuals (greater than 0.1 in absolute value)?

Step 5. Interpreting parameters of the 5-factor exploratory


model
Having evaluated the model suitability and fit, we are ready to obtain and
interpret its parameters - factor loadings, unique variances, and factor correla-
tions. You can call object FA_neo, or, to obtain just the factor loadings in a
convenient format, call:
144 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES

FA_neo$loadings

##
## Loadings:
## ML1 ML4 ML3 ML2 ML5
## N1 0.798
## N2 0.578 -0.437 -0.123
## N3 0.783 -0.109
## N4 0.694 0.100
## N5 0.373 -0.252 -0.282 0.295
## N6 0.650 -0.251
## E1 0.170 0.690 0.105
## E2 -0.155 0.547
## E3 -0.290 0.227 -0.406 0.250 0.164
## E4 0.394 -0.382 0.358
## E5 -0.478 0.380
## E6 -0.116 0.696
## O1 -0.317 -0.144 0.114 0.474
## O2 0.129 0.183 0.699
## O3 0.331 0.342 0.418
## O4 -0.173 0.153 0.427
## O5 -0.149 -0.127 0.710
## O6 -0.123 -0.157 0.325
## A1 -0.282 0.422 0.306 0.111
## A2 0.219 0.627
## A3 0.208 0.353 0.635 -0.109
## A4 0.728 0.112
## A5 0.197 0.518 -0.122
## A6 0.457 0.373
## C1 -0.294 0.555 0.104
## C2 0.662 -0.135
## C3 0.619 0.273
## C4 0.715 -0.170 0.159
## C5 -0.180 0.727
## C6 -0.116 0.496 0.276 -0.229
##
## ML1 ML4 ML3 ML2 ML5
## SS loadings 3.228 3.020 2.834 2.612 1.855
## Proportion Var 0.108 0.101 0.094 0.087 0.062
## Cumulative Var 0.108 0.208 0.303 0.390 0.452

Note that in the factor loadings matrix, the factor labels do not always corre-
spond to the column number. For example (factor named ML1 is in 1st column,
but ML2 is in 4th column. This does not matter, but just pay attention to the
factor name that you quote.)
11.3. WORKED EXAMPLE - EFA OF NEO PI-R CORRELATION MATRIX145

QUESTION 6. Examine the standardized factor loadings (pattern matrix),


and name the factors you obtained based on which facets load on them.
Examine the factor loading matrix again, this time looking for cross-loading
facets (facets with loadings greater than 0.32 on factor(s) other than its own).
Note that in the above output, loadings under 0.1 are suppressed by default.
You can control the cutoff value for printing loadings, and set it to 0.32 using
function print():

print(FA_neo, cut = 0.32, digits=2)

## Factor Analysis using method = ml


## Call: fa(r = neo, nfactors = 5, n.obs = 1000, fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML4 ML3 ML2 ML5 h2 u2 com
## N1 0.80 0.60 0.40 1.0
## N2 0.58 -0.44 0.59 0.41 2.0
## N3 0.78 0.70 0.30 1.1
## N4 0.69 0.53 0.47 1.1
## N5 0.37 0.42 0.58 3.7
## N6 0.65 0.64 0.36 1.4
## E1 0.69 0.59 0.41 1.2
## E2 0.55 0.33 0.67 1.3
## E3 -0.41 0.50 0.50 3.7
## E4 0.39 -0.38 0.36 0.49 0.51 3.2
## E5 -0.48 0.38 0.38 0.62 2.0
## E6 0.70 0.55 0.45 1.1
## O1 0.47 0.41 0.59 2.2
## O2 0.70 0.48 0.52 1.2
## O3 0.33 0.34 0.42 0.48 0.52 3.1
## O4 0.43 0.28 0.72 1.7
## O5 0.71 0.51 0.49 1.2
## O6 0.32 0.15 0.85 2.0
## A1 0.42 0.41 0.59 2.9
## A2 0.63 0.46 0.54 1.3
## A3 0.35 0.63 0.61 0.39 1.9
## A4 0.73 0.56 0.44 1.1
## A5 0.52 0.32 0.68 1.5
## A6 0.46 0.37 0.36 0.64 2.1
## C1 0.56 0.58 0.42 1.6
## C2 0.66 0.41 0.59 1.1
## C3 0.62 0.52 0.48 1.4
## C4 0.72 0.59 0.41 1.2
## C5 0.73 0.69 0.31 1.2
## C6 0.50 0.40 0.60 2.2
##
146 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES

## ML1 ML4 ML3 ML2 ML5


## SS loadings 3.52 3.31 2.89 2.80 2.01
## Proportion Var 0.12 0.11 0.10 0.09 0.07
## Cumulative Var 0.12 0.23 0.32 0.42 0.48
## Proportion Explained 0.24 0.23 0.20 0.19 0.14
## Cumulative Proportion 0.24 0.47 0.67 0.86 1.00
##
## With factor correlations of
## ML1 ML4 ML3 ML2 ML5
## ML1 1.00 -0.41 -0.12 -0.13 -0.07
## ML4 -0.41 1.00 0.07 0.21 0.08
## ML3 -0.12 0.07 1.00 0.01 -0.16
## ML2 -0.13 0.21 0.01 1.00 0.32
## ML5 -0.07 0.08 -0.16 0.32 1.00
##
## Mean item complexity = 1.8
## Test of the hypothesis that 5 factors are sufficient.
##
## The degrees of freedom for the null model are 435 and the objective function was
## The degrees of freedom for the model are 295 and the objective function was 1.47
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.04
##
## The harmonic number of observations is 1000 with the empirical chi square 857.09
## The total number of observations was 1000 with Likelihood Chi Square = 1443.45 w
##
## Tucker Lewis Index of factoring reliability = 0.863
## RMSEA index = 0.062 and the 90 % confidence intervals are 0.059 0.066
## BIC = -594.34
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy
## ML1 ML4 ML3 ML2 ML5
## Correlation of (regression) scores with factors 0.94 0.93 0.92 0.92 0.88
## Multiple R square of scores with factors 0.89 0.87 0.85 0.85 0.78
## Minimum correlation of possible factor scores 0.78 0.75 0.69 0.69 0.56

Note. I use a 0.32 cut-off for non-trivial loadings since this translates into 10%
of variance shared with the factor in an orthogonal solution. This value is quite
arbitrary, and many use 0.3 instead.
QUESTION 7. Refer to the NEO facet descriptions in the supplementary
document to hypothesize why any non-trivial cross-loadings may have occurred.
Examine columns h2 and u2 of the factor loading matrix. The h2 column is the
facet’s communality (proportion of variance due to all of the common factors),
11.4. SOLUTIONS 147

and u2 is the uniqueness (proportion of variance unique to this facet). The


communality and uniqueness sum to 1.
QUESTION 8. Based on communalities and uniquenesses, which facets are
the best and worst explained by the factor model we just tested?
QUESTION 9. Refer to the earlier summary output to find and interpret the
factor correlation matrix. Which factors correlate the strongest?

Step 6. Saving your work

After you finished work with this exercise, save your R script by pressing
the Save icon in the script window. To keep all the created objects (such as
FA_neo), which might be useful when revisiting this exercise, save your en-
tire work space. To do that, when closing the project by pressing File / Close
project, make sure you select Save when prompted to save your “Workspace
image”.

11.4 Solutions
Q1. MSA = 0.87. The data are “meritorious” according to Kaiser’s guidelines
(i.e. very much suitable for factor analysis). Refer to Exercise 7 to remind
yourself of this index and its interpretation.
Q2. There is a clear change from a steep slope (“mountain”) and a shallow slope
(“scree” or rubble) in the Scree plot. The first five factors form a mountain;
factor #6 and all subsequent factors belong to the rubble pile and therefore
we should proceed with 5 factors. Parallel analysis confirms this decision –
five factors of the blue plot (observed data) are above the red dotted line (the
simulated random data).
Q3. Chi-square=1443.45 on DF=295 (p=1.8E-150). The p value is given in the
scientific format, with 1.8 multiplied by 10 to the power of -150. It is so tiny
(divided by 10 to the power of 150) that we can just say p<0.001. We should
reject the model, because the probability of observing the given correlation
matrix in the population where the 5-factor model is true (the hypothesis we
are testing) is vanishingly small (p<0.001). However, chi-square test is very
sensitive to the sample size, and our sample is large. We must judge the fit
based on other criteria, for instance residuals.
Q4.

The root mean square of the residuals (RMSR) is 0.03

The RMSR of 0.03 is a very small value, indicating that the 5-factor model
reproduces the observed correlations well.
148 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES

Q5. All residuals are very small (close to zero), and none are above 0.1 in
absolute value. Since the 30x30 matrix of residuals is somewhat hard to eyeball,
you can create a histogram of the values pulled by residuals.psych():

hist(residuals.psych(FA_neo, diag=FALSE))

Histogram of residuals.psych(FA_neo, diag = FALSE)


200
150
Frequency

100
50
0

−0.10 −0.05 0.00 0.05 0.10

residuals.psych(FA_neo, diag = FALSE)

Q6. ML1=Neuroticism; ML4=Conscientiousness, ML3=Agreeableness,


ML2=Extraversion, ML5=Openness (see the salient loadings marked in Q7).
Q7. There are several cross-loadings for facets of Agreeableness (factor ML3)
and Extraversion (factor ML2) - see the loadings below. For instance, E4 (Ac-
tivity), described as “pace of living” loads on Extraversion as expected (0.36)
but even slightly stronger on Conscientiousness (0.39) and negatively on Agree-
ableness (-0.38). Perhaps the need to meet deadlines and achieve (part of Con-
scientiousness) has an effect on the “pace of living”, as well as the need to “get
ahead” (the low end of Agreeableness).
Q8. Facets O6 (Values) and O4 (Ideas) are worst explained by the factors
in this model. Their communalities (see column “h2”) are only 15% and 28%
respectively. O6 Values is an unusual scale, quite different from other NEO
scales. It has attitudinal rather than behavioural statements, such as “I be-
lieve letting students hear controversial speakers can only confuse and mislead
them”. Facets N3 (Depression) and C5 (Self-Discipline) are best explained by
the factors in this model.
Q9. Overall, the five factors correlate quite weakly, but they are not orthogo-
nal. ML1 (Neuroticism) correlates negatively with all other factors; the rest of
11.4. SOLUTIONS 149

Figure 11.1: image of loading matrix with marker items


150 EXERCISE 11. EFA OF PERSONALITY SCALE SCORES

factors correlate positively with each other. The strongest correlation is between
ML1 (Neuroticism) and ML4 (Conscientiousness) (r=–.41).
Part V

ITEM RESPONSE
THEORY (IRT)

151
Exercise 12

Fitting 1PL and 2PL


models to dichotomous
questionnaire data.

Data file EPQ_N_demo.txt


R package ltm

12.1 Objectives
We already fitted a linear single-factor model to the Eysenck Personality Ques-
tionnaire (EPQ) Neuroticism scale in Exercise 8, using tetrachoric correlations
of its dichotomous items. The objective of this exercise is to fit two basic Item
Response Theory (IRT) models - 1-parameter logistic (1PL) and 2-parameter
logistic (2PL) - to the actual dichotomous responses to EPQ Neuroticism items.
After selecting the most appropriate response model, we will plot Item Char-
acteristic Curves, and examine item difficulties and discrimination parameters.
Finally, we will estimate people’s trait scores and their standard errors.

12.2 Worked Example - Fitting 1PL and 2PL


models to EPQ Neuroticism items
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.
This exercise makes use of the data we considered in Exercises 6 and 8. These
data come from a large cross-cultural study (Barrett, Petrides, Eysenck &

153
154EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA

Eysenck, 1998), with N = 1,381 participants who completed the Eysenck Person-
ality Questionnaire (EPQ). The focus of our analysis here will be the Neuroti-
cism/Anxiety (N) scale, measured by 23 items with only two response options -
either “YES” or “NO”, for example:

N_3 Does your mood often go up and down?


N_7 Do you ever feel "just miserable" for no reason?
N_12 Do you often worry about things you should not have done or said?
etc.

You can find the full list of EPQ Neuroticism items in Exercise 6. Please
note that all items indicate “Neuroticism” rather than “Emotional Stability”
(i.e. there are no counter-indicative items).

Step 1. Opening and preparing the data

If you have already worked with this data set in Exercises 6 and 8, the sim-
plest thing to do is to continue working within one of the projects created back
then. In RStudio, select File / Open Project and navigate to the folder and the
project you created. You should see the data frame EPQ appearing in your
Environment tab, together with other objects you created and saved.

If you have not completed relevant exercises, or simply want to start from
scratch, download the data file EPQ_N_demo.txt into a new folder and
follow instructions from Exercise 6 on creating a project and importing the
data.

EPQ <- read.delim(file="EPQ_N_demo.txt")

The object EPQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. As you can see, there are 26 variables
in this data frame, beginning with participant id, age and sex (0 = female; 1
= male). These demographic variables are followed by 23 item responses, which
are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses,
marked with NA.

Because the item responses are stored in columns 4 to 26 of the EPQ data
frame, we will have to refer to them as EPQ[ ,4:26] in future analyses. More
conveniently, we can create a new object containing only the item response data:

N_data <- EPQ[ ,4:26]


12.2. WORKED EXAMPLE - FITTING 1PL AND 2PL MODELS TO EPQ NEUROTICISM ITEMS155

Step 2. Fit a 1-parameter logistic (1PL or Rasch) model

You will need package ltm (stands for “latent trait modelling”) installed on your
computer. Select Tools / Install packages… from the RStudio menu, type in ltm
and click Install. Make sure you load the installed package into the working
memory:

library(ltm)

To fit a 1-parameter logistic (1PL) model (which is mathematically equivalent


to Rasch model), use rasch() function on N_data object with item responses:

fit1PL <- rasch(N_data)

Once the command has been executed, examine the model results by calling
summary(fit1PL).
You should see some model fit statistics, and the estimated item parameters.
Take note of the log likelihood (see Model Summary at the beginning of the
output). Log likelihood can only be interpreted in relative terms (compared
to a log likelihood of another model), so you will have to wait for results from
another model before using it.
For item parameters printed in a more convenient format, call

# item parameters in convenient format


coef(fit1PL)

## Dffclt Dscrmn
## N_3 -0.57725861 1.324813
## N_7 -0.18590831 1.324813
## N_12 -1.35483500 1.324813
## N_15 0.89517911 1.324813
## N_19 -0.32174146 1.324813
## N_23 -0.06547699 1.324813
## N_27 0.10182697 1.324813
## N_31 0.68572728 1.324813
## N_34 -0.41414515 1.324813
## N_38 -0.13645633 1.324813
## N_41 0.80993448 1.324813
## N_47 -0.34271380 1.324813
## N_54 0.78134781 1.324813
## N_58 -0.66528672 1.324813
## N_62 0.97477759 1.324813
## N_66 -0.34506389 1.324813
156EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA

## N_68 0.25343459 1.324813


## N_72 -0.45539741 1.324813
## N_75 0.57519145 1.324813
## N_77 0.34614593 1.324813
## N_80 -0.45161535 1.324813
## N_84 -1.43292247 1.324813
## N_88 -2.03201824 1.324813

You should see the difficulty and discrimination parameters. Difficulty refers to
the latent trait score at the point of inflection of the probability function (item
characteristic curve or ICC); this is the point where the curve changes from
concave to convex. In 1-parameter and 2-parameter logistic models, this point
corresponds to the latent trait score where the probability of ‘YES’ response
equals the probability of ‘NO’ response (both P=0.5). Discrimination refers to
the steepness (slope) of the ICC at the point of inflection (at the item difficulty).

QUESTION 1. Examine the 1PL difficulty parameters. Do the difficulty


parameters vary widely? What is the ‘easiest’ item in this set? What is the
most ‘difficult’ item? Examine the phrasing of the corresponding items (refer
to the item list in Appendix) – can you see why one item is easier to agree with
than the other?

QUESTION 2. Examine the 1PL discrimination parameters. Why is the


discrimination parameter the same for every one of 23 items?

Now plot item characteristic curves for this model using plot() function:

plot(fit1PL)
12.2. WORKED EXAMPLE - FITTING 1PL AND 2PL MODELS TO EPQ NEUROTICISM ITEMS157

Item Characteristic Curves


1.0

N_88
N_84
N_80
N_72N_77
0.8

N_66
N_58 N_68N_75
Probability

0.6

N_47
0.4

N_62
N_38 N_54
N_34
0.2

N_12 N_41
N_27
N_19
N_23
0.0

N_31
N_3
N_7 N_15
−4 −2 0 2 4

Ability

Examine the item characteristic curves (ICCs). First, you should notice that
they run in “parallel” to each other, and do not cross. This is because the slopes
are constrained to be equal in the 1PL model. Try to estimate the difficulty
levels of the most extreme items on the left and on the right, just by looking at
the approximate trait values where corresponding probabilities equal 0.5. Check
if these values equal to the difficulty parameters printed in the output.

Step 3. Fit a 2-parameter logistic model

To fit a 2-parameter logistic (2PL) model, use ltm() function (stands for ‘latent
trait model’) as follows:

fit2PL <- ltm(N_data ~ z1)

The function uses a formula, described in the ltm package manual. The formula
follows regression conventions used by base R and many packages – it specifies
that items in the set N_data are ‘regressed on’ (~) one latent trait (z1). Note
that z1 is not an arbitrary name; it is actually fixed in the package. At most,
two latent traits can be fitted (z1 + z2). We are only fitting one trait and
therefore we specify ~ z1.
When the command has been executed, see the results by calling summary(fit2PL).
You should see some model fit statistics, and then estimated item parameters
– difficulties and discriminations. Take note of the log likelihood for this
158EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA

model. We will test how the 2PL model compares to the 1PL model later using
this value.
For the item parameters printed in a convenient format, call

# item parameters in convenient format


coef(fit2PL)

## Dffclt Dscrmn
## N_3 -0.51605652 1.6321883
## N_7 -0.17666140 1.4774008
## N_12 -1.22040224 1.5911491
## N_15 1.03197005 1.0570904
## N_19 -0.26737063 1.9740413
## N_23 -0.06402011 1.5006632
## N_27 0.09002533 1.5308778
## N_31 0.59894671 1.6672151
## N_34 -0.32318252 2.3270028
## N_38 -0.13268055 1.4266667
## N_41 0.75664387 1.4738613
## N_47 -0.52960514 0.7057917
## N_54 1.08776504 0.8186594
## N_58 -0.69064694 1.2397378
## N_62 1.05886259 1.1532931
## N_66 -0.42109219 0.9640119
## N_68 0.32362475 0.9217228
## N_72 -0.40104331 1.6932412
## N_75 0.48193055 1.8070986
## N_77 0.31094924 1.5627476
## N_80 -0.42017648 1.5126054
## N_84 -1.92770607 0.8595893
## N_88 -2.57944224 0.9375370

QUESTION 3. Examine the 2PL difficulty parameters. What is the ‘easiest’


item in this set? What is the most ‘difficult’ item? Are these the same as in
the 1PL model? Examine the phrasing of the corresponding items – can you
see why one item is much easier to agree with than the other?
QUESTION 4. Examine the 2PL discrimination parameters. What is the
most and least discriminating item in this set? Examine the phrasing of the
corresponding items – can you interpret the meaning of the construct we are
measuring in relation to the most discriminating item (as we did for “marker”
items in factor analysis)?
Now plot item characteristic curves for this model:
12.2. WORKED EXAMPLE - FITTING 1PL AND 2PL MODELS TO EPQ NEUROTICISM ITEMS159

plot(fit2PL)

Item Characteristic Curves


1.0

N_88
N_80
N_84
N_72N_77
N_75
0.8

N_58N_66
N_68
Probability

0.6

N_47
0.4

N_62
N_38 N_54
0.2

N_34
N_41
N_12 N_23
0.0

N_27
N_19 N_31
N_7 N_15
N_3
−4 −2 0 2 4

Ability

QUESTION 5. Examine the item characteristic curves. Now the curves


should cross. Can you identify the most and the least discriminating items
from this plot?

Step 4. Compare 1PL and 2PL models

You have fitted two IRT models to the Neuroticism items. These models are
nested - one is the special case of another (1PL model is a special case of 2PL
with all discrimination parameters constrained equal). The 1PL (Rasch) model
is more restrictive than the 2PL model because it has fewer parameters (23
difficulty parameters and only 1 discrimination parameter, against 23 difficulties
+ 23 discrimination parameters in the 2PL model), and we can test which model
fits better. Use the base R function anova() to compare the models:

anova(fit1PL, fit2PL)

##
## Likelihood Ratio Table
## AIC BIC log.Lik LRT df p.value
## fit1PL 34774.05 34899.59 -17363.03
## fit2PL 34471.99 34712.60 -17189.99 346.06 22 <0.001
160EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA

This function prints the log likelihood values for the two models (you already
saw them). It computes the likelihood ratio test (LRT) by multiplying each log
likelihood by -2, and then subtracting the lower value from the higher value.
The resulting value is distributed as chi-square, with degrees of freedom equal
the difference in the number of parameters between the two models.
QUESTION 6. Examine the results of anova() function. LRT value is the
likelihood ratio test result (the chi-square statistic). What are the degrees of
freedom? Can you explain why the degrees of freedom as they are? Is the
difference between the two models significant? Which model would you retain
and why?

Step 5. Estimate Neuroticism trait scores for people

Now we can score people in this sample using function factor.scores() and
Empirical Bayes method (EB). We will use the better-fitting 2PL model.

Scores <- factor.scores(fit2PL, method="EB", resp.patterns=N_data)

The estimated scores together with their standard errors will be stored for each
respondent in a new object Scores. Check out what components are stored in
this object by calling head(Scores). It appears that the estimated trait scores
(z1) and their standard errors (se.z1) are stored in $score.dat part of the
Scores object.
To make our further work with these values easier, let’s assign them to new
variables in the EPQ data frame:

# the Neuroticism factor score


EPQ$Zscore <- Scores$score.dat$z1
# Standard Error of the factor score
EPQ$Zse <- Scores$score.dat$se.z1

Now, you can plot the histogram of the estimated IRT scores, by calling
hist(EPQ$Zscore).
You can also examine relationships between the IRT estimated scores and simple
sum scores. This is how we computed the sum scores previously (see Exercise
6:

EPQ$Nscore <- rowMeans(N_data, na.rm=TRUE)*23

Then plot the sum score against the IRT score. Note that the relationship
between these scores is very strong but not linear. Rather, it resembles a logistic
curve.
12.2. WORKED EXAMPLE - FITTING 1PL AND 2PL MODELS TO EPQ NEUROTICISM ITEMS161

plot(EPQ$Zscore, EPQ$Nscore)
20
EPQ$Nscore

15
10
5
0

−2 −1 0 1 2

EPQ$Zscore

Step 6. Evaluate the Standard Errors of measurement

Now let’s plot the IRT estimated scores against their standard errors:

plot(EPQ$Zscore, EPQ$Zse)
162EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA

0.50
EPQ$Zse

0.40
0.30

−2 −1 0 1 2

EPQ$Zscore

QUESTION 7. Examine the graph. What range of the Neuroticism trait


score is measured with most precision? What are the smallest and the largest
Standard Errors on this graph, approximately?
[Hint. You can also get the exact values by calling min(EPQ$Zse) and
max(EPQ$Zse).]

A very interesting feature of this graph is that a few points are out of line with
the majority (which form a very smooth curve). This means that SEs can be
different for people with the same latent trait score. Specifically, SEs can be
larger for some people (note that the outliers are always above the trend, not
below). The reason for it is that some individuals had missing responses, and
because fewer items provided information for their score estimation, their SEs
were larger than for individuals with complete response sets.

Step 7. Saving your work

After you finished this exercise, you may close the project. Do not forget to
save your R script, giving it a meaningful name, for example “IRT analysis of
EPQ_N”.

Please also make sure that you save your entire ‘workspace’, which includes the
data frame and all the new objects you created. This will be useful for your
revision.
12.3. SOLUTIONS 163

12.3 Solutions
Q1. The difficulty parameters of Neuroticism items vary widely. The ‘easiest’
item to endorse is N_88, with the lowest difficulty value (-2.03). N_88 is
phrased “Are you touchy about some things?”, and according to the 1PL model,
at very low neuroticism level z=–2.03 (remember, the trait score is scaled like
z-score), the probability of agreeing with this item is already 0.5. The most
‘difficult’ item to endorse in this set is N_62, with the highest difficulty value
(0.97). N_62 is phrased “Do you often feel life is very dull?”, and according to
this model, one need to have neuroticism level of at least 0.97 to endorse this
item with probability 0.5. The phrasing of this item is indeed more extreme
than that of item N_88, and it is therefore more ‘difficult’ to endorse.
Q2. Only one discrimination parameter is printed for this set because the 1PL
(Rasch) model assumes that all items have equal discriminations. Therefore the
model constrains all discriminations to be equal.
Q3. The ‘easiest’ item to endorse in this set is still item N_88, which now has
the difficulty value -2.58. The most ‘difficult’ item to endorse in this set is now
item N_54, with the difficulty value (1.09). The most difficult item from the
1PL model, N_62, has a similarly high difficulty (1.06). N_54 is phrased “Do
you suffer from sleeplessness?”, and according to this model, one needs to have
neuroticism level of at least 1.09 to endorse this item with probability 0.5.
Q4. The most discriminating item in this set is N_34 (Dscrmn.=2.33). This
item reads “Are you a worrier?”, which seems to be right at the heart of the
Neuroticism construct. This is the item which would have the highest factor
loading in factor analysis of these data. The least discriminating item is N_47
(Dscrmn.=0.71), reading “Do you worry about your health?”. According to this
model, the item is more marginal to the general Neuroticism construct (perhaps
it tackles a context-specific behaviour).
Q5. The most and the least discriminating items can be easily seen on the
plot. The most discriminating item N_34 has the steepest slope, and the least
discriminating item N_47 has the shallowest slope.
Q6. The degrees of freedom = 22. This is made up by the difference in the
number of item parameters estimated. The Rasch model estimated 23 difficulty
parameters, and only 1 discrimination parameter (one for all items). The 2PL
model estimated 23 difficulty parameters, and 23 discrimination parameters.
The difference between the two models = 22 parameters.
The chi-square value 346.06 on 22 degrees of freedom is highly significant, so
the 2PL model fits the data much better and we have to prefer it to the more
parsimonious but worse fitting 1PL model.
Q7. Most precise measurement is achieved in the range between about z=-0.2
and z=0. The smallest standard error was about 0.3 (exact value 0.299). The
largest standard error was about 0.57 (exact value 0.573).
164EXERCISE 12. FITTING 1PL AND 2PL MODELS TO DICHOTOMOUS QUESTIONNAIRE DATA
Exercise 13

Fitting a graded response


model to polytomous
questionnaire data.

Data file likertGHQ28sex.txt


R package ltm, psych

13.1 Objectives

The objective of this exercise is to analyse polytomous questionnaire responses


(i.e. responses in multiple categories) using Item Response Theory (IRT). You
will use Samejima’s Graded Response Model (GRM), which is a popular choice
for analysis of Likert scales where response options are ordered categories. After
fitting and testing the GRM, you will plot Item Characteristic Curves, and
examine item parameters. Finally, you will estimate people’s trait scores and
their standard errors.

13.2 Study of emotional distress using General


Health Questionnaire (GHQ-28)

Data for this exercise come from the Medical Research Council National Survey
of Health and Development (NSHD), also known as the British 1946 birth cohort
study. The data pertain to a wave of interviewing undertaken in 1999, when
the participants were aged 53.

165
166EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR

At that point, N = 2,901 participants (1,422 men and 1,479 women) completed
the 28-item version of General Health Questionnaire (GHQ-28). The GHQ-28
was developed as a screening questionnaire for detecting non-psychotic psy-
chiatric disorders in community settings and non-psychiatric clinical settings
(Goldberg, 1978). You can find a short description of the GHQ-28 in the data
archive available with this book.

The questionnaire is designed to measure 4 facets of mental/emotional distress -


somatic symptoms (items 1-7), anxiety/insomnia (items 8-14), social dysfunction
(items 15-21) and severe depression (items 22-28). The focus of our analysis
here will be the Somatic Symptoms scale, measured by 7 items. For each item,
respondents need to think how they felt in the past 2 weeks:

Have you recently:


1) been feeling perfectly well and in good health?
2) been feeling in need of a good tonic?
3) been feeling run down and out of sorts?
4) felt that you are ill?
5) been getting any pains in your head?
6) been getting a feeling of tightness or pressure in your head?
7) been having hot or cold spells?

Please note that some items indicate emotional health and some emotional dis-
tress; however, 4 response options for each item are custom-ordered depending
on whether the item measures health or distress, as follows:

For items indicating DISTRESS:


1 = not at all;
2 = no more than usual;
3 = rather more than usual;
4 = much more than usual

For items indicating HEALTH:


1 = better than usual;
2 = same as usual;
3 = worse than usual;
4 = much worse than usual.

With this coding scheme, for every item, the score of 1 indicates good health
or lack of distress; and the score of 4 indicates poor health or a lot of distress.
Therefore, high scores on the questionnaire indicate emotional distress.
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1

13.3 A Worked Example - Fitting a Graded Re-


sponse model to Somatic Symptoms items

To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions. To practice further, you may repeat the
analyses for the remaining GHQ-28 subscales.

Step 1. Opening and examining the data

First, download the data file likertGHQ28sex.txt into a new folder and follow
instructions from Exercise 1 on creating a project and loading the data. As the
file here is in the tab-delimited format (.txt), use function read.table() to
import data into the data frame we will call GHQ.

GHQ <- read.table(file="likertghq28sex.txt", header=TRUE)

The object GHQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. You should see 33 variables, beginning
with the 7 items for Somatic Symptoms scale, SOMAT_1, SOMAT_2, …
SOMAT_7, and followed by items for the remaining subscales. They are
followed by participant sex (0 = male; 1 = female). The last 4 variables in the
file are sum scores (sums of relevant 7 items) for each of the 4 subscales. There
are NO missing data.

Step 2. Preparing data for analysis

Because the item responses for Somatic Symptoms scale are stored in columns
1 to 7 of the GHQ data frame, we can refer to them as GHQ[ ,1:7] in future
analyses. More conveniently, we can create a new object containing only the
relevant item response data:

somatic <- GHQ[ ,1:7]

Before attempting to fit an IRT model, it would be good to examine whether the
7 items form a homogeneous scale (measure just one latent factor in common).
To this end, we may use Parallel Analysis functionality of package psych. For
a reminder of this procedure, see Exercise 7. In this case, we will treat the
4-point ordered responses as ordinal rather than interval scales, and ask for the
analysis to be performed from polychoric correlations (cor="poly") rather than
Pearson’s correlations that we used before:
168EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR

library(psych)
fa.parallel(somatic, cor="poly", fa="pc")

Parallel Analysis Scree Plots


eigen values of principal components

PC Actual Data
PC Simulated Data
PC Resampled Data
3
2
1
0

1 2 3 4 5 6 7

Component Number

## Parallel analysis suggests that the number of factors = NA and the number of compo

QUESTION 1. Examine the Scree plot. Does it support the hypothesis that
only one factor underlies responses to Somatic Symptoms items? Why? Does
your conclusion agree with the text output of the parallel analysis function?

Step 3. Interpreting parameters of a Graded Response


Model

For IRT analyses, you will need package ltm (stands for “latent trait modelling”)
installed on your computer. Select Tools / Install packages… from the RStudio
menu, type in ltm and click Install. Make sure you load the installed package
into the working memory before starting the below analyses.

library(ltm)

To fit a Graded Response Model (GRM), use grm() function on somatic object
containing item responses:
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1

fit1 <- grm(somatic)


fit1

##
## Call:
## grm(data = somatic)
##
## Coefficients:
## Extrmt1 Extrmt2 Extrmt3 Dscrmn
## SOMAT_1 -1.707 1.281 2.889 2.350
## SOMAT_2 -0.447 0.791 1.944 2.856
## SOMAT_3 -0.375 0.793 1.965 3.743
## SOMAT_4 0.270 1.231 2.332 3.223
## SOMAT_5 1.014 3.252 5.742 1.092
## SOMAT_6 1.328 2.895 5.392 1.234
## SOMAT_7 0.681 2.754 5.176 0.808
##
## Log.Lik: -15376.89

Examine the basic output above. Take note of the Log.Lik (log likelihood) at
the end of the output. Log likelihood can only be interpreted in relative terms
(compared to a log likelihood of another model), so you will have to wait for
results from another model before using it.
Examine the Extremity and Discrimination parameters. Extremity parameters
are really extensions of the difficulty parameters in binary IRT models, but
instead of describing the threshold between two possible answers in binary items
(e.g. ‘YES’ or ‘NO’), they contrast response category 1 with those above it, then
category 2 with those above 2, etc. There are k-1 extremity parameters for an
item with k response categories. In our 4-category items, Extrmt1 refers to
the latent trait score z at which the probability of selecting any category >1
(2 or 3 or 4) equals 0.5. Below this point, selecting category 1 is more likely
than selecting 2, 3 or 4; and above this point selecting categories 2, 3 or 4 is
more likely. Extrmt2 refers to the latent trait score z at which the probability
of selecting any categories >2 (3 or 4) equals 0.5. And Extrmt3 refers to the
latent trait score z at which the probability of selecting any categories >3 (just
4) equals 0.5. The GRM assumes that the thresholds are ordered - that is,
switching from one response category to the next happens at higher latent trait
score values. That is why the GRM is a suitable model for Likert-type responses,
which are supposedly ordered categories. Discrimination refers to the steepness
(slope) of the response characteristic curves at the extremity points. In GRM,
this parameter is assumed identical for all categories, so the category curves
have the same steepness.
This will be easier to understand by looking at Operation Characteristic Curves
(OCC). These can be obtained easily by calling plot() function. If the default
170EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR

setting items=NULL is used, then OCCs will be produced for all items and the
user can move from one to the next by hitting . To obtain OCC for a particular
item, specify the item number:

# obtain Operation Characteristic Curves for item #1


plot(fit1, type="OCCu", items=1)

Item Operation Characteristic Curves − Item: SOMAT_1


1.0
0.8
Probability

0.6

3
0.4
0.2
0.0

1 2

−4 −2 0 2 4

Ability

It can be seen that for item 1, Extrmt1= -1.707 is the trait score (the plot refers
to it as “Ability” score) at which the OCC for category 1 crosses Probability 0.5;
Extrmt2= 1.281 corresponds to the trait score at which OCC 2 crosses P=0.5,
and Extrmt3= 2.889 corresponds to the trait score at which OCC 3 crosses
P=0.5.
Plot and examine OCCs for all items.
QUESTION 2. Examine the OCC plots and the corresponding Extremity
parameters. How are the extremities spaced out for different items? Do the
extremity values vary widely? What are the most extreme values in this set?
Examine the phrasing of the corresponding items – can you see why some cat-
egories are “easy” or “difficult” to select?
QUESTION 3. Examine the Discrimination parameters. What is the most
and least discriminating item in this set? Examine the phrasing of the corre-
sponding items – can you interpret the meaning of the construct we are mea-
suring in relation to the most discriminating item (as we did for “marker” items
in factor analysis)?
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1

Now we are ready to look at Item Characteristic Curves (ICC) - the most useful
plot in practice of applying an IRT model. An ICC for each item will consist of
as many curves as there are response categories, in our case 4. Each curve will
represent the probability of selecting this particular response category, given the
latent trait score. Let’s obtain the ICC for item SOMAT_3 by calling plot()
function and specifying type="ICC":

# obtain Item Characteristic Curves for item #3


plot(fit1, type="ICC", items=3)

Item Response Category Characteristic Curves − Item: SOMAT_3


1.0

1 4
0.8
Probability

0.6

3
0.4
0.2

2
0.0

−4 −2 0 2 4

Ability

In this pretty plot, each response category from 1 to 4 have its own curve in
its own colour. For item SOMAT_3 indicating distress, curve #1 (in black)
represents the probability of selecting response “not at all”. You should be able
to see that at the extremely low values of somatic symptoms (low distress), the
probability of selecting this response approaches 1, and the probability goes
down as the somatic symptoms score goes up (distress increases). Somewhere
around z=-0.5, the probability of selecting response #2 “no more than usual”
plotted in pink becomes equal to the probability of selecting category #1 (the
black and the pink lines cross), and then starts increasing as the somatic symp-
toms score goes up. The pink line #2 peaks somewhere around z=0.2 (which
means that at this level of somatic symptoms, response #2 “no more than usual”
is most likely), and then starts coming down. Then somewhere close to z=1,
the pink line crosses the green line for response #3 “rather more than usual”, so
that this response becomes more likely. Eventually, at about z=2 (high levels of
distress), response #4 “much more than usual” (light blue line) becomes more
172EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR

likely and it reaches the probability close to 1 for extremely high z scores.
Plot and examine ICCs for all remaining items.Try to interpret the ICC for item
SOMAT_1, which is the only item indicating health rather than distress. See
how with the increasing somatic symptom score, the probability is decreasing for
category #1 “better than usual” and increasing for category #4 “much worse
than usual”.

Step 4. Testing alternative models and examining model


fit

The standard GRM allows all items have different discrimination parameters,
and we indeed saw that they varied between items. It would be interesting to
see if these differences are significant - that is, if constraining the discrimination
parameters equal resulted in a significantly worse model fit. We can check that
by fitting a model where all discrimination parameters are constrained equal:

# discrimination parameters are constrained equal for all items


fit2 <- grm(somatic, constrained=TRUE)
fit2

##
## Call:
## grm(data = somatic, constrained = TRUE)
##
## Coefficients:
## Extrmt1 Extrmt2 Extrmt3 Dscrmn
## SOMAT_1 -1.798 1.406 3.122 1.789
## SOMAT_2 -0.423 1.346 2.814 1.789
## SOMAT_3 -0.360 1.340 2.929 1.789
## SOMAT_4 0.470 1.845 3.184 1.789
## SOMAT_5 0.830 2.367 4.173 1.789
## SOMAT_6 1.137 2.330 4.342 1.789
## SOMAT_7 0.480 1.663 2.963 1.789
##
## Log.Lik: -15668.06

Examine the log likelihood (Log.Lik) for fit2, and note that it is different from
the result for fit1.
You have fitted two IRT models to the Somatic Symptoms items. These models
are nested - one is the special case of another (constrained model is a special case
of the unconstrained model). The constrained model is more restrictive than
the unconstrained model because it has fewer parameters (the same number of
extremity parameters, but only 1 discrimination parameter in the constrained
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1

model against 7 in the unconstrained model, the difference of 6 parameters),


and we can test which model fits better. Use the base R function anova() to
compare the models - the more constrained one goes first:

anova(fit2, fit1)

##
## Likelihood Ratio Table
## AIC BIC log.Lik LRT df p.value
## fit2 31380.12 31511.52 -15668.06
## fit1 30809.78 30977.02 -15376.89 582.34 6 <0.001

This function prints the Log.Lik values for the two models (you already saw
them). It computes the likelihood ratio test (LRT) by multiplying each log
likelihood by -2, and then subtracting the lower value from the higher value.
The resulting value is distributed as chi-square, with degrees of freedom equal
the difference in the number of parameters between the two models.
QUESTION 4. Examine the results of anova() function. LRT value is the
likelihood ratio test result (the chi-square statistic). What are the degrees of
freedom? Can you explain why the degrees of freedom as they are? Is the
difference between the two models significant? Which model would you retain
and why?
Now, let’s see if the unconstrained GRM model (which we should prefer) fits
the data well. For this, we will look at so-called two-way margins. They are
obtained by taking two variables at a time, making a contingency table, and
computing the observed and expected frequencies of bivariate responses under
the GRM model (for instance, response #1 to item1 and #1 to item2; response
#1 to item1 and #2 to item2 etc.). The Chi-square statistic is computed from
each pairwise contingency table. A significant Chi-square statistic (greater than
3.8 for 1 degree of freedom) would indicate that the expected bivariate response
frequencies are significantly different from the observed. This means that a pair
of items has relationships beyond what is accounted for by the IRT model. In
other words the local independence assumption of IRT is violated - the items are
not independent controlling for the latent trait (dependence remains after the
IRT model controls for the influences due to the latent trait). Comparing the
observed and expected two-way margins is analogous to comparing the observed
and expected correlations when judging the fit of a factor analysis model.
To compute these so called Chi-squared residuals, the package ltm has function
margins().

margins(fit1)

##
174EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR

## Call:
## grm(data = somatic)
##
## Fit on the Two-Way Margins
##
## SOMAT_1 SOMAT_2 SOMAT_3 SOMAT_4 SOMAT_5 SOMAT_6 SOMAT_7
## SOMAT_1 - 449.93 501.91 373.66 148.71 152.32 149.56
## SOMAT_2 *** - 328.58 253.63 154.07 153.28 145.72
## SOMAT_3 *** *** - 163.16 124.78 133.09 118.25
## SOMAT_4 *** *** *** - 102.39 96.00 99.20
## SOMAT_5 *** *** *** *** - 879.46 80.40
## SOMAT_6 *** *** *** *** *** - 82.40
## SOMAT_7 *** *** *** *** *** *** -
##
## '***' denotes pairs of items with lack-of-fit

QUESTION 5. Examine the Chi-squared residuals and find the largest one.
Which pair of items does it pertain to? Can you interpret why this pair of items
violates the local independence assumption?
The analysis of two-way margins confirms the results we obtained earlier from
Parallel Analysis - that there is more than one factor underlying responses to
the somatic items. At the very least, there is a second factor pertaining to a
local dependency between two somatic items.

Step 5. Estimating trait scores for people

Now we can score people in this sample using function factor.scores() and
Empirical Bayes method (EB). We will use the better-fitting 2PL model.

Scores <- factor.scores(fit1, method="EB", resp.patterns=somatic)

The estimated scores together with their standard errors will be stored for each
respondent in a new object Scores. Check out what components are stored in
this object by calling head(Scores). It appears that the estimated trait scores
(z1) and their standard errors (se.z1) are stored in $score.dat part of the
Scores object.
To make our further work with these values easier, let’s assign them to new
variables in the GHQ data frame:

# Somatic Symptoms factor score


GHQ$somaticZ <- Scores$score.dat$z1
# Standard Error of the factor score
GHQ$somaticZse <- Scores$score.dat$se.z1
13.3. A WORKED EXAMPLE - FITTING A GRADED RESPONSE MODEL TO SOMATIC SYMPTOMS ITEMS1

Now, you can plot the histogram of the estimated IRT scores, by calling
hist(GHQ$somaticZ, breaks=20).

You can also examine relationships between the IRT estimated score and the
simple sum score stored in the variable SUM_SOM. [Since there are no missing
data, this variable can be computed by using function rowSums(somatic)]. You
can plot the sum score against the IRT score.

plot(GHQ$somaticZ, GHQ$SUM_SOM)
25
GHQ$SUM_SOM

20
15
10

−1 0 1 2 3

GHQ$somaticZ

Note that the relationship between these scores is very strong, and almost linear.
It is certainly more linear than the relationship in Exercise 12, which resembled
a logistic curve. This suggest that the more response categories there are, the
closer IRT models approximate linear models.

Step 6. Evaluating the Standard Errors of measurement

Now let’s plot the IRT estimated scores against their standard errors:

plot(GHQ$somaticZ, GHQ$somaticZse)
176EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR

0.50
GHQ$somaticZse

0.40
0.30

−1 0 1 2 3

GHQ$somaticZ

QUESTION 6. Examine the graph. What range of the somatic symptoms


score is measured with most precision? What are the smallest and the largest
Standard Errors on this graph, approximately?
[Hint. You can also get the exact values by calling min(GHQ$somaticZse) and
max(GHQ$somaticZse).]

A very interesting feature of this graph is that some points are out of line
with the majority (which form a relatively smooth curve). This means that
Standard Errors can be different for people with the same latent trait score.
Specifically, SEs can be larger for some people (note that the outliers are always
above the trend, not below). The reason for it is that some individuals have
aberrant response patterns - patterns not in line with the GRM model (for
example, endorsing more ‘difficult’ response categories for some items while not
endorsing ‘easier’ categories for other items). That makes estimating their score
less certain.

Step 7. Saving your work

After you finished this exercise, you may want to save the new variables you
created in the data frame GHQ. You may store the data in the R internal
format (extension .RData), so next time you can use function load() to read
it.
13.4. SOLUTIONS 177

save(GHQ, file="likertGHQ28sex.RData")

Before closing the project, do not forget to save your R script, giving it a
meaningful name, for example “Fitting GRM model to GHQ”.

13.4 Solutions
Q1. According to the Scree plot of the observed data (in blue) the 1st factor
accounts for a substantial amount of variance compared to the 2nd factor. There
is a large drop from the 1st factor to the 2nd; however, the break from the steep
to the shallow slope is not as clear as it could be, with the 2nd factor probably
still being part of “the mountain” rather than part of “the rubble”. There are
probably 2 factors underlying the data - one major explaining the most shared
variance in the items, and one minor.
Parallel analysis confirms this conclusion, because the simulated data yields a
line (in red) that crosses the observed scree between 2nd and 3rd factor, with
2nd factor significantly above the red line, and 3rd factor below it. It means
that factors 1 and 2 should be retained, and factors 3, 4 and all subsequent
factors should be discarded.
Q2. The extremity parameters vary widely between items. For SOMAT_2,
SOMAT_3, and SOMAT_4, they are spaced more closely than for SO-
MAT_1 or SOMAT_5, SOMAT_6 and SOMAT_7.
The lowest threshold is between categories 1 and 2 of item SOMAT_1, with
the lowest extremity value (Extrmt1=-1.707). SOMAT_1 is phrased “Have
you recently: been feeling perfectly well and in good health?” (which is an item
indicating health), and according to the GRM model, at the level of Somatic
Symptoms z=–1.707 (which is well below the mean - remember, the trait score
is scaled like z-score?), the probabilities of endorsing “1 = better than usual”
and “2 = same as usual” are equal. People with Somatic Symptoms score lower
than z=-1.707, will more likely say “1 = better than usual”, and with a score
higher than z=-.707 (having more Somatic Symptoms) will more likely say “2
= same as usual”.
The highest threshold is between categories 3 and 4 of item SOMAT_5, with
the lowest extremity value (Extrmt3=5.742). SOMAT_5 is phrased “Have you
recently: been getting any pains in your head?” (which is an item indicating
distress), and according to the GRM model, it requires to have an extremely
high score of z=5.742 on Somatic Symptoms (more than 5 Standard Deviations
above the mean) to switch from endorsing “3 = rather more than usual” to
endorsing “4 = much more than usual”.It is easy to see why this item is more
difficult to agree with than some other items indicating distress (which are all
items except SOMATIC_1), as it refers to a pretty severe somatic symptom
- head pains.
178EXERCISE 13. FITTING A GRADED RESPONSE MODEL TO POLYTOMOUS QUESTIONNAIR

Q3. The most discriminating item in this set is SOMAT_3, phrased “Have you
recently: been feeling run down and out of sorts?”. Apparently, this symptom is
most sensitive to the change in overall Somatic Symptoms (a “marker” for this
construct). The least discriminating item in this set is SOMAT_7, phrased
“Have you recently: been having hot or cold spells?”. This symptom is less
sensitive to the change in Somatic Symptoms (peripheral to the meaning of this
construct).
Q4. The degrees of freedom for the likelihood ratio test is df=6. This is made
up by the difference in the number of item parameters estimated. The uncon-
strained model estimated 7x3=21 extremity parameters and 7 discrimination
parameters, 28 parameters in total. The constrained model estimated 7x3=21
extremity parameters and, only 1 discrimination parameter (one for all items),
22 parameters in total. The difference between the two models 28-22=6 param-
eters.
The chi-square value 582.34 on 6 degrees of freedom is highly significant, so the
unconstrained GRM fits the data much better and we have to prefer it to the
more parsimonious but worse fitting constrained model.
Q5. The largest Chi-squared residual is 879.46, pertaining to the pair SO-
MAT_5 (been getting any pains in your head?) and SOMAT_6 (been get-
ting a feeling of tightness or pressure in your head?). It is pretty easy to see
why this pair of items violates the local independence assumption. Both refer
to one’s head - and experiencing “pain” is more likely in people who are also
experience “pressure”, even after controlling for all other somatic symptoms. In
other words, if we take people with exactly the same overall level of somatic
symptoms (i.e. controlling for the overall trait level), those among them who
experience “pain” will also more likely experience “pressure”. There is a residual
dependency between these two symptoms.
Q6. Most precise measurement is observed in the range between about z=-0.5
and z=2.5. The smallest standard error was about 0.3 (exact value 0.287). The
largest standard error was about 0.58 (exact value 0.581).
Part VI

DIFFERENTIAL ITEM
FUNCTIONING (DIF)
ANALYSIS

179
Exercise 14

DIF analysis of
dichotomous questionnaire
items using logistic
regression

Data file EPQ_N_demo.txt


R package fmsb

14.1 Objectives
The objective of this exercise is to learn how to screen dichotomous test items
for Differential Item Functioning (DIF) using logistic regression. Item DIF is
present when people of the same ability (or people with the same trait level)
but from different groups have different probabilities of passing/endorsing the
item. In this example, we will screen for DIF with respect to gender.

14.2 Worked Example - Screening EPQ Neuroti-


cism items for gender DIF
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.
This exercise makes use of the data we considered in Exercises 6, 8 and 12.
These data come from a a large cross-cultural study (Barrett, Petrides, Eysenck

181
182EXERCISE 14. DIF ANALYSIS OF DICHOTOMOUS QUESTIONNAIRE ITEMS USING LOGIST

& Eysenck, 1998), with N = 1,381 participants who completed the Eysenck
Personality Questionnaire (EPQ). The focus of our analysis here will be the
Neuroticism/Anxiety (N) scale, measured by 23 items with only two response
options - either “YES” or “NO”, for example:

N_3 Does your mood often go up and down?


N_7 Do you ever feel "just miserable" for no reason?
N_12 Do you often worry about things you should not have done or said?
etc.

You can find the full list of EPQ Neuroticism items in Exercise 6. Please
note that all items indicate “Neuroticism” rather than “Emotional Stability”
(i.e. there are no counter-indicative items).

Step 1. Opening and examining the data

If you have already worked with this data set in previous exercises, the simplest
thing to do is to continue working within the project created back then. In RStu-
dio, select File / Open Project and navigate to the folder and the project you
created. You should see the data frame EPQ appearing in your Environment
tab, together with other objects you created and saved.
If you have not completed previous exercises or have not saved your work, or
simply want to start from scratch, download the data file EPQ_N_demo.txt
into a new folder and follow instructions from Exercise 6 on creating a project
and importing the data.

EPQ <- read.delim(file="EPQ_N_demo.txt")

The object EPQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. As you can see, there are 26 variables
in this data frame, beginning with participant id, age and sex (0 = female; 1
= male). These demographic variables are followed by 23 item responses, which
are either 0 (for “NO”) or 1 (for “YES”). There are also a few missing responses,
marked with NA.

Step 2. Creating the trait (matching) variable, and the


grouping variable

Any DIF analysis begins with creating a variable that represents the trait score
on which test takers will be matched. (Remember that DIF is the difference in
probability of endorsing an item for people with the same trait score? We need
to compute the trait score, to control for it in analyses).
14.2. WORKED EXAMPLE - SCREENING EPQ NEUROTICISM ITEMS FOR GENDER DIF183

You should already know how to compute the sum score when some item re-
sponses are missing. We did this in Exercise 1. You use the base R function
rowMeans() to compute the average item score (omitting NA values from calcu-
lation, na.rm=TRUE), and then multiply the result by 23 (the number of items
in the Neuroticism scale). This will essentially replace any missing responses
with the mean for that individual.
Noting that the item responses are located in columns 4 to 26, compute the
Neuroticism trait score (call it Nscore), and append it to the dataframe as a
new variable:

EPQ$Nscore <- rowMeans(EPQ[ ,4:26], na.rm=TRUE)*23

Next, we need to prepare the grouping variable for DIF analyses. The variable
sex is coded as 0 = female; 1 = male. ATTENTION : this means that male is
the focal group (group which will be the focus of analysis, and will be compared
to the reference group - here, female). To make it easy to interpret the DIF
analyses, give value labels to this variable as follows:

EPQ$sex <- factor(EPQ$sex,


levels = c(0,1),
labels = c("female", "male"))

Run the command head(EPQ) again to check that the new variable Nscore and
the correct labels for sex indeed appeared in the data frame.
Next, let’s obtain and examine the item means, and the means of Nscore by
sex. An easy way to do this is to apply base R function colMeans() to only
males or females from the sample:

colMeans(EPQ[EPQ$sex=="female",4:27], na.rm=TRUE)

## N_3 N_7 N_12 N_15 N_19 N_23 N_27


## 0.6800459 0.6486797 0.8339061 0.2980437 0.7019563 0.5189873 0.5137931
## N_31 N_34 N_38 N_41 N_47 N_54 N_58
## 0.3839080 0.6877153 0.5977011 0.3487833 0.5986239 0.3390805 0.6953036
## N_62 N_66 N_68 N_72 N_75 N_77 N_80
## 0.2935780 0.6160920 0.4954128 0.6923077 0.4170507 0.4602992 0.7238205
## N_84 N_88 Nscore
## 0.8513825 0.9033372 13.3015915

colMeans(EPQ[EPQ$sex=="male",4:27], na.rm=TRUE)

## N_3 N_7 N_12 N_15 N_19 N_23 N_27


184EXERCISE 14. DIF ANALYSIS OF DICHOTOMOUS QUESTIONNAIRE ITEMS USING LOGIST

## 0.5660750 0.3661417 0.7322835 0.2727273 0.3641732 0.5059055 0.4043393


## N_31 N_34 N_38 N_41 N_47 N_54 N_58
## 0.2500000 0.4477318 0.4192913 0.2366864 0.5551181 0.2696850 0.5944882
## N_62 N_66 N_68 N_72 N_75 N_77 N_80
## 0.2381890 0.5255906 0.3333333 0.4665354 0.2603550 0.3333333 0.4110672
## N_84 N_88 Nscore
## 0.7381890 0.8698225 10.1637778

QUESTION 1. Who has the higher proportion of endorsing item N_19


– males or females? (Hint. Remember that for binary items coded 0/1, the
item mean is also the proportion of endorsement.) Who score higher on the
Neuroticism scale (Nscore) on average – males or females? Interpret the means
for N_19 in the light of the Nscore means.
Now you are ready for DIF analyses.

Step 3. Specifying logistic regression models

Now, let’s run DIF analyses for item N_19 (Are your feelings easily hurt?).
We will create 3 logistic regression models. The first, Baseline model, will
include the total Neuroticism score as the only predictor of N_19. Because
this item was designed to measure Neuroticism, Nscore should positively and
significantly predict responses to this item.
By adding sex as another predictor in the second model, we will check for
uniform DIF (main effect of sex). If sex adds significantly (in terms of chi-
square value) and substantially (in terms of Nagelkerke R2) over and above
Nscore, males and females have different odds of saying “YES” to N_19,
given their Neuroticism score. This means that uniform DIF is present.
By adding Nscore by sex interaction as another predictor in the third model,
we will check for non-uniform DIF. If the interaction term adds significantly (in
terms of chi-square value) and substantially (in terms of Nagelkerke R2) over
and above Nscore and sex, non-uniform DIF is present.
We will use the R base function glm() (stands for ‘generalized linear model’)
to specify the three logistic regression models:

# Baseline model
Baseline <- glm(N_19 ~ Nscore, data=EPQ, family=binomial(link="logit"))
# Uniform DIF model
dif.U <- glm(N_19 ~ Nscore + sex, data=EPQ, family=binomial(link="logit"))
# Non-Uniform DIF model
dif.NU <- glm(N_19 ~ Nscore + sex + Nscore:sex, data=EPQ, family=binomial(link="logit")

You can see that the model syntax above is very simple. We basically saying
that “N_19 is regressed on (~) Nscore””; or that “N_19 is regressed on (~)
14.2. WORKED EXAMPLE - SCREENING EPQ NEUROTICISM ITEMS FOR GENDER DIF185

Nscore and sex”, or that “N_19 is regressed on (~) Nscore, sex and Nscore
by sex interaction” (Nscore:sex). We ask the function to perform logistic
regression (family = binomial(link="logit")). And of course, we pass the
dataset (data = EPQ) where all the variables can be found.
Type the models one by one into your script, and run them. Objects Baseline,
dif.U and dif.NU should appear in your Environment. Next, you will obtain
and interpret various outputs generated from the results of these models.

Step 4. Testing for significance of uniform and non-uniform


effects of sex

To test if the main effect of sex or the interaction between sex and Nscore
added significantly to the Baseline model, use the base R anova() function. It
analyses not only variance components (the ANOVA as most students know it),
but also deviance components (what is minimized in logistic regression). The
chi-square statistic is used to test the significance of contributions of each added
predictor, in the order in which they appear in the regression equation. To get
this breakdown, apply the anova() function to the final model, dif.NU.

anova(dif.NU, test= "Chisq")

## Analysis of Deviance Table


##
## Model: binomial, link: logit
##
## Response: N_19
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1376 1875.8
## Nscore 1 622.32 1375 1253.5 < 2.2e-16 ***
## sex 1 57.42 1374 1196.1 3.513e-14 ***
## Nscore:sex 1 1.21 1373 1194.9 0.2717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Each row shows the contribution of each added predictor to the model’s chi-
square. The Deviance column shows the chi-square statistic for the model with
each subsequent predictor added (starting from NULL model with just intercept
and no predictors, then with Nscore predictor added, then with sex added
and finally Nscore:sex). The Df column (first column) shows the degrees of
186EXERCISE 14. DIF ANALYSIS OF DICHOTOMOUS QUESTIONNAIRE ITEMS USING LOGIST

freedom for each added predictor, which is 1 degree of freedom every time. The
column Pr(>Chi) is the p-value.
Examine the output and try to judge whether the predictors added at each step
contributed significantly to the prediction of N_19.
QUESTION 2. Is the Baseline model (with Nscore as the only predic-
tor) significant? Try to report the chi-square statistic for this model using
the template below: Baseline – NULL: diff.chi-square (df= __ , N=__ ) =
_______, p = _______.
QUESTION 3. Does sex add significantly over and above Nscore? What
is the increment in chi-square compared to the Baseline model? What does
this mean in terms of testing for Uniform DIF? dif.U – Baseline: diff.chi-square
(df= __ , N=__ ) = _______, p = _______.
QUESTION 4. Does Nscore by sex interaction add significantly over and
above Nscore and sex? What is the increment in chi-square compared to
the dif.U model? What does this mean in terms of testing for Non-Uniform
DIF? dif.NU – dif.U: diff.chi-square (df= __ , N=__ ) = _______, p =
_______.

Step 5. Evaluating effect sizes for uniform and non-uniform


effects of sex

Effects of added predictors may be significant, but they may be trivial in size.
To judge whether the differences between groups while controlling for the latent
trait (DIF) are consequential, effect sizes need to be computed and evaluated.
Nagelkerke R Square is recommended for judging the effect size of logistic re-
gression models.
To obtain Nagelkerke R2, we will use package fmsb. Install this package on
your computer and load it into memory. Then apply function NagelkerkeR2()
to the results of 3 models you produced earlier.

library(fmsb)

NagelkerkeR2(Baseline)

## $N
## [1] 1377
##
## $R2
## [1] 0.4887726
14.2. WORKED EXAMPLE - SCREENING EPQ NEUROTICISM ITEMS FOR GENDER DIF187

NagelkerkeR2(dif.U)

## $N
## [1] 1377
##
## $R2
## [1] 0.5237134

NagelkerkeR2(dif.NU)

## $N
## [1] 1377
##
## $R2
## [1] 0.524433

Look at the output and note that the functions return 2 values – the sample
size on which the calculation was made, and the actual R2.
QUESTION 5. Report the Nagelkerke R2 for each model below: Base-
line: Nagelkerke R2 = _______ dif.U: Nagelkerke R2 = _______ dif.NU:
Nagelkerke R2 = _______
Finally, let’s compute the increments in Nagelkerke R2 for dif.U compared to
Baseline, and dif.NU compared to dif.U. We will refer directly to the $R2
values of the models:

# compare model dif.U against Baseline - Uniform DIF effect size


NagelkerkeR2(dif.U)$R2 - NagelkerkeR2(Baseline)$R2

## [1] 0.03494082

# compare model dif.NU against dif.U - Non-Uniform DIF effect size


NagelkerkeR2(dif.NU)$R2 - NagelkerkeR2(dif.U)$R2

## [1] 0.0007195194

Now, refer to the following decision rules to judge whether DIF is present or
not:

Large DIF: Chi-square significant and Nagelkerke R2 change � 0.07


Moderate DIF: Chi-square significant and Nagelkerke R2 change between 0.035 and 0.07
Negligible DIF: Chi-square insignificant or Nagelkerke R2 change < 0.035

QUESTION 6. What are the increments for Nagelkerke R2, and what do you
conclude about Differential Item Functioning?
188EXERCISE 14. DIF ANALYSIS OF DICHOTOMOUS QUESTIONNAIRE ITEMS USING LOGIST

Step 6. Describing the effects: regression coefficients

Finally, obtain the regression coefficients of the final model, by running


summary(dif.NU). You will see the sign of the effects and their significance.
However, remember that B coefficients in logistic regression are on the log-odds
scale and therefore they are not interpreted. Instead, request exp(B) and
interpret them as odds ratios.

exp(coef(dif.NU))

## (Intercept) Nscore sexmale Nscore:sexmale


## 0.06050411 1.35852112 0.20692613 1.04251259

Now, try to report the size of the individual variables’ effects.


QUESTION 7. Write sentences describing the odds ratios, in terms of change
in the DV accompanying increases in the IV. Report significant effects only. Use
the below templates.
A one-point increase in Nscore was associated with a _______ times
_____________________(increase or decrease) in the odds of
endorsing item N_19.
Male sex was associated with a _______ times ________________(in-
crease or decrease) in the odds of endorsing item N_19.
QUESTION 8. Finally, try to interpret any moderate or large DIF effects
that you found (ignore negligible DIF). Who have higher expected probabilities
of endorsing item N_19 – males or females? Can you interpret / explain this
finding substantively?

Step 7. Saving your work

After you finished this exercise, save your R script by pressing the Save icon.
Give the script a meaningful name, for example “EPQ_N logistic regression”.
When closing the project by pressing File / Close project, make sure you select
Save when prompted to save your ‘Workspace image’ (with extension .RData).

14.3 Solutions
Q1. Females endorse item N_19 more frequently (0.70 or 70% of them endorse
it) than males (0.36 or 36%). Females also have higher Nscore (mean=13.30)
than males (mean=10.16). Knowing this, the differences in the item endorse-
ment rates are actually expected, as they may be explained by the difference
14.3. SOLUTIONS 189

in Neuroticism levels. The question is whether the differences in responses are


fully explained by the trait level.
Q2. The Baseline model predicts the item response significantly (as we would
expect); chi-square (1, N = 1377)=622.32, p < .001.
[Hint. To determine the sample size on which the regression model was run,
you can look at ‘Resid.DF’ column. The NULL model always has N-1 degrees
of freedom. So, we can use the Resid.DF entry against the NULL model (1376)
to calculate N=1377. This is smaller than the total sample size for EPQ (1381)
because a few responses on N_19 were missing and these cases were deleted
listwise by the regression procedure.]
Q3. Sex adds significantly to prediction. The increment is diff.chi-square (1,
N = 1377) =57.42, p < .001. This means that Uniform DIF might be present
(to judge its effect size, we will need to look at the Nagelkerke R2).
Q4. Nscore by sex interaction does not add significantly to prediction. The
increment diff.chi-square (1, N = 1377) =1.21, p = .272. It means that there is
no Non-Uniform DIF, regardless of the effect size.
Q5. Baseline: Nagelkerke R2 = 0.4888; dif.U: Nagelkerke R2 = 0.5237;
dif.NU: Nagelkerke R2 = 0.5244
Q6. Nagelkerke R2 increment from Baseline model to dif.U model is 0.035.
According to the DIF classification rules, this just qualifies for moderate DIF
(because the effect was significant – see Q3, and the effect size is exactly at the
cut-off for moderate DIF).
Nagelkerke R2 increment from dif.U model to dif.NU model is 0.0007. Ac-
cording to the DIF classification rules, there is no DIF (i.e. DIF is negligible),
because the effect was insignificant – see Q4, and the effect size is tiny.
Q7. A one-point increase in Nscore was associated with a 1.359 times increase
in the odds of endorsing the item. Male sex was associated with a 0.207 times
decrease in the odds of endorsing the item. [NOTE that values above 1 are
associated with an increase, and below 1 with a decrease in odds].
Q8. We found moderate Uniform DIF for item N_19. We also found that
females have higher odds of endorsing the item given the same Nscore as males
(because males have lower odds – see Q7). It appears that females admit to
their “feelings being easily hurt” (see text of N_19) easier than males with the
same Neuroticism level. This might be because expressing their feeling is more
socially acceptable for females.
190EXERCISE 14. DIF ANALYSIS OF DICHOTOMOUS QUESTIONNAIRE ITEMS USING LOGIST
Exercise 15

DIF analysis of polytomous


questionnaire items using
ordinal logistic regression

Data file likertGHQ28sex.txt


R package lordif, psych

15.1 Objectives
The objective of this exercise is to screen polytomous test items for Differential
Item Functioning (DIF) using ordinal logistic regression. Item DIF is present
when people with the same trait level but from different groups have different
probabilities of selecting the item response categories. In this example, we will
screen for DIF with respect to gender.

15.2 Worked Example - Screening GHQ Somatic


Symptoms items for gender DIF
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions. To practice further, you may repeat the
analyses for the remaining GHQ-28 subscales.
Data for this exercise come from the Medical Research Council National Survey
of Health and Development (NSHD), also known as the British 1946 birth cohort
study. We considered these data in Exercise 13.

191
192EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL

To remind you, the data pertain to a wave of interviewing undertaken in 1999,


when the participants were aged 53. At that point, N = 2,901 participants
(1,422 men and 1,479 women) completed the 28-item version of General Health
Questionnaire (GHQ-28). The GHQ-28 is designed to measure 4 facets of men-
tal/emotional distress - somatic symptoms (items 1-7), anxiety/insomnia (items
8-14), social dysfunction (items 15-21) and severe depression (items 22-28).
The focus of our analysis here will be the Somatic Symptoms scale, measured
by 7 items. For each item, respondents need to think how they felt in the past
2 weeks:

Have you recently:


1) been feeling perfectly well and in good health?
2) been feeling in need of a good tonic?
3) been feeling run down and out of sorts?
4) felt that you are ill?
5) been getting any pains in your head?
6) been getting a feeling of tightness or pressure in your head?
7) been having hot or cold spells?

Item responses are coded so that the score of 1 indicates good health or lack of
distress; and the score of 4 indicates poor health or a lot of distress. Therefore,
high scores on the questionnaire indicate emotional distress.

Step 1. Opening and examining the data

If you completed Exercise 13, the easiest thing to do is to open the project you
created back then and continue working with it.
If you need to start from scratch, download the data file likertGHQ28sex.txt
into a new folder and create a project associated with it. As the file here is in
the tab-delimited format (.txt), use function read.table() to import data into
the data frame we will call GHQ.

GHQ <- read.table(file="likertghq28sex.txt", header=TRUE)

The object GHQ should appear on the Environment tab. Click on it and the
data will be displayed on a separate tab. You should see 33 variables, beginning
with the 7 items for Somatic Symptoms scale, SOMAT_1, SOMAT_2, …
SOMAT_7, and followed by items for the remaining subscales. They are
followed by participant sex (variable sexm0f1 coded male = 0 and female =
1). The last 4 variables in the file are sum scores (sums of relevant 7 items) for
each of the 4 subscales. There are NO missing data.
15.2. WORKED EXAMPLE - SCREENING GHQ SOMATIC SYMPTOMS ITEMS FOR GENDER DIF193

Step 2. Preparing the grouping variable, exploring group


differences

Next, you need to prepare the grouping variable for DIF analyses. The variable
sexm0f1 coded as male = 0 and female = 1. ATTENTION : this means that
female is the focal group (group which will be the focus of analysis, and will be
compared to the reference group, male). Give value labels to the sex variable
as follows:

GHQ$sexm0f1 <- factor(GHQ$sexm0f1,


levels = c(0,1),
labels = c("male", "female"))

Run the command head(GHQ$sexf0m1) to check that the correct labels for
sexm0f1 indeed appeared in the data frame.
Next, let’s obtain and examine the descriptive statistics for items and subscale
scores by sex. The easiest way to do this is to use the function describeBy(x,
group, ...) from psych package. The function provides descriptive statistics
for all variables in the data frame x by grouping variable group.

library(psych)

describeBy(GHQ[ ,c("SOMAT_1","SOMAT_2","SOMAT_3","SOMAT_4","SOMAT_5","SOMAT_6","SOMAT_7","SUM_SOM

##
## Descriptive statistics by group
## group: male
## vars n mean sd median trimmed mad min max range skew kurtosis
## SOMAT_1 1 1422 2.05 0.49 2 2.03 0.00 1 4 3 0.85 3.68
## SOMAT_2 2 1422 1.74 0.75 2 1.64 1.48 1 4 3 0.80 0.25
## SOMAT_3 3 1422 1.71 0.73 2 1.61 1.48 1 4 3 0.77 0.06
## SOMAT_4 4 1422 1.44 0.68 1 1.31 0.00 1 4 3 1.47 1.70
## SOMAT_5 5 1422 1.24 0.50 1 1.14 0.00 1 4 3 2.01 3.64
## SOMAT_6 6 1422 1.20 0.49 1 1.08 0.00 1 4 3 2.41 5.26
## SOMAT_7 7 1422 1.20 0.52 1 1.07 0.00 1 4 3 2.79 7.85
## SUM_SOM 8 1422 10.58 2.82 10 10.15 2.97 7 26 19 1.49 2.71
## se
## SOMAT_1 0.01
## SOMAT_2 0.02
## SOMAT_3 0.02
## SOMAT_4 0.02
## SOMAT_5 0.01
## SOMAT_6 0.01
## SOMAT_7 0.01
194EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL

## SUM_SOM 0.07
## ------------------------------------------------------------
## group: female
## vars n mean sd median trimmed mad min max range skew kurtosis
## SOMAT_1 1 1479 2.09 0.59 2 2.08 0.00 1 4 3 0.58 1.49
## SOMAT_2 2 1479 1.87 0.79 2 1.80 1.48 1 4 3 0.63 -0.11
## SOMAT_3 3 1479 1.86 0.81 2 1.79 1.48 1 4 3 0.62 -0.26
## SOMAT_4 4 1479 1.49 0.74 1 1.34 0.00 1 4 3 1.44 1.40
## SOMAT_5 5 1479 1.39 0.62 1 1.28 0.00 1 4 3 1.46 1.52
## SOMAT_6 6 1479 1.30 0.59 1 1.17 0.00 1 4 3 1.95 3.13
## SOMAT_7 7 1479 1.77 0.81 2 1.67 1.48 1 4 3 0.81 -0.01
## SUM_SOM 8 1479 11.77 3.45 11 11.34 2.97 7 27 20 1.18 1.56
## se
## SOMAT_1 0.02
## SOMAT_2 0.02
## SOMAT_3 0.02
## SOMAT_4 0.02
## SOMAT_5 0.02
## SOMAT_6 0.02
## SOMAT_7 0.02
## SUM_SOM 0.09

QUESTION 1. Who has the higher average score on item SOMAT_7 (“been
having hot or cold spells”) – males or females? Who score higher on the Somatic
Symptoms (SUM_SOM) on average – males or females? Interpret the means
for SOMAT_7 in the light of the SUM_SOM means.
Now you are ready for DIF analyses.

Step 3. Running basic DIF analysis for item SOMAT_7

We will use package lordif to run the DIF analysis. This package has great
functions that automate the use of ordinal logistic regression to detect DIF.
Function rundif() is the basic function for detecting polytomous DIF items.
It performs an ordinal (common odds-ratio) logistic regression DIF analysis by
comparing the endorsement levels for each response category of given items with
the matching variable - the variable reflecting the scale score - for individuals
from the focal and referent groups. The function has the following format:
rundif(item, resp, theta, gr, criterion, pseudo.R2, R2.change,
...)
where item is one item or a collection of items which we examine for DIF; resp is
a data frame containing item responses; theta is a conditioning (matching) vari-
able; gr is a variable identifying groups; criterion is the criterion for flagging
DIF items (i.e., “CHISQR”, “R2”, or “BETA”); pseudo.R2 is pseudo R-squared
15.2. WORKED EXAMPLE - SCREENING GHQ SOMATIC SYMPTOMS ITEMS FOR GENDER DIF195

measure (i.e., “McFadden”, “Nagelkerke”, or “CoxSnell”), and R2.change is R-


squared change for pseudo R-squared criterion.
We will examine item SOMAT_7 for DIF, so we write item="SOMAT_7". Re-
sponses to this item are contained in the 7th column of GHQ dataframe, so we
write resp=GHQ[,7] or resp=GHQ[,"SOMAT_7"]. We will use the sum score of
Somatic Symptoms items, SUM_SOM, as the matching (conditioning) vari-
able, so we write theta=GHQ$SUM_SOM. If we did not have that sum score already
in the dataframe, we could easily compute it as the sum of all item scores using
function rowSums(GHQ[,1:7]) (this would work fine because all items in GHQ
are coded appropriately to represent distress and there are no missing data).
Our grouping variable is sexm0f1, so we write gr=GHQ$sexm0f1.
Finally, we need to decide what criterion we will use for detecting DIF. As our
sample size is very large, the statistical significance will be almost guaranteed
even for very small effects. Therefore, we will rely on effect size measures, pseudo
R-squared (because R-squared is, strictly speaking, is not defined for categorical
data) to detect sizable differences between the focal and referent groups. We
will use the Nagelkerke R-squared as it the measure well described and used in
DIF analysis (unlike other R-squared options, it can theoretically reach 1), and
we use 0.035 as the increments in R-squared corresponding to small, medium
and large effects.

library(lordif)

# run basic DIF procedure for item SOMAT_7


DIF7 <- rundif(item="SOMAT_7", resp=GHQ["SOMAT_7"], theta=rowSums(GHQ[,1:7]),
gr=GHQ$sexm0f1,
criterion = "R2", pseudo.R2 = "Nagelkerke", R2.change = 0.035)

# output DIF results


print(DIF7)

## $stats
## item ncat chi12 chi13 chi23 beta12 pseudo12.McFadden pseudo13.McFadden
## 1 SOMAT_7 4 0 0 0 0.0052 0.0829 0.0875
## pseudo23.McFadden pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1 0.0046 0.1241 0.1305 0.0063
## pseudo12.CoxSnell pseudo13.CoxSnell pseudo23.CoxSnell df12 df13 df23
## 1 0.1046 0.11 0.0053 1 2 1
##
## $flag
## [1] TRUE

Now examine the output carefully. You can see:


196EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL

ncat - number of response categories;


chi12- p-values for the chi-square change from the baseline model (Model 1)
with only trait score accounting for all differences in items scores to the Uniform
DIF model (Model 2) where the grouping variable also accounts for differences;
chi13- p-values for the chi-square change from the baseline model (Model
1) with only trait score accounting for all differences in items scores to the
Non-Uniform DIF model (Model 3) where the the grouping variable AND
interaction between trait score and grouping variable also account for differences;
chi23- p-values for the chi-square change from the Uniform DIF model (Model
2) to the Non-Uniform DIF model (Model 3).
The next important values are pseudo12.Nagelkerke, pseudo13.Nagelkerke,
and pseudo23.Nagelkerke, describing the effect sizes corresponding to differ-
ences between Model 1 and 2, Model 1 and 3 and Model 2 and 3 as described
above. Effects of added predictors tested with the chi-square statistic may be
significant, but they may be trivial in size. To judge whether the DIF effects
are consequential, effect sizes need to be computed and evaluated. Nagelk-
erke R Square is recommended for judging the effect size of logistic regression
models. These are interpreted together with the chi-squared results. Funda-
mentally, significant chi12 AND substantial pseudo12 indicates Uniform DIF;
significant chi13 AND substantial pseudo13 indicates either Uniform DIF or
Non-uniform DIF or both; and significant chi23 AND substantial pseudo23
indicates Non-uniform DIF.
You can use the following decision rules (Jodoin and Gierl, 2001) to judge
whether DIF is present, and its size:

Large DIF: Chi-square significant and Nagelkerke R2 change � 0.07


Moderate DIF: Chi-square significant and Nagelkerke R2 change between 0.035 and 0.07
Negligible DIF: Chi-square insignificant or Nagelkerke R2 change < 0.035

QUESTION 2. Looking at the chi-square and Nagelkerke pseudo R-squared


in the output, is EITHER Uniform or Non-uniform DIF* present for item SO-
MAT_7? Report the respective increment for Nagelkerke R2, and your con-
clusions about DIF effect size. Check your conclusions with the output, section
$flag, where any DIF effects are flagged.
QUESTION 3. Looking at the chi-square and Nagelkerke pseudo R-squared
in the output, is Uniform DIF present for item SOMAT_7? Report the re-
spective increment for Nagelkerke R2, and your conclusions about DIF effect
size.
QUESTION 4. Looking at the chi-square and Nagelkerke pseudo R-squared
in the output, is Non-uniform DIF present for item SOMAT_7? Report the
respective increment for Nagelkerke R2, and your conclusions about DIF effect
size.
15.2. WORKED EXAMPLE - SCREENING GHQ SOMATIC SYMPTOMS ITEMS FOR GENDER DIF197

You can also run the above basic procedure for all 7 items simultaneously calling

# run basic DIF procedure for all Somatic Symptom items

DIF <- rundif(item=c("SOMAT_1","SOMAT_2","SOMAT_3","SOMAT_4","SOMAT_5","SOMAT_6","SOMAT_7"),


resp=GHQ[,1:7], theta=rowSums(GHQ[,1:7]), gr=GHQ$sexm0f1,
criterion = "R2", pseudo.R2 = "Nagelkerke", R2.change = 0.035)

QUESTION 5. Obtain print(DIF) output and examine it. Can you see any
more DIF items?

Step 4. DIF analyses with purification

The above analyses used the scale score SUM_SOM as the matching (condi-
tioning) variable. This score is made up of the items scores. But what if we have
some DIF items in the scale - surely then, the total score will be “contaminated”
by DIF? Indeed, there are methods that remove the effects of DIF items from
the matching (conditioning) score. Package lordif offers one of such “purifica-
tion” methods, whereby items flagged for DIF are removed from the “anchor”
set of common items from which the trait score is estimated. This function uses
IRT theta estimates (not the simple sum score) as the matching/conditioning
variable. The graded response model (GRM) is used for IRT trait estimation
as default. The procedure runs iteratively until the same set of items is flagged
over two consecutive iterations, unless anchor items are specified.
The lordif function has the following format:
lordif(resp.data, group, selection = NULL, criterion = c("Chisqr",
"R2", "Beta"), pseudo.R2 = c("McFadden", "Nagelkerke", "CoxSnell"),
R2.change = 0.02, anchor = NULL, ...)

pureDIF <- lordif(resp.data=GHQ[,1:7], group=GHQ$sexm0f1,


criterion = "R2", pseudo.R2 = "Nagelkerke", R2.change = 0.035)

print(pureDIF)

## Call:
## lordif(resp.data = GHQ[, 1:7], group = GHQ$sexm0f1, criterion = "R2",
## pseudo.R2 = "Nagelkerke", R2.change = 0.035)
##
## Number of DIF groups: 2
##
## Number of items flagged for DIF: 1 of 7
##
198EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL

## Items flagged: 7
##
## Number of iterations for purification: 2 of 10
##
## Detection criterion: R2
##
## Threshold: R-square change >= 0.035
##
## item ncat pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1 1 4 0.0011 0.0039 0.0028
## 2 2 4 0.0000 0.0001 0.0001
## 3 3 4 0.0000 0.0001 0.0000
## 4 4 4 0.0035 0.0041 0.0005
## 5 5 3 0.0105 0.0107 0.0002
## 6 6 3 0.0027 0.0031 0.0004
## 7 7 4 0.1732 0.1856 0.0124

The output obtained by calling print() makes it clear that after iterative
purification, there is only one item flagged as DIF, and it is item number 7
(SOMAT_7). So we obtain the same results as without purification. How-
ever, the effect sizes obtained are slightly different, presumably we used the IRT
estimated trait scores rather than sum scores as the matching variable.
QUESTION 6. Report the DIF effect sizes for SOMAT_7 based on the DIF
procedure with purification.

Step 5. Describing and interpreting the DIF effects

QUESTION 7. Given the item text and the characteristics of the sample, try
to interpret the DIF effect that you found. Who have higher expected score on
item SOMAT_7 after controlling for the overall Somatic Symptoms score –
males or females? Can you interpret / explain this finding substantively?
Finally, let’s plot the results visually. Package lordif has its own function
plot.lordif(), which plots diagnostic graphs for items flagged for DIF. It has
the following format:
plot.lordif(x, labels = c("Reference", "Focal"), ...)
In our case, males are the reference group, females is the focal group, therefore
we call:

# graphics.off() is used to keep the outputs in the Plot tab


plot.lordif(pureDIF, labels=c("Male","Female"), graphics.off())

You can use the arrows to navigate between the plots in the Plot tab. The first
output is “Item True Score Functions”. Here, you see the expected item score
15.3. SOLUTIONS 199

for men (black line) and women (dashed red line) for each value of the IRT “true
score” (or theta score, in the standardized metric). The large discrepancy across
all theta values shows uniform DIF. The “Item Response Functions” output
provides the probabilities of endorsing each response category for men (black
line) and women (dashed red line) for each value of the theta score. You can
see that the biggest difference between sexes is observed for the first response
category (“not at all”). It is the presence of hot spells and not their amount
that seems to make the biggest difference between sexes.
The next group of plots show the “Test Characteristic Curve” or TCC. The
TCCs differ by about 0.5 points at most. That means that for the same true
Somatic Symptom score, women are expected to have the sum score of at most
0.5 points higher than men. This difference is most pronounced for average true
scores.

Step 6. Saving your work

After you finished this exercise, save your R script by pressing the Save icon.
Give the script a meaningful name, for example “GHQ-28 ordinal DIF”.
When closing the project by pressing File / Close project, make sure you select
Save when prompted to save your ‘Workspace image’ (with extension .RData).

15.3 Solutions
Q1. Females have higher mean SOMAT_7 item score (1.77) than males
(1.20). Females also have higher SUM_SOM (mean=11.77) than males
(mean=10.58). Knowing this, the differences in the item means are actually
expected, as they may be explained by the difference in Somatic Symptom
levels. The question is whether the differences in responses are fully explained
by the differences in trait levels.
Q2. chi13for SOMAT_7 is highly significant with the p-value reported as
0.000 (p<0.001). Corresponding pseudo13.Nagelkerke is 0.1305, which is
larger than 0.07 (cut-off for large effect size), therefore SOMAT_7 demon-
strates LARGE DIF (either uniform or non-uniform or both). This is confirmed
by the $flag equal TRUE for SOMAT_7.
Q3. chi12for SOMAT_7 is highly significant with the p-value reported as
0.000 (p<0.001). Corresponding pseudo12.Nagelkerke is 0.1241, which is
larger than 0.07, therefore SOMAT_7 demonstrates LARGE Uniform DIF.
Q4. chi23for SOMAT_7 is highly significant with the p-value reported as
0.000 (p<0.001). Corresponding pseudo23.Nagelkerke is 0.0063, which is
smaller than 0.035, therefore any Non-uniform DIF effects are trivial; which
means thare are NO Non-uniform DIF.
200EXERCISE 15. DIF ANALYSIS OF POLYTOMOUS QUESTIONNAIRE ITEMS USING ORDINAL

Q5. There are no more DIF items apart from SOMAT_7.

print(DIF)

## $stats
## item ncat chi12 chi13 chi23 beta12 pseudo12.McFadden pseudo13.McFadden
## 1 SOMAT_1 4 0.000 0.0000 0.3373 0.0360 0.0080 0.0082
## 2 SOMAT_2 4 0.000 0.0000 0.0112 0.0241 0.0037 0.0047
## 3 SOMAT_3 4 0.000 0.0000 0.0218 0.0248 0.0039 0.0048
## 4 SOMAT_4 4 0.000 0.0000 0.0459 0.0652 0.0163 0.0170
## 5 SOMAT_5 4 0.068 0.1671 0.6181 0.0096 0.0008 0.0009
## 6 SOMAT_6 4 0.302 0.5136 0.6052 0.0069 0.0003 0.0004
## 7 SOMAT_7 4 0.000 0.0000 0.0000 0.0052 0.0829 0.0875
## pseudo23.McFadden pseudo12.Nagelkerke pseudo13.Nagelkerke pseudo23.Nagelkerke
## 1 0.0002 0.0095 0.0097 0.0002
## 2 0.0010 0.0040 0.0051 0.0011
## 3 0.0008 0.0035 0.0042 0.0007
## 4 0.0008 0.0171 0.0179 0.0008
## 5 0.0001 0.0011 0.0011 0.0001
## 6 0.0001 0.0004 0.0005 0.0001
## 7 0.0046 0.1241 0.1305 0.0063
## pseudo12.CoxSnell pseudo13.CoxSnell pseudo23.CoxSnell df12 df13 df23
## 1 0.0075 0.0077 0.0002 1 2 1
## 2 0.0036 0.0045 0.0010 1 2 1
## 3 0.0031 0.0037 0.0006 1 2 1
## 4 0.0143 0.0149 0.0007 1 2 1
## 5 0.0008 0.0009 0.0001 1 2 1
## 6 0.0003 0.0003 0.0001 1 2 1
## 7 0.1046 0.1100 0.0053 1 2 1
##
## $flag
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE

Q6. pseudo12.Nagelkerke= 0.1732, which indicates large Uniform DIF.


pseudo23.Nagelkerke= 0.0124, which indicates negligible non-Uniform DIF.
Q7. We found LARGE Uniform DIF for item SOMAT_7. We also found
that females have higher expected score on the item given the same scale score
SUM_SOM as males (see Q1). Females agree more with the symptom “having
hot and cold spells” than males with the same level of Somatic symptoms. This
is probably because many women in this sample are going through menopause
(remember, the participants were aged 53 in this wave of testing), and experience
hot flashes that men do not experience.
Part VII

CONFIRMATORY
FACTOR ANALYSIS
(CFA)

201
Exercise 16

CFA of polytomous item


scores

Data file CHI_ESQ2.txt


R package lavaan

16.1 Objectives
In this exercise, you will fit a Confirmatory Factor Analysis (CFA) model, or
measurement model, to polytomous responses to a short service satisfaction
questionnaire. We already established a factor structure for this questionnaire
using Exploratory Factor Analysis (EFA) in Exercise 9. This time, we will
confirm this structure using CFA. However, we will confirm it on a different
sample from the one we used for EFA.
In the process of fitting a CFA model, you will start getting familiar with the
SEM language and techniques using R package lavaan (stands for latent variable
analysis).

16.2 Study of service satisfaction using Experi-


ence of Service Questionnaire (CHI-ESQ)
In 2002, Commission for Health Improvement in the UK developed the Expe-
rience of Service Questionnaire (CHI-ESQ) to measure user experiences with
the Child and Adolescent Mental Health Services (CAMHS). The questionnaire
exist in two versions: one for older children and adolescents for describing expe-
riences with their own treatment, and the other is for parents/carers of children
who underwent treatment. We will consider the version for parents/carers.

203
204 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES

The CHI-ESQ parent version consists of 12 questions covering different types of


experiences. Here are the questions:

1. I feel that the people who have seen my child listened to me


2. It was easy to talk to the people who have seen my child
3. I was treated well by the people who have seen my child
4. My views and worries were taken seriously
5. I feel the people here know how to help with the problem I came for
6. I have been given enough explanation about the help available here
7. I feel that the people who have seen my child are working together to help with the
8. The facilities here are comfortable (e.g. waiting area)
9. The appointments are usually at a convenient time (e.g. don’t interfere with work, s
10. It is quite easy to get to the place where the appointments are
11. If a friend needed similar help, I would recommend that he or she come here
12. Overall, the help I have received here is good

Parents/carers are asked to respond to these questions using response options


“Certainly True” — “Partly True” — “Not True” (coded 1-2-3) and Don’t know
(not scored but treated as missing data, “NA”). Note that there are quite a lot
of missing responses!
Participants in this study are N=460 parents reporting on experiences with one
CAMHS member Service. This is a subset of the large multi-service sample
analysed and reported in Brown, Ford, Deighton, & Wolpert (2014), and, im-
portantly, these data are from a different member service that the data we
considered in Exercise 9.

16.3 Worked Example - CFA of responses to


CHI-ESQ parental version
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.

Step 1. Importing and examining the data

The data are stored in the space-separated (.txt) file CHI_ESQ2.txt. Down-
load the file now into a new folder and create a project associated with it.
Preview the file by clicking on CHI_ESQ2.txt in the Files tab in RStudio
(this will not import the data yet, just open the actual file). You will see that
the first row contains the item abbreviations (esq+p for “parent”): “esqp_01”,
“esqp_02”, “esqp_03”, etc., and the first entry in each row is the respondent
number: “1”, “2”, …“620”. Function read.table() will import this into RStu-
dio taking care of these column and row names. It will actually understand
16.3. WORKED EXAMPLE - CFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION205

that we have headers and row names because the first row contains one fewer
fields than the number of columns (see ?read.table for detailed help on this
function).

CHI_ESQ2 <- read.table(file="CHI_ESQ2.txt")

We have just read the data into a new data frame named CHI_ESQ2. Exam-
ine the data frame by either pressing on it in the Environment tab, or calling
function View(CHI_ESQ2). You will see that there are quite a few missing re-
sponses, which could have occurred because either “Don’t know” response option
was chosen, or because the question was not answered at all.

Step 2. Specifying and fitting a measurement model

Before you start using commands from the package lavaan, make sure you have
it installed (if not, use menu Tools and then Install packages…), and load it.

library(lavaan)

Given that we already established the factor structure for CHI-ESQ by the
means of EFA in Exercise 9, we will fit a model with 2 correlated factors - Sat-
isfaction with Care (Care for short) and Satisfaction with Environment (Envi-
ronment for short).
We need to “code” this model in syntax, using the lavaan syntax conventions.
Here are these conventions:

To describe the model in Figure 1 in words, we would say:


• Care is measured by esqp_01 and esqp_02 <…> and esqp_11 and
esqp_12 • Environment is measured by esqp_08 and esqp_09 and
esqp_10
A shorthand for and is the plus sign +. So, this is how we translate the above
sentences into syntax for our model (let’s call it ESQmodel):
206 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES

Figure 16.1: Figure 16.1. Confirmatory model for CHI-ESQ (paths fixed to 1
are in dashed lines)
16.3. WORKED EXAMPLE - CFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION207

ESQmodel <- 'Care =~ esqp_01+esqp_02+esqp_03+esqp_04+


esqp_05+esqp_06+esqp_07+
esqp_11+esqp_12
Environment =~ esqp_08+esqp_09+esqp_10 '

Be sure to have the single quotation marks opening and closing the syntax.
Begin each equation on the new line. Spaces within each equation are optional,
they just make reading easier for you (but R does not mind).
By default, lavaan will scale each factor by setting the same unit as its first
indicator (and the factor variance is then freely estimated). So, it will set
the loading of esqp_01 to 1 to scale factor Care, it will set the loading of
esqp_08 to 1 to scale factor Environment. You can see this on the diagram,
where the loading paths for these three indicators are in dashed rather than
solid lines.
Write the above syntax in your Script, and run it by highlighting the whole lot
and pressing the Run button or Ctrl+Enter keys. You should see a new object
ESQmodel appear on the Environment tab. This object contains the model
syntax.
To fit this CFA model, we need function cfa(). We need to pass to this function
the data frame CHI_ESQ2 and the model name (model=ESQmodel).

# fit the model with default scaling, and ask for summary output
fit <- cfa(model=ESQmodel, data=CHI_ESQ2)

summary(fit)

## lavaan 0.6.15 ended normally after 56 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 25
##
## Used Total
## Number of observations 331 460
##
## Model Test User Model:
##
## Test statistic 264.573
## Degrees of freedom 53
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
208 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES

## Standard errors Standard


## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Care =~
## esqp_01 1.000
## esqp_02 0.970 0.074 13.108 0.000
## esqp_03 0.607 0.054 11.255 0.000
## esqp_04 1.140 0.072 15.883 0.000
## esqp_05 1.441 0.090 15.979 0.000
## esqp_06 1.192 0.088 13.525 0.000
## esqp_07 1.321 0.089 14.924 0.000
## esqp_11 1.242 0.079 15.647 0.000
## esqp_12 1.270 0.076 16.620 0.000
## Environment =~
## esqp_08 1.000
## esqp_09 1.520 0.368 4.135 0.000
## esqp_10 0.921 0.227 4.050 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## Care ~~
## Environment 0.033 0.009 3.835 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .esqp_01 0.099 0.008 11.732 0.000
## .esqp_02 0.127 0.011 12.036 0.000
## .esqp_03 0.080 0.007 12.356 0.000
## .esqp_04 0.077 0.007 10.966 0.000
## .esqp_05 0.119 0.011 10.902 0.000
## .esqp_06 0.171 0.014 11.937 0.000
## .esqp_07 0.140 0.012 11.476 0.000
## .esqp_11 0.099 0.009 11.114 0.000
## .esqp_12 0.073 0.007 10.366 0.000
## .esqp_08 0.204 0.020 10.431 0.000
## .esqp_09 0.171 0.027 6.352 0.000
## .esqp_10 0.172 0.017 10.421 0.000
## Care 0.130 0.016 7.936 0.000
## Environment 0.045 0.016 2.809 0.005

Examine the output. Start with finding estimated factor loadings (under Latent
Variables: find statements =~), factor variances (under Variances: find state-
ments ~~) and covariances (under Covariances: find statements ~~).
16.3. WORKED EXAMPLE - CFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION209

QUESTION 1. Why are the factor loadings of esqp_01 and esqp_08 equal
1, and no Standard Errors, z-values or p-values are printed for them? How many
factor loadings are there to estimate?
QUESTION 2. How many latent variables are there in your model? What
are they? What parameters are estimated for these variables? [HINT. You do
not need the R output to answer the first part of the question. You need the
model diagram.]
QUESTION 3. How many parameters are there to estimate in total? What
are they?
QUESTION 4. How many known pieces of information (sample moments)
are there in the data? What are the degrees of freedom and how are they made
up?
QUESTION 5. Interpret the chi-square (reported as Test statistic). Do
you retain or reject the model?
To obtain more measures of fit beyond the chi-square, request an extended
output:

summary(fit, fit.measures=TRUE)

QUESTION 6. Examine the extended output; find and interpret the SRMR,
CFI and RMSEA.

Step 3. Alternative scaling of latent factors

Now, I suggest you change the way your factors are scaled – for the sake of
learning alternative ways of scaling latent variables, which will be useful in
different situations. The other popular way of scaling common factors is by
setting their variances to 1 and freeing all the factor loadings.
You can use your ESQmodel, but request the cfa() function to standardize all
the latent variables (std.lv=TRUE), so that their variances are set to 1. Assign
this new model run to a new object, fit.2, so you can compare it to the previous
run, fit.

fit.2 <- cfa(ESQmodel, data=CHI_ESQ2, std.lv=TRUE)

summary(fit.2)

Examine the output. First, compare the test statistic (chi-square) between
fit.2 and fit (see your answer for Q5). The two chi-square statistics and their
degrees of freedom should be exactly the same! This is because alternative ways
of scaling do not change the model fit or the number of parameters. They just
210 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES

swap the parameters to be estimated. The standardized version of ESQmodel


sets the factor variances to 1 and estimates all factor loadings, while the original
version sets 2 factor loadings (one per each factor) to 1 and estimates 2 factor
variances. The difference between the two models is the matter of scaling -
otherwise they are mathematically equivalent. This is nice to know, so you can
use one or the other way of scaling depending on what is more convenient in a
particular situation.

Step 4. Interpreting model parameters

OK, now examine the latest output with standardized latent variables (factors),
and answer the following questions.

QUESTION 7. Examine the factor loadings and their standard errors. How
do you interpret these loadings compared to the model where the factor were
scaled by borrowing the scale of one of their indicators?

QUESTION 8. Interpret factor correlations. Are they what you would ex-
pect? (If you did Exercise 9, you can compare this result with your estimate of
factor correlations)

Now obtain the standardized solution (in which EVERYTHING is standard-


ized - the latent and observed variables) by adding standardized=TRUE to the
summary() output:

summary(fit.2, standardized=TRUE)

QUESTION 9. Compare the default (unstandardized) parameter values in


Estimate column, values in Std.lv column (standardized on latent variables
only), and values in Std.all column (standardized on all - latent and observed
variables). Can you explain why Estimate are identical to Std.lv, but different
from Std.all?

QUESTION 10. Examine the standardized (Std.all) factor loadings. How


do you interpret these loadings compared to the Std.lv values?

Examining the relative size of standardized factor loadings, we can see that the
“marker” (highest loading) item for factor Care is esqp_12 (Overall help) -
the same result as in EFA with “oblimin” rotation in Exercise 9. The marker
item for factor Environment is esqp_08 (facilities), while in EFA the marker
item was esqp_09 (appointment times).

QUESTION 11. Are the standardized error variances small or large, and how
do you judge their magnitude?
16.3. WORKED EXAMPLE - CFA OF RESPONSES TO CHI-ESQ PARENTAL VERSION211

Step 5. Examining residuals and local areas of misfit


Now request one additional output – residuals. Residuals are differences be-
tween fitted (or predicted by the model) covariances and the actual observed
covariances. The residuals show if all covariances are well reproduced by the
model. But since our data are raw responses to questionnaire items, residual
covariances will be relatively difficult to interpret. So, instead of requesting
residual covariances, we will request residual correlations - differences between
fitted and observed correlations of item responses.

residuals(fit.2, type="cor")

## $type
## [1] "cor.bollen"
##
## $cov
## esq_01 esq_02 esq_03 esq_04 esq_05 esq_06 esq_07 esq_11 esq_12 esq_08
## esqp_01 0.000
## esqp_02 0.124 0.000
## esqp_03 0.080 0.115 0.000
## esqp_04 0.088 0.077 0.077 0.000
## esqp_05 -0.012 -0.024 0.012 -0.021 0.000
## esqp_06 -0.045 -0.016 -0.085 -0.047 0.017 0.000
## esqp_07 0.012 -0.068 0.000 -0.028 0.032 0.035 0.000
## esqp_11 -0.067 -0.072 -0.050 -0.023 0.008 0.019 -0.039 0.000
## esqp_12 -0.074 -0.045 -0.097 -0.031 -0.003 0.049 0.035 0.102 0.000
## esqp_08 -0.009 0.064 0.040 0.061 -0.007 0.030 0.042 0.033 0.050 0.000
## esqp_09 0.029 0.098 0.127 0.014 0.017 -0.035 -0.026 0.045 -0.060 -0.034
## esqp_10 0.006 -0.043 0.036 -0.008 -0.032 -0.137 -0.108 -0.010 -0.128 0.024
## esq_09 esq_10
## esqp_01
## esqp_02
## esqp_03
## esqp_04
## esqp_05
## esqp_06
## esqp_07
## esqp_11
## esqp_12
## esqp_08
## esqp_09 0.000
## esqp_10 0.021 0.000

Now interpretation of residuals is very simple. Just think of them as differences


between 2 sets of correlations, with the usual effect sizes assumed for correlation
coefficients.
212 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES

You can also request standardized residuals, to formally test for significant dif-
ferences from zero. Any standardized residual larger than 1.96 in magnitude
(approximately 2 standard deviations from the mean in the standard normal
distribution) is significantly different from 0 at p=.05 level.

residuals(fit.2, type="standardized")

QUESTION 12. Are there any large residuals? Are there any statistically
significant residuals?
Now request modification indices by calling

modindices(fit.2)

QUESTION 13. Examine the modification indices output. Find the largest
index. What does it suggest?

Step 6. Modifying the model

Modify the model by adding a covariance path (~~) between items esqp_11
and esqp_12. All you need to do is to add a new line of code to the definition of
ESQmodel by using base R function paste(), thus creating a modified model
ESQmodel.m:

ESQmodel.m <- paste(ESQmodel, '


esqp_11 ~~ esqp_12')

Re-estimate the modified model following the steps as with ESQmodel, and
assign the results to a new object fit.m.
QUESTION 14. Examine the modified model summary output. What is the
chi-square for this modified model? How many degrees of freedom do we have
and why? What is the goodness of fit of this modified model?

Step 7. Saving your work

After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. To keep all the created objects (such as fit.2),
which might be useful when revisiting this exercise, save your entire work space.
To do that, when closing the project by pressing File / Close project, make sure
you select Save when prompted to save your “Workspace image”.
16.4. SOLUTIONS 213

16.4 Solutions

Q1. Factor loadings for esqp_01 and esqp_08 were not actually estimated;
instead, they were fixed to 1 to set the scale of the latent factors (latent factors
then simply take the scale of these particular measured variables). That is why
there is no usual estimation statistics reported for these. Factor loadings for the
remaining 10 items (12-2=10) are free parameters in the model to estimate.
Q2. There are 14 unobserved (latent) variables – 2 common factors (Care
and Environment), and 12 unique factors /errors (these are labelled by the
prefix . in front of the observed variable to which this error term is attached,
for example .esqp_01, .esqp_02, etc. It so happens that in this model, all
the latent variables are exogenous (independent), and for that reason variances
for them are estimated. In addition, covariance is estimated between the 2
common factors. Of course, the errors are assumed uncorrelated (remember the
local independence assumption in factor analysis?), and so their covariances are
fixed to 0 (not estimated and not printed in the output).
Q3. You can look this up in lavaan output. It says: Number of free
parameters 25. To work out how this number is derived, you can look how
many rows of values are printed in Parameter Estimates under each category.
The model estimates: 10 factor loadings (see Q1) + 2 factor variances (see Q2)
+ 1 factor covariance (see Q2) + 12 error variances (see Q2). That’s it. So we
have 10+2+1+12=25 parameters.
Q4. Sample moments refers to the number of variances and covariances in our
observed data. To know how many sample moments there are, the only thing
you need to know is how many observed variables there are. There are 12 ob-
served variables, therefore there are 12 variances and 12(12-1)/2=66 covariances,
78 “moments” in total.
You can use the following formula for calculating the number of sample moments
for any given data: m(m+1)/2 where m is the number of observed variables.
Q5. Chi-square is 264.573 (df=53, which is calculated as 78 sample moments
minus 35 parameters). We have to reject the model, because the test indicates
that the probability of this factor model holding in the population is less than
.001.
Q6. Comparative Fit Index (CFI) = 0.903, which is larger than 0.90 but smaller
than 0.95, indicating adequate fit. The Root Mean Square Error of Approxima-
tion (RMSEA) = 0.110, and the 90% confidence interval for RMSEA is (0.097,
0.123), well outside of the cut-off .08 for adequate fit. Standardized Root Mean
Square Residual (SRMR) = 0.054 is small as we would hope for a well-fitting
model. All indices except the RMSEA indicate acceptable fit of this model.
Q7. All factor loadings are positive as would be expected with questions being
positive indicators of satisfaction. The SEs are low (magnitude of 1/sqrt(N), as
214 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES

they should be in a properly identified model). All factor loadings are signifi-
cantly different from 0 (p-values are very small).

Latent Variables:
Estimate Std.Err z-value P(>|z|)
Care =~
esqp_01 0.361 0.023 15.872 0.000
esqp_02 0.350 0.024 14.350 0.000
esqp_03 0.219 0.018 12.011 0.000
esqp_04 0.412 0.023 18.269 0.000
esqp_05 0.520 0.028 18.416 0.000
esqp_06 0.430 0.029 14.903 0.000
esqp_07 0.477 0.028 16.845 0.000
esqp_11 0.448 0.025 17.911 0.000
esqp_12 0.459 0.024 19.427 0.000
Environment =~
esqp_08 0.211 0.038 5.617 0.000
esqp_09 0.321 0.045 7.189 0.000
esqp_10 0.194 0.035 5.624 0.000

Q8. The factor covariance are easy to interpret in terms of size, because we set
the factors’ variances =1, and therefore factor covariance is correlation. The
correlation between Care and Environment is positive and of medium size (r
= 0.438). This estimate is close to the latent factor correlation estimated in
EFA in Exercise 9 (r = 0.39).

Covariances:
Estimate Std.Err z-value P(>|z|)
Care ~~
Environment 0.438 0.073 6.022 0.000

Q9. The Estimate and Std.lav parameter values are identical because in fit.2,
the factors were scaled by setting their variances =1. So, the model is already
standardized on the latent variables (factors). The Std.lv and Std.all param-
eter values are different because the observed variables were raw item responses
and not standardized originally. The standardization of factors (Std.lv) does
not standardize the observed variables. Only in Std.all, the observed variables
are also standardized. The Std.all output makes the results comparable with
EFA in Exercise 9, where all variables are standardized.

Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Care =~
16.4. SOLUTIONS 215

esqp_01 0.361 0.023 15.872 0.000 0.361 0.755


esqp_02 0.350 0.024 14.350 0.000 0.350 0.702
esqp_03 0.219 0.018 12.011 0.000 0.219 0.611
esqp_04 0.412 0.023 18.269 0.000 0.412 0.830
esqp_05 0.520 0.028 18.416 0.000 0.520 0.834
esqp_06 0.430 0.029 14.903 0.000 0.430 0.721
esqp_07 0.477 0.028 16.845 0.000 0.477 0.786
esqp_11 0.448 0.025 17.911 0.000 0.448 0.819
esqp_12 0.459 0.024 19.427 0.000 0.459 0.862
Environment =~
esqp_08 0.211 0.038 5.617 0.000 0.211 0.424
esqp_09 0.321 0.045 7.189 0.000 0.321 0.614
esqp_10 0.194 0.035 5.624 0.000 0.194 0.425

Q10. The Std.all loadings are interpreted as the standardized factor loadings
in EFA. In models with orthogonal factors, these loadings are equal to the cor-
relation between the item and the factor; also, squared factor loading represents
the proportion of variance explained by the factor.
Q11. The Std.all error variances range from (small) 0.256 for .esqp_12 to
(quite large) 0.820 for esqp_08 and esqp_10. I judge them to be “small”
or “large” considering that the observed variables are now standardized and
have variance 1, the error variance is simply the proportion of 1. The remaining
proportion of variance is due to the common factors. For example, for esqp_12,
error variance is 0.256 and this means 25.6% of variance is due to error and the
remaining 74.4% (1-0.256) is due to the common factors.

Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.esqp_01 0.099 0.008 11.732 0.000 0.099 0.431
.esqp_02 0.127 0.011 12.036 0.000 0.127 0.508
.esqp_03 0.080 0.007 12.356 0.000 0.080 0.626
.esqp_04 0.077 0.007 10.966 0.000 0.077 0.312
.esqp_05 0.119 0.011 10.902 0.000 0.119 0.305
.esqp_06 0.171 0.014 11.937 0.000 0.171 0.480
.esqp_07 0.140 0.012 11.476 0.000 0.140 0.382
.esqp_11 0.099 0.009 11.114 0.000 0.099 0.329
.esqp_12 0.073 0.007 10.366 0.000 0.073 0.256
.esqp_08 0.204 0.020 10.431 0.000 0.204 0.820
.esqp_09 0.171 0.027 6.352 0.000 0.171 0.623
.esqp_10 0.172 0.017 10.421 0.000 0.172 0.820

Q12. There are a few residuals over 0.1 in size. The largest is for correlation be-
tween esqp_06 and esqp_10 (-.137). It is also significantly different from zero
(standardized residual is -3.155, which is greater than 1.96). There are larger
standardized residuals, for example -5.320 between esqp_01 and esqp_12.
216 EXERCISE 16. CFA OF POLYTOMOUS ITEM SCORES

Q13.

<snip>
92 esqp_07 ~~ esqp_12 6.374 0.017 0.017 0.171 0.171
93 esqp_07 ~~ esqp_08 0.667 0.008 0.008 0.050 0.050
94 esqp_07 ~~ esqp_09 0.784 -0.009 -0.009 -0.060 -0.060
95 esqp_07 ~~ esqp_10 3.022 -0.017 -0.017 -0.106 -0.106
96 esqp_11 ~~ esqp_12 66.449 0.048 0.048 0.571 0.571
97 esqp_11 ~~ esqp_08 0.150 -0.003 -0.003 -0.024 -0.024
98 esqp_11 ~~ esqp_09 1.016 0.009 0.009 0.070 0.070
99 esqp_11 ~~ esqp_10 1.421 0.010 0.010 0.074 0.074
100 esqp_12 ~~ esqp_08 2.555 0.012 0.012 0.102 0.102
<snip>

The largest index is for the covariance esqp_11 ~~ esqp_12 (66.449). It is


more than twice larger than any other indices. This means that chi-square
would reduce by at least 66 if we allowed errors of esqp_11 and esqp_12
to correlate. This means that these items share something in common beyond
what the factor Care accounts for.
Q14.

# fit the model with default scaling, and ask for summary output
fit.m <- cfa(model=ESQmodel.m, CHI_ESQ2, std.lv=TRUE)
summary(fit.m)

For the modified model, Chi-square = 201.249, Degrees of freedom = 52, p<
.001. The number of DF is 52, not 53 as in the original model. This is because
we added one more parameter to estimate (correlated error terms), therefore
losing 1 degree of freedom. The model reduced the RMSEA to 0.093 (90% CI
0.08-0.107) - still too high for an adequate fit.
Exercise 17

CFA of a correlation matrix


of subtest scores from an
ability test battery

Data file Thurstone.csv


R package lavaan

17.1 Objectives

In this exercise, you will fit a Confirmatory Factor Analysis (CFA) model, or
measurement model, to a published correlation matrix of scores on subtests from
an ability test battery. We already considered an Exploratory Factor Analysis
(EFA) of these same data in Exercise 10. In the process of fitting a CFA model,
you will start getting familiar with the SEM language and techniques using R
package lavaan (stands for latent variable analysis).

17.2 Study of “primary mental abilities” of


Louis Thurstone

This is a classic study of “primary mental abilities” by Louis Thurstone. Thur-


stone analysed 9 subtests, each measuring some specific mental ability with
several tasks. The 9 subtests were designed to measure 3 broader mental abili-
ties, namely:

217
218EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY

Verbal Ability Word Fluency Reasoning Ability


1. sentences 4. first letters 7. letter series
2. vocabulary 5. four-letter words 8. pedigrees
3. sentence completion 6. suffixes 9. letter grouping
Each subtest was scored as a number of correctly completed tasks, and the
scores on each of the 9 subtests can be considered continuous.
You will work with the published correlations between the 9 subtests, based
on N=215 subjects (Thurstone, 1947). Despite having no access to the actual
subject scores, the correlation matrix is all you need to fit a basic CFA model
without intercepts.

17.3 Worked Example - CFA of Thurstone’s pri-


mary ability data

To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.

Step 1. Reading and examining data

The data (subscale correlations) are stored in a comma-separated (.csv) file


Thurstone.csv. If you have completed Exercise 10, you would have already
downloaded this file and created a project associated with it. You can continue
working in that project.
If you have not completed Exercise 10, you can download the file now into a
new folder and create a project associated with it. Preview the file by clicking
on Thurstone.csv in the Files tab in RStudio, and selecting View File. This
will not import the data yet, just open the actual .csv file. You will see that the
first row contains the subscale names - “s1”, “s2”, “s3”, etc., and the first entry
in each row is again the facet names. To import this correlation matrix into
RStudio preserving all these facet names for each row and each column, we will
use function read.csv(). We will say that we have headers (header=TRUE),
and that the row names are contained in the first column (row.names=1).

Thurstone <- read.csv(file="Thurstone.csv", header=TRUE, row.names=1)

Examine object Thurstone that should now be in the Environment tab. Ex-
amine the object by either pressing on it, or calling function View(). You will
see that it is a correlation matrix, with values 1 on the diagonal, and moderate
to large positive values off the diagonal. This is typical for ability tests – they
tend to correlate positively with each other.
17.3. WORKED EXAMPLE - CFA OF THURSTONE’S PRIMARY ABILITY DATA219

Step 2. Specifying and fitting a measurement model

Before you start using commands from the package lavaan, make sure you have
it installed (if not, use menu Tools and then Install packages…), and load it:

library(lavaan)

Given that the subtests were designed to measure 3 broader mental abilities
(Verbal, Word Fluency and Reasoning Ability), we will test the below CFA
model.

Figure 17.1: Figure 17.1. Theoretical model for Thurstone’s ability data (Ver-
bal= Vrb, Word Fluency = WrF and Reasoning Ability = Rsn)

We need to “code” this model in syntax, using the lavaan syntax conventions.
Here are these conventions:

To describe the model in Figure 1 in words, we would say:

• Verbal is measured by s1 and s2 and s3 • WordFluency is measured by


s4 and s5 and s6 • Reasoning is measured by s7 and s8 and s9

A shorthand for and is the plus sign +. So, this is how we translate the above
sentences into syntax for our model (let’s call it T.model):
220EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY

T.model <- ' Verbal =~ s1 + s2 + s3


WordFluency =~ s4 + s5 + s6
Reasoning =~ s7 + s8 + s9 '

Be sure to have the single quotation marks opening and closing the syntax.
Begin each equation on the new line. Spaces within each equation are optional,
they just make reading easier for you (but R does not mind).
By default, lavaan will scale each factor by setting the same unit as its first
indicator (and the factor variance is then freely estimated). So, it will set
the loading of s1 to 1 to scale factor Verbal, it will set the loading of s4
to 1 to scale factor WordFluency, and the loading of s7 to 1 to scale factor
Reasoning. You can see this on the diagram, where the loading paths for these
three indicators are in dashed rather than solid lines.
Write the above syntax in your Script, and run it by highlighting the whole lot
and pressing the Run button or Ctrl+Enter keys. You should see a new object
T.model appear on the Environment tab. This object contains the model
syntax.
To fit this CFA model, we need function cfa(). We need to pass to this function
the model name (model=T.model), and the data. However, you cannot specify
data=Thurstone because the argument data is reserved for raw data (subjects
x variables), but our data is the correlation matrix! Instead, we need to pass our
matrix to the argument sample.cov, which is reserved for sample covariance (or
correlation) matrices. [That also means that we need to convert Thurstone,
which is a data frame, into matrix before submitting it to analysis.] Of course
we also need to tell R the number of observations (sample.nobs=215), because
the correlation matrix does not provide such information.

# convert data frame to matrix


Thurstone <- as.matrix(Thurstone)

# fit the model with default scaling, and ask for summary output
fit <- cfa(model=T.model, sample.cov=Thurstone, sample.nobs=215)

summary(fit)

## lavaan 0.6.15 ended normally after 30 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 21
##
## Number of observations 215
##
17.3. WORKED EXAMPLE - CFA OF THURSTONE’S PRIMARY ABILITY DATA221

## Model Test User Model:


##
## Test statistic 38.737
## Degrees of freedom 24
## P-value (Chi-square) 0.029
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Verbal =~
## s1 1.000
## s2 1.010 0.050 20.031 0.000
## s3 0.946 0.053 17.726 0.000
## WordFluency =~
## s4 1.000
## s5 0.954 0.081 11.722 0.000
## s6 0.841 0.081 10.374 0.000
## Reasoning =~
## s7 1.000
## s8 0.922 0.097 9.514 0.000
## s9 0.901 0.097 9.332 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## Verbal ~~
## WordFluency 0.484 0.071 6.783 0.000
## Reasoning 0.471 0.070 6.685 0.000
## WordFluency ~~
## Reasoning 0.414 0.067 6.147 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .s1 0.181 0.028 6.418 0.000
## .s2 0.164 0.027 5.981 0.000
## .s3 0.266 0.033 8.063 0.000
## .s4 0.300 0.050 5.951 0.000
## .s5 0.363 0.052 6.974 0.000
## .s6 0.504 0.059 8.553 0.000
## .s7 0.389 0.059 6.625 0.000
## .s8 0.479 0.062 7.788 0.000
## .s9 0.503 0.063 8.032 0.000
222EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY

## Verbal 0.815 0.097 8.402 0.000


## WordFluency 0.695 0.100 6.924 0.000
## Reasoning 0.607 0.099 6.115 0.000

Examine the output. Start with finding estimated factor loadings (under Latent
Variables: find statements =~), factor variances (under Variances: find state-
ments ~~) and covariances (under Covariances: find statements ~~).
QUESTION 1. Why are the factor loadings of s1, s4 and s7 equal 1, and no
Standard Errors, z-values or p-values are printed for them? How many factor
loadings are there to estimate?
QUESTION 2. How many latent variables are there in your model? What
are they? What parameters are estimated for these variables? [HINT. You do
not need the R output to answer the first part of the question. You need the
model diagram.]
QUESTION 3. How many parameters are there to estimate in total? What
are they?
QUESTION 4. How many known pieces of information (sample moments)
are there in the data? What are the degrees of freedom and how are they made
up?
QUESTION 5. Interpret the chi-square (reported as Test statistic). Do
you retain or reject the model?
To obtain more measures of fit beyond the chi-square, request an extended
output:

summary(fit, fit.measures=TRUE)

QUESTION 6. Examine the extended output; find and interpret the SRMR,
CFI and RMSEA.

Step 3. Alternative scaling of latent factors

Now, I suggest you change the way your factors are scaled – for the sake of
learning alternative ways of scaling latent variables, which will be useful in
different situations. The other popular way of scaling common factors is by
setting their variances to 1 and freeing all the factor loadings.
You can use your T.model, but request the cfa() function to standardize all
the latent variables (std.lv=TRUE), so that their variances are set to 1. Assign
this new model run to a new object, fit.2, so you can compare it to the previous
run, fit.
17.3. WORKED EXAMPLE - CFA OF THURSTONE’S PRIMARY ABILITY DATA223

fit.2 <- cfa(T.model, sample.cov=Thurstone, sample.nobs=215, std.lv=TRUE)

summary(fit.2)

Examine the output. First, compare the test statistic (chi-square) between
fit.2 and fit (see your answer for Q5). The two chi-square statistics and their
degrees of freedom should be exactly the same! This is because alternative ways
of scaling do not change the model fit or the number of parameters. They just
swap the parameters to be estimated. The standardized version of T.model
sets the factor variances to 1 and estimates all factor loadings, while the original
version sets 3 factor loadings (one per each factor) to 1 and estimates 3 factor
variances. The difference between the two models is the matter of scaling -
otherwise they are mathematically equivalent. This is nice to know, so you can
use one or the other way of scaling depending on what is more convenient in a
particular situation.

Step 4. Interpreting model parameters

OK, now examine the latest output with standardized latent variables (factors),
interpret its parameters and answer the following questions.
QUESTION 7. Interpret the factor loadings and their standard errors.
QUESTION 8. Interpret factor covariances. Are they what you would expect?
QUESTION 9. Are the error variances small or large, and what do you
compare them to in order to judge their magnitude?
Now obtain the standardized solution (in which EVERYTHING is standard-
ized - the latent and observed variables) by adding standardized=TRUE to the
summary() output:

summary(fit.2, standardized=TRUE)

QUESTION 10. Compare the default (unstandardized) parameter values in


Estimate column, values in Std.lv column (standardized on latent variables
only), and values in Std.all column (standardized on all - latent and observed
variables). Can you explain why Estimate are identical to Std.lv, and Std.all
are nearly identical to them both?

Step 5. Examining residuals and local areas of misfit

Now request two additional outputs – fitted covariance matrix and residuals.
The fitted covariances are covariances predicted by the model. Residuals are
differences between fitted and actual observed covariances.
224EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY

fitted(fit.2)

## $cov
## s1 s2 s3 s4 s5 s6 s7 s8 s9
## s1 0.995
## s2 0.823 0.995
## s3 0.771 0.779 0.995
## s4 0.484 0.489 0.458 0.995
## s5 0.461 0.466 0.437 0.663 0.995
## s6 0.407 0.411 0.385 0.584 0.557 0.995
## s7 0.471 0.476 0.446 0.414 0.395 0.348 0.995
## s8 0.435 0.439 0.411 0.382 0.364 0.321 0.560 0.995
## s9 0.424 0.429 0.402 0.373 0.356 0.313 0.547 0.504 0.995

residuals(fit.2)

## $type
## [1] "raw"
##
## $cov
## s1 s2 s3 s4 s5 s6 s7 s8 s9
## s1 0.000
## s2 0.001 0.000
## s3 0.001 -0.003 0.000
## s4 -0.047 0.002 0.000 0.000
## s5 -0.031 -0.004 -0.014 0.008 0.000
## s6 0.038 0.076 0.056 0.003 -0.019 0.000
## s7 -0.026 -0.046 -0.047 -0.035 0.005 -0.061 0.000
## s8 0.104 0.096 0.120 -0.033 0.001 -0.002 -0.007 0.000
## s9 -0.046 -0.072 -0.044 0.049 0.088 0.010 0.048 -0.054 0.000

Examine the fitted covariances and compare them to the observed covariances
for the 9 subtests - you can see them again by calling View(Thurstone).
The residuals show if all covariances are well reproduced by the model. Since
our observed covariances are actually correlations (we worked with correlation
matrix, with 1 on the diagonal), interpretation of residuals is very simple. Just
think of them as differences between 2 sets of correlations, with the usual effect
sizes assumed for correlation coefficients.
You can also request standardized residuals, to formally test for significant dif-
ferences from zero. Any standardized residual larger than 1.96 in magnitude
(approximately 2 standard deviations from the mean in the standard normal
distribution) is significantly different from 0 at p=.05 level.
17.4. SOLUTIONS 225

residuals(fit.2, type="standardized")

QUESTION 11. Are there any large residuals? Are there any statistically
significant residuals?
Now request modification indices by calling

modindices(fit.2)

QUESTION 12. Examine the modification indices output. Find the largest
index. What does it suggest?

Step 6. Modifying the model

Modify the model by adding a direct path from Verbal factor to subtest 8 (allow
s8 to load on the Verbal factor). All you need to do is to add the indicator s8
to the definition of the Verbal factor in T.model, creating T.model.m:

T.model.m <- ' Verbal =~ s1+s2+s3 + s8


WordFluency =~ s4+s5+s6
Reasoning =~ s7+s8+s9 '

Re-estimate the modified model following the steps as with T.model, and assign
the results to a new object fit.m.
QUESTION 13. Examine the modified model summary output. What is the
chi-square for this modified model? How many degrees of freedom do we have
and why? Do we accept or reject this modified model?

Step 7. Saving your work

After you finished work with this exercise, save your R script by pressing the
Save icon in the script window. To keep all the created objects (such as fit.2),
which might be useful when revisiting this exercise, save your entire work space.
To do that, when closing the project by pressing File / Close project, make sure
you select Save when prompted to save your “Workspace image”.

17.4 Solutions
Q1. Factor loadings for s1, s4 and s7 were not actually estimated; instead, they
were fixed to 1 to set the scale of the latent factors (latent factors then simply
take the scale of these particular measured variables). That is why there is no
226EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY

usual estimation statistics reported for these. Factor loadings for the remaining
6 subtests (9-3=6) are free parameters in the model to estimate.
Q2. There are 12 unobserved (latent) variables – 3 common factors (Verbal,
WordFluency and Reasoning), and 9 unique factors /errors (these are la-
belled by the prefix . in front of the observed variable to which this error term
is attached, for example .s1, .s2, etc. It so happens that in this model, all
the latent variables are exogenous (independent), and for that reason variances
for them are estimated. In addition, covariances are estimated between the 3
common factors (3 covariances, one for each pair of factors). Of course, the
errors are assumed uncorrelated (remember the local independence assumption
in factor analysis?), and so their covariances are fixed to 0 (not estimated and
not printed in the output).
Q3. You can look this up in lavaan output. It says: Number of free
parameters 21. To work out how this number is derived, you can look how
many rows of values are printed in Parameter Estimates under each category.
The model estimates: 6 factor loadings (see Q1) + 3 factor variances (see Q2)
+ 3 factor covariances (see Q2) + 9 error variances (see Q2). That’s it. So we
have 6+3+3+9=21 parameters.
Q4. Sample moments refers to the number of variances and covariances in
our observed data. To know how many sample moments there are, the only
thing you need to know is how many observed variables there are. There are 9
observed variables, therefore there are 9 variances and 9(9-1)/2=36 covarainces,
45 “moments” in total.
[Since our data is the correlation matrix, these sample moments are actually
right in front of you. How many unique cells are in the 9x9 correlation matrix?
There are 81 cells, but not all of them are unique, because the top and bottom
off-diagonal parts are mirror images of each other. So, we only need to count
the diagonal cells and the top of-diagonal cells, 9 + 36 = 45. ]
You can use the following formula for calculating the number of sample moments
for any given data: m(m+1)/2 where m is the number of observed variables.
Q5. Chi-square is 38.737 (df=24). We have to reject the model, because the
test indicates that the probability of this factor model holding in the population
is less than .05 (P = .029).
Q6. Comparative Fit Index (CFI) = 0.986, which is larger than 0.95, indicating
excellent fit. The Root Mean Square Error of Approximation (RMSEA) =
0.053, just outside of the cut-off .05 for close fit. The 90% confidence interval
for RMSEA is (0.017, 0.083), which just includes the cut-off 0.08 for adequate
fit. Overall, RMSEA indicates adequate fit. Standardized Root Mean Square
Residual (SRMR) = 0.044 is small as we would hope for a well-fitting model.
All indices indicate close fit of this model, despite the significant chi-square.
Q7. All factor loadings are positive as would be expected with ability variables,
since ability subtests should be positive indicators of the ability domains. The
17.4. SOLUTIONS 227

SEs are low (magnitude of 1/sqrt(N), as they should be in a properly identified


model). All factor loadings are significantly different from 0 (p-values are very
small).

Latent Variables:
Estimate Std.Err z-value P(>|z|)
Verbal =~
s1 0.903 0.054 16.805 0.000
s2 0.912 0.053 17.084 0.000
s3 0.854 0.056 15.389 0.000
WordFluency =~
s4 0.834 0.060 13.847 0.000
s5 0.795 0.061 12.998 0.000
s6 0.701 0.064 11.012 0.000
Reasoning =~
s7 0.779 0.064 12.230 0.000
s8 0.718 0.065 11.050 0.000
s9 0.702 0.065 10.729 0.000

Q8. The factor covariances are easy to interpret in terms of size, because we set
the factors’ variances =1, and therefore factor covariances are correlations. The
correlations between the 3 ability domains are positive and large, as expected.

Covariances:
Estimate Std.Err z-value P(>|z|)
Verbal ~~
WordFluency 0.643 0.050 12.815 0.000
Reasoning 0.670 0.051 13.215 0.000
WordFluency ~~
Reasoning 0.637 0.058 10.951 0.000

Q9. The error variances are certainly quite small – considering that the observed
variables had variance 1 (remember, we analysed the correlation matrix, with 1
on the diagonal?), the error variance is less than half that for most subtests. The
remaining proportion of variance is due to the common factors. For example,
for s1, error variance is 0.18 and this means 18% of variance is due to error and
the remaining 82% (1-0.18) is due to the common factors.

Variances:
Estimate Std.Err z-value P(>|z|)
.s1 0.181 0.028 6.418 0.000
.s2 0.164 0.027 5.981 0.000
.s3 0.266 0.033 8.063 0.000
.s4 0.300 0.050 5.951 0.000
228EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY

.s5 0.363 0.052 6.974 0.000


.s6 0.504 0.059 8.553 0.000
.s7 0.389 0.059 6.625 0.000
.s8 0.479 0.062 7.788 0.000
.s9 0.503 0.063 8.032 0.000

Q10. The Estimate and Std.lav parameter values are identical because in
fit.2, the factors were scaled by setting their variances =1. So, the model
is already standardized on the latent variables (factors). The Estimate and
Std.all parameter values are almost identical (allowing for the rounding error
in third decimal place) because the observed variables were already standardized
since we worked with a correlation matrix!
Q11. The largest residual is for correlation between s3 and s8 (.120). It is also
significantly different from zero (standardized residual is 3.334, which is greater
than 1.96).
Q12.

modindices(fit.2)

## lhs op rhs mi epc sepc.lv sepc.all sepc.nox


## 25 Verbal =~ s4 1.030 -0.089 -0.089 -0.090 -0.090
## 26 Verbal =~ s5 0.512 -0.061 -0.061 -0.061 -0.061
## 27 Verbal =~ s6 3.764 0.162 0.162 0.163 0.163
## 28 Verbal =~ s7 5.081 -0.238 -0.238 -0.238 -0.238
## 29 Verbal =~ s8 19.943 0.447 0.447 0.448 0.448
## 30 Verbal =~ s9 4.736 -0.215 -0.215 -0.216 -0.216
## 31 WordFluency =~ s1 2.061 -0.085 -0.085 -0.085 -0.085
## 32 WordFluency =~ s2 1.158 0.063 0.063 0.064 0.064
## 33 WordFluency =~ s3 0.148 0.024 0.024 0.024 0.024
## 34 WordFluency =~ s7 2.629 -0.167 -0.167 -0.167 -0.167
## 35 WordFluency =~ s8 0.011 -0.011 -0.011 -0.011 -0.011
## 36 WordFluency =~ s9 3.341 0.180 0.180 0.180 0.180
## 37 Reasoning =~ s1 0.105 0.021 0.021 0.021 0.021
## 38 Reasoning =~ s2 0.341 -0.038 -0.038 -0.038 -0.038
## 39 Reasoning =~ s3 0.089 0.021 0.021 0.021 0.021
## 40 Reasoning =~ s4 0.626 -0.076 -0.076 -0.076 -0.076
## 41 Reasoning =~ s5 1.415 0.111 0.111 0.111 0.111
## 42 Reasoning =~ s6 0.182 -0.039 -0.039 -0.039 -0.039
## 43 s1 ~~ s2 0.164 0.018 0.018 0.105 0.105
## 44 s1 ~~ s3 0.063 0.009 0.009 0.043 0.043
## 45 s1 ~~ s4 2.120 -0.035 -0.035 -0.150 -0.150
## 46 s1 ~~ s5 0.041 -0.005 -0.005 -0.019 -0.019
## 47 s1 ~~ s6 0.051 0.006 0.006 0.020 0.020
## 48 s1 ~~ s7 0.189 0.011 0.011 0.043 0.043
17.4. SOLUTIONS 229

## 49 s1 ~~ s8 0.580 0.021 0.021 0.071 0.071


## 50 s1 ~~ s9 0.033 -0.005 -0.005 -0.017 -0.017
## 51 s2 ~~ s3 0.385 -0.024 -0.024 -0.114 -0.114
## 52 s2 ~~ s4 0.285 0.013 0.013 0.057 0.057
## 53 s2 ~~ s5 0.044 -0.005 -0.005 -0.021 -0.021
## 54 s2 ~~ s6 1.337 0.031 0.031 0.107 0.107
## 55 s2 ~~ s7 0.213 -0.012 -0.012 -0.047 -0.047
## 56 s2 ~~ s8 0.532 0.020 0.020 0.070 0.070
## 57 s2 ~~ s9 2.161 -0.040 -0.040 -0.140 -0.140
## 58 s3 ~~ s4 0.295 0.014 0.014 0.051 0.051
## 59 s3 ~~ s5 0.211 -0.013 -0.013 -0.041 -0.041
## 60 s3 ~~ s6 0.047 0.007 0.007 0.018 0.018
## 61 s3 ~~ s7 1.165 -0.031 -0.031 -0.098 -0.098
## 62 s3 ~~ s8 2.617 0.049 0.049 0.138 0.138
## 63 s3 ~~ s9 0.043 -0.006 -0.006 -0.017 -0.017
## 64 s4 ~~ s5 0.985 0.065 0.065 0.196 0.196
## 65 s4 ~~ s6 0.046 0.011 0.011 0.029 0.029
## 66 s4 ~~ s7 0.015 -0.004 -0.004 -0.013 -0.013
## 67 s4 ~~ s8 1.557 -0.045 -0.045 -0.119 -0.119
## 68 s4 ~~ s9 1.193 0.040 0.040 0.103 0.103
## 69 s5 ~~ s6 1.275 -0.057 -0.057 -0.134 -0.134
## 70 s5 ~~ s7 0.380 0.022 0.022 0.059 0.059
## 71 s5 ~~ s8 0.322 -0.021 -0.021 -0.051 -0.051
## 72 s5 ~~ s9 2.937 0.065 0.065 0.151 0.151
## 73 s6 ~~ s7 1.963 -0.054 -0.054 -0.123 -0.123
## 74 s6 ~~ s8 0.066 0.010 0.010 0.021 0.021
## 75 s6 ~~ s9 0.201 -0.018 -0.018 -0.037 -0.037
## 76 s7 ~~ s8 0.257 -0.031 -0.031 -0.071 -0.071
## 77 s7 ~~ s9 9.757 0.183 0.183 0.414 0.414
## 78 s8 ~~ s9 6.840 -0.141 -0.141 -0.287 -0.287

The largest index is for the factor loading Verbal =~ s8 (19.943). It is at least
twice larger than any other indices. This means that chi-square would reduce
by at least 19.943 if we allowed subtest s8 to load on the Verbal factor (note
that currently it does not, because only s1, s2 and s3 load on Verbal). It
appears that solving subtest 8 (“Pedigrees” - Identification of familial relation-
ships within a family tree) requires verbal as well as reasoning ability (s8 is
factorially complex).
Q13.

# fit the model with default scaling, and ask for summary output
fit.m <- cfa(model=T.model.m, sample.cov=Thurstone, sample.nobs=215)
summary(fit.m)

For the modified model, Chi-square = 20.882, Degrees of freedom = 23, p-value
230EXERCISE 17. CFA OF A CORRELATION MATRIX OF SUBTEST SCORES FROM AN ABILITY

= .588. The number of DF is 23, not 24 as in the original model. This is because
we added one more parameter to estimate (factor loading), therefore losing 1
degree of freedom. The model fits the data (the p-value is insignificant), and we
accept it.
Part VIII

PATH ANALYSIS

231
Exercise 18

Fitting a path model to


observed test scores

Data file Wellbeing.RData


R package lavaan

18.1 Objectives
In this exercise, you will test a path analysis model. Path analysis is also
known as “SEM without common factors”. In path analysis, we model observed
variables only; the only latent variables in these models are residuals/errors.
Your first model will be relatively simple; however, in the process you will learn
how to build path model diagrams in package lavaan, test your models and
interpret outputs.

18.2 Study of subjective well-being of Feist et


al. (1995)
Data for this exercise is taken from a study by Feist et al. (1995). The study
investigated causal pathways of subjective well-being (SWB). Here is the ab-
stract:
Although there have been many recent advances in the literature on subjective
well-being (SWB), the field historically has suffered from 2 shortcomings: little
theoretical progress and lack of quasi-experimental or longitudinal design (E.
Diener, 1984). Causal influences therefore have been difficult to determine.
After collecting data over 4 time periods with 160 Subjects, the authors compared

233
234EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES

how well 2 alternative models of SWB (bottom-up and top-down models) fit the
data. Variables of interest in both models were physical health, daily hassles,
world assumptions, and constructive thinking. Results showed that both models
provided good fit to the data, with neither model providing a closer fit than
the other, which suggests that the field would benefit from devoting more time
to examining how general dispositions toward happiness color perceptions of
life’s experiences. Results implicate bidirectional causal models of SWB and its
personality and situational influences.
We will work with published correlations between the observed variables, based
on N=149 subjects.

18.3 Worked Example - Testing a bottom-up


model of Well-being
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.

Step 1. Loading and examining data

The correlations are stored in the file Wellbeing.RData. Download this file
and save it in a new folder (e.g. “Wellbeing study”). In RStudio, create a new
project in the folder you have just created. Create a new R script, where you
will be writing all your commands.
Since the data file is in the “native” R format, you can simply load it into the
environment:

load("Wellbeing.RData")

A new object Wellbeing should appear in the Environment tab. Examine the
object by either clicking on it or calling View(Wellbeing). You should see that
it is indeed a correlation matrix, with values 1 on the diagonal.

Step 2. Specifying and fitting the model

Before you start modelling, load the package lavaan:

library(lavaan)

For our purposes, we only fit a bottom-up model of SWB, using data on physical
health, daily hassles, world assumptions, and constructive thinking collected at
18.3. WORKED EXAMPLE - TESTING A BOTTOM-UP MODEL OF WELL-BEING235

the study’s baseline, and the data on SWB collected one month later. This
analysis is described in “Test theory: A Unified Treatment” (McDonald, 1999)
for illustration of main points and issues with path models.
We will be analyzing the following variables:
SWB - Subjective Well-Being, measured as the sum score of PIL (purpose in
life = extent to which one possesses goals and directions), EM (environmental
mastery), and SA (self-acceptance).
WA - World Assumptions, measured as the sum of BWP (benevolence of world
and people), and SW (self as worthy = extent of satisfaction with self).
CT - Constructive Thinking, measured as the sum of GCT (global constructive
thinking = acceptance of self and others), BC (behavioural coping = ability to
focus on effective action); EC (emotional coping = ability to avoid self-defeating
thoughts and feelings).
PS - Physical Symptoms, indicating problems rather than health. It is mea-
sured as the sum of FS (frequency of symptoms), MS (muscular symptoms); GS
(gastrointestinal symptoms).
DH - Daily ‘Hassles’, measured as the sum of TP (time pressures), MW (money
worries) and IC (inner concerns).
Given the study description, we will test the following path model.

Figure 18.1: Figure 18.1. Bottom-up model of Well-being

QUESTION 1. How many observed and latent variables are there in your
model? What are the latent variables in this model? What are the exogenous
and endogenous variables in this model?
Now you need to “code” this model, using the lavaan syntax conventions.To
describe the model in Figure 1 in words, we need to mention all of its regression
relationships. Let’s work from left to right of the diagram:
• WA is regressed on PS and DH (residual for WA is assumed)
236EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES

• CT is regressed on PS and DH (residual for CT is assumed)


• SWB is regressed on WA and CT (residual for SWB is assumed)
We also need to mention all covariance relationships:
• PS is correlated with DH
• Residual for WA is correlated with residual for CT
Try to write the respective statements using the shorthand symbol ~ for is
regressed on, symbol ~~ for is correlated with, and the plus sign + for and.
You should obtain something similar to this (I called my model W.model, but
you can call it whatever you like):

W.model <- '


# regressions
WA ~ PS + DH
CT ~ PS + DH
SWB ~ WA + CT
# covariance of IVs
PS ~~ DH
# residual covariance of DVs
WA ~~ CT '

NOTE that to describe covariance between residuals of WA and CT, we just


referred to the variables themselves (WA ~~ CT). Actually, lavaan understands
what we mean, because for DVs, variances are “written away” to their predictors,
and what is left over is the residual, so by referring to the DV in a variance or
covariance statement, you actually referring to its residual. In the output, you
will see that lavaan acknowledges this by writing .WA ~~.CT, with dots in front
of variable names to mark their residuals.
To fit this path model, we need function sem(). It works very similarly to
function cfa() we used in Exercises 16 and 17 to fit CFA models. We need to
pass the model description (model=W.model, or simply W.model if you put this
argument first), and the data. However, we cannot specify data=Wellbeing
because the argument data is reserved for raw data (subjects by variables), but
our data is the correlation matrix! Instead, we need to pass our matrix to the
argument sample.cov, which is reserved for sample covariance (or correlation)
matrices. We also need to pass the number of observations (sample.nobs=149),
because the correlations do not provide this information.

# Attention! We use "sample.cov=" because our data is the correlation matrix!


# Use "data=" for any standard case-by-variable data
sem(W.model, sample.cov=Wellbeing, sample.nobs=149)

That’s all you need to do to fit a scary looking path model from Figure 1.
Simple, right?
18.3. WORKED EXAMPLE - TESTING A BOTTOM-UP MODEL OF WELL-BEING237

Step 3. Assessing Goodness of Fit with global indices and


residuals

If you run the above command, you will get a very basic output telling you that
the model estimation ended normally, and giving you the chi-square statistic.
To get access to full results, assign the results to a new object (for example, fit),
and request the extended output including fit measures:

fit <- sem(W.model, sample.cov=Wellbeing, sample.nobs=149)

summary(fit, fit.measures=TRUE)

Now examine the output, starting with assessment of goodness of fit, and try
to answer the following questions.
QUESTION 2. How many parameters are there to estimate? How many
known pieces of information (sample moments) are there in the data? What are
the degrees of freedom and how are they calculated?
QUESTION 3. Interpret the chi-square. Do you retain or reject the model?
Now consider the following. McDonald (1999) argued that “researchers ap-
pear to rely too much on goodness-of-fit indices, when the actual discrepancies
[residuals] are much more informative” (p.390). He refers to indices such as
RMSEA and CFI, which we used to judge the goodness of fit of CFA models.
He continues: “generally, indices of fit are not informative in path analysis.
This is because large parts of the model can be unrestrictive, with only a few
discrepancies capable of being non-zero”.
To obtain residuals, which are differences between observed and ‘fit-
ted’/‘reproduced’ covariances, request this additional output:

residuals(fit)

## $type
## [1] "raw"
##
## $cov
## WA CT SWB PS DH
## WA 0.000
## CT 0.000 0.000
## SWB 0.000 0.000 0.000
## PS 0.000 0.000 0.096 0.000
## DH 0.000 0.000 -0.084 0.000 0.000

The residuals show how well the observed covariances are reproduced by the
model. Since our observed covariances are actually correlations (we worked
238EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES

with the correlation matrix), interpretation of residuals is simple. Just think


of them as differences between 2 sets of correlations. We usually interpret a
correlation of 0.1 as small; and correlations below 0.1 trivial. Use the same
logic for interpreting the residual correlations.
Note. In factor models, we used SRMR (standardized root mean square resid-
ual) to assess the average magnitude of the discrepancies between observed
and expected covariances in the correlation metric. This summary was useful
since in factor models, no direct covariances between observed variables were al-
lowed (remember the local independence assumption?) and all these covariances
needed to be reproduced by the factor model. In Path Analysis models, how-
ever, this index makes little sense because direct covariance paths are allowed
and therefore the reproduced correlations for them will be exact (residuals = 0).
Averaging (many) zero residuals with (few) non-zero ones will give a false im-
pression that the model fits well. So, ALWAYS look at the individual residuals,
and pay attention to those close or exceeding 0.1.
Also request standardized residuals, to formally test for significant differences
from zero. Any standardized residual larger than 1.96 in magnitude (approx-
imately 2 standard deviations from the mean in the standard normal distribu-
tion) is significantly different from 0 at p=.05 level.

residuals(fit, type="standardized")

## $type
## [1] "standardized"
##
## $cov
## WA CT SWB PS DH
## WA 0.000
## CT 0.000 0.000
## SWB 0.000 0.000 0.000
## PS 0.000 0.000 2.051 0.000
## DH 0.000 0.000 -1.791 0.000 0.000

QUESTION 4. Are there any large residuals? Are there any statistically
significant residuals? What can you say about the model fit based on the resid-
uals?
We will not modify the model by adding any other paths to it – it already has
only 2 degrees of freedom, and we will reduce this number further by adding
parameters.

Step 4. Interpreting the parameter estimates


OK, let us now examine the full output (scroll up in the Console to see it,
or call summary(fit) again). Find estimated regression paths (statements ~),
18.3. WORKED EXAMPLE - TESTING A BOTTOM-UP MODEL OF WELL-BEING239

variances (statements ~~) and covariances (statements ~~). Try to answer the
following question.

QUESTION 5. Interpret the regression coefficients and their standard er-


rors. Note that the observed variables in this study are standardized since we
are working with the correlation matrix, so the regression paths are like ‘beta’
coefficients in regression and are easy to interpret.

QUESTION 6. Examine Variances output. Are the variances of regression


residuals (.SWB, .WA and .CT) small or large, and what do you compare
them to in order to judge their size?

Step 5. Computing indirect and total effects

It is sometimes important for research to quantify the effects – direct, indirect


and total. In the bottom-up model of Wellbeing, we might be interested in
describing the effect that Physical Symptoms (PS) have on Subjective Wellbeing
(SWB). If you look on the diagram again, you will see that PS does not influence
SWB directly, only indirectly via two routes – via WA and via CT. To compute
each of these indirect effects, we need to multiply the path coefficients along each
route. This will give us two indirect effects. To compute the total effect, we will
need to add the two indirect effects via the two routes. There is nothing else to
add here, because there is no other way of getting from PS to SWB.

You could do all these calculations by looking up the obtained parameter es-
timates and then multiplying and adding them (simple but tedious!). Or, you
could actually ask lavaan to do these for you while fitting the model. To do the
latter, you need to create ‘labels’ for the parameters of interest, and then write
formulas with these labels (also a bit tedious, but much less room for errors!).
It will be useful to learn how to create labels, because we will use them later in
the course.

According to the diagram in Figure 2 below, I will label the path from PS to
WA as a1 and path from WA to SWB as b1. Then, my indirect effect from
PS to SWB via WA will be equal a1*b1. I will label the path from PS to CT
as a2 and path from CT to SWB as b2. Then, my indirect effect from PS to
SWB via WA will be equal a2*b2.
240EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES

Now look how I will modify the syntax to get these labels in place, in a new
model called W.model.label. The way to label a path coefficient is to add a
multiplier in front of the relevant variable (predictor), as is typical for regression
equations.

W.model.label <- '


# regressions - now with parameter labels assigned
WA ~ a1*PS + DH
CT ~ a2*PS + DH
SWB ~ b1*WA + b2*CT
# covariance of IVs (same as before)
PS ~~ DH
# residual covariance of DVs (same as before)
WA ~~ CT
# Here we describe new indirect effects
ind.WA := a1*b1 # indirect via WA
ind.CT := a2*b2 # indirect via CT
ind.total := ind.WA + ind.CT
'

Describe and estimate this new model and obtain the summary output. Note
that the model fit or any of the existing paths will not change! This is because
the model has not changed; all that we have done is assigned labels to previously
“unlabeled” regression paths. But, you should see the following additions to
the output. First, you should see the labels appearing next to the relevant
parameters:

Regressions:
Estimate Std.Err z-value P(>|z|)
WA ~
PS (a1) -0.339 0.080 -4.256 0.000
DH -0.165 0.080 -2.065 0.039
CT ~
PS (a2) -0.191 0.078 -2.430 0.015
18.3. WORKED EXAMPLE - TESTING A BOTTOM-UP MODEL OF WELL-BEING241

DH -0.349 0.078 -4.452 0.000


SWB ~
WA (b1) 0.328 0.061 5.332 0.000
CT (b2) 0.550 0.061 8.950 0.000

Second, you should have a new section Defined Parameters with calculated
effects:

Defined Parameters:
Estimate Std.Err z-value P(>|z|)
ind.WA -0.111 0.033 -3.326 0.001
ind.CT -0.105 0.045 -2.345 0.019
ind.total -0.216 0.062 -3.463 0.001

Now you can interpret this output. (Our variables are standardized, making
it easy to interpret the indirect and total effects). Try to answer the following
questions.
QUESTION 7. What is the indirect effect of PS on SWB via WA? What
is the indirect effect of PS on SWB via CT? What is the total indirect effect
of PS on SWB (the sum of the indirect paths via WA and CT)? How do the
numbers correspond to the observed covariance of PS and SWB?

Step 6. Appraising the bottom-up model of Wellbeing crit-


ically

Finally, let us critically evaluate the model. Specifically, let us think of causality
in this study. The article’s abstract suggested that there was an equally plausible
top-down model of subjective well-being, in which Subjective well-being causes
World Assumptions and Constructive Thinking, with those in turn influencing
Physical Symptoms and Daily Hassles. Path Analysis in itself cannot prove
causality, so the analysis cannot tell us which model – the bottom-up or the top-
down – is better. You could say: “but this study employed a longitudinal design,
where subjective well-being was measured one month later than the predictor
variables”. Unfortunately, even longitudinal designs cannot always disentangle
the causes and effects. The fact that SWB was measured one month after the
other constructs does not mean it was caused by them, because SWB was very
stable over that time (the article reports test-retest correlation = 0.92), and
it is therefore a good proxy for SWB at baseline, or possibly even before the
baseline.
QUESTION 8 (Optional challenge). Can you try to suggest a research design
that will address these problems and help establish causality in this study?
242EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES

Step 7. Saving your work

After you finished work with this exercise, save your R script with a meaningful
name, for example “Bottom-up model of Wellbeing”. To keep all of the created
objects, which might be useful when revisiting this exercise, save your entire
‘work space’ when closing the project. Press File / Close project, and select
“Save” when prompted to save your ‘Workspace image’.

18.4 Solutions
Q1. There are 8 variables in this model – 5 observed variables, and 3 unob-
served variables (the 3 regression residuals). There are 5 exogenous (indepen-
dent) variables: PS and DH (observed), and residuals .WA, .SWB and .CT
(unobserved). There are 3 endogenous (dependent) variables: WA, CT and
SWB.
Q2. The output prints that the “Number of free parameters” is 13. You can
work out what these are just by looking at the diagram, or by counting the
number of ‘Parameter estimates’ printed in the output. The best way to practice
is to have a go just looking at the diagram, and then check with the output.
There are 6 regression paths (arrows on diagram) to estimate. There are also 3
paths from residuals to the observed variables, but they are fixed to 1 for setting
the scale of residuals and therefore are not estimated. There are 2 covariances
(bi-directional arrows) to estimate: one for the independent variables PS and
DH, and the other is for the residuals of WA and CT. There are 5 variances
(look at how many IVs you have) to estimate. There are 3 variances of residuals,
and also 2 variances of the independent (exogenous) variables PS and DH. In
total, this makes 6+2+5=13 parameters.
Now let’s calculate sample moments, which is the number of observed variances
and covariances. There are 5 observed variables, therefore 5(5+1)/2 = 15 mo-
ments in total (made up from 5 variances and 10 covariances). The degrees of
freedom* is then 15 (moments) – 13 (parameters) = 2 (df).
Q3. The chi-square test (Chi-square = 10.682, Degrees of freedom = 2, P-value
= .005) suggests rejecting the model because the test is significant (p-value is
very low). In this case, we cannot “blame” the poor chi-square result on the
large sample size (our sample is not that large).
Q4. The residual output makes it very clear that all but 2 covariances were
freely estimated (and fully accounted for) by the model – this is why their
residuals are exactly 0. The only omitted connections were between DH and
SWB, and between PS and SWB (look at the diagram - there are no direct
paths between these pairs of variables). We modelled these relationships as fully
mediated by WA and CT, respectively. So, essentially, this model tests only
for the absence of 2 direct effects – the rest of the model is unrestricted. The
18.4. SOLUTIONS 243

residuals for these omitted paths are small in magnitude (just under 0.1), but
one borderline residual PS/SWB (=0.096) is significantly different from zero
(stdz value = 2.051 is larger than the critical value 1.96).
Based on the residual output, we have detected an area of local misfit, pertaining
to the lack of direct path from PS to SWB. This misfit is “small” (not “trivial”)
in terms of effect size, and it is significant.
Q5. All paths are significant at the 0.05 level (p-values are less than 0.05), and
the standard errors of these estimates are small (of magnitude 1/SQRT(N), as
they should be for an identified model).

Regressions:
Estimate Std.Err z-value P(>|z|)
WA ~
PS -0.339 0.080 -4.256 0.000
DH -0.165 0.080 -2.065 0.039
CT ~
PS -0.191 0.078 -2.430 0.015
DH -0.349 0.078 -4.452 0.000
SWB ~
WA 0.328 0.061 5.332 0.000
CT 0.550 0.061 8.950 0.000

Q6. Considering that all our observed variables have variance 1, the variances
for residuals .WA and .CT are quite large (about 0.8). So the predictors only
explained about 20% of variance in these variables. The error variance for
.SWB is quite small (0.39) – which means that the majority of variance in
SWB is explained by the other variables in the model.

Variances:
Estimate Std.Err z-value P(>|z|)
.WA 0.811 0.094 8.631 0.000
.CT 0.787 0.091 8.631 0.000
.SWB 0.390 0.045 8.631 0.000

Q7. The total indirect effect of PS on SWB is –.216. Let’s see how it is
computed. Look at the regression weights in the answer to Q5. The route from
PS to SWB via WA: (–0.339)0.328=–0.111. The route from PS to SWB via
CT: (–0.191)0.550=–0.105. Adding these two routes, we get the indirect effect
= –0.216. This shows you how to compute the effects from given parameter
values by hand, but we obtained the same answers through parameter labels.
Now, if you look into the original correlation matrix, View(Wellbeing), you
will see that the covariance of PS and SWB is -0.21. This is close to the
total indirect effect (-0.216). The values are not exactly the same, which means
244EXERCISE 18. FITTING A PATH MODEL TO OBSERVED TEST SCORES

that the indirect effects do not explain the observed covariance fully. Adding a
direct path from PS to SWB (which is omitted in this model) would explain
the observed covariance 100%.
Q8. Perhaps one could alter the level of daily hassles or physical symptoms
through an intervention and see what difference (if any) this makes to the out-
comes compared to the non-intervention condition.
Exercise 19

Fitting an autoregressive
model to longitudinal test
measurements

Data file WISC.RData


R package lavaan

19.1 Objectives

In this exercise, you will fit a special type of path analysis model - an autoregres-
sive model. Such models involve repeated (observed) measurements over several
time points, and each subsequent measurement is regressed on the previous one,
hence the name. The only latent variables in these models are regression resid-
uals; there are no common factors or errors of measurement. This exercise
will demonstrate one of the weaknesses of Path Analysis, specifically, why it is
important to model the measurement error for unreliable measures.

19.2 Study of longitudinal stability of ability


test score across 4 time points

Data for this exercise is taken from a study by Osborne and Suddick (1972).
In this study, N=204 children took the Wechsler Intelligence Test for Children
(WISC) at ages 6, 7, 9 and 11. There are two subtests - Verbal and Non-verbal
reasoning.

245
246EXERCISE 19. FITTING AN AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST MEASUR

Scores on both tets were recorded for each child at each time point, resulting
in 4 variables for the Verbal test (V1, V2, V3, V4) and 4 variables for the
Non-verbal test (N1, N2, N3, N4).
In terms of data, we only have summary statistics - means, Standard Deviations
and correlations of all the observed variables - which were reported in the article.
As you can see in the table below, the mean of Verbal test increases as the
children get older, and so does the mean of Non-verbal test. The Standard
Deviations also increase, suggesting the increasing spread of scores over time.
However, the test scores of the same kind correlate strongly, suggesting that
there is certain stability to rank ordering of children; so that despite everyone
improving their scores, the children doing best at Time 1 still tend to do better
at Time 2, 3 and 4.

Figure 19.1: Table 19.1. Descriptive statistics for WISC subtests

19.3 Worked Example - Testing an autoregres-


sive model for WISC Verbal subtest

To complete this exercise, you need to repeat the analysis of Verbal subtests from
a worked example below, and answer some questions. Once you are confident,
you can repeat the analyses of Non-verbal subtests independently.

Step 1. Reading and examining data

The summary statistics are stored in the file WISC.RData. Download this
file and save it in a new folder (e.g. “Autoregressive model of WISC data”). In
19.3. WORKED EXAMPLE - TESTING AN AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTEST247

RStudio, create a new project in the folder you have just created. Create a new
R script, where you will be writing all your commands.
If you want to analyse the data straight away, you can simply load the prepared
data into the environment:

load("WISC.RData")

Step 2 (Optional). Creating data objects for given means,


SDs and correlations
Alternatively, as an exercise, you may follow me in creating the data objects
from scratch using the descriptive statistics given above.

WISC.mean <- c(19.58,25.41,32.60,43.74,18.00,27.68,39.35,50.92)

WISC.SD <- c(5.83,6.13,7.34,10.70,8.37,10.02,10.31,12.52)

# assign variable names


varnames = c("V1","V2","V3","V4","N1","N2","N3","N4")

# create a lower diagonal correlation matrix, with variable names


library(sem)

## Warning: package 'sem' was built under R version 4.1.2

WISC.corr <- readMoments(diag=FALSE, names=varnames,


text="
.717
.726 .756
.653 .727 .797
.609 .584 .622 .617
.517 .600 .591 .631 .779
.467 .530 .544 .593 .732 .793
.476 .511 .529 .609 .695 .785 .811
")

# now fill the upper triangle part with the transposed values from the lower diagonal part
WISC.corr[upper.tri(WISC.corr)] <- t(WISC.corr)[upper.tri(WISC.corr)]

# save your objects to use them next time


save(WISC.mean, WISC.SD, WISC.corr, file='WISC.RData')

You should now see objects - WISC.mean, WISC.SD and WISC.corr in


your Environment tab. Examine the objects by either clicking on them or
248EXERCISE 19. FITTING AN AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST MEASUR

calling them, for example View(WISC.corr). You should see that WISC.corr
is indeed a full correlation matrix, with values 1 on the diagonal.
Before we proceed to fitting a path model, however, we need to produce a co-
variance matrix for WISC data, because model fitting functions in lavaan work
with covariance matrices. Of course you can supply a correlation matrix, but
then you will lose the original measurement scale of your variables and essen-
tially standardize them. I would like to keep the scale of WISC tests, and get
unstandardized parameters. Thankfully, it is easy to produce a covariance ma-
trix from the WISC correlation matrix and SDs. Since cov(X,Y) is the product
of corr(X,Y), SD(X) and SD(Y), we can write:

# produce WISC covariance matrix


WISC.cov <- crossprod(crossprod(WISC.corr, diag(WISC.SD)), diag(WISC.SD))
# adding variable names to rows and columns
rownames(WISC.cov) <- varnames
colnames(WISC.cov) <- varnames

Step 3. Specifying and fitting the model

Before you start modelling, load the package lavaan:

library(lavaan)

We will fit the following auto-regressive path model, for now without means
(you can bring the means in as an optional challenge later).

Figure 19.2: Figure 19.1. Autoregressive model for WISC Verbal subtest

Coding this model using the lavaan syntax conventions is simple. You simply
need to describe all the regression relationships in Figure 1. Let’s work from
left to right of the diagram:
• V2 is regressed on V1 (residual for V2 is assumed)
• V3 is regressed on V2 (residual for V3 is assumed)
• V4 is regressed on V3 (residual for V4 is assumed)
19.3. WORKED EXAMPLE - TESTING AN AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTEST249

That’s it. Try to write the respective statements using the shorthand symbol
~ for is regressed on. You should obtain something similar to this (I called my
model V.auto, but you can call it whatever you like):

V.auto <- '


# auto regressions
V2 ~ V1
V3 ~ V2
V4 ~ V3 '

To fit this path model, we need function sem(). We need to pass to it: the model
(model=V.auto, or simply V.auto if you put this argument first), and the data.
However, the argument data is reserved for raw data (subjects by variables),
but our data is the covariance matrix! So, we need to pass our matrix to the
argument sample.cov, which is reserved for sample covariance matrices. We
also need to pass the number of observations (sample.nobs = 204), because
the covariance matrix does not provide this information.

fit <- sem(V.auto, sample.cov= WISC.cov, sample.nobs=204)

Step 4. Assessing Goodness of Fit with global indices and


residuals

To get access to model fitting results stored in object fit, request the extended
output including fit measures and standardized parameter estimates:

summary(fit, fit.measures=TRUE, standardized=TRUE)

## lavaan 0.6.15 ended normally after 1 iteration


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 6
##
## Number of observations 204
##
## Model Test User Model:
##
## Test statistic 58.473
## Degrees of freedom 3
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
250EXERCISE 19. FITTING AN AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST MEASUR

##
## Test statistic 584.326
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.904
## Tucker-Lewis Index (TLI) 0.808
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -1864.023
## Loglikelihood unrestricted model (H1) -1834.786
##
## Akaike (AIC) 3740.046
## Bayesian (BIC) 3759.955
## Sample-size adjusted Bayesian (SABIC) 3740.945
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.301
## 90 Percent confidence interval - lower 0.237
## 90 Percent confidence interval - upper 0.371
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.099
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## V2 ~
## V1 0.754 0.051 14.691 0.000 0.754 0.717
## V3 ~
## V2 0.905 0.055 16.496 0.000 0.905 0.756
## V4 ~
## V3 1.162 0.062 18.847 0.000 1.162 0.797
##
19.3. WORKED EXAMPLE - TESTING AN AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTEST251

## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .V2 18.170 1.799 10.100 0.000 18.170 0.486
## .V3 22.971 2.274 10.100 0.000 22.971 0.428
## .V4 41.560 4.115 10.100 0.000 41.560 0.365

Examine the output, starting with parameter estimates. Look at column


Std.all in the output, where parameters are based on all variables being
standardized.
QUESTION 1. What is the standardized effect of V1 on V2? Can you
relate this value to any of the observed sample moments? What about the
standardized effects of V2 on V3, and V3 on V4?
Now carry out assessment of the model’s goodness of fit, and try to answer the
following questions.
QUESTION 2. How many parameters are there to estimate? How many
known pieces of information (sample moments) are there in the data? What are
the degrees of freedom and how are they calculated? [HINT. Remember that
we are not interested in means here, only in covariance structure].
QUESTION 3. Interpret the chi-square. Do you retain or reject the model?
As I explained in Exercise 18, McDonald (1999) argued against the use of global
fit indices such as RMSEA and CFI for path models and advocated for looking
at the actual discrepancies [residuals]. To obtain residuals in the metric of corre-
lation (differences between observed correlations and correlations ‘reproduced’
by the model), request this additional output:

residuals(fit, type="cor")

## $type
## [1] "cor.bollen"
##
## $cov
## V2 V3 V4 V1
## V2 0.000
## V3 0.000 0.000
## V4 0.124 0.000 0.000
## V1 0.000 0.184 0.221 0.000

Thinking of residuals as differences between 2 sets of correlations, we can eas-


ily interpret them. We usually interpret a correlation of 0.1 as small; and
correlations below 0.1 trivial. Using this logic, we can see that all non-zero
residuals (residuals for all relationships constrained by the model) are above 0.1
and cannot be ignored. These are residuals pertaining to relationships between
non-adjacent time points - V1 and V3, V2 and V4 , and V1 and V4.
252EXERCISE 19. FITTING AN AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST MEASUR

Step 5. Interpretation and critical appraisal of model re-


sults. The importance of accounting for measurement er-
ror.

What do these results mean? Shall we conclude that the model requires direct
effects for non-adjacent measures, for example a direct path from V1 (verbal
score at age 6) to V3 (score at age 9)?
A predictor is said to have a “sleeper” effect on a distal outcome if it influenced it
not only indirectly through the more proximal outcome, but directly too. The
unexplained relationships between distal points showing up in large residuals
seem to suggest that we have such “sleeper” effects. For example, some variance
in verbal score at age 9 that could not be explained by verbal score at age 7
could, however, be explained by the verbal score at age 6. It appears that some
of the effect from age 6 was “sleeping” until age 9 when it resurfaced again.
Here we need to pause and look at the correlation structure implied by the
autoregressive model and compare it with the correlations observed in this study.
If indeed the autoregressive relationships held, then any regression path is the
product of the paths it consists of, for example:

b13 = b12 * b23.

The standardized direct paths b12 and b23, as we worked out in Question
1, equal to observed correlations between adjacent measures V1 and V2, and
V2 and V3, respectively. But according to our data, b13 (0.726) is much
larger than the product of b12 (0.717) and b23 (0.756), which should be 0.542.
Our observed correlations fail to reduce to the extent the autoregressive model
predicts.
To summarise, the autoregressive model predicts that the standardized indirect
effect should get smaller the more distal the outcome is. But our data does not
support this prediction. This finding is typical for the data of this kind. And
the reason for it is not necessarily in the existence of “sleeper” effects, but could
be often attributed to the failure to account for the error of measurement.
As you know, many psychological constructs such as “verbal reasoning” are not
measured perfectly. According to the Classical Test Theory, variance in the
observed measure (Y) is due to the ‘true’ score (T) and the measurement error
(E):

var(Y) = var(T) + var(E)

However, because the measurement errors are assumed uncorrelated, covariance


of two measures Y1 and Y2 is due to ‘true’ scores only:
19.4. FURTHER PRACTICE - PATH ANALYSIS OF WISC NON-VERBAL SCORES253

cov(Y1, Y2) = cov(T1, T2)

Therefore, the error “contaminates” variances but not covariances. In the cor-
relation matrix then, the off-diagonal elements reflect the relationships between
the ‘true’ (standardized) scores, but the diagonal reflects the ‘true’ (standard-
ized) score variances plus the measurement errors! So, the correlation matrix
does NOT reflect correlations of ‘true’ scores. Instead, it reflects covariances
of ‘true’ scores with variances smaller than 1. The actual correlations of ‘true’
scores could be estimated if we knew their variances. In Classical Test Theory,
score reliability is defined as the proportion of variance in he observed score
due to ‘true’ score. So, knowing and accounting for the score reliability would
enable us to obtain correct estimates of the ‘true’ score correlations, and fit an
autoregressive model to those estimates.
To conclude, it is important to appreciate limitations of path analysis. Path
analysis deals with observed variables, and it assumes that these variables are
measured without error. This assumption is almost always violated in psycho-
logical measurement. Test scores are not 100% reliable, and WISC scores in
this data analysis example are no exception. Every time we deal with imper-
fect measures, we will encounter similar problems. The lower the reliability of
our observed variables, the greater the problems might be. These need to be
understood to draw correct conclusions from analysis.
To address this limitation, we need to account for the error of measurement.
Models with measurement parts (where latent traits are indicated by observed
variables) and structural parts (where we explore relationships between latent
constructs - ‘true’ scores - rather than their imperfect measures) will be a way
of dealing with unreliable data. I will provide such a full structural model for
WISC data in Exercise 20.

Step 6. Saving your work

After you finished work with this exercise, save your R script with a meaningful
name, for example “Autoregressive model of WISC”. To keep all of the created
objects, which might be useful when revisiting this exercise, save your entire
‘work space’ when closing the project. Press File / Close project, and select
“Save” when prompted to save your ‘Workspace image’.

19.4 Further practice - Path analysis of WISC


Non-verbal scores

To practice further, fit an autoregressive path model to the WISC Non-verbal


scores, and interpret the results.
254EXERCISE 19. FITTING AN AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST MEASUR

19.5 Solutions
Q1. The standardized regression path are as follows

Std.all
V2 ~ V1 0.717
V3 ~ V2 0.756
V4 ~ V3 0.797

These are equal to the observed correlations between V1 and V2, V2 and V3,
and V3 and V4, respectively.
Q2. Let’s first calculate sample moments, which is the number of observed
variances and covariances. There are 4 observed variables, therefore 4x(4+1)/2
= 10 moments in total (made up from 4 variances and 6 covariances).
The output prints that the “Number of model parameters” is 6. You can work
out what these are just by looking at the diagram, or by counting the number of
‘Parameter estimates’ printed in the output. The best way to practice is to have
a go just looking at the diagram, and then check with the output. There are 3
regression paths (arrows on diagram) to estimate. There are also 3 paths from
residuals to the observed variables, but they are fixed to 1 for setting the scale of
residuals and therefore are not estimated. There are also variances of regression
residuals. So far, this makes 3+3=6 parameters. But, there is actually one more
parameter that lavaan does not print - the variance of independent variable V1.
It is of course just the variance that we we already know from sample statistics,
nevertheless, it is a model parameter (although trivial). [If you want the variance
of V1 printed in the output, add the reference to it explicitly: V1 ~~ V1 ]
Then, the degrees of freedom is 3 (as lavvan rightly tells you), made up from
10(moments) – 7(parameters) = 3(df).
Q3. The chi-square test (Chi-square = 58.473, Degrees of freedom = 3, P-value
< .001) suggests rejecting the model because the test is significant (p-value is
very low). In this case, we cannot “blame” the poor chi-square result on the
large sample size (N=204 is not that large).
Part IX

STRUCTURAL
EQUATION MODELLING

255
Exercise 20

Fitting a latent
autoregressive model to
longitudinal test
measurements

Data file WISC.RData


R package lavaan

20.1 Objectives

In this exercise, you will test for autoregressive relationships between latent
constructs. This will be done through a so-called full structural model, which
includes measurement part(s) (where latent traits are indicated by observed
variables) and a structural part (where we explore relationships between latent
constructs - ‘true’ scores - rather than their imperfect measures). Incorporating
the measurement part will be a way of dealing with unreliable data because
he error of measurement can be explicitly controlled. As in Exercise 19, we
will model repeated (observed) measurements over several time points, where
each subsequent measurement occasion is regressed on the previous one, hence
the name. However, we will model the autoregressive relationships not between
the observed measures but between the latent variables (‘true’ scores) in the
structural part of the model.

257
258EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M

20.2 Study of longitudinal stability of the latent


ability across 4 time points
Data for this exercise is the same as for Exercise 19. It is taken from a study by
Osborne and Suddick (1972). In this study, N=204 children took the Wechsler
Intelligence Test for Children (WISC) at ages 6, 7, 9 and 11. There are two
subtests - Verbal and Non-verbal reasoning.
Scores on both tests were recorded for each child at each time point, resulting
in 4 variables for the Verbal test (V1, V2, V3, V4) and 4 variables for the
Non-verbal test (N1, N2, N3, N4).
Please refer to Exercise 19 for the descriptive statistics - means, Standard De-
viations and correlations of all the observed variables.

20.3 Worked Example - Testing a latent autore-


gressive model for WISC Verbal subtest
To complete this exercise, you need to repeat the analysis of Verbal subtests from
a worked example below, and answer some questions. Once you are confident,
you can repeat the analyses of Non-verbal subtests independently.

Step 1. Reading data

You can continue working in the project created for Exercise 19. Alterna-
tively, download the summary statistics for WISC data stored in the file
WISC.RData, save it in a new folder (e.g. “Latent autoregressive model of
WISC data”), and create a new project in this new folder.
If it is not in your Environment already, load the data consisting of
means (WISC.mean, which we will not use in this exercise), correlations
(WISC.corr), and SDs (WISC.SD):

load("WISC.RData")

And produce the WISC covariance matrix from the correlations and SDs:

# produce WISC covariance matrix


WISC.cov <- crossprod(crossprod(WISC.corr, diag(WISC.SD)), diag(WISC.SD))

varnames=c("V1","V2","V3","V4","N1","N2","N3","N4")
rownames(WISC.cov) <- varnames
colnames(WISC.cov) <- varnames
20.3. WORKED EXAMPLE - TESTING A LATENT AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTES

Step 2. Specifying and fitting the model

Before you start, load the package lavaan:

library(lavaan)

We will fit the following latent auto-regressive path model.

Figure 20.1: Figure 20.1. Latent autoregressive model for WISC Verbal subtest

There are 4 measurement models - one for each of the Verbal scores across 4
time points. These models are based on the classic true-score model:

V = T + E

Each observed verbal score, for example V1, is a sum of two latent variables -
the ‘true’ score (T1) and the error of measurement (E1). The same applies to
V2, V3 and V4. The paths from each T and E latent variables to the observed
measures of V are fixed to 1, as the true-score model prescribes.
With these measurement models in place, the autoregressive relationships per-
tain to the ‘true’ scores only: T1, T2, T3 and T4.
Let’s code this model using the lavaan syntax conventions, starting with de-
scribing the four measurement models:
• T1 is measured by V1 (error of measurement E1 is assumed; lavaan will label
it .V1)
• T2 is measured by V2 (error of measurement E2 is assumed; lavaan will label
it .V2)
• T3 is measured by V3 (error of measurement E3 is assumed; lavaan will label
it .V3)
• T4 is measured by V4 (error of measurement E4 is assumed; lavaan will label
it .V4)
We then describe the autoregressive relationships between T1, T2, T3 and T4.
260EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M

• T2 is regressed on T1 (residual for T2 is assumed)


• T3 is regressed on T2 (residual for T3 is assumed)
• T4 is regressed on T3 (residual for T4 is assumed)
That’s it. Try to write the respective statements using the shorthand symbols
=~ for is measured by and ~ for is regressed on. You should obtain something
similar to this (I called my model T.auto, but you can call it whatever you
like):

T.auto <- '


# measurement part (by default, the loadings will be fixed to 1)
T1 =~ V1
T2 =~ V2
T3 =~ V3
T4 =~ V4
# structural part - auto regressions
T2 ~ T1
T3 ~ T2
T4 ~ T3 '

To fit this path model, we need function sem(). We need to pass to it: the model
(model=T.auto, or simply T.auto if you put this argument first), and the data.
However, because our data is the WISC covariance matrix, we need to pass our
matrix to the argument sample.cov, rather than data which is reserved for raw
subject data. We also need to pass the number of observations (sample.nobs
= 204), because this information is not contained in the covariance matrix.

fit <- sem(T.auto, sample.cov= WISC.cov, sample.nobs=204)

Step 3. Assessing Goodness of Fit with global indices and


residuals

To get access to model fitting results stored in object fit1, request the standard-
ized parameter estimates:

summary(fit, standardized=TRUE)

## lavaan 0.6.15 ended normally after 69 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 7
##
20.3. WORKED EXAMPLE - TESTING A LATENT AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTES

## Number of observations 204


##
## Model Test User Model:
##
## Test statistic 58.473
## Degrees of freedom 3
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## T1 =~
## V1 1.000 5.816 1.000
## T2 =~
## V2 1.000 6.115 1.000
## T3 =~
## V3 1.000 7.322 1.000
## T4 =~
## V4 1.000 10.674 1.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## T2 ~
## T1 0.754 0.051 14.691 0.000 0.717 0.717
## T3 ~
## T2 0.905 0.055 16.496 0.000 0.756 0.756
## T4 ~
## T3 1.162 0.062 18.847 0.000 0.797 0.797
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .V1 0.000 0.000 0.000
## .V2 0.000 0.000 0.000
## .V3 0.000 0.000 0.000
## .V4 0.000 0.000 0.000
## T1 33.822 3.349 10.100 0.000 1.000 1.000
## .T2 18.170 1.799 10.100 0.000 0.486 0.486
## .T3 22.971 2.274 10.100 0.000 0.428 0.428
## .T4 41.560 4.115 10.100 0.000 0.365 0.365

Examine the output, starting with the Chi-square statistic. Refer to Exercise 19
262EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M

for the model output, and you will see that the Chi-square is exactly the same,
58.473 on 3 Degrees of freedom. This is not what we expected. We expected
different fit due to estimating more parameters (errors of measurement that
we included in the model). Also, the standardized regression coefficients for
the ‘true’ scores T2T1, T3 T2, and T4~T3 are the same as the standardized
regression coefficients for the observed scores V2V1, V3 V2 and V4~V3 in the
path model of Exercise 19. To find out where we went wrong, let’s examine
the parameter estimates. Ah! The variances of errors of measurement that we
assumed for the observed variables (.V1, .V2, .V3 and .V4) have been fixed
to zero by lavaan:

Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.V1 0.000 0.000 0.000
.V2 0.000 0.000 0.000
.V3 0.000 0.000 0.000
.V4 0.000 0.000 0.000

Obviously then, our model is then reduced to the path model in Exercise 19
because it assumes that the errors of measurement are zero (and the observed
scores are perfectly reliable).
To specify the model in the way we intended, we need to explicitly free up the
variances of measurement errors, using the ~~ syntax convention. We add the
following syntax to T.auto, making T.auto2:

T.auto2 <- paste(T.auto, '

# free up error variance


V1 ~~ V1
V2 ~~ V2
V3 ~~ V3
V4 ~~ V4
')

fit2 <- sem(T.auto2, sample.cov= WISC.cov, sample.nobs=204)

## Warning in lav_model_vcov(lavmodel = lavmodel, lavsamplestats = lavsamplestats, : la


## Could not compute standard errors! The information matrix could
## not be inverted. This may be a symptom that the model is not
## identified.

It appears we have a problem. The warning message is telling us that the


standard errors could not be computed. To investigate the problem further, ask
for the summary output:
20.3. WORKED EXAMPLE - TESTING A LATENT AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTES

summary(fit2)

## lavaan 0.6.15 ended normally after 35 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 11
##
## Number of observations 204
##
## Model Test User Model:
##
## Test statistic NA
## Degrees of freedom -1
## P-value (Unknown) NA
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## T1 =~
## V1 1.000
## T2 =~
## V2 1.000
## T3 =~
## V3 1.000
## T4 =~
## V4 1.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## T2 ~
## T1 1.117 NA
## T3 ~
## T2 1.196 NA
## T4 ~
## T3 1.366 NA
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .V1 11.002 NA
264EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M

## .V2 8.833 NA
## .V3 8.016 NA
## .V4 28.797 NA
## T1 22.820 NA
## .T2 0.069 NA
## .T3 4.758 NA
## .T4 0.041 NA

From the output it is clear that our model is not identified, because we have
a negative number of degrees of freedom, -1. This means we have 1 more
parameters to estimate than the number of sample moments available.
The previous model, T.auto, had 3 degrees of freedom. By asking to estimate
4 error variances, we used up all these degrees of freedom and one more, and
we ran out.

Step 4. Constraining the errors of measurement equal


across time

To remedy the situation with the model under-identification, we need to reduce


the number of parameters. Clearly we cannot hope to estimate all 4 error
variances. The most sensible way to proceed is to estimate one common variance
for the 4 errors of measurement. This makes perfect conceptual sense because
the same Verbal test was used on all 4 measurement occasions, and the error of
measurement is the property of the test and therefore should not change over
time (see wonderful explanation of this in McDonald’s “Test Theory: A Unified
Treatment” 1999 edition, page 68). Therefore, we can constrain the 4 error
variances to be equal, thus estimating one parameter instead of four.
We will constrain the error variances to be equal by using a parameter label,
arbitrarily we call it e. This will tell lavaan that each of the declared variances
have to be the same as another one with that same parameter. So, we replace
the last part of the model with the new version:

T.auto3 <- paste(T.auto, '

# free up error variance but constrain equal


V1 ~~ e*V1
V2 ~~ e*V2
V3 ~~ e*V3
V4 ~~ e*V4
')

And now we can fit the model again and obtain the summary output:
20.3. WORKED EXAMPLE - TESTING A LATENT AUTOREGRESSIVE MODEL FOR WISC VERBAL SUBTES

fit3 <- sem(T.auto3, sample.cov= WISC.cov, sample.nobs=204)

summary(fit3, standardized=TRUE)

## lavaan 0.6.15 ended normally after 57 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 11
## Number of equality constraints 3
##
## Number of observations 204
##
## Model Test User Model:
##
## Test statistic 2.115
## Degrees of freedom 2
## P-value (Chi-square) 0.347
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## T1 =~
## V1 1.000 5.033 0.865
## T2 =~
## V2 1.000 5.366 0.879
## T3 =~
## V3 1.000 6.725 0.918
## T4 =~
## V4 1.000 10.268 0.962
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## T2 ~
## T1 1.010 0.082 12.356 0.000 0.947 0.947
## T3 ~
## T2 1.186 0.074 15.983 0.000 0.946 0.946
## T4 ~
## T3 1.375 0.077 17.940 0.000 0.900 0.900
##
266EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M

## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .V1 (e) 8.494 1.112 7.638 0.000 8.494 0.251
## .V2 (e) 8.494 1.112 7.638 0.000 8.494 0.228
## .V3 (e) 8.494 1.112 7.638 0.000 8.494 0.158
## .V4 (e) 8.494 1.112 7.638 0.000 8.494 0.075
## T1 25.329 3.529 7.178 0.000 1.000 1.000
## .T2 2.972 2.053 1.448 0.148 0.103 0.103
## .T3 4.738 2.081 2.276 0.023 0.105 0.105
## .T4 19.986 4.278 4.672 0.000 0.190 0.190

Examine the output and answer the following questions.


QUESTION 1. How many parameters are there to estimate? What are they?
How many known pieces of information (sample moments) are there in the data?
What are the degrees of freedom and how are they calculated?
QUESTION 2. Interpret the chi-square. Do you retain or reject the model?

Step 5. Interpreting and appraising the model results

Now examine the model parameters, and try to answer the following questions.
QUESTION 3. What is the standardized effect of T1 on T2? How does it
compare to the observed correlation between V1 and V2? If you have completed
Exercise 19, how would you interpret this result?
QUESTION 4. What is the (unstandardized) variance of the measurement
error of the Verbal test? Why are the standardized values (Std.all) for this
parameter different for the 4 measurements? What does it tell you about the
reliability of the Verbal scores?
To summarise, the latent autoregressive model predicts that the standardized
indirect effect should get smaller the more distal the outcome is. And now our
data support this prediction. The correlations between the ‘true’ scores at the
adjacent time points are much stronger (0.9 or above) than the path model in
Exercise 19 would have you to belive. The reason for misfit of that path model
was not the existence of “sleeper” effects, but the failure to account for the error
of measurement.

Step 6. Saving your work

After you finished work with this exercise, save your R script with a meaningful
name, for example “Latent autoregressive model of WISC”. To keep all of the
created objects, which might be useful when revisiting this exercise, save your
entire ‘work space’ when closing the project. Press File / Close project, and
select “Save” when prompted to save your ‘Workspace image’.
20.4. FURTHER PRACTICE - TESTING A LATENT AUTOREGRESSIVE MODEL FOR WISC NON-VERBAL S

20.4 Further practice - Testing a latent autore-


gressive model for WISC Non-verbal sub-
test
To practice further, fit a latent autoregressive path model to the WISC Non-
verbal scores, and interpret the results.

20.5 Solutions
Q1. There are 4 observed variables, therefore 4x(4+1)/2 = 10 sample moments
in total (made up from 4 variances and 6 covariances).
The output prints that the “Number of model parameters” is 11, and “number
of equality constraints” is 3. The 11 parameters are made up of 3 regression
paths (arrows on diagram) for the ‘true’ scores (T variables), plus 3 variances of
regression residuals for T2, T3 and T4 variables, plus 1 variance of independent
latent variable T1, plus the 4 error variances of V variables. Nominally, this
makes 11, but of course there are only 11-3=8 parameters to estimate because
we imposed equality constraints on error variances estimating only 1 parameetr
instead of 4 (3 less parameters than we did in the model T.auto2).
Then, the degrees of freedom is 2 (as lavvan rightly tells you), made up from
10(moments) – 8(parameters) = 2(df).
Q2. The chi-square test (Chi-square = 2.115, Degrees of freedom = 2) suggests
accepting the model because it is quite likely (P-value = .347) that the observed
data could emerge from the population in which the model is true.
Q3. The standardized effect of T1 on T2 is 0.947 (you can look at Std.lv,
which stands for “standardized latent variables”). This is much higher than the
observed correlations between V1 and V2, which was 0.717. From the expla-
nation of this latter value in Exercise 19 it follows that 0.717 was an estimate
of covariance between ‘true’ Verbal scores at ages 6 and 7 rather than their
correlation. Because the scores had error of measurement, the (standardized)
variances of their ‘true’ scores were lower than 1, and therefore the covariance
of 0.717 equated to a much higher correlation of ‘true’ scores, which is now
estimated as 0.947.
Q4. The unstandardized variance of the measurement error of the Verbal test
is 8.494. These values are identical across 4 time points, and are marked with
label (e) in the output. However, the standardized values (Std.all) for this
parameter are different for the 4 measurements:

Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
268EXERCISE 20. FITTING A LATENT AUTOREGRESSIVE MODEL TO LONGITUDINAL TEST M

.V1 (e) 8.494 1.112 7.638 0.000 8.494 0.251


.V2 (e) 8.494 1.112 7.638 0.000 8.494 0.228
.V3 (e) 8.494 1.112 7.638 0.000 8.494 0.158
.V4 (e) 8.494 1.112 7.638 0.000 8.494 0.075

This is because the test score variance is composed of ‘true’ and ‘error’ vari-
ance, and even though the error variance was constrained equal, the ‘true’ score
variance change over time because the ‘true’ score is the property of the person,
and people change. Std.all values show you the proportion of error in the ob-
served score at each time point, and because the values are getting smaller, we
can conclude that the proportion of ‘true’ score variance increases. Presumably,
the ‘true’ score variance at time 4 (age 11) was the largest.
From these results, it is easy to calculate the reliability of the Verbal test scores.
By definition, reliability is the proportion of the variance in observed score due
to ‘true’ score. As the Std.all values give the proportion of error, reliabil-
ity is computed as 1 minus the given proportion of error. So, reliability of
V1 is 1-0.251=0.749, reliability of V2 is 1-0.228=0.772, reliability of V3 is
1-0.158=0.842 and reliability of V4 is 1-0.075=0.925. As Roderick McDonald
(1999) pointed out, reliability will vary according to a population sampled, and
is not property of the test alone. On contrary, the error of measurement is,
and this why the ultimate goal of reliability analysis is to obtain an estimate of
the variance of E in the test score metric. In this data analysis example, this
estimate is e=8.494.
Exercise 21

Growth curve modelling of


longitudinal measurements

Data file ALSPAC.csv


R package lavaan, lcsm, psych

21.1 Objectives

In this exercise, you will perform growth curve modelling of longitudinal mea-
surements of child development. These will be measurements of weight or height.
Redeeming features of such measures are that they have little or no error of mea-
surement (unlike most psychological data), and that they are continuous vari-
ables at the highest level of measurement - ratio scales. These features enable
analysis of growth patterns without being concerned about the measurement,
and thus focus on the structural part of the model. You will practice plotting
growth trajectories and modelling these trajectories with grwoth curve models
with random intercept and random slopes (both linear and quadratic).

21.2 Study of child growth in a longitudinal


repeated-measures design spanning 8 years

First, I must make some acknowledgements. Data and many ideas for the
analyses in this exercise are due to Dr Jon Heron of Bristol Medical School.
This data example was first presented in a summer school that Dr Heron and
other dear colleagues of mine taught at the University of Cambridge in 2011.

269
270EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

Data for this exercise comes from the Avon Longitudinal Study of Parents
and Children (ALSPAC), also known as “Children of the Nineties”. This is
a population-based prospective birth-cohort study of ~14,000 children and their
parents, based in South-West England. For a child to be eligible for inclusion in
the study, their mother had to be resident in Avon and have an expected date
of delivery between April 1st 1991 and December 31st 1992.
We will analyse some basic growth measurements taken in clinics attended at
approximate ages 7, 9, 11, 13 and 15 years. There are N=802 children in the
data file ALSPAC.csv that we will consider today, 376 boys and 426 girls.
The analysis dataset is in wide-data format so that each repeated measure is
a different variable (e.g. ht7, ht9, ht11, …). The file contains the following
variables:
ht 7/9/11/13/15 Child height (cm) at the 7, 9, 11, 13 and 15 year clinics
wt 7/9/11/13/15 Child weight (kg) at each clinic
bmi 7/9/11/13/15 Child body mass index at each clinic
age 7/9/11/13/15 Child actual age at measurement (months) at each clinic
female Sex dummy (coded 1 for girl, and 0 for boy)
bwt Birth weight (kg)

21.3 Worked Example - Fitting growth curve


models to longitudinal measurements of
child’s weight

To complete this exercise, you need to repeat the analysis of weight measure-
ments from a worked example below, and answer some questions. Once you are
confident, you may want to explore the height measurements independently.

Step 1. Reading and examining data

Download file ALSPAC.csv, save it in a new folder (e.g. “Growth curve model
of ALSPAC data”), and create a new project in this new folder.
Import the data (which is in the comma-separated-values format) into a new
data frame. I will call it ALSPAC.

ALSPAC <- read.csv(file="ALSPAC.csv")

First, produce and examine the descriptive statistics to learn about the coverage
and spread of measurements. As you can see, there are no missing data.
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

library(psych)
# check the data coverage and spread
describe(ALSPAC)

## vars n mean sd median trimmed mad min max range


## id 1 802 1913.68 1120.48 1929.00 1907.45 1447.02 3.00 3883.00 3880.00
## ht7 2 802 125.62 5.26 125.50 125.55 5.26 111.40 140.80 29.40
## wt7 3 802 25.58 4.09 25.00 25.23 3.56 16.60 43.40 26.80
## bmi7 4 802 16.14 1.88 15.79 15.94 1.50 10.85 27.78 16.94
## age7 5 802 89.56 2.05 89.00 89.27 1.48 86.00 109.00 23.00
## ht9 6 802 139.26 6.14 139.00 139.18 6.08 120.90 163.00 42.10
## wt9 7 802 34.13 6.70 32.90 33.54 6.08 21.40 61.80 40.40
## bmi9 8 802 17.50 2.62 16.89 17.22 2.29 13.02 29.84 16.82
## age9 9 802 117.71 3.22 117.00 117.28 2.97 113.00 136.00 23.00
## ht11 10 802 150.55 6.99 150.40 150.41 7.41 131.10 174.20 43.10
## wt11 11 802 43.11 9.23 41.20 42.34 8.60 24.00 76.20 52.20
## bmi11 12 802 18.89 3.15 18.14 18.58 2.88 12.44 31.69 19.25
## age11 13 802 140.54 2.24 140.00 140.39 1.48 133.00 154.00 21.00
## ht13 14 802 157.05 7.28 156.80 156.99 7.26 136.00 180.30 44.30
## wt13 15 802 48.77 9.98 47.50 48.08 9.56 25.70 87.30 61.60
## bmi13 16 802 19.68 3.26 19.04 19.38 2.91 13.07 32.02 18.95
## age13 17 802 153.29 2.35 153.00 153.23 1.48 138.00 169.00 31.00
## ht15 18 802 169.09 8.15 168.50 168.88 8.15 147.40 192.20 44.80
## wt15 19 802 60.87 11.17 58.80 59.83 8.90 35.20 119.10 83.90
## bmi15 20 802 21.25 3.36 20.62 20.86 2.73 14.09 38.23 24.14
## age15 21 802 185.05 3.33 184.00 184.55 1.48 177.00 203.00 26.00
## female 22 802 0.53 0.50 1.00 0.54 0.00 0.00 1.00 1.00
## bwt 23 802 3.42 0.54 3.45 3.44 0.46 0.82 5.14 4.31
## skew kurtosis se
## id 0.01 -1.22 39.57
## ht7 0.11 -0.23 0.19
## wt7 0.99 1.60 0.14
## bmi7 1.34 3.18 0.07
## age7 4.47 29.62 0.07
## ht9 0.14 -0.12 0.22
## wt9 0.89 0.81 0.24
## bmi9 1.05 1.17 0.09
## age9 1.63 4.07 0.11
## ht11 0.16 -0.23 0.25
## wt11 0.80 0.42 0.33
## bmi11 0.96 0.85 0.11
## age11 0.88 2.49 0.08
## ht13 0.06 -0.15 0.26
## wt13 0.67 0.31 0.35
## bmi13 0.90 0.69 0.12
272EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

## age13 0.39 5.43 0.08


## ht15 0.26 -0.24 0.29
## wt15 1.13 2.28 0.39
## bmi15 1.35 2.81 0.12
## age15 2.27 7.49 0.12
## female -0.12 -1.99 0.02
## bwt -0.56 2.10 0.02

You may also want to examine data by sex, because there might be differences
in growth patters between boys and girls. Do this independently using function
describeBy(ALSPAC, group="female") from package psych.

Step 2. Plotting individual development trajectories

We begin with visualizing the data. Perhaps the most useful way of looking
at individuals’ development over time is to connect individual values for each
time point, thus building “trajectories” of change . Package lcsm offers a nice
function plot_trajectories() that does exactly that:

## load the lcsm package


library(lcsm)

## plotting individual WEIGHT trajectories


plot_trajectories(data = ALSPAC,
id_var = "id",
var_list = c("wt7", "wt9", "wt11", "wt13", "wt15"),
xlab = "Age in years",
ylab = "Weight in kilograms",
line_colour = "black")
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

100
Weight in kilograms

75

50

25

wt7 wt9 wt11 wt13 wt15


Age in years

Each line on the graph represents a child, and you can see how weight of each
child changes over time. You can see that there is strong tendency to weight
increase, as you may expect, but also quite a lot of variation in the starting
weight and also in the rate of change. You can perhaps see an overall tendency
for almost linear growth between ages 7 and 11, but then slowing as children
reach 13,and then a fast acceleration between 13 and 15 (puberty).

You may, however, suspect that there might be systematic differences between
trajectories of boys and girls. To address this, we can plot the sexes separately.

## trajectories for boys only


plot_trajectories(data = ALSPAC[ALSPAC$female==0, ],
id_var = "id",
var_list = c("wt7", "wt9", "wt11", "wt13", "wt15"),
xlab = "Age in years",
ylab = "Weight in kilograms",
line_colour = "blue")
274EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

100
Weight in kilograms

75

50

25

wt7 wt9 wt11 wt13 wt15


Age in years

QUESTION 1. Using the syntax above as guide, plot the individual trajec-
tories for girls, in a different colour (say, “red”). Are the trajectories for girls
look similar to trajectories for boys?

Step 3. Specifying and fitting a linear growth model to 4


repeated measurements
Now we are ready to start growth curve (GC) modelling. I will demonstrate
how to fit GC models to the repeated measures of weight for the boys only (and
you can try to fit these for girls independently).
First we note that each trajectory (individual child) can be characterized by
its own baseline and rate of increase. For instance, some boys start at below
average weight and stay low. Some start average but gain weight quickly. Some
start high and gain weight at an average rate. And all combinations of the
baseline and growth rate are of course possible.
When the developmental process is of interest, longitudinal observations can be
described in terms of trajectories of change. For every child, instead of all the
repeated observations, we may consider:

- The intercept (expected weight at baseline)


- The slope (linear rate of increase in weight per year)

One aim of such models is to summarize growth through new variables, which
can be used in analyses as predictors, or outcomes, or mediators, etc. Another
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

aim is to test if a linear model is sufficient to describe the growth process. In


linear models, any deviations from linear trajectories are considered as random
errors. In the ALSPAC example, the linear trend is visible between ages 7 and
11, but the rate of change seems to slow down as children reach age 13 (we will
test it formally when looking at the model fit).
So, we will attempt to describe the observed change occurring between ages 7
and 13 (for simplicity, I will not include age 15 as different growth factors
come to play then). We will describe the data through two new random variables
- intercept and slope, representing each child’s trajectory through his unique
baseline and rate of change. We can assume that these variables are normally
distributed, so there are average, low and high baselines, and average, low and
high rates of change. Most children are expected to be in the average range, and
a few will have extremely high or low values. So the idea is to describe systematic
variability in the data through a model with only 2 variables (which would
necessarily be quite a substantial simplification from the original 4 variables).
If linear rate of change holds, then each observation could be described as fol-
lows:

wt7 = intercept + e7
wt9 = intercept + 2*slope + e9
wt11 = intercept + 4*slope + e11
wt13 = intercept + 6*slope + e13

The slope (latent variable) is the rate of change (kg per year), and coefficients
in front of this variable represent the number of years passing from the first
measurement at age 7. Obviously, at age 7 the coefficient for slope is 0. The
intercept (latent variable) represents the predicted weight at 7 for each child
(this may or may not correspond to the child’s actual weight at 7). Of course
weighs predicted by this rather crude model will not describe the data exactly,
and that is why we need errors of prediction (variables e7 to e13).
The hypothesized linear growth curve model can be depicted as follows.
To code this model, we describe the growth variables - intercept i and slope s
- using the lavaan syntax conventions. We use the shorthand symbol =~ for is
measured by and use * for fixing the “loading” coefficients as the equation above
specify.
Linear <- ' # linear growth model with 4 measurements
# intercept and slope with fixed coefficients
i =~ 1*wt7 + 1*wt9 + 1*wt11 + 1*wt13
s =~ 0*wt7 + 2*wt9 + 4*wt11 + 6*wt13
'

To fit this GC model, lavaan has function growth(). We need to pass to it: the
model (model=Linear, or simply Linear if you put this argument first), and
the data.
276EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

Figure 21.1: Figure 21.1. Linear growth curve model for 4 repeated bodyweight
measurements
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

# do not forget to load the lavaan package!


library(lavaan)

# fit the growth model for boys only


fitL <- growth(Linear, data=ALSPAC[ALSPAC$female==0, ])

# include additional fit indices


summary(fitL, fit.measures=TRUE)

## lavaan 0.6.15 ended normally after 96 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 9
##
## Number of observations 376
##
## Model Test User Model:
##
## Test statistic 272.201
## Degrees of freedom 5
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 2260.736
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.881
## Tucker-Lewis Index (TLI) 0.858
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -4111.927
## Loglikelihood unrestricted model (H1) -3975.826
##
## Akaike (AIC) 8241.854
## Bayesian (BIC) 8277.220
## Sample-size adjusted Bayesian (SABIC) 8248.665
##
## Root Mean Square Error of Approximation:
##
278EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

## RMSEA 0.377
## 90 Percent confidence interval - lower 0.340
## 90 Percent confidence interval - upper 0.416
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.096
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## i =~
## wt7 1.000
## wt9 1.000
## wt11 1.000
## wt13 1.000
## s =~
## wt7 0.000
## wt9 2.000
## wt11 4.000
## wt13 6.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## i ~~
## s 3.714 0.354 10.502 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .wt7 0.000
## .wt9 0.000
## .wt11 0.000
## .wt13 0.000
## i 25.882 0.222 116.466 0.000
## s 3.982 0.067 59.129 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .wt7 0.861 0.403 2.138 0.033
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

## .wt9 3.836 0.383 10.014 0.000


## .wt11 5.488 0.664 8.262 0.000
## .wt13 12.342 1.492 8.270 0.000
## i 17.767 1.398 12.706 0.000
## s 1.508 0.127 11.883 0.000

Examine the output and answer the following questions.


QUESTION 2. How many parameters are there to estimate? What are they?
How many known pieces of information (sample moments) are there in the data?
What are the degrees of freedom and how are they calculated?
QUESTION 3. Interpret the chi-square statistic. Do you retain or reject the
model?
QUESTION 4. Interpret the RMSEA, CFI and SRMR. What do you conclude
about the model fit?
Next, let’s investigate the reason for model misfit. Ask for model residuals
(which are differences between the observed and the expected (model-implied)
summary statistics, such as means, variances and covariances). To evaluate
the effect size of the residuals, ask for type="cor", which will be based on
standardized variables (and the covariances will turn into correlations).

residuals(fitL, type="cor")

## $type
## [1] "cor.bollen"
##
## $cov
## wt7 wt9 wt11 wt13
## wt7 0.000
## wt9 0.011 0.000
## wt11 -0.007 0.016 0.000
## wt13 0.008 -0.001 0.036 0.000
##
## $mean
## wt7 wt9 wt11 wt13
## -0.020 0.043 0.089 -0.151

Under $cov~, we have the differences between actual minus expected covariances
of the repeated measures. This information can help find weaknesses in the
model by finding residuals that are large compared to the rest. No large residuals
(greater than 0.1) can be seen in the output.
Under $mean, we have the differences between actual minus estimated means.
The residuals are largest for the last measurement point (wt13), suggesting
280EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

that a linear model might not be adequately capturing the longitudinal trend in
weight. The negative sign of the residual means that the model predicts larger
weight at age 13 than actually observed. We conclude that the linear model
fails to describe adequately the slowing down of growth towards the age 13.

Step 4. Specifying and fitting a quadratic growth model to


4 repeated measurements

To address the inadequate fit of the linear model, we will add a quadratic slope,
which will capture the non-linear trend (slowing down of growth) that we ob-
served on the trajectory plots and also by examining residuals.
Now we will describe each observation a weighted sum of the intercept, the slope
and the quadratic slope:

wt7 = intercept + e7
wt9 = intercept + 2*slope + 4*qslope + e9
wt11 = intercept + 4*slope + 16*qslope + e11
wt13 = intercept + 6*slope + 36*qslope + e13

The qslope (latent variable) is the quadratic rate of change (squared kg per
year), and coefficients in front of this variable represent the squared number
of years passing from the first measurement at age 7. Obviously, at age 7 the
coefficient for quadratic slope is 0.

Quadratic <- ' # quadratic growth model for 4 measurements


# intercept and slope with fixed coefficients
i =~ 1*wt7 + 1*wt9 + 1*wt11 + 1*wt13
s =~ 0*wt7 + 2*wt9 + 4*wt11 + 6*wt13
q =~ 0*wt7 + 4*wt9 + 16*wt11 + 36*wt13
'
# fit the growth model, boys only
fitQ <- growth(Quadratic, data=ALSPAC[ALSPAC$female==0, ])

## Warning in lav_object_post_check(object): lavaan WARNING: some estimated ov


## variances are negative

## Warning in lav_object_post_check(object): lavaan WARNING: covariance matrix of laten


## is not positive definite;
## use lavInspect(fit, "cov.lv") to investigate.

It seems we have a problem. The program is telling us that ’covariance matrix


of latent variables is not positive definite”. Let’s examine the summary output:
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

summary(fitQ)

## lavaan 0.6.15 ended normally after 118 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 13
##
## Number of observations 376
##
## Model Test User Model:
##
## Test statistic 52.148
## Degrees of freedom 1
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## i =~
## wt7 1.000
## wt9 1.000
## wt11 1.000
## wt13 1.000
## s =~
## wt7 0.000
## wt9 2.000
## wt11 4.000
## wt13 6.000
## q =~
## wt7 0.000
## wt9 4.000
## wt11 16.000
## wt13 36.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## i ~~
## s 7.704 1.281 6.015 0.000
## q -0.657 0.144 -4.548 0.000
282EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

## s ~~
## q -0.380 0.053 -7.159 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .wt7 0.000
## .wt9 0.000
## .wt11 0.000
## .wt13 0.000
## i 25.593 0.218 117.646 0.000
## s 4.566 0.111 41.264 0.000
## q -0.153 0.014 -10.670 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .wt7 4.747 1.867 2.543 0.011
## .wt9 -0.167 0.983 -0.170 0.865
## .wt11 9.494 1.432 6.629 0.000
## .wt13 -17.604 4.541 -3.877 0.000
## i 13.365 2.077 6.435 0.000
## s 2.965 0.518 5.726 0.000
## q 0.093 0.012 7.641 0.000

It shows that two residual variances, .wt9 and .wt13, are negative. This is
obviously a problem, and could be a sign of model “overfitting”, when there are
too many parameters estimated in the model. Indeed, this model estimates 13
parameters on 14 sample moments and has only 1 degree of freedom.
Examining the output again, we can see that the variance of q (quadratic slope)
is only 0.093, which is very small compared to the variances of other growth
factors i and s. A good way to reduce the number of parameters then is to
fix the variance of q to 0, making it a fixed rather than a random effect. The
quadratic slope will have the same effect on everyone’s trajectory (it will have
the mean but no variance). And if q has no variance, it obviously should have
zero covariance with the other growth factors i and s, so we fix those covariances
to 0 as well. Here is the modified model:

QuadraticFixed <- paste(Quadratic, '


# the quadratic effect is fixed rather than random
q ~~ 0*q
q ~~ 0*i
q ~~ 0*s
')

fitQF <- growth(QuadraticFixed, data=ALSPAC[ALSPAC$female==0, ])


21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

summary(fitQF, fit.measures=TRUE)

## lavaan 0.6.15 ended normally after 116 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 10
##
## Number of observations 376
##
## Model Test User Model:
##
## Test statistic 154.874
## Degrees of freedom 4
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 2260.736
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.933
## Tucker-Lewis Index (TLI) 0.900
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -4053.263
## Loglikelihood unrestricted model (H1) -3975.826
##
## Akaike (AIC) 8126.526
## Bayesian (BIC) 8165.822
## Sample-size adjusted Bayesian (SABIC) 8134.095
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.317
## 90 Percent confidence interval - lower 0.275
## 90 Percent confidence interval - upper 0.360
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
284EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

##
## SRMR 0.072
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## i =~
## wt7 1.000
## wt9 1.000
## wt11 1.000
## wt13 1.000
## s =~
## wt7 0.000
## wt9 2.000
## wt11 4.000
## wt13 6.000
## q =~
## wt7 0.000
## wt9 4.000
## wt11 16.000
## wt13 36.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## i ~~
## q 0.000
## s ~~
## q 0.000
## i ~~
## s 3.384 0.336 10.062 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .wt7 0.000
## .wt9 0.000
## .wt11 0.000
## .wt13 0.000
## i 25.786 0.221 116.892 0.000
## s 4.701 0.092 51.035 0.000
## q -0.155 0.013 -12.161 0.000
##
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

## Variances:
## Estimate Std.Err z-value P(>|z|)
## q 0.000
## .wt7 0.317 0.395 0.802 0.422
## .wt9 4.626 0.418 11.068 0.000
## .wt11 5.460 0.612 8.919 0.000
## .wt13 5.874 1.109 5.296 0.000
## i 17.981 1.385 12.979 0.000
## s 1.465 0.118 12.457 0.000

Now the model runs and produces admissible parameter values.


QUESTION 5. How many degrees of freedom does this model have? How
does it compare to the random effect model Quadratic?
QUESTION 6. Interpret the chi-square statistic. How does it compare to the
Linear model?

Step 5. Interpreting results of the growth curve model

Now we have a reasonably fitting model, at least according to the SRMR (0.072)
and CFI (0.933), we can interpret the growth factors.
The section Intercepts presents the means of i, s and q, which we might refer
to as the fixed-effect part of the model. We interpret these as follows:
The average 7-year-old boy in the population starts off with a body weight
of 25.786kg, and that weight then increases by about 4.7kg per year, with an
adjustment of -0.155 in the first year, and -0.155 times the number of years
squared in each consecutive year. Therefore, the adjustment gets bigger as the
time goes by, slowing the linear rate of growth.
The sections Variances and Covariances describe the random-effects part of
the model, or the variances and covariance of our growth factors i and s. There
is a strong positive covariance 3.384 indicating that larger children at age 7 tend
to gain weight at a faster rate than smaller children. There is also considerable
variation in both intercept (var(i)=17.981) and linear slope (var(s)=1.465) as
evidenced by the plot of the raw data from earlier.
Finally, we can see the individual values on the growth factors (numbers describ-
ing intercept and slope for each boy’s individual trajectory). Package lavaan
has function lavPredict(), which will compute these values for each boy in the
sample. You need to apply this function to the result of fitting our best model
with fixed quadratic effect, fitQF.

# check the growth factors values for first few boys


head(lavPredict(fitQF))
286EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

## i s q
## [1,] 22.58784 3.984492 -0.1548514
## [2,] 21.08919 3.667423 -0.1548514
## [3,] 29.87843 5.794144 -0.1548514
## [4,] 27.29049 4.850545 -0.1548514
## [5,] 28.26805 5.018105 -0.1548514
## [6,] 23.54283 3.867593 -0.1548514

You can see that there are 3 values for each boy - i, s and q. While i and s vary
between individuals, q is constant because we set it up as a fixed effect. You
can plot histograms of the intercept and linear slope values as follows:

## plot the growth factors distributions


hist(lavPredict(fitQF)[,"i"], main="Random Intercept values",
xlab="Expected weight in kg at baseline")

Random Intercept values


80
60
Frequency

40
20
0

20 25 30 35 40 45

Expected weight in kg at baseline

hist(lavPredict(fitQF)[,"s"], main="Random Linear slope values",


xlab="Expected rate of growth in kg per year")
21.3. WORKED EXAMPLE - FITTING GROWTH CURVE MODELS TO LONGITUDINAL MEASUREMENTS O

Random Linear slope values


80
60
Frequency

40
20
0

3 4 5 6 7 8 9

Expected rate of growth in kg per year

Both distributions have positive skews, with outliers located on the high end of
the weight scale at baseline, and the high end of the weight gain per year.

Step 6. Investigating covariates of the growth factors

Finally, to give you an idea about how the random growth factors can be used in
research, I will provide a simple illustration of a growth model with covariates.
As we have the sex variable in the data set, we can easily investigate whether
the child’s sex has any effect on body weight trajectories at this age (between
7 and 13). To this end, we simply regress our random growth factors, i and s
on the sex dummy variable, female. We add the following line to the model
QuadraticFixed:

QuadraticFixedSex <- paste(QuadraticFixed, '


# sex influences intercept and slope?
i ~ female
s ~ female
')

# Now we fit the model to ALL cases, boys and girls!


fitQFS <- growth(QuadraticFixedSex, data=ALSPAC)

Run the model and interpret the results.


288EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

QUESTION 7. Does child’s sex has a significant effect on the growth inter-
cept? Who start off with higher weight - boys or girls? Does sex have an effect
on the linear slope? Who gain weight at a higher rate - boys or girls?

Step 7. Saving your work

After you finished work with this exercise, save your R script with a meaningful
name, for example “Growth curve model of body weights”. To keep all of the
created objects, which might be useful when revisiting this exercise, save your
entire ‘work space’ when closing the project. Press File / Close project, and
select “Save” when prompted to save your ‘Workspace image’.

21.4 Further practice - Testing growth curve


models for girls

To practice further, fit a latent autoregressive path model to the girls’ body
weights, and interpret the results.

21.5 Solutions

Q1.

# girls only
plot_trajectories(data = ALSPAC[ALSPAC$female==1, ],
id_var = "id",
var_list = c("wt7", "wt9", "wt11", "wt13", "wt15"),
xlab = "Age in years",
ylab = "Weight in kilograms",
line_colour = "red")
21.5. SOLUTIONS 289

75
Weight in kilograms

50

25

wt7 wt9 wt11 wt13 wt15


Age in years

The trajectories for girls look quite similar to trajectories for boys. The spread
is quite similar and the shapes are overall similar. Note that the scale of the y
axis (weight) are different, so that the girls’ spread seem greater when in fact
it is not.
Q2. The output prints that the “Number of model parameters” is 9, which are
made up of 2 variances of latent variables i and S, 1 covariance of i and s, 2
intercepts of i and s, and 4 error (residual) variances of observed variables (.wt7,
.wt9, wt11, and .wt13). There are 4 observed variables, therefore 4x(4+1)/2
= 10 sample moments in the variance/covariance matrix, plus 4 means, in total
14 sample moments. Then, the degrees of freedom is 5 (as lavaan tells you),
calculated as 14(moments) – 9(parameters) = 5(df).
Q3. The chi-square test (Chi-square = 272.201, Degrees of freedom = 5) sug-
gests rejecting the model because it is extremely unlikely (P-value < 0.001) that
the observed data could emerge from the population in which the model is true.
Q4 The CFI = 0.881 falls short of acceptable levels of 0.9 or above; the RMSEA
= 0.377 falls well short of acceptable levels of 0.08 and below, and SRMR =
0.096 is also short of acceptable levels of 0.08 or below. All in all, the model
does not fit the data.
Q5. QuadraticFixed model has 4 degrees of freedom. This is 3 more than in
the Quadratic model. This is because we fixed 3 parameters - variance of q,
its covariance with i and its covariance with s, releasing 3 degrees of freedom.
Q6. For QuadraticFixed model, Chi-square = 154.874 on 4 DF. For Linear
model, Chi-square = 272.201 on 5 DF. The improvement in Chi-square is sub-
290EXERCISE 21. GROWTH CURVE MODELLING OF LONGITUDINAL MEASUREMENTS

stantial for just 1 DF difference. To test the significance of difference between


these two nested models, you can run a formal chi-squared difference test:

anova(fitL, fitQF)

##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## fitQF 4 8126.5 8165.8 154.87
## fitL 5 8241.9 8277.2 272.20 117.33 0.55622 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q7.

summary(fitQFS, standardized=TRUE)

Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
i ~
female -0.413 0.292 -1.416 0.157 -0.101 -0.051
s ~
female 0.255 0.088 2.911 0.004 0.213 0.106

Child’s sex does not have a significant effect on the growth intercept (p = 0.157).
However, child’s sex has a significant effect on the linear slope (p = 0.004).
Because the effect of the dummy variable female is positive, we conclude that
girls gain weight at a higher rate than boys. Looking at the standardized value
(0.106), this is a small effect.
Exercise 22

Testing for longitudinal


measurement invariance in
repeated test measurements

Data file SDQ_repeated.csv


R package lavaan

22.1 Objectives

The objective of this exercise is to fit a full structural model to repeated obser-
vations. This model will allow testing for significance of treatment effect (mean
difference in latent constructs for pre-treatment and post-treatment data). You
will learn how to implement measurement invariance constraints, which are es-
sential to make sure that the scale of measurement is maintained over time.

22.2 Study of clinical change in externalising


problems after treatment

Data for this exercise is an anonymous sample from the Child and Adolescent
Mental Health Services (CAMHS) database. The sample includes children and
adolescents who were referred for psychological/psychiatric help with regard to
various problems. In order to evaluate the outcomes of these interventions, the
patients’ parents completed the Strengths and Difficulties Questionnaire (SDQ)
twice – at referral and then at follow-up, typically 6 months from the referral

291
292EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

(from 4 to 8 months on average), in most cases post treatment, or well into the
treatment.
The Strengths and Difficulties Questionnaire (SDQ) is a screening question-
naire about 3-16 year olds. It exists in several versions to meet the needs of
researchers, clinicians and educationalists (http://www.sdqinfo.org/). In Exer-
cises 1, 5 and 7, we worked with the self-rated version of the SDQ. Today we
will work with the parent-rated version, which allows recording outcomes for
children of any age. Just like the self-rated version, the parent-rated version
includes 25 items measuring 5 facets.
The participants in this study are parents of N=579 children and adolescents
(340 boys and 239 girls) aged from 2 to 16 (mean=10.4 years, SD=3.2). This
is a clinical sample, so all patients were referred to the services with various
presenting problems.

22.3 Worked Example - Quantifying change on


a latent construct after an intervention
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.
You will work with the facet scores as indicators of broader difficulties. The
three facets of interest here are Conduct problems, Hyperactivity and Pro-social
behaviour. These three facets together are thought to measure a broad construct
referred to as Externalizing problems, with pro-social behaviour indicating the
low end of this construct (it is a negative indicator).

Step 1. Reading and examining data

Download the data file SDQ_repeated.csv, and save it in a new folder. In


RStudio, create a new project in the folder you have just created. Start a new
script.
This time, the data file is not in the internal R format, but rather a comma-
separated text file. The first row contains variables names, and each subsequent
row is data for one individual on these variables. To read this file into a data
frame, use the dedicated function for .csv files read.csv(). Let’s name this
data frame SDQ.

SDQ <- read.csv(file="SDQ_repeated.csv")

A new object SDQ should appear in your Environment panel. Examine this
object using functions head() and names().
22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

Variables p1_conduct, p1_hyper and p1_prosoc are parent-rated scores


on the 3 subscales at Time 1, pre-treatment; and variables p2_conduct,
p2_hyper and p2_prosoc are parent-rated scores on the 3 subscales at Time
2, post-treatment.

Let’s use function describe() from package psych to get full descriptive statis-
tics for all variables:

library(psych)
describe(SDQ)

## vars n mean sd median trimmed mad min max range skew kurtosis
## age_at_r 1 579 10.42 3.22 10 10.50 4.45 2 16 14 -0.18 -1.00
## gender 2 579 1.41 0.49 1 1.39 0.00 1 2 1 0.35 -1.88
## p1_hyper 3 579 6.26 3.06 6 6.48 4.45 0 10 10 -0.38 -1.01
## p1_emo 4 579 5.15 2.85 5 5.15 2.97 0 10 10 0.00 -1.01
## p1_conduct 5 579 4.50 2.81 4 4.43 2.97 0 10 10 0.18 -0.96
## p1_peer 6 579 3.41 2.47 3 3.27 2.97 0 10 10 0.43 -0.61
## p1_prosoc 7 579 6.48 2.46 7 6.61 2.97 0 10 10 -0.40 -0.61
## p2_hyper 8 579 5.45 3.14 6 5.52 4.45 0 10 10 -0.09 -1.23
## p2_emo 9 579 3.98 2.83 4 3.83 2.97 0 10 10 0.38 -0.81
## p2_conduct 10 579 3.92 2.88 3 3.74 2.97 0 10 10 0.46 -0.77
## p2_peer 11 579 3.13 2.47 3 2.92 2.97 0 10 10 0.62 -0.34
## p2_prosoc 12 579 6.77 2.43 7 6.94 2.97 0 10 10 -0.45 -0.52
## se
## age_at_r 0.13
## gender 0.02
## p1_hyper 0.13
## p1_emo 0.12
## p1_conduct 0.12
## p1_peer 0.10
## p1_prosoc 0.10
## p2_hyper 0.13
## p2_emo 0.12
## p2_conduct 0.12
## p2_peer 0.10
## p2_prosoc 0.10

QUESTION 1. Note mean differences at Time 1 (pre-treatment) and Time


2 (post-treatment) for the focal constructs Conduct Problems, Hyperactivity
and Pro-social. Do the means increase or decrease? How do you interpret the
changes?
294EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

Step 2. Fitting a basic structural model for repeated mea-


sures

Given that the three SDQ subscales – Conduct problems, Hyperactivity and
Pro-social behaviour – are thought to measure Externalizing problems, we will
fit the same measurement model at each time point.

Figure 22.1: Figure 22.1. Basic model for change in Externalizing problems

This model specifies that p1_conduct, p1_hyper and p1_prosoc (scores


on the 3 subscales at Time 1) are indicators of the factor External1; and
that p2_conduct, p2_hyper and p2_prosoc (scores on the 3 subscales at
Time 2) are indicators of the factor External2. These parts represent the 2
measurement models. External1 is interpreted as the extent of externalizing
problems at referral (Time 1), and External2 is the extent of externalizing
problems at follow-up (Time 2). The structural model specifies that External2
is linearly dependent on (regressed on) External1.
First, load the lavaan package.

library(lavaan)

Now let’s code the model in Figure 22.1 by translating the following sentences
into syntax:
• External1 is measured by p1_conduct and p1_hyper and p1_prosoc
• External2 is measured by p2_conduct and p2_hyper and p2_prosoc
• External2 is regressed on External1
Using lavaan contentions, we specify this model (let’s call it Model0) as fol-
lows:

Model0 <- '


# Time 1 measurement model
External1 =~ p1_conduct + p1_hyper + p1_prosoc
# Time 2 measurement model
22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

External2 =~ p2_conduct + p2_hyper + p2_prosoc


# Structural model
External2 ~ External1 '

By default, External1 and External2 will be scaled by adopting the scale of


their first indicators (p1_conduct and p2_conduct respectively). So, it will
set the loadings of p1_conduct and p2_conduct to 1. On the diagram, I
made these fixed loading paths in dashed rather than solid lines.
To fit this basic model, we need lavaan function sem(). We need to pass to
this function the model name (model = Model0), and the data (data = SDQ).
Because in this exercise we are interested in change between Time 1 and Time
2, we need to bring means and intercepts into the analysis. This can be done
by either declaring means/intercepts of the variables in the model using lavaan
convention ~ 1 (this is convenient when you need labels for specific intercepts
and we will do this later), or, by simply setting meanstructure=TRUE in the
sem() function:

fit0 <- sem(model = Model0, data = SDQ, meanstructure=TRUE)

# ask for summary output including fit indices


summary(fit0, fit.measures=TRUE)

## lavaan 0.6.15 ended normally after 35 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 19
##
## Number of observations 579
##
## Model Test User Model:
##
## Test statistic 466.618
## Degrees of freedom 8
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 2229.698
## Degrees of freedom 15
## P-value 0.000
##
## User Model versus Baseline Model:
##
296EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

## Comparative Fit Index (CFI) 0.793


## Tucker-Lewis Index (TLI) 0.612
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -7603.738
## Loglikelihood unrestricted model (H1) -7370.429
##
## Akaike (AIC) 15245.476
## Bayesian (BIC) 15328.341
## Sample-size adjusted Bayesian (SABIC) 15268.023
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.315
## 90 Percent confidence interval - lower 0.291
## 90 Percent confidence interval - upper 0.339
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.075
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## External1 =~
## p1_conduct 1.000
## p1_hyper 0.917 0.049 18.868 0.000
## p1_prosoc -0.711 0.040 -17.985 0.000
## External2 =~
## p2_conduct 1.000
## p2_hyper 0.971 0.047 20.840 0.000
## p2_prosoc -0.709 0.037 -19.184 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## External2 ~
## External1 1.003 0.046 21.889 0.000
##
22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .p1_conduct 4.504 0.117 38.576 0.000
## .p1_hyper 6.264 0.127 49.263 0.000
## .p1_prosoc 6.484 0.102 63.394 0.000
## .p2_conduct 3.924 0.120 32.771 0.000
## .p2_hyper 5.454 0.130 41.845 0.000
## .p2_prosoc 6.769 0.101 67.004 0.000
## External1 0.000
## .External2 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .p1_conduct 2.216 0.204 10.885 0.000
## .p1_hyper 4.586 0.311 14.762 0.000
## .p1_prosoc 3.183 0.211 15.077 0.000
## .p2_conduct 2.251 0.203 11.078 0.000
## .p2_hyper 4.127 0.292 14.117 0.000
## .p2_prosoc 2.867 0.193 14.820 0.000
## External1 5.678 0.472 12.029 0.000
## .External2 0.337 0.189 1.780 0.075

Examine the output. Note that the factor loadings for p1_conduct and
p2_conduct are fixed to 1, and the other loadings are freely estimated. As
expected, the loadings for p1_prosoc and p2_prosoc are negative, because
being pro-social indicates the lack of externalising problems. Also note that
the variance for External1 and residual variance for External2 (.External2)
are freely estimated, as are the unique (error) variances of all the indicators
(.p1_conduct, .p1_hyper, etc.)
There is also output called ‘Intercepts’. For every DV, its intercept is printed
(beginning with ‘.’, for example .p1_conduct), and for every IV, its mean is
printed.
Note that the mean of External1 and the intercept of External2 are fixed to
0. This is the default way of giving the origin of measurement to the common
factors. Lavaan did this automatically, just as it automatically gave the unit
of measurement to the factors by adopting the unit of their first indicators.
With the scale of common factors set, the intercepts of all indicators (observed
variables) are freely estimated – and thus they have Standard Errors (Std.Err).
These intercepts correspond to the expected scale scores on Conduct, Hyperac-
tivity and Pro-social for those with the average (=0) scores on External at the
respective time point.
Now examine the chi-square statistic, and other measures of fit.
QUESTION 2. Report and interpret the chi-square test, SRMR, CFI and
RMSEA. Would you accept or reject Model0?
298EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

To help you understand the reasons for misfit, request the modification indices
(sort them in descending order):

modindices(fit0, sort.=TRUE)

## lhs op rhs mi epc sepc.lv sepc.all sepc.nox


## 37 p1_hyper ~~ p2_hyper 248.442 3.524 3.524 0.810 0.810
## 41 p1_prosoc ~~ p2_prosoc 156.286 1.839 1.839 0.609 0.609
## 38 p1_hyper ~~ p2_prosoc 45.346 1.208 1.208 0.333 0.333
## 40 p1_prosoc ~~ p2_hyper 41.278 1.175 1.175 0.324 0.324
## 36 p1_hyper ~~ p2_conduct 40.656 -1.231 -1.231 -0.383 -0.383
## 33 p1_conduct ~~ p2_hyper 31.941 -1.073 -1.073 -0.355 -0.355
## 32 p1_conduct ~~ p2_conduct 31.520 0.985 0.985 0.441 0.441
## 35 p1_hyper ~~ p1_prosoc 12.271 0.662 0.662 0.173 0.173
## 27 External2 =~ p1_conduct 12.270 -3.019 -7.425 -2.643 -2.643
## 39 p1_prosoc ~~ p2_conduct 10.351 0.503 0.503 0.188 0.188
## 34 p1_conduct ~~ p2_prosoc 10.057 0.471 0.471 0.187 0.187
## 44 p2_hyper ~~ p2_prosoc 9.982 0.565 0.565 0.164 0.164
## 24 External1 =~ p2_conduct 9.980 -2.442 -5.819 -2.020 -2.020
## 43 p2_conduct ~~ p2_prosoc 4.580 -0.359 -0.359 -0.141 -0.141
## 25 External1 =~ p2_hyper 4.580 1.464 3.489 1.112 1.112
## 28 External2 =~ p1_hyper 4.105 1.376 3.385 1.106 1.106
## 31 p1_conduct ~~ p1_prosoc 4.105 -0.359 -0.359 -0.135 -0.135
## 29 External2 =~ p1_prosoc 0.959 -0.513 -1.263 -0.513 -0.513
## 30 p1_conduct ~~ p1_hyper 0.959 0.222 0.222 0.070 0.070
## 42 p2_conduct ~~ p2_hyper 0.538 0.168 0.168 0.055 0.055
## 26 External1 =~ p2_prosoc 0.538 -0.364 -0.867 -0.357 -0.357

QUESTION 3. What do the modification indices tell you? Which two changes
in the model would produce the greatest decrease in the chi-square? How do
you interpret these model changes?
Let us now modify the model allowing the unique factors (errors) of p1_hyper
and p2_hyper, and errors of p1_prosoc and p2_prosoc correlate. Modify
Model0 as follows, creating Model1:

Model1 <- '


# Time 1 measurement model
External1 =~ p1_conduct + p1_hyper + p1_prosoc
# Time 2 measurement model
External2 =~ p2_conduct + p2_hyper + p2_prosoc
# Structural model
External2 ~ External1

# MODIFIED PART: correlated errors for repeated measures


22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

p1_hyper ~~ p2_hyper
p1_prosoc ~~ p2_prosoc '

Now fit the modified model Model1 and ask for the summary output:

fit1 <- sem(model = Model1, data = SDQ, meanstructure=TRUE)

# ask for summary output


summary(fit1)

## lavaan 0.6.15 ended normally after 44 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 21
##
## Number of observations 579
##
## Model Test User Model:
##
## Test statistic 1.490
## Degrees of freedom 6
## P-value (Chi-square) 0.960
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## External1 =~
## p1_conduct 1.000
## p1_hyper 0.793 0.044 17.853 0.000
## p1_prosoc -0.627 0.036 -17.575 0.000
## External2 =~
## p2_conduct 1.000
## p2_hyper 0.865 0.043 20.209 0.000
## p2_prosoc -0.640 0.033 -19.227 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## External2 ~
300EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

## External1 0.885 0.040 22.276 0.000


##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .p1_hyper ~~
## .p2_hyper 3.459 0.278 12.444 0.000
## .p1_prosoc ~~
## .p2_prosoc 1.827 0.170 10.771 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .p1_conduct 4.504 0.117 38.576 0.000
## .p1_hyper 6.264 0.128 49.094 0.000
## .p1_prosoc 6.484 0.102 63.381 0.000
## .p2_conduct 3.924 0.120 32.771 0.000
## .p2_hyper 5.454 0.131 41.689 0.000
## .p2_prosoc 6.769 0.101 67.154 0.000
## External1 0.000
## .External2 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .p1_conduct 1.200 0.211 5.680 0.000
## .p1_hyper 5.216 0.343 15.229 0.000
## .p1_prosoc 3.429 0.223 15.350 0.000
## .p2_conduct 1.396 0.200 6.962 0.000
## .p2_hyper 4.742 0.323 14.678 0.000
## .p2_prosoc 3.057 0.202 15.129 0.000
## External1 6.694 0.500 13.389 0.000
## .External2 1.666 0.196 8.494 0.000

QUESTION 4. Report and interpret the chi-square for the model with cor-
related errors for repeated measures (Model1). Would you accept or reject
Model1? What are the degrees of freedom for Model1 and how do they com-
pare to Model0?

Step 3. Fitting a full measurement invariance model for


repeated measures
Our structural model with correlated errors fits well, but this model cannot be
used to measure change in externalising problems because it does not ensure
measurement invariance across time. The scale of measurement of External-
izing factor might have changed, and if so, we cannot compare Externalising
constructs at Time 1 and Time 2 (it is like comparing temperatures measured
in Celsius and Fahrenheit). To ensure equivalent measurement across time,
22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

we need constraints on the measurement parameters: factor loadings (unit of


measurement), intercepts (origin of measurement) and unique/error variances
(error of measurement) of respective indicators at Time 1 and Time 2.
These constraints are easy to implement by giving the corresponding parameters
the same label (look at small symbols in red in Figure 22.2). To constrain
the loadings of p1_hyper and p2_hyper to be the same across time, we
give them the same label (for instance, lh, standing for ‘loading hyper’ – but
you can use any label you want); to constrain the intercepts of p1_hyper
and p2_hyper to be the same, we give them the same label (ih, standing
for ‘intercept hyper’); and to constrain the error variances of p1_hyper and
p2_hyper to be the same, we give them the same label (eh, standing for ‘error
hyper’). We do the same for the other indicators, giving the same labels to the
loadings, intercepts and error variances at Time 1 and Time 2. NOTE that
the loadings for p1_conduct and p2_conduct do not need to be constrained
equal by using labels because they are fixed to 1 by default, therefore they are
already equal.

Figure 22.2: Figure 22.2. Full measurement invariance model for change in
Externalizing problems

You should already know how to specify labels for path coefficients. Factor load-
ings are path coefficients. Simply add multipliers in front of indicators in mea-
surement =~ statements, like so: lh*p1_hyper. To specify labels for variances,
add multipliers in variance ~~ statements, like so: p1_hyper ~~ eh*p1_hyper.
To specify labels for intercepts, use multipliers in statements ~ 1. For example,
to label the intercept of p1_hyper as ih, write p1_hyper ~ ih*1. OK, here
is the full measurement invariance model (Model2) corresponding to Figure
2. [Please type it in yourself, using bits of your previous models, but do not
just copy and paste my text! You need to make your own mistakes and correct
them.]

Model2 <- '


# Time 1 measurement model with labels
External1 =~ p1_conduct + lh*p1_hyper + lp*p1_prosoc
302EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

# error variances
p1_conduct ~~ ec*p1_conduct
p1_hyper ~~ eh*p1_hyper
p1_prosoc ~~ ep*p1_prosoc
# intercepts
p1_conduct ~ ic*1
p1_hyper ~ ih*1
p1_prosoc ~ ip*1

# Time 2 measurement model with labels


External2 =~ p2_conduct + lh*p2_hyper + lp*p2_prosoc
# error variances
p2_conduct ~~ ec*p2_conduct
p2_hyper ~~ eh*p2_hyper
p2_prosoc ~~ ep*p2_prosoc
# intercepts
p2_conduct ~ ic*1
p2_hyper ~ ih*1
p2_prosoc ~ ip*1

# Structural model
External2 ~ External1

#correlated errors for repeated measures


p1_hyper ~~ p2_hyper
p1_prosoc ~~ p2_prosoc '

Model2 tests the following combined hypothesis:


• H1. Two measurements of Externalising are fully invariant across time. •
H2. And, because we have not changed the origin of measurement for Exter-
nal2 (its intercept was set to 0 by default, remember?), Model2 also assumes
that there is no change in Externalising from Time 1 to Time 2 (because the
mean of External1 was also set to 0).
Now fit Model2 (assign the results to new object fit2), and examine the output.

fit2 <- sem(model = Model2, data = SDQ, meanstructure=TRUE)

# ask for summary output


summary(fit2)

## lavaan 0.6.15 ended normally after 42 iterations


##
## Estimator ML
22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## Optimization method NLMINB


## Number of model parameters 21
## Number of equality constraints 8
##
## Number of observations 579
##
## Model Test User Model:
##
## Test statistic 103.584
## Degrees of freedom 14
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## External1 =~
## p1_condct 1.000
## p1_hyper (lh) 0.840 0.038 22.356 0.000
## p1_prosoc (lp) -0.625 0.029 -21.901 0.000
## External2 =~
## p2_condct 1.000
## p2_hyper (lh) 0.840 0.038 22.356 0.000
## p2_prosoc (lp) -0.625 0.029 -21.901 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## External2 ~
## External1 0.888 0.032 27.610 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .p1_hyper ~~
## .p2_hyper 3.398 0.279 12.173 0.000
## .p1_prosoc ~~
## .p2_prosoc 1.844 0.170 10.863 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .p1_condct (ic) 4.283 0.110 39.027 0.000
## .p1_hyper (ih) 5.917 0.123 48.276 0.000
## .p1_prosoc (ip) 6.583 0.093 70.683 0.000
304EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

## .p2_condct (ic) 4.283 0.110 39.027 0.000


## .p2_hyper (ih) 5.917 0.123 48.276 0.000
## .p2_prosoc (ip) 6.583 0.093 70.683 0.000
## External1 0.000
## .External2 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .p1_condct (ec) 1.258 0.125 10.025 0.000
## .p1_hyper (eh) 5.048 0.284 17.743 0.000
## .p1_prosoc (ep) 3.253 0.172 18.896 0.000
## .p2_condct (ec) 1.258 0.125 10.025 0.000
## .p2_hyper (eh) 5.048 0.284 17.743 0.000
## .p2_prosoc (ep) 3.253 0.172 18.896 0.000
## External1 6.567 0.454 14.471 0.000
## .External2 2.110 0.224 9.436 0.000

Note that all parameter labels that you introduced are printed in the output
next to the respective parameter. Also note that every pair of parameters that
you constrained equal are indeed equal!
QUESTION 5. Report and interpret the chi-square test for the full measure-
ment invariance model (Model2). Would you accept or reject this model?
Now we need to understand what the chi-square result means with respect to
the combined hypotheses that Model2 tested. Rejection of the model could
mean that the measurement invariance is violated (H1 is wrong) or that there
is a significant change in the Externalising score from Time 1 to Time 2 (H2
is wrong). To help us understand which hypothesis is wrong, let us obtain the
modification indices.

modindices(fit2, sort.=TRUE)

## lhs op rhs mi epc sepc.lv sepc.all sepc.nox


## 24 External1 ~1 73.030 5.966 2.328 2.328 2.328
## 25 External2 ~1 73.030 -0.667 -0.247 -0.247 -0.247
## 38 External2 =~ p1_hyper 4.519 -0.070 -0.190 -0.061 -0.061
## 44 p1_conduct ~~ p2_prosoc 1.898 -0.173 -0.173 -0.085 -0.085
## 52 p2_hyper ~~ p2_prosoc 1.630 -0.164 -0.164 -0.041 -0.041
## 37 External2 =~ p1_conduct 1.568 0.045 0.121 0.043 0.043
## 46 p1_hyper ~~ p2_conduct 1.552 -0.205 -0.205 -0.081 -0.081
## 48 p1_prosoc ~~ p2_conduct 1.383 -0.151 -0.151 -0.075 -0.075
## 41 p1_conduct ~~ p1_prosoc 1.323 0.147 0.147 0.073 0.073
## 35 External1 =~ p2_hyper 1.199 0.040 0.102 0.032 0.032
## 50 p2_conduct ~~ p2_hyper 1.154 0.175 0.175 0.069 0.069
## 10 External2 =~ p2_conduct 0.965 -0.039 -0.104 -0.036 -0.036
22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## 1 External1 =~ p1_conduct 0.965 0.039 0.099 0.035 0.035


## 51 p2_conduct ~~ p2_prosoc 0.910 0.125 0.125 0.062 0.062
## 47 p1_hyper ~~ p2_prosoc 0.859 0.116 0.116 0.029 0.029
## 34 External1 =~ p2_conduct 0.748 -0.034 -0.088 -0.030 -0.030
## 39 External2 =~ p1_prosoc 0.723 -0.024 -0.065 -0.027 -0.027
## 49 p1_prosoc ~~ p2_hyper 0.143 0.047 0.047 0.012 0.012
## 45 p1_hyper ~~ p1_prosoc 0.051 0.029 0.029 0.007 0.007
## 36 External1 =~ p2_prosoc 0.028 0.005 0.013 0.005 0.005
## 43 p1_conduct ~~ p2_hyper 0.008 -0.014 -0.014 -0.006 -0.006
## 42 p1_conduct ~~ p2_conduct 0.007 -0.021 -0.021 -0.017 -0.017
## 40 p1_conduct ~~ p1_hyper 0.004 0.011 0.011 0.004 0.004

The largest modification index should appear first in the output. Compare its
size to the chi-square of the model, because the MI shows by how much the
chi-square will reduce if the respective parameter was freely estimated.
QUESTION 6. What is the largest modification index? How does it compare
to the chi-square and other modification indices? Try to interpret what this
index suggests. Do you think it points to H1 being wrong, or H2 being wrong?

Step 4. Measuring change in Externalising Problems factor

Now, hopefully you agree that the reason for misfit of Model2 was fixing the
mean of External1 and the intercept of External2 to the same value – zero.
This basically allows no change in the Externalising score due to the interven-
tion, setting the regression intercept to 0, like in this equation:

External2 = 0 + B1*External1

This is indeed unreasonable. More reasonable would be to assume that after


treatment, externalising problems reduced. To allow this change between Exter-
nalising score at Time 1 and Time 2 in the model, we need to free the intercept
of External2. This can be done by adding the following new line to Model2,
making Model3.

Model3 <- paste(Model2,


' #it is important to have an empty line here
External2 ~ NA*1')

NA*1 means that we “freeing” the intercept of External2 (we have no particular
value or label to assign to it). Now, Model3 would test the hypothesis H1 (full
measurement invariance across Time).
Create and fit Model3 (assign its results to fit3), examine the output, and
answer the following questions.
306EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

fit3 <- sem(model = Model3, data = SDQ, meanstructure=TRUE)

summary(fit3)

## lavaan 0.6.15 ended normally after 44 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 22
## Number of equality constraints 8
##
## Number of observations 579
##
## Model Test User Model:
##
## Test statistic 24.219
## Degrees of freedom 13
## P-value (Chi-square) 0.029
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## External1 =~
## p1_condct 1.000
## p1_hyper (lh) 0.866 0.037 23.389 0.000
## p1_prosoc (lp) -0.625 0.028 -22.133 0.000
## External2 =~
## p2_condct 1.000
## p2_hyper (lh) 0.866 0.037 23.389 0.000
## p2_prosoc (lp) -0.625 0.028 -22.133 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## External2 ~
## External1 0.918 0.030 30.138 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .p1_hyper ~~
## .p2_hyper 3.417 0.280 12.223 0.000
22.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## .p1_prosoc ~~
## .p2_prosoc 1.823 0.170 10.728 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .p1_condct (ic) 4.550 0.113 40.169 0.000
## .p1_hyper (ih) 6.150 0.127 48.553 0.000
## .p1_prosoc (ip) 6.416 0.095 67.667 0.000
## .p2_condct (ic) 4.550 0.113 40.169 0.000
## .p2_hyper (ih) 6.150 0.127 48.553 0.000
## .p2_prosoc (ip) 6.416 0.095 67.667 0.000
## .External2 -0.672 0.073 -9.181 0.000
## External1 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .p1_condct (ec) 1.333 0.120 11.096 0.000
## .p1_hyper (eh) 4.962 0.284 17.479 0.000
## .p1_prosoc (ep) 3.260 0.172 18.978 0.000
## .p2_condct (ec) 1.333 0.120 11.096 0.000
## .p2_hyper (eh) 4.962 0.284 17.479 0.000
## .p2_prosoc (ep) 3.260 0.172 18.978 0.000
## External1 6.411 0.444 14.431 0.000
## .External2 1.656 0.194 8.547 0.000

QUESTION 7. What is the (unstandardized) intercept of External2 (treat-


ment effect)? Recall that we expected a negative effect, since the interventions
were aiming to reduce problems. Is the effect as expected?
QUESTION 8. What is the (unstandardized) regression coefficient of Exter-
nal2 on External1? How will you interpret this value in relation to the unit
of measurement of External1?
Finally, carry out the chi-square difference test to compare nested models
Model2 and Model3. (HINT. Use function anova(fit2,fit3)).
QUESTION 9. How would you interpret the results of this test? Is Model3
significantly better than Model2? Which model would you retain?

Step 5. Saving your work


After you finished work with this exercise, save your R script by pressing the
Save icon in the script window, and giving the script a meaningful name, for
example “SDQ change after intervention”. To keep all of the created objects,
which might be useful when revisiting this exercise, save your entire ‘work space’
when closing the project. Press File / Close project, and select Save when
prompted to save your ‘Workspace image’ (with extension .RData).
308EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE

22.4 Solutions

Q1. The means decrease for Conduct and Hyperactivity, and increase for Pro-
social. This is to be expected since the SDQ Hyperactivity and Conduct scales
measure problems (the higher the score, the larger the extent of problems),
and SDQ Pro-social measures positive pro-social behaviour. Reduction in the
problem scores is to be expected after treatment.
Q2. Chi-square = 466.618 (df = 8); P-value < .001. We have to reject the
model because the chi-square is highly significant. CFI=0.793, which is much
lower than 0.95 (the threshold for good fit), and lower than 0.90 (the threshold
for adequate fit). RMSEA =0.315, which is much greater than 0.05 (for good
fit) and 0.08 (adequate fit). Finally, SRMR = 0.075, which is just below the
threshold for adequate fit (0.08). All indices except SRMR indicate very poor
model fit.
Q3. Two largest modification indices (MI) by far can be found in
the covariance ~~ statements: p1_hyper~~p2_hyper (mi=248.422) and
p1_prosoc~~p2_prosoc (mi=156.286).
The first MI tells you that if you repeat the analysis allowing p1_hyper and
p2_hyper correlate (actually, because these are DVs, the correlation will be
between their errors/unique factors), the chi square will fall by about 248. But
is this reasonable to allow errors/unique factors for the same measures at Time
1 and Time 2 correlate? Consider how the variance on the Hyperactivity facet is
explained by both, the common Externalising factor, and the remaining unique
content of the facet (the unique factor). Because the same Hyperactivity scale
was administered on two different occasions, its unique content not explained by
the common Externalising factor would be still shared between the occasions.
Therefore, the unique factors at Time 1 and Time 2 cannot be considered inde-
pendent. The correlated errors will correct for this lack of local independence.
Similarly, we should allow correlated errors across time for the Pro-social con-
struct (p1_prosoc and p2_prosoc). A correlated error for p1_conduct and
p2_conduct is not needed since M.I. did not suggest it.
Q4. Model1: Chi-square = 1.490 (df=6), P-value = 0.960. The chi-square test
is not significant and we accept the model with correlated errors for repeated
measures. The degrees of freedom for Model1 are 6, and the degrees of freedom
for Model0 were 8. The difference, 2 df, corresponds to the 2 additional param-
eters we introduced – the two error covariances. By adding 2 parameters, we
reduced df by 2.
Q5. Model2: Chi-square = 103.584 (df = 14); P-value < 0.001. The test is
highly significant and we have to reject the model.
Q6. The largest modification index by far is 73.030. It is of the same magnitude
as the chi-square for this model (103.584), and much larger than other MIs,
which are all in single digits. This index pertains to both External1~1 and
22.4. SOLUTIONS 309

External2~1. These are mean/intercept statements. Remember that the mean


of External1 and the intercept of External2 were fixed to 0 by default? The
modification index says that if you either freed the mean of External1 or freed
the intercept of External2, the chi-square would improve dramatically. This
points to the hypothesis of no change in the Externalising score between Time
1 and Time 2 (H2) being wrong.
Q7. The intercept of External2 is -.672, significant (p<.001) and negative as
we expected, indicating reduction in Externalizing problems on average. [Note
that this value is on the same scale as the facet Conduct problems, since the
Externalising factors borrowed the unit of measurement from this facet.] This
result shows that the interventions reduced the extent of externalising problems
in children.
Q8. The regression coefficient is 0.918 (look for External2 ~ External1 in
the output). This indicates that per one unit change in External1, External2
changes by 0.918 units. Since both External1 and External2 are measured on
the same scale, the closeness of the regression coefficient to 1 indicates that the
children largely retain their relative ordering on Externalising problems after
intervention.
Q9.

anova(fit2, fit3)

##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## fit3 13 14793 14854 24.219
## fit2 14 14870 14927 103.584 79.365 0.36789 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The chi-square difference of 79.365 on 1 degree of freedom is highly significant.


Therefore Model3 is significantly better than Model2, and we should retain
Model3.
310EXERCISE 22. TESTING FOR LONGITUDINAL MEASUREMENT INVARIANCE IN REPEATE
Part X

MULTIPLE-GROUP
STRUCTURAL
EQUATION MODELLING

311
Exercise 23

Testing for measurement


invariance across sexes in a
multiple-group setting

Data file HolzingerSwineford1939 <part of lavvan>


R package lavaan, psych

23.1 Objectives
The objective of this exercise is to fit a measurement model to two groups of
participants using the multiple-group features of package lavaan. In particu-
lar, you will estimate means and variances of latent constructs in two groups,
implementing measurement invariance constraints.

23.2 Study of structure of mental abilities


Holzinger and Swineford (1939) administered several mental ability tests to sev-
enth and eighth-grade students in two Chicago schools. In the present example,
we use scores obtained from 73 girls and 72 boys from the Grant-White school
(Joreskog, 1969). Variables that we need for this exercise:
sex (1=boy, 2=girl)
school School (“Pasteur” or “Grant-White”)
x1 Visual perception test
x2 Spatial visualization test

313
314EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL

x3 Spatial orientation test


x4 Paragraph comprehension test
x5 Sentence completion test
x6 Word meaning test

23.3 Worked Example - Comparing structure of


mental abilities between sexes
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.

Step 1. Reading and examining data

The Holzinger-Swineford data set is included in package lavaan, so you do


not need to download it separately. Open RStudio, and create a new project,
associating it with a new folder or perhaps with a folder where you downloaded
this Workshop instruction. Create a new R script. In the script, load the lavaan
package, and then load the data set using the function data(). This function
loads data included within packages.

library(lavaan)
data(HolzingerSwineford1939)

A new object HolzingerSwineford1939 should appear in your Environment


tab. Press on this object and the data set should open in its own tab. Examine
it carefully, and scroll all the way down. You should notice that the beginning
of the data set is populated with children from school “Pasteur”, and the end
with children from school “Grant-White”. The latter is the school we need for
our analysis. So, let’s create a new object called sample with only children
from school Grant-White included.

sample <- HolzingerSwineford1939[HolzingerSwineford1939$school=="Grant-White", ]

In the above, I used the format dataset[rows,columns] to select only rows


(cases) for which school equals “Grant-White”, retaining all columns (variables).
If you look at the data frame sample in the Environment tab, it includes only
145 children from the Grant-White school, while the original data set had 301
children.
Examine the object sample using function head(). You will see that the vari-
able sex is populated with numbers 1 (for boys) and 2 (for girls). Because this
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES315

variable will be the focus of our analysis, it would be nice to have labels attached
to these values, so each group is clearly labelled in all outputs. To do this, we
use R base function factor(), which encodes a vector of values into categories
(note that this function has nothing to do with factor analysis!). We would like
to encode levels c(1,2) as two nominal categories c("boy", "girl"):

# give value labels for sex variable


sample$sex <- factor(sample$sex,
levels = c(1,2),
labels = c("boy", "girl"))

Request head(sample) again. You will see that now, variable sex is populated
with either “boy” or “girl”.
Next, let’s obtain and examine the means of the six tests (x1 to x6) for boys
and girls. An easy way to do this is to use function describeBy() from package
psych:

library(psych)

# descriptive statistics by sex


describeBy(sample, group="sex")

##
## Descriptive statistics by group
## sex: boy
## vars n mean sd median trimmed mad min max range skew
## id 1 72 279.18 46.38 281.00 279.90 66.72 201.00 351.00 150.00 -0.10
## sex* 2 72 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## ageyr 3 72 12.90 1.05 13.00 12.84 1.48 11.00 16.00 5.00 0.62
## agemo 4 72 5.19 3.37 5.00 5.12 4.45 0.00 11.00 11.00 0.12
## school* 5 72 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## grade 6 71 7.49 0.50 7.00 7.49 0.00 7.00 8.00 1.00 0.03
## x1 7 72 4.97 1.16 4.92 4.97 1.11 2.67 8.50 5.83 0.17
## x2 8 72 6.23 1.10 6.00 6.13 1.11 4.00 9.25 5.25 0.64
## x3 9 72 2.14 1.08 1.88 2.08 1.11 0.38 4.38 4.00 0.40
## x4 10 72 3.10 1.02 3.00 3.08 0.99 0.33 6.00 5.67 0.26
## x5 11 72 4.60 1.05 4.75 4.62 1.11 1.75 6.50 4.75 -0.17
## x6 12 72 2.36 1.08 2.29 2.27 1.06 0.57 5.57 5.00 0.85
## x7 13 72 3.85 1.02 3.72 3.82 1.06 1.87 6.48 4.61 0.25
## x8 14 72 5.60 1.22 5.55 5.51 1.11 3.60 10.00 6.40 0.87
## x9 15 72 5.30 0.93 5.25 5.30 1.01 3.28 7.39 4.11 0.05
## kurtosis se
## id -1.38 5.47
## sex* NaN 0.00
316EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL

## ageyr 0.55 0.12


## agemo -1.22 0.40
## school* NaN 0.00
## grade -2.03 0.06
## x1 -0.06 0.14
## x2 -0.10 0.13
## x3 -0.91 0.13
## x4 0.50 0.12
## x5 -0.44 0.12
## x6 0.65 0.13
## x7 -0.50 0.12
## x8 1.46 0.14
## x9 -0.70 0.11
## ------------------------------------------------------------
## sex: girl
## vars n mean sd median trimmed mad min max range skew
## id 1 73 271.38 40.06 270.00 270.61 45.96 202.00 348.00 146.00 0.11
## sex* 2 73 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0.00 NaN
## ageyr 3 73 12.55 0.85 12.00 12.49 0.00 11.00 15.00 4.00 0.59
## agemo 4 73 5.49 3.61 5.00 5.49 4.45 0.00 11.00 11.00 0.00
## school* 5 73 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## grade 6 73 7.41 0.50 7.00 7.39 0.00 7.00 8.00 1.00 0.35
## x1 7 73 4.89 1.15 5.00 4.94 0.99 1.83 7.50 5.67 -0.40
## x2 8 73 6.17 1.13 6.25 6.15 1.11 2.25 9.25 7.00 -0.13
## x3 9 73 1.85 0.99 1.62 1.75 0.93 0.38 4.50 4.12 0.82
## x4 10 73 3.53 1.19 3.33 3.49 0.99 0.67 6.33 5.67 0.37
## x5 11 73 4.83 1.26 5.00 4.95 1.11 1.00 7.00 6.00 -0.82
## x6 12 73 2.57 1.19 2.29 2.51 1.06 0.29 5.86 5.57 0.56
## x7 13 73 3.99 1.05 4.00 3.97 1.10 1.30 6.30 5.00 0.06
## x8 14 73 5.38 0.84 5.50 5.41 0.74 3.05 7.15 4.10 -0.41
## x9 15 73 5.35 1.12 5.47 5.36 1.07 3.11 9.25 6.14 0.25
## kurtosis se
## id -1.04 4.69
## sex* NaN 0.00
## ageyr -0.18 0.10
## agemo -1.28 0.42
## school* NaN 0.00
## grade -1.90 0.06
## x1 -0.36 0.13
## x2 1.32 0.13
## x3 0.04 0.12
## x4 -0.31 0.14
## x5 0.44 0.15
## x6 -0.29 0.14
## x7 -0.38 0.12
## x8 0.08 0.10
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES317

## x9 0.69 0.13

QUESTION 1. Examine the means of the six test variables for boys and
girls. Note the mean differences for x3 (spatial orientation), and all verbal tests
(x4-x6). Who score higher on what tests?

Step 2. Fitting a baseline measurement model to two


groups (boys and girls)

It is thought that performance on the first three tests (x1, x2 and x3) depends
on a broader spatial ability, whereas performance on the other three tests (x4,
x5 and x6) depends on a broader verbal ability. We will fit a confirmatory factor
model with two factors – Spatial and Verbal, which are expected to correlate.
We will scale the factors by adopting the unit of their first indicators (x1 for
Spatial, and x4 for Verbal), which is the default in lavaan.

Figure 23.1: Figure 23.1. Measurement model for Holzinger and Swineford data

OK, let’s first describe the measurement (confirmatory factor analysis, or CFA)
model depicted in Figure 23.1. We will call it HS.model (HS stands for
Holzinger-Swineford). You should be able to specify this model yourself by
now, using the “measured by” (=~) syntax conventions:

HS.model <- ' Spatial =~ x1 + x2 + x3


Verbal =~ x4 + x5 + x6 '

Now we will fit the model using function cfa(). There is only one change from
how we used this function before. This time, we want to perform a multiple-
group analysis, and fit the model not to the whole sample of 145 children, but
separately to two groups - 73 girls and 72 boys. In order to do this, we set sex
as the grouping variable for analysis (group = "sex").
318EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL

# Fitting Baseline (Configural) model


fit.b <- cfa(HS.model, data = sample, group = "sex")

summary(fit.b, fit.measures=TRUE)

When we specify the grouping variable (group = "sex"), the data will be sepa-
rated into two groups according to children sex, and the HS.model will be fitted
to both groups without any parameter constraints. That is, all model parame-
ters in each group will be freely estimated. This is called configural invariance,
because the only thing in common between the groups is the configuration of
the model (what variables indicate what factors).
NOTE that the default in multiple-group analysis is to include means/intercepts,
so we do not need to specify this.
QUESTION 2. How many sample moments are there in the data? Why?
{Hint. When counting sample moments, do not forget that we split the data
into 2 groups, so observed means, variances and covariances are available for
both groups}.
QUESTION 3. How many parameters does the baseline (configural) model
estimate? What are they? {Hint. You can look up the total number of parame-
ters in the output, but try to work out how these are made up. The output will
help you, if you look in ‘Parameter Estimates’. Remember that values printed
there are parameters.}
QUESTION 4. Interpret the chi-square, and SRMR. Do you retain or reject
the baseline model?

Step 3. Fitting a full measurement invariance model to two


groups

The baseline model does not allow us to compare the Spatial and Verbal
factor scores between boys and girls, because each group has its own metric for
these factors. For instance, origins of the factors are set to 0 in each group by
default, so girls and boys form their own ‘norm’ groups. Girls can be compared
with girls; boys with boys, but cross-comparison are not meaningful because
the scale is “reset”. To compare the factor scores across groups properly, we
need full measurement invariance. We want the factor loadings, intercepts and
residual variances to be equal for every corresponding test across groups. They
will be still estimable parameters, but instead of estimating two sets – for boys
and for girls, we will estimate only one set for both groups.
This is very simple to do in lavaan. You do not need to adjust the model. You
only need to tell the cfa() function which types of parameters you want to
constrain equal across groups using argument group.equal. Because we want
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES319

loadings, intercepts and residuals to be equal, we combine these parameter types


into a vector using base R function c():

# Fitting Measurement Invariance model


fit.mi <- cfa(HS.model, data = sample, group = "sex",
group.equal = c("loadings", "intercepts", "residuals"))

summary(fit.mi, fit.measures=TRUE)

## lavaan 0.6.15 ended normally after 44 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 40
## Number of equality constraints 16
##
## Number of observations per group:
## boy 72
## girl 73
##
## Model Test User Model:
##
## Test statistic 27.482
## Degrees of freedom 30
## P-value (Chi-square) 0.598
## Test statistic for each group:
## boy 13.693
## girl 13.789
##
## Model Test Baseline Model:
##
## Test statistic 342.274
## Degrees of freedom 30
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 1.000
## Tucker-Lewis Index (TLI) 1.008
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -1164.566
## Loglikelihood unrestricted model (H1) -1150.825
##
320EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL

## Akaike (AIC) 2377.131


## Bayesian (BIC) 2448.573
## Sample-size adjusted Bayesian (SABIC) 2372.628
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.000
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.079
## P-value H_0: RMSEA <= 0.050 0.807
## P-value H_0: RMSEA >= 0.080 0.047
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.058
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##
## Group 1 [boy]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Spatial =~
## x1 1.000
## x2 (.p2.) 0.813 0.173 4.699 0.000
## x3 (.p3.) 1.063 0.200 5.302 0.000
## Verbal =~
## x4 1.000
## x5 (.p5.) 0.963 0.084 11.442 0.000
## x6 (.p6.) 0.947 0.082 11.520 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## Spatial ~~
## Verbal 0.387 0.116 3.329 0.001
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .x1 (.16.) 5.021 0.118 42.462 0.000
## .x2 (.17.) 6.274 0.108 57.911 0.000
## .x3 (.18.) 2.092 0.113 18.468 0.000
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES321

## .x4 (.19.) 3.157 0.117 27.051 0.000


## .x5 (.20.) 4.558 0.118 38.678 0.000
## .x6 (.21.) 2.317 0.115 20.093 0.000
## Spatial 0.000
## Verbal 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 (.p7.) 0.799 0.128 6.223 0.000
## .x2 (.p8.) 0.883 0.123 7.209 0.000
## .x3 (.p9.) 0.487 0.110 4.439 0.000
## .x4 (.10.) 0.289 0.063 4.572 0.000
## .x5 (.11.) 0.443 0.073 6.094 0.000
## .x6 (.12.) 0.412 0.069 5.990 0.000
## Spatial 0.484 0.165 2.931 0.003
## Verbal 0.769 0.164 4.689 0.000
##
##
## Group 2 [girl]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Spatial =~
## x1 1.000
## x2 (.p2.) 0.813 0.173 4.699 0.000
## x3 (.p3.) 1.063 0.200 5.302 0.000
## Verbal =~
## x4 1.000
## x5 (.p5.) 0.963 0.084 11.442 0.000
## x6 (.p6.) 0.947 0.082 11.520 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## Spatial ~~
## Verbal 0.395 0.134 2.956 0.003
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .x1 (.16.) 5.021 0.118 42.462 0.000
## .x2 (.17.) 6.274 0.108 57.911 0.000
## .x3 (.18.) 2.092 0.113 18.468 0.000
## .x4 (.19.) 3.157 0.117 27.051 0.000
## .x5 (.20.) 4.558 0.118 38.678 0.000
## .x6 (.21.) 2.317 0.115 20.093 0.000
## Spatial -0.180 0.145 -1.245 0.213
## Verbal 0.318 0.172 1.847 0.065
322EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL

##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 (.p7.) 0.799 0.128 6.223 0.000
## .x2 (.p8.) 0.883 0.123 7.209 0.000
## .x3 (.p9.) 0.487 0.110 4.439 0.000
## .x4 (.10.) 0.289 0.063 4.572 0.000
## .x5 (.11.) 0.443 0.073 6.094 0.000
## .x6 (.12.) 0.412 0.069 5.990 0.000
## Spatial 0.538 0.180 2.989 0.003
## Verbal 1.114 0.227 4.917 0.000

What you have just done is fitted a full measurement invariance model. The
model tests the following combined hypothesis:

H1. The measure is fully invariant across groups. Factor loadings, intercepts
and error variances for corresponding indicators are equal.

QUESTION 5. Interpret the chi-square and SRMR. Do we retain or reject


the full measurement invariance model?

Now examine the output, focusing on ‘Parameter Estimates’. Note that all those
measurement parameters that were supposed to be equal are indeed equal!

Here is a brief explanation of why the parameters are set the way they are.
While we assume full measurement invariance (i.e. the tests function equally
across groups), we do not have any particular reasons to assume that boys and
girls should be equal to each other in terms of their latent factors – Spatial and
Verbal. In fact, they might be different with respect to group means, variances
and covariances. This is why the logical way of scaling the latent factors is
setting their metric in the first group (say, boys), and carry over that metric to
the second group (girls) via parameter constraints. The parameter constrains
will ensure the scale of measurement does not change, and then the means,
variances and covariances of the latent factors for girls can be freely estimated.

QUESTION 6. How many parameters does the measurement invariance


model estimate? What are they? {Hint. Again, look up the total number
of parameters in the output, and then try to work out how these are made up.
The ‘Parameter Estimates’ output and the above explanation will help you.}

Now, examine and interpret the means, variances and covariances of Spatial
and Verbal factors.

QUESTION 7. What are the means and variances of Spatial and Verbal
for Girls? How do you interpret these values?
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES323

Step 4. Testing equality of means of latent factors

There are small differences between the means of Spatial and Verbal for boys
and girls. It appears that girls are slightly worse than boys in spatial ability,
but better in verbal ability. However, the means for girls were not significantly
different from 0 at the 0.05 level, which could also lead us to conclude that they
were not significantly different from boy’s means (see answer to Question 7 for
explanation).
Let us test the hypothesis of equality of the means of Spatial and Verbal for
girls and boys formally. All you need to do is to add one more group equality
constraint, for the "means" of latent variables:

# Testing Equality of means of latent factors


fit.mi.e <- cfa(HS.model, data = sample, group = "sex",
group.equal = c("loadings", "intercepts", "residuals", "means"))

summary(fit.mi.e, fit.measures=TRUE)

## lavaan 0.6.15 ended normally after 41 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 38
## Number of equality constraints 16
##
## Number of observations per group:
## boy 72
## girl 73
##
## Model Test User Model:
##
## Test statistic 35.786
## Degrees of freedom 32
## P-value (Chi-square) 0.295
## Test statistic for each group:
## boy 17.258
## girl 18.528
##
## Model Test Baseline Model:
##
## Test statistic 342.274
## Degrees of freedom 30
## P-value 0.000
##
## User Model versus Baseline Model:
324EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL

##
## Comparative Fit Index (CFI) 0.988
## Tucker-Lewis Index (TLI) 0.989
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -1168.718
## Loglikelihood unrestricted model (H1) -1150.825
##
## Akaike (AIC) 2381.435
## Bayesian (BIC) 2446.923
## Sample-size adjusted Bayesian (SABIC) 2377.308
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.040
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.099
## P-value H_0: RMSEA <= 0.050 0.555
## P-value H_0: RMSEA >= 0.080 0.158
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.078
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##
## Group 1 [boy]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Spatial =~
## x1 1.000
## x2 (.p2.) 0.810 0.172 4.707 0.000
## x3 (.p3.) 1.012 0.195 5.185 0.000
## Verbal =~
## x4 1.000
## x5 (.p5.) 0.977 0.086 11.401 0.000
## x6 (.p6.) 0.960 0.084 11.480 0.000
##
## Covariances:
23.3. WORKED EXAMPLE - COMPARING STRUCTURE OF MENTAL ABILITIES BETWEEN SEXES325

## Estimate Std.Err z-value P(>|z|)


## Spatial ~~
## Verbal 0.384 0.117 3.267 0.001
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .x1 (.16.) 4.938 0.095 51.820 0.000
## .x2 (.17.) 6.207 0.092 67.524 0.000
## .x3 (.18.) 2.004 0.086 23.313 0.000
## .x4 (.19.) 3.277 0.092 35.697 0.000
## .x5 (.20.) 4.672 0.095 49.199 0.000
## .x6 (.21.) 2.430 0.093 26.199 0.000
## Spatial 0.000
## Verbal 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 (.p7.) 0.778 0.131 5.939 0.000
## .x2 (.p8.) 0.871 0.123 7.072 0.000
## .x3 (.p9.) 0.520 0.111 4.689 0.000
## .x4 (.10.) 0.306 0.064 4.758 0.000
## .x5 (.11.) 0.434 0.073 5.978 0.000
## .x6 (.12.) 0.403 0.069 5.858 0.000
## Spatial 0.504 0.172 2.934 0.003
## Verbal 0.770 0.165 4.662 0.000
##
##
## Group 2 [girl]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Spatial =~
## x1 1.000
## x2 (.p2.) 0.810 0.172 4.707 0.000
## x3 (.p3.) 1.012 0.195 5.185 0.000
## Verbal =~
## x4 1.000
## x5 (.p5.) 0.977 0.086 11.401 0.000
## x6 (.p6.) 0.960 0.084 11.480 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## Spatial ~~
## Verbal 0.384 0.137 2.811 0.005
##
## Intercepts:
326EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL

## Estimate Std.Err z-value P(>|z|)


## .x1 (.16.) 4.938 0.095 51.820 0.000
## .x2 (.17.) 6.207 0.092 67.524 0.000
## .x3 (.18.) 2.004 0.086 23.313 0.000
## .x4 (.19.) 3.277 0.092 35.697 0.000
## .x5 (.20.) 4.672 0.095 49.199 0.000
## .x6 (.21.) 2.430 0.093 26.199 0.000
## Spatial 0.000
## Verbal 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 (.p7.) 0.778 0.131 5.939 0.000
## .x2 (.p8.) 0.871 0.123 7.072 0.000
## .x3 (.p9.) 0.520 0.111 4.689 0.000
## .x4 (.10.) 0.306 0.064 4.758 0.000
## .x5 (.11.) 0.434 0.073 5.978 0.000
## .x6 (.12.) 0.403 0.069 5.858 0.000
## Spatial 0.578 0.192 3.012 0.003
## Verbal 1.134 0.232 4.888 0.000

The resulting model tests the following combined hypothesis:


H1. The measure is fully invariant across groups. Factor loadings, intercepts
and error variances for corresponding indicators are equal.
H2. The means of Spatial and Verbal factors are equal across groups.
The model with equality constraints on latent means (fit.mi.e) appears to fit
well based on the chi-square test, and the SRMR=0.078 is just under the ac-
ceptable value of 0.08.
Next, we will formally compare two models – the measurement invariance
model (fit.mi) and the measurement invariance model with equality constraints
(fit.mi.e). It would be interesting to know if the two models are significantly
different from each other.
Because the two models are nested (one is a special case of the other), you can
conduct the chi-square difference test using the R base function anova(), which
will compute the difference of the models’ chi-square statistics and the difference
of their degrees of freedom, and print out the resulting p-value.

anova(fit.mi, fit.mi.e)

QUESTION 8. Conduct the chi-square difference test of models with and


without the equality constraints as described above, and interpret the results.
Are the models significantly different? Which model will you retain?
23.4. SOLUTIONS 327

Step 5. Saving your work

After you finished work with this exercise, save your R script with a meaningful
name, for example “Holzinger-Swineford 1939 analysis”.
To keep all of the created objects, which might be useful when revisiting this
exercise, save your entire ‘work space’ when closing the project. Press File /
Close project, and select Save when prompted to save your ‘Workspace image’.

23.4 Solutions
Q1. Boys scored higher than girls on x3, but girls scored higher than boys on
all verbal tests (x4, x5 and x6).
Q2. As we model means/intercepts too, we need to include them in the counted
sample moments. ‘Sample moments’ refers to the number of means, variances
and covariances in the observed data. There are 6 observed variables, therefore
6 means, plus 6(6+1)/2=21 variances and covariances; 27 moments in total for
each group. So we have 27*2=54 sample moments in both groups.
Q3. Baseline model estimates 38 parameters in total. Boys and girls groups
estimate the same parameters (i.e. there are 19 parameters in each group).
These are:
• 4 factor loadings (loadings for x1 and x4 are fixed to 1);
• 1 covariance of Spatial and Verbal factors;
• 6 intercepts of observed variables x1-x6;
• 6 residual variances of observed variables x1-x6;
• 2 variances (for Spatial and Verbal factors).
Q4. Chi-square for the baseline model is insignificant (chisq = 16.710; Degrees
of freedom = 16; p = .405). The SMRR is 0.040 – nice and small.
A breakdown of the chi-square statistic by group is also provided, attributing
8.748 to boys, and 7.962 to girls (the chi-square statistic is additive, so these
values add to the total chi-square statistic reported). The almost equal chi-
square values for both groups (and the groups were of almost equal size) indicate
similar fit of the baseline model in both groups. We conclude that the two-factor
configural model is appropriate for boys and girls.
Q5. Chi-square for the measurement invariance model is again insignificant
(chisq = 27.482, Degrees of freedom = 30, P-value = 0.598). The SMRR is
0.058 – small again. The almost equal chi-square values for both groups (boy
chisq = 13.693, girl chisq = 13.789) indicate that the measurement invariance
model is equally appropriate for boys and girls.
328EXERCISE 23. TESTING FOR MEASUREMENT INVARIANCE ACROSS SEXES IN A MULTIPL

Q6. The output says that ‘Number of free parameters’ is 40, and ‘Number of
equality constraints’ is 16. This means that 40-16=24 unique parameters are
estimated:
• 4 factor loadings (note that these have labels and parameter values identical
for boys and girls);
• 1 covariance of Spatial and Verbal for boys + 1 covariance for girls (note
that these have no labels and the values are different in the output) ;
• 6 intercepts of observed variables x1-x6 (note that these have labels and
parameter values identical for boys and girls);
• 2 means of Spatial and Verbal factors for girls (note that for boys, these
means are set to 0);
• 6 residual variances of observed variables x1-x6 (note that these have labels
and parameter values identical for boys and girls);
• 2 variances of Spatial and Verbal factors for boys + 2 variances for girls.
Q7. For Girls, the mean for Spatial is –0.180 (which appears lower than for
Boys for whom the mean was fixed to 0), and the mean for Verbal is 0.318
(higher than for Boys for whom the mean was fixed to 0). Both means for girls
are not significantly different from zero at the 0.05 level (look at their p-values).
Variances of the Spatial and Verbal factors are larger for girls (0.538; 1.114)
than boys (0.484; 0.769). Girls appear to show more variability in their latent
abilities than boys.
Q8. The model with additional constraints (fit.mi.e) has the Chi-square =
35.786; Degrees of freedom = 32. Testing the difference between this model
and the previous model (fit.mi), we obtain Diff(Chi-square) = 35.786–27.482=
8.304 and Diff(DF) = 32–30 = 2.
Chi-square of 8.304 on 2 degrees of freedom is significant at the 0.05 level, with
the p-value=0.016. Restricting the model with additional equality constraints
resulted in significantly different (worse) fit. The fit is worse is because the
chi-square is greater (and constraining some free parameters cannot make the
fit better!). We conclude that the means for boys and girls on the Spatial and
Verbal tests are significantly different, and that our measurement invariance
model with free means is better than the model with means constrained equal.
You may wonder how the fit can be ‘significantly worse’ if it is still very good
according to the chi-square test (the SRMR, not being a ‘significance’ measure
but an ‘effect size’ measure picked up the worsening fit). Here the small sample
works against us – there is not enough power to reject the ‘wrong’ model (it
has too many parameters), but just enough power to detect the elements of the
model that make a difference.
Exercise 24

Measuring effect of
intervention by comparing
change models for control
and experimental groups

Data file Olsson.RData


R package lavaan, psych

24.1 Objectives
The objective of this exercise is to fit a full structural model to repeated obser-
vations across two groups, where one group received an intervention between
the two measurement occasions and the other did not. This model will allow
testing for difference between the experimental and control group in terms of
the change between the two measurement occasions. You will learn how to
implement measurement invariance constraints across time and groups, which
are essential to make sure that the longitudinal change can be compared across
groups.

24.2 Study of training intervention to improve


test performance (Olsson, 1973)
This data analysis example is adopted (and adapted) from James L. Arbuckle’s
(2016) User Guide to software AMOS. Arbuckle analyses a study by Olsson

329
330EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

(1973), who administered a battery of tests to N=213 eleven-year-old students


on two occasions. In this exercise, we will focus on two tests, Synonyms and
Opposites. Between the two administrations of these tests, 108 of the students
(the experimental group) received training that was intended to improve perfor-
mance on the tests. The other 105 students (the control group) did not receive
any training. As a result of taking two tests on two occasions, each of the 213
students obtained four test scores:
Descriptions
pre_syn Pretest scores on the Synonyms test
pre_opp Pretest scores on the Opposites test
post_syn Posttest scores on the Synonyms test
post_opp Posttest scores on the Opposites test

24.3 Worked Example - Quantifying change on


a latent construct after an intervention
To complete this exercise, you need to repeat the analysis from a worked example
below, and answer some questions.

Step 1. Reading and examining data

Means and covarainces of these four variables are available for the experimental
and control groups separately. I organised the data into objects that are ready
to be used by R for analyses in this exercise. These objects are packaged into
file Olsson_data.RData. Download this file, and save it in a new folder. In
RStudio, create a new project in the folder you have just created. Start a new
script.
We start from loading the data into RStudio Environment.

load(file='Olsson.RData')

Three new objects should appear in your Environment: Olsson.cov - a list of


2 covariance matrices, one for experimental and one for control group; Ols-
son.mean - a list of two vectors containing means for the two groups; and
Olsson.N - a list of two sample sizes for each group. Examine these objects by
calling them like so:

Olsson.cov

## $Control
## pre_syn pre_opp post_syn post_opp
24.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## pre_syn 37.626 24.933 26.639 23.649


## pre_opp 24.933 34.680 24.236 27.760
## post_syn 26.639 24.236 32.013 23.565
## post_opp 23.649 27.760 23.565 33.443
##
## $Experimental
## pre_syn pre_opp post_syn post_opp
## pre_syn 50.084 42.373 40.760 37.343
## pre_opp 42.373 49.872 36.094 40.396
## post_syn 40.760 36.094 51.237 39.890
## post_opp 37.343 40.396 39.890 53.641

Olsson.mean

## $Control
## [1] 18.381 20.229 20.400 21.343
##
## $Experimental
## [1] 20.556 21.241 25.667 25.870

Olsson.N

## $Control
## [1] 105
##
## $Experimental
## [1] 108

QUESTION 1. Examine the means for the control and experimental groups.
Remember that the fist two means pertain to the pre-test and the last two
means to the post-test. Do the means increase or decrease? Are there any
visible differences between the groups? How would you (tentatively) interpret
the changes?

Step 2. Fitting a basic structural model for repeated mea-


sures

Given that the two subtests - synonyms and opposites - are supposed to indicate
verbal ability, we will fit the following basic longitudinal measurement model to
both groups.
In this model, pre_verbal is measured by pre_syn and pre_opp, and
post_verbal is measured by post_syn and post_opp. We set the factor
332EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

Figure 24.1: Figure 24.1. Basic model for change in verbal test performance

loading of pre_syn to 1, to give the latent pre_verbal factor a unit of mea-


surement. We set the mean of the latent pre_verbal factor to 0, to specify
the origin of measurement. Because we have a repeated measures design within
each group, we need to constrain measurement parameters equal across time to
maintain the same scale of measurement before and after the intervention. We
do so by using labels, as we did before in Exercise 22. This is regardless of any
cross-group invariance considerations (which we will consider later).

The correlated unique factors (errors) for pre_syn and post_syn account for
the shared variance between the Synonyms subtest on the two measurement
occasions, after the verbal factor has been controlled for. The correlated unique
factors (errors) for pre_opp and post_opp account for the shared variance
between the Opposites subtest on the two measurement occasions, after the
Verbal factor has been controlled for. This is typical for repeated measures –
we have considered this feature previously in Exercise 22.

The model depicted in Figure 24.1 is the default measurement model for change
that we will program in lavaan. Once we have specified this default model, we
will consider how this will be implemented across groups.

Using lavaan contentions, we can specify this repeated measures model with
parameter constraints (let’s call it Model.1) as follows:
24.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

Model.1 <- '


# measurement models with factor loadings equal across time (label f_opp)
pre_verbal =~ pre_syn + f_opp*pre_opp
post_verbal =~ post_syn + f_opp*post_opp

# correlated residuals for repeated measures


pre_syn ~~ post_syn
pre_opp ~~ post_opp

# intercepts with equality constraints across time (labels i_syn and i_opp)
pre_syn ~ i_syn*1
post_syn ~ i_syn*1
pre_opp ~ i_opp*1
post_opp ~ i_opp*1

# unique factors/errors with equality constraints across time (labels e_syn and e_opp)
pre_syn ~~ e_syn*pre_syn
post_syn ~~ e_syn*post_syn
pre_opp ~~ e_opp*pre_opp
post_opp ~~ e_opp*post_opp

# finally, the structural part (regression)


post_verbal ~ pre_verbal '

Step 3. Setting parameter equal across groups in lavaan

Now we can deal with ensuring measurement invariance across groups. It turns
out that by using the parameter equality labels in a multiple group setting,
we already imposed equality constraints across the groups. This is because
if a single label is attached to a parameter (say, “f_opp” was the label for
factor loading of pre_opp and post_opp), then this label also holds this
parameter equal across groups. if we wanted different measurement parameters
across groups (but wanted to maintain longitudinal invariance) we would need
to assign group-specific labels. Thankfully, we do not need to.

OK, let’s fit *thisModel.1 to both groups using sem() function of lavaan. Be-
cause our data is not a data frame with a grouping variable but instead lists
of statistics for both groups, we need to specify the sample covariance matrices
(sample.cov = Olsson.cov), sample means (sample.mean = Olsson.mean),
and sample sizes (sample.nobs = Olsson.N). With labels set up previously to
ensure invariance of repeated-measures parameters within each group, the op-
tion group.equal = c("loadings", "intercepts", "residuals") will en-
force these parameters to also be invariant across groups.
334EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

library(lavaan)

fit.1 <- sem(Model.1, sample.cov = Olsson.cov, sample.mean = Olsson.mean, sample.nobs =


group.equal = c("loadings", "intercepts", "residuals"))

summary(fit.1, fit.measures=TRUE)

## lavaan 0.6.15 ended normally after 134 iterations


##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 32
## Number of equality constraints 15
##
## Number of observations per group:
## Control 105
## Experimental 108
##
## Model Test User Model:
##
## Test statistic 43.384
## Degrees of freedom 11
## P-value (Chi-square) 0.000
## Test statistic for each group:
## Control 34.035
## Experimental 9.349
##
## Model Test Baseline Model:
##
## Test statistic 689.109
## Degrees of freedom 12
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.952
## Tucker-Lewis Index (TLI) 0.948
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -2474.985
## Loglikelihood unrestricted model (H1) -2453.294
##
## Akaike (AIC) 4983.971
## Bayesian (BIC) 5041.113
24.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## Sample-size adjusted Bayesian (SABIC) 4987.245


##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.166
## 90 Percent confidence interval - lower 0.116
## 90 Percent confidence interval - upper 0.220
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 0.997
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.062
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##
## Group 1 [Control]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## pre_verbal =~
## pre_syn 1.000
## pre_opp (f_pp) 0.872 0.069 12.621 0.000
## post_verbal =~
## pst_syn 1.000
## post_pp (f_pp) 0.872 0.069 12.621 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## post_verbal ~
## pre_verbal 0.874 0.060 14.566 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .pre_syn ~~
## .post_syn -2.629 3.256 -0.807 0.420
## .pre_opp ~~
## .post_opp 8.412 2.763 3.045 0.002
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
336EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

## .pre_syn (i_sy) 19.758 0.532 37.138 0.000


## .pst_syn (i_sy) 19.758 0.532 37.138 0.000
## .pre_opp (i_pp) 20.869 0.524 39.865 0.000
## .post_pp (i_pp) 20.869 0.524 39.865 0.000
## pr_vrbl 0.000
## .pst_vrb 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .pre_syn (e_sy) 5.593 3.325 1.682 0.093
## .pst_syn (e_sy) 5.593 3.325 1.682 0.093
## .pre_opp (e_pp) 14.108 2.748 5.134 0.000
## .post_pp (e_pp) 14.108 2.748 5.134 0.000
## pr_vrbl 31.968 5.803 5.509 0.000
## .pst_vrb 2.841 1.643 1.729 0.084
##
##
## Group 2 [Experimental]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## pre_verbal =~
## pre_syn 1.000
## pre_opp (f_pp) 0.872 0.069 12.621 0.000
## post_verbal =~
## pst_syn 1.000
## post_pp (f_pp) 0.872 0.069 12.621 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## post_verbal ~
## pre_verbal 0.891 0.060 14.890 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .pre_syn ~~
## .post_syn 0.616 3.193 0.193 0.847
## .pre_opp ~~
## .post_opp 6.756 2.688 2.513 0.012
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .pre_syn (i_sy) 19.758 0.532 37.138 0.000
## .pst_syn (i_sy) 19.758 0.532 37.138 0.000
## .pre_opp (i_pp) 20.869 0.524 39.865 0.000
## .post_pp (i_pp) 20.869 0.524 39.865 0.000
24.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## pr_vrbl 0.714 0.862 0.828 0.407


## .pst_vrb 5.256 0.406 12.937 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .pre_syn (e_sy) 5.593 3.325 1.682 0.093
## .pst_syn (e_sy) 5.593 3.325 1.682 0.093
## .pre_opp (e_pp) 14.108 2.748 5.134 0.000
## .post_pp (e_pp) 14.108 2.748 5.134 0.000
## pr_vrbl 45.408 7.516 6.042 0.000
## .pst_vrb 9.360 2.393 3.911 0.000

Run the syntax and examine the output carefully. You should be able to see
that the measurement parameters we set to be equal (factor loadings, intercepts
and error variances) are indeed equal across time and also across groups.
But when examining the structural parameters, we notice that “Intercepts”
for Control and Experimental groups are different. Specifically, the mean of
pre_verbal is fixed at 0.000 and the intercept .post_verbal is fixed at 0.000.
We know they are fixed because no Standard Errors are reported for them. This
is interesting. It is common to set the origin of measurement for pre_verbal at
Time 1 to 0 so that the intercepts of pre-syn and pre_opp could be estimated,
but surely the intercept of post_verbal can be freely estimated given that the
parameter constraints ‘pass on’ the scale of measurement from Time 1 to Time
2. However, lavaan sets the intercept of all latent factors to 0 by default. If we
retain this fixed parameter, we test the following hypothesis:

Hypothsis 1 (H1). The average performance did not change across the two testing occasions in the

Initially, this appears reasonable (because the Control group did not receive any
training) and we leave this intercept fixed to 0 for now.
Now, in the Experimental group, the mean of pr_verbal and the intercept of
post_verbal are freely estimated (we know that because Standard Errors are
reported for them), and are different from zero and from each other. Indeed,
the origin and the unit of measurement can be passed on to the model for
Experimental group through the group parameter constraints. Hence, we can
freely estimate the mean and variance of pre_verbal, and the intercept and
residual variance of post_verbal in the Experimental group.
Now let’s examine the chi-square statistic. Chi-square for the tested model is
43.384 (df = 11); P-value < .001. We have to reject the model because the
chi-square is highly significant despite the modest sample size. Because this is
multiple-group analysis, the chi-square statistic is made of two group statistics:
34.035 for Control group and 9.349 for Experimental group. Clearly, the misfit
comes from the Control group.
338EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

QUESTION 2. Report and interpret the SRMR, CFI and RMSEA. Would
can you say about the goodness of fit of Model.1?
To help you understand the reasons for misfit, request the modification indices,
and sort them from largest to smallest:

modindices(fit.1, sort. = TRUE)

## lhs op rhs block group level mi epc sepc.lv sepc.all


## 19 post_verbal ~1 1 1 1 24.972 1.631 0.312 0.312
## 18 pre_verbal ~1 1 1 1 24.972 -12.962 -2.292 -2.292
## 68 pre_syn ~~ pre_opp 2 2 1 5.936 3.732 3.732 0.420
## 71 post_syn ~~ post_opp 2 2 1 4.821 -3.363 -3.363 -0.379
## 1 pre_verbal =~ pre_syn 1 1 1 1.231 0.102 0.574 0.094
## 58 pre_syn ~~ pre_opp 1 1 1 1.087 -1.630 -1.630 -0.183
## 57 post_verbal =~ pre_opp 1 1 1 0.927 -0.085 -0.442 -0.071
## 56 post_verbal =~ pre_syn 1 1 1 0.927 0.097 0.507 0.083
## 60 pre_opp ~~ post_syn 1 1 1 0.807 1.902 1.902 0.214
## 59 pre_syn ~~ post_opp 1 1 1 0.807 -1.902 -1.902 -0.214
## 61 post_syn ~~ post_opp 1 1 1 0.636 1.248 1.248 0.140
## 20 pre_verbal =~ pre_syn 2 2 1 0.240 -0.035 -0.238 -0.033
## 65 pre_verbal =~ post_opp 2 2 1 0.192 0.029 0.195 0.028
## 64 pre_verbal =~ post_syn 2 2 1 0.192 -0.033 -0.223 -0.031
## 69 pre_syn ~~ post_opp 2 2 1 0.142 0.959 0.959 0.108
## 70 pre_opp ~~ post_syn 2 2 1 0.142 -0.959 -0.959 -0.108
## 22 post_verbal =~ post_syn 2 2 1 0.123 -0.028 -0.187 -0.026
## 66 post_verbal =~ pre_syn 2 2 1 0.040 0.011 0.075 0.011
## 67 post_verbal =~ pre_opp 2 2 1 0.040 -0.010 -0.065 -0.009
## 3 post_verbal =~ post_syn 1 1 1 0.008 -0.009 -0.048 -0.008
## 54 pre_verbal =~ post_syn 1 1 1 0.001 0.003 0.016 0.003
## 55 pre_verbal =~ post_opp 1 1 1 0.001 -0.003 -0.014 -0.002
## sepc.nox
## 19 0.312
## 18 -2.292
## 68 0.420
## 71 -0.379
## 1 0.094
## 58 -0.183
## 57 -0.071
## 56 0.083
## 60 0.214
## 59 -0.214
## 61 0.140
## 20 -0.033
## 65 0.028
## 64 -0.031
24.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## 69 0.108
## 70 -0.108
## 22 -0.026
## 66 0.011
## 67 -0.009
## 3 -0.008
## 54 0.003
## 55 -0.002

QUESTION 3. What do the modification indices tell you? Which changes in


the model would produce the greatest decrease in the chi-square? How do you
interpret this model change?

Step 4. Measuring change in latent factors using structural


parameters in lavaan

After having answered Question 3, hopefully you agree that the lavaan de-
faults imposed on our structural parameters (specifically, setting the intercept
of post_verbal in the Control group to 0) caused model misfit.
We will now adjust the structural model setup by freeing the intercept of
post_verbal, thus allowing for change in verbal performance score between
pre-test and post-test. Because the Control group did not receive any inter-
vention, this is not “training” effect but rather “practice” effect. Modifying
the model is very easy. You can just append one more line to the syntax of
Model.1, explicitly freeing the intercept of post_verbal by giving it label NA,
and making Model.2:

Model.2 <- paste(Model.1,' # it is important to have a line break here


post_verbal ~ NA*1 ')

We now fit the modified model, and request summary output with fit indices
and standardized parameters.

fit.2 <- sem(Model.2,


sample.cov = Olsson.cov, sample.mean = Olsson.mean, sample.nobs = Olsson.N,
group.equal = c("loadings", "intercepts", "residuals"))

summary(fit.2, standardized=TRUE)

## lavaan 0.6.15 ended normally after 123 iterations


##
## Estimator ML
## Optimization method NLMINB
340EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

## Number of model parameters 33


## Number of equality constraints 15
##
## Number of observations per group:
## Control 105
## Experimental 108
##
## Model Test User Model:
##
## Test statistic 14.272
## Degrees of freedom 10
## P-value (Chi-square) 0.161
## Test statistic for each group:
## Control 4.774
## Experimental 9.498
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##
## Group 1 [Control]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## pre_verbal =~
## pre_syn 1.000 5.642 0.937
## pre_opp (f_pp) 0.849 0.065 13.143 0.000 4.789 0.778
## post_verbal =~
## pst_syn 1.000 5.284 0.929
## post_pp (f_pp) 0.849 0.065 13.143 0.000 4.486 0.757
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## post_verbal ~
## pre_verbal 0.928 0.055 16.901 0.000 0.991 0.991
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn ~~
## .post_syn -3.214 3.292 -0.976 0.329 -3.214 -0.726
## .pre_opp ~~
## .post_opp 8.833 2.719 3.249 0.001 8.833 0.590
##
24.3. WORKED EXAMPLE - QUANTIFYING CHANGE ON A LATENT CONSTRUCT AFTER AN INTERVENT

## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn (i_sy) 18.556 0.574 32.338 0.000 18.556 3.082
## .pst_syn (i_sy) 18.556 0.574 32.338 0.000 18.556 3.262
## .pre_opp (i_pp) 19.883 0.558 35.625 0.000 19.883 3.230
## .post_pp (i_pp) 19.883 0.558 35.625 0.000 19.883 3.357
## .pst_vrb 1.684 0.295 5.716 0.000 0.319 0.319
## pr_vrbl 0.000 0.000 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn (e_sy) 4.430 3.281 1.350 0.177 4.430 0.122
## .pst_syn (e_sy) 4.430 3.281 1.350 0.177 4.430 0.137
## .pre_opp (e_pp) 14.960 2.633 5.681 0.000 14.960 0.395
## .post_pp (e_pp) 14.960 2.633 5.681 0.000 14.960 0.426
## pr_vrbl 31.832 5.737 5.549 0.000 1.000 1.000
## .pst_vrb 0.517 1.479 0.350 0.727 0.019 0.019
##
##
## Group 2 [Experimental]:
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## pre_verbal =~
## pre_syn 1.000 6.800 0.955
## pre_opp (f_pp) 0.849 0.065 13.143 0.000 5.773 0.831
## post_verbal =~
## pst_syn 1.000 6.810 0.955
## post_pp (f_pp) 0.849 0.065 13.143 0.000 5.781 0.831
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## post_verbal ~
## pre_verbal 0.892 0.060 14.981 0.000 0.891 0.891
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn ~~
## .post_syn -0.431 3.165 -0.136 0.892 -0.431 -0.097
## .pre_opp ~~
## .post_opp 7.494 2.595 2.887 0.004 7.494 0.501
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn (i_sy) 18.556 0.574 32.338 0.000 18.556 2.607
## .pst_syn (i_sy) 18.556 0.574 32.338 0.000 18.556 2.603
342EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

## .pre_opp (i_pp) 19.883 0.558 35.625 0.000 19.883 2.861


## .post_pp (i_pp) 19.883 0.558 35.625 0.000 19.883 2.859
## .pst_vrb 5.428 0.421 12.906 0.000 0.797 0.797
## pr_vrbl 1.919 0.888 2.161 0.031 0.282 0.282
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .pre_syn (e_sy) 4.430 3.281 1.350 0.177 4.430 0.087
## .pst_syn (e_sy) 4.430 3.281 1.350 0.177 4.430 0.087
## .pre_opp (e_pp) 14.960 2.633 5.681 0.000 14.960 0.310
## .post_pp (e_pp) 14.960 2.633 5.681 0.000 14.960 0.309
## pr_vrbl 46.245 7.497 6.169 0.000 1.000 1.000
## .pst_vrb 9.595 2.446 3.923 0.000 0.207 0.207

QUESTION 4. Report and interpret the chi-square for Model.2. Would you
accept or reject this model? What are the degrees of freedom for Model.2 and
how do they compare to Model.1?

24.3.1 Step 5. Interpreting the effects of practice and


training on latent ability score

Now that we have a well-fitting model, let’s focus on the ‘Intercepts’ output to
interpret the findings. We can see that the latent mean changed for the Control
group as well as for the Experimental group, but change for the Experimental
group is greater. Presumably this increase over and above the practice effect
was due to the training programme.
QUESTION 5. What is the (unstandardized) intercept of post_verbal in
the Control group (practice effect)? Is the effect in expected direction? How
will you interpret its size in relation to the unit of measurement of the verbal
factor?
QUESTION 6. What is the (unstandardized) intercept of post_verbal in
the Experimental group (training effect)? Is the effect in expected direction?
How will you interpret its size in relation to the unit of measurement of the
verbal factor?
QUESTION 7. What is the standardized intercept of post_verbal in the
Control group (standardized practice effect)? Is this a large or a small effect?
QUESTION 8. What is the standardized intercept of post_verbal in the
Experimental group (standardized training effect)? Is this a large or a small
effect?
Overall, we conclude that there was a positive effect of retaking the tests on
performance in the Control group, with effect size approximately 0.32 stan-
dard deviations of the Verbal factor score. There was a larger positive effect of
24.4. SOLUTIONS 343

training intervention plus the potential effect of retaking the tests in the Exper-
imental group, with effect size approximately 0.79 standard deviations of the
Verbal factor score.

Step 6. Saving your work

After you finished work with this exercise, save your R script by pressing the
Save icon in the script window, and giving the script a meaningful name, for
example “Ollson study of test performance change”. To keep all of the created
objects, which might be useful when revisiting this exercise, save your entire
‘work space’ when closing the project. Press File / Close project, and select
Save when prompted to save your ‘Workspace image’ (with extension .RData).

24.4 Solutions
Q1. The means for post-tests increase for both groups. However, the increases
seem to be larger for the Experimental group (increase from 20.556 to 25.667
for synonyms, and from 21.241 to 25.870 for opposites - so by about 5 points
each, while the increase is only 1 or 2 points for the Control group). This might
be because the Experimental group received a performance training.
Q2. CFI=0.952, which is greater than 0.95 and indicates good fit. RMSEA
= 0.166, which is greater than 0.08 (the threshold for adequate fit). Finally,
SRMR = 0.062, which is below the threshold for adequate fit (0.08). the model
fits well according to the CFI and SRMR but not RMSEA.
Q3. Two largest modification indices (MI) can be found in the ‘intercepts’ ~
statements:

lhs op rhs block group level mi epc sepc.lv sepc.all sepc.nox


19 post_verbal ~1 1 1 1 24.972 1.631 0.312 0.312 0.312
18 pre_verbal ~1 1 1 1 24.972 -12.962 -2.292 -2.292 -2.292

They both belong to group 1 (Control), both equal, and both tell the same story.
They say that if you free EITHER the mean of pre_verbal or the intercept
of post_verbal (remember that they were both set to 0 in Model.1?), the
chi square will fall by 24.972. It seems that Hypothesis 1 that the verbal test
performance stays equal across time in Control group is not supported. Of
course it is reasonable to keep the mean of pre-verbal fixed at 0 and release
the intercept of post_verbal, allowing change across time.
Q4. Chi-square = 14.272, Degrees of freedom = 10, P-value = 0.161. The
chi-square test is not significant and we accept the model with free intercept
of post_verbal in the Control group. The degrees of freedom for Model.2
344EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

are 10, and the degrees of freedom for Model.1 were 11. The difference, 1 df,
corresponds to the additional intercept parameters we introduced. By adding 1
parameter, we reduced df by 1.
Q5. The (unstandardized) intercept of post_verbal in the Control group is
1.684. The effect in positive as expected. This is on the same scale as the
synonyms subtest, since the verbal factors borrowed the unit of measurement
from this subtest by fixing its factor loading to 1.
Q6. The (unstandardized) intercept of post_verbal in the Experimental
group is 5.428. The effect in positive as expected. This is again on the same
scale as the synonyms subtest.
Q7. The standardized intercept of post_verbal in the Control group is 0.319.
This is a small effect.
Q8. The standardized intercept of post_verbal in the Experimental group is
0.797. This is a large effect.
References

Arbuckle, James L. (2016). IBM SPSS Amos 24: User’s Guide. Amos Devel-
opment Corporation.
Barrett, P. T., Petrides, K. V., Eysenck, S. B. G., & Eysenck, H. J. (1998). The
Eysenck Personality Questionnaire: An examination of the factorial similarity
of P, E, N, and L across 34 countries. Personality and Individual Differences,
25(5), 805–819. https://doi.org/10.1016/S0191-8869(98)00026-9
Brown, A., Ford, T., Deighton, J., & Wolpert, M. (2014). Satisfaction in child
and adolescent mental health services: Translating users’ feedback into measure-
ment. Administration and Policy in Mental Health and Mental Health Services
Research, 41(4), 434-446. https://doi.org/10.1007/s10488-012-0433-9.
Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personality Inventory
(NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual.
Odessa, FL: Psychological Assessment Resources.
Eysenck, S. B. G., & Eysenck, H. J. (1976). Personality and Mental Illness.
Psychological Reports, 39, 1011–1022.
Feist, Gregory J.; Bodner, Todd E.; Jacobs, John F.; Miles, Marilyn; Tan,
Vickie. (1995). Integrating top-down and bottom-up structural models of sub-
jective well-being: A longitudinal investigation. Journal of Personality and
Social Psychology, 68(1), 138-150. https://doi.org/10.1037/0022-3514.68.1.138
Goldberg, D. (1978). Manual of the General Heath Questionnaire. Windsor:
NFER-Nelson.
Holzinger, K., and Swineford, F. (1939). A study in factor analysis: The stability
of a bifactor solution. Supplementary Educational Monograph, no. 48. Chicago:
University of Chicago Press.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power
rates using an effect size measure with the logistic regression procedure for
DIF detection. Applied measurement in education, 14(4), 329-349. https:
//doi.org/10.1207/S15324818AME1404_2
Joreskog, K. G. (1969). A general approach to confirmatory maximum likeli-
hood factor analysis. Psychometrika, 34, 183-202.

345
346EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

McDonald, R. P. (2013). Test theory: A unified treatment. Mahwah, NJ:


Lawrence Erlbaum.
Nishisato, S. (1994). Elements of dual scaling: An introduction to practical data
analysis. Hillsdale NJ: Erlbaum.
Olsson, S. 1973. An experimental study of the effects of training on test scores
and factor structure. Uppsala, Sweden: University of Uppsala, Department of
Education.
Osborne, R. T., & Suddick, D. E. (1972). A longitudinal investigation of the in-
tellectual differentiation hypothesis. The Journal of Genetic Psychology, 121(1),
83-89.
Thurstone, L. L. (1947). Multiple factor analysis. Chicago: University of
Chicago Press.
Acknowledgments

The style I adopted in this book has been inspired by my co-teachers and dear
friends at Cambridge Tim Croudace, Jon Heron, Jan Boehnke and Jan Stochl,
and perfected over the years thanks to feedback from my students at Cambridge
and Kent, and Graduate Teaching Assistants at Kent (most notably Hirotaka
Imada, Bjarki Gronfeldt Gunnarsson, Daqing Liu, Chole Bates and Rebecca
McNeill). I am very grateful to all of them.
Data used in this book are either open access from articles, R packages or other
software resources (sources are always acknowledged), or were shared with me by
colleagues and collaborators, or are from research projects I have been involved
in during my time at SHL, Anna Freud Centre, University of Cambridge and
University of Kent. I tried my best to acknowledge all sources appropriately;
however, please let me know if you spot any omissions.
Unless acknowledged within chapters, all contents are my own. I did not have
any assistance in writing this book, if anything, many people have tried to stop
me from finishing it by loading me with other work and commitments. I thank
them for helping to build my character.

347
348EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE
About the author

Anna Brown is a psychometrician with an established reputation and extensive


industry experience. She is currently Professor of Psychometrics at the School
of Psychology of University of Kent. Previously, she taught short courses in
Applied Psychometrics at the University of Cambridge. Her experiences outside
of academia included research and test development at the Research Division of
the UK largest occupational test publisher SHL Group, where she held senior
and principal psychometrician positions for many years.
Anna completed a 5-year Master’s program in Mathematics with distinction and
a Ph.D. in Psychology with distinction. Her Ph.D. research led to the devel-
opment of the Thurstonian Item Response Theory (TIRT) model described as
a breakthrough in scoring of ‘ipsative’ (relative-to-self) questionnaires, and re-
ceived the prestigious “Best Dissertation” award from the Psychometric Society.
Further advances in this methodology made by Professor Brown enabled the de-
velopment of novel personality assessment tools that are resistant to response
biases and ‘faking good’ by candidates. Multiple world-leading test publishers
and organisations, including SHL, Korn Ferry, and many others incorporated

349
350EXERCISE 24. MEASURING EFFECT OF INTERVENTION BY COMPARING CHANGE MODE

this methodology into millions of individual psychological assessments that have


taken place in at least 37 different languages, across 40 different countries since
2010.
Anna’s research focuses on psychological measurement and testing, particularly
issues in test validity and test fairness. She specialises in modelling response bi-
ases and faking, scaling of comparative data, measurement invariance and other
measurement models using IRT and SEM frameworks more broadly. She has
published widely on these and other topics, including peer-reviewed articles in
top journals such as Psychological Methods, Psychological Science, Psychome-
trika and others; book chapters; psychometric tests and test manuals.
Professor Brown has extensive experience in designing, developing and imple-
menting psychometric testing solutions in the workplace, health settings and
education, and provides psychometric advice to several organisations in the pri-
vate and public sectors internationally. She served as an elected member on
the Council of the International Test Commission (ITC) for 8 years, chairing its
Research and Guidelines Committee. In this role, Anna facilitated publication
and translation of ITC guidelines aiming to improve testing standards interna-
tionally. She also serves as a member of the editorial Board of the International
Journal of Testing, and as an ad-hoc reviewer for countless journals in the field
of psychological testing.

You might also like