Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views20 pages

2 - An Introduction To Multilevel Data Structure

textbook

Uploaded by

chatgptdegozaru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

2 - An Introduction To Multilevel Data Structure

textbook

Uploaded by

chatgptdegozaru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

2

An Introduction to Multilevel Data Structure

Nested Data and Cluster Sampling Designs


In Chapter 1, we considered the standard linear model that underlies such
common statistical methods as regression and analysis of variance (ANOVA;
i.e. the general linear model). As we noted, this model rests on several pri-
mary assumptions regarding the nature of the data in the population. Of
particular importance in the context of multilevel modeling is the assump-
tion of independently distributed error terms for the individual observa-
tions within the sample. This assumption essentially means that there are no
relationships among individuals in the sample for the dependent variable,
once the independent variables in the analysis are accounted for. In the example
we described in Chapter 1, this assumption was indeed met, as the indi-
viduals in the sample were selected randomly from the general population.
Therefore, there was nothing linking their dependent variable values other
than the independent variables included in the linear model. However, in
many cases the method used for selecting the sample does create the cor-
related responses among individuals. For example, a researcher interested
in the impact of a new teaching method on student achievement might ran-
domly select schools for placement in either a treatment or control group.
If school A is placed into the treatment condition, all students within the
school will also be in the treatment condition – this is a cluster randomized
design, in that the clusters and not the individuals are assigned to a specific
group. Furthermore, it would be reasonable to assume that the school itself,
above and beyond the treatment condition, would have an impact on the per-
formance of the students. This impact would manifest itself as correlations
in achievement test scores among individuals attending that school. Thus, if
we were to use a simple one-way ANOVA to compare the achievement test
means for the treatment and control groups with such cluster sampled data,
we would likely be violating the assumption of independent errors because
a factor beyond treatment condition (in this case the school) would have an
additional impact on the outcome variable.
We typically refer to the data structure described above as nested,
meaning that individual data points at one level (e.g. student) appear in

23
24 Multilevel Modeling Using R

only one level of a higher-level variable such as school. Thus, students are
nested within school. Such designs can be contrasted with a crossed data
structure whereby individuals at the first level appear in multiple levels of
the second variable. In our example, students might be crossed with after-
school organizations if they are allowed to participate in more than one.
For example, a given student might be on the basketball team as well as in
the band. The focus of this book is almost exclusively on nested designs,
which give rise to multilevel data. Other examples of nested designs might
include a survey of job satisfaction for employees from multiple depart-
ments within a large business organization. In this case, each employee
works within only a single division in the company, which leads to a
nested design. Furthermore, it seems reasonable to assume that employ-
ees working within the same division will have correlated responses on
the satisfaction survey, as much of their view regarding the job would
be based exclusively upon experiences within their division. For a third
such example, consider the situation in which clients of several psycho-
therapists working in a clinic are asked to rate the quality of each of their
therapy sessions. In this instance, there exist three levels in the data: time,
in the form of individual therapy session, client, and therapist. Thus, ses-
sion is nested in client, who in turn is nested within therapist. All of this
data structure would be expected to lead to correlated scores on a therapy-
rating instrument.

Intraclass Correlation
In cases where individuals are clustered or nested within a higher-level unit
(e.g. classrooms, schools, school districts), it is possible to estimate the cor-
relation among individuals’ scores within the cluster/nested structure using
the intraclass correlation (denoted ρI in the population). The ρI is a measure
of the proportion of variation in the outcome variable that occurs between
groups versus the total variation present and ranges from 0 (no variance
between clusters) to 1 (variance between clusters but not within cluster vari-
ance). ρI can also be conceptualized as the correlation for the dependent mea-
sure for two individuals randomly selected from the same cluster. It can be
expressed as

t2
rI = (2.1)
t2 + s2
where
τ2 = Population variance between clusters
σ2 = Population variance within clusters
An Introduction to Multilevel Data Structure 25

Higher values of ρI indicate that a greater share of the total variation in


the outcome measure is associated with cluster membership; i.e. there is a
relatively strong relationship among the scores for two individuals from the
same cluster. Another way to frame this issue is that individuals within the
same cluster (e.g. school) are more alike on the measured variable than they
are like those in other clusters.
It is possible to estimate τ2 and σ2 using sample data, and thus it is also pos-
sible to estimate ρI. Those familiar with ANOVA will recognize these esti-
mates as being related (though not identical) to the sum of squared terms.
The sample estimate for variation within clusters is simply

å (n - 1)S
j =1
j
2
j

sˆ 2 =
N -C
where
nj

2
å (y ij - yj)
S = variance within cluster j =
j
i =1
(n j - 1)
nj = sample size for cluster j
N = total sample size
C = total number of clusters

å (n - 1)S
j =1
j
2
j

sˆ 2 = (2.2)
N -C
where
nj

å (y
j =1
ij - y j )2
S2j =
(n j - 1)
nj = sample size for cluster j
N = total sample size
C = total number of clusters

In other words, σ2 is simply the weighted average of within cluster


variances.
Estimation of τ2 involves a few more steps, but it is not much more complex
than what we have seen for σ2. In order to obtain the sample estimate for
variation between clusters, t̂ 2 , we must first calculate the weighted between-
cluster variance.
26 Multilevel Modeling Using R

å n (y - y)
j j
2

Sˆ B2 =
j =1
(2.3)
n (C - 1)

where
y j = mean on response variable for cluster j
y = overall mean on response variable

é C
ù

1 ê
ê å j =1
n2j ú
ú
n = êN - ú
C -1 ê N ú
ê ú
ë û

We cannot use SB2 as a direct estimate of τ2 because it is impacted by the ran-


dom variation among subjects within the same clusters. Therefore, in order
to remove this random fluctuation, we will estimate the population between-
cluster variance as

sˆ 2
tˆ 2 = SB2 - . (2.4)
n

Using these variance estimates, we can in turn calculate the sample estimate
of ρI:

tˆ 2
rˆ I = . (2.5)
tˆ 2 + sˆ 2

Note that Equation (2.5) assumes that the clusters are of equal size. Clearly,
such will not always be the case, in which case this equation will not hold.
However, the purpose for its inclusion here is to demonstrate the principle
underlying the estimation of ρI, which holds even as the equation might
change.
In order to illustrate estimation of ρI, let us consider the following dataset.
Achievement test data were collected from 10,903 third-grade examinees
nested within 160 schools. School sizes range from 11 to 143, with a mean
size of 68.14. In this case, we will focus on the reading achievement test
score, and will use data from only five of the schools, in order to make the
calculations by hand easy to follow. First, we will estimate ŝ 2 . To do so,
we must estimate the variance in scores within each school. These values
appear in Table 2.1.
An Introduction to Multilevel Data Structure 27

TABLE 2.1
School Size, Mean, and Variance of Reading Achievement Test
School N Mean Variance
767 58 3.952 5.298
785 29 3.331 1.524
789 64 4.363 2.957
815 39 4.500 6.088
981 88 4.236 3.362
Total 278 4.149 3.916

Using these variances and sample sizes, we can calculate ŝ 2 as

∑ (n − 1)S
j =1
j
2
j
(58 − 1)5.3 + (29 − 1)1.5 + (64 − 1)2.9 + (39 − 1)6.1 + (88 − 1)3.4
σˆ 2 = =
N −C 278 − 5
302.1 + 42 + 182.7 + 231.8 + 295.8 1054.4
= = = 3.9
273 273

The school means, which are needed in order to calculate SB2 , appear in
Table 2.2 as well. First, we must calculate n:

 C


1 
∑ n 
j =1
2
j
1  58 2 + 292 + 64 2 + 392 + 88 2  1
n = N− = 278 −  = 4 (278 − 63.2)
C−1 N  
5−1 278
 
 

= 53.7

TABLE 2.2
Between Subjects Intercept and Slope, and within Subjects Variation on
These Parameters by School
School Intercept U0j Slope U1j
1 1.230 −1.129 0.552 0.177
2 2.673 0.314 0.199 −0.176
3 2.707 0.348 0.376 0.001
4 2.867 0.508 0.336 −0.039
5 2.319 −0.040 0.411 0.036
Overall 2.359 0.375
28 Multilevel Modeling Using R

Using this value, we can then calculate SB2 for the five schools in our small
sample using Equation (2.3):

58(3.952 - 4.149)2 + 29(3.331 - 4.149)2 + 64( 4.363 - 4.149)2 + 39( 4.500 - 4.149)2 + 88( 4.236 - 4.149)2
53.7(5 - 1)
2.251 + 19.405 + 2.931 + 4.8805 + 0.666 30.057
= = = 0.140
214.8 214.800

We can now estimate the population between-cluster variance, τ2, using


Equation (2.4):

3.9
0.140 - = 0.140 - 0.073 = 0.067
53.7

We have now calculated all of the parts that we need to estimate ρI for the
population,

0.067
rˆ I = = 0.017
0.067 + 3.9

This result indicates that there is very little correlation of examinees’ test
scores within the schools. We can also interpret this value as the proportion
of variation in the test scores that is accounted for by the schools.
Given that r̂I is a sample estimate, we know that it is subject to sampling
variation, which can be estimated with a standard error as in Equation (2.6):

2
srI = (1 - rI ) (1 + (n - 1)rI ) . (2.6)
n(n - 1)( N - 1)

The terms in 2.6 are as defined previously, and the assumption is that
all clusters are of equal size. As noted earlier in the chapter, this latter
condition is not a requirement, however, and an alternative formulation
exists for cases in which it does not hold. However, 2.6 provides suffi-
cient insight for our purposes into the estimation of the standard error
of the ICC.
The ICC is an important tool in multilevel modeling, in large part because
it is an indicator of the degree to which the multilevel data structure might
impact the outcome variable of interest. Larger values of the ICC are indica-
tive of a greater impact of clustering. Thus, as the ICC increases in value, we
must be more cognizant of employing multilevel modeling strategies in our
data analysis. In the next section, we will discuss the problems associated
with ignoring this multilevel structure, before we turn our attention to meth-
ods for dealing with it directly.
An Introduction to Multilevel Data Structure 29

Pitfalls of Ignoring Multilevel Data Structure


When researchers apply standard statistical methods to multilevel data, such
as the regression model described in Chapter 1, the assumption of indepen-
dent errors is violated. For example, if we have achievement test scores from
a sample of students who attend several different schools, it would be rea-
sonable to believe that those attending the same school will have scores that
are more highly correlated with one another than they are with scores from
students attending other schools. This within-school correlation would be
due, for example, to having a common set of teachers, a common teaching
curriculum, a single set of administrative policies, and a common commu-
nity, among numerous other reasons. The within-school correlation will in
turn result in an inappropriate estimate of the standard errors for the model
parameters, which in turn will lead to errors of statistical inference, such as
p-values smaller than they really should be and the resulting rejection of null
effects above the stated Type I error rate, regarding the parameters. Recalling
our discussion in Chapter 1, the test statistic for the null hypothesis of no
relationship between the independent and dependent variable is simply the
regression coefficient divided by the standard error. If the standard error is
underestimated, this will lead to an overestimation of the test statistic, and
therefore statistical significance for the parameter in cases where it should
not be; i.e. a Type I error at a higher rate than specified. Indeed, the underes-
timation of the standard error will occur unless τ2 is equal to 0.
In addition to the underestimation of the standard error, another problem
with ignoring the multilevel structure of data is that we may miss important
relationships involving each level in the data. Recall that in our example,
there are two levels of sampling: students (level 1) are nested in schools
(level 2). Specifically, by not including information about the school, for
example, we may well miss important variables at the school level that help
to explain performance at the examinee level. Therefore, beyond the known
problem with misestimating standard errors, we also proffer an incorrect
model for understanding the outcome variable of interest. In the context of
MLMs, inclusion of variables at each level is relatively simple, as are interac-
tions among variables at different levels. This greater model complexity in
turn may lead to greater understanding of the phenomenon under study.

Multilevel Linear Models


In the following section we will review some of the core ideas that underlie
multilevel linear models (MLM). Our goal is to familiarize the reader with
terms that will repeat themselves throughout the book, and to do so in a
30 Multilevel Modeling Using R

relatively nontechnical fashion. We will first focus on the difference between


random and fixed effects, after which we will discuss the basics of param-
eter estimation, focusing on the two most commonly used methods, maxi-
mum likelihood and restricted maximum likelihood, and conclude with a
review of assumptions underlying MLMs, and an overview of how they are
most frequently used, with examples. In this section, we will also address
the issue of centering, and explain why it is an important concept in MLM.
After reading the rest of this chapter, the reader will have sufficient techni-
cal background on MLMs to begin using the R software package for fitting
MLMs of various types.

Random Intercept
As we transition from the one-level regression framework of Chapter 1 to
the MLM context, let’s first revisit the basic simple linear regression model
of Equation (1.1), y = b0 + b1x + e . Here, the dependent variable y is expressed
as a function of an independent variable, x, multiplied by a slope coeffi-
cient, β1, an intercept, β0, and random variation from subject to subject, ε.
We defined the intercept as the conditional mean of y when the value of x
is 0. In the context of a single-level regression model such as this, there is
one intercept that is common to all individuals in the population of inter-
est. However, when individuals are clustered together in some fashion (e.g.
within classrooms, schools, organizational units within a company), there
will potentially be a separate intercept for each of these clusters; that is,
there may be different means for the dependent variable for x = 0 across the
different clusters. We say potentially here because if there is in fact no cluster
effect, then the single intercept model of 1.1 will suffice. In practice, assess-
ing whether there are different means across the clusters is an empirical
question, which we describe below. It should also be noted that in this dis-
cussion we are considering only the case where the intercept is cluster spe-
cific, but it is also possible for β1 to vary by group, or even other coefficients
from more complicated models.
Allowing for group-specific intercepts and slopes leads to the following
notation commonly used for the level 1 (micro level) model in multilevel
modeling:

yij = b0 j + b1 j x + e ij (2.7)

where the subscripts ij refer to the ith individual in the jth cluster. As we con-
tinue our discussion of multilevel modeling notation and structure, we will
begin with the most basic multilevel model: predicting the outcome from just
an intercept which we will allow to vary randomly for each group.

yij = b0 j + e ij . (2.8)
An Introduction to Multilevel Data Structure 31

Allowing the intercept to differ across clusters, as in Equation (2.8), leads to


the random intercept which we express as

b0 j = g 00 + U 0 j . (2.9)

In this framework, γ00 represents an average or general intercept value that


holds across clusters, whereas U0j is a group-specific effect on the intercept.
We can think of γ00 as a fixed effect because it remains constant across all
clusters, and U0j is a random effect because it varies from cluster to cluster.
Therefore, for an MLM we are interested not only in some general mean value
for y when x is 0 for all individuals in the population (γ00), but also in the
deviation between the overall mean and the cluster-specific effects for the
intercept (U0j). If we go on to assume that the clusters are a random sample
from the population of all such clusters, then we can treat the U0j as a kind
of residual effect on yij, very similar to how we think of ε. In that case, U0j is
assumed to be drawn randomly from a population with a mean of 0 (recall
U0j is a deviation from the fixed effect) and a variance, τ2. Furthermore, we
assume that τ2 and σ2, the variance of ε, are uncorrelated. We have already
discussed τ2 and its role in calculating r̂I . In addition, τ2 can also be viewed as
the impact of the cluster on the dependent variable, and therefore testing it for
statistical significance is equivalent to testing the null hypothesis that cluster
(e.g. school) has no impact on the dependent variable. If we substitute the two
components of the random intercept into the regression model, we get

y = g 00 + U 0 j + b1x + e. (2.10)

Equation (2.10) is termed the full or composite model in which the multiple
levels are combined into a unified equation.
Often in MLM, we begin our analysis of a dataset with this simple random
intercept model, known as the null model, which takes the form

yij = g 00 + U 0 j + e ij . (2.11)

While the null model does not provide information regarding the impact
of specific independent variables on the dependent, it does yield important
information regarding how variation in y is partitioned between variance
among the individuals σ2 and variance among the clusters τ2. The total vari-
ance of y is simply the sum of σ2 and τ2. In addition, as we have already seen,
these values can be used to estimate ρI. The null model, as will be seen in
later sections, is also used as a baseline for model building and comparison.

Random Slopes
It is a simple matter to expand the random intercept model in 2.9 to accom-
modate one or more independent predictor variables. As an example, if we
32 Multilevel Modeling Using R

add a single predictor (xij) at the individual level (level 1) to the model, we
obtain

yij = g 00 + g 10 xij + U 0 j + e ij . (2.12)

This model can also be expressed in two separate levels as:


Level 1:

yij = b0 j + b1 j x + e ij (2.13)

Level 2:

β0 j = γ 00 + U 0 j (2.14)

b1 j = g 10 (2.15)

This model now includes the predictor and the slope relating it to the depen-
dent variable, γ10, which we acknowledge as being at level 1 by the subscript
10. We interpret γ10 in the same way that we did β1 in the linear regression
model; i.e. a measure of the impact on y of a 1-unit change in x. In addition,
we can estimate ρI exactly as before, though now it reflects the correlation
between individuals from the same cluster after controlling for the indepen-
dent variable, x. In this model, both γ10 and γ00 are fixed effects, while σ2 and
τ2 remain random.
One implication of the model in 2.12 is that the dependent variable is
impacted by variation among individuals (σ2), variation among clusters (τ2),
an overall mean common to all clusters (γ00), and the impact of the indepen-
dent variable as measured by γ10, which is also common to all clusters. In
practice there is no reason that the impact of x on y would need to be com-
mon for all clusters, however. In other words, it is entirely possible that rather
than a single γ10 common to all clusters, there is actually a unique effect for
the cluster of γ10 + U1j, where γ10 is the average relationship of x with y across
clusters, and U1j is the cluster-specific variation of the relationship between
the two variables. This cluster -specific effect is assumed to have a mean of 0
and to vary randomly around γ10. The random slopes model is

yij = g 00 + g 10 xij + U 0 j + U1 j xij + e ij . (2.16)

Written in this way, we have separated the model into its fixed (g 00 + g 10 xij )
and random (U 0 j + U1 j xij + e ij ) components. Model 2.16 simply states that
there is an interaction between cluster and x, such that the relationship of x
and y is not constant across clusters.
Heretofore we have discussed only one source of between-group varia-
tion, which we have expressed as τ2, and which is the variation among
clusters in the intercept. However, Model2.16 adds a second such source of
An Introduction to Multilevel Data Structure 33

between-group variance in the form of U1j, which is cluster variation on the


slope relating the independent and dependent variables. In order to differen-
tiate between these two sources of between-group variance, we now denote
the variance of U0j as t 02 and the variance of U1j as t12 . Furthermore, within
clusters, we expect U1j and U0j to have a covariance of τ01. However, across
different clusters, these terms should be independent of one another, and in
all cases it is assumed that ε remains independent of all other model terms.
In practice, if we find that t12 is not 0, we must be careful in describing the
relationship between the independent and dependent variables, as it is not
the same for all clusters. We will revisit this idea in subsequent chapters. For
the moment, however, it is most important to recognize that variation in the
dependent variable, y, can be explained by several sources, some fixed and
others random. In practice, we will most likely be interested in estimating all
of these sources of variability in a single model.
As a means for further understanding the MLM, let’s consider a simple
example using the five schools described above. In this context, we are inter-
ested in treating a reading achievement test score as the dependent vari-
able and a vocabulary achievement test score as the independent variable.
Remember that students are nested within schools so that a simple regres-
sion analysis will not be appropriate. In order to understand what is being
estimated in the context of MLM, we can obtain separate intercept and slope
estimates for each school, which appear in Table 2.2.
Given that the schools are of the same sample size, the estimate of γ00, the
average intercept value is 2.359, and the estimate of the average slope value,
γ10, is 0.375. Notice that for both parameters, the school values deviate from
these means. For example, the intercept for school 1 is 1.230. The difference
between this value and 2.359, −1.129, is U0j for that school. Similarly, the differ-
ence between the average slope value of 0.375 and the slope for school 1, 0.552,
is 0.177, which is U1j for this school. Table 2.2 includes U0j and U1j values for
each of the schools. The differences in slopes also provide information regard-
ing the relationship between vocabulary and reading test scores. For all of
the schools this relationship was positive, meaning that students who scored
higher on vocabulary also scored higher on reading. However, the strength
of this relationship was weaker for school 2 than for school 1, as an example.
Given the values in Table 2.2, it is also possible to estimate the variances
associated with U1j and U0j, t12 and t 02 , respectively. Again, because the schools
in this example had the same number of students, the calculation of these
variances is a straightforward matter, using

å (U )
2
1j - U 1.
(2.17)
J -1
for the slopes, and an analogous equation for the intercept random variance.
Doing so, we obtain t 02 = 0.439 , and t12 = 0.016 . In other words, much more
34 Multilevel Modeling Using R

of the variance in the dependent variable is accounted for by variation in the


intercepts at the school level than is accounted for by variation in the slopes.
Another way to think of this result is that the schools exhibited greater dif-
ferences among one another in the mean level of achievement as compared
to differences in the impact of x on y.
The actual practice of obtaining these variance estimates using the R
environment for statistical computing and graphics and interpreting their
meaning are subjects for the coming chapters. Before discussing the practi-
cal nuts and bolts of conducting this analysis, we will first examine the basics
for how parameters are estimated in the MLM framework using maximum
likelihood and restricted maximum likelihood algorithms. While similar
in spirit to the simple calculations demonstrated above, they are different
in practice and will yield somewhat different results than those we would
obtain using least squares, as above. Prior to this discussion, however, there
is one more issue that warrants our attention as we consider the practice of
MLM, namely variable centering.

Centering
Centering simply refers to the practice of subtracting the mean of a vari-
able from each individual value. This implies the mean for the sample of the
centered variables is 0, and implies that each individual’s (centered) score
represents a deviation from the mean, rather than whatever meaning its raw
value might have. In the context of regression, centering is commonly used,
for example, to reduce collinearity caused by including an interaction term
in a regression model. If the raw scores of the independent variables are used
to calculate the interaction, and then both the main effects and interaction
terms are included in the subsequent analysis, it is very likely that collin-
earity will cause problems in the standard errors of the model parameters.
Centering is a way to help avoid such problems (e.g. Iversen, 1991). Such
issues are also important to consider in MLMs, in which interactions are
frequently employed. In addition, centering is also a useful tool for avoid-
ing collinearity caused by highly correlated random intercepts and slopes
in MLMs (Wooldridge, 2004). Finally, centering provides a potential advan-
tage in terms of interpretation of results. Remember from our discussion in
Chapter 1 that the intercept is the value of the dependent variable when the
independent variable is set equal to 0. In many applications the indepen-
dent variable cannot reasonably be 0 (e.g. a measure of vocabulary), however,
which essentially renders the intercept as a necessary value for fitting the
regression line but not one that has a readily interpretable value. However,
when x has been centered, the intercept takes on the value of the dependent
variable when the independent is at its mean. This is a much more useful
interpretation for researchers in many situations, and yet another reason
why centering is an important aspect of modeling, particularly in the mul-
tilevel context.
An Introduction to Multilevel Data Structure 35

Probably the most common approach to centering is to calculate the differ-


ence between each individual’s score and the overall, or grand mean across
the entire sample. This grand mean centering is certainly the most commonly
used in practice (Bickel, 2007). It is not, however, the only manner in which
data can be centered. An alternative approach, known as group mean cen-
tering, is to calculate the difference between each individual score and the
mean of the cluster to which they belong. In our school example, grand mean
centering would involve calculating the difference between each score and
the overall mean across schools, while group mean centering would lead the
researcher to calculate the difference between each score and the mean for
their school. While there is some disagreement in the literature regarding
which approach might be best at reducing the harmful effects of collinearity
(Bryk and Raudenbush, 2002; Snijders and Bosker, 1999), researchers have
demonstrated that in most cases either will work well in this regard (Kreft,
de Leeuw, and Aiken, 1995). Therefore, the choice of which approach to use
must be made on substantive grounds regarding the nature of the relation-
ship between x and y. By using grand mean centering, we are implicitly com-
paring individuals to one another (in the form of the overall mean) across
the entire sample. On the other hand, when using group mean centering,
we are placing each individual in their relative position on x within their
cluster. Thus, in our school example, using the group mean centered values
of vocabulary in the analysis would mean that we are investigating the rela-
tionship between one’s relative vocabulary score in their school and their
reading score. In contrast, the use of grand mean centering would examine
the relationship between one’s relative standing in the sample as a whole
on vocabulary and the reading score. This latter interpretation would be
equivalent conceptually (though not mathematically) to using the raw score,
while the group mean centering would not. Throughout the rest of this book,
we will use grand mean centering by default, per recommendations by Hox
(2002), among others. At times, however, we will also demonstrate the use of
group mean centering in order to illustrate how it provides different results,
and for applications in which interpretation of the impact of an individual’s
relative standing in their cluster might be more useful than their relative
standing in the sample as a whole.

Basics of Parameter Estimation with MLMs


Heretofore, when we have discussed estimation of model parameters, it has
been in the context of least squares, which serve as the underpinnings of
OLS and related linear models. However, as we move from these fairly sim-
ple applications to more complex models, OLS is not typically the optimal
approach to use for parameter estimation. Instead, we will rely on maximum-
likelihood estimation (MLE) and restricted maximum likelihood (REML).
In the following sections we review these approaches to estimation from a
conceptual basis, focusing on the generalities of how they work, what they
36 Multilevel Modeling Using R

assume about the data, and how they differ from one another. For the tech-
nical details we refer the interested reader to Bryk and Raudenbush (2002)
or de Leeuw and Meijer (2008), both of which provide excellent resources
for those desiring a more in-depth coverage of these methods. Our purpose
here is to provide the reader with a conceptual understanding that will aid
in their understanding of application of MLMs in practice.

Maximum Likelihood Estimation


MLE has as its primary goal the estimation of population model parameters
that maximize the likelihood of our obtaining the sample that we in fact
obtained. In other words, the estimated parameter values should maximize
the likelihood of our particular sample. From a practical perspective, identi-
fying such sample values takes place through the comparison of the observed
data with that predicted by the model associated with the parameter val-
ues. The closer the observed and predicted values are to one another, the
greater the likelihood that the observed data arose from a population with
parameters close to those used to generate the predicted values. In practice,
MLE is an iterative methodology in which the algorithm searches for those
parameter values that will maximize the likelihood of the observed data (i.e.
produce predicted values that are as close as possible to the observed), and
as such can be computationally intensive, particularly for complex models
and large samples.

Restricted Maximum Likelihood Estimation


There exists a variant of MLE, restricted maximum likelihood, that has
been shown to be more accurate with regard to the estimation of variance
parameters than is MLE (Kreft and De Leeuw, 1998). In particular, the two
methods differ with respect to how degrees of freedom are calculated in the
estimation of variances. As a simple example, the sample variance is typi-
cally calculated by dividing the sum of squared differences between indi-
vidual values and the mean by the number of observations minus 1, so as to
have an unbiased estimate. This is a REML estimate of variance. In contrast,
the MLE variance is calculated by dividing the sum of squared differences
by the total sample size, leading to a smaller variance estimate than REML
and, in fact, one that is biased in finite samples. In the context of multilevel
modeling, REML takes into account the number of parameters being esti-
mated in the model when determining the appropriate degrees of freedom
for the estimation of the random components such as the parameter vari-
ances described above. In contrast, MLE does not account for these, leading
to an underestimate of the variances that does not occur with REML. For this
reason, REML is generally the preferred method for estimating multilevel
models, though for testing variance parameters (or any random effect) it is
necessary to use MLE (Snijders and Bosker, 1999). It should be noted that as
An Introduction to Multilevel Data Structure 37

the number of level-2 clusters increases, the difference in value for MLE and
REML estimates becomes very small (Snijders and Bosker, 1999).

Assumptions Underlying MLMs


As with any statistical model, the appropriate use of MLMs requires that
several assumptions about the data hold true. If these assumptions are not
met, the model parameter estimates may not be trustworthy, just as would be
the case with standard linear regression, which was reviewed in Chapter 1.
Indeed, while they differ somewhat from the assumptions for the single-
level models, the assumptions underlying MLMs are akin to those for the
simpler models. In this section, we provide an introduction to these assump-
tions and their implications for researchers using MLMs, and in subsequent
chapters we describe methods for checking the validity of these assump-
tions for a given set of data.
First, we assume that the level-2 residuals are independent between clus-
ters. In other words, there is an assumption that the random intercept and
slope(s) at level 2 are independent of one another across clusters. Second,
the level 2 intercepts and coefficients are assumed to be independent of the
level-1 residuals; i.e. the errors for the cluster-level estimates are unrelated to
errors at the individual level. Third, the level-1 residuals are normally dis-
tributed and have a constant variance. This assumption is very similar to
the one we make about residuals in the standard linear regression model.
Fourth, the level-2 intercept and slope(s) have a multivariate normal distri-
bution with a constant covariance matrix. Each of these assumptions can
be directly assessed for a sample, as we shall see in forthcoming chapters.
Indeed, the methods for checking the MLM assumptions are not very differ-
ent than those for checking the regression model that we used in Chapter 1.

Overview of Two-Level MLMs


To this point, we have described the specific terms that make up the MLM,
including the level-1 and level-2 random effects and residuals. We will close
out this chapter introducing the MLM by considering an example of two-
and three-level MLMs, and the use of MLMs with longitudinal data. This
discussion should prepare the reader for subsequent chapters in which we
consider the application of R to the estimation of specific MLMs. First, let
us consider the two-level MLM, parts of which we have already described
previously in the chapter.
Previously, in Equation (2.16), we considered the random slopes model
yij = g 00 + g 10 xij + U 0 j + U1 j xij + e ij in which the dependent variable, yij (reading
achievement), was a function of an independent variable xij (vocabulary test
score), as well as random error at both the examinee and school level. We can
extend this model a bit further by including multiple independent variables at
both level 1 (examinee) and level 2 (school). Thus, for example, in addition to
38 Multilevel Modeling Using R

ascertaining the relationship between an individual’s vocabulary and read-


ing scores, we can also determine the degree to which the average vocabu-
lary score at the school as a whole is related to an individual’s reading score.
This model would essentially have two parts, one explaining the relation-
ship between the individual level vocabulary (xij) and reading, and the other
explaining the coefficients at level 1 as a function of the level-2 predictor, aver-
age vocabulary score (zj). The two parts of this model are expressed as:
Level 1:

yij = b0 j + b1 j xij + e ij (2.18)

and

Level 2:

b hj = g h 0 + g hl z j + U hj . (2.19)

The additional piece of the equation in 2.19 is γh1zj, which represents the slope
for (γh1), and value of the average vocabulary score for the school (zj). In other
words, the mean school performance is related directly to the coefficient
linking the individual vocabulary score to the individual reading score. For
our specific example, we can combine 2.18 and 2.19 in order to obtain a single
equation for the two-level MLM.

yij = g 00 + g 10 xij + g 01Z j + g 1001xij Z j + U 0 j + U1 j Xij + e ij . (2.20)

Each of these model terms has been defined previously in the chapter: γ00
is the intercept or the grand mean for the model, γ10 is the fixed effect of
variable x (vocabulary) on the outcome, U0j represents the random variation
for the intercept across groups, and U1j represents the random variation for
the slope across groups. The additional pieces of the equation in 2.13 are γ01
and γ11. γ01 represents the fixed effect of level-2 variable z (average vocabu-
lary) on the outcome. γ11 represents the slope for, and value of, the average
vocabulary score for the school. The new term in Model 2.20 is the cross-
level interaction, γ1001xijzj. As the name implies, the cross-level interaction
is simply the interaction between the level-1 and level-2 predictors. In this
context, it is the interaction between an individual’s vocabulary score and
the mean vocabulary score for their school. The coefficient for this interac-
tion term, γ1001, assesses the extent to which the relationship between an
examinee’s vocabulary score is moderated by the mean for the school that
they attend. A large significant value for this coefficient would indicate that
the relationship between a person’s vocabulary test score and their overall
reading achievement is dependent on the level of vocabulary achievement
at their school.
An Introduction to Multilevel Data Structure 39

Overview of Three-Level MLMs


With MLMs, it is entirely possible to have three or more levels of data struc-
ture. It should be noted that in actual practice, four-level and higher models
are rare, however. For our reading achievement data, where the second level
was school, the third level might be the district in which the school resides.
In that case, we would have multiple equations to consider when expressing
the relationship between vocabulary and reading achievement scores, start-
ing at the individual level

yijk = b0 jk + b1 jk Xijk + e ijk . (2.21)

Here, the subscript k represents the level-3 cluster to which the individual
belongs. Prior to formulating the rest of the model, we must evaluate if the
slopes and intercepts are random at both levels 2 and 3, or only at level 1,
for example. This decision should always be based on the theory surround-
ing the research questions, what is expected in the population, and what is
revealed in the empirical data. We will proceed with the remainder of this
discussion under the assumption that the level-1 intercepts and slopes are
random for both levels 2 and 3, in order to provide a complete description
of the most complex model possible when three levels of data structure are
present. When the level-1 coefficients are not random at both levels, the terms
in the following models for which this randomness is not present would sim-
ply be removed. We will address this issue more specifically in Chapter 4,
when we discuss the fitting of three-level models using R.
The level-2 and level-3 contributions to the MLM described in 2.13 appear
below.
Level 2:

b0 jk = g 00 k + U 0 jk

b1 jk = g 10 k + U1 jk

Level 3:

g 00 k = d 000 + V00 k

g 10 k = d 100 + V10 k (2.22)

We can then use simple substitution to obtain the expression for the level-1
intercept and slope in terms of both level-2 and level-3 parameters.

b0 jk = d 000 + V00 k + U 0 jk
and (2.23)
b1 jk = d 100 + V10 k + U1 jk
40 Multilevel Modeling Using R

In turn, these terms can be substituted into Equation (2.15) to provide the full
three-level MLM.

( )
yijk = d 000 + V00 k + U 0 jk + d 100 + V10 k + U1 jk xijk + e ijk . (2.24)

There is an implicit assumption in this expression of 2.24 that there are no


cross-level interactions, though these are certainly possible to model across
all three levels, or for any pair of levels. In 2.24, we are expressing individ-
uals’ scores on the reading achievement test as a function of random and
fixed components from the school which they attend, the district in which
the school resides, as well as their own vocabulary test score and random
variation associated only with themselves. Though not present in 2.24, it is
also possible to include variables at both levels 2 and 3, similarly to what we
described for the two-level model structure.

Overview of Longitudinal Designs and Their Relationship to MLMs


Finally, we will say just a word about how longitudinal designs can be
expressed as MLMs. Longitudinal research designs simply involve the
collection of data from the same individuals at multiple points in time.
For example, we may have reading achievement scores for examinees in
the fall and spring of the school year. With such a design, we would be
able to investigate issues around growth scores and change in achieve-
ment over time. Such models can be placed in the context of an MLM
where the examinee is the level-2 (cluster) variable, and the individual
test administration is at level 1. We would then simply apply the two-level
model described above, including whichever examinee level variables are
appropriate for explaining reading achievement. Similarly, if examinees
are nested within schools, we would have a three-level model, with school
at the third level, and could apply Model 2.24, once again with which-
ever examinee- or school-level variables were pertinent to the research
question. One unique aspect of fitting longitudinal data in the MLM con-
text is that the error terms can potentially take specific forms that are not
common in other applications of multilevel analysis. These error terms
reflect the way in which measurements made over time are related to one
another, and are typically more complex than the basic error structure that
we have described thus far. In Chapter 5 we will look at examples of fit-
ting such longitudinal models with R, and focus much of our attention on
these error structures, when each is appropriate, and how they are inter-
preted. In addition, such MLMs need not take a linear form, but can be
adapted to fit quadratic, cubic, or other nonlinear trends over time. These
issues will be further discussed in Chapter 5.
An Introduction to Multilevel Data Structure 41

Summary
The goal of this chapter was to introduce the basic theoretical underpin-
nings of multilevel modeling, but not to provide an exhaustive technical
discussion of these issues, as there are a number of useful sources avail-
able in this regard, which you will find among the references at the end of
the text. However, what is given here should stand you in good stead as we
move forward with multilevel modeling using R software. We recommend
that while reading subsequent chapters you make liberal use of the informa-
tion provided here, in order to gain a more complete understanding of the
output that we will be examining from R. In particular, when interpreting
output from R, it may be very helpful for you to come back to this chapter
for reminders on precisely what each model parameter means. In the next
two chapters we will take the theoretical information from Chapter 2 and
apply it to real datasets using two different R libraries, nlme and lme4, both
of which have been developed to conduct multilevel analyses with continu-
ous outcome variables. In Chapter 5, we will examine how these ideas can
be applied to longitudinal data, and in Chapters 7 and 8, we will discuss
multilevel modeling for categorical dependent variables. In Chapter 9, we
will diverge from the likelihood-based approaches described here, and dis-
cuss multilevel modeling within the Bayesian framework, focusing on appli-
cation, and learning when this method might be appropriate and when it
might not.

You might also like