Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views158 pages

Section 3

Chapter 3 discusses the general linear mixed model, which can be expressed as y = Xβ + Zu + e, where u and e are normally distributed random effects. It covers estimation methods such as maximum likelihood and restricted maximum likelihood (REML) for variance components, highlighting their differences and applications. The chapter also introduces prediction in mixed models, focusing on best linear predictors and best linear unbiased predictions (BLUP).

Uploaded by

nguyettmn01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views158 pages

Section 3

Chapter 3 discusses the general linear mixed model, which can be expressed as y = Xβ + Zu + e, where u and e are normally distributed random effects. It covers estimation methods such as maximum likelihood and restricted maximum likelihood (REML) for variance components, highlighting their differences and applications. The chapter also introduces prediction in mixed models, focusing on best linear predictors and best linear unbiased predictions (BLUP).

Uploaded by

nguyettmn01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Linear Mixed Models

Chapter 3: The General Linear Mixed Model

Craig Anderson
The general linear mixed model

So far we have seen specific examples of mixed models.


We can generalize mixed models to arbitrary design
matrices:
y = Xβ + Zu + e
where X and Z are given matrices and
       
u 0 u G 0
E = , Var = .
e 0 e 0 R

We assume that u ∼ N(0, G) and e ∼ N(0, R)


independently of each other.
The models studied previously are all special cases of this
general model.

2/158
Example

Two-way mixed model without interaction

Consider the model

yij = αi + bj + eij

with factor A fixed and factor B random.


Since there is no µ in this parameterisation, there is no
need for an identifiability constraint on the αi .
iid
Assume as usual that bj ∼ N(0, σB2 ) independently of
iid
eij ∼ N(0, σE2 ).
For illustration purposes let i = 1, 2, 3 and j = 1, 2.

3/158
Example

Model components

 
y11
 y12   
  α1  
 y21  ; β =  α2  ; u = b1 .

y=
 y22  b2
  α3
 y31 
y32

4/158
Example

Model components

   
1 0 0 1 0

 1 0 0 


 0 1 

 0 1 0   1 0 
X= ; Z =  .

 0 1 0 


 0 1 

 0 0 1   1 0 
0 0 1 0 1

5/158
Normal model

In general we can write

y ∼ N(Xβ, ZGZ| + R) = N(Xβ, V)

or

y = Xβ + e∗

where e∗ = Zu + e.

This is a linear model with correlated errors since

Var (e∗ ) = V = ZGZ| + R.

6/158
Variance terms in the example

 
σE2 0 ... 0
   0 σ2 ... 0 
σB2 0 E
G= σB2 I2 = ; R = σ 2
I =
 
0 σB2 E 6 
 ... 

2
0 ... 0 σE

σB2 + σE2 σB2 σB2


 
0 0 0
 0 2
σB + σE2 0 2
σB 0 σB2 
 2 2 2 2

 σB 0 σB + σE 0 σB 0 
V= 

 0 σB2 0 σB2 + σE2 0 σB2 

 σB2 0 σB2 0 σB2 + σE2 0 
0 σB2 0 2
σB 0 σB2 + σE2

7/158
Maximum likelihood for fixed effects

For a given V, the generalized least squares (GLS)


estimator of β is

β̃ = (X| V−1 X)−1 X| V−1 y

The estimator β̃ is the maximum likelihood estimator


(MLE) and the uniformly minimum variance unbiased
estimator (UMVUE).
It maximises the log-likelihood
1 1
− log |V| − (y − Xβ)| V−1 (y − Xβ) + const
2 2
for the normal model.

8/158
Variance component estimation

Maximum likelihood can also be used to estimate variance


components.

Other options were popular before the computing advances


of recent decades.

These include minimum norm quadratic unbiased


estimation (MINQUE) and minimum variance quadratic
unbiased estimation (MIVQUE)

9/158
Maximum likelihood for variance matrices
The maximum likelihood estimate of

V = Var (y) = ZGZ| + R

is based on the model

y ∼ N(Xβ, V).

The log-likelihood of y under this model is


1 1
l(β, V) = − log |V| − (y − Xβ)| V−1 (y − Xβ) + const
2 2
and the MLE of (β, V) is the one that maximises this
expression.

10/158
Maximum likelihood for variance matrices
For any fixed V, l(β, V) is maximised over β by
β̃ = (X| V−1 X)−1 X| V−1 y.

Substituting back into the log-likelihood expression, we


obtain the profile log-likelihood for V:
1 1
lP (V) = − log |V| − (y − Xβ̃)| V−1 (y − Xβ̃) + const
2 2

1 1
= − log |V| − y| V−1 [I − X(X| V−1 X)−1 X| V−1 ]y
2 2
+ const

This can be maximised for the parameters in V.


11/158
Variance component estimation

However, there is one big problem with the maximum


likelihood approach.

Even with a very simple model, the variance component


estimates do not match those obtained by the ANOVA
approach.

This is because the ANOVA approach adjusts for the


degrees of freedom lost for estimation.

12/158
Variance component estimation

Simple example
iid
Consider X1 , . . . , Xn ∼ N(µ, σ 2 ).

The variance estimate using the ANOVA approach is


n
2 1 X
s = (Xi − X̄)2 .
n − 1 i=1

The variance estimate using the maximum likelihood


approach is
n
2 1X
σ̂ = (Xi − X̄)2 .
n i=1

13/158
Restricted Maximum Likelihood (REML)

To adjust for degrees of freedom we use restricted


maximum likelihood.

This maximises the likelihood of linear combinations of


the elements of y that do not depend on β.

The resulting criterion function is the restricted


log-likelihood
1
lR (V) = lP (V) − log |X| V−1 X|
2

14/158
Restricted Maximum Likelihood (REML)
REML

Starting with
y = Xβ + Zu + e

find all independent linear combinations of the response, k


such that k| · X = 0.

Then
k| · y = k| · Xβ + k| · Zu + k| · e

and taking K to be the matrix with columns k we have

Ky = (KZ)u + (Ke).

15/158
Restricted Maximum Likelihood (REML)

REML

Thus

Ky ∼ N(0, K| VK) = N(0, K| ZGZ| K + K| RK)

and the maximum likelihood can be used to estimate


variance components based on the likelihood of Ky.

Since there are no longer fixed effects to estimate, we do


not ‘lose’ degrees of freedom.

For simple and balanced designs, REML gives the same


variance component estimates as ANOVA.

16/158
Residual Maximum Likelihood

REML could also stand for Residual Maximum


Likelihood.

REML is equivalent to:


finding the least squares estimates of β from regressing y
on X (ignoring random effects).
taking the residuals.
using maximum likelihood on the residuals.

17/158
Examples

Balanced design - Heart X-ray data


Recall the two-factor random effects model with interaction and
the ANOVA estimates for the variance components:

Variance component Estimate


Observer 0.000070
Case 0.002718
Case:Observer 0.000112
Error 0.000114

18/158
Examples

Balanced design - Heart X-ray data


Recall the two-factor random effects model:

Variance component ANOVA est. REML ML


Observer 0.000070 0.000070 0.000058
Case 0.002718 0.002718 0.002556
Case:Observer 0.000112 0.000112 0.000112
Error 0.000114 0.000114 0.000114

19/158
Examples

Unbalanced data - Nitrogen concentrations


Recall the example of nitrogen concentrations in the
Mississippi river, with influent as the random factor:

Variance component ANOVA est. REML ML


Influent 56.17 63.32 51.26
Error 42.57 42.66 42.70

20/158
Prediction

Mixed models contain fixed effects, random effects and


variance-covariance matrix parameters.

The model parameters are the parameters in β and those in


V. These can be estimated using maximum likelihood as
outlined above.

Maximum likelihood does not apply to the random effects


u.

Instead, we use the term prediction.

21/158
Example

Multicentre clinical trial

Three drugs are compared in a multicentre clinical trial for


their effects on diastolic blood pressure.
Patients are given one of the three drugs, at random, at 1 of
26 clinics.
Measurements are taken on the patients during five visits at
the clinics.
Can you identify the fixed effects and random effects?

22/158
Example

Multicentre clinical trial

The clinics were randomly selected from a large


population of clinics; therefore, clinic is a random effect.

The variable drug is a fixed effect since we only have one


of three predetermined choices.

The drug by clinic interaction corresponds to the patient


effect, and is a random effect.

23/158
Prediction

In conventional fixed-effects analysis of multicentre trials,


inference focuses on the average drug effect throughout the
target population.
In many practical situations, we may be interested in the
performance of treatments at a specific clinic.
Suppose we suspect that different treatments perform
better under different environmental conditions.
These conditions could be represented by the various
locations in the trial.
We would therefore be interested in predicting the value
of the response at a randomly selected clinic.

24/158
Simple prediction example

Consider two random variables, Y and U, where

Y = U + e, U ∼ N(0, 1), e ∼ N(0, 4)


0.6

density of y
density of u
0.5
0.4
density

0.3
0.2
0.1
0.0

−10 −5 0 5 10
25/158
Simple example of prediction

We observe only y. Based on this observation, what is our


prediction for the value of U?

The best predictor (BP) is defined to be the Ũ which


minimises the equation:

E[(Ũ − U)2 ]

For general Y and U, the solution is

ũ = BP(U) = E(U|Y = y).

In the example, it can be shown that ũ = y/5.

26/158
Best predictor

In general, if y is the vector of observed data and u is a


random vector to be predicted, then best prediction
corresponds to minimisation of

E kũ − uk2 .
 

The solution is

ũ = BP(u) = E(u|y).

27/158
Best linear prediction

As a simplification, we usually restrict attention to


predictors that are linear in y, i.e. of the form

ũ = Ay + b

for some matrix A and vector b.

The solution is called the best linear predictor (BLP):

ũ = BLP(u) = E(u) + CV−1 [y − E(y)],

where

C = E {[(u − E(u)][(y − E(y)]| } and V = Var(y).

28/158
Best linear prediction

 
u
If is multivariate normal, then best prediction and
y
best linear prediction coincide.

In particular,

BP(u) = BLP(u) = E(u|y) = E(u) + CV−1 [y − E(y)].

29/158
BLP and the mixed model
In the mixed model
y = Xβ + Zu + e
we have
E(y) = Xβ
V = Var (y) = ZGZ| + R
C = E {[(u − E(u)][(y − E(y)]| }
= E[u(Zu + e)| ]
= E(uu| Z| ) + E(ue| )
= Var (u) Z| + 0 = GZ|
Therefore
ũ = BLP(u) = GZ| V−1 (y − Xβ).

30/158
BLP and the mixed model

The expression

ũ = BLP(u) = GZ| V−1 [y − Xβ]

includes the terms G, V and β, all of which need to be


estimated.

For instance β would be replaced by an estimator such as


β̃ = (X| V−1 X)−1 X| V−1 y.

31/158
Best linear unbiased prediction

Best linear unbiased prediction (BLUP) allows us to view


estimation of β and prediction of u in a more unified way.

This involves finding β̃ and ũ to minimise the prediction


error
 2 
| | | |
E (s Xβ̃ + t Zũ) − (s Xβ + t Zu)

subject to the unbiasedness condition

E(s| Xβ̃ + t| Zũ) = E(s| Xβ + t| Zu)

where s and t are arbitrary n × 1 vectors.

32/158
Best linear unbiased prediction

It can be shown that the solutions are

BLUE(β) = β̃ = (X| V−1 X)−1 X| V−1 y

BLUP(u) = ũ = GZ| V−1 (y − Xβ̃).

The best linear unbiased estimate (BLUE) for β is the


same as the GLS estimate.
The best linear unbiased predictor (BLUP) for u is the
BLP with β replaced by BLUE(β) = β̃.

33/158
Henderson’s justification

One derivation of BLUPs is by solving Henderson’s


equations:
 | −1
X| R−1 Z
   | −1 
X R X β X R y
| −1 | −1 −1 =
Z R X Z R Z+G u Z| R−1 y

These assume

y|u ∼ N(Xβ + Zu, R), u ∼ N(0, G)

and maximise the likelihood of (y, u) over β and u, using


f (y, u) = f (y|u)f (u).

34/158
Henderson’s justification

The criterion to be optimised becomes

(y − Xβ − Zu)| R−1 (y − Xβ − Zu) + u| G−1 u.

This is essentially generalized least squares with a penalty


term.

35/158
Best linear unbiased prediction

It can be shown that the BLUP of (β, u) can be written as


 
β̃
= (D| R−1 D + B)−1 D| R−1 y,

where
 
  0 0
D= X Z and B = .
0 G−1

The fitted values are then

BLUP(y) = Xβ̃ + Zũ = D(D| R−1 D + B)−1 D| R−1 y.

36/158
Estimated or empirical BLUP

We showed that the BLUPs for a mixed model are

BLUE(β) = β̃ = (X| V−1 X)−1 X| V−1 y

BLUP(u) = ũ = GZ| V−1 (y − Xβ̃)

These depend on G = Var(u) and R = Var(e) through


V = Var(y) = ZGZ| + R.

37/158
Estimated or empirical BLUP

In practice, the BLUE and BLUP are replaced by the


estimated or empirical BLUE/BLUP.

The EBLUE/EBLUP take the form:

β̂ = (X| V̂−1 X)−1 X| V̂−1 y


û = ĜZ| V̂−1 (y − Xβ̂)

where Ĝ or V̂ are obtained by plugging in the ML or


REML estimates of their parameters.

38/158
Estimated or empirical BLUP

Consider the mixed model as a whole:

BLUP[E(y|u)] = Xβ̃ + Zũ

EBLUP[E(y|u)] = ŷ = Xβ̂ + Zû.

Estimated BLUPs have two sources of variability: that


from estimation of β and u and that from estimation of G
and V.
Both sources should be taken into account when making
inference about the quantity of interest.
This can be tricky.

39/158
Standard error estimation

The variance of

BLUE(β) = β̃ = (X| V−1 X)−1 X| V−1 y

is  
Var β̃ = (X| V−1 X)−1 .

A natural estimate of the standard error of the ith entry of


the EBLUE β̂i is the square root of the ith diagonal entry
of (X| V̂−1 X)−1 .
This ignores the variability due to estimation of V.
For large samples this variability can be ignored, but for
small samples it makes a difference.

40/158
Precision of BLUPs involving u

To estimate the precision of BLUPs involving u we need


   
β̃ − β β̃
Var = Var = (D| R−1 D + B)−1
ũ − u ũ − u

where, as before
 
  0 0
D= X Z and B =
0 G−1

Therefore we could use the approximation


 
β̂
Var = (D| R̂−1 D + B̂)−1 .
û − u

41/158
Precision of BLUPs involving u

Sometimes we may also need the conditional variance


     
β̃ − β β̃
Var u = Var u
ũ ũ

= (D| R−1 D + B)−1 D| R−1 D(D| R−1 D + B)−1 .

This suggests the approximation


  
β̂
Var u = (D| R̂−1 D+B̂)−1 D| R̂−1 D(D| R̂−1 D+B̂)−1 .

42/158
Summary

The BLUP of the random effect is the expected value of


the random variable(s) given the observed data.

The solutions for the fixed effect yield best linear unbiased
estimators (BLUEs).

We solve the mixed model equations using the estimated


covariance matrices, Ĝ and R̂.

This yields the estimated or empirical best linear unbiased


predictor (EBLUP) for the random effect u and the
estimated or empirical best linear unbiased estimator
(EBLUE) for the fixed effect β.

43/158
Summary

Properties of a BLUP

it is unbiased; that is, E(û) = E(u) = 0.

it is a linear estimator; that is, it is a linear combination of


y: û = Ay + b, where A and b are free of the fixed effect
parameters.

it is best because it minimises the residual error. If u were


a fixed effect, this criterion is equivalent to minimum
variance.

44/158
Summary

Properties of a BLUP

BLUPs have a so-called shrinkage property.

They shrink toward the overall average - in other words,


they are less extreme than the observed counterparts.

They can be interpreted as the weighted average of the


grand mean and the observed value.

45/158
Toy Example

Recall the model

yij = µ + αi + bj + eij

where
yij is the breaking strength for the ith adhesive and jth toy,
i = 1, . . . , I (I = 3) and j = 1, . . . , J (J = 7).
µ is the overall mean.
αi is the fixed effect associated with the ith adhesive.
bj is the random effect associated with the jth toy (block).
eij is the experimental error associated with samples within
blocks.

46/158
BLUP for toy effect

It can be shown that the BLUP for the toy effect bj is

σB2
b̃j = (ȳ·j − ȳ)
σB2 + σE2 /3

where ȳ·j is the average pressure value for the jth toy and ȳ
is the grand mean.

σB2
Because the factor is never greater than 1,
σB2 + σE2 /3
the BLUP can be thought of as a shrinkage estimator.

47/158
Hypothesis tests

Suppose we wish to test H0 : βi = 0 where βi is the ith


entry of β.

A Wald test of this hypothesis would depend on the


asymptotic normality of the MLE through a result such as

β̂i − βi approx
∼ N(0, 1)
ese(β̂i )

where the estimated standard error (ese) could be obtained


as the square root of the ith diagonal entry of (X| V̂−1 X)−1 .

48/158
Hypothesis tests

Wald test

- However, this result does not hold for general mixed


models because the elements of y are dependent due to the
random effects.

+ It is still applicable in special cases such as longitudinal


data analysis.

49/158
Hypothesis tests

We can use the sums of squares in the ANOVA


decompositions to construct F-tests for hypotheses such as
H0 : βi = 0.

Expected mean squares

+ For balanced data these tests are more powerful than


alternatives.

- The appropriate F-test has to be derived for each particular


example.
- For unbalanced experiments these are not exact and they
rely on various complex adjustments and approximations.

50/158
Likelihood ratio test for fixed effects

Let L(θ; y) be the likelihood of the parameter vector θ


based on the data y.

The likelihood ratio statistic for testing the restricted


model under the null hypothesis against an alternative
unrestricted model is

L(θ̂ 0 ; y)
LR(y) =
L(θ̂; y)

where θ̂ 0 is the MLE under the null model and θ̂ is the


MLE under the unrestricted model.

51/158
Likelihood ratio test for fixed effects

Let l(θ; y) = log L(θ; y) be the log-likelihood.

Usually we work with


h i
−2 l(θ̂ 0 ; y) − l(θ̂; y)

which, under H0 , is approximately distributed as χ2ν .

Here

ν = number of parameters in the unrestricted model


− number of parameters in the null model.

52/158
Hypothesis tests

Likelihood ratio test for fixed effects

+ The test statistic depends on y and hence on the type of


correlation structure in matrices G and R.

- Even when the conditions for the asymptotic result hold,


the approximation could be poor.

53/158
Hypothesis tests for random effects

Suppose we wish to test H0 : σ 2 = 0 against H1 : σ 2 > 0


for some variance parameter.

The test based on comparing


h i
−2 l(θ̂ 0 ; y) − l(θ̂; y)

with the percentiles of a χ2 (1) distribution does not apply


because its theoretical justification assumes that the
parameter of interest is not on the boundary of its
parameter space.

54/158
Likelihood ratio tests for variance

Since the parameter space for σ 2 is [0, ∞), this assumption


is violated.

If you do use the test with the usual degrees of freedom,


the test will be very conservative (the p-values will be
larger than they should be).

This means that if something appears to be significant


using the χ2 approximation, then we can be confident that
it is actually significant.

55/158
Special case

There is a special case where the asymptotic distribution is


available.

For hypothesis tests involving one variance parameter and


s regression coefficients,
h i approx 1 1
−2 l(θ̂ 0 ; y) − l(θ̂; y) ∼ χ2 (s) + χ2 (s + 1)
2 2
h i
This means that the term −2 l(θ̂ 0 ; y) − l(θ̂; y) has an
approximate density function equal to a 50:50 mixture of
the χ2 (s) and χ2 (s + 1) densities.

56/158
Special case

For example, a test of H0 : σ 2 = 0, β1 = 0 against


H0 : σ 2 > 0 or β1 6= 0 would involve comparing
h i
−2 l(θ̂ 0 ; y) − l(θ̂; y)

with the percentiles of the 21 χ2 (1) + 12 χ2 (2) mixture


distribution.

Distribution theory is much more complex if the null


hypothesis involves more than one variance component.

57/158
Tests using bootstrap

All the asymptotic tests discussed above involve (often


poor) approximations.

A more precise way to obtain critical values for tests such


as the likelihood ratio test is via simulation.

This can be done via a method known as the parametric


bootstrap technique.

58/158
Tests using bootstrap

Parametric Bootstrap

1 Generate data under the null model using the fitted


parameter estimates.

2 Compute the likelihood ratio statistic for these generated


data.

3 Use the distribution of this statistic to obtain a critical


value for the observed test statistic.

59/158
Example

Paper brightness

The pulp data frame, available in R from the faraway


package has 20 rows and 2 columns.

Data comes from an experiment to test the paper


brightness depending on a shift operator.

This data frame contains the following columns:


bright Brightness of the pulp as measured by a
reflectance meter
operator Shift operator a-d

Data source: Statistical techniques applied to production situations, F.


Sheldon (1960), Industrial and Engineering Chemistry, 52, 507-509.

60/158
Pulp example

Model

yij = µ + ai + eij

where
yij is the paper brightness measured by the ith operator,
i = 1, . . . , 4 with j = 1, . . . , 5 replicates per operator.
µ is the overall mean
ai is the random effect associated with the ith operator
eij is the experimental error.

61/158
Data

pulp

bright operator
1 59.8 a
2 60.0 a
3 60.8 a
4 60.8 a
5 59.8 a
6 59.8 b
...
18 60.6 d
19 60.5 d
20 60.5 d

62/158
Inference using ANOVA decomposition

# Change the identifiability constraint


# to sum to zero:
op <- options(contrasts
=c("contr.sum", "contr.poly"))

# Obtain the ANOVA decomposition for the


# one-way layout:
lmod <- aov(bright ˜ operator , data=pulp)
summary(lmod)
Df Sum Sq Mean Sq F value Pr(>F)
operator 3 1.34 0.4467 4.204 0.0226 *
Residuals 16 1.70 0.1062

Operator effect significant with p-value of 0.023.


63/158
Fitting a mixed model using ML
smod <- lmer(bright ˜ 1+(1|operator), data=pulp,
REML=FALSE)
summary(smod)

Linear mixed model fit by maximum likelihood


Formula: bright ˜ 1 + (1 | operator)
Data: pulp
AIC BIC logLik deviance REMLdev
22.51 25.5 -8.256 16.51 18.74
Random effects:
Groups Name Variance Std.Dev.
operator (Intercept) 0.04575 0.21389
Residual 0.10625 0.32596
Number of obs: 20, groups: operator, 4

Fixed effects:
Estimate Std. Error t value
(Intercept) 60.4000 0.1294 466.7
64/158
Likelihood ratio test

nullmod <- lm(bright˜ 1, data=pulp)

as.numeric(2*(logLik(smod)-logLik(nullmod)))
[1] 2.568371

pchisq(2.5684,1, lower=FALSE)
[1] 0.1090179

Can we trust the χ2 approximation?

65/158
Fitting a mixed model using REML
library(lme4)
mmod <- lmer(bright ˜ 1+(1|operator), data=pulp)
summary(mmod)

Linear mixed model fit by REML


Formula: bright ˜ 1 + (1 | operator)
Data: pulp
AIC BIC logLik deviance REMLdev
24.63 27.61 -9.313 16.64 18.63
Random effects:
Groups Name Variance Std.Dev.
operator (Intercept) 0.068082 0.26093
Residual 0.106250 0.32596
Number of obs: 20, groups: operator, 4

Fixed effects:
Estimate Std. Error t value
(Intercept) 60.4000 0.1494 404.2
66/158
Parametric bootstrap

lrstat <- numeric(1000)


for (i in 1:1000) {
y <- unlist(simulate(nullmod))
bnull <- lm(y ˜ 1)
balt <- lmer(y ˜ 1 + (1|operator),
data=pulp, REML=FALSE)
lrstat[i] <- as.numeric(2*(logLik(balt)-logLik(bnull))
}

# p-value:
mean(lrstat >2.5684)
[1] 0.02

The effect is significant at 5% level.

67/158
Pulp example

In this example, the p-value obtained from the parametric


bootstrap approach is similar to that from the ANOVA
table (fixed effects model).
The hypotheses for fixed and random effects are different.
It is easier to conclude that there is an effect in a fixed
effects model where the conclusion only applies to the
levels of the factor used in the experiment.
The conclusion about random effects generalizes to a
larger population, hence stronger evidence is required to
obtain significance.

68/158
Prediction

Suppose we want to predict a new value.


If this prediction is for a new operator or an unknown
operator, our best guess will be µ̂ = 60.4.
If we know the operator, we can combine µ̂ with the
estimate of the random effect for that operator to obtain the
empirical best linear unbiased predictor (EBLUP).

69/158
Prediction of the random effects

# Prediction of random effects:


ranef(mmod)$operator
(Intercept)
a -0.1219414
b -0.2591256
c 0.1676695
d 0.2133975

#EBLUPs:
fixef(mmod)+ranef(mmod)$operator
(Intercept)
a 60.27806
b 60.14087
c 60.56767
d 60.61340

70/158
Residuals
Because we can have different fitted values we end up with
more than one type of residual. In the example resid(mmod)
gives residuals as follows:
round(resid(mmod),5)
[1] -0.47806 -0.27806 0.52194 0.52194 -0.47806
[6] 0.34088 0.05912 0.25912 -0.24088 -0.14088
[11] 0.13233 0.13233 -0.06767 0.33233 -0.26767
[16] 0.38660 0.18660 -0.01340 -0.11340 -0.11340

pulp$bright-resid(mmod)
[1] 60.27806 60.27806 60.27806 60.27806 60.27806
[6] 60.14087 60.14087 60.14087 60.14087 60.14087
[11] 60.56767 60.56767 60.56767 60.56767 60.56767
[16] 60.61340 60.61340 60.61340 60.61340 60.61340

We can use these residuals in diagnostic plots.


71/158
Diagnostic plots for pulp data

● ● ●
0.4

0.4
● ●
● ●
Sample Quantiles

● ●
0.2

0.2
● ●

Residuals
●● ●
● ●
0.0

0.0
● ●
● ●
●● ●
● ●
−0.4 −0.2

−0.4 −0.2
● ●
●● ● ●
● ●

● ● ●

−2 −1 0 1 2 60.2 60.4 60.6

Theoretical Quantiles Fitted

72/158
Diagnostic plots for pulp data

We can check normality and pick outliers from the


QQ-plot.
We can check the assumption of constant variance from
the residuals versus fitted plot.
If we had more operators, we could also check the
normality and constant variance assumption for the group
level effect too.
In this example the plots indicate no particular problems.

73/158
Mixed models for split-plot designs

Split-plot design

Two factors: A and B.


Factor A is applied to the large experimental units (whole
unit).
The large experimental unit is divided into smaller
experimental units (sub-units).
Factor B is applied to the sub-units.
Each whole unit is a complete replicate of all the levels of
factor B.

74/158
Mixed models for split-plot designs

Why use a split-plot design?

Split-plot experiments are often used out of necessity.


Sometimes a factor, or factorial combination, must be
applied to relatively large experimental units, whereas
other factors are more appropriately applied to sub-units.
Split-plot experiments are also used for convenience. It is
often easier to apply different factors to different sized
units.
They may also be used to increase the precision of the
estimated effect of the factor applied to the sub-units.

75/158
Advantages of split-plot designs

+ They provide greater power for testing the sub-unit


treatment factor and interaction.
+ They allow for different-sized experimental units in the
same experiment.
+ They allow for including a second factor at very little cost.
+ They can be used for experiments involving repeated
measures (from the sub-units) on the same experimental
unit (whole unit).

76/158
Disdvantages of split-plot designs

- Analysis is complicated by the presence of two


experimental error variances and the necessity for several
different standard errors for comparisons.
- High variance and few replications of whole units often
lead to poor sensitivity on the whole-unit factor.

77/158
Example

Water resistance

An experiment was conducted to investigate the effects of


different types of pretreatments and stains on the water
resistance property of wood.
Two types of pretreatments (A and B) and four types of
stains (1, 2, 3 and 4) were included in the study.
Fourteen wood panels were randomly selected and
pretreatment A was applied to seven of them, pretreatment
B was applied to the other seven wood panels.
Each wood panel was divided into four pieces and one of
the four stains was applied to the smaller piece of wood.

78/158
Example
Water resistance data

The water resistance property was characterised by


measuring how long it takes for three drops of water to
pass through the treated materials.
The dataset wood, available from SAS, contains the
following variables:
wood: the identification number of each wood panel in the
study;
pretrt: pretreatment (A or B) applied to the wood panel;
stain: types of stains (1, 2 ,3, or 4) applied to the smaller
piece of wood;
resistance: the time it takes for three drops of water to
pass through the treated materials.

79/158
Example

A split-plot design

This experiment applies each of the pretreatment types (A


and B) to an entire wood panel. Then each panel is cut into
four pieces and the four stain types are applied to the
smaller pieces.
This is a split-plot design. For the pretreatment factor, the
experimental unit is the entire panel, but for the stain
factor, the experimental unit is one of the small pieces cut
from the large panel.

80/158
Example

Quiz
Which of the following factors in the model are fixed and which
random?
wood: the identification number of each wood panel in the
study;
pretrt: pretreatment (A or B) applied to the wood panel;
stain: types of stains (1, 2 ,3, or 4) applied to the smaller
piece of wood.

81/158
Model

yijk = µ + αi + βj + (αβ)ij + wk + eijk


where
yijk is the resistance measurement for the ith pretreatment
(i = 1, 2), jth stain (j = 1, 2, 3, 4) and kth wood panel,
k = 1, . . . , 14;
µ is the overall mean;
αi is the fixed effect associated with the ith pretreatment;
βj is the fixed effect associated with the jth stain;
(αβ)ij is the fixed effect for the pretreatment*stain
interaction;
wk is the random effect associated with the kth wood panel,
2
assumed i.i.d. N(0, σW );
eijk is the residual effect, assumed i.i.d. N(0, σE2 ).
82/158
Fitting the model in R

Model without interaction term:


woodres <- read.table("wood.dat", header=TRUE)
woodres$wood <- as.factor(woodres$wood)
woodres$stain <- as.factor(woodres$stain)

# Fit a mixed model to these split-plot type data:

library(lme4)
m2 <- lmer(resistance ˜ pretrt+stain
+ (1|wood), data=woodres)
summary(m2)

83/158
R output

Linear mixed model fit by REML


Formula: resistance ˜ pretrt + stain + (1 | wood)

Random effects:
Groups Name Variance Std.Dev.
wood (Intercept) 0.81245 0.90136
Residual 0.81566 0.90314
Number of obs: 56, groups: wood, 14

Fixed effects:
Estimate Std. Error t value
(Intercept) 5.9646 0.4346 13.724
pretrtB 1.3050 0.5389 2.422
stain2 -0.3807 0.3414 -1.115
stain3 -0.9064 0.3414 -2.655
stain4 -1.9714 0.3414 -5.775

84/158
Fixed effects estimates

Intercept corresponds to pretreatment A and stain 1.


The estimated means of all other combinations of
pretreatment and stain level can be worked out from the
output.
A t-value larger than 2 in absolute value roughly
corresponds to a significant effect.
Only certain contrasts can be tested directly from the
output in this way, and there is no multiple comparison
adjustment.

85/158
Types of explanatory variables

In the examples so far, we have focused on categorical


explanatory variables (factors) to answer questions such
as:

Are there differences in strength between adhesives?

Which pretreatment/stain combination is the best for


waterproofing wood?

It is often the case that there are also continuous variables


which must be taken into consideration in order to answer
such questions.

86/158
Analysis of covariance

Analysis of covariance is a (slightly dated) name for a


form of analysis which combines aspects of ANOVA and
regression.

It allows us to evaluate the effect of our categorial variable


of interest on our response variable, while accounting for
additional categorical variables.

The categorical variables are essentially nuisance


variables; we are not interested in their effects, but still
have to account for them in our model.

87/158
Example
Clinical trial for blood pressure drugs

Suppose two drugs were evaluated for the effect of


reducing blood pressure.
For each subject we measure their baseline blood pressure
and the changes in blood pressure after administering one
of the two drugs.
Continuous response: blood pressure (BP).
Explanatory variables: drug (categorical); baseline blood
pressure (continuous).
Is there a difference in mean change in BP between the two
treatment groups, when we compare individuals having the
same baseline BP?

88/158
Possible scenarios

Possible relationships between the response variable and the


treatment and covariate:

the slopes and intercepts for the treatments are the same
the slopes are different, but the intercepts are the same
the slopes and intercepts are different
the intercepts are different, but the slopes are the same
the intercepts are different, but all slopes are zero (a
special case of the previous scenario)

89/158
Possible scenarios

BP Change

BP Change

BP Change
Baseline BP Baseline BP Baseline BP
BP Change

BP Change

Drug 1
Drug 2

Baseline BP Baseline BP

90/158
Example

Silicon wafers

The dataset wafer4 was obtained from the


semiconductor industry.
The experiment was designed to study the effect of
temperature on the deposition rate of a layer of polysilicon
in the fabrication of wafers.
It was thought that the wafer thickness before the
deposition process was applied might have an effect on the
deposition rate.
Therefore, the average thickness of each wafer (thick) was
determined and used as a possible covariate.

91/158
Silicon Wafer Example

Data

A random sample of 24 wafers was collected and used in


the experiment.
Wafers were randomly assigned to one of the three levels
of temperature (900◦ F, 1000◦ F, and 1100◦ F).
As a result, each level of temperature had eight wafers
assigned.
The amount of deposited material at three randomly
chosen sites from each wafer was measured.

92/158
Silicon Wafer Example

Variables in wafer4

temp: temperature (900◦ F, 1000◦ F, and 1100◦ F);


wafer: wafers randomly selected and assigned to one of
the three temperatures;
site: sites on each wafer where the response
measurements were taken (1, 2, and 3);
deposit: the amount of deposited material at each site;
thick: the average thickness of each wafer before the
deposition process.

93/158
Model

yijk = β0 + αi + β1 xij + δi xij + wj(i) + eijk

where
yijk is the deposition rate for the kth site from the jth wafer
assigned to the ith temperature, i = 1, 2, 3, j = 1, . . . , 8
and k = 1, 2, 3;
β0 is the overall intercept;
αi is the coefficient for the ith temperature effect on the
intercept;
β1 is the overall slope;
δi is the coefficient for the ith temperature effect on the
slope;

94/158
Model

Also
xij is the thickness measured on the jth wafer assigned to
the ith temperature.
2
wj(i) is the random effect for wafer, assumed i.i.d. N(0, σW )
(wafer effect nested within temperature);
eijk is the site effect, assumed i.i.d. N(0, σE2 ).

95/158
Data

temp wafer site deposit thick


900 1 1 291 1919
900 1 2 295 1919
900 1 3 294 1919
900 2 1 318 2113
900 2 2 315 2113
900 2 3 315 2113
...
1100 8 1 271 2036
1100 8 2 271 2036
1100 8 3 270 2036

96/158
Exploratory Plot

97/158
Initial Impressions

It appears that there is a positive relationship between the


deposition rate and the thickness of the wafers across all
three temperatures.

On average, the deposition rate at 900◦ F seems to be


higher than the deposition rate at other temperatures within
the range of the data.

The slopes do not seem to be the same across three


temperatures.

98/158
Fitting the mixed model in R

We fit a mixed model

mod1 <- lmer(deposit ˜ temp + thick + thick:temp


+ (1|wafer), data=data)

Random effects:
Groups Name Variance Std.Dev.
wafer (Intercept) 132.536 11.512
Residual 4.194 2.048

99/158
Fitting the mixed model in R

Fixed effects:
Estimate Std. Error t value
(Intercept) 114.40145 63.83150 1.792
temp1000 -141.12757 89.05137 -1.585
temp1100 84.67028 114.31343 0.741
thick 0.09970 0.03196 3.120
temp1000:thick 0.06371 0.04529 1.407
temp1100:thick -0.05879 0.05774 -1.018

100/158
Comments on output

The summary() command allows us to obtain estimates


for the fixed effect parameters.

This is an over-parameterised model, since more


parameters need to be estimated (8) than there are
independent pieces of information for (6).

To account for this, lme4 uses the first level of factors


temp and thick*temp as a baseline (equal to zero).

The estimates for the other parameters are computed


relative to this fixed estimate.

101/158
Output interpretation

The Intercept term corresponds to the intercept for the


first level of the group variable, in this case temp 900◦ F.

The estimated intercept coefficient for temp 900◦ F is


114.40.

The coefficient for temp 1000◦ F intercept is


114.40 + (−141.13) = −26.73.

The coefficient for temp 1100◦ F intercept is


114.40 + 84.67 = 199.07.

102/158
Output interpretation

The estimate for thick corresponds to the slope for the


first level of the group variable, in this case temp 900◦ F.

The estimated slope for temp 900◦ F is 0.0997.

The slope for temp 1000◦ F is computed as


0.0997 + 0.0637 = 0.1634.

The slope for temp 1100◦ F is computed as


0.0997 + (−0.05879) = 0.0409.

103/158
Fitted regression lines

For temp 900◦ F

deposit = 114.40 + 0.0997 ∗ thick

For temp 1000◦ F

deposit = (114.40 − 141.13) + (0.0997 + 0.0637) ∗ thick


= −26.73 + 0.1634 ∗ thick.

For temp 1100◦ F

deposit = (114.40 + 84.67) + (0.0997 − 0.05879) ∗ thick


= 199.07 + 0.0409 ∗ thick.

104/158
Test of the interaction term

Fixed effects:
Estimate Std. Error t value
(Intercept) 114.40145 63.83150 1.792
temp1000 -141.12757 89.05137 -1.585
temp1100 84.67028 114.31343 0.741
thick 0.09970 0.03196 3.120
temp1000:thick 0.06371 0.04529 1.407
temp1100:thick -0.05879 0.05774 -1.018

The |t-value | < 2 for the interaction terms indicates that


the slopes may not be significantly different.
There may not be enough evidence to warrant a model
with different slopes.

105/158
Model with different intercepts, same slope

We now fit a mixed model with different intercepts, but the


same slope.

mod2 <- lmer(deposit ˜ temp + thick +


(1|wafer), data=data)

Random effects:
Groups Name Variance Std.Dev.
wafer (Intercept) 151.817 12.321
Residual 4.194 2.048

Note that the variance component for wafer has changed


because we have changed the mean model.

106/158
Fixed effects parameter estimates

Fixed effects:
Estimate Std. Error t value
(Intercept) 83.89769 43.89139 1.911
temp1000 -17.14673 6.33695 -2.706
temp1100 -30.79875 6.20972 -4.960
thick 0.11501 0.02191 5.249

107/158
Fitted regression lines

For temp 900◦ F

deposit = 83.8977 + 0.1150 ∗ thick

For temp 1000◦ F

deposit = (83.8977 − 17.1467) + 0.1150 ∗ thick


= 66.7510 + 0.1150 ∗ thick.

For temp 1100◦ F

deposit = (83.8977 − 30.7988) + 0.1150 ∗ thick


= 53.0989 + 0.1150 ∗ thick.

108/158
Adjusting for a covariate

In the presence of a covariate (thickness), we can no longer


simply look at average deposition rates for the three
temperatures.

Instead, we can obtain these at a given covariate value,


e.g. at thickness=2000.

Or we could obtain the means for each temperature


evaluated at the average thickness, x̄·· , by taking
β0 + αi + β1 x̄·· .

These are called adjusted treatment means.

109/158
Adjusted treatment means in R

ls_means(mod2, test.effs=NULL, method.grad=’simple’)

Least Squares Means table:

Estimate Std. Error df t value lower upper Pr(>|t|)


temp900 309.8568 4.4204 20 70.097 300.6361 319.0776 < 2.2e-16
temp1000 292.7101 4.4382 20 65.953 283.4522 301.9680 < 2.2e-16
temp1100 279.0581 4.3778 20 63.743 269.9261 288.1901 < 2.2e-16

At the mean value of thick, 1964.7, the average amounts


of deposit for temp 900◦ F, 1000◦ F, and 1100◦ F are
309.86, 292.71, and 279.06, respectively.

110/158
Pairwise differences in means

ls_means(mod2, test.effs=NULL, method.grad=’simple’, pairwise = TRUE)

Least Squares Means table:

Estimate Std. Error df t value


temp900 - temp1000 17.14673 6.33695 20 2.7058
temp900 - temp1100 30.79875 6.20972 20 4.9598
temp1000 - temp1100 13.65202 6.24773 20 2.1851

lower upper Pr(>|t|)


temp900 - temp1000 3.92809 30.36538 0.01360
temp900 - temp1100 17.84550 43.75200 7.539e-05
temp1000 - temp1100 0.61948 26.68456 0.04095

Due to the common slope model, the differences between


deposits at different temperatures will be the same at any
thickness level.

111/158
Comments on the output

At any thickness level, the average amount of deposit for


temp 900◦ F is 17.15 larger than that for temp 1000◦ F.
The average amount of deposit for temp 900◦ F is 30.80
larger than that for temp 1100◦ F.
The average amount of deposit for temp 1000◦ F is 13.65
larger than that for temp 1100◦ F
All pairwise differences are significantly different from
zero.

112/158
Random coefficient models

In the ANCOVA model, the regression coefficients for one


or more continuous explanatory variables are assumed to
be fixed effects.
In a random coefficient model, the regression coefficients
for one or more continuous explanatory variables are
assumed to be random effects.
Data arise from independent subjects or clusters from a
larger population of interest.
The regression model for each subject or cluster can be
assumed to be a random deviation from some population
regression model.

113/158
ANCOVA

+ β 4x
= α4 x
|x) 4 +β 2
µ(y α 2

|x) 2=
y µ (y
β x
= α 1+ 1
µ(y|x) 1

114/158
Random coefficient model

b 4x
a4+

+b 2x
a2
x
y a 3 +b 3 x
a 1 +b 1

115/158
Fixed vs random regression coefficients

ANCOVA graph

The categorical variable (e.g. temperature in the silicon


wafer example) represents all levels of interest; therefore,
it is a fixed effect.

The regression coefficients for each level of the


temperature variable represent unknown fixed parameters
that are estimated from the data.

116/158
Fixed vs random regression coefficients

Random coefficient model graph

The random regression lines for each subject deviate about


the overall population regression line.

The goals of fitting a random coefficient model are


1 to estimate the variances of the intercept and the slope and
any covariance between the two; and
2 to obtain the best linear unbiased predictors (BLUPs) of the
intercept and slope for each subject or cluster.

117/158
Example
Wheat

Ten varieties of wheat were randomly selected from the


population of varieties of hard red winter wheat adapted to
dry climate conditions.
Each variety was randomly assigned to six one-acre plots
of land; thus the experimental units are one-acre plots of
land in a 60-acre field.
It was thought that the pre-plant moisture content of the
plots could have an influence on the germination rate and
hence the eventual yield of the plots.
Therefore, the amount of pre-planting moisture in the top
36 inches of the soil was determined for each plot.

118/158
Wheat Example

Data
The wheat dataset contains the following variables:
id: the identification number for the plots;
variety: ten randomly selected varieties of winter
wheat;
moist: the amount of moisture measured before planting
the varieties on the plots;
yield: the yield of the plot in bushels per acre.

119/158
Yield vs moisture
80

70

variety
1
60 2
3
4
yield

5
6
7
50 8
9
10

40

30

20 40 60
moisture
120/158
Wheat Example

The response variable is the yield in bushels per acre


(yield), and the continuous explanatory variable is the
measured amount of moisture (moist).

Varieties were randomly selected from the population of


wheat varieties and should be represented by a random
effect.

The resulting regression model for each variety therefore


represents a random sample from the model for the
population of varieties.

Each regression model can be expressed as deviations from


the population model.

121/158
Model

yij = ai + bi xij + eij

where
yij is the yield for the ith variety in the jth plot,
i = 1, . . . , 10 and j = 1, . . . , 6;
xij is the moisture of the ith variety in the jth plot;
ai is the intercept for the ith variety. This is a random effect
because variety is a random effect.
bi is the slope for the ith variety. This is also a random
effect because variety is a random effect.
eij is the random error, assumed i.i.d. N(0, σE2 ).

122/158
Model

For the random intercept and random slope we assume


     2 
ai iid α σA σAB
∼N ,
bi β σAB σB2

The fixed effects of the model are the intercept α and the
slope β.

These are the expected values of the intercepts and slopes


for the population of varieties.

123/158
Mixed model parameterisation
We have shown that

ai = α + a∗i
bi = β + b∗i

We can therefore rewrite the model as:

yij = α + βxij + a∗i + b∗i xij + eij

where
iid
a∗i ∼ N(0, σA2 ),
iid
b∗i ∼ N(0, σB2 )
Cov a∗i , b∗i = σAB .


124/158
Mixed model parameterisation

Fixed effects part of the model

α + βxij

Random effects part of the model

a∗i + b∗i xij + eij

125/158
Random Slope, Random Intercept

Variance
  
σA2 σAB 1
+ σE2
 
Var (yij ) = 1 xij
σAB σB2 xij

= σA2 + 2σAB xij + σB2 xij2 + σE2

Covariance

Cov yij , yik = Cov a∗i + b∗i xij + eij , a∗i + b∗i xik + eik
 

= σA2 + σAB xij + σAB xik + σB2 xij xik

126/158
Wheat Data
id variety yield moist
1 1 41 10
2 1 69 57
3 1 53 32
4 1 66 52
5 1 64 47
6 1 64 48
7 2 49 30
8 2 44 21
...
59 10 67 48
60 10 74 59

Note: to ensure numerical stability, it is a good idea to scale our


covariate (moist) to take values between 1 and 10.

127/158
Fitting the random coefficient model in R

m1 <- lmer(yield ˜ moist + (moist|variety))

Random effects:
Groups Name Variance Std.Dev. Corr
variety (Intercept) 18.8947 4.3468
moist 0.2394 0.4893 -0.34
Residual 0.3521 0.5933

Fixed effects:
Estimate Std. Error t value
(Intercept) 33.4339 1.3985 23.91
moist 6.6166 0.1678 39.42

128/158
Fixed effects in lmer

The term moist in the formula generates a model matrix


X with two columns: the intercept column (all 1s) and the
numeric moist column.

Note that the intercept column is included by default.

If we want to fit the model with out an intercept, we must


specify that using 0 + moist or -1 + moist.

129/158
Random effects in lmer

Our random effect terms are generated by


(moist|variety).

The second part of this term ( |variety) tells R to


generate random effect(s) for each of the 10 unique levels
of the variety parameter.

The first part (moist| ) determines the structure of


these random effect terms.

130/158
Random effects in lmer
Until now, we have only used random effects of the form
(1|variety).

The 1 tells R to generate a set of univariate random effects


at the intercept level.

This time, our random effects take the form


(moist|variety).

Again, this includes an intercept by default, and could be


rewritten as (1 + moist|variety).

This therefore tells R to generate a pair of random effects


for each vector - one for the slope (moist) and one for the
intercept.
131/158
Random effects in lmer

The pair of random effects generated by


(moist|variety) are correlated - ie there is a
correlation between the slope and intercept effects.

If we want uncorrelated effects, then we must instead


include them as two separate terms:
(1|variety) +(0 + moist|variety).

Similar to the fixed effects, we tell R not to include an


intercept effect using 0 or -1.

132/158
Output: fixed effects

The summary() function provides the following:


Fixed effects:
Estimate Std. Error t value
(Intercept) 33.4339 1.3985 23.91
moist 6.6166 0.1678 39.42

We obtain estimates for our intercept and slope


parameters: α̂ = 33.43 and β̂ = 6.62.

From the t-value, it is clear that there is a significant


relationship between moisture and yield.

133/158
Output: random effects

The summary() function also provides the following:


Random effects:
Groups Name Variance Std.Dev. Corr
variety (Intercept) 18.8947 4.3468
moist 0.2394 0.4893 -0.34
Residual 0.3521 0.5933

We obtain estimates for our variance parameters:


σ̂A2 = 18.8947, σ̂B2 = 0.2394 and σ̂E2 = 0.3521.

We can also compute the covariance σ̂AB as:


√ √
σ̂A σ̂B ρ̂AB = 18.8947 × 0.2394 × −0.34 = −0.727.

134/158
Output: random effects
The ranef() function provides estimates for each of our
random effect terms.
(Intercept) moist
1 0.9577955 -0.4921125
2 -2.2842770 -0.6669726
3 -0.4081197 0.6722278
4 0.6960210 -0.2330618
5 1.1159079 -0.1990372
6 4.6391469 0.2388880
7 -10.7300464 0.5642359
8 2.4011660 0.2243375
9 -0.1762124 0.2335679
10 3.7886182 -0.3420729

The first and second columns contain our intercept effects


aˆ∗i and slope effects bˆ∗i respectively.

135/158
Output: random effects

We can use our estimates α̂, β̂, aˆ∗i and bˆ∗i to construct a
unique fitted line for each variety i.

For example, we can construct the line for variety 1 as


follows:

EBLUP1 = α̂ + β̂x + aˆ∗1 + bˆ∗1 x


= 33.43 + 6.62x + 0.96 + (−0.49)x
= 34.39 + 6.13x

Each variety has a unique intercept and slope, each of


which vary around a common mean.

136/158
Fitted lines for each variety
80

70

variety
1
60 2
3
4
yield

5
6
7
50 8
9
10

40

30

2 4 6
moisture 137/158
Likelihood ratio test

The correlation between the intercept and slope random


effects in our model was -0.34.

We may wish to consider a model with uncorrelated


random effects.

We can use a likelihood ratio test to compare these two


models and decide whether the correlation is necessary.

Note that this test is OK since we are testing ρ = 0, which


is not on the boundary of the parameter space for a
correlation parameter [−1, 1].

138/158
Likelihood ratio test

We can carry out the likelihood ratio test in R using the


anova() function.
anova(m2,m1)

Models:
m2: yield ˜ moist + (1 | variety) + (0 + moist | variety)
m1: yield ˜ moist + (moist | variety)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
m2 5 193.10 203.57 -91.548 183.10
m1 6 194.06 206.62 -91.028 182.06 1.0411 1 0.3076

We obtain a p-value of 0.31, which suggests that the


simpler uncorrelated model may be used.

139/158
Likelihood ratio test
We can carry out a similar test to see whether we can
remove the random slope from our model.

anova(m3,m2)

Models:
m3: yield ˜ moist + (1 | variety)
m2: yield ˜ moist + (1 | variety) + (0 + moist | variety)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
m3 4 208.26 216.64 -100.129 200.26
m2 5 193.10 203.57 -91.548 183.10 17.162 1 3.432e-05 ***

We obtain a p-value << 0.05, which suggests that the


random slope is necessary.

We can do a similar test to see whether the random


intercept is necessary (p << 0.05 again).

140/158
Hierarchical linear models

In hierarchical or multilevel models, we have a nested


data structure.

We therefore have a model design which accounts for this


nesting by considering each level of the data in turn.

Level 1 of the model corresponds to the smallest sized


units.

Level 2 of the model corresponds to the first grouping


factor, such that Level 1 is nested within Level 2.

If we have more than two levels, then Level 2 is nested


within Level 3, and so on.

141/158
Hierarchical linear models

In the analysis of such data we fit a random coefficient


model at Level 1.

The coefficients for this Level 1 model are modelled as a


function of the Level 2 variables.

We continue this pattern if there are more levels and finally


combine models from all the levels.

142/158
Example

Test score gains

Data was collected for 3,111 eighth-grade students in the


US.
The students’ test score gains (Gain) on one of the
mathematics achievement tests were recorded.
In addition, the sum of some pretest core items
(PreTotal) on the same students was also recorded.
These students were grouped into 159 classes.
A variable measuring the percent of the class with a
sufficient degree of mastery of previous curricula
(Tmastry) was recorded for each class.

143/158
Test Score Example

Data
The mathscore dataset contains the following variables:
Gain: the test score gains on a mathematics achievement
test for each student;
PreTotal: the sum of some pretest core items for each
student;
Class: the class each student belongs to;
Tmastry: the percent of class mastering previous
curricula.

144/158
Test Score Example

Nested structure

Students are nested within classes.

Students and classes are often considered as random


effects.

In this example, the student effect is modelled by the


residuals.

Measurements were taken at both student and class levels.

145/158
Model at the student level

The response variable, the test score gains (Gain), was


measured at the student level.
The explanatory variable, the sum of some pretest scores
(PreTotal), was also measured at the student level.
A linear regression model might be appropriate to fit the
data.
Because classes are considered as random effects, a
random coefficient model may be reasonable.
In this random coefficient model, the coefficients for each
class (intercept and slope) represent a deviation from some
population regression model.

146/158
Model at the student level

The model at student level is therefore:

yij = aj + bj xij + eij

where
yij is the gain for the ith student in the jth class,
i = 1, . . . , nj and j = 1, . . . , 159;
xij is the sum of pretest scores of the ith student in the jth
class;
aj is the intercept for the jth class. This is a random effect
because class is a random effect.
bj is the slope for the jth class. This is also a random effect.
eij is the random error, assumed i.i.d. N(0, σE2 ).

147/158
Model at the student level

For the random intercept and random slope we assume


     2 
aj iid α0 σA σAB
∼N ,
bj β0 σAB σB2

The fixed effects of the model are the intercept α0 and the
slope β0 .

These are the expected values of the intercepts and slopes


for the population of classes.

148/158
Model at the student level
We can rewrite this model as:

yij = α0 + β0 xij + a∗j + b∗j xij + eij

where

aj = α0 + a∗j and
bj = β0 + b∗j

with
iid
a∗j ∼ N(0, σA2 ),
iid
b∗j ∼ N(0, σB2 ) and
Cov a∗j , b∗j = σAB .


149/158
Model at the class level

The percentage of the class mastering previous curricula


(Tmastry) is measured at the class level.

This effect can be incorporated into the model for the


intercept and slope for each class:

aj = α0 + α1 zj + a∗j
bj = β0 + β1 zj + b∗j

where zj is the Tmastry for class j, α0 and β0 are fixed


intercepts and α1 and β1 fixed slope parameters.

150/158
Model at the class level

Here we are rewriting the random coefficients for the


intercept and slope to incorporate the Tmastry effect
measured at class level.

The distributional assumptions on a∗j and b∗j are

iid
a∗j ∼ N(0, σA2 ),
iid
b∗j ∼ N(0, σB2 ) and
Cov a∗j , b∗j = σAB .


151/158
Multilevel model

We can combine the student-level model and the


class-level equations

yij = aj + bj xij + eij


aj = α0 + α1 zj + a∗j
bj = β0 + β1 zj + b∗j

to produce a single equation involving effects at two levels


(student and class levels):

yij = α0 + α1 zj + β0 xij + β1 zj xij + a∗j + b∗j xij + eij

152/158
Mathscore data

Class Tmastry Gain PreTotal


1 50 2 20
1 50 -3 18
1 50 5 12
1 50 1 9
1 50 -3 11
1 50 3 12
...
159 95 5 12
159 95 3 16
159 95 12 12

Note: to ensure numerical stability, it is a good idea to scale our


covariates (Tmastry and PreTotal) to take values between 1 and
10.

153/158
Fitting the multilevel model in R

m1 <- lmer(Gain ˜ PreTotal + Tmastry +


PreTotal:Tmastry + (PreTotal|Class) )

Random effects:
Groups Name Variance Std.Dev. Corr
Class (Intercept) 9.0284 3.005
PreTotal 0.7796 0.883 -0.82
Residual 21.6545 4.653

Fixed effects:
Estimate Std. Error t value
(Intercept) -1.494221 1.341034 -1.114
PreTotal -1.602810 0.652153 -2.458
Tmastry 1.131062 0.176961 6.392
PreTotal:Tmastry -0.006142 0.084758 -0.072

154/158
Output: fixed effects

Fixed effects:
Estimate Std. Error t value
(Intercept) -1.494221 1.341034 -1.114
PreTotal -1.602810 0.652153 -2.458
Tmastry 1.131062 0.176961 6.392
PreTotal:Tmastry -0.006142 0.084758 -0.072

Our parameter estimates are α̂0 = −1.49, β̂0 = −1.60,


α̂1 = 1.13 and β̂1 = −0.0062.

The interaction term PreTotal:Tmastry has a very


small t-value. This term is not significant and can be
dropped from the model.

155/158
Output: random effects

Random effects:
Groups Name Variance Std.Dev. Corr
Class (Intercept) 9.0284 3.005
PreTotal 0.7796 0.883 -0.82
Residual 21.6545 4.653

The estimated variance components are


σ̂A2 = 9.0284
σ̂B2 = 0.7795
σ̂AB = σ̂A σ̂A ρ̂AB = −2.1861
σ̂E2 = 21.6545
Note the negative correlation between intercept and slope: larger intercepts
tend to have smaller slopes.

156/158
Likelihood ratio test

We can carry out a likelihood ratio test to see if we need


correlation between our slope and intercept random
effects.

Models:
m3: Gain ˜ PreTotal + Tmastry + (1 | Class) +
m3: (0 + PreTotal | Class)
m2: Gain ˜ PreTotal + Tmastry + (PreTotal | Class)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
m3 6 18681 18717 -9334.6 18669
m2 7 18673 18716 -9329.6 18659 9.8694 1 0.001681 **

We obtain a p-value << 0.05, which suggests that the


correlation is necessary.

157/158
EBLUPs for (some) random effects

Partial output showing some of the estimates â∗j and b̂∗j :

(Intercept) PreTotal
1 -1.92675077 0.0119876249
2 -5.39003631 0.0915662399
3 -1.00595155 0.0094430992
4 -0.38172198 0.0321817612
5 1.58481753 -0.0650445905
6 0.13879856 -0.0187367181
7 -0.87685526 0.0031362848
8 -2.18841662 0.0299758090

158/158

You might also like