Statistical Principles of
Experimental Design
Dov Stekel
Maximum information from
minimum effort
Overview
Blocking and randomization
Arrangement of samples and arrays
Class exercise
How many replicates?
Computer practical
Blocking, Randomization and
Blinding
Arrangement of experimental design that
minimises problems from extraneous
sources of variability
Use blocking to avoid confounding
Use randomization and blinding to avoid
bias
Toxicity Example
We are interested in characterising the
toxic effect of Benzo(a)pyrene (BP) on rats
8 Rats are to be treated with BP and 8 rats
with a control compound
Each array will be hybridized against a
reference sample
16 Arrays in the experiment
Experimental Design
There are two batches of 8 slides from two
different print runs (1 and 2)
Hybridisation will be done by two
researchers, Alison and Brian.
What is the best way to arrange the
experiment?
Design 1
Alison prepares all 8 BP samples and
hybridises them to the arrays of print run 1
Brian prepares all 8 control samples and
hybridises them to the arrays of print run 2
Design 2
Alison chooses 8 rats and treats 4 with BP and 4
with control substance.
She prepares and hybridises 2 BP samples to
arrays from print run 1 and 2 BP samples to
arrays from print run 2
She prepares and hybridises 2 control samples
to arrays from print run 1 and 2 control samples
to arrays from print run 2
Brian does the same with the other 8 rats
Design 2
Alison
Print Run 1
Print Run 2
Print Run 1
Print Run 2
Control
Treated
Brian
Control
Treated
Design 3
8 rats are randomly assigned to Alison, along
with 4 BP preparations and 4 control
preparations. She is not told which
preparations are which.
She prepares and hybridises samples to
randomly pre-arranged arrays so that 2 BP
samples and 2 control samples are hybridised
to 4 arrays from each of print runs 1 and 2.
Brian does the same with the other 8 rats
What is wrong with design 1?
Treatment, researcher and print run are
confounded variables
We cannot tell whether differences between the
two groups of rats result from treatment,
researcher or print run
Use blocking in designs 2 and 3 to deconfound
the variability of interest (treatment) from the
extraneous variabilities (researcher and print run)
Designs 2 and 3 are also balanced which
increases power of analyses
What is wrong with design 2?
Alison's choice of rats may be biased
For example, she may choose the
healthiest rats, so confounding potential
treatment effects with researcher variability
Use randomization and blinding in design
3 to avoid bias
Arrangement of Samples and Arrays
Is it better to use Affymetrix arrays or a
two-colour array system?
If using a two-colour array system, is it
better to use a reference sample?
If using a two-colour array system, what is
the best arrangement of samples on the
slides?
Several Factors
Available technology
Cost
Statistical considerations
We consider problem from perspective of
three different experiments
Example 1:
Hepatocellular Carcinomas
Samples are taken from disease and
healthy tissue from patients suffering from
hepatocellular carcinomas and hybridised
to microarrays. We would like to identify
genes that are up- or down- regulated in
hepatocellular carcinomas relative to
healthy tissue.
Design 1.1
Reference
Sample
Reference
Sample
Healthy 1
Disease 1
Array 1
Array 2
x 20
Design 1.2
Healthy 1
GeneChip 1
Disease 1
GeneChip 2
x 20
Design 1.3
Healthy 1
x 20
Disease 1
Array 1
Design 1.4
Healthy 1
Healthy 11
x 10
x 10
Disease 1
Disease 11
Array 1
Array 11
Design 1.5
Healthy 1
Healthy 1
x 20
Disease 1
Disease 1
Array 1
Array 2
Which is the best design?
Simple experiment - five different
designs!
Design 1.1 is bad because it increases
variability.
Design 1.3 is bad because it confounds
colour with disease state.
Designs 1.4 and 1.5 are best.
Design 1.1
Reference
Sample
Reference
Sample
Healthy
Disease
Array 1
Array 2
Coefficient of
Variability is 30%
Design increases
variability to 43%
Design 1.5
Healthy
Healthy
Disease
Disease
Array 1
Array 2
Coefficient of
Variability: 30%
Experimental
design reduces
variability to 21%
Example 2:
B-Cell Lymphomas
Samples are taken from 60 patients
suffering from B-cell lymphomas and
hybridised to microarrays. The aim of the
experiment is to identify clinically relevant
subgroups of patients using a cluster
analysis, and then to build a classification
model to differentiate between the
subgroups.
Design 2.1
Patient 1
x 30
Patient 2
Array 1
Design 2.2
Patient 1
x 60
Reference
Array 1
Design 2.3
Patient 1
GeneChip 1
x 60
Which design is best?
Design 2.1 is bad because it is difficult to
compare patients on equal footing.
Designs 2.2 and 2.3 are good.
Probably most appropriate use of
Affymetrix technology.
Example 3:
Yeast Time Series
Budding yeast can reproduce sexually by
producing haploid cells through a process
called sporulation. Yeast was placed in a
sporulating medium, samples taken at 7
timepoints from the start of sporulation.
We are interested in identifying genes that
show similar profiles in the timecourse.
Design 3.1
Time 0
Time 0
Time 0
Time 0
Time 0
Time 0
Time 1
Time 2
Time 3
Time 4
Time 5
Time 6
Array 1
Design 3.2
Time 0
Time 1
Time 2
Time 3
Time 4
Time 5
Time 6
Time 1
Time 2
Time 3
Time 4
Time 5
Time 6
Time 0
Array 1
Design 3.3
Time 0
Time 1
GeneChip 1 2
Time 2
Time 3
Time 4
Time 5
Time 6
Which is the best design?
Design 3.3 is bad because timepoint is
confounded with array.
Design 3.2 is a loop design. It is a good
design, but harder to analyse.
Design 3.1 is the best design.
Bright Timepoint Problem
Imagine we have a "bright" array. This
could be because of:
Higher gene expression
Experimental artifact
Normalising by array mean or median
cannot deconfound these factors
Time Series Example
Time Series Ratios
Raw Gene Expression for FYV1
FYV1 Normalised to Array
FYV1 Normalised to Reference
Class Exercise
Two strains of Staphylococcus aureus:
methicillin-sensitive and methicillinresistant
Each strain is cultured and then either
treated or untreated with methicillin
Samples are taken at several time points
(0h, 2h, 6h, 10h)
We want to identify genes involved in
methicillin-resistance
How Many Replicates?
Use Power Analysis which relates:
Difference in mean we are trying to detect
Population and experimental variability
Type of analysis
Chosen significance threshold
Number of replicates
Population Inferrence
Population
Sample
Inferrence
Confidence
The confidence is the probability of not getting
a false positive result.
It is the probability of accepting the null
hypothesis when the null hypothesis is true.
A false positive result is known as a Type I
Error.
We control for Type I errors explicitly by
selecting an appropriate confidence level
In microarray experiments, we must modify the
confidence level to account for multiplicity
Power
The power is the probability of not getting a false
negative result.
It is the probability of rejecting the null hypothesis
when the null hypothesis is false.
A false negative result is known as a Type II
Error.
We control the power implicitly via the confidence
level and the experimental design.
Type I and Type II Errors
TRUE SITUATION
OUR
DECISION
No effect
Effect
Not significant
Correct
Type II error
Significant
Type I error
Correct
Power Analysis Assumptions
We assume that the data is approximately log
normally distributed
This corresponds to standard deviation of the
errors of the raw data being proportional to the
signal intensity
This is equivalent to a constant standard
deviation in the logged data
The standard deviation divided by the mean is
called the coefficient of variation
Log Normally Distributed Data
Power Analysis
We will use the power.t.test() formula in
R to calculate the power of one and two
sample tests
power.t.test(n, delta, sd,
sig.level, power, type,
alternative)
Formula is used with one of the first five
variables omitted and will calculate the
unknown variable
Power Analysis Example:
Doxorubicin Chemotherapy
We are interested in the treatment of breast
cancer patients with doxorubicin chemotherapy
We want to perform a microarray experiment to
determine genes that are up- or down- regulated
as a result of the chemotherapy
We would like to know:
How to design the experiment?
How many patients we need?
Paired vs Unpaired Design
In a paired design, we take samples from each
patient before and after treatment, and for each
gene, look at the difference in expression before
and after treatment
In an unpaired design, we have two groups of
patients, one group treated, the other group
untreated. We look at the difference in gene
expression between the two groups
Which is a better experiment?
Paired and Unpaired Designs
Paired: test if
mean is different
from zero
Unpaired: test if
means of groups
are different
Power Analysis Assumptions
Suppose we know from a pilot study and
evaluation of our technology that the
coefficient of variation is 40%
Let's say that we want to detect genes that are
2-fold regulated
We are testing 10,000 genes so we will use a
signficance threshold of 0.001 to compensate
for multiplicity
How many patients do we need for a power of
80%, 90% and 99%?
Paired Experiment
The standard deviation of the underlying normal
distribution equivalent to 40% variability is 0.39
The difference in means is log2(2) = 1
The number of patients we need is:
Power
80%
90%
99%
Number
8
9
11
Unpaired Experiment
The standard deviation and difference in
means is the same.
The number of patients we need is:
Power
80%
90%
99%
Group Size
8
10
13
Number
16
20
26
1-Sample Number
8
9
11
Paired vs Unpaired
In this example, we need more than twice
the patients in the unpaired experiment to
obtain the same power as the paired
experiment
Paired experimental design is more
powerful than unpaired experimental
design because the differences between
individuals are factored out in the analysis
Conclusions
Extraneous variability:
Block to avoid confounding variables
Randomisation to avoid bias
Blocked experiments require ANOVA
analyses
Two sample experiments
Reference samples increase variability.
Hybridise both samples to same array.
Conclusions
Multiple patient comparisons
Reference samples or Affymetrix technology
enable comparisons.
Time series analysis
Reference samples are essential.
Number of replicates
Calculate using power analyses.
Computer Practical
Power analysis for population inference test