Error analysis in biology
Marek Gierliński
Division of Computational Biology
Hand-outs available at http://is.gd/statlec
Errors, like straws, upon the surface flow;
He who would search for pearls must dive below
John Dryden (1631-1700)
Previously on Errors…
• Random variable: result of an experiment
• Probability distribution: how random values are
distributed
• Discrete and continuous probability distributions
Poisson (count) distribution Binomial distribution
Gaussian (normal) distribution • random and independent • probability of 𝑘 successes
• very common events out of 𝑛 trials
• 95% probability within 𝜇 ± 1.96𝜎 • mean = variance • toss a coin
• approximates Gaussian for • approximates Gaussian
large 𝑛 for large 𝑛
2
Example
Take one mouse and weight it
Result: 18.21 g
Reading error
Take five mice and find mean weight
Results 18.81 g
Sampling error
These are examples of measurement
errors
3
2. Measurement errors
“If your experiment needs statistics, you ought to have
done a better experiment”
Ernest Rutherford
Different types of errors
Systematic errors Random errors
Incorrect instrument calibration Reading errors
Model uncertainties Sampling errors
Change in experimental conditions Counting errors
Mistakes! Background noise
Intrinsic variability
Sensitivity limits
Systematic errors can be eliminated in good You can’t eliminate random errors, you have to
experiments live with them. You can estimate (and reduce)
random error by taking multiple measurements
5
Random measurement error
Determine the strength of oxalic acid in a sample
Method: find the volume of NaOH solution required to neutralize a given volume
of the acid by observing a phenolphthalein indicator
Uncertainties contributing to the final result
volume of the acid sample
judgement at which point acid is neutralized
volume of NaOH solution used at this point
accuracy of NaOH concentration
• weight of solid NaOH dissolved
• volume of water added
Each of these uncertainties adds a random error to the final result
6
A model of random measurement error
Laplace 1783
Consider a measurement of a certain
quantity
Its unknown true value is 𝑚0
Contribution
Measurement is perturbed by small
uncertainties
Each of them contributes a small random
deviation, ±𝜀, from the measured value
7
A model of random measurement error
Laplace 1783
Consider a measurement of a certain
quantity
Its unknown true value is 𝑚0
Contribution
Measurement is perturbed by small
uncertainties
Each of them contributes a small random
deviation, ±𝜀, from the measured value
This creates binomial distribution
For large 𝑛 it approximates Gaussian
Binomial
distribution
We expect random measurement errors
to be normally distributed
8
Biological and technical variability
Biological variability Technical variability
Molecular level Random measurement errors
Phenotype variability Accumulation of errors
From subject to subject
Variability in time
Life is stochastic!
In most experiments biological variability dominates
It is hard to disentangle the two types of variability
10
Sampling error
Repeated measurements give us
mean value
variability scale
Sampling from a population
Measure the body weight of a mouse
Sample: 5 mice
Population: all mice on the planet Body weight of 5 mice Mean
(g) (g)
Small sample size introduces 20.38 20.73 23.24 15.39 12.58 18.5
uncertainty 27.48 12.52 21.95 12.54 21.19 19.1
14.73 16.37 28.21 21.18 13.48 18.9
11
Reading error
smallest division
When you do one simple measurement
using
ruler
micrometer
voltmeter
thermometer
measuring cylinder
stopwatch
The reading error is half of the smallest
division
A ruler with 1-mm scale can give a reading
230.5 mm
Beware of digital instruments that
sometimes give readings much better
than their real accuracy
Read the instruction manual!
Reading error does not take into account
biological variability
12
Counting error
Dilution plating of bacteria
Counted 𝐶 = 17 colonies on a plate at the
10-5 dilution
Counting statistics: Poisson distribution
𝜎= 𝜇
Use standard deviation as error estimate
𝑆 = 𝐶 = 17 ≈ 4
𝐶 = 17 ± 4
13
Counting error
Gedankenexperiment
True mean count, 𝜇 = 11
Measure counts on 10,000 plates (!) 𝐶𝑖
Plot counts, 𝐶𝑖 , and their errors,
𝑆𝑖 = 𝐶𝑖
𝑆𝑖
Plot distribution of counts from 10,000
plates and its mean, 𝜇, and standard
deviation, 𝜎
Counting errors, 𝑆𝑖 = 𝐶𝑖 are similar,
but not identical, to 𝜎
𝜇±𝜎
𝐶𝑖 is an estimator of 𝜇
𝑆𝑖 is an estimator of 𝜎
14
Exercise: is Dundee a murder capital of Scotland?
On 2 October 2013 The Courier published
an article “Dundee is murder capital of
Scotland”
Data in the article (2012/2013):
City Murders Per 100,000
Dundee 6 4.1
Glasgow 19 3.2
Aberdeen 2 0.88
Edinburgh 2 0.41
Compare Dundee and Glasgow
Find errors on murder rates
Hint: find errors on murder count first
15
Exercise: is Dundee a murder capital of Scotland?
City Murders Per 100,000
Dundee 6 4.1
𝑝 = 0.8
Glasgow 19 3.2
Δ𝐶𝐷 = 6 ≈ 2.4
Δ𝐶𝐺 = 19 ≈ 4.4
Errors scale with variables, so we can use
fractional errors
Δ𝐶𝐷
= 0.41
𝐶𝐷
Δ𝐶𝐺
= 0.23
𝐷𝐺
and apply them to murder rate
Δ𝑅𝐷 = 4.1 × 0.41 = 1.7
Δ𝑅𝐺 = 3.2 × 0.23 = 0.74
16
Exercise: is Dundee a murder capital of Scotland?
City Murders Per 100,000 95% confidence intervals
(Lecture 4)
Dundee 6 4.1
p-values from chi-square test
Glasgow 19 3.2 vs Dundee
Aberdeen 2 0.88
Edinburgh 2 0.41
𝑝 = 0.8
𝑝 = 0.04
𝑝 = 0.002
17
Measurement errors: summary
Experimental random errors are expected to be normally distributed
Some errors can be estimated directly
reading (scale, gauge, digital read-out)
counting
Other uncertainties require replicates (a sample)
this introduces sampling error
18
Example
Body mass of 5 mice
This is a sample
We can find
mean = 18.8 g
median = 18.6 g
standard deviation = 5.0 g
standard error = 2.2 g
These are examples of statistical
estimators
19
3. Statistical estimators
“The average human has one breast and one testicle”
Des MacHale
Population and sample
Sample selection
Terms nicked from social sciences
Most biological experiments involve sample selection
Terms “population” and “sample” are not always literal
21
What is a sample?
The term “sample” has different meanings biological samples
in biology and statistics (specimens)
Biology: sample is a specimen, e.g., a cell
culture you want to analyse
Experiment in 5 biological replicates
requires 5 biological samples
After quantification (e.g. protein
abundance) we get a set of 5 numbers
Statistics: sample is (usually) a set of quantification
numbers (measurements)
In these talks: 𝑥1 , 𝑥2 , … , 𝑥𝑛 Statistical sample (set of numbers)
1.32 1.12
0.98
0.80 1.07
22
Population and sample
Population Sample
Population can be a somewhat abstract Sample is what you get from your
concept experiments
Huge size, impossible to handle Manageable size, 𝑛 measurements
all mice on Earth 12 mice in a particular experiment .
all people with eczema 26 patients with eczema
all possible measurements of gene 5 biological replicates to measure gene
expression (infinite population) expression
23
Population and sample
Population
unknown parameters A parameter describes a
𝜇, 𝜎, … population
A statistical estimator
(statistic) describes a
sample
Sample
size 𝑛 A statistical estimator
known statistics approximates the
𝑀, 𝑆𝐷, … corresponding parameter
24
Sample size
Dilution plating experiment
What is the sample size?
𝑛=1
This sample consists of one
measurement: 𝑥1 = 17
17 colonies
25
What is a statistical estimator?
Stand at the door of a church on a
Sunday and bid 16 men to stop, tall
ones and small ones, as they happen to
pass out when the service is finished;
then make them put their left feet one
behind the other, and the length thus
obtained shall be a right and lawful
rood to measure and survey the land
with, and the 16th part of it shall be
the right and lawful foot.
Over 400 years ago Köbel:
• introduced random sampling
from a population
• required a representative sample
• defined standardized units of
measure
“Right and lawful rood*” from Geometrei, by Jacob • used 16 replicates to minimize
Köbel (Frankfurt 1575) random error
• calculated an estimator: the
sample mean
*rood – a unit of measure equal to 16 feet
26
Statistical estimators
Statistical estimator is a sample attribute 𝜇 𝜎
used to estimate a population parameter
𝑀 𝑆𝐷
population
From a sample 𝑥1 , 𝑥2 , … , 𝑥𝑛 we can find 𝒩(20, 5)
𝑛
1 sample
𝑀= 𝑥𝑖 mean 𝑛 = 30
𝑛
𝑖=1
𝑛
1 2 standard
𝑆𝐷 = 𝑥𝑖 − 𝑀
𝑛−1 deviation
𝑖=1
• 𝑛 = 30
• 𝑀 = 20.3 g
• 𝑆𝐷 = 5.2 g
median, proportion, correlation, …
• 𝑆𝐸 = 0.94 g
𝑀 = 20.3 ± 0.9 g
27
Standard deviation
Standard deviation is a measure of spread of
data points
Sample mean
Idea:
calculate the mean
find deviations from the mean of individual
points
Deviation from
get rid of negative signs
the mean
combine them together
28
Standard deviation
Standard deviation is a measure of spread of
data points
Sample mean
Idea:
calculate the mean
find deviations from the mean of individual
points
Deviation from
get rid of negative signs
the mean
combine them together
Standard deviation of 𝑥1 , 𝑥2 , … , 𝑥𝑛
1
𝑆𝐷𝑛 = 𝑥𝑖 − 𝑀 2
𝑛
𝑖
1
𝑆𝐷𝑛−1 = 𝑥𝑖 − 𝑀 2 2
𝑆𝐷𝑛−1 is unbiased estimator of variance
𝑛−1
𝑖
Mean deviation
• doesn’t overestimate outliers
1 • less accurate than 𝑆𝐷
𝑀𝐷 = 𝑥𝑖 − 𝑀
𝑛 • mathematically more complicated
𝑖 • tradition: use 𝑆𝐷
29
Standard error of the mean
Gedankenexperiment
Consider a population of mice with
normally distributed body weight with
𝜇 = 20 g and 𝜎 = 5 g
Take a sample of 5 mice
Sample no.
Calculate sample mean, 𝑀
Repeat many times
Plot distributions of sample means
Normalized frequency
Distribution of
sample means
30
Standard error of the mean
Gedankenexperiment
Consider a population of mice with
normally distributed body weight with
𝜇 = 20 g and 𝜎 = 5 g
Take a sample of 30 mice
Sample no.
Calculate sample mean, 𝑀
Repeat many times
Plot distributions of sample means
Normalized frequency
Distribution of
sample means
31
Standard error of the mean
Distribution of sample means is called
sampling distribution of the mean
The larger the sample, the narrower the
sampling distribution
Sampling distribution is Gaussian, with
Sample no.
standard deviation
𝜎
𝜎𝑚 =
𝑛
Hence, uncertainty of the mean can be
estimated by
𝑆𝐷
𝑆𝐸 =
Normalized frequency
𝑛
Sampling distribution
Standard error estimates the width of the of the mean
sampling distribution
32
Standard error of the mean
33
Standard deviation and standard error
Standard deviation Standard error
1 𝑆𝐷
𝑆𝐷 = 𝑥𝑖 − 𝑀 2 𝑆𝐸 =
𝑛−1 𝑛
𝑖
Measure of dispersion in the sample Error of the mean
Estimates the true standard deviation in the Estimates the width (standard deviation) of
population, the distribution of the sample means
Does not depend on sample size Gets smaller with increasing sample size
34
Correlation coefficient
Two samples: 𝑥1 , 𝑥2 , … , 𝑥𝑛 and 𝑦1 , 𝑦2 , … , 𝑦𝑛
𝑛 𝑛
1 𝑥𝑖 − 𝑀𝑥 𝑦𝑖 − 𝑀𝑦 1
𝑟= = 𝑍𝑥𝑖 𝑍𝑦𝑖
𝑛−1 𝑆𝐷𝑥 𝑆𝐷𝑦 𝑛−1
𝑖=1 𝑖=1
where 𝑍 is a “Z-score”
Correlation does not mean causation!
35
Correlation coefficient: example
𝑛
1
𝑥 𝑦 𝑍𝑥 𝑍𝑦 𝑍𝑥 𝑍𝑦 𝑟= 𝑍𝑥𝑖 𝑍𝑦𝑖
𝑛−1
0.01 0.01 -1.35 -1.24 1.68 𝑖=1
0.24 0.22 -0.64 -0.74 0.48
0.25 0.26 -0.62 -0.64 0.40
0.66 0.75 0.62 0.53 0.33
0.75 0.98 0.89 1.09 0.97
0.81 0.95 1.10 1.02 1.11 𝑍𝑥 𝑍𝑦 = 4.96
𝑥 𝑦 𝑍𝑥 𝑍𝑦 𝑍𝑥 𝑍𝑦
0.45 0.74 -1.72 0.57 -0.98
0.60 0.19 -0.54 -0.72 0.39
0.68 0.00 0.05 -1.14 -0.06
0.73 0.98 0.47 1.14 0.54
0.77 0.15 0.77 -0.81 -0.63
0.80 0.90 0.96 0.95 0.92 𝑍𝑥 𝑍𝑦 = 0.18
36
Statistical estimators
Central point Dispersion
Mean Variance
Geometric mean Standard deviation
Harmonic mean Mean deviation
Median Range
Mode Interquartile range
Trimmed mean Mean difference
Symmetry Dependence
Skewness Pearson’s correlation
Kurtosis Rank correlation
Distance
37
Hand-outs available at http://is.gd/statlec
Please leave your feedback forms on the table by the door