Advanced Diploma
Research Methodology and Project (RMP470S)
One-dimensional statistics
Dr N. Dlangamandla
Email: [email protected]
30 August 2023
The challenge of statistics is to….
Define the levels of uncertainty
• Based on measurement error and probabilities
• Related to randomly distributed values.
• These are named random errors and are different from
systematic errors which result from some bias in the
measurement technique (e.g. calibration error)
• The main difference between a population and sample has
to do with how observations are assigned to the data set.
• A population includes all of the elements from a set of data.
• A sample consists one or more observations drawn from the
population.
Introduction
• A research paper reports a distance measurement of 10.5
m.
• The implication is that:
• The measurement accuracy is 10.5 ±0.05 m;
• The measurement instrument has been calibrated
• The measurement instrument is capable of resolving
measurements to this accuracy
• A measurement of 10.7 m is significantly different from the result
stated
• Plotting a histogram of the same result measured several times can
show the slightly different results being recorded.
• Reporting of a single measurement contains inherent
information about the accuracy of both the measurement
and the measurement system.
5% Probability estimate
• This is a common measure in statistics –a ‘rule of thumb’
• 5% of all the measured values will lie outside this range of
values centred on the mean value
• 95% of the measurements will lie within this range
• This probability value is a measure of the random,
symmetrical distribution of measured values about the
mean value.
• Assuming a normal (random) distribution about the mean
value (μ) less than 5% of the measurements will lie outside
the range of ± two standard deviations (σ) away from the
mean.
• On average 2.5% will have values greater than μ + 2σ and
2.5% will have values small than μ + 2σ.
• There might be situations where the 5% probability of error
is unacceptably large, then a smaller probability might be
mandated.
Normal distribution curve
Descriptive and inferential statistics
• Descriptive statistics summarize information already present in
data
◦ Visualizations like boxplots, histograms, etc.
◦ Summary measures like averages, standard deviation, median,
etc.
• Inferential statistics use a sample of data to make predictions about
larger populations or about unobserved/future trends
◦ Any measurements made in the presence of noise or variation
◦ Generalizations from a sample to a population
▪ Confidence intervals, hypothesis tests, etc.
◦ Comparisons made between datasets
▪ Comparisons, correlations, regress, etc.
6
Statistics describe different types of data
• Categorical values take one of a discrete set of unordered values
◦ A tissue type: blood/skin/lung/GI/etc.
• Ordinal values take one of a discrete set of ordered values
◦ Counts or rank orders
◦ Often (but not always) analyzed in the same way as continuous
values
• Continuous values take one value from an ordered numerical scale
◦ Times, frequencies, ratios, percentages, abundances, etc.
7
Simple descriptive statistics
• A statistic is any single value that summarizes an entire dataset
• Parametric summary statistics
◦ Typically used to describe "well-behaved" data that are
approximately
normally-distributed
▪ i.e. continuous, symmetric, thin-tailed, no outliers
▪ Closeness needed for "approximately" depends on application
• Average = Mean = = x/n
8
Simple descriptive statistics
σ(𝑥−𝜇)2
• Standard deviation = Variance = = (population)
𝑛
◦ Beware the difference between population and sample standard deviation
σ(𝑥−𝜇)2 σ(𝑥−𝜇)2
◦ s= (population) or σ = (sample) Why?
𝑛 𝑛−1
9
Nonparametric statistics
• Can be used to describe any data regardless of distribution
◦ No free lunch: they're less sensitive to false and real signals
◦ Fewer false positives, but potentially fewer true positives, too
• Median = m = x[|x|/2] = midpoint of dataset
• Percentile = p(y) = x[y|x|] = data point y% of the way "through" dataset
• Quartiles = 25th, 50th, and 75th percentiles = {p(0.25), p(0.5), p(0.75)}
◦ Also quintiles, deciles, etc.
• Inter-quartile range = IQR = p(0.75) - p(0.25)
◦ Difference between upper and lower quartiles
8/30/2023 10
Statistics for paired data: comparisons
• What about experiments that result in more than one measurement?
◦ Paired? Multidimensional?
Subject S1 S2 S3 S4 S5 …
Height 150 155 160 165 175 …
Height 153 154 162 163 191 …
(father)
Height 150 155 160 165 175 161 Can we generate a "joint"
statistic that summarizes
Mean something about the
"similarity" of two sets of
Height 153 154 162 163 191 164.6 measurements?
(father)
11
Simple descriptive statistics
• Simply put, a z-score (also called a standard score) gives you an idea of how far from the
mean a data point is. But more technically it’s a measure of how many standard deviations
below or above the population mean a raw score is.
• Data expressed as z-scores are relative to a dataset's mean and
• The Z Score Formula: One Sample
• The basic z score formula for a sample is:
◦ z = (x-)/
99%
95%
2/3
12
Simple descriptive statistics
• For example, let’s say you have a test score of 190. The test has a
mean (μ) of 150 and a standard deviation (σ) of 25. Assuming
anormal distribution, your z score would be:
z = (x –μ) / σ
= (190 –150) / 25 = 1.6.
• The z score tells you how many standard deviations from the mean
your score is. In this example, your score is 1.6 standard deviations
above the mean.
13
Simple descriptive statistics
• Technically, a z-score is the number of standard deviations from the
mean value of the reference population (a population whose known
values have been recorded, like in these charts the CDC compiles
about people’s weights). For example:
• A z-score of 1 is 1 standard deviation above the mean.
• A score of 2 is 2 standard deviations above the mean.
• A score of -1.8 is -1.8standard deviations below the mean.
• A z-score tells you where the score lies on anormal distribution
curve. A z-score of zero tells you the values is exactly average while
a score of +3 tells you that the value is much higher than average.
14
Z Score Formula: Standard Error of the Mean
• When you have multiple samples and want to describe the standard
deviation of those sample means (the standard error), you would
use this z score formula:
z = (x –μ) / (σ / √n)
This z-score will tell you how many standard errors there are
between the sample mean and the population mean.
Example problem:
• In general, the mean weight of women is 65 kg with a standard
deviation of 3.5 kg. What is the probability of finding a random
sample of 50 women with a mean height of 70 kg, assuming the
heights are normally distributed?
z = (x –μ) / (σ / √n)
= (70 –65) / (3.5/√50) = 5 / 0.495 = 10.1
15
Z Score Formula: Standard Error of the Mean
The standard error
σ 𝑥−𝜇 2
𝜎𝑠 𝑛−1
𝜎𝜇 = =
𝑛 𝑛
The degrees of freedom = (n – 1).
16
Probability: Basic definitions
• Experiment: anything that produces a non-deterministic result
◦ Coin flip, die roll, item count, concentration measurement,
distance measurement...
• Sample space: the set of all possible outcomes for a particular
experiment, finite or infinite, discrete or continuous
◦ {H, T}, {1, 2, 3, 4, 5, 6}, {0, 1, 2, 3, ...}, {0, 0.1, 0.001, 0.02,
3.14159, ...}
• Event: any subset of a sample space
◦ {}, {H}, {1, 3, 5}, {0, 1, 2}, [0, 3)
• Probability: for an event E, the limit of n(E)/n as n grows large
17
Where normal distribution can not be used for
probability calculations
• The mean value is close to zero and negative values are not possible
in the data set
• The distribution is skewed about the mean. This is defined
numerically as the skewness of the population.
• The μ±2σrange of values do not contribute 95% of the probability.
This is numerically defined as the kurtosis of the population.
18
Combining errors and uncertainties
• Once several parameters have been determined experimentally
• And associated errors determine using the 5% probability concept
• Some additional mathematical processing might be required
• In which the different parameters and their associated errors are
combined
• To calculate the final value of interest and the associated error.
• There are some simple rules for combining errors which a based on
the least squared error analysis used to calculate the mean value
19
Combining errors and uncertainties
• If two values are to be added or subtracted:
𝑥𝑖 ± 2𝜎𝑖 𝑎𝑛𝑑 𝑥𝑗 ∓ 2𝜎𝑗
0.5
𝑦 = (𝑥𝑖 ± 𝑥) ∓ 2 𝜎𝑖2 ± 𝜎𝑗2
• All units must be the same
• If two values are divided or multiplied: 𝑥𝑖 ∓ 2𝜎𝑖 𝑎𝑛𝑑 𝑥𝑗 ∓ 2𝜎𝑗
2 2
𝑥𝑖 𝑥𝑖 𝜎𝑖 𝜎𝑗
𝑦= ± 2 +
𝑥𝑗 𝑥𝑗 𝑥𝑖 𝑥𝑗
•The units of y, xi, xj do not have to be identical.
• Expressions are based on the RMS analysis and so are statistically
rigorous and should be used in combining data with their associated
errors.
20
Student’s t-test
• In some experimental investigations it is important to know if two
populations are likely to be sample populations selected from the
same global populations.
• This type of one-dimensional question can be addressed using a t-
test (also known as Student’s t-test)
• The test is most applicable when the standard deviations are very
large in comparison to the likely changes or differences between
the two mean values.
21
Two types of t-test
Paired t-test
• Is one in which the same population is tested twice to determine if
there has been a change in the overall population
• It is a method of determining if there is a statistically significant
change in the population after an intervention.
• A simple mean and standard deviation calculation will not show a
significant change if the change is likely to be significantly smaller
than the standard deviation measure of the
22
Two types of t-test
Unpaired t-test
◦ Is one in which two different populations are measured to determine if
there is a difference between the two populations.
◦ In this case the two populations are unrelated and the number of samples
can be different in the two sample sets.
• The t-test can be evaluated using the MS-Excel function t-test-
paired and t-test-unpaired.
• In Matlab the functions are t-test 1 and t-test 2 for the paired and
unpaired data sets respectively.
23
ANOVA statistics
• The We looked at one-dimension statistical methods first
• Then two different populations were compared using the t test.
• If there are more than two dependent data sets, these techniques are
inadequate.
• If many repeat measurements are made of a number of members of the
population
• The ANOVA statistical methods allow the calculation of probability
estimates for three or more datasets
• As for the t-test, this method can determine statistically significant
differences when the standard deviations in the parameters are much
larger than the difference between the populations
◦ The ANOVA test can be evaluated using the MS Excel function ANOVA
◦ Two-factor with replication ANOVA
◦ Two-factor without replication
◦ In Matlab the functions are anova 1 and anova2
24
Exercise 1: Instrumentation & Calibration
Review specification sheets for 3 experimental
instruments that you will use in your research project.
Briefly summarise the following user requirements:
◦ Dynamic range
◦ Sensitivity
◦ Linearity
◦ Calibration requirements
◦ Calibration procedure
25
Exercise 2: Review of statistical analysis in journal
article
Review a published journal article in your engineering
discipline which includes a statistical analysis.
• Write a brief report on the statistical analysis.
• Can you suggest an improved statistical analysis?
• Suggest some additional parameters that might have been
measured during the data acquisition stage.
• Explain how you would analyze the total data set of the
additional measurements.
26
END