Lecture 3.
Numerical Measures of Data
AGEC 2001 Statistics I
Feng-An Yang1
1 Departmentof Agricultural Economics
National Taiwan University
Fall Semester
1/36
Outline
Measures of Location
Mean
Median
Mode
Shape of a distribution
Measures of Variation
Range
Variance and Standard Deviation
Coefficient of Variation
Grouped Data
Measures of Position
Percentile
Location of Percentile
Quartile and Decile
Box plot
2/36
Measures of Location
Measures of Location
Numerical measures used to describe the central tendency of the
data
I Common measures of location
I Mean
I Median
I Mode
3/36
Mean
Mean
A numerical average of a set of numbers
I Arithmetic Mean
I Weighted Mean
I Geometric Mean
Example
I The mean height of AGEC students is 172 cm.
I The mean weight of AGEC students is 55.3 kg.
4/36
Arithmetic Mean
Arithmetic Mean
Arithmetic mean is the simplest and the most widely used measure
of mean, and it is the sum of all the numbers in a dataset divided
by the number of observations in that dataset
Population Mean
N
P
xi
i=1
µ=
N
I µ is the population mean
I N is the number of observations
I Xi is the value of i-th observation
5/36
Arithmetic Mean
Sample Mean
n
P
xi
i=1
x̄ = n
I x̄ is the sample mean
I n is the number of observations in the sample
Example
{90,77,94,89,119,112,91,110,92,100,113,83}
n
P
xi
i=1 90+77+···+83 1,170
x̄ = n = 12 = 12 = 97.5
6/36
Arithmetic Mean
Properties of the Arithmetic Mean
I All values in the dataset are used in the calculation of mean
I The mean is unique
I The sum of the deviations from the mean is zero
n
(xi − x̄ ) = 0
P
i=1
Example
{3,7,5}, x̄ = 5
n
(xi − x̄ ) = (3 − 5) + (7 − 5) + (5 − 5) = 0
P
i=1
7/36
Arithmetic Mean
Properties of the Arithmetic Mean (cont’d)
I The mean can be affected by extreme values
Example
I A={1,2,3,4,5}, x̄A = 3
I B={1,2,3,4,100}, x̄B = 22
8/36
Median
Median
The midpoint of all values in a dataset
Steps for finding the median
I Sort the data in ascending (or descending) order
I In case of odd number of observations, the Median is on the
n+1
2 position
I Example: {11, 17, 25, 38, 60}. The median is 25
I In case of even number of observations, the Median is the
simple average of two middle numbers
25+38
I Example: {11, 17, 25, 38, 60, 65}. The median is 2 = 31.5
9/36
Median
Median
I The median is less sensitive to extreme values
I The median is unique
Example
I A={1,2,3,4,5}, x̄A = 3, median=3
I B={1,2,3,4,100}, x̄B = 22, median=3
10/36
Mode
Mode
The value of number that appears most often in a datset
I The mode is less sensitive to extreme values
I There may be multiple modes
Steps for finding the mode
I Organize the data and make a frequency table
I The mode is the value(s) with highest frequency
11/36
Mode
Example
{4,4,4,3,100,3,1,3,5,2,2,5,6,1,2,2,3,7,
1,3,7,8,1,4,7,5,2,2,5,1,1,3,3,1,2}
Value Frequency
1 7
2 7
3 7
4 3
5 4
6 1
7 3
100 2
I The modes are 1, 2, and 3
12/36
Shape of a distribution
Skewness
Skewness is a measure of the symmetry of a data distribution
1.5
0.4 Mode 0.4 Mode
Mean, Median, Mode
Median 1 Median
Mean Mean
0.2 0.2
0.5
0 0 0
−4 −2 0 −2 0 2 0 2 4
(a) Left-skewed: Mean < Median (b) Symmetric: Mean = Median (c) Right-skewed: Mean > Median
13/36
Measures of Variation
Measures of Variation
Numerical measures used to describe the spread of data
I Common measures of variation
I Range
I Variance and Standard Deviation
I Coefficient of Variation
Why study dispersion?
Measures of location, which describe central tendency of data, are
useful at that standpoint, but it tells noting about the variability of
data. Two data distributions can have the same central tendency
but quite different variability
0.3
0.2
0.1
0
0 2 4 6 8 10
x
14/36
Range
Range
The difference between the largest and the smallest values in a
dataset
Range = Maximum value - Minimum value
Example
{7,8,13,15,27,30}, Range=30-7=23
Issues
I It can be affected by extreme values
I {7,8,13,15,27,30}, Range=30-7=23
I {7,8,13,15,27,130}, Range=130-7=123
I It tells nothing about how data are distributed
15/36
Variance
Variance
The arithmetic mean of the squared deviations from the mean
Population Variance
N
P
(xi −µ)2
σ2 = i=1
N
I σ 2 is the population variance
I xi is the value of i-th observation
I µ is the population mean
I N is the number of observations in the population
16/36
Variance
Sample Variance
n
P
(xi −x̄ )2
s2 = i=1
n−1
I s 2 is the sample variance
I x̄ is the sample mean
Sample Standard Deviation
v
uPn
u (xi −x̄ )2
t
i=1
s= n−1
17/36
Variance
n
(xi − x̄ )2
P
2 i=1
s =
n−1
n
xi2 − 2xi x̄ + x̄ 2
P
i=1
=
n
n−1
n
P 2
xi − 2x̄ xi + nx̄ 2
P
i=1 i=1
=
n
n−1
P 2 2 2
xi − 2nx̄ + nx̄
i=1
=
n
n−1
P 2 2
xi − nx̄
i=1
=
n−1
18/36
Variance
Example
x x2 x − x̄ (x − x̄ )2
12 144 -5 25
20 400 3 9
16 256 -1 1
18 324 1 1
19 361 2 4
Total 1485 0 40
n
(xi − x̄ )2
P
i=1 40
s2 = = = 10
n
n−1
5−1
P 2
xi − nx̄ 2
i=1 1485 − 5 × 172
= = = 10
n−1 5−1
19/36
Variance
Properties of Variance
I Variance and standard deviation can never be negative
I Variance and standard deviation do not depend on the
location of data
I The more concentrated the data are, the smaller the variance
and standard deviation
I What if there is no variation in the data, i.e., all values are the
same?
0.2
0.1
0
−2 0 2 4 6 8 10 12
x
20/36
Empirical Rule
Empirical Rule
For a symmetrical, bell-shaped distribution, approximately 68%,
95%, and 99.7% of the observations lie within plus and minus one,
two, and three standard deviation of the mean, respectively
I Pr(µ − σ ≤ X ≤ µ + σ) ≈ 68%
I Pr(µ − 2σ ≤ X ≤ µ + 2σ) ≈ 95%
I Pr(µ − 3σ ≤ X ≤ µ + 3σ) ≈ 99.7%
68%
95%
99.7%
−3σ −2σ −1σ µ 1σ 2σ 3σ
21/36
Chebyshev’s Theorem
Chebyshev’s Theorem
For any set of observations (sample or population), the proportion
of values that lie within k standard deviations of the mean is at
least 1âĂŞ k12 , where k is any value greater than 1
Example
The average height of AGEC students is 170 cm and the
corresponding standard deviation is 10. At least what percent of
students lie within plus 3 and minus 3 standard deviations of the
mean? 1 − k12 = 1 − 312 = 1 − 19 ≈ 0.89
22/36
Coefficient of Variation
Coefficient of Variation (CV)
The coefficient of variation is a standardized measure of dispersion
of a data distribution, expressed as a percentage
I CV = x̄s × 100%
s is the sample standard deviation and x̄ is the sample mean
I It quantifies the variability relative to the mean and facilitates
the comparison of variability among data distributions with
different units or significantly different means
23/36
Coefficient of Variation
Example
Pollutant Mean Standard Deviation CV
PM2.5 100 Îijg/m3 10 Îijg/m3 10%
Ozone 50 ppm 10 ppm 20%
Relative to mean, the pollution of ozone is more variable than the
PM2.5
Example
Company Mean Production Standard Deviation CV
A 10000 10 0.1%
B 50 10 20%
Company A and B have the same variation in their production, but
company B is more variable relative to its production
24/36
Arithmetic Mean of Grouped data
Meann
P
f ×M
i=1
x̄ = n
I f is the frequency in each class
I M is the midpoint in each class
Example
Point Frequency (f ) Midpoint (M) f ×M
0-10 5 5 25
10-20 1 15 15
20-30 3 25 75
30-40 4 35 140
40-50 2 45 90
Total 15 345
n
P
f ×M
i=1 345
x̄ = n = 15 = 23
25/36
Standard Deviation of Grouped data
Standard
v Deviation
uPn
u f (M−x̄ )2
t
i=1
s= n−1
Example
Point Frequency (f ) Midpoint (M) f ×M (M − x̄ ) (M − x̄ )2 f (M − x̄ )2
0-10 5 5 25 -18 324 1620
10-20 1 15 15 -8 64 64
20-30 3 25 75 2 4 12
30-40 4 35 140 12 144 576
40-50 2 45 90 22 484 968
Total 15 345 3240
v
uPn
u f (M−x̄ )2
t q
i=1 3240
x̄ = n−1 = 14 = 15.21
26/36
Measures of Position
Measures of Position
Numerical measures used to divide data in equal parts
I Common measures of Position
I Quartile
I Decile
I Percentile
27/36
Percentile
Percentile
A percentile is a value indicating the percentage of observations in
a dataset fall below that value
Example
I The 87th percentile is 90 and it indicates that 87% of
observations are below 90
28/36
Location of Percentile
Steps for finding the pth percentile
I 1. Order the data in ascending order
I 2. Multiply p percent by the number of observations in the
data. Let’s call the resulting number as an index i
I 3. Check the index in Step 2.
I In case of a whole number, the pth percentile is the simple
average between the ith value and (i + 1)th value in the
ordered data
I Otherwise, round the index up to the nearest whole number.
The pth percentile is the dieth value in the ordered data
Note
There are some other ways to determine the percentile, such as
nearest-rank method, linear interpolation method
29/36
Location of Percentile
Example
{43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87,
88, 89, 93, 95, 96, 98, 99, 99}
I Suppose we want to find the 60th percentile. Index
i = 60/100 × 25 = 15
I The 60th percentile is then the simple average between the
15th value and 16th value
79+85
I P60 = 2 = 82
30/36
Location of Percentile
Example
{34, 42, 51, 65, 69, 74, 78, 84, 85, 85, 86, 87}
I Suppose we want to find the 80th percentile. Index
i = 80/100 × 12 = 9.6
I Since the index is not a whole number, we round it up to 10.
Then the the 80th percentile is at the 10th position in the
ordered data
I P80 = 85
31/36
Quartile and Decile
Quartiles
I The first quartile is called Q1 and it is equal to the 25th
percentile, indicting that 25% of observations are below it
I The second quartile is called Q2 and it is equal to the 50th
percentile. It is also simply the median that splits the data in
half
I The third quartile is called Q3 and it is equal to the 75th
percentile, indicting that 75% of observations are below it
I Interquartile range = Q3 − Q1
Deciles
In a similar fashion to Quartiles, Deciles are nine values that divide
the data into ten equal parts
32/36
Box plot
Box plot
I A box plot is a graphical representation of the distribution of
a data set
I It displays the median, quartiles, and potential outliers of the
data, providing a visual summary of its central tendency and
spread
I Also known as a box-and-whisker plot
33/36
Box plot
Components of a Box Plot
I Box
I The central box represents the interquartile range (IQR), which
includes the middle 50% of the data
I The edges of the box are the first quartile (Q1) and the third
quartile (Q3)
I Median Line
I A line inside the box represents the median (the 50th
percentile), which divides the data into two equal halves
I Whiskers
I Whiskers extend from the edges of the box to the minimum
and maximum values within a defined range, typically 1.5
times the IQR from Q1 and Q3
I They show the spread of the data outside the middle 50%
I Outliers
I Data points that fall outside the whiskers are considered
outliers and are often marked with individual points or symbols
34/36
Box plot
Min and Max as the boundary
I Let’s consider an example where we have exam scores for a
group of students
I 55, 60, 65, 70, 72, 75, 78, 80, 83, 85, 88, 90, 92, 95, 100
I Summaries
I Minimum: 55
I Q1 (First Quartile): 70
I Median (Q2): 80
I Q3 (Third Quartile): 90
I Maximum: 100
55 70 80 90 100
35/36
Box plot
1.5 IQR as the boundary
I 30,50,51,53,53,54,54,58,59,60,61,62,62,64,65,67,68,69,80,90
I Summaries
I Minimum: 30
I Q1 (First Quartile): 53.5
I Median (Q2): 60.5
I Q3 (Third Quartile): 66
I Maximum: 90
I Lower and upper bound
I Interquartile Range (IQR) = Q3 - Q1 = 66 - 54 = 12
I Lower Bound = 54 - 1.5 × 12 = 36
I Upper Bound = 66 + 1.5 × 12 = 84
I Outliers: 94
30 36 54 60.5 66 84 88 94
36/36