2021/08/25
Lecture 4
GIS220:
Descriptive statistics
Prof Gregory Breetzke
[email protected] Room 1-19, Geography
Lecture overview
• What are descriptive statistics?
• Types of descriptive statistics
– Univariate
– Bivariate
• Examples
1
2021/08/25
Descriptive statistics
• Provide an initial entry point
• Some research questions can satisfactory be answered
using descriptive statistics
Types of descriptive statistics
• Univariate and bivariate statistics
– U: mean, mode, range, standard deviation
– B: correlation coefficient
2
2021/08/25
Types of descriptive statistics
UNIVARIATE
• Measures of central tendency
– Mean
– Mode
– Median
• Measures of dispersion
– Range
– Interquartile range
– Variance
– Standard deviation
The mean
• The mean is a measure of central value
– What most people mean by “average”
– Sum of a set of numbers divided by the number
of numbers in the set
3
2021/08/25
The median
• Middlemost or most central item in the set of
ordered numbers; it separates the distribution
into two equal halves
• If odd, then n is the middle value of sequence
– if X = [1,2,4,6,9,10,12,14,17]
– then 9 is the median
• If even, then n, average of 2 middle values
– if X= [1,2,4,6,9,10,11,12,14,17]
– then 9.5 is the median; i.e., (9+10)/2
• Median is not affected by extreme values
The mode
• The mode is the most frequently occurring
number in a distribution
– if X = [1,2,4,7,7,7,8,10,12,14,17]
– then 7 is the mode
• Easy to see in a simple frequency distribution
• Possible to have no modes or more than one
mode
– bimodal and multimodal
• Don’t have to be exactly equal frequency
– major mode, minor mode
• Mode is not affected by extreme values
4
2021/08/25
When to use what…?
• Mean is a great measure. But, there are time when its
usage is inappropriate or impossible
– Nominal data: Mode
– The distribution is bimodal: Mode
– You have ordinal data: Median or mode
– Are a few extreme scores: Median
Dispersion
• Dispersion
– How tightly clustered or how
variable the values are in a data
set
• Example
– Data set 1: [0,25,50,75,100]
– Data set 2: [48,49,50,51,52]
– Both have a mean of 50, but data
set 1 clearly has greater variability than data set 2
5
2021/08/25
Range
• The difference between the maximum and
minimum values in a set
• Example
– Data set 1: [1,25,50,75,100]; R: 100-1 = 99
– Data set 2: [48,49,50,51,52]; R: 52-48 = 4
– The range ignores how data are distributed and
only takes the extreme scores into account
• RANGE = (Xlargest –Xsmallest)
Quartiles
• Split ordered data into four quarters
= first quartile = (25th percentile)
= second quartile = Median (50th percentile)
= third quartile = (75th percentile)
6
2021/08/25
Interquartile range (IQR)
• Difference between third and first quartiles
– Interquartile Range = Q3-Q1
• Spread in middle 50%
• Not affected by extreme values
• The IQR is used to measure how spread out the data points in a set
are from the mean of the data set
• The higher the IQR, the more spread out the data points
• The smaller the IQR, the more bunched up the data points are
around the mean
• It is best used with other measurements such as the median and
total range to build a complete picture of a data set’s tendency to
cluster around its mean.
Example
• Given the set of values: 27, 18, 19, 12, 15, 1,
2, 6, 5, 9, 7, find the…
– Mean
– Median
– Range
– Interquartile range
7
2021/08/25
Standard deviation
• Let X = [3, 4, 5 ,6, 7]
– X=5
– (X - X) = [-2, -1, 0, 1, 2]
• Subtract x from each number in X
– (X - X)2 = [4, 1, 0, 1, 4]
• Squared deviations from the mean
– – S (X - X)2 = 10
• Sum of squared deviations from the mean (SS)
– S (X - X)2 /n-1 = 10/5 = 2.5
• Average squared deviation from the mean
– S (X - X)2 /n-1 = 2.5 = 1.58
• Square root of averaged squared deviation
Standard deviation
• Most South African employers issue raises based on
percent of salary
• Why do supervisors think the most fair raise is a
percentage raise?
• Answer:
1)Because higher paid persons get the most money.
2)The easiest thing to do is raise everyone’s salary by a fixed
percent.
• If your budget went up by 5%, salaries can go up by 5%.
• The problem is that the flat percent raise gives
unequal increased rewards
8
2021/08/25
Standard deviation
• Acme Toilet Cleaning Services
• Salary Pool: R200,000
Incomes:
• President: R100K; Manager: R50K; Secretary: R40K; and
Toilet Cleaner: R10K
• Mean: R50K - These can be considered
• Range: R90K “measures of inequality”
• Variance: R1,050,000,000
• Standard Deviation: R32.4K
• Now, let’s apply a 5% raise
Standard deviation
• After a 5% raise, the pool of money increases by R10K to
R210,000
• Incomes:
– President: R105K; Manager: R52.5K; Secretary: R42K; and Toilet Cleaner:
R10.5K
– Mean: R52.5K –went up by 5%
– Range: R94.5K –went up by 5%
– Variance: R1,157,625,000
– Standard Deviation: R34K –went up by 5%
• The flat percentage raise increased
inequality. The top earner got 50% of
the new money. The bottom earner
got 5% of the new money. Measures of
inequality went up by 5%.
9
2021/08/25
Skew
• Skewness is a measure of the asymmetry of the
probability distribution
• Roughly speaking, a distribution has positive skew
(right-skewed) if the right (higher value) tail is
longer and a negative skew (left-skewed) if the left
(lower value) tail is longer (confusing the two is a
common error)
Skew
10
2021/08/25
Kurtosis
• A high kurtosis distribution has a sharper "peak"
and fatter "tails", while a low kurtosis distribution
has a more rounded peak with wider "shoulders".
11
2021/08/25
Frequency distributions
• Symmetrical distribution
– Approximately equal numbers of observations above and
below the middle
• Skewed distribution
– One side is more spread out that the other, like a tail
– Direction of the skew
• Positive or negative (right or left)
• Side with the fewer scores
• Side that looks like a tail
Symmetrical vs. skewed distributions
12
2021/08/25
Types of descriptive statistics
BIVARIATE
• Correlation
– linear pattern of relationship between one variable (x) and
another variable (y) –an association between two variables
• Relative position of one variable correlates with relative
distribution of another variable
• Warning:
– No proof of causality
– Cannot assume x causes y
Scatterplots and correlation
• A scatter plot (or scatter diagram) is used to show
the relationship between two variables
– Scatter diagram plots pairs of bivariate observations (x, y)
on the X-Y plane
– Y is called the dependent variable
– X is called an independent variable
• Correlation analysis is used to measure strength of
the association (linear relationship) between two
variables
– Only concerned with strength of the
relationship
– No causal effect is implied
13
2021/08/25
Types of correlation
• Positive correlation
– High values of X tend to be associated with high values of Y.
– As X increases, Y increases
• Negative correlation
– High values of X tend to be associated with low values of Y.
– As X increases, Y decreases
• No correlation
• No consistent tendency for values on Y to increase or
decrease as X increases
14
2021/08/25
15
2021/08/25
Applications
Individual vs Group (Neighbourhood)
16
2021/08/25
What type of relationship?
Scatterplot:Video Games and Alcohol Consumption
20
Average Number of Alcoholic Drinks
18
16
14
Per Week
12
10
8
6
4
2
0
0 5 10 15 20 25
Average Hours of Video Games Per Week
What type of relationship?
Scatterplot: Video Games and Test Score
100
90
80
70
Exam Score
60
50
40
30
20
10
0
0 5 10 15 20
Average Hours of Video Games Per Week
17
2021/08/25
Each point represents something or
some PLACE!!
18
2021/08/25
19
2021/08/25
Practical 1
Date: Thursday 26th August 1130-1430 (Posted on Thursday)
Location: Remotely or on-campus (Brown & Orange & Red IT labs)
Assistance: Thursdays 1130-1420 and Thursdays 14:00-16:00 by
appointment via Doodle
Due: Thursday 9th September at 1130 (upload on ClickUp)
Task: Sampling exercise and gaining familiarity with GeoDa and
ArcPro
Software: Excel, GeoDa and ArcPro
20