2/7/2025
Descriptive Statistics
(part 2)
Agenda
• Graphical Summaries of Data
• Correlation and Covariance
1
2/7/2025
Review of Previous Lecture
• What is Statistics? Descriptive vs Inferential
• Population, Samples, etc; Sampling Methods
• Data Types and Variable Types
• Measurements of Data
– Central Tendency
– Dispersion, Variability or Spread
– Five Number Summaries
Graphic Summaries of Data
2
2/7/2025
Graphical Summaries of data
Frequency table
• Frequency – a summary of counts for each category of the data
• Relative frequency – ratio between frequency of a category and sum of all frequencies
– All relative frequencies should add up to 1 or very close to 1
Graphic Displays of Data
• Basic plots: bar and pie charts
• Boxplots: graphic display of five-number summary
• Histograms: x-axis are values, y-axis represents frequencies
• Scatter plots: each pair of values is a pair of coordinates and plotted as points in the plane
• Contingency tables: data summary for two categorical variables
• Segmented Bar and Mosaic plots:
3
2/7/2025
Graphic Summaries of Data: Basic Charts
• Bar chart - display a single categorical variable
– Pareto Chart – descending / ascending bar chart
– Side by side bar chart
• Pie chart
Graphic Summaries of Data: Histograms
• Histogram
– Visualization of distribution of continuous variable
– Frequency distribution of continuous variable by creating classes (groups/bins)
– All data falls into one of the groups
– Bins: Same size, No overlap, & No gaps
4
2/7/2025
Skewness and Modes of Histograms
• Skewness
• Mode
Histograms
• Data: Infant mortality rates (number of deaths under one year of age per 1000 live
births) of 201 countries/regions in 2010-2015.
10
10
5
2/7/2025
Histogram Exercise
Match the following variables with the histograms and bar graphs given below. Suppose
the data represent DATA202 students:
• Height of students
• Gender breakdown of students
• # of pets students have
11
11
Measuring the Dispersion of Data
Quartiles, Outliners and Boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• 5-number summary: min, Q1, median, Q3, max
• Outliner: a value higher/lower than 1.5x IQR of Q1 or Q3
12
12
6
2/7/2025
Boxplot Examples
• Like histograms, boxplots of related distributions are often placed side-by-side for
comparison.
13
13
Visualization of Data Dispersion
14
14
7
2/7/2025
Histograms Often Tell More than Boxplots
• Consider the following histograms:
• These may have the same boxplot representation:
– The same values for: min, Q1, median, Q3, max
• But they have rather different data distributions.
15
15
Graphic Summaries of Data: Scatter Plots
Scatter Plots
• Allows to visualize association between two variables
• Predictor (or explanatory) variables and response variable
• Can be used for examining association between variables
– Positive association
– Negative association
– No association
• Form can be linear and non-linear
16
8
2/7/2025
Scatter Plots
• Provides a first look at bivariate data to see clusters of points, outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane
17
17
Different Relationships from Scatter Plots
18
18
9
2/7/2025
Positively and Negatively Correlated Data
Positively correlated Negatively correlated The left half is positively correlated
The right half is negatively correlated
19
19
Contingency Tables
• Contingency Table: A table that summarizes data for two categorical variables
• Example: Breakdown of passengers on Titanic by class and survival status
20
20
10
2/7/2025
Contingency Tables: Overall Proportions
• From a contingency table, we can divide each cell by the overall total to get the
proportions of observations in the different combinations:
21
21
Contingency Tables: Row Proportions
• From a contingency table, we can divide each cell by the corresponding row totals to get
the proportions of passengers survived in the four classes (rows):
22
22
11
2/7/2025
Contingency Tables: Column Proportions
• From a contingency table, we can divide each cell by the corresponding column totals to
get the proportions of passengers survived in each of the four classes:
23
23
Independence of Two Categorical Variables
• If the row proportions do not change from row to row, the two categorical variables are
independent. Otherwise, the two categorical variables are associated.
• In the Titanic example, the survival of passengers is associated with the class they were
in because the survival rates differ substantially from class to class:
24
24
12
2/7/2025
Independence of Two Categorical Variables
• We can also define two categorical variables to be independent if the column
proportions do not vary from column to column.
• The two conditions are equivalent… Why?
25
25
Exercise
The table below shows the breakdown of cases of injuries in the U.S in a certain year. by
circumstance and gender. Counts are in millions.
• What proportion of injury cases occurred at work?
• What proportion of injury cases occurred at work and on women?
• Among all injury cases occurred on women, what proportion occurred at work?
• Among all injury cases occurred at work, what proportion occurred on women?
• Is the circumstance of injury cases independent of the gender of the victims?
26
Source: Vital and Health Statistics published by the National Center for Health Statistics
26
13
2/7/2025
Segmented Bar Plots
• Segmented bar plots (stacked bar plots) are a type of bar chart that displays multiple
categorical variables in one bar.
• Each bar is divided into segments that represent the different categories or groups
within the variable.
• The length of each segment corresponds to the proportion or count of that category in
relation to the total.
27
27
Standardized Segmented Bar Plots
• Standardized segmented bar plots are generated on the row proportions.
• They are convenient for comparing row proportions, and determining whether the two
variables are independent
28
28
14
2/7/2025
Mosaic Plots
• Mosaic plots are graphical representations of multivariate categorical data. They use
tiles to represent the proportions of combinations of categories.
• Each tile's size is proportional to the frequency or count of the corresponding category
combination.
• bar widths = row totals
• segment lengths within a bar = row proportions
29
29
Segmented Bar vs Mosaic Plots
30
30
15
2/7/2025
Ways to Inspect Relationships Between Variables
• Numerical vs Numerical
– scatterplots
• Categorical vs Categorical
– contingency tables
– segmented bar plots, standardized segmented bar plots, mosaic plots
• Categorical vs Numerical
– side-by-side boxplots
– histograms by group on the same horizontal axis
31
31
Correlation and Covariance
32
16
2/7/2025
Covariance
• Covariance is a measure of how much two random variables change together. It
indicates the direction of the linear relationship between variables.
Σ 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌
𝐶𝑜𝑣 𝑋, 𝑌 =
𝑁
• It's the average of the product of the deviations of each pair of data points from their
respective means.
• Covariance values can range from negative to positive infinity.
– Positive covariance indicates that the variables tend to increase together.
– Negative covariance indicates that as one variable increases, the other tends to decrease.
– Zero covariance suggests no linear relationship.
• Covariance provides a sense of the direction of the relationship but not its strength.
33
33
Correlation
• Correlation is a statistical measure that describes the strength and direction of a
relationship between two variables. It shows how one variable changes in relation to
another.
𝐶𝑜𝑣(𝑋, 𝑌)
𝜌 𝑋, 𝑌 =
𝜎𝑋 𝜎𝑌
• Correlation values range from -1 to 1:
– +1: Perfect positive correlation (as one variable increases, the other also increases).
– 0: No correlation (no linear relationship between the variables).
– -1: Perfect negative correlation (as one variable increases, the other decreases).
• Pearson Correlation Coefficient: Measures linear relationships.
• Spearman’s Rank Correlation: Measures monotonic relationships (not necessarily linear).
34
34
17
2/7/2025
Example: Covariance and Correlation of Two Variables
x y
• Calculate the Mean of X and Y 1 0 1
2 2 2
3 1 1
• Calculate the Deviations from the Mean 4 1 0
• Calculate Covariance
• Calculate the Standard Deviations of X and Y
• Calculate Correlation
35
35
18