Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views18 pages

DATA202-02 - Descriptive Statistics (Part 2)

The document covers descriptive statistics, focusing on graphical summaries of data, correlation, and covariance. It includes various types of graphical displays such as histograms, scatter plots, and contingency tables, along with methods for measuring data dispersion and relationships between variables. Additionally, it explains covariance and correlation, detailing their calculations and interpretations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views18 pages

DATA202-02 - Descriptive Statistics (Part 2)

The document covers descriptive statistics, focusing on graphical summaries of data, correlation, and covariance. It includes various types of graphical displays such as histograms, scatter plots, and contingency tables, along with methods for measuring data dispersion and relationships between variables. Additionally, it explains covariance and correlation, detailing their calculations and interpretations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

2/7/2025

Descriptive Statistics
(part 2)

Agenda

• Graphical Summaries of Data


• Correlation and Covariance

1
2/7/2025

Review of Previous Lecture

• What is Statistics? Descriptive vs Inferential


• Population, Samples, etc; Sampling Methods
• Data Types and Variable Types
• Measurements of Data
– Central Tendency
– Dispersion, Variability or Spread
– Five Number Summaries

Graphic Summaries of Data

2
2/7/2025

Graphical Summaries of data

Frequency table
• Frequency – a summary of counts for each category of the data
• Relative frequency – ratio between frequency of a category and sum of all frequencies
– All relative frequencies should add up to 1 or very close to 1

Graphic Displays of Data

• Basic plots: bar and pie charts


• Boxplots: graphic display of five-number summary
• Histograms: x-axis are values, y-axis represents frequencies
• Scatter plots: each pair of values is a pair of coordinates and plotted as points in the plane
• Contingency tables: data summary for two categorical variables
• Segmented Bar and Mosaic plots:

3
2/7/2025

Graphic Summaries of Data: Basic Charts

• Bar chart - display a single categorical variable


– Pareto Chart – descending / ascending bar chart
– Side by side bar chart
• Pie chart

Graphic Summaries of Data: Histograms

• Histogram
– Visualization of distribution of continuous variable
– Frequency distribution of continuous variable by creating classes (groups/bins)
– All data falls into one of the groups
– Bins: Same size, No overlap, & No gaps

4
2/7/2025

Skewness and Modes of Histograms

• Skewness

• Mode

Histograms

• Data: Infant mortality rates (number of deaths under one year of age per 1000 live
births) of 201 countries/regions in 2010-2015.

10

10

5
2/7/2025

Histogram Exercise

Match the following variables with the histograms and bar graphs given below. Suppose
the data represent DATA202 students:
• Height of students
• Gender breakdown of students
• # of pets students have

11

11

Measuring the Dispersion of Data

Quartiles, Outliners and Boxplots


• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• 5-number summary: min, Q1, median, Q3, max

• Outliner: a value higher/lower than 1.5x IQR of Q1 or Q3

12

12

6
2/7/2025

Boxplot Examples

• Like histograms, boxplots of related distributions are often placed side-by-side for
comparison.

13

13

Visualization of Data Dispersion

14

14

7
2/7/2025

Histograms Often Tell More than Boxplots

• Consider the following histograms:

• These may have the same boxplot representation:


– The same values for: min, Q1, median, Q3, max
• But they have rather different data distributions.

15

15

Graphic Summaries of Data: Scatter Plots

Scatter Plots
• Allows to visualize association between two variables
• Predictor (or explanatory) variables and response variable
• Can be used for examining association between variables
– Positive association
– Negative association
– No association
• Form can be linear and non-linear

16

8
2/7/2025

Scatter Plots

• Provides a first look at bivariate data to see clusters of points, outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane

17

17

Different Relationships from Scatter Plots

18

18

9
2/7/2025

Positively and Negatively Correlated Data

Positively correlated Negatively correlated The left half is positively correlated


The right half is negatively correlated

19

19

Contingency Tables

• Contingency Table: A table that summarizes data for two categorical variables
• Example: Breakdown of passengers on Titanic by class and survival status

20

20

10
2/7/2025

Contingency Tables: Overall Proportions

• From a contingency table, we can divide each cell by the overall total to get the
proportions of observations in the different combinations:

21

21

Contingency Tables: Row Proportions

• From a contingency table, we can divide each cell by the corresponding row totals to get
the proportions of passengers survived in the four classes (rows):

22

22

11
2/7/2025

Contingency Tables: Column Proportions

• From a contingency table, we can divide each cell by the corresponding column totals to
get the proportions of passengers survived in each of the four classes:

23

23

Independence of Two Categorical Variables

• If the row proportions do not change from row to row, the two categorical variables are
independent. Otherwise, the two categorical variables are associated.
• In the Titanic example, the survival of passengers is associated with the class they were
in because the survival rates differ substantially from class to class:

24

24

12
2/7/2025

Independence of Two Categorical Variables

• We can also define two categorical variables to be independent if the column


proportions do not vary from column to column.
• The two conditions are equivalent… Why?

25

25

Exercise
The table below shows the breakdown of cases of injuries in the U.S in a certain year. by
circumstance and gender. Counts are in millions.

• What proportion of injury cases occurred at work?


• What proportion of injury cases occurred at work and on women?
• Among all injury cases occurred on women, what proportion occurred at work?
• Among all injury cases occurred at work, what proportion occurred on women?
• Is the circumstance of injury cases independent of the gender of the victims?

26
Source: Vital and Health Statistics published by the National Center for Health Statistics

26

13
2/7/2025

Segmented Bar Plots

• Segmented bar plots (stacked bar plots) are a type of bar chart that displays multiple
categorical variables in one bar.
• Each bar is divided into segments that represent the different categories or groups
within the variable.
• The length of each segment corresponds to the proportion or count of that category in
relation to the total.

27

27

Standardized Segmented Bar Plots

• Standardized segmented bar plots are generated on the row proportions.


• They are convenient for comparing row proportions, and determining whether the two
variables are independent

28

28

14
2/7/2025

Mosaic Plots

• Mosaic plots are graphical representations of multivariate categorical data. They use
tiles to represent the proportions of combinations of categories.
• Each tile's size is proportional to the frequency or count of the corresponding category
combination.
• bar widths = row totals
• segment lengths within a bar = row proportions

29

29

Segmented Bar vs Mosaic Plots

30

30

15
2/7/2025

Ways to Inspect Relationships Between Variables

• Numerical vs Numerical
– scatterplots
• Categorical vs Categorical
– contingency tables
– segmented bar plots, standardized segmented bar plots, mosaic plots
• Categorical vs Numerical
– side-by-side boxplots
– histograms by group on the same horizontal axis

31

31

Correlation and Covariance

32

16
2/7/2025

Covariance

• Covariance is a measure of how much two random variables change together. It


indicates the direction of the linear relationship between variables.
Σ 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌
𝐶𝑜𝑣 𝑋, 𝑌 =
𝑁
• It's the average of the product of the deviations of each pair of data points from their
respective means.
• Covariance values can range from negative to positive infinity.
– Positive covariance indicates that the variables tend to increase together.
– Negative covariance indicates that as one variable increases, the other tends to decrease.
– Zero covariance suggests no linear relationship.
• Covariance provides a sense of the direction of the relationship but not its strength.

33

33

Correlation

• Correlation is a statistical measure that describes the strength and direction of a


relationship between two variables. It shows how one variable changes in relation to
another.
𝐶𝑜𝑣(𝑋, 𝑌)
𝜌 𝑋, 𝑌 =
𝜎𝑋 𝜎𝑌

• Correlation values range from -1 to 1:


– +1: Perfect positive correlation (as one variable increases, the other also increases).
– 0: No correlation (no linear relationship between the variables).
– -1: Perfect negative correlation (as one variable increases, the other decreases).
• Pearson Correlation Coefficient: Measures linear relationships.
• Spearman’s Rank Correlation: Measures monotonic relationships (not necessarily linear).

34

34

17
2/7/2025

Example: Covariance and Correlation of Two Variables


x y
• Calculate the Mean of X and Y 1 0 1
2 2 2
3 1 1
• Calculate the Deviations from the Mean 4 1 0

• Calculate Covariance

• Calculate the Standard Deviations of X and Y

• Calculate Correlation
35

35

18

You might also like