0% found this document useful (0 votes)

32 views19 pages

GEA1000 Chapter 3 Review: David - Chew@nus - Edu.sg

This document provides a summary of key concepts in univariate and bivariate exploratory data analysis, including distributions, histograms, boxplots, scatter plots, correlation coefficients, and simple linear regression. It discusses how these statistical tools can be used to explore the shape and features of one or two numerical variables and identify potential relationships between them. Anscombe's quartet is presented as an example of how misleading summary statistics can be without visualizing the actual data.

Uploaded by

Huang Zhanyi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views19 pages

GEA1000 Chapter 3 Review: David - Chew@nus - Edu.sg

Uploaded by

Huang Zhanyi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

GEA1000

Chapter 3 Review

[email protected]

1
W HERE WE WERE ( IN W ORKSHOP 2) . . .

• C ATEGORICAL D ATA A NALYSIS

– Create tables
– Plot graphs
– Check for association

• W HAT VARIABLES TO MEASURE ?

– Exposure variable
– Response variable
– Potential Confounders

2
W HERE WE ARE NOW ( IN W ORKSHOP 3) . . .

N UMERICAL VARIABLES
• U NIVARIATE E XPLORATORY D ATA A NALYSIS

– Distributions
– Histograms
– Boxplots
– Outliers

• B IVARIATE E XPLORATORY D ATA A NALYSIS

– Scatter plots
– Correlation coefficient
– Linear regression

3
1 D ISTRIBUTIONS

4
2 H ISTOGRAMS

A histogram is an excellent tool for judging

the shape and peaks of a data set with mod-
erate or large sample sizes.
However, choosing the number of bins is not
an exact Science. With either too few bins or
too many, we may not see the true shape of
the data.

5
For small data sets, the stem-and-leaf plot (a cousin of the histogram) may be useful. It is very
economical – showing both shape and actual data points.

The left shows a stem-and-leaf plot, while the one on the right is a train-timetable (also a stem-and-
leaf plot).

6
3 B OXPLOTS

The five-number summary, made up of the

• minimum value,

• first quartile Q1,

• median,

• third quartile Q3, and

• the maximum value,

is the basis of the boxplot.

7
B OXPLOTS VERSUS H ISTOGRAMS
A boxplot

• does not portray certain features of a

distribution, such as distinct mounds
and possible gaps, as clearly as a his-
togram.

• does indicate skew from the relative

lengths of the whiskers and the two
parts of the box.

• is useful for identifying outliers.

• is also very useful for graphical compar-

isons of distributions.
Which one of the following is a robust statistics?
mean? median? standard deviation? IQR?

8
4 B IVARIATE D ATA

Two types of relationship between 2 numerical variables:

• Deterministic
A formula can be used to calculate a true value for a variable, when another is given.

Example: Converting Celsius to Fahrenheit.

9C
F= + 32
5
• Statistical or Non-deterministic
The relationship between numerical variables cannot be codified into a formula which gives us
true values.

Example: X and Y related via

Y = α + β X + random variation

9
S CATTER P LOTS
• A scatter plot is useful for visualiz-
ing bivariate data.

• For this data set:

Not a good idea to use a “linear"
model to describe the relationship
between x and y.

• But even when we use the “curve"

to model the relationship between
this set of x and y, we are only able
to say that

y = f (x) + random variation,

which is non-deterministic.

10
5 C ORRELATION C OEFFICIENT

• The correlation coefficient r measures linear association between 2 numerical variables.

• Value is always between −1 and 1.

• r has no units!

• Interpretation

– r > 0: positive linear association

– r < 0: negative linear association
– r = 0: no linear association

• R Shiny App
https://gallery.shinyapps.io/correlation_game/

11
• Strength of association:

This is only a guide for our course. Some other texts will give some other characterisation.

• r is not affected by change of scale, nor when we switch x and y.

(But what happens if we multiply −1 to one of the variables?)

• However, be aware of the effect of removing/adding outliers on the value of r.

12
E XAMPLE OF r VALUES

13
6 S IMPLE L INEAR R EGRESSION

• Suppose we wish to investigate the relation-

ship between a father and his son’s height.

• Let x denote the father’s height and y denote

the son’s height.

• Is there an association between x and y?

• What is the predicted son’s height for a fa-

ther whose height is 67 inches? 80 inches?

• Is the relationship between x and y determin-

istic?

14
M ORE ABOUT S IMPLE L INEAR R EGRESSION

• Regression line: “best fit" line via the least-square method.

R Shiny App
https://gallery.shinyapps.io/simple_regression/

• For a given X value, we can only predict the average value of Y !

• The slope of the linear regression line m is given as

sy
m= r,
sx
where sx and sy are the standard deviation of x and y respectively.

From this you see that m and r are of the same sign.

• Prudent to NOT predict average values of Y corresponding to X values that are beyond range
used in data set!

15
W HAT ABOUT NON - LINEAR RELATIONSHIPS ?
• We can sometimes make use of linear regres-
sion to model non-linear relationships.

• An instance is when two variables are related

via an exponential equation:

y = cb−t .

• Take log on both sides of the equation to get

log y = log c − t log b.

This can be thought of as

Y = mX +C,

where Y = log y, m = log b, X = t and C = log c.

16
A NSCOMBE ’ S Q UARTET

Seeing is believing
Be careful!
Summary statistics like mean, standard deviation and correlation of bivariate data are useful, but do not
give you the full picture! (Pun intended. . . )

17
Consider the following 4 sets of bivariate data:

x1 x2 x3 x4 y1 y2 y3 y4 It can be shown that they have the same summary

10 10 10 8 8.04 9.14 7.46 6.58 statistics –
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71 • mean and standard deviation for x
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47 • mean and standard deviation for y
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25 • correlation coefficient for (x, y)
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56 and even the same linear regression line!
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89

R Shiny App
https://david-chew.shinyapps.io/SimpleLinearRegression

You would think that they are quite similar. However, nothing could be further from the truth . . .

18
19

DETAILED LESSON PLAN in Measures of Central Tendency FINAL
92% (12)
DETAILED LESSON PLAN in Measures of Central Tendency FINAL
8 pages
Cambridge AS Biology 9700 Practical Notes
100% (1)
Cambridge AS Biology 9700 Practical Notes
2 pages
Developmental Mathematics 1st Edition Blitzer Solutions Manual 1
100% (81)
Developmental Mathematics 1st Edition Blitzer Solutions Manual 1
36 pages
G11 Statistics & Probability Exam
No ratings yet
G11 Statistics & Probability Exam
4 pages
Statistical Analysis For Industrial Engineering 1
No ratings yet
Statistical Analysis For Industrial Engineering 1
3 pages
Quarter 3 - Summative Test STAT
No ratings yet
Quarter 3 - Summative Test STAT
9 pages
Educational Statistics EDU 408
100% (1)
Educational Statistics EDU 408
4 pages
Point-Biserial Correlation Guide
0% (1)
Point-Biserial Correlation Guide
3 pages
Boxplot - ActivityAnswerKey
No ratings yet
Boxplot - ActivityAnswerKey
6 pages
LET Reviewer Professional Education Prof. Ed.: Assessment and Evaluation of Learning Part 4
No ratings yet
LET Reviewer Professional Education Prof. Ed.: Assessment and Evaluation of Learning Part 4
3 pages
GCSE CumulativeFrequencyAndBoxPlots
100% (1)
GCSE CumulativeFrequencyAndBoxPlots
44 pages
Applied Statistics in Construction
No ratings yet
Applied Statistics in Construction
8 pages
Tutorial 1 - Questions
No ratings yet
Tutorial 1 - Questions
2 pages
Xtabs
No ratings yet
Xtabs
25 pages
GEA1000 Chapter 2 Review: David - Chew@nus - Edu.sg
No ratings yet
GEA1000 Chapter 2 Review: David - Chew@nus - Edu.sg
14 pages
Assignment 8 - Discrete Optimization: Due Date: 30 October 11:59PM
No ratings yet
Assignment 8 - Discrete Optimization: Due Date: 30 October 11:59PM
3 pages
Assignment 7
No ratings yet
Assignment 7
3 pages
NUS Managerial Economics Exam Guide
No ratings yet
NUS Managerial Economics Exam Guide
3 pages
Tutorial 4 - Questions
No ratings yet
Tutorial 4 - Questions
2 pages
Tutorial 3 - Questions - Revised
No ratings yet
Tutorial 3 - Questions - Revised
2 pages
DAO1704 Assignment 2: Due Date
No ratings yet
DAO1704 Assignment 2: Due Date
2 pages
2321-Article Text-8649-1-10-20200130
No ratings yet
2321-Article Text-8649-1-10-20200130
17 pages
Business Statistics Revision Guide
No ratings yet
Business Statistics Revision Guide
6 pages
(PDF Download) Business Statistics in Practice 8th Edition Bowerman Test Bank Fulll Chapter
100% (8)
(PDF Download) Business Statistics in Practice 8th Edition Bowerman Test Bank Fulll Chapter
54 pages
Statistical Analysis Essentials
No ratings yet
Statistical Analysis Essentials
4 pages
Cheat Sheet Statistic
No ratings yet
Cheat Sheet Statistic
6 pages
Lesson 3 Measures of Central TendencY FINAL
No ratings yet
Lesson 3 Measures of Central TendencY FINAL
18 pages
Data Science Course: Confidence Interval
No ratings yet
Data Science Course: Confidence Interval
66 pages
Business Analytics Assignment Analysis
No ratings yet
Business Analytics Assignment Analysis
6 pages
Statistik Uji Asumsi Klasik
No ratings yet
Statistik Uji Asumsi Klasik
2 pages
REPORT Data-Science
No ratings yet
REPORT Data-Science
4 pages
MCQ Measures of Dispersion With Correct Answers
No ratings yet
MCQ Measures of Dispersion With Correct Answers
10 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
Biostatistics For Clinical and Public Health Research 1st Edition Reference Book Download
No ratings yet
Biostatistics For Clinical and Public Health Research 1st Edition Reference Book Download
17 pages
Central Limit Theorem Guide
No ratings yet
Central Limit Theorem Guide
3 pages
G10 Final Exam
No ratings yet
G10 Final Exam
2 pages
ABM - (CL 22) ABM CH 2 MCQs Part-I
No ratings yet
ABM - (CL 22) ABM CH 2 MCQs Part-I
15 pages
Paradoxical Increase in Global COVID-19 Deaths With Vaccination Coverage (2020-2023) Ssrn-5108515
No ratings yet
Paradoxical Increase in Global COVID-19 Deaths With Vaccination Coverage (2020-2023) Ssrn-5108515
28 pages