Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
32 views19 pages

GEA1000 Chapter 3 Review: David - Chew@nus - Edu.sg

This document provides a summary of key concepts in univariate and bivariate exploratory data analysis, including distributions, histograms, boxplots, scatter plots, correlation coefficients, and simple linear regression. It discusses how these statistical tools can be used to explore the shape and features of one or two numerical variables and identify potential relationships between them. Anscombe's quartet is presented as an example of how misleading summary statistics can be without visualizing the actual data.

Uploaded by

Huang Zhanyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views19 pages

GEA1000 Chapter 3 Review: David - Chew@nus - Edu.sg

This document provides a summary of key concepts in univariate and bivariate exploratory data analysis, including distributions, histograms, boxplots, scatter plots, correlation coefficients, and simple linear regression. It discusses how these statistical tools can be used to explore the shape and features of one or two numerical variables and identify potential relationships between them. Anscombe's quartet is presented as an example of how misleading summary statistics can be without visualizing the actual data.

Uploaded by

Huang Zhanyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

GEA1000

Chapter 3 Review

[email protected]

1
W HERE WE WERE ( IN W ORKSHOP 2) . . .

• C ATEGORICAL D ATA A NALYSIS

– Create tables
– Plot graphs
– Check for association

• W HAT VARIABLES TO MEASURE ?

– Exposure variable
– Response variable
– Potential Confounders

2
W HERE WE ARE NOW ( IN W ORKSHOP 3) . . .

N UMERICAL VARIABLES
• U NIVARIATE E XPLORATORY D ATA A NALYSIS

– Distributions
– Histograms
– Boxplots
– Outliers

• B IVARIATE E XPLORATORY D ATA A NALYSIS

– Scatter plots
– Correlation coefficient
– Linear regression

3
1 D ISTRIBUTIONS

4
2 H ISTOGRAMS

A histogram is an excellent tool for judging


the shape and peaks of a data set with mod-
erate or large sample sizes.
However, choosing the number of bins is not
an exact Science. With either too few bins or
too many, we may not see the true shape of
the data.

5
For small data sets, the stem-and-leaf plot (a cousin of the histogram) may be useful. It is very
economical – showing both shape and actual data points.

The left shows a stem-and-leaf plot, while the one on the right is a train-timetable (also a stem-and-
leaf plot).

6
3 B OXPLOTS

The five-number summary, made up of the

• minimum value,

• first quartile Q1,

• median,

• third quartile Q3, and

• the maximum value,

is the basis of the boxplot.

7
B OXPLOTS VERSUS H ISTOGRAMS
A boxplot

• does not portray certain features of a


distribution, such as distinct mounds
and possible gaps, as clearly as a his-
togram.

• does indicate skew from the relative


lengths of the whiskers and the two
parts of the box.

• is useful for identifying outliers.

• is also very useful for graphical compar-


isons of distributions.
Which one of the following is a robust statistics?
mean? median? standard deviation? IQR?

8
4 B IVARIATE D ATA

Two types of relationship between 2 numerical variables:

• Deterministic
A formula can be used to calculate a true value for a variable, when another is given.

Example: Converting Celsius to Fahrenheit.

9C
F= + 32
5
• Statistical or Non-deterministic
The relationship between numerical variables cannot be codified into a formula which gives us
true values.

Example: X and Y related via

Y = α + β X + random variation

9
S CATTER P LOTS
• A scatter plot is useful for visualiz-
ing bivariate data.

• For this data set:


Not a good idea to use a “linear"
model to describe the relationship
between x and y.

• But even when we use the “curve"


to model the relationship between
this set of x and y, we are only able
to say that

y = f (x) + random variation,

which is non-deterministic.

10
5 C ORRELATION C OEFFICIENT

• The correlation coefficient r measures linear association between 2 numerical variables.

• Value is always between −1 and 1.

• r has no units!

• Interpretation

– r > 0: positive linear association


– r < 0: negative linear association
– r = 0: no linear association

• R Shiny App
https://gallery.shinyapps.io/correlation_game/

11
• Strength of association:

This is only a guide for our course. Some other texts will give some other characterisation.

• r is not affected by change of scale, nor when we switch x and y.


(But what happens if we multiply −1 to one of the variables?)

• However, be aware of the effect of removing/adding outliers on the value of r.

12
E XAMPLE OF r VALUES

13
6 S IMPLE L INEAR R EGRESSION

• Suppose we wish to investigate the relation-


ship between a father and his son’s height.

• Let x denote the father’s height and y denote


the son’s height.

• Is there an association between x and y?

• What is the predicted son’s height for a fa-


ther whose height is 67 inches? 80 inches?

• Is the relationship between x and y determin-


istic?

14
M ORE ABOUT S IMPLE L INEAR R EGRESSION

• Regression line: “best fit" line via the least-square method.

R Shiny App
https://gallery.shinyapps.io/simple_regression/

• For a given X value, we can only predict the average value of Y !

• The slope of the linear regression line m is given as


sy
m= r,
sx
where sx and sy are the standard deviation of x and y respectively.

From this you see that m and r are of the same sign.

• Prudent to NOT predict average values of Y corresponding to X values that are beyond range
used in data set!

15
W HAT ABOUT NON - LINEAR RELATIONSHIPS ?
• We can sometimes make use of linear regres-
sion to model non-linear relationships.

• An instance is when two variables are related


via an exponential equation:

y = cb−t .

• Take log on both sides of the equation to get

log y = log c − t log b.

This can be thought of as

Y = mX +C,

where Y = log y, m = log b, X = t and C = log c.

16
A NSCOMBE ’ S Q UARTET

Seeing is believing
Be careful!
Summary statistics like mean, standard deviation and correlation of bivariate data are useful, but do not
give you the full picture! (Pun intended. . . )

17
Consider the following 4 sets of bivariate data:

x1 x2 x3 x4 y1 y2 y3 y4 It can be shown that they have the same summary


10 10 10 8 8.04 9.14 7.46 6.58 statistics –
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71 • mean and standard deviation for x
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47 • mean and standard deviation for y
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25 • correlation coefficient for (x, y)
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56 and even the same linear regression line!
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89

R Shiny App
https://david-chew.shinyapps.io/SimpleLinearRegression

You would think that they are quite similar. However, nothing could be further from the truth . . .

18
19

You might also like