GEA1000
Chapter 3 Review
[email protected]
1
W HERE WE WERE ( IN W ORKSHOP 2) . . .
• C ATEGORICAL D ATA A NALYSIS
– Create tables
– Plot graphs
– Check for association
• W HAT VARIABLES TO MEASURE ?
– Exposure variable
– Response variable
– Potential Confounders
2
W HERE WE ARE NOW ( IN W ORKSHOP 3) . . .
N UMERICAL VARIABLES
• U NIVARIATE E XPLORATORY D ATA A NALYSIS
– Distributions
– Histograms
– Boxplots
– Outliers
• B IVARIATE E XPLORATORY D ATA A NALYSIS
– Scatter plots
– Correlation coefficient
– Linear regression
3
1 D ISTRIBUTIONS
4
2 H ISTOGRAMS
A histogram is an excellent tool for judging
the shape and peaks of a data set with mod-
erate or large sample sizes.
However, choosing the number of bins is not
an exact Science. With either too few bins or
too many, we may not see the true shape of
the data.
5
For small data sets, the stem-and-leaf plot (a cousin of the histogram) may be useful. It is very
economical – showing both shape and actual data points.
The left shows a stem-and-leaf plot, while the one on the right is a train-timetable (also a stem-and-
leaf plot).
6
3 B OXPLOTS
The five-number summary, made up of the
• minimum value,
• first quartile Q1,
• median,
• third quartile Q3, and
• the maximum value,
is the basis of the boxplot.
7
B OXPLOTS VERSUS H ISTOGRAMS
A boxplot
• does not portray certain features of a
distribution, such as distinct mounds
and possible gaps, as clearly as a his-
togram.
• does indicate skew from the relative
lengths of the whiskers and the two
parts of the box.
• is useful for identifying outliers.
• is also very useful for graphical compar-
isons of distributions.
Which one of the following is a robust statistics?
mean? median? standard deviation? IQR?
8
4 B IVARIATE D ATA
Two types of relationship between 2 numerical variables:
• Deterministic
A formula can be used to calculate a true value for a variable, when another is given.
Example: Converting Celsius to Fahrenheit.
9C
F= + 32
5
• Statistical or Non-deterministic
The relationship between numerical variables cannot be codified into a formula which gives us
true values.
Example: X and Y related via
Y = α + β X + random variation
9
S CATTER P LOTS
• A scatter plot is useful for visualiz-
ing bivariate data.
• For this data set:
Not a good idea to use a “linear"
model to describe the relationship
between x and y.
• But even when we use the “curve"
to model the relationship between
this set of x and y, we are only able
to say that
y = f (x) + random variation,
which is non-deterministic.
10
5 C ORRELATION C OEFFICIENT
• The correlation coefficient r measures linear association between 2 numerical variables.
• Value is always between −1 and 1.
• r has no units!
• Interpretation
– r > 0: positive linear association
– r < 0: negative linear association
– r = 0: no linear association
• R Shiny App
https://gallery.shinyapps.io/correlation_game/
11
• Strength of association:
This is only a guide for our course. Some other texts will give some other characterisation.
• r is not affected by change of scale, nor when we switch x and y.
(But what happens if we multiply −1 to one of the variables?)
• However, be aware of the effect of removing/adding outliers on the value of r.
12
E XAMPLE OF r VALUES
13
6 S IMPLE L INEAR R EGRESSION
• Suppose we wish to investigate the relation-
ship between a father and his son’s height.
• Let x denote the father’s height and y denote
the son’s height.
• Is there an association between x and y?
• What is the predicted son’s height for a fa-
ther whose height is 67 inches? 80 inches?
• Is the relationship between x and y determin-
istic?
14
M ORE ABOUT S IMPLE L INEAR R EGRESSION
• Regression line: “best fit" line via the least-square method.
R Shiny App
https://gallery.shinyapps.io/simple_regression/
• For a given X value, we can only predict the average value of Y !
• The slope of the linear regression line m is given as
sy
m= r,
sx
where sx and sy are the standard deviation of x and y respectively.
From this you see that m and r are of the same sign.
• Prudent to NOT predict average values of Y corresponding to X values that are beyond range
used in data set!
15
W HAT ABOUT NON - LINEAR RELATIONSHIPS ?
• We can sometimes make use of linear regres-
sion to model non-linear relationships.
• An instance is when two variables are related
via an exponential equation:
y = cb−t .
• Take log on both sides of the equation to get
log y = log c − t log b.
This can be thought of as
Y = mX +C,
where Y = log y, m = log b, X = t and C = log c.
16
A NSCOMBE ’ S Q UARTET
Seeing is believing
Be careful!
Summary statistics like mean, standard deviation and correlation of bivariate data are useful, but do not
give you the full picture! (Pun intended. . . )
17
Consider the following 4 sets of bivariate data:
x1 x2 x3 x4 y1 y2 y3 y4 It can be shown that they have the same summary
10 10 10 8 8.04 9.14 7.46 6.58 statistics –
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71 • mean and standard deviation for x
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47 • mean and standard deviation for y
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25 • correlation coefficient for (x, y)
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56 and even the same linear regression line!
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89
R Shiny App
https://david-chew.shinyapps.io/SimpleLinearRegression
You would think that they are quite similar. However, nothing could be further from the truth . . .
18
19