Contents
Misleading ways to present data
Advantages of Numerical Summaries
Summaries of Spreads
Reporting Center and Spread
Formulas
Base R Codes
GGplot Codes
Misleading ways to present data
Real estate agents can use the average sale price of suburbs. A few houses sold at a high price
(outliers) could increase the average, despite that the median price of houses are much lower.
Bad suburbs are presented to be better.
Price per square foot instead of the predominant Australian metric unit of square meter can
make the price look lower.
Advantages of Numerical Summaries
Reduces all the data to 1 simple number. This loses a lot of information. However it allows easy
communication and comparisons.
Major features that can be summarised numerically:
● Maximum
● Minimum
● Center (mean, median)
● Spread (standard deviation, range, IQR)
Summaries of Spread
IQR
Range
Variance
Standard deviation
Reporting center and spread
Correct pairs to present: (mean, SD) OR (median, IQR)
INCORRECT pairs to present: (mean, IQR) OR (median, SD)
Note: (mean, SD) pair is not robust; (median, IQR) IS robust
Characteristics of Summaries
Summary Robust Center or Spread Compares 2 Property of Effect of
or more data shift in scaling
set? variance (multiplying/
(adding/subtr dividing)ever
acting ‘n’ y data
from every number by ‘n’
data number
effect)
IQR
Range
Median
Mean Shifts Scales
Variance
Standard No change Scales. Sd*n
Deviation
Formulas
Function Formula
Coefficient of Variation (CV): combines mean and CV = SD/mean
standard deviation into 1 summary
Uses: analytical chemistry to express precision and
repeatability of an assay. Engineering and physics for
quality assurance studies. Economics for determining
the volatility of a security.
Upper threshold Q1 - 1.5IQR
Lower threshold Q3 - 1.5IQR
Standard units: how many standard deviations is it (data point - mean)/SD
above or below the mean OR gap/SD
OR data point = mean + SD * standard units
Standard deviation population SDpop = RMS of (gaps from the mean)
Area under the normal curve pnorm(number you want to find area under, mean =
Note: default mean = 0, SD=1 enter, sd = enter)
Finding area between -2.5 and 2.5 sd pnorm(2.5)-pnorm(-2.5)
dnorm(
rnorm(
To remove errors ```{r setup, include=FALSE}
knitr::opts_chunk$set(message = F, warning = F)
```
Base R Codes
Function Code
Histogram with one variable hist(iris$Sepal.Length)
Boxplot with one variable boxplot(iris$Petal.Length)
Average/mean mean(datasetname$variable)
Eg. mean(Newtown2017$Sold)
Importing Data into R Studio Import dataset → (choose data) → import
Show dataset: displays data size, str(datasetname)
variable names, variable
classifications
Tidyverse library(tidyverse)
head(datasetname,rows)
Eg. head(Newtown2017,2)
Filter to find further subset of mean(Newtown2017$Sold[Newtown$Type == “House” &
data Newtown2017$Bedrooms == “4”])
Note: this focuses on houses with 4 bedrooms (large), the mean price
Median median(datasetname$variable)
Median focusing on variable median(datasetname$variable[datasetname$Type==”variable” &
datasetname$variable2==”number”])
Eg. median(Newtown2017$Sold[Newtown$Type ==”House” &
Newtown2017$Bedrooms==”4”])
Mean and median together c(mean(datasetname$variable), median(datasetname$variable))
Gaps gaps = datasetname$variable - mean(dataset$variable)
Maximum gap max(gaps)
Standard Deviation for sample sd(datasetname$variable)
Standard deviation for Sample sd * sqrt((n-1)/n): sd(datasetname$variable) * sqrt((n-1)/n)
population
OR
Rafalib package + popsd(datasetname$variable)
Make barplot
Quantile quantile(datasetname$variable)
IQR quantile(datasetname$variable)[4] -
quantile(quantile(datasetname$variable)[2]
Moving data up Eg. A = c(1:20)
B=A+5
NOTE: mean(B) = mean(A) + 5
sd(A) = sd(B)
Boxplot values summary(datasetname)
Ordering sort(datasetname)
Population standard deviation library(rafalib)
popsd(datasetname)
NOTE: without rafalib package, sd(datasetname) outputs sample sd
GGplot Codes
NOTE:
NO $ IN GGPLOT
Function Code
Histogram: 1 variable ggplot(iris, aes(Petal.length)) + geom_histogram()
Histogram: 1 variable + coloured ggplot(iris, aes(Petal.length)) +
geom_histogram(aes(fill=Species))
Boxplot: 1 variable ggplot(iris, aes(Petal.length)) + geom_boxplot()
Calculating popsd without the function
Quiz
Question Answer
What feature of a data can be easily communicated by The maximum of quantitative data
a single numerical summary?
The mean is the unique point at which the data is Balanced
____.
For measuring the spread of data, what is wrong with It will always be 0
calculating the mean of the “gaps”, where “gap” =
data - mean?
Can standard deviation be negative? No. Involves RMS which cannot be negative.
In R, does the sd() command works out the False. sd() command gives sample sd. Rafalib
population SD, and does rafalib package need to be package + popsd() command gives population sd.
installed?
Project1 <- read.csv("Downloads/Project1Data.csv")
View(Project1Data)
str(Project1)
library(tidyverse)
ggplot(Project1, aes(Breakfast,fill=Employment))+geom_bar(stat = "count", bins=10) +
labs(x="Number of days per week breakfast is consumed", y = "Frequency", title="Breakfast
Habits vs Employment Status" )
Potential contents:
Advantages and disadvantages of numerical and graphical summaries
Skills:
- Find mean, median, iqr, range etc. in boxplot, histogram and dataset both by hand, R
studio and ggplot