R Complete Note
Presenting Data in Charts and Tables
Bar Chart
# Function to create bar charts:
barplot()
Parameters for barplot() :
x — Vector or matrix containing numeric values used in bar chart
xlab — Label for x-axis
ylab — Label for y-axis
main — The title of the bar chart
names.arg — Vector names appearing under each bar
col — Give colors to the bars
border — Give colors to the borders of the bars
density — (A single number or a vector) Gives the density of the shading
lines for the bars. The default is no shading
Example:
x = c(5, 7, 4, 8, 12, 16, 15)
barplot(
x,
main = "Number of customers visited the store",
xlab = "Days",
ylab = "Number of customers",
R Complete Note 1
names.arg = c("Monday", "Tuesday", "Wednesday", "Thursda
y", "Friday", "Saturday", "Sunday"),
col = rainbow(length(x))
)
Exercise
Create a barplot for this Exercise using R.
Performance Level Frequency
Good 13
Above Average 12
Average 15
Poor 10
Total 50
x = c(13, 12, 15, 10)
barplot(
x,
main = "Performance Level",
xlab = "Levels",
ylab = "Frequency",
names.arg = c("Good", "Above Average", "Average", "Poo
r"),
R Complete Note 2
col = rainbow(length(x))
)
Example with density :
barplot(
x,
main = "Number of customers visited the store",
xlab = "Days",
ylab = "Number of customers",
names.arg = c("Monday", "Tuesday", "Wednesday", "Thursd
ay", "Friday", "Saturday", "Sunday"),
col = rainbow(length(x)),
density = seq(10, 70, 10)
)
R Complete Note 3
Tabular Method
Example:
Rating Frequency Relative Frequency Percent Frequency
2
Poor 2 20
= 0.1 10%
3
Below Average 3 20
= 0.15 15%
5
Average 5 20
= 0.25 25%
9
Above Average 9 20
= 0.45 45%
1
Excellent 1 20
= 0.05 5%
Total 20 1.00 100%
set = c(1, 3, 8, 4, 2, 3, 6, 5, 5, 8, 4, 2, 4, 1, 5)
# Create a frequency table
table(set)
ta <- table(set)
# Relatove frequency
prop.table(ta)
# Cumulative Frequency
cumsum(ta)
R Complete Note 4
Pie Chart
# Function to create pie charts:
pie()
Parameters for pie() :
labels — Descriptions to the slices
radius — Radius of the circle (value between −1and +1)
main — The title of the pie chart
col — The color of the slice
clockwise — If set to TRUE , slices are drawn clockwise
💡 Slices are drawn counter-clockwise by default
Example:
x <- c(35, 28, 47, 63, 50)
pie(x)
Pie Chart with slice percentages as labels:
x <- c(35, 28, 47, 63, 50)
piepercent <- round(100 * x / sum(x), 1)
R Complete Note 5
lbls <- paste(piepercent, "%", sep = "")
pie(
x,
labels = lbls,
main = "Pie chart with slice percentage",
col = rainbow(length(x)),
radius = 1
)
Pie chart with slice percentage along with characters as labels:
x <- c(35, 28, 47, 63, 50)
districts <- c("Colombo", "Kandy", "Jaffna", "Anuradhapur
a", "Batticaloa")
piepercent <- round(100 * x / sum(x), 1)
lbls <- paste(districts, piepercent, "%", sep = "")
pie(
x,
labels = lbls,
main = "Pie chart with slice percentage",
col = heat.colors(length(x)),
R Complete Note 6
radius = 1
)
Measures of Centre Tendency
Mean
datasets::CO2
d <- CO2 # Assign variable 'd' to CO2
u <- d$uptake # Assign variable 'u' to d$uptakeu
mean(u) # Find the mean of 'u'
Find mean values for each category in a column (similar to group by in SQL):
## Find mean values for each category in a column (similar
to group by in SQL)
# Find the mean value of the 'uptake' variable for each cat
egory in the 'Plant' column
tapply(d$uptake, d$Plant, mean)
R Complete Note 7
# Find the maximum value of the 'uptake' variable for each
category in the 'Treatment' column
tapply(d$uptake, d$Treatment, max)
Find the mean value for a specific category in a column:
## Find the mean value of a specific category type in a col
umn
# Find the mean value of the 'uptake' variable for the cate
gory 'chilled' in the 'Treatment column
mean(d$uptake[d$Treatment=="chilled"])
Median
median(u) # Find the median of 'u'
Mode
## Find the mode -- Method 1
# Define function to find the mode
getmode <- function(v) {
uniqv <- unique(v) # Get unique values of 'v'
uniqv[which.max(tabulate(match(v, uniqv)))]} # Return the
value that occurs most frequently
# Call the function to get the mode of 'u'
getmode(u)
R Complete Note 8
💬 match(v, uniqv) returns a vector of the same length as
element is the index of the corresponding element of
v
v
where each
in uniqv .
tabulate() counts the number of times each integer occurs in the
input vector, up to the maximum value in the input vector.
which.max() returns the index of the maximum value in the input vector.
So,
uniqv[which.max(tabulate(match(v, uniqv)))] returns the value in uniqv that
corresponds to the maximum count in v , which is the mode of v .
## Find the mode -- Method 2
# Create a frequency table of 'u' and assign it to 'y'
y <- table(u)
# Find the mode(s) of 'u'
names(y)[which(y==max(y))]
💬 The table() function counts the number of times each unique value
occurs in u
R Complete Note 9
💬 max(y) finds the maximum frequency in y .
returns the indices of
which(y==max(y)) y where the frequency is equal
to the maximum frequency.
names(y)[which(y==max(y))]returns the names (or labels) of y at these
indices. In other words, it finds the values of u that occur most
frequently, which is the mode(s) of u .
Measures of Dispersion
Range and Interquartile Range
# Find the range
Range = max(u) - min(u)
Range
# Find the Interquartile Range
IQR(u)
Quartiles
# Find Quartiles
quantile(u, 0.25) #First Quartile
quantile(u, 0.5) #Second Quartile
quantile(u, 0.75) #Third Quartile
Five number summary
# Find the five-number summary
summary(u)
R Complete Note 10
Find the five-number summary for a specific category in a column:
## Find the five-number summary of a specific category type
in a column
# Find the five-number summary of the 'uptake' variable for
the category 'chilled' in the 'Treatment column
summary(d$uptake[d$Treatment=="chilled"])
Exercise
Find the summary statistics for uptake where plant type is Qn1 and uptake value
is more than 20
# Find the summary statistics for uptake where plant type i
s “Qn1” and uptake value is more than 20
summary(d$uptake[d$Plant=="Qn1" & d$uptake > 20])
💡 function will automatically neglect missing values while
summary()
others do not.
To neglect the missing values in other functions you have to
specifically mention it
Example:
num = c(10, 20, 33, 44, NA, 88, 55)
mean(num) # This will not neglect 'NA'
mean(num, na.rm = T # This will neglect 'NA'
Deciles
# Find Deciles
quantile(u, 0.4) #Fourth Decide
quantile(u, 0.7) #Seventh Decile
Percentiles
R Complete Note 11
# Find Percentiles
quantile(u, 0.98) # 98th Percentile
quantile(u, 0.37) # 37th Percentile
Variance
# Find the Sample Variance
var(u)
# Find the Standard Deviation
sd(u)
Box-Plot (Box & Whisker Plot)
# Function to create box plots:
boxplot()
Parameters for boxplot() :
[y-axis]~[x-axis] — The axes of the graph
data — The dataset
main — The title of the bar chart
xlab — Label for x-axis
ylab — Label for y-axis
col — Give colors to the boxes
border — Give colors to the borders of the boxes
notch — Add a notch to the box at the Median
varwidth — If set to FALSE , all boxes will have the same width regardless of
the size of the group
horizontal — If set to TRUE , the boxes will be horizontal
Examples:
R Complete Note 12
datasets::ToothGrowth
TG<-ToothGrowth
boxplot(
TG$len,
main="Box plot of tooth length",
ylab="Tooth length",
col="hotpink",
border="lightpink",
notch = FALSE,
varwidth = FALSE,
horizontal = TRUE
)
datasets::ToothGrowth
TG<-ToothGrowth
boxplot(
R Complete Note 13
len~supp,
data = TG,
main = "Tooth growth with supplement types",
xlab = "Supplement type",
ylab = "Tooth length",
col = c("hotpink", "lightpink")
)
Tally Table
# Tally Table
datasets::iris
i <- iris
table(i$Species)
Output:
R Complete Note 14
Contingency Table
# Contingency Table
table(d$Plant, d$Type)
table(d$Plant, d$Treatment)
Output:
Binomial Distribution
dbinom
For binomial distributions, dbinom is used in R.
# dbinom Help
help(dbinom)
Example:
Find P (x = 1)when n = 5, and θ = 0.1.
R Complete Note 15
x=1
n=5
θ = 0.1
P (x = 1) =5 C1 (0.1)1 (0.9)4
= 5 × 0.1 × 0.6561
= 0.32805
dbinom(x = 1, size = 5, prob = 0.1)
Find P (x ≤ 3)when n = 5and θ = 0.1.
sum(dbinom(x = 0:3, size = 5, prob = 0.1))
Exercise
A customer receiving service from a customer care center can be classified as
good service or bad service. The probability of getting good service is 0.4.
1. What is the probability of he/she getting at least 2 good services out of 10
tries?
n = 10
x=2
θ = 0.4
sum(dbinom(x = 2:10, size = 10, prob = 0.4))
1 - sum(dbinom(x = 0:1, size = 10, prob = 0.4))
2. What is the probability he/she getting bad service between 3 and 7 out of
10 tries?
n = 10
3 < x < 10
θ = 0.6
R Complete Note 16
sum(dbinom(x = 4:6, size = 10, prob = 0.6))
pbinom
pbinom is a cumulative function
# pbinomm Help
help(pbinom)
Examples:
Find P (x ≤ 3)when n = 5and θ = 0.1.
pbinom(3, size = 5, prob = 0.1)
The same can be done with dbinom as:
sum(dbinom(x = 0:3, size = 5, prob = 0.1))
Poisson Distribution
dpois
dpois is used for Poisson distributions in R
# dpois Help
help(dpois)
Examples:
Find P (x = 0)when λ = 0.03.
dpois(x = 0, lambda = 0.03)
Find P (x ≥ 1)when λ = 0.03
1 - dpois(x = 0, lambda = 0.03)
R Complete Note 17
ppois
ppois is a cumulative function
# ppois Help
help(ppois)
Example:
Find the value of P (x = 0) + P (x = 1) + P (x = 2)when λ = 2.
ppois(2, lamba = 2)
The same can be done with dpois as:
# Method 1
p1 <- dpois(x = 0, lambda = 2)
p2 <- dpois(x = 1, lambda = 2)
p3 <- dpois(x = 2, lambda = 2)
p <- p1 + p2 + p3
p
# Method 2
sum(dpois(x=0:2, lambda = 2))
Exercise
Suppose it has been observed that, on average 180 cars per hour pass a
specified point on a particular road in the morning rush hour. Due to impending
road works it is estimated that congestion will occur closer to the city center if
more than 5 cars pass the point in any of one minute. What is the probability of
congestion occurring?
180
λ= =3
60
x>5
1 - ppois(5, lambda = 3)
R Complete Note 18
Exercise
A manufacturer of balloons produces 40% that are oval and 60% that are
round. Packets of 20 balloons may be assumed to contain random samples of
balloons. Determine the probability that such a packet contains:
1. an equal number of oval balloons and round balloons
2. P (oval) = 0.4
P (round) = 0.6
20
C10 (0.4)10 (0.6)10
dbinom(x = 10, size = 20, prob = 0.4)
3. fewer oval balloons than round balloons
P (x ≤ 9)
pbinom(9, size = 20, prob = 0.4)
A customer selects packets of 20 balloons at random from a large consignment
until she finds a packet with exactly 12 round balloons.
3. Give a reason why a binomial distribution is not an appropriate model for
the number of packets selected.
The number of trials is not fixed even though they are independent
events
Continuous Uniform Distribution
dunif
dunif(x, min, max) is used to find the PDF at xin R.
# dunif Help
?dunif
R Complete Note 19
Example:
Find the PDF of a uniform distribution between 0and 5at the point x = 2.
dunif(2, min = 0, max = 5)
Cumulative Distribution Function (CDF)
💡 Cumulative distribution function (CDF) for a uniform distribution gives
the probability that the random variable X is less than or equal to a
certain value x.
punif
punif is a cumulative function in R.
punif(q, min, max) is used to find the CDF of x ≤ q
Example:
Find the probability that a random variable from a uniform distribution between
0 and 5 is less than or equal to 3.
punif(3 , min = 0, max = 5)
qunif
qunif is a quantile function in R.
qunif(p, min, max) is used to find the quantile defined by p
Example:
What is the 90th percentile of a uniform distribution between 0 and 5?
qunif(0.90, min = 0, max = 5)
The Normal Distribution / Gaussian
Distribution
R Complete Note 20
pnorm
pnorm is used for Normal Distribution calculations in R.
# pnorm Help
?pnorm
Example:
Find P (x < 18)when mean is 15and the standard deviation is 2
x−μ 18 − 15
<
2
σ
z < 1.5
= 0.9332
pnorm(q = 18, mean = 15, sd = 2)
pnorm(q = 18, mean = 15, sd = 2, lower.tail = TRUE)
💬 The lower.tail argument specifies whether the PDF is calculated for
the lower tail (left-hand side) or the upper tail (right-hand side) of the
normal distribution.
Example:
Find P (x > 18)when mean is 15and the standard deviation is 2
1 − P (x < 18)
= 1 − 0.9331928
= 0.0668072
pnorm(q = 18, mean = 15, sd = 2, lower.tail = FALSE)
Example:
Find P (970000 < x < 1060000)when mean is 1000000and standard
deviation is 30000
R Complete Note 21
# P(x < 1060000)
P1 <- pnorm(q = 1060000, mean = 1000000, sd = 30000, lower.
tail = TRUE)
# P(x < 970000)
P2 <- pnorm(q = 970000, mean = 1000000, sd = 30000, lower.t
ail = TRUE)
# P(970000 < x < 1060000) = P(x < 1060000) - P(x < 970000)
P <- P2 - P1
P
Hypothesis Testing - Examples
H0 : ? ≤ 80
H1 : ? > 80
Sample Mean = 83
Standard Deviation = 8
# Test Statistic Value
Z1 = (83 - 80) / (8 / sqrt(25))
Z1
# Table_value for 95% upper tail test
Table_value <- qnorm(0.95)
Table_value
if (Table_value < Z1) {
print("Reject the H0")
}
H0 : ? = 170 (this specifies a signle value for the parameter of interest)
H1 : ? > 170 (this is what we want to determine)
sd = 65
mu_0 = 170
R Complete Note 22
n = 400
x_bar = 178
# Test Statistic Value
z1 <- (x_bar - mu_0) / (sd / sqrt(n))
z1
# Table_value for 95% upper tail test
Table_value <- qnorm(0.95)
Table_value
if (Table_value < z1) {
print("Reject the H0")
}
The owner of the shop wants to induce the annual income of the shop. He
suspects compared to previous years annual income rate declined to less than
5%.. He suspects at 5% significance error. Standard deviation of annual
income for last 16 years is 0.1%. The population mean is 5%, and sample mean
is 4.962%.
H0 : ? = 5 (this specifies a single values for the parameter of interest)
H1 : ? < 5 (this is what we want to determine)
sd = 0.1
mu_0 = 5
n = 16
x_bar = 4.962
# Test Statistic Value
z1 <- (x_bar - mu_0) / (sd / sqrt(n))
z1
# Table_value for 5% lower tail test
Table_value <- round(qnorm(0.05), 2)
Table_value
if (Table_value > z1) {
R Complete Note 23
print("Reject the H0")
} else {
print("Failed to reject H0")
}
R Complete Note 24