EDA With R Lab Manual
EDA With R Lab Manual
Aim:-
R download:
https://cran.r-project.org/bin/windows/base/
RStudio download:
https://www.rstudio.com/products/rstudio/download/
http://www.r-bloggers.com/download-and-install-r-in-ubuntu/
INSTALLATION STEPS:
STEP:1
Page | 1
STEP:2
STEP:3
Click on next button and choose install location, and click on next.
Page | 2
STEP:4
STEP:5
RESULT:-
Page | 3
Exp No Experiment Name Date
1.b The Basics of R Syntax
Aim:-
Output:
10 5 4
#Arithmetic Operations
add=2 + 2 #addition
sub=2-2 #subtraction
mul=2*2 #multiplication
div=2 / 11 #returns arithmetic Quotient
rem=10 %% 2 #returns remainder
div2=2%/%11 #returns integer Quotient
cat("Addition=",add,"Subtraction=",sub,"Multiplication=",mul,"\nDevision=",div,"Remainde
r=",rem,
"Integer Quotient=",div2)
Output:
Addition= 4 Subtraction= 0 Multiplication= 4
Devision= 0.1818182 Remainder= 0 Integer Quotient= 0
#Relational Operators
a <- c(1, 3, 5)
b <- c(2, 4, 6)
print(a>b)
print(a<b)
print(a<=b)
print(a>=b)
print(a==b)
print(a>=b)
Output:
[1] FALSE FALSE FALSE
[1] TRUE TRUE TRUE
[1] TRUE TRUE TRUE
[1] FALSE FALSE FALSE
Page | 4
[1] FALSE FALSE FALSE
[1] FALSE FALSE FALSE
#Logical Operators
a <- c(3, 0, TRUE, 2+2i)
b <- c(2, 4, TRUE, 2+3i)
print(a&b)
print(a|b)
print(!a)
print(a&&b)
print(a||b)
Output:
[1] TRUE FALSE TRUE TRUE
[1] TRUE TRUE TRUE TRUE
[1] FALSE TRUE FALSE FALSE
[1] TRUE
[1] TRUE
Result:
Page | 5
Exp No Experiment Name Date
1.c Matrices and Lists
Aim:-
Matrices in R:
#Matrix with one column
data=c(1, 2, 3, 4, 5, 6)
amatrix = matrix(data)
print(amatrix)
Output:
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
Output:
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
#Matrix multiplication
a=matrix(c(1,2,3,4),ncol=2)
print(a)
b=matrix(c(2,2,2,2),ncol=2)
print(b)
print(a%*%b)
Output:
[,1] [,2]
[1,] 1 3
[2,] 2 4
[,1] [,2]
[1,] 2 2
[2,] 2 2
[,1] [,2]
Page | 6
[1,] 8 8
[2,] 12 12
Lists in R:
#list with same datatype
list1<-list(1,2,3)
list2<-list("CSE","ECE","EEE")
print(list1)
print(list2)
Output:
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[1]]
[1] "CSE"
[[2]]
[1] "ECE"
[[3]]
[1] "EEE"
Output:
[[1]]
[1] "CSE"
[[2]]
[1] "ECE"
[[3]]
[1] 1 2 3 4 5
[[4]]
[1] TRUE
[[5]]
[1] FALSE
Page | 7
[[6]]
[1] 22.5
[[7]]
[1] 12
Output:
$Students
[1] "UDay" "Kumar" "Venkatesh"
$Marks
[,1] [,2] [,3]
[1,] 40 60 90
[2,] 80 70 80
$Course
$Course[[1]]
[1] "CSE"
$Course[[2]]
[1] "ECE"
$Course[[3]]
[1] "EEE"
Result:
Page | 8
Exp No Experiment Name Date
1.d Subsetting
Aim:-
Output:
Original vector: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
First 5 values of vector: 1 2 3 4 5
Without values present at index 1, 2 and 3: 4 5 6 7 8 9 10 11 12 13 14 15
Output:
Original List:
$a
[1] 1
$b
[1] 2
$c
[1] 10
$d
[1] 20
First element of list: 1
Page | 9
Output:
Original list:
$a
[1] 1
$b
[1] 2
$c
[1] "Hello"
$d
[1] "GFG"
Using $ operator:
[1] "GFG"
Output:
gear hp
Ford Pantera L 5 264
Maserati Bora 5 335
Result:
Page | 10
Exp No Experiment Name Date
1.e Help System
Aim:-
help("gsub")
?gsub
help.search("chisquare")
??chisquare
Page | 11
Result:
Page | 12
Exp No Experiment Name Date
Viewing and Manipulating
2.a
Data
Aim:-
View(data)
Output:
View(iris)
Output:
Page | 13
#Manipulating data using dplyr package
install.packages("dplyr")
library("dplyr")
summary(iris)
Output:
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
head(selected)
Output:
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
Page | 14
6 5.4 3.9
head(filtered,3)
Output:
tail(col1,4)
Output:
head(arranged)
Page | 15
Output:
head(arranged)
Output:
Result:
Page | 16
Exp No Experiment Name Date
2.b Plotting Data
Aim:-
plot(airquality$Ozone, xlab = 'ozone Concentration', ylab = 'No of Instances', main = 'Ozone levels in
NY city', col = 'green')
Output:
Output:
Page | 17
#Histogram Plot using hist() function
Output:
boxplot(airquality,col="Red")
Output:
Result:
Page | 18
Exp No Experiment Name Date
2c Reading External Data
Aim:-
print(data)
Output:
flavor number
1 pistachio 6
3 vanilla 5
4 chocolate 10
5 strawberry 2
6 neopolitan 4
head(my_data,3)
Output:
1 IND1 10 A
2 IND2 7 A
3 IND3 20 A
Result:
Page | 19
Exp No Experiment Name Date
3a Tables, Charts And Plots
Aim:-
Tables in R:
data = data.frame(
table(data)
Output:
Gender
abc 0 1
cde 1 0
def 0 1
Charts in R:
Types of R – Charts
# defining vector
Page | 20
png(file = "barplot.png")
# plotting vector
dev.off()
Output:
png(file = "piechart.png")
dev.off()
Page | 21
Output:
Histogram Plot:
# defining vector
x <- c(21, 23, 56, 90, 20, 7, 94, 12, 57, 76, 69, 45, 34, 32, 49, 55, 57)
png(file = "hist.png")
dev.off()
Output:
Page | 22
Scatter Plot:
#defining vector
png(file = "plot.png")
# plotting
col.axis = "darkgreen")
dev.off()
Output:
Box Plot:
x <- c(42, 21, 22, 24, 25, 30, 29, 22,23, 23, 24, 28, 32, 45, 39, 40)
png(file = "boxplot.png")
Page | 23
# plotting
dev.off()
Output:
Result:
Page | 24
Exp No Experiment Name Date
Univariate Data, Measure of Central
3b
Tendency, Frequency Distribution
Aim:-
mean<-mean(iris$Petal.Length)
print(mean)
Output:
[1] 3.758
median<-median(iris$Petal.Length)
print(median)
Output:
[1] 4.35
install.packages("modeest")
library(modeest)
mode<-mfv(iris$Petal.Length)
print(mode)
Output:
Page | 25
Frequency Distribution:
freq<-table(mtcars$carb)
print(freq)
Output:
1 2 3 4 6 8
7 10 3 10 1 1
print(data)
freq<-table(data)
print(freq)
hist(airquality$Temp,xlab="Temperature",ylab="count",main="frequency distribution of
Temperature",col = "white")
Output:
data
Page | 26
(56,60.6] (60.6,65.1] (65.1,69.7] (69.7,74.2] (74.2,78.8] (78.8,83.3] (83.3,87.9]
8 10 14 16 26 35 22
(87.9,92.4] (92.4,97]
15 7
Result:
Page | 27
Exp No Experiment Name Date
Multivariate data, Relationship
3C between categorical and
continues variables
Aim:-
mean(iris$Petal.Length[iris$Species=="setosa"])
mean(iris$Petal.Length[iris$Species=="versicolor"])
mean(iris$Petal.Length[iris$Species=="virginica"])
Output:
[1] 1.462
[1] 4.26
[1] 5.552
Output:
iris$Species: setosa
[1] 1.462
----------------------------------------------------------------------------------
iris$Species: versicolor
[1] 4.26
----------------------------------------------------------------------------------
Page | 28
iris$Species: virginica
[1] 5.552
#finding Standard Deviation of Petal.Length for all Species using by() function
Output:
iris$Species: setosa
[1] 0.173664
----------------------------------------------------------------------------------
iris$Species: versicolor
[1] 0.469911
----------------------------------------------------------------------------------
iris$Species: virginica
[1] 0.5518947
Output:
iris$Species: setosa
----------------------------------------------------------------------------------
iris$Species: versicolor
----------------------------------------------------------------------------------
Page | 29
iris$Species: virginica
#Plotting Relationship between Species and Petal.Length variables using Box and whisker plot
Output:
Result:
Page | 30
Exp No Exp Name Date
Relationship between
3d
two continues variables
Aim:-
∑( ̅ )( )
covxy=
( )
where x,y are the vector values and 𝑥̅ and 𝑦 are the mean of x and y columns
cov(women$height*2.54,women$weight)
Output:
[1] 69
cov(women$height*2.54,women$weight)
Output:
[1] 175.26
#plotting scatter plot for height and weight variables in women dataset to show relationship
Output:
Page | 31
Finding strength of Relationship between two numerical variables using Correlation:
∑( ̅ )( )
r=
( )
cor(women$height,women$weight)
Output:
[1] 0.9954948
cor(women$height*2.54,women$weight)
Output:
[1] 0.9954948
#plotting scatter plot for height and weight variables in women dataset to show relationship
abline(lm(height~weight,data=women),col="green")
Output:
Page | 32
Finding Correlation between multiple numerical variables in a Dataset:
data<- iris[,-5]
cor(data)
Output:
Result
Page | 33
Exp No Exp Name Date
3e Visualization methods
AIM:-
library(ggplot2)
qplot(Species,Petal.Length,data=iris,geom="boxplot",fill=Species)
Output:
install.packages("vcd")
library(vcd)
Page | 34
#the formula Freq~Gender+Admit is read as frequency break down by Gender whether Admission is
rejected or admitted
Output:
library(ggplot2)
Output:
#Drawing a smooth line connecting scatter points and constructing a linear model
Page | 35
Output:
Result:
Page | 36
Exp No Exp Name Date
Sampling from distribution- Binomial
4.a
and Normal distribution
Aim:
Binomial Distribution:
The binomial distribution is a discrete distribution and has only two outcomes i.e. success or
failure. All its trials are independent, the probability of success remains the same and the
previous outcome does not affect the next outcome. The outcomes from different trials are
independent. Binomial distribution helps us to find the individual probabilities as well as
cumulative probabilities over a certain range.
dbinom():
dbinom(k, n, p)
where n is total number of trials, p is probability of success, k is the value at which the
probability has to be found out.
Script:
data.frame(x, probs)
Output:
[1] 0.2138454
prob
1 1.615056e-01
Page | 37
2 3.230112e-01
3 2.907100e-01
4 1.550454e-01
pbinom():
pbinom(k, n, p)
where n is total number of trials, p is probability of success, k is the value at which the
probability has to be found out.
Script:
Output:
[1] 0.8419226
qbinom()
qbinom(P, n, p)
Where P is the probability, n is the total number of trials and p is the probability of success.
Script:
Page | 38
x <- seq(0, 1, by = 0.1)
Output:
[1] 3
rbinom()
rbinom(n, N, p)
Script:
Output:
[1] 1 1 2 1 4 0 2 3
Page | 39
Normal Distribution:
Normal Distribution is a probability function used in statistics that tells about how the data
values are distributed. It is the most important probability distribution function used in
statistics because of its advantages in real case scenarios. For example, the height of the
population, shoe size, IQ level, rolling a dice, and many more.
dnorm()
Script:
plot(x, y)
Output:
pnorm()
Script:
plot(x, y)
Page | 40
Output:
qnorm()
Script:
plot(x, y)
Output:
Page | 41
rnorm()
hist(x, breaks=50)
Result:
Page | 42
Exp No Exp Name Date
4.b tTest, zTest, Chi Square Test
Aim:
tTest:
use it to determine whether the means of two groups are equal to each other. The assumption
for the test is that both groups are sampled from normal distributions with equal variances.
The null hypothesis is that the two means are equal, and the alternative is that they are not
equal.
Classification of T-tests
The One-Sample T-Test is used to test the statistical difference between a sample mean and a
known or assumed/hypothesized value of the mean in the population.
Script:
set.seed(0)
t.test(sweetSold, mu = 150)
Output:
data: sweetSold
138.8176 141.4217
Page | 43
sample estimates:
mean of x
140.1197
It is used to help us to understand that the difference between the two means is real or simply
by chance.
Script:
set.seed(0)
Output:
-11.482807 -8.473061
sample estimates:
mean of x mean of y
140.1077 150.0856
This is a statistical procedure that is used to determine whether the mean difference between
two sets of observations is zero. In a paired sample t-test, each subject is measured two times,
resulting in pairs of observations.
Script:
Page | 44
set.seed(2820)
Output:
Paired t-test
0.9892738 1.1329434
sample estimates:
mean difference
1.061109
zTest:
This function is based on the standard normal distribution and creates confidence intervals
and tests hypotheses for both one and two sample problems.
You can use the z.test() function from the BSDA package to perform one sample and two
sample z-tests in R.
Syntax:
where:
y: values for the second sample (if performing a two sample z-test)
Page | 45
mu: mean under the null or mean difference (in two sample case)
Script:
library(BSDA)
data = c(88, 92, 94, 94, 96, 97, 97, 97, 99, 99,
105, 109, 109, 109, 110, 112, 112, 113, 114, 115)
Output:
One-sample z-Test
data: data
96.47608 109.62392
sample estimates:
mean of x
103.05
library(BSDA)
Page | 46
cityA = c(82, 84, 85, 89, 91, 91, 92, 94, 99, 99,
105, 109, 109, 109, 110, 112, 112, 113, 114, 114)
cityB = c(90, 91, 91, 91, 95, 95, 99, 99, 108, 109,
109, 114, 115, 116, 117, 117, 128, 129, 130, 133)
Output:
Two-sample z-Test
-17.446925 1.146925
sample estimates:
mean of x mean of y
100.65 108.80
Chi-Square Test:
Particularly in this test, we have to check the p-values. Moreover, like all statistical tests, we
assume this test as a null hypothesis and an alternate hypothesis.
The main thing is, we reject the null hypothesis if the p-value that comes out in the result is
less than a predetermined significance level, which is 0.05 usually, then we reject the null
hypothesis.
chisq.test(data)
Page | 47
Script:
data("mtcars")
table(mtcars$carb, mtcars$cyl)
chisq.test(mtcars$carb, mtcars$cyl)
Output:
We have a high chi-squared value and a p-value of less than 0.05 significance level. So we
reject the null hypothesis and conclude that carb and cyl have a significant relationship.
Result:
Page | 48
Exp No Exp Name Date
4.c Density Function
Aim:
The probability density function of a vector x , denoted by f(x) describes the probability of
the variable taking certain value. A density plot is a representation of the distribution of a
numeric variable that uses a kernel density estimate to show the probability density function
of the variable. In R Language we use the density() function which helps to compute kernel
density estimates. And further with its return value, is used to build the final density plot.
plot(density(x), main,xlab,ylab,col,lwd)
Example1:
set.seed(13531)
x <- rnorm(1000)
dx<-density(x)
plot(dx, main = "Kernel Density Plot", xlab = "X-Values", ylab = "Density of my X-Values",
col="red", lwd=2)
Output:
Page | 49
Example2: Filling area under density curve and drawing line at mean
set.seed(13531)
x <- rnorm(1000)
dx<-density(x)
xlab = "X-Values",
col="red",
lwd=2)
Output:
x <- rnorm(100)
y <- rnorm(200)
Page | 50
Output:
Result:
Page | 51
Exp No Exp Name Date
5 EDA on Population Dataset
Aim:
Population Dataset:
Dataset: population_by_country_2020
Description:
Population dataset consist of a sorted list of countries by their population in the year 2020.
There are 235 countries along with their population in the dataset. And there are 11 columns
each representing different features of countries.
List of Columns:
Country (or dependency), Population (2020), Yearly Change, Net Change, Density
(P/Km²), Land Area (Km²), Migrants (net), Fert. Rate, Med. Age, Urban Pop , World
Share
Range: The range can be defined as the difference between the maximum and minimum
elements in the given data, the data can be a vector or a dataframe. So we can define the
range as the difference between maximum_value – minimum_value
Script:
data<-read.csv("population.csv")
View(data)
x<-data.frame(data)
na.omit(x)
#Med..Age, Fert..Rate,Urban..Pop..
mage<-as.integer(data$Med..Age)
Page | 52
Frate<-as.integer(data$Fert..Rate)
Urbanpop<-as.integer(data$Urban.Pop)
print(range(x$Country..or.dependency.,na.rm = TRUE))
print(range(x$Population..2020.,na.rm = TRUE))
print(range(x$Yearly.Change,na.rm = TRUE))
print(range(x$Net.Change,na.rm = TRUE))
print(range(x$`Density..P.Km².`,na.rm = TRUE))
print(range(x$`Land.Area..Km².`,na.rm = TRUE))
print(range(x$Migrants..net.,na.rm = TRUE))
print(range(Frate,na.rm = TRUE))
print(range(mage,na.rm = TRUE))
print(range(Urbanpop,na.rm = TRUE))
print(range(x$World.Share,na.rm = TRUE))
Output:
Page | 53
[1] -653249 954806
> print(range(Frate,na.rm = TRUE))
[1] 1 7
> print(range(mage,na.rm = TRUE))
[1] 15 48
> print(range(Urbanpop,na.rm = TRUE))
[1] 0 100
> print(range(x$World.Share,na.rm = TRUE))
[1] "0.00%" "4.25%"
Summary:
Script:
summary(data)
Output:
> summary(data)
Country..or.dependency. Population..2020. Yearly.Change Net.Change
Length:235 Min. :8.010e+02 Length:235 Min. : -383840
Class :character 1st Qu.:3.995e+05 Class :character 1st Qu.: 424
Mode :character Median :5.460e+06 Mode :character Median : 39170
Mean :3.323e+07 Mean : 346088
3rd Qu.:2.067e+07 3rd Qu.: 249660
Max. :1.440e+09 Max. :13586631
Page | 54
Mean : 475.8 Mean : 553592 Mean : 6.3
3rd Qu.: 239.5 3rd Qu.: 403820 3rd Qu.: 9741.0
Max. :26337.0 Max. :16376870 Max. : 954806.0
NA's :34
Med..Age Urban.Pop World.Share
Length:235 Length:235 Length:235
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Mean:
Script:
#mean
mean(data$Population..2020.)
mean(data$Net.Change)
Output:
> mean(data$Population..2020.)
[1] 33227444
> mean(data$Net.Change)
[1] 346087.8
Variance:
#variance
var(data$Population..2020.,na.rm = TRUE)
var(data$`Density..P.Km².`,na.rm = TRUE)
Output:
Page | 55
> var(data$Population..2020.,na.rm = TRUE)
[1] 1.830702e+16
> var(data$`Density..P.Km².`,na.rm = TRUE)
[1] 5434894
Median:
#meadian
median(data$Population..2020.)
median(data$Net.Change)
Output:
> median(data$Population..2020.)
[1] 5460109
> median(data$Net.Change)
[1] 39170
Standard Deviation:
#Standard Deviation
sd(data$Population..2020.)
sd(data$Net.Change)
Output:
> sd(data$Population..2020.)
[1] 135303438
> sd(data$Net.Change)
[1] 1128260
Page | 56
Script:
library(ggplot2)
ggplot(x,aes(x=Urbanpop))+geom_histogram(color="blue",fill="white")
Output:
Box Plot:
View(xs)
ggplot(xs,aes(x=Med..Age,y=Population..2020.,color="red",fill=Med..Age))+geom_boxplot(
)
Output:
Scatter Plot:
#scatter plot
h<-head(x)
Page | 57
ggplot(h,aes(x=Population..2020.,y=Country..or.dependency.))+geom_point(size=3,color="b
lue",shape=19)
Output:
Result:
Page | 58
Exp No Exp Name Date
6.a Null hypothesis significance testing
Aim:
A statistical hypothesis is an assumption made by the researcher about the data of the
population collected for any experiment. It is not mandatory for this assumption to be true
every time. Hypothesis testing, in a way, is a formal process of validating the hypothesis
made by the researcher.
Null Hypothesis – Hypothesis testing is carried out in order to test the validity of a claim or
assumption that is made about the larger population. This claim that involves attributes to the
trial is known as the Null Hypothesis. The null hypothesis testing is denoted by H0.
Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence or in other
words what the data are about the population. The p-value ranges between 0 and 1. It can be
interpreted in the following way:
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so
you reject it.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to
reject it.
A p-value very close to the cutoff (0.05) is considered to be marginal and could go either
way.
Script:
binom.test(10,30)
Output:
data: 10 and 30
Page | 59
number of successes = 10, number of trials = 30, p-value = 0.09874
0.1728742 0.5281200
sample estimates:
As we can see in the output above, p-Value is not less than 0.05 default alpha value so we
can conclude that there is no enough evidence to reject null hypothesis(probability of
success is equal to 0.5 hypothesis).
Script:
binom.test(10,30, alternative="less")
Output:
data: 10 and 30
0.0000000 0.4994387
sample estimates:
As we can see in the output above, p-Value is less than 0.05 default alpha value so we can
reject null hypothesis(probability of success is not less than 0.5 hypothesis).
Result:
Page | 60
Exp No Exp Name Date
Testing the mean of one sample and
6.b,c
Testing two means
Aim:
• H1 = the average (mean) precipitation in the US is different than the known average
precipitation in the rest of the world
Script:
mean(precip)
Output:
34.88571
t.test(precip, mu=38)
Output:
data: precip
Page | 61
95 percent confidence interval:
31.61748 38.15395
sample estimates:
mean of x
34.88571
Output:
data: precip
-Inf 37.61708
sample estimates:
mean of x
34.88571
Script:
Output:
data: mpg by am
Page | 62
t = -3.7671, df = 18.332, p-value = 0.0006868
alternative hypothesis: true difference in means between group 0 and group 1 is less than 0
-Inf -3.913256
sample estimates:
17.14737 24.39231
As we can see the p-value is less than 0.05 in the above output so we can reject the null
hypothesis and conclude that manual transmission cars give more mileage than automatic
gear cars.
Result:
Page | 63
Exp No Exp Name Date
Predicting Continues Variable using
7.a
Linear models
Aim:
Linear Model:
Page | 64
Constructing model using lm() function:
#synrax: lm(predictorvariable ~ covariate, data=dataset)
model<-lm(weight~height,data=women)
abline(model,col="red")
print(model)
Output:
Call:
lm(formula = weight ~ height, data = women)
Coefficients:
(Intercept) height
-87.52 3.45
Output:
> summary(model)
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
Page | 65
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
> coef(model)
(Intercept) height
-87.51667 3.45000
Result:
Page | 66
Exp No Exp Name Date
Predicting Continues Variable using
7.b
Linear Regression
Aim:
Linear Regression:
In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is
not equal to 1 creates a curve.
head(women)
Output:
height weight
1 58 115
2 59 117
3 60 120
plot(weight~height,data=women)
model<-lm(weight~height,data=women)
Page | 67
abline(model,col="red")
print(model)
Output:
Call:
Coefficients:
(Intercept) height
-87.52 3.45
hdata=data.frame(height=71)
predict(model,h)
Output:
157.4333
Result:
Page | 68
Exp No Exp Name Date
Predicting Continues Variable using
7.c
Multiple Linear Regression
Aim:
Multiple regression is an extension of linear regression into relationship between more than
two variables. In simple linear relation we have one predictor and one response variable, but
in multiple regression we have more than one predictor variable and one response variable.
where, y is the response variable, a, b1, b2...bn are the coefficients, x1, x2, ...xn are the
predictor variables.
head(data,3)
Output:
mpg disp hp wt
Constructing model:
print(model)
Output:
Call:
Page | 69
lm(formula = mpg ~ disp + hp + wt, data = data)
Coefficients:
(Intercept) disp hp wt
mpgdata=data.frame(disp=108,hp=110,wt=2.6)
predict(model,mpgdata)
Output:
23.69477
Result:
Page | 70
Exp No Exp Name Date
Bias Variance trade off Cross
7.d
validation
Aim:
Bias: In statistical learning, the bias of a model refers to the error of the model. A model with
high bias will fail to accurately predict its dependent variable.
Variance: The variance of a model refers to how sensitive a model is to changes in the data
that built the model. A linear model with high variance is very sensitive to changes to the data
that it was built with, and the estimated coefficients will be unstable.
Cross validation is defined as testing the efficiency of a constructed model. To perform the
same the data is split into training and test datasets.
To perform k-fold cross validation we will use cv.glm() function from boot package
Script:
library(boot)
model<-glm(mpg~.,data=mtcars)
modelcverror<-cv.glm(mtcars,model,K=5)
modelcverror$delta[2]
Output:
13.92993
model2<-glm(mpg~disp+hp+wt,data=mtcars)
model2cverror<-cv.glm(mtcars,model2,K=5)
model2cverror$delta[2]
Output:
7.846927
Result:
Page | 71
Exp No Exp Name Date
8.a correlation between two variables
Aim:
Correlation is usually computed on two quantitative variables, but it can also be computed on
two qualitative ordinal variables.
Pearson correlation is often used for quantitative continuous variables that have a linear
relationship
Spearman correlation (which is actually similar to Pearson but based on the ranked values
for each variable rather than on the raw data) is often used to evaluate relationships involving
at least one qualitative ordinal variable or two quantitative variables if the link is partially
linear
Kendall’s tau-b which is computed from the number of concordant and discordant pairs is
often used for qualitative ordinal variables
Script:
head(mtcars,3)
mtcarsdata<-mtcars[,-c(8,9)]
head(mtcarsdata,3)
Output:
Page | 72
mpg cyl disp hp drat wt qsec gear carb
cor(mtcarsdata$mpg, mtcarsdata$hp)
Output:
-0.7761684
Output:
0.7746767
Output:
0.7851865
Result:
Page | 73
Exp No Exp Name Date
8.b Plotting Scatter Plots
Aim:
Output:
Page | 74
Using ggplot2 package:
library(ggplot2)
ggplot(mtcars,aes(x=mpg,y=hp))+geom_point(size=2,color="blue")
Result:
Page | 75
Exp No Exp Name Date
8.c Correlation using Scatter Plot
Aim:
install.packages("ggstatsplot")
library(ggstatsplot)
Output:
As we can see in the graph above the r= -0.78 and p<0.05 so we can conclude that the
correlation between Horse Power and Miles per gallon is negative.
Page | 76
Spearman Correlation method: For spearman correlation method use
type=”nonparametric” argument in ggscatterstats function
Output:
Result:
Page | 77
Exp No Exp Name Date
9 Tests of Hypotheses
Aim:
Example:
Let’s say we need to determine whether the average score of students is higher than 610 in
the exam or not. We have the information that the standard deviation for students’ scores
is 100. So, we collect the data of 32 students by using random samples and gets following
data:
670,730,540,670,480,800,690,560,590,620,700,660,640,710,650,490,800
,600,560,700,680,550,580,700,705,690,520,650,660,790
Assume that the score follows a normal distribution. Test at 5% level of significance.
Script:
library("BSDA")
,600,560,700,680,550,580,700,705,690,520,650,660,790)
Output:
One-sample z-Test
data: marks
616.1359 NA
sample estimates:
mean of x
646.1667
As we can see the p-value is less than 0.05 in the above output so we can reject the null
hypothesis and conclude that average score of students is greater than 610.
Script:
alpha <- 5
sd <- 2
sample<- 20
meanvalue<- 7
print(zscore)
pvalue<- 2*pnorm(-abs(zscore))
print(pvalue)
Output:
7.744216e-06
Result:
Page | 79
Exp No Exp Name Date
10 Estimating A Linear Relationship
Aim:
In R linear relationship between continues variables are constructed using lm( ) function.
Using lm( ) function both simple linear regression and multiple linear regression models can
be constructed.
Syntax:
Parameter:
dataframe: determines the name of the data frame that contains the data.
The least squares principle provides a way of choosing the coefficients effectively by
minimising the sum of the squared errors. That is, we choose the values of β0,β1,…,βk that
minimise the squared error.
To evaluate the goodness of fit of a model constructed using lm( ) function R2 coefficients
and Residual standard error are used. The model is said to be well fit if R2 value is high
and another way of evaluating goodness of fit of constructed model is using Residual
standard error, the lower the residual standard error the better the model is fit.
Generally the Residual standard error coefficient is compared with the sample mean of y
or standard deviation of y.
Page | 80
Script:
simplelinearmodel<-lm(mpg~hp,data = mtcars)
multiplelinearmodel<-lm(mpg~hp+cyl+wt,data=mtcars)
summary(simplelinearmodel)
summary(multiplelinearmodel)
sd(mtcars$mpg)
Output:
> summary(simplelinearmodel)
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Page | 81
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
> summary(multiplelinearmodel)
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> sd(mtcars$mpg)
[1] 6.026948
As we can see in the above output, R2 value of simple linear model is 0.35 and R2 value of
multiple linear model is 0.82, so we can conclude that multiple linear model is doing better
than simple linear model. Also the Residual standard error of simple linear model is 3.86
and multiple linear model is 2.51 which also tells that multiple linear model is having
better goodness of fit to the original dataset.
Result:
Page | 82
Exp No Exp Name Date
Defining user defined classes and
11.a
operations
Aim:
Class is the blueprint that helps to create an object and contains its member variable along
with the attributes.
Script:
class(marks)<-"studentmarks"
marks
display<-function(obj){
display(marks)
Output:
Page | 83
$name
$regNo
[1] "19F41A0586"
$score
[1] "41"
attr(,"class")
[1] "studentmarks"
Marks Secured: 41
For creating a S4 class setclass( ) method is used and new( ) function is used to create objects
for the class. We can set methods for S4 class using setmethod( ) function
Script:
objcharan<-new("studentmarks",name="S Charan",regNo="19F41A0586",score=41)
objcharan
setMethod("show",
"studentmarks",
function(object){
Page | 84
cat("\nReg No: ",object@regNo)
show(objcharan)
Output:
Marks Secured: 41
Result:
Page | 85
Exp No Exp Name Date
11.b Conditional Statements
Aim:
Conditional statements are also called as Decision statements that helps in making decisions
for controlling the flow of execution of statements in a program.
Simple if
if-else
if-else-if ladder
Nested if-else
Switch statement
Simple if Statement:
Syntax:
if(condition is true){
Page | 86
Example:
number<-scan()
if(number<0){
Output:
> number<-scan()
1: -3
if-else Statement:
Syntax:
if(condition is true) {
} else {
Page | 87
Example:
number<-scan()
if(number<0){
}else{
Output:
> number<-scan()
1: 61
Syntax:
if(condition 1 is true) {
execute this statement
} else if(condition 2 is true) {
execute this statement
} else {
execute this statement
}
Page | 88
Example:
number<-scan()
if(number==0){
}else if(number>0){
}else{
Output:
> number<-scan()
1: 0
Syntax:
Page | 89
}
}
Example:
number<-scan()
if(number==0){
}else if(number>0){
if(number%%2==0){
}else{
}else{
Page | 90
}
Output:
> number<-scan()
1: 5
Switch Statement:
Syntax:
Example:
Example:
print("Enter a value:")
a<-scan()
print("Enter b value")
Page | 91
b<-scan()
cat("\n1.Addition
\n2.Subtraction
\n3.Multiplication")
choice<-scan()
print(choice)
result<-switch(as.character(choice),
"1"=cat("\nAddition is =",a+b),
"2"=cat("\nSubtraction is =",a-b),
"3"=cat("\nMultiplication is =",a*b),
print("invalid choice")
Output:
"Enter a value:"
1: 5
"Enter b value"
1: 6
1.Addition
2.Subtraction
3.Multiplication
1: 1
Addition is = 11
Result:
Page | 92
Exp No Exp Name Date
11.c Loop and Iteration Statements
Aim:
For loop
While loop
Repeat loop
for loop:
Syntax:
statements
Page | 93
Example:
data<-1:10
for(i in data){
if(i%%2==0){
cat(" ", i)
Output:
2 4 6 8 10
While loop:
In while loop, firstly the condition will be checked and then after the body of the statement
will execute. In this statement, the condition will be checked n+1 time, rather than n times.
Syntax:
while (test_expression) {
statement
Page | 94
Example:
i=1
while(i<=10){
if(i%%2==0){
cat(" ", i)
i=i+1
Output:
2 4 6 8 10
Repeat loop:
A repeat loop is used to iterate a block of code. It is a special type of loop in which there is no
condition to exit from the loop. For exiting, we include a break statement with a user-defined
condition. This property of the loop makes it different from the other loops.
Syntax:
repeat {
commands
if(condition) {
break
Page | 95
Example:
i=1
repeat{
if(i==10){
break
if(i%%2==0){
cat(" ", i)
i=i+1
Output:
2 4 6 8
Result:
Page | 96
Exp No Exp Name Date
12 Statistical Functions In R
Aim:
The average of a dataset, defined as the sum of all observations divided by the
number of observations.
Example: mean(iris$Sepal.Length)
Variance
var(Vector)
var(iris$Sepal.Length)
Standard Deviation
The amount any observation can be expected to differ from the mean.
sd(Vector)
sd(iris$Sepal.Length)
Median
median(Vector)
median(iris$Sepal.Length)
Minimum
min(Vector)
min(iris$Sepal.Length)
Page | 97
Maximum
max(Vector)
max(iris$Sepal.Length)
If your group variable has more than two levels, don’t use a t test - use an ANOVA
instead
t.test(Vector1, Vector2)
Chi-squared Test
A test to see if two categorical variables are related. The null hypothesis is that
both variables are independent from one another
For more information - particularly on what form your data should be in - please
see this site.
chisq.test(Vector1, Vector2)
Linear Models
A type of regression which predicts the value of a response variable at given values
of independent predictor variables
You can add multiple predictor vectors using +, and include interaction terms
using + Vector1:Vector2.
Result:
Page | 98
Beyond Syllabus
Page | 99
Exp No Exp Name Date
1 EDA on Parkinsons Disease Data
Aim:
Population Dataset:
Description:
The data is in ASCII CSV format. The rows of the CSV file contain an instance
corresponding to one voice recording. There are around six recordings per patient, the name
of the patient is identified in the first column.
List of Columns:
Page | 100
Range: The range can be defined as the difference between the maximum and minimum
elements in the given data, the data can be a vector or a dataframe. So we can define the
range as the difference between maximum_value – minimum_value
Script:
data<-read.csv("parkinsons.data")
View(data)
head(data,3)
#Finding Range
print(range(data$MDVP.Fo.Hz.,na.rm=TRUE))
print(range(data$MDVP.Fhi.Hz., rm=TRUE))
print(range(data$MDVP.Flo.Hz., na.rm=TRUE))
print(range(data$MDVP.Jitter..., na.rm=TRUE))
Output:
> print(range(data$MDVP.Fo.Hz.))
[1] 88.333 260.105
> #Finding Range
> print(range(data$MDVP.Fo.Hz.,na.rm=TRUE))
[1] 88.333 260.105
>
> print(range(data$MDVP.Fhi.Hz., rm=TRUE))
[1] 1.00 592.03
> print(range(data$MDVP.Flo.Hz., na.rm=TRUE))
[1] 65.476 239.170
> print(range(data$MDVP.Jitter..., na.rm=TRUE))
[1] 0.00168 0.03316
Page | 101
Summary:
Script:
summary(data,3)
Output:
> summary(data)
name MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz.
Length:195 Min. : 88.33 Min. :102.1 Min. : 65.48
Class :character 1st Qu.:117.57 1st Qu.:134.9 1st Qu.: 84.29
Mode :character Median :148.79 Median :175.8 Median :104.31
Mean :154.23 Mean :197.1 Mean :116.32
3rd Qu.:182.77 3rd Qu.:224.2 3rd Qu.:140.02
Max. :260.11 Max. :592.0 Max. :239.17
MDVP.Jitter... MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ
Min. :0.001680 Min. :7.000e-06 Min. :0.000680 Min. :0.000920
1st Qu.:0.003460 1st Qu.:2.000e-05 1st Qu.:0.001660 1st Qu.:0.001860
Median :0.004940 Median :3.000e-05 Median :0.002500 Median :0.002690
Mean :0.006220 Mean :4.396e-05 Mean :0.003306 Mean :0.003446
3rd Qu.:0.007365 3rd Qu.:6.000e-05 3rd Qu.:0.003835 3rd Qu.:0.003955
Max. :0.033160 Max. :2.600e-04 Max. :0.021440 Max. :0.019580
Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3
Min. :0.002040 Min. :0.00954 Min. :0.0850 Min. :0.004550
1st Qu.:0.004985 1st Qu.:0.01650 1st Qu.:0.1485 1st Qu.:0.008245
Median :0.007490 Median :0.02297 Median :0.2210 Median :0.012790
Mean :0.009920 Mean :0.02971 Mean :0.2823 Mean :0.015664
3rd Qu.:0.011505 3rd Qu.:0.03789 3rd Qu.:0.3500 3rd Qu.:0.020265
Max. :0.064330 Max. :0.11908 Max. :1.3020 Max. :0.056470
Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR
Min. :0.00570 Min. :0.00719 Min. :0.01364 Min. :0.000650
Page | 102
1st Qu.:0.00958 1st Qu.:0.01308 1st Qu.:0.02474 1st Qu.:0.005925
Median :0.01347 Median :0.01826 Median :0.03836 Median :0.011660
Mean :0.01788 Mean :0.02408 Mean :0.04699 Mean :0.024847
3rd Qu.:0.02238 3rd Qu.:0.02940 3rd Qu.:0.06080 3rd Qu.:0.025640
Max. :0.07940 Max. :0.13778 Max. :0.16942 Max. :0.314820
HNR status RPDE DFA
Min. : 8.441 Min. :0.0000 Min. :0.2566 Min. :0.5743
1st Qu.:19.198 1st Qu.:1.0000 1st Qu.:0.4213 1st Qu.:0.6748
Median :22.085 Median :1.0000 Median :0.4960 Median :0.7223
Mean :21.886 Mean :0.7538 Mean :0.4985 Mean :0.7181
3rd Qu.:25.076 3rd Qu.:1.0000 3rd Qu.:0.5876 3rd Qu.:0.7619
Max. :33.047 Max. :1.0000 Max. :0.6852 Max. :0.8253
spread1 spread2 D2 PPE
Min. :-7.965 Min. :0.006274 Min. :1.423 Min. :0.04454
1st Qu.:-6.450 1st Qu.:0.174350 1st Qu.:2.099 1st Qu.:0.13745
Median :-5.721 Median :0.218885 Median :2.362 Median :0.19405
Mean :-5.684 Mean :0.226510 Mean :2.382 Mean :0.20655
3rd Qu.:-5.046 3rd Qu.:0.279234 3rd Qu.:2.636 3rd Qu.:0.25298
Max. :-2.434 Max. :0.450493 Max. :3.671 Max. :0.52737
Mean:
Script:
#mean
mean(data$MDVP.Fo.Hz.)
mean(data$MDVP.Jitter...)
mean(data$MDVP.Shimmer)
Output:
> mean(data$MDVP.Fo.Hz.)
[1] 154.2286
Page | 103
> mean(data$MDVP.Jitter...)
[1] 0.006220462
> mean(data$MDVP.Shimmer)
[1] 0.02970913
Variance:
#variance
var(data$MDVP.Fo.Hz.)
var(data$NHR)
var(data$MDVP.Shimmer)
Output:
> var(data$MDVP.Fo.Hz.)
[1] 1713.137
> var(data$NHR)
[1] 0.001633651
> var(data$MDVP.Shimmer)
[1] 0.0003555839
Median:
#median
median(data$MDVP.Fo.Hz.)
median(data$MDVP.Jitter...)
median(data$MDVP.Shimmer)
Output:
> median(data$MDVP.Fo.Hz.)
[1] 148.79
> median(data$MDVP.Jitter...)
[1] 0.00494
> median(data$MDVP.Shimmer)
[1] 0.02297
Page | 104
Standard Deviation:
#standard deviation
sd(data$MDVP.Fo.Hz.)
sd(data$MDVP.Jitter...)
sd(data$MDVP.Shimmer)
Output:
> sd(data$MDVP.Fo.Hz.)
[1] 41.39006
> sd(data$MDVP.Jitter...)
[1] 0.004848134
> sd(data$MDVP.Shimmer)
[1] 0.01885693
Page | 105
Box Plot:
ggplot(data=data,aes(x=status, y=Shimmer.DDA))+geom_boxplot(color="red",fill="green")
Output:
Scatter Plot:
#scatter plot
ggplot(data =
data,aes(x=MDVP.Fo.Hz.,y=MDVP.Jitter...))+geom_point(size=3,color="blue",shape=19)
Output:
Result:
Page | 106
Exp No Exp Name Date
Finding Correlation between variables
2
in Parkinsons Disease Data
Aim:
Correlation is usually computed on two quantitative variables, but it can also be computed on
two qualitative ordinal variables.
Pearson correlation is often used for quantitative continuous variables that have a linear
relationship
Spearman correlation (which is actually similar to Pearson but based on the ranked values
for each variable rather than on the raw data) is often used to evaluate relationships involving
at least one qualitative ordinal variable or two quantitative variables if the link is partially
linear
Kendall’s tau-b which is computed from the number of concordant and discordant pairs is
often used for qualitative ordinal variables
Script:
data<-read.csv("parkinsons.data")
head(data,3)
Output:
Page | 107
MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer
MDVP.Shimmer.dB.
cor(data$MDVP.Fo.Hz., data$MDVP.Jitter...)
Output:
-0.1180026
Output:
0.7291587
Output:
0.5442087
Result:
Page | 108
Exp No Exp Name Date
Constructing Predictive model for PD
3
Detection
Aim:
Multiple regression is an extension of linear regression into relationship between more than
two variables. In simple linear relation we have one predictor and one response variable, but
in multiple regression we have more than one predictor variable and one response variable.
where, y is the response variable, a, b1, b2...bn are the coefficients, x1, x2, ...xn are the
predictor variables.
PDdata <-
data[,c("status","MDVP.Fo.Hz.","MDVP.Jitter...","MDVP.Shimmer","HNR","spread1")]
head(PDdata,3)
Output:
Constructing model:
print(model)
Page | 109
Output:
Call:
Coefficients:
spread1
0.218370
testdata=data.frame(MDVP.Fo.Hz.=119.9,MDVP.Jitter...=0.007,MDVP.Shimmer=0.04,HNR
=21,spread1=-4.8)
result=predict(model,testdata)
print(round(result))
Output:
Result:
Page | 110