Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
24 views110 pages

EDA With R Lab Manual

The document provides a comprehensive guide on installing R and R-Studio, including download links and installation steps. It covers basic R syntax, data structures like matrices and lists, subsetting techniques, and the help system. Additionally, it includes data manipulation using the dplyr package, plotting data, reading external data, and creating tables and charts in R.

Uploaded by

Rao Vj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views110 pages

EDA With R Lab Manual

The document provides a comprehensive guide on installing R and R-Studio, including download links and installation steps. It covers basic R syntax, data structures like matrices and lists, subsetting techniques, and the help system. Additionally, it includes data manipulation using the dplyr package, plotting data, reading external data, and creating tables and charts in R.

Uploaded by

Rao Vj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Exp No Experiment Name Date

1.a INSTALLATION OF ‘R’

Aim:-

How to get R and R-Studio

To download and work with R on Windows operating system:

R download:

https://cran.r-project.org/bin/windows/base/

To download and work with RStudio

RStudio download:

For Windows operating system:

https://www.rstudio.com/products/rstudio/download/

For Linux operating system:

http://www.r-bloggers.com/download-and-install-r-in-ubuntu/

INSTALLATION STEPS:

STEP:1

Download R-studio 1.1.383-windows vista/7/8/10 and R-console.

Page | 1
STEP:2

Open R-studio setup wizard

STEP:3

Click on next button and choose install location, and click on next.

Page | 2
STEP:4

Install wizard will appear and click on Install button.

STEP:5

Click on FINISH button , Hence R-studio installed successfully in your P.C.

RESULT:-

Page | 3
Exp No Experiment Name Date
1.b The Basics of R Syntax

Aim:-

#Assigning values to variables


a=10
b<-5
4->c
cat(a,b,c)

Output:
10 5 4

#Arithmetic Operations
add=2 + 2 #addition
sub=2-2 #subtraction
mul=2*2 #multiplication
div=2 / 11 #returns arithmetic Quotient
rem=10 %% 2 #returns remainder
div2=2%/%11 #returns integer Quotient
cat("Addition=",add,"Subtraction=",sub,"Multiplication=",mul,"\nDevision=",div,"Remainde
r=",rem,
"Integer Quotient=",div2)

Output:
Addition= 4 Subtraction= 0 Multiplication= 4
Devision= 0.1818182 Remainder= 0 Integer Quotient= 0

#Relational Operators
a <- c(1, 3, 5)
b <- c(2, 4, 6)
print(a>b)
print(a<b)
print(a<=b)
print(a>=b)
print(a==b)
print(a>=b)

Output:
[1] FALSE FALSE FALSE
[1] TRUE TRUE TRUE
[1] TRUE TRUE TRUE
[1] FALSE FALSE FALSE

Page | 4
[1] FALSE FALSE FALSE
[1] FALSE FALSE FALSE

#Logical Operators
a <- c(3, 0, TRUE, 2+2i)
b <- c(2, 4, TRUE, 2+3i)
print(a&b)
print(a|b)
print(!a)
print(a&&b)
print(a||b)

Output:
[1] TRUE FALSE TRUE TRUE
[1] TRUE TRUE TRUE TRUE
[1] FALSE TRUE FALSE FALSE
[1] TRUE
[1] TRUE

Result:

Page | 5
Exp No Experiment Name Date
1.c Matrices and Lists

Aim:-

Matrices in R:
#Matrix with one column
data=c(1, 2, 3, 4, 5, 6)
amatrix = matrix(data)
print(amatrix)

Output:
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6

#Matrix with two columns


amatrix = matrix(data, ncol=2)
print(amatrix)

Output:
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

#Matrix multiplication
a=matrix(c(1,2,3,4),ncol=2)
print(a)
b=matrix(c(2,2,2,2),ncol=2)
print(b)
print(a%*%b)

Output:
[,1] [,2]
[1,] 1 3
[2,] 2 4
[,1] [,2]
[1,] 2 2
[2,] 2 2
[,1] [,2]

Page | 6
[1,] 8 8
[2,] 12 12

Lists in R:
#list with same datatype
list1<-list(1,2,3)
list2<-list("CSE","ECE","EEE")
print(list1)
print(list2)

Output:
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[1]]
[1] "CSE"

[[2]]
[1] "ECE"

[[3]]
[1] "EEE"

#list with different datatypes


listdata<-list("CSE","ECE",c(1,2,3,4,5),TRUE,FALSE,22.5,12L)
print(listdata)

Output:
[[1]]
[1] "CSE"

[[2]]
[1] "ECE"

[[3]]
[1] 1 2 3 4 5

[[4]]
[1] TRUE

[[5]]
[1] FALSE

Page | 7
[[6]]
[1] 22.5

[[7]]
[1] 12

#list with different objects and names for list values


listdata<-list("CSE","ECE",c(1,2,3,4,5),TRUE,FALSE,22.5,12L)
print(listdata)
listdata <- list(c("UDay","Kumar","Venkatesh"), matrix(c(40,80,60,70,90,80), nrow = 2),
list("CSE","ECE","EEE"))
names(listdata) <- c("Students", "Marks", "Course")
print(listdata)

Output:
$Students
[1] "UDay" "Kumar" "Venkatesh"
$Marks
[,1] [,2] [,3]
[1,] 40 60 90
[2,] 80 70 80
$Course
$Course[[1]]
[1] "CSE"
$Course[[2]]
[1] "ECE"
$Course[[3]]
[1] "EEE"

Result:

Page | 8
Exp No Experiment Name Date
1.d Subsetting

Aim:-

#Subsetting Using [ ] Operator


x <- 1:15
cat("Original vector: ", x, "\n")
cat("First 5 values of vector: ", x[1:5])
cat("Without values present at index 1, 2 and 3: ",
x[-c(1, 2, 3)], "\n")

Output:
Original vector: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
First 5 values of vector: 1 2 3 4 5
Without values present at index 1, 2 and 3: 4 5 6 7 8 9 10 11 12 13 14 15

#subsetting using [[ ]] Operator


ls <- list(a = 1, b = 2, c = 10, d = 20)
cat("Original List: \n")
print(ls)
cat("First element of list: ", ls[[1]], "\n")

Output:
Original List:
$a
[1] 1
$b
[1] 2
$c
[1] 10
$d
[1] 20
First element of list: 1

#Subsetting using $ Operator


ls <- list(a = 1, b = 2, c = "Hello", d = "GFG")
cat("Original list:\n")
print(ls)
cat("Using $ operator:\n")
print(ls$d)

Page | 9
Output:
Original list:
$a
[1] 1
$b
[1] 2
$c
[1] "Hello"
$d
[1] "GFG"
Using $ operator:
[1] "GFG"

#Subsetting using subset( ) function


mtc <- subset(mtcars, gear == 5 & hp > 200,select = c(gear, hp))
print(mtc)

Output:
gear hp
Ford Pantera L 5 264
Maserati Bora 5 335

Result:

Page | 10
Exp No Experiment Name Date
1.e Help System

Aim:-

#getting help manual


help.start()

#getting help on a particular function

help("gsub")

#or ? operator is used instead of help function

?gsub

#help.search() or ?? is used when name of function is unaware

help.search("chisquare")

??chisquare

Page | 11
Result:

Page | 12
Exp No Experiment Name Date
Viewing and Manipulating
2.a
Data

Aim:-

#Viewing data in a dataframe

data <- data.frame(x=1:50,y=100:149)

View(data)

Output:

#viewing iris data from the built in iris dataset

View(iris)

Output:

Page | 13
#Manipulating data using dplyr package

#installing dplyr package

install.packages("dplyr")

#loading dplyr package

library("dplyr")

#summary of iris dataset using summary() function

summary(iris)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50

Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

#selecting data using select() function

selected <- select(iris, Sepal.Length, Sepal.Width)

head(selected)

Output:

Sepal.Length Sepal.Width

1 5.1 3.5

2 4.9 3.0

3 4.7 3.2

4 4.6 3.1

5 5.0 3.6

Page | 14
6 5.4 3.9

#Filtering data using filter() function

filtered <- filter(iris, Species == "setosa" )

head(filtered,3)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

#creating a “Greater.Half” column based on condition using Mutate() function

col1 <- mutate(iris, Greater.Half = Sepal.Width > 0.5 * Sepal.Length)

tail(col1,4)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


Greater.Half

147 6.3 2.5 5.0 1.9 virginica FALSE

148 6.5 3.0 5.2 2.0 virginica FALSE

149 6.2 3.4 5.4 2.3 virginica TRUE

150 5.9 3.0 5.1 1.8 virginica


TRUE

#Arranging data in an order using arrange() function

arranged <- arrange(iris, Sepal.Width)

head(arranged)

Page | 15
Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.0 2.0 3.5 1.0 versicolor

2 6.0 2.2 4.0 1.0 versicolor

3 6.2 2.2 4.5 1.5 versicolor

4 6.0 2.2 5.0 1.5 virginica

5 4.5 2.3 1.3 0.3 setosa

6 5.5 2.3 4.0 1.3 versicolor

#To arrange Sepal Width in descending order

arranged <- arrange(iris, desc(Sepal.Width))

head(arranged)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.7 4.4 1.5 0.4 setosa

2 5.5 4.2 1.4 0.2 setosa

3 5.2 4.1 1.5 0.1 setosa

4 5.8 4.0 1.2 0.2 setosa

5 5.4 3.9 1.7 0.4 setosa

6 5.4 3.9 1.3 0.4 setosa

Result:

Page | 16
Exp No Experiment Name Date
2.b Plotting Data

Aim:-

#Plotting graphfor Ozone levels in airquality dataset using plot() function

plot(airquality$Ozone, xlab = 'ozone Concentration', ylab = 'No of Instances', main = 'Ozone levels in
NY city', col = 'green')

Output:

# Horizontal bar plot

barplot(airquality$Ozone, main = 'Ozone Concenteration in air',xlab = 'ozone levels', col=


'green',horiz = TRUE)

Output:

Page | 17
#Histogram Plot using hist() function

hist(airquality$Solar.R, main = 'Solar Radiation values in air',xlab = 'Solar rad.', col='red')

Output:

#Box plot using boxplot() function

boxplot(airquality,col="Red")

Output:

Result:

Page | 18
Exp No Experiment Name Date
2c Reading External Data

Aim:-

#reading data using read.csv() function

data=read.csv("U:/KEC/R Lab Experiments/sample.txt",header=TRUE,sep=",",dec=".")

print(data)

Output:

flavor number

1 pistachio 6

2 mint chocolate chip 7

3 vanilla 5

4 chocolate 10

5 strawberry 2

6 neopolitan 4

#reading a file from internet

my_data <- read.csv("http://www.sthda.com/upload/boxplot_format.txt",sep="\t")

head(my_data,3)

Output:

Nom variable Group

1 IND1 10 A

2 IND2 7 A

3 IND3 20 A

Result:

Page | 19
Exp No Experiment Name Date
3a Tables, Charts And Plots

Aim:-

Tables in R:

#creating table using table() function

data = data.frame(

"Name" = c("abc", "cde", "def"),

"Gender" = c("Male", "Female", "Male")

table(data)

Output:

Gender

Name Female Male

abc 0 1

cde 1 0

def 0 1

Charts in R:

Types of R – Charts

 Bar Plot or Bar Chart


 Pie Diagram or Pie Chart
 Histogram
 Scatter Plot
 Box Plot

Bar Plot or Bar Chart:

# defining vector

x <- c(7, 15, 23, 12, 44, 56, 32)

# output to be present as PNG file

Page | 20
png(file = "barplot.png")

# plotting vector

barplot(x, xlab = "sales", ylab = "Count", col = "white", col.axis = "darkgreen",

col.lab = "darkgreen",main="sales of product")

dev.off()

Output:

Pie Diagram or Pie Chart

# defining vector x with number of articles

x <- c(210, 450, 250, 100, 50, 90)

# defining labels for each value in x

names(x) <- c("Algo", "DS", "Java", "C", "C++", "Python")

# output to be present as PNG file

png(file = "piechart.png")

# creating pie chart

pie(x, labels = names(x), col = "white", main = "programming languages",

radius = -1, col.main = "red")

# saving the file

dev.off()

Page | 21
Output:

Histogram Plot:

# defining vector

x <- c(21, 23, 56, 90, 20, 7, 94, 12, 57, 76, 69, 45, 34, 32, 49, 55, 57)

# output to be present as PNG file

png(file = "hist.png")

hist(x, main = "Histogram of Vector x",xlab = "Values",col.lab = "darkgreen",col.main = darkgreen")

# saving the file

dev.off()

Output:

Page | 22
Scatter Plot:

#defining vector

orange <- Orange[, c('age', 'circumference')]

# output to be present as PNG file

png(file = "plot.png")

# plotting

plot(x = orange$age, y = orange$circumference, xlab = "Age",

ylab = "Circumference", main = "Age VS Circumference",

col.lab = "darkgreen", col.main = "darkgreen",

col.axis = "darkgreen")

# saving the file

dev.off()

Output:

Box Plot:

# defining vector with ages of employees

x <- c(42, 21, 22, 24, 25, 30, 29, 22,23, 23, 24, 28, 32, 45, 39, 40)

# output to be present as PNG file

png(file = "boxplot.png")

Page | 23
# plotting

boxplot(x, xlab = "Box Plot", ylab = "Age",col.axis = "darkgreen", col.lab = "darkgreen")

# saving the file

dev.off()

Output:

Result:

Page | 24
Exp No Experiment Name Date
Univariate Data, Measure of Central
3b
Tendency, Frequency Distribution

Aim:-

Measure of Central Tendency:

#finding mean/average of iris$Petal.Length

mean<-mean(iris$Petal.Length)

print(mean)

Output:

[1] 3.758

#finding mode or middle value of iris$Petal.Length

median<-median(iris$Petal.Length)

print(median)

Output:

[1] 4.35

#finding mode value of iris$Petal.Length

#installing and loading modest library

install.packages("modeest")

library(modeest)

mode<-mfv(iris$Petal.Length)

print(mode)

Output:

[1] 1.4 1.5

Page | 25
Frequency Distribution:

Frequency distribution for Categorical Data:

#finding frequency distribution of carburators in mtcars dataset using table() function

freq<-table(mtcars$carb)

print(freq)

barplot(freq,xlab="carburators",ylab="count",main="frequency distribution of carburators",col =


"black")

Output:

1 2 3 4 6 8

7 10 3 10 1 1

Frequency distribution for Continues Data:

data<-cut(airquality$Temp,9) #cut() is used to make equally spaced bins

print(data)

freq<-table(data)

print(freq)

hist(airquality$Temp,xlab="Temperature",ylab="count",main="frequency distribution of
Temperature",col = "white")

Output:

data

Page | 26
(56,60.6] (60.6,65.1] (65.1,69.7] (69.7,74.2] (74.2,78.8] (78.8,83.3] (83.3,87.9]

8 10 14 16 26 35 22

(87.9,92.4] (92.4,97]

15 7

Result:

Page | 27
Exp No Experiment Name Date
Multivariate data, Relationship
3C between categorical and
continues variables

Aim:-

Finding Relationship between Species and Petal.Length variables in iris dataset

#mean of Petal.Length for setosa Species

mean(iris$Petal.Length[iris$Species=="setosa"])

#mean of Petal.Length for setosa Species

mean(iris$Petal.Length[iris$Species=="versicolor"])

#mean of Petal.Length for setosa Species

mean(iris$Petal.Length[iris$Species=="virginica"])

Output:

[1] 1.462

[1] 4.26

[1] 5.552

#finding mean of Petal.Length for all Species using by() function

by(iris$Petal.Length, iris$Species, mean)

Output:

iris$Species: setosa

[1] 1.462

----------------------------------------------------------------------------------

iris$Species: versicolor

[1] 4.26

----------------------------------------------------------------------------------

Page | 28
iris$Species: virginica

[1] 5.552

#finding Standard Deviation of Petal.Length for all Species using by() function

by(iris$Petal.Length, iris$Species, sd)

Output:

iris$Species: setosa

[1] 0.173664

----------------------------------------------------------------------------------

iris$Species: versicolor

[1] 0.469911

----------------------------------------------------------------------------------

iris$Species: virginica

[1] 0.5518947

#finding summary of Petal.Length for all Species using by() function

by(iris$Petal.Length, iris$Species, summary)

Output:

iris$Species: setosa

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 1.400 1.500 1.462 1.575 1.900

----------------------------------------------------------------------------------

iris$Species: versicolor

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.00 4.00 4.35 4.26 4.60 5.10

----------------------------------------------------------------------------------

Page | 29
iris$Species: virginica

Min. 1st Qu. Median Mean 3rd Qu. Max.

4.500 5.100 5.550 5.552 5.875 6.900

#Plotting Relationship between Species and Petal.Length variables using Box and whisker plot

boxplot(Petal.Length~Species,data=iris, xlab="Species",ylab="Petal Length",main="Box and whisker


plot for Species and Petal Length",col=c("red","green","blue"))

Output:

Result:

Page | 30
Exp No Exp Name Date
Relationship between
3d
two continues variables

Aim:-

Finding Relationship between two numerical variables using Covariance:

∑( ̅ )( )
covxy=
( )

where x,y are the vector values and 𝑥̅ and 𝑦 are the mean of x and y columns

#finding covariance(relationship) between height and weight in women dataset

cov(women$height*2.54,women$weight)

Output:

[1] 69

#Covariance between height and weight in centimetres and pounds

cov(women$height*2.54,women$weight)

Output:

[1] 175.26

#plotting scatter plot for height and weight variables in women dataset to show relationship

plot(women$weight,women$height,pch=19,xlab="weight", ylab = "height", main = "Relationship b/w


height and weight")

Output:

Page | 31
Finding strength of Relationship between two numerical variables using Correlation:

∑( ̅ )( )
r=
( )

#Correlation(How strong relationship) between height and weight

cor(women$height,women$weight)

Output:

[1] 0.9954948

#Correlation between height and weight in centimetres and pounds

cor(women$height*2.54,women$weight)

Output:

[1] 0.9954948

#plotting scatter plot for height and weight variables in women dataset to show relationship

plot(women$weight,women$height,pch=19, xlab="weight", ylab = "height", main = "Relationship


b/w height and weight")

abline(lm(height~weight,data=women),col="green")

Output:

Page | 32
Finding Correlation between multiple numerical variables in a Dataset:

#Dropping Non numerical data in dataset (Species in iris)

data<- iris[,-5]

#Finding correlation between data

cor(data)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width

Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411

Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259

Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654

Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000

Result

Page | 33
Exp No Exp Name Date
3e Visualization methods

AIM:-

Visualizing Categorical and Continuous variables:

library(ggplot2)

qplot(Species,Petal.Length,data=iris,geom="boxplot",fill=Species)

Output:

Visualizing Two Categorical variables using mosaic() plot:

#installing vcd package for mosaic() plot

install.packages("vcd")

library(vcd)

#converting UCBAdmission dataset into a dataframe

ucbadata <- data.frame(UCBAdmissions)

mosaic(Freq~Gender+Admit, data=ucbadata,shade=TRUE, legend=FALSE, main="Frequency of


Gender with Admission")

Page | 34
#the formula Freq~Gender+Admit is read as frequency break down by Gender whether Admission is
rejected or admitted

Output:

Visualizing Two Continues variables using scatter plot:

#loading ggplot2 package for plotting scatter plot

library(ggplot2)

qplot(height, weight, data=women, geom="point")

Output:

#Drawing a smooth line connecting scatter points and constructing a linear model

qplot(height, weight, data=women, geom=c("point","smooth"),method="lm")

Page | 35
Output:

Result:

Page | 36
Exp No Exp Name Date
Sampling from distribution- Binomial
4.a
and Normal distribution

Aim:

Sampling from Distribution:

Binomial Distribution:

The binomial distribution is a discrete distribution and has only two outcomes i.e. success or
failure. All its trials are independent, the probability of success remains the same and the
previous outcome does not affect the next outcome. The outcomes from different trials are
independent. Binomial distribution helps us to find the individual probabilities as well as
cumulative probabilities over a certain range.

Functions for Binomial Distribution

We have four functions for handling binomial distribution in R namely:

dbinom():

dbinom(k, n, p)

where n is total number of trials, p is probability of success, k is the value at which the
probability has to be found out.

Script:

dbinom(3, size = 13, prob = 1 / 6)

probabilities <- dbinom(x = c(0:3), size = 10, prob = 1 / 6)

data.frame(x, probs)

plot(0:10, probabilities, type = "l")

Output:

[1] 0.2138454

prob

1 1.615056e-01

Page | 37
2 3.230112e-01

3 2.907100e-01

4 1.550454e-01

pbinom():

pbinom(k, n, p)

where n is total number of trials, p is probability of success, k is the value at which the
probability has to be found out.

Script:

pbinom(3, size = 13, prob = 1 / 6)

plot(0:10, pbinom(0:10, size = 10, prob = 1 / 6), type = "l")

Output:

[1] 0.8419226

qbinom()

qbinom(P, n, p)

Where P is the probability, n is the total number of trials and p is the probability of success.

Script:

qbinom(0.8419226, size = 13, prob = 1 / 6)

Page | 38
x <- seq(0, 1, by = 0.1)

y <- qbinom(x, size = 13, prob = 1 / 6)

plot(x, y, type = 'l')

Output:

[1] 3

rbinom()

rbinom(n, N, p)

Where n is numbers of observations, N is the total number of trials, p is the probability of


success.

Script:

rbinom(8, size = 13, prob = 1 / 6)

hist(rbinom(8, size = 13, prob = 1 / 6))

Output:

[1] 1 1 2 1 4 0 2 3

Page | 39
Normal Distribution:

Normal Distribution is a probability function used in statistics that tells about how the data
values are distributed. It is the most important probability distribution function used in
statistics because of its advantages in real case scenarios. For example, the height of the
population, shoe size, IQ level, rolling a dice, and many more.

In R, there are 4 built-in functions to generate normal distribution:

dnorm()

dnorm(x, mean, sd)

Script:

x = seq(-15, 15, by=0.1)

y = dnorm(x, mean(x), sd(x))

plot(x, y)

Output:

pnorm()

pnorm(x, mean, sd)

Script:

x <- seq(-10, 10, by=0.1)

y <- pnorm(x, mean = 2.5, sd = 2)

plot(x, y)

Page | 40
Output:

qnorm()

qnorm(p, mean, sd)

Script:

x <- seq(0, 1, by=0.02)

#y <- pnorm(x, mean = 2.5, sd = 2)

y = qnorm(x, mean(x), sd(x))

plot(x, y)

Output:

Page | 41
rnorm()

rnorm(n, mean, sd)

x <- rnorm(10000, mean=90, sd=5)

hist(x, breaks=50)

Result:

Page | 42
Exp No Exp Name Date
4.b tTest, zTest, Chi Square Test

Aim:

tTest:

use it to determine whether the means of two groups are equal to each other. The assumption
for the test is that both groups are sampled from normal distributions with equal variances.
The null hypothesis is that the two means are equal, and the alternative is that they are not
equal.

Classification of T-tests

One Sample T-test

Two sample T-test

Paired sample T-test

One Sample T-test

The One-Sample T-Test is used to test the statistical difference between a sample mean and a
known or assumed/hypothesized value of the mean in the population.

Script:

set.seed(0)

sweetSold <- c(rnorm(50, mean = 140, sd = 5))

t.test(sweetSold, mu = 150)

Output:

One Sample t-test

data: sweetSold

t = -15.249, df = 49, p-value < 2.2e-16

alternative hypothesis: true mean is not equal to 150

95 percent confidence interval:

138.8176 141.4217

Page | 43
sample estimates:

mean of x

140.1197

Two sample T-test

It is used to help us to understand that the difference between the two means is real or simply
by chance.

Script:

set.seed(0)

shopOne <- rnorm(50, mean = 140, sd = 4.5)

shopTwo <- rnorm(50, mean = 150, sd = 4)

t.test(shopOne, shopTwo, var.equal = TRUE)

Output:

Two Sample t-test

data: shopOne and shopTwo

t = -13.158, df = 98, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-11.482807 -8.473061

sample estimates:

mean of x mean of y

140.1077 150.0856

Paired sample T-test

This is a statistical procedure that is used to determine whether the mean difference between
two sets of observations is zero. In a paired sample t-test, each subject is measured two times,
resulting in pairs of observations.

Script:

Page | 44
set.seed(2820)

sweetOne <- c(rnorm(100, mean = 14, sd = 0.3))

sweetTwo <- c(rnorm(100, mean = 13, sd = 0.2))

t.test(sweetOne, sweetTwo, paired = TRUE)

Output:

Paired t-test

data: sweetOne and sweetTwo

t = 29.31, df = 99, p-value < 2.2e-16

alternative hypothesis: true mean difference is not equal to 0

95 percent confidence interval:

0.9892738 1.1329434

sample estimates:

mean difference

1.061109

zTest:

This function is based on the standard normal distribution and creates confidence intervals
and tests hypotheses for both one and two sample problems.

One Smaple z-Test:

You can use the z.test() function from the BSDA package to perform one sample and two
sample z-tests in R.

Syntax:

z.test(x, y, alternative='two.sided', mu=0, sigma.x=NULL, sigma.y=NULL,conf.level=.95)

where:

x: values for the first sample

y: values for the second sample (if performing a two sample z-test)

alternative: the alternative hypothesis (‘greater’, ‘less’, ‘two.sided’)

Page | 45
mu: mean under the null or mean difference (in two sample case)

sigma.x: population standard deviation of first sample

sigma.y: population standard deviation of second sample

conf.level: confidence level to use

Script:

library(BSDA)

#enter IQ levels for 20 patients

data = c(88, 92, 94, 94, 96, 97, 97, 97, 99, 99,

105, 109, 109, 109, 110, 112, 112, 113, 114, 115)

#perform one sample z-test

z.test(data, mu=100, sigma.x=15)

Output:

One-sample z-Test

data: data

z = 0.90933, p-value = 0.3632

alternative hypothesis: true mean is not equal to 100

95 percent confidence interval:

96.47608 109.62392

sample estimates:

mean of x

103.05

Two Smaple z-Test:

library(BSDA)

#enter IQ levels for 20 individuals from each city

Page | 46
cityA = c(82, 84, 85, 89, 91, 91, 92, 94, 99, 99,

105, 109, 109, 109, 110, 112, 112, 113, 114, 114)

cityB = c(90, 91, 91, 91, 95, 95, 99, 99, 108, 109,

109, 114, 115, 116, 117, 117, 128, 129, 130, 133)

#perform two sample z-test

z.test(x=cityA, y=cityB, mu=0, sigma.x=15, sigma.y=15)

Output:

Two-sample z-Test

data: cityA and cityB

z = -1.7182, p-value = 0.08577

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-17.446925 1.146925

sample estimates:

mean of x mean of y

100.65 108.80

Chi-Square Test:

Chi-Square test in R is a statistical method which used to determine if two categorical


variables have a significant correlation between them. The two variables are selected from the
same population.

Particularly in this test, we have to check the p-values. Moreover, like all statistical tests, we
assume this test as a null hypothesis and an alternate hypothesis.

The main thing is, we reject the null hypothesis if the p-value that comes out in the result is
less than a predetermined significance level, which is 0.05 usually, then we reject the null
hypothesis.

Syntax of a chi-square test:

chisq.test(data)

Page | 47
Script:

data("mtcars")

table(mtcars$carb, mtcars$cyl)

chisq.test(mtcars$carb, mtcars$cyl)

Output:

Pearson's Chi-squared test

data: mtcars$carb and mtcars$cyl

X-squared = 24.389, df = 10, p-value = 0.006632

We have a high chi-squared value and a p-value of less than 0.05 significance level. So we
reject the null hypothesis and conclude that carb and cyl have a significant relationship.

Result:

Page | 48
Exp No Exp Name Date
4.c Density Function

Aim:

The probability density function of a vector x , denoted by f(x) describes the probability of
the variable taking certain value. A density plot is a representation of the distribution of a
numeric variable that uses a kernel density estimate to show the probability density function
of the variable. In R Language we use the density() function which helps to compute kernel
density estimates. And further with its return value, is used to build the final density plot.

Syntax: density(x) #x is a vector

plot(density(x), main,xlab,ylab,col,lwd)

Where lwd is line width of the density curve

Example1:

set.seed(13531)

x <- rnorm(1000)

dx<-density(x)

plot(dx, main = "Kernel Density Plot", xlab = "X-Values", ylab = "Density of my X-Values",

col="red", lwd=2)

Output:

Page | 49
Example2: Filling area under density curve and drawing line at mean

We use polygon() to fill the area under density curve

set.seed(13531)

x <- rnorm(1000)

dx<-density(x)

plot(dx, main = "Kernel Density Plot",

xlab = "X-Values",

ylab = "Density of my X-Values",

col="red",

lwd=2)

polygon(density(x), col = "green") #for filling area under density curve

abline(v = mean(x), col = "blue", lwd=3) #for drawing line at mean of x

Output:

Example3: Multiple densities in same plot

x <- rnorm(100)

y <- rnorm(200)

plot(density(x), lwd=2) # Plot density of x

lines(density(y), col = "red", lwd=2) # Overlay density of y

Page | 50
Output:

Result:

Page | 51
Exp No Exp Name Date
5 EDA on Population Dataset

Aim:

Population Dataset:

Dataset: population_by_country_2020

Source link: https://www.kaggle.com/datasets/tanuprabhu/population-by-country-2020

Description:

Population dataset consist of a sorted list of countries by their population in the year 2020.
There are 235 countries along with their population in the dataset. And there are 11 columns
each representing different features of countries.

List of Columns:

Country (or dependency), Population (2020), Yearly Change, Net Change, Density
(P/Km²), Land Area (Km²), Migrants (net), Fert. Rate, Med. Age, Urban Pop , World
Share

Range: The range can be defined as the difference between the maximum and minimum
elements in the given data, the data can be a vector or a dataframe. So we can define the
range as the difference between maximum_value – minimum_value

Script:

data<-read.csv("population.csv")

View(data)

x<-data.frame(data)

na.omit(x)

#converting character data into integer

#since we cant remove NA from character vector

#character vectors with NA in population dataset are

#Med..Age, Fert..Rate,Urban..Pop..

mage<-as.integer(data$Med..Age)

Page | 52
Frate<-as.integer(data$Fert..Rate)

Urbanpop<-as.integer(data$Urban.Pop)

print(range(x$Country..or.dependency.,na.rm = TRUE))

print(range(x$Population..2020.,na.rm = TRUE))

print(range(x$Yearly.Change,na.rm = TRUE))

print(range(x$Net.Change,na.rm = TRUE))

print(range(x$`Density..P.Km².`,na.rm = TRUE))

print(range(x$`Land.Area..Km².`,na.rm = TRUE))

print(range(x$Migrants..net.,na.rm = TRUE))

print(range(Frate,na.rm = TRUE))

print(range(mage,na.rm = TRUE))

print(range(Urbanpop,na.rm = TRUE))

print(range(x$World.Share,na.rm = TRUE))

Output:

> print(range(x$Country..or.dependency.,na.rm = TRUE))


[1] "Afghanistan" "Zimbabwe"
> print(range(x$Population..2020.,na.rm = TRUE))
[1] 801 1440297825
> print(range(x$Yearly.Change,na.rm = TRUE))
[1] "-0.03%" "3.84%"
> print(range(x$Net.Change,na.rm = TRUE))
[1] -383840 13586631
> print(range(x$`Density..P.Km².`,na.rm = TRUE))
[1] 0 26337
> print(range(x$`Land.Area..Km².`,na.rm = TRUE))
[1] 0 16376870
> print(range(x$Migrants..net.,na.rm = TRUE))

Page | 53
[1] -653249 954806
> print(range(Frate,na.rm = TRUE))
[1] 1 7
> print(range(mage,na.rm = TRUE))
[1] 15 48
> print(range(Urbanpop,na.rm = TRUE))
[1] 0 100
> print(range(x$World.Share,na.rm = TRUE))
[1] "0.00%" "4.25%"

Summary:
Script:
summary(data)

Output:
> summary(data)
Country..or.dependency. Population..2020. Yearly.Change Net.Change
Length:235 Min. :8.010e+02 Length:235 Min. : -383840
Class :character 1st Qu.:3.995e+05 Class :character 1st Qu.: 424
Mode :character Median :5.460e+06 Mode :character Median : 39170
Mean :3.323e+07 Mean : 346088
3rd Qu.:2.067e+07 3rd Qu.: 249660
Max. :1.440e+09 Max. :13586631

Density..P.Km². Land.Area..Km². Migrants..net. Fert..Rate


Min. : 0.0 Min. : 0 Min. :-653249.0 Length:235
1st Qu.: 37.0 1st Qu.: 2545 1st Qu.: -10047.0 Class :character
Median : 95.0 Median : 77240 Median : -852.0 Mode :character

Page | 54
Mean : 475.8 Mean : 553592 Mean : 6.3
3rd Qu.: 239.5 3rd Qu.: 403820 3rd Qu.: 9741.0
Max. :26337.0 Max. :16376870 Max. : 954806.0
NA's :34
Med..Age Urban.Pop World.Share
Length:235 Length:235 Length:235
Class :character Class :character Class :character
Mode :character Mode :character Mode :character

Mean:
Script:
#mean
mean(data$Population..2020.)
mean(data$Net.Change)

Output:
> mean(data$Population..2020.)
[1] 33227444
> mean(data$Net.Change)
[1] 346087.8

Variance:
#variance
var(data$Population..2020.,na.rm = TRUE)
var(data$`Density..P.Km².`,na.rm = TRUE)

Output:

Page | 55
> var(data$Population..2020.,na.rm = TRUE)
[1] 1.830702e+16
> var(data$`Density..P.Km².`,na.rm = TRUE)
[1] 5434894

Median:
#meadian
median(data$Population..2020.)
median(data$Net.Change)

Output:
> median(data$Population..2020.)
[1] 5460109
> median(data$Net.Change)
[1] 39170

Standard Deviation:
#Standard Deviation
sd(data$Population..2020.)
sd(data$Net.Change)

Output:
> sd(data$Population..2020.)
[1] 135303438
> sd(data$Net.Change)
[1] 1128260

Plots using ggplot2 package:


Histogram Plot:

Page | 56
Script:
library(ggplot2)
ggplot(x,aes(x=Urbanpop))+geom_histogram(color="blue",fill="white")

Output:

Box Plot:

xs=subset(x,x$Med..Age<20) #extracted only few rows from dataset since Med..Age is


continues variable

View(xs)

ggplot(xs,aes(x=Med..Age,y=Population..2020.,color="red",fill=Med..Age))+geom_boxplot(
)

Output:

Scatter Plot:

#scatter plot

h<-head(x)

Page | 57
ggplot(h,aes(x=Population..2020.,y=Country..or.dependency.))+geom_point(size=3,color="b
lue",shape=19)

Output:

Result:

Page | 58
Exp No Exp Name Date
6.a Null hypothesis significance testing

Aim:

A statistical hypothesis is an assumption made by the researcher about the data of the
population collected for any experiment. It is not mandatory for this assumption to be true
every time. Hypothesis testing, in a way, is a formal process of validating the hypothesis
made by the researcher.

Statistical Hypothesis Testing can be categorized into two types,

Null Hypothesis – Hypothesis testing is carried out in order to test the validity of a claim or
assumption that is made about the larger population. This claim that involves attributes to the
trial is known as the Null Hypothesis. The null hypothesis testing is denoted by H0.

Alternative Hypothesis – An alternative hypothesis would be considered valid if the null


hypothesis is fallacious. The evidence that is present in the trial is basically the data and the
statistical computations that accompany it. The alternative hypothesis testing is denoted by
H1or Ha.

To perform hypothesis testing in R we use binom.test( ) function.

Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence or in other
words what the data are about the population. The p-value ranges between 0 and 1. It can be
interpreted in the following way:

A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so
you reject it.

A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to
reject it.

A p-value very close to the cutoff (0.05) is considered to be marginal and could go either
way.

Script:

binom.test(10,30)

Output:

Exact binomial test

data: 10 and 30

Page | 59
number of successes = 10, number of trials = 30, p-value = 0.09874

alternative hypothesis: true probability of success is not equal to 0.5

95 percent confidence interval:

0.1728742 0.5281200

sample estimates:

probability of success 0.3333333

As we can see in the output above, p-Value is not less than 0.05 default alpha value so we
can conclude that there is no enough evidence to reject null hypothesis(probability of
success is equal to 0.5 hypothesis).

Directional Hypothesis testing:

To perform Directional Hypothesis testing alternative argument is passed to binom.test( )


function, the values can be “less” or “greater” or “two-sided”

Script:

binom.test(10,30, alternative="less")

Output:

Exact binomial test

data: 10 and 30

number of successes = 10, number of trials = 30, p-value = 0.04937

alternative hypothesis: true probability of success is less than 0.5

95 percent confidence interval:

0.0000000 0.4994387

sample estimates:

probability of success 0.3333333

As we can see in the output above, p-Value is less than 0.05 default alpha value so we can
reject null hypothesis(probability of success is not less than 0.5 hypothesis).

Result:

Page | 60
Exp No Exp Name Date
Testing the mean of one sample and
6.b,c
Testing two means

Aim:

Testing the mean of one sample:


common statistical hypothesis test is the one sample t-test. You use it when you have one
sample and you want to test whether that sample likely came from a population by
comparing the mean against the known population mean. For this test to work, you have to
know the population mean.

Here we use “precip” dataset for performing one sample t-test

H0 = the average (mean) precipitation in the US is equal to the known average


precipitation in the rest of the world

• H1 = the average (mean) precipitation in the US is different than the known average
precipitation in the rest of the world

Script:

#mean of precip dataset

mean(precip)

Output:

34.88571

#performing one sample t.test

t.test(precip, mu=38)

Output:

One Sample t-test

data: precip

t = -1.901, df = 69, p-value = 0.06148

alternative hypothesis: true mean is not equal to 38

Page | 61
95 percent confidence interval:

31.61748 38.15395

sample estimates:

mean of x

34.88571

#directional hypothesis on precip dataset

t.test(precip, mu=38, alternative="less")

Output:

One Sample t-test

data: precip

t = -1.901, df = 69, p-value = 0.03074

alternative hypothesis: true mean is less than 38

95 percent confidence interval:

-Inf 37.61708

sample estimates:

mean of x

34.88571

Testing two means:


To perform hypothesis testing on two samples we use t.test( ) function with y~x formula as
first argument. Here we consider mtcars dataset and hypothesis testing is performed to
check mileage on automatic gear cars vs manual transmission cars.

Script:

t.test(mpg ~ am, data=mtcars, alternative="less")

Output:

Welch Two Sample t-test

data: mpg by am

Page | 62
t = -3.7671, df = 18.332, p-value = 0.0006868

alternative hypothesis: true difference in means between group 0 and group 1 is less than 0

95 percent confidence interval:

-Inf -3.913256

sample estimates:

mean in group 0 mean in group 1

17.14737 24.39231

As we can see the p-value is less than 0.05 in the above output so we can reject the null
hypothesis and conclude that manual transmission cars give more mileage than automatic
gear cars.

Result:

Page | 63
Exp No Exp Name Date
Predicting Continues Variable using
7.a
Linear models

Aim:

Linear Model:

A linear model is a model for a continuous outcome Y of the form y=mx+b

where y is predictor variable and x is covariate

The covariates X can be:

 a continuous variable (age, weight, temperature, etc.)


 Dummy variables coding a categorical covariate (more later)

m is slope and b is y intercept in the model

R uses the function lm() to fit the linear models.

Viewing data for the model:


head(women)
Output:
height weight
1 58 115
2 59 117
3 60 120
Plotting data using plot() function:
plot(weight~height,data=women)

Page | 64
Constructing model using lm() function:
#synrax: lm(predictorvariable ~ covariate, data=dataset)
model<-lm(weight~height,data=women)
abline(model,col="red")
print(model)

Output:
Call:
lm(formula = weight ~ height, data = women)
Coefficients:
(Intercept) height
-87.52 3.45

#printing summary and coefficents of the model


summary(model)
coef(model)

Output:
> summary(model)

Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max

Page | 65
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14

> coef(model)
(Intercept) height
-87.51667 3.45000

Result:

Page | 66
Exp No Exp Name Date
Predicting Continues Variable using
7.b
Linear Regression

Aim:

Linear Regression:

In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is
not equal to 1 creates a curve.

Viewing data for the model:

head(women)

Output:

height weight

1 58 115

2 59 117

3 60 120

Plotting data using plot() function:

plot(weight~height,data=women)

Constructing model using lm() function:

#synrax: lm(predictorvariable ~ covariate, data=dataset)

model<-lm(weight~height,data=women)

Page | 67
abline(model,col="red")

print(model)

Output:

Call:

lm(formula = weight ~ height, data = women)

Coefficients:

(Intercept) height

-87.52 3.45

Predicting the weight of new person

#predicting new weight for given height

hdata=data.frame(height=71)

predict(model,h)

Output:

157.4333

Result:

Page | 68
Exp No Exp Name Date
Predicting Continues Variable using
7.c
Multiple Linear Regression

Aim:

Multiple Linear Regression:

Multiple regression is an extension of linear regression into relationship between more than
two variables. In simple linear relation we have one predictor and one response variable, but
in multiple regression we have more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is –

y = a + b1x1 + b2x2 +...bnxn

where, y is the response variable, a, b1, b2...bn are the coefficients, x1, x2, ...xn are the
predictor variables.

Viewing data for the model:

data <- mtcars[,c("mpg","disp","hp","wt")]

head(data,3)

Output:

mpg disp hp wt

Mazda RX4 21.0 160 110 2.620

Mazda RX4 Wag 21.0 160 110 2.875

Datsun 710 22.8 108 93 2.320

Constructing model:

model <- lm(mpg~disp+hp+wt, data = data)

print(model)

Output:

Call:

Page | 69
lm(formula = mpg ~ disp + hp + wt, data = data)

Coefficients:

(Intercept) disp hp wt

37.105505 -0.000937 -0.031157 -3.800891

Predicting mpg with multiple covariates:

mpgdata=data.frame(disp=108,hp=110,wt=2.6)

predict(model,mpgdata)

Output:

23.69477

Result:

Page | 70
Exp No Exp Name Date
Bias Variance trade off Cross
7.d
validation

Aim:

Bias: In statistical learning, the bias of a model refers to the error of the model. A model with
high bias will fail to accurately predict its dependent variable.

Variance: The variance of a model refers to how sensitive a model is to changes in the data
that built the model. A linear model with high variance is very sensitive to changes to the data
that it was built with, and the estimated coefficients will be unstable.
Cross validation is defined as testing the efficiency of a constructed model. To perform the
same the data is split into training and test datasets.

To perform k-fold cross validation we will use cv.glm() function from boot package

Script:

library(boot)

model<-glm(mpg~.,data=mtcars)

modelcverror<-cv.glm(mtcars,model,K=5)

modelcverror$delta[2]

Output:

13.92993

model2<-glm(mpg~disp+hp+wt,data=mtcars)

model2cverror<-cv.glm(mtcars,model2,K=5)

model2cverror$delta[2]

Output:

7.846927

Result:

Page | 71
Exp No Exp Name Date
8.a correlation between two variables

Aim:

Correlation: Correlations between variables play an important role in a descriptive


analysis. A correlation measures the relationship between two variables, that is, how they are
linked to each other. In this sense, a correlation allows to know which variables evolve in the
same direction, which ones evolve in the opposite direction, and which ones are independent.

Correlation is usually computed on two quantitative variables, but it can also be computed on
two qualitative ordinal variables.

There are several correlation methods:

Pearson correlation is often used for quantitative continuous variables that have a linear
relationship

Spearman correlation (which is actually similar to Pearson but based on the ranked values
for each variable rather than on the raw data) is often used to evaluate relationships involving
at least one qualitative ordinal variable or two quantitative variables if the link is partially
linear

Kendall’s tau-b which is computed from the number of concordant and discordant pairs is
often used for qualitative ordinal variables

Preparing data for finding Correlation:

Script:

head(mtcars,3)

#removing vs and am columns as they are categorical variables

mtcarsdata<-mtcars[,-c(8,9)]

head(mtcarsdata,3)

Output:

Page | 72
mpg cyl disp hp drat wt qsec gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 4 1

Pearson Correlation method: (Default)

cor(mtcarsdata$mpg, mtcarsdata$hp)

Output:

-0.7761684

Spearman Correlation method:

cor(mtcarsdata$hp, mtcarsdata$wt, method="spearman")

Output:

0.7746767

Kendall Correlation method:

cor(mtcarsdata$cyl, mtcarsdata$hp,method = "kendall")

Output:

0.7851865

Result:

Page | 73
Exp No Exp Name Date
8.b Plotting Scatter Plots

Aim:

Using plot( ) function:

Syntax: plot(x, y, xlab, ylab, main, col=”value”)

plot(x = mtcars$mpg, y = mtcars$hp, xlab = "Miles per Gallon",

ylab = "Horse Power", main = "mpg VS Hp",

col.lab = "red", col.main = "red",

col.axis = "red", col=”blue”)

Output:

Scatter Plot for several pairs of variables:

pairs(mtcarsdata[, c("mpg", "hp", "wt","cyl")])

Page | 74
Using ggplot2 package:

library(ggplot2)

ggplot(mtcars,aes(x=mpg,y=hp))+geom_point(size=2,color="blue")

Result:

Page | 75
Exp No Exp Name Date
8.c Correlation using Scatter Plot

Aim:

In this experiment we use ggscatterstats( ) function from ggstatsplot package

Pearson Correlation method:

By default the Pearson correlation method is considered in the ggscatterstats( ) function.

install.packages("ggstatsplot")

library(ggstatsplot)

ggscatterstats(data = mtcars,x = mpg,y = hp)

Output:

As we can see in the graph above the r= -0.78 and p<0.05 so we can conclude that the
correlation between Horse Power and Miles per gallon is negative.

Page | 76
Spearman Correlation method: For spearman correlation method use
type=”nonparametric” argument in ggscatterstats function

ggscatterstats(data = mtcars,x = mpg,y = hp, type="nonparametric")

Output:

Result:

Page | 77
Exp No Exp Name Date
9 Tests of Hypotheses

Aim:

Tests of hypotheses about the mean when the variance is known:

To perform hypothesis testing when variance of a population is known, z-test( ) is


performed. BSDA package should be installed to use z.test( ) function in R script.

Example:

Let’s say we need to determine whether the average score of students is higher than 610 in
the exam or not. We have the information that the standard deviation for students’ scores
is 100. So, we collect the data of 32 students by using random samples and gets following
data:

670,730,540,670,480,800,690,560,590,620,700,660,640,710,650,490,800

,600,560,700,680,550,580,700,705,690,520,650,660,790

Assume that the score follows a normal distribution. Test at 5% level of significance.

Script:

library("BSDA")

marks <- c(670,730,540,670,480,800,690,560,590,620,700,660,640,710,650,490,800

,600,560,700,680,550,580,700,705,690,520,650,660,790)

z.test(x=marks,mu=610,alternative = "greater",sigma.x = 100)

Output:

One-sample z-Test

data: marks

z = 1.9809, p-value = 0.0238


Page | 78
alternative hypothesis: true mean is greater than 610

95 percent confidence interval:

616.1359 NA

sample estimates:

mean of x

646.1667

As we can see the p-value is less than 0.05 in the above output so we can reject the null
hypothesis and conclude that average score of students is greater than 610.

Computing the p-Value:

Script:

alpha <- 5

sd <- 2

sample<- 20

meanvalue<- 7

zscore <- (meanvalue-alpha)/(sd/sqrt(sample))

print(zscore)

pvalue<- 2*pnorm(-abs(zscore))

print(pvalue)

Output:

7.744216e-06

Result:

Page | 79
Exp No Exp Name Date
10 Estimating A Linear Relationship

Aim:

In R linear relationship between continues variables are constructed using lm( ) function.
Using lm( ) function both simple linear regression and multiple linear regression models can
be constructed.

Syntax:

lm( fitting_formula, dataframe )

Parameter:

 fitting_formula: determines the formula for the linear model.

 dataframe: determines the name of the data frame that contains the data.

Least Square Estimates:

The least squares principle provides a way of choosing the coefficients effectively by
minimising the sum of the squared errors. That is, we choose the values of β0,β1,…,βk that
minimise the squared error.

Evaluating the Goodness of fit of a model:

To evaluate the goodness of fit of a model constructed using lm( ) function R2 coefficients
and Residual standard error are used. The model is said to be well fit if R2 value is high
and another way of evaluating goodness of fit of constructed model is using Residual
standard error, the lower the residual standard error the better the model is fit.

Generally the Residual standard error coefficient is compared with the sample mean of y
or standard deviation of y.

Page | 80
Script:

#constructing simple linear regression model

simplelinearmodel<-lm(mpg~hp,data = mtcars)

#constructing multiple linear regression model

multiplelinearmodel<-lm(mpg~hp+cyl+wt,data=mtcars)

summary(simplelinearmodel)

summary(multiplelinearmodel)

sd(mtcars$mpg)

Output:

> summary(simplelinearmodel)

Call:

lm(formula = mpg ~ hp, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-5.7121 -2.1122 -0.8854 1.5819 8.2360

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 30.09886 1.63392 18.421 < 2e-16 ***

hp -0.06823 0.01012 -6.742 1.79e-07 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.863 on 30 degrees of freedom

Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892

Page | 81
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07

> summary(multiplelinearmodel)

Call:

lm(formula = mpg ~ hp + cyl + wt, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-3.9290 -1.5598 -0.5311 1.1850 5.8986

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 38.75179 1.78686 21.687 < 2e-16 ***

hp -0.01804 0.01188 -1.519 0.140015

cyl -0.94162 0.55092 -1.709 0.098480 .

wt -3.16697 0.74058 -4.276 0.000199 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.512 on 28 degrees of freedom

Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263

F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11

> sd(mtcars$mpg)

[1] 6.026948

As we can see in the above output, R2 value of simple linear model is 0.35 and R2 value of
multiple linear model is 0.82, so we can conclude that multiple linear model is doing better
than simple linear model. Also the Residual standard error of simple linear model is 3.86
and multiple linear model is 2.51 which also tells that multiple linear model is having
better goodness of fit to the original dataset.

Result:

Page | 82
Exp No Exp Name Date
Defining user defined classes and
11.a
operations

Aim:

Class is the blueprint that helps to create an object and contains its member variable along
with the attributes.

In R we can create classes in two ways namely S3 and S4

Creating S3 class and methods:

Syntax: listname<- list(value1, value2,…..)

class(listname) <- “classname”

Script:

marks<-list(name="S Charan", regNo="19F41A0586", score="41")

class(marks)<-"studentmarks"

marks

#creating a method for studentmarks calls using (.) dot notation

display<-function(obj){

cat("Name of the student is ",obj$name)

cat("\nReg No: ",obj$regNo)

cat("\nMarks Secured: ",obj$score)

display(marks)

Output:

Page | 83
$name

[1] "S Charan"

$regNo

[1] "19F41A0586"

$score

[1] "41"

attr(,"class")

[1] "studentmarks"

Name of the student is S Charan

Reg No: 19F41A0586

Marks Secured: 41

Creating S4 class and methods:

For creating a S4 class setclass( ) method is used and new( ) function is used to create objects
for the class. We can set methods for S4 class using setmethod( ) function

Script:

setClass("studentmarks",slots = list(name="character", regNo="character",score="numeric"))

#creating object for S4 class using new() function

objcharan<-new("studentmarks",name="S Charan",regNo="19F41A0586",score=41)

objcharan

#creating method for S4 class using setmethod() function

setMethod("show",

"studentmarks",

function(object){

cat("Name of the student is ",object@name)

Page | 84
cat("\nReg No: ",object@regNo)

cat("\nMarks Secured: ",object@score)

show(objcharan)

Output:

Name of the student is S Charan

Reg No: 19F41A0586

Marks Secured: 41

Result:

Page | 85
Exp No Exp Name Date
11.b Conditional Statements

Aim:

Conditional statements are also called as Decision statements that helps in making decisions
for controlling the flow of execution of statements in a program.

R provides the following conditional statements:

 Simple if
 if-else
 if-else-if ladder
 Nested if-else
 Switch statement

Simple if Statement:

Syntax:

if(condition is true){

execute this statement

Page | 86
Example:

number<-scan()

if(number<0){

print("The number is negitive")

Output:

> number<-scan()

1: -3

[1] "The number is negitive"

if-else Statement:

Syntax:

if(condition is true) {

execute this statement

} else {

execute this statement

Page | 87
Example:

number<-scan()

if(number<0){

print("The number is negitive")

}else{

print("The given number is Positive")

Output:

> number<-scan()

1: 61

[1] "The given number is Positive"

if-eleif ladder Stament:

Syntax:

if(condition 1 is true) {
execute this statement
} else if(condition 2 is true) {
execute this statement
} else {
execute this statement
}

Page | 88
Example:

number<-scan()

if(number==0){

print("The number is Zero")

}else if(number>0){

print("The number is Positive")

}else{

print("The number is Negitive")

Output:

> number<-scan()

1: 0

[1] "The number is Zero"

Nested if-else Statement:

Syntax:

if(parent condition is true) {


if( child condition 1 is true) {
execute this statement
} else {
execute this statement
}
} else {
if(child condition 2 is true) {
execute this statement
} else {
execute this statement

Page | 89
}
}

Example:

number<-scan()

if(number==0){

print("The number is Zero")

}else if(number>0){

print("The number is Positive")

if(number%%2==0){

print("It is an even number")

}else{

print("It is an odd number")

}else{

print("The number is Negitive")

Page | 90
}

Output:

> number<-scan()

1: 5

[1] "The number is Positive"

[1] "It is an odd number"

Switch Statement:

Syntax:

switch (expression, case1, case2, case3,…,case n )

Example:

Example:

print("Enter a value:")

a<-scan()

print("Enter b value")

Page | 91
b<-scan()

cat("\n1.Addition

\n2.Subtraction

\n3.Multiplication")

print("Enter your choice:")

choice<-scan()

print(choice)

result<-switch(as.character(choice),

"1"=cat("\nAddition is =",a+b),

"2"=cat("\nSubtraction is =",a-b),

"3"=cat("\nMultiplication is =",a*b),

print("invalid choice")

Output:

"Enter a value:"

1: 5

"Enter b value"

1: 6

1.Addition

2.Subtraction

3.Multiplication

"Enter your choice:"

1: 1

Addition is = 11

Result:

Page | 92
Exp No Exp Name Date
11.c Loop and Iteration Statements

Aim:

R provides the following looping statements

 For loop
 While loop
 Repeat loop

for loop:

A for loop is used to iterate a vector.

Syntax:

for (value in vector) {

statements

Page | 93
Example:

data<-1:10

for(i in data){

if(i%%2==0){

cat(" ", i)

Output:

2 4 6 8 10

While loop:

In while loop, firstly the condition will be checked and then after the body of the statement
will execute. In this statement, the condition will be checked n+1 time, rather than n times.

Syntax:

while (test_expression) {

statement

Page | 94
Example:

i=1

while(i<=10){

if(i%%2==0){

cat(" ", i)

i=i+1

Output:

2 4 6 8 10

Repeat loop:

A repeat loop is used to iterate a block of code. It is a special type of loop in which there is no
condition to exit from the loop. For exiting, we include a break statement with a user-defined
condition. This property of the loop makes it different from the other loops.

Syntax:

repeat {

commands

if(condition) {

break

Page | 95
Example:

i=1

repeat{

if(i==10){

break

if(i%%2==0){

cat(" ", i)

i=i+1

Output:

2 4 6 8

Result:

Page | 96
Exp No Exp Name Date
12 Statistical Functions In R

Aim:

Estimates and Statistics Functions:


Mean

 The average of a dataset, defined as the sum of all observations divided by the
number of observations.

 In R, run mean(Vector) to calculate.

 Example: mean(iris$Sepal.Length)

Variance

 A measure of the spread of your data.

 var(Vector)

 var(iris$Sepal.Length)

Standard Deviation

 The amount any observation can be expected to differ from the mean.

 sd(Vector)

 sd(iris$Sepal.Length)

Median

 A robust estimate of the center of the data.

 median(Vector)

 median(iris$Sepal.Length)

Minimum

 The smallest value.

 min(Vector)

 min(iris$Sepal.Length)

Page | 97
Maximum

 The largest value.

 max(Vector)

 max(iris$Sepal.Length)

Models and Test functions:


t Test

 A method of comparing the means of two groups

 If your group variable has more than two levels, don’t use a t test - use an ANOVA
instead

 t.test(Vector1, Vector2)

Chi-squared Test

 A test to see if two categorical variables are related. The null hypothesis is that
both variables are independent from one another

 For more information - particularly on what form your data should be in - please
see this site.

 chisq.test(Vector1, Vector2)

Linear Models

 A type of regression which predicts the value of a response variable at given values
of independent predictor variables

 You can add multiple predictor vectors using +, and include interaction terms
using + Vector1:Vector2.

 lm(ResponseVector ~ PredictorVectors, data)

 lm(Sepal.Length ~ Species, data = iris)

Result:

Page | 98
Beyond Syllabus

Page | 99
Exp No Exp Name Date
1 EDA on Parkinsons Disease Data

Aim:

Population Dataset:

Dataset: Parkinson’s Disease Data Set

Source link: https://www.kaggle.com/datasets/vikasukani/parkinsons-disease-data-set

Description:

This dataset is composed of a range of biomedical voice measurements from 31 people, 23


with Parkinson's disease (PD). Each column in the table is a particular voice measure, and
each row corresponds to one of 195 voice recordings from these individuals ("name"
column). The main aim of the data is to discriminate healthy people from those with PD,
according to the "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance
corresponding to one voice recording. There are around six recordings per patient, the name
of the patient is identified in the first column.

List of Columns:

Matrix column entries (attributes):


name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several
measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shim
mer:DDA - Several measures of variation in amplitude
NHR, HNR - Two measures of the ratio of noise to tonal components in the voice
status - The health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE, D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Page | 100
Range: The range can be defined as the difference between the maximum and minimum
elements in the given data, the data can be a vector or a dataframe. So we can define the
range as the difference between maximum_value – minimum_value

Script:

data<-read.csv("parkinsons.data")

View(data)

head(data,3)

#Finding Range

print(range(data$MDVP.Fo.Hz.,na.rm=TRUE))

print(range(data$MDVP.Fhi.Hz., rm=TRUE))

print(range(data$MDVP.Flo.Hz., na.rm=TRUE))

print(range(data$MDVP.Jitter..., na.rm=TRUE))

Output:

> print(range(data$MDVP.Fo.Hz.))
[1] 88.333 260.105
> #Finding Range
> print(range(data$MDVP.Fo.Hz.,na.rm=TRUE))
[1] 88.333 260.105
>
> print(range(data$MDVP.Fhi.Hz., rm=TRUE))
[1] 1.00 592.03
> print(range(data$MDVP.Flo.Hz., na.rm=TRUE))
[1] 65.476 239.170
> print(range(data$MDVP.Jitter..., na.rm=TRUE))
[1] 0.00168 0.03316

Page | 101
Summary:
Script:
summary(data,3)

Output:
> summary(data)
name MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz.
Length:195 Min. : 88.33 Min. :102.1 Min. : 65.48
Class :character 1st Qu.:117.57 1st Qu.:134.9 1st Qu.: 84.29
Mode :character Median :148.79 Median :175.8 Median :104.31
Mean :154.23 Mean :197.1 Mean :116.32
3rd Qu.:182.77 3rd Qu.:224.2 3rd Qu.:140.02
Max. :260.11 Max. :592.0 Max. :239.17
MDVP.Jitter... MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ
Min. :0.001680 Min. :7.000e-06 Min. :0.000680 Min. :0.000920
1st Qu.:0.003460 1st Qu.:2.000e-05 1st Qu.:0.001660 1st Qu.:0.001860
Median :0.004940 Median :3.000e-05 Median :0.002500 Median :0.002690
Mean :0.006220 Mean :4.396e-05 Mean :0.003306 Mean :0.003446
3rd Qu.:0.007365 3rd Qu.:6.000e-05 3rd Qu.:0.003835 3rd Qu.:0.003955
Max. :0.033160 Max. :2.600e-04 Max. :0.021440 Max. :0.019580
Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3
Min. :0.002040 Min. :0.00954 Min. :0.0850 Min. :0.004550
1st Qu.:0.004985 1st Qu.:0.01650 1st Qu.:0.1485 1st Qu.:0.008245
Median :0.007490 Median :0.02297 Median :0.2210 Median :0.012790
Mean :0.009920 Mean :0.02971 Mean :0.2823 Mean :0.015664
3rd Qu.:0.011505 3rd Qu.:0.03789 3rd Qu.:0.3500 3rd Qu.:0.020265
Max. :0.064330 Max. :0.11908 Max. :1.3020 Max. :0.056470
Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR
Min. :0.00570 Min. :0.00719 Min. :0.01364 Min. :0.000650

Page | 102
1st Qu.:0.00958 1st Qu.:0.01308 1st Qu.:0.02474 1st Qu.:0.005925
Median :0.01347 Median :0.01826 Median :0.03836 Median :0.011660
Mean :0.01788 Mean :0.02408 Mean :0.04699 Mean :0.024847
3rd Qu.:0.02238 3rd Qu.:0.02940 3rd Qu.:0.06080 3rd Qu.:0.025640
Max. :0.07940 Max. :0.13778 Max. :0.16942 Max. :0.314820
HNR status RPDE DFA
Min. : 8.441 Min. :0.0000 Min. :0.2566 Min. :0.5743
1st Qu.:19.198 1st Qu.:1.0000 1st Qu.:0.4213 1st Qu.:0.6748
Median :22.085 Median :1.0000 Median :0.4960 Median :0.7223
Mean :21.886 Mean :0.7538 Mean :0.4985 Mean :0.7181
3rd Qu.:25.076 3rd Qu.:1.0000 3rd Qu.:0.5876 3rd Qu.:0.7619
Max. :33.047 Max. :1.0000 Max. :0.6852 Max. :0.8253
spread1 spread2 D2 PPE
Min. :-7.965 Min. :0.006274 Min. :1.423 Min. :0.04454
1st Qu.:-6.450 1st Qu.:0.174350 1st Qu.:2.099 1st Qu.:0.13745
Median :-5.721 Median :0.218885 Median :2.362 Median :0.19405
Mean :-5.684 Mean :0.226510 Mean :2.382 Mean :0.20655
3rd Qu.:-5.046 3rd Qu.:0.279234 3rd Qu.:2.636 3rd Qu.:0.25298
Max. :-2.434 Max. :0.450493 Max. :3.671 Max. :0.52737

Mean:
Script:
#mean
mean(data$MDVP.Fo.Hz.)
mean(data$MDVP.Jitter...)
mean(data$MDVP.Shimmer)
Output:
> mean(data$MDVP.Fo.Hz.)
[1] 154.2286

Page | 103
> mean(data$MDVP.Jitter...)
[1] 0.006220462
> mean(data$MDVP.Shimmer)
[1] 0.02970913
Variance:
#variance
var(data$MDVP.Fo.Hz.)
var(data$NHR)
var(data$MDVP.Shimmer)
Output:
> var(data$MDVP.Fo.Hz.)
[1] 1713.137
> var(data$NHR)
[1] 0.001633651
> var(data$MDVP.Shimmer)
[1] 0.0003555839

Median:
#median
median(data$MDVP.Fo.Hz.)
median(data$MDVP.Jitter...)
median(data$MDVP.Shimmer)
Output:
> median(data$MDVP.Fo.Hz.)
[1] 148.79
> median(data$MDVP.Jitter...)
[1] 0.00494
> median(data$MDVP.Shimmer)
[1] 0.02297

Page | 104
Standard Deviation:
#standard deviation
sd(data$MDVP.Fo.Hz.)
sd(data$MDVP.Jitter...)
sd(data$MDVP.Shimmer)
Output:
> sd(data$MDVP.Fo.Hz.)
[1] 41.39006
> sd(data$MDVP.Jitter...)
[1] 0.004848134
> sd(data$MDVP.Shimmer)
[1] 0.01885693

Plots using ggplot2 package:


Histogram Plot:
Script:
library(ggplot2)
ggplot(data=data,aes(x=MDVP.Shimmer))+geom_histogram(color="red",fill="blue")
Output:

Page | 105
Box Plot:

ggplot(data=data,aes(x=status, y=Shimmer.DDA))+geom_boxplot(color="red",fill="green")

Output:

Scatter Plot:

#scatter plot

ggplot(data =
data,aes(x=MDVP.Fo.Hz.,y=MDVP.Jitter...))+geom_point(size=3,color="blue",shape=19)

Output:

Result:

Page | 106
Exp No Exp Name Date
Finding Correlation between variables
2
in Parkinsons Disease Data

Aim:

Correlation: Correlations between variables play an important role in a descriptive


analysis. A correlation measures the relationship between two variables, that is, how they are
linked to each other. In this sense, a correlation allows to know which variables evolve in the
same direction, which ones evolve in the opposite direction, and which ones are independent.

Correlation is usually computed on two quantitative variables, but it can also be computed on
two qualitative ordinal variables.

There are several correlation methods:

Pearson correlation is often used for quantitative continuous variables that have a linear
relationship

Spearman correlation (which is actually similar to Pearson but based on the ranked values
for each variable rather than on the raw data) is often used to evaluate relationships involving
at least one qualitative ordinal variable or two quantitative variables if the link is partially
linear

Kendall’s tau-b which is computed from the number of concordant and discordant pairs is
often used for qualitative ordinal variables

Preparing data for finding Correlation:

Script:

data<-read.csv("parkinsons.data")

head(data,3)

Output:

name MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. MDVP.Jitter...

1 phon_R01_S01_1 119.992 157.302 74.997 0.00784

2 phon_R01_S01_2 122.400 148.650 113.819 0.00968

3 phon_R01_S01_3 116.682 131.111 111.555 0.01050

Page | 107
MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ Jitter.DDP MDVP.Shimmer
MDVP.Shimmer.dB.

1 7e-05 0.00370 0.00554 0.01109 0.04374 0.426

2 8e-05 0.00465 0.00696 0.01394 0.06134 0.626

3 9e-05 0.00544 0.00781 0.01633 0.05233 0.482

Shimmer.APQ3 Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR HNR status


RPDE

1 0.02182 0.03130 0.02971 0.06545 0.02211 21.033 1 0.414783

2 0.03134 0.04518 0.04368 0.09403 0.01929 19.085 1 0.458359

3 0.02757 0.03858 0.03590 0.08270 0.01309 20.651 1 0.429895

DFA spread1 spread2 D2 PPE

1 0.815285 -4.813031 0.266482 2.301442 0.284654

2 0.819521 -4.075192 0.335590 2.486855 0.368674

3 0.825288 -4.443179 0.311173 2.342259 0.332634

Pearson Correlation method: (Default)

cor(data$MDVP.Fo.Hz., data$MDVP.Jitter...)

Output:

-0.1180026

Spearman Correlation method:

cor(data$MDVP.Shimmer , data$MDVP.Jitter..., method="spearman")

Output:

0.7291587

Kendall Correlation method:

cor(data$MDVP.Shimmer , data$MDVP.Jitter..., method="kendall")

Output:

0.5442087

Result:

Page | 108
Exp No Exp Name Date
Constructing Predictive model for PD
3
Detection

Aim:

Multiple Linear Regression:

Multiple regression is an extension of linear regression into relationship between more than
two variables. In simple linear relation we have one predictor and one response variable, but
in multiple regression we have more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is –

y = a + b1x1 + b2x2 +...bnxn

where, y is the response variable, a, b1, b2...bn are the coefficients, x1, x2, ...xn are the
predictor variables.

Viewing data for the model:

PDdata <-
data[,c("status","MDVP.Fo.Hz.","MDVP.Jitter...","MDVP.Shimmer","HNR","spread1")]

head(PDdata,3)

Output:

status MDVP.Fo.Hz. MDVP.Jitter... MDVP.Shimmer HNR spread1

1 1 119.992 0.00784 0.04374 21.033 -4.813031

2 1 122.400 0.00968 0.06134 19.085 -4.075192

3 1 116.682 0.01050 0.05233 20.651 -4.443179

Constructing model:

model <- lm(mpg~disp+hp+wt, data = data)

print(model)

Page | 109
Output:

Call:

lm(formula = status ~ MDVP.Fo.Hz. + MDVP.Jitter... + MDVP.Shimmer +

HNR + spread1, data = PDdata)

Coefficients:

(Intercept) MDVP.Fo.Hz. MDVP.Jitter... MDVP.Shimmer HNR

2.411416 -0.001774 -26.238660 4.058376 -0.004570

spread1

0.218370

Predicting PD status with multiple covariates:

testdata=data.frame(MDVP.Fo.Hz.=119.9,MDVP.Jitter...=0.007,MDVP.Shimmer=0.04,HNR
=21,spread1=-4.8)

result=predict(model,testdata)

print(round(result))

Output:

Result:

Page | 110

You might also like