Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views32 pages

CB161 (R Lab Manual)

The document outlines various statistical measures including central tendency (mean, median, mode, geometric mean, harmonic mean) and dispersion (range, quartile deviation, mean deviation, standard deviation, coefficient of variation) using R programming. It also covers correlation and regression analysis, detailing correlation coefficients, regression lines, rank correlation, multiple correlation coefficients, and multiple linear regression. Each section includes definitions, formulas, and source code examples demonstrating how to calculate these statistics in R.

Uploaded by

raavihema3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views32 pages

CB161 (R Lab Manual)

The document outlines various statistical measures including central tendency (mean, median, mode, geometric mean, harmonic mean) and dispersion (range, quartile deviation, mean deviation, standard deviation, coefficient of variation) using R programming. It also covers correlation and regression analysis, detailing correlation coefficients, regression lines, rank correlation, multiple correlation coefficients, and multiple linear regression. Each section includes definitions, formulas, and source code examples demonstrating how to calculate these statistics in R.

Uploaded by

raavihema3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

EXPERIMENT1

AIM: Measures of central tendency


a) Mean b) Median c) Mode d) Geometric Mean e) Harmonic Mean

The measure of central tendency in R Language represents the whole set of data by a
single value. It gives us the location of the central points. There are three main measures
of central tendency.

a) Mean: It is the sum of observations divided by the total number of observations. It is


also defined as average which is the sum divided by count.

Where, n = number of terms


Source code:
marks <- c (97, 78, 57, 64, 87)
result <- mean(marks)
print(result)
Output: [1] 76.6

b) Median: It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is median and if it is
even then the median would be the average of two central elements.

• Where n = number of terms


• Syntax: median (x, na.rm = False)
• Where, X is a vector and na.rm is used to remove missing value
• The mean is the average of a data set.
• The mode is the most common number in a data set.
• The median is the middle of the set of numbers.
Source code:
marks <- c (97, 78, 57, 64, 87)
result <- median(marks)
print(result)

Output: [1] 78

c) Mode: The mode is the most common number in a data set.


Source code:

marks <- c (97, 78, 57,78, 97, 66, 87, 64, 87, 78)
mode = function ()
{
return(names(sort(-table(marks))) [1])
}
mode ()
Output: [1] “78”
d) Geometric Mean: The geometric mean is the nth root of the product of n numbers.
• prod (): This function calculates the product of all elements in a vector.
• ^ (power): This operator raises a number to a power.
• length(x): This function returns the number of elements in the vector x.
• exp (): This function calculates the exponential of a number.
• mean (): This function calculates the average of a vector.
• log (): This function calculates the natural logarithm of a vector.
exp(mean(log(x)))
Source code:
x <- c (4, 8, 9, 9, 12, 14, 17)
exp(mean(log(x)))

Output: [1] 9.579479


Geometric Mean of Columns in Data Frame
Source code:
df <- data. frame (a=c (1, 3, 4, 6, 8, 8, 9),
b=c (7, 8, 8, 7, 13, 14, 16),
c=c (11, 13, 13, 18, 19, 19, 22),
d=c (4, 8, 9, 9, 12, 14, 17))
exp(mean(log(df$a)))
Output:[1] 4.567508

e) Harmonic Mean: Harmonic mean (HM) is a reciprocal of the arithmetic mean of


the reciprocal of given numbers. There should be no zero number in the dataset,
otherwise harmonic mean will be zero. The harmonic mean is commonly used for
calculating the mean of rates or ratios (e.g. speed of the car) as it gives more accurate
answers than arithmetic means.

Source code:
library("psych")
x <- c (10, 20, 60)
harmonic. mean(x)
output: 18
harmonic mean from data frame columns
Source code:
df <- data. frame (bike=c ("A", "B", "C"), speed=c (10, 20, 60))
df bike speed
1 A 10
2 B 20
3 C 60
library("psych")
harmonic. mean(df$speed)
output: 18
EXPERIMENT 2

AIM: Measures of dispersion


a) Range b) Quartile deviation c) Mean deviation d) Standard deviation
e) Coeff. of Variation.

Dispersion refers to the degree of spread or variability in a dataset. While measures of central
tendency (like mean, median) describe the center of the data, measures of dispersion describe
the extent to which data values deviate from the center.
Dispersion helps in understanding the reliability, consistency, and variability of the dataset. A
smaller dispersion implies that values are closely clustered, while a larger dispersion indicates
greater variability.
Types of Measures of Dispersion:
a) Range:The difference between the maximum and minimum values in a dataset.

Range=Max−Min
b) Quartile Deviation (Semi-Interquartile Range): Measures the spread of the middle
50% of the data.
QD = Q3−Q1 /2

where Q1 = First Quartile, Q3 = Third Quartile


c) Mean Deviation: Average of absolute deviations from the mean.
MD=1/n∑∣xi−xˉ∣

d) Standard Deviation: Square root of the average of the squared deviations from the
mean.

e) Coefficient of Variation: A relative measure of dispersion. Expressed as a


percentage
CV = σ / μ
where:
σ: The standard deviation of dataset
μ: The mean of dataset

Source code:
data <- c(10, 12, 14, 18, 20, 22, 25)
range_val <- max(data) - min(data)
Q1 <- quantile(data, 0.25)
Q3 <- quantile(data, 0.75)
quartile_deviation <- (Q3 - Q1) / 2
mean_val <- mean(data)
mean_deviation <- mean (abs (data - mean_val))
std_deviation <- sd(data)
cv <- (std_deviation / mean_val) * 100
cat ("Data: ", data, "\n")
cat ("Range: ", range_val, "\n")
cat("Quartile Deviation: ", quartile_deviation, "\n")
cat ("Mean Deviation: ", mean_deviation, "\n")
cat ("Standard Deviation: ", std_deviation, "\n")
cat ("Coefficient of Variation (%): ", cv, "\n")
output:
Data: 10 12 14 18 20 22 25
Range: 15
Quartile Deviation: 4
Mean Deviation: 4.163265
Standard Deviation: 5.160563
Coefficient of Variation (%): 30.35627

EXPERIMENT 3
AIM: Correlation & Regression
a) Correlation coefficient b) Regression lines c) Rank Correlation d) Multiple
correlation coefficient e) Multiple linear regression
Correlation and Regression: Correlation measures the strength and direction of the
linear relationship between two variables.
a) Correlation Coefficient: The correlation coefficient (usually Pearson's r) measures
the strength and direction of a linear relationship between two variables.
This is the most basic usage. It calculates the Pearson correlation coefficient for all
pairs of variables in the input data
method parameter: You can use other correlation methods like Spearman or
Kendall. For example, cor (data, method = "spearman") calculates the Spearman
correlation coefficient.
• Correlation coefficients range from -1 to 1
• 1 indicates a perfect positive correlation.
• -1 indicates a perfect negative correlation.
• 0 indicates no correlation.
Source Code:

x<-c(7,9,4,10,6,7,8,8,5,6)
y<-c(6,8,6,10,8,5,10,7,7,8)
n<-length(x)
xy<-x*y
xx<-x*x
yy<-y*y
mydata <- data.frame(x,y,xy,xx,yy)
print(mydata)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x),sum(y*y))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
meanx<-sum(x)/n
print(meanx)
meany<-sum(y)/n
print(meany)
cov<-(sum(x*y)/n)-(meanx*meany)
print(cov)
sdx<-sqrt((sum(x*x)/n)-(meanx^2))
print(sdx)
sdy<-sqrt((sum(y*y)/n)-(meany^2))
print(sdx)
corcof<-cov/sdx/sdy
print(" Correlation coefficient using program")
print(round(corcof,digits=4))
print(" Correlation coefficient using built in")
print(cor(x,y))
plot(x, y)

output:
[1] 7.0
[1] 7.5
[1] 1.5
[1] 1.732
[1] 1.565
[1] " Correlation coefficient using program"
[1] 0.5536
[1] " Correlation coefficient using built in"
[1] 0.5536079

b) Regression lines: Shows the best-fit line through the data points.
Simple linear regression equation: Y=a+bX
where:
aa: Intercept (value of Y when X = 0)
bb: Slope (rate of change of Y w.r.t X)
Source Code:
x<-c(7,9,4,10,6,7,8,8,5,6)
y<-c(6,8,6,10,8,5,10,7,7,8)
n<-length(x)
xy<-x*y
xx<-x*x
yy<-y*y
mydata <- data. frame(x,y,xy,xx,yy)
print(mydata)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x),sum(y*y))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
print("regression line x on y")
result1<-lm(x~y)
print(result1)
print("regression line y on x")
result2<-lm(y~x)
print(result2)
x<-coef(result1)[1] + coef(result1)[2]*23
print(x)
y<-coef(result2)[1] + coef(result2)[2]*45
print(y)

output:
[1] "regression line x on y"
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
1.5333 0.6667
[1] "regression line y on x"
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
2.2333 0.7500
[1] 16.8667 # Predicted x when y = 23
[1] 35.9833 # Predicted y when x = 45
c) Rank Correlation: Measures the degree of association between the ranks of two
variables.
Source Code:
x <- c (10, 20, 30, 40, 50)
y <- c(3, 4, 5, 6, 7)
a <- rank(-x)
cat ("Ranks for X (descending):\n")
print(a)
b <- rank(-y)
cat ("Ranks for Y (descending): \n")
print(b)
d <- a - b
cat ("Difference in ranks:\n")
print(d)
ssqd <- sum(d^2)
cat ("Sum of squared differences:\n")
print(ssqd)
n <- length(x)
rankcor <- 1 - ((6 * ssqd) / (n * (n^2 - 1)))
cat("Manual Spearman Rank Correlation Coefficient:\n")
print(rankcor)
corr <- cor.test(x, y, method = "spearman")
cat ("Spearman Rank Correlation using built-in function:\n")
print(corr$estimate)

output:
Ranks for X (descending):
[1] 5 4 3 2 1
Ranks for Y (descending):
[1] 5 4 3 2 1
Difference in ranks:
[1] 0 0 0 0 0
Sum of squared differences:
[1] 0
Manual Spearman Rank Correlation Coefficient:
[1] 1
Spearman Rank Correlation using built-in function:
spearman
1
d) Multiple correlation coefficient: Measures the strength of the relationship between one
dependent variable and multiple independent variables.
Source Code:
x<-c(7,9,4,10,6,7,8,8,5,6)
y<-c(6,8,6,10,8,5,10,7,7,8)
z<-c(1,2,3,4,5,6,9,7,8,9)
n<-length(x)
corcof<- function(x,y)
{
xy<-x*y
xx<-x*x
yy<-y*y
mydata <- data.frame(x,y,xy,xx,yy)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x),sum(y*y))
mydata<-rbind(mydata,sums)
cat("\n")
print(mydata,row.names=FALSE)
meanx<-sum(x)/n
meany<-sum(y)/n
cov<-(sum(x*y)/n)-(meanx*meany)
sdx<-sqrt((sum(x*x)/n)-(meanx^2))
sdy<-sqrt((sum(y*y)/n)-(meany^2))
corcof<-cov/sdx/sdy
print(" Correlation coefficient using program")
print(round(corcof,digits=4))
}
r12<-corcof(x,y)
r23<-corcof(y,z)
r13<-corcof(x,z)
cat("\n\n\n")
print("partial correlation coefficient")
pcof<-(r12-(r13*r23))/(sqrt(1-(r13^2))*sqrt(1-(r23^2)))
print(pcof)
print("multiple correlation coefficient")
mcof<-sqrt((r12^2+r13^2-2*r12*r13*r23)/(1-r23^2))
print(mcof)

output:
[1] " Correlation coefficient using program"
[1] 0.5536
[1] " Correlation coefficient using program"
[1] 0.5530
[1] " Correlation coefficient using program"
[1] 0.7716
[1] "partial correlation coefficient"
[1] 0.2394
[1] "multiple correlation coefficient"
[1] 0.7857
e) Multiple linear regression: Predicts a dependent variable using multiple independent
variables. Y=a+b1X1+b2X2+…+bnXn
Source Code:
x1<-c (3,5,6,8,12,14)
x2<-c (16,10,7,4,3,2)
x3<-c (90,72,54,42,30,12)
x1<-c (37,45,38,42,31)
x2<-c (4,0,5,2,4)
x3<-c (71200,66800,75000,70300,65400)
n<-length(x)
x1x2<-x1*x2
x1x3<-x1*x3
x2x3<-x2*x3
x1x1<-x1*x1
x2x2<-x2*x2
mydata <- data.frame(x1,x2,x3,x1x2,x1x3,x2x3,x1x1,x2x2)
print(mydata)
sums<-list(sum(x1),sum(x2),sum(x3),sum(x1*x2),sum(x1*x3),sum(x2*x3),sum(x1*x1),
sum(x2*x2))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
result1<-lm(x3~x1+x2)
print(result1)

output:
x1 x2 x3 x1x2 x1x3 x2x3 x1x1 x2x2

3 16 90 48 270 1440 9 256


x1 x2 x3 x1x2 x1x3 x2x3 x1x1 x2x2

5 10 72 50 360 720 25 100

6 7 54 42 324 378 36 49

8 4 42 32 336 168 64 16

12 3 30 36 360 90 144 9

14 2 12 28 168 24 196 4

48 42 300 236 1818 2820 474 434


Call:
lm(formula = x3 ~ x1 + x2)
Coefficients:
(Intercept) x1 x2
61.400 -3.646 2.538
EXPERIMENT 4
AIM: Curve fitting
a) Straight line b) Parabola c) Y=aXb d) Y=abX e) Y=aebX

a) Straight line:
x<-c(1,2,3,4,6,8)
y<-c(2.4,3,3.6,4,5,6)
n<-length(x)
xy<-x*y
xx<-x*x
mydata <- data.frame(x,y,xy,xx)
print(mydata)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x))
mydata<-rbind(mydata,sums)
print (mydata, row.names=FALSE)
stline<-lm(y~x)
print(stline)
summary(stline)
plot(x,y)
abline(stline,col="BLUE")
output:

b)Parabola:
x<-c(0,1,2,3,4)
y<-c(1,1.8,1.3,2.5,6.3)
n<-length(x)
xy<-x*y
xx<-x*x
xxx<-x^3
xxxx<-x^4
xxy<-x^2*y
mydata <- data.frame(x,y,xy,xx,xxx,xxxx,xxy)
print(mydata)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x),sum(x^3),sum(x^4),sum(x^2*y))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
parabola <- lm(y ~ x+I(x^2))
print(parabola)
f<-coef(parabola)[1]+((coef(parabola)[2])*x)+((coef(parabola)[3])*x*x)
print(f)
plot(x,y)
curve((coef(parabola)[1]+(coef(parabola)[2]*x)+(coef(parabola)
[3]*x*x)),from=x[1],n=x[n],add=T)
curve(predict(parabola,newdata=data.frame(x)),add=T)

output:

c) Y=aXb
x<-c(1,2,3,4,6,8)
y<-c(2.4,3,3.6,4,5,6)
n<-length(x)
logx<-round(log10(x),digits=4)
logy<-round(log10(y),digits=4)
logxlogy<-round(logx*logy,digits=4)
logxlogx<-round(logx*logx,digits=4)
mydata <-data.frame(logx,logy,logxlogy,logxlogx)
colnames(mydata)=c("X=logx","Y=logy","XY","XX")
print(mydata)
sums<-
list(sum(logx),sum(logy),round(sum(logx*logy),digits=4),round(sum(logx*logx),digits=4))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
power<-lm(log10(y)~log10(x))
print(power)
alpha<-10^(coef(power)[1])
beta<-coef(power)[2]
print(round(alpha,digits=4))
print(round(beta,digits=4))
f<-alpha*(x^beta)
print(f)
plot(x,y)
curve(alpha*(x^beta),from=x[1],n=x[n],add=T)
output:
(Intercept)
2.2858
log10(x)
0.4372
[1] 2.285842 3.095021 3.695344 4.190646 5.003481
[6] 5.674118

d) Y=abX
x<-c(1,1.5,2,2.5,3,3.5,4)
y<-c(1,1.3,1.6,2,2.7,3.4,4.1)
n<-length(x)
logy<-round(log10(y),digits=4)
xlogy<-round(x*logy,digits=4)
xx<-x*x
mydata <-data.frame(x,logy,xlogy,xx)
colnames(mydata)=c("X=x","Y=logy","XY","XX")
print(mydata)
sums<-list(sum(x),sum(logy),sum(x*logy),sum(x*x))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
power<-lm(log10(y)~x)
print(power)
alpha<-10^(coef(power)[1])
beta<-10^(coef(power)[2])
print(alpha)
print(beta)
f<-alpha*(beta^x)
print(f)
plot(x,y)
curve(alpha*(beta^x),from=x[1],n=x[n],add=T)

output:

(Intercept)

0.6245328

1.611352

[1] 1.006342 1.277441 1.621572 2.058408 2.612923

[6] 3.316820 4.210339

e) Y=aebX
x<-c(1,2,3,4,5)
y<-c(1.8,5.1,8.9,14.1,19.8)
n<-length(x)
logy<-round(log10(y),digits=4)
xlogy<-round(x*logy,digits=4)
xx<-x*x
mydata <-data.frame(x,logy,xlogy,xx)
colnames(mydata)=c("X=x","Y=logy","XY","XX")
print(mydata)
sums<-list(sum(x),sum(logy),sum(x*logy),sum(x*x))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
power<-lm(log10(y)~x)
print(power)
alpha<-10^(coef(power)[1])
beta<-coef(power)[2]/0.4343
print(alpha)
print(beta)
f<-alpha*exp((beta^x))
print(f)
plot(x,y)
curve((alpha*exp((beta^x))),from=x[1],n=x[n],add=T)

output:
X=x Y=logy XY XX
1 0.2553 0.2553 1
2 0.7076 1.4152 4
3 0.9494 2.8482 9
4 1.1492 4.5968 16
5 1.2967 6.4835 25
15 4.3582 15.5990 55

Call:
lm(formula = log10(y) ~ x)

Coefficients:
(Intercept) x
0.1143 0.2524
(Intercept)
1.301047

0.5812651

[1] 2.326662 1.824012 1.583379 1.458378 1.390307


EXPERIMENT 5

AIM: ANOVA
a) one-way classification b) two-way classification

a) one-way classification:
Source code:

# Data

data <- c(25, 30, 28, 35, 40, 42, 45, 48, 46, 10, 15, 18)

group <- factor(c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"))

# One-Way ANOVA

oneway_model <- aov(data ~ group)

print(summary(oneway_model))

output:
Df Sum Sq Mean Sq F value Pr(>F)

group 3 2188.2 729.4 52.09 4.71e-06 ***

Residuals 8 111.9 14.0

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05

b) two-way classification
Source code:

# Data

output <- c (10,12,11,13,14,15,13,16,12,15,14,16)

machine <- factor(rep(c("M1","M2","M3"), each = 4))

operator <- factor(rep(c("O1","O2"), times = 6))

# Two-Way ANOVA

twoway_model <- aov(output ~ machine + operator)

print(summary(twoway_model))

output:
Df Sum Sq Mean Sq F value Pr(>F)
machine 2 16.0 8.0 4.000 0.0574 .

operator 1 2.0 2.0 1.000 0.3390

Residuals 8 16.0 2.0

---

Signif. codes: 0 ‘.’ 0.1 ‘ ’ 1

EXPERIMENT 6

AIM: Time series


a) Moving averages b) ARIMA

a) Moving averages
Source code:

# Sample monthly data (e.g., monthly demand or sales)

data <- c(25, 30, 28, 35, 40, 42, 45, 48, 46, 50, 53, 55)

months <- month.abb[1:12]

# Create a time series object

ts_data <- ts(data, start = c(2024, 1), frequency = 12)

# Display original data

print("Original Monthly Data:")

print(ts_data)

# 3-month Moving Average

ma3 <- filter(ts_data, rep(1/3, 3), sides = 2)

# 4-month Moving Average

ma4 <- filter (ts_data, rep(1/4, 4), sides = 2)

# 5-month Moving Average

ma5 <- filter(ts_data, rep(1/5, 5), sides = 2)

# Combine all into a data frame

result <- data.frame(

Month = months,

Sales = ts_data,

MA_3 = round(ma3, 2),

MA_4 = round(ma4, 2),


MA_5 = round(ma5, 2)

print("Moving Averages Table:")

print(result)

# Plot

plot(ts_data, type = "o", col = "black", ylim=c(20,60), main = "Moving Averages",

ylab = "Sales", xlab = "Month")

lines(ma3, type="o", col="blue")

lines(ma4, type="o", col="red")

lines(ma5, type="o", col="green")

legend("topleft", legend=c("Original", "MA(3)", "MA(4)", "MA(5)"),

col=c("black", "blue", "red", "green"), lty=1, pch=1)

output:
[1] "Moving Averages Table:"

Month Sales MA_3 MA_4 MA_5

1 Jan 25 NA NA NA

2 Feb 30 27.67 29.50 NA

3 Mar 28 31.00 33.25 31.6

4 Apr 35 34.33 36.25 35.0

5 May 40 39.00 40.50 38.0

6 Jun 42 42.33 43.75 42.0

7 Jul 45 45.00 45.25 44.2

8 Aug 48 46.33 47.25 46.2

9 Sep 46 48.00 49.25 48.4

10 Oct 50 49.67 51.00 50.4

11 Nov 53 52.67 NA NA

12 Dec 55 NA NA NA
b) ARIMA
Source code:

# Load necessary libraries

install. Packages("forecast") # Only once

library(forecast)

# Create sample time series data

data <- c(112,118,132,129,121,135,148,148,136,119,104,118,

115,126,141,135,125,149,170,170,158,133,114,140)

# Convert to time series object (monthly data starting from Jan 1949)

ts_data <- ts(data, start = c(1949,1), frequency = 12)

# Plot the original time series

plot(ts_data, main = "Monthly Data", ylab = "Value", col = "blue")

# Check if the data is stationary using Augmented Dickey-Fuller Test

# install.packages("tseries") # Uncomment if not installed

library(tseries)

adf.test(ts_data)

# Apply differencing if not stationary (auto.arima handles it automatically)

# Fit ARIMA model

model <- auto.arima(ts_data)

# Display the ARIMA model summary

summary(model)
# Forecast next 12 periods

forecast_values <- forecast(model, h = 12)

# Print forecast results

print(forecast_values)

# Plot the forecast

plot(forecast_values, main = "ARIMA Forecast")

Output:

Series: ts_data

ARIMA(0,1,1)(0,1,1)[12]

Coefficients:

ma1 sma1

-0.3775 -0.5964

s.e. 0.2314 0.1534

sigma^2 estimated as 120.7: log likelihood=-140.42

AIC=286.83 AICc=287.89 BIC=291.79

Forecast Output (first few months)

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95

Jan 1951 147.6 132.2 163.0 124.2 171.0

Feb 1951 137.8 122.3 153.3 114.2 161.4

Mar 1951 153.1 136.2 170.1 127.3 178.9

...
EXPERIMENT 7
AIM: Goodness of fit
a) Binomial b) Poisson

a) Binomial: Binomial distribution is a discrete probability distribution used


when:
A fixed number of independent trials (n) are conducted
Each trial has two possible outcomes: success or failure
The probability of success (p) is constant for each trial

P(X=k): Probability of exactly kkk successes


nnn: Total number of trials
ppp: Probability of success in a single trial
kkk: Number of successes (0 ≤ k ≤ n)

Source code:

# Observed frequencies
obs <- c(7, 6, 19, 35, 30, 23, 7, 1)
x <- 0:7
n <- 7
total <- sum(obs)
mean <- sum(obs * x) / total
p <- mean / n

# Expected frequencies
expected_probs <- dbinom(x, size = n, prob = p)
expected_freq <- round(expected_probs * total)

# Chi-square test
chisq.test(obs, p = expected_probs, rescale.p = TRUE)
Output:

Chi-squared test for given probabilities


data: obs
X-squared = 29.102, df = 7, p-value =
0.0001386
b) Poisson : The Poisson distribution is a discrete probability distribution used to
model the number of events occurring in a fixed interval of time or space,
when the events happen independently and at a constant average rate (λ =
lambda)
f(x) = P(X=x) = (e-λ λx )/x!
X: number of events

λ: average rate (mean)


e≈2.718e
Source code:
# Observed frequencies

x <- 0:8

f <- c(103, 143, 98, 42, 8, 4, 2, 0, 0)

# Total frequency

total <- sum(f)

# Calculate mean (lambda)

lambda <- sum(x * f) / total

# Expected probabilities

pr <- dpois(x, lambda = lambda)

# Expected frequencies

fe <- round(pr * total)

# Show data

mydata <- data.frame(x, f, pr = round(pr, 5), Expected = fe)

print(mydata)

# Chi-square test for goodness of fit

result <- chisq.test(f, p = pr, rescale.p = TRUE)

print(result)

# Critical value at 5% significance level

s <- length(x) - 1

cat("Chi-square table value (df =", s, "):", qchisq(0.95, df = s), "\n")

Output:
x f pr Expected

0 103 0.26647 107

1 143 0.35240 141

2 98 0.23303 93

3 42 0.10273 41

4 8 0.03396 14

5 4 0.00898 4

6 2 0.00198 1

7 0 0.00037 0

8 0 0.00006 0

Chi-squared test for given probabilities

data: f

X-squared = 4.7755, df = 8, p-value = 0.7813

Chi-square table value (df = 8 ): 15.50731

EXPERIMENT 8

AIM: Parametric tests


a) t-test for one-mean b) t-test for two means c) paired t-test d) F-test

a) t-test for one-mean: Tests whether the mean of a single group differs from
a known or hypothesized value.
Null Hypothesis (H₀): μ = μ₀
Alternative Hypothesis (H₁): μ ≠ μ₀ (or > or <)
Source code:

# Sample data

x <- c (25, 27, 29, 28, 26, 30, 31)

# Test if mean is 28

result <- t.test(x, mu = 28)

print(result)
Output:

data: x

t = 0.4472, df = 6, p-value = 0.6715

alternative hypothesis: true mean is not equal to 28

95 percent confidence interval: 26.72526 29.56045

sample estimates: mean of x = 28.14286

b) t-test for two means: Compares the means of two independent groups.
Assumes equal or unequal variance (use var.equal).
H₀: μ₁ = μ₂
H₁: μ₁ ≠ μ₂
Source code:

# Two independent samples

group1 <- c (22, 25, 27, 30, 32)

group2 <- c (20, 23, 26, 28, 29)

# Test for equal means

result <- t.test(group1, group2, var.equal = TRUE)

print(result)

Output:

t = 0.8944, df = 8, p-value = 0.3967

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval: -3.20719 7.20719

mean of x = 27.2, mean of y = 25.2

c) paired t-test: Used when same subjects are measured before and after a
treatment. H₀: mean difference = 0
Source code:

# Before and after values

before <- c (200, 195, 210, 190, 205)

after <- c (198, 193, 208, 192, 204)

# Paired t-test

result <- t.test(before, after, paired = TRUE)

print(result)

Output:

t = 2.8284, df = 4, p-value = 0.0473

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval: 0.2 3.8

mean of the differences: 2

d) F-test: Tests equality of variances of two populations.


H₀: σ₁² = σ₂²
H₁: σ₁² ≠ σ₂
Source code:

# Two samples

x <- c (15, 16, 14, 15, 17, 18)

y <- c (10, 12, 9, 11, 13, 14)

# F-test for equality of variances

result <- var.test(x, y)

print(result)

Output:

F = 2.25, num df = 5, denom df = 5, p-value = 0.3012


alternative hypothesis: true ratio of variances is not equal to 1

95

percent confidence interval: 0.4350 11.6426

EXPERIMENT 9
AIM: Non-parametric tests
a) Sign test b) Mann-Whitney test c) Run test d) Kolmogorov-Smirnov test

a) Sign test: The Sign Test is used to test the median of a population or for
comparing paired samples. It only considers the signs (+/-) of differences,
ignoring their magnitude.
Hypotheses:

H₀: The median difference is zero.

H₁: The median difference is not zero.

Source code:

# Install BSDA if not installed

# install.packages("BSDA")

library(BSDA)

# Example: Paired data (Before and After)

before <- c(55, 60, 52, 63, 70, 65, 62)

after <- c (58, 62, 54, 60, 68, 68, 63)

# Perform Sign Test

SIGN.test(x = before, y = after, alternative = "two.sided")

Output:

Dependent-samples Sign-Test

data: before and after

S = 1, p-value = 0.07031

alternative hypothesis: true median difference is not equal to 0

b) Mann-Whitney test (Wilcoxon Rank Sum Test): Used to compare two


independent samples. It’s a non-parametric alternative to the t-test.
Hypotheses:
H₀: The two populations are equal.
H₁: The two populations are not equal
Source code:

# Group 1 and Group 2 data

group1 <- c(85, 90, 88, 75, 95)

group2 <- c(80, 70, 78, 85, 68)

# Mann-Whitney U Test

wilcox.test(group1, group2, alternative = "two.sided")

Output:

Wilcoxon rank sum test with continuity correction

data: group1 and group2

W = 21.5, p-value = 0.2893

alternative hypothesis: true location shift is not equal to 0

c)Run test (Wald-Wolfowitz Runs Test): Checks the randomness of a sequence.


It counts the number of runs (uninterrupted sequences of similar items).
Source code:

# Install and load tseries

install.packages("tseries")

library(tseries)

# Sample sequence

sequence <- c(1.2, 1.5, 1.3, 1.7, 2.1, 1.9, 1.8, 2.2, 2.4)

# Run test

runs.test(as.factor(sequence > median(sequence)))

Output:

Runs Test

data: as.factor(sequence > median(sequence))


Standard Normal = 0.169, p-value = 0.866

alternative hypothesis: two.sided

c) Kolmogorov-Smirnov test: Used to test if a sample comes from a specific


distribution (e.g., normal), or to compare two samples. It’s based on the
maximum distance between their empirical distribution functions.
Source code:

x <- rnorm(30, mean=5)


y <- rnorm(30, mean=6)
ks.test(x, y)
Output:

Two-sample Kolmogorov-Smirnov test


data: x and y
D = 0.5, p-value = 0.0029
alternative hypothesis: two-sided

EXPERIMENT 10

AIM: Graphical representation of data


a) Bar plot b) Frequency polygon c) Histogram d) Pie chart e) scatter plot

Graphical representation helps to visualize and interpret data effectively. Different types of
graphs are used depending on the nature of data and what insights are to be derived.
a) Bar plot:

Source Code:
H <- c(5,15,17,18,16,15)
M <- c(1980,1981,1982,1983,1984,1985)
barplot(H,xlab="Year",ylab="Profit",ylim=c(0,20),
col=rainbow(6),names.arg=M,main="RVRJCPHARMACEUTICAL
FIRM",border="red")
Output:
b) Frequency polygon : A frequency polygon is a line graph that represents the
distribution of data. It is drawn by connecting the midpoints of the top of the bars in

histogram.
Source Code:
data <- c(10, 20, 20, 30, 30, 30, 40, 40, 50, 50, 50, 50, 60)
hist_data <- hist(data, plot = FALSE)
plot(hist_data$mids, hist_data$counts, type = "o", col = "red", xlab = "Class
Intervals", ylab = "Frequency", main = "Frequency Polygon")
Output:

c) Histogram: A histogram is used for continuous data and shows the frequency
distribution of a dataset using adjacent rectangles.
Source Code:
v<-c (3,5,6,19,9,18,23,67,11,10,44,45,54,37,26,8,5,1)
hist(v,main="STUDENTS MARKS", xlab ="Weight",xlim=c(0,70), ylab="no.of
branches",ylim=c(0,5),col=rainbow(10))

Output:

d) Pie chart: A pie chart is a circular graph divided into sectors representing
proportions of a whole.
Source Code:
# Pie Chart
slices <- c(10, 20, 30, 40)
labels <- c("Q1", "Q2", "Q3", "Q4")
pie(slices, labels = labels, main = "Pie Chart", col = rainbow(length(slices)))

Output:
e) scatter plot: A scatter plot shows the relationship between two continuous variables.
Points are plotted for each pair (x, y).
Source Code:
# Scatter Plot
x <- c(1, 2, 3, 4, 5, 6)
y <- c(2, 4, 5, 7, 10, 12)
plot(x, y, main = "Scatter Plot", xlab = "X values", ylab = "Y values", col = "blue",
pch = 16)

Output:

You might also like