EXPERIMENT1
AIM: Measures of central tendency
a) Mean b) Median c) Mode d) Geometric Mean e) Harmonic Mean
The measure of central tendency in R Language represents the whole set of data by a
single value. It gives us the location of the central points. There are three main measures
of central tendency.
a) Mean: It is the sum of observations divided by the total number of observations. It is
also defined as average which is the sum divided by count.
Where, n = number of terms
Source code:
marks <- c (97, 78, 57, 64, 87)
result <- mean(marks)
print(result)
Output: [1] 76.6
b) Median: It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is median and if it is
even then the median would be the average of two central elements.
• Where n = number of terms
• Syntax: median (x, na.rm = False)
• Where, X is a vector and na.rm is used to remove missing value
• The mean is the average of a data set.
• The mode is the most common number in a data set.
• The median is the middle of the set of numbers.
Source code:
marks <- c (97, 78, 57, 64, 87)
result <- median(marks)
print(result)
Output: [1] 78
c) Mode: The mode is the most common number in a data set.
Source code:
marks <- c (97, 78, 57,78, 97, 66, 87, 64, 87, 78)
mode = function ()
{
return(names(sort(-table(marks))) [1])
}
mode ()
Output: [1] “78”
d) Geometric Mean: The geometric mean is the nth root of the product of n numbers.
• prod (): This function calculates the product of all elements in a vector.
• ^ (power): This operator raises a number to a power.
• length(x): This function returns the number of elements in the vector x.
• exp (): This function calculates the exponential of a number.
• mean (): This function calculates the average of a vector.
• log (): This function calculates the natural logarithm of a vector.
exp(mean(log(x)))
Source code:
x <- c (4, 8, 9, 9, 12, 14, 17)
exp(mean(log(x)))
Output: [1] 9.579479
Geometric Mean of Columns in Data Frame
Source code:
df <- data. frame (a=c (1, 3, 4, 6, 8, 8, 9),
b=c (7, 8, 8, 7, 13, 14, 16),
c=c (11, 13, 13, 18, 19, 19, 22),
d=c (4, 8, 9, 9, 12, 14, 17))
exp(mean(log(df$a)))
Output:[1] 4.567508
e) Harmonic Mean: Harmonic mean (HM) is a reciprocal of the arithmetic mean of
the reciprocal of given numbers. There should be no zero number in the dataset,
otherwise harmonic mean will be zero. The harmonic mean is commonly used for
calculating the mean of rates or ratios (e.g. speed of the car) as it gives more accurate
answers than arithmetic means.
Source code:
library("psych")
x <- c (10, 20, 60)
harmonic. mean(x)
output: 18
harmonic mean from data frame columns
Source code:
df <- data. frame (bike=c ("A", "B", "C"), speed=c (10, 20, 60))
df bike speed
1 A 10
2 B 20
3 C 60
library("psych")
harmonic. mean(df$speed)
output: 18
EXPERIMENT 2
AIM: Measures of dispersion
a) Range b) Quartile deviation c) Mean deviation d) Standard deviation
e) Coeff. of Variation.
Dispersion refers to the degree of spread or variability in a dataset. While measures of central
tendency (like mean, median) describe the center of the data, measures of dispersion describe
the extent to which data values deviate from the center.
Dispersion helps in understanding the reliability, consistency, and variability of the dataset. A
smaller dispersion implies that values are closely clustered, while a larger dispersion indicates
greater variability.
Types of Measures of Dispersion:
a) Range:The difference between the maximum and minimum values in a dataset.
Range=Max−Min
b) Quartile Deviation (Semi-Interquartile Range): Measures the spread of the middle
50% of the data.
QD = Q3−Q1 /2
where Q1 = First Quartile, Q3 = Third Quartile
c) Mean Deviation: Average of absolute deviations from the mean.
MD=1/n∑∣xi−xˉ∣
d) Standard Deviation: Square root of the average of the squared deviations from the
mean.
e) Coefficient of Variation: A relative measure of dispersion. Expressed as a
percentage
CV = σ / μ
where:
σ: The standard deviation of dataset
μ: The mean of dataset
Source code:
data <- c(10, 12, 14, 18, 20, 22, 25)
range_val <- max(data) - min(data)
Q1 <- quantile(data, 0.25)
Q3 <- quantile(data, 0.75)
quartile_deviation <- (Q3 - Q1) / 2
mean_val <- mean(data)
mean_deviation <- mean (abs (data - mean_val))
std_deviation <- sd(data)
cv <- (std_deviation / mean_val) * 100
cat ("Data: ", data, "\n")
cat ("Range: ", range_val, "\n")
cat("Quartile Deviation: ", quartile_deviation, "\n")
cat ("Mean Deviation: ", mean_deviation, "\n")
cat ("Standard Deviation: ", std_deviation, "\n")
cat ("Coefficient of Variation (%): ", cv, "\n")
output:
Data: 10 12 14 18 20 22 25
Range: 15
Quartile Deviation: 4
Mean Deviation: 4.163265
Standard Deviation: 5.160563
Coefficient of Variation (%): 30.35627
EXPERIMENT 3
AIM: Correlation & Regression
a) Correlation coefficient b) Regression lines c) Rank Correlation d) Multiple
correlation coefficient e) Multiple linear regression
Correlation and Regression: Correlation measures the strength and direction of the
linear relationship between two variables.
a) Correlation Coefficient: The correlation coefficient (usually Pearson's r) measures
the strength and direction of a linear relationship between two variables.
This is the most basic usage. It calculates the Pearson correlation coefficient for all
pairs of variables in the input data
method parameter: You can use other correlation methods like Spearman or
Kendall. For example, cor (data, method = "spearman") calculates the Spearman
correlation coefficient.
• Correlation coefficients range from -1 to 1
• 1 indicates a perfect positive correlation.
• -1 indicates a perfect negative correlation.
• 0 indicates no correlation.
Source Code:
x<-c(7,9,4,10,6,7,8,8,5,6)
y<-c(6,8,6,10,8,5,10,7,7,8)
n<-length(x)
xy<-x*y
xx<-x*x
yy<-y*y
mydata <- data.frame(x,y,xy,xx,yy)
print(mydata)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x),sum(y*y))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
meanx<-sum(x)/n
print(meanx)
meany<-sum(y)/n
print(meany)
cov<-(sum(x*y)/n)-(meanx*meany)
print(cov)
sdx<-sqrt((sum(x*x)/n)-(meanx^2))
print(sdx)
sdy<-sqrt((sum(y*y)/n)-(meany^2))
print(sdx)
corcof<-cov/sdx/sdy
print(" Correlation coefficient using program")
print(round(corcof,digits=4))
print(" Correlation coefficient using built in")
print(cor(x,y))
plot(x, y)
output:
[1] 7.0
[1] 7.5
[1] 1.5
[1] 1.732
[1] 1.565
[1] " Correlation coefficient using program"
[1] 0.5536
[1] " Correlation coefficient using built in"
[1] 0.5536079
b) Regression lines: Shows the best-fit line through the data points.
Simple linear regression equation: Y=a+bX
where:
aa: Intercept (value of Y when X = 0)
bb: Slope (rate of change of Y w.r.t X)
Source Code:
x<-c(7,9,4,10,6,7,8,8,5,6)
y<-c(6,8,6,10,8,5,10,7,7,8)
n<-length(x)
xy<-x*y
xx<-x*x
yy<-y*y
mydata <- data. frame(x,y,xy,xx,yy)
print(mydata)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x),sum(y*y))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
print("regression line x on y")
result1<-lm(x~y)
print(result1)
print("regression line y on x")
result2<-lm(y~x)
print(result2)
x<-coef(result1)[1] + coef(result1)[2]*23
print(x)
y<-coef(result2)[1] + coef(result2)[2]*45
print(y)
output:
[1] "regression line x on y"
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
1.5333 0.6667
[1] "regression line y on x"
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
2.2333 0.7500
[1] 16.8667 # Predicted x when y = 23
[1] 35.9833 # Predicted y when x = 45
c) Rank Correlation: Measures the degree of association between the ranks of two
variables.
Source Code:
x <- c (10, 20, 30, 40, 50)
y <- c(3, 4, 5, 6, 7)
a <- rank(-x)
cat ("Ranks for X (descending):\n")
print(a)
b <- rank(-y)
cat ("Ranks for Y (descending): \n")
print(b)
d <- a - b
cat ("Difference in ranks:\n")
print(d)
ssqd <- sum(d^2)
cat ("Sum of squared differences:\n")
print(ssqd)
n <- length(x)
rankcor <- 1 - ((6 * ssqd) / (n * (n^2 - 1)))
cat("Manual Spearman Rank Correlation Coefficient:\n")
print(rankcor)
corr <- cor.test(x, y, method = "spearman")
cat ("Spearman Rank Correlation using built-in function:\n")
print(corr$estimate)
output:
Ranks for X (descending):
[1] 5 4 3 2 1
Ranks for Y (descending):
[1] 5 4 3 2 1
Difference in ranks:
[1] 0 0 0 0 0
Sum of squared differences:
[1] 0
Manual Spearman Rank Correlation Coefficient:
[1] 1
Spearman Rank Correlation using built-in function:
spearman
1
d) Multiple correlation coefficient: Measures the strength of the relationship between one
dependent variable and multiple independent variables.
Source Code:
x<-c(7,9,4,10,6,7,8,8,5,6)
y<-c(6,8,6,10,8,5,10,7,7,8)
z<-c(1,2,3,4,5,6,9,7,8,9)
n<-length(x)
corcof<- function(x,y)
{
xy<-x*y
xx<-x*x
yy<-y*y
mydata <- data.frame(x,y,xy,xx,yy)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x),sum(y*y))
mydata<-rbind(mydata,sums)
cat("\n")
print(mydata,row.names=FALSE)
meanx<-sum(x)/n
meany<-sum(y)/n
cov<-(sum(x*y)/n)-(meanx*meany)
sdx<-sqrt((sum(x*x)/n)-(meanx^2))
sdy<-sqrt((sum(y*y)/n)-(meany^2))
corcof<-cov/sdx/sdy
print(" Correlation coefficient using program")
print(round(corcof,digits=4))
}
r12<-corcof(x,y)
r23<-corcof(y,z)
r13<-corcof(x,z)
cat("\n\n\n")
print("partial correlation coefficient")
pcof<-(r12-(r13*r23))/(sqrt(1-(r13^2))*sqrt(1-(r23^2)))
print(pcof)
print("multiple correlation coefficient")
mcof<-sqrt((r12^2+r13^2-2*r12*r13*r23)/(1-r23^2))
print(mcof)
output:
[1] " Correlation coefficient using program"
[1] 0.5536
[1] " Correlation coefficient using program"
[1] 0.5530
[1] " Correlation coefficient using program"
[1] 0.7716
[1] "partial correlation coefficient"
[1] 0.2394
[1] "multiple correlation coefficient"
[1] 0.7857
e) Multiple linear regression: Predicts a dependent variable using multiple independent
variables. Y=a+b1X1+b2X2+…+bnXn
Source Code:
x1<-c (3,5,6,8,12,14)
x2<-c (16,10,7,4,3,2)
x3<-c (90,72,54,42,30,12)
x1<-c (37,45,38,42,31)
x2<-c (4,0,5,2,4)
x3<-c (71200,66800,75000,70300,65400)
n<-length(x)
x1x2<-x1*x2
x1x3<-x1*x3
x2x3<-x2*x3
x1x1<-x1*x1
x2x2<-x2*x2
mydata <- data.frame(x1,x2,x3,x1x2,x1x3,x2x3,x1x1,x2x2)
print(mydata)
sums<-list(sum(x1),sum(x2),sum(x3),sum(x1*x2),sum(x1*x3),sum(x2*x3),sum(x1*x1),
sum(x2*x2))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
result1<-lm(x3~x1+x2)
print(result1)
output:
x1 x2 x3 x1x2 x1x3 x2x3 x1x1 x2x2
3 16 90 48 270 1440 9 256
x1 x2 x3 x1x2 x1x3 x2x3 x1x1 x2x2
5 10 72 50 360 720 25 100
6 7 54 42 324 378 36 49
8 4 42 32 336 168 64 16
12 3 30 36 360 90 144 9
14 2 12 28 168 24 196 4
48 42 300 236 1818 2820 474 434
Call:
lm(formula = x3 ~ x1 + x2)
Coefficients:
(Intercept) x1 x2
61.400 -3.646 2.538
EXPERIMENT 4
AIM: Curve fitting
a) Straight line b) Parabola c) Y=aXb d) Y=abX e) Y=aebX
a) Straight line:
x<-c(1,2,3,4,6,8)
y<-c(2.4,3,3.6,4,5,6)
n<-length(x)
xy<-x*y
xx<-x*x
mydata <- data.frame(x,y,xy,xx)
print(mydata)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x))
mydata<-rbind(mydata,sums)
print (mydata, row.names=FALSE)
stline<-lm(y~x)
print(stline)
summary(stline)
plot(x,y)
abline(stline,col="BLUE")
output:
b)Parabola:
x<-c(0,1,2,3,4)
y<-c(1,1.8,1.3,2.5,6.3)
n<-length(x)
xy<-x*y
xx<-x*x
xxx<-x^3
xxxx<-x^4
xxy<-x^2*y
mydata <- data.frame(x,y,xy,xx,xxx,xxxx,xxy)
print(mydata)
sums<-list(sum(x),sum(y),sum(x*y),sum(x*x),sum(x^3),sum(x^4),sum(x^2*y))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
parabola <- lm(y ~ x+I(x^2))
print(parabola)
f<-coef(parabola)[1]+((coef(parabola)[2])*x)+((coef(parabola)[3])*x*x)
print(f)
plot(x,y)
curve((coef(parabola)[1]+(coef(parabola)[2]*x)+(coef(parabola)
[3]*x*x)),from=x[1],n=x[n],add=T)
curve(predict(parabola,newdata=data.frame(x)),add=T)
output:
c) Y=aXb
x<-c(1,2,3,4,6,8)
y<-c(2.4,3,3.6,4,5,6)
n<-length(x)
logx<-round(log10(x),digits=4)
logy<-round(log10(y),digits=4)
logxlogy<-round(logx*logy,digits=4)
logxlogx<-round(logx*logx,digits=4)
mydata <-data.frame(logx,logy,logxlogy,logxlogx)
colnames(mydata)=c("X=logx","Y=logy","XY","XX")
print(mydata)
sums<-
list(sum(logx),sum(logy),round(sum(logx*logy),digits=4),round(sum(logx*logx),digits=4))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
power<-lm(log10(y)~log10(x))
print(power)
alpha<-10^(coef(power)[1])
beta<-coef(power)[2]
print(round(alpha,digits=4))
print(round(beta,digits=4))
f<-alpha*(x^beta)
print(f)
plot(x,y)
curve(alpha*(x^beta),from=x[1],n=x[n],add=T)
output:
(Intercept)
2.2858
log10(x)
0.4372
[1] 2.285842 3.095021 3.695344 4.190646 5.003481
[6] 5.674118
d) Y=abX
x<-c(1,1.5,2,2.5,3,3.5,4)
y<-c(1,1.3,1.6,2,2.7,3.4,4.1)
n<-length(x)
logy<-round(log10(y),digits=4)
xlogy<-round(x*logy,digits=4)
xx<-x*x
mydata <-data.frame(x,logy,xlogy,xx)
colnames(mydata)=c("X=x","Y=logy","XY","XX")
print(mydata)
sums<-list(sum(x),sum(logy),sum(x*logy),sum(x*x))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
power<-lm(log10(y)~x)
print(power)
alpha<-10^(coef(power)[1])
beta<-10^(coef(power)[2])
print(alpha)
print(beta)
f<-alpha*(beta^x)
print(f)
plot(x,y)
curve(alpha*(beta^x),from=x[1],n=x[n],add=T)
output:
(Intercept)
0.6245328
1.611352
[1] 1.006342 1.277441 1.621572 2.058408 2.612923
[6] 3.316820 4.210339
e) Y=aebX
x<-c(1,2,3,4,5)
y<-c(1.8,5.1,8.9,14.1,19.8)
n<-length(x)
logy<-round(log10(y),digits=4)
xlogy<-round(x*logy,digits=4)
xx<-x*x
mydata <-data.frame(x,logy,xlogy,xx)
colnames(mydata)=c("X=x","Y=logy","XY","XX")
print(mydata)
sums<-list(sum(x),sum(logy),sum(x*logy),sum(x*x))
mydata<-rbind(mydata,sums)
print(mydata,row.names=FALSE)
power<-lm(log10(y)~x)
print(power)
alpha<-10^(coef(power)[1])
beta<-coef(power)[2]/0.4343
print(alpha)
print(beta)
f<-alpha*exp((beta^x))
print(f)
plot(x,y)
curve((alpha*exp((beta^x))),from=x[1],n=x[n],add=T)
output:
X=x Y=logy XY XX
1 0.2553 0.2553 1
2 0.7076 1.4152 4
3 0.9494 2.8482 9
4 1.1492 4.5968 16
5 1.2967 6.4835 25
15 4.3582 15.5990 55
Call:
lm(formula = log10(y) ~ x)
Coefficients:
(Intercept) x
0.1143 0.2524
(Intercept)
1.301047
0.5812651
[1] 2.326662 1.824012 1.583379 1.458378 1.390307
EXPERIMENT 5
AIM: ANOVA
a) one-way classification b) two-way classification
a) one-way classification:
Source code:
# Data
data <- c(25, 30, 28, 35, 40, 42, 45, 48, 46, 10, 15, 18)
group <- factor(c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"))
# One-Way ANOVA
oneway_model <- aov(data ~ group)
print(summary(oneway_model))
output:
Df Sum Sq Mean Sq F value Pr(>F)
group 3 2188.2 729.4 52.09 4.71e-06 ***
Residuals 8 111.9 14.0
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
b) two-way classification
Source code:
# Data
output <- c (10,12,11,13,14,15,13,16,12,15,14,16)
machine <- factor(rep(c("M1","M2","M3"), each = 4))
operator <- factor(rep(c("O1","O2"), times = 6))
# Two-Way ANOVA
twoway_model <- aov(output ~ machine + operator)
print(summary(twoway_model))
output:
Df Sum Sq Mean Sq F value Pr(>F)
machine 2 16.0 8.0 4.000 0.0574 .
operator 1 2.0 2.0 1.000 0.3390
Residuals 8 16.0 2.0
---
Signif. codes: 0 ‘.’ 0.1 ‘ ’ 1
EXPERIMENT 6
AIM: Time series
a) Moving averages b) ARIMA
a) Moving averages
Source code:
# Sample monthly data (e.g., monthly demand or sales)
data <- c(25, 30, 28, 35, 40, 42, 45, 48, 46, 50, 53, 55)
months <- month.abb[1:12]
# Create a time series object
ts_data <- ts(data, start = c(2024, 1), frequency = 12)
# Display original data
print("Original Monthly Data:")
print(ts_data)
# 3-month Moving Average
ma3 <- filter(ts_data, rep(1/3, 3), sides = 2)
# 4-month Moving Average
ma4 <- filter (ts_data, rep(1/4, 4), sides = 2)
# 5-month Moving Average
ma5 <- filter(ts_data, rep(1/5, 5), sides = 2)
# Combine all into a data frame
result <- data.frame(
Month = months,
Sales = ts_data,
MA_3 = round(ma3, 2),
MA_4 = round(ma4, 2),
MA_5 = round(ma5, 2)
print("Moving Averages Table:")
print(result)
# Plot
plot(ts_data, type = "o", col = "black", ylim=c(20,60), main = "Moving Averages",
ylab = "Sales", xlab = "Month")
lines(ma3, type="o", col="blue")
lines(ma4, type="o", col="red")
lines(ma5, type="o", col="green")
legend("topleft", legend=c("Original", "MA(3)", "MA(4)", "MA(5)"),
col=c("black", "blue", "red", "green"), lty=1, pch=1)
output:
[1] "Moving Averages Table:"
Month Sales MA_3 MA_4 MA_5
1 Jan 25 NA NA NA
2 Feb 30 27.67 29.50 NA
3 Mar 28 31.00 33.25 31.6
4 Apr 35 34.33 36.25 35.0
5 May 40 39.00 40.50 38.0
6 Jun 42 42.33 43.75 42.0
7 Jul 45 45.00 45.25 44.2
8 Aug 48 46.33 47.25 46.2
9 Sep 46 48.00 49.25 48.4
10 Oct 50 49.67 51.00 50.4
11 Nov 53 52.67 NA NA
12 Dec 55 NA NA NA
b) ARIMA
Source code:
# Load necessary libraries
install. Packages("forecast") # Only once
library(forecast)
# Create sample time series data
data <- c(112,118,132,129,121,135,148,148,136,119,104,118,
115,126,141,135,125,149,170,170,158,133,114,140)
# Convert to time series object (monthly data starting from Jan 1949)
ts_data <- ts(data, start = c(1949,1), frequency = 12)
# Plot the original time series
plot(ts_data, main = "Monthly Data", ylab = "Value", col = "blue")
# Check if the data is stationary using Augmented Dickey-Fuller Test
# install.packages("tseries") # Uncomment if not installed
library(tseries)
adf.test(ts_data)
# Apply differencing if not stationary (auto.arima handles it automatically)
# Fit ARIMA model
model <- auto.arima(ts_data)
# Display the ARIMA model summary
summary(model)
# Forecast next 12 periods
forecast_values <- forecast(model, h = 12)
# Print forecast results
print(forecast_values)
# Plot the forecast
plot(forecast_values, main = "ARIMA Forecast")
Output:
Series: ts_data
ARIMA(0,1,1)(0,1,1)[12]
Coefficients:
ma1 sma1
-0.3775 -0.5964
s.e. 0.2314 0.1534
sigma^2 estimated as 120.7: log likelihood=-140.42
AIC=286.83 AICc=287.89 BIC=291.79
Forecast Output (first few months)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 1951 147.6 132.2 163.0 124.2 171.0
Feb 1951 137.8 122.3 153.3 114.2 161.4
Mar 1951 153.1 136.2 170.1 127.3 178.9
...
EXPERIMENT 7
AIM: Goodness of fit
a) Binomial b) Poisson
a) Binomial: Binomial distribution is a discrete probability distribution used
when:
A fixed number of independent trials (n) are conducted
Each trial has two possible outcomes: success or failure
The probability of success (p) is constant for each trial
P(X=k): Probability of exactly kkk successes
nnn: Total number of trials
ppp: Probability of success in a single trial
kkk: Number of successes (0 ≤ k ≤ n)
Source code:
# Observed frequencies
obs <- c(7, 6, 19, 35, 30, 23, 7, 1)
x <- 0:7
n <- 7
total <- sum(obs)
mean <- sum(obs * x) / total
p <- mean / n
# Expected frequencies
expected_probs <- dbinom(x, size = n, prob = p)
expected_freq <- round(expected_probs * total)
# Chi-square test
chisq.test(obs, p = expected_probs, rescale.p = TRUE)
Output:
Chi-squared test for given probabilities
data: obs
X-squared = 29.102, df = 7, p-value =
0.0001386
b) Poisson : The Poisson distribution is a discrete probability distribution used to
model the number of events occurring in a fixed interval of time or space,
when the events happen independently and at a constant average rate (λ =
lambda)
f(x) = P(X=x) = (e-λ λx )/x!
X: number of events
λ: average rate (mean)
e≈2.718e
Source code:
# Observed frequencies
x <- 0:8
f <- c(103, 143, 98, 42, 8, 4, 2, 0, 0)
# Total frequency
total <- sum(f)
# Calculate mean (lambda)
lambda <- sum(x * f) / total
# Expected probabilities
pr <- dpois(x, lambda = lambda)
# Expected frequencies
fe <- round(pr * total)
# Show data
mydata <- data.frame(x, f, pr = round(pr, 5), Expected = fe)
print(mydata)
# Chi-square test for goodness of fit
result <- chisq.test(f, p = pr, rescale.p = TRUE)
print(result)
# Critical value at 5% significance level
s <- length(x) - 1
cat("Chi-square table value (df =", s, "):", qchisq(0.95, df = s), "\n")
Output:
x f pr Expected
0 103 0.26647 107
1 143 0.35240 141
2 98 0.23303 93
3 42 0.10273 41
4 8 0.03396 14
5 4 0.00898 4
6 2 0.00198 1
7 0 0.00037 0
8 0 0.00006 0
Chi-squared test for given probabilities
data: f
X-squared = 4.7755, df = 8, p-value = 0.7813
Chi-square table value (df = 8 ): 15.50731
EXPERIMENT 8
AIM: Parametric tests
a) t-test for one-mean b) t-test for two means c) paired t-test d) F-test
a) t-test for one-mean: Tests whether the mean of a single group differs from
a known or hypothesized value.
Null Hypothesis (H₀): μ = μ₀
Alternative Hypothesis (H₁): μ ≠ μ₀ (or > or <)
Source code:
# Sample data
x <- c (25, 27, 29, 28, 26, 30, 31)
# Test if mean is 28
result <- t.test(x, mu = 28)
print(result)
Output:
data: x
t = 0.4472, df = 6, p-value = 0.6715
alternative hypothesis: true mean is not equal to 28
95 percent confidence interval: 26.72526 29.56045
sample estimates: mean of x = 28.14286
b) t-test for two means: Compares the means of two independent groups.
Assumes equal or unequal variance (use var.equal).
H₀: μ₁ = μ₂
H₁: μ₁ ≠ μ₂
Source code:
# Two independent samples
group1 <- c (22, 25, 27, 30, 32)
group2 <- c (20, 23, 26, 28, 29)
# Test for equal means
result <- t.test(group1, group2, var.equal = TRUE)
print(result)
Output:
t = 0.8944, df = 8, p-value = 0.3967
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: -3.20719 7.20719
mean of x = 27.2, mean of y = 25.2
c) paired t-test: Used when same subjects are measured before and after a
treatment. H₀: mean difference = 0
Source code:
# Before and after values
before <- c (200, 195, 210, 190, 205)
after <- c (198, 193, 208, 192, 204)
# Paired t-test
result <- t.test(before, after, paired = TRUE)
print(result)
Output:
t = 2.8284, df = 4, p-value = 0.0473
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: 0.2 3.8
mean of the differences: 2
d) F-test: Tests equality of variances of two populations.
H₀: σ₁² = σ₂²
H₁: σ₁² ≠ σ₂
Source code:
# Two samples
x <- c (15, 16, 14, 15, 17, 18)
y <- c (10, 12, 9, 11, 13, 14)
# F-test for equality of variances
result <- var.test(x, y)
print(result)
Output:
F = 2.25, num df = 5, denom df = 5, p-value = 0.3012
alternative hypothesis: true ratio of variances is not equal to 1
95
percent confidence interval: 0.4350 11.6426
EXPERIMENT 9
AIM: Non-parametric tests
a) Sign test b) Mann-Whitney test c) Run test d) Kolmogorov-Smirnov test
a) Sign test: The Sign Test is used to test the median of a population or for
comparing paired samples. It only considers the signs (+/-) of differences,
ignoring their magnitude.
Hypotheses:
H₀: The median difference is zero.
H₁: The median difference is not zero.
Source code:
# Install BSDA if not installed
# install.packages("BSDA")
library(BSDA)
# Example: Paired data (Before and After)
before <- c(55, 60, 52, 63, 70, 65, 62)
after <- c (58, 62, 54, 60, 68, 68, 63)
# Perform Sign Test
SIGN.test(x = before, y = after, alternative = "two.sided")
Output:
Dependent-samples Sign-Test
data: before and after
S = 1, p-value = 0.07031
alternative hypothesis: true median difference is not equal to 0
b) Mann-Whitney test (Wilcoxon Rank Sum Test): Used to compare two
independent samples. It’s a non-parametric alternative to the t-test.
Hypotheses:
H₀: The two populations are equal.
H₁: The two populations are not equal
Source code:
# Group 1 and Group 2 data
group1 <- c(85, 90, 88, 75, 95)
group2 <- c(80, 70, 78, 85, 68)
# Mann-Whitney U Test
wilcox.test(group1, group2, alternative = "two.sided")
Output:
Wilcoxon rank sum test with continuity correction
data: group1 and group2
W = 21.5, p-value = 0.2893
alternative hypothesis: true location shift is not equal to 0
c)Run test (Wald-Wolfowitz Runs Test): Checks the randomness of a sequence.
It counts the number of runs (uninterrupted sequences of similar items).
Source code:
# Install and load tseries
install.packages("tseries")
library(tseries)
# Sample sequence
sequence <- c(1.2, 1.5, 1.3, 1.7, 2.1, 1.9, 1.8, 2.2, 2.4)
# Run test
runs.test(as.factor(sequence > median(sequence)))
Output:
Runs Test
data: as.factor(sequence > median(sequence))
Standard Normal = 0.169, p-value = 0.866
alternative hypothesis: two.sided
c) Kolmogorov-Smirnov test: Used to test if a sample comes from a specific
distribution (e.g., normal), or to compare two samples. It’s based on the
maximum distance between their empirical distribution functions.
Source code:
x <- rnorm(30, mean=5)
y <- rnorm(30, mean=6)
ks.test(x, y)
Output:
Two-sample Kolmogorov-Smirnov test
data: x and y
D = 0.5, p-value = 0.0029
alternative hypothesis: two-sided
EXPERIMENT 10
AIM: Graphical representation of data
a) Bar plot b) Frequency polygon c) Histogram d) Pie chart e) scatter plot
Graphical representation helps to visualize and interpret data effectively. Different types of
graphs are used depending on the nature of data and what insights are to be derived.
a) Bar plot:
Source Code:
H <- c(5,15,17,18,16,15)
M <- c(1980,1981,1982,1983,1984,1985)
barplot(H,xlab="Year",ylab="Profit",ylim=c(0,20),
col=rainbow(6),names.arg=M,main="RVRJCPHARMACEUTICAL
FIRM",border="red")
Output:
b) Frequency polygon : A frequency polygon is a line graph that represents the
distribution of data. It is drawn by connecting the midpoints of the top of the bars in
histogram.
Source Code:
data <- c(10, 20, 20, 30, 30, 30, 40, 40, 50, 50, 50, 50, 60)
hist_data <- hist(data, plot = FALSE)
plot(hist_data$mids, hist_data$counts, type = "o", col = "red", xlab = "Class
Intervals", ylab = "Frequency", main = "Frequency Polygon")
Output:
c) Histogram: A histogram is used for continuous data and shows the frequency
distribution of a dataset using adjacent rectangles.
Source Code:
v<-c (3,5,6,19,9,18,23,67,11,10,44,45,54,37,26,8,5,1)
hist(v,main="STUDENTS MARKS", xlab ="Weight",xlim=c(0,70), ylab="no.of
branches",ylim=c(0,5),col=rainbow(10))
Output:
d) Pie chart: A pie chart is a circular graph divided into sectors representing
proportions of a whole.
Source Code:
# Pie Chart
slices <- c(10, 20, 30, 40)
labels <- c("Q1", "Q2", "Q3", "Q4")
pie(slices, labels = labels, main = "Pie Chart", col = rainbow(length(slices)))
Output:
e) scatter plot: A scatter plot shows the relationship between two continuous variables.
Points are plotted for each pair (x, y).
Source Code:
# Scatter Plot
x <- c(1, 2, 3, 4, 5, 6)
y <- c(2, 4, 5, 7, 10, 12)
plot(x, y, main = "Scatter Plot", xlab = "X values", ylab = "Y values", col = "blue",
pch = 16)
Output: