Name-Niyati Shah
VIN-633007415
-
STAT-654
email-myatibhavik Shah atamu edu . -
Assignment-1
#1 .
E(i) =. Bl
For simple linear model
,
=>
regression
Yi =
Bo + Bixi + ei
the data sex (x,, 4) , (x2 y 2) (kn , yu)
for , ...
= (yi Yi)
↑
sum
o square residuals
-
2
: SSR
= [Yi -
(B + B xi)]
To determine Boa Bi ,
we have to minimize SSR
GST
...
[yi -(P pixi)] + ( D) 0
-
=
.
- Eyi + EP + piEsi = 0
y =
E
N
- -
In + Bon + Bisn = 0
y BY i
:
Bo = -
Now
,
4
O
[yi -
(B + Bixi)] ( -
xi) = 0
:
E[j-Bi + Bixi-yilxi =
0
Exi-pis Exi +
pi Exi-Exiyi = o
Bi [n22 -Exi2)
-
:N Fy - -
Ex1YI = &
:
B =
nxY -
Exy ;
net-Exi ?
=
Exiyi -
nx
Exi -
ni
*
Exiyi Eyi
= -
2(Xi -
X)2
& sample correlation
coefficient =
R2
-
coefficient of determination
① residuals =
2. =
yi -Y ,
error sun SSE Eci2
of square
=
22(yi y, )2 -
E(yi-y)
:
① lotal sum
of square SST :
+ =
E(yi y y y) -
+ -
E[(yi yi) + (yi y)]2 - -
[(yi-y) + 2 [ (yi-yi) (yi y) E(yi y) -
+ -
Now
,
y =
B + Bixi =
y B -
+ Bixi
.. -
y
=
B(xi x) -
Abo
B Bixi
,
yi-y =
yi
-
Yi
-
Y + Bise-Bix ;
(yi-yi) (yi y) + BY (x xi)
= -
-
Therefore , 2(yi yi)(yi y)
- -
=
EB , (xi -
x) .
[Bi(xi x) -
+ Bi (5 x)] -
BYE(xi -
-) "
-
BY(xi 5) -
I
O
vence
,
SST =
E(yi yi) -
+
E(yi - y)2
: 337 = SSE + Ely - y)2
E(y y)
2
where SSR = -
regression sum
of squares
&2 given that MX = 2
My(X(X) =
Bo + Bi( Mx) - +
B2(X-MX)
=
Bo +
Bi(x 2) -
+
B2(x 2)2 -
B2(X = 4)
-
Bo +
Bix -
2B + 4x +
Bo +
BIK-2B , +
B2x2- 4B2X + 4B2
=
Bo-2B ,
+
4B2 +
BIX-YP2X +
B2x2
*
comparing the equation with
Mylx -8 5-3
. .
2x + 0 .
FX
Be = 0 7 .
4 B2 3 2
B1
-
= -
.
Bi 4(0 7) 3 2
-
: -
.
= .
3 2 + 2 8
Bi
- = -
.
.
=
-
0 .
Y
And
Bo-2B1 + 4B2 =- 8.5
:
Bo-2(0 4) -
+ 4(0 7) .
= - 8.5
2.
:
Thus ,
the centered model is
My(X(X)
= -
21 -
0 .
4(X 2) + -
0
.
7(x -
2) .
*Help of ChatGPT was used in learning some part of the code.
Question D
R Code with Outputs
#Question D1
setwd("C:/Users/n-shah/OneDrive - Texas A&M University/Semester 3/STAT 654 Stats")
# Read the data from the text file
hc <- read.table("HardwoodTensileStr-1.txt", header=TRUE, sep="")
# View the data
head(hc)
Concentration Strength
1 1.0 6.3
2 1.5 11.1
3 2.0 20.0
4 3.0 24.0
5 4.0 26.1
6 4.5 30.0
# Center the predictor
hc$Concentration_centered <- hc$Concentration - mean(hc$Concentration)
y <- hc$Strength
# Fit a third order polynomial model
hc3 <- lm(Strength ~ poly(Concentration_centered, 3, raw=TRUE), data=hc)
#Summary of the model to get the adjusted R-squared and p-value
summary(hc3)
lm(formula = Strength ~ poly(Concentration_centered, 3, raw = TRUE),
data = hc)
Residuals:
Min 1Q Median 3Q Max
-4.6250 -1.6109 0.0413 1.5892 5.0216
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.975562 0.869032 51.754 < 2e-16 ***
poly(Concentration_centered, 3, raw = TRUE)1 4.339394 0.350978 12.364 2.87e-09 ***
poly(Concentration_centered, 3, raw = TRUE)2 -0.548873 0.039199 -14.002 5.11e-10 ***
poly(Concentration_centered, 3, raw = TRUE)3 -0.055188 0.009789 -5.638 4.72e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.585 on 15 degrees of freedom
Multiple R-squared: 0.9707, Adjusted R-squared: 0.9648
F-statistic: 165.4 on 3 and 15 DF, p-value: 1.025e-11
The adjusted R2 is 0.9648 and the p-value is 1.025e-11. Which is much lower than 0.01 significance level.
Thus, it can be said that the model is statistically significant and can be used to predict the tensile
strength.
#Question D2
# Plot the data and the fitted model
plot(hc$Concentration_centered, y, main="Scatter Plot with Third Order Polynomial Fit", xlab =
"Concentration(centered)", ylab = "Strength", pch=19)
points(hc$Concentration_centered,fitted(hc3), col="red",pch=19)
# Adding the fitted curve
curve(predict(hc3,newdata = data.frame(Concentration_centered=x)), add = TRUE, col="blue",lwd=2)
As seen in the above graph, it appears that the third order polynomial fits the dataset well.
#Question D3
# Residuals vs Fitted values plot
plot(fitted(hc3),resid(hc3), xlab = "Fitted Values", ylab="Residuals", main = "Residuals vs Fitted Values")
abline(h=0, col="red", lty=2)
# Fit a fifth order polynomial model to this data
hc5 <- lm(Strength ~ poly(Concentration_centered, 5, raw=TRUE), data=hc)
#Summary of the model to get the adjusted R-squared and p-value
summary(hc5)
Call:
lm(formula = Strength ~ poly(Concentration_centered, 5, raw = TRUE),
data = hc)
Residuals:
Min 1Q Median 3Q Max
-2.65167 -0.91159 -0.03811 0.96396 2.56865
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.6187788 0.7309210 59.676 < 2e-16 ***
poly(Concentration_centered, 5, raw = TRUE)1 5.3479308 0.3896655 13.724 4.11e-09 ***
poly(Concentration_centered, 5, raw = TRUE)2 -0.1378567 0.1059263 -1.301 0.215700
poly(Concentration_centered, 5, raw = TRUE)3 -0.1630817 0.0289147 -5.640 8.06e-05 ***
poly(Concentration_centered, 5, raw = TRUE)4 -0.0114448 0.0026525 -4.315 0.000840 ***
poly(Concentration_centered, 5, raw = TRUE)5 0.0021978 0.0005163 4.257 0.000935 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.703 on 13 degrees of freedom
Multiple R-squared: 0.989, Adjusted R-squared: 0.9847
F-statistic: 233.1 on 5 and 13 DF, p-value: 3.022e-12
The adjusted R-squared is 0.98 which is greated than third order polynomial and the p-value is lower than 0.01 and that of the
third order polynomial.
Question D4
#p-values for the polynomial terms (excluding the intercept)
p_values <- summary_hc5$coefficients[-1, "Pr(>|t|)"]
# Finding the index of the term with the largest p-value (adding one to the index considering the
intercept)
largest_p_value_index <- which.max(p_values) + 1
#Generating the polynomial terms excluding the one with the largest p-value
poly_terms <- paste0("I(Concentration_centered^", 1:5, ")", collapse=" + ")
poly_terms <- strsplit(poly_terms, " + ")[[1]]
# Split into individual terms
poly_terms <- poly_terms[-largest_p_value_index]
# Removing the term with largest p-value
newf <- as.formula(paste("Strength ~ ", paste(poly_terms, collapse=" + ")))
# Fitting the model with the updated formula
hc5_reduced <- lm(newf, data=hc)
summary_hc5_reduced <- summary(hc5_reduced)
Call:
lm(formula = newf, data = hc)
Residuals:
Min 1Q Median 3Q Max
-2.65167 -0.91159 -0.03811 0.96396 2.56865
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.6187788 0.7309210 59.676 < 2e-16 ***
I(Concentration_centered^1) 5.3479308 0.3896655 13.724 4.11e-09 ***
I(Concentration_centered^2) -0.1378567 0.1059263 -1.301 0.215700
I(Concentration_centered^3) -0.1630817 0.0289147 -5.640 8.06e-05 ***
I(Concentration_centered^4) -0.0114448 0.0026525 -4.315 0.000840 ***
I(Concentration_centered^5) 0.0021978 0.0005163 4.257 0.000935 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.703 on 13 degrees of freedom
Multiple R-squared: 0.989, Adjusted R-squared: 0.9847
F-statistic: 233.1 on 5 and 13 DF, p-value: 3.022e-12
# Scatterpot of the data with the fitted curve from the final model
plot(hc$Concentration_centered, hc$Strength, main="Scatter Plot with Reduced Model Fit",
xlab="Concentration (Centered)", ylab="Strength", pch=19)
lines(sort(hc$Concentration_centered), predict(hc5_reduced)[order(hc$Concentration_centered)],
col="blue", lwd=2)
The significance after omitting the polynomial term with the highest p value didn’t affect the model.
The R-squared is 0.9847 which is the same.
The scatter plot of 5 degree polynomial visually and statistically fits the dataset better and precisely
compared to the third order polynomial.
Question E
Question E1
setwd("C:/Users/n-shah/OneDrive - Texas A&M University/Semester 3/STAT 654 Stats")
data <- read.table("TreeAgeDiamSugarMaple-1.txt", header = TRUE, sep = "")
x=data$Diamet
y=data$Age
lm1=lm(y~poly(x,1,raw=TRUE), data=data)
lm2=lm(y~poly(x,2,raw=TRUE), data=data)
lm3=lm(y~poly(x,3,raw=TRUE), data=data)
lm4=lm(y~poly(x,4,raw=TRUE), data=data)
lm5=lm(y~poly(x,5,raw=TRUE), data=data)
lm6=lm(y~poly(x,6,raw=TRUE), data=data)
lm7=lm(y~poly(x,7,raw=TRUE), data=data)
lm8=lm(y~poly(x,8,raw=TRUE), data=data)
AIC <- c(AIC(lm1), AIC(lm2), AIC(lm3), AIC(lm4), AIC(lm5), AIC(lm6), AIC(lm7), AIC(lm8))
BIC <- c(BIC(lm1), BIC(lm2), BIC(lm3), BIC(lm4), BIC(lm5), BIC(lm6), BIC(lm7), BIC(lm8))
AIC
BIC
AIC
[1] 239.5899 230.0744 231.8443 233.4604 235.2434 237.2078
[7] 234.6909 234.6980
BIC
[1] 243.4774 235.2577 238.3235 241.2355 244.3142 247.5745
[7] 246.3534 247.6563
which.min(AIC)
[1] 2
which.min(BIC)
[1] 2
Here, the least value of AIC and BIC is seen in second order polynomial. Hence, the same has been taken
for further analysis.
Question E2
summary(lm2)
Call:
lm(formula = y ~ poly(x, 2, raw = TRUE), data = data)
Residuals:
Min 1Q Median 3Q Max
-25.451 -10.027 -1.046 8.201 31.469
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4693224 5.7627144 1.123 0.27271
poly(x, 2, raw = TRUE)1 0.4545286 0.0658670 6.901 3.89e-07 ***
poly(x, 2, raw = TRUE)2 -0.0004106 0.0001149 -3.573 0.00154 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.68 on 24 degrees of freedom
Multiple R-squared: 0.9072, Adjusted R-squared: 0.8995
F-statistic: 117.4 on 2 and 24 DF, p-value: 4.062e-13
plot(lm2$fitted,lm2$res,main="",xlab="fitted",ylab="residuals",pch=19)
abline(h=mean(lm2$res),col="red")
P-value is : 4.062e-13
Question E4
new <- data.frame(Age=110)
prediction <- predict(lm2, newdata=new, interval="predict", level=0.95)
prediction
fit lwr upr
1 7.765916 -26.614207 42.14604
2 8.411693 -25.920323 42.74371
3 10.334476 -23.860730 44.52968
4 12.880690 -21.148054 46.90943
5 13.508629 -20.481620 47.49888
6 14.139317 -19.813281 48.09191
7 15.395654 -18.484949 49.27626
8 29.935475 -3.391954 63.26290
9 32.262878 -1.021172 65.54693
10 43.505908 10.277105 76.73471
fit lwr upr
11 47.816181 14.548155 81.08421
12 48.877911 15.595732 82.16009
13 49.929246 16.631410 83.22708
14 50.977566 17.662541 84.29259
15 53.054053 19.700554 86.40755
16 74.100469 40.126265 108.07467
17 74.541587 40.551983 108.53119
18 95.148398 60.531815 129.76498
19 95.500596 60.876930 130.12426
20 108.199612 73.486420 142.91280
21 122.220881 87.938708 156.50305
22 130.676961 96.543286 164.81064
23 132.241131 96.559324 167.92294
24 132.058646 97.289721 166.82757
25 132.083716 97.279233 166.88820
26 132.233644 97.063349 167.40394
27 132.198768 97.159801 167.23773