first assignment
yaikob melaku
2024-11-22
R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring
HTML, PDF, and MS Word documents. For more details on using R Markdown see
http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as
well as the output of any embedded R code chunks within the document. You can embed an
R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Including Plots
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing
of the R code that generated the plot.
#1) Describe the assumptions of linear regression model
Linear regression is a widely used statistical method for modeling the relationship between
a dependent variable and one or more independent variables. Here are the key
assumptions underlying linear regression:
1. Linearity
The relationship between the independent variables and the dependent variable is linear.
This means that changes in the independent variables should result in proportional
changes in the dependent variable.
2. Independence
Observations are independent of each other. This means that the residuals (the differences
between observed and predicted values) should not be correlated. This assumption is
critical for valid statistical inference.
3. Homoscedasticity
The variance of the residuals should be constant across all levels of the independent
variables. In other words, the spread of residuals should be roughly the same regardless of
the value of the independent variables.
4. Normality of Residuals
The residuals should be approximately normally distributed. This is especially important
for hypothesis testing and constructing confidence intervals. Normality can be checked
using Q-Q plots or statistical tests like the Shapiro-Wilk test.
5. No Multicollinearity
Independent variables should not be too highly correlated with each other.
Multicollinearity can inflate the variance of the coefficient estimates and make the model
unstable.
6. No Autocorrelation
In time series data, the residuals should not be correlated with each other. Autocorrelation
can lead to inefficient estimates and affect the validity of statistical tests.
7. Adequate Sample Size
A sufficient number of observations are needed to provide reliable estimates of the
parameters. Small sample sizes can lead to over fitting and unreliable coefficient estimates.
Summary
These assumptions are crucial for the validity of the linear regression model. Violating
these assumptions may lead to biased or inefficient estimates, reducing the model's
predictive accuracy and interpretability. It is essential to assess these assumptions when
developing a linear regression model and consider transformations or alternative methods
if they are not met.
#2 QUESTION ##a) a. Fit a simple linear regression model relating weight and systolic
blood pressure
getwd()
## [1] "C:/Users/pc/Desktop/4th year"
setwd("C:/Users/pc/desktop")
A1=read.table("table1.txt",header = T)
A1
## sub weight sybBP
## 1 1 165 130
## 2 2 167 133
## 3 3 180 150
## 4 4 155 128
## 5 5 212 151
## 6 6 175 146
## 7 7 190 150
## 8 8 210 140
## 9 9 200 148
## 10 10 149 125
## 11 11 158 133
## 12 12 169 135
## 13 13 170 150
## 14 14 172 153
## 15 15 159 128
## 16 16 168 132
## 17 17 174 149
## 18 18 183 158
## 19 19 215 150
## 20 20 195 163
## 21 21 180 156
## 22 22 143 124
## 23 23 240 170
## 24 24 235 165
## 25 25 192 160
## 26 26 187 159
model<-lm(A1$sybBP~A1$weight)
summary(model)
##
## Call:
## lm(formula = A1$sybBP ~ A1$weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.182 -6.485 -2.519 8.926 12.143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.10437 12.91013 5.353 1.71e-05 ***
## A1$weight 0.41942 0.07015 5.979 3.59e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.681 on 24 degrees of freedom
## Multiple R-squared: 0.5983, Adjusted R-squared: 0.5815
## F-statistic: 35.74 on 1 and 24 DF, p-value: 3.591e-06
##b) Display scatter plot for the data with the fitted regression line
# Load necessary libraries
library(ggplot2)
# Create the data frame with the provided data
data <- data.frame(A1$sub,A1$weight,A1$sybBP)
# Fit the linear regression model
model <- lm(A1$sybBP~A1$weight,data = A1)
# Create a scatter plot with the regression line
ggplot(data, aes(x =A1$weight,y =A1$sybBP)) +
geom_point(color = 'blue') + # Scatter plot points
geom_smooth(method = 'lm', color = 'red') + # Regression line
labs(title = 'Scatter Plot of Weight vs Systolic Blood Pressure',
x = 'Weight (lbs)',
y = 'Systolic Blood Pressure (mmHg)') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
##c) Construct the analysis - of - variance table and test for significance of regression.
anova_table <- anova(model)
print(anova_table)
## Analysis of Variance Table
##
## Response: A1$sybBP
## Df Sum Sq Mean Sq F value Pr(>F)
## A1$weight 1 2693.6 2693.58 35.744 3.591e-06 ***
## Residuals 24 1808.6 75.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##d) Construct a 95% CI on the slope, intercept and expected response and display 95%
ex- pected response on the plot given in (b).
# Confidence intervals
conf_int <- confint(model)
print(conf_int)
## 2.5 % 97.5 %
## (Intercept) 42.4591756 95.7495700
## A1$weight 0.2746281 0.5642023
# 95% expected response
new_data <- data.frame(weight = c(185))
predicted <- predict(model, newdata = new_data, interval =
"confidence", level = 0.95)
print(predicted)
## fit lwr upr
## 1 138.3079 133.9824 142.6334
## 2 139.1467 134.9835 143.3100
## 3 144.5991 141.0679 148.1303
## 4 134.1137 128.8118 139.4157
## 5 158.0204 152.4810 163.5598
## 6 142.5020 138.8276 146.1764
## 7 148.7933 145.1123 152.4742
## 8 157.1816 151.8629 162.5002
## 9 152.9874 148.6489 157.3259
## 10 131.5972 125.6169 137.5776
## 11 135.3720 130.3870 140.3570
## 12 139.9855 135.9702 144.0009
## 13 140.4050 136.4576 144.3523
## 14 141.2438 137.4197 145.0679
## 15 135.7914 130.9080 140.6748
## 16 139.5661 135.4787 143.6535
## 17 142.0826 138.3633 145.8019
## 18 145.8574 142.3427 149.3720
## 19 159.2786 153.3970 165.1603
## 20 150.8903 146.9328 154.8479
## 21 144.5991 141.0679 148.1303
## 22 129.0807 122.3780 135.7835
## 23 169.7640 160.7174 178.8107
## 24 167.6669 159.2827 176.0512
## 25 149.6321 145.8547 153.4095
## 26 147.5350 143.9594 151.1107
##e) Calculate R2 and interpret the result
r_squared <- summary(model)$r.squared
print(paste("R-squared: ", r_squared))
## [1] "R-squared: 0.59828724501484"
The R-squared value of approximately 0.598 indicates that about 59.8% of the variance in
the dependent variable (systolic blood pressure) can be explained by the independent
variable (weight) in the linear regression model.
Interpretation:
Explanation of Variance: This means that nearly 60% of the changes in systolic blood
pressure among the individuals in your study can be attributed to changes in their weight.
Model Fit: An R-squared value of 0.598 suggests a moderate fit of the model. While the
model explains a significant portion of the variance, there are still other factors (about
40.2%) influencing systolic blood pressure that are not captured by weight alone.
Contextual Considerations: The interpretation of R-squared can depend on the context of
the study. In some fields, an R-squared of 0.598 might be considered acceptable, while in
others, it might be viewed as insufficient.
Limitations: R-squared does not provide information about the model's accuracy or
whether the regression assumptions are met. Therefore, further model diagnostics should
be performed to evaluate the model's performance comprehensively.
##f) Find a 95% CI on the mean systolic blood pressure when the individual weight is 185
# CI for a specific weight
predicted_mean <- predict(model, newdata = new_data, interval =
"confidence", level = 0.95)
print(predicted_mean)
## fit lwr upr
## 1 138.3079 133.9824 142.6334
## 2 139.1467 134.9835 143.3100
## 3 144.5991 141.0679 148.1303
## 4 134.1137 128.8118 139.4157
## 5 158.0204 152.4810 163.5598
## 6 142.5020 138.8276 146.1764
## 7 148.7933 145.1123 152.4742
## 8 157.1816 151.8629 162.5002
## 9 152.9874 148.6489 157.3259
## 10 131.5972 125.6169 137.5776
## 11 135.3720 130.3870 140.3570
## 12 139.9855 135.9702 144.0009
## 13 140.4050 136.4576 144.3523
## 14 141.2438 137.4197 145.0679
## 15 135.7914 130.9080 140.6748
## 16 139.5661 135.4787 143.6535
## 17 142.0826 138.3633 145.8019
## 18 145.8574 142.3427 149.3720
## 19 159.2786 153.3970 165.1603
## 20 150.8903 146.9328 154.8479
## 21 144.5991 141.0679 148.1303
## 22 129.0807 122.3780 135.7835
## 23 169.7640 160.7174 178.8107
## 24 167.6669 159.2827 176.0512
## 25 149.6321 145.8547 153.4095
## 26 147.5350 143.9594 151.1107
#3 QUESTION ##a) Fit a multiple linear regression model relatmg gasoline mileage y
(miles per gallon) to engine displacement x1 and the number of carburetor barrels x6.
setwd("C:/Users/pc/desktop")
A2=read.table("table221.txt",header = T)
A2
## automobile y.0 x1 x2 x3 x4.0 X. X1 x5.0 X..1 X1.1 x6 x7
x8.0 x9.0
## 1 Apollo 18.90 350.0 165 260 8.00 : 1 2.56 : 1 4 3
200.3 69.9
## 2 Omega 17.00 350.0 170 275 8.50 : 1 2.56 : 1 4 3
199.6 72.9
## 3 Nova 20.00 250.0 105 185 8.25 : 1 2.73 : 1 1 3
196.7 72.2
## 4 Monarch 18.25 351.0 143 255 8.00 : 1 3.00 : 1 2 3
199.9 74.0
## 5 Duster 20.07 225.0 95 170 8.40 : 1 2.76 : 1 1 3
194.1 71.8
## 6 Jenson 11.20 440.0 215 330 8.20 : 1 2.88 : 1 4 3
184.5 69.0
## 7 Skyhawk 22.12 231.0 110 175 8.00 : 1 2.56 : 1 2 3
179.3 65.4
## 8 Monza 21.47 262.0 110 200 8.50 : 1 2.56 : 1 2 3
179.3 65.4
## 9 Scirocco 34.70 89.7 70 81 8.20 : 1 3.90 : 1 2 4
155.7 64.0
## 10 Corolla 30.40 96.9 75 83 9.00 : 1 4.30 : 1 2 5
165.2 65.0
## 11 Camaro 16.50 350.0 155 250 8.50 : 1 3.08 : 1 4 3
195.4 74.4
## 12 Datsun 36.50 85.3 80 83 8.50 : 1 3.89 : 1 2 4
160.6 62.2
## 13 CapriII 21.50 171.0 109 146 8.20 : 1 3.22 : 1 2 4
170.4 66.9
## 14 Pacer 19.70 258.0 110 195 8.00 : 1 3.08 : 1 1 3
171.5 77.0
## 15 Babcat 20.30 140.0 83 109 8.40 : 1 3.40 : 1 2 4
168.8 69.4
## 16 Granada 17.80 302.0 129 220 8.00 : 1 3.00 : 1 2 3
199.9 74.0
## 17 Eldorado 14.39 500.0 190 360 8.50 : 1 2.73 : 1 4 3
224.1 79.8
## 18 Imperial 14.89 440.0 215 330 8.20 : 1 2.71 : 1 4 3
231.0 79.7
## 19 NovaLN 17.80 350.0 155 250 8.50 : 1 3.08 : 1 4 3
196.7 72.2
## 20 Valiant 16.41 318.0 145 255 8.50 : 1 2.45 : 1 2 3
197.6 71.0
## 21 Starfire 23.54 231.0 110 175 8.00 : 1 2.56 : 1 2 3
179.3 65.4
## 22 Cordoba 21.47 360.0 180 290 8.40 : 1 2.45 : 1 2 3
214.2 76.3
## 23 TransAM 16.59 400.0 185 NA 7.60 : 1 3.08 : 1 4 3
196.0 73.0
## 24 CorollaE-5 31.90 96.9 75 83 9.00 : 1 4.30 : 1 2 5
165.2 61.8
## 25 Astre 29.40 140.0 86 NA 8.00 : 1 2.92 : 1 2 4
176.4 65.4
## 26 MarkIV 13.27 460.0 223 366 8.00 : 1 3.00 : 1 4 3
228.0 79.8
## 27 CelicaGT 23.90 133.6 96 120 8.40 : 1 3.91 : 1 2 5
171.5 63.4
## 28 ChargerSE 19.73 318.0 140 255 8.50 : 1 2.71 : 1 2 3
215.3 76.3
## 29 Cougar 13.90 351.0 148 243 8.00 : 1 3.25 : 1 2 3
215.5 78.5
## 30 Elite 13.27 351.0 148 243 8.00 : 1 3.26 : 1 2 3
216.1 78.5
## 31 Matador 13.77 360.0 195 295 8.25 : 1 3.15 : 1 4 3
209.3 77.4
## 32 Corvette 16.50 350.0 165 255 8.50 : 1 2.73 : 1 4 3
185.2 69.0
## x10 x11
## 1 3910 A
## 2 2860 A
## 3 3510 A
## 4 3890 A
## 5 3365 M
## 6 4215 A
## 7 3020 A
## 8 3180 A
## 9 1905 M
## 10 2320 M
## 11 3885 A
## 12 2009 M
## 13 2655 M
## 14 3375 A
## 15 2700 M
## 16 3890 A
## 17 5290 A
## 18 5185 A
## 19 3910 A
## 20 3660 A
## 21 3050 A
## 22 4250 A
## 23 3850 A
## 24 2275 M
## 25 2150 M
## 26 5430 A
## 27 2535 M
## 28 4370 A
## 29 4540 A
## 30 4715 A
## 31 4215 A
## 32 3660 A
model<-lm(A2$y.0~A2$x1+A2$x6 , data = A2)
summary(model)
##
## Call:
## lm(formula = A2$y.0 ~ A2$x1 + A2$x6, data = A2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0456 -1.6368 -0.3348 1.6503 6.2540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.910041 1.540929 21.357 < 2e-16 ***
## A2$x1 -0.053025 0.006145 -8.628 1.68e-09 ***
## A2$x6 0.929500 0.670108 1.387 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.021 on 29 degrees of freedom
## Multiple R-squared: 0.7862, Adjusted R-squared: 0.7714
## F-statistic: 53.31 on 2 and 29 DF, p-value: 1.934e-10
##b) Construct the analysis-of-variance table and test for significance of regression.
anova_table<-anova(model)
print(anova_table)
## Analysis of Variance Table
##
## Response: A2$y.0
## Df Sum Sq Mean Sq F value Pr(>F)
## A2$x1 1 955.34 955.34 104.687 3.916e-11 ***
## A2$x6 1 17.56 17.56 1.924 0.176
## Residuals 29 264.65 9.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##C) Calculate R2 and R2 Adj for this model. Compare this to the R2 and the R2 Adj for the
simple linear regression model relating mileage to engine displacement
summary_model<-summary(model)
r_squared<-summary_model$r.squared
r_squared_adj<-summary_model$adj.r.squared
# For simple linear regression (mileage vs displacement)
simple_model<-lm(A2$y.0~A2$x1, data = A2)
summary_simple<-summary(simple_model)
r_squared_simple<-summary_simple$r.squared
r_squared_adj_simple<-summary_simple$adj.r.squared
# Print R2 and R2 Adj
cat("Multiple Regression R2:", r_squared, "\n")
## Multiple Regression R2: 0.7861525
cat("Multiple Regression R2 Adj:", r_squared_adj, "\n")
## Multiple Regression R2 Adj: 0.7714044
cat("Simple Regression R2:", r_squared_simple, "\n")
## Simple Regression R2: 0.7719647
cat("Simple Regression R2 Adj:", r_squared_adj_simple, "\n")
## Simple Regression R2 Adj: 0.7643635
##D) Find a 95% CI for β1 and β2
confint_model<-confint(model, level = 0.95)
print(confint_model)
## 2.5 % 97.5 %
## (Intercept) 29.75848643 36.06159556
## A2$x1 -0.06559354 -0.04045598
## A2$x6 -0.44102434 2.30002368
##E) Compute the t statistics for testing H0: β1 = 0 and H0: β6 = 0. What conclusions can you
draw?
t_stats<-summary_model$coefficients[, "t value"]
p_values<-summary_model$coefficients[, "Pr(>|t|)"]
cat("t-statistic for β1:", t_stats["A2$x1"], "p-value:",
p_values["A2$x1"], "\n")
## t-statistic for β1: -8.628348 p-value: 1.676495e-09
cat("t-statistic for β2:", t_stats["A2$x6"], "p-value:",
p_values["A2$x6"], "\n")
## t-statistic for β2: 1.38709 p-value: 0.1759837
##F) Find a 95% prediction interval for a new observation on gasoline mileage when x1 =
275 and x6 = 2 barrels.