Machine Learning
Lecturer :Ya-Mei Chang
Office: Room 446, ext 66117
[email protected]
Textbook
Title: An Introduction to Statistical Learning: with Applications in R, 2021
Authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Reference Book
Title: The Elements of Statistical Learing: Data mining, Inference and Prediction
Authors: D. Hastie, R. Tibshirani and J. Friedman
Grading:
⚫ Attendance 10%
⚫ Mark of usual 30%
⚫ Midterm Exam 30%
⚫ Final Report 30%
Office hours:
Tue. 10:00~11:00
Thr. 10:00~11:00
What Is Statistical Learning?
⚫ An exemple:
X
Y
More generally, suppose that we observe a quantitative response Y and p
different predictors, X1,X2, . . .,Xp. We assume that there is some relationship
between Y and X = (X1,X2, . . .,Xp), which can be written in the very general form
Here f is some fixed but unknown function of X1, . . . , Xp, and is a random
error term, which is independent of X and has mean zero. In this formulation, f
represents the systematic information that X provides about Y .
⚫ In essence, statistical learning refers to
Linear Regression
⚫ Simple Linear Regression
➢ Assumption:
It assumes that there is approximately a linear relationship between X and
Y . Mathematically, we can write this linear relationship as
0 and 1 are two unknown constants that represent the intercept and
slope terms in the linear model. Once we have used our training data to
produce estimates ̂ 0 and ˆ1 for the model coefficients, we can predict
future sales on the basis of a particular value of TV advertising by
computing
➢ Estimating the Coefficients:
Let (x1, y1), (x2, y2), . . . , (xn, yn) represent n observation pairs, each of which
consists of a measurement of X and a measurement of Y . Let
yˆi = ˆ0 + ˆ1 xi be the prediction for Y based on the ith value of X. Then
ei = yi − yˆi represents the ith residual—this is the difference between the
ith observed response value and the ith response value that is predicted by
our linear model. We define the residual sum of squares (RSS) as
or equivalently
The least squares approach chooses ̂ 0 and ˆ1 to minimize the RSS.
Using some calculus, one can show that the minimizers are
n n
where y = yi / n and x = xi / n are the sample means.
i =1 i =1
➢ Assessing the Accuracy of the Coefficient Estimates:
1. About (the population mean of Y),
In general, σ2 is not known. It is estimated by residual standard error
(RSE) and is given by the formula RSE = RSS / (n − 2).
2. About ̂ 0 ,
the 95% confidence interval for ̂ 0 approximately takes the form
3. About ˆ1 ,
the 95% confidence interval for 1 approximately takes the form
H0 : There is no relationship between X and Y
versus
Ha : There is some relationship between X and Y .
Mathematically, this corresponds to testing
➢ Assessing the Accuracy of the Model:
1. RSE
2. R 2 Statistics:
3.
⚫ Multiple Linear Regression
➢ Assumption:
Y = 0 + 1 X1 + + p X p +
➢ Estimating the Coefficients:
We choose 0 , 1, , p to minimize the sum of squared residuals
➢ Some Important Questions
1. Is at least one of the predictors X1,X2, . . . , Xp useful in predicting
the response?
◼ We test the null hypothesis,
H0 : β1 = β2 = ···= βp = 0
versus
Ha : at least one βj is non-zero.
This hypothesis test is performed by computing the F-statistic,
◼ 0 , 1, , p−q
p−q+1, , p : The last q variables maybe not useful in prediction
This corresponds to a null hypothesis
H0 : βp−q+1 = βp−q+2 = . . . = βp = 0,
In this case we fit a second model that uses all the variables except those
last q. Suppose that the residual sum of squares for that model is RSS0.
Then the appropriate F-statistic is
2. Do all the predictors help to explain Y , or is only a subset of the
predictors useful?
◼
3. How well does the model fit the data?
◼
4. Given a set of predictor values, what response value should we
predict, and how accurate is our prediction?
◼
Computer Session
➢ Simple Linear Regression
library (MASS)
library (ISLR2)
##
## 載入套件:'ISLR2'
## 下列物件被遮斷自 'package:MASS':
##
## Boston
head (Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
lm.fit <- lm(medv~lstat, data=Boston)
attach(Boston)
lm.fit<- lm(medv~lstat)
lm.fit
##
## Call:
## lm(formula = medv ~ lstat)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
names(lm.fit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
predict(lm.fit, data.frame(lstat = (c(5, 10, 15) )),interval = "confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
predict(lm.fit, data.frame(lstat = (c(5, 10, 15) )),interval = "prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
plot(lstat,medv)
abline(lm.fit)
abline(lm.fit,lwd=3)
abline(lm.fit,lwd=3,col="red")
plot(lstat,medv, col = "red")
plot(lstat,medv, pch=20)
plot(lstat,medv,pch="+")
plot (1:20 , 1:20 , pch = 1:20)
par( mfrow = c(1, 2))
plot(lm.fit)
plot(predict(lm.fit), residuals(lm.fit))
➢ Multiple Linear Regression
lm.fit <- lm(medv~lstat+age, data=Boston)
summary(lm.fit)
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
lm.fit <- lm(medv~., data=Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1304 -2.7673 -0.5814 1.9414 26.2526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.617270 4.936039 8.431 3.79e-16 ***
## crim -0.121389 0.033000 -3.678 0.000261 ***
## zn 0.046963 0.013879 3.384 0.000772 ***
## indus 0.013468 0.062145 0.217 0.828520
## chas 2.839993 0.870007 3.264 0.001173 **
## nox -18.758022 3.851355 -4.870 1.50e-06 ***
## rm 3.658119 0.420246 8.705 < 2e-16 ***
## age 0.003611 0.013329 0.271 0.786595
## dis -1.490754 0.201623 -7.394 6.17e-13 ***
## rad 0.289405 0.066908 4.325 1.84e-05 ***
## tax -0.012682 0.003801 -3.337 0.000912 ***
## ptratio -0.937533 0.132206 -7.091 4.63e-12 ***
## lstat -0.552019 0.050659 -10.897 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared: 0.7343, Adjusted R-squared: 0.7278
## F-statistic: 113.5 on 12 and 493 DF, p-value: < 2.2e-16
library (car)
## 載入需要的套件:carData
vif(lm.fit)
## crim zn indus chas nox rm age dis
## 1.767486 2.298459 3.987181 1.071168 4.369093 1.912532 3.088232 3.954037
## rad tax ptratio lstat
## 7.445301 9.002158 1.797060 2.870777
lm.fit1<-lm( medv~.-age, data = Boston)
summary (lm.fit1)
## Call:
## lm(formula = medv ~ . - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1851 -2.7330 -0.6116 1.8555 26.3838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.525128 4.919684 8.441 3.52e-16 ***
## crim -0.121426 0.032969 -3.683 0.000256 ***
## zn 0.046512 0.013766 3.379 0.000785 ***
## indus 0.013451 0.062086 0.217 0.828577
## chas 2.852773 0.867912 3.287 0.001085 **
## nox -18.485070 3.713714 -4.978 8.91e-07 ***
## rm 3.681070 0.411230 8.951 < 2e-16 ***
## dis -1.506777 0.192570 -7.825 3.12e-14 ***
## rad 0.287940 0.066627 4.322 1.87e-05 ***
## tax -0.012653 0.003796 -3.333 0.000923 ***
## ptratio -0.934649 0.131653 -7.099 4.39e-12 ***
## lstat -0.547409 0.047669 -11.483 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.794 on 494 degrees of freedom
## Multiple R-squared: 0.7343, Adjusted R-squared: 0.7284
## F-statistic: 124.1 on 11 and 494 DF, p-value: < 2.2e-16