CSE 375 Machine Learning and Pattern
Recognition
3. Multiple Linear Regression
1
Contents
• Formulate a machine learning model as a multiple linear regression
model.
– Identify prediction vector and target for the problem.
• Write the regression model in matrix form. Write the feature matrix
• Compute the least-squares solution for the regression coefficients on
training data.
• Derive the least-squares formula from minimization of the RSS
• Manipulate 2D arrays in python (indexing, stacking, computing
shapes, …)
• Compute the LS solution using python linear algebra and machine
learning packages
Slides Credit: Sundeep Rangan
#2
Pre-Requisites for this Lecture
• You should now know how to:
– Install python and run python notebooks
– Describe simple linear models mathematically
– Derive equations for fitting simple linear models
– Perform basic manipulations and plotting of data in
python
#3
Outline
• Motivating Example: Understanding glucose levels in
diabetes patients
• Multiple variable linear models
• Least squares solutions
• Computing the solutions in python
• Special case: Simple linear regression
• Extensions
#4
Example: Blood Glucose Level
• Diabetes patients must monitor
glucose level
• What causes blood glucose levels to
rise and fall?
• Many factors
• We know mechanisms qualitatively
• But, quantitative models are difficult
to obtain
– Hard to derive from first principles
– Difficult to model physiological
process precisely
• Can machine learning help?
#5
Data from AIM 94 Experiment
• Data collected as series of events
– Eating
– Exercise
– Insulin dosage
• Target variable glucose level monitored
#6
Demo on GitHub
• Demo:
demo2_glucose.ipynb
#7
Loading the Data
• Sklearn package:
– Many methods for machine
learning
– Datasets
– Will use throughout this class
• Diabetes dataset is one example
#8
Finding a Mathematical Model
Attributes Target
𝑥𝑥1 : Age 𝑦𝑦 =Glucose level
𝑥𝑥2 : Sex
𝑥𝑥3 : BMI
𝑦𝑦 ≈ 𝑦𝑦� = 𝑓𝑓(𝑥𝑥1 , … , 𝑥𝑥10 )
𝑥𝑥4 : BP
𝑥𝑥5 : S1
⋮
𝑥𝑥10 : S6
• Goal: Find a function to predict glucose level from the 10 attributes
• Problem: Several attributes
– Need a multi-variable function
#9
Matrix Representation of Data
• Data is a matrix and a target vector
• 𝑛𝑛 samples: Attributes Target vector
– One sample per row
• 𝑘𝑘 features / attributes /predictors:
– One feature per column 𝑥𝑥11 ⋯ 𝑥𝑥1𝑘𝑘 𝑦𝑦1
𝑋𝑋 = ⋮ ⋱ ⋮ 𝑦𝑦 = ⋮ Samples
𝑥𝑥𝑛𝑛𝑛 ⋯ 𝑥𝑥𝑛𝑛𝑘𝑘 𝑦𝑦𝑛𝑛
• This example:
– 𝑦𝑦𝑖𝑖 = blood glucose measurement of i-th
sample
– 𝑥𝑥𝑖𝑖,𝑗𝑗 : j-th feature of i-th sample
– 𝒙𝒙𝑇𝑇𝑖𝑖 = [𝑥𝑥𝑖𝑖,1 , 𝑥𝑥𝑖𝑖,2 ,…, 𝑥𝑥𝑖𝑖,𝑘𝑘 ]: feature or predictor
vector
– i-th sample contains 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖
#10
Outline
• Motivating Example: Understanding glucose levels in
diabetes patients
• Multiple variable linear models
• Least squares solutions
• Computing the solutions in python
• Special case: Simple linear regression
• Extensions
#11
Multivariable Linear Model for Glucose
Attributes 𝑦𝑦 ≈ 𝑦𝑦� = 𝑓𝑓(𝑥𝑥1 , … , 𝑥𝑥10 )
Target
Age, Sex, BMI,BP,S1, …, S6
𝑦𝑦 =Glucose level
𝒙𝒙 = [𝑥𝑥1 , … , 𝑥𝑥10 ]
• Goal: Find a function to predict glucose level from the 10 attributes
• Linear Model: Assume glucose is a linear function of the predictors:
glucose ≈ prediction
= 𝛽𝛽0 + 𝛽𝛽1 Age + ⋯ + 𝛽𝛽4 𝐵𝐵𝐵𝐵 + 𝛽𝛽5 S1 + ⋯ + 𝛽𝛽10 S6
• General form:
𝑦𝑦 ≈ 𝑦𝑦� = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + ⋯ + 𝛽𝛽4 𝑥𝑥4 + 𝛽𝛽5 𝑥𝑥5 + ⋯ + 𝛽𝛽10 𝑥𝑥10
Target
Intercept 10 Features
#12
Multiple Variable Linear Model
• Vector of features: 𝒙𝒙 = [𝑥𝑥1 , … , 𝑥𝑥𝑘𝑘 ]
– 𝑘𝑘 features (also known as predictors, independent variable, attributes, covariates, …)
• Single target variable 𝑦𝑦
– What we want to predict
• Linear model: Make a prediction 𝑦𝑦�
𝑦𝑦 ≈ 𝑦𝑦� = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘
• Data for training
– Samples are (𝒙𝒙𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), i=1,2,…,n.
– Each sample has a vector of features: 𝒙𝒙𝑖𝑖 = [𝑥𝑥𝑖𝑖𝑖 , … , 𝑥𝑥𝑖𝑖𝑖𝑖 ] and scalar target 𝑦𝑦𝑖𝑖
• Problem: Learn the best coefficients 𝜷𝜷 = [𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽𝑘𝑘 ] from the training data
#13
Example: Heart Rate Increase
• Linear Model: HR increase ≈ 𝛽𝛽0 + 𝛽𝛽1 [mins
exercise] + 𝛽𝛽2 exercise intensity
• Data:
Subject HR before HR after Mins on Speed Days exercise
number treadmill (min/km) / week
123 60 90 1 5.2 3
456 80 110 2 4.1 1
789 70 130 5 3.5 2
Measuring fitness of athletes
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
https://www.mercurynews.com/2017/10/29/4851089/
#14
Why Use a Linear Model?
• Many natural phenomena have linear relationship
• Predictor has small variation
• Simple to compute
• Easy to interpret relation
– Coefficient 𝛽𝛽𝑗𝑗 indicates the importance of feature j for the
target.
• Advanced: Gaussian random variables:
– If two variables are jointly Gaussian, the optimal predictor
of one from the other is linear predictor
#15
Matrix Review
• Consider
1 2
2 0 2
𝐴𝐴 = 3 4 , 𝐵𝐵 = , 𝑥𝑥 = ,
3 2 3
5 6
• Compute (computations on the board):
– Matrix vector multiply: 𝐴𝐴𝐴𝐴
– Transpose: 𝐴𝐴𝑇𝑇
– Matrix multiply: 𝐴𝐴𝐴𝐴
– Solution to linear equations: Solve for 𝑢𝑢: 𝑥𝑥 = 𝐵𝐵𝐵𝐵
– Matrix inverse: 𝐵𝐵−1
#16
Slopes, Intercept and Inner Products
• Model with coefficients 𝜷𝜷: 𝑦𝑦� = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘
• Sometimes use weight bias version:
𝑦𝑦� = 𝑏𝑏 + 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑘𝑘 𝑥𝑥𝑘𝑘
– 𝑏𝑏 = 𝛽𝛽0 : Bias or intercept
– 𝒘𝒘 = 𝜷𝜷1:𝑘𝑘 = [𝛽𝛽1 , … , 𝛽𝛽𝑘𝑘 ]: Weights or slope vector
• Can write either with inner product:
𝑦𝑦� = 𝛽𝛽0 + 𝜷𝜷1:𝑘𝑘 ⋅ 𝒙𝒙 or 𝑦𝑦� = 𝑏𝑏 + 𝒘𝒘 ⋅ 𝒙𝒙
• Inner product:
– 𝒘𝒘 ⋅ 𝒙𝒙 = ∑𝑘𝑘𝑗𝑗=1 𝑤𝑤𝑗𝑗 𝑥𝑥𝑗𝑗
– Will use alternate notation: 𝐰𝐰 𝑇𝑇 𝒙𝒙 = 𝒘𝒘, 𝒙𝒙
#17
Matrix Form of Linear Regression
• Data: 𝒙𝒙𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝑖𝑖 = 1, … , 𝑛𝑛
• Predicted value for 𝑖𝑖-th sample: 𝑦𝑦�𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥𝑖𝑖𝑖 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑖𝑖𝑖𝑖
• Matrix form 𝑦𝑦�
1 1 𝑥𝑥11 ⋯ 𝑥𝑥1𝑘𝑘 𝛽𝛽0
𝑦𝑦�2 1 𝑥𝑥21 ⋯ 𝑥𝑥2𝑘𝑘 𝛽𝛽1 𝜷𝜷 with 𝑝𝑝 = 𝑘𝑘 + 1 coefficient vector
� a 𝑛𝑛 predicted values
𝒚𝒚 =
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
𝑦𝑦�𝑛𝑛 1 𝑥𝑥𝑛𝑛1 ⋯ 𝑥𝑥𝑛𝑛𝑘𝑘 𝛽𝛽𝑘𝑘
𝑨𝑨 a 𝑛𝑛 × 𝑝𝑝 feature matrix
• Matrix equation: � = 𝑨𝑨 𝜷𝜷
𝒚𝒚
#18
In-Class Exercise
#19
Outline
• Motivating Example: Understanding glucose levels in
diabetes patients
• Multiple variable linear models
• Least squares solutions
• Computing the solutions in python
• Special case: Simple linear regression
• Extensions
#20
Least Squares Model Fitting
• How do we select parameters 𝜷𝜷 = (𝛽𝛽0 , … , 𝛽𝛽𝑘𝑘 )?
• Define 𝑦𝑦�𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥𝑖𝑖𝑖 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑖𝑖𝑖𝑖
– Predicted value on sample 𝑖𝑖 for parameters 𝜷𝜷 = (𝛽𝛽0 , … , 𝛽𝛽𝑘𝑘 )
• Define average residual sum of squares:
𝑛𝑛
RSS 𝜷𝜷 : = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2
𝐼𝐼=1
– Note that 𝑦𝑦�𝑖𝑖 is implicitly a function of 𝜷𝜷 = (𝛽𝛽0 , … , 𝛽𝛽𝑘𝑘 )
– Also called the sum of squared residuals (SSR) and sum of squared
errors (SSE)
• Least squares solution: Find 𝜷𝜷 to minimize RSS.
#21
Variants of RSS
• Often use some variant of RSS
– Note: these are not standard
• Residual sum of squares: RSS = ∑𝑛𝑛𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2
• RSS per sample or Mean Squared Error:
𝑛𝑛
RSS 1 2
MSE = = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
𝑛𝑛 𝑛𝑛
𝑖𝑖=1
• Normalized RSS or Normalized MSE:
𝑛𝑛 2
𝑅𝑅𝑅𝑅𝑅𝑅⁄𝑛𝑛 𝑀𝑀𝑀𝑀𝑀𝑀 ∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
2 = 2 = 𝑛𝑛 2
𝑠𝑠𝑦𝑦 𝑠𝑠𝑦𝑦 ∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦�
#22
Finding Parameters via Optimization
A general ML recipe
General ML problem Multiple linear regression
• Pick a model with parameters Linear model: 𝑦𝑦� = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘
• Get data
Data: 𝒙𝒙𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝑖𝑖 = 1,2, … , 𝑛𝑛
• Pick a loss function
– Measures goodness of fit Loss function:
𝑅𝑅𝑅𝑅𝑅𝑅 𝛽𝛽0 , … , 𝛽𝛽𝑘𝑘 ≔ ∑ 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2
model to data
– Function of the parameters
• Find parameters that minimizes Select 𝜷𝜷 = (𝛽𝛽0 , … , 𝛽𝛽𝑘𝑘 ) to minimize 𝑅𝑅𝑅𝑅𝑅𝑅 𝜷𝜷
loss
#23
RSS as a Vector Norm
• RSS is given by sum:
𝑛𝑛
RSS = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2
𝑖𝑖=1
• Define norm of a vector:
– 𝒙𝒙 = 𝑥𝑥12 + ⋯ + 𝑥𝑥𝑟𝑟2 1/2
– Standard Euclidean norm.
– Sometimes called ℓ-2 norm. ℓ is for Lebesque
• Write RSS in vector form:
� 2
RSS = 𝒚𝒚 − 𝒚𝒚
#24
Least Squares Solution
• Consider cost function of the RSS:
𝑛𝑛 𝑝𝑝
RSS 𝜷𝜷 = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2, 𝑦𝑦�𝑖𝑖 = � 𝐴𝐴𝑖𝑖𝑖𝑖 𝛽𝛽𝑗𝑗
𝑖𝑖=1 𝑗𝑗=0
– Vector 𝜷𝜷 that minimizes RSS called the least-squares solution
• Least squares solution: The vector 𝜷𝜷 that minimizes the RSS is:
� = 𝑨𝑨𝑇𝑇 𝑨𝑨
𝜷𝜷 −1 𝑨𝑨𝑇𝑇 𝒚𝒚
– Can compute the best coefficient vector analytically
– Just solve a linear set of equations
– Will show the proof below
#25
Proving the LS Formula
• Least squares formula: The vector 𝜷𝜷 that minimizes the RSS is:
� = 𝑨𝑨𝑇𝑇 𝑨𝑨
𝜷𝜷 −1 𝑇𝑇
𝑨𝑨 𝒚𝒚
• To prove this formula, we will:
– Review gradients of multi-variable functions
– Compute gradient 𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅 𝜷𝜷
– Solve 𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅 𝜷𝜷 = 0
#26
Gradients of Multi-Variable Functions
• Consider scalar valued function of a vector: 𝑓𝑓 𝜷𝜷 = 𝑓𝑓(𝛽𝛽1 , … , 𝛽𝛽𝑛𝑛 )
• Gradient is the column vector:
𝜕𝜕𝜕𝜕(𝜷𝜷)⁄𝜕𝜕𝛽𝛽1
𝛻𝛻𝑓𝑓 𝜷𝜷 = ⋮
𝜕𝜕𝜕𝜕(𝜷𝜷)⁄𝜕𝜕𝛽𝛽𝑛𝑛
• Represents direction of maximum increase
• At a local minima or maxima: 𝛻𝛻𝑓𝑓 𝜷𝜷 = 0
– Solve 𝑛𝑛 equations and 𝑛𝑛 unknowns
• Ex: 𝑓𝑓 𝛽𝛽1 , 𝛽𝛽2 = 𝛽𝛽1 sin 𝛽𝛽2 + 𝛽𝛽12 𝛽𝛽2 .
– Compute 𝛻𝛻𝑓𝑓 𝜷𝜷 try it! .
#27
Proof of the LS Formula
• Consider cost function of the RSS:
𝑛𝑛 𝑝𝑝
RSS = � 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2, 𝑦𝑦�𝑖𝑖 = � 𝐴𝐴𝑖𝑖𝑖𝑖 𝛽𝛽𝑗𝑗
𝑖𝑖=1 𝑗𝑗=0
– Vector 𝜷𝜷 that minimizes RSS called the least-squares solution
𝜕𝜕𝜕𝜕𝜕𝜕𝜕𝜕 𝑛𝑛
• Compute partial derivatives via chain rule: = −2 ∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 𝐴𝐴𝑖𝑖𝑖𝑖 , 𝑗𝑗 = 1,2, … , 𝑘𝑘
𝜕𝜕𝛽𝛽𝑗𝑗
• Matrix form: RSS = 𝐴𝐴𝜷𝜷 − 𝒚𝒚 2 , 𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅 = −2𝐴𝐴𝑇𝑇 (𝒚𝒚 − 𝑨𝑨𝑨𝑨)
• Solution: 𝐴𝐴𝑇𝑇 𝒚𝒚 − 𝐴𝐴𝜷𝜷 = 0 → 𝜷𝜷 = 𝐴𝐴𝑇𝑇 𝐴𝐴 −1 𝐴𝐴𝑇𝑇 𝒚𝒚 (least squares solution of equation
𝐴𝐴𝜷𝜷 = 𝐲𝐲)
• Minimum RSS: 𝑅𝑅𝑅𝑅𝑅𝑅 = 𝒚𝒚𝑇𝑇 𝐼𝐼 − 𝐴𝐴 𝐴𝐴𝑇𝑇 𝐴𝐴 −1 𝐴𝐴𝑇𝑇 𝒚𝒚
– Proof in previous lecture
#28
In-Class Exercise
#29
Outline
• Motivating Example: Understanding glucose levels in
diabetes patients
• Multiple variable linear models
• Least squares solutions
• Computing the solutions in python
• Special case: Simple linear regression
• Extensions
#30
Arrays and Vector in Python
• Python:
– Arrays can have 1, 2, 3, … dimension
– Vectors can be 1D arrays; matrices are generally 2D arrays
– Vectors that are 1D arrays are neither row not column vectors
– If x is 1D and A is 2D, then left and right multiplication are the
same: x.dot(A) and A.dot(x)
• Lecture notes: We will generally treat 𝑥𝑥 and 𝑥𝑥 𝑇𝑇 the same.
– Can write 𝑥𝑥 = (𝑥𝑥1 , … , 𝑥𝑥𝑁𝑁 ) and still multiply by a matrix on left or
right
#31
Fitting Using sklearn
• Return to diabetes data example
• All code in demo
• Divide data into two portions:
– Training data: First 300 samples
– Test data: Remaining 142
samples
• Train model on training data.
• Test model (i.e. measure RSS) on
test data
• Reason for splitting data discussed
next lecture.
#32
Manually Computing the Solution
• Use numpy linear algebra routine to
solve
𝛽𝛽 = 𝐴𝐴𝑇𝑇 𝐴𝐴 −1 𝐴𝐴𝑇𝑇 𝑦𝑦
• Common mistake:
– Compute matrix inverse 𝑃𝑃 = 𝐴𝐴𝑇𝑇 𝐴𝐴 −1 ,
– Then compute 𝛽𝛽 = 𝑃𝑃𝐴𝐴𝑇𝑇 𝑦𝑦
– Full matrix inverse is VERY slow. Not
needed.
– Can directly solve linear system: 𝐴𝐴 𝛽𝛽 =
𝑦𝑦
– Numpy has routines to solve this
directly
#33
Calling the sklearn Linear Regression method
• Construct a linear regression
object
• Run it on the training data
• Predict values on the test
data
#34
Outline
• Motivating Example: Understanding glucose levels in
diabetes patients
• Multiple variable linear models
• Least squares solutions
• Computing the solutions in python
• Special case: Simple linear regression
• Extensions
#35
Simple vs. Multiple Regression
• Simple linear regression: One predictor (feature)
– Scalar predictor 𝑥𝑥
– Linear model: 𝑦𝑦� = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
– Can only account for one variable
• Multiple linear regression: Multiple predictors (features)
– Vector predictor 𝒙𝒙 = (𝑥𝑥1 , … , 𝑥𝑥𝑘𝑘 )
– Linear model: 𝑦𝑦� = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + ⋯ + 𝛽𝛽𝑘𝑘 𝑥𝑥𝑘𝑘
– Can account for multiple predictors
– Turns into simple linear regression when 𝑘𝑘 = 1
#36
Comparison to Single Variable Models
• We could compute models for each variable separately:
𝑦𝑦 = 𝑎𝑎1 + 𝑏𝑏1 𝑥𝑥1
𝑦𝑦 = 𝑎𝑎2 + 𝑏𝑏2 𝑥𝑥2
⋮
• But, doesn’t provide a way to account for joint effects
• Example: Consider three linear models to predicting longevity:
– A: Longevity vs. some factor in diet (e.g. amount of fiber
consumed)
– B: Longevity vs. exercise
– C: Longevity vs. diet AND exercise
– What does C tell you that A and B do not?
#37
Special Case: Single Variable
• Suppose 𝑘𝑘 = 1 predictor.
• Feature matrix and coefficient vector:
1 𝑥𝑥1
𝛽𝛽0
𝐴𝐴 = ⋮ ⋮ , 𝛽𝛽 =
𝛽𝛽1
1 𝑥𝑥𝑛𝑛
1 𝑇𝑇 −1 1
• LS soln: 𝛽𝛽 = 𝐴𝐴 𝐴𝐴 𝐴𝐴𝑇𝑇 𝑦𝑦
= 𝑃𝑃−1 𝑟𝑟
𝑁𝑁 𝑁𝑁
1 𝑥𝑥̅ 𝑦𝑦�
𝑃𝑃 = , 𝑟𝑟 =
𝑥𝑥̅ 𝑥𝑥 2 𝑥𝑥𝑦𝑦
• Obtain single variable solutions for coefficients (after some algebra):
𝑠𝑠𝑥𝑥𝑥𝑥 𝑠𝑠 2
𝑥𝑥𝑥𝑥
𝛽𝛽1 = 2 , 𝛽𝛽0 = 𝑦𝑦� − 𝛽𝛽1 𝑥𝑥,̅ 𝑅𝑅2 = 2 2
𝑠𝑠𝑥𝑥 𝑠𝑠𝑥𝑥 𝑠𝑠𝑦𝑦
#38
Simple Linear Regression for Diabetes Data
• Try a fit of each variable individually
• Compute 𝑅𝑅𝑘𝑘2 coefficient for each variable
• Use formula on previous slide
• “Best” individual variable is a poor fit
– 𝑅𝑅𝑘𝑘2 ≈ 0.34
Best individual variable
#39
Scatter Plot
• No one variable explains glucose well
• Multiple linear regression is much better
#40
Outline
• Motivating Example: Understanding glucose levels in
diabetes patients
• Multiple variable linear models
• Least squares solutions
• Computing in python
• Extensions
#41
Next Lecture
• Model Selection
#42