W3 - Linear Regression
W3 - Linear Regression
Spring – 2025
Linear Regression
- Optimization Algorithm
02/15
Let’s buy a motorcycle
03/15
Let’s buy a motorcycle
- Suppose you want to buy a new motorcycle, and you want to estimate
the price of the motorcycle using the engine size.
03/15
Let’s buy a motorcycle
- Suppose you want to buy a new motorcycle, and you want to estimate
the price of the motorcycle using the engine size.
- The exact relationship between the engine size and the price is unknown
to you.
03/15
Let’s buy a motorcycle
- Suppose you want to buy a new motorcycle, and you want to estimate
the price of the motorcycle using the engine size.
- The exact relationship between the engine size and the price is unknown
to you.
- So, you go to the market and take the price quotation for different
motorcycles (you collected some data).
03/15
Let’s buy a motorcycle
- Suppose you want to buy a new motorcycle, and you want to estimate
the price of the motorcycle using the engine size.
𝑦
- The exact relationship between the engine size and the price is unknown
500
450
to you. 400
𝑃𝑟𝑖𝑐𝑒 = 𝑓(𝑒𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒) 350
Price (x1000)
300
- So, you go to the market and take the price quotation for different 250
motorcycles (you collected some data). 200
150
- Heading back home, you plot the data on a graph and try to discover the
100
relationship between engine size and the price.
50
03/15
Let’s buy a motorcycle
- Suppose you want to buy a new motorcycle, and you want to estimate
the price of the motorcycle using the engine size.
𝑦
- The exact relationship between the engine size and the price is unknown
500
450
to you. 400
𝑃𝑟𝑖𝑐𝑒 = 𝑓(𝑒𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒) 350
Price (x1000)
300
- So, you go to the market and take the price quotation for different 250
motorcycles (you collected some data). 200
150
- Heading back home, you plot the data on a graph and try to discover the
100
relationship between engine size and the price.
50
- You notice that the price increases when engine size increases. 25 50 75 100 125 150 175 200 225 250 𝑥
Engine displacement (cc)
So, as a first estimation, you draw a line that fits your data.
03/15
How can you determine the suitability of a predictor line for the given data?
04/15
How can you determine the suitability of a predictor line for the given data?
- Find the difference between actual function values and the predicted function values.
500
450
400
350
Price (x1000)
300
250
200
150
100
50
04/15
How can you determine the suitability of a predictor line for the given data?
- Find the difference between actual function values and the predicted function values.
500
450
400
350
Price (x1000)
300
200
150
100
50
04/15
How can you determine the suitability of a predictor line for the given data?
- Find the difference between actual function values and the predicted function values.
500
450
Price (x1000)
18 300
- Evaluate ℒ(𝑥) on various values of 𝑤. 16
250
14
200
12
ℒ(𝑤)
150
10
8 100
6 50
4
25 50 75 100 125 150 175 200 225 250 𝑥
2
Engine displacement (cc)
0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5
𝑤
04/15
How can you determine the suitability of a predictor line for the given data?
- Find the difference between actual function values and the predicted function values.
500
450
- What’s the value of 𝑤 that minimises ℒ(𝑤)?
400
20 350
- Evaluate ℒ(𝑥) on various values of 𝑤.
Price (x1000)
18 300
16
- Why minimise ℒ(𝑤)? 14
250
200
12
ℒ(𝑤)
150
10
8 100
6 50
4
25 50 75 100 125 150 175 200 225 250 𝑥
2
Engine displacement (cc)
0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5
𝑤
04/15
How can you determine the suitability of a predictor line for the given data?
- Find the difference between actual function values and the predicted function values.
500
450
ℒ(𝑤)? 20 350
Price (x1000)
18 300
16
- Evaluate ℒ(𝑥) on various values of 𝑤. 250
14
200
12
ℒ(𝑤)
- Why minimise ℒ(𝑤)? 10
150
8 100
- Use the graph (or table) that shows ℒ(𝑤) for all 𝑤 and pick the optimal
value.
18
16
14
12
ℒ(𝑤)
10
05/15
How to find the optimal value of 𝑤?
- Use the graph (or table) that shows ℒ(𝑤) for all 𝑤 and pick the optimal value.
14
ℒ(𝑤)
10
05/15
How to find the optimal value of 𝑤?
- Use the graph (or table) that shows ℒ(𝑤) for all 𝑤 and pick the optimal value.
14
ℒ(𝑤)
10
6
- Calculate the gradient again on new location and taken another step. 4
05/15
How to find the optimal value of 𝑤?
- Use the graph (or table) that shows ℒ(𝑤) for all 𝑤 and pick the optimal value.
14
ℒ(𝑤)
10
6
- Calculate the gradient again on new location and taken another step. 4
2
- Continue until you reach optimal point.
0.25 0.5 0.75 1 1.25 1.5 1.75 2
2.25 2.5
𝑤
05/15
How accurately can you estimate motorcycle price by engine size only?
06/15
How accurately can you estimate motorcycle price by engine size only?
- There are many other factors besides engine size that determine the price of a motorcycle.
06/15
How accurately can you estimate motorcycle price by engine size only?
- There are many other factors besides engine size that determine the price of a motorcycle.
06/15
How accurately can you estimate motorcycle price by engine size only?
- There are many other factors besides engine size that determine the price of a motorcycle.
- Before,
06/15
How accurately can you estimate motorcycle price by engine size only?
- There are many other factors besides engine size that determine the price of a motorcycle.
- Before,
- Now,
06/15
Let’s establish some common terminology
- Data: (mostly) Real-world observations. May consist of input-output pairs or inputs only. In the case of
input- out pairs, the outputs are called Ground Truth or True Targets.
07/15
Let’s establish some common terminology
- Data: (mostly) Real-world observations. May consist of input-output pairs or inputs only. In the case of
input- out pairs, the outputs are called Ground Truth or True Targets.
07/15
Let’s establish some common terminology
- Data: (mostly) Real-world observations. May consist of input-output pairs or inputs only. In the case of
input- out pairs, the outputs are called Ground Truth or True Targets.
- Hypothesis Function: An approximation of the actual function (mostly unknown or intractable) that
governs the relationship between input and output. Also called the predictor function or the model.
07/15
Let’s establish some common terminology
- Data: (mostly) Real-world observations. May consist of input-output pairs or inputs only. In the case of input-
out pairs, the outputs are called Ground Truth or True Targets.
- Hypothesis Function: An approximation of the actual function (mostly unknown or intractable) that
governs the relationship between input and output. Also called the predictor function or the model.
- Parameters: Coefficients of predictor function. Learnt through optimisation and operate on the input features.
Consist of weights and biases.
07/15
Let’s establish some common terminology
- Data: (mostly) Real-world observations. May consist of input-output pairs or inputs only. In the case of input-
out pairs, the outputs are called Ground Truth or True Targets.
- Hypothesis Function: An approximation of the actual function (mostly unknown or intractable) that
governs the relationship between input and output. Also called the predictor function or the model.
- Parameters: Coefficients of predictor function. Learnt through optimisation and operate on the input features.
Consist of weights and biases.
- Prediction: Output of the predictor function given by applying weights on the features.
07/15
Let’s establish some common terminology
- Data: (mostly) Real-world observations. May consist of input-output pairs or inputs only. In the case of input-
out pairs, the outputs are called Ground Truth or True Targets.
- Hypothesis Function: An approximation of the actual function (mostly unknown or intractable) that
governs the relationship between input and output. Also called the predictor function or the model.
- Parameters: Coefficients of predictor function. Learnt through optimisation and operate on the input features.
Consist of weights and biases.
- Prediction: Output of the predictor function given by applying weights on the features.
- Loss: The difference between prediction and ground truth. Also known as residual.
07/15
Let’s establish some common terminology (continued)
- Loss Function: A function used to calculate the loss. Many variants are available. Some are preferred
on others depending upon tasks.
08/15
Let’s establish some common terminology (continued)
- Loss Function: A function used to calculate the loss. Many variants are available. Some are preferred
on others depending upon tasks.
- Cost Function: Average of all losses over all input samples. Represents how good or bad a
predictor function (model) is, given some training data. Also called objective function.
08/15
Let’s establish some common terminology (continued)
- Loss Function: A function used to calculate the loss. Many variants are available. Some are preferred
on others depending upon tasks.
- Cost Function: Average of all losses over all input samples. Represents how good or bad a
predictor function (model) is, given some training data. Also called objective function.
- Optimisation: The process of finding optimal values of parameters (weights) that result in the minimum value
of cost function. (A comment about fairness of model). Mathematically,
min 𝐽(𝜃)
𝜃
08/15
Let’s establish some common terminology (continued)
- Loss Function: A function used to calculate the loss. Many variants are available. Some are preferred
on others depending upon tasks.
- Cost Function: Average of all losses over all input samples. Represents how good or bad a
predictor function (model) is, given some training data. Also called objective function.
- Optimization: The process of finding optimal values of parameters (weights) that result in the minimum value
of cost function. (A comment about fairness of model). Mathematically,
min 𝐽(𝜃)
𝜃
- Gradient Descent: An optimisation algorithm that uses gradient of cost function with respect to parameters
to update and eventually find the best combination of parameters to minimise cost function.
08/15
There are different types of loss functions
09/15
There are different types of loss functions
09/15
There are different types of loss functions
09/15
There are different types of loss functions
10/15
Optimisation means learning optimal model parameters
- The objective of an optimisation algorithm is to find optimal parameters to minimize the cost function.
𝛼 represents learning rate. It controls how big (or small) a step should be taken after each iteration.
11/15
Optimisation means learning optimal model parameters
- The objective of an optimisation algorithm is to find optimal parameters to minimize the cost function.
𝛼 represents learning rate. It controls how big (or small) a step should be taken after each iteration.
11/15
Optimisation means learning optimal model parameters
- The objective of an optimisation algorithm is to find optimal parameters to minimize the cost function.
𝛼 represents learning rate. It controls how big (or small) a step should be taken after each iteration.
11/15
There are different versions of gradient descent algorithm
12/15
There are different versions of gradient descent algorithm
12/15
There are different versions of gradient descent algorithm
- Minibatch SGD can help smooth the optimisation path of SGD and take
advantage of parallelisation on GPUs.
Real-world
Modelling
𝑦
500
450
400 Simplified
Price (x1000)
350
300
Model with
250 random
200
150
parameters
100 𝑝𝑟𝑖𝑐𝑒 = 2000 × 𝑒𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒
50
400
Model
(Consider only engine size and ignore other features.)
Price (x1000)
350
300 without
250
200 optimal
150
parameters
100
Prediction
- The model returns the estimated price of the motorcycle by
multiplying the slope with engine size.
𝑦
500
450
400 Simplified
Price (x1000)
350
300
Model with
250 random
200
150
parameters
100 𝑝𝑟𝑖𝑐𝑒 = 2000 × 𝑒𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒
50
400
Model
(Consider only engine size and ignore other features.)
Price (x1000)
350
300 without
250
200 optimal
- Prediction: Ask questions from the model and record the answers. 150
parameters
100
𝑝𝑟𝑖𝑐𝑒 = 2000 × 𝑒𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒
- The model returns the estimated price of the motorcycle by multiplying 50
Learning
- Learning: Learn the values of the model parameter that fit the training + data
data more accurately. 𝑦
- Requires training data and some optimisation algorithm. 500
- (The model learns that the best value of 𝑤 is 1.17 instead of the initial 450
Price (x1000)
350
300
optimal
250 parameters
200
150
https://www.ibm.com/topics/logistic-
regression
14/15
Machine Learning has its foundation in statistics
https://www.ibm.com/topics/logistic-
regression
14/15
Machine Learning has its foundation in statistics
Outcome is Outcome is
Linear Regression continuous variable categorical variable Logistic Regression
https://www.ibm.com/topics/logistic-
regression
14/15
Machine Learning has its foundation in statistics
Outcome is Outcome is
Linear Regression continuous variable categorical variable Logistic Regression
Simple Linear Multiple Linear Binary Logistic Multinomial Logistic Ordinal Logistic
Regression Regression Regression Regression Regression
One input variable Multiple input variables Binary output More than two outputs More than two outputs
without order with order
https://www.ibm.com/topics/logistic-
regression
14/15
Summary of today’s lecture
- We discussed linear regression starting from single feature to multifeatured input for the prediction of
motorcycle prices.
𝒙 Model 𝑦∈ℝ
15/15
Summary of today’s lecture
- We discussed linear regression starting from single feature to multifeatured input for the
prediction of motorcycle prices.
𝒙 Model 𝑦∈ℝ
- Made a model and initialised it randomly.
[Modelling]
- Had the model predict the price of a known motorcycle engine size. [Prediction]
- Calculated the difference between the predicted price and the actual price to calculate
the loss.
[Learning]
- Used an optimisation algorithm to update the model (weights) to minimise the difference.
15/15
Summary of today’s lecture
- Calculated the difference between the predicted price and the actual price to calculate
the loss.
[Learning]
- Used an optimisation algorithm to update the model (weights) to minimise the difference.
- Next Lecture:
15/15
Do you have any problem?
EOP