Introduction to Machine Learning
Mauricio A. Álvarez
Foundations of Machine Learning
The University of Manchester
1 / 60
Textbooks
Rogers and Girolami, A First Course Bishop, Pattern Recognition and Ma-
in Machine Learning, Chapman and chine Learning, Springer-Verlag, 2006.
Hall/CRC Press, 2nd Edition, 2016.
2 / 60
Textbooks
Murphy, Machine Learning: A Probabilistic Géron, Hands-On Machine Learning
Perspective, MIT Press, 2012. with Scikit-Learn, Keras and Tensor-
Flow, O’Reilly, 3rd Edition, 2022.
3 / 60
Textbooks
Murphy, Probabilistic Machine Learning: Murphy, Probabilistic Machine Learning:
An Introduction, MIT Press, 2022. Advanced Topics, MIT Press, 2023.
4 / 60
Textbooks
Hardt and Recht, Patterns, Predic- Zhang et al., Dive into Deep Learning,
tions, and Actions: Foundations of Cambridge University Press, 2023.
Machine Learning, Princeton Univer-
sity Press, 2022.
5 / 60
Contents
Machine learning
Definitions
An example of a predictive model
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
6 / 60
Machine learning or Statistical Learning
❑ We would like to design an algorithm that help us to solve different
prediction problems.
❑ The algorithm is designed based on a mathematical model or function,
and a dataset.
❑ Extract knowlegde from data.
7 / 60
Examples of ML problems
Handwritten digit recognition
8 / 60
Examples of ML problems
Face detection and face recognition
From Murphy (2012).
9 / 60
Examples of ML problems
Predicting the age of a person looking at a particular YouTube video.
10 / 60
Examples of ML problems
Stock market
11 / 60
Examples of ML problems
Clustering: segmenting customers in e-commerce
12 / 60
Examples of ML problems
Recommendation systems
13 / 60
ML has contributed to advances in AI
AlphaGo Autonomous driving
AlphaFold
14 / 60
Generative AI
DALL-E ChatGPT
Github Copilot
15 / 60
Contents
Machine learning
Definitions
An example of a predictive model
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
16 / 60
Basic definitions
❑ Handwritten digit recognition
❑ Variability
❑ Each image can be transformed into a vector x (feature extraction).
❑ An instance is made of the pair (x, y ), where y is the label of the image.
❑ Objective: find a function f (x, w).
17 / 60
Basic definitions
❑ Training set: a set of N images and their labels (x1 , y1 ), . . . , (xN , yN ), to
fit the predictive model.
❑ Estimation or training phase: process of getting the values of w of
the function f (x, w), that best fit the data.
❑ Generalisation: ability to correctly predict the label of new images x∗ .
18 / 60
Supervised and unsupervised learning
❑ Supervised learning:
– Variable y is discrete: classification.
– Variable y is continuous: regression.
❑ Unsupervised learning: from the set (x1 , y1 ), . . . , (xN , yN ), we only
have access to x1 , . . . , xN
– Find similar groups: clustering.
– Find a probability function for x: density estimation.
– Find a lower dimensionality representation for x: dimensionality reduction
and visualisation.
❑ Other types of learning: reinforcement learning, semi-supervised
learning, active learning, multi-task learning.
19 / 60
Contents
Machine learning
Definitions
An example of a predictive model
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
20 / 60
Olympic 100m Data
Image from Wikimedia Commons http://bit.ly/191adDC.
21 / 60
Dataset
Male 100 mts
12.0
11.5
Seconds
11.0
10.5
10.0
1900 1920 1940 1960 1980 2000
Year
22 / 60
Model
❑ We will use a linear model f (x, w), where y is the time in seconds and x
the year of the competition.
❑ The linear model is given as
y = w1 x + w0 ,
where w0 is the intercept and w1 is the slope.
❑ We use w to refer both to w0 and w1 .
23 / 60
Objective function
❑ We use an objetive function to estimate the parameters w0 and w1 that
best fit the data.
❑ In this example, we use a a least squares objective function
X X
E(w0 , w1 ) = (yi − f (xi ))2 = [yi − (w1 xi + w0 )]2 .
∀i ∀i
❑ By minimising the error with respect to w, we get the solution as
w0 = 36.4 and w1 = −1.34 × 10−2 .
24 / 60
Data and model
Male 100 mts
12.0
11.5
Seconds
11.0
10.5
10.0
1900 1920 1940 1960 1980 2000
Year
25 / 60
Predictions
❑ We can now use this model for making predictions.
❑ For example, what does the model predict for 2012?
❑ If we say x = 2012, then
y = f (x, w) = f (x = 2012, w)
= w1 x + w0 = (−1.34 × 10−2 ) × 2012 + 36.4 = 9.59.
❑ The actual value was 9.63.
26 / 60
Main challenges of machine learning
❑ Insufficient quantity of training data.
❑ Nonrepresentative training data.
❑ Poor-quality data.
❑ Irrelevant features.
❑ Overfitting the training data.
❑ Underfitting the training data.
27 / 60
Contents
Machine learning
Definitions
An example of a predictive model
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
28 / 60
Contents
Machine learning
Definitions
An example of a predictive model
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
29 / 60
Definition
❑ A random variable (RV) is a function that assigns a number to the
outcome of a random experiment.
❑ For example, we toss a coin (random experiment).
❑ We assign the number 0 to the outcome “tails” and the number “1” to
the outcome “heads”.
30 / 60
Discrete and continuous random variables
❑ A random variable can either be discrete or continuous.
❑ A discrete RV can take a value only from a countable number of
distinct values, like 1, 2, 3, . . ..
❑ For example, the number of phone calls received in a call-center from
9:00 to 10:00, the number of COVID patients in a Hospital on May 30,
2020.
❑ A continuous RV can take any value from an infinite possible values
within one or more intervals.
❑ Examples include the time that a cyclist takes to finish the Tour de
France; the exchange rate between the british pound and the US dollar
on June 30, 2023.
31 / 60
Notation
❑ We use capital letters to denote random variables, e.g. X , Y , Z , . . ..
❑ We use lowercase letters to denote the values that the random variable
takes, x, y , z, . . ..
32 / 60
Contents
Machine learning
Definitions
An example of a predictive model
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
33 / 60
Probability mass function
❑ A discrete RV X is completely defined by a set of values it can take,
x1 , x2 , . . . , xn , and their corresponding probabilities.
❑ The probability that X = xi is denoted as P(X = xi ) for i = 1, . . . , n, and
it is known as the probability mass function (pmf).
❑ Properties
1. P(X = xi ) ≥ 0, i = 1, . . . , n.
Xn
2. P(X = xi ) = 1.
i=1
34 / 60
Two discrete RVs
❑ In machine learning, we are usually interested in more than one
random variable.
❑ Consider two RVs X and Y taking values x1 , x2 , . . . , xn and
y1 , y2 , . . . , ym , respectively.
❑ These two random variables can be fully described with a joint
probability mass function P(X = xi , Y = yj ) specifying the probability
of X = xi and Y = yj .
❑ Properties
1. P(X = xi , Y = yj ) ≥ 0, i = 1, . . . , n, j = 1, . . . , m.
Xn Xm
2. P(X = xi , Y = yj ) = 1.
i=1 j=1
35 / 60
Rules of probability
❑ Marginal
m
X
P(X = xi ) = P(X = xi , Y = yj ).
j=1
We obtain the probability of P(X = xi ) regardless of the value of Y .
This is also known as the sum rule of probability.
❑ Conditional
P(X = xi , Y = yj )
P(X = xi |Y = yj ) = , P(Y = yj ) ̸= 0.
P(Y = yj )
❑ From the expression above, we can also write
P(X = xi , Y = yj ) = P(X = xi |Y = yj )P(Y = yj ).
This expression is also known as the product rule of probability.
36 / 60
How do we compute P(X = xi ) from data?
❑ A way to compute the probability P(X = xi ) is to repeat an experiment
several times, say N, see how many outcomes we get for which X = xi ,
say nX =xi and then approximate the probability as
nX =xi
P(X = xi ) ≈ .
N
❑ We expect the approximation to become an equality when N → ∞,
nX =xi
P(X = xi ) = lim .
N→∞ N
37 / 60
What about P(X = xi , Y = yj ) and P(X = xi |Y = yj )?
❑ We can follow a similar procedure to compute P(X = xi , Y = yj ),
nX =xi ,Y =yj
P(X = xi , Y = yj ) = lim ,
N→∞ N
where nX =xi ,Y =yj is the number of times we observe a simultaneous
occurence of X = xi and Y = yj .
❑ To compute P(X = xi |Y = yj ), we can use the definition of the
conditional probability
nX =xi ,Y =yj
P(X = xi , Y = yj ) N
nX =xi ,Y =yj
P(X = xi |Y = yj ) = ≈ nY =yj = .
P(Y = yj ) nY =yj
N
❑ In the limit N → ∞,
nX =xi ,Y =yj
P(X = xi |Y = yj ) = lim .
N→∞ nY =yj
38 / 60
Examples of the different pmf
p(X, Y ) p(Y )
Y =2
Y =1
X
p(X) p(X|Y = 1)
X X
From Bishop (2006).
39 / 60
Statistical independence
Two discrete RVs are statistically independent if
P(X = xi , Y = yj ) = P(X = xi )P(Y = yj ), i = 1, . . . , n j = 1, . . . , m.
40 / 60
Bayes theorem
❑ Bayes theorem allows to go from P(X = x) to P(X = x|Y = y).
❑ According to Bayes theorem
P(Y = y |X = x)P(X = x)
P(X = x|Y = y) =
P(Y = y )
41 / 60
Example: Bayes theorem
There are two barrels in front of you. Barrel One contains 20 apples and 4
oranges. Barrel Two contains 4 apples and 8 oranges. You choose a barrel
randomly and select a fruit. It is an apple. What is the probability that the
barrel was Barrel One?
42 / 60
Answer (I)
❑ There are two random variables involved.
❑ Let B be the random variable associated to picking one of the barrels.
So B can either be “One” or “Two”.
❑ Let F be the random variables associated to picking a fruit. So F can
either be “Apple” (A) or “Orange” (O).
❑ The probability we want to compute is the conditional probability
P(B = One|F = A).
43 / 60
Answer (II)
❑ The statement says “You choose a barrel randomly” which means that
P(B = One) = P(B = Two) = 21 .
❑ Since we want to go from P(B = One) to P(B = One|F = A), we can
use Bayes theorem,
P(F = A|B = One)P(B = One)
P(B = One|F = A) = .
P(F = A)
❑ We need to compute P(F = A|B = One) and P(F = A).
44 / 60
Answer (III)
❑ Using the sum rule of probability and the product rule of probability
X X
P(F = A) = P(F = A, B) = P(F = A|B)P(B)
= P(F = A|B = One)P(B = One)
+ P(F = A|B = Two)P(B = Two).
❑ From the statement,
20
P(F = A|B = One) =
24
4
P(F = A|B = Two) =
12
20 1 4 1
❑ We then have P(F = A) = 24 2 + 12 2 .
45 / 60
Answer (IV)
We can finally compute P(B = One|F = A)
P(F = A|B = One)P(B = One)
P(B = One|F = A) =
P(F = A)
20 1
24 2
= 20 1 4 1
24 2 + 12 2
20
24 5
= 20 4
= ≈ 0.71
24 + 12
7
46 / 60
Expected value and statistical moments
❑ The expected value of a function of a discrete RV, g(X ) is defined as
n
X
E{g(X )} = g(xi )P(X = xi ).
i=1
❑ Two expected values or statistical moments of the discrete RV X , used
frequently are the mean µX and the variance σX2 ,
n
X
µX = E{X } = xi P(X = xi ),
i=1
n
X
σX2 = var{X } = E{(X − µX )2 } = (xi − µX )2 P(X = xi )
i=1
2
= E{X } − µ2X
❑ The squared root of the variance, σX , is known as the standard
deviation.
47 / 60
Example: Expected values
❑ Consider the following discrete RV X and its pmf. Compute µX and σX2 .
X 1 2 3 4
P(X ) 0.3 0.2 0.1 0.4
❑ For the mean µX , we have
n
X
µX = xi P(X = xi ) = (1)(0.3) + (2)(0.2) + (3)(0.1) + (4)(0.4) = 2.6.
i=1
❑ For the variance, we first compute E{X 2 } and then use
σX2 = E{X 2 } − µ2X .
❑ To compute E{X 2 }, we can use E{g(X )}, where g(X ) : X → X 2 ,
n
X
E{X 2 } = xi2 P(X = xi ) = (1)2 (0.3) + (2)2 (0.2) + (3)2 (0.1) + (4)2 (0.4) = 8.4.
i=1
❑ We finally get σX2 = E{X 2 } − µ2X = 8.4 − (2.6)2 ≈ 1.64.
48 / 60
Notation
❑ When referring to the probability P(X = x), we usually simply write
P(x).
❑ Likewise, instead of writing P(X = x, Y = y), we simply write P(x, y ).
49 / 60
Contents
Machine learning
Definitions
An example of a predictive model
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
50 / 60
Probability density function
❑ A continuous RV X takes values within one or more intervals of the real
line.
❑ We use probability density functions (pdf), pX (x), to describe a
continuous RV X .
❑ Properties of a pdf
1. pX (x) ≥ 0.
Z ∞
2. pX (x)dx = 1.
−∞
Z a
3. P(X ≤ a) = pX (x)dx.
−∞
Z b
4. P(a ≤ X ≤ b) = pX (x)dx.
a
51 / 60
Two continuous RVs
❑ As it was the case for discrete RVs, in ML, we are interested in
analysing more than one continuous RV.
❑ We can use a joint probability density function, pX ,Y (x, y ) to fully
characterise two continuous random variables X and Y .
❑ Properties of a joint pdf
1. pX ,Y (x, y ) ≥ 0.
Z ∞ Z ∞
2. pX ,Y (x, y)dxdy = 1.
−∞ −∞
Z a Z c
3. P(X ≤ a, Y ≤ c) = pX ,Y (x, y )dxdy .
−∞ −∞
Z bZ d
4. P(a ≤ X ≤ b, c ≤ Y ≤ d) = pX ,Y (x, y )dxdy .
a c
52 / 60
Rules of probability (continuous RVs)
❑ Sum rule of probability. In the case of continuous RVs, we replace
the sums we had before with an integral
Z ∞
pX (x) = pX ,Y (x, y )dy ,
−∞
where pX (x) is known as the marginal pdf.
❑ Product rule of probability. The conditional pdf can be obtained as
pX ,Y (x, y )
pX |Y (x|y ) = ,
pY (y)
which can also be written as
pX ,Y (x, y) = pX |Y (x|y )pY (y ).
The conditional pdf in this last form is known as the product rule of
probability for two continuous RVs.
53 / 60
Bayes theorem and statistical independence
❑ For the case of continuous RVs, Bayes theorem follows as
pX |Y (x|y)pY (y )
pY |X (y|x) = .
pX (x)
❑ We say that two continuous RVs X and Y are statistically independent if
pX ,Y (x, y) = pX (x)pY (y).
54 / 60
Expected values and statistical moments
For continuous RVs, expected values are computed as
Z ∞
E{g(X )} = g(x)pX (x)dx,
−∞
Z ∞
µX = E{X } = xpX (x)dx,
−∞
Z ∞
σX2 = var{X } = E{(X − µX )2 } = (x − µX )2 pX (x)dx.
−∞
= E{X 2 } − µ2X .
55 / 60
Notation
We have been using pX (x) or pX ,Y (x, y ) to refer to pdfs. We will normally
drop the subindex for the RVs and simply use p(x) or p(x, y) to refer to the
pdfs.
56 / 60
Contents
Machine learning
Definitions
An example of a predictive model
Review of probability
Random variables
Discrete random variables
Continuous random variables
Additional comments
57 / 60
What if we don’t have the pmf or pdf?
❑ Discrete RVs. In practice, we can use data to compute the probabilities
P(X = xi ) or P(X = xi , Y = yj ) by applying the definitions we saw
before.
❑ Notice that those definitions are valid in the limit N → ∞.
❑ Continuous RVs. In pratice, we assume a particular model for the pdf,
eg. a Gaussian pdf, and estimate the parameters of that pdf, e.g. the
mean and variance for the Gaussian pdf.
❑ There are advanced methods to model both pmf and pdfs but we will
not study those in this module.
58 / 60
What about moments?
❑ We can estimate µX and σX2 when we have access to observations of
the random variable X , x1 , x2 , · · · , xN , but no access to the pmf or the
pdf.
❑ In statistics, these are called “estimators” for µX and σX2 , denoted as µ
bX
2
and σ .
c
X
❑ An estimator for µX is given as
N
1 X
µ
bX = xk .
N
k =1
❑ An estimator for σX2 is given as
N
1 X
c2 =
σ X bX )2 .
(xk − µ
N −1
k =1
59 / 60
What if we have more than two RVs?
❑ In ML, we are usually faced with problems where we have more than
two RVs.
❑ In fact, there are applications of ML in Natural Language Processing,
speech processing, computer vision, computational biology, etc. where
we can have hundreds of thousands or even millions of RVs.
❑ The ideas that we saw before can be extended to these cases and we
will see some examples in the following lectures.
60 / 60