Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views38 pages

07 Intro To ML

Uploaded by

liangyibo653
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views38 pages

07 Intro To ML

Uploaded by

liangyibo653
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to ML

Computer Vision
Introduction to
Machine Learning

2 / 38
Recognition Problems
• The main problem: the rules must be selected manually.

true
Class 1
true
Condition 2
Condition 1 false Class 2
false
Class 1
Geometric features for
selected segments

• Consequences:
1. Meaningful and informative features should be used.
2. There are very few such features, and their complex combinations
cannot be processed.
3 / 38
Ideal Solution
• The machine gives answers to questions based on the data
already processed.
Learning algorithm Trained machine

Data Answer

Question

4 / 38
Learning Process
• Learning is not the same as memorization, memorization is not a
problem for a machine.
• The machine must learn to draw inferences from a set of training
data.
• The machine must work correctly based on new data that was
not given to it before.

5 / 38
Definition
• «A computer program is said to learn from experience 𝐸 with
respect to some task 𝑇 and some performance measure 𝑃, if its
performance on 𝑇, as measured by 𝑃, improves with experience
𝐸» © Т.М. Mitchell, 1997.

6 / 38
Applications
• computer vision,
• speech recognition,
• computational linguistics and natural language processing,
• medical diagnostics,
• bioinformatics,
• technical diagnostics,
• financial applications,
• search and rubricating of texts,
• expert systems,
• etc.
7 / 38
Machine Learning Classes
1. Deductive learning (from general to particular).
• There are formalized data.
• It is required to derive a rule applicable to a particular case
based on formalized data.
• Typical example: expert systems.
2. Inductive learning (from particular to general).
• There are empirical data. Need to restore some dependency.
• Subdivided into:
a. Supervised learning;
b. Unsupervised learning;
c. Reinforcement learning;
d. Active learning etc.
8 / 38
Probability Theory and Stochastic Processes

• What is a probability?
1. In frequency interpretation: the probability is the frequency of a
repeating event.
2. In the Bayesian interpretation: the probability is a measure of the
experiment’s outcome uncertainty.

9 / 38
Example: Extraction of Fruit From Two Boxes

• Experiment:
• Box selection;
• Fruit extraction;
• Putting the fruit back.
• Two random variables:
• 𝑋 – the color of the box (red or blue);
• 𝑌 – fruit (orange or apple).
𝐻𝑜𝑤 𝑚𝑎𝑛𝑦 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑟𝑒𝑑 𝑏𝑜𝑥 𝑤𝑎𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑃 𝑋 = 𝑟𝑒𝑑 =
𝐻𝑜𝑤 𝑚𝑎𝑛𝑦 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡𝑠 𝑤𝑒𝑟𝑒 𝑑𝑜𝑛𝑒
• 𝑃 is the probability of choosing the red box.

10 / 38
Example: Extraction of Fruit From Two Boxes

• We will perform an experiment and enter the number of


outcomes in the table (horizontally – the colors of the box,
vertically – fruits).

• Event intersection probability:


#!"
𝑃 𝑋 = 𝑥! , 𝑃 = 𝑦" = ,
$
where 𝑁 – the number of experiments, the number of outcomes 𝑛!" .
• Conditional Probability:
#!"
𝑃 𝑃 = 𝑦" |𝑋 = 𝑥! , = .
%! 11 / 38
Example: Extraction of Fruit From Two Boxes

P(Y|X)=red P(Y|X)=blue

Conditional Probability:
75% that an orange in a red box
25% that it in blue.
12 / 38
Bayes Formula

• What is the probability that we have a dinosaur in front of us (𝑥 –


observation)?
!(#|%)'!(%)
𝑃 𝑦𝑥 = − Bayes formula,
!(#)
where 𝑃(𝑥|𝑦) – is the probability that the dinosaur looks like this;
𝑃(𝑦) – the probability of meeting a dinosaur;
𝑃 𝑥 – the probability of seeing such a scene.

13 / 38
Probability Theory
• Sum rule:

𝑃 𝑥 = < 𝑃 𝑥, 𝑦 𝑑𝑦 ↔ 𝑃 𝑦 = < 𝑃 𝑥, 𝑦 𝑑𝑥
& '
• Product rule:
𝑃 𝑥, 𝑦 = 𝑃 𝑦|𝑥 𝑃 𝑥 = 𝑃 𝑥|𝑦 𝑃(𝑦)
• If two random variables are independent:
𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑃(𝑦)

14 / 38
Machine Learning Tasks
• Setting the task of supervised learning:
• 𝕏 – a set of objects or examples, situations, inputs (samples);
• 𝕐 – a set of answers or labels, outputs (responses).
• There is some dependence that allows predicting 𝑦 ∈ 𝕐 from 𝑥 ∈
𝕏.
• If the dependence is deterministic, then there is a function
𝑓 ∗: 𝕏 → 𝕐.
• The dependence is known only on the objects of the training
sample, it means that we know some finite number of data:
𝑥 (!) , 𝑦 (!) : 𝑥 (!) ∈ 𝕏, 𝑦 ! ∈ 𝕐(𝑖 = 1, … , 𝑁) .

15 / 38
Machine Learning Tasks
• An ordered object-response pair 𝑥 (!) , 𝑦 (!) ∈ (𝕏×𝕐) is called a
precedent.
• The task of supervised learning is to restore the relationship
between input and output based on the existing training sample,
• i.e., it is necessary to design a function (decision rule) 𝑓: 𝕏 → 𝕐, for
new objects 𝑥 ∈ 𝕏 predicting the answer 𝑓(𝑥) ∈ 𝕐:
𝑦 = 𝑓 𝑥 ≈ 𝑓∗ 𝑥 .

16 / 38
Basic Definitions
• The functions 𝑓 are chosen from the parametric family 𝐹, i.e.
from a set of possible models.
• The process of finding the function 𝑓 is called learning, as well
as tuning or fitting the model.
• An algorithm for designing a function 𝑓 from a given training set
is called a learning algorithm.
• Some class of algorithms is called a learning method.

17 / 38
Basic Definitions
• Learning algorithms operate with object descriptions: each
sample element is described by a set of features 𝑥 =
(𝑥+ , 𝑥, , … , 𝑥- ) (feature vector), where 𝑥" ∈ 𝑄" , 𝑗 = 1, 𝑑, 𝕏 =
𝑄+ ×𝑄, × ⋯ 𝑄- .
• The set 𝕏 is called the feature space.
• It is necessary to design such a function 𝑦 = 𝑓(𝑥) from the
feature vector 𝑥 = (𝑥+ , 𝑥, , … , 𝑥- ), that would give the answer 𝑦 for
any possible observation 𝑥.
• The component 𝑥" is called the 𝑗-th feature, or property, or
attribute of the object 𝑥.

18 / 38
Basic Definitions
• If 𝑄" = ℝ, then the 𝑗-th attribute is called quantitative or real.
• If 𝑄" is finite, then the 𝑗-th feature is called nominal, or
categorical, or factor.
• If dim𝑄) = 2, then the feature is called binary.
• If 𝑄) is ordered, then the feature is called ordinal.

19 / 38
Regression Retrieval Task
• If 𝕐 = ℝ, then it is a regression retrieval task.
• decision rule 𝑓 called regression.
• If 𝕐 is finite 𝕐={1,2,…,𝐾}, then this is a classification task.
• The decision rule 𝑓 is called the classifier.

An example of regression retrieval, 𝑦 is a continuous value. 20 / 38


Binary Classification Task
• Given a training sample:
𝑋* = 𝑥+ , 𝑦+ , … , 𝑥* , 𝑦* , 𝑥, , 𝑦, ∈ 𝑅 * ×𝑌, 𝑌 = −1, +1 .
• Objects belong to one of two classes.
• We mark the main class as “+1”, the secondary “background” as “−1”.
• It is required for all new values 𝑥 to define the class "+1" or "−1".

21 / 38
Multiclass Classification Task
• Given a training sample:
𝑋* = 𝑥+ , 𝑦+ , … , 𝑥* , 𝑦* , 𝑥, , 𝑦, ∈ 𝑅 * ×𝑌, 𝑌 = 1, … , 𝐾 .
• Objects belong to one of the 𝐾 classes.
• It is required for all new values 𝑥 to define a class and put a label from 1
to 𝐾.

22 / 38
Remarks
1. The found decision rule should have a generalizing ability (the
constructed classifier or regression function should reflect the
overall dependence of the output on the input, based only on
known data about the precedents of the training sample).
2. Attention should be paid to the problem of effective
computability of the function 𝑓 and to the learning algorithm:
the model tuning should take place in an acceptable time.

23 / 38
Machine Learning Tasks
Data with unknown
• The quality of the algorithm on new data
answers
is interesting: it is necessary to connect
the existing data with those that we will
process in the future.
• For this, the values of the features will be
considered random variables.
• We will assume that the data that will
have to be processed in the future and the
available data are distributed equally.
Training set

24 / 38
Machine Learning Paradigms
• Discriminative Approach
• We will choose functions 𝑓 from the parametric family 𝐹, i.e. from some
set of possible models.
• Let us introduce some loss function 𝐿(𝑦,𝑓(𝑥)) of the true value of the
output 𝑦 and the predicted value 𝑓(𝑥):

25 / 38
Discriminative Approach
• Let us introduce some loss function 𝐿(𝑦,𝑓(𝑥)) of the true value of the
output 𝑦 and the predicted value 𝑓(𝑥):
• In a regression retrieval task,
• a quadratic error:
1 !
𝐿 𝑦, 𝑓 𝑥 = 𝑦−𝑓 𝑥 ,
2
• an absolute error:
𝐿 𝑦, 𝑓 𝑥 = 𝑦 − 𝑓 𝑥 .
• In a classification task,
• a prediction error:
𝐿 𝑦, 𝑓 𝑥 =𝐼 𝑦≠𝑓 𝑥 ,
where 𝑓(𝑥) is the predicted class,
1, condition met
𝐼=- – indicator function.
0, condition not met
26 / 38
Discriminative Approach
• It is necessary to design a function 𝑦=𝑓(𝑥) – a decision rule or a
classifier.
• Any decision rule divides space into decision regions separated
by decision boundaries.

27 / 38
Machine Learning Paradigms
• Generative approach
!(#|%)'!(%)
• Bayes formula 𝑃 𝑦 𝑥 = is used;
!(#)
• Each class is modeled separately, evaluate 𝑃 𝑥 𝑦 , 𝑃(𝑦);
• The problem statement is like the classification.
• Discriminative approach
• Since we are interested in 𝑃(𝑦│𝑥), we will evaluate it;
• Problem statement is like regression.

28 / 38
Risks
• The task of learning is to find a set of classifier parameters 𝑓, in
which the losses for new data will be minimal.
• Let's introduce the concept of general (average) risk – this is the
mathematical expectation of losses:

𝑅 𝑓 = 𝐸 𝐿 𝑓 𝑥 ,𝑦 = < 𝐿 𝑓 𝑥 , 𝑦 𝑑𝑃.
',&
• Unfortunately, due to the unknown probability distribution 𝑃 of
the joint random variable (𝑥,𝑦), the total risk cannot be
calculated.

29 / 38
Risks
• Let us introduce the concept of empirical risk. Let 𝑋={𝑥+ ,…, 𝑥/ },
𝑌={𝑦+ ,…, 𝑦/ } be the training sample. Empirical risk or training
error:
/
1
𝑅0/1 𝑓, 𝑋 = R 𝐿 𝑦! , 𝑓 𝑥! .
𝑚
!2+
• To minimize the empirical risk, it is necessary to find the
function 𝑓 in accordance with the condition:
𝑓 = arg min 𝑅0/1 𝑓, 𝑋 .
3∈5
• The condition is called: the empirical risk minimization principle.

30 / 38
Comment
• There can be an unlimited number of hypotheses that have zero
empirical risk:

The most common hypothesis

Middle way

The most general hypothesis

31 / 38
Supervised Learning Challenge
• The problem was reduced to finding a function 𝑓 from an
admissible set 𝐹 that satisfies the condition:
𝑓 = arg min 𝑅0/1 𝑓, 𝑋 ,
3∈5
𝐹 and 𝐿 are fixed and known.
• The class of models 𝐹 is parametrized, i.e. there is a description
of the species 𝐹 = {𝑓 𝑥 = 𝑓 𝑥, 𝜃 : 𝜃 ∈ Θ}, where Θ is some known
set.
• Model tuning process:
• the learning algorithm selects the values of the set of parameters Θ
that ensure the fulfillment of the condition 𝑓, i.e. minimizing the
error on the precedents of the training sample.
32 / 38
Overfitting
• The considered condition is not suitable for evaluation the
generalizing ability of the algorithm.
• All available data is divided into training and test sets:
• Training is performed using a training set,
• Evaluation of the prediction quality based on test sample data.
• Values 𝑅 𝑓 and 𝑅0/1 𝑓, 𝑋 can differ significantly.
• The phenomenon when 𝑅0/1 𝑓, 𝑋 is small and 𝑅 𝑓 is too large is
called overfitting.

33 / 38
Overfitting
• Let there be a regression problem.
• 𝑡 = sin 2𝜋𝑥 + 𝜖, where 𝜖 – normally distributed noise, but we
don't know that.
• Let there be a training sample and it is required to restore the
dependence:

34 / 38
Overfitting
• We will choose the target dependence among polynomials of
order 𝑀 (parametrized set):
𝑦 𝑥, 𝑤 = 𝑤6 + 𝑤+ 𝑥 + 𝑤, 𝑥 , + ⋯ + 𝑤7 𝑥 7 = 𝑤 8𝜙7 𝑥 .
• We introduce the loss function:
1
𝐿 𝑥, 𝑡 , 𝑦 = (𝑦 𝑥, 𝑤 − 𝑡), .
2
• Among the set of polynomials, we will choose the one that
brings the least total loss on the training set.

35 / 38
Overfitting

36 / 38
Overfitting
• Reason: the hypothesis well describes the properties of not
objects in general, but only objects from the training sample:
• Too many degrees of freedom of the algorithm model parameters (a
complex model);
• Noisy data;
• Bad training set.

37 / 38
THANK YOU
FOR YOUR TIME!

[email protected]

You might also like