Machine learning
Introduction
Mohamed FARAH
Année universitaire : 2024-2025
Machine Learning
Machine Learning is the field of study that gives the
computer the ability to learn without being explicitly
programmed (Arthur Samuel (1959)
Machine Learning
Machine Learning is:
a subfield of artificial intelligence based on mathematical and
statistical approaches to empower computers to learn from data
automatically resolves decision problems without explicit
programming
relates to the design, optimisation and implementation of methods to
learn from past data in order to predict new observations
Machine Learning – new programming paradigm
Traditional Programming
Data
Computer Output
Data driven Program
Automating automation
Getting computers to Machine Learning
programme themselves
Data
Computer Program
Output
4
Machine Learning
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance
at tasks in T, as measured by P, improves with experience E (Tom
Mitchell, 1998)
Example :
• Experience (data): games played by the program (with itself) Tom Mitchell
• Performance measure: winning rate
Learning is the acquisition of the ability to perform the task
How to learn is another type of problem and there are many methods
The Experience E / Data
Most algorithms experience an entire dataset
Dataset: A collection of examples, aka data points
An example is a collection of features (data) that have been
quantitatively measured for some object/event that we
want the ML system to process
6
Data – Example
Anderson’s Iris data (oldest dataset, in stat/ML 1936)
• Measurements of 150 iris flowers
- 4 attributes : sepal length, sepal width, petal length, petal width ∈ ×
• 3 species: Setosa, versicolor, virginica
23
https://en.wikipedia.org/wiki/Iris_flower_data_set
Data as vectors, matrices, tensors
Tensors: generalization of matrices
to n dimensions (or rank, order, degree)
• 1D tensor: vector
• 2D tensor: matrix
• 3D, 4D, 5D tensors
Data
Datasets decomposition
• Training set : data to train with
• Validation set : when to stop training
• Test or generalisation set : data to test on
These datasets are all disjoint
Dataset Assumptions
data are generated by a probability distribution
over the data
Typically make i.i.d assumptions
• Samples are independent from each other
• Training and test sets are identically distributed
(drawn from the same distribution)
The Task T
ML enables tackling tasks too difficult to solve with fixed
programs written and designed manually
T is usually described in terms of how the machine learning
system should process an example
NB. The process of learning itself is not the task
The Performance Measures, P
P are specific to the task T
Well known measures based on the confusion matrix
Accuracy
Precision
Recall
F-score
etc.
! Applied on data not seen before:
Test set ... not the training set
12
After the task is learned
Processing of new data is called inference
Computational costs during training (high) vs inference (lower)
13
Related Domains
Statistics: learning theory, data mining, inference
Computing: AI, computer vision, IR
Engineering: signal, robotics, control
Cognitive science, psychology, epistemology, neuroscience
Economics: decision theory, game theory
14
Applications of Machine Learning
15
Computer Vision
Image recognition, segmentation, classification, etc.
Model Cat or Dog
Example : Recognition of handwritten characters
16
Computer Vision
Example : Face detection
17
Computer Vision
Example : Detection of pedestrians
Example of training images
18
Natural Language Processing (NLP)
Example : Classification of Textual Documents.
19
Natural Language Processing (NLP)
Example : Detection of spams in the emails.
Hint: Count the frequency and co-occurrence of certain keywords, e.g.
congratulations, lottery, win, prize, etc.
20
Natural Language Processing (NLP)
Example : Automatic Translation.
“How are you?” Model “Wie geht’s dir?”
Translating machine
21
Natural Language Processing (NLP)
Example : Recommendation Systems.
22
Natural Language Processing (NLP)
Example : Chatbots
“How are you?” Model ‘I am fine thank you’
Conversational agent / chatbot
23
Bio-Informatics
Sequence alignment, analysis of genetic data, etc.
Example : Prediction of Caesarean Emergency
Conditions
24
Signal processing
Speech recognition, person identification, speech to text, text to
speech, etc.
Model ‘Hello’
Speech recognition
25
Other areas of application
Robotics: estimation of positions, of states, etc.
Financial analysis: allocation of portfolio, credits, grants, etc.
Medicine: diagnosis, treatment, design of therapies, etc.
Graphic design: realistic design and simulations, etc.
Social networks
Content generation
etc.
26
Learning Types
(based on tasks)
27
Learning Types
Supervised Unsupervised
Learning Learning
Reinforcement
Learning
28
Supervised learning
29
Supervised learning
Given: a dataset that contains samples
1 , 1 ,…( , )
Task: if a residence has square feet, predict its price?
15th sample
( 15 , 15 )
= 800
=?
Housing price prediction
Supervised learning
Given: a dataset that contains samples
1 , 1 ,…( , )
Task: if a residence has square feet, predict its price?
= 800
=?
Housing price prediction
Regression vs Classification
regression: if ∈ ℝ is a continuous variable
e.g., price prediction
classification: the label is a discrete variable
e.g., the task of predicting the types of residence
(size, lot size) → house or townhouse?
= house or
townhouse?
Supervised Learning – Model Types
2 types of models:
• Discriminative model:
• it is estimated that ( | )
• we're learning the decision boundary
• Generative model:
• it is estimated that ( │ ) is used to deduce ( | )
• we learn probability distributions of data 33
Supervised learning in Computer Vision
Image Classification
= raw pixels of the image, = the main object
ImageNet Large Scale Visual Recognition Challenge. Russakovsky et al.’2015
Supervised learning in Computer Vision
Object localization and detection
= raw pixels of the image, = the bounding boxes
ImageNet Large Scale Visual Recognition Challenge. Russakovsky et al.’2015
Supervised learning in Computer Vision
Recognition of handwritten characters (OCR)
: values of intensities of pixels of the image.
: identity of the character (class).
36
Supervised learning in NLP
Machine translation
Unsupervised learning
Also called Knowledge discovery
38
Unsupervised Learning
Dataset contains no labels: 1 ,…
Target is not explicitly known
Goal (vaguely-posed): to find interesting structures /
patterns in the data
supervised unsupervised
Clustering
k-mean clustering, mixture of Gaussians, etc.
Clustering
k-mean clustering, mixture of Gaussians, etc.
Density Estimation
learning the probability distribution having generated the
data.
• To generate new realistic data.
• To distinguish “realistic” data from “false” data (e.g. spam
filtering).
• Compression of data
• etc.
42
Density Estimation
given a sample = , = 1. . from a distribution,
obtain an estimate of the density function at any point.
Parametric:
• Assume a parametric family of densities . ! (e.g., (", # $ )) and obtain
the best estimate !% of !
Nonparametric:
• Obtain a good estimate of the
entire density directly from
the sample (e.g. Histogram)
43
Representation learning
automatically extracting useful and significant characteristics
from raw data without labels.
The aim is to transform the data into a more compact and
informative representation (embeddings) that facilitates
subsequent tasks such as classification or grouping.
44
Word Embedding
Rome
Represent words by vectors
Paris
encode Italy
word vector
Berlin
encode
relation direction France
Germany
Word2vec [Mikolov et al’13]
GloVe [Pennington et al’14]
Clustering Words with Similar Meaning
(Hierachically)
[Arora-Ge-Liang-M.-Risteski, TACL’17,18]
Dimensionality reduction
reduce the number of variables or dimensions of the data,
while preserving the essential information.
“swiss roll”
dataset
47
https://link.springer.com/article/10.1007/s00477-016-1246-2/figures/1
Latent Semantic Analysis (LSA)
documents
words
Principal Component Analysis (PCA) used in LSA
https://commons.wikimedia.org/wiki/File:Topic_
detection_in_a_document-word_matrix.gif
Large Language Models (LLM)
machine learning models for language learnt on large-
scale language datasets
can be used for many purposes
Language Models are Few-Shot Learners [Brown et al.’20]
https://openai.com/blog/better-language-models/
Reinforcement learning
50
Reinforcement learning
Learning to make sequential decisions
Chess
• 1997: Deep Blue (IBM) defeated world chess champion Garry Kasparov
in a six-game match.
• 2017: AlphaZero (DeepMind) defeated Stockfish (chess engine)
Go
• 2016: AlphaGo (DeepMind) defeated 18-time world champion Lee
Sedol 4-1 in a five-game match.
• 2017: AlphaGo Master defeated world champion Ke Jie
• 2017: AlphaGo Zero (a more advanced version) surpassed all previous
versions
Reinforcement learning
The algorithm can collect data interactively
Try the strategy and Data Improve the strategy
collect feedbacks
Training based on the
collection
feedbacks
Reinforcement learning
Problem Data
A state describes a situation
An action allows you to switch between states
A policy allows you to choose the action to be taken based on your
current state
At the end of each action, a + or - reward is observed
Objectives
Guide an agent to define a policy: Improve the policy of choice of
action at time t+1
Avoid failure situations
53
Challenges
The ability to generalise a model
Overfiting
Underfitting
Curse of dimensionality (Lots of features vs dataset size)
Vanishing and exploding gradients (in Neural Networks
based models)
Data not available
Data augmentation (Reduced datasets)
Imbalanced datasets
etc. 54
Generalisation
a major challenge of ML
• Ability to perform well on previously unseen outputs
training error vs test / generalisation error
• training error: error on training input
• test / generalisation error: expected error on a new input
ML training algorithm reduces training error, which is the task
of optimisation
What differentiates ML from pure optimisation is that the test /
generalisation error needs to be low as well
Typical learning curve
Validation Loss
Training loss
Number of training steps
Overfitting
• A major problem for the learning techniques!
• One can find a hypothesis that makes a good prediction for
training data, but that does not generalise well for the rest
of the data.
• In the rest of the course, we will see methods to
mitigate the overfitting problem.
57
Vanishing and exploding gradients problem
For Neural Networks based models
• Vanishing Gradients: Occur when the gradients of the loss
function with respect to the parameters become very small
during backpropagation. This prevents the weights from
updating effectively, slowing or halting learning, especially
in early layers of deep networks.
• Exploding Gradients: Occur when the gradients become
very large, causing unstable updates to the weights and
making training diverge.
Vanishing and exploding gradients problem
Data augmentation
What ?
increase the size and diversity of a training dataset
apply various transformations to the original data
used when the original dataset is small or lacks diversity.
Why ?
Prevents overfitting by exposing the model to more varied
data.
Improves the model's ability to generalize to unseen data.
Enhances performance in tasks like image classification,
object detection, natural language processing, etc. 60
Data augmentation
Common Techniques:
1. Image Data:
1. Rotation, flipping, cropping, scaling, and translation.
2. Color jittering (adjusting brightness, contrast, saturation).
3. Adding noise or blurring.
4. Random erasing or cutout.
2. Text Data:
1. Synonym replacement, random insertion, or deletion of words.
2. Back-translation (translating text to another language and back).
3. Shuffling sentences or phrases.
3. Audio Data:
1. Time stretching, pitch shifting, or adding background noise.
4. Tabular Data:
1. Adding noise to numerical features.
61
2. Synthetic minority oversampling techniques (e.g., SMOTE).
Data augmentation
Examples for Image Data
62
Imbalanced datasets
Skewed class distributions can lead to biased
models that favor the majority class.
Common Techniques: Resampling Techniques
• Oversampling:
- Increase the number of instances in the minority
class.
- Examples: Random oversampling, SMOTE
(Synthetic Minority Oversampling Technique),
ADASYN.
• Undersampling:
- Reduce the number of instances in the majority
class.
- Examples: Random undersampling, Tomek links,
Cluster Centroids.
• Hybrid Approaches:
- Combine oversampling and undersampling for
balanced results.
Prerequisites
• Knowledge in numerical analysis: derivation calculation,
partial derivative, gradient, integral, etc.
• Knowledge of linear algebra: matrix, vector, norm, scalar
product, etc.
• Knowledge in probabilities & statistics
• Knowledge of programming
64
References
• A. Geron. Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow Concepts, Tools, and Techniques to
Build Intelligent Systems. O'Reilly Media Inc., 2019.
• C. Bishop. Pattern Recognition and Machine Learning.
Springer 2006.
• R. Duda, P. Storck and D. Hart. Pattern Classification.
Prentice Hall, 2002).
• ...
65