Machine Learning
Data Science Process
Learning?
• Herbert Simon: “Learning is any process by which a
system improves performance from experience.”
• There are two ways that a system can improve:
1. By acquiring new knowledge (e.g. acquiring new facts)
2. By adapting its behavior (e.g. solving problems more accurately )
• How to learn a machine using data?
Main types of Machine Learning
• Supervised learning(With a teacher): uses a
series of labelled examples with direct feedback
• Unsupervised/clustering learning (without a
teacher): no feedback
• Semi-supervised: in between supervised and
unsupervised learning (Some data is labeled but
most of it is unlabeled)
Supervised vs Unsupervised
• How many groups do we have in this figure?
• Can we apply supervised learning?
• What will you get if you apply unsupervised learning?
What do you think now?
Supervised vs Unsupervised
• Can you separate this data into two groups?
Supervised vs Unsupervised vs Semi-supervised
Example
• We have a dataset with two columns x1 and x2
X1 X2
1 2
5 3
… …
• We plot the data into two-dimensional space as follows
Q1) can you
divide the data
into two
groups?
• Q1) can you divide the data into two groups?
• Try to separate the points based on the distance
between the data points
X1 X2
1 2
5 3
... ...
Q2) If we give you the labels (a new column which
provides the class of each row) can you draw a line
that separte the two classes?
X1 X2 X3
(Label)
1 2 normal
(blue)
5 3 abnormal
(red)
.. .. ..
Examples of
ML Algorithms
Usual ML stages
• Hypothesis, data
• Training or learning (requires examples/data)
• Testing or generalization
Training
• Training is the acquisition of knowledge, skills, and competencies as
a result of teaching, practical skills and knowledge that relate to
specific useful competencies (wikipedia)
• Training requires scenarios or examples (data)
In machine learning we learn from the available data or examples
Training: The figure shows how the separating line is updated through the several training steps
Initial random line Updating the line after one Training is complete
training step
Testing
• How well the learned system works?
• Generalization
• Performance on unseen or unknown scenarios or data
• Which model performs the best?
Types of testing
• Evaluate performance by
testing on data NOT used for
training (both should be
randomly sampled)
• Cross validation methods
for small data sets
The more (relevant) data the
better.
Defining the Learning Task
Improve on task, T, with respect to
performance metric, P, based on experience, E.
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands recorded while
observing a human driver.
T: Categorize email messages as spam or legitimate.
P: Percentage of email messages correctly classified.
E: Database of emails, some with human-given labels
Suppose that we are done
with EDA and data is
ready for modelling, what
is next?
Cancer diagnosis
This is our data 103x5
Patient ID # of Tumors Avg Area Avg Density Diagnosis
1 5 20 118 M
2 3 15 130 B
3 7 10 52 B
4 2 30 100 M
... ... ... ... ...
100 3 19 100 M
101 4 16 95 M
102 9 22 125 B
103 1 14 80 M
Recall ML stages
Supervised Learning Classification
Training
Set
• Use this training set to learn how to classify patients
where diagnosis is not known:
Patient ID # of Tumors Avg Area Avg Density Diagnosis
101 4 16 95 ?
102 9 22 125 ? Test Set
103 1 14 80 ?
Will be predicted by
Input Data our model
Breast Cancer Diagnosis Linear Separation
Line produced
by our model
to separate
the two
classes
The plot of the training data into 2D, where:
red represents M cases and blue represents B cases
Predict the test data
The gray circles represent the test set
• The model predict the test data as following:
Patient ID # of Tumors Avg Area Avg Density Diagnosis
101 4 16 95 M Predicted by
102 9 22 125 M
the model
103 1 14 80 M
Actual
diagnosis
• How good is our model?
Examples of
ML Algorithms