CMPE 442 INTRODUCTION
MACHINE LEARNING
MACHINE LEARNING
Machine learning is the field of study that gives
computers the ability to learn without being
explicitly programmed.
MACHINE LEARNING
Machine learning is the field of study that gives
computers the ability to learn without being
explicitly programmed.
A computer program is said to learn from
experience E with respect to some task T and
some performance measure P, if its performance
on T, as measured by P, improves with
experience E.
MACHINE LEARNING
Example: Spam filter- given examples of spam e-
mails and examples of ham e-mails, learns to flag
spam.
Training set- examples that the system uses to learn.
T (task)- flag spam for new e-mails
E (experience)- training data
P (performance)- ? needs to be defined:
Ex: the ratio of correctly classified emails accuracy
EVALUATING PERFORMANCE ON A TASK
Machine learning problems don’t have a
“correct” answer.
Consider sorting problem:
Many sorting algorithms available: bubble sort, quick
sort, insertion sort ...
The performance is measured in terms of how fast
they are and how much data they can handle.
Would we compare the sorting algorithms with
respect to the correctness of the result?
EVALUATING PERFORMANCE ON A TASK
Machine learning problems don’t have a
“correct” answer.
Consider sorting problem:
Many sorting algorithms available: bubble sort, quick
sort, insertion sort ...
The performance is measured in terms of how fast
they are and how much data they can handle.
Would we compare the sorting algorithms with
respect to the correctness of the result?
Algorithm that isn’t guaranteed to produce a sorted list
every time is useless as a sorting algorithm.
EVALUATING PERFORMANCE ON A TASK
No perfect solution in machine learning
Perfect e-mail spam filter does not exist!!!
In many cases data is “noisy”
Examples mislabelled
Features contain errors
o Performance evaluation of learning algorithms is
important in machine learning.
WHY USE MACHINE LEARNING?
WHY USE MACHINE LEARNING?
Let’s write a spam filter using traditional
programming technique
1) Study spam emails and get the patterns and most
occurring words.
2) Write detection algorithm.
3) Test and repeat steps 1 and 2 until it is good
enough
WHY USE MACHINE LEARNING?
Launch!
Study the Write
Evaluate
problem rules
Analyze
errors
WHY USE MACHINE LEARNING?
Launch!
Data
Study the Train ML
Evaluate
problem algorithm
Analyze
errors
WHY USE MACHINE LEARNING?
Consider the example of recognizing handwritten
digits.
Each digit corresponds to a 28x28 pixel image
and so can be represented by a vector x
comprising 784 real numbers.
Goal: build a machine that will take such a vector
x as input and that will produce the identity of
the digit 0, …, 9 as the output.
WHY USE MACHINE LEARNING?
Better use a machine learning approach where a
large set N digits called the training set is
used to tune the parameters of an adaptive model.
The categories of the digits in the training set are
known in advance target vector t.
The goal is to determine the function y(x) which takes
a new digit image x as an input and generates an
output vector y learning, training phase
Once the model is trained we can run it on the test
set.
The ability to categorize correctly new examples that
differ from those used for training is known as
generalization.
WHY USE MACHINE LEARNING?
For problems that are too complex for traditional
approach.
For problem that have no known algorithm.
Ex.: Speech recognition
Helps human learn: applying ML techniques to
large amounts of data reveals patterns that were
not immediately apparent data mining.
SOME ML PROBLEMS
Speech Recognition
Document Classification
Face Detection and Recognition
...
TYPES OF MACHINE LEARNING SYSTEMS
Whether or not they are trained with human
supervision: supervised, unsupervised, semi-
supervised, reinforcement learning.
Instance-based versus model-based learning.
SUPERVISED LEARNING
The training data includes the desired solutions,
called labels.
Spam filter classification
SPAM FILTERING AS A CLASSIFICATION
TASK
MACHINE LEARNING FOR SPAM
FILTERING
SUPERVISED LEARNING
The training data includes the desired solutions,
called labels.
House price prediction regression
SUPERVISED LEARNING
Some most important supervised algorithms:
K-Nearest Neighbours
Linear Regression
Naïve Bayes
Logistic Regression
Support Vector Machines
Decision Trees and Random Forests
Neural Networks
UNSUPERVISED LEARNING
The training data is unlabelled.
The system tries to learn without anyone's
guidance.
UNSUPERVISED LEARNING
Some most important unsupervised algorithms:
Clustering
K-Means
Hierarchical Cluster Analysis
Expectation Maximization
Visualization and Dimensionality Reduction
Principal Component Analysis (PCA)
Locally-Linear Embedding (LLE)
t-distribution Stochastic Neighbour Embedding (t-SNE)
SUPERVISED/UNSUPERVISED LEARNING
INSTANCE-BASED VS. MODEL-BASED
LEARNING
Most ML problems are about making prediction
Given training examples, the system needs to be
able to generalize to examples it has never seen
before
The true goal is to perform well on new instances
Two main generalization approaches:
Instance-based: The system learns the examples by
heart, then generalizes to new cases using a
similarity measure.
Model-based: generalizes from a set of examples by
building a model of these examples, then use that
model to make predictions.
INSTANCE-BASED LEARNING
MODEL-BASED LEARNING
REGRESSION PROBLEM
LINEAR REGRESSION
LINEAR REGRESSION
LINEAR REGRESSION
PROJECT PHASES
Study data
Select a learning algorithm
Train it on the training data
Apply the model to make predictions on new
cases
MAIN CHALLENGES IN MACHINE
LEARNING
Two things that can go wrong:
Bad data
Bad algorithm
BAD DATA
Insufficient quantity of training data
It takes a lot of data for most ML algorithms to work
properly.
Non-representative training data
It is crucial that your training data is representative of the
new cases you want to generalize to.
Poor-quality data
It is better to spend time cleaning up the training data:
decide about outliers and missing features.
Irrelevant features
Feature engineering involves:
Feature selection: selecting the most useful features to train on
among existing features
Feature extraction: combining existing features to produce a
more useful one
Creating new features by gathering new data
BAD ALGORITHM
Overfitting the training data:
Happens when the model performs well on the
training data, but it does not generalize well.
Underfitting the training data
Happens when the model is too simple to learn the
underlying structure of the data
BAD ALGORITHM: EXAMPLE
Simple regression problem: Suppose we observe a
real-valued input variable x and we wish to use
this observation to predict the value of a real-
valued target variable t.
The data for this example is generated from the
function with random noise included in
the target values.
Suppose we are given a training set containing N
observations of x, and the
corresponding observations t, t
BAD ALGORITHM: EXAMPLE
N=10, the input data set x is generated by
choosing values of , for , spaced
uniformly in range
The target data set t is obtained by computing
for corresponding x values and adding
small level of noise having Gaussian distribution
Goal: exploit the training set in order to
make predictions of the value of the target
variable for some new value of the input
variable.
In other words we are trying to
discover the underlying function
POLYNOMIAL CURVE FITTING
M order of polynomial
coefficients
CURVE FITTING
TESTING AND VALIDATING
Once you have a trained model, evaluate it and
fine-tune it.
Split your data into two sets: training set and the
test set.
Generalization error: error rate on the new cases,
estimated by evaluating the model on test set.
If the training error is low (makes few mistakes
on training set) but the generalization error is
high, then the model is overfitting the training
set.
HOW ML HELPS TO SOLVE A TASK?