Artificial Intelligence
Machine Learning
احمد سلطان الهجامي/د.ا
استاذ الذكاء االصطناعي
AI Prof. Ahmed Sultan Al-Hegami 1
What is learning?
“Learning denotes changes in a
system that ... enable a system to
do the same task more efficiently
the next time.”
“Learning is constructing or
modifying representations of what
is being experienced.”
“Learning is making useful changes
in our minds.”
AI Prof. Ahmed Sultan Al-Hegami 2
What Is Machine Learning?
“Logic is not the end of wisdom, it is just the beginning” --- Spock
same time
Environment Environment
System System
Action1 Action2
Knowledge Knowledge
changed
AI Prof. Ahmed Sultan Al-Hegami 3
Machine Learning: A
Definition
Definition: A computer program is
said to learn from experience E with
respect to some class of tasks T and
performance measure P, if its
performance at tasks in T, as
measured by P, improves with
experience E.
AI Prof. Ahmed Sultan Al-Hegami 4
Applications of Machine
Learning
Learning to recognize spoken words
(Lee, 1989; Waibel, 1989).
Learning to drive an autonomous
vehicle (Pomerleau, 1989).
Learning to classify new astronomical
structures (Fayyad et al., 1995).
Learning to play world-class
backgammon (Tesauro 1992, 1995).
AI Prof. Ahmed Sultan Al-Hegami 5
Why learn?
Understand and improve efficiency of human learning
Improve methods for teaching and tutoring people (better
CAI)
Discover new things or structure that were previously
unknown to humans
Examples: data mining, scientific discovery
Fill in skeletal or incomplete specifications about a
domain
Large, complex AI systems cannot be completely derived by
hand and require dynamic updating to incorporate new
information.
Learning new characteristics expands the domain or expertise
and lessens the “brittleness” of the system
Build software agents that can adapt to their users or
to other software agents
Reproduce an important aspect of intelligent behavior
AI Prof. Ahmed Sultan Al-Hegami 6
Learning and Machine
learning
Like human learning from past experiences.
A computer does not have “experiences”.
A computer system learns from data, which
represent some “past experiences” of an
application domain.
Our focus: learn a target function that can be
used to predict the values of a discrete class
attribute, e.g., approve or not-approved, and
high-risk or low risk.
The task is commonly called: Supervised
learning, classification, or inductive learning.
AI Prof. Ahmed Sultan Al-Hegami 7
The data and the goal
Data: A set of data records (also called
examples, instances or cases) described
by
k attributes: A , A , … A .
1 2 k
a class: Each example is labelled with a
pre-defined class.
Goal: To learn a classification model from
the data that can be used to predict the
classes of new (future, or test)
cases/instances.
AI Prof. Ahmed Sultan Al-Hegami 8
An example: data (loan
application)
Approved or not
AI Prof. Ahmed Sultan Al-Hegami 9
An example: the learning
task
Learn a classification model from the data
Use the model to classify future loan
applications into
Yes (approved) and
No (not approved)
What is the class for following
case/instance?
AI Prof. Ahmed Sultan Al-Hegami 10
Supervised vs. unsupervised
Learning
Supervised learning: classification is seen as
supervised learning from examples.
Supervision: The data (observations,
measurements, etc.) are labeled with pre-
defined classes. It is like that a “teacher” gives
the classes (supervision).
Test data are classified into these classes too.
Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the
existence of classes or clusters in the data
AI Prof. Ahmed Sultan Al-Hegami 11
Supervised learning process: two
steps
Learning (training): Learn a model using the
training data
Testing: Test the model using unseen test
data to assess the model accuracy
Number of correct classifications
Accuracy ,
Total number of test cases
AI Prof. Ahmed Sultan Al-Hegami 12
What do we mean by
learning?
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D
to perform the task T if after learning the
system’s performance on T improves as
measured by M.
In other words, the learned model helps
the system to perform T better as
compared
AI
to no Prof.
learning.
Ahmed Sultan Al-Hegami 13
An example
Data: Loan application data
Task: Predict whether a loan should be
approved or not.
Performance measure: accuracy.
No learning: classify all future applications
(test data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
We can do better than 60% with learning.
AI Prof. Ahmed Sultan Al-Hegami 14
Fundamental assumption of
learning
Assumption: The distribution of training examples is
identical to the distribution of test examples
(including future unseen examples).
In practice, this assumption is often violated to
certain degree.
Strong violations will clearly result in poor
classification accuracy.
To achieve good accuracy on the test data,
training examples must be sufficiently
representative of the test data.
AI Prof. Ahmed Sultan Al-Hegami 15
Classification (Supervised Learning
)
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
AI Prof. Ahmed Sultan Al-Hegami 16
Classification: Linear Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression
computes wi from
data to minimize
squared error to ‘fit’
the data
Not flexible enough
AI Prof. Ahmed Sultan Al-Hegami 17
Classification: Decision
Trees
if X > 5 then blue
else if Y > 3 then blue
Y else if X > 2 then green
else blue
2 5 X
AI Prof. Ahmed Sultan Al-Hegami 18
Decision Trees
-a way of representing a series of rules
that lead to a class or value;
-basic components of a decision tree:
decision node, branches and leaves;
Income>40,000
No Yes
Job>5 High Debt
Yes No Yes No
Low Risk High Risk High Risk Low
Risk
AI Prof. Ahmed Sultan Al-Hegami 19
Decision Trees (cont.)
- handle very well non-numeric data;
- work best when the predictor
variables are categorical;
AI Prof. Ahmed Sultan Al-Hegami 20
Example Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass Splitting Attributes
l
ca ca co c
Tid Refund Marital Taxable Refund
Status Income Cheat Yes No
1 Yes Single 125K No
NO MarSt
2 No Married 100K No
Single, Divorced Married
3 No Single 70K No
4 Yes Married 120K No TaxInc NO
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No The splitting attribute at a node is
10 No Single 90K Yes determined based on the Gini index.
10
AI Prof. Ahmed Sultan Al-Hegami 21
Classification: Neural Networks
- efficiently model large and complex
problems;
- may be used in classification problems or
for regressions;
- Starts with input layer => hidden layer
=> output layer 3
1
4 6
2
5 Output
Inputs
Hidden Layer
AI Prof. Ahmed Sultan Al-Hegami 22
Neural Networks (cont.)
- can be easily implemented to run on
massively parallel computers;
- can not be easily interpret;
- require an extensive amount of training
time;
- require a lot of data preparation (involve
very careful data cleansing, selection,
preparation, and pre-processing);
- require sufficiently large data set and high
signal-to noise ratio.
AI Prof. Ahmed Sultan Al-Hegami 23
Classification and
Prediction
AI Prof. Ahmed Sultan Al-Hegami 24
An example application
An emergency room in a hospital measures
17 variables (e.g., blood pressure, age, etc)
of newly admitted patients. A decision has
to be taken whether to put the patient in an
intensive-care unit. Due to the high cost of
ICU, those patients who may survive less
than a month are given higher priority. The
problem is to predict high-risk patients and
discriminate them from low-risk patients.
AI Prof. Ahmed Sultan Al-Hegami 25
Another application
A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.
AI Prof. Ahmed Sultan Al-Hegami 26
Classification
Data: It has k attributes A1, … Ak. Each
tuple (case or example) is described by
values of the attributes and a class
label.
Goal: To learn rules or to build a model
that can be used to predict the classes
of new (or future or test) cases.
The data used for building the model is
called the training data.
AI Prof. Ahmed Sultan Al-Hegami 27
An example data
Outlook Temp Humidity Windy Class
Sunny Hot High No Yes
Sunny Hot High Yes Yes
O’cast Hot High No No
Rain Mild Normal No No
Rain Cool Normal No No
Rain Cool Normal Yes Yes
O’cast Cool Normal Yes No
Sunny Mild High No Yes
Sunny Cool Normal No No
Rain Mild Normal No No
Sunny Mild Normal Yes No
O’cast Mild High Yes No
O’cast Hot Normal No No
Rain Mild High Yes Yes
AI Prof. Ahmed Sultan Al-Hegami 28
Classification
—A Two-Step Process
Model construction: describing a set of
predetermined classes based on a training set. It is
also called learning.
Each tuple/sample is assumed to belong to a predefined
class
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future test data/objects
Estimate accuracy of the model
The known label of test example is compared with the
classified result from the model
Accuracy rate is the % of test cases that are correctly
classified by the model
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known.
AI Prof. Ahmed Sultan Al-Hegami 29
Classification Process
(1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no OR years > 6
Anne Associate Prof 3 no THEN tenured = ‘yes’
AI Prof. Ahmed Sultan Al-Hegami 30
Classification Process (2):
Use the Model in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
AI Prof. Ahmed Sultan Al-Hegami 31
Supervised vs.
Unsupervised Learning
Supervised learning: classification is seen
as supervised learning from examples.
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the classes of the observations/cases.
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
AI Prof. Ahmed Sultan Al-Hegami 32
Evaluating Classification
Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability:
understandable and insight provided by the model
Compactness of the model: size of the tree, or
the number of rules.
AI Prof. Ahmed Sultan Al-Hegami 33
Different classification
techniques
There are many techniques for
classification
Decision trees
Naïve Bayesian classifiers
Neural networks
Genetic Algorithms
Logistic regression
and many more ...
AI Prof. Ahmed Sultan Al-Hegami 34
Decision Tree
Decision tree learning is one of the
most widely used techniques for
classification.
Its classification accuracy is competitive
with other methods, and
it is very efficient.
The classification model is a tree, called
decision tree.
C4.5 by Ross Quinlan is perhaps the
best known system. It can be
downloaded from the Web.
AI Prof. Ahmed Sultan Al-Hegami 35
The loan data (reproduced)
Approved or not
AI Prof. Ahmed Sultan Al-Hegami 36
A decision tree from the loan
data
Decision nodes and leaf nodes (classes)
AI Prof. Ahmed Sultan Al-Hegami 37
Use the decision tree
No
AI Prof. Ahmed Sultan Al-Hegami 38
Is the decision tree
unique?
No. Here is a simpler tree.
We want smaller tree and accurate tree.
Easy to understand and perform better.
Finding the best tree is
very hard.
All current tree building
algorithms are heuristic
algorithms
AI Prof. Ahmed Sultan Al-Hegami 39
From a decision tree to a set of
rules
A decision tree can
be converted to a
set of rules
Each path from the
root to a leaf is a
rule.
AI Prof. Ahmed Sultan Al-Hegami 40
Algorithm for decision tree
learning
Basic algorithm (a greedy divide-and-conquer algorithm)
Assume attributes are categorical now (continuous
attributes can be handled too)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Examples are partitioned recursively based on
selected attributes
Attributes are selected on the basis of an impurity
function (e.g., information gain)
Conditions for stopping partitioning
All examples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority class is the leaf
There are no examples left
AI Prof. Ahmed Sultan Al-Hegami 41
Choose an attribute to partition data
The key to building a decision tree - which
attribute to choose in order to branch.
The objective is to reduce impurity or
uncertainty in data as much as possible.
A subset of data is pure if all instances
belong to the same class.
The heuristic in C4.5 is to choose the
attribute with the maximum Information
Gain or Gain Ratio based on information
theory.
AI Prof. Ahmed Sultan Al-Hegami 42
The loan data (reproduced)
Approved or not
AI Prof. Ahmed Sultan Al-Hegami 43
Two possible roots, which is
better?
Fig. (B) seems to be better.
AI Prof. Ahmed Sultan Al-Hegami 44
Building a decision tree:
an example training
dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
AI Prof. Ahmed Sultan Al-Hegami 45
Output: A Decision Tree for
“buys_computer”
age?
<=30 overcast
30..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
AI Prof. Ahmed Sultan Al-Hegami 46
Inducing a decision tree
There are many possible trees
let’s try it on a credit data
How to find the most compact one
that is consistent with the data?
AI Prof. Ahmed Sultan Al-Hegami 47
An Illustrative Example
Training Examples for Concept PlayTennis
Day Outlook Temperature Humidity Wind PlayTennis?
1 Sunny Hot High Light No
2 Sunny Hot High Strong No
3 Overcast Hot High Light Yes
4 Rain Mil High Light Yes
5 Rain d Normal Light Yes
6 Rain Coo Normal Strong No
7 Overcast l Normal Strong Yes
8 Sunny Coo High Light No
9 Sunny l Normal Light Yes
10 Rain Coo Normal Light Yes
11 Sunny l Normal Strong Yes
12 Overcast Mil High Strong Yes
13 Overcast d Normal Light Yes
14 Rain Coo High Strong No
l
Mil
d
AI Prof. Ahmed Sultan Al-Hegami 48
Mil
Constructing a Decision Tree for
PlayTennis
Day Outlook Temperatur Humidity Wind Play Tennis? [9+, 5-]
e
1 Sunny Hot High Light No
2 Sunny Hot High Strong No
3 Overcast Hot High Light Yes
4 Rain Mild High Light Yes
5 Rain Cool Normal Light Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Light No
9 Sunny Cool Normal Light Yes
10 Rain Mild Normal Light Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Light Yes
14 Rain Mild High Strong No
AI Prof. Ahmed Sultan Al-Hegami 49
Constructing a Decision Tree for
PlayTennis
Potential Splits of Root Node
[9+, 5-] [9+, 5-]
Outlook Temperature
Sunny Overcast Rain Cool Mild Hot
[2+, 3-] [4+, 0-] [3+, 2-] [3+, 1-] [4+, 2-] [2+, 2-]
[9+, 5-] [9+, 5-]
Humidity Wind
High Normal Light Strong
[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]
AI Prof. Ahmed Sultan Al-Hegami 50
Constructing a Decision Tree for
PlayTennis
1,2,3,4,5,6,7,8,9,10,11,12,13,14
[9+,5-] Outlook?
Sunny Overcast Rain
1,2,8,9,11 4,5,6,10,14
[2+,3-] Humidity? Yes Wind? [3+,2-]
3,7,12,13
High Normal [4+,0-] Strong Light
No Yes No Yes
1,2,8 9,11 6,14 4,5,10
[0+,3-] [2+,0-] [0+,2-] [3+,0-]
AI Prof. Ahmed Sultan Al-Hegami 51
Building a compact tree
The key to building a decision tree -
which attribute to choose in order to
branch.
The heuristic is to choose the attribute
with the maximum Information Gain
based on information theory.
Another explanation is to reduce
uncertainty as much as possible.
AI Prof. Ahmed Sultan Al-Hegami 52
Information theory
Information theory provides a mathematical
basis for measuring the information content.
To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads.
If one already has a good guess about the
answer, then the actual answer is less
informative.
If one already knows that the coin is rigged so
that it will come with heads with probability
0.99, then a message (advanced information)
about the actual outcome of a flip is worth
less than it would be for a honest coin (50-
50).
AI Prof. Ahmed Sultan Al-Hegami 53
Information theory (cont
…)
For a fair (honest) coin, you have no
information, and you are willing to pay
more (say in terms of $) for advanced
information - less you know, the more
valuable the information.
Information theory uses this same
intuition, but instead of measuring the
value for information in dollars, it
measures information contents in bits.
One bit of information is enough to
answer a yes/no question about which
one has no idea, such as the flip of a
fair coin
AI Prof. Ahmed Sultan Al-Hegami 54
Back to decision tree
learning
For a given example, what is the
correct classification?
We may think of a decision tree as
conveying information about the
classification of examples in the
table (of examples);
The entropy measure characterizes
the (im)purity of an arbitrary
collection of examples.
AI Prof. Ahmed Sultan Al-Hegami 55
Information theory: Entropy measure
The entropy formula,
|C |
entropy ( D) Pr(c ) log
j 1
j 2 Pr(c j )
|C |
Pr(c ) 1,
j 1
j
Pr(cj) is the probability of class cj in data set D
We use entropy as a measure of impurity or
disorder of data set D. (Or, a measure of
information in a tree)
AI Prof. Ahmed Sultan Al-Hegami 56
Entropy measure: let us get a
feeling
As the data become purer and purer, the entropy value
becomes smaller and smaller. This is useful to us!
AI Prof. Ahmed Sultan Al-Hegami 57
Information gain
Given a set of examples D, we first
compute its entropy:
If we make attribute Ai, with v values, the root
of the current tree, this will partition D into v
subsets D1, D2 …, Dv . The expected entropy if
Ai is used as the current v |D root:
j |
entropy Ai ( D)
j 1 | D |
entropy ( D j )
AI Prof. Ahmed Sultan Al-Hegami 58
Information gain (cont …)
Information gained by selecting attribute
Ai to branch or to partition the data is
gain( D, Ai ) entropy ( D) entropy Ai ( D)
We choose the attribute with the highest
gain to branch/split the current tree.
Notice: log2(x)= log10(x)/log10(2)
AI Prof. Ahmed Sultan Al-Hegami 59
An example
6 6 9 9
entropy( D) log 2 log 2 0.971
15 15 15 15
6 9
entropy Own _ house ( D) entropy ( D1 ) entropy ( D2 )
15 15
6 9
0 0.918
15 15
0.551
5 5 5
entropy Age ( D) entropy ( D1 ) entropy ( D2 ) entropy ( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
0.888
old 4 1 0.722
Own_house is the best
choice for the root.
AI Prof. Ahmed Sultan Al-Hegami 60
We build the final tree
We can use information gain ratio to evaluate the
impurity as well (see the handout)
AI Prof. Ahmed Sultan Al-Hegami 61
Handling continuous
attributes
Handle continuous attribute by splitting into
two intervals (can be more) at each node.
How to find the best threshold to divide?
Use information gain or gain ratio again
Sort all the values of an continuous
attribute in increasing order {v1, v2, …, vr},
One possible threshold between two
adjacent values vi and vi+1. Try all possible
thresholds and find the one that maximizes
the gain (or gain ratio).
AI Prof. Ahmed Sultan Al-Hegami 62
Evaluating classification methods
Predictive accuracy
Efficiency
time to construct the model
time to use the model
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability:
understandable and insight provided by the model
Compactness of the model: size of the tree, or the number
of rules.
AI Prof. Ahmed Sultan Al-Hegami 63
Classification Accuracy or
Error Rates
Partition: Training-and-testing
use two independent data sets, e.g., training
set (2/3), test set(1/3)
used for data set with large number of exmples
Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one
sub-sample as test data—k-fold cross-validation
for data set with moderate size
leave-one-out: for small size data
AI Prof. Ahmed Sultan Al-Hegami 64
Avoid Overfitting in Classification
Overfitting: An tree may overfit the training
data
Good accuracy on training data but poor on test
examples
Too many branches, some may reflect anomalies
due to noise or outliers
Two approaches to avoid overfitting
Prepruning: Halt tree construction early
Difficult to decide
Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees.
This method is commonly used (based on validation set or
statistical estimate or MDL)
AI Prof. Ahmed Sultan Al-Hegami 65
Enhancements to basic
decision tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a
discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that
are sparsely represented.
This reduces fragmentation, repetition, and
replication
AI Prof. Ahmed Sultan Al-Hegami 66
Summary
Applications of supervised learning are in
almost any field or domain.
We studied many classification techniques.
There are still many other methods, e.g.,
Bayesian networks
Neural networks
Genetic algorithms
Fuzzy classification
This large number of methods also show the
importance of classification and its wide
applicability.
It remains to be an active research area.
AI Prof. Ahmed Sultan Al-Hegami 67
Discussion
AI Prof. Ahmed Sultan Al-Hegami 68