Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views68 pages

7 MachineLearning

Uploaded by

terpytforyou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views68 pages

7 MachineLearning

Uploaded by

terpytforyou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 68

Artificial Intelligence

Machine Learning

‫ احمد سلطان الهجامي‬/‫د‬.‫ا‬


‫استاذ الذكاء االصطناعي‬

AI Prof. Ahmed Sultan Al-Hegami 1


What is learning?
 “Learning denotes changes in a
system that ... enable a system to
do the same task more efficiently
the next time.”
 “Learning is constructing or
modifying representations of what
is being experienced.”

 “Learning is making useful changes


in our minds.”
AI Prof. Ahmed Sultan Al-Hegami 2
What Is Machine Learning?
“Logic is not the end of wisdom, it is just the beginning” --- Spock

same time
Environment Environment

System System

Action1 Action2

Knowledge Knowledge
changed
AI Prof. Ahmed Sultan Al-Hegami 3
Machine Learning: A
Definition

Definition: A computer program is


said to learn from experience E with
respect to some class of tasks T and
performance measure P, if its
performance at tasks in T, as
measured by P, improves with
experience E.

AI Prof. Ahmed Sultan Al-Hegami 4


Applications of Machine
Learning
 Learning to recognize spoken words
(Lee, 1989; Waibel, 1989).
 Learning to drive an autonomous
vehicle (Pomerleau, 1989).
 Learning to classify new astronomical
structures (Fayyad et al., 1995).
 Learning to play world-class
backgammon (Tesauro 1992, 1995).

AI Prof. Ahmed Sultan Al-Hegami 5


Why learn?
 Understand and improve efficiency of human learning
 Improve methods for teaching and tutoring people (better
CAI)
 Discover new things or structure that were previously
unknown to humans
 Examples: data mining, scientific discovery
 Fill in skeletal or incomplete specifications about a
domain
 Large, complex AI systems cannot be completely derived by
hand and require dynamic updating to incorporate new
information.
 Learning new characteristics expands the domain or expertise
and lessens the “brittleness” of the system
 Build software agents that can adapt to their users or
to other software agents
 Reproduce an important aspect of intelligent behavior
AI Prof. Ahmed Sultan Al-Hegami 6
Learning and Machine
learning

 Like human learning from past experiences.


 A computer does not have “experiences”.
 A computer system learns from data, which
represent some “past experiences” of an
application domain.
 Our focus: learn a target function that can be
used to predict the values of a discrete class
attribute, e.g., approve or not-approved, and
high-risk or low risk.
 The task is commonly called: Supervised
learning, classification, or inductive learning.
AI Prof. Ahmed Sultan Al-Hegami 7
The data and the goal
 Data: A set of data records (also called
examples, instances or cases) described
by
 k attributes: A , A , … A .
1 2 k
 a class: Each example is labelled with a
pre-defined class.
 Goal: To learn a classification model from
the data that can be used to predict the
classes of new (future, or test)
cases/instances.

AI Prof. Ahmed Sultan Al-Hegami 8


An example: data (loan
application)
Approved or not

AI Prof. Ahmed Sultan Al-Hegami 9


An example: the learning
task
 Learn a classification model from the data
 Use the model to classify future loan
applications into
 Yes (approved) and
 No (not approved)
 What is the class for following
case/instance?

AI Prof. Ahmed Sultan Al-Hegami 10


Supervised vs. unsupervised
Learning

 Supervised learning: classification is seen as


supervised learning from examples.
 Supervision: The data (observations,

measurements, etc.) are labeled with pre-


defined classes. It is like that a “teacher” gives
the classes (supervision).
 Test data are classified into these classes too.

 Unsupervised learning (clustering)


 Class labels of the data are unknown

 Given a set of data, the task is to establish the

existence of classes or clusters in the data

AI Prof. Ahmed Sultan Al-Hegami 11


Supervised learning process: two
steps
 Learning (training): Learn a model using the
training data
 Testing: Test the model using unseen test
data to assess the model accuracy
Number of correct classifications
Accuracy  ,
Total number of test cases

AI Prof. Ahmed Sultan Al-Hegami 12


What do we mean by
learning?

 Given
 a data set D,

 a task T, and

 a performance measure M,

a computer system is said to learn from D


to perform the task T if after learning the
system’s performance on T improves as
measured by M.
 In other words, the learned model helps
the system to perform T better as
compared
AI
to no Prof.
learning.
Ahmed Sultan Al-Hegami 13
An example

 Data: Loan application data


 Task: Predict whether a loan should be
approved or not.
 Performance measure: accuracy.

No learning: classify all future applications


(test data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
 We can do better than 60% with learning.
AI Prof. Ahmed Sultan Al-Hegami 14
Fundamental assumption of
learning

Assumption: The distribution of training examples is


identical to the distribution of test examples
(including future unseen examples).

 In practice, this assumption is often violated to


certain degree.
 Strong violations will clearly result in poor
classification accuracy.
 To achieve good accuracy on the test data,
training examples must be sufficiently
representative of the test data.

AI Prof. Ahmed Sultan Al-Hegami 15


Classification (Supervised Learning
)

Learn a method for predicting the instance class


from pre-labeled (classified) instances

Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

AI Prof. Ahmed Sultan Al-Hegami 16


Classification: Linear Regression

 Linear Regression
w0 + w1 x + w2 y >= 0
 Regression
computes wi from
data to minimize
squared error to ‘fit’
the data
 Not flexible enough
AI Prof. Ahmed Sultan Al-Hegami 17
Classification: Decision
Trees
if X > 5 then blue
else if Y > 3 then blue
Y else if X > 2 then green
else blue

2 5 X

AI Prof. Ahmed Sultan Al-Hegami 18


Decision Trees
-a way of representing a series of rules
that lead to a class or value;
-basic components of a decision tree:
decision node, branches and leaves;
Income>40,000
No Yes
Job>5 High Debt
Yes No Yes No
Low Risk High Risk High Risk Low
Risk
AI Prof. Ahmed Sultan Al-Hegami 19
Decision Trees (cont.)

- handle very well non-numeric data;


- work best when the predictor
variables are categorical;

AI Prof. Ahmed Sultan Al-Hegami 20


Example Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass Splitting Attributes
l
ca ca co c
Tid Refund Marital Taxable Refund
Status Income Cheat Yes No
1 Yes Single 125K No
NO MarSt
2 No Married 100K No
Single, Divorced Married
3 No Single 70K No
4 Yes Married 120K No TaxInc NO
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No The splitting attribute at a node is
10 No Single 90K Yes determined based on the Gini index.
10

AI Prof. Ahmed Sultan Al-Hegami 21


Classification: Neural Networks

- efficiently model large and complex


problems;
- may be used in classification problems or
for regressions;
- Starts with input layer => hidden layer
=> output layer 3
1

4 6
2
5 Output
Inputs
Hidden Layer
AI Prof. Ahmed Sultan Al-Hegami 22
Neural Networks (cont.)

- can be easily implemented to run on


massively parallel computers;
- can not be easily interpret;
- require an extensive amount of training
time;
- require a lot of data preparation (involve
very careful data cleansing, selection,
preparation, and pre-processing);
- require sufficiently large data set and high
signal-to noise ratio.
AI Prof. Ahmed Sultan Al-Hegami 23
Classification and
Prediction

AI Prof. Ahmed Sultan Al-Hegami 24


An example application
 An emergency room in a hospital measures
17 variables (e.g., blood pressure, age, etc)
of newly admitted patients. A decision has
to be taken whether to put the patient in an
intensive-care unit. Due to the high cost of
ICU, those patients who may survive less
than a month are given higher priority. The
problem is to predict high-risk patients and
discriminate them from low-risk patients.

AI Prof. Ahmed Sultan Al-Hegami 25


Another application
 A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
 age

 Marital status

 annual salary

 outstanding debts

 credit rating

 etc.

 Problem: to decide whether an application should


approved, or to classify applications into two
categories, approved and not approved.
AI Prof. Ahmed Sultan Al-Hegami 26
Classification
 Data: It has k attributes A1, … Ak. Each
tuple (case or example) is described by
values of the attributes and a class
label.
 Goal: To learn rules or to build a model
that can be used to predict the classes
of new (or future or test) cases.
 The data used for building the model is
called the training data.
AI Prof. Ahmed Sultan Al-Hegami 27
An example data
Outlook Temp Humidity Windy Class
Sunny Hot High No Yes
Sunny Hot High Yes Yes
O’cast Hot High No No
Rain Mild Normal No No
Rain Cool Normal No No
Rain Cool Normal Yes Yes
O’cast Cool Normal Yes No
Sunny Mild High No Yes
Sunny Cool Normal No No
Rain Mild Normal No No
Sunny Mild Normal Yes No
O’cast Mild High Yes No
O’cast Hot Normal No No
Rain Mild High Yes Yes

AI Prof. Ahmed Sultan Al-Hegami 28


Classification
—A Two-Step Process
 Model construction: describing a set of
predetermined classes based on a training set. It is
also called learning.
 Each tuple/sample is assumed to belong to a predefined
class
 The model is represented as classification rules, decision
trees, or mathematical formulae
 Model usage: for classifying future test data/objects
 Estimate accuracy of the model

The known label of test example is compared with the
classified result from the model

Accuracy rate is the % of test cases that are correctly
classified by the model
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known.
AI Prof. Ahmed Sultan Al-Hegami 29
Classification Process
(1): Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no OR years > 6
Anne Associate Prof 3 no THEN tenured = ‘yes’
AI Prof. Ahmed Sultan Al-Hegami 30
Classification Process (2):
Use the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
AI Prof. Ahmed Sultan Al-Hegami 31
Supervised vs.
Unsupervised Learning
 Supervised learning: classification is seen
as supervised learning from examples.
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the classes of the observations/cases.
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
AI Prof. Ahmed Sultan Al-Hegami 32
Evaluating Classification
Methods
 Predictive accuracy
 Speed and scalability

time to construct the model

time to use the model
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability:

understandable and insight provided by the model
 Compactness of the model: size of the tree, or
the number of rules.

AI Prof. Ahmed Sultan Al-Hegami 33


Different classification
techniques
 There are many techniques for
classification
 Decision trees
 Naïve Bayesian classifiers
 Neural networks
 Genetic Algorithms
 Logistic regression
 and many more ...
AI Prof. Ahmed Sultan Al-Hegami 34
Decision Tree
 Decision tree learning is one of the
most widely used techniques for
classification.
 Its classification accuracy is competitive
with other methods, and
 it is very efficient.
 The classification model is a tree, called
decision tree.
 C4.5 by Ross Quinlan is perhaps the
best known system. It can be
downloaded from the Web.
AI Prof. Ahmed Sultan Al-Hegami 35
The loan data (reproduced)
Approved or not

AI Prof. Ahmed Sultan Al-Hegami 36


A decision tree from the loan
data
 Decision nodes and leaf nodes (classes)

AI Prof. Ahmed Sultan Al-Hegami 37


Use the decision tree

No

AI Prof. Ahmed Sultan Al-Hegami 38


Is the decision tree
unique?
 No. Here is a simpler tree.
 We want smaller tree and accurate tree.

Easy to understand and perform better.

 Finding the best tree is


very hard.
 All current tree building
algorithms are heuristic
algorithms

AI Prof. Ahmed Sultan Al-Hegami 39


From a decision tree to a set of
rules
 A decision tree can
be converted to a
set of rules
 Each path from the
root to a leaf is a
rule.

AI Prof. Ahmed Sultan Al-Hegami 40


Algorithm for decision tree
learning
 Basic algorithm (a greedy divide-and-conquer algorithm)
 Assume attributes are categorical now (continuous

attributes can be handled too)


 Tree is constructed in a top-down recursive manner

 At start, all the training examples are at the root

 Examples are partitioned recursively based on

selected attributes
 Attributes are selected on the basis of an impurity

function (e.g., information gain)


 Conditions for stopping partitioning
 All examples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority class is the leaf
 There are no examples left

AI Prof. Ahmed Sultan Al-Hegami 41


Choose an attribute to partition data

 The key to building a decision tree - which


attribute to choose in order to branch.
 The objective is to reduce impurity or
uncertainty in data as much as possible.
 A subset of data is pure if all instances

belong to the same class.


 The heuristic in C4.5 is to choose the
attribute with the maximum Information
Gain or Gain Ratio based on information
theory.
AI Prof. Ahmed Sultan Al-Hegami 42
The loan data (reproduced)
Approved or not

AI Prof. Ahmed Sultan Al-Hegami 43


Two possible roots, which is
better?

 Fig. (B) seems to be better.

AI Prof. Ahmed Sultan Al-Hegami 44


Building a decision tree:
an example training
dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

AI Prof. Ahmed Sultan Al-Hegami 45


Output: A Decision Tree for
“buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

AI Prof. Ahmed Sultan Al-Hegami 46


Inducing a decision tree
 There are many possible trees
 let’s try it on a credit data
 How to find the most compact one
 that is consistent with the data?

AI Prof. Ahmed Sultan Al-Hegami 47


An Illustrative Example
Training Examples for Concept PlayTennis

Day Outlook Temperature Humidity Wind PlayTennis?


1 Sunny Hot High Light No
2 Sunny Hot High Strong No
3 Overcast Hot High Light Yes
4 Rain Mil High Light Yes
5 Rain d Normal Light Yes
6 Rain Coo Normal Strong No
7 Overcast l Normal Strong Yes
8 Sunny Coo High Light No
9 Sunny l Normal Light Yes
10 Rain Coo Normal Light Yes
11 Sunny l Normal Strong Yes
12 Overcast Mil High Strong Yes
13 Overcast d Normal Light Yes
14 Rain Coo High Strong No
l
Mil
d
AI Prof. Ahmed Sultan Al-Hegami 48
Mil
Constructing a Decision Tree for
PlayTennis

Day Outlook Temperatur Humidity Wind Play Tennis? [9+, 5-]


e
1 Sunny Hot High Light No
2 Sunny Hot High Strong No
3 Overcast Hot High Light Yes
4 Rain Mild High Light Yes
5 Rain Cool Normal Light Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Light No
9 Sunny Cool Normal Light Yes
10 Rain Mild Normal Light Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Light Yes
14 Rain Mild High Strong No

AI Prof. Ahmed Sultan Al-Hegami 49


Constructing a Decision Tree for
PlayTennis
Potential Splits of Root Node

[9+, 5-] [9+, 5-]


Outlook Temperature

Sunny Overcast Rain Cool Mild Hot

[2+, 3-] [4+, 0-] [3+, 2-] [3+, 1-] [4+, 2-] [2+, 2-]

[9+, 5-] [9+, 5-]


Humidity Wind

High Normal Light Strong


[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

AI Prof. Ahmed Sultan Al-Hegami 50


Constructing a Decision Tree for
PlayTennis

1,2,3,4,5,6,7,8,9,10,11,12,13,14
[9+,5-] Outlook?

Sunny Overcast Rain


1,2,8,9,11 4,5,6,10,14
[2+,3-] Humidity? Yes Wind? [3+,2-]
3,7,12,13
High Normal [4+,0-] Strong Light

No Yes No Yes
1,2,8 9,11 6,14 4,5,10
[0+,3-] [2+,0-] [0+,2-] [3+,0-]

AI Prof. Ahmed Sultan Al-Hegami 51


Building a compact tree
 The key to building a decision tree -
which attribute to choose in order to
branch.
 The heuristic is to choose the attribute
with the maximum Information Gain
based on information theory.
 Another explanation is to reduce
uncertainty as much as possible.
AI Prof. Ahmed Sultan Al-Hegami 52
Information theory
 Information theory provides a mathematical
basis for measuring the information content.
 To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads.

If one already has a good guess about the
answer, then the actual answer is less
informative.

If one already knows that the coin is rigged so
that it will come with heads with probability
0.99, then a message (advanced information)
about the actual outcome of a flip is worth
less than it would be for a honest coin (50-
50).

AI Prof. Ahmed Sultan Al-Hegami 53


Information theory (cont
…)

 For a fair (honest) coin, you have no


information, and you are willing to pay
more (say in terms of $) for advanced
information - less you know, the more
valuable the information.
 Information theory uses this same
intuition, but instead of measuring the
value for information in dollars, it
measures information contents in bits.
 One bit of information is enough to
answer a yes/no question about which
one has no idea, such as the flip of a
fair coin
AI Prof. Ahmed Sultan Al-Hegami 54
Back to decision tree
learning
 For a given example, what is the
correct classification?
 We may think of a decision tree as
conveying information about the
classification of examples in the
table (of examples);
 The entropy measure characterizes
the (im)purity of an arbitrary
collection of examples.
AI Prof. Ahmed Sultan Al-Hegami 55
Information theory: Entropy measure

 The entropy formula,


|C |
entropy ( D)   Pr(c ) log
j 1
j 2 Pr(c j )

|C |

 Pr(c ) 1,
j 1
j

 Pr(cj) is the probability of class cj in data set D


 We use entropy as a measure of impurity or
disorder of data set D. (Or, a measure of
information in a tree)

AI Prof. Ahmed Sultan Al-Hegami 56


Entropy measure: let us get a
feeling

 As the data become purer and purer, the entropy value


becomes smaller and smaller. This is useful to us!
AI Prof. Ahmed Sultan Al-Hegami 57
Information gain

 Given a set of examples D, we first


compute its entropy:

 If we make attribute Ai, with v values, the root


of the current tree, this will partition D into v
subsets D1, D2 …, Dv . The expected entropy if
Ai is used as the current v |D root:
j |

entropy Ai ( D) 
j 1 | D |
entropy ( D j )

AI Prof. Ahmed Sultan Al-Hegami 58


Information gain (cont …)
 Information gained by selecting attribute
Ai to branch or to partition the data is

gain( D, Ai ) entropy ( D)  entropy Ai ( D)


 We choose the attribute with the highest
gain to branch/split the current tree.

 Notice: log2(x)= log10(x)/log10(2)

AI Prof. Ahmed Sultan Al-Hegami 59


An example
6 6 9 9
entropy( D)  log 2  log 2 0.971
15 15 15 15

6 9
entropy Own _ house ( D)  entropy ( D1 )  entropy ( D2 )
15 15
6 9
 0  0.918
15 15
0.551

5 5 5
entropy Age ( D)  entropy ( D1 )  entropy ( D2 )  entropy ( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
 0.971  0.971  0.722 middle 3 2 0.971
15 15 15
0.888
old 4 1 0.722

 Own_house is the best


choice for the root.

AI Prof. Ahmed Sultan Al-Hegami 60


We build the final tree

 We can use information gain ratio to evaluate the


impurity as well (see the handout)

AI Prof. Ahmed Sultan Al-Hegami 61


Handling continuous
attributes

 Handle continuous attribute by splitting into


two intervals (can be more) at each node.
 How to find the best threshold to divide?
 Use information gain or gain ratio again

 Sort all the values of an continuous

attribute in increasing order {v1, v2, …, vr},


 One possible threshold between two
adjacent values vi and vi+1. Try all possible
thresholds and find the one that maximizes
the gain (or gain ratio).
AI Prof. Ahmed Sultan Al-Hegami 62
Evaluating classification methods

 Predictive accuracy

 Efficiency
 time to construct the model

 time to use the model

 Robustness: handling noise and missing values


 Scalability: efficiency in disk-resident databases
 Interpretability:
 understandable and insight provided by the model

 Compactness of the model: size of the tree, or the number


of rules.
AI Prof. Ahmed Sultan Al-Hegami 63
Classification Accuracy or
Error Rates
 Partition: Training-and-testing

use two independent data sets, e.g., training
set (2/3), test set(1/3)

used for data set with large number of exmples
 Cross-validation

divide the data set into k subsamples

use k-1 subsamples as training data and one
sub-sample as test data—k-fold cross-validation

for data set with moderate size
 leave-one-out: for small size data
AI Prof. Ahmed Sultan Al-Hegami 64
Avoid Overfitting in Classification

 Overfitting: An tree may overfit the training


data
 Good accuracy on training data but poor on test
examples
 Too many branches, some may reflect anomalies
due to noise or outliers
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early

Difficult to decide
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees.

This method is commonly used (based on validation set or
statistical estimate or MDL)
AI Prof. Ahmed Sultan Al-Hegami 65
Enhancements to basic
decision tree induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a
discrete set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that
are sparsely represented.
 This reduces fragmentation, repetition, and
replication
AI Prof. Ahmed Sultan Al-Hegami 66
Summary
 Applications of supervised learning are in
almost any field or domain.
 We studied many classification techniques.
 There are still many other methods, e.g.,

Bayesian networks

Neural networks

Genetic algorithms

Fuzzy classification
This large number of methods also show the
importance of classification and its wide
applicability.
 It remains to be an active research area.
AI Prof. Ahmed Sultan Al-Hegami 67
Discussion

AI Prof. Ahmed Sultan Al-Hegami 68

You might also like