0% found this document useful (0 votes)

24 views33 pages

Data Mining - Lecture 5

Uploaded by

hendymostafa256

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views33 pages

Data Mining - Lecture 5

Uploaded by

hendymostafa256

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Data Mining and Business Intelligence

Attribute
Selection
Measures
Tree
Classification Pruning

Extracting
By Rules

Dr. Nora Shoaip

Lecture 5

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024- 2025
Outline
 The Basics
• What is Classification?
• General Approach

 Decision Tree Induction

• The Algorithm
• Attribute Selection Measures
• Tree Pruning
• Extracting Rules from Decision Trees

2
The Basics: What is Classification

 Motivation: Prediction
 Is a bank loan applicant “safe” or “risky”?
 Which treatment is better for patient, “treatmentX” or “treatmentY”?

 Classification is a data analysis task where a model is

constructed to predict class labels (categories)

3
The Basics: General Approach

 A two-step process:
 Learning (training) step  construct classification model
 Build classifier for a predetermined set of classes
 Learn from a training dataset (data tuples + their associated classes)  Supervised
Learning
 Classification step  model is used to predict class labels for given data (test
set)

4
The Basics: General Approach

Attribute Class
vector label

Classification
Training
data Algorithm

Classification
rules

5
The Basics: General Approach

Classification
rules

Estimate classifier Predict classification

accuracy (to avoid of new data
overfitting)

Test (Mohammed, youth, medium)

data Loan decision?

% test set tuples correctly classified

Risky
6
Outline
 The Basics
• What is Classification?
• General Approach

 Decision Tree Induction

• The Algorithm
• Attribute Selection Measures
• Tree Pruning
• Extracting Rules from Decision Trees

7
Decision Tree Induction
 Learning of decision trees from training
dataset
 Decision tree  A flowchart-like tree
structure
 Internal node  a test on an attribute
 Branch  a test outcome
 Leaf node  a class label
 Constructed tree can be binary or otherwise

8
Decision Tree Induction

Benefits
 No domain knowledge required
 No parameter setting
 Can handle multidimensional data
 Easy-to-understand representation
 Simple and fast

9
Decision Tree Induction :The Algorithm
N All of D

Training Dataset
D
No Yes
Classification
Attribute List
Algorithm Attribute
Selection Method
Attribute
Selection Method
Splitting attribute and
Split point(s) or
Splitting subsets
10
Decision Tree Induction :The Algorithm
Splitting Criterion

Outcome 1 Outcome n
Training Dataset
Partition 1 Partition n
D

Classification
Attribute List
Algorithm

No Yes
Attribute
Selection Method
Attribute
Selection Method

11
Decision Tree Induction :The Algorithm
Splitting
Attribute Splitting Criterion

Outcome 1 Outcome n
Discrete
Partition 1 Partition n

Continuous

No Yes
Discrete

Attribute
Binary
Tree Selection Method

12
Decision Tree Induction :The Algorithm

Splitting Criterion is a test:

 Which attribute to test at node N  What is the “best” way to
partition D into mutually exclusive classes
 which (and how many) branches to grow from node N to
represent the test outcomes
 Resulting partitions at each branch should be as “pure” as possible
 A partition is “pure” if all its tuples belong to the same class
 When attribute is chosen to split training data set, it’s removed
from attribute list

13
Decision Tree Induction :The Algorithm

Terminating conditions
 All the tuples in D (represented at node N) belong to the same
class
 There are no remaining attributes on which the tuples may be
further partitioned
 majority voting is employed  convert node into a leaf and label it with the most
common class in data partition

 There are no tuples for a given branch

 a leaf is created with the majority class in data partition

14
Decision Tree Induction :
Attribute Selection Measures
 Attribute selection measure  a heuristic for selecting the splitting
criterion that “best” splits a given data partition into smaller mutually
exclusive classes
 Attributes are ranked according to a measure
 attribute having the best score is chosen as the splitting attribute
 split-point for continuous attributes
 splitting subset for discrete attributes with binary trees

 Measures: Information Gain, Gain Ratio, Gini Index

15
Decision Tree Induction :
Attribute Selection Measures
 Information Gain
 Based on Shannon’s information theory
 Goal is to minimize the expected number of tests needed to classify a tuple
 guarantee that a simple tree is found
 Attribute with the highest information gain is chosen as the splitting attribute
 minimizes information needed to classify tuples in resulting partitions
 reflects least “impurity” in resulting partitions

16
Decision Tree Induction :
Attribute Selection Measures

17
Decision Tree Induction :
Attribute Selection Measures

18
Decision Tree Induction :
Attribute Selection Measures

19
Attribute Selection Measures
RID age income student Credit_rating Class: buys_computer
C1 = Yes = 9 C2 = No = 5
1 youth high no fair no

2 youth high no excellent no

3 middle aged high no fair yes

4 senior medium no fair yes

5 senior low yes fair yes

6 senior low yes excellent no

7 middle aged low yes excellent yes

8 youth medium no fair no

9 youth low yes fair yes

10 senior medium yes fair yes

11 youth medium yes excellent yes

12 middle aged medium no excellent yes

13 middle aged high yes fair yes

14 senior medium no excellent no

20
Attribute Selection Measures
RID age income student Credit_rating Class: buys_computer
C1 = Yes = 9 C2 = No = 5
1 youth high no fair no

2 youth high no excellent no

3 middle aged high no fair yes

4 senior medium no fair yes

5 senior low yes fair yes

6 senior low yes excellent no

7 middle aged low yes excellent yes

8 youth medium no fair no

9 youth low yes fair yes

10 senior medium yes fair yes

11 youth medium yes excellent yes

12 middle aged medium no excellent yes

13 middle aged high yes fair yes

14 senior medium no excellent no

21
Attribute Selection Measures
RID age income student Credit_rating Class: buys_computer
C1 = Yes = 9 C2 = No = 5
1 youth high no fair no

2 youth high no excellent no

3 middle aged high no fair yes

4 senior medium no fair yes

5 senior low yes fair yes

6 senior low yes excellent no

7 middle aged low yes excellent yes

8 youth medium no fair no

9 youth low yes fair yes

10 senior medium yes fair yes

11 youth medium yes excellent yes

12 middle aged medium no excellent yes

13 middle aged high yes fair yes

14 senior medium no excellent no

3- compute Gain(age) = 0.94 - 0.694 = 0.246

22 bits
Attribute Selection Measures
RID age income student Credit_rating Class: buys_computer

1 youth high no fair no

2 youth high no excellent no Gain(age) = 0.246 bits

3 middle aged high no fair yes Gain(income) = 0.029 bits
4 senior medium no fair yes Gain(student) = 0.151 bits
5 senior low yes fair yes Gain(credit_rating) = 0.048 bits
6 senior low yes excellent no

7 middle aged low yes excellent yes

Gain(age) has
8 youth medium no fair no
highest information
9 youth low yes fair yes
gain
10 senior medium yes fair yes

11 youth medium yes excellent yes

12 middle aged medium no excellent yes

13 middle aged high yes fair yes

14 senior medium no excellent no

23
Attribute Selection Measures

24
Attribute Selection Measures

Age?
youth senior
middle_aged

Student? Yes credit_rating?

no yes fair excellent

No Yes No Yes

25
Decision Tree Induction :
Attribute Selection Measures

26
Decision Tree Induction :Tree Pruning

 Data may be overfitted to dataset anomalies and outliers

 Pruning removes the least reliable branches
 DT becomes less complex
 Prepruning  statistically assess the goodness of a split
before it takes place
 hard to choose thresholds for statistical significance
 Postpruning  remove sub-trees from already constructed
trees
 remove sub-tree branches and replace with leaf node
 leaf is labeled with most frequent class in sub-tree
27
Decision Tree Induction :Tree Pruning

28
Decision Tree Induction :
Rule Extraction from a Decision Tree

29
Decision Tree Induction :
Rule Extraction from a Decision Tree
RID age income student Credit_rating Class: buys_computer

1 youth high no fair no

2 youth high no excellent no

3 middle aged high no fair yes

4 senior medium no fair yes

5 senior low yes fair yes

6 senior low yes excellent no

7 middle aged low yes excellent yes

8 youth medium no fair no

9 youth low yes fair yes

10 senior medium yes fair yes

11 youth medium yes excellent yes X: (age = youth, income = medium,

12 middle aged medium no excellent yes
student = yes, credit_rating=fair)
13 middle aged high yes fair yes

14 senior medium no excellent no

30
Decision Tree Induction :
Rule Extraction from a Decision Tree– Resolving Rules Conflicts

 Rules conflicts are the result of a tuple firing more than one rule with different class
predictions
 Two resolution strategies
 Size Ordering rule with largest antecedent (toughest) has highest priority fires and returns class prediction
 Rule Ordering rules prioritized apriori according to

 Class-based ordering  decreasing importance (most frequent are highest – order of

prevalence)
 Rule-based ordering  measures of rule quality (e.g. accuracy, size, domain expertise)

 Fallback (default) rule when no rules are triggered

31
Decision Tree Induction :
Rule Extraction from a Decision Tree– Resolving Rules Conflicts

• Create one rule for each path from root to leaf in the decision tree
1. Each splitting criterion is Added to form rule antecedent (IF)
2. Leaf node holds class prediction (THEN)

Can the rules resulting from

decision trees have conflicts?

R1: IF age =youth AND student =no THEN buys

computer =no
32

Concepts and Techniques
No ratings yet
Concepts and Techniques
53 pages
Sorting Algorithms
No ratings yet
Sorting Algorithms
9 pages
Digital Assignment-5: Name: Pratishtha Gaur Reg no:19BCE0104
No ratings yet
Digital Assignment-5: Name: Pratishtha Gaur Reg no:19BCE0104
10 pages
Unit2 Skiplist
No ratings yet
Unit2 Skiplist
10 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
4 & 5 DWM 2024-25
No ratings yet
4 & 5 DWM 2024-25
32 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
6 pages
Bound Algorithm AND Problem
No ratings yet
Bound Algorithm AND Problem
28 pages
Linear Programming for Students
No ratings yet
Linear Programming for Students
8 pages
8 Classification
No ratings yet
8 Classification
82 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
Analysis of Algorithms: Dijkstra's Algorithm
No ratings yet
Analysis of Algorithms: Dijkstra's Algorithm
46 pages
Classification
No ratings yet
Classification
75 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Midterm Exam Review
No ratings yet
Midterm Exam Review
25 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
Module 4
No ratings yet
Module 4
99 pages
Ecture Ecision REE: Sajal Halder Bsmrstu
100% (1)
Ecture Ecision REE: Sajal Halder Bsmrstu
22 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
Data Structure Using C by Mamata Garanayak 780d35 PDF
100% (1)
Data Structure Using C by Mamata Garanayak 780d35 PDF
460 pages
Decision Tree
No ratings yet
Decision Tree
41 pages
Daa Report
No ratings yet
Daa Report
15 pages
Introduction To Algorithm
No ratings yet
Introduction To Algorithm
3 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
Decision Tree Course Guide
No ratings yet
Decision Tree Course Guide
37 pages
08ClassBasic L
No ratings yet
08ClassBasic L
78 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
DM 4
No ratings yet
DM 4
68 pages
Unit 3
No ratings yet
Unit 3
98 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Classification
No ratings yet
Classification
45 pages
Greedy
No ratings yet
Greedy
900 pages
ECE Classification Concepts
No ratings yet
ECE Classification Concepts
69 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Decision Tree Learning Guide
No ratings yet
Decision Tree Learning Guide
33 pages
5 Classification
No ratings yet
5 Classification
59 pages
Decision Tree
No ratings yet
Decision Tree
22 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
CSC 427: Data Structures and Algorithm Analysis Fall 2004
100% (1)
CSC 427: Data Structures and Algorithm Analysis Fall 2004
40 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
DM 3
No ratings yet
DM 3
37 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
DS MCQ
No ratings yet
DS MCQ
9 pages
Class Basic
No ratings yet
Class Basic
75 pages
05 Classification
No ratings yet
05 Classification
79 pages
CH 5
No ratings yet
CH 5
81 pages
Algorithm Basics: Pseudocode vs Flowcharts
No ratings yet
Algorithm Basics: Pseudocode vs Flowcharts
1 page
Strivera2z 1
No ratings yet
Strivera2z 1
63 pages
Lec05 Classification DecisionTree
No ratings yet
Lec05 Classification DecisionTree
67 pages
Experiment 1-Water Jug Problem
No ratings yet
Experiment 1-Water Jug Problem
7 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Class Basic
No ratings yet
Class Basic
67 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Maximum Flow
No ratings yet
Maximum Flow
44 pages
Data Classification Basics
No ratings yet
Data Classification Basics
34 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
FPGA Sorting for Engineers
No ratings yet
FPGA Sorting for Engineers
4 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Array Element Insertion & Deletion
No ratings yet
Array Element Insertion & Deletion
40 pages
Sample LP-III Chits
No ratings yet
Sample LP-III Chits
6 pages
New Daa Rahul
No ratings yet
New Daa Rahul
24 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
Solution - 6
No ratings yet
Solution - 6
3 pages
Introduction To Data Structure - Recursion
No ratings yet
Introduction To Data Structure - Recursion
31 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Unit I: Design and Analysis of Algorithm (Daa) Be (Comp Engg)
No ratings yet
Unit I: Design and Analysis of Algorithm (Daa) Be (Comp Engg)
5 pages
Sorting
No ratings yet
Sorting
32 pages
Cmpe 300 Programming Project Fall 2010 Parallel K-Means Algorithm
No ratings yet
Cmpe 300 Programming Project Fall 2010 Parallel K-Means Algorithm
4 pages
Data Access Methods in Oracle
No ratings yet
Data Access Methods in Oracle
4 pages
15.module6 Decisiontree-Updated 14
No ratings yet
15.module6 Decisiontree-Updated 14
20 pages
Classification & Prediction Techniques
No ratings yet
Classification & Prediction Techniques
71 pages
Interleaving in Error Correction Systems
No ratings yet
Interleaving in Error Correction Systems
1 page