Introduction to Classification
Dr. techn. Annisa Maulida Ningtyas, M.Eng.
Metode Learning Algoritma Data Mining
Supervised Semi- Unsupervised
Supervised
Learning Learning Learning
Association based Learning
2
1. Supervised Learning
• Pembelajaran dengan guru, data set
memiliki target/label/class
• Sebagian besar algoritma data mining
(estimation, prediction/forecasting,
classification) adalah supervised learning
• Algoritma melakukan proses belajar
berdasarkan nilai dari variabel target yang
terasosiasi dengan nilai dari variable
prediktor
3
Dataset dengan Class
Attribute/Feature/Dimension Class/Label/Target
Nominal
Numerik
2. Unsupervised Learning
• Algoritma data mining mencari pola dari
semua variable (atribut)
• Variable (atribut) yang menjadi
target/label/class tidak ditentukan (tidak ada)
• Algoritma clustering adalah algoritma
unsupervised learning
Dataset tanpa Class
Attribute/Feature/Dimension
6
3. Semi-Supervised Learning
• Semi-supervised learning
adalah metode data
mining yang
menggunakan data
dengan label dan tidak
berlabel sekaligus dalam
proses pembelajarannya
• Data yang memiliki kelas
digunakan untuk
membentuk model
(pengetahuan), data tanpa
label digunakan untuk
membuat batasan antara
kelas 7
Metode Data Mining
1. Estimation (Estimasi):
Linear Regression (LR), Neural Network (NN), Deep Learning
(DL), Support Vector Machine (SVM), Generalized Linear Model
(GLM), etc
2. Forecasting (Prediksi/Peramalan):
Linear Regression (LR), Neural Network (NN), Deep Learning
(DL), Support Vector Machine (SVM), Generalized Linear Model
(GLM), etc
3. Classification (Klasifikasi):
Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5,
Adaptative Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor
(kNN), Linear Discriminant Analysis (LDA), Logistic Regression
(LogR), etc
4. Clustering (Klastering):
K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-
Means (FCM), etc
5. Association (Asosiasi):
FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc
8
Output/Pola/Model/Knowledge
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
2. Decision Tree (Pohon Keputusan)
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)
9
Classification
What is Classification?
Approach:
• Given a collection of records (training set)
• each record contains a set of attributes
• one of the attributes is the class (label) that should be predicted.
• Learn a model for the class attribute as a function of the
values of other attributes.
Variants:
• single-class problems (class labels e.g. true/false or fraud/no fraud)
• multi-class problems (class labels e.g. low, medium, high)
• Multi-label problems (class labels e.g in text classification, an article
can be about 'Technology,' 'Health,' and 'Travel’)
Introduction to Classification
A Couple of Questions:
□ What is this?
□ Why do you know?
□ How have you come to that
knowledge?
Goal: Learn a model for recognizing a concept,
e.g. fruit
□ Training data:
"fruit" "fruit" "fruit"
“not a fruit" “not a fruit" “not a fruit"
Model Learning and Model Application
Process
Classification Examples
□ Credit Risk Assessment
• Attributes: your age, income, debts, …
• Class: Are you getting credit by your bank?
□ Marketing
• Attributes: previously bought products, browsing behaviour
• Class: Are you a target customer for a new product?
□ Tax Fraud
• Attributes: the values in your tax declaration
• Class: Are you trying to cheat?
□ SPAM Detection
• Attributes: words and header fields of an e-mail
• Class: Is it a spam e-mail?
Classification Techniques
1. K-Nearest-Neighbors
2. Decision Trees
3. Rule Learning
4. Naïve Bayes
5. Support Vector Machines
6. Artificial Neural Networks
7. Deep Neural Networks
8. Many others …
K-Nearest Neighbours (KNN)
K-Nearest-Neighbors
Example Problem
– Predict what the current weather
is in a certain place
– where there is no weather station.
– How could you do that?
Basic Idea
□ Use the average of
the nearest stations
□ Example:
• 3x sunny
• 2x cloudy
• result = sunny
□ This approach is called K-Nearest-Neighbors
• where k is the number of neighbors to consider
• in the example: k=5
• in the example: “near” denotes geographical proximity
Basic Idea
Impact of K value in KNN
Impact of K value in KNN
Discussion of K-Nearest-Neighbor
Classification
□ Often very accurate
• for instance for optical character
recognition (OCR)
□ … but slow
• as training data needs to be searched
□ Assumes that all attributes are equally important
• remedy: attribute selection or attribute weights
□ Can handle decision boundaries which are not
parallel to the axes (unlike decision trees)
Lazy versus Eager Learning
□ Lazy Learning
• Instance-based learning approaches, like KNN, are also called
lazy learning as no explicit knowledge (model) is learned
• Single goal: Classify unseen records as accurately as possible
□ Eager Learning
• but actually, we might have two goals
1. classify unseen records
2. understand the application domain as a human
• Eager learning approaches generate models that are (might be)
interpretable by humans
• Examples of eager techniques: Decision Tree Learning, Rule
Learning
Decision Tree
Why Decision Tree?
• A decision tree is a prediction model that works by creating a flowchart-
like structure (tree-like structure).
• Tree-like Structure: The model resembles an upside-down tree that starts
from a single point (the root) and branches out. Information flows from
top to bottom, with decisions being made at each step.
• Internal Nodes Represent Attribute Tests: Each non-leaf node in the tree
represents a question or test about a specific feature (attribute) in your
dataset. For example, in a medical diagnosis tree, a node might ask "Is
the patient's temperature above 100°F?"
• Branches Represent Attribute Values: The lines connecting nodes are
branches that represent possible answers or outcomes of the test at the
internal node. Following our temperature example, the branches might
be "Yes" and "No" coming out from the temperature test node.
• Leaf Nodes Represent Final Decisions or Predictions: When you follow a
path from the root to the end of a branch, you reach a leaf node, which
provides the final prediction or decision. In a classification problem, this
might be a category label (like "has heart disease" or "doesn't have heart
disease"). In a regression problem, it would be a numerical value.
Intuition
1. Root Node (Income)
First Question: “Is the person’s income greater than $50,000?”
• If Yes, proceed to the next question.
• If No, predict “No Purchase” (leaf node).
2. Internal Node (Age):
If the person’s income is greater than $50,000, ask: “Is the
person’s age above 30?”
• If Yes, proceed to the next question.
• If No, predict “No Purchase” (leaf node).
3. Internal Node (Previous Purchases):
• If the person is above 30 and has made previous
purchases, predict “Purchase” (leaf node).
• If the person is above 30 and has not made previous
purchases, predict “No Purchase” (leaf node).
Forming a Decision Tree From
Training Data
age income student credit_rating buys_computer Need to be understood first
<=30 high no fair no regarding:
<=30 high no excellent no • Dataset
31…40 high no fair yes • Atribute
>40 medium no fair yes • Label / class
>40 low yes fair yes
• Data types
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no How can we obtain a model from
<=30 low yes fair yes this training data that can classify
automatically?
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Forming a Decision Tree From
Training Data
age income student credit_rating buys_computer
<=30
<=30
high
high
no
no
fair
excellent
no
no
Model Decision Tree
31…40 high no fair yes
>40 medium no fair yes age?
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 31..40 >40
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes student? yes credit rating?
>40 medium no excellent no
no yes excellent fair
Rule:
IF ((age<=30) and (student) ) OR (age=31..40) OR
((age>40) and (credit_rating=fair))
THEN no yes yes
BELI_PC=YES
Membangun Decision Tree Dari
Data Latih
age income student credit_rating buys_computer
<=30
<=30
high
high
no
no
fair
excellent
no
no
Model Decision Tree
31…40 high no fair yes
>40 medium no fair yes age?
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 31..40 >40
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes student? yes credit rating?
>40 medium no excellent no
no yes excellent fair
How to choose which attribute
becomes the root? Presented first,
no yes yes
and so on
Attribute Selection
age income student credit_rating buys_computer
Let's try the attribute student as the root
<=30 high no fair no
<=30
31…40
high
high
no
no
excellent
fair
no
yes
of our decision tree.
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40
<=30
low
medium
yes
no
excellent
fair
yes
no
Yes
<=30 low yes fair yes
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes
No
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
The attribute student is not
student? suitable as a separator, because it
cannot separate the labels well.
yes no
Attribute Selection
age income student credit_rating buys_computer
<=30
<=30
high
high
no
no
fair
excellent
no
no
Let's try the attribute age
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40
<=30
low
medium
yes
no
excellent
fair
yes
no
Yes
<=30 low yes fair yes
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes
No
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
The age attribute is suitable as a
age? separator because it can effectively
separate the labels.
<=30 >40
30..40
Attribute Selection
age income student credit_rating buys_computer
<=30
<=30
high
high
no
no
fair
excellent
no
no
Let's try the attribute age
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40
<=30
low
medium
yes
no
excellent
fair
yes
no
Yes
<=30 low yes fair yes
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes
No
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
age?
<=30 >40
30..40
• More Predictiveness
• Less Impurity
• Lower Entropy
Attribute Selection
age income student credit_rating buys_computer
<=30
<=30
high
high
no
no
fair
excellent
no
no
Let's try the attribute age
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40
<=30
low
medium
yes
no
excellent
fair
yes
no
Yes
<=30 low yes fair yes
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes
No
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
age?
<=30 >40
30..40
• Pure node
Entropy
• A measure of purity, 1 yes 3 yes
diversity, randomness, or 7 no 5 no
uncertainty.
Low Entropy High Entropy
• The smaller the entropy
value, the more
homogeneous the 0 yes
8 no
4 yes
4 no
distribution, and the
purer the node. Entropy = 0 Entropy = 1
*Info(D) = Entropy (D)
m
Info( D) = − pi log 2 ( pi )
i =1
Entropy Calculation
age income student credit_rating buys_computer
<=30 high no fair no
Calculate the entropy of the 'buys_computer'
1
<=30 high no excellent no class/label (data entropy)
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes • buys_computer = “yes” → 9 data
>40 low yes excellent no • buys_computer = “no” → 5 data
31…40 low yes excellent yes
<=30 medium no fair no
• Total → 14 data
<=30 low yes fair yes
>40 medium yes fair yes 9 9 5 5
<=30 medium yes excellent yes Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940
14 14 14 14
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
m v | Dj |
Info( D) = − pi log 2 ( pi ) InfoA ( D) = I (D j )
i =1 j =1 |D|
Entropy Calculation
age income student credit_rating buys_computer
<=30 high no fair no 2 Calculate the entropy of each attribute
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Attribute “age”
>40 low yes fair yes age “yes” “no” I(yes,no)
>40 low yes excellent no
31…40 low yes excellent yes <=30 2 3 0,971
<=30 medium no fair no
<=30 low yes fair yes 31..40 4 0 0
>40 medium yes fair yes >40 3 2 0,971
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes 2 2 3 3
>40 medium no excellent no Info(2,3) = I (2,3) = − log 2 ( ) − log 2 ( ) =0,971
5 3 5 5
m v | Dj | 5 4 5
Info( D) = − pi log 2 ( pi ) InfoA ( D) = I (D j ) Infoage ( D) =
14
I (2,3) +
14
I (4,0) +
14
I (3,2) = 0,694
i =1 j =1 |D|
Entropy Calculation
age income student credit_rating buys_computer
<=30 high no fair no 3 Calculate the entropy of each attribute
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Attribute “income”
>40 low yes fair yes income “yes” “no” I(yes,no)
>40 low yes excellent no
31…40 low yes excellent yes low 3 1 0,811
<=30 medium no fair no
<=30 low yes fair yes medium 4 2 0,918
>40 medium yes fair yes high 2 2 1
<=30 medium yes excellent yes
31…40 medium no excellent yes
4 6 4
31…40 high yes fair yes Infoincome ( D) = I (3,1) + I (4,2) + I (2,2) = 0,911
>40 medium no excellent no 14 14 14
m v | Dj |
Info( D) = − pi log 2 ( pi ) InfoA ( D) = I (D j )
i =1 j =1 |D|
Entropy Calculation
age income student credit_rating buys_computer
<=30 high no fair no 4 Calculate the entropy of each attribute
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Atribut “student”
>40 low yes fair yes student “yes” “no” I(yes,no)
>40 low yes excellent no
31…40 low yes excellent yes yes 6 1 0,592
<=30 medium no fair no
<=30 low yes fair yes no 3 4 0,985
>40 medium yes fair yes
<=30 medium yes excellent yes
7 7
31…40 medium no excellent yes Infostudent ( D) = I (6,1) + I (3,4) = 0,788
31…40 high yes fair yes 14 14
>40 medium no excellent no
m v | Dj |
Info( D) = − pi log 2 ( pi ) InfoA ( D) = I (D j )
i =1 j =1 |D|
Entropy Calculation
age income student credit_rating buys_computer
<=30 high no fair no 5 Calculate the entropy of each attribute
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Atribut “credit_rating”
>40 low yes fair yes credit_rating “yes” “no” I(yes,no)
>40 low yes excellent no
31…40 low yes excellent yes fair 6 2 0,811
<=30 medium no fair no
<=30 low yes fair yes excellent 3 3 1
>40 medium yes fair yes
<=30 medium yes excellent yes
8 6
31…40 medium no excellent yes Infocredit _ rating ( D) = I (6,2) + I (3,3) = 0,892
31…40 high yes fair yes 14 14
>40 medium no excellent no
m v | Dj |
Info( D) = − pi log 2 ( pi ) InfoA ( D) = I (D j )
i =1 j =1 |D|
Information Gain
• Information gain (often referred to Entropy before the split
simply as 'gain') is the E = 0,940
information/value that increases the
degree of certainty of an attribute
after it is split
• Gain(A) describes how much the student?
entropy decreases due to attribute yes no
A. The larger the gain, the better.
E = 0,592 E = 0,985
Entropy after the split
Gain(A) = Info(D) − InfoA(D)
Hitung Information Gain Setiap
Atribut
Gain(A) = Info(D) − InfoA(D) Atribut Entropi Atribut
buys_computer 0,904
age 0,694
income 0,911
student 0,788
credit_rating 0,892
• Gain (age) = 0,940 – 0,694 = 0,246 The attribute age has
• Gain (income) = 0,940 – 0,911 = 0,029 the highest
• Gain (student) = 0,940 – 0,788 = 0,152 information gain, so it
• Gain (credit_rating) = 0,940 – 0,892 = 0,048 is selected as the initial
node (root)
Forming a Decision Tree
age? • After the age attribute, which
attribute comes next?
<=30 31..40 >40 • Processed for each branch as
long as there is more than one
class.
? yes ?
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
Tidak perlu diproses, karena 31…40 high no fair yes
>40 medium no fair yes
hanya 1 kemungkinan kelas
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Forming a Decision Tree
Next... process data where age<=30
age income student credit_rating buys_computer 2 2 3 3
Info( D) = I (2,3) = − log 2 ( ) − log 2 ( ) =0.97
<=30 high no fair no 5 5 5 5
<=30 high no excellent no
<=30 medium no fair no
<=30 low yes fair yes
Calculate the attribute Gain:
<=30 medium yes excellent yes • Gain(Age): no need to calculate again
• Gain(income)
• Gain(student)
• Gain(credit_rating)
2 2 1
Infoincome ( D) = I (0,2) + I (1,1) + I (1,0) = 0.4 Gain (income) = Info(D) – Infoincome(D) = 0.97 – 0.4 =
5 5 5 0.57
3 2
Infostudent ( D) = I (0,3) + I (2,0) = 0 Gain (student) = Info(D) – Infostudent(D) = 0.97 – 0 = 0.97
5 5
3 2
Infocredit _ rating ( D) = I (1,2) + I (1,1) = 0.95 Gain (credit_rating) = Info(D) – Infocredit_rating(D) = 0.97 – 0.95 = 0.02
5 5
Forming a Decision Tree
age?
<=30 31..40 >40
student? yes ?
CONTINUE…
no yes
no yes
Another way to form a Decision
Tree
• Using Gini Index (GI) or Impurity
• Langkah:
• Calculate the Gini Index (GI) for each
attribute
• Determine the root based on the GI
value. The root is the attribute with the
smallest GI value
• Repeat steps 1 and 2 for the next levels
of the tree until the GI value = 0
Evaluation Metrics
• Choosing the right evaluation metrics is crucial
after selecting a classification model
• We will cover the most commonly used metrics
for classification tasks
Evaluation Metrics
https://www.datacamp.com/blog/classification-machine-learning
Evaluation Metrics
https://www.datacamp.com/blog/classification-machine-learning
Evaluation Metrics
https://www.datacamp.com/blog/classification-machine-learning
References:
• https://romisatriawahono.net/lecture/dm/romi-
dm-apr2020.pptx
• https://www.geeksforgeeks.org/decision-tree-
introduction-example/
• https://www.datacamp.com/blog/classification-
machine-learning
• DTS
Terimakasih