UNIT – IV CLASSIFICATION AND CLUSTERING TECHNIQUES
1. Classification Techniques
Definition: Classification is a supervised learning technique where the goal is
to assign labels (categories) to data based on input features.
A. Decision Trees
Definition: A tree-like structure where data is split based on conditions to
reach a decision.
Components:
o Root Node: Represents the entire dataset.
o Internal Nodes: Represent tests on attributes.
o Leaf Nodes: Represent the outcome (class label).
Advantages:
o Easy to understand and interpret.
o Works well with both numerical and categorical data.
Use Case: Classifying loan applications as "Approved" or "Rejected"
based on income, credit score, etc.
B. K-Nearest Neighbors (KNN)
Definition: Classifies a new data point based on the majority class of its
k nearest neighbors in the dataset.
Steps:
1. Choose k (number of neighbors).
2. Calculate distance (e.g., Euclidean) between new point and all
others.
3. Assign the class most common among k neighbors.
Use Case: Recommender systems (e.g., suggesting products similar to
others liked by a user).
C. Logistic Regression
Definition: A statistical model used for binary classification (e.g.,
Yes/No, 0/1).
Formula:
P(Y=1)=11+e−(β0+β1X1+β2X2+...+βnXn)P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \
beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}}P(Y=1)=1+e−(β0+β1X1+β2X2+...+βnXn
)1
Output: Probability of class membership (converted to 0 or 1 using a
threshold).
Use Case: Predicting if a customer will buy a product or not.
D. Discriminant Analysis
Definition: A classification technique that models the difference between
classes using linear combinations of features.
Types:
o Linear Discriminant Analysis (LDA): Assumes equal variance
across classes.
o Quadratic Discriminant Analysis (QDA): Allows different
variances.
Use Case: Classifying customers into loyalty tiers (e.g., Silver, Gold,
Platinum).
2. Clustering Techniques
Definition: Clustering is an unsupervised learning method where the goal is to
group data points into clusters based on similarity.
Key Concept: Unlike classification, clustering does not use pre-labeled
data.
Popular Methods:
K-Means Clustering:
o Divides data into k clusters by minimizing distance between data
points and the cluster centroids.
o Use Case: Customer segmentation.
Hierarchical Clustering:
o Builds a tree of clusters using either a bottom-up (agglomerative)
or top-down (divisive) approach.
o Use Case: Grouping documents or behaviors with unknown
categories.
DBSCAN:
o Groups data based on density rather than distance, good for
arbitrary-shaped clusters.
o Use Case: Detecting anomalies or fraud.
3. Market Basket Analysis
Definition: A technique used to find associations between items
purchased together.
Goal: Identify frequent itemsets and generate association rules.
Key Terms:
Support: Frequency of itemset in transactions.
Confidence: Likelihood that item B is bought when item A is bought.
Lift: Strength of a rule over random occurrence.
Use Case:
In a retail store:
o If customers buy bread and butter → they also tend to buy milk.
o Rule: {Bread, Butter} → {Milk}
Enables cross-selling and store layout optimization.
Comparison Table
Technique Type Use Case Output
Decision Trees Classification Loan approval Class label
K-Nearest Neighbors Classification Product recommendation Class label
Technique Type Use Case Output
Logistic Regression Classification Purchase prediction Probability + Class
Discriminant Analysis Classification Customer tier prediction Class label
K-Means Clustering Clustering Market segmentation Cluster assignment
Hierarchical Document or gene Cluster tree
Clustering
Clustering grouping (dendrogram)
Market Basket Association Cross-sell
Association rules
Analysis Mining recommendations
Visual Aids Suggestions for Lecture
Flowchart of decision tree classification
K-means scatterplot with clusters
ROC curve for logistic regression
Dendrogram from hierarchical clustering
Table showing example of association rules: {Milk, Bread} → {Butter}
Conclusion
Classification helps predict known categories using labeled data.
Clustering helps discover hidden groupings without labels.
Market Basket Analysis reveals buying patterns to improve marketing
and sales strategies.