Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views23 pages

Unit 4-DWDM

Cluster analysis is a technique for grouping similar objects based on their characteristics, utilizing various data types such as interval-scaled, binary, categorical, ordinal, and ratio-scaled data. Different clustering methods, including partitioning, hierarchical, density-based, grid-based, and model-based methods, cater to specific data types and business scenarios, enabling effective segmentation and analysis. Key takeaways emphasize the importance of selecting appropriate clustering algorithms based on data structure and intended outcomes.

Uploaded by

Roopa Roopa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views23 pages

Unit 4-DWDM

Cluster analysis is a technique for grouping similar objects based on their characteristics, utilizing various data types such as interval-scaled, binary, categorical, ordinal, and ratio-scaled data. Different clustering methods, including partitioning, hierarchical, density-based, grid-based, and model-based methods, cater to specific data types and business scenarios, enabling effective segmentation and analysis. Key takeaways emphasize the importance of selecting appropriate clustering algorithms based on data structure and intended outcomes.

Uploaded by

Roopa Roopa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

📘 Topic: Types of Data in Cluster Analysis

🧠 What is Cluster Analysis?


Cluster analysis is a technique used to group similar objects (customers, products, transactions,
etc.) based on their characteristics. It is unsupervised learning, meaning no predefined labels.

🔍 Types of Data in Cluster Analysis


1. Interval-Scaled Data

💡 Definition: Continuous numeric data where the differences between values are
meaningful.

📊 Example Scenario:
A retail chain wants to segment its stores based on monthly sales and number of
footfalls.

●​ Sales: ₹5,00,000 to ₹25,00,000​

●​ Footfalls: 2,000 to 12,000 visitors/month​

🔍 Why this is Interval-Scaled?


Because the difference between ₹5 lakh and ₹10 lakh is meaningful and measurable, just
like the difference in footfalls.

💼 Business Insight:
Use K-means clustering on this data to group stores into high-performing, medium, and
low-performing categories. Tailor marketing or resource allocation accordingly.

2. Binary Data
💡 Definition: Variables with only two values – often 0 or 1 (Yes/No, Male/Female,
Defaulted/Not).

📊 Example Scenario:
A credit card company wants to cluster customers based on behavior:

●​ Has defaulted before: Yes (1) / No (0)​

●​ Owns another credit card: Yes (1) / No (0)​

🧠 Subtypes:
●​ Symmetric Binary: Both outcomes equally important (e.g., owns card: yes/no)​

●​ Asymmetric Binary: One outcome is rare and more important (e.g., fraud: yes/no)​

💼 Business Insight:
Binary data helps in identifying risky or loyal customer segments and can guide personalized
communication or credit limits.

3. Categorical (Nominal) Data

💡 Definition: Data tha t can be grouped into categories, but not ordered.
📊 Example Scenario:
An e-commerce company wants to cluster customers based on preferred device and
favorite product category.

●​ Device used: Mobile, Laptop, Tablet​

●​ Category: Fashion, Electronics, Books​

🔍 Why Categorical?
You can’t say Mobile > Laptop > Tablet—these are just categories.

💼 Business Insight:
Use mode-based clustering (e.g., k-modes) to understand which customer groups prefer
which channels and what to recommend.

4. Ordinal Data

💡 Definition: Categorical data that can be ranked but differences between ranks are not
measurable.

📊 Example Scenario:
A hotel chain collects customer satisfaction levels:

●​ Very Unsatisfied (1), Unsatisfied (2), Neutral (3), Satisfied (4), Very Satisfied (5)​

🔍 Why Ordinal?
“Very Satisfied” is better than “Satisfied”, but the gap between levels isn’t numerically equal.

💼 Business Insight:
Cluster feedback responses to identify regions where service is lagging. Use ordinal-aware
methods like Gower’s distance.

5. Ratio-Scaled Data

💡 Definition: Like interval-scaled but with a true zero point.


📊 Example Scenario:
A logistics firm is clustering delivery centers based on delivery time and number of
delays.

●​ Delivery time (hours)​

●​ Number of late deliveries​

🔍 Why Ratio?
You can say 10 hours is twice as long as 5 hours. “Zero” means none—a true origin.

💼 Business Insight:
Use clustering to highlight inefficient hubs, optimize logistics, and reduce operational costs.

📌 Summary Table
Type of Data Example Features Real-World Use Case

Interval-Scaled Sales, temperature, profit Store performance analysis


margin

Binary Defaulted (Y/N), Gender Credit risk segmentation

Categorical Product category, Device type Customer behavior clustering

Ordinal Satisfaction rating, Risk level Service quality feedback grouping

Ratio-Scaled Age, Income, Delivery time Logistics optimization, product pricing

🎓 Key Takeaways for Students


●​ Choosing the right clustering algorithm depends on data type.​

●​ Always preprocess data (normalization, encoding) based on type.​

●​ Use distance measures appropriate to the data (Euclidean for interval, Hamming for
binary, Gower for mixed).​

●​ Understanding data types helps in customer segmentation, fraud detection,


feedback analysis, etc.​

📘 Topic: A Categorization of Major Clustering Methods


🧠 What is Clustering?
Clustering is a method of grouping data points (e.g., customers, products, branches) such that
objects in the same group (cluster) are more similar to each other than to those in other groups.

🧩 Categorization of Clustering Methods — Explained


with Business Scenarios

1. Partitioning Methods

📌 Concept: Divide data into k non-overlapping clusters. Each object belongs to one and only
one cluster.

📊 Scenario:
A bank wants to segment 10,000 customers into 3 groups based on account balance,
transaction frequency, and loan repayment history.

🛠 Common Algorithms:
●​ K-Means​

●​ K-Medoids​

💼 Use Case:
●​ Targeting low-balance, low-activity customers for re-engagement.​

●​ Optimizing financial products for high-value segments.​

📌 Notes:
●​ Works well with numeric data.​

●​ You must specify number of clusters (k) in advance.​


2. Hierarchical Methods

📌 Concept: Create a tree-like structure (dendrogram) of clusters, either by merging smaller


clusters (agglomerative) or splitting larger ones (divisive).

📊 Scenario:
An HR team in a multinational company wants to group employees based on experience,
skills, and appraisal ratings to understand talent clusters across branches.

🛠 Common Algorithms:
●​ Agglomerative Hierarchical Clustering​

●​ DIANA (Divisive Analysis)​

💼 Use Case:
●​ Visualizing career progression groups.​

●​ Identifying potential leaders or mentors.​

📌 Notes:
●​ No need to specify number of clusters initially.​

●​ Good for small to medium datasets.​

3. Density-Based Methods

📌 Concept: Clusters are defined as dense regions of data points, separated by low-density
areas.

📊 Scenario:
An e-commerce platform wants to detect fraudulent transactions that differ from normal
purchasing behavior.

🛠 Common Algorithm:
●​ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)​

💼 Use Case:
●​ Fraud detection.​

●​ Identifying outliers (e.g., unusually high or low-value transactions).​

📌 Notes:
●​ Can find clusters of arbitrary shape.​

●​ Automatically detects noise/outliers.​

●​ Does not require the number of clusters in advance.​

4. Grid-Based Methods

📌 Concept: Data space is divided into a finite number of grid cells, then clustering is
performed on those cells.

📊 Scenario:
A real estate analytics firm is analyzing house prices and sizes across a city grid to suggest
investment zones.

🛠 Common Algorithm:
●​ STING (Statistical Information Grid)​

💼 Use Case:
●​ Geospatial analysis.​

●​ Fast clustering on large datasets with spatial dimensions.​

📌 Notes:
●​ Good for high-dimensional or spatial data.​

●​ Fast and scalable but may lose accuracy in fine-grained patterns.​

5. Model-Based Methods

📌 Concept: Assume a mathematical model (like probability distributions) and fit the data to
that model to form clusters.

📊 Scenario:
A telecom company wants to identify different customer usage profiles based on call
durations, data usage, and plan types.

🛠 Common Algorithm:
●​ Gaussian Mixture Model (GMM)​

💼 Use Case:
●​ Predictive customer segmentation.​

●​ Assigning probabilities of belonging to a cluster (soft clustering).​

📌 Notes:
●​ Useful when data fits known distributions.​

●​ Produces soft clusters (probabilistic memberships).​

📌 Summary Table
Clustering Key Feature Example Use Case
Method

Partitioning Fixed number of clusters (K) Customer segmentation for


marketing
Hierarchical Dendrogram/tree of clusters Talent mapping in HR

Density-Based Groups based on density; detects Fraud detection in transactions


noise

Grid-Based Uses grid cells to form clusters Urban planning / real estate
zoning

Model-Based Assumes data from a statistical Telecom customer profiling


model

🎓 Key Takeaways Students:


●​ Different clustering methods suit different types of data and objectives.​

●​ Partitioning (like K-Means) is good for marketing and operations.​

●​ Density-based and model-based clustering are valuable for anomaly detection and
predictive modeling.​

●​ As future managers and analysts, choose clustering techniques based on:​

○​ Data structure (numeric, categorical, mixed)​

○​ Purpose (segmentation, fraud detection, profiling)​

○​ Size and shape of data​

📘 Topic: Partitioning Methods in Clustering


🧠 What are Partitioning Methods?
Partitioning methods divide a dataset into k distinct, non-overlapping clusters based on a
measure of similarity (usually distance). The goal is to group similar items together so that those
within a cluster are as similar as possible, and items from different clusters are as dissimilar
as possible.
These methods are suitable for large numeric datasets and typically require the number of
clusters (k) to be defined beforehand.

🔍 Business Scenarios Explaining Partitioning Methods


📊 Scenario 1: Customer Segmentation in Retail
Context: A retail chain wants to categorize its customers based on:

●​ Annual spending​

●​ Frequency of visits​

●​ Loyalty card usage​

Solution:​
Use K-Means Clustering to form groups such as:

●​ High-value customers (frequent visitors, high spenders)​

●​ Occasional shoppers​

●​ Discount seekers​

Decision Making:​
Design loyalty programs and promotions tailored to each segment.

📊 Scenario 2: Startup Hiring Strategy


Context: A tech startup wants to cluster job applicants based on:

●​ Years of experience​

●​ Coding test score​

●​ Interview performance rating​


Solution:​
Use K-Medoids Clustering to avoid the influence of outliers (e.g., one applicant with 25 years
of experience).

Decision Making:​
Identify top-tier applicants, promising learners, and those needing more training.

📊 Scenario 3: Financial Risk Profiling


Context: A credit agency wants to assess the risk level of customers applying for loans using:

●​ Income level​

●​ Credit score​

●​ Debt-to-income ratio​

Solution:​
Apply K-Means to segment customers into:

●​ Low-risk​

●​ Moderate-risk​

●​ High-risk​

Decision Making:​
Use the clusters to automate loan approvals, determine interest rates, or flag high-risk
applications for manual review.

🛠 Key Partitioning Algorithms


Algorithm Description Use Case Example

K-Means Assigns each data point to the Customer


nearest centroid segmentation
K-Medoids Similar to K-Means but uses Applicant profiling
actual data points as centers
(more robust to outliers)

CLARANS (Clustering Large Extension of K-Medoids for large Market segmentation


Applications based on datasets using random sampling with millions of records
RANdomized Search)

📈 Process Flow (K-Means Example)


1.​ Choose k (number of clusters)​

2.​ Randomly assign k centroids​

3.​ Assign each data point to the nearest centroid​

4.​ Recalculate centroids​

5.​ Repeat until centroids stabilize​

⚖️ Strengths vs Weaknesses
Strengths Weaknesses

Simple and fast Requires predefined k

Efficient for large datasets Sensitive to outliers and


noise

Good for numerical and structured Assumes clusters are


data spherical

🧠 Key Takeaways for Students


●​ Partitioning methods are best for quick, effective segmentation when the number of
groups is known.​
●​ K-Means is ideal for marketing and customer analytics.​

●​ K-Medoids works better with data having outliers or non-numeric features.​

●​ Useful for market segmentation, pricing strategies, hiring, credit scoring, etc.​

📘 Topic: Hierarchical Clustering Methods


🧠 What are Hierarchical Methods?
Hierarchical clustering creates a tree-like structure (dendrogram) to show how data points are
grouped together step-by-step. It doesn’t require pre-defining the number of clusters (unlike
K-Means).

There are two approaches:

●​ Agglomerative (bottom-up): Start with each point as its own cluster and merge them.​

●​ Divisive (top-down): Start with all data in one cluster and split iteratively.​

🔍 Business Scenarios for Hierarchical Clustering


📊 Scenario 1: Employee Skill Mapping in a Multinational Company
Context:​
An HR manager in a multinational wants to group employees based on:

●​ Years of experience​

●​ Skill proficiency scores (e.g., in data analysis, leadership, communication)​

Method:​
Use Agglomerative Hierarchical Clustering to discover:
●​ Natural clusters like entry-level, mid-career, and senior experts​

●​ Hidden leadership potential within teams​

Decision Making:​
Use clusters to assign roles, plan career development, and offer targeted training.

📊 Scenario 2: Tour Package Personalization


Context:​
A travel agency wants to group travelers based on:

●​ Preferred destinations​

●​ Travel frequency​

●​ Average budget​

Method:​
Apply Divisive Hierarchical Clustering to break large customer base into increasingly specific
segments.

Decision Making:​
Design personalized holiday packages for each cluster — luxury, adventure, budget-friendly,
etc.

📊 Scenario 3: Retail Store Performance Evaluation


Context:​
A retail company with 200 stores wants to compare performance based on:

●​ Sales volume​

●​ Inventory turnover​

●​ Customer satisfaction scores​

Method:​
Use Agglomerative clustering to build a hierarchy of stores by similarity.
Decision Making:​
Identify benchmark stores, underperformers, and regional trends to guide strategic
changes.

🌳 Visual Insight: The Dendrogram


A dendrogram is a tree diagram that visually represents how clusters are formed.​
You can “cut” the dendrogram at any level to form your desired number of clusters.

🛠 Algorithms & Linkage Methods


Method Description Example Use Case

Single Linkage Minimum distance between points in Risk analysis (minimum


clusters overlap)

Complete Maximum distance between points Customer similarity (tight


Linkage groups)

Average Linkage Average distance across all points Balanced segmentation


Ward’s Method Minimizes variance within clusters Organizational analysis

⚖️ Strengths vs Limitations
Strengths Limitations

No need to pre-define number of clusters Not scalable for very large


datasets

Produces a clear visual (dendrogram) Sensitive to outliers and


noisy data

Works with mixed data types (numeric + categorical, if Computation can be heavy
distance is defined properly) (O(n²))

🎓 Key Takeaways for MBA Students


●​ Hierarchical clustering is ideal for decision-making when:​

○​ You want a visual, step-by-step understanding of grouping.​

○​ You’re unsure about how many clusters you need.​

●​ Use in HR, operations, strategy, and customer analysis.​

●​ It’s great for exploratory data analysis—especially on medium-sized datasets.​

🧠 Pro Tip:
If your dataset is too large, perform K-Means first to create small clusters, then apply
hierarchical clustering to those clusters. This hybrid approach is efficient and insightful.

📘 Topic: Density-Based Clustering Methods


🧠 What are Density-Based Clustering Methods?
Density-based clustering identifies clusters as dense regions of data points separated by
low-density (sparse) regions. Unlike K-Means or Hierarchical clustering, these methods can
discover arbitrary-shaped clusters and automatically detect outliers.

🚀 Key Idea:
A cluster is formed if a group of points is densely packed (i.e., close enough to each other),
and outliers are left unclustered.

🔍 Business Scenarios for Density-Based Clustering


📊 Scenario 1: Fraud Detection in Banking
Context:​
A bank wants to identify suspicious transactions. Most users show consistent behavior, but
frauds appear as irregular spikes.

Method:​
Apply DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

●​ Legit transactions form dense clusters.​

●​ Fraudulent ones appear as sparse, isolated points (noise).​

Decision Making:​
Flag anomalies for manual audit or block real-time threats.

📊 Scenario 2: Geospatial Store Analysis for Expansion


Context:​
A retail chain wants to find dense shopping zones in a city based on customer location data.

Method:​
Use density-based clustering to identify urban clusters where:
●​ Footfall is dense (high customer presence)​

●​ Some areas are sparse (not suitable for opening stores)​

Decision Making:​
Focus on high-density clusters for store expansion, ignore low-density areas to save cost.

📊 Scenario 3: Social Media Campaign Targeting


Context:​
A brand wants to cluster social media users based on:

●​ Posting frequency​

●​ Sentiment score​

●​ Engagement levels​

Method:​
Apply density-based clustering to:

●​ Identify highly engaged user groups​

●​ Detect bots or spammers as outliers​

Decision Making:​
Target campaigns at engaged clusters and block outliers.

🛠 Common Density-Based Algorithms


Algorithm Description Key Feature

DBSCAN (Most Popular) Groups data based on neighborhood Detects outliers


density. No need to specify number of
clusters

OPTICS (Ordering Points To Extension of DBSCAN; handles Better for complex


Identify Clustering Structure) varying density data
HDBSCAN (Hierarchical Combines DBSCAN + Hierarchical for More accurate, less
DBSCAN) better structure parameter tuning

📐 Key Parameters in DBSCAN


Parameter Meaning

ε (Epsilon) Radius to consider for neighborhood

MinPts Minimum number of points required to form a dense region

👉 Too small ε = too many clusters​


👉 Too large ε = clusters merge, or everything is one cluster

⚖️ Strengths vs Limitations
Strengths Limitations

No need to specify number of Choice of ε and MinPts affects


clusters outcome

Can detect arbitrary shapes Struggles with high-dimensional data

Identifies noise/outliers Slower on very large datasets

🎓 Key Takeaways for Students


●​ Density-based methods are best when:​

○​ You expect irregular cluster shapes​

○​ You want to detect outliers​

●​ Widely used in:​

○​ Fraud detection​

○​ Geo-marketing​
○​ Customer behavioral pattern analysis​

●​ Always visualize data before using DBSCAN—results can vary with parameter tuning.​

🧠 Pro Tip:
Use DBSCAN when:

●​ You don’t know how many clusters exist.​

●​ Your clusters are non-spherical.​

●​ You want to exclude outliers from analysis (e.g., bot detection, noise elimination).​

📘 Topic: Grid-Based Clustering Methods


🧠 What are Grid-Based Clustering Methods?
Grid-based clustering divides the data space into a grid structure (cells) and then forms
clusters by analyzing the density of points within these cells. It’s efficient and scalable,
especially for large spatial datasets.

🚀 Key Idea:
●​ The entire data space is partitioned into a finite number of cells, regardless of the
data distribution.​

●​ Clusters are formed by merging adjacent dense cells.​

🔍 Business Scenarios for Grid-Based Clustering


📊 Scenario 1: Urban Planning and Retail Zoning
Context:​
A real estate developer wants to find the best locations for new retail outlets based on
customer movement data (GPS coordinates).

Method:​
Use grid-based clustering to divide the city map into blocks (grids) and find high-density
shopping activity zones.

Decision Making:​
Select top dense zones for retail expansion and avoid low-traffic regions.

📊 Scenario 2: Telecom Tower Optimization


Context:​
A telecom provider wants to optimize the placement of towers based on call drop location
data from across the city.

Method:​
Apply grid-based clustering to identify high-frequency drop zones within defined spatial
grids.

Decision Making:​
Deploy towers in hotspot areas to improve signal coverage and reduce service complaints.

📊 Scenario 3: Online Map Heatmap Visualization


Context:​
An e-commerce logistics company wants to visualize delivery hotspots and failed deliveries
on a city map.

Method:​
Divide the city into grid cells (e.g., 1 km² each). Use clustering to highlight:

●​ Dense delivery zones​

●​ Error-prone delivery areas​


Decision Making:​
Improve last-mile delivery strategies and reroute logistics.

🛠 Common Grid-Based Clustering Algorithms


Algorithm Description Key Feature

STING (Statistical Divides data space hierarchically using Fast, works with large
Information Grid) statistical summaries data

CLIQUE (Clustering in Finds dense regions in subspaces of Good for


Quest) high-dimensional data high-dimensional
problems

WaveCluster Uses wavelet transforms to detect Good noise filtering and


clusters at different resolutions visualization

📐 How It Works (Conceptually)


1.​ Divide the entire data space into equal-sized cells (grids).​

2.​ Label each cell based on the number of points (density).​

3.​ Identify clusters by connecting adjacent dense cells.​

4.​ Ignore sparse/noisy cells.​

⚖️ Strengths vs Limitations
Strengths Limitations

Fast processing time Quality depends on grid size choice

Scalable for large datasets Not ideal for very irregularly shaped
clusters

Good for spatial and geographic Sensitive to parameter tuning


data
🎓 Key Takeaways for Students
●​ Grid-based clustering is best for:​

○​ Geographic data​

○​ Large transactional datasets​

○​ Visual dashboards or heatmaps​

●​ Efficient for real-time spatial analysis, urban development, and logistics planning​

●​ Choose grid size carefully – too fine = noise, too broad = oversimplified​

🧠 Pro Tip:
Grid-based clustering is especially useful when:

●​ You need quick insights from massive spatial data​

●​ You want to display patterns visually (e.g., heatmaps)​

●​ You want to avoid complex distance calculations like in K-Means​

You might also like