0% found this document useful (0 votes)

12 views23 pages

Unit 4-DWDM

Cluster analysis is a technique for grouping similar objects based on their characteristics, utilizing various data types such as interval-scaled, binary, categorical, ordinal, and ratio-scaled data. Different clustering methods, including partitioning, hierarchical, density-based, grid-based, and model-based methods, cater to specific data types and business scenarios, enabling effective segmentation and analysis. Key takeaways emphasize the importance of selecting appropriate clustering algorithms based on data structure and intended outcomes.

Uploaded by

Roopa Roopa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views23 pages

Unit 4-DWDM

Uploaded by

Roopa Roopa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

📘 Topic: Types of Data in Cluster Analysis

🧠 What is Cluster Analysis?

Cluster analysis is a technique used to group similar objects (customers, products, transactions,
etc.) based on their characteristics. It is unsupervised learning, meaning no predefined labels.

🔍 Types of Data in Cluster Analysis

1. Interval-Scaled Data

💡 Definition: Continuous numeric data where the differences between values are
meaningful.

📊 Example Scenario:
A retail chain wants to segment its stores based on monthly sales and number of
footfalls.

● Sales: ₹5,00,000 to ₹25,00,000

● Footfalls: 2,000 to 12,000 visitors/month

🔍 Why this is Interval-Scaled?

Because the difference between ₹5 lakh and ₹10 lakh is meaningful and measurable, just
like the difference in footfalls.

💼 Business Insight:
Use K-means clustering on this data to group stores into high-performing, medium, and
low-performing categories. Tailor marketing or resource allocation accordingly.

2. Binary Data
💡 Definition: Variables with only two values – often 0 or 1 (Yes/No, Male/Female,
Defaulted/Not).

📊 Example Scenario:
A credit card company wants to cluster customers based on behavior:

● Has defaulted before: Yes (1) / No (0)

● Owns another credit card: Yes (1) / No (0)

🧠 Subtypes:
● Symmetric Binary: Both outcomes equally important (e.g., owns card: yes/no)

● Asymmetric Binary: One outcome is rare and more important (e.g., fraud: yes/no)

💼 Business Insight:
Binary data helps in identifying risky or loyal customer segments and can guide personalized
communication or credit limits.

3. Categorical (Nominal) Data

💡 Definition: Data tha t can be grouped into categories, but not ordered.
📊 Example Scenario:
An e-commerce company wants to cluster customers based on preferred device and
favorite product category.

● Device used: Mobile, Laptop, Tablet

● Category: Fashion, Electronics, Books

🔍 Why Categorical?
You can’t say Mobile > Laptop > Tablet—these are just categories.

💼 Business Insight:
Use mode-based clustering (e.g., k-modes) to understand which customer groups prefer
which channels and what to recommend.

4. Ordinal Data

💡 Definition: Categorical data that can be ranked but differences between ranks are not
measurable.

📊 Example Scenario:
A hotel chain collects customer satisfaction levels:

● Very Unsatisfied (1), Unsatisfied (2), Neutral (3), Satisfied (4), Very Satisfied (5)

🔍 Why Ordinal?
“Very Satisfied” is better than “Satisfied”, but the gap between levels isn’t numerically equal.

💼 Business Insight:
Cluster feedback responses to identify regions where service is lagging. Use ordinal-aware
methods like Gower’s distance.

5. Ratio-Scaled Data

💡 Definition: Like interval-scaled but with a true zero point.

📊 Example Scenario:
A logistics firm is clustering delivery centers based on delivery time and number of
delays.

● Delivery time (hours)

● Number of late deliveries

🔍 Why Ratio?
You can say 10 hours is twice as long as 5 hours. “Zero” means none—a true origin.

💼 Business Insight:
Use clustering to highlight inefficient hubs, optimize logistics, and reduce operational costs.

📌 Summary Table
Type of Data Example Features Real-World Use Case

Interval-Scaled Sales, temperature, profit Store performance analysis

margin

Binary Defaulted (Y/N), Gender Credit risk segmentation

Categorical Product category, Device type Customer behavior clustering

Ordinal Satisfaction rating, Risk level Service quality feedback grouping

Ratio-Scaled Age, Income, Delivery time Logistics optimization, product pricing

🎓 Key Takeaways for Students

● Choosing the right clustering algorithm depends on data type.

● Always preprocess data (normalization, encoding) based on type.

● Use distance measures appropriate to the data (Euclidean for interval, Hamming for
binary, Gower for mixed).

● Understanding data types helps in customer segmentation, fraud detection,

feedback analysis, etc.

📘 Topic: A Categorization of Major Clustering Methods

🧠 What is Clustering?
Clustering is a method of grouping data points (e.g., customers, products, branches) such that
objects in the same group (cluster) are more similar to each other than to those in other groups.

🧩 Categorization of Clustering Methods — Explained

with Business Scenarios

1. Partitioning Methods

📌 Concept: Divide data into k non-overlapping clusters. Each object belongs to one and only
one cluster.

📊 Scenario:
A bank wants to segment 10,000 customers into 3 groups based on account balance,
transaction frequency, and loan repayment history.

🛠 Common Algorithms:
● K-Means

● K-Medoids

💼 Use Case:
● Targeting low-balance, low-activity customers for re-engagement.

● Optimizing financial products for high-value segments.

📌 Notes:
● Works well with numeric data.

● You must specify number of clusters (k) in advance.

2. Hierarchical Methods

📌 Concept: Create a tree-like structure (dendrogram) of clusters, either by merging smaller

clusters (agglomerative) or splitting larger ones (divisive).

📊 Scenario:
An HR team in a multinational company wants to group employees based on experience,
skills, and appraisal ratings to understand talent clusters across branches.

🛠 Common Algorithms:
● Agglomerative Hierarchical Clustering

● DIANA (Divisive Analysis)

💼 Use Case:
● Visualizing career progression groups.

● Identifying potential leaders or mentors.

📌 Notes:
● No need to specify number of clusters initially.

● Good for small to medium datasets.

3. Density-Based Methods

📌 Concept: Clusters are defined as dense regions of data points, separated by low-density
areas.

📊 Scenario:
An e-commerce platform wants to detect fraudulent transactions that differ from normal
purchasing behavior.

🛠 Common Algorithm:
● DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

💼 Use Case:
● Fraud detection.

● Identifying outliers (e.g., unusually high or low-value transactions).

📌 Notes:
● Can find clusters of arbitrary shape.

● Automatically detects noise/outliers.

● Does not require the number of clusters in advance.

4. Grid-Based Methods

📌 Concept: Data space is divided into a finite number of grid cells, then clustering is
performed on those cells.

📊 Scenario:
A real estate analytics firm is analyzing house prices and sizes across a city grid to suggest
investment zones.

🛠 Common Algorithm:
● STING (Statistical Information Grid)

💼 Use Case:
● Geospatial analysis.

● Fast clustering on large datasets with spatial dimensions.

📌 Notes:
● Good for high-dimensional or spatial data.

● Fast and scalable but may lose accuracy in fine-grained patterns.

5. Model-Based Methods

📌 Concept: Assume a mathematical model (like probability distributions) and fit the data to
that model to form clusters.

📊 Scenario:
A telecom company wants to identify different customer usage profiles based on call
durations, data usage, and plan types.

🛠 Common Algorithm:
● Gaussian Mixture Model (GMM)

💼 Use Case:
● Predictive customer segmentation.

● Assigning probabilities of belonging to a cluster (soft clustering).

📌 Notes:
● Useful when data fits known distributions.

● Produces soft clusters (probabilistic memberships).

📌 Summary Table
Clustering Key Feature Example Use Case
Method

Partitioning Fixed number of clusters (K) Customer segmentation for

marketing
Hierarchical Dendrogram/tree of clusters Talent mapping in HR

Density-Based Groups based on density; detects Fraud detection in transactions

noise

Grid-Based Uses grid cells to form clusters Urban planning / real estate
zoning

Model-Based Assumes data from a statistical Telecom customer profiling

model

🎓 Key Takeaways Students:

● Different clustering methods suit different types of data and objectives.

● Partitioning (like K-Means) is good for marketing and operations.

● Density-based and model-based clustering are valuable for anomaly detection and
predictive modeling.

● As future managers and analysts, choose clustering techniques based on:

○ Data structure (numeric, categorical, mixed)

○ Purpose (segmentation, fraud detection, profiling)

○ Size and shape of data

📘 Topic: Partitioning Methods in Clustering

🧠 What are Partitioning Methods?
Partitioning methods divide a dataset into k distinct, non-overlapping clusters based on a
measure of similarity (usually distance). The goal is to group similar items together so that those
within a cluster are as similar as possible, and items from different clusters are as dissimilar
as possible.
These methods are suitable for large numeric datasets and typically require the number of
clusters (k) to be defined beforehand.

🔍 Business Scenarios Explaining Partitioning Methods

📊 Scenario 1: Customer Segmentation in Retail
Context: A retail chain wants to categorize its customers based on:

● Annual spending

● Frequency of visits

● Loyalty card usage

Solution:
Use K-Means Clustering to form groups such as:

● High-value customers (frequent visitors, high spenders)

● Occasional shoppers

● Discount seekers

Decision Making:
Design loyalty programs and promotions tailored to each segment.

📊 Scenario 2: Startup Hiring Strategy

Context: A tech startup wants to cluster job applicants based on:

● Years of experience

● Coding test score

● Interview performance rating

Solution:
Use K-Medoids Clustering to avoid the influence of outliers (e.g., one applicant with 25 years
of experience).

Decision Making:
Identify top-tier applicants, promising learners, and those needing more training.

📊 Scenario 3: Financial Risk Profiling

Context: A credit agency wants to assess the risk level of customers applying for loans using:

● Income level

● Credit score

● Debt-to-income ratio

Solution:
Apply K-Means to segment customers into:

● Low-risk

● Moderate-risk

● High-risk

Decision Making:
Use the clusters to automate loan approvals, determine interest rates, or flag high-risk
applications for manual review.

🛠 Key Partitioning Algorithms

Algorithm Description Use Case Example

K-Means Assigns each data point to the Customer

nearest centroid segmentation
K-Medoids Similar to K-Means but uses Applicant profiling
actual data points as centers
(more robust to outliers)

CLARANS (Clustering Large Extension of K-Medoids for large Market segmentation

Applications based on datasets using random sampling with millions of records
RANdomized Search)

📈 Process Flow (K-Means Example)

1. Choose k (number of clusters)

2. Randomly assign k centroids

3. Assign each data point to the nearest centroid

4. Recalculate centroids

5. Repeat until centroids stabilize

⚖️ Strengths vs Weaknesses
Strengths Weaknesses

Simple and fast Requires predefined k

Efficient for large datasets Sensitive to outliers and

noise

Good for numerical and structured Assumes clusters are

data spherical

🧠 Key Takeaways for Students

● Partitioning methods are best for quick, effective segmentation when the number of
groups is known.
● K-Means is ideal for marketing and customer analytics.

● K-Medoids works better with data having outliers or non-numeric features.

● Useful for market segmentation, pricing strategies, hiring, credit scoring, etc.

📘 Topic: Hierarchical Clustering Methods

🧠 What are Hierarchical Methods?
Hierarchical clustering creates a tree-like structure (dendrogram) to show how data points are
grouped together step-by-step. It doesn’t require pre-defining the number of clusters (unlike
K-Means).

There are two approaches:

● Agglomerative (bottom-up): Start with each point as its own cluster and merge them.

● Divisive (top-down): Start with all data in one cluster and split iteratively.

🔍 Business Scenarios for Hierarchical Clustering

📊 Scenario 1: Employee Skill Mapping in a Multinational Company
Context:
An HR manager in a multinational wants to group employees based on:

● Years of experience

● Skill proficiency scores (e.g., in data analysis, leadership, communication)

Method:
Use Agglomerative Hierarchical Clustering to discover:
● Natural clusters like entry-level, mid-career, and senior experts

● Hidden leadership potential within teams

Decision Making:
Use clusters to assign roles, plan career development, and offer targeted training.

📊 Scenario 2: Tour Package Personalization

Context:
A travel agency wants to group travelers based on:

● Preferred destinations

● Travel frequency

● Average budget

Method:
Apply Divisive Hierarchical Clustering to break large customer base into increasingly specific
segments.

Decision Making:
Design personalized holiday packages for each cluster — luxury, adventure, budget-friendly,
etc.

📊 Scenario 3: Retail Store Performance Evaluation

Context:
A retail company with 200 stores wants to compare performance based on:

● Sales volume

● Inventory turnover

● Customer satisfaction scores

Method:
Use Agglomerative clustering to build a hierarchy of stores by similarity.
Decision Making:
Identify benchmark stores, underperformers, and regional trends to guide strategic
changes.

🌳 Visual Insight: The Dendrogram

A dendrogram is a tree diagram that visually represents how clusters are formed.
You can “cut” the dendrogram at any level to form your desired number of clusters.

🛠 Algorithms & Linkage Methods

Method Description Example Use Case

Single Linkage Minimum distance between points in Risk analysis (minimum

clusters overlap)

Complete Maximum distance between points Customer similarity (tight

Linkage groups)

Average Linkage Average distance across all points Balanced segmentation

Ward’s Method Minimizes variance within clusters Organizational analysis

⚖️ Strengths vs Limitations
Strengths Limitations

No need to pre-define number of clusters Not scalable for very large

datasets

Produces a clear visual (dendrogram) Sensitive to outliers and

noisy data

Works with mixed data types (numeric + categorical, if Computation can be heavy
distance is defined properly) (O(n²))

🎓 Key Takeaways for MBA Students

● Hierarchical clustering is ideal for decision-making when:

○ You want a visual, step-by-step understanding of grouping.

○ You’re unsure about how many clusters you need.

● Use in HR, operations, strategy, and customer analysis.

● It’s great for exploratory data analysis—especially on medium-sized datasets.

🧠 Pro Tip:
If your dataset is too large, perform K-Means first to create small clusters, then apply
hierarchical clustering to those clusters. This hybrid approach is efficient and insightful.

📘 Topic: Density-Based Clustering Methods

🧠 What are Density-Based Clustering Methods?
Density-based clustering identifies clusters as dense regions of data points separated by
low-density (sparse) regions. Unlike K-Means or Hierarchical clustering, these methods can
discover arbitrary-shaped clusters and automatically detect outliers.

🚀 Key Idea:
A cluster is formed if a group of points is densely packed (i.e., close enough to each other),
and outliers are left unclustered.

🔍 Business Scenarios for Density-Based Clustering

📊 Scenario 1: Fraud Detection in Banking
Context:
A bank wants to identify suspicious transactions. Most users show consistent behavior, but
frauds appear as irregular spikes.

Method:
Apply DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

● Legit transactions form dense clusters.

● Fraudulent ones appear as sparse, isolated points (noise).

Decision Making:
Flag anomalies for manual audit or block real-time threats.

📊 Scenario 2: Geospatial Store Analysis for Expansion

Context:
A retail chain wants to find dense shopping zones in a city based on customer location data.

Method:
Use density-based clustering to identify urban clusters where:
● Footfall is dense (high customer presence)

● Some areas are sparse (not suitable for opening stores)

Decision Making:
Focus on high-density clusters for store expansion, ignore low-density areas to save cost.

📊 Scenario 3: Social Media Campaign Targeting

Context:
A brand wants to cluster social media users based on:

● Posting frequency

● Sentiment score

● Engagement levels

Method:
Apply density-based clustering to:

● Identify highly engaged user groups

● Detect bots or spammers as outliers

Decision Making:
Target campaigns at engaged clusters and block outliers.

🛠 Common Density-Based Algorithms

Algorithm Description Key Feature

DBSCAN (Most Popular) Groups data based on neighborhood Detects outliers

density. No need to specify number of
clusters

OPTICS (Ordering Points To Extension of DBSCAN; handles Better for complex

Identify Clustering Structure) varying density data
HDBSCAN (Hierarchical Combines DBSCAN + Hierarchical for More accurate, less
DBSCAN) better structure parameter tuning

📐 Key Parameters in DBSCAN

Parameter Meaning

ε (Epsilon) Radius to consider for neighborhood

MinPts Minimum number of points required to form a dense region

👉 Too small ε = too many clusters

👉 Too large ε = clusters merge, or everything is one cluster

⚖️ Strengths vs Limitations
Strengths Limitations

No need to specify number of Choice of ε and MinPts affects

clusters outcome

Can detect arbitrary shapes Struggles with high-dimensional data

Identifies noise/outliers Slower on very large datasets

🎓 Key Takeaways for Students

● Density-based methods are best when:

○ You expect irregular cluster shapes

○ You want to detect outliers

● Widely used in:

○ Fraud detection

○ Geo-marketing
○ Customer behavioral pattern analysis

● Always visualize data before using DBSCAN—results can vary with parameter tuning.

🧠 Pro Tip:
Use DBSCAN when:

● You don’t know how many clusters exist.

● Your clusters are non-spherical.

● You want to exclude outliers from analysis (e.g., bot detection, noise elimination).

📘 Topic: Grid-Based Clustering Methods

🧠 What are Grid-Based Clustering Methods?
Grid-based clustering divides the data space into a grid structure (cells) and then forms
clusters by analyzing the density of points within these cells. It’s efficient and scalable,
especially for large spatial datasets.

🚀 Key Idea:
● The entire data space is partitioned into a finite number of cells, regardless of the
data distribution.

● Clusters are formed by merging adjacent dense cells.

🔍 Business Scenarios for Grid-Based Clustering

📊 Scenario 1: Urban Planning and Retail Zoning
Context:
A real estate developer wants to find the best locations for new retail outlets based on
customer movement data (GPS coordinates).

Method:
Use grid-based clustering to divide the city map into blocks (grids) and find high-density
shopping activity zones.

Decision Making:
Select top dense zones for retail expansion and avoid low-traffic regions.

📊 Scenario 2: Telecom Tower Optimization

Context:
A telecom provider wants to optimize the placement of towers based on call drop location
data from across the city.

Method:
Apply grid-based clustering to identify high-frequency drop zones within defined spatial
grids.

Decision Making:
Deploy towers in hotspot areas to improve signal coverage and reduce service complaints.

📊 Scenario 3: Online Map Heatmap Visualization

Context:
An e-commerce logistics company wants to visualize delivery hotspots and failed deliveries
on a city map.

Method:
Divide the city into grid cells (e.g., 1 km² each). Use clustering to highlight:

● Dense delivery zones

● Error-prone delivery areas

Decision Making:
Improve last-mile delivery strategies and reroute logistics.

🛠 Common Grid-Based Clustering Algorithms

Algorithm Description Key Feature

STING (Statistical Divides data space hierarchically using Fast, works with large
Information Grid) statistical summaries data

CLIQUE (Clustering in Finds dense regions in subspaces of Good for

Quest) high-dimensional data high-dimensional
problems

WaveCluster Uses wavelet transforms to detect Good noise filtering and

clusters at different resolutions visualization

📐 How It Works (Conceptually)

1. Divide the entire data space into equal-sized cells (grids).

2. Label each cell based on the number of points (density).

3. Identify clusters by connecting adjacent dense cells.

4. Ignore sparse/noisy cells.

⚖️ Strengths vs Limitations
Strengths Limitations

Fast processing time Quality depends on grid size choice

Scalable for large datasets Not ideal for very irregularly shaped
clusters

Good for spatial and geographic Sensitive to parameter tuning

data
🎓 Key Takeaways for Students
● Grid-based clustering is best for:

○ Geographic data

○ Large transactional datasets

○ Visual dashboards or heatmaps

● Efficient for real-time spatial analysis, urban development, and logistics planning

● Choose grid size carefully – too fine = noise, too broad = oversimplified

🧠 Pro Tip:
Grid-based clustering is especially useful when:

● You need quick insights from massive spatial data

● You want to display patterns visually (e.g., heatmaps)

● You want to avoid complex distance calculations like in K-Means

Unit 5
No ratings yet
Unit 5
27 pages
Horse With Cowboy
No ratings yet
Horse With Cowboy
1 page
Simple Sabotage Field Manual
50% (2)
Simple Sabotage Field Manual
16 pages
Full Clustering
No ratings yet
Full Clustering
10 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
Clustering
No ratings yet
Clustering
6 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
9 pages
Mod3 DM
No ratings yet
Mod3 DM
20 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Data Mining - 5
No ratings yet
Data Mining - 5
4 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Clustering: An Overview: Key Concepts Objective
No ratings yet
Clustering: An Overview: Key Concepts Objective
12 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
DWDS Unit 6 Cluster Analysis
No ratings yet
DWDS Unit 6 Cluster Analysis
31 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
Clustering
No ratings yet
Clustering
21 pages
Clustering
No ratings yet
Clustering
8 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Clustering
No ratings yet
Clustering
11 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Big Data Clustering Techniques
No ratings yet
Big Data Clustering Techniques
28 pages
Unit 4
No ratings yet
Unit 4
4 pages
Clustering Methods
No ratings yet
Clustering Methods
14 pages
BI UNIT-03 Chap02 Clustering
No ratings yet
BI UNIT-03 Chap02 Clustering
8 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
Unit IV
No ratings yet
Unit IV
96 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
UNIT II-Segmentation, Positioning, and Product Optimization
No ratings yet
UNIT II-Segmentation, Positioning, and Product Optimization
48 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Unit VII
No ratings yet
Unit VII
30 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Data Mining Clustering Guide
No ratings yet
Data Mining Clustering Guide
56 pages
10 Clus Basic
No ratings yet
10 Clus Basic
31 pages
ML Chapter4
No ratings yet
ML Chapter4
20 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
6 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
ML 7th Sem AIML ITE Notes Complete LONG (1) - 155-202
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG (1) - 155-202
48 pages
Clustering
No ratings yet
Clustering
41 pages
Module V
No ratings yet
Module V
16 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Cluster Analysis for Researchers
No ratings yet
Cluster Analysis for Researchers
76 pages
ML Unit 4 (Ab 22)
No ratings yet
ML Unit 4 (Ab 22)
39 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Python Machine Learning
No ratings yet
Python Machine Learning
19 pages
Zara
No ratings yet
Zara
47 pages
Data Mining Chapter 5 Cluster Analysis
No ratings yet
Data Mining Chapter 5 Cluster Analysis
44 pages
Ds Un4
No ratings yet
Ds Un4
11 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
66 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Dmaclat4 Merged
No ratings yet
Dmaclat4 Merged
46 pages
Chapter 7
No ratings yet
Chapter 7
3 pages
5.cluster Analysis
No ratings yet
5.cluster Analysis
16 pages
Clustering
No ratings yet
Clustering
104 pages
DICA Lab Manual PDF
No ratings yet
DICA Lab Manual PDF
64 pages
FBRE 07 Fletcher Reo - Mesh Guide North Island V10.00.0322 MR
No ratings yet
FBRE 07 Fletcher Reo - Mesh Guide North Island V10.00.0322 MR
20 pages
Hand, Foot and Mouth Disease (HFMD)
No ratings yet
Hand, Foot and Mouth Disease (HFMD)
3 pages
Lectures Named Reactions
No ratings yet
Lectures Named Reactions
26 pages
Endeavor
No ratings yet
Endeavor
2 pages
Lesson Plan
No ratings yet
Lesson Plan
8 pages
Old Man Yells at Cloud Know Your Meme
No ratings yet
Old Man Yells at Cloud Know Your Meme
1 page
Learning Objectives: Introduction W
No ratings yet
Learning Objectives: Introduction W
238 pages
SARA-R5 ATCommands UBX-19047455
No ratings yet
SARA-R5 ATCommands UBX-19047455
558 pages
Black Seed (Nigella Sativa) - Clark's Nutrition
No ratings yet
Black Seed (Nigella Sativa) - Clark's Nutrition
5 pages
How To Send or Receive SMS Message Via GSM Module by at Commands
100% (1)
How To Send or Receive SMS Message Via GSM Module by at Commands
6 pages
EEE229/EEE223/GEE202 - Problem Sheet 1
No ratings yet
EEE229/EEE223/GEE202 - Problem Sheet 1
1 page
Unit 3 Theories and Principles in The Use and Design of Technology Driven Learning Lessons
100% (1)
Unit 3 Theories and Principles in The Use and Design of Technology Driven Learning Lessons
49 pages
Nanto Company Profile & Introduction Letter & ISO
No ratings yet
Nanto Company Profile & Introduction Letter & ISO
15 pages
Integrated Geophysical Approach For Dam Health Checks and Monitoring
No ratings yet
Integrated Geophysical Approach For Dam Health Checks and Monitoring
37 pages
Drg. Reza Fajarsyah Putra, SP - BM Prodi Ikg FK Univ Yarsi
No ratings yet
Drg. Reza Fajarsyah Putra, SP - BM Prodi Ikg FK Univ Yarsi
38 pages
Radiant July 2018
No ratings yet
Radiant July 2018
18 pages
Weekly Assessment in Science
No ratings yet
Weekly Assessment in Science
1 page
Advances in Carbohydrate Chemistry and Biochemistry Secure Ebook Download
No ratings yet
Advances in Carbohydrate Chemistry and Biochemistry Secure Ebook Download
17 pages
600782556eb44428af903a75 1643114610 GE 7 Course Syllabus 2021 2022 - Rev.2
No ratings yet
600782556eb44428af903a75 1643114610 GE 7 Course Syllabus 2021 2022 - Rev.2
8 pages
SLW Investment Group
No ratings yet
SLW Investment Group
30 pages
Metro Jobs Clearance Form Blank
100% (1)
Metro Jobs Clearance Form Blank
1 page
Daily Lesson Log of Stem - Bc11Lc-Iiib-2: Compare The Graph of The Three Special Functions
No ratings yet
Daily Lesson Log of Stem - Bc11Lc-Iiib-2: Compare The Graph of The Three Special Functions
5 pages
MH 400
No ratings yet
MH 400
81 pages
Graph Operations in Applied Math
100% (1)
Graph Operations in Applied Math
20 pages
Rare Project-2023-24 - 230614 - 163032
No ratings yet
Rare Project-2023-24 - 230614 - 163032
6 pages
Maths Grade-8 Model 2015
No ratings yet
Maths Grade-8 Model 2015
7 pages

Unit 4-DWDM

Uploaded by

Unit 4-DWDM

Uploaded by

📘 Topic: Types of Data in Cluster Analysis

🧠 What is Cluster Analysis?

🔍 Types of Data in Cluster Analysis

●​ Sales: ₹5,00,000 to ₹25,00,000​

●​ Footfalls: 2,000 to 12,000 visitors/month​

🔍 Why this is Interval-Scaled?

●​ Has defaulted before: Yes (1) / No (0)​

●​ Owns another credit card: Yes (1) / No (0)​

3. Categorical (Nominal) Data

●​ Device used: Mobile, Laptop, Tablet​

●​ Category: Fashion, Electronics, Books​

💡 Definition: Like interval-scaled but with a true zero point.

●​ Delivery time (hours)​

●​ Number of late deliveries​

Interval-Scaled Sales, temperature, profit Store performance analysis

Binary Defaulted (Y/N), Gender Credit risk segmentation

Categorical Product category, Device type Customer behavior clustering

Ordinal Satisfaction rating, Risk level Service quality feedback grouping

Ratio-Scaled Age, Income, Delivery time Logistics optimization, product pricing

🎓 Key Takeaways for Students

●​ Always preprocess data (normalization, encoding) based on type.​

●​ Understanding data types helps in customer segmentation, fraud detection,

📘 Topic: A Categorization of Major Clustering Methods

🧩 Categorization of Clustering Methods — Explained

●​ Optimizing financial products for high-value segments.​

●​ You must specify number of clusters (k) in advance.​

📌 Concept: Create a tree-like structure (dendrogram) of clusters, either by merging smaller

●​ DIANA (Divisive Analysis)​

●​ Identifying potential leaders or mentors.​

●​ Good for small to medium datasets.​

●​ Identifying outliers (e.g., unusually high or low-value transactions).​

●​ Automatically detects noise/outliers.​

●​ Does not require the number of clusters in advance.​

●​ Fast clustering on large datasets with spatial dimensions.​

●​ Fast and scalable but may lose accuracy in fine-grained patterns.​

●​ Assigning probabilities of belonging to a cluster (soft clustering).​

●​ Produces soft clusters (probabilistic memberships).​

Partitioning Fixed number of clusters (K) Customer segmentation for

Density-Based Groups based on density; detects Fraud detection in transactions

Model-Based Assumes data from a statistical Telecom customer profiling

🎓 Key Takeaways Students:

●​ Partitioning (like K-Means) is good for marketing and operations.​

●​ As future managers and analysts, choose clustering techniques based on:​

○​ Data structure (numeric, categorical, mixed)​

○​ Purpose (segmentation, fraud detection, profiling)​

○​ Size and shape of data​

📘 Topic: Partitioning Methods in Clustering

🔍 Business Scenarios Explaining Partitioning Methods

●​ Loyalty card usage​

●​ High-value customers (frequent visitors, high spenders)​

📊 Scenario 2: Startup Hiring Strategy

●​ Coding test score​

●​ Interview performance rating​

📊 Scenario 3: Financial Risk Profiling

🛠 Key Partitioning Algorithms

K-Means Assigns each data point to the Customer

CLARANS (Clustering Large Extension of K-Medoids for large Market segmentation

📈 Process Flow (K-Means Example)

2.​ Randomly assign k centroids​

3.​ Assign each data point to the nearest centroid​

4.​ Recalculate centroids​

5.​ Repeat until centroids stabilize​

Simple and fast Requires predefined k

Efficient for large datasets Sensitive to outliers and

Good for numerical and structured Assumes clusters are

🧠 Key Takeaways for Students

●​ K-Medoids works better with data having outliers or non-numeric features.​

📘 Topic: Hierarchical Clustering Methods

There are two approaches:

🔍 Business Scenarios for Hierarchical Clustering

●​ Skill proficiency scores (e.g., in data analysis, leadership, communication)​

●​ Hidden leadership potential within teams​

📊 Scenario 2: Tour Package Personalization

📊 Scenario 3: Retail Store Performance Evaluation

●​ Customer satisfaction scores​

🌳 Visual Insight: The Dendrogram

● Sales: ₹5,00,000 to ₹25,00,000

● Footfalls: 2,000 to 12,000 visitors/month

● Has defaulted before: Yes (1) / No (0)

● Owns another credit card: Yes (1) / No (0)

● Device used: Mobile, Laptop, Tablet

● Category: Fashion, Electronics, Books

● Delivery time (hours)

● Number of late deliveries

● Always preprocess data (normalization, encoding) based on type.

● Understanding data types helps in customer segmentation, fraud detection,

● Optimizing financial products for high-value segments.

● You must specify number of clusters (k) in advance.

● DIANA (Divisive Analysis)

● Identifying potential leaders or mentors.

● Good for small to medium datasets.

● Identifying outliers (e.g., unusually high or low-value transactions).

● Automatically detects noise/outliers.

● Does not require the number of clusters in advance.

● Fast clustering on large datasets with spatial dimensions.

● Fast and scalable but may lose accuracy in fine-grained patterns.

● Assigning probabilities of belonging to a cluster (soft clustering).

● Produces soft clusters (probabilistic memberships).

● Partitioning (like K-Means) is good for marketing and operations.

● As future managers and analysts, choose clustering techniques based on:

○ Data structure (numeric, categorical, mixed)

○ Purpose (segmentation, fraud detection, profiling)

○ Size and shape of data

● Loyalty card usage

● High-value customers (frequent visitors, high spenders)

● Coding test score

● Interview performance rating

2. Randomly assign k centroids

3. Assign each data point to the nearest centroid

4. Recalculate centroids

5. Repeat until centroids stabilize

● K-Medoids works better with data having outliers or non-numeric features.

● Skill proficiency scores (e.g., in data analysis, leadership, communication)

● Hidden leadership potential within teams

● Customer satisfaction scores

○ You want a visual, step-by-step understanding of grouping.

○ You’re unsure about how many clusters you need.

● Use in HR, operations, strategy, and customer analysis.

● It’s great for exploratory data analysis—especially on medium-sized datasets.

● Legit transactions form dense clusters.

● Fraudulent ones appear as sparse, isolated points (noise).

● Some areas are sparse (not suitable for opening stores)

● Identify highly engaged user groups

● Detect bots or spammers as outliers

👉 Too small ε = too many clusters

○ You expect irregular cluster shapes

○ You want to detect outliers

● Widely used in:

● You don’t know how many clusters exist.

● Your clusters are non-spherical.

● Clusters are formed by merging adjacent dense cells.

● Dense delivery zones

● Error-prone delivery areas

2. Label each cell based on the number of points (density).

3. Identify clusters by connecting adjacent dense cells.

4. Ignore sparse/noisy cells.

○ Large transactional datasets

○ Visual dashboards or heatmaps

● You need quick insights from massive spatial data

● You want to display patterns visually (e.g., heatmaps)

● You want to avoid complex distance calculations like in K-Means