Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views16 pages

DWM Q Bank

Question bank for dwm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

DWM Q Bank

Question bank for dwm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit 1: Introduction to Data Warehousing and Business Analysis

(Question Bank with Key Points)


No. Question BTL Key Points
- Definitions of OLAP (Online Analytical
Processing) and OLTP (Online Transaction
Explain the difference between
Processing)
1 OLAP and OLTP with 2
- Comparison in terms of purpose, data type,
examples.
users, and performance
- Examples of OLAP and OLTP applications
- Explanation of components: ETL (Extract,
Transform, Load), Data Storage, Metadata, and
Describe the main components
2 1 Access Tools
of a Data Warehouse.
- Role of each component in Data Warehousing
- Importance of Data Integration and Quality
- Explanation of architecture layers: Data Sources,
Staging Area, Data Storage, Presentation Layer
Outline the architecture of a
3 2 - Components in each layer and their roles
Data Warehouse.
- Example of a typical data flow in the
architecture
- Importance of discovering patterns and trends in
large datasets
Why is Data Mining important - Role in decision-making and competitive
4 4
in business analysis? advantage
- Examples of data mining applications in
business
- Definition and importance of Data Mining
Define Data Mining and its - Process steps: Data Collection, Data Processing,
5 1
process steps. Data Mining, Evaluation, Deployment
- Tools and technologies used
- Structured and unstructured data
List and describe various data - Examples of data types like text, multimedia,
6 2
types that can be mined. web data
- Applications for each data type
- Explanation of patterns: associations,
What patterns can be mined classifications, clusters, outliers
7 4
from data? - Example of each pattern type
- Importance in business insights
- Technologies: Machine Learning, AI, Statistics,
List technologies that support Database Systems
8 1
Data Mining. - Brief role of each technology
- Integration in Data Mining
- Application areas: Retail, Finance, Healthcare,
Discuss applications targeted by E-commerce
9 3
Data Mining. - Benefits and limitations in each area
- Case examples
- Issues: data quality, scalability, data privacy,
Identify major issues faced in interpretation
10 5
Data Mining. - Impact of each issue on data analysis
- Solutions to mitigate issues
1

Describe the concept of data - Data objects: entity and their instances
11 2
Page

objects and attribute types. - Types of attributes: nominal, ordinal, interval,

Q-Bank-DWM-SITE21-CST/@theva
ratio
- Examples of each type
- Role in presenting data insights
How can data visualization aid - Visualization techniques like charts, graphs, and
12 3
in data understanding? dashboards
- Advantages in interpretation
Explain similarity and - Concepts of similarity and dissimilarity
13 dissimilarity measurements in 3 - Calculation methods: Euclidean, Cosine, Jaccard
data. - Application in clustering and classification
- Descriptive statistics: mean, median, mode,
Provide basic statistical variance, standard deviation
14 2
descriptions of data. - Applications in data understanding
- Role in data pre-processing
- Explanation of business intelligence
How do data warehousing and
- Role of Data Warehousing and OLAP in
15 OLAP support business 4
decision support
intelligence?
- Examples and case studies

Example problems

Question Solution Key Points Bloom's


Taxonomy Level
Problem Statement: A Step 1: Data 1. OLAP vs. OLTP: Creating (Level
retail company wants to Warehouse OLAP is for analysis, 6): Design a Data
improve its customer Architecture OLTP for real-time Warehouse
insights by implementing - OLTP Component: transactions. architecture for
a Data Warehouse and Used to store real-time 2. Data Warehouse specific
applying Data Mining data such as sales Structure: Consists of applications.
techniques. transactions. ETL, central repository, Analyzing (Level
- ETL Process: Extract, and OLAP systems. 4): Analyze data
(a) Design a simple Data Transform, and Load 3. Data Mining mining tasks and
Warehouse architecture data into the data Patterns: Patterns patterns for
including OLAP and warehouse from OLTP include associations, customer insights.
OLTP components. databases. sequences, Applying (Level
(b) Identify potential - Data Warehouse classifications, and 3): Apply
data mining tasks for Layer: Central clusters. similarity and
customer behavior repository storing 4. Similarity & dissimilarity
analysis and outline the historical customer data. Dissimilarity: Key measures for
types of patterns that can - OLAP Component: metrics like Euclidean clustering
be mined. Supports multi- distance and cosine analysis.
(c) Discuss data dimensional analysis for similarity are used in
similarity and business intelligence customer segmentation.
dissimilarity measures reports, such as sales
for clustering customer trends and customer
segments based on segmentation.
transaction history.
Step 2: Data Mining
Tasks
- Pattern Discovery:
Identifying patterns in
2
Page

customer purchase

Q-Bank-DWM-SITE21-CST/@theva
frequency, preferences,
and spending behavior.
- Classification: Classify
customers into different
categories (e.g., high-
value customers).
- Association Analysis:
Find product association
patterns (e.g., customers
buying product A often
buy product B).

Step 3: Similarity and


Dissimilarity Measures
- Similarity: Use cosine
similarity for comparing
customer purchase
vectors (e.g., basket of
items purchased).
- Dissimilarity:
Euclidean distance to
measure differences in
transaction amounts or
frequency.

Sample Data for Customer Transactions

Customer ID Transaction Date Amount ($) Product Category Location


1 2023-10-01 200 Electronics New York
2 2023-10-05 350 Home Appliances Los Angeles
3 2023-10-10 120 Groceries New York
4 2023-10-12 90 Clothing Chicago
5 2023-10-15 500 Electronics New York
6 2023-10-18 250 Groceries Los Angeles
7 2023-10-20 100 Clothing Chicago
8 2023-10-22 450 Electronics Chicago

Explanation of Key Steps

1. Data Warehouse Architecture: Integrates OLTP and OLAP components for efficient
analysis and reporting.
2. Data Mining Tasks: Focuses on finding customer buying patterns, which helps in targeted
marketing.
3. Similarity and Dissimilarity: These measures allow clustering of customers based on
similar purchase behaviors.
3
Page

Q-Bank-DWM-SITE21-CST/@theva
Unit 2: Data Pre-processing-(Question Bank with Key Points)

No. Question BTL Key Points


- Overview of data cleaning, integration,
reduction, transformation
Describe the main steps in data pre-
1 2 - Importance of each step in ensuring data
processing.
quality
- Tools used in data pre-processing
- Definition of data cleaning
Explain data cleaning and its - Role in removing noise, handling missing
2 4
importance in data analysis. values
- Examples of cleaning techniques
- Importance of combining data from
multiple sources
3 Outline the process of data integration. 2 - Techniques like schema merging, entity
resolution
- Benefits and challenges
- Concept of data reduction: dimensionality
What is data reduction, and why is it reduction, aggregation
4 2
necessary? - Techniques like PCA, sampling
- Use in efficient analysis
- Methods: normalization, scaling, binning
Discuss the methods of data
5 4 - Purpose of each method
transformation.
- Impact on data quality
- Definition and purpose
Explain data discretization and its
6 4 - Techniques like binning, clustering
applications.
- Use in classification and modeling
- Techniques like deletion, imputation, and
Describe how handling missing values interpolation
7 5
is approached. - Impact on data quality and analysis
- Example methods for various types of data
- Definition of noise
Outline the role of noise removal in - Techniques like smoothing, outlier
8 3
data cleaning. detection
- Impact on analysis accuracy
- Explanation of scaling techniques: Min-
Why is feature scaling important in Max, Standardization
9 4
data pre-processing? - Importance in machine learning models
- Examples of scaled data
- Comparison of methods like PCA, t-SNE
Compare dimensionality reduction
10 5 - Advantages and limitations of each
methods.
- Situations where each method is preferred
- Purpose of identifying similar entities
Describe entity resolution in data across datasets
11 2
integration. - Techniques for matching
- Examples in customer data integration
- Concept of binning
Explain how binning helps in data
12 3 - Types: equal-width, equal-frequency
transformation.
- Application in discretization
4

Discuss the concept of attribute - Importance in model simplification and


13 4
Page

selection. efficiency

Q-Bank-DWM-SITE21-CST/@theva
- Methods like correlation, mutual
information
- Impact on predictive modeling
- Issues like schema mismatches, data
redundancy
Describe the challenges in integrating
14 5 - Solutions: schema alignment, entity
heterogeneous data sources.
matching
- Case examples
- Explanation of standardization
What is the purpose of standardizing
15 2 - Techniques and formulas used
data?
- Importance in feature scaling

Example problems

Question Solution Key Points Bloom's


Taxonomy Level
Problem Statement: Given a Step 1: Data Cleaning 1. Data Cleaning Understanding
dataset containing customer - Handling Missing ensures accuracy and (Level 2): Identify
data with issues like missing Values: Replace completeness by and apply basic
values, duplicate entries, high missing values in the filling missing data preprocessing
dimensionality, and different "Annual Income" values and removing techniques.
measurement scales, perform column with the duplicates. Applying (Level
data preprocessing steps. column’s mean value. 2. Data Integration 3): Implement
Specifically, apply data - Removing combines datasets transformation and
cleaning, integration, Duplicates: Remove for a holistic view. reduction
reduction, transformation, and duplicate customer 3. Data Reduction techniques
discretization techniques to records based on the removes irrelevant effectively.
prepare the dataset for further unique "Customer ID". data for efficiency.
analysis. 4. Data
Step 2: Data Transformation
Integration scales data to a
- Integrate datasets uniform range,
from different sources aiding in
containing comparability.
demographic and 5. Data
purchase data on Discretization
common keys like groups continuous
"Customer ID" for variables, enhancing
complete customer interpretability in
profiles. classification tasks.

Step 3: Data
Reduction
- Feature Selection:
Drop unimportant
columns such as
"Customer Address" to
reduce dimensionality.

Step 4: Data
5
Page

Transformation
- Normalization:

Q-Bank-DWM-SITE21-CST/@theva
Scale the "Annual
Income" column using
Min-Max
normalization to a [0,1]
range.

Step 5: Data
Discretization
- Binning: Discretize
the "Age" column into
bins representing age
groups (e.g., 20-30, 30-
40) for analysis.

Sample Data (Before and After)

Before Data Preprocessing


Customer ID Age Annual Income (in $) Spending Score (1-100) Customer Address
1 25 50,000 60 123 Street, City
2 30 NULL 70 456 Avenue, City
2 30 NULL 70 456 Avenue, City
3 35 70,000 80 789 Blvd, City
4 40 80,000 90 101 Road, City

After Data Preprocessing

Customer ID Age Group Annual Income (Normalized) Spending Score (1-100)


1 20-30 0.0 60
2 30-40 0.0 70
3 30-40 0.5 80
4 40-50 1.0 90

Explanation of Key Steps:

1. Data Cleaning: Missing values are filled with mean values, and duplicates are removed.
2. Data Integration: Combines data to give a complete profile for each customer.
3. Data Reduction: Dropping less informative columns reduces the data size.
4. Data Transformation: Normalization standardizes "Annual Income" for improved model
performance.
5. Data Discretization: Binning groups "Age" into manageable categories, aiding
interpretation.
6
Page

Q-Bank-DWM-SITE21-CST/@theva
Unit 3: Classification - (Question Bank with Key Points)

No. Question BTL Key Points


- Definition of classification in data mining
Define classification and explain its - Application in predicting categorical
1 1
importance. outcomes
- Importance in data-driven decision-making
- Steps: Data Preparation, Model Selection,
Training, Testing
Outline the general approach to
2 2 - Importance of each step
solving a classification problem.
- Tools commonly use
d in classification
- Decision tree structure and node-based
Describe the working of a decision decisions
3 3
tree in classification. - Process of splitting data on attributes
- Example workflow of a decision tree
- Steps: data splitting, choosing attributes,
Explain the process of building a recursive partitioning
4 4
decision tree. - Role of entropy and information gain
- Examples of split criteria
- Test conditions: binary split, multiway split
What methods are used to express an
5 4 - Handling categorical and continuous data
attribute test condition?
- Examples of test applications
- Measures: Information Gain, Gini Index,
Discuss measures for selecting the Chi-square
6 5
best split in a decision tree. - Calculations and comparison
- Impact on decision tree accuracy
- Definition and components (nodes, directed
Explain Bayesian Belief Networks in acyclic graph)
7 3
classification. - Inference mechanism in belief networks
- Applications in probabilistic reasoning
- Differences in approach, interpretability,
Compare decision tree classification and usage
8 5
with Bayesian classification. - Advantages and limitations of each method
- Use cases for each classification method
- Definition of overfitting
What is overfitting in decision trees, - Techniques to prevent it: pruning, cross-
9 4
and how can it be prevented? validation
- Impact on model accuracy
- Steps: calculate prior probabilities,
Describe the steps in building a likelihoods, apply Bayes’ theorem
10 3
Bayesian classifier. - Example of Bayesian classification
- Application in real-world scenarios
- Types: Decision Trees, Naive Bayes, SVM,
Outline the types of classifiers used k-NN
11 1
in data mining. - Brief explanation of each
- Use cases for each classifier
- Importance in reducing noise and
How does attribute selection improve computational cost
12 4
7

classification? - Methods like correlation analysis and


Page

information gain

Q-Bank-DWM-SITE21-CST/@theva
- Example of improved model performance
- Difference between training and testing data
Explain the role of training and - Importance of model evaluation
13 2
testing in classification. - Common metrics: accuracy, precision,
recall
- Explanation of k-NN as a distance-based
Describe the k-nearest neighbors (k- classifier
14 3
NN) algorithm. - Steps in classifying a new instance
- Use cases and limitations
- Definition of ensemble learning
What are ensemble methods in
15 5 - Examples like bagging, boosting
classification?
- Impact on accuracy and robustness

Example problems

Question Solution Key Points Bloom's Taxonomy


Level
Problem Statement: Given a Step 1: Define 1. Decision Trees Understanding
dataset of students’ academic Attributes and provide a visual (Level 2): Explain
performance, use Decision Tree Goal structure for decision tree and
Induction to classify students as - Attributes: classification. Bayesian network
“Pass” or “Fail” based on the Hours Studied, 2. Information methods.
following attributes: Hours Attendance, Past Gain and Gini Applying (Level
Studied, Attendance, and Past Grades. Index are essential 3): Apply
Grades. Demonstrate the working - Goal: Classify for determining algorithms to
of a decision tree, methods for students as "Pass" optimal splits. classify new data.
expressing attribute test conditions, or "Fail". 3. Bayesian Belief Analyzing (Level
and measures for selecting the best Networks use 4): Assess
split. Additionally, explain how Step 2: Decision prior knowledge effectiveness of
Bayesian Belief Networks could be Tree Induction and probability to split measures and
used for this classification problem. - Calculate Gini refine predictions. classifications.
Index or 4. Use of
Information Gain thresholds in
for each attribute. attribute test
- Split Criteria: conditions allows
Based on the for flexibility in
attribute with the classification.
highest
Information Gain.
- Build the tree by
choosing split
attributes at each
node until
classification
purity is achieved.

Step 3: Attribute
Test Conditions
- Test conditions
are based on
8
Page

thresholds for

Q-Bank-DWM-SITE21-CST/@theva
Hours Studied,
Attendance, and
Past Grades.

Step 4: Bayesian
Belief Networks
- Define
probabilities for
relationships
between attributes
(e.g., Hours
Studied →
Pass/Fail).
- Calculate
posterior
probabilities to
predict the class.

Sample Data

Student ID Hours Studied Attendance (%) Past Grades Outcome (Pass/Fail)


1 5 80 B Pass
2 2 60 C Fail
3 4 75 B Pass
4 3 65 C Fail
5 6 85 A Pass

Explanation of Key Steps

1. Selecting the Best Split: Compute Information Gain for each attribute (e.g., Hours Studied)
and choose the one with the highest gain to split the data at the root node. This forms the
basis of classification.
2. Attribute Test Conditions: Conditions can be set, such as "Hours Studied > 3," to further
split data at each node.
3. Bayesian Belief Network: For each new student, calculate the probability of passing based
on conditional probabilities derived from existing data. 9
Page

Q-Bank-DWM-SITE21-CST/@theva
Unit 4: Association Analysis - (Question Bank with Key Points)

No. Question BTL Key Points


- Definition of association analysis
- Use in market basket analysis and other
Define association analysis and its
1 1 domains
application areas.
- Importance in finding co-occurring
patterns
- Explanation of frequent itemsets
Describe the problem of frequent
2 2 - Role in association rule mining
itemset generation.
- Challenges in finding frequent itemsets
- Steps: finding frequent itemsets, rule
Explain the Apriori algorithm for rule generation
3 3
generation. - Role of support and confidence
- Example with sample data
- Methods: closed and maximal itemsets
Outline compact representation of
4 4 - Importance of reducing redundancy
frequent itemsets.
- Applications in data mining
- FP-Growth structure and process
How does the FP-Growth algorithm - Differences in computation and
5 5
differ from Apriori? performance
- Examples of large dataset applications
- Challenges: scalability, interpretability,
support thresholds
Discuss the challenges in association
6 5 - Solutions like pruning, use of alternative
rule mining.
metrics
- Impact on rule generation quality
- Definitions of confidence, lift
Define confidence and lift in association
7 1 - Role in evaluating association rules
rules.
- Examples to illustrate
- Definition of support
Explain the importance of support in
8 2 - Role in filtering meaningful patterns
rule mining.
- Examples in transaction data
- Explanation of each metric and their use
Compare support, confidence, and lift in
9 4 - Advantages and limitations
rule evaluation.
- Situations where each is prioritized
- Concept of FP-tree and its creation
Describe the use of the FP-tree structure - How it facilitates frequent pattern
10 3
in FP-Growth. discovery
- Benefits in large datasets
- Applications in recommendation
Explain how association rules are systems, cross-selling
11 3
applied in e-commerce. - Examples in online shopping platforms
- Impact on customer engagement
- Definition and importance of pruning
- Methods like minimum confidence and
12 Describe the steps in rule pruning. 4
lift thresholding
- Examples of improved rule quality
10

- Definitions of closed and maximal


13 What are closed and maximal itemsets? 2 itemsets
Page

- Role in reducing rule redundancy

Q-Bank-DWM-SITE21-CST/@theva
- Examples to illustrate
- Balancing specificity and generality in
rule generation
Discuss challenges in setting thresholds
14 5 - Examples of too low or high threshold
for support and confidence.
issues
- Impact on meaningful rule discovery
- Role in uncovering customer purchasing
Explain how rule generation can patterns
15 4
improve business decision-making. - Use in strategic marketing and sales
- Examples of actionable insights

Problem Statement

Given Data: A retail store’s transaction data with items purchased together.

Task: Use the FP-Growth Algorithm to identify frequent item sets and generate association rules
for understanding customer purchasing patterns. Discuss the advantages of using FP-Growth over the
Apriori Algorithm in terms of efficiency and compact representation.

Sample Data

Transaction ID Items Purchased


1 Milk, Bread, Butter
2 Milk, Bread
3 Bread, Butter
4 Milk, Bread, Butter, Jam
5 Bread, Butter

Solution

Question Solution Key Points Bloom's Taxonomy


Level
Task Step 1: Build FP-Tree 1. FP-Growth reduces Applying (Level 3):
- Calculate the frequency of each item the need for candidate Implement the FP-
across all transactions (e.g., Bread: 5,
generation by using the Growth algorithm
Milk: 3, Butter: 4, Jam: 1). compact FP-tree and interpret results.
- Arrange items in each transaction in structure. Analyzing (Level
descending order of frequency and 2. Frequent item set 4): Analyze item
construct the FP-tree. generation enables relationships and
understanding of rule generation.
Step 2: Generate Frequent Item common purchase Evaluating (Level
Sets patterns. 5): Evaluate FP-
- Identify frequent item sets using 3. Association rule Growth efficiency
conditional FP-trees, reducing generation allows for over Apriori.
redundant calculations by grouping actionable insights on
common patterns. related items.
4. Compact
11

Step 3: Association Rule Generation representation in FP-


- For each frequent item set, generate Growth is useful for
Page

rules based on confidence (e.g., Bread large datasets.

Q-Bank-DWM-SITE21-CST/@theva
→ Butter).
- Calculate support and confidence for
each rule. For example,
Support(Bread → Butter) = 4/5
(80%), Confidence(Bread → Butter) =
4/5 (80%).

Step 4: Analysis
- Compare FP-Growth with Apriori:
FP-Growth is efficient with large
datasets due to its compact
representation, eliminating the need
for candidate generation like in
Apriori.

Explanation of Key Steps

1. FP-Tree Construction: Organizes data to enable efficient traversal and pattern recognition.
2. Frequent Item Sets: Highlights common item combinations, offering insights into frequently
co-purchased products.
3. Association Rule Generation: Helps identify strong associations between items to inform
product placement, bundling, and promotions.
4. Efficiency of FP-Growth: By avoiding candidate generation, FP-Growth provides faster
processing for larger datasets compared to Apriori.

12
Page

Q-Bank-DWM-SITE21-CST/@theva
Unit 5: Cluster Analysis - (Question Bank with Key Points)

No. Question BTL Key Points


- Definition of cluster analysis
Define cluster analysis and its - Objectives: grouping similar data,
1 1
objectives. discovering patterns
- Importance in exploratory data analysis
- Types: partitioning, hierarchical, density-
Explain different types of clustering based, model-based
2 2
methods. - Brief explanation of each method
- Examples of use cases
- Steps: initialization, assignment, updating
Describe the K-means clustering centroids
3 3
algorithm and its steps. - Iterative process of clustering
- Advantages and limitations
- Issues: choice of k, sensitivity to
initialization
Discuss additional issues with K-means
4 5 - Solutions like multiple runs, Elbow
clustering.
method
- Example scenarios
- Differences in approach and structure
Compare K-means and hierarchical
5 4 - Advantages and limitations of each
clustering.
- Examples of applications for each method
- Explanation of the bisecting process
What is bisecting K-means, and how
6 3 - Comparison to traditional K-means
does it work?
- Examples of large data clustering
- Strengths: simplicity, speed
Outline the strengths and weaknesses of - Weaknesses: sensitivity to outliers, fixed
7 5
K-means. k
- Practical considerations in using K-means
- Steps in agglomerative clustering
- Linkage methods (single, complete,
Explain the basic agglomerative
8 2 average)
hierarchical clustering algorithm.
- Examples of hierarchical clustering
applications
- Steps in density-based clustering
Describe the DBSCAN algorithm and
9 3 - Concepts of core, border, noise points
its applications.
- Applications in spatial data analysis
- Differences in clustering approach and
flexibility
Compare DBSCAN with K-means
10 5 - Advantages of DBSCAN in handling
clustering.
noise
- Example of non-spherical data clusters
- Definition and purpose in cluster
discovery
Explain the concept of density-based
11 4 - Examples of density-based clusters
clustering.
- Comparison with other clustering
methods
13

- Definition of hierarchical clustering


Discuss the concept of hierarchical
12 3 - Explanation of dendrograms
clusters.
Page

- Example of hierarchical clustering

Q-Bank-DWM-SITE21-CST/@theva
analysis
- Use in customer segmentation and
Describe clustering applications in targeting
13 4
marketing. - Example scenarios in retail marketing
- Benefits for targeted campaigns
- Definition of centroid
Outline the concept of centroid in - Role in K-means and other centroid-based
14 2
clustering algorithms. methods
- Example illustrating centroid calculation
- Methods: Silhouette Score, Davies-
Bouldin Index
15 Explain clustering validation methods. 5 - Importance in cluster quality assessment
- Example interpretation of validation
scores

Example problems

Bloom's Taxonomy
Question Solution Key Points Level
Problem Statement: Given a Step 1: Data 1. K-means uses Analyzing (Level
dataset of customer purchase Preprocessing distance metrics to 4): Understand and
behavior with the following - Normalize the data partition data into K analyze clustering
attributes: Age, Annual Income, (standardization). clusters. results and their
implications.
and Spending Score, apply K- 2. The Elbow
Evaluating (Level
means and Agglomerative Step 2: K-means Method helps in 5): Evaluate the
Hierarchical Clustering to Clustering determining the strengths and
segment customers into distinct - Choose an optimal optimal K. weaknesses of K-
groups. Discuss the strengths value for K using the 3. Agglomerative means and
and weaknesses of each Elbow Method. clustering builds Agglomerative
clustering technique. - Apply the K-means clusters based on Hierarchical
algorithm to segment proximity, leading Clustering.
customers. to a dendrogram for
- Analyze the visualization.
centroids and group 4. DBSCAN offers
characteristics. a density-based
approach that
Step 3: handles noise and
Agglomerative varying cluster
Hierarchical sizes.
Clustering
- Use Ward’s method
to perform
hierarchical clustering.
- Plot a dendrogram to
determine the optimal
number of clusters.
- Compare results with
K-means
segmentation.
14

Step 4: Comparison
Page

- K-means Strengths:

Q-Bank-DWM-SITE21-CST/@theva
Efficient for large
datasets, easy to
implement.
- K-means
Weaknesses:
Sensitive to outliers,
requires pre-
specification of K.

- Agglomerative
Hierarchical
Strengths: No need to
specify the number of
clusters, useful for
small datasets.
- Agglomerative
Hierarchical
Weaknesses:
Computationally
expensive for large
datasets, sensitive to
noise.

Explanation of Key Steps:

1. Data Preprocessing: Normalization ensures that different scales of data do not skew the
clustering results.
2. K-means Clustering: This algorithm partitions data into K clusters by minimizing variance
within each cluster.
3. Agglomerative Hierarchical Clustering: This method builds a hierarchy of clusters, useful
for understanding the data structure.
4. Comparison: By evaluating both clustering methods, insights can be drawn about the data's
structure and clustering appropriateness.

Sample Data

Customer ID Age Annual Income (in $) Spending Score (1-100)


1 25 50,000 60
2 30 60,000 70
3 35 70,000 80
4 40 80,000 90
5 45 90,000 95
6 50 100,000 40
7 55 110,000 30
8 60 120,000 20
9 65 130,000 50
10 70 140,000 10
15
Page

Q-Bank-DWM-SITE21-CST/@theva
We generally use the Euclidean distance (or other metrics like Manhattan, Cosine, etc., depending
on the problem). Euclidean distance is commonly used in K-means and Agglomerative Hierarchical
Clustering.

Steps to Calculate Euclidean Distance

For any two data points (x1, y1, z1and (x2, y2, z2 in the given dataset, where:

 x represents Age
 y represents Annual Income
 z represents Spending Score

The Euclidean distance d between these two points is calculated as:

Example Calculation

Let's take Customer 1 and Customer 2 from the sample data:

Customer ID Age Annual Income (in $) Spending Score (1-100)


1 25 50,000 60
2 30 60,000 70

1. Extract values:
o Customer 1: (25,50000,60)(25, 50000, 60)(25,50000,60)
o Customer 2: (30,60000,70)(30, 60000, 70)(30,60000,70)
2. Calculate distance:

The resulting distance, 10000.01, indicates the Euclidean distance between Customer 1 and
Customer 2 in terms of their age, annual income, and spending score.

Applying Distance Calculation in Clustering

 K-means: This distance helps in assigning each customer to the closest centroid.
 Agglomerative Hierarchical Clustering: Distances between each pair of data points are
calculated to form clusters iteratively.
16
Page

Q-Bank-DWM-SITE21-CST/@theva

You might also like