DWM Q Bank
DWM Q Bank
Describe the concept of data - Data objects: entity and their instances
11 2
Page
Q-Bank-DWM-SITE21-CST/@theva
ratio
- Examples of each type
- Role in presenting data insights
How can data visualization aid - Visualization techniques like charts, graphs, and
12 3
in data understanding? dashboards
- Advantages in interpretation
Explain similarity and - Concepts of similarity and dissimilarity
13 dissimilarity measurements in 3 - Calculation methods: Euclidean, Cosine, Jaccard
data. - Application in clustering and classification
- Descriptive statistics: mean, median, mode,
Provide basic statistical variance, standard deviation
14 2
descriptions of data. - Applications in data understanding
- Role in data pre-processing
- Explanation of business intelligence
How do data warehousing and
- Role of Data Warehousing and OLAP in
15 OLAP support business 4
decision support
intelligence?
- Examples and case studies
Example problems
customer purchase
Q-Bank-DWM-SITE21-CST/@theva
frequency, preferences,
and spending behavior.
- Classification: Classify
customers into different
categories (e.g., high-
value customers).
- Association Analysis:
Find product association
patterns (e.g., customers
buying product A often
buy product B).
1. Data Warehouse Architecture: Integrates OLTP and OLAP components for efficient
analysis and reporting.
2. Data Mining Tasks: Focuses on finding customer buying patterns, which helps in targeted
marketing.
3. Similarity and Dissimilarity: These measures allow clustering of customers based on
similar purchase behaviors.
3
Page
Q-Bank-DWM-SITE21-CST/@theva
Unit 2: Data Pre-processing-(Question Bank with Key Points)
selection. efficiency
Q-Bank-DWM-SITE21-CST/@theva
- Methods like correlation, mutual
information
- Impact on predictive modeling
- Issues like schema mismatches, data
redundancy
Describe the challenges in integrating
14 5 - Solutions: schema alignment, entity
heterogeneous data sources.
matching
- Case examples
- Explanation of standardization
What is the purpose of standardizing
15 2 - Techniques and formulas used
data?
- Importance in feature scaling
Example problems
Step 3: Data
Reduction
- Feature Selection:
Drop unimportant
columns such as
"Customer Address" to
reduce dimensionality.
Step 4: Data
5
Page
Transformation
- Normalization:
Q-Bank-DWM-SITE21-CST/@theva
Scale the "Annual
Income" column using
Min-Max
normalization to a [0,1]
range.
Step 5: Data
Discretization
- Binning: Discretize
the "Age" column into
bins representing age
groups (e.g., 20-30, 30-
40) for analysis.
1. Data Cleaning: Missing values are filled with mean values, and duplicates are removed.
2. Data Integration: Combines data to give a complete profile for each customer.
3. Data Reduction: Dropping less informative columns reduces the data size.
4. Data Transformation: Normalization standardizes "Annual Income" for improved model
performance.
5. Data Discretization: Binning groups "Age" into manageable categories, aiding
interpretation.
6
Page
Q-Bank-DWM-SITE21-CST/@theva
Unit 3: Classification - (Question Bank with Key Points)
information gain
Q-Bank-DWM-SITE21-CST/@theva
- Example of improved model performance
- Difference between training and testing data
Explain the role of training and - Importance of model evaluation
13 2
testing in classification. - Common metrics: accuracy, precision,
recall
- Explanation of k-NN as a distance-based
Describe the k-nearest neighbors (k- classifier
14 3
NN) algorithm. - Steps in classifying a new instance
- Use cases and limitations
- Definition of ensemble learning
What are ensemble methods in
15 5 - Examples like bagging, boosting
classification?
- Impact on accuracy and robustness
Example problems
Step 3: Attribute
Test Conditions
- Test conditions
are based on
8
Page
thresholds for
Q-Bank-DWM-SITE21-CST/@theva
Hours Studied,
Attendance, and
Past Grades.
Step 4: Bayesian
Belief Networks
- Define
probabilities for
relationships
between attributes
(e.g., Hours
Studied →
Pass/Fail).
- Calculate
posterior
probabilities to
predict the class.
Sample Data
1. Selecting the Best Split: Compute Information Gain for each attribute (e.g., Hours Studied)
and choose the one with the highest gain to split the data at the root node. This forms the
basis of classification.
2. Attribute Test Conditions: Conditions can be set, such as "Hours Studied > 3," to further
split data at each node.
3. Bayesian Belief Network: For each new student, calculate the probability of passing based
on conditional probabilities derived from existing data. 9
Page
Q-Bank-DWM-SITE21-CST/@theva
Unit 4: Association Analysis - (Question Bank with Key Points)
Q-Bank-DWM-SITE21-CST/@theva
- Examples to illustrate
- Balancing specificity and generality in
rule generation
Discuss challenges in setting thresholds
14 5 - Examples of too low or high threshold
for support and confidence.
issues
- Impact on meaningful rule discovery
- Role in uncovering customer purchasing
Explain how rule generation can patterns
15 4
improve business decision-making. - Use in strategic marketing and sales
- Examples of actionable insights
Problem Statement
Given Data: A retail store’s transaction data with items purchased together.
Task: Use the FP-Growth Algorithm to identify frequent item sets and generate association rules
for understanding customer purchasing patterns. Discuss the advantages of using FP-Growth over the
Apriori Algorithm in terms of efficiency and compact representation.
Sample Data
Solution
Q-Bank-DWM-SITE21-CST/@theva
→ Butter).
- Calculate support and confidence for
each rule. For example,
Support(Bread → Butter) = 4/5
(80%), Confidence(Bread → Butter) =
4/5 (80%).
Step 4: Analysis
- Compare FP-Growth with Apriori:
FP-Growth is efficient with large
datasets due to its compact
representation, eliminating the need
for candidate generation like in
Apriori.
1. FP-Tree Construction: Organizes data to enable efficient traversal and pattern recognition.
2. Frequent Item Sets: Highlights common item combinations, offering insights into frequently
co-purchased products.
3. Association Rule Generation: Helps identify strong associations between items to inform
product placement, bundling, and promotions.
4. Efficiency of FP-Growth: By avoiding candidate generation, FP-Growth provides faster
processing for larger datasets compared to Apriori.
12
Page
Q-Bank-DWM-SITE21-CST/@theva
Unit 5: Cluster Analysis - (Question Bank with Key Points)
Q-Bank-DWM-SITE21-CST/@theva
analysis
- Use in customer segmentation and
Describe clustering applications in targeting
13 4
marketing. - Example scenarios in retail marketing
- Benefits for targeted campaigns
- Definition of centroid
Outline the concept of centroid in - Role in K-means and other centroid-based
14 2
clustering algorithms. methods
- Example illustrating centroid calculation
- Methods: Silhouette Score, Davies-
Bouldin Index
15 Explain clustering validation methods. 5 - Importance in cluster quality assessment
- Example interpretation of validation
scores
Example problems
Bloom's Taxonomy
Question Solution Key Points Level
Problem Statement: Given a Step 1: Data 1. K-means uses Analyzing (Level
dataset of customer purchase Preprocessing distance metrics to 4): Understand and
behavior with the following - Normalize the data partition data into K analyze clustering
attributes: Age, Annual Income, (standardization). clusters. results and their
implications.
and Spending Score, apply K- 2. The Elbow
Evaluating (Level
means and Agglomerative Step 2: K-means Method helps in 5): Evaluate the
Hierarchical Clustering to Clustering determining the strengths and
segment customers into distinct - Choose an optimal optimal K. weaknesses of K-
groups. Discuss the strengths value for K using the 3. Agglomerative means and
and weaknesses of each Elbow Method. clustering builds Agglomerative
clustering technique. - Apply the K-means clusters based on Hierarchical
algorithm to segment proximity, leading Clustering.
customers. to a dendrogram for
- Analyze the visualization.
centroids and group 4. DBSCAN offers
characteristics. a density-based
approach that
Step 3: handles noise and
Agglomerative varying cluster
Hierarchical sizes.
Clustering
- Use Ward’s method
to perform
hierarchical clustering.
- Plot a dendrogram to
determine the optimal
number of clusters.
- Compare results with
K-means
segmentation.
14
Step 4: Comparison
Page
- K-means Strengths:
Q-Bank-DWM-SITE21-CST/@theva
Efficient for large
datasets, easy to
implement.
- K-means
Weaknesses:
Sensitive to outliers,
requires pre-
specification of K.
- Agglomerative
Hierarchical
Strengths: No need to
specify the number of
clusters, useful for
small datasets.
- Agglomerative
Hierarchical
Weaknesses:
Computationally
expensive for large
datasets, sensitive to
noise.
1. Data Preprocessing: Normalization ensures that different scales of data do not skew the
clustering results.
2. K-means Clustering: This algorithm partitions data into K clusters by minimizing variance
within each cluster.
3. Agglomerative Hierarchical Clustering: This method builds a hierarchy of clusters, useful
for understanding the data structure.
4. Comparison: By evaluating both clustering methods, insights can be drawn about the data's
structure and clustering appropriateness.
Sample Data
Q-Bank-DWM-SITE21-CST/@theva
We generally use the Euclidean distance (or other metrics like Manhattan, Cosine, etc., depending
on the problem). Euclidean distance is commonly used in K-means and Agglomerative Hierarchical
Clustering.
For any two data points (x1, y1, z1and (x2, y2, z2 in the given dataset, where:
x represents Age
y represents Annual Income
z represents Spending Score
Example Calculation
1. Extract values:
o Customer 1: (25,50000,60)(25, 50000, 60)(25,50000,60)
o Customer 2: (30,60000,70)(30, 60000, 70)(30,60000,70)
2. Calculate distance:
The resulting distance, 10000.01, indicates the Euclidean distance between Customer 1 and
Customer 2 in terms of their age, annual income, and spending score.
K-means: This distance helps in assigning each customer to the closest centroid.
Agglomerative Hierarchical Clustering: Distances between each pair of data points are
calculated to form clusters iteratively.
16
Page
Q-Bank-DWM-SITE21-CST/@theva