PART C
1 Stacking, or Stacked Generalization, is an ensemble learning technique that
combines multiple base models (also called level-0 models) and a meta-model (level-1
model) to improve prediction performance. It leverages the strengths of diverse algorithms
by training a meta-learner on their outputs. In this case, we design a stacking model using:
K-Nearest Neighbors (KNN)
Decision Tree
Naive Bayes
as base learners, and a Logistic Regression model as the meta-learner.
Architecture of the Stacking Model:
Level-0 Models (Base Learners):
o KNN: Instance-based, non-parametric learner, effective in local decision
boundaries.
o Decision Tree: Captures complex decision rules and feature interactions.
o Naive Bayes: Assumes feature independence, fast and performs well with
categorical data.
Level-1 Model (Meta-Learner):
o Logistic Regression: A linear model ideal for binary classification tasks
(e.g., churn prediction). It generalizes well and interprets probabilities from
base model outputs.
Justification of Base Learner Selection
KNN:
o Captures local data patterns and performs well with well-separated classes.
o Brings diversity due to its lazy learning and distance-based approach.
Decision Tree:
o Handles non-linear relationships and feature importance.
o Robust to irrelevant features and outliers.
Naive Bayes:
o Simple and fast, ideal when feature independence assumption holds.
o Often performs surprisingly well on text or categorical data.
➡️These algorithms represent diverse learning biases, improving ensemble generalization
through variance in predictions.
Justification of Meta-Learner Selection
Logistic Regression is selected because:
o It can interpret and weigh the predictions of base models effectively.
o It is less prone to overfitting as a meta-learner.
o Provides probabilistic outputs, useful for classification thresholds and ROC
analysis.
Implementation Example in Python
# Define base learners
base_learners = [
('knn', KNeighborsClassifier(n_neighbors=5)),
('dt', DecisionTreeClassifier(max_depth=5)),
('nb', GaussianNB())
]
# Define meta-learner
meta_learner = LogisticRegression()
# Build stacking model
stacking_model = StackingClassifier(estimators=base_learners,
final_estimator=meta_learner)
The designed stacking ensemble effectively combines the strengths of KNN (local
patterns), Decision Trees (rule-based learning), and Naive Bayes (probabilistic
reasoning). The Logistic Regression meta-learner efficiently integrates their predictions
for improved accuracy, robustness, and generalization. This approach is particularly
beneficial in complex classification problems like customer churn prediction, where no
single algorithm performs best across all scenarios.
2 Bagging (Bootstrap Aggregating) is an ensemble technique used to improve the accuracy
and stability of machine learning algorithms. It is especially effective with high-variance models
like Decision Trees. In customer churn prediction, where we aim to identify customers likely to
leave, bagging can improve classification performance by reducing overfitting.
Working of Bagging Ensemble
Bagging works as follows:
1. Bootstrap Sampling: Multiple subsets of the original training dataset are created by
random sampling with replacement.
2. Model Training: A Decision Tree is trained on each bootstrap sample.
3. Aggregation: For classification tasks, predictions from all trees are combined using
majority voting.
Python Code Example
# Simulated data (replace with churn dataset)
X, y = make_classification(n_samples=1000, n_features=20,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Bagging with Decision Trees
bag_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
n_estimators=50, random_state=42)
bag_model.fit(X_train, y_train)
y_pred = bag_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Impact of Increasing Number of Base Learners
No. of Base Learners Effect on Model Performance
Low (e.g., 5–10) Faster training, but less robust. Higher variance.
Moderate (e.g., 30–50) Balanced performance. Reduces overfitting. Good accuracy.
High (e.g., 100+) Minor gains in accuracy. Lower variance. Stable predictions.
Too Many Minimal gain. Increased computation time and memory usage.
✅ Summary: More base learners reduce variance and increase model stability, but after a certain
point, the performance gain plateaus.
Bagging with Decision Trees is a powerful ensemble method for predicting customer churn. By
aggregating multiple models trained on diverse bootstrap samples, it improves accuracy and
reduces overfitting. Increasing the number of base learners enhances model robustness, but the
improvement slows after a certain threshold due to the law of diminishing returns.
3 K-Means is a centroid-based, unsupervised clustering algorithm that partitions data into K
clusters by minimizing the within-cluster sum of squares (WCSS). It assumes that clusters are
spherical (convex) and equally sized.
Performance of K-Means on Non-Convex Clusters When applied to non-convex clusters (e.g.,
crescent-shaped or ring-shaped clusters), K-Means often performs poorly because:
It relies on Euclidean distance to assign points to the nearest centroid.
It cannot correctly represent clusters with irregular shapes, varying density, or overlaps.
It may incorrectly split a single non-convex cluster into multiple convex parts.
Example Diagram: K-Means Failure on Non-Convex Data
Original Clusters (Non-Convex) K-Means Clustering Output
( ) ( ) XXXXX OOOOO
( ) ( ) XXXXX OOOOO
( ) ( ) XXXXX OOOOO
Ring shapes K-Means splits into incorrect
convex parts
Left: True cluster shapes; Right: K-Means failing to capture the structure.
Alternative Clustering Method: DBSCAN
We recommend DBSCAN (Density-Based Spatial Clustering of Applications with Noise) as a
better alternative.
Why DBSCAN Works Better:
Captures arbitrary shapes: Clusters can be non-convex and irregular.
Handles noise: Can identify outliers as noise.
No need to specify K: Instead, uses parameters eps (neighborhood size) and
min_samples.
Key Advantages Over K-Means:
Feature K-Means DBSCAN
Cluster shape Convex only Arbitrary shapes
Sensitivity to noise High Low (can detect outliers)
Number of clusters Must be specified (K) Determined automatically
Density sensitivity Poor Very good
Python Example: DBSCAN vs. K-Means
# Create non-convex data
X, _ = make_moons(n_samples=300, noise=0.05)
# K-Means clustering
kmeans = KMeans(n_clusters=2).fit(X)
# DBSCAN clustering
dbscan = DBSCAN(eps=0.3, min_samples=5).fit(X)
# Plot results
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
plt.title("K-Means Clustering")
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_)
plt.title("DBSCAN Clustering")
plt.show()
K-Means clustering is efficient but assumes convex, equally sized clusters, leading to poor
performance on non-convex datasets. An ideal alternative is DBSCAN, which identifies arbitrary-
shaped clusters and detects noise, making it a robust choice for real-world, non-linearly
separable data.This highlights the importance of choosing clustering methods based on data
geometry and distribution.
4 (a) KNN vs. K-Means for Fraud Detection
1. Nature of Algorithms
KNN (Supervised) K-Means (Unsupervised)
Classification algorithm Clustering algorithm
Requires labeled data Does not require labels
Predicts based on similarity to neighbors Groups data into K clusters
2. Use Case for Fraud Detection
KNN: Effective for fraud classification when historical fraud labels (fraud/non-fraud) are
available. It compares a new transaction to known patterns.
K-Means: Groups transactions into clusters, but does not know which are fraudulent.
Used for anomaly detection, not direct classification.
3. Suitability Justification (3 Marks)
Criteria KNN K-Means
Label availability ✅ Uses labels ❌ No labels
Fraud prediction ✅ Direct prediction ❌ Only detects groups
Interpretability ✅ Transparent ⚠ Depends on clusters
Suitability ✔ Best for classification Used for anomaly grouping
KNN is more suitable for fraud detection when labeled data is available. It can predict directly
whether a transaction is fraudulent based on past examples, whereas K-Means is only useful for
unsupervised anomaly detection.
4. Diagram: KNN vs. K-Means Fraud Detection (1 Mark)
KNN (Supervised) K-Means (Unsupervised)
------------------ -----------------------
[Known Labels] [No Labels]
↓ ↓
Train KNN Run K-Means (k=2)
↓ ↓
Predict Fraud / Legit Cluster 1 / Cluster 2
(Fraud: Red) (Unknown which is fraud)
(b) Computational Efficiency: 1000 samples, 50 features
1. KNN Complexity
Training Time: O(1) – No training required (lazy learner).
Prediction Time: O(n × d)
where:
o n = number of training samples = 1000
o d = number of features = 50
➡️Cost per prediction: Must compute distance to all 1000 points in 50D space.
➡️Scales poorly with high dimensionality (Curse of Dimensionality).
2. K-Means Complexity
Training Time: O(k × n × d × i)
where:
o k = number of clusters (e.g., 3–10)
o n = 1000
o d = 50
o i = number of iterations (typically <100)
➡️K-Means does not need to compare to all points for every prediction.
➡️More efficient in prediction (assigning to nearest centroid), but initial training cost is higher
than KNN.
3. Conclusion on Efficiency
Algorithm Training Efficiency Prediction Efficiency Overall (50D)
KNN ✅ Fast (no training) ❌ Slow (distance to all points) ❌ Less efficient
K-Means ❌ Slower (training) ✅ Fast (after training) ✅ More efficient
For high-dimensional data (50 features), K-Means is computationally more efficient overall.
KNN becomes slower during prediction due to distance calculations in high dimensions.
5 A Gaussian Mixture Model (GMM) is a probabilistic model that assumes the data is
generated from a mixture of several Gaussian distributions with unknown parameters.
Expectation-Maximization (EM) is an iterative algorithm used to estimate these
parameters:
E-Step (Expectation): Estimate the probability (responsibility) that each data point
belongs to each Gaussian.
M-Step (Maximization): Update the parameters (means, variances, and weights)
based on the responsibilities.
After the first EM iteration, the GMM starts adapting to the data structure by shifting means and
responsibilities. The log-likelihood improves in each iteration, moving toward convergence. This
iterative process allows GMMs to model complex multimodal distributions, making them
effective in applications like speaker recognition, image segmentation, and anomaly detection.
6 Boosting is an ensemble method that combines several weak learners (typically decision
trees) to form a strong classifier.
It trains models sequentially, where each new model focuses on correcting the errors of
the previous ones.
Effectiveness of Boosting on Noisy Data
🔹 Problem with Noisy Data:
Boosting gives higher weights to misclassified examples.
If noise causes some points to be consistently misclassified, Boosting overemphasizes
these, treating them as important patterns.
This leads to overfitting, where the model fits the noise instead of the true data
distribution.
🔹 Observed Effects:
Scenario Behavior of Boosting
Clean data Learns progressively better models
Noisy data Learns patterns in noise → overfits
Imbalanced data May overfocus on outliers
Conclusion: Boosting is sensitive to noise and can overfit significantly if not controlled.
Diagram: Boosting Overfitting on Noisy Data
Iteration 1: Iteration 2: Final Boosted Model:
+-------------+ +-------------+ +---------------------+
| Noisy | | Focus on | | Overfits to noise |
| Sample Mis- | ==> | Misclassified| ==> | Complex boundary |
| classified | | (noisy points)| | wraps around noise |
+-------------+ +-------------+ +---------------------+
Correct pattern ignored → Noise fit increases
Modifications to Reduce Overfitting
1. Use Early Stopping
Monitor validation loss.
Stop training when validation performance degrades (prevents overfitting on noise).
2. Limit Base Learner Complexity
Use shallow decision trees (depth = 1 or 2).
Prevents learning overly specific patterns caused by noise.
3. Use Robust Variants of Boosting
Gradient Boosting with Regularization (e.g., XGBoost):
o Adds penalties for model complexity.
o Regularizes trees using parameters like gamma, lambda.
Stochastic Boosting:
o Introduce randomness by subsampling data/features.
o Prevents the model from focusing too heavily on noise.
4. Label Cleaning or Outlier Removal
Preprocess data to remove mislabeled/noisy points using:
o Clustering, outlier detection (e.g., LOF)
o Manual review for small datasets
Boosting is highly effective on clean data but suffers in noisy environments due to its
error-focusing nature.
Using techniques like regularization, early stopping, and robust models (e.g., XGBoost,
LightGBM) significantly reduces overfitting.
These modifications make Boosting more reliable and generalizable even with noisy
datasets.