Practical CO/PO
Number
1 CO1
Problem Definition
Machine Learning Environments and Data Exploration Using
the Iris Dataset
You are beginning your journey in machine learning by setting up the
environment needed to write and run ML code. Your first task is to
understand how to install and use Anaconda for launching Jupyter
Notebooks, and explore the use of Google Colab as a cloud-based
alternative. Using a beginner-friendly dataset such as the Iris dataset
from the UCI Machine Learning Repository, you will learn how to
load data using Python libraries like pandas, view basic statistics, and
visualize the dataset using simple plots. This module focuses on
navigating the ML environment, loading datasets, viewing column
types, checking for missing values, and performing basic descriptive
analysis—all within an interactive notebook interface. The goal is to
build comfort with the tools and develop foundational skills for
handling data before applying machine learning algorithms.
Key Questions / Analysis / Interpretation to be Evaluated
1. Correctly load and explore the dataset using pandas in
Jupyter/Colab?
2. Identified column types, missing values, and generated
summary statistics
3. Correctly load and explore the dataset using VS code?
Supplementary Problems
Try loading another dataset, Wine dataset from UCI and perform
similar steps independently.
Key Skills Addressed
Installation, Data-loading, Visualization, Summarization.
Applications
Forms the base for preparing datasets in any domain before applying
machine learning algorithms.
Learning Outcome
Upon completing this practical, you will:
1. Operate ML Environments: Confidently use tools like Jupyter
Notebook, Google Colab, and VS Code for data science tasks.
2. Handle Datasets: Load, inspect, and summarize structured
datasets using Python libraries like pandas.
3. Visualize Data: Create simple yet insightful plots to
understand data distributions and relationships.
4. Develop Insights: Analyze dataset characteristics, such as
column types and missing values, to inform preprocessing and
machine learning model decisions.
These foundational skills are critical for preparing and
understanding data, which is the first step in any machine learning
pipeline.
Dataset/Test Data
Source: https://archive.ics.uci.edu/dataset/53/iris
Description: This is one of the earliest datasets used in the literature
on classification methods and widely used in statistics and machine
learning. The data set contains 3 classes of 50 instances each, where
each class refers to a type of iris plant. One class is linearly separable
from the other 2; the latter are not linearly separable from each other.
Predicted attribute: class of iris plant.
Tools/Technology
Python, pandas, scikit-learn, matplotlib, seaborn.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Prepare a notebook summary that includes: dataset import, head/tail,
data types, null value check, and at least two plots with insights.
Evaluation Strategy:
Successful loading and inspection of data, Basic descriptive
statistics and missing value check, well-documented notebook
submission.
2 CO1
Problem Definition
Exploratory Data Analysis (EDA) of Air Quality Data in India
You are tasked with performing Exploratory Data Analysis (EDA) on
the Air Quality Data in India using Jupyter Notebook or Google
Colab. This dataset contains measurements of various pollutants
(PM2.5, PM10, NO2, SO2, CO, O3) across multiple Indian cities and
dates. Your objective is to explore the dataset by calculating
descriptive statistics to understand central tendencies and dispersion,
identifying and handling missing values, analyzing correlations
between pollutants using a correlation matrix, and generating
visualizations such as line plots, boxplots, histograms, and heatmaps
to reveal trends, anomalies, and relationships in the data. The goal is
to develop a meaningful understanding of pollution patterns across
time and location while gaining hands-on experience in basic EDA
techniques.
Key Questions / Analysis / Interpretation to be Evaluated
1. 1 Interpret spatial and temporal pollution patterns.
2. Analyze interdependence among pollutants using correlation
matrices.
3. Evaluate data quality and justify handling strategies for
missing values.
Supplementary Problems
Compare pollution levels during winter and summer months across
top 5 polluted cities using visual analysis.
Key Skills Addressed
Data Cleaning and Pre-processing, Descriptive Statistical Analysis,
Correlation Analysis, Data Visualization
Applications
Developing data-driven air quality monitoring systems for urban
pollution control and policy planning.
Learning Outcome
By the end of this practical, learners will:
1. Perform Comprehensive EDA:
o Explore datasets using descriptive statistics (e.g.,
mean, median, standard deviation).
o Visualize trends, anomalies, and relationships using
appropriate plots.
2. Handle Data Quality Issues:
o Identify missing data and apply strategies like
imputation or deletion.
3. Analyze Correlations:
o Interpret pollutant relationships using correlation
matrices and heatmaps.
4. Interpret Trends:
o Use visual analysis to compare seasonal pollution
patterns across cities.
Dataset/Test Data
Source: https://www.kaggle.com/datasets/rohanrao/air-quality-data-
in-india/data
Description:
Context
Air is what keeps humans alive. Monitoring it and understanding its
quality is of immense importance to our well-being.
Content
The dataset contains air quality data and AQI (Air Quality Index) at
hourly and daily level of various stations across multiple cities in
India.
Cities
Ahmedabad, Aizawl, Amaravati, Amritsar, Bengaluru, Bhopal,
Brajrajnagar, Chandigarh, Chennai, Coimbatore, Delhi, Ernakulam,
Gurugram, Guwahati, Hyderabad, Jaipur, Jorapokhar, Kochi,
Kolkata, Lucknow, Mumbai, Patna, Shillong, Talcher,
Thiruvananthapuram, Visakhapatnam
Tools/Technology
Python, pandas, scikit-learn, matplotlib, seaborn
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Try to create it for interactive visulization.
Evaluation Strategy:
EDA, metrics interpretation, assumption validation, corelation
analysis
3 CO1
Problem Definition
Data Cleaning and Preprocessing for COvid-19 Dataset
You are tasked with performing Exploratory Data Analysis (EDA) on
the COVID-19 India dataset available at Kaggle: COVID-19 in India.
This dataset includes time-series data of confirmed, recovered, and
deceased cases across Indian states. Your objective is to understand
data quality and trends by loading the dataset in Jupyter Notebook or
Google Colab, and applying core EDA skills such as handling
missing values, encoding categorical state names, detecting and
treating outliers in daily case counts, and scaling numerical features
for visualization. You will generate plots to identify peaks, trends,
and state-wise comparisons, while also using descriptive statistics
and correlation analysis to uncover meaningful patterns. This
practical will build foundational skills essential for preparing real-
world public health data for modeling or policy insights.
Key Questions / Analysis / Interpretation to be Evaluated
1. understanding of recovery patterns and potential predictive
relationships.
2. cleaning and validating data before modelling
3. Interpret trends and compare regional impact of the
pandemic
Supplementary Problems
Analyze how lockdown phases impacted the trend of COVID-19
cases in at least two major Indian states using time-series plots.
Key Skills Addressed
Data Cleaning and Pre-processing, Descriptive Statistical Analysis,
Correlation Analysis, Data Visualization
Applications
Supports public health decision-making by identifying regional
COVID-19 trends and data quality issues for targeted interventions.
Learning Outcome
By the end of this practical, learners will be able to:
1. Load and Inspect Time-Series Data: Understand dataset
structure and key features.
2. Handle Data Quality Issues: Clean missing values, encode
categorical data, and scale numerical features.
3. Analyze Trends and Patterns: Use plots to explore
temporal and regional trends.
4. Prepare Data for Modeling: Develop a structured dataset
ready for predictive analysis.
Dataset/Test Data
Source: https://www.kaggle.com/datasets/sudalairajkumar/covid19-
in-india
Description: Coronaviruses are a large family of viruses which may
cause illness in animals or humans. In humans, several
coronaviruses are known to cause respiratory infections ranging
from the common cold to more severe diseases such as Middle East
Respiratory Syndrome (MERS) and Severe Acute Respiratory
Syndrome (SARS). The most recently discovered coronavirus
causes coronavirus disease COVID-19 - World Health Organization
The number of new cases are increasing day by day around the
world. This dataset has information from the states and union
territories of India at daily level.
State level data comes from Ministry of Health & Family Welfare
Testing data and vaccination data comes from covid19india. Huge
thanks to them for their efforts!
Update on April 20, 2021: Thanks to the Team at ISIBang, I was
able to get the historical data for the periods that I missed to collect
and updated the csv file.
Tools/Technology
Python, pandas, scikit-learn, matplotlib, seaborn.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Prepare a notebook summary that includes:
1. Steps to load and clean the dataset.
2. Code and outputs for descriptive statistics and visualizations.
3. At least two time-series plots comparing state-wise trends
with insights.
Evaluation Strategy:
1. Identified and compared total confirmed, recovered, and
deceased cases across different Indian states over time.
2. Detected and addressed any anomalies or outliers in daily case
counts, noting potential data quality issues.
3. Performed correlation analysis between key variables (e.g.,
confirmed vs. recovered cases) and interpreted relationships
cross states and time.
4 CO2
Problem Definition
Energy Efficiency Estimation for Smart Buildings
You are part of a smart infrastructure team working on developing an
energy efficiency model for residential buildings. The dataset
includes architectural and environmental features like wall area, roof
area, glazing area, orientation, relative compactness, and overall
height. Your goal is to predict heating load in kilowatts based on these
attributes. You must analyze the role of each feature, check for
linearity assumptions, and determine whether simple linear
approaches suffice or polynomial transformations are necessary.
Investigate both underfitting and overfitting scenarios by comparing
training and testing errors. You are expected to critically evaluate and
justify your modeling decisions using residual plots, correlation
matrices, and error metrics.
Key Questions / Analysis / Interpretation to be Evaluated
1. Conduct exploratory data analysis to determine if linear or
nonlinear patterns exist.
2. Select relevant features by analyzing correlation and
multicollinearity.
3. Build a predictive model for heating load, interpret its
coefficients, and explain each feature's influence.
4. Calculate MAE, MSE, RMSE, and R² for both training and
test sets.
5. Report on any signs of overfitting.
6. Assess bias and variance using learning curves and validate
assumptions via residual plots.
7. Recommend whether polynomial transformations or feature
interaction terms may improve performance.
Supplementary Problems
Predict electricity consumption using weather and occupancy data.
Key Skills Addressed
Regression modeling, error interpretation, residual analysis,
transformation handling.
Applications
Linear regression, when mastered through practical implementations
such as energy load prediction, serves as a foundational technique
for numerous real-world tasks involving continuous value
estimation.
Learning Outcome
Upon completing this practical:
Students will master regression diagnostics, identify modeling
pitfalls, and improve estimation accuracy using real-world features.
Dataset/Test Data
Source: UCI Energy Efficiency dataset
Link: https://archive.ics.uci.edu/dataset/242/energy+efficiency
Dataset Information
We perform energy analysis using 12 different building shapes
simulated in Ecotect. The buildings differ with respect to the glazing
area, the glazing area distribution, and the orientation, amongst other
parameters. We simulate various settings as functions of the afore-
mentioned characteristics to obtain 768 building shapes. The dataset
comprises 768 samples and 8 features, aiming to predict two real
valued responses. It can also be used as a multi-class classification
problem if the response is rounded to the nearest integer.
Tools/Technology
Python, pandas, scikit-learn, matplotlib, seaborn.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Try advanced transformations (log, polynomial) and compare
against base model.
Evaluation Strategy:
EDA, metrics interpretation, assumption validation, coefficient
analysis.
5 Problem Definition CO2
Diagnosing Disease from Symptoms
You are working with a health analytics company to develop a
predictive system that flags high-risk patients for chronic illness
based on demographic, biometric, and lifestyle attributes. Your
dataset contains fields like age, BMI, blood pressure, glucose levels,
and behavioral flags (e.g., smoking, alcohol). You are tasked with
designing a binary classifier that not only predicts the outcome but
also justifies its sensitivity to false positives and false negatives under
different decision thresholds. Your solution should deal with class
imbalance and emphasize the use of probability-based predictions
instead of hard labels.
Key Questions / Analysis / Interpretation to be Evaluated
1. Which features have the highest impact? How is this
validated?
2. What does the confusion matrix reveal about the model's
classification priorities?
3. How does adjusting the decision threshold influence precision
and recall?
4. Why is ROC-AUC a more appropriate evaluation metric than
accuracy in this scenario?
5. Is the model biased toward a particular class? How can this
bias be detected and addressed?
6. How well does the model generalize across multiple
validation folds?
Supplementary Problems
Fraud detection in financial transactions.
Key Skills Addressed
Binary classification, threshold tuning, evaluation metrics (ROC-
AUC, F1), cost-sensitive modelling.
Applications
Logistic regression is a fundamental algorithm for binary and multi-
class classification tasks, widely adopted in various sectors where
decisions are made based on probability-driven outcomes. Its
mathematical simplicity, interpretability, and real-time efficiency
make it a core skill in the data science and applied machine learning
toolkit.
Learning Outcome
Upon completing this practical:
Students will build interpretable classifiers, handle imbalanced
datasets, and make cost-sensitive decisions
Dataset/Test Data
SOURCE: PIMA Diabetes dataset / Synthetic healthcare dataset
Link: https://www.kaggle.com/datasets/uciml/pima-indians-
diabetes-database
About Dataset
Context
This dataset is originally from the National Institute of Diabetes and
Digestive and Kidney Diseases. The objective of the dataset is to
diagnostically predict whether or not a patient has diabetes, based on
certain diagnostic measurements included in the dataset. Several
constraints were placed on the selection of these instances from a
larger database. In particular, all patients here are females at least 21
years old of Pima Indian heritage.
Content
The datasets consists of several medical predictor variables and one
target variable, Outcome. Predictor variables includes the number of
pregnancies the patient has had, their BMI, insulin level, age, and so
on.
Tools/Technology
Python, sklearn, imbalanced-learn, seaborn.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Evaluate confusion matrix across 3 thresholds (0.3, 0.5, 0.7);
suggest best threshold for real use case.
Evaluation Strategy:
Threshold analysis, metric justification, interpretability, impact of
imbalance.
6 Problem Definition CO2
Recommending Products Based on Browsing Behaviour
A personalized recommendation system is to be developed to predict
which product category a user is likely to interact with next, based on
their browsing behavior. The dataset contains features such as time
spent per category, last purchased product, time since last login,
search keywords, and average rating of viewed products. The task
involves building a similarity-based model that identifies the top-k
closest users or items and classifies the next likely interaction.
The model must evaluate multiple distance metrics (e.g., Euclidean,
Manhattan, Cosine) and analyze their influence on classification
performance. Additionally, KNN should be extended to a regression
context to predict user engagement time or next session duration.
Scalability concerns, the effect of dimensionality, and the role of
normalization in distance-based models must be critically analyzed.
Key Questions / Analysis / Interpretation to be Evaluated
1. Which distance metric gave the best performance, and
what justifies this choice?
2. How does the value of K influence model accuracy and
variance?
3. At what K-values does the model exhibit overfitting or
underfitting?
4. What insights do misclassified categories provide about
potential class overlap?
5. How were the features normalized, and why is
normalization critical in this context?
6. What are the scalability limitations of this approach when
applied to large datasets?
7. How does the regression model perform (e.g., in
predicting session time)? Compare MAE/MSE across K
values.
Supplementary Problems
News recommendation, personalized course suggestion.
Key Skills Addressed
Distance-based classification, multi-class confusion matrix,
hyperparameter tuning.
Applications
K-Nearest Neighbors (KNN) is a non-parametric, instance-based
learning algorithm widely used in both classification and regression
tasks. Despite its simplicity, KNN is powerful in applications where
pattern similarity, local structure, or case-based reasoning is
important.
Learning Outcome
Upon completing this practical:
Students will develop and analyze similarity-based classification
systems under practical constraints.
Dataset/Test Data
Source: Retail user activity logs / Synthetic behavior data
Link: https://archive.ics.uci.edu/dataset/352/online+retail
Dataset Information
This is a transactional data set which contains all the transactions
occurring between 01/12/2010 and 09/12/2011 for a UK-based and
registered non-store online retail.The company mainly sells unique
all-occasion gifts. Many customers of the company are wholesalers.
Tools/Technology
Python, scikit-learn, seaborn, NumPy.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Create K vs Accuracy (for classification) and K vs MAE/MSE (for
regression) plots. Apply PCA to analyze the impact of dimensionality
reduction on performance.
Evaluation Strategy:
Classification and regression performance metrics, distance metric
comparison, normalization analysis, scalability discussion, PCA
analysis.
7 Problem Definition CO3
Customer Churn Prediction using Decision Tree & Random
Forest
In today’s competitive telecom industry, retaining customers is
crucial. Customer churn refers to the loss of clients who stop using a
company's services. This practical aims to develop a predictive model
using Decision Tree and Random Forest algorithms to identify
customers likely to churn based on historical data such as service
usage, billing information, and customer support interactions. By
analyzing these patterns, the model will help the company proactively
address issues and improve retention strategies. The key focus is on
building accurate and interpretable models, identifying important
features, and minimizing overfitting while ensuring better decision-
making support for the business.
Key Questions / Analysis / Interpretation to be Evaluated
1. What are the most important features influencing customer
churn?
2. Can a decision tree model accurately classify churners vs
non-churners?
3. Does the Random Forest model reduce overfitting compared
to a single decision tree?
4. How does feature importance differ between the models?
5. How does model accuracy change with varying depths or
number of trees?
Supplementary Problems
Handling missing or imbalanced data.
Key Skills Addressed
Students will gain hands-on experience in handling data imbalance,
building and tuning Decision Tree and Random Forest models,
analyzing feature importance, and evaluating models to minimize
overfitting
Applications
Students will understand how machine learning can predict
customer churn, aiding industries like telecom, banking, and e-
commerce in reducing customer attrition through data-driven
strategies.
Learning Outcome
Upon completing this practical:
Students will learn to preprocess data, build interpretable predictive
models, identify key factors driving churn, and validate their models
effectively for real-world applications..
Dataset/Test Data
Source: Telco Customer Churn Dataset (Kaggle)
Link: https://www.kaggle.com/datasets/blastchar/telco-customer-
churn
Features include: tenure, MonthlyCharges, Contract,
CustomerServiceCalls, PaymentMethod, etc.
Target: Churn (Yes/No).
Tools/Technology
Python, pandas, numpy for data handling
sklearn for ML models
matplotlib, seaborn for visualization
Jupyter Notebook or Google Colab.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Which model performed better and why?
Top 3 features influencing churn
Visualization of decision tree and feature importance plot
Suggested Experiment: Try with another dataset (e.g., Credit Default
dataset).
Evaluation Strategy:
Decision Tree Implementation, Random Forest Model, Overfitting
Explanation and Handling, Feature Importance Interpretation,
Evaluation Metrics and Justification.
8 Problem Definition CO3
Email Spam Detection using Support Vector Machine (SVM)
With the rapid increase in email usage, spam messages have become
a major nuisance, often carrying malicious links or irrelevant
promotions. The objective of this practical is to build a Support
Vector Machine (SVM) classifier to distinguish between spam and
non-spam (ham) emails based on the textual content of emails. The
model will analyze word frequencies, patterns, and metadata to
identify whether an email is likely spam. By leveraging key SVM
concepts such as margin maximization and kernel tricks, the solution
aims to deliver high accuracy and generalization while visualizing
how data is separated in high-dimensional space.
Key Questions / Analysis / Interpretation to be Evaluated
1. Can SVM effectively separate spam and non-spam
emails based on textual features?
2. How does changing the kernel (linear, polynomial, RBF)
affect classification performance?
3. What is the optimal decision boundary, and how is it
determined?
4. How does margin maximization improve generalization?
5. What insights can be gained by visualizing high-
dimensional data?
Supplementary Problems
Document classification.
Key Skills Addressed
Students will be able to create robust spam detection systems that
improve email filtering, enhance cybersecurity, and moderate
content, addressing challenges across various domains like
communication and social media.
Applications
Students will understand how machine learning can predict
customer churn, aiding industries like telecom, banking, and e-
commerce in reducing customer attrition through data-driven
strategies.
Learning Outcome
Upon completing this practical:
Students will develop an understanding of SVM principles,
implement text classification pipelines, and interpret the
performance and decision boundaries of SVM for high-dimensional
data.
Dataset/Test Data
Dataset: SMS Spam Collection Dataset (UCI)
Link: https://archive.ics.uci.edu/dataset/228/sms+spam+collection
Features extracted from: email/SMS content, frequency of certain
keywords, presence of special characters or links.
Target: Spam or Ham
Tools/Technology
Python, pandas, numpy for data handling
sklearn for ML models
matplotlib, seaborn for visualization
Jupyter Notebook or Google Colab.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Choice and justification of kernel
Confusion matrix, precision, recall, F1-score
Visualization of feature space and decision boundaries
Experiment with another dataset or add bigrams/trigrams as
features.
Evaluation Strategy:
SVM Model Implementation (with kernel variation), Explanation of
Kernel Trick & Margin Maximization, Visualization and
Interpretation, Evaluation Metrics (confusion matrix, F1-score).
9 Problem Definition CO3
Heart Disease Prediction Using Model Evaluation & Cross-
Validation
Heart disease is one of the leading causes of death globally. Early
and accurate prediction of heart disease based on patient health
indicators can assist in timely diagnosis and treatment. This
practical aims to build and evaluate a classification model (e.g.,
Logistic Regression, Decision Tree) using various performance
metrics such as accuracy, precision, recall, F1-score, and ROC
curve. To ensure robustness and generalizability of the model, k-
fold cross-validation will be applied. The goal is to not only build a
classifier but also interpret and compare the model's performance
using different evaluation techniques.
Key Questions / Analysis / Interpretation to be Evaluated
1. Is the model accurately predicting patients at risk of heart
disease?
2. How do performance metrics vary when using different
algorithms?
3. Which metric (precision, recall, F1) is more important in a
medical context?
4. What does the ROC curve reveal about the model's
discrimination ability?
5. How consistent is the model's performance across k-folds?
Supplementary Problems
Quality assurance in manufacturing.
Key Skills Addressed
This practical enables students to evaluate models with metrics like
ROC and F1-score, apply k-fold cross-validation for reliability, and
compare algorithms to ensure robust predictions.
Applications
Students will see how predictive models can assist in early detection
of heart disease, supporting timely medical interventions and
improving healthcare decision-making.
Learning Outcome
Upon completing this practical:
Students will learn to train and assess classification models, apply
evaluation techniques for reliability, and interpret results, preparing
them to solve real-world healthcare problems with machine learning
Dataset/Test Data
Dataset: UCI Heart Disease Dataset
Link: https://archive.ics.uci.edu/dataset/45/heart+disease
Features include: age, sex, chest pain type, resting blood pressure,
cholesterol, fasting blood sugar, etc.
Target: Presence or absence of heart disease
Tools/Technology
Python
pandas, numpy for data handling
sklearn for modeling, evaluation, and cross-validation
matplotlib, seaborn for plotting metrics and ROC curves
Jupyter Notebook / Google Colab.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Confusion matrices, Metric values, ROC curves and interpretation,
Summary of k-fold cross-validation results.
Evaluation Strategy:
Model Training and Basic Evaluation, Metric Calculation
(Precision, Recall, F1, Accuracy), ROC Curve and AUC
Interpretation, k-Fold Cross-Validation Implementation and
Analysis.
10 Problem Definition CO4
Customer Segmentation Using K-Means and DBSCAN
Businesses often deal with large, diverse customer bases. Segmenting
customers based on behavior allows companies to tailor marketing
strategies, improve customer experience, and optimize resources.
This practical aims to apply K-Means and DBSCAN clustering
algorithms to a retail customer dataset to group customers with
similar purchasing behavior. By analyzing features like annual
income and spending score, we will explore clustering techniques,
evaluate performance using inertia and silhouette scores, and
visualize the clusters. This hands-on practical demonstrates how
unsupervised learning can uncover hidden patterns in data without
labeled outcomes.
Key Questions / Analysis / Interpretation to be Evaluated
1. How many natural customer segments are there in the
dataset?
2. What is the optimal number of clusters (for K-Means)?
3. How does DBSCAN perform compared to K-Means for
density-based clustering?
4. What do inertia and silhouette scores reveal about clustering
performance?
5. How can clusters be interpreted for business decision-
making?
Supplementary Problems
Image compression and segmentation.
Key Skills Addressed
By completing this practical, students will apply centroid-based and
density-based clustering techniques, evaluate cluster quality using
inertia and silhouette scores, and visualize data to derive meaningful
insights for business.
Applications
Students will learn to segment customers effectively, enabling
businesses in retail and e-commerce to design targeted marketing
strategies, improve customer satisfaction, and detect anomalies.
Learning Outcome
Upon completing this practical:
Students will understand clustering algorithms, evaluate their
performance, and interpret clusters to support business decisions
with unsupervised learning techniques.
Dataset/Test Data
Dataset: Mall Customer Segmentation Dataset (Kaggle)
https://www.kaggle.com/datasets/vjchoudhary7/customer-
segmentation-tutorial-in-python
Features: CustomerID, Gender, Age, Annual Income, Spending
Score
Optional: Normalize Annual Income and Spending Score for
DBSCAN
Tools/Technology
Python
pandas, numpy for data manipulation
sklearn for K-Means, DBSCAN, silhouette score
matplotlib, seaborn for visualization
scipy, PCA from sklearn.decomposition for dimensionality
reduction
Jupyter Notebook / Google Colab.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Cluster analysis and business interpretation of each segment
Evaluation metrics: inertia and silhouette score comparison
Visualization of K-Means and DBSCAN clusters
Suggested Extension: Apply on a different dataset such as customer
behavior or geolocation data.
Evaluation Strategy:
K-Means Clustering Implementation & Evaluation, DBSCAN
Clustering Implementation & Comparison, Inertia and Silhouette
Score Calculation & Interpretation, Cluster Visualization and
Business Insights.
11 Problem Definition CO4
Face Recognition Feature Reduction using PCA
Face recognition systems process high-dimensional image data,
which increases computational cost and may lead to overfitting. This
practical focuses on using Principal Component Analysis (PCA) to
reduce the dimensionality of facial image datasets while retaining
essential features. The goal is to apply PCA to compress the dataset,
visualize it in 2D using principal components, and understand the
impact of eigen decomposition and explained variance. This helps in
improving model efficiency without significant loss of accuracy.
Through this, learners explore how PCA simplifies data while
preserving structure and key information.
Key Questions / Analysis / Interpretation to be Evaluated
1. How does PCA reduce the dimensionality of image data?
2. What percentage of variance is retained by top components?
3. How are eigenvectors and eigenvalues used in PCA?
4. How do lower-dimensional visualizations help in
understanding data clusters?
5. What are the trade-offs between compression and
information loss?
Supplementary Problems
Handwriting recognition.
Key Skills Addressed
Through this practical, students will standardize high-dimensional
data, apply PCA for feature reduction, and analyze explained
variance to balance dimensionality and performance.
Applications
Students will explore how dimensionality reduction optimizes face
recognition systems and other high-dimensional tasks, reducing
computation costs while maintaining essential data for accurate
predictions.
Learning Outcome
Upon completing this practical:
Students will learn to implement PCA, visualize data in reduced
dimensions, and evaluate the trade-offs between data compression
and accuracy for real-world applications.
Dataset/Test Data
Dataset: Olivetti Faces Dataset (sklearn)
https://scikit-learn.org/stable/datasets/real_world.html#olivetti-
faces-dataset
Features: Pixel values of grayscale facial images
Labels (optional): Person identities (used for classification post
PCA)
Tools/Technology
Python
numpy, pandas for data handling
sklearn.decomposition.PCA for dimensionality reduction
matplotlib, seaborn for 2D plotting and visualizations
sklearn.datasets for loading image datasets.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Scree plot (explained variance vs components)
2D plot of PCA-transformed data
Interpretation of the number of components selected".
Evaluation Strategy:
PCA Implementation & Component Analysis, Eigen Decomposition
& Explained Variance Interpretation, 2D Plotting and Visualization
of Data, Reconstruction Error and Information Loss Analysis.
12 Problem Definition CO4
Handwritten Digit Recognition using Artificial Neural Networks
(ANN)
Handwritten digit recognition is a classic problem in the field of
pattern recognition and deep learning. This practical aims to build an
Artificial Neural Network (ANN) using a Multi-Layer Perceptron
(MLP) to classify images of handwritten digits from the MNIST
dataset. The model will learn from pixel-level image data through
forward propagation, adjusting weights using backpropagation, and
applying activation functions such as ReLU and Softmax. The
objective is to train the ANN to achieve high accuracy in predicting
digits (0–9), understand the working of neural layers, and interpret
training behavior using loss and accuracy curves.
Key Questions / Analysis / Interpretation to be Evaluated
1. How does the MLP architecture (hidden layers, neurons)
affect performance?
2. What role do activation functions play in learning non-linear
patterns?
3. How does the network learn through forward and backward
passes?
4. What is the effect of learning rate and number of epochs?
5. How well does the trained ANN generalize to unseen digit
images?
Supplementary Problems
Automated postal address reading.
Key Skills Addressed
This practical will involve training an ANN using forward and
backward propagation, understanding activation functions, and
tuning hyperparameters to optimize model performance
Applications
Students will explore how dimensionality reduction optimizes face
recognition systems and other high-dimensional tasks, reducing
computation costs while maintaining essential data for accurate
predictions.
Learning Outcome
Upon completing this practical:
Students will gain the ability to implement and train ANNs, analyze
training behavior, and achieve high accuracy in recognizing
handwritten digits using modern deep learning frameworks.
Dataset/Test Data
Dataset: MNIST Handwritten Digit Dataset (60,000 training, 10,000
test images)
Features: 28x28 grayscale images of digits (0–9)
Target: Digit label (0–9)
Tools/Technology
Python
TensorFlow or PyTorch for ANN implementation
Keras (if using TensorFlow) for high-level APIs
matplotlib, seaborn for plotting training metrics
Jupyter Notebook / Google Colab for hands-on environment.
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work:
Accuracy/loss plots for training and validation sets
Confusion matrix for prediction analysis
At least one experiment with hyperparameter tuning
Evaluation Strategy:
Use of Activation Functions and Forward/Backward Logic, Model
Training and Validation, Accuracy Evaluation and Confusion Matrix,
Hyperparameter Tuning and Interpretation.
13 CO5
Problem Definition
To implement a Convolutional Neural Network (CNN) for the
binary classification of images of dogs and cats. The goal is to build
a model that can accurately distinguish between images of dogs and
cats.
In this case study, we will use the Kaggle Dogs vs. Cats dataset,
which consists of 25,000 labeled images of dogs and cats. The
dataset can be downloaded from the following link:
https://www.kaggle.com/c/dogs-vs-cats/data
Dataset:
• Training set: 25,000 images (12,500 dogs and 12,500 cats)
• Test set: To be split from the training set or use the provided test
set for evaluation.
Tasks to be Performed:
1. Dataset Preparation:
• Download and extract the dataset.
• Split the dataset into training and validation sets.
• Preprocess the images (resizing, normalization, and augmentation).
2. Model Building:
• Define the architecture of the CNN.
• Choose appropriate layers (convolutional layers, pooling layers,
dense layers, etc.).
• Implement the model using a deep learning framework (e.g.,
TensorFlow/Keras, PyTorch).
3. Model Training:
• Compile the model with appropriate loss function and optimizer.
• Train the model on the training set and validate it on the validation
set.
• Monitor the training process using metrics like accuracy and loss.
4. Model Evaluation:
• Evaluate the model's performance on the test set.
• Analyze the results using confusion matrix, precision, recall, and
F1-score.
5. Model Optimization:
• Implement techniques to improve model accuracy (e.g.,
hyperparameter tuning, regularization, dropout).
• Retrain and evaluate the optimized model.
6. Model Deployment (Optional):
• Save the trained model.
• Implement a simple application to use the model for real-time
image classification.
Key Questions / Analysis / Interpretation to be Evaluated
1. Understanding the Dataset:
• How is the dataset structured, and what preprocessing steps were
necessary?
• What challenges did you encounter during data preprocessing?
2. Model Architecture:
• What architecture did you choose for the CNN, and why?
• How did you decide on the number of layers and their types?
3. Training Process:
• What loss function and optimizer were used, and why?
• How did you split the dataset for training and validation?
• What metrics were used to monitor the training process?
4. Model Performance:
• What was the accuracy of the model on the validation and test
sets?
• What do the confusion matrix and other evaluation metrics
indicate about the model's performance?
5. Optimization Techniques:
• What optimization techniques did you implement, and how did
they affect the model's performance?
• What were the best hyperparameters found during tuning?
6. Deployment and Application:
• How can the trained model be deployed for practical use?
• What are the potential applications of the model in real-world
scenarios?
7. Reflection:
• What were the main challenges you faced during the
implementation of the CNN?
• How would you improve the project if given more time or
resources?
Supplementary Problems
3 class problem
Key Skills Addressed
Training and evaluating deep learning models
Visualizing and interpreting model performance
Performing hyperparameter tuning (epochs, learning rate, batch
size)
Applications
Image Classification
Learning Outcome
Implement an CNN using modern frameworks like Keras or
PyTorch.
Apply forward and backward propagation effectively during
training.
Analyze training curves and optimize model parameters.
Evaluate and interpret classification results on real-world datasets.
Dataset/Test Data
The dataset can be downloaded from the following link:
https://www.kaggle.com/c/dogs-vs-cats/data
Dataset:
• Training set: 25,000 images (12,500 dogs and 12,500 cats)
• Test set: To be split from the training set or use the provided test
set for evaluation.
Tools/Technology
Keras, matplotlib
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work
Accuracy/loss plots for training and validation sets
Confusion matrix for prediction analysis
At least one experiment with hyperparameter tuning
Evaluation Strategy
Confusion Matrix
CO5
14 Problem Definition
To implement a text classification system using Natural Language
Processing (NLP) techniques. The objective is to preprocess raw text
data, convert it into numerical features using TF-IDF or Transformer-
based embeddings, and build a classification model to perform basic
sentiment analysis or category-based classification.
Key Questions / Analysis / Interpretation to be Evaluated
Compare and contrast TF-IDF and Transformer-based embeddings
(e.g., BERT). When would you prefer one over the other?
How does the TF-IDF vectorizer convert text into numerical form?
How did the model perform in terms of accuracy, precision, recall,
and F1-score?
How would you improve the model if you had more time or data?
Supplementary Problems
Key Skills Addressed
Text preprocessing (tokenization, stop word removal, etc.)
Feature extraction using TF-IDF or Transformers (BERT,
DistilBERT, etc.)
Training classification models (Logistic Regression, SVM, or
Neural Networks)
Evaluating model performance (accuracy, F1-score)
Sentiment or category classification using real-world text data
Applications
Sentiment analysis (e.g., movie/product reviews)
Spam detection
News topic classification
Social media content moderation
Customer feedback analysis
Learning Outcome
Understand and apply standard NLP preprocessing techniques
Learn the working of TF-IDF and Transformer-based vector
representations
Build and evaluate a basic text classification model
Gain experience with real-world NLP workflows using libraries like
Scikit-learn or HuggingFace
Dataset/Test Data
Public datasets such as IMDB reviews, SMS Spam Collection, or
Twitter sentiment datasets
Tools/Technology
Python
Scikit-learn (for TF-IDF and classical ML models)
NLTK / spaCy (for text preprocessing)
HuggingFace Transformers (for BERT-based vectorization, if used)
Pandas / NumPy / Matplotlib (for data handling and visualization)
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work
Accuracy/loss plots for training and validation sets
Confusion matrix for prediction analysis
At least one experiment with hyperparameter tuning
Evaluation Strategy
File
15 CO5
Problem Definition
To implement the Q-Learning algorithm for solving a reinforcement
learning problem and evaluate its performance.
Tasks to be Performed
1. Setup Environment:
• Install the OpenAI Gym toolkit.
• Load the Frozen Lake environment.
• Understand the environment’s state and action space.
2. Implement Q-Learning Algorithm:
• Initialize the Q-table with zeros.
• Define the hyperparameters: learning rate (alpha), discount factor
(gamma), and exploration rate (epsilon).
• Implement the Q-Learning update rule.
• Implement an epsilon-greedy policy for action selection.
3. Training the Agent:
• Run the Q-Learning algorithm for a fixed number of episodes.
• Update the Q-values based on the experiences gained by the agent.
• Monitor the agent's performance over time.
4. Evaluation:
• Test the trained Q-Learning agent in the environment.
• Measure the agent’s performance using metrics such as average
reward per episode.
5. Analysis:
• Analyze how different hyperparameters affect the learning
process.
• Compare the performance of the Q-Learning agent with other
baseline methods if available.
Key Questions / Analysis / Interpretation to be Evaluated
1. How does the Q-Learning algorithm update the Q-values?
2. What is the role of the learning rate (alpha) and how does it affect
the learning process?
3. How does the discount factor (gamma) influence the Q-Learning
algorithm?
4. What is the epsilon-greedy policy and why is it used in Q-
Learning?
5. How does the exploration rate (epsilon) impact the agent’s
learning and performance?
6. What challenges did you encounter while implementing the Q-
Learning algorithm and how did you address them?
7. How does the performance of the Q-Learning agent change with
different values of alpha, gamma, and epsilon?
8. Can the Q-Learning algorithm be applied to other environments?
If yes, what changes would be required in the implementation?
9. Discuss the importance of choosing the right hyper parameters for
the Q-Learning algorithm.
10. How does the Q-Learning algorithm compare to other
reinforcement learning algorithms?
11. Analyse the reward value for the different actions.
Supplementary Problems
Frozen lake problem
Key Skills Addressed
Implementation of Q-Learning algorithm
Hyperparameter tuning (α, γ, ε)
Policy design using epsilon-greedy strategy
Applications
Spam detection
News topic classification
Social media content moderation
Customer feedback analysis
Learning Outcome
Students will understand and implement a tabular Q-Learning
algorithm
Students will analyze how learning parameters affect agent
performance
Students will develop the ability to design, train, and evaluate RL
agents
Dataset/Test Data
Environment: Frozen Lake v1 from OpenAI Gym
Predefined 4x4 or 8x8 grid map with slippery surface, holes, and a
goal
No external dataset; environment provides states, actions, and
rewards dynamically
Tools/Technology
OpenAI Gym (for Frozen Lake environment)
NumPy (for Q-table and numerical operations)
Matplotlib / Seaborn (optional, for performance visualization)
Total Hours (Definition): 2
Total Hours (Engagement): 2
Post Lab Work
Explanation of the Q-Learning algorithm with the update formula
Description of environment (Frozen Lake)
Hyperparameter values used and justification
Graphs: Reward per episode (optional)
Explanation of the Q-Learning algorithm with the update formula
Description of environment (Frozen Lake)
Hyperparameter values used and justification
Graphs: Reward per episode (optional)
Evaluation Strategy
File