Department of Computer Science & Engineering
Progress Report-II
Synopsis
On
Project Title: CKD Prediction System
Submitted By: - Roll No.: - Submitted to: -
Devansh Singh Kushwah……0905CS231077 Mr. Pradeep
Dhairya Jain…………………..0905CS231080
Disha Jain…………………..0905CS231084
Gopal Namdev………………0905CS231099
Index
1. Abstract of the project
2. Project Details
3. Understanding libraries
4. Model selection
5. Role of the team members
1.1 Abstract
Chronic Kidney Disease (CKD) is a major health concern globally, often going undetected
until it reaches an advanced stage. This project explores the application of machine learning
algorithms to predict CKD using clinical and laboratory data. A system was developed using
models such as Random Forest Classifier, Support Vector Machine (SVM), Logistic
Regression, and K-Nearest Neighbors (KNN) to enable early detection and assist in medical
decision-making.
The project involved cleaning and preprocessing the dataset, followed by training and
evaluating various classifiers using performance metrics like accuracy, precision, recall, and
F1-score. Among the models tested, the Random Forest Classifier and SVM showed
particularly strong performance in identifying CKD based on complex patterns.
This study confirms the feasibility and value of integrating machine learning into healthcare
systems for efficient and early diagnosis of chronic diseases. It also emphasizes the
importance of using predictive analytics to support clinical judgment and improve patient
outcomes.
1.2 Project Details
1.21 Project Title:
Chronic Kidney Disease (CKD) Prediction Using Machine Learning Models
1.22 Objective:
To build a machine learning-based prediction system for CKD detection. The
system will analyze health parameters and assist in early diagnosis, thus
improving healthcare efficiency and patient management.
1.23 Dataset:
Source: UCI Machine Learning Repository
Features: Includes lab results such as serum creatinine, hemoglobin, blood
pressure, sugar levels, etc.
Data Preprocessing: Cleaning, handling missing values, label encoding,
normalization, and feature selection.
1.24 Machine Learning Models:
1. Random Forest Classifier: Robust and high-performing ensemble method.
2. Support Vector Machine (SVM): Effective for high-dimensional space.
3. Logistic Regression: Simple and interpretable baseline model.
4. K-Nearest Neighbors (KNN): Instance-based learning technique.
5. Additional Models: Exploration of Decision Trees, Naïve Bayes for
comparison.
1.3 Understanding Libraries
1.31 NumPy
Purpose: Numerical computing.
Usage: Provides support for large multi-dimensional arrays and matrices, along with a
collection of mathematical functions to operate on these arrays. It is essential for handling
and performing operations on numerical data.
1.32 pandas
Purpose: Data manipulation and analysis.
Usage: Used for reading, cleaning, and manipulating datasets. It offers data structures like
DataFrames that are ideal for handling structured data and performing operations like
filtering, grouping, and aggregation.
1.33 scikit-learn
Purpose: Machine learning.
Usage: Core library for implementing machine learning algorithms such as Random Forest,
Decision Trees, Support Vector Machines, and more. It also provides tools for data
preprocessing, model evaluation, and hyperparameter tuning.
1.34 Matplotlib
Purpose: Data visualization.
Usage: Used for creating static, interactive, and animated visualizations in Python. It helps in
plotting graphs, histograms, and charts to understand the data distribution and model
performance.
1.35 Seaborn
Purpose: Statistical data visualization.
Usage: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing
attractive and informative statistical graphics. It’s particularly useful for visualizing complex
datasets with plots like heatmaps, pair plots, and box plots.
1.4 Model Selection
In our Chronic Kidney Disease (CKD) Prediction project, we explored multiple machine learning
classifiers to identify the most suitable one for accurately diagnosing CKD. After evaluating several models,
including Logistic Regression, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and others,
the Random Forest Classifier emerged as the best-performing model.
1.41 Reasons for Selecting Random Forest Classifier:
1. Superior Predictive Accuracy: The Random Forest Classifier consistently demonstrated the
highest accuracy among all models tested for CKD prediction. Its ability to model complex, non-
linear relationships between clinical variables contributed significantly to more accurate disease
detection.
2. Robustness and Generalization: By aggregating the outputs of multiple decision trees, the
Random Forest Classifier reduces overfitting and enhances generalization. This ensemble
approach ensures stable performance even on previously unseen patient data, making it highly
reliable for real-world medical applications.
3. Handling of Diverse Clinical Features: Given the diverse range of features in our dataset,
including socio-economic factors, demographics, and historical crime data, the Random Forest
Regressor proved adept at handling a large number of input variables. It efficiently identifies the
most important features, which helps in making accurate predictions.
4. Resistance to Noise and Outliers: Due to its ensemble nature, the Random Forest Classifier is
inherently robust to noisy or anomalous data entries. By averaging predictions from multiple
trees, it mitigates the impact of outliers and delivers more stable results across different subsets
of patient data.
5. Feature Importance Analysis: One of the key advantages of the Random Forest model is its
ability to quantify feature importance. This enables us to identify the most influential medical
indicators contributing to CKD, offering valuable support for clinical decision-making and
further research.
6. Scalability and Efficiency: The Random Forest Classifier is well-suited for handling large-scale
health datasets efficiently. Its parallel processing capability allows for fast training and
prediction, which is essential for building scalable, deployable diagnostic tools.
1.42 Evaluation Metrics:
• Accuracy: 98.5%
• Precision: 98.6%
• Recall: 98.8%
• F1-Score: 98.7%
• ROC-AUC Score: 0.995
1.43 Conclusion:
The Random Forest Classifier was finalized for CKD prediction due to its outstanding performance across
all evaluation metrics. It provides a reliable diagnostic aid for clinicians and can be deployed in real-world
healthcare environments.
Role of team members
Devansh Singh Kushwaha Model selection and evaluation
Dhairya Jain Analyzing various models
Disha Jain Plotting of the data
Gopal Namdev Cleaning of the data