A
INTERNSHIP REPORT ON
“
Disease prediction based on
symptoms using Data Science Techniques”
SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY IN THE
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE AWARD
OF THE DEGREE
OF
THIRD YEAR OF ENGINEERING
(COMPUTER ENGINEERING)
SUBMITTED BY
Mr./Ms.: - Doge Onkar
Somling Exam Seat No.
T1904304258
Area of Internship: - Remote Mode
Sem: - 5th Semester
DEPARTMENT OF COMPUTER ENGINEERING
STES’S SINHGAD ACADEMY OF ENGINEERING
KONDHWA, PUNE 411048
UNIVERSITY OF
PUNE 2024-25
DEPARTMENT OF COMPUTER ENGINEERING
STES’S SINHGAD ACADEMY OF ENGINEERING
KONDHWA, PUNE 411048
I
CERTIFICATE
This is to certify that the Internship report entitle
“Disease prediction based on
symptoms using Data Science Techniques”
Submitted by
Mr./Ms. Doge Onkar Somling Exam No: T1904304258
is a Bonafide work carried out by him/her under the supervision of Prof. P.R. Dongare
and it is submitted towards the Partial fulfillment of the requirement of Savitribai Phule Pune
University, for the award of the degree of Third year of Engineering (Computer Engineering).
Prof. P.R. Dongare Prof. S. N. Shelke
TG/Mentor Head of,
Department of Computer Engineering Department of Computer Engineering
Dr. K.P. Patil
Principal,
Sinhgad Academy of Engineering, Pune – 48
Place: Pune
Date:
II
ACKNOWLEDGEMENT
It is indeed with a great sense of pleasure and immense sense of gratitude that I
acknowledge the help of these individuals
Working in the Unified Mentor Pvt. Ltd. was interesting. During these one months of
internship, I learnt a lot Data Science, especially on completing the project is helpful for
understanding all concepts in Data Science, Python and Power BI.
I have to thank Unified Mentor Pvt. Ltd. For giving opportunity and platform for
internship.
Therefore, I am grateful to the people in the Unified Mentor Pvt. Ltd. For the chance to
make this experiment. And Opportunity to build a project and that very helpful to my knowledge.
Further on, I want to thank the students and interns in the Unified Mentor Pvt. Ltd. who
made this demanding time joyful but always efficient.
I am extremely great full to my department Internship Coordinator Prof. M.K.
Nivangune and H.O.D. Prof. S.N. Shelke Sir who Guided and helped me in successful
completion of this internship.
(Student Name & Signature)
III
ABSTRACT
Accurate diagnosis is essential for effective healthcare, but the process is often
hampered by overlapping symptoms of various diseases and differences in patient
reporting. The Health, Symptom, and Disease Analytics project addresses these
issues by leveraging advanced data analysis techniques to identify patterns and
relationships between symptoms and diagnoses. This data includes pre-processed
patient symptoms and related diagnoses, including handling missing data,
calculating standard deviations, and coding categorical features for deeper insights.
The goal of this analysis is to demonstrate the relationship between symptoms and
diseases through statistical investigations using correlation and covariance
matrices, enabling healthcare professionals to identify key symptom clusters that
may be indicative of specific diseases.
This data-driven approach can help doctors better understand the diagnostic
process by identifying groups of symptoms commonly associated with particular
diseases, ultimately improving the accuracy and efficiency of diagnosis.
Furthermore, by clustering symptoms and using data analytics, patterns emerge
that can guide medical professionals in considering less obvious diagnoses. The
integration of statistical tools like the correlation matrix offers a robust means to
assess the strength and direction of relationships between symptoms and diagnoses.
Information from this research data analysis can provide decision support,
making diagnoses faster, more accurate, and less prone to human error. The
findings from this project highlight the potential for information technology to
reduce misdiagnosis, improve health outcomes, and enhance diagnostic
procedures, particularly in areas with limited healthcare resources. This type of
analysis is particularly relevant in remote or underserved regions where access to
specialists may be limited, offering decision-making support to general
practitioners through technology.
Overall, this internship has provided a deep understanding of how data science
methodologies can be harnessed to improve healthcare outcomes. The system
developed through this research represents a promising step toward more efficient,
accessible, and data-driven diagnostic practices in the medical field.
Keywords: Disease Diagnosis, Symptom Analysis, Correlation Matrix, Symptom
Clustering, Data-Driven Healthcare, Diagnostic Support Systems.
IV
CONTENTS
Sr.No TITLE Page no
1. Acknowledgement 3
2. Abstract 4
3. Internship Offer and Completion Certificate(Two Scanned Copies of 6-7
each )
4. Internship Place Details: Company Background.. 8
5. Introduction 9
6. Title and Problem Statement 10
7. Objectives 11
7. Motivation and Scope 12
8. Design 13
9. Methodologies Used 14-15
10. Hardware and Software Used 16
11. Results 17-24
12. Future Scope 25
13. Conclusion 26
14. References 27-28
5
Internship Offer Letter
6
Internship Completion Certificate
7
Internship Company Details
Company background-Organization Unified Mentor Pvt. Ltd.
IT Services and IT Consulting
Gurugram, Haryana
Activities/Scope Disease prediction based on symptoms
using Data Science Techniques Website
Objectives of Study Data Science and Libraries of it
Supervisor Details (Name, Bhautik Khunt
Designation, Company Name, Email (Project Mentor)
ID, Contact Number)
[email protected] HR Details (Name, Drishti Madaan
Designation, Company Name, Email (HR Manager)
ID, Contact Number)
[email protected] Director Details (Name, Paras Grover
Designation, Company Name, Email (Director)
ID, Contact Number)
[email protected] 8
Introduction
Accurate disease diagnosis is vital for effective healthcare, yet diagnosing based on
symptoms is often difficult due to the overlap of symptoms across various diseases.
Many illnesses present with similar or overlapping symptoms, which makes it
challenging for healthcare professionals to correctly identify the underlying condition.
This issue is further compounded by the variability in how patients report their
symptoms—some patients may exaggerate certain symptoms, while others may
downplay them. This inconsistency can lead to diagnostic uncertainty and delays in
treatment. Traditional diagnostic methods often fail to fully utilize the wealth of data
available from patient histories, symptom patterns, and medical records, resulting in
missed opportunities for more precise diagnosis.
This project addresses that gap by applying advanced data analysis techniques to
uncover relationships between symptoms and diseases. Through careful data cleaning,
normalization, and encoding, we prepare a comprehensive dataset of patient symptoms
and diagnoses for deeper statistical exploration. Missing data is handled to ensure
accuracy, while normalization and encoding make it easier to interpret and analyze the
data. With this processed data, we use correlation and covariance matrices to identify
key patterns and trends that link specific symptoms to particular diseases. These
statistical tools help reveal underlying symptom clusters that may not be immediately
obvious through traditional diagnostic approaches.
The goal of this project is to provide automated, data-driven insights that support
healthcare professionals in making faster, more accurate diagnoses. By offering real-
time decision support based on statistical analysis, this approach has the potential to
enhance both the efficiency and precision of the diagnostic process. Ultimately, this can
lead to improved patient care, reduce the risk of misdiagnosis, and allow healthcare
systems to leverage data more effectively, particularly in time-sensitive or resource-
constrained environments.
Diagnosing diseases based on symptoms alone is a challenging task due to symptom
overlap across various conditions. Traditional methods often fail to leverage large
amounts of available clinical data effectively. This project attempts to automate the
initial stages of diagnosis by using data science to identify symptom patterns and their
associations with diseases. A binary dataset consisting of symptoms and diagnoses is
pre-processed and analysed using statistical tools, enabling data-driven diagnostic
support. The goal is to assist healthcare providers in diagnosing diseases more
accurately and efficient
9
Title and Problem Statement
Title: Disease Prediction Based on Symptoms
Using Data Science Techniques
Problem Statement: To create a system that uses patient-
reported symptoms as input and predicts
possible diseases using data science and
machine learning models, thereby
assisting healthcare professionals in
faster and more accurate diagnosis.
10
Objectives
1. To study and analyze the existing challenges in symptom-based disease diagnosis
due to overlapping symptoms and inconsistent patient reporting.
2. To collect and preprocess a comprehensive dataset of symptoms and associated
diseases, ensuring quality by handling missing values, standardizing inputs, and
encoding categorical features.
3. To apply Exploratory Data Analysis (EDA) techniques to identify correlations
and clusters among symptoms that are commonly associated with specific
diseases.
4. To implement feature selection techniques to identify the most significant
symptoms that influence disease prediction accuracy.
5. To design and develop multiple machine learning models (Logistic Regression,
Random Forest, Decision Tree) for multi-class disease prediction based on
symptom data.
6. To compare and evaluate the performance of different ML models using metrics
like accuracy, precision, recall, F1-score, and cross-validation accuracy.
7. To reduce data dimensionality using Principal Component Analysis (PCA) for
better visualization and to discover hidden structures within the dataset.
8. To visualize co-occurrence networks of symptoms to better understand their
interrelationships and how they influence the presence of specific diseases.
11
Motivation of the Project
This project is motivated by the need to improve diagnostic accuracy, reduce
misdiagnosis, and provide healthcare assistance in resource-limited areas. The system
offers scalability for real-world use in clinics, telemedicine applications, and AI-
powered healthcare platforms.
In today's fast-paced world, healthcare systems face increasing pressure to provide
accurate, timely, and accessible diagnoses. Traditional diagnostic methods, though
effective, often rely heavily on the expertise of medical professionals and can be prone
to human error, especially in the early stages of disease when symptoms are vague or
overlapping. In rural or underdeveloped regions, access to experienced doctors and
specialists is limited, further complicating the diagnosis process. These challenges
form the core motivation behind this project—to create a system that can assist
healthcare professionals by offering intelligent, data-driven insights derived from
patient-reported symptoms.
The rise of data science and machine learning has opened new frontiers in healthcare
analytics. With the increasing availability of electronic health records and structured
datasets, there is a growing opportunity to leverage this data to improve diagnostic
processes. This project is driven by the idea that early disease prediction using
symptoms and machine learning models can significantly reduce diagnosis time, avoid
misdiagnoses, and improve patient outcomes. A robust, AI-based system can serve as a
digital assistant to clinicians, especially in overburdened or low-resource
environments.
12
Design of the Project
Step 1: Data Loading & Preprocessing
Binary symptom-disease dataset
Null value handling, irrelevant record removal
Step 2: Exploratory Data Analysis (EDA)
Heatmaps, symptom frequency plots
Step 3: Feature Selection
Retaining high-correlation features
Creating binary symptom clusters
Step 4: Model Training
Algorithms: Random Forest, Logistic Regression, Decision Tree
Step 5: Model Evaluation
Accuracy, Precision, Recall, Cross-validation scores
Step 6: Visual Interpretation
PCA plots, Symptom Co-occurrence Networks, Cluster visualizations
13
Methodology
Design and Implementation
The aim of this project is to formulate a system that possesses AI ability, using the
input symptoms of a patient for the diagnosis of health and prediction of diseases.
This model, by inputting the symptom data, will identify patterns and correlatively
make inferences about possible conditions the patient may be having. Such a
predictive system could advance the work being done by service providers in
healthcare by providing preliminary diagnoses from faster and potentially more
accurate avenues to more personalized and timely medical interventions.
Step 1: Loading and Data Preparation
Under the first step, the dataset loads into a named structure called Data Frame, which
can take advantage of their binary data for examination and manipulation. A row can
imply an individual case or patient while having one column for symptoms and
another for diseases. Often in such a binary data set, each symptom or disease is
encoded as either 0 (absent) or 1 (present). A quick loading check ensures that there
are no missing or inconsistent data value entries and checks the data types. Since it is
a binary dataset, we look particularly for anomalies that would impact the
interpretability of the model created out of this, such as missing data points.
Preprocessing encompasses handling null values, deleting irrelevant records, and
getting categorical variables to be in the right format, although the binary format
minimizes the need for extensive encoding. That will lead to a foundational step in
presenting clean, well-formatted data very essential for model training.
Step 2: Exploratory Data Analysis
Using a preprocessed dataset, EDA enables understanding patterns in symptom-
disease relationships. Visualizations like heatmaps and bar charts can be very useful at
this point, showing how the prevalence of certain symptoms cuts across multiple
diseases. A heatmap of the correlation of symptoms to diseases might show which
symptoms most align with which diseases. Then, a simple bar chart would display the
frequency of each symptom across the dataset, facilitating, for example, the
determination of common or rare symptoms particularly associated with specific
diseases. Analyzing interrelations of symptoms may present clusters of symptoms that
frequently co-occur and thereby serve as possible predictors for specific conditions.
14
Step 3: Feature Selection and Engineering
Using insight from EDA, we further refine the dataset by selecting informative
symptoms to best describe and streamline the model for every condition. For instance,
if symptoms are always correlated with some diseases, we retain the features. This
promotes model relevance as well as increases the accuracy of the model. Here again,
feature engineering might comprise agglomeration of similar symptoms or creating
new binary variables based on symptom combinations to capture more complex
patterns. Binary features do not need scaling however any grouping or if it has to be
created then creating a binary interaction, if it needs to be done, is taken care of. This
will ensure that the dataset used to train is clear in its focus with targeted predictive
symptom
Step 4: Model Selection and Training
Now we have our dataset all streamlined, for this classification model, a multi-class,
binary symptom setup is chosen. Such data can be handled highly efficiently by
algorithms such as Logistic Regression, Random Forest, and even Support Decision
Tree. In case of binary data, the Random Forest goes well ahead of the rest because it
removes overfitting and depends on multiple decision trees for increasing predictivity.
At the time of training, the model learns which combinations of symptoms predict
specific diseases, allowing it to successfully generalize those patterns to new data.
Step 5: Model Evaluation
Finally, we evaluate the performance of the model on the test set. Some of the key
metrics for multi-disease prediction are accuracy, precision, recall, and scores
indicating how good the model is at classifying each disease based on the presence of
a symptom. Precision and recall become particularly relevant in cases of imbalanced
disease representation to better assess how well the model finds fewer common
diseases. If performance is not up to some acceptable level, adjustments are made to
the structure or hyperparameters of the model to improve outcomes. The final stage
then tests the robustness of this model and its capability for disease prediction based
on symptoms within practical application scenario
15
Hardware and Software Used
Hardware Requirements:
Computer/Laptop Laptop
Processor Intel i5 Processor
RAM 16 GB
Software Requirements:
Tools Python (Jupyter Notebook)
Libraries 1. Pandas,
2. NumPy
3. Scikit-learn
4. Matplotlib
5. Seaborn
16
Result & Discussion
Frequency Distribution of Diseases
The plot shows the frequency distribution of diseases in the dataset. The massive
majority of diseases appear only a few times, while the bar at 10 represents one
disease frequency, which has 677 occurrences. Each bar is accordingly labeled with
the accurate count. The imbalance is strong, as a few diseases are very frequent while
most are rare. This might be having an effect on the performance of the model since it
might bias the common diseases.
17
Symptom prevalence
The plot shows the top 50 symptoms by prevalence, with "sharp abdominal pain"
being the most common, followed by "headache" and "shortness of breath." The
prevalence of symptoms decreases as you move down the list, helping identify the
most frequent symptoms in the dataset.
Most common diseases
The bar chart above indicates the top 10 common diseases in the dataset with a count
of 10 cases each. Such top common diseases include Zenker's diverticulum,
abdominal aortic aneurysm, and abdominal hernia..
18
Correlation among symptoms
The heatmap displays the correlation between 10 selected symptoms, indicating how
they relate to each other. Each cell represents a correlation coefficient, where values
close to 1 (red) show a strong positive correlation, meaning the symptoms often
appear together, and values close to -1 (dark blue) show a strong negative correlation,
meaning they tend not to occur together. For example, symptoms like depression and
insomnia have a moderate positive correlation (0.37), suggesting they may co-occur.
In contrast, anxiety and nervousness show almost no correlation with palpitations (-
0.00014), indicating they are likely independent. This visualization helps in
identifying symptom patterns that could be useful for diagnostic purposes.
19
Principal Component Analysis (PCA)
This is a Principal Component Analysis plot applied to a dataset of symptoms where
diseases were encoded as integer values for the color mapping. PCA is one type of
dimensionality reduction technique that transforms data from a high-dimensional into
a lower-dimensional space while retaining most of the important variance in the data.
We have two principal components: "PCA Component 1" on the x-axis, and "PCA
Component 2" on the y-axis. Each point represents a combination of symptoms that
occur in one instance, while color refers to disease type. The color gradient,
represented by the color bar, shows the encoded disease labels, helping us visualize in
what ways different diseases do or don't look similar or distinct in terms of patterns of
symptoms in the reduced feature space. Clusters or patterns in distribution may
represent relations among symptoms and types of diseases.
20
Symptom Cluster Analysis
Along with the code and plots below, this section of the report examines a clustering
analysis of the symptom data. The first plot drawn is called the "Elbow Method, and it
is illustrated below. The intuition behind this method is how the inertia---one measure
of cluster tightness---actually grows as the number of clusters increases. That point at
which adding a new cluster gives quite small increase in benefits leads to the notation
of a sharp elbow in the curve; for this dataset, that elbow appears around three
clusters.
This lays the foundation from which the rest of the analysis will be based when using
three clusters with the K-means clustering. The second plot displays the result of a
clustering procedure, by first taking the dataset and using PCA to reduce it to two
principal components. All the data points are sets of symptoms, and the coloring is all
associated with the assigned cluster. It can be noticed that the separations between the
groups (purple, teal, and yellow) are very clear, indicating different symptom patterns
for each of the groups, which may be related to different disease categories or profiles
21
of symptoms.
22
Symptom Co-occurrence Network
A co-occurrence network of symptoms, where edges between nodes denote
associations or correlations between symptoms-edges represent associations or
correlations between different symptoms, where each node represents a symptom, and
the thickness and darkness of edges typically encode correlation strength; all edges
appear uniformly gray in this plot. Nodes are colored in light blue for aesthetic
purposes, and their size is also the same, so every symptom has the same visibility.
23
Comparative Analysis between Symptom Patterns of Disease Subgroups
24
In this exercise, two bar plots have been included. They depict the occurrence of two
symptoms- "depression" and "shortness of breath"-across different disease subgroups.
Each bar itself represents a disease and depicts how severe or how probable that
symptom is in the patients with that particular disease. The bar plot for depression is on
one side while the bar plot for shortness of breath is on the other side, indicating how
the symptom varies between the diseases. Some of the diseases have very considerable
frequencies of these symptoms represented in the taller bars.
Model Accuracy Comparison
Shows the accuracy comparison performance of the test between three different models
of machine learning: Logistic Regression, Random Forest, and Decision Tree. Both Test
Accuracy and Cross- Validation Accuracy are presented in percentages for the three
types of models. Based on Test Accuracy and Cross-validation Accuracy, the model,
Random Forest, garnered 91.75%, thus becoming the best by comparison. Visualization
The function allows the determination of which type of model generalizes the best for
the unseen data, thus identifying Random Forest as the most reliable one for this given
dataset
25
Future Scope
The integration of artificial intelligence and data science in healthcare is still in its
early stages, and this project opens several avenues for future enhancements and real-
world implementation. Building on the promising results achieved through this
research, the following future directions can be explored to further improve the
accuracy, usability, and scalability of the system:
1. Integration with Real-Time Healthcare Systems
The model can be integrated with existing Electronic Health Records (EHR) and
Hospital Information Systems (HIS) to automatically suggest possible diagnoses
to doctors based on patient-reported symptoms and medical history.
2. Development of a Mobile or Web-Based Application
Transforming the project into a user-friendly mobile app or online platform can
provide remote patients and primary healthcare workers with a quick diagnostic
tool, especially useful in rural or underdeveloped regions.
3. Expansion of Dataset and Inclusion of Demographic Details
Future models can be trained on larger, more diverse datasets that include
additional parameters such as age, gender, medical history, and geographical
factors, which may improve prediction accuracy and personalization.
4. Multilingual and Voice-Based Input Support
Implementing NLP (Natural Language Processing) capabilities will allow users
to input symptoms in regional languages or via voice commands, increasing
accessibility for non-English speakers and elderly users.
5. Collaboration with Healthcare Institutions and Startups
Collaborating with hospitals, clinics, or health-tech startups can help pilot and
deploy the system in real-world environments, making it a part of daily clinical
workflows.
2
Conclusion
This project, Health, Symptoms & Disease Analysis, evidences the potency of data-
driven methods in disease diagnosis by means of patients' reported symptoms. We
found some really impressive patterns about how to support informed decision-
making from healthcare practitioners for efficient and accurate diagnosis by
analyzing correlations between the symptom and the disease.
The project majorly consisted of rigorous data preprocessing, including cleaning,
handling missing values, and encoding categorical data; thus, setting up a quality
dataset for meaningful statistical analysis. In turn, we were able to provide a set of
EDA techniques, such as correlation and covariance matrices, to find clusters of
symptoms common to specific diseases. This supports a more systematic diagnostic
process that also helps limit medical errors and streamline healthcare workflows.
This project also compared the train and test accuracy among machine learning
models such as Logistic Regression, Random Forest, and Decision Tree along with
their respective cross- validation accuracies. Among all these models of machine
learning, the best model was that of Random Forest; it gave an accuracy of 91.75%,
which proves its reliability in generalizing predictions to unseen data in this dataset.
Overall, the project demonstrates that data science adds value to health care-from
basic data processing at its core to more sophisticated predictive models. The
work laid here forms a foundation for further work, such as the integration of
machine learning into real-time diagnosis and model update. This will thereby
improve diagnosis by creating enhanced, data-related diagnostic tools to the
betterment of patient outcome.
2
References
1) Author: Smith, J. (2020). Analysis of Symptom-Disease Correlation,
Journal of Healthcare Informatics.
Line No : 23–29.
This study investigates the complexities involved in diagnosing diseases that
share overlapping symptoms. Smith highlights the inadequacy of traditional
diagnostic methods and advocates for the integration of advanced analytical
techniques. The findings suggest that using data-driven approaches can
significantly improve the accuracy of symptom- disease correlations.
2) Author: Doe, M. (2018). Data Pre-processing Techniques in Medical
Datasets, Data Science in Healthcare.
Line No: 45–52.
Doe emphasizes the critical role of data pre-processing in healthcare
analytics, particularly in handling missing values and normalizing data for
enhanced analysis. The paper outlines various pre-processing strategies that
can improve data quality, directly supporting the methodological framework
employed in this project.
3) Author: Johnson, R. (2021). Exploratory Data Analysis for Disease
Prediction, International Journal of Medical Data Science.
Line No : 11–18.
Johnson explores the methodologies for exploratory data analysis (EDA) in
the context of predicting diseases. The use of correlation and covariance
matrices is discussed as a means to uncover relationships in healthcare
datasets. This work provides a strong basis for the statistical techniques
applied in our analysis.
2
4) Author: Patel, A. (2019). The Role of Machine Learning in Diagnosing
Diseases, Journal of Biomedical Informatics.
Line No : 34–40.
Patel investigates the application of machine learning algorithms for disease
diagnosis based on symptomatic data. The study finds that these models can
enhance diagnostic precision and reduce errors associated with overlapping
symptoms, supporting the automation goals of our project.
5) Author: Chen, L. (2022). Automated Insights in Healthcare: A Review,
Health Informatics Journal.
Line No : 50–55.
Chen reviews the development of automated systems in healthcare analytics,
including techniques for clustering symptoms and predicting diseases. The
paper stresses the growing
6) Author: Divya A., Deepika B., Durga Akhila C. H. Disease Prediction
Based on Symptoms Given by User Using Machine Learning
Line No : 40–45
This paper presents an automated disease diagnosis model using machine
learning techniques. It analyzes patient records for 41 diseases and employs
Decision Tree and Naive Bayes algorithms for prognosis
7) Author: Unknown, Human Symptoms–Disease
Network Line No : 50–55.
This research utilizes medical bibliographic records to generate a symptom-
based network of human diseases. It investigates correlations between
symptom similarity and shared genes or protein interactions, revealing that
symptom-based similarity can inform drug design and disease etiology
research