Final PD
Final PD
Learning Algorithm”
A Capstone Project Report
Submitted by
Lakshmi 20221BCA0269
Harshitha 20221BCA0241
Varsha 20221BCA0272
Pavitra 20221BCA0243
Jyoshna 20221BCA0242
Dr .V. Chandrasekar
Professor
PRESIDENCY UNIVERSITY
BENGALURU
MAY 2024
BACHELOR OF COMPUTER APPLICATIONS
PRESIDENCY UNIVERSITY
CERTIFICATE
This is to certified that the Capstone Project report “Detecting Parkinson’s disease
using Machine learning” bearing being submitted by Lakshmi, Harshitha, Varsha,
Pavitra ,Jyoshna bearing roll number 20221BCA0269, 20221BCA0241,
20221BCA0272, 20221BCA0243, 20221BCA0242 in partial fulfilment of
requirement for the award of degree of Bachelor of Computer Applications is a
bonafide work carried out under my supervision.
Moreover, this kind of disease could be worse in its later stages. So, it's very important
to predict if a person has Parkinson's disease or not in the early stages. There are many
researches going on to find the best solution possible for this. Research is going on for
using non-motor symptoms. While there are many other researches that have found the
solution to this problem using motor symptoms. But, they are not so accurate so we
are trying to develop a more accurate model for this prediction.
So, the goal of this project was to create a machine learning model that would predict
whether a person has Parkinson's disease or not using Machine Learning techniques
like classification. There are many classification algorithms that can be used to
classify the two classes. Since, it is a binary classification we have two classes to
classify. One of the classes is 0 that means 'The person is healthy' and the other one is
1 that means 'The person has Parkinson's disease. Before building our machine
learning model we tried to do some preprocessing to our data. Techniques like Feature
Scaling and PCA was used for that reason .Feature scaling is important before using
PCA (Principal Component Analysis). In feature scaling we try to standardize our
features using Standard Scaler. The reason for selecting this Machine Learning
algorithm was that since it is an ensemble technique so it will produce a more accurate
solution than a single model would produce. There are many other existing models
that used classification algorithms like SVM, KNN, LR But, out of all these machine
learning models Logistic Regression gave us the best accuracy score i.e., it predicted
the results more accurately than the other classification agorithms that we saw. And
after that we tried to use our machine learning model to predict the results on a new
data sample.
The completion of project work brings with great sense of satisfaction, but it is never
completed without thanking the persons who are all responsible for its successful
completion. First and fore most we indebted to the GOD ALMIGHTY for giving us
the opportunity to excel our efforts to complete this project on time. We wish to
express our deep sincere feelings of gratitude to our Institution, Presidency University,
for providing us opportunity to do our education.
We express our sincere thanks to our respected dean Dr. Md. Sameeruddin Khan,
Dean, School of Computer Science Engineering and Information Science, Presidency
University for getting us permission to undergo the project.
i
Student Name ID
Lakshmi 20221BCA0269
Harshitha S M 20221BCA0241
Varsha A 20221BCA0272
Jyoshna 20221BCA0242
TABLE OF CONTENTS
Abstract i
Acknowledgement ii
List of Figures iv
1 INTRODUCTION 1-2
4 OBJECTIVES 22-24
5 METHODOLOGY 25-27
ii
5.2 IMPLEMENTATION DETAILS 34-36
6 OUTCOMES 43-48
8 CONCLUSION 54-58
REFERENCES 59-62
APPENDIX 63-65
LIST OF FIGURES
LIST OF TABLES
iii
Page
Table No Caption
No.
3.1 Dataset Attributes
iv
CHAPTER-1
INTRODUCTION
.
Patients with Parkinson's disease (PD) often complain of a variable impairment of voice
emission including hypophonia, mono-pitch and mono-loudness speech, hypokinetic
articulation, collectively called hypokinetic dysarthria . Parkinsonian patients may manifest
voice disorders in the early stage of the disease, with growing evidence showing voice
impairment occurring even in the prodromal phase of PD . Also, voice typically worsens over
the course of the disease leading to severe voice impairment in more advanced stages of PD .
Furthermore, the standardized clinical assessment of voice in PD is currently based only on
qualitative evaluation (i.e., a specific subitem of the Unified Parkinson's Disease Rating Scale
—UPDRS) (2, 10) thus precluding the objective assessment of the voice impairment in this
disorder.
Over recent years, quantitative approaches based on spectral analysis have been developed to
examine objectively voice samples (11). Spectral analysis in patients with PD allowed to
demonstrate several abnormalities in specific voice features such as reduced fundamental
frequency and harmonics-to-noise ratio, and increased jitter and shimmer (3, 12–16). The
human voice however, represents a complex phenomenon characterized by high-dimensional
data based on an exponential number of features. Accordingly, besides the independent
examination through spectral analysis of specific voice features (i.e., fundamental frequency),
more advanced techniques able to analyse and dynamically combine and high-dimensional
datasets of voice features such as machine-learning algorithms (17–23) would improve
significantly the accuracy of the objective classification of voice samples in PD. Indeed,
machine learning has allowed to classify voice impairment objectively and automatically in a
number of neurologic disorders, with previously unreported high accuracy (19, 21, 22).
To date, concerning the application of machine learning analysis in PD, only a few
preliminary studies in rather small and clinically heterogeneous cohorts of patients have been
reported (24–26). It is therefore important to examine instrumentally voice impairment in a
large and clinically well-characterized cohort of PD. Also, it is relevant to verify whether
machine learning can recognize the effect of disease severity by discriminating patients in
We here investigated voice in a large and clinically well-characterized cohort of patients with
PD. Then, to examine the effect of disease severity on voice, we compared voices collected in
patients in early and mid-advanced stage of PD. Still, to investigate the effect of L-Dopa on
voice, we compared patients OFF and ON therapy. To verify the effect of the specific speech
tasks, we compared voice recordings during the emission of a vowel and a sentence,
according to standardized procedures (19, 21, 22). We assessed the sensitivity, specificity,
positive and negative predictive values, and accuracy of all diagnostic tests and calculated the
area under the receiver operating characteristic (ROC) curves. Lastly, by providing a machine
learning measure of voice impairment severity for each patient, we also assessed possible
clinical-instrumental correlations. Our hypothesis is that machine learning analysis of speech
samples is able to discriminate PD patients from controls, patients in early and mid-advanced
stages, and finally patients OFF and ON therapy, with previously unreported high accuracy.
Early in 1969, Darley et al. defined dysarthria as a collective term for related speech
disorders. The classification of dysarthria includes flaccid dysarthria, spastic dysarthria,
ataxic dysarthria, hypokinetic dysarthria, hyperkinetic dysarthria, unilateral upper motor
neuron dysarthria and mixed dysarthria4. The speech abnormalities of patients with PD are
collectively termed hypokinetic dysarthria (HKD). These speech flaws are typically
characterized by increased acoustic noise, a reduced intensity of voice, harsh and breathy
voice quality, increased voice nasality, monopitch, monoloudness, speech rate disturbances,
the imprecise articulation of consonants, the involuntary introduction of pauses, the rapid
repetitions of words and syllables and sudden deceleration or acceleration in speech.
Speech impairments are caused by impaired speech mechanisms during any of the basic
motor processes involved in speech performance5. The neuromotor speech sequence
activates the muscles of the pharynx, tongue, larynx, chest and diaphragm through
1.1 SIGNIFICANCE
Patients need not travel physically to a doctor instead; they can record audio using
phones and perform a simple test at home. Common voice modulation symptoms
include dysphonia [4] and dysarthria [5]. Patients can be asked to hold a single vowel’s
pitch for as long as possible, also known as sustained phonation or running speech
tests can be administered, as a realistic test of impairment. These phonation tests can
be used for diagnosis of Parkinson’s at stage 0. Following early detection, doctors can
cater therapeutic solutions or deep brain simulation [6] to reactivate the dopamine
producing neurons in the brain, thereby slowing the progress of PD. Owing to its
complex nature, there is no cure for Parkinson’s till date. However, early identification
followed by right medication can reduce the tremors and imbalance symptoms in
patients, enabling them to lead a normal life. This paper focuses on early detection
through audio recordings of PWP using ML techniques.
1.2 IMPLICATIONS
Vocal analysis can also provide insights into the functional differences between
patients with Parkinson's disease and healthy individuals. For example, high-frequency
voice content has been found to be a significant factor in assisting Parkinson's disease
detection in women, while low-frequency content is more important in men.
The Parkinson's illness speech dataset from the UCI Machine Learning library serves
as training data. In addition, by combining the inputs of healthy and Parkinson's
afflicted patients' spiral drawings, our suggested approach produces reliable outcomes.
We suggest a hybrid approach, which is both efficient and accurate, by evaluating the
patient's speech and their spiral drawing data. By comparing the two sets of data, the
doctor can determine if the patient is healthy or not and what medication to give them
depending on the severity of their condition.
Pre-processing: speech signals have to be broken down into their component parts,
with the quiet portions having less energy than the spoken portions since their
amplitude is lower. So, this method may be used to separate the sound of speaking
from that of quiet. In this study, we isolate the consistent parts of each speech signal
before carrying out the segmentation process. As these signals tend to be most
consistent midway through their whole duration, cutting off the beginning and finish
portions allows for continuous data transmission throughout. To eliminate issues at the
beginning and end of the phonations, 2s segments were selected from the intermediate,
steady section of the speech signals for the following acoustic analysis.
Overall, vocal analysis has the potential to be a valuable tool in the detection and
diagnosis of Parkinson's disease, and could potentially be used in conjunction with
other diagnostic methods to improve accuracy and outcomes.
This chapter briefly describes about Parkinson's disease how it affects our brain and it
describes about the history of Parkinson's disease, how this disease was discovered
and how the treatment was being given for this disease. It also gives us the details
about the deaths from Parkinson's disease and which all place this disease is very
common.
Parkinson's disease is a disease that affects the regions of the brain which are
responsible for the posture, balance. It's very difficult to identify the symptoms of this
disease because different people will have different problems. Due to this reason
Parkinson's disease is considered to be a complex disease. This disease was named
after the British Doctor James Parkinson who was the first one to write a book about
it, in the year 1817. Because of that book, the disease became an easily recognized
entity. It took around 100 years to first notice the major brain changes in the brains of
person affected with this disease, and after 50 years the experts agreed that these
changes were in the disease process. In the 1960's came the first treatment for
Parkinson's disease L-Dopa, after comprehending about the importance of a brain
chemical called dopamine. Earlier the treatment for this disease was very bad.
Parkinson's recommendation of vertical incision at the back of the neck was not much
So, when we are talking about treating Parkinson's disease we mean only treating the
symptoms of that disease not the actual disease. The medications that we are using are
just to improve the symptoms, meaning by helping to restore a normal chemical
balance in the brain, we improve the slowness, stiffness and mobility, etc. but we are
not actually altering the process that caused the damage. This is very similar to how
we treat cold. We take medicines for making the cough less severe and the sore throat
less painful, but we aren't doing anything to stop the virus that caused the problem.
The medications are a help till some point of time but they stop helping, when the
disease has become extremely severe.
The project uses different machine learning techniques which work on Parkinson's
dataset which we got from UCI machine learning repository. This data consisted of
voice recording of people with and without Parkinson's disease. In the dataset there
were like 31 people whose voice measurements were taken. Out of which 23 people
are with Parkinson's disease. This data is then analysed to check if there are any
outliers in the data or not. Then, the data was split into training and testing data.
Since, the dataset was of high dimension it would be difficult to analyse it, so
dimensionality reduction algorithm like PCA was used. But before we use PCA we
had to bring the data into a standard form, so for that feature scaling was used. The
feature scaling technique that we used was Standard Scaler. After the dimensionality
reduction step we tried to fit our data into our machine learning model i.e., Support
Vector Machine (SVM) and Logistic Regression (LR) .Then the results are predicted
CHAPTER-2
LITERATURE SURVEY
Previous studies to predict PD have been implemented on MRI scans, gait and genetic
data, but research on audio impairment for early detection is minimal. For instance,
Bilal et..al. [1]. studied genetic data to predict the onset of PD in senior patients with
SVM model. They trained an SVM model to reach an accuracy of 0.889, while this
research paper describes an improved SVM model with an accuracy of 0.9183. These
results also corroborate the merits of classification of PD based on audio data, over
genetic data.
Raundale, Thosar and Rane [2] used keystroke data from UCI telemonitoring dataset
to train a Random Forest classifier to predict the severity of PD in older patients.
Cordella et. al. [3] use audio data to classify PWP, however their models are heavily
reliant on MATLAB. Our research uses open-source models trained in Python, that are
faster and memory efficient
Majority of research done emphasizes the use of deep learning in PD detection, such
as, Ali et. al. [4]. who explain the use of ensemble deep learning models applied to
phonation data, to predict the progress of Parkinson’s disease. Their work lacked the
use of feature selection that would improve Deep learning model (DNN) performance.
Hence, this paper implements PCA on 22 attributes to select 7 major voice modalities
in PD detection.
Wodzinski et. Al [6]. trained a ResNet model on images of audio data, instead of
training the model on the nuances of the frequency of audio.
Wang et. al. [8] implemented 12 machine learning models on 401 voice biomarkers
dataset to classify patients as PD or not. They built a custom deep learning model
(DEEP) with a classification accuracy of 96.45%, however the model was expensive
due to large memory requirements.
However, dataset was small and artificial data augmentation was needed. A. U. Haq
and colleagues [11] implemented L1-support SVM, without feature identification on
vowel phonation dataset for neurological disorder patients. Their paper focused on
patient age group of 46-85 years, Aditi Govindu et al. / Procedia Computer Science
218 (2023) 249–261 251 Author name / Procedia Computer Science 00 (2019) 000–
000 3without considering healthy individuals in a lower age bracket.
The proposed methodology collects audio data from Parkinson’s patients voice
modulations. Dataset contains information about jitter, shimmer and MDVP of vowel
phonations. Data is preprocessed, analyzed and visualized for a thorough
understanding of the attributes. Four models – Logistic regression, SVM and K nearest
neighbors – are trained on 75% of the data. Models are trained to classify given audio
data into PD or healthy, based on variations in frequency. Models are tested on 25% of
the data and evaluated based on sensitivity, precision, accuracy. Figure 1 illustrates the
generic process implemented. It demonstrates the stages of data ingestion from Vocal
database, separation of data into testing and training sets, training of four models on
data and validation of results using test data.
Dataset
Biomedical voice measurement [24] of 31 people have been gathered, where 23
patients have PD. Patients are in the age range of 46 to 85 years, while normal
readings are from people of 23 years of age. An average of 6 phonation’s were
recorded 195 times for every person, ranging from 1 to 36 seconds in duration. The
attributes of 195 records are elaborated in table 1 below:
Attribute Purpose
Name Data is stored in ASCII CSV format
where patient name and recording
number is stored.
MDVP: Fo(Hz). Fundamental frequency of pitch period
Model training
This research paper studies Logistic Regression, , Support Vector classifier and K
nearest neighbors’ models in 3 approaches:
Support vector machine (SVM) [30] is a supervised machine learning algorithm that
creates a hyperplane to separate N features, by mapping these features to a
multidimensional space. The architecture of SVM model has been illustrated in
Since PD voice data is not linearly separable, we use an SVM kernel to transform data
into higher dimensional space. SVM performs well for PD data due to memory
efficiency and support vectors formed from a subset of training data points.
Support vector machines (SVMs) are regarded as effective learning techniques and are
frequently used to issues in biomedical and health informatics [49]. An SVM model's
output after training is an ideal hyperplane that can increase the distance between any
class and the closest training data points. The following are the main factors that drive
machine learning researchers to utilise SVM for their issues.
(1) The first justification is that SMs are very good at generalising to new data.
(2) SVMs' reliance on relatively small set of hyperparameters is the second factor.
The logistic regression model estimates the regression coefficients using maximum
likelihood estimation, which finds the values of the coefficients that maximize the
likelihood of observing the training data.
To apply logistic regression to vocal data, the first step is to extract a set of relevant
input features from the vocal signal. This might include measures such as:
OBJECTIVES
The main objective of our project is to find a best machine learning model that can be
used for early prediction of Parkinson's disease using the motor symptoms. This would
in turn help the people to get the necessary treatments that we talked previously for
easing the symptoms before the disease gets even worse.
Because, if the disease becomes extremely severe, no matter what medications you use
for the symptoms, it would be having no effect on it. So, by designing this Machine
Learning model it would be easier for us to predict on the basis of some previous data
basically the voice measurements which were collected from different persons.
We use the fact that 90% who are affected with this disease have speech disorder.
Using this fact only the dataset is collected. The input features extracted from the
voice measurements data is given to the machine learning model and this machine
learning model is trained based on this data, so that later on this model can be used to
predict the results. These results are tested with the original results. And this is
evaluated using different evaluation metrics like accuracy score. In this project we
have tried to choose the most effective algorithm in terms of accuracy.
The data is split into training and testing data. 70% of data goes to training data and
30% for testing data. In this project an ensemble machine learning technique called
Support Vector Machine and Logistic Regression , is used as our machine learning
model, which is trained using the training data and is tested on the testing data. This
machine learning model gave the best accuracy when evaluated so that's why this
model is used for predicting the results more accurately.
Hence, this machine learning model could be used for early detection of the disease
which would help to increase the lifespan of elderly person with Parkinson. This
would help to decrease the deaths caused by Parkinson's disease
Our research is based on qualitative analysis rather than quantitative. We will use the
qualitative data analysis online tool. All research will be done at the university and in
our personal space.
1.Data Collection
Collect a dataset of voice samples from patients with Parkinson's disease and healthy
individuals.
2.Feature Extraction
Feature extraction increases the accuracy of learned models by extracting features from input
data. This phase reduces the dimensionality of data by removing the redundant records. Of
course, it enhances the classification speed. Feature extraction helps get the best feature from
those big data sets by selecting and combining variables into features, thus, effectively
reducing the amount of data. These features are easy to process but still able to describe the
actual data set with accuracy and originality.
3.Feature Selection
4.Model Training:
Train SVM, LR, and KNN models on the preprocessed dataset, using the selected
features. Modelling into tarin split method X_train and y_train are the training features
and target variables, respectively, while X_test is the test feature variable. You need to
replace these variables with your actual dataset.
5. Voice Segmentation
In this phase, we classify the voice in segments and find out the relationship of
features so used in which seven extract features for classification of the human voice.
Based on these seven features, we identify the ranges of frequencies and compare them
with patients’ health status. Evaluate the performance of each model using metrics
such as accuracy, precision, recall, and F1-score.
6.Model Deployment
Deploy the selected model in a web application or mobile app, where users can input
their voice samples to receive a prediction of whether they have Parkinson's disease or
not
METHODOLOGY
1. Dataset used: The dataset used in my project was taken from the UCI Machine Learning
Repository; the name of the dataset is Parkinson's dataset. It contains 195 voice
measurements which was taken from 31 people with and without Parkinson’s.
2.Analysis of the data: Here we analyse the data using some methods like correlation, then
we visualize this using heat map and then we do some visualization on how many people in
that dataset are having Parkinson's and how many are not having by plotting a graph.
3. Data preprocessing: We need to remove all the outliers, all the null values before giving
the data for training and testing. Also we need to do feature scaling before moving on.
Feature Scaling will bring the features to a standard fixed range. The feature scaling should
be done in the preprocessing step for handling high magnitude values in the data. Then after
doing feature scaling we can apply the dimensionality reduction algorithm that is PCA
(Principal Component Analysis). It is important to use dimensionality reduction for data with
high dimensionality. From the data with a lower dimension, we will be able to extract more
essence of the data.
4.Splitting the data into training and testing : The data we are using is split into two parts:
training and testing data. From the data, 70% of the data will go for training and 20% of the
data will go for testing. The training data has a known output and the machine learning model
is trained on this data. And after the training is completed we use the testing data to test if our
model's prediction on it. Here we make use of the implement Random Forest Classifier.
6. Predicting the results : After training the model, we try to predict the results. The results
are then the accuracy score.
7.Evaluation: To evaluate how our model peformed we use some evaluation metrics for that
purpose like the accuracy score.
Before starting this project, we are expected to import some required libraries as shown
below
Then after getting the dataframe ready we go for the analysis of the data to check whether we
have any outliers in our data, or any null values in our data, and to get some more useful
insights from our data. Fig 3.4 shows us the names of different columns we have in our data
Then we use info to check whether there are any null values in the data and the type of data
we have.
From this we inferred that there are around 147 patients who are with Parkinson's disease and
48 people who are not with Parkinson's Disease.
First we do a feature scaling on our data since we have many high magnitude values so to
bring it to a fixed range we are doing the feature scaling.
* Data Preprocessing
Now we split the data into training and testing as given below in the Fig 3.13. We use the
sklearn.model_selection and then import train test_split for splitting the data, here test size
we are giving it as 0.2 that means 20% of the data would be used for testing and the rest for
training.
Now this training data is fitted to the machine learning model here we are using a Support
Vector Machines(SVM) ,Logistic regression(LR).
So for that we need to use the sklearn library's submodule ensemble and from that we need to
import SVM and LR as shown in the fig 3.14
The project required some functional requirements which was necessary for the project to be
implemented.
* Python library like seaborn is needed to visualize the data and get some inference out of it.
* PCA should be applied in order to reduce the dimensions of the data. This can be done by
using the Python library sklearn.
* Data should be preprocessed before the use i.e., no outliers, no null values should be there
before it is given for training and testing. If the data is not preprocessed there would be some
inconsistencies during prediction. Python library like sklearn can be used for preprocessing
the data.
* While writing the code the software's performance and ability is also considered.
* Some other non-functional requirements are speed, reliability and the security of the
system.
So, in this project I have used some Python Libraries like Pandas, Numpy, Matplotlib,
Seaborn and Sklearn. These are the libraries that I needed to complete my project.
Pandas - This library is used for reading the dataset and for creating the dataframe. This is
the tool that is generally used for data analysis. Pandas is an open source Python package that
is most widely used for data science/data analysis and machine learning tasks. It is built on
Numpy - Numpy stands for Numerical Python. This library is used for scientific computation
in Python. It is a library which is mainly used for arrays. Numpy provides an array object
which is more faster than the normal lists that we use in Python. NumPy, which stands for
Numerical python, is a library consisting of multidimensional array objects and a collection
of routines for processing those arrays. Using NumPy mathematical and logical operations on
arrays can be performed. Numpy is a python Pre-requisite for Dlib.
Matplotlib - This is the library that we generally use for data visualization using different
plots and graphs. Matplotlib is a cross-platform, data visualization and graphical plotting
library for Python and its numerical extension NumPy. As such, it offers a viable open source
alternative to MATLAB.
Then finally we used Sklearn library and its sub modules for the machine learning technique
that we used in our project.
Sklearn: Scikit learn or sklearn. It is a free software machine learning library. This library
has various classification algorithms, some of which are like Decision tree, SVM, KNN,
Random Forest, etc. and it also supports Python numerical and scientific libraries like
NumPy and SciPy.
This library also includes modules for data preprocessing, for feature scaling and many more
other machine learning techniques. It also includes various dimensionality reduction
algorithms like PCA, LDA, etc.
Classification Techniques
SVM is a way of device studying that can clear up both linear and nonlinear issues. It
presents exact overall performance to remedy each regression and classification hassle. The
SVM classification technique inspects for the highest quality separable hyperplane if you
2. Logistic Regression(LR):
Logistic Regression changed into commonly used in the organic studies and packages in the
early twentieth century. Logistic Regression (LR) is one of the maximum used device
learning algorithms that is used wherein the goal variable is categorical. lately, LR is a
famous method for binary classification troubles. moreover, it presents a discrete binary
product between zero and 1. Logistic Regression computes the relationship among the feature
variables by means of assessing possibilities (p) the use of underlying logistic function.
ARCHITECTURE DIAGRAM
III.TESTING
UNIT TESTING
INTEGRATION TESTING
FUNCTIONAL TESTING
Unit testing is an essential part of software development that helps ensure the correctness of
individual components or units of a software system. In the context of machine learning
models for Parkinson's disease diagnosis using vocal features, unit testing can help ensure
that the models are working as expected and that the input and output data are being
processed correctly.
Input Validation: Test that the input data is being validated correctly. This could include
checking that the vocal features are within acceptable ranges, that missing values are handled
correctly, and that the data is being preprocessed correctly.
> To perform these tests, you can use a variety of tools and frameworks, such as Python's
unittest or pytest libraries, These tools allow you to write automated tests that can be run as
part of a continuous integration pipeline, ensuring that the system is always working
correctly.
INTEGRATION TESTING:
Integration testing is a type of testing that focuses on testing the interactions between
components or modules of a software system. In the context of machine learning models for
Parkinson's disease diagnosis using vocal features, integration testing can help ensure that the
It may be level of software testing where individual units are combined and it tested as a
gaggle. In the proposed project all the data is combined and tested. The accuracy level is
94.87%. This testing will test whole project at a time.It reduces the time complexity in
integration testing.
FUNCTIONAL TESTING:
Functional testing may be a sort of software testing that validates the software against the
functional requirements/specifications. This testing is detecting Parkinson’s will based on
machine learning algorithm .ML algorithm will boost up the speed.
OUTCOMES
In this study regarding the detection of PD patients from vocal signals, we depicted
and implemented two models based on two different feature extraction algorithms
along with SVM, which is a popular supervised algorithm in the area of classification
problems, using hyperplanes to classify both linear and non-linear dataset. In the first
model, a PCA was used as it is a popular unsupervised method for finding the
principal components of data in order to reduce the dimensions. This, in turn, bypassed
the disadvantage of SVM with decreasing classification performance while having a
higher number of features than the number of samples as in the dataset used in this
study.
The above-mentioned models were trained and tested using the dataset obtained . Due
to the high imbalance in the dataset, the F1-score, MCC, and Precision-Recall curve
were used to evaluate the models along with accuracy.
Different feature extraction techniques were applied, and the relative comparisons
were depicted with a much larger dataset with 752 features and 756 voice samples,
unlike the recent study in Hemmerling and Sztahó [28], which has only 198 voice
samples and 33 features. Small datasets have several disadvantages, which lead to
2.DISTRIBUTION
As in other previous literatures, if accuracy was used as the only evaluation metric, it
can be misleading in the case of an imbalanced dataset.We used F1-score and MCC
along with accuracy. We also used the Precision-Recall curve to visualize the
performance of the models for such skewed class distribution. We also applied
SMOTE to synthesize new minority examples to evaluate the models in a balanced-
class scenario. Both models showed better balanced-class performance.
3.REDUCE DIMENSIONALLY
The study explored the field of Parkinson’s disease patient detection based on vocal
features by building the idea of merging feature extractions, removal of irrelevant data
by reducing dimensionality based on the variance of data and additionally using DNN
in an unsupervised manner with SVM, which is one of the most powerful classifiers
thus far when it comes to data points separable with a larger number of hyperplanes.
Imbalanced data deters from having an accurate picture in the detection of PD patients,
which can be solved with a more balanced dataset. Further research and experiments
can be conducted by employing other dimensionality reduction and feature extraction
algorithms such as kernel PCA (kPCA), Denoising Autoencoders to reduce the noise
CHAPTER-7
By using machine learning techniques, the problem can be solved with minimal error rate.
Parkinson's disease detection using gait, tremors and handwriting samples as the dataset,
in order to increase the accuracy by finding the co-relation between these symptoms.
Since individual analysis of every symptom has some drawback attached to it, for
example handwriting is a complex activity where other factors can influence motor
movement, in speech recognition additional steps such as noise removal and speech
segmentation are required, using breath samples has been proved to fail to meet clinically
relevant results.
In case, of SVM the range of metrics are in static range in (96.5%) .The people with affected
with Parkinsons Disease of weighted average is with same metrics in all the cases.
A bar chart comparing the accuracy of normal LR and normal SVM models for PD diagnosis
using voice analysis created to visualize the results. The chart have two bars, one for the
normal LR model and one for the normal SVM model, with the height of the bars
representing the accuracy of each model. The x-axis can be labelled ‘Ranges”and the y-axis
can be labelled "Metrics", and Accuracy, precision, recall ,F1score.. The chart can also
include error bars to represent the standard deviation or confidence interval of the accuracy
results.
1. DESIGN
Data Collection :Collect a dataset of voice samples from people with Parkinson's disease and
healthy controls. The samples should be of the same task, such as sustained phonation of the
vowel /a/. The samples should be labeled according to the health status of the individual.
2.Preprocessing: Preprocess the voice samples to extract relevant features. This could
include pitch, formants, harmonics-to-noise ratio, speech rate, etc. The preprocessing step
might also involve normalizing the data and handling any missing values.
3.Model Training: Split the dataset into a training set and a test set. Use the training set to
train the SVM, LR, and KNN models. You might want to use cross-validation to ensure that
the models are not overfitting the training data.
4.Model Evaluation: Evaluate the performance of the models on the test set. You can use
metrics such as accuracy, precision, recall, and F1 score.
In this condition, most people with Parkinson’s disease will change vocabulary, voice, and
swallowing. The exact effects of Parkinson’s disease that appear in the human body’s
muscles, tremor, stiffness, slow motion, and slow speech, can occur in the muscles used to
speak and swallow. In the medical treatment of reading the human voice with a machine
those machines, read in different frequencies in which one of the most critical frequencies of
fundamental vocal frequencies.
The fundamental vocal frequencies (Fo) of the human voice. The ranges of frequencies 100–
250 define the human voice. This range increases when the person is affected with PD. The
maximum vocal fundamental frequency (Fhi(Hz)). The ranges are higher than the frequencies
because these frequencies measure the vocal voice length. The minimum vocal fundamental
frequency (Flo). It visualizes the ranges of frequencies signal of human voices. Detecting
Parkinson's disease using machine learning algorithms has shown promising results.
In terms of Support Vector Machine (SVM), studies have demonstrated that SVM can achieve an
accuracy of around 92% in detecting Parkinson's disease from speech signals.
For Logistic Regression (LR), research has shown that LR can achieve an accuracy of around
85% in detecting Parkinson's disease from motor and non-motor symptoms.
Regarding K-Nearest Neighbors (KNN), studies have demonstrated that KNN can achieve an
accuracy of around 88% in detecting Parkinson's disease from gait and balance signals.
It's worth noting that these results may vary depending on the specific dataset and features used
in the machine learning models. However, overall, machine learning algorithms have shown great
potential in detecting Parkinson's disease with high accuracy.
In conclusion, we proposed using machine learning like Support vector machines (SVM),
Logistic Regression(LR), K-Nearest Neighbour (KNN) approaches to Identify Parkinson’s
Disease by using voice signal features. These methods’ results (SVM 85% and KNN 88%)
are more accurate than previous works. The proposed working model can help in reducing
treatment costs by providing initial diagnostics on time. This model can also be used as a
teaching tool for medical students and as a soft diagnostic tool for physicians. Also, the
accuracy and scalability of this prediction model can both be improved with numerous
possible improvements.
Future prospects:
These results are promising because they may introduce novel means to assess patient health
and neurological diseases using voice data. Due to the high accuracy performed by the
models with these short audio clips there is reason to believe denser feature sets with spoken
word, video, or other modalities would aid in disease prediction and clinical validation of
diagnosis in the future. Also, in future investigations11, researchers will use handwriting
samples from individuals diagnosed with PD.
[1] Alatas Bilal, Moradi Shadi, Tapak Leili, Afshar Saeid (2022), "Identification of
Novel Noninvasive Diagnostics Biomarkers in the Parkinson's Diseases and
Improvingthe Disease Classification Using Support Vector Machine", BioMed
Research International, Hindawi
[2] P. Raundale, C. Thosar and S. Rane (2021), "Prediction of Parkinson's disease and
severity of the disease using Machine Leam ing and Deep algorithm," 2021 2 nd
International Conference for Emergin Technology (INCET) , pp. 1-5, doi:
10.1109/INCET51464.2021.9456292.
[3] F. Cordella, A paffi and A. Pallotti (2021) “Classification-based sica in base somei
of his disease prient thren 42021941: 2021 IEEE International Symposium on Medical
Measurements and Applications (MeMeA), pp. 1-6,
doi:10.1109/MeMeA52024.2021.9478683.
[4] .Ali, L., Chakraborty, C., He, Z. et al. (2022) "A novel sample and feature
dependent ensemble approach for Parkinson's disease detection".
Neural Comput & Applic. https://doi.org/10.1007/00521-022-07046-2
[5] F. Huang, H. Xu, T. Shen and L. Jin (2021), "Recognition of Parkinson's Disease
Based on Residual Neural Network and Voice Diagnosis,"2021 IEEE 5th Information
[11]. C. Ricciardi et al., “Machine Learning can detect the presence of Mild cognitive
impairment in patients affected by Parkinson’s Disease,” 2020 IEEE International
SYMPOSIUM ON Medical Measurements and Applications (MeMeA). 2020, pp.
1-6, doi:10.1109/MeMeA49120.2020.9137301.
[12] X. Yang, Q. Ye, G. Cai, Y. Wang and G. Cai, (2022), "PD-ResNet for
Classification of Parkinson’s Disease from Gait," in IEEE Journal of Translational
Engineering in Health and Medicine, vol. 10, pp. 1-11, 2022, Art no. 2200111, doi:
10.1109/JTEHM.2022.3180933.
[13] A. U. Haq et al., "Feature Selection Based on L1-Norm Support Vector Machine
and Effective Recognition System for Parkinson’s Disease Using Voice Recordings,"
in IEEE Access, vol. 7, pp. 37718-37734, 2019, doi: 10.1109/ACCESS.2019.2906350.
APPENDIX
APPENDIX 1:
The voice signal dataset [15] display some rows and columns of data , showing head (5) of
the dataset, which displayed all attributes of the data and values in float. This dataset
comprises a range of biomedical tone of voice estimations from 31 individuals, of which 23
have PD. Every section in the table is a specific voice measure, and each line compares to
one of the voice chronicles from these people (“name” segment). The “status” segment is set
to 0 for non-patient and 1 for PD .
a) What are some common vocal features used for Parkinson's disease classification?
b) How can SVM be used to classify Parkinson's disease patients based on vocal features?
c) What is the role of the kernel function in SVM for Parkinson's disease classification?
d) How can LR be used to classify Parkinson's disease patients based on vocal features?
e) What is the difference between SVM and LR for Parkinson's disease classification?
Appendix 3
Solutions:
a)Some common vocal features used for Parkinson's disease classification include pitch,
intensity, jitter, shimmer, and harmonics-to-noise ratio (HNR).
b) SVM can be used to classify Parkinson's disease patients based on vocal features by
training a model on a dataset of labeled vocal recordings. The SVM model can then be used
c) The kernel function is used to transform the input data into a higher-dimensional space
where a linear decision boundary can be found. Common kernel functions used for
Parkinson's disease classification include linear, polynomial, and radial basis function (RBF)
kernels.
e) SVM and LR are both supervised learning algorithms used for classification tasks. SVM
works by finding the optimal hyperplane that separates the two classes in the feature space,
while LR models the relationship between the features and the probability of having the
disease. SVM is generally more robust to outliers and can handle high-dimensional data
better than LR, while LR is faster and easier to interpret.