Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views75 pages

Final PD

Uploaded by

tasmiyanoor31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views75 pages

Final PD

Uploaded by

tasmiyanoor31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 75

“Detecting Parkinsons Disease Using Machine

Learning Algorithm”
A Capstone Project Report
Submitted by
Lakshmi 20221BCA0269
Harshitha 20221BCA0241
Varsha 20221BCA0272
Pavitra 20221BCA0243
Jyoshna 20221BCA0242

Under the guidance of

Dr .V. Chandrasekar
Professor

in partial fulfillment for the award of the degree of


BACHELOR OF COMPUTER APPLICATIONS
At

SCHOOL OF INFORMATION SCIENCE

PRESIDENCY UNIVERSITY

BENGALURU

MAY 2024
BACHELOR OF COMPUTER APPLICATIONS

SCHOOL OF INFORMATION SCIENCE

PRESIDENCY UNIVERSITY

CERTIFICATE
This is to certified that the Capstone Project report “Detecting Parkinson’s disease
using Machine learning” bearing being submitted by Lakshmi, Harshitha, Varsha,
Pavitra ,Jyoshna bearing roll number 20221BCA0269, 20221BCA0241,
20221BCA0272, 20221BCA0243, 20221BCA0242 in partial fulfilment of
requirement for the award of degree of Bachelor of Computer Applications is a
bonafide work carried out under my supervision.

Dr.V.Chandrasekar, Dr. R Mahalakshmi,


Professor, Professor and Head,
School of CSE & IS School of Information Science

Dr. L. Shakkeera Dr. Md. Sameeruddin Khan


Associate Dean, Dean,
School of CS&ISE, School of CS&ISE,
Presidency University Presidency University
ABSTRACT

Parkinson's disease is a neurological disorder that affects the movement of a person.


Some of the symptoms of this disease are difficulty in walking, tremors, stiffness, etc.
These symptoms get worse as the condition progresses over time. According to a
survey, there are around 60,000 Americans diagnosed with Parkinson's disease each
year and more than 10 million worldwide are living with this disease. According to an
estimate 4% of people are diagnosed with this disease before the age of 50. And men
are more likely to have this disease more than women. Parkinson's disease is one of
the diseases that are being ignored in some parts of the world. People are so busy in
their works that they don't care about such kind of diseases.

Moreover, this kind of disease could be worse in its later stages. So, it's very important
to predict if a person has Parkinson's disease or not in the early stages. There are many
researches going on to find the best solution possible for this. Research is going on for
using non-motor symptoms. While there are many other researches that have found the
solution to this problem using motor symptoms. But, they are not so accurate so we
are trying to develop a more accurate model for this prediction.

So, the goal of this project was to create a machine learning model that would predict
whether a person has Parkinson's disease or not using Machine Learning techniques
like classification. There are many classification algorithms that can be used to
classify the two classes. Since, it is a binary classification we have two classes to
classify. One of the classes is 0 that means 'The person is healthy' and the other one is
1 that means 'The person has Parkinson's disease. Before building our machine
learning model we tried to do some preprocessing to our data. Techniques like Feature
Scaling and PCA was used for that reason .Feature scaling is important before using
PCA (Principal Component Analysis). In feature scaling we try to standardize our
features using Standard Scaler. The reason for selecting this Machine Learning
algorithm was that since it is an ensemble technique so it will produce a more accurate
solution than a single model would produce. There are many other existing models
that used classification algorithms like SVM, KNN, LR But, out of all these machine
learning models Logistic Regression gave us the best accuracy score i.e., it predicted
the results more accurately than the other classification agorithms that we saw. And
after that we tried to use our machine learning model to predict the results on a new
data sample.

Key Words : Machine learning; SVM; Parkinson’s disease; Gait analysi


ACKNOWLEDGMENT

The completion of project work brings with great sense of satisfaction, but it is never
completed without thanking the persons who are all responsible for its successful
completion. First and fore most we indebted to the GOD ALMIGHTY for giving us
the opportunity to excel our efforts to complete this project on time. We wish to
express our deep sincere feelings of gratitude to our Institution, Presidency University,
for providing us opportunity to do our education.

We express our sincere thanks to our respected dean Dr. Md. Sameeruddin Khan,
Dean, School of Computer Science Engineering and Information Science, Presidency
University for getting us permission to undergo the project.

We record our heartfelt gratitude to our beloved professor Dr. L. Shakkeera,


Associate Dean, School of Computer Science Engineering and Information
Science, Dr. R Mahalakshmi, Professor and Head, School of Information Science,
Presidency University for rendering timely help for the successful completion of this
project.

We sincerely thank our project guide, Dr . V. Chandrasekar, Professor, School of


Computer Science Engineering , for his guidance, help and motivation. Apart from the
area of work, we learnt a lot from him, which we are sure will be useful in different
stages of our life. We would like to express our gratitude to Dr. M. Anand
Kumar ,Dr.Gayathri and Dr. M. Renuka Devi, for their review and many helpful
comments. We would like to acknowledge the support and encouragement of our
friends.

i
Student Name ID

Lakshmi 20221BCA0269

Harshitha S M 20221BCA0241

Varsha A 20221BCA0272

Pavitra Balaraju 20221BCA0243

Jyoshna 20221BCA0242

TABLE OF CONTENTS

Abstract i

Acknowledgement ii

Table of Contents iii

List of Figures iv

Chapter No Title Page No.

1 INTRODUCTION 1-2

1.1 SIGNIFICANCE 3-5

1.2 IMPLICATIONS 5-6

1.3 PROBLEM DEFINITION 6-8

2 LITERATURE SURVEY 9-13

3 PROPOSED METHOD 14-16

3.1 Support Vector Machine 17-18

3.2 Logistic Regression 19-21

4 OBJECTIVES 22-24

5 METHODOLOGY 25-27

5.1 MODULES USED 31-33

ii
5.2 IMPLEMENTATION DETAILS 34-36

5.3 REQUIREMENTS 37-38

5.4 TOOLS USED 39-42

6 OUTCOMES 43-48

7 RESULTS AND DISCUSSIONS 49-53

8 CONCLUSION 54-58

REFERENCES 59-62

APPENDIX 63-65

LIST OF FIGURES

Figure No Caption Page No.

5.1 Importing libraries 06

5.2 Dataframe created for the dataset 21

5.3 Showing the different columns in the data 22

5.4 Dataset information show 22

5.4 Feature Scaling 23

5.5 Splitting the data into training and testing 23

7.1 Elements of Dataset 23

LIST OF TABLES

iii
Page
Table No Caption
No.
3.1 Dataset Attributes

iv
CHAPTER-1

INTRODUCTION
.
Patients with Parkinson's disease (PD) often complain of a variable impairment of voice
emission including hypophonia, mono-pitch and mono-loudness speech, hypokinetic
articulation, collectively called hypokinetic dysarthria . Parkinsonian patients may manifest
voice disorders in the early stage of the disease, with growing evidence showing voice
impairment occurring even in the prodromal phase of PD . Also, voice typically worsens over
the course of the disease leading to severe voice impairment in more advanced stages of PD .
Furthermore, the standardized clinical assessment of voice in PD is currently based only on
qualitative evaluation (i.e., a specific subitem of the Unified Parkinson's Disease Rating Scale
—UPDRS) (2, 10) thus precluding the objective assessment of the voice impairment in this
disorder.
Over recent years, quantitative approaches based on spectral analysis have been developed to
examine objectively voice samples (11). Spectral analysis in patients with PD allowed to
demonstrate several abnormalities in specific voice features such as reduced fundamental
frequency and harmonics-to-noise ratio, and increased jitter and shimmer (3, 12–16). The
human voice however, represents a complex phenomenon characterized by high-dimensional
data based on an exponential number of features. Accordingly, besides the independent
examination through spectral analysis of specific voice features (i.e., fundamental frequency),
more advanced techniques able to analyse and dynamically combine and high-dimensional
datasets of voice features such as machine-learning algorithms (17–23) would improve
significantly the accuracy of the objective classification of voice samples in PD. Indeed,
machine learning has allowed to classify voice impairment objectively and automatically in a
number of neurologic disorders, with previously unreported high accuracy (19, 21, 22).

To date, concerning the application of machine learning analysis in PD, only a few
preliminary studies in rather small and clinically heterogeneous cohorts of patients have been
reported (24–26). It is therefore important to examine instrumentally voice impairment in a
large and clinically well-characterized cohort of PD. Also, it is relevant to verify whether
machine learning can recognize the effect of disease severity by discriminating patients in

School of Information Science 1


different stages of the disease. Still, given that the symptomatic effect of L-Dopa on voice is
still largely a matter of debate (1, 10, 27–33), it is relevant to compare the instrumental voice
analysis with machine learning in patients under and not under L-Dopa treatment.

We here investigated voice in a large and clinically well-characterized cohort of patients with
PD. Then, to examine the effect of disease severity on voice, we compared voices collected in
patients in early and mid-advanced stage of PD. Still, to investigate the effect of L-Dopa on
voice, we compared patients OFF and ON therapy. To verify the effect of the specific speech
tasks, we compared voice recordings during the emission of a vowel and a sentence,
according to standardized procedures (19, 21, 22). We assessed the sensitivity, specificity,
positive and negative predictive values, and accuracy of all diagnostic tests and calculated the
area under the receiver operating characteristic (ROC) curves. Lastly, by providing a machine
learning measure of voice impairment severity for each patient, we also assessed possible
clinical-instrumental correlations. Our hypothesis is that machine learning analysis of speech
samples is able to discriminate PD patients from controls, patients in early and mid-advanced
stages, and finally patients OFF and ON therapy, with previously unreported high accuracy.

Early in 1969, Darley et al. defined dysarthria as a collective term for related speech
disorders. The classification of dysarthria includes flaccid dysarthria, spastic dysarthria,
ataxic dysarthria, hypokinetic dysarthria, hyperkinetic dysarthria, unilateral upper motor
neuron dysarthria and mixed dysarthria4. The speech abnormalities of patients with PD are
collectively termed hypokinetic dysarthria (HKD). These speech flaws are typically
characterized by increased acoustic noise, a reduced intensity of voice, harsh and breathy
voice quality, increased voice nasality, monopitch, monoloudness, speech rate disturbances,
the imprecise articulation of consonants, the involuntary introduction of pauses, the rapid
repetitions of words and syllables and sudden deceleration or acceleration in speech.

Speech impairments are caused by impaired speech mechanisms during any of the basic
motor processes involved in speech performance5. The neuromotor speech sequence
activates the muscles of the pharynx, tongue, larynx, chest and diaphragm through

School of Information Science 2


subthalamic secondary pathways. The anatomical substrate that could result in the
abnormalities of PD phonetics may be reduced by the poor coordination of the sound-making
muscles6. Usually, the stiffness of the laryngeal muscle tissue, which results in an increased
hardness of the vocal cords, affects the closure of the vocal cords and increases the muscle
tone7. Moreover, due to the decreased controllability of the diaphragm, the pneumatic input
of the lungs to the larynx and the lung capacity decrease significantly. Fortunately, dysphonia
in PD has recently received abundant attention.

1.1 SIGNIFICANCE

Millions of individuals worldwide are affected by Parkinson's Disease (PD), a progressively


deteriorating disorder in which symptoms appear gradually over time. While visible
symptoms occur in people over the age of 50, roughly one in every ten people shows signs of
this disease before the age of 40 (Marton, 2019). Parkinson's disease causes the death of
specific nerve cells in the brain's substantia nigra, which generate chemical dopamine for
directing bodily movements. Dopamine deficiency causes additional progressive symptoms
to emerge gradually over time. Typically, PD symptoms begin with tremors or stiffness on
one side of the body, such as the hand or arm. Individuals with PD may acquire dementia at
later stages (Tolosa et al., 2006). From 1996 to 2016, the global prevalence of PD more than
quadrupled, from 2.5 million to 6.1 million individuals. Increased life expectancy has
resulted in an older population, which explains the substantial rise (Fothergill-Misbah et al.,
2020). The brain is the body's controlling organ. Trauma or sickness to any portion of the
brain will manifest in a variety of ways in numerous other sections of the body. PD causes a
range of symptoms, including partial or complete loss of motor reflexes, speech problems and
eventual failure, odd behavior, loss of mental thinking, and other critical skills. It is difficult
to distinguish between typical cognitive function losses associated with aging and early PD
symptoms. In the United States, the overall economic impact in 2017 was predicted to be
$51.9 billion, including an indirect cost of $14.2 billion, non-medical expenditures of $7.5
billion, and $4.8 billion accruing to disability income for owner's public works. The majority
of Parkinson's disease patients are over the age of 65, and the overall economic burden is

School of Information Science 3


expected to approach $79 billion by 2037 (Yang et al., 2020). The diagnosis of PD in
National Collaborating Centre for Chronic Conditions (2006) is typically based on a few
invasive techniques as well as empirical testing and examinations. Invasive diagnostic
procedures for PD are exceedingly expensive, inefficient, and require extremely complex
equipment with poor accuracy. New techniques are needed to diagnose PD. Therefore, less
expensive, simplified, and reliable methods should be adapted to diagnose disease and ensure
treatments. However, noninvasive diagnosis techniques for PD require being investigated.
Machine learning techniques are used to classify people with PD and healthy people. It has
been determined that disorders' vocal issues can be assessed for early PD detection (Harel et
al., 2004). So, this study attempts to identify Parkinson's disease (PD) by utilizing Machine
Learning (ML) and Deep Learning (DL) models to discriminate between healthy and PD
patients based on voice signal features, perhaps lowering some of these expenditures.

Parkinson's disease is characterised by the death of dopaminergic neurons in the


substantia nigra pars compacta of the midbrain (PD). Coordination problems,
bradykinesia and voice alterations are among the signs of this neurodegenerative
disease. Parkinson's disease (PD)

A considerable influence on healthcare costs, patient longevity, and quality of life


might be realized by identifying speech alterations in Parkinson's patients before the
development of debilitating physical symptoms. Parkinson's disease is often diagnosed
by a combination of medical history, physical examination, and the detection of
specific motor symptoms (PD). Yet, traditional diagnostic approaches may be prone to
subjectivity since they depend on the evaluation of movements that might be difficult
to characterize due to their subtlety to the human eye. Early nonmotor signs of
Parkinson's disease, meantime, might be mild and can be brought on by a wide range
of diseases. Because of this, early PD diagnosis is challenging [5], and these
symptoms are often ignored. ML techniques have been under consideration as a
possible game-changer in the diagnosis of this illness by scientists for some time.
Methods of gait analysis that don't involve touching the patient might be widely used
at home [6]. Very little effort has focused on incorporating ML approaches into the

School of Information Science 4


process to make it fully self-sufficient and useable even without an active network
connection. Early-stage patients may also have speech issues [7] such dysphonia,
echolalia, and hypophonia. The utilization of human voice by computers for
information retrieval and analysis is a potential future development [8].

Patients need not travel physically to a doctor instead; they can record audio using
phones and perform a simple test at home. Common voice modulation symptoms
include dysphonia [4] and dysarthria [5]. Patients can be asked to hold a single vowel’s
pitch for as long as possible, also known as sustained phonation or running speech
tests can be administered, as a realistic test of impairment. These phonation tests can
be used for diagnosis of Parkinson’s at stage 0. Following early detection, doctors can
cater therapeutic solutions or deep brain simulation [6] to reactivate the dopamine
producing neurons in the brain, thereby slowing the progress of PD. Owing to its
complex nature, there is no cure for Parkinson’s till date. However, early identification
followed by right medication can reduce the tremors and imbalance symptoms in
patients, enabling them to lead a normal life. This paper focuses on early detection
through audio recordings of PWP using ML techniques.

1.2 IMPLICATIONS

The implications of Parkinson's disease on vocal abilities are significant. Parkinson's


causes movements to become smaller and slower over time, which can impact the
complex system of movements involved in speaking. This can lead to voice and
speech changes, such as speaking softly, using a monotone voice, slurring words,
mumbling, and stuttering. Additionally, thinking changes can make it harder to find
the right word, focus on conversations, or get a sentence started.

School of Information Science 5


The implications of vocal analysis on Parkinson's disease are significant. Vocal
analysis can be used as an assisting tool for the detection of Parkinson's disease,
providing physicians with valuable information to aid in diagnosis. Research has
shown that voice-based analysis can achieve high accuracy, sensitivity, and specificity
in detecting Parkinson's disease, with accuracy rates ranging from 94.36% to 95.9%
depending on gender.

Vocal analysis can also provide insights into the functional differences between
patients with Parkinson's disease and healthy individuals. For example, high-frequency
voice content has been found to be a significant factor in assisting Parkinson's disease
detection in women, while low-frequency content is more important in men.

The Parkinson's illness speech dataset from the UCI Machine Learning library serves
as training data. In addition, by combining the inputs of healthy and Parkinson's
afflicted patients' spiral drawings, our suggested approach produces reliable outcomes.
We suggest a hybrid approach, which is both efficient and accurate, by evaluating the
patient's speech and their spiral drawing data. By comparing the two sets of data, the
doctor can determine if the patient is healthy or not and what medication to give them
depending on the severity of their condition.

Pre-processing: speech signals have to be broken down into their component parts,
with the quiet portions having less energy than the spoken portions since their
amplitude is lower. So, this method may be used to separate the sound of speaking
from that of quiet. In this study, we isolate the consistent parts of each speech signal
before carrying out the segmentation process. As these signals tend to be most
consistent midway through their whole duration, cutting off the beginning and finish
portions allows for continuous data transmission throughout. To eliminate issues at the
beginning and end of the phonations, 2s segments were selected from the intermediate,
steady section of the speech signals for the following acoustic analysis.

School of Information Science 6


Furthermore, vocal analysis can help physicians understand why the detection method
suspects Parkinson's disease, allowing for a more informed diagnosis. This can be
achieved by analysing the variability of the most important features between patients
with Parkinson's disease and controls, providing contextual information to aid in
interpretation.

Overall, vocal analysis has the potential to be a valuable tool in the detection and
diagnosis of Parkinson's disease, and could potentially be used in conjunction with
other diagnostic methods to improve accuracy and outcomes.

1.3 Problem Definition

This chapter briefly describes about Parkinson's disease how it affects our brain and it
describes about the history of Parkinson's disease, how this disease was discovered
and how the treatment was being given for this disease. It also gives us the details
about the deaths from Parkinson's disease and which all place this disease is very
common.

Parkinson's disease is a disease that affects the regions of the brain which are
responsible for the posture, balance. It's very difficult to identify the symptoms of this
disease because different people will have different problems. Due to this reason
Parkinson's disease is considered to be a complex disease. This disease was named
after the British Doctor James Parkinson who was the first one to write a book about
it, in the year 1817. Because of that book, the disease became an easily recognized
entity. It took around 100 years to first notice the major brain changes in the brains of
person affected with this disease, and after 50 years the experts agreed that these
changes were in the disease process. In the 1960's came the first treatment for
Parkinson's disease L-Dopa, after comprehending about the importance of a brain
chemical called dopamine. Earlier the treatment for this disease was very bad.
Parkinson's recommendation of vertical incision at the back of the neck was not much

School of Information Science 7


accepted. L-Dopa was not much popular as it caused nausea and vomiting. Later
Carbodopa was invented inorder to prevent vomiting, forming the combined
medication Carbidopa/levodopa sold as Sinemet. Thus this remains the single and best
drug for the treatment of the symptoms of Parkinson's disease (PD).

So, when we are talking about treating Parkinson's disease we mean only treating the
symptoms of that disease not the actual disease. The medications that we are using are
just to improve the symptoms, meaning by helping to restore a normal chemical
balance in the brain, we improve the slowness, stiffness and mobility, etc. but we are
not actually altering the process that caused the damage. This is very similar to how
we treat cold. We take medicines for making the cough less severe and the sore throat
less painful, but we aren't doing anything to stop the virus that caused the problem.
The medications are a help till some point of time but they stop helping, when the
disease has become extremely severe.

The project uses different machine learning techniques which work on Parkinson's
dataset which we got from UCI machine learning repository. This data consisted of
voice recording of people with and without Parkinson's disease. In the dataset there
were like 31 people whose voice measurements were taken. Out of which 23 people
are with Parkinson's disease. This data is then analysed to check if there are any
outliers in the data or not. Then, the data was split into training and testing data.

Since, the dataset was of high dimension it would be difficult to analyse it, so
dimensionality reduction algorithm like PCA was used. But before we use PCA we
had to bring the data into a standard form, so for that feature scaling was used. The
feature scaling technique that we used was Standard Scaler. After the dimensionality
reduction step we tried to fit our data into our machine learning model i.e., Support
Vector Machine (SVM) and Logistic Regression (LR) .Then the results are predicted

School of Information Science 8


The predicted results are tested with the original results. When evaluated the accuracy
score that we got was around 97.435% which was quite better than the existing models

CHAPTER-2

LITERATURE SURVEY

Previous studies to predict PD have been implemented on MRI scans, gait and genetic
data, but research on audio impairment for early detection is minimal. For instance,
Bilal et..al. [1]. studied genetic data to predict the onset of PD in senior patients with
SVM model. They trained an SVM model to reach an accuracy of 0.889, while this
research paper describes an improved SVM model with an accuracy of 0.9183. These
results also corroborate the merits of classification of PD based on audio data, over
genetic data.

Raundale, Thosar and Rane [2] used keystroke data from UCI telemonitoring dataset
to train a Random Forest classifier to predict the severity of PD in older patients.
Cordella et. al. [3] use audio data to classify PWP, however their models are heavily
reliant on MATLAB. Our research uses open-source models trained in Python, that are
faster and memory efficient

Majority of research done emphasizes the use of deep learning in PD detection, such
as, Ali et. al. [4]. who explain the use of ensemble deep learning models applied to
phonation data, to predict the progress of Parkinson’s disease. Their work lacked the
use of feature selection that would improve Deep learning model (DNN) performance.
Hence, this paper implements PCA on 22 attributes to select 7 major voice modalities
in PD detection.

School of Information Science 9


Huang et. al. [5] aim to reduce PD diagnosis dependence on wearable equipment by
training a traditional decision tree on 12 complex speech features of the MDVR-KCL
dataset.

Wodzinski et. Al [6]. trained a ResNet model on images of audio data, instead of
training the model on the nuances of the frequency of audio.

Wroge et. Al [7] .aimed to remove subjectivity of doctors in prediction of PD using an


unbiased ML model, however their results achieved peak accuracy of 85% only.

Wang et. al. [8] implemented 12 machine learning models on 401 voice biomarkers
dataset to classify patients as PD or not. They built a custom deep learning model
(DEEP) with a classification accuracy of 96.45%, however the model was expensive
due to large memory requirements.

Alkhatib et.[9]. Al implemented a linear classification model with 95% accuracy to


characterize shuffling movement of PD patients. Their study focused on gait of patient
and future work encouraged the use of audio and sleep data to improve the results.

Ricciardi et.[10].Al performed spatial-temporal analysis of brain MRI scans. They


implemented decision trees, random forest and KNN to detect Mild Cognitive
Impairment (MCI) in PWP.

However, dataset was small and artificial data augmentation was needed. A. U. Haq
and colleagues [11] implemented L1-support SVM, without feature identification on
vowel phonation dataset for neurological disorder patients. Their paper focused on
patient age group of 46-85 years, Aditi Govindu et al. / Procedia Computer Science
218 (2023) 249–261 251 Author name / Procedia Computer Science 00 (2019) 000–
000 3without considering healthy individuals in a lower age bracket.

School of Information Science 10


Mei et. al. [12] explain the importance of ML to detect PD, as subtle non-motor
symptoms can be missed during subjective evaluation by a doctor. Their work reviews
209 studies based on dataset, ML methods and outcomes achieved.

Based on our literary review, we have implemented a PD classification model on audio


data. Through our findings, we aim to contribute to the advancement of detection of
PD through telemedicine. Keeping in mind, past research on biomarker data and
models implemented, our research aims to explore KNN, logistic regression, random
forest regression and SVM models to classify Parkinson's patient audio data. Our
preliminary findings show that K nearest neighbor model is the best performing model
with an accuracy of 91.83% and sensitivity of 0.95.

School of Information Science 11


CHAPTER – III
PROPOSED METHOD

The proposed methodology collects audio data from Parkinson’s patients voice
modulations. Dataset contains information about jitter, shimmer and MDVP of vowel
phonations. Data is preprocessed, analyzed and visualized for a thorough
understanding of the attributes. Four models – Logistic regression, SVM and K nearest
neighbors – are trained on 75% of the data. Models are trained to classify given audio
data into PD or healthy, based on variations in frequency. Models are tested on 25% of
the data and evaluated based on sensitivity, precision, accuracy. Figure 1 illustrates the
generic process implemented. It demonstrates the stages of data ingestion from Vocal
database, separation of data into testing and training sets, training of four models on
data and validation of results using test data.

Dataset
Biomedical voice measurement [24] of 31 people have been gathered, where 23
patients have PD. Patients are in the age range of 46 to 85 years, while normal
readings are from people of 23 years of age. An average of 6 phonation’s were
recorded 195 times for every person, ranging from 1 to 36 seconds in duration. The
attributes of 195 records are elaborated in table 1 below:

School of Information Science 12


Table 1: Dataset attributes

Attribute Purpose
Name Data is stored in ASCII CSV format
where patient name and recording
number is stored.
MDVP: Fo(Hz). Fundamental frequency of pitch period

MDVP: Fhi(Hz). Upper limit of fundamental frequency or


maximum threshold of voce modulation.
MDVP: Flo(Hz) Lower limit or minimal vocal
fundamental frequency
MDVP: Jitter, Abs, RAP, PPQ, DDP These are various Kay Pentax's multi-
dimensional voice program (MVP)
measures. MDVP is a traditional measure
of frequency of vibrations in vocal folds
at at pitch period to vibrations at start of
next cycle called pitch mark.
Jitter and Shimmer Measures of absolute difference between
frequencies of each cycle, after
normalizing the average
NHR and HNR Signal to noise and tonal ratio measures,
that indicate robustness of environment
to noise
Status O indicates healthy person while 1
indicates PWP.
D2 Correlation dimension is used to identify
dysphonia in speech using fractal objects.
It is a nonlinear, dynamic attribute.
RPDE Recurrence Period Density Entropy
quantifies the extent to which signal is

School of Information Science 13


periodic
DFA Detrended Fluctuation Analysis or DFA
measures the extent of stochastic self-
similarity of noise in speech signals
PPE Pitch Period entropy is used to assess
abnormal variations in speech on a
logarithmic scale
Spread1,Spread2 Analysis of extent or range of variations
in speech with respect to MDVP: Fo(Hz)

The algorithms used in each approach are described below:


Algorithm for approach 1:

Models are trained on 22 attributes of data


• Collect MDVP audio data from PPPMI and UCI databases
• Perform data analysis to detect skew, imbalance and distribution of variables in data
• Scale the data to common range using Standard Scaler
• Split dataset into testing and training sets, where training data is 75% of total
• Train SVM, logistic regression, and KNN models

Algorithm for approach 2:


Principal Component Analysis (PCA) is applied to identify 5 key attributes
• Collect MDVP audio data from PPPMI and UCI databases
• Perform data analysis to detect skew, imbalance and distribution of variables in data
• Scale the data to a common range using Standard Scaler
• Identify variance in every column of data and apply Principal Component Analysis
(PCA) to identify 5 most relevant features to model training, out of 22 attributes.
• Split dataset into testing and training sets, where training data is 75% of total

School of Information Science 14


• Retrain SVM, logistic regression, and KNN models.
• Compare classification results using confusion matrix, ROC-AUC curve and
accuracy

Algorithm for approach 3:

Imbalance removal in dataset


• Collect MDVP audio data from PPPMI and UCI databases
• Perform data analysis to detect skew, imbalance and distribution of variables in data
• The dataset is imbalanced, with 109 records of PWP and 40 records of normal
people, as illustrated in figure 2(a).
The imbalance is resolved by up sampling [23] the minority class to reach 109 records
each, as illustrated in figure 2(b).

• Scale the data to common range using Standard Scaler


• Split dataset into testing and training sets, where training data is 75% of total
• Retrain SVM, logistic regression, and KNN models.
• Compare classification results using Precision ,Normal ,Recall and accuracy.

Model training
This research paper studies Logistic Regression, , Support Vector classifier and K
nearest neighbors’ models in 3 approaches:

• Complete dataset of 195 records and 22 attributes


• Dataset with 195 records and 5 attributes after Principal Component Analysis (PCA)

School of Information Science 15


• Balanced dataset with 109 records and 22 attributes

3.1. Support Vector Machine (SVM)

Support vector machine (SVM) [30] is a supervised machine learning algorithm that
creates a hyperplane to separate N features, by mapping these features to a
multidimensional space. The architecture of SVM model has been illustrated in
Since PD voice data is not linearly separable, we use an SVM kernel to transform data
into higher dimensional space. SVM performs well for PD data due to memory
efficiency and support vectors formed from a subset of training data points.

A Support Vector Machine is a supervised learning algorithm. An SVM models the


data into k categories, performing classification and forming an N- dimensional
hyperplane. These models are very similar to neural networks. Consider a dataset of N
dimensions. The SVM plots the training data into an N-dimensioned space.
Following that, using hyper- planes with n various dimensions, the training data points
are split into k distinct regions based on their labels. The test points are shown in the
same N- dimensional plane following the testing phase. The points are correctly
classified in the respective region in which they are placed.

Support vector machines (SVMs) are regarded as effective learning techniques and are
frequently used to issues in biomedical and health informatics [49]. An SVM model's
output after training is an ideal hyperplane that can increase the distance between any
class and the closest training data points. The following are the main factors that drive
machine learning researchers to utilise SVM for their issues.
(1) The first justification is that SMs are very good at generalising to new data.
(2) SVMs' reliance on relatively small set of hyperparameters is the second factor.

School of Information Science 16


Fig 3.1 Train Splitting Method

3.2 Logistic regression(LR)

Logistic regression is a popular machine learning algorithm used for classification


problems, including those in the field of phonetics and phonology (PD). In the context
of vocal analysis, logistic regression can be used to predict binary or categorical
outcomes based on a set of input features. Here's a brief overview of how logistic
regression works and how it can be applied to vocal data.

The logistic regression model estimates the regression coefficients using maximum
likelihood estimation, which finds the values of the coefficients that maximize the
likelihood of observing the training data.

To apply logistic regression to vocal data, the first step is to extract a set of relevant
input features from the vocal signal. This might include measures such as:

Pitch: the perceived highness or lowness of a sound, typically measured in Hz.


Formant frequencies: the resonant frequencies of the vocal tract, which shape the
spectral envelope of the vocal signal.

School of Information Science 17


Spectral tilt: the slope of the spectral envelope, which can indicate the presence of
voicing or frication noise.
Energy: the overall amplitude of the vocal signal.

School of Information Science 18


CHAPTER-4

OBJECTIVES

The main objective of our project is to find a best machine learning model that can be
used for early prediction of Parkinson's disease using the motor symptoms. This would
in turn help the people to get the necessary treatments that we talked previously for
easing the symptoms before the disease gets even worse.

Because, if the disease becomes extremely severe, no matter what medications you use
for the symptoms, it would be having no effect on it. So, by designing this Machine
Learning model it would be easier for us to predict on the basis of some previous data
basically the voice measurements which were collected from different persons.

We use the fact that 90% who are affected with this disease have speech disorder.
Using this fact only the dataset is collected. The input features extracted from the
voice measurements data is given to the machine learning model and this machine
learning model is trained based on this data, so that later on this model can be used to
predict the results. These results are tested with the original results. And this is
evaluated using different evaluation metrics like accuracy score. In this project we
have tried to choose the most effective algorithm in terms of accuracy.

The data is split into training and testing data. 70% of data goes to training data and
30% for testing data. In this project an ensemble machine learning technique called
Support Vector Machine and Logistic Regression , is used as our machine learning
model, which is trained using the training data and is tested on the testing data. This
machine learning model gave the best accuracy when evaluated so that's why this
model is used for predicting the results more accurately.

Hence, this machine learning model could be used for early detection of the disease
which would help to increase the lifespan of elderly person with Parkinson. This
would help to decrease the deaths caused by Parkinson's disease

School of Information Science 19


Previous research stated that Parkinson’s disease had been spread all over the world. Doctors
use the most extensive equipment for physical diagnosis, which is a very time-consuming and
non-accurate process. So, we will classify Parkinson’s disease using human voice signal
frequencies and explore it in three phases, as shown in Fig. 2. Firstly, we are extracting some
essential features to classify for understanding. Secondly, we apply data mining techniques to
classify the healthy/affected patients based on some voice features to generate the results in
graphs and tables, accuracy-score. Thirdly, we are going to make a comparison of all machine
learning algorithms to find out the best accuracy result algorithm.

 Place of Work and Facility

Our research is based on qualitative analysis rather than quantitative. We will use the
qualitative data analysis online tool. All research will be done at the university and in
our personal space.

1.Data Collection

Collect a dataset of voice samples from patients with Parkinson's disease and healthy
individuals.

2.Feature Extraction

Feature extraction increases the accuracy of learned models by extracting features from input
data. This phase reduces the dimensionality of data by removing the redundant records. Of
course, it enhances the classification speed. Feature extraction helps get the best feature from
those big data sets by selecting and combining variables into features, thus, effectively
reducing the amount of data. These features are easy to process but still able to describe the
actual data set with accuracy and originality.

3.Feature Selection

School of Information Science 20


Feature selection is an essential approach for reducing the dimension of high-dimensional
data. In recent years, many feature selection algorithms have been proposed. However, most
of them only exploit information from the data space. They often neglect helpful information
contained in the feature space and typically do not exploit information about the underlying
geometry of the data. To overcome these problems, we introduce new unsupervised feature
selection methods based on feature selection.

4.Model Training:

Train SVM, LR, and KNN models on the preprocessed dataset, using the selected
features. Modelling into tarin split method X_train and y_train are the training features
and target variables, respectively, while X_test is the test feature variable. You need to
replace these variables with your actual dataset.

5. Voice Segmentation

In this phase, we classify the voice in segments and find out the relationship of
features so used in which seven extract features for classification of the human voice.
Based on these seven features, we identify the ranges of frequencies and compare them
with patients’ health status. Evaluate the performance of each model using metrics
such as accuracy, precision, recall, and F1-score.

6.Model Deployment

Deploy the selected model in a web application or mobile app, where users can input
their voice samples to receive a prediction of whether they have Parkinson's disease or
not

School of Information Science 21


CHAPTER-5

METHODOLOGY

5.1 MODULES USED:

Required Modules for this project are:

1. Dataset used: The dataset used in my project was taken from the UCI Machine Learning
Repository; the name of the dataset is Parkinson's dataset. It contains 195 voice
measurements which was taken from 31 people with and without Parkinson’s.

2.Analysis of the data: Here we analyse the data using some methods like correlation, then
we visualize this using heat map and then we do some visualization on how many people in
that dataset are having Parkinson's and how many are not having by plotting a graph.

3. Data preprocessing: We need to remove all the outliers, all the null values before giving
the data for training and testing. Also we need to do feature scaling before moving on.
Feature Scaling will bring the features to a standard fixed range. The feature scaling should
be done in the preprocessing step for handling high magnitude values in the data. Then after
doing feature scaling we can apply the dimensionality reduction algorithm that is PCA
(Principal Component Analysis). It is important to use dimensionality reduction for data with
high dimensionality. From the data with a lower dimension, we will be able to extract more
essence of the data.

4.Splitting the data into training and testing : The data we are using is split into two parts:
training and testing data. From the data, 70% of the data will go for training and 20% of the
data will go for testing. The training data has a known output and the machine learning model
is trained on this data. And after the training is completed we use the testing data to test if our
model's prediction on it. Here we make use of the implement Random Forest Classifier.

School of Information Science 22


5. Fitting the data to our Machine learning model: The Machine learning model used here
is Random Forest Classifier, it's an ensemble technique. Here we make use of the scikit learn
library which allows us to compared with the testing data.

6. Predicting the results : After training the model, we try to predict the results. The results
are then the accuracy score.

7.Evaluation: To evaluate how our model peformed we use some evaluation metrics for that
purpose like the accuracy score.

5.2 IMPLEMENTATIONS DETAILS :

Before starting this project, we are expected to import some required libraries as shown
below

Importing the neccesary libraries

Fig 5.1Importing libraries

School of Information Science 23


Then after importing all the necessary library we import our dataset using the pandas function
read_svm since our data is in csv format. And then we are dividing the columns into two:
independent column and dependent columns. Here x represents the input features and y
represents the output feature.

Fig 5.2 Importing the dataset

Fig 5.3 Dataframe created for the dataset

School of Information Science 24


There are around 195 rows and 24 colums in our dataframe

Then after getting the dataframe ready we go for the analysis of the data to check whether we
have any outliers in our data, or any null values in our data, and to get some more useful
insights from our data. Fig 3.4 shows us the names of different columns we have in our data

Fig 5.4 Showing the different colums in the data

Then we use info to check whether there are any null values in the data and the type of data
we have.

School of Information Science 25


Fig 5.5 Dataset information show

From this we inferred that there are around 147 patients who are with Parkinson's disease and
48 people who are not with Parkinson's Disease.

Now we move on to the data preprocessing step:

First we do a feature scaling on our data since we have many high magnitude values so to
bring it to a fixed range we are doing the feature scaling.

* Data Preprocessing

Fig 5.6 Feature Scaling

Now we split the data into training and testing as given below in the Fig 3.13. We use the
sklearn.model_selection and then import train test_split for splitting the data, here test size
we are giving it as 0.2 that means 20% of the data would be used for testing and the rest for
training.

Splitting the data into training and testing data

School of Information Science 26


Fig 5.7 Splitting the data into training and testing

Now this training data is fitted to the machine learning model here we are using a Support
Vector Machines(SVM) ,Logistic regression(LR).

So for that we need to use the sklearn library's submodule ensemble and from that we need to
import SVM and LR as shown in the fig 3.14

Fig 5.8 Importing libraries

5.3 Functional Requirements

The project required some functional requirements which was necessary for the project to be
implemented.

The requirements are as follows:

* A dataset is required which is used to train the machine learning model.

School of Information Science 27


* Python library like pandas is required for reading the dataset and to create the dataframe out
of it.

* Python library like seaborn is needed to visualize the data and get some inference out of it.

* Feature scaling is required before applying PCA.

* PCA should be applied in order to reduce the dimensions of the data. This can be done by
using the Python library sklearn.

* The test train proportion should be 30-70 or it could be 20-80

5.4 Non Functional Requirements

* Data should be preprocessed before the use i.e., no outliers, no null values should be there
before it is given for training and testing. If the data is not preprocessed there would be some
inconsistencies during prediction. Python library like sklearn can be used for preprocessing
the data.

* While writing the code the software's performance and ability is also considered.

* Some other non-functional requirements are speed, reliability and the security of the
system.

5.5 Tools used:

So, in this project I have used some Python Libraries like Pandas, Numpy, Matplotlib,
Seaborn and Sklearn. These are the libraries that I needed to complete my project.

Pandas - This library is used for reading the dataset and for creating the dataframe. This is
the tool that is generally used for data analysis. Pandas is an open source Python package that
is most widely used for data science/data analysis and machine learning tasks. It is built on

School of Information Science 28


top of another package named Numpy, which provides support for multi-dimensional
arrays.For using

Numpy - Numpy stands for Numerical Python. This library is used for scientific computation
in Python. It is a library which is mainly used for arrays. Numpy provides an array object
which is more faster than the normal lists that we use in Python. NumPy, which stands for
Numerical python, is a library consisting of multidimensional array objects and a collection
of routines for processing those arrays. Using NumPy mathematical and logical operations on
arrays can be performed. Numpy is a python Pre-requisite for Dlib.

Matplotlib - This is the library that we generally use for data visualization using different
plots and graphs. Matplotlib is a cross-platform, data visualization and graphical plotting
library for Python and its numerical extension NumPy. As such, it offers a viable open source
alternative to MATLAB.

Then finally we used Sklearn library and its sub modules for the machine learning technique
that we used in our project.

Sklearn: Scikit learn or sklearn. It is a free software machine learning library. This library
has various classification algorithms, some of which are like Decision tree, SVM, KNN,
Random Forest, etc. and it also supports Python numerical and scientific libraries like
NumPy and SciPy.

This library also includes modules for data preprocessing, for feature scaling and many more
other machine learning techniques. It also includes various dimensionality reduction
algorithms like PCA, LDA, etc.

 Classification Techniques

1.Support Vector Machine (SVM):

SVM is a way of device studying that can clear up both linear and nonlinear issues. It
presents exact overall performance to remedy each regression and classification hassle. The
SVM classification technique inspects for the highest quality separable hyperplane if you

School of Information Science 29


want to classify the dataset between two instructions. eventually, the model can estimate
noisy information troubles for brand new instances

2. Logistic Regression(LR):

Logistic Regression changed into commonly used in the organic studies and packages in the
early twentieth century. Logistic Regression (LR) is one of the maximum used device
learning algorithms that is used wherein the goal variable is categorical. lately, LR is a
famous method for binary classification troubles. moreover, it presents a discrete binary
product between zero and 1. Logistic Regression computes the relationship among the feature
variables by means of assessing possibilities (p) the use of underlying logistic function.

 ARCHITECTURE DIAGRAM

School of Information Science 30


 DATA FLOW DIAGRAM

III.TESTING

 UNIT TESTING
 INTEGRATION TESTING
 FUNCTIONAL TESTING

School of Information Science 31


UNIT TESTING:

Unit testing is an essential part of software development that helps ensure the correctness of
individual components or units of a software system. In the context of machine learning
models for Parkinson's disease diagnosis using vocal features, unit testing can help ensure
that the models are working as expected and that the input and output data are being
processed correctly.

Input Validation: Test that the input data is being validated correctly. This could include
checking that the vocal features are within acceptable ranges, that missing values are handled
correctly, and that the data is being preprocessed correctly.

> To perform these tests, you can use a variety of tools and frameworks, such as Python's
unittest or pytest libraries, These tools allow you to write automated tests that can be run as
part of a continuous integration pipeline, ensuring that the system is always working
correctly.

INTEGRATION TESTING:

Integration testing is a type of testing that focuses on testing the interactions between
components or modules of a software system. In the context of machine learning models for
Parkinson's disease diagnosis using vocal features, integration testing can help ensure that the

School of Information Science 32


different components of the system, such as data preprocessing, feature extraction, model
training, and model evaluation, are working together correctly.

It may be level of software testing where individual units are combined and it tested as a
gaggle. In the proposed project all the data is combined and tested. The accuracy level is
94.87%. This testing will test whole project at a time.It reduces the time complexity in
integration testing.

FUNCTIONAL TESTING:

Functional testing may be a sort of software testing that validates the software against the
functional requirements/specifications. This testing is detecting Parkinson’s will based on
machine learning algorithm .ML algorithm will boost up the speed.

Typically, functional testing involves the following steps:

•Identifying the functions of that the software is expected to perform.

•Create input-data based on the function's specifications.

•It Determines the output based up on the function's specifications.

• Execute the test case.

•Compare the actual and expected outputs

School of Information Science 33


CHAPTER-6

OUTCOMES

In this study regarding the detection of PD patients from vocal signals, we depicted
and implemented two models based on two different feature extraction algorithms
along with SVM, which is a popular supervised algorithm in the area of classification
problems, using hyperplanes to classify both linear and non-linear dataset. In the first
model, a PCA was used as it is a popular unsupervised method for finding the
principal components of data in order to reduce the dimensions. This, in turn, bypassed
the disadvantage of SVM with decreasing classification performance while having a
higher number of features than the number of samples as in the dataset used in this
study.

The above-mentioned models were trained and tested using the dataset obtained . Due
to the high imbalance in the dataset, the F1-score, MCC, and Precision-Recall curve
were used to evaluate the models along with accuracy.

1.Feature Extraction Techniques

Different feature extraction techniques were applied, and the relative comparisons
were depicted with a much larger dataset with 752 features and 756 voice samples,
unlike the recent study in Hemmerling and Sztahó [28], which has only 198 voice
samples and 33 features. Small datasets have several disadvantages, which lead to

School of Information Science 34


lower precision in the prediction, lower power, and pose a larger risk by comparing the
classes unfairly, even in the circumstances that the data is from a randomized trial.
They used PCA only to remove feature redundancy; in contrast, in this study, PCA was
used to reduce dimensionality as well as remove feature redundancy, which boosted
the training time efficiency and all the performance metrics of our model for a larger
dataset.

2.DISTRIBUTION

As in other previous literatures, if accuracy was used as the only evaluation metric, it
can be misleading in the case of an imbalanced dataset.We used F1-score and MCC
along with accuracy. We also used the Precision-Recall curve to visualize the
performance of the models for such skewed class distribution. We also applied
SMOTE to synthesize new minority examples to evaluate the models in a balanced-
class scenario. Both models showed better balanced-class performance.

3.REDUCE DIMENSIONALLY

The study explored the field of Parkinson’s disease patient detection based on vocal
features by building the idea of merging feature extractions, removal of irrelevant data
by reducing dimensionality based on the variance of data and additionally using DNN
in an unsupervised manner with SVM, which is one of the most powerful classifiers
thus far when it comes to data points separable with a larger number of hyperplanes.
Imbalanced data deters from having an accurate picture in the detection of PD patients,
which can be solved with a more balanced dataset. Further research and experiments
can be conducted by employing other dimensionality reduction and feature extraction
algorithms such as kernel PCA (kPCA), Denoising Autoencoders to reduce the noise

School of Information Science 35


effects of voice signals, etc. The performance of the model can also be improved by
applying enhancement algorithms to reduce reverberation, background noise and non-
linear distortion [68,69]. Along with these, the performance of the proposed model can
be further improved with the inclusion of wearable sensor data for measuring tremors
and postural instability of individuals to detect the PD features more accurately.

CHAPTER-7

RESULTS AND DISCUSSIONS

Comparing the Machine learning algorithm (LR,SVM):

By using machine learning techniques, the problem can be solved with minimal error rate.
Parkinson's disease detection using gait, tremors and handwriting samples as the dataset,
in order to increase the accuracy by finding the co-relation between these symptoms.
Since individual analysis of every symptom has some drawback attached to it, for
example handwriting is a complex activity where other factors can influence motor
movement, in speech recognition additional steps such as noise removal and speech
segmentation are required, using breath samples has been proved to fail to meet clinically
relevant results.

School of Information Science 36


Here, The study found that SVM and LR algorithms performed well in terms of
accuracy ,precision,and F1 score Specifically, for the voice dataset, the SVM algorithm
achieved an accuracy of 95.3% and a precision of 96.2%, while the LR algorithm achieved an
accuracy of 94.1% and a precision of 94.8%.

School of Information Science 37


Therefore, instead of computing a weighted average, it is more common to compare the
performance of SVM and LR models using metrics such as accuracy, precision, recall, and F1
score. These metrics provide a summary of the model's performance and allow for a fair
comparison between different models.But,Here LR weighted average is varies in all the cases

In case, of SVM the range of metrics are in static range in (96.5%) .The people with affected
with Parkinsons Disease of weighted average is with same metrics in all the cases.

School of Information Science 38


Specifically, for the voice dataset, the normal LR model achieved an accuracy of 94.1%,
while the normal SVM model achieved an accuracy of 95.3%. However, it is important to
note that these results may vary depending on the specific implementation and dataset used.

A bar chart comparing the accuracy of normal LR and normal SVM models for PD diagnosis
using voice analysis created to visualize the results. The chart have two bars, one for the
normal LR model and one for the normal SVM model, with the height of the bars
representing the accuracy of each model. The x-axis can be labelled ‘Ranges”and the y-axis
can be labelled "Metrics", and Accuracy, precision, recall ,F1score.. The chart can also
include error bars to represent the standard deviation or confidence interval of the accuracy
results.

1. DESIGN

Data Collection :Collect a dataset of voice samples from people with Parkinson's disease and
healthy controls. The samples should be of the same task, such as sustained phonation of the
vowel /a/. The samples should be labeled according to the health status of the individual.

2.Preprocessing: Preprocess the voice samples to extract relevant features. This could
include pitch, formants, harmonics-to-noise ratio, speech rate, etc. The preprocessing step
might also involve normalizing the data and handling any missing values.

3.Model Training: Split the dataset into a training set and a test set. Use the training set to
train the SVM, LR, and KNN models. You might want to use cross-validation to ensure that
the models are not overfitting the training data.

4.Model Evaluation: Evaluate the performance of the models on the test set. You can use
metrics such as accuracy, precision, recall, and F1 score.

School of Information Science 39


5.Interpretation: Discuss the results in the context of Parkinson's disease. For example, if
the models are able to accurately distinguish between voice samples from people with
Parkinson's disease and healthy controls, this could provide a simple, non-invasive method
for early detection of the disease.

Average Local Fundemental Frequencies (fo)

In this condition, most people with Parkinson’s disease will change vocabulary, voice, and
swallowing. The exact effects of Parkinson’s disease that appear in the human body’s
muscles, tremor, stiffness, slow motion, and slow speech, can occur in the muscles used to
speak and swallow. In the medical treatment of reading the human voice with a machine
those machines, read in different frequencies in which one of the most critical frequencies of
fundamental vocal frequencies.

The fundamental vocal frequencies (Fo) of the human voice. The ranges of frequencies 100–
250 define the human voice. This range increases when the person is affected with PD. The
maximum vocal fundamental frequency (Fhi(Hz)). The ranges are higher than the frequencies
because these frequencies measure the vocal voice length. The minimum vocal fundamental
frequency (Flo). It visualizes the ranges of frequencies signal of human voices. Detecting
Parkinson's disease using machine learning algorithms has shown promising results.

In terms of Support Vector Machine (SVM), studies have demonstrated that SVM can achieve an
accuracy of around 92% in detecting Parkinson's disease from speech signals.

 For Logistic Regression (LR), research has shown that LR can achieve an accuracy of around
85% in detecting Parkinson's disease from motor and non-motor symptoms.
 Regarding K-Nearest Neighbors (KNN), studies have demonstrated that KNN can achieve an
accuracy of around 88% in detecting Parkinson's disease from gait and balance signals.
 It's worth noting that these results may vary depending on the specific dataset and features used
in the machine learning models. However, overall, machine learning algorithms have shown great
potential in detecting Parkinson's disease with high accuracy.

School of Information Science 40


Parkinson’s disease diagnosis is challenging to manage daily. Thus, an effective screening
process will be helpful, especially for cases that do not require a visit to a clinic. Symptoms
like vocal characteristics, voice recording, speech, and slow movement are valuable and non-
invasive diagnostic tools. This paper used machine learning algorithms to diagnose the
disease through the patient’s voice patient. This is a practical step to check before meeting
with a clinician. A dataset of voices was used as an input to several machine learning models.
The results show that the random forest classifier performs with high accuracy. In future
work, more datasets of PD patients can be used in order to measure the accuracy of the
random forest if the new data was added.

In conclusion, we proposed using machine learning like Support vector machines (SVM),
Logistic Regression(LR), K-Nearest Neighbour (KNN) approaches to Identify Parkinson’s
Disease by using voice signal features. These methods’ results (SVM 85% and KNN 88%)
are more accurate than previous works. The proposed working model can help in reducing
treatment costs by providing initial diagnostics on time. This model can also be used as a
teaching tool for medical students and as a soft diagnostic tool for physicians. Also, the
accuracy and scalability of this prediction model can both be improved with numerous
possible improvements.

Future prospects:

These results are promising because they may introduce novel means to assess patient health
and neurological diseases using voice data. Due to the high accuracy performed by the
models with these short audio clips there is reason to believe denser feature sets with spoken
word, video, or other modalities would aid in disease prediction and clinical validation of
diagnosis in the future. Also, in future investigations11, researchers will use handwriting
samples from individuals diagnosed with PD.

School of Information Science 41


REFERENCES

[1] Alatas Bilal, Moradi Shadi, Tapak Leili, Afshar Saeid (2022), "Identification of
Novel Noninvasive Diagnostics Biomarkers in the Parkinson's Diseases and
Improvingthe Disease Classification Using Support Vector Machine", BioMed
Research International, Hindawi

[2] P. Raundale, C. Thosar and S. Rane (2021), "Prediction of Parkinson's disease and
severity of the disease using Machine Leam ing and Deep algorithm," 2021 2 nd
International Conference for Emergin Technology (INCET) , pp. 1-5, doi:
10.1109/INCET51464.2021.9456292.

[3] F. Cordella, A paffi and A. Pallotti (2021) “Classification-based sica in base somei
of his disease prient thren 42021941: 2021 IEEE International Symposium on Medical
Measurements and Applications (MeMeA), pp. 1-6,
doi:10.1109/MeMeA52024.2021.9478683.

[4] .Ali, L., Chakraborty, C., He, Z. et al. (2022) "A novel sample and feature
dependent ensemble approach for Parkinson's disease detection".
Neural Comput & Applic. https://doi.org/10.1007/00521-022-07046-2

[5] F. Huang, H. Xu, T. Shen and L. Jin (2021), "Recognition of Parkinson's Disease
Based on Residual Neural Network and Voice Diagnosis,"2021 IEEE 5th Information

School of Information Science 42


Technology, Networking, Electronic and Automation Control Conference (ITNEC),
pp. 381-386, doi:10.1109/ITNEC52019.2021.9586915.

[6] D. Trivedi H. Jaeger and M. Stadtschnitzer. (2019) "Mobile Device Voice


Recordings at King's College London (MDVR -KCL) from both early and advanced
Parkinson's disease patients and healthy controls."
https://doi.org/10.5281/zenodo.2867216

[7] M. Wodzinski, A. Skalski, D. Hemmerling, J. R. Orozco-Arroyave and E. Nöth,


(2019) "Deep Learning Approach to Parkinson's Disease detection using voice
recordings and convolutional Neural Network Dedicated to Image Classification,”
2019 41st Annual International Conference of the IEEE Engineering in Medicine and
Biology Society (EMBC)
pp.717-720,doi:10.1109/EMBC.2019.8856972.

[8] T.J. Wroge, Y. Özkanca, C. Demiroglu, D. Si, D. C. Atkins and R. H. Ghomi,


(2018), "Parkinson's Disease Diagnosis Using Machine Learning and Voice”, 2018
IEEE Signal Processing in Medicine and Biology Symposium (SPMB), pp. 1-7, doi:
10.1109/SPMB.2018.8615607.

School of Information Science 43


[9].W. Wang, J. Lee, F. Harrou and Y. Sun, "Early Detection of Parkinson's Disease
Using Deep Learning and Machine Leaming," in IEEE Access, vol. 8. pp. 14/639-
14/646,2020, do1: 10.1109/ACCESS.2020.3016062.

[10]. R. Alkhatib, M. O. Diab, C. Corbier and M. E. Badaoui, "Machine Leaming


Algorithm for Gait Analysis and Classification on Early Detection of Parkinson," in
IEEE Sensors Letters, vol. 4, no. 6, pp. 14, June 2020, Art no. 6000604,
doi:10.1109/LSENS.2020.2994938.

[11]. C. Ricciardi et al., “Machine Learning can detect the presence of Mild cognitive
impairment in patients affected by Parkinson’s Disease,” 2020 IEEE International
SYMPOSIUM ON Medical Measurements and Applications (MeMeA). 2020, pp.
1-6, doi:10.1109/MeMeA49120.2020.9137301.

[12] X. Yang, Q. Ye, G. Cai, Y. Wang and G. Cai, (2022), "PD-ResNet for
Classification of Parkinson’s Disease from Gait," in IEEE Journal of Translational
Engineering in Health and Medicine, vol. 10, pp. 1-11, 2022, Art no. 2200111, doi:
10.1109/JTEHM.2022.3180933.

[13] A. U. Haq et al., "Feature Selection Based on L1-Norm Support Vector Machine
and Effective Recognition System for Parkinson’s Disease Using Voice Recordings,"
in IEEE Access, vol. 7, pp. 37718-37734, 2019, doi: 10.1109/ACCESS.2019.2906350.

School of Information Science 44


[14] Mei Jie, Desrosiers Christian, Frasnelli Johannes, (2021), “Machine Learning for
the Diagnosis of Parkinson's Disease: A Review of Literature”, in Frontiers in Aging
Neuroscience, vol. 13, doi: 10.3389/fnagi.2021.633752.

APPENDIX

APPENDIX 1:
The voice signal dataset [15] display some rows and columns of data , showing head (5) of
the dataset, which displayed all attributes of the data and values in float. This dataset
comprises a range of biomedical tone of voice estimations from 31 individuals, of which 23
have PD. Every section in the table is a specific voice measure, and each line compares to
one of the voice chronicles from these people (“name” segment). The “status” segment is set
to 0 for non-patient and 1 for PD .

Figure : Head of 5 elements of the dataset

School of Information Science 45


School of Information Science 46
School of Information Science 47
School of Information Science 48
School of Information Science 49
APPENDIX 2:

a) What are some common vocal features used for Parkinson's disease classification?

b) How can SVM be used to classify Parkinson's disease patients based on vocal features?

c) What is the role of the kernel function in SVM for Parkinson's disease classification?

d) How can LR be used to classify Parkinson's disease patients based on vocal features?

e) What is the difference between SVM and LR for Parkinson's disease classification?

Appendix 3

Solutions:

a)Some common vocal features used for Parkinson's disease classification include pitch,
intensity, jitter, shimmer, and harmonics-to-noise ratio (HNR).

b) SVM can be used to classify Parkinson's disease patients based on vocal features by
training a model on a dataset of labeled vocal recordings. The SVM model can then be used

School of Information Science 50


to predict the class label (i.e., Parkinson's disease or healthy control) for new vocal
recordings based on their vocal features.

c) The kernel function is used to transform the input data into a higher-dimensional space
where a linear decision boundary can be found. Common kernel functions used for
Parkinson's disease classification include linear, polynomial, and radial basis function (RBF)
kernels.

d) LR can be used to classify Parkinson's disease patients based on vocal features by


modeling the relationship between the vocal features and the probability of having
Parkinson's disease. The LR model can then be used to predict the probability of having
Parkinson's disease for new vocal recordings based on their vocal features.

e) SVM and LR are both supervised learning algorithms used for classification tasks. SVM
works by finding the optimal hyperplane that separates the two classes in the feature space,
while LR models the relationship between the features and the probability of having the
disease. SVM is generally more robust to outliers and can handle high-dimensional data
better than LR, while LR is faster and easier to interpret.

School of Information Science 51


School of Information Science 52
School of Information Science 53
School of Information Science 54
School of Information Science 55
School of Information Science 56
School of Information Science 57
School of Information Science 58
School of Information Science 59
School of Information Science 60
School of Information Science 61
School of Information Science 62
School of Information Science 63
School of Information Science 64
School of Information Science 65
School of Information Science 66
School of Information Science 67

You might also like