Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views25 pages

Seminar Report

The document discusses the integration of advanced AI techniques, particularly federated learning (FL), in healthcare to enhance diagnosis and treatment while addressing privacy concerns. It highlights the challenges of data access due to regulations like GDPR and the benefits of FL, such as preserving patient privacy and enabling collaborative model training across institutions. The literature survey reviews various studies on FL applications in healthcare, emphasizing its potential for personalized care and compliance with privacy regulations.

Uploaded by

AARUNI RAI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views25 pages

Seminar Report

The document discusses the integration of advanced AI techniques, particularly federated learning (FL), in healthcare to enhance diagnosis and treatment while addressing privacy concerns. It highlights the challenges of data access due to regulations like GDPR and the benefits of FL, such as preserving patient privacy and enabling collaborative model training across institutions. The literature survey reviews various studies on FL applications in healthcare, emphasizing its potential for personalized care and compliance with privacy regulations.

Uploaded by

AARUNI RAI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Study of Information Extraction from Unstructured and Multidimensional Big Data

CHAPTER 1
INTRODUCTION
In light of modern AI, various state-of-the-art AI techniques, including deep learning (DL) and the
Internet of Medical Things (IoMTs), have made their way into the healthcare industry. This leads to
improve the diagnosis and treatment of various conditions such as COVID-19 [1] and autism
spectrum disorder (ASD) [2]. However, existing intelligent healthcare AI models need to be truly
intelligent, and some have been criticized for providing ineffective and unsafe treatment
recommendations [3]. Several factors may have caused deficiencies in existing systems. A significant
issue is the difficulty of obtaining sufficient data with complex features that can adequately describe
the patient’s symptoms. In addition, with the implementation of rigorous laws such as the United
States Consumer Privacy Bill of Rights and the European Commission’s General Data Protection
Regulations (GDPR), which aim to safeguard individuals’ privacy [4], AI models are now unable to
directly access source data for training purposes. Instead, they must adhere to strict limitations and
regulatory requirements. FL, which offers a novel distributed AI paradigm aimed at addressing
concerns related to healthcare data privacy and management [5], has emerged as a popular subject
of discussion in recent years [6]. Google first introduced FL in 2015 [7]. Essentially, FL is a distributed
AI methodology that involves training several local models and aggregating them to derive a global
model without the need for data sharing. FL can be specifically applied in the following situations.
In the realm of traditional machine learning, it is common practice to assume that data is
independently and identically distributed (IID). However, it is important to note that in the majority
of practical scenarios and circumstances, this assumption is not met. For instance, each individual
client exhibits a unique set of behaviors, resulting in the collection of biased data that may differ
from that of other participants [8]. This, in turn, can lead to the emergence of Non-IID or
Heterogeneous data
An unbalanced data distribution occurs when certain participants in the training dataset possess a
disproportionate amount of pertinent data. For example, in a scenario where the training
participants include both hospitals and individuals, hospitals are likely to have significantly larger
sample sizes than individuals. Additionally, data relevant to the same disease can vary substantially
between hospitals due to differences in equipment, personnel, and other factors. This can create
challenges for machine learning models, especially when attempting to generalize to new and
diverse datasets

Dept. of CSE, RRCE 2019-2020 Page 1


Study of Information Extraction from Unstructured and Multidimensional Big Data

CHAPTER 2
LITERATURE SURVEY
[1] Federated learning for healthcare informatics

Description: This paper presents a comprehensive examination of federated learning


techniques in the context of healthcare informatics. It delves into the unique challenges and
opportunities that arise when applying federated learning in healthcare settings, particularly
focusing on preserving patient privacy while leveraging distributed data sources for model
training...

Benefits: Privacy Preservation: Federated learning ensures that sensitive patient data remains
on local devices, thereby mitigating privacy concerns associated with centralized data
aggregation.

Data Utilization: By enabling model training on decentralized data sources across multiple
healthcare institutions, federated learning allows for the utilization of diverse datasets without
the need to centralize them.

Collaborative Learning: Healthcare institutions can collaborate on model training without


sharing raw data, fostering collaboration while maintaining data privacy.

Personalized Healthcare: Federated learning facilitates the development of personalized


healthcare models tailored to individual patient needs, leading to improved patient outcomes
and care delivery.

Compliance: The decentralized nature of federated learning helps healthcare institutions


comply with regulations such as HIPAA and GDPR by minimizing data exposure and
ensuring patient data privacy.

Dept. of CSE, RRCE 2019-2020 Page 2


Study of Information Extraction from Unstructured and Multidimensional Big Data

[2] Privacy-Preserving Federated Brain Tumor Segmentation


Description: This study focuses on the application of federated learning techniques for brain
tumor segmentation in MRI scans while prioritizing patient privacy. It addresses the
challenges of centralized data sharing in medical imaging and proposes a privacy-preserving
federated learning framework to train segmentation models across multiple healthcare
institutions.

Benefits: Privacy Preservation: By keeping patient data local and only sharing model
updates, federated learning ensures that sensitive MRI scans are not exposed to external
parties, thus preserving patient privacy.

Improved Segmentation Accuracy: The federated learning approach achieves state-of-the-art


segmentation accuracy by leveraging distributed data sources from multiple healthcare
institutions without centralizing the data.

Collaborative Model Training: Healthcare institutions can collaborate on training


segmentation models without sharing raw MRI scans, enabling collaborative research and
model development while protecting patient data.

Regulatory Compliance: The privacy-preserving nature of federated learning helps healthcare


institutions comply with regulations such as HIPAA and GDPR by minimizing data exposure
and ensuring patient data privacy.

Scalability: Federated learning enables scalable model training across a large number of
healthcare institutions, allowing for the development of robust and accurate segmentation
models using diverse datasets.

[3] "Towards federated learning at scale: System design

Description: The paper "Towards federated learning at scale: System design" by Bonawitz,
Keith, et al. (2019) addresses the intricate technical challenges inherent in deploying
federated learning systems at scale. It meticulously explores the design considerations
essential for the successful implementation of federated learning across a large number of
devices or institutions. The authors delve into optimizing communication efficiency between

Dept. of CSE, RRCE 2019-2020 Page 3


Study of Information Extraction from Unstructured and Multidimensional Big Data

Practical Implementation: The discussion of system design considerations in the paper serves
as a practical guide for implementing federated learning systems in real-world settings,
including healthcare applications. It helps researchers and practitioners navigate the technical
challenges of deploying federated learning at scale and harness its benefits for collaborative
machine learning.

[4] Federated learning for electrocardiogram (ECG) classification: A


privacy-preserving model
Description: Text based approaches for web image information retrieval has been
exploited for many years, however the noisy textual content of the web pages makes their
task challenging. Moreover, text based systems that retrieve information from textual
sources such as image file names, anchor texts, existing keywords and, of course,
surrounding text often share the inability to correctly assign all relevant text to an image
and discard the irrelevant. A novel method for indexing web images is discussed in the
present paper. The main concern of the proposed system is to overcome the obstacle of
correctly assigning textual information to web images, while disregarding text that is
unrelated to them. The proposed system uses visual cues in order to cluster a web page into
several regions and compares this method to the use of semantic information and the
realization of a k-means clustering. The evaluation reveals the advantages and
disadvantages of the different clustering techniques and confirms the validity of the
proposed method for web image indexing.

Benefits: An image indexing system that uses textual information in order to extract the
concept of the images that are found in a web page. The method uses visual cues in order to
identify the segments of the web page and calculates Euclidean distances among these
segments. It delivers a semantic or Euclidean clustering of the contents of a web page in
order to assign textual information to the existing images.

[5] Automatic Feature Extraction for Classifying Audio Data


Description : The study "Privacy-Preserving Federated Brain Tumor Segmentation" by
Rieke N., et al. (2020) focuses on brain tumor segmentation in MRI scans while safeguarding
patient privacy through federated learning. By employing this approach, the research allows
model training to occur locally on individual healthcare institutions' data, preventing
centralized aggregation of sensitive medical information. Benefits include enhanced privacy
protection, ensuring compliance with regulations like HIPAA and GDPR, while achieving

Dept. of CSE, RRCE 2019-2020 Page 4


Study of Information Extraction from Unstructured and Multidimensional Big Data

state-of-the-art segmentation accuracy through collaboration. Federated learning enables


scalable model training across multiple institutions, fostering innovation and knowledge
sharing without compromising data security. Consequently, the study's findings not only
advance brain tumor diagnosis and treatment planning but also pave the way for collaborative
research initiatives in healthcare while maintaining patient confidentiality.

Benefits: Privacy Preservation: By adopting federated learning, the study ensures that patient
data remains decentralized and secure, thus mitigating privacy concerns associated with
centralized data aggregation. This approach enhances patient trust and compliance with
healthcare regulations like HIPAA and GDPR.

Improved Accuracy: Leveraging federated learning techniques, the study achieves state-of-
the-art segmentation accuracy by aggregating model updates from diverse datasets. This
collaborative approach enhances the robustness and generalizability of the segmentation
model.

Collaborative Research: The federated learning framework enables healthcare institutions to


collaborate on model training without directly sharing raw MRI scans. This collaborative
research paradigm fosters innovation and knowledge sharing while respecting data privacy.

Scalability: Federated learning facilitates scalable model training across numerous healthcare
institutions, allowing for the development of accurate segmentation models using large and
diverse datasets. This scalability enhances the model's effectiveness in real-world clinical
settings.

Clinical Impact: By advancing brain tumor segmentation accuracy, the study contributes to
improved diagnosis, treatment planning, and patient outcomes in neuro-oncology. The
privacy-preserving federated learning approach ensures that these benefits are achieved
without compromising patient confidentiality.

[6] Towards federated learning at scale: System design


Description: The paper "Federated Learning in Medicine: A Systematic Review" by Wang,
Shuhan, et al. (2021) provides a comprehensive analysis of federated learning applications
across various medical domains. It systematically reviews existing literature to evaluate the
effectiveness, challenges, and future directions of federated learning in medicine.

Dept. of CSE, RRCE 2019-2020 Page 5


Study of Information Extraction from Unstructured and Multidimensional Big Data

Benefits: Comprehensive Overview: The systematic review offers a holistic understanding


of the current landscape of federated learning in medicine, including its applications,
methodologies, and outcomes. This comprehensive overview serves as a valuable resource
for researchers and practitioners interested in leveraging federated learning for medical
research and healthcare delivery.

Identification of Challenges: By synthesizing findings from existing studies, the paper


identifies common challenges and limitations associated with federated learning in medical
settings. This insight helps researchers anticipate potential obstacles and design strategies to
address them effectively.

Highlighting Success Stories: The systematic review showcases successful implementations


of federated learning in medicine, demonstrating its potential to improve diagnostic accuracy,
patient outcomes, and healthcare delivery. These success stories serve as inspiration for future
research and adoption of federated learning in clinical practice.

Informing Future Research: By summarizing key findings and gaps in the literature, the paper
informs future research directions and priorities in the field of federated learning in medicine.
Researchers can use this information to design studies that address unanswered questions and
further advance the application of federated learning in healthcare.

Policy Implications: The systematic review may also have policy implications by providing
insights into the regulatory and ethical considerations associated with federated learning in
medicine. Policymakers can use this information to develop guidelines and regulations that
promote the responsible and ethical use of federated learning in healthcare settings.

[7] Depth Extraction from Video Using Non-parametric Sampling


Description: A technique that automatically generates plausible depth maps from videos
using non-parametric depth sampling. We demonstrate our technique in cases where past
methods fail (non-translating cameras and dynamic scenes). Our technique is applicable to
single images as well as videos. For videos, we use local motion cues to improve the
inferred depth maps, while optical flow is used to ensure temporal depth consistency. For
training and evaluation, we use a Kinect-based system to collect a large dataset containing
stereoscopic videos with known depths. We show that our depth estimation technique
outperforms the state-of-the-art on benchmark databases. Our technique can be used to

Dept. of CSE, RRCE 2019-2020 Page 6


Study of Information Extraction from Unstructured and Multidimensional Big Data

automatically convert a monoscopic video into stereo for 3D visualization, and we


demonstrate this through a variety of visually pleasing results for indoor and outdoor
scenes, including results from the feature film Charade.

Benefits: Demonstrated a fully automatic technique to estimate depths for videos. Our
method is applicable in cases where other methods fail, such as those based on motion
parallax and structure from motion, and works even for single images and dynamics scenes.
Our depth estimation technique is novel in that we use a non-parametric approach, which
gives qualitatively good results, and our single image algorithm quantitatively outperforms
existing methods.

[8] Privacy-Preserving Federated Learning for Wearable Healthcare


Systems: A Survey
Description: The paper "Privacy-Preserving Federated Learning for Wearable Healthcare
Systems: A Survey" by Chen, Lianyong, et al. (2020) investigates the application of federated
learning in wearable healthcare systems. It explores methods for preserving user privacy
while leveraging data collected from wearable devices for collaborative model training.

Benefits: Privacy Preservation: The survey explores privacy-preserving techniques in


federated learning, ensuring that sensitive health data collected from wearable devices
remains secure and confidential.

Collaborative Model Training: By enabling collaborative model training across distributed


wearable devices, federated learning facilitates the development of accurate and robust
healthcare models while respecting user privacy.

Scalability: Federated learning allows for scalable model training across a large number of
wearable devices, accommodating diverse data sources and ensuring the generalizability of
healthcare models.

Personalized Healthcare: The collaborative nature of federated learning enables the


development of personalized healthcare models tailored to individual user needs, enhancing
the effectiveness of healthcare interventions.

Regulatory Compliance: By implementing privacy-preserving federated learning techniques,


wearable healthcare systems can adhere to regulatory requirements such as HIPAA and
GDPR, ensuring compliance with data privacy regulation

Dept. of CSE, RRCE 2019-2020 Page 7


Study of Information Extraction from Unstructured and Multidimensional Big Data

CHAPTER 3

IMPLEMENTATION

Dept. of CSE, RRCE 2019-2020 Page 8


Study of Information Extraction from Unstructured and Multidimensional Big Data

3.1 MODEL INITALIZATION AND CLIENT SELECTION:

The process begins by defining a task in the healthcare domain, such as medical image
classification, segmentation, or HAR. Next, the

Algorithm 1 The key stages involved in federated learning,


where ω represents the model’s parameters, D denotes the local dataset held by individual
clients, and the method pertains
to the aggregation approach.
Initialization: Clients = {}, ωglobal = 0, The number of clients N, Initial
parameters ωpretrained, Communication round C.
procedure Initialization & Selection (ωpretrained, N)
Clients = Select clients(N)
ωglobal = ωpretrained
end procedure
procedure Local training & Upload (ωglobal, Clients)
ωclient = ωglobal
ωlocalnew = Local training(ωclient, Clients, D)
UploadP ara(Client, ωlocalnew)
end procedure
procedure Aggregation & Download (ωlocalnew, Clients)
ωglobal = Aggregation(ωlocalnew, method)
DownloadP ara(Clients, ωglobal)
end procedure
for c = 1 to C do Local training & Upload (ωglobal, Clients)
Aggregation & Download (ωlocalnew, Clients)
if performance meets requirement do
Break
endif

parameters of the global server are artificially initialized, and clients are then selected by the
global server to participate in the training.

Dept. of CSE, RRCE 2019-2020 Page 9


Study of Information Extraction from Unstructured and Multidimensional Big Data

3.1.1 LOCAL TRAINING AND PARAMETER UPLOAD:


: Once the participating clients are identified, the global server distributes the initial model
and its parameters to each client. In every subsequent communication round, each client
trains its own dataset and uploads the parameters of its local model on the global server for
aggregation

3.1.2 MODEK AGGREGATION AND PARAMETER DOWNLOAD:


After all participating clients have completed uploading their updated parameters, the global server
combines them to compute a new global model. This updated model is then distributed to each
client for the next training session. The process of FL continues until the loss function of the global
server converges or meets the performance requirements

Fig: 3.1.2 FEDERATED LEARNING APPROACH


IN HEALTH CARE
This section classifies FL into three distinct types, namely Horizontal FL (HFL, [21]),
Vertical FL (VFL), and Federated Transfer Learning (FTL). Figure 4 provides a clear
illustration of these categories. 1) Horizontal federated learning: Sample-Partitioned FL, also
referred to as Horizontal FL, involves healthcare clients with datasets that share the same
feature space but have different sample spaces. In this scenario, each participant can use the
same model to train on its data and then upload it to the global server. The integration of data
from the same feature space that is spread across multiple clients is a commonly used
technique in privacy-sensitive fields such as healthcare and mobile services. This technique is
made possible through Fig. 4: The various types of federated learning utilized in the
healthcare field can be illustrated through three categories. The first category, represented on
the left, is Horizontal Federated Learning (HFL), which involves the same feature space but
different sample spaces. The second category, shown in the middle, is Vertical Federated
Learning (VFL), where there are distinct feature spaces but the same sample spaces. The third
category, depicted on the right, is Federated Transfer Learning (FTL), which is characterized
by disparate feature and sample spaces. The blue and green colors represent the different
types of samples, while the gray circles indicate the feature types. the use of HFL, as
described in [22]. To be specific, HFL can be defined as: Xi = Xj , Yi = Yj , Ii ̸= Ij , ∀ Di , Dj
, i ̸= j (3) where I denotes the sample space, while X and Y refer to the feature space and the
label space, respectively. The datasets owned by the ith and jth clients are represented by Di
and Dj , respectively. 2) Vertical federated learning: Feature-Partitioned FL, also referred to

Dept. of CSE, RRCE 2019-2020 Page 10


Study of Information Extraction from Unstructured and Multidimensional Big Data

as Vertical FL according to the source [23], operates within the FL framework where the
sample space remains the same, but the feature space differs. The goal of VFL is to create a
shared machine learning model collaboratively, utilizing all the features gathered by the
participating clients. An instance of this is the Federated Data Network (FDN) [24], which
integrates anonymous data from a prominent social network service, thus allowing for the
inclusion of a vast majority of user samples from other data holders, such as bank customers.
Formally, the VFL can be defined as follows: Xi ̸= Xj , Yi ̸= Yj , Ii = Ij , ∀ Di , Dj , i ̸= j (4)
where X again represents feature space and Y the label space. I is sample space and D is the
datasets owned by each healthcare client. 3) Federated transfer learning: HFL and VFL
require all clients to have the same feature space or sample space, but this assumption does
not hold in more practical situations. Transfer learning is a technique that attempts to improve
the performance of target learners on target domains by transferring knowledge from distinct,
but related source domains [25]. Thus, FTL aims to solve the case where both the sample
space and feature space are different while using a TL method to minimize the data
distribution discrepancy between each local dataset. In healthcare, for example, FTL can
assist in disease diagnosis with data from different patients (different sample spaces) in
multiple hospitals with different therapuotic rograms (different feature spaces) [26]. Hence, FTL
can be defined as: Xi ̸= Xj , Yi ̸= Yj , Ii ̸= Ij , ∀ Di , Dj , i ̸= j (5) Xi being the ith feature space and Yi the
ith label space. Ii is ith sample space and Di , Dj are the datasets owned by ith and jth healthcare
clients, respectively

3.2 CANCER DIAGNOSIS:


Recent studies have shown the feasibility and benefits of applying FL technology to cancer
diagnostic tasks. For instance, proposes a differentially private FL framework that employs
Bag Preparation and Multiple Instance Learning (MIL) to perform a classification task on a
Lung cancer dataset. The authors conduct experiments on their hand-crafted dataset derived
from The Cancer Genome Atlas (TCGA) [48] and demonstrate that their FL model performs
better than non-FL models while also addressing medical data privacy concerns. However,
the performance of their model degrades when the number of clients is high (32 clients), with
an accuracy of less than 60% in this case. This limitation prohibits the implementation of the
model in large-scale collaborative environments. Heterogeneous data is a common challenge
in FL that can cause local and global drift, affecting the performance of the model [28]. To
address this issue, the authors of [28] introduced a FL framework called HarmoFL, which
aims to harmonize local and global drifts simultaneously using magnitude normalization. For

Dept. of CSE, RRCE 2019-2020 Page 11


Study of Information Extraction from Unstructured and Multidimensional Big Data

addressing local drift, magnitudes are limited to a specific range to generate a coordinated
feature space across local clients. They also used client weight perturbation based on the
generated feature space to guide the local target near a globally-optimal solution which
reduces global drift. Specifically, it considered both local and global.
a solution was proposed to alleviate the instability arising from data diversity in a setup
known as FL with Shared Label Distribution. This approach employs a weighted
crossentropy loss, which optimizes the relevance of each sample to the local target by taking
into account the label distribution in each client. However, it is assumed that clients can share
the number of samples in each class, which may result in privacy leakage if this information
is valuable. The proposed approach achieved improved test accuracy on the OrganMNIST
dataset [53]. Yet, these studies performed experiments on limited types of datasets, and
further analyses on more varied and complex medical datasets are warranted. The work in
[31] introduces a novel self-supervised pretraining FL approach which utilizes the Vision
Transformer (ViT) as the underlying network architecture. This approach performs local
model pre-training on each client dataset to overcome data heterogeneity concerns.
Experiments conducted on a Dermatology dataset related to skin cancer showed the method
to achieve notable improvements in test accuracy [54]– [56]. In contrast to previous studies,
authors of this work perform three classification tasks in both simulated and realworld
scenarios, providing a more thorough assessment of reliability. However, their experiments
only consider a limited number of clients (5 clients), which raises worries regarding possible
bias and the approach’s ability to effectively handle a larger number of clients. To address the
issue of non-IID (non-identically distributed) data across different clients, the approach in
[32] trains personalized models using channel-wise assignment instead of the layer-wise
personalization techniques of previous studies [57]–[60]. In this method, the global model is
decoupled at the channel level to enable personalization. To further improve the decoupling
effect, a new cyclic distillation technique is introduced for reducing divergence. Experiments
conducted on the colorectal cancer HISTO-FED dataset, demonstrated the proposed
approach’s effectiveness in handling non-IID data. However, the approach was only tested
using three clients

3.2.1 : COVID -19 DETECTION


studies have also investigated the use of FL for COVID19 detection and pneumonia diagnosis
[36]–[39], [62]. Since COVID-19 is a worldwide epidemic, incorporating more clients to

Dept. of CSE, RRCE 2019-2020 Page 12


Study of Information Extraction from Unstructured and Multidimensional Big Data

create a robust global model can be beneficial for patients and physicians. The study in [36]
leverages customized local models for healthcare personalization, employing distinct local
batch normalization to optimize model generalizability while maintaining a high specificity
for each patient. Experimental results on the COVID-19 chest x-ray dataset [63] showed
promising performance and rapid convergence of the method. Experiments involving 100
clients showed the method achieves an average classification accuracy of 75%, which
indicates its robustness under a large number of clients. In [37], two FL techniques are
proposed for different active learning scenarios: Labeling Efficient Federated Active
Learning (LEFAL) and Training Efficient Federated Active Learning (TEFAL). LEFAL aims
to enhance the effectiveness of feature learning by taking into account data uncertainty and
diversity, while TEFAL improves client efficiency by employing a discriminator to assess the
amount of useful information a client can provide. The authors conducted experiments on the
COVID-19 dataset [64] and showed their approach achieves high accuracy and F1 scores in a
limited number of iterations. For example, their model obtained an average accuracy of 0.9
and an average F1 score of 0.95 with only 50 iterations. Additionally, the experiments
covered two scenarios, involving a small hospital and a large hospital, providing a more
practical assessment of the performance of the FL model in complex settings. However, the
maximum number of clients was limited to five in this study. The work in [38] presents a FL
approach utilizing Generative Adversarial Networks (GANs) to mitigate the risk of data
privacy leakage. In this approach, a Convolutional Neural Network (CNN) was used as a
generator to produce synthetic COVID-19 images, enabling the discriminator to learn and
replicate the actual distribution of COVID-19 data. Additionally, a blockchain-based
Differential Privacy Protection technique was implemented to enhance the data privacy
protection. Experiments on the DarkCOVID dataset [65] and the ChestCOVID dataset [66]
indicated that this approach could outperform state-of-the-art FL methods on these datasets.
Results on the DarkCOVID dataset reveal that the classification accuracy for COVID and
normal cases is 99%, however, the performance in predicting pneumonia is relatively lower
with an accuracy of 80%. Furthermore, the proposed method requires a large number of
epochs, typically around 200, to achieve optimal results, which is time-consuming. The
authors of [62] use cyclic homomorphic encryption to improve the privacy-preserving
capabilities of their FL method by encrypting the aggregation process. Adversarial attacks are
also simulated to evaluate the model’s resilience. However, their privacy protection technique
is only effective when there are more than two clients. In other words, when there are fewer
than three participating clients, the model’s privacy-preserving ability is almost nonexistent.

Dept. of CSE, RRCE 2019-2020 Page 13


Study of Information Extraction from Unstructured and Multidimensional Big Data

Experimental results based on the RAD-ChestCT dataset showed their approach to achieve an
average accuracy of 94%, which is similar to the performance of TL (95%) [67]. However,
the maximum number of clients used in this work is limited to 5. Moreover, the GPU
memory usage of the method exceeds 26 GB, which may restrict the choice of computational
device. One advantage is the training time is shorter compared to centralized training,
shedding light on training efficient FL models. In [39], a practical FL scenario called
intermittent client involved in the training while others leave due to internet connectivity
issues. The method in this work achieves an accuracy of 80.29% for pneumonia diagnosis on
the chest Xray image dataset [68]. However, this study only considers whether there is one
client leaving or not, which fails to provide a comprehensive reflection of the overall impact
of clients leaving. Additionally, the maximum number of clients is limited to 10.
3.3.1 HUMAN ACTIVITY RECOGNITION:
The development of IoT technology has enabled Human Activity Recognition (HAR) to play
a critical role in assisting medical professionals with collecting patient data for diagnosing
chronic illnesses [72]. However, HAR is susceptible to privacy violations and data
dissimilarity issues. FL is a potential solution for implementing robust models with numerous
clients, as it effectively addresses the previous issues. In a recent study [42], the authors
concluded that privacy regulations would not be violated if a label with natural language is
specified when sharing data. The study considered the classification problem as a matching
process between data and class representation, and transformed the classifier into a data and
category encoder to facilitate this process. Additionally, it used the class names as a reference
point to ensure category representation in the label encoder through natural language.
Experiments conducted on the PAMAP2 dataset [73] demonstrated that this method could
outperform most existing classification techniques based on FL. Nevertheless, the
experiments did not include the results obtained using a centralized model. Instead, the
authors only compared their results with those of six recent FL methods. Thus, this
comparison does not adequately reflect the differences in performance between TL and FL.
In [41], the limitations of existing wearable devices such as data privacy, service integrity,
and network structure adaptability have led authors to create an adaptive network for
intelligent wearables based on the distributed structural features of the fog-IoT network. The
proposed FL platform integrates blockchain technology to enhance data privacy protection.
When tested on a HAR task using smartphone data [74], this approach achieved good
performance in terms of privacy

Dept. of CSE, RRCE 2019-2020 Page 14


Study of Information Extraction from Unstructured and Multidimensional Big Data

3.3.2 MAJOR DEPRESSIVE DISORDER DISEASE DIAGNOSIS:


Major Depressive Disorder (MDD), a prevalent, severe, and expensive mental disorder
worldwide, causes depressed mood, reduced interest, and impaired cognitive function.
Detecting functional connectivity biomarkers and early intervention is important for
managing MDD. The privacy concerns related to patients’ information and data require the
utilization of FL to train a large global model. In a recent study [44], the authors developed a
federated joint estimator to detect these biomarkers by training a multilayer Bayesian network
based on continuous optimization. To enhance personalized models, they utilized group fused
lasso penalty during training and proposed an alternating direction method of multipliers
(ADMM) technique to aid in processing neuroimaging data. The proposed method
incorporated information-sharing strategies to improve the learning of local models.
Experiments on rs-fMRI dataset [75] demonstrated the superior effectiveness and precision of
this method.
3.4 AUTISM SPECTRUM DISORDER PREDICTION:
Autism spectrum disorder (ASD), a disorder that is part of the autism spectrum, has a
substantial impact on the prevalence of mental illnesses, which can harm a child’s mental
health development [76]. CNN [77], [78] and Recurrent Neural Network (RNN) [79], [80]
are frequently employed to detect ASD early on for prediction purposes. Although these
techniques have achieved good results, they mostly disregarded the correlations and
connections between subjects in the population [45]. Recent research has shown that graph
neural networks can effectively overcome this limitation [81]. This approach employs graph
generative adversarial networks to complete the missing information in the local network and
uses network in painting and inter-institutional data to enhance the edge predictor [45]. The
method’s effectiveness was demonstrated through experiments on two neuroimaging datasets,
ABIDE [82] and ADNI [83]. For the ADNI dataset, the performance of the FL model
remains the same when increasing the number of clients beyond 8. However, this
performance continues to improve for the ABIDE dataset, suggesting that the model’s
potential may not be fully attained when faced with more clients serves a crucial clinical
purpose by accurately identifying the current phase without future information from the
surgical video [84]. Despite its importance, the field continues to face challenges due to the
sensitive nature of medical data. This restricts collaborations between multiple institutions
and limits the deployment of traditional deep models in real-world settings. In [46], the
authors introduced the first FL strategy that employs semi-supervised learning to enhance the

Dept. of CSE, RRCE 2019-2020 Page 15


Study of Information Extraction from Unstructured and Multidimensional Big Data

generalization capability of the surgical phase recognition model using both labeled and
unlabeled data present in the dataset. The experimental results demonstrated that this
approach can learn better features and exhibit a feasible generalization performance in
unknown domains. The MultiChole2022 dataset used in this study was created from the
Cholec80 dataset .
3.4.1 PROSTATE TUMOR SEGMENTATION:
The accurate segmentation prostate regions in MRI is a crucial step in numerous
medical imaging applications for detecting prostate cancer, characterizing its
aggressiveness, predicting its recurrence, assessing the effectiveness of
treatment [129]. The work in [28] trains a FL-based segmentation model using a
multi-site prostate dataset [89], which comprises 79 samples from six different
sites. Results showed this model to achieve an average Dice of 94.28%.
Compared to FedAvg and FedBN, the proposed method shows enhanced
stability with increased local training epochs. However, this study did not
evaluate the performance improvement or decrease brought by using FL,
compared to centralized approaches. Weakly supervised learning has emerged
as popular approach to alleviate the burden of labeling data [130]. In this
approach, incomplete but easier-to-obtain annotations are used instead of full
image annotations. In [90], authors proposed a first federated weakly supervised
segmentation (FedWSS) method to learn a segmentation task from multiple data
sources wile minimizing the impact of data drift. To address local and global
data drift, the authors introduced two strategies, based on Cooperativ(HGD).
CAC reduces local drift using a Monte Carlo sampling technique that
customizes a distal peer and proximal peer for each client, and accurately
distinguishes between clean and noisy labels. Meanwhile, the HGD strategy
mitigates global data drift by using primary gradient data to aid clients in
subsequent training cycles [90]. he authors proposed a personalized FL
paradigm to address the challenges of performance degradation and unbalanced
label distribution. The proposed method leverages progressive Fourier

Dept. of CSE, RRCE 2019-2020 Page 16


Study of Information Extraction from Unstructured and Multidimensional Big Data

aggregation on the global server side and enhanced transfer on the client side to
learn the parameters of individual client models and transfer local knowledge to
the global model more effectively. To address the problem of label distribution
imbalance, it also introduces a new loss function called Conjoint Prototype
Aligned (CPA) loss. This loss evaluates the global conjoint objective based on
the global imbalance and modifies the local client-side training via
prototypealigned refinement to eliminate the imbalance gap with a balanced
objective. Experimental results on PROMISE12 dataset [91] and ISBI dataset
[95] showed the method’s superior performance compared to recent approaches.
However, this method has a local training time twice longer than standard FL,
which could potentially increase the communication load when using edge
devices. Moreover, the absence of a comparison with the centralized model
does not sufficiently explain the potential of using FL for prostate tumor
segmentation.

3.4.2 BREAST TUMOR SEGMENTATION:


Breast cancer, which is the most prevalent type of cancer in women, can be fatal if not
detected early [131]. In order to simulate a FL model, in a recent study [100], a novel label-
agnostic supervised FL method called FedMix was proposed. FedMix trains each client by
utilizing both strong and weak labels with an adaptive weight adjustment strategy, which
allows for dynamic weight adaptation during the FL training process to learn better feature
representations. This method breaks the restriction of only using one type of label for
training. FedMix was tested on three breast tumor segmentation datasets: BUS [101], BUSIS
[102], and UDIAT [103]. Experimental results showed it outperforms most current
approaches. Nevertheless, the performance of this technique relies heavily on the choice of
hyper-parameters, which needs extensive fine-tuning to avoid degradation in performance.
Additionally, it is assumed that rich label clients exhibit higher training loss, indicating a
greater amount of information available for model training. However, the presence of noisy
or corrupted labels can lead to a substantial rise in the loss, and their model is unable to
effectively differentiate these labels. Consequently, the performance of the model in this
particular scenario may be negatively impacted

Dept. of CSE, RRCE 2019-2020 Page 17


Study of Information Extraction from Unstructured and Multidimensional Big Data

CHAPTER 4
RESULTS AND DISCUSSION

Dept. of CSE, RRCE 2019-2020 Page 18


Study of Information Extraction from Unstructured and Multidimensional Big Data

Due to the fast advancement of artificial intelligence(AI), centralized-based models have


become critical for health-care tasks like in medical image analysis and human behavior
recognition. Although these models exhibit suitable performance ,they are frequently
constrained by privacy concerns. To attenuate this, a centralized learning strategy cannot be
used in cases where there is a risk of data privacy breach, particularly in health carecenters.
Federated learning (FL) is a technique that allowsfor training a global model without sharing
data by training distributed local models and aggregating them. By implementing FL
throughout the training process, we can obtain a model with comparable generalization
abilities to centralized learning while maintaining data privacy. This survey provides an
introduction to the fundamental concepts and categories of FL, highlights the limitations of
the centralized healthcare model, and discusse show FL can address these constraints. We
also provide a detailed overview of the healthcare applications using FL models, along with
commonly used evaluation metrics and public datasets. In this context, we have implemented
a case study to demonstrate how FL can be applied in the healthcare field. Furthermore, we
outline the key challenges and future trends in FL

4.1 Challenges Of Protection of data privacy:


FL ensures data privacy by allowing clients to share only model parameters, not the raw
data, as mentioned in the previous section. This approach is highly effective in protecting
data privacy. In a study citedin [18], homomorphic encryption-based privacy-preserving
strategies were used to address data privacy leakage issues. Asdata privacy laws
continue to become more stringent, FL is expected to play a crucial role in the smart
healthcare industry

4.1.1 Reduce training consumption:


The FL technique can distribute data efficiently to each edge server, leading to Flowchart of
federated learning model for healthcare .The process involves several steps, including model
initializa-tion and client selection (Left), local training and parameter rupload (Middle), and
model aggregation and parameter down-load (Right). 1) the global model is initialized, and
clients are selected to participate in the federated learning process.2) second step involves
local training on client data and the upload of updated model parameters to the server.
Finally, the updated parameters from all clients are aggregated to create anew global model,
which is transmitted back to the clients for the next round of training. This approach enables
the training of models using decentralized data while preserving data privacely. a reduction in

Dept. of CSE, RRCE 2019-2020 Page 19


Study of Information Extraction from Unstructured and Multidimensional Big Data

communication usage, network transmission latency, and costs. Sharing model parameters
through FL typically requires much less energy compared to exchanging raw data. For
example, the size of parameter gradients is significantly smaller than the actual data in the
dataset, as stated in [19]. This makes FL an energy-efficient solution for distributed machine
learning

4.1.2 Large amount of training data:

FL provides strategies,
such as FedAvg [20], that
allows for the merging of
multiple
clients when the number of
clients is sufficient. This
merging
of clients promotes the
availability of training data
and can
alleviate the problem of
requiring a large quantity of
data
to train AI models. Thus, FL is
a powerful technique for
Dept. of CSE, RRCE 2019-2020 Page 20
Study of Information Extraction from Unstructured and Multidimensional Big Data

distributed machine learning,


especially when there is a
large
number of clients available
FL provides strategies ,such as Fed Avg [20], that allows for the merging of multiple clients
when the number of clients is sufficient. This merging of clients promotes the availability of
training data and can alle viate the problem of requiring a large quantity of datat o train AI
models. Thus, FL is a powerful technique for distributed machine learning, especially when
there is a large number of clients available.

4.1.3 Model initialization and client selection:


The process begins by defining a task in the healthcare domain, such as medical image
classification, segmentation, or HAR. Next, the parameters of the global server are artificially
initialized, and clients are then selected by the global server to participate in the training

4.1.4 Dimensionality and Heterogeneity


Unstructured big data comes with high dimensionality, diversity, dynamicity and
heterogeneity. Dimensionality reduction and semantic annotation can further improve the IE
performance of high dimensional and heterogeneous data respectively. The techniques with
high representational power are appropriate for high dimensional data. With the influx of data
from increasingly diverse sources, big data IE and analytics require advanced techniques to
handle more than data accessibility.

4.1.5 Local training and parameters upload:


Once the par-ticipating clients are identified, the global server distributes the initial model
and its parameters to each client. In every subsequent communication round, each client

Dept. of CSE, RRCE 2019-2020 Page 21


Study of Information Extraction from Unstructured and Multidimensional Big Data

trains its own data set and uploads the parameters of its local model on the global server for
aggregation.

4.1.6 Model aggregation and parameters download:


After all participating clients have completed uploading their updated parameters, the global
server combines them to compute anew global model. This updated model is then distributed
to each client for the next training session. The process of FL continues until the loss function
of the global server converges or meets the performance requirements
.Efficient parallelism and computational power are required to support large data models.

CONCLUSION
data privacy in the healthcare sector. FL is presented as a potential solution to address privacy
concerns by developing a global model through local training and model aggregation on

Dept. of CSE, RRCE 2019-2020 Page 22


Study of Information Extraction from Unstructured and Multidimensional Big Data

decentralized datasets without sharing raw data. However, FL in healthcare faces its own set
of challenges such as poor data quality, data heterogeneity, and data allocation and
management. We also compare FL with TL and highlight the advantages of the former
approach. The critical steps of FL are explained in detail, and FL is categorized based on
sample and feature space. The applications of FL in healthcare are summarized and
categorized, along with typical evaluation metrics and commonly used medical datasets. The
reported case study also sheds light on the importance of FL in healthcare. It is expected that
FL techniques will continue to be widely used in both academia and hospitals in the near
future. With the aid of advances in science and technology, we anticipate that FL can be
further enhanced to provide more effective support to patients in the healthcare sector

FUTURE WORK

Dept. of CSE, RRCE 2019-2020 Page 23


Study of Information Extraction from Unstructured and Multidimensional Big Data

Federated learning in healthcare has immense potential for various applications, given its
ability to train machine learning models across decentralized data sources while ensuring
privacy and security. Here are some potential future directions for federated learning in
healthcare:
1. Clinical Decision Support Systems (CDSS): Federated learning can be utilized to
develop CDSS that provide real-time recommendations for healthcare providers based
on collective knowledge from various hospitals and clinics while preserving patient
privacy.
2. Disease Prediction and Diagnosis: Federated learning models can be trained across
diverse healthcare systems to improve disease prediction and diagnosis accuracy by
leveraging a broader spectrum of data while respecting privacy regulations.
3. Drug Discovery and Development: Federated learning can facilitate collaborative
drug discovery and development by enabling pharmaceutical companies and research
institutions to jointly train models on distributed datasets without sharing sensitive
information.
4. Medical Imaging Analysis: Federated learning can enhance medical imaging
analysis by aggregating data from different hospitals to train robust models for tasks
such as tumor detection, organ segmentation, and disease progression tracking while
protecting patient privacy.

REFERENCES

Dept. of CSE, RRCE 2019-2020 Page 24


Study of Information Extraction from Unstructured and Multidimensional Big Data

1. Wang K, Shi Y. User information extraction in big data environment. In: 3rd IEEE
international conference on computer and communications (ICCC).
2. Li P, Mao K. Knowledge-oriented convolutional neural network for causal relation
extraction from natural language texts. Expert System Appl. 2019;115:512–23.
3. Liu Z, Tong J, Gu J, Liu K, Hu B. A Semi-automated entity relation extraction
mechanism with weakly supervise learning for Chinese medical webpages. In:
International conference on smart health. Cham: Springer; 2016; p 44–56.
4. KHe K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition
(CVPR). 2016; p. 770–8.
5. Gantz J, Reinsel D. The digital universe in 2020: big data, bigger digital shadows, and
biggest growth in the far east. IDC iView IDC Analyze Future. 2012;2007(2012):1–
16.
6. Wang Y, Kung LA, Byrd TA. Big data analytics: understanding its capabilities and
potential benefits for healthcare organizations. Technol Forecast Soc Change.
2018;126:3–13.
7. Lomotey RK, Deters R. Topics and terms mining in unstructured data stores. In: 2013
IEEE 16th international conference on computational science and engineering, 2013.
p. 854–61.
8. Lomotey RK, Deters R. RSenter: terms mining tool from unstructured data sources.
Int J Bus Process Integr Manag, 2013;6(4):298.
9. Goldberg S, Wang DZ, Grant C. A probabilistically integrated system for crowd-
assisted text labeling and extraction. J Data Inf Qual. 2017;8(2):1–23.
10. Napoli C, Tramontana E, Verga G. Extracting location names from unstructured
italian texts using grammar rules and MapReduce. In: International conference on
information and software technologies. Cham: Springer; 2016; p.593–601.

Dept. of CSE, RRCE 2019-2020 Page 25

You might also like