Clinic Data: Sources, Preparation,
and Feature Operations
Textbook: An Overview of Data Collection, Preprocessing, and Feature Extraction in AI
(Chapter 3)
Artificial Intelligence in Medicine
A Practical Guide for Clinicians
Index:
1. Clinical Data meaning.
2. Clinical data sources.
3. Clinical Data Preprocessing.
4. Clinical Data Handling.
5. Feature extraction and selection.
Introduction:
• Data can be seen as a treasure for AI systems.
• Data is the foundation for training and developing AI models to
assist healthcare professionals in various aspects of patient care.
• Each piece of data holds valuable insights, stories, and facts that
contribute to the AI system’s understanding of healthcare.
• Data forms: structured data like organized rows and columns in
spreadsheets, unstructured data like text documents, images,
and audio recordings.
• For example, AI systems can analyze medical images to detect
diseases, text data can be processed to extract clinical insights
and patient records can be analyzed to predict health outcomes.
Introduction:
• Data is processed by computers using algorithms and logical
operations to produce new data or meaningful output based
on input data. (data -> information -> knowledge).
• In a healthcare context, data refers to information related to
health conditions, including clinical metrics, and clinical
outcomes, as well as environmental, socioeconomic, and
behavioral information pertinent to health and wellness. AI
algorithms to learn patterns, make predictions and provide
valuable insights to support clinical decision-making
(Vaccination campaigns).
• By analyzing data related to patient flow, staffing, scheduling,
and supply chain management, AI systems can help
hospitals optimize their operations and improve the quality of
care.
Machine Learning Steps:
ACQUIRE PREPARE ANALYZE REPORT ACT
Goal: data collection is the first
step where scientists gather all the
necessary medical information
from different sources before
being processed in an AI system
Acquire Data
Identify data sources
Collect data
Integrate data
Data
Data Sources:
1. Electronic Health Records (EHRs): EHRs contain comprehensive patient health information,
including medical history, diagnoses, treatments, laboratory results, and medications. These
records are collected and stored by healthcare providers and hospitals during patient visits.
Data Sources:
2. Medical imaging:
Medical imaging data such as X-rays, MRIs, CT scans, and ultrasounds
provide visual representations of the patient’s anatomy and can help diagnose
and monitor diseases. Images are captured using specialized equipment and
stored in digital formats.
3. Wearable Devices and Internet of Things (IoT) Devices:
With the increasing popularity of wearable devices, such as fitness trackers
and smartwatches, physiological data like heart rate, activity levels, and sleep
patterns can be collected continuously. IoT devices, such as remote
monitoring systems and sensors, also contribute to the collection of patient-
generated data outside of healthcare facilities.
Data Sources:
3. Clinical trials and research studies:
Research studies and clinical trials collect data from participants to investigate the
effectiveness and safety of new treatments, interventions, or medical devices. These studies
generate valuable data that can be used for AI analysis and to improve patient care.
4. Health apps and patient portals:
Mobile health applications and patient portals allow patients to record and track their
health information, such as symptoms, vital signs, medication adherence, and lifestyle habits.
These apps enable individuals to actively participate in managing their health and contribute to
the collection of personal health data.
5. Social media and online communities:
Social media platforms and online communities provide a wealth of health-related
information and patient experiences. Analyzing these unstructured data sources can uncover
insights and trends that contribute to AI-driven healthcare improvements.
ACQUIRE PREPARE ANALYZE REPORT ACT
Step 2-A: Explore
Step 2-B: Pre-process
Why Explore?
Goal: Understand your data
Describe Your Data
Visualize Your Data
Histogram Heat map
Line plot Scatter plot
Why Explore? Outliers
General trends
Correlations
Time Series Def.
• A time series is a sequential set of data points, measured typically
over successive times
• A time series containing a single variable’s values is termed
univariate. But if values of more than one variable are considered, it
is termed a multivariate
• A time series can be continuous or discrete.
• Continuous time series: observations are measured at every instance
of time. EX: temperature readings, river flow, and rate of illness
spread.
• Discrete time series: observations are measured at discrete points in
time. Examples include city population, company production,
exchange rates, number of patients, and number of required beds.
Time Series Compounds:
Time Series
Trend Component Seasonal Cyclical Irregular /
Component Component Random
Component
Overall, Regular periodic Repeating swings Erratic or residual
persistent, long-term Fluctuations, or movements, fluctuations
movement, up or Short-term cycle length (y, m,
down, linear or non- regular wave- d ), peak to peak
linear like patterns
1
6
Time Series
Compounds:
Data Trend:
Mann-Kenda trend
test
tau +/- ve
p. Value
Pettit test:
point of
change
Spearman’s Rho
test:
correlation between
the time series and
the data values
Theil-Sen’s slope:
determine the
magnitude of the
trend (rate of
change) of the
climatic variable.
𝑆𝑙𝑜𝑝 ∗ #𝑦𝑒𝑎𝑟𝑠 ∗ 100
𝑚𝑒𝑎𝑛
Clinical Data Handling: Handling missing data
• Deletion: If the amount of missing data is relatively small, the rows or
columns containing missing values may be removed.
• Mean/mode/median imputation: Missing values are replaced with the mean,
mode, or median value of the corresponding feature.
• Forward/backward fill: Also known as “last observation carried forward” or
“next observation carried backaward,” this method involves filling missing
values with the previous or subsequent non-missing values in the dataset. It
is commonly used in time series data where missing values are expected to
have similar patterns.
Clinical Data Handling: Handling missing data
• Interpolation: Interpolation methods estimate missing values based on the
values of neighboring data points. Common interpolation techniques include
linear interpolation, polynomial interpolation, and spline interpolation.
• Multiple imputations: Multiple imputations generate multiple plausible
values for each missing data point based on the observed data’s distribution.
The missing values are then replaced with these imputed values and the
analysis is performed multiple times using each imputed dataset. This
method accounts for uncertainty in imputation and produces more robust
results.
Clinical Data Handling: Handling imbalanced data
• In certain applications, datasets may be imbalanced, meaning that one class
or category is significantly more prevalent than others. The idea to combat
the challenge of imbalanced data is random sampling.
• Oversampling — Generate new samples for the under-represented class.
• Undersampling — Remove samples from the class which is over-
represented.
• Oversampling or undersampling works to balance the representation of
different classes and prevent biases in model training and evaluation.
• If a dataset contains patient records with a rare disease, oversampling
techniques can be generated.