Data Science Process
Data Science for Health Care Conference
May 14, 2019
Lect. Anuchate Pattanateepapon, D.Eng.
Section for Clinical Epidemiology and Biostatistics
Mahidol University Faculty of Medicine Ramathibodi Hospital
© 2019
Outline
• Introduction to Data Science Process
• Data Science Lifecycle
- Business Understanding
- Data Acquisition and Understanding
- Modeling
- Deployment
2
What is a Data Science Process?
• The Data Science Process [1] is a framework for approaching data science
tasks.
• The Data Science Process [2] is a process for finding relevant information
insight the data that will transform the business to higher profits.
• The Team Data Science Process (TDSP) [3] is an agile, iterative data
science methodology to deliver predictive analytics solutions and intelligent
applications efficiently. TDSP helps improve team collaboration and
learning.
3
Data Science Lifecycle
Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle
4
Data Science Lifecycle
Business
Understanding
Business Understanding: In this first step,
we try to get a better idea of what
business needs we should be extracting
from data.
Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle
5
Data Science Lifecycle
Data
Acquisition
and
Understanding
Data Acquisition and Understanding: This is
getting a business idea of the data that you have
and understanding what each part of the data
means.
Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle
6
Data Science Lifecycle
Modeling
Modeling: Here is where doing statistics
and analyzing the data come in to create a
model that best fits the data.
Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle
7
Data Science Lifecycle
Deployment: This is where you share your
findings of the data
Deployment
Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle
8
Business Understanding
Example
Identify Business
Problem
Goal:
To identify
Problem
- What data to be used
Formalization - Where are data
- How to measure
Scenario: 10 years risk of coronary heart disease (CHD)
Identify variables
and metrics The early prognosis of cardiovascular diseases can aid in making
decisions on lifestyle changes in high risk patients and in turn reduce
the complications.
Identify Data
Source Problem: How to predict a coronary heart disease with acceptable
accuracy?
9
Business Understanding
Example
Identify Business
Problem
Problem settings:
Given a patient’s demographic, behavioral and medical risk
Problem
Formalization
factors.
Formalization: Regression Problem
X1
Identify variables
and metrics
X2 Y
Identify Data …
Source
Xm
10
Business Understanding
Example
Identify Business
Variables1:
Problem - Demographic: sex(male = 1/female = 0), age(continuous), and etc.
- Behavioral: current smoker(yes = 1/no = 0), cigarettes per day(continuous),
and etc.
Problem
Formalization - Medical(history): blood pressure medication (continuous), the patient had
previously stroke(yes = 1/no = 0), the patient was hypertensive(yes = 1/no =
0), the patient had diabetes(yes = 1/no = 0), and etc.
Identify variables
and metrics - Medical(current): total cholesterol level(continuous), systolic blood
pressure(continuous), diastolic blood pressure(continuous), body mass
index(continuous), heart rate(continuous), glucose level(continuous) and etc.
Identify Data
Source
Predict variable: 10 year risk of coronary heart disease CHD (binary: “1” means
“Yes” and “0” means “No”)
Metrics: Model accuracy
1 Example of variables: https://www.kaggle.com/neisha/heart-disease-prediction-using-logistic-regression
11
Business Understanding
Example
Identify Business
Problem
Data sources
- Internal Data, e.g. hospital database
Problem
Formalization
- External Data, n/a
Artifacts
Identify variables
and metrics - Data source
- Data Dictionary
Identify Data
Source
12
Data Acquisition and Understanding
Data Ingestion
Goal
Data
- To produce high quality data
Preprocessing &
Exploration - To ingest data from operation to analytic environment
- To develop solution architecture
Solution
Architecture
13
Data Acquisition and Understanding
• Data Ingestion Operational Enterprise
Data Sources Data Store Data Warehouse
Ingestion
Analytic
14
Data Acquisition and Understanding
Data Ingestion
Major Data Preprocessing Tasks
Data
- Data cleansing, e.g. missing value handling
Preprocessing &
Exploration - Data transformation, e.g. rescaling, normalization
- Data reduction, e.g. data sampling
- Data discretization, e.g. continuous to category
Solution
Architecture conversion
- Text cleansing, e.g. inconsistent delimiters
15
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• From 10 years risk of coronary heart disease scenario
Internal Data External Data
current cigs
Smoker PerDay
0 0.0
0 0.0
1 20.0
sex age BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYearC 1 30.0
Meds Stroke Hyp Rate HD
x xx xx
1 39 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
16
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• From 10 years risk of coronary heart disease scenario
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
Internal Data
External Data
17
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
- 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
Missing value in categorical data
Missing value in continuous data
18
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
1. Remove the missing transactions
• Data cleansing, e.g. missing value handling (missing value ≤ 5% or ≤10%)
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
- 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
Missing value in categorical data
Missing value in continuous data
19
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
2. Imputation
Missing value in categorical data 2.1 Single imputation: median (Easy and fast but it doesn’t
Missing value in continuous data factor the correlations between features and it not
very accurate)
2.2 Multiple imputation: k-NN, EM, etc.
20
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 44 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
3. Imputation
Missing value in categorical data 3.1 Single imputation: round(mean) ex. round(43.3) = 44
Missing value in continuous data 3.2 Multiple imputation: k-NN, etc.
21
K-Nearest Neighbors (k-NN)
a) K-NN is a supervised training algorithm
b) No explicit training or model
c) Could perform with a classification and regression problem
d) Use the K-Nearest Neighbors of x to vote the label of x
22
A k-NN framework
Define or initial K
Compute distance (test instance each training instance)
Sort the distances
Take K nearest neighbor
Apply simply majority
Class
23
A majority vote in K-Nearest Neighbors
k-NN is classify by using the majority vote of the k closest training points
𝑥2 𝑥2 𝑥2
x x x
𝑥1 𝑥1 𝑥1
1-nearest neighbor 2-nearest neighbor 3-nearest neighbor
24
k-NN in a classification problem
No explicitly decision boundaries computation
The boundaries between distinct classes form a subset of the
Voronoi diagram of the training data
Each line segment is equidistant to neighboring points
Image by MIT OpenCourseWare
25
Summary
a) k-NN can deal with complex and arbitrary decision boundaries
b) Despite its simplicity, researchers have shown that the classification
accuracy of k-NN can be quite strong and in many cases as
accurate as those elaborated method
c) k-NN is slow at the classification time
d) k-NN does not produce an understandable model
26
Data Acquisition and Understanding
Data Ingestion
Batch:
Standard architecture for data
warehouse such as Data mart
Data
Preprocessing &
Exploration
Stream:
Solution
Architecture
Data flows continuously
from the data sources. This
idea is called Data Lake.
27
Data Acquisition and Understanding
Data Mart:
Just like a store of bottled water –
Data Ingestion
cleansed and packaged and
structured for easy consumption
Data
Preprocessing &
Exploration
Data Lake:
Similar to a large body of water
Solution
Architecture in a more natural state. Various
users of the lake can come to
examine, dive in, or take
samples
28
Modeling
Feature
Goal
Engineering
- To create a list of feature vectors from raw data
- To create a machine learning model
Modeling
29
Modeling
Structured Data
- Pick all relevant variables to the target class
- Data Preprocessing
Feature
Engineering
Male Age Location … TenYerCHD
1 39 Bangkok … 0
0 46 Bangkok … 0
Modeling 1 48 Nakon … 0
Ratchasima
0 61 Nakon … 1
Ratchasima
0 46 Chonburi … 0
30
Modeling
Structured Data
- Pick all relevant variables to the target class
- Data Preprocessing
Feature
sex age currentS cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Engineering moker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
Modeling
31
Modeling
UNKNOWN TARGET FUNCTION
f: X → y
(Ex. Ideal disease diagnosis function)
DATA COLLECTION
Ex. Patients who agree to be participants
Feature TRAINING SAMPLES
Engineering
(x1, y1), (x2, y2) … (xN, yN)
(Ex. Participants’ medical records)
LEARNING FINAL HYPOTHESIS
Modeling ALGORITHM g≈f
(Ex. Approximated disease
A diagnosis function)
HYPOTHESIS SET
H
32
Deployment
Goal
- Deploy models with a data pipeline to a
Deployment
production or production-like environment for final
user acceptance
- Tracking model performance and improving if
required
33
Deployment
Batch Training Real-time Prediction
Historical Data Analytic Data Model Prediction
Live Data Feedback
34
Reference
[1] A. Lauren, G.L. Gabriel, and W. Joschua, “CS109 Data Science”, School of Engineering
and Applied Sciences, Harvard, [online] http://cs109.github.io/2015/, [15 January, 2019].
[2] J. Wood, “Data Science and the Data Science Process”, Wintellect, [online]
https://www.wintellect.com/data-science-data-science-process/, [15 January, 2019].
[3] S. Nick, E. Gary and el., “What is the Team Data Science Process”, [online]
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-
process/overview, [15 January, 2019]
35