Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
34 views35 pages

DataScienceProcess 14may2019

The document summarizes the data science process which includes 4 main steps: 1) business understanding to define the problem and identify data sources, 2) data acquisition and understanding to ingest and preprocess data, 3) modeling to analyze data and create models, and 4) deployment to share findings. It provides examples for the business understanding step applied to predicting coronary heart disease risk.

Uploaded by

romdhoniyyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views35 pages

DataScienceProcess 14may2019

The document summarizes the data science process which includes 4 main steps: 1) business understanding to define the problem and identify data sources, 2) data acquisition and understanding to ingest and preprocess data, 3) modeling to analyze data and create models, and 4) deployment to share findings. It provides examples for the business understanding step applied to predicting coronary heart disease risk.

Uploaded by

romdhoniyyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Science Process

Data Science for Health Care Conference


May 14, 2019
Lect. Anuchate Pattanateepapon, D.Eng.
Section for Clinical Epidemiology and Biostatistics
Mahidol University Faculty of Medicine Ramathibodi Hospital
© 2019
Outline
• Introduction to Data Science Process
• Data Science Lifecycle
- Business Understanding
- Data Acquisition and Understanding
- Modeling
- Deployment

2
What is a Data Science Process?
• The Data Science Process [1] is a framework for approaching data science
tasks.
• The Data Science Process [2] is a process for finding relevant information
insight the data that will transform the business to higher profits.
• The Team Data Science Process (TDSP) [3] is an agile, iterative data
science methodology to deliver predictive analytics solutions and intelligent
applications efficiently. TDSP helps improve team collaboration and
learning.

3
Data Science Lifecycle

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

4
Data Science Lifecycle
Business
Understanding

Business Understanding: In this first step,


we try to get a better idea of what
business needs we should be extracting
from data.

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

5
Data Science Lifecycle

Data
Acquisition
and
Understanding

Data Acquisition and Understanding: This is


getting a business idea of the data that you have
and understanding what each part of the data
means.

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

6
Data Science Lifecycle

Modeling

Modeling: Here is where doing statistics


and analyzing the data come in to create a
model that best fits the data.

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

7
Data Science Lifecycle

Deployment: This is where you share your


findings of the data

Deployment

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

8
Business Understanding
Example
Identify Business
Problem
Goal:
To identify
Problem
- What data to be used
Formalization - Where are data
- How to measure
Scenario: 10 years risk of coronary heart disease (CHD)
Identify variables
and metrics The early prognosis of cardiovascular diseases can aid in making
decisions on lifestyle changes in high risk patients and in turn reduce
the complications.
Identify Data
Source Problem: How to predict a coronary heart disease with acceptable
accuracy?

9
Business Understanding
Example
Identify Business
Problem
Problem settings:
Given a patient’s demographic, behavioral and medical risk
Problem
Formalization
factors.
Formalization: Regression Problem
X1
Identify variables
and metrics

X2 Y

Identify Data …
Source
Xm

10
Business Understanding
Example

Identify Business
Variables1:
Problem - Demographic: sex(male = 1/female = 0), age(continuous), and etc.
- Behavioral: current smoker(yes = 1/no = 0), cigarettes per day(continuous),
and etc.
Problem
Formalization - Medical(history): blood pressure medication (continuous), the patient had
previously stroke(yes = 1/no = 0), the patient was hypertensive(yes = 1/no =
0), the patient had diabetes(yes = 1/no = 0), and etc.
Identify variables
and metrics - Medical(current): total cholesterol level(continuous), systolic blood
pressure(continuous), diastolic blood pressure(continuous), body mass
index(continuous), heart rate(continuous), glucose level(continuous) and etc.
Identify Data
Source
Predict variable: 10 year risk of coronary heart disease CHD (binary: “1” means
“Yes” and “0” means “No”)
Metrics: Model accuracy
1 Example of variables: https://www.kaggle.com/neisha/heart-disease-prediction-using-logistic-regression

11
Business Understanding
Example
Identify Business
Problem
Data sources
- Internal Data, e.g. hospital database
Problem
Formalization
- External Data, n/a

Artifacts
Identify variables
and metrics - Data source
- Data Dictionary
Identify Data
Source

12
Data Acquisition and Understanding

Data Ingestion

Goal

Data
- To produce high quality data
Preprocessing &
Exploration - To ingest data from operation to analytic environment
- To develop solution architecture
Solution
Architecture

13
Data Acquisition and Understanding

• Data Ingestion Operational Enterprise


Data Sources Data Store Data Warehouse

Ingestion

Analytic

14
Data Acquisition and Understanding

Data Ingestion
Major Data Preprocessing Tasks

Data
- Data cleansing, e.g. missing value handling
Preprocessing &
Exploration - Data transformation, e.g. rescaling, normalization
- Data reduction, e.g. data sampling
- Data discretization, e.g. continuous to category
Solution
Architecture conversion
- Text cleansing, e.g. inconsistent delimiters

15
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• From 10 years risk of coronary heart disease scenario
Internal Data External Data
current cigs
Smoker PerDay
0 0.0
0 0.0
1 20.0
sex age BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYearC 1 30.0
Meds Stroke Hyp Rate HD
x xx xx
1 39 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

16
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• From 10 years risk of coronary heart disease scenario
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

Internal Data
External Data

17
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
- 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

Missing value in categorical data


Missing value in continuous data

18
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
1. Remove the missing transactions
• Data cleansing, e.g. missing value handling (missing value ≤ 5% or ≤10%)
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
- 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

Missing value in categorical data


Missing value in continuous data

19
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

2. Imputation
Missing value in categorical data 2.1 Single imputation: median (Easy and fast but it doesn’t
Missing value in continuous data factor the correlations between features and it not
very accurate)
2.2 Multiple imputation: k-NN, EM, etc.

20
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 44 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

3. Imputation
Missing value in categorical data 3.1 Single imputation: round(mean) ex. round(43.3) = 44
Missing value in continuous data 3.2 Multiple imputation: k-NN, etc.

21
K-Nearest Neighbors (k-NN)
a) K-NN is a supervised training algorithm
b) No explicit training or model
c) Could perform with a classification and regression problem
d) Use the K-Nearest Neighbors of x to vote the label of x

22
A k-NN framework
Define or initial K

Compute distance (test instance each training instance)

Sort the distances

Take K nearest neighbor

Apply simply majority

Class

23
A majority vote in K-Nearest Neighbors
k-NN is classify by using the majority vote of the k closest training points

𝑥2 𝑥2 𝑥2

x x x

𝑥1 𝑥1 𝑥1

1-nearest neighbor 2-nearest neighbor 3-nearest neighbor

24
k-NN in a classification problem
 No explicitly decision boundaries computation
 The boundaries between distinct classes form a subset of the
Voronoi diagram of the training data
 Each line segment is equidistant to neighboring points

Image by MIT OpenCourseWare

25
Summary
a) k-NN can deal with complex and arbitrary decision boundaries
b) Despite its simplicity, researchers have shown that the classification
accuracy of k-NN can be quite strong and in many cases as
accurate as those elaborated method
c) k-NN is slow at the classification time
d) k-NN does not produce an understandable model

26
Data Acquisition and Understanding

Data Ingestion
Batch:
Standard architecture for data
warehouse such as Data mart
Data
Preprocessing &
Exploration

Stream:
Solution
Architecture
Data flows continuously
from the data sources. This
idea is called Data Lake.

27
Data Acquisition and Understanding
Data Mart:
Just like a store of bottled water –
Data Ingestion
cleansed and packaged and
structured for easy consumption
Data
Preprocessing &
Exploration
Data Lake:
Similar to a large body of water
Solution
Architecture in a more natural state. Various
users of the lake can come to
examine, dive in, or take
samples
28
Modeling

Feature
Goal
Engineering
- To create a list of feature vectors from raw data
- To create a machine learning model
Modeling

29
Modeling
Structured Data
- Pick all relevant variables to the target class
- Data Preprocessing
Feature
Engineering
Male Age Location … TenYerCHD
1 39 Bangkok … 0
0 46 Bangkok … 0
Modeling 1 48 Nakon … 0
Ratchasima
0 61 Nakon … 1
Ratchasima
0 46 Chonburi … 0

30
Modeling
Structured Data
- Pick all relevant variables to the target class
- Data Preprocessing
Feature
sex age currentS cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Engineering moker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
Modeling

31
Modeling
UNKNOWN TARGET FUNCTION
f: X → y
(Ex. Ideal disease diagnosis function)

DATA COLLECTION
Ex. Patients who agree to be participants
Feature TRAINING SAMPLES
Engineering
(x1, y1), (x2, y2) … (xN, yN)
(Ex. Participants’ medical records)
LEARNING FINAL HYPOTHESIS
Modeling ALGORITHM g≈f
(Ex. Approximated disease
A diagnosis function)
HYPOTHESIS SET
H

32
Deployment

Goal
- Deploy models with a data pipeline to a
Deployment
production or production-like environment for final
user acceptance
- Tracking model performance and improving if
required

33
Deployment

Batch Training Real-time Prediction

Historical Data Analytic Data Model Prediction

Live Data Feedback

34
Reference
[1] A. Lauren, G.L. Gabriel, and W. Joschua, “CS109 Data Science”, School of Engineering
and Applied Sciences, Harvard, [online] http://cs109.github.io/2015/, [15 January, 2019].
[2] J. Wood, “Data Science and the Data Science Process”, Wintellect, [online]
https://www.wintellect.com/data-science-data-science-process/, [15 January, 2019].
[3] S. Nick, E. Gary and el., “What is the Team Data Science Process”, [online]
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-
process/overview, [15 January, 2019]

35

You might also like