Thanks to visit codestin.com
Credit goes to www.scribd.com

Open navigation menu

Scribd

0% found this document useful (0 votes)

34 views35 pages

DataScienceProcess 14may2019

The document summarizes the data science process which includes 4 main steps: 1) business understanding to define the problem and identify data sources, 2) data acquisition and understanding to ingest and preprocess data, 3) modeling to analyze data and create models, and 4) deployment to share findings. It provides examples for the business understanding step applied to predicting coronary heart disease risk.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views35 pages

DataScienceProcess 14may2019

The document summarizes the data science process which includes 4 main steps: 1) business understanding to define the problem and identify data sources, 2) data acquisition and understanding to ingest and preprocess data, 3) modeling to analyze data and create models, and 4) deployment to share findings. It provides examples for the business understanding step applied to predicting coronary heart disease risk.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Data Science Process

Data Science for Health Care Conference

May 14, 2019
Lect. Anuchate Pattanateepapon, D.Eng.
Section for Clinical Epidemiology and Biostatistics
Mahidol University Faculty of Medicine Ramathibodi Hospital
© 2019
Outline
• Introduction to Data Science Process
• Data Science Lifecycle
- Business Understanding
- Data Acquisition and Understanding
- Modeling
- Deployment

2
What is a Data Science Process?
• The Data Science Process [1] is a framework for approaching data science
tasks.
• The Data Science Process [2] is a process for finding relevant information
insight the data that will transform the business to higher profits.
• The Team Data Science Process (TDSP) [3] is an agile, iterative data
science methodology to deliver predictive analytics solutions and intelligent
applications efficiently. TDSP helps improve team collaboration and
learning.

3
Data Science Lifecycle

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

4
Data Science Lifecycle
Business
Understanding

Business Understanding: In this first step,

we try to get a better idea of what
business needs we should be extracting
from data.

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

5
Data Science Lifecycle

Data
Acquisition
and
Understanding

Data Acquisition and Understanding: This is

getting a business idea of the data that you have
and understanding what each part of the data
means.

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

6
Data Science Lifecycle

Modeling

Modeling: Here is where doing statistics

and analyzing the data come in to create a
model that best fits the data.

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

7
Data Science Lifecycle

Deployment: This is where you share your

findings of the data

Deployment

Source: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle

8
Business Understanding
Example
Identify Business
Problem
Goal:
To identify
Problem
- What data to be used
Formalization - Where are data
- How to measure
Scenario: 10 years risk of coronary heart disease (CHD)
Identify variables
and metrics The early prognosis of cardiovascular diseases can aid in making
decisions on lifestyle changes in high risk patients and in turn reduce
the complications.
Identify Data
Source Problem: How to predict a coronary heart disease with acceptable
accuracy?

9
Business Understanding
Example
Identify Business
Problem
Problem settings:
Given a patient’s demographic, behavioral and medical risk
Problem
Formalization
factors.
Formalization: Regression Problem
X1
Identify variables
and metrics

X2 Y

Identify Data …
Source
Xm

10
Business Understanding
Example

Identify Business
Variables1:
Problem - Demographic: sex(male = 1/female = 0), age(continuous), and etc.
- Behavioral: current smoker(yes = 1/no = 0), cigarettes per day(continuous),
and etc.
Problem
Formalization - Medical(history): blood pressure medication (continuous), the patient had
previously stroke(yes = 1/no = 0), the patient was hypertensive(yes = 1/no =
0), the patient had diabetes(yes = 1/no = 0), and etc.
Identify variables
and metrics - Medical(current): total cholesterol level(continuous), systolic blood
pressure(continuous), diastolic blood pressure(continuous), body mass
index(continuous), heart rate(continuous), glucose level(continuous) and etc.
Identify Data
Source
Predict variable: 10 year risk of coronary heart disease CHD (binary: “1” means
“Yes” and “0” means “No”)
Metrics: Model accuracy
1 Example of variables: https://www.kaggle.com/neisha/heart-disease-prediction-using-logistic-regression

11
Business Understanding
Example
Identify Business
Problem
Data sources
- Internal Data, e.g. hospital database
Problem
Formalization
- External Data, n/a

Artifacts
Identify variables
and metrics - Data source
- Data Dictionary
Identify Data
Source

12
Data Acquisition and Understanding

Data Ingestion

Goal

Data
- To produce high quality data
Preprocessing &
Exploration - To ingest data from operation to analytic environment
- To develop solution architecture
Solution
Architecture

13
Data Acquisition and Understanding

• Data Ingestion Operational Enterprise

Data Sources Data Store Data Warehouse

Ingestion

Analytic

14
Data Acquisition and Understanding

Data Ingestion
Major Data Preprocessing Tasks

Data
- Data cleansing, e.g. missing value handling
Preprocessing &
Exploration - Data transformation, e.g. rescaling, normalization
- Data reduction, e.g. data sampling
- Data discretization, e.g. continuous to category
Solution
Architecture conversion
- Text cleansing, e.g. inconsistent delimiters

15
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• From 10 years risk of coronary heart disease scenario
Internal Data External Data
current cigs
Smoker PerDay
0 0.0
0 0.0
1 20.0
sex age BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYearC 1 30.0
Meds Stroke Hyp Rate HD
x xx xx
1 39 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

16
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• From 10 years risk of coronary heart disease scenario
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

Internal Data
External Data

17
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
- 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

Missing value in categorical data

Missing value in continuous data

18
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
1. Remove the missing transactions
• Data cleansing, e.g. missing value handling (missing value ≤ 5% or ≤10%)
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
- 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

Missing value in categorical data

Missing value in continuous data

19
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 - 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

2. Imputation
Missing value in categorical data 2.1 Single imputation: median (Easy and fast but it doesn’t
Missing value in continuous data factor the correlations between features and it not
very accurate)
2.2 Multiple imputation: k-NN, EM, etc.

20
Data Acquisition and Understanding
• Data Preprocessing and Data Exploration
• Data cleansing, e.g. missing value handling
sex age current cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Smoker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 44 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1

3. Imputation
Missing value in categorical data 3.1 Single imputation: round(mean) ex. round(43.3) = 44
Missing value in continuous data 3.2 Multiple imputation: k-NN, etc.

21
K-Nearest Neighbors (k-NN)
a) K-NN is a supervised training algorithm
b) No explicit training or model
c) Could perform with a classification and regression problem
d) Use the K-Nearest Neighbors of x to vote the label of x

22
A k-NN framework
Define or initial K

Compute distance (test instance each training instance)

Sort the distances

Take K nearest neighbor

Apply simply majority

Class

23
A majority vote in K-Nearest Neighbors
k-NN is classify by using the majority vote of the k closest training points

𝑥2 𝑥2 𝑥2

x x x

𝑥1 𝑥1 𝑥1

1-nearest neighbor 2-nearest neighbor 3-nearest neighbor

24
k-NN in a classification problem
 No explicitly decision boundaries computation
 The boundaries between distinct classes form a subset of the
Voronoi diagram of the training data
 Each line segment is equidistant to neighboring points

Image by MIT OpenCourseWare

25
Summary
a) k-NN can deal with complex and arbitrary decision boundaries
b) Despite its simplicity, researchers have shown that the classification
accuracy of k-NN can be quite strong and in many cases as
accurate as those elaborated method
c) k-NN is slow at the classification time
d) k-NN does not produce an understandable model

26
Data Acquisition and Understanding

Data Ingestion
Batch:
Standard architecture for data
warehouse such as Data mart
Data
Preprocessing &
Exploration

Stream:
Solution
Architecture
Data flows continuously
from the data sources. This
idea is called Data Lake.

27
Data Acquisition and Understanding
Data Mart:
Just like a store of bottled water –
Data Ingestion
cleansed and packaged and
structured for easy consumption
Data
Preprocessing &
Exploration
Data Lake:
Similar to a large body of water
Solution
Architecture in a more natural state. Various
users of the lake can come to
examine, dive in, or take
samples
28
Modeling

Feature
Goal
Engineering
- To create a list of feature vectors from raw data
- To create a machine learning model
Modeling

29
Modeling
Structured Data
- Pick all relevant variables to the target class
- Data Preprocessing
Feature
Engineering
Male Age Location … TenYerCHD
1 39 Bangkok … 0
0 46 Bangkok … 0
Modeling 1 48 Nakon … 0
Ratchasima
0 61 Nakon … 1
Ratchasima
0 46 Chonburi … 0

30
Modeling
Structured Data
- Pick all relevant variables to the target class
- Data Preprocessing
Feature
sex age currentS cigs BP prevalent prevalent diabetes totChol sysBP diaBP BMI heart glucose TenYear
Engineering moker PerDay Meds Stroke Hyp Rate CHD
1 39 0 0.0 0.0 0 0 0 195.0 106.0 70.0 26.97 80.0 77.0 0
0 46 0 0.0 0.0 0 0 0 250.0 121.0 81.0 28.73 95.0 76.0 0
1 48 1 20.0 0.0 0 0 0 245.0 127.5 80.0 25.34 75.0 70.0 0
0 61 1 30.0 0.0 0 0 1 225.0 150.0 95.0 28.58 65.0 103.0 1
Modeling

31
Modeling
UNKNOWN TARGET FUNCTION
f: X → y
(Ex. Ideal disease diagnosis function)

DATA COLLECTION
Ex. Patients who agree to be participants
Feature TRAINING SAMPLES
Engineering
(x1, y1), (x2, y2) … (xN, yN)
(Ex. Participants’ medical records)
LEARNING FINAL HYPOTHESIS
Modeling ALGORITHM g≈f
(Ex. Approximated disease
A diagnosis function)
HYPOTHESIS SET
H

32
Deployment

Goal
- Deploy models with a data pipeline to a
Deployment
production or production-like environment for final
user acceptance
- Tracking model performance and improving if
required

33
Deployment

Batch Training Real-time Prediction

Historical Data Analytic Data Model Prediction

Live Data Feedback

34
Reference
[1] A. Lauren, G.L. Gabriel, and W. Joschua, “CS109 Data Science”, School of Engineering
and Applied Sciences, Harvard, [online] http://cs109.github.io/2015/, [15 January, 2019].
[2] J. Wood, “Data Science and the Data Science Process”, Wintellect, [online]
https://www.wintellect.com/data-science-data-science-process/, [15 January, 2019].
[3] S. Nick, E. Gary and el., “What is the Team Data Science Process”, [online]
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-
process/overview, [15 January, 2019]

35

You might also like

Heart Disease Prediction System Using Machine Learning
86% (22)
Heart Disease Prediction System Using Machine Learning
24 pages
AoME Professional Standards 4th Ed
No ratings yet
AoME Professional Standards 4th Ed
24 pages
4 11 Final Modified Chapter-4
No ratings yet
4 11 Final Modified Chapter-4
32 pages
Dissertation
No ratings yet
Dissertation
41 pages
Heart Disease Detection
No ratings yet
Heart Disease Detection
14 pages
Lecture 03 DS Methodology
No ratings yet
Lecture 03 DS Methodology
77 pages
Unit 2 Data Science Process
No ratings yet
Unit 2 Data Science Process
24 pages
Diabetes Prediction Case Study
No ratings yet
Diabetes Prediction Case Study
7 pages
Data Science for Business Solutions
No ratings yet
Data Science for Business Solutions
24 pages
HealthCare Analytics - Day 1-5
No ratings yet
HealthCare Analytics - Day 1-5
196 pages
Data Preprocessing
No ratings yet
Data Preprocessing
57 pages
Predicting Disease With Machine Learning
No ratings yet
Predicting Disease With Machine Learning
20 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
2 - Clinical Data Lecture
No ratings yet
2 - Clinical Data Lecture
24 pages
Introduction To Predictive Analytics: UNIT-1
No ratings yet
Introduction To Predictive Analytics: UNIT-1
14 pages
Unit 2
No ratings yet
Unit 2
19 pages
Batch-2 (Review 2)
No ratings yet
Batch-2 (Review 2)
19 pages
Unit 2 Data Science Process Plus
No ratings yet
Unit 2 Data Science Process Plus
24 pages
HussainBadshah SafwanSheikh
No ratings yet
HussainBadshah SafwanSheikh
12 pages
Big Data Analytics: Data Prep
No ratings yet
Big Data Analytics: Data Prep
58 pages
Data Analytics in R - A Case Study Based Approach
No ratings yet
Data Analytics in R - A Case Study Based Approach
81 pages
Heart Attack Prediction Using Machine Learning
No ratings yet
Heart Attack Prediction Using Machine Learning
10 pages
Deta Science
No ratings yet
Deta Science
40 pages
Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics
No ratings yet
Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Business Analytics
35 pages
Second Progres Report
No ratings yet
Second Progres Report
10 pages
Lecture 2 - The Data Science Process
No ratings yet
Lecture 2 - The Data Science Process
30 pages
Phase 2
No ratings yet
Phase 2
6 pages
The Circulatory System Education Presentation in Hand Drawn Lightly Textured Style
No ratings yet
The Circulatory System Education Presentation in Hand Drawn Lightly Textured Style
24 pages
Project Synopsis On Breast Cancer Detection Using Data Mining
No ratings yet
Project Synopsis On Breast Cancer Detection Using Data Mining
3 pages
Data Science Dse
No ratings yet
Data Science Dse
24 pages
Progress Report
No ratings yet
Progress Report
13 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Data Preprocessing Before Classification: Presented by
No ratings yet
Data Preprocessing Before Classification: Presented by
23 pages
Internship
No ratings yet
Internship
15 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
Journal Heart Attack
No ratings yet
Journal Heart Attack
6 pages
Strategies For Predictive Analytics - Dean Abbott Feb2014 PDF
No ratings yet
Strategies For Predictive Analytics - Dean Abbott Feb2014 PDF
75 pages
Topic 2 Business in Practice and The GRISP-DM Framework
No ratings yet
Topic 2 Business in Practice and The GRISP-DM Framework
22 pages
Sent-Machine Learning For Data Science
100% (1)
Sent-Machine Learning For Data Science
463 pages
Slides02 - Data Understand Prep
No ratings yet
Slides02 - Data Understand Prep
48 pages
Unit 2
No ratings yet
Unit 2
48 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Start
No ratings yet
Start
1 page
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Health Monitoring and Diagnosis: University College of Engineering, Bit Campus
No ratings yet
Health Monitoring and Diagnosis: University College of Engineering, Bit Campus
21 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
39 pages
C1000-177 STU SGC1000177v2
No ratings yet
C1000-177 STU SGC1000177v2
9 pages
2 DMiningKuliah 2A DPreparation
No ratings yet
2 DMiningKuliah 2A DPreparation
32 pages
Heart Disease Prediction Model: Dissertation
No ratings yet
Heart Disease Prediction Model: Dissertation
4 pages
2024 Wk5 Explorative Data Analysis-1.Ko - en
No ratings yet
2024 Wk5 Explorative Data Analysis-1.Ko - en
51 pages
Data2 Science Process Am
No ratings yet
Data2 Science Process Am
33 pages
Unit 2 - Data Science Methodology
No ratings yet
Unit 2 - Data Science Methodology
11 pages
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
No ratings yet
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
253 pages
Prediction of Heart Diseases Using Machine Learning
No ratings yet
Prediction of Heart Diseases Using Machine Learning
49 pages
Bhavan Phase3 Prj.
No ratings yet
Bhavan Phase3 Prj.
24 pages
List of Course Selection Advisors
No ratings yet
List of Course Selection Advisors
3 pages
Lee-Lanier IB Pythagorean Project
No ratings yet
Lee-Lanier IB Pythagorean Project
3 pages
My Field Study Experience (Reflection)
0% (1)
My Field Study Experience (Reflection)
3 pages
Chapter 1-Introduction To Non-Parametric Statistics
No ratings yet
Chapter 1-Introduction To Non-Parametric Statistics
10 pages
M.T.R. Bhadhon: Personal Information
No ratings yet
M.T.R. Bhadhon: Personal Information
2 pages
Teacher Development Plans Overview
100% (3)
Teacher Development Plans Overview
2 pages
Nursing Research and Knowledge Acquisition
No ratings yet
Nursing Research and Knowledge Acquisition
30 pages
DMS Assignment-2
No ratings yet
DMS Assignment-2
1 page
Marco Sgarbi - KantKongress
No ratings yet
Marco Sgarbi - KantKongress
12 pages
Ang Mga Lapida Sa Simbahan: Alingawngaw NG Mga Yumao
No ratings yet
Ang Mga Lapida Sa Simbahan: Alingawngaw NG Mga Yumao
12 pages
Group 3 Authentic Assessment Methods in Mathematics Education 20240926 220056 0000
No ratings yet
Group 3 Authentic Assessment Methods in Mathematics Education 20240926 220056 0000
150 pages
Managing Organizational Change For School
No ratings yet
Managing Organizational Change For School
43 pages
The Impact of Mental Health Issues On Academic Achievement in Hi
No ratings yet
The Impact of Mental Health Issues On Academic Achievement in Hi
60 pages
Architecture Beyond Criticism - Expert Judgement and Performance Evaluation
No ratings yet
Architecture Beyond Criticism - Expert Judgement and Performance Evaluation
6 pages
Dutch Family Farms: Social Sustainability
No ratings yet
Dutch Family Farms: Social Sustainability
61 pages
Cover - Letter - Abhishek - Witten
No ratings yet
Cover - Letter - Abhishek - Witten
2 pages
Apacible NCM119-LP1 Leadership
No ratings yet
Apacible NCM119-LP1 Leadership
15 pages
Chapter 13 - Slides Leadership
No ratings yet
Chapter 13 - Slides Leadership
36 pages
Sample Preliminary Pages
No ratings yet
Sample Preliminary Pages
20 pages
Branches of Social Science
No ratings yet
Branches of Social Science
91 pages
Syllabus PA 1 240607 171338
No ratings yet
Syllabus PA 1 240607 171338
2 pages
Geography Fieldwork for Educators
No ratings yet
Geography Fieldwork for Educators
27 pages
Holistic Health Concepts for BSN Students
No ratings yet
Holistic Health Concepts for BSN Students
13 pages
(Ebook PDF) Experience Sociology 4th Edition by David Croteau PDF Download
No ratings yet
(Ebook PDF) Experience Sociology 4th Edition by David Croteau PDF Download
109 pages
(Ebook PDF) Child Development: A Cultural Approach 3rd Edition 2024 Scribd Download
100% (2)
(Ebook PDF) Child Development: A Cultural Approach 3rd Edition 2024 Scribd Download
50 pages
The Evolution of Ethical Considerations in Software Engineering
No ratings yet
The Evolution of Ethical Considerations in Software Engineering
2 pages
Defining Social Studies
No ratings yet
Defining Social Studies
13 pages
Informal Assessment Tools
No ratings yet
Informal Assessment Tools
4 pages
Data Cleaning (Chen2019)
No ratings yet
Data Cleaning (Chen2019)
13 pages