0% found this document useful (0 votes)

96 views55 pages

Data Science Project Lifecycle

The document discusses the lifecycle of a data science project using the CRISP-DM methodology. It covers the business understanding and data understanding phases of CRISP-DM, including understanding the problem to be solved, documenting goals and scope, exploring and assessing data quality. It also provides an example use case of predicting grid losses using a dataset on grid load, weather forecasts, and other features.

Uploaded by

Giorgio Aduso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views55 pages

Data Science Project Lifecycle

Uploaded by

Giorgio Aduso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

TDT4259 – Applied Data Science

Lecture 5: Lifecycle of a data science project

Nisha Dalal
Adj. Associate Professor

[email protected]
2

But first,
• Check your groups
• Contact your team members
• Decide datasets
• Discuss contributions
• Email from TAs
• Language preferences
• Schedule group meeting with TAs (preferably after)
• Some changes in the scoring scheme for group assignment

• Questions on Slack
3

Reference groups and feedback

We are looking for 5-8 students to comprise the reference group. The purpose of the reference group is to provide
constructive feedback about the course through an ongoing open dialogue with other students throughout the semester.
You can read more about task of the reference group in this link.

If you want to sign up to be a member of the reference group, use this link.

A survey will be sent out to all to evaluate the course during the last week.
CRISP-DM: with a use case
5

Aneo: Grid loss data

• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
6

What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM

An open standard developed in 1996 by leading

companies in data analysis

It is still the most popular methodology for data-centric

projects

It is an agile method that introduces almost no

overhead and emphasizes adaptive transitions between
project phases

Source
7

What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM

Maintenace and
An open standard developed in 1996 by leading monitoring
companies in data analysis

It is still the most popular methodology for data-centric

projects

It is an agile method that introduces almost no

overhead and emphasizes adaptive transitions between
project phases
8

What is CRISP-DM

Maintenace and
monitoring
9

Business Understanding
• Initially, it is vital to understand the problem to be solved

• This may seem obvious, but business projects seldom

come pre-packaged as clear and unambiguous data Maintenace and
monitoring
science problems

• The design team should think carefully about the problem

to be solved and about the use scenario

• Learn to concretize or even reduce the scope of the initial

idea
10

Documentation
• One Pager
• Design document

• High level documents explaining the overall goal

• Quick feedback from the stakeholders and data scientists/engineers
• Different document for different audience
• Enough information to make decisions and provide feedback
• Everyone on the same page
• Easier to scope the project
• Provide clarity and avoids getting into the perfection rabbit hole
• Make project planning easier
11

Data analytics
Examining data to answer questions, identify trends, and extract insights.
12

Types of data analytics

Descriptive analysis
• Pull trends from raw data and succinctly describe it.

• Focus on What happened or is currently happening ?

Descriptive analysis
15

Descriptive analysis
16

Descriptive analysis
17

Diagnostic analysis
• Comparing coexisting trends or movement, uncovering correlations between
variables, and determining causal relationships where possible.

• Focus on: Why did it happen?

Descriptive analysis
19

Descriptive analysis
• Comparing coexisting trends or movement, uncovering correlations between
variables, and determining causal relationships where possible.

• Focus on: Why did it happen?

• Correlation versus Causation

Correlation versus Causation

Spurious Correlations
22

Correlation versus Causation

Predictive analysis
• Predict the future trends and events, using the data at hand.

• Focus on: What might happen in future?

Predictive analysis
25

Predictive analysis

Source
26

Prescriptive analysis
• Suggests actionable takeaways considering all possible factors in a scenario

• Focus on: What should we do next?

Prescriptive analysis
28

Prescriptive analysis
29

Business Understanding
• Initially, it is vital to understand the problem to be solved

• This may seem obvious, but business projects seldom

come pre-packaged as clear and unambiguous data Maintenace and
monitoring
science problems

• The design team should think carefully about the problem

to be solved and about the use scenario

• Learn to concretize or even reduce the scope of the initial

idea
30

Aneo: Grid loss data

Grid loss data: Problems

• Not grid-specific
• Manual retraining
• Manual and subjective alterations
• Lack of monitoring infrastructure
• Poor scalability
33

Data Understanding
• If solving the business problem is the goal, the data
comprise the available raw material from which the
solution will be built

• Collect initial data Maintenace and

o Existing data monitoring
o Purchased data
o Additional data
• Describe data
o Amount of data
o Value types
o Coding schemes
• Explore data
• Verify data quality
o Missing data
o Data errors
o Coding inconsistencies
o Bad metadata
34

Data Understanding

Maintenace and
monitoring
35

Data Understanding

Maintenace and
monitoring
36

Aneo: Grid loss data

•https://www.kaggle.com/trnderenergikraft/grid-loss-time-series-dataset
37

Aneo: Grid loss data

1. Grid load:
• Grid load = consumption by customers + grid loss
• Estimated grid loss = idle loss + k*(expected power consumption)2

2. Calendar features

3. Weather forecasts

4. Estimated demand in the area

Data Understanding
Dos and DON’T’S

• Do not economize on this phase

o The earlier you discover issues with your data the better
o Data understanding leads to domain understanding Maintenace and
monitoring

• Verify as far as you can, if your data is correct,

complete, coherent, deduplicated, representative,
independent and up-to-date

• Investigate what sort of processing was applied to the

raw data

• Understand anomalies and outliers

Grid loss data: Problems

• Delayed measurements
• Missing values
• Incorrect values
• Changing values
• Small datasets
• Missing features
40

Data Preparation
1. Select data
• Select features

2. Clean data
• Correct data errors
Maintenace and
• Make coding consistent monitoring
• Fill in or infer missing data

3. Construct data
• Generate derived attributes

4. Integrate data
• Merge information from different sources

5. Format data
• Convert to format convenient for modelling
41

Data Preparation

Maintenace and
monitoring
42

Day in the life of a Data Scientist

Aneo: Grid loss data

1. Grid load
• Is it possible to predict grid load?

2. Calendar features
• Categorical features
• Encoding?

3. Time series decomposed features

• Prophet based features?
45

Modelling
1. Select modelling techniques
• Select an algorithm or a model
Maintenace and
2. Build the model monitoring

• Feature selection
• Hyperparameter optimization
• Training and validation

3. Assess model
• Model performance on test dataset
• Time
• Other Key Performance Indicators (KPIs)
46

Aneo: Grid loss data

1. Select modelling techniques
• Select an algorithm or a model

2. Build the model

• Feature selection
• Hyperparameter optimization
• Training and validation

3. Assess model
• Model performance on test dataset
• Time
• Other KPIs
47

Aneo: Grid loss data

1. Select modelling techniques
• Select an algorithm or a model

2. Build the model

• Feature selection
• Hyperparameter optimization
• Training and validation

3. Assess model
• Model performance on test dataset
• Time
• Other KPIs
48

Aneo: Grid loss data

Which models to use
• Multi-layer perceptron
• Decision tree regressor
• Gradient boosting regressor ensemble
• CatBoost

What baselines to compare to

• Manual method
• Last week

How much training data to use

Which features to use

Aneo: Grid loss data

Which models to use
• Multi-layer perceptron
• Decision tree regressor
• Gradient boosting regressor ensemble
• CatBoost

What baselines to compare to

• Manual method
• Last week

How much training data to use

Which features to use

Traditional software testing

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
51

Data Science pipeline testing

Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
52

Important Deadlines
When you will need to deliver or complete a task

1 20/9 Register yourself/group and the company/dataset for group assignment

2 30/10 Deliver individual assignment

3 27/11 Deliver presentation and report for group assignment

Lecture Plan
Unpacking the course syllabus

1 23/8 Lecture 1: Introduction [Nisha Dalal] 8 11/10 Lecture 7: Data Visualization & Storytelling
[Manos Papagiannidis]

2 30/8 Lecture 2: Presentation of datasets [Nisha Dalal]

9 18/10 Lecture 8: Data Science in the time of Chat-
GPT [Pikakshi Manchanda]
3 6/9 Lecture 3: Crash course in machine learning
[Kshitij Sharma]
10 25/10 No lecture

13/9 Lecture 4: Data analysis with low or no-code

4
tools [Nisha Dalal] 1/11 Lecture 9: Experiences from Industry [Thomas
11
Thorensen]

5 20/9 No lecture
8/11 Lecture 10: Decision making with data science
12
[Nisha Dalal]
6 27/9 Lecture 5: Lifecycle of a Data Science project I
[Nisha Dalal]

4/10 Lecture 6: Lifecycle of a Data Science project II 13 15/11 Course finish

7
[Nisha Dalal]
Summer Internship 2024
in the AI and Product Development department of Aneo

Rea d more a nd a pply here:

We are looking for you who want to contribute to a sustainable future by
applying AI and/or software development in the renewable energy sector!
Where: Trondheim
When: Summer 2024 (7 weeks, dates TBA)
Deadline: November 12th, 2023

https ://tinyurl.com/2xxh5uhx
55

Nisha Dalal
Questions & Discussion [email protected]

E-Metrics (Business Metrics For The New Economy)
0% (1)
E-Metrics (Business Metrics For The New Economy)
67 pages
Data Science Questions and Answers
No ratings yet
Data Science Questions and Answers
4 pages
Digital Transformation & Innovation
No ratings yet
Digital Transformation & Innovation
7 pages
Unit 3 SVVT
No ratings yet
Unit 3 SVVT
13 pages
Copper Oxide Nanoparticles Thesis
No ratings yet
Copper Oxide Nanoparticles Thesis
8 pages
Resouce Guide The Giver
No ratings yet
Resouce Guide The Giver
46 pages
C Boe Taxes and Investing
No ratings yet
C Boe Taxes and Investing
27 pages
Building Recommendation System Using Movielens Data
No ratings yet
Building Recommendation System Using Movielens Data
6 pages
Beginner's Guide to Pentesting VMs
No ratings yet
Beginner's Guide to Pentesting VMs
1 page
Barnes-The Toils of Scepticism
100% (1)
Barnes-The Toils of Scepticism
88 pages
Siemens Scada
No ratings yet
Siemens Scada
12 pages
Data Science Project Lifecycle
No ratings yet
Data Science Project Lifecycle
43 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Weka Manual
No ratings yet
Weka Manual
303 pages
Augmented Analytics for BI Experts
No ratings yet
Augmented Analytics for BI Experts
8 pages
AHDAdv Cust Guide
No ratings yet
AHDAdv Cust Guide
361 pages
Business Intelligence & Business Analytics
No ratings yet
Business Intelligence & Business Analytics
8 pages
Modelling Real and Artificial Financial Markets
No ratings yet
Modelling Real and Artificial Financial Markets
82 pages
An Introduction To Predictive Analytics Final
No ratings yet
An Introduction To Predictive Analytics Final
31 pages
Requirements and Use Case Modelling: Module 2 - Objectives Module 2 - Objectives (Continued)
100% (1)
Requirements and Use Case Modelling: Module 2 - Objectives Module 2 - Objectives (Continued)
11 pages
CCNP NAT Configuration Lab Guide
No ratings yet
CCNP NAT Configuration Lab Guide
8 pages
Data Scientist Nanodegree Syllabus: Before You Start
No ratings yet
Data Scientist Nanodegree Syllabus: Before You Start
5 pages
ARIMA Models for Naira-Dollar Exchange Rate
No ratings yet
ARIMA Models for Naira-Dollar Exchange Rate
8 pages
Untitled Document
No ratings yet
Untitled Document
13 pages
Assessing Value For Money 2015 PDF
No ratings yet
Assessing Value For Money 2015 PDF
33 pages
A Primer On Process Mining Practical Skills With Python and Graphviz
No ratings yet
A Primer On Process Mining Practical Skills With Python and Graphviz
101 pages
Adoption of BI in SMEs PDF
No ratings yet
Adoption of BI in SMEs PDF
22 pages
Previewpdf
No ratings yet
Previewpdf
64 pages
Project Decision Analysis
No ratings yet
Project Decision Analysis
30 pages
BDM Using AI - Data Driven Decision Making
No ratings yet
BDM Using AI - Data Driven Decision Making
34 pages
ForecastX Wizard User Guide
No ratings yet
ForecastX Wizard User Guide
220 pages
BI Final
No ratings yet
BI Final
39 pages
Laura Paton - PMI Business Analysis Leading Organizations To Better Outcomes
No ratings yet
Laura Paton - PMI Business Analysis Leading Organizations To Better Outcomes
25 pages
QWT BusinessIntelligencePlan PDF
No ratings yet
QWT BusinessIntelligencePlan PDF
20 pages
Gen AI For Developers Preread
No ratings yet
Gen AI For Developers Preread
96 pages
Midia Kit - Valor 2023 en
No ratings yet
Midia Kit - Valor 2023 en
71 pages
BI Market and Its Market Segments
No ratings yet
BI Market and Its Market Segments
12 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Big Data Analytics Applications
No ratings yet
Big Data Analytics Applications
4 pages
Decision Support Systems Guide
No ratings yet
Decision Support Systems Guide
9 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Lecture Notes 1 - Introduction To SMEs
No ratings yet
Lecture Notes 1 - Introduction To SMEs
7 pages
CTF - Kioptrix Level 3 - Walkthrough Step by Step - Yeah Hub
No ratings yet
CTF - Kioptrix Level 3 - Walkthrough Step by Step - Yeah Hub
26 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
2019 Audience Demographics Report
No ratings yet
2019 Audience Demographics Report
12 pages
Applied Ai Book Preview 2018
No ratings yet
Applied Ai Book Preview 2018
68 pages
A Review of Business Intelligence and Analytics in Small and Medium Sized Enterprises
No ratings yet
A Review of Business Intelligence and Analytics in Small and Medium Sized Enterprises
24 pages
1106 Slides UserTrainingBeginners
100% (1)
1106 Slides UserTrainingBeginners
164 pages
Metrics Framwork Rollout
No ratings yet
Metrics Framwork Rollout
6 pages
Business Intelligence
No ratings yet
Business Intelligence
8 pages
Machine Learning and Data Mining in Manufacturing
No ratings yet
Machine Learning and Data Mining in Manufacturing
45 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Weka A Tool For Exploratory Data Mining
No ratings yet
Weka A Tool For Exploratory Data Mining
157 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
37 pages
Maximizing The Impact and Effectiveness of HR Analytics To Drive Business Outcomes PDF
No ratings yet
Maximizing The Impact and Effectiveness of HR Analytics To Drive Business Outcomes PDF
8 pages
Visual Analytics
No ratings yet
Visual Analytics
36 pages
Decision Science Helps Boost Business
No ratings yet
Decision Science Helps Boost Business
5 pages
Financial Management Guide
No ratings yet
Financial Management Guide
94 pages
Titanic Data Analysis Project
No ratings yet
Titanic Data Analysis Project
14 pages
Full Statistics
No ratings yet
Full Statistics
108 pages
Multi-Criteria Decision Making
No ratings yet
Multi-Criteria Decision Making
5 pages
Big Data Analytics in Accounting and Finance Assignment 2
No ratings yet
Big Data Analytics in Accounting and Finance Assignment 2
3 pages
Module 5 - Data Science Methodology
No ratings yet
Module 5 - Data Science Methodology
17 pages
AI Capstone Project Guide
100% (1)
AI Capstone Project Guide
47 pages
Example 1
No ratings yet
Example 1
47 pages
Chicago Crime Reduction via Data Science
No ratings yet
Chicago Crime Reduction via Data Science
29 pages
Lecture 9 - Decision Making With Data Science
No ratings yet
Lecture 9 - Decision Making With Data Science
19 pages
Lecture 4 - No-Code and Low-Code Tools
No ratings yet
Lecture 4 - No-Code and Low-Code Tools
29 pages
Data Visualization Essentials
No ratings yet
Data Visualization Essentials
90 pages
Lecture 2 - The Dataset Presentation
No ratings yet
Lecture 2 - The Dataset Presentation
35 pages
Practical Research 1: Quarter 3, LAS 6: Synthesizing Information and Writing Coherent Literature Review
No ratings yet
Practical Research 1: Quarter 3, LAS 6: Synthesizing Information and Writing Coherent Literature Review
8 pages
English - Grade 11 - Third Term Test 2022 - Kalmunai - English Paper II
No ratings yet
English - Grade 11 - Third Term Test 2022 - Kalmunai - English Paper II
8 pages
GR 6 - Sa 1 - Date Sheet
No ratings yet
GR 6 - Sa 1 - Date Sheet
1 page
Med TG g02 v2 en Web
No ratings yet
Med TG g02 v2 en Web
88 pages
1 s2.0 S1057521924005544 Main
No ratings yet
1 s2.0 S1057521924005544 Main
14 pages
Data Science Dan Artificial Intelligence:: Konsep, Teori, Teknik, Tools, Dan Aplikasi
No ratings yet
Data Science Dan Artificial Intelligence:: Konsep, Teori, Teknik, Tools, Dan Aplikasi
57 pages
2023 GP Mathematics Literacy P2 June Memo
No ratings yet
2023 GP Mathematics Literacy P2 June Memo
6 pages
WBGCore Competencies Final
No ratings yet
WBGCore Competencies Final
12 pages
Procedural Writing
100% (1)
Procedural Writing
3 pages
Elevating Branding Potential Through Color Psychology
No ratings yet
Elevating Branding Potential Through Color Psychology
3 pages
Civil Engineering Homework Guide
No ratings yet
Civil Engineering Homework Guide
5 pages
Scaling Social Impact
No ratings yet
Scaling Social Impact
95 pages
Spray Humidifier for Agriculture
No ratings yet
Spray Humidifier for Agriculture
27 pages
Hazards and Risk Identification and Management
No ratings yet
Hazards and Risk Identification and Management
2 pages
Dissertation Writing Support
100% (2)
Dissertation Writing Support
7 pages
Class Syllabus For Grade 9 English 1 Honors Gifted and Honors 2024-2025
No ratings yet
Class Syllabus For Grade 9 English 1 Honors Gifted and Honors 2024-2025
4 pages
MBA Dissertation - Final-University of Cumbria
No ratings yet
MBA Dissertation - Final-University of Cumbria
77 pages
UAI Book Chapter
No ratings yet
UAI Book Chapter
36 pages
Heat Transfer CHE F241: Basic Concepts
No ratings yet
Heat Transfer CHE F241: Basic Concepts
36 pages
Expanded World Creation for SWN
No ratings yet
Expanded World Creation for SWN
8 pages
Neuromuscular Junction: Shannon Sanders Bishop O'Connell
No ratings yet
Neuromuscular Junction: Shannon Sanders Bishop O'Connell
16 pages
Integrating Detailed Hydrocarbon Analysis Data With Simulated Distillation To Improve The Characterisation of Crude Oils by Gas Chromatography
No ratings yet
Integrating Detailed Hydrocarbon Analysis Data With Simulated Distillation To Improve The Characterisation of Crude Oils by Gas Chromatography
2 pages
Outstanding Cambridge Learner Awards 2021 - Pakistan Brochure
No ratings yet
Outstanding Cambridge Learner Awards 2021 - Pakistan Brochure
38 pages
Allergic Rhinitis RCT Data Analysis
No ratings yet
Allergic Rhinitis RCT Data Analysis
10 pages
Syllabus For BPKMCH NEPAL
No ratings yet
Syllabus For BPKMCH NEPAL
7 pages
Quality by Design in Pharma Development
No ratings yet
Quality by Design in Pharma Development
18 pages

Data Science Project Lifecycle

Uploaded by

Data Science Project Lifecycle

Uploaded by

TDT4259 – Applied Data Science

Lecture 5: Lifecycle of a data science project

Reference groups and feedback

Aneo: Grid loss data

An open standard developed in 1996 by leading

It is still the most popular methodology for data-centric

It is an agile method that introduces almost no

It is still the most popular methodology for data-centric

It is an agile method that introduces almost no

• This may seem obvious, but business projects seldom

• The design team should think carefully about the problem

• Learn to concretize or even reduce the scope of the initial

• High level documents explaining the overall goal

Types of data analytics

• Focus on What happened or is currently happening ?

• Focus on: Why did it happen?

• Focus on: Why did it happen?

• Correlation versus Causation

Correlation versus Causation

Correlation versus Causation

• Focus on: What might happen in future?

• Focus on: What should we do next?

• This may seem obvious, but business projects seldom

• The design team should think carefully about the problem

• Learn to concretize or even reduce the scope of the initial

Aneo: Grid loss data

Aneo: Grid loss data

Grid loss data: Problems

• Collect initial data Maintenace and

Aneo: Grid loss data

Aneo: Grid loss data

4. Estimated demand in the area

• Do not economize on this phase

• Verify as far as you can, if your data is correct,

• Investigate what sort of processing was applied to the

• Understand anomalies and outliers

Grid loss data: Problems

Day in the life of a Data Scientist

Day in the life of a Data Scientist

Aneo: Grid loss data

3. Time series decomposed features

Aneo: Grid loss data

2. Build the model

Aneo: Grid loss data

2. Build the model

Aneo: Grid loss data

What baselines to compare to

How much training data to use

Which features to use

Aneo: Grid loss data

What baselines to compare to

How much training data to use

Which features to use

Traditional software testing

Data Science pipeline testing

1 20/9 Register yourself/group and the company/dataset for group assignment

2 30/10 Deliver individual assignment

3 27/11 Deliver presentation and report for group assignment

2 30/8 Lecture 2: Presentation of datasets [Nisha Dalal]

13/9 Lecture 4: Data analysis with low or no-code

4/10 Lecture 6: Lifecycle of a Data Science project II 13 15/11 Course finish

Rea d more a nd a pply here:

You might also like