TDT4259 – Applied Data Science
Lecture 5: Lifecycle of a data science project
Nisha Dalal
Adj. Associate Professor
[email protected]
2
But first,
• Check your groups
• Contact your team members
• Decide datasets
• Discuss contributions
• Email from TAs
• Language preferences
• Schedule group meeting with TAs (preferably after)
• Some changes in the scoring scheme for group assignment
• Questions on Slack
3
Reference groups and feedback
We are looking for 5-8 students to comprise the reference group. The purpose of the reference group is to provide
constructive feedback about the course through an ongoing open dialogue with other students throughout the semester.
You can read more about task of the reference group in this link.
If you want to sign up to be a member of the reference group, use this link.
A survey will be sent out to all to evaluate the course during the last week.
CRISP-DM: with a use case
5
Aneo: Grid loss data
• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
6
What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM
An open standard developed in 1996 by leading
companies in data analysis
It is still the most popular methodology for data-centric
projects
It is an agile method that introduces almost no
overhead and emphasizes adaptive transitions between
project phases
Source
7
What is CRISP-DM
Cross-industry standard process for data
mining - CRISP-DM
Maintenace and
An open standard developed in 1996 by leading monitoring
companies in data analysis
It is still the most popular methodology for data-centric
projects
It is an agile method that introduces almost no
overhead and emphasizes adaptive transitions between
project phases
8
What is CRISP-DM
Maintenace and
monitoring
9
Business Understanding
• Initially, it is vital to understand the problem to be solved
• This may seem obvious, but business projects seldom
come pre-packaged as clear and unambiguous data Maintenace and
monitoring
science problems
• The design team should think carefully about the problem
to be solved and about the use scenario
• Learn to concretize or even reduce the scope of the initial
idea
10
Documentation
• One Pager
• Design document
• High level documents explaining the overall goal
• Quick feedback from the stakeholders and data scientists/engineers
• Different document for different audience
• Enough information to make decisions and provide feedback
• Everyone on the same page
• Easier to scope the project
• Provide clarity and avoids getting into the perfection rabbit hole
• Make project planning easier
11
Data analytics
Examining data to answer questions, identify trends, and extract insights.
12
Types of data analytics
13
Descriptive analysis
• Pull trends from raw data and succinctly describe it.
• Focus on What happened or is currently happening ?
14
Descriptive analysis
15
Descriptive analysis
16
Descriptive analysis
17
Diagnostic analysis
• Comparing coexisting trends or movement, uncovering correlations between
variables, and determining causal relationships where possible.
• Focus on: Why did it happen?
18
Descriptive analysis
19
Descriptive analysis
• Comparing coexisting trends or movement, uncovering correlations between
variables, and determining causal relationships where possible.
• Focus on: Why did it happen?
• Correlation versus Causation
20
Correlation versus Causation
21
Spurious Correlations
22
Correlation versus Causation
23
Predictive analysis
• Predict the future trends and events, using the data at hand.
• Focus on: What might happen in future?
24
Predictive analysis
25
Predictive analysis
Source
26
Prescriptive analysis
• Suggests actionable takeaways considering all possible factors in a scenario
• Focus on: What should we do next?
27
Prescriptive analysis
28
Prescriptive analysis
29
Business Understanding
• Initially, it is vital to understand the problem to be solved
• This may seem obvious, but business projects seldom
come pre-packaged as clear and unambiguous data Maintenace and
monitoring
science problems
• The design team should think carefully about the problem
to be solved and about the use scenario
• Learn to concretize or even reduce the scope of the initial
idea
30
Aneo: Grid loss data
• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
• Estimated grid loss = idle loss + k*(expected power consumption)2
31
Aneo: Grid loss data
• Grid load
• Total amount of electric energy in the grid
• Grid loss
• Difference in electricity between what has been
produced by the power plants (or load) and what has
been sold to the customers
• Grid load = consumption by customers + grid loss
• Estimated grid loss = idle loss + k*(expected power consumption)2
32
Grid loss data: Problems
• Not grid-specific
• Manual retraining
• Manual and subjective alterations
• Lack of monitoring infrastructure
• Poor scalability
33
Data Understanding
• If solving the business problem is the goal, the data
comprise the available raw material from which the
solution will be built
• Collect initial data Maintenace and
o Existing data monitoring
o Purchased data
o Additional data
• Describe data
o Amount of data
o Value types
o Coding schemes
• Explore data
• Verify data quality
o Missing data
o Data errors
o Coding inconsistencies
o Bad metadata
34
Data Understanding
Maintenace and
monitoring
35
Data Understanding
Maintenace and
monitoring
36
Aneo: Grid loss data
•https://www.kaggle.com/trnderenergikraft/grid-loss-time-series-dataset
37
Aneo: Grid loss data
1. Grid load:
• Grid load = consumption by customers + grid loss
• Estimated grid loss = idle loss + k*(expected power consumption)2
2. Calendar features
3. Weather forecasts
4. Estimated demand in the area
38
Data Understanding
Dos and DON’T’S
• Do not economize on this phase
o The earlier you discover issues with your data the better
o Data understanding leads to domain understanding Maintenace and
monitoring
• Verify as far as you can, if your data is correct,
complete, coherent, deduplicated, representative,
independent and up-to-date
• Investigate what sort of processing was applied to the
raw data
• Understand anomalies and outliers
39
Grid loss data: Problems
• Delayed measurements
• Missing values
• Incorrect values
• Changing values
• Small datasets
• Missing features
40
Data Preparation
1. Select data
• Select features
2. Clean data
• Correct data errors
Maintenace and
• Make coding consistent monitoring
• Fill in or infer missing data
3. Construct data
• Generate derived attributes
4. Integrate data
• Merge information from different sources
5. Format data
• Convert to format convenient for modelling
41
Data Preparation
Maintenace and
monitoring
42
Day in the life of a Data Scientist
43
Day in the life of a Data Scientist
44
Aneo: Grid loss data
1. Grid load
• Is it possible to predict grid load?
2. Calendar features
• Categorical features
• Encoding?
3. Time series decomposed features
• Prophet based features?
45
Modelling
1. Select modelling techniques
• Select an algorithm or a model
Maintenace and
2. Build the model monitoring
• Feature selection
• Hyperparameter optimization
• Training and validation
3. Assess model
• Model performance on test dataset
• Time
• Other Key Performance Indicators (KPIs)
46
Aneo: Grid loss data
1. Select modelling techniques
• Select an algorithm or a model
2. Build the model
• Feature selection
• Hyperparameter optimization
• Training and validation
3. Assess model
• Model performance on test dataset
• Time
• Other KPIs
47
Aneo: Grid loss data
1. Select modelling techniques
• Select an algorithm or a model
2. Build the model
• Feature selection
• Hyperparameter optimization
• Training and validation
3. Assess model
• Model performance on test dataset
• Time
• Other KPIs
48
Aneo: Grid loss data
Which models to use
• Multi-layer perceptron
• Decision tree regressor
• Gradient boosting regressor ensemble
• CatBoost
What baselines to compare to
• Manual method
• Last week
How much training data to use
Which features to use
49
Aneo: Grid loss data
Which models to use
• Multi-layer perceptron
• Decision tree regressor
• Gradient boosting regressor ensemble
• CatBoost
What baselines to compare to
• Manual method
• Last week
How much training data to use
Which features to use
50
Traditional software testing
Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
51
Data Science pipeline testing
Breck, Eric, et al. "The ML test score: A rubric for ML production readiness and technical debt
reduction." 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017.
52
Important Deadlines
When you will need to deliver or complete a task
1 20/9 Register yourself/group and the company/dataset for group assignment
2 30/10 Deliver individual assignment
3 27/11 Deliver presentation and report for group assignment
53
Lecture Plan
Unpacking the course syllabus
1 23/8 Lecture 1: Introduction [Nisha Dalal] 8 11/10 Lecture 7: Data Visualization & Storytelling
[Manos Papagiannidis]
2 30/8 Lecture 2: Presentation of datasets [Nisha Dalal]
9 18/10 Lecture 8: Data Science in the time of Chat-
GPT [Pikakshi Manchanda]
3 6/9 Lecture 3: Crash course in machine learning
[Kshitij Sharma]
10 25/10 No lecture
13/9 Lecture 4: Data analysis with low or no-code
4
tools [Nisha Dalal] 1/11 Lecture 9: Experiences from Industry [Thomas
11
Thorensen]
5 20/9 No lecture
8/11 Lecture 10: Decision making with data science
12
[Nisha Dalal]
6 27/9 Lecture 5: Lifecycle of a Data Science project I
[Nisha Dalal]
4/10 Lecture 6: Lifecycle of a Data Science project II 13 15/11 Course finish
7
[Nisha Dalal]
Summer Internship 2024
in the AI and Product Development department of Aneo
Rea d more a nd a pply here:
We are looking for you who want to contribute to a sustainable future by
applying AI and/or software development in the renewable energy sector!
Where: Trondheim
When: Summer 2024 (7 weeks, dates TBA)
Deadline: November 12th, 2023
https ://tinyurl.com/2xxh5uhx
55
Nisha Dalal
Questions & Discussion [email protected]