28/3/2025
ST2195
Introduction to Data Science
What is Data Science?
Data Science is about data gathering, analysis and
decision-making.
Data Science is about finding patterns in data, through
analysis, and make future predictions.
By using Data Science, companies are able to make:
• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden
information in the data)
1
28/3/2025
Where is Data Science Needed?
• For route planning: To discover the best routes to ship
• To foresee delays for flight/ship/train etc. (through
predictive analysis)
• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections
Application for Data Science
• Consumer goods
• Stock markets
• Industry
• Politics
• Logistic companies
• E-commerce
2
28/3/2025
How Does a Data Scientist Work?
A Data Scientist requires expertise in several
backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
How a Data Scientist Works:
1.Ask the right questions - To understand the business problem.
2.Explore and collect data - From database, web logs, customer
feedback, etc.
3.Extract the data - Transform the data to a standardized format.
4.Clean the data - Remove erroneous values from the data.
5.Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
6.Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
7.Analyze data, find patterns and make future predictions.
8.Represent the result - Present the result with useful insights in a
way the "company" can understand.
3
28/3/2025
DATA
What is Data?
Data is a collection of information.
One purpose of Data Science is to structure data, making
it interpretable and easy to work with.
Data can be categorized into two groups:
• Structured data
• Unstructured data
4
28/3/2025
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis
purposes.
Structured Data
• Structured data is organized and easier to work with.
5
28/3/2025
How to Structure Data?
We can use an array or a database table to structure or
present data.
Example of an array:
• [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Example: (Array in Python)
• Array =
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
Database Table
A database table is a table with structured data.
• The following table shows a database table with
health data extracted from a sports watch:
Duration Average_Pul Max_Pulse Calorie_Bur Hours_Work Hours_Sleep
se nage
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
45 100 140 280 0 7
60 105 140 290 7 8
60 110 145 300 7 8
60 115 145 310 8 8
75 120 150 320 0 8
75 125 150 330 8 8
6
28/3/2025
Database Table Structure
Database Table Structure Column Column Column Column Column Column
1 2 3 4 5 6
• A database table consists of column(s) and
row(s): Duration Average Max_Pul Calorie_ Hours_ Hours_S
_Pulse se Burnage Work leep
Row 1 30 80 120 240 10 7
Row 2 30 85 120 250 10 7
Row 3 45 90 130 260 8 7
Row 4 45 95 130 270 8 7
Row 5 45 100 140 280 0 7
Row 6 60 105 140 290 7 8
Row 7 60 110 145 300 7 8
Row 8 60 115 145 310 8 8
Row 9 75 120 150 320 0 8
Row 10 75 125 150 330 8 8
Variables
A variable is defined as something that can be measured
or counted.
Examples can be characters, numbers or time.
• In the example under, we can observe that each column
represents a variable.
There are 6 columns, meaning that there are 6 variables
(Duration, Average_Pulse, Max_Pulse, Calorie_Burnage,
Hours_Work, Hours_Sleep).
• There are 11 rows, meaning that each variable has 10
observations.
7
28/3/2025
Variables
Duration Average_P Max_Pulse Calorie_Bu Hours_Wor Hours_Slee
ulse rnage k p
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
45 100 140 280 0 7
60 105 140 290 7 8
60 110 145 300 7 8
60 115 145 310 8 8
75 120 150 320 0 8
75 125 150 330 8 8