Session 1
• Defining data science and big
data,
• Recognizing the different types
of data,
• Gaining insight into the data
science process
08/03/22
Data All Around
• Lots of data is being collected and
warehoused
– Web data, e-commerce
– Financial transactions, bank/credit
transactions
– Online trading and purchasing
– Social Network
– Cloud
08/03/22
Data and Big Data
• “90% of the world’s data was generated in
the last few years.”
• Due to the advent of new technologies,
devices, and communication means like
social networking sites, the amount of data
produced by mankind is growing rapidly
every year.
• The amount of data produced by us from the
beginning of time till 2003 was 5 billion
gigabytes. If you pile up the data in the form
of disks it may fill an entire football field.
• The same amount was created in every two
days in 2011, and in every six minutes in
2016. This rate is still growing enormously.
08/03/22
What is Big Data
• Big Data is a collection of large
datasets that cannot be
processed using traditional
computing techniques.
• It is not a single technique or a
tool, rather it involves many
areas of business and
technology
08/03/22
• Big Data is any data that is expensive to
manage and hard to extract value from
– Volume
• The size of the data
– Velocity
• The latency of data processing
relative to the growing demand for
interactivity
– Variety and Complexity
• The diversity of sources, formats,
quality, structures.
08/03/22
08/03/22
Characteristics of Big
Data: Volume
Data Volume
• 44x increase from 2009 to 2020
• From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
08/03/22
Characteristics of Big
Data: Variety
• Various formats, types, and structures
• Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
• Static data vs. streaming data A single
application can be generating/collecting
many types of data
08/03/22
Characteristics of Big
Data: Velocity
• Data is begin generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions, missing opportunities
Examples
• E-Promotions: Based on your current location,
your purchase history, what you like send
promotions right now for store next to you.
• Healthcare monitoring: sensors monitoring your
activities and body any abnormal measurements
require immediate reaction
08/03/22
Benefits of Big Data
• Using the information kept in the social
network like Facebook, the marketing
agencies are learning about the response for
their campaigns, promotions, and other
advertising mediums.
• Using the information in the social media like
preferences and product perception of their
consumers, product companies and retail
organizations are planning their production.
• Using the data regarding the previous
medical history of patients, hospitals are
providing better and quick service
08/03/22
Big Data Technologies
• Operational Big data
• Analytical Big data
08/03/22
Operational Big Data
• These include systems like
MongoDB that provide operational
capabilities for real- time,
interactive workloads where data is
primarily captured and stored.
• NoSQL Big Data systems are
designed to take advantage of new
cloud computing architectures that
have emerged over the past
decade to allow massive
computations
08/03/22
to be run
Analytical Big Data
• These includes systems like
Massively Parallel Processing (MPP)
database systems and MapReduce
that provide analytical capabilities
for retrospective and complex
analysis that may touch most or all
of the data.
• MapReduce provides a new method
of analyzing data that is
complementary to the capabilities
provided by SQL, and a system
based on MapReduce that can be
scaled up from single servers to
thousands08/03/22
of high and low end
Who generates Big Data?
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Sensor technology and
networks
(measuring all kinds of data)
Mobile devices
(tracking all objects all the time)
08/03/22
Challenges in Big Data
• The major challenges associated with
big data are as follows:
– Capturing data
– Curation
– Storage
– Searching
– Sharing
– Transfer
– Analysis
– Presentation
08/03/22
What is Data Science ?
• An area that manages,
manipulates, extracts, and
interprets knowledge from
tremendous amount of data.
• Data science (DS) is a
multidisciplinary field of study
with goal to address the
challenges in big data.
• Data science principles apply to
all data –08/03/22
big and small.
What is Data Science ?
• Theories and techniques from many fields
and disciplines are used to investigate and
analyze a large amount of data to help
decision makers in many industries such as
science, engineering, economics, politics,
finance, and education
– Computer Science
• Pattern recognition, visualization, data
warehousing, High performance
computing, Databases, AI
– Mathematics
• Mathematical Modeling
– Statistics
• Statistical08/03/22
and Stochastic modeling,
Data Science Disciplines
08/03/22
Real Life Examples
• Internet Search
• Digital Advertisements (Targeted
Advertising and re- targeting)
• Recommender Systems
• Image Recognition
• Speech Recognition
• Gaming
• Price Comparison Websites
• Airline Route Planning
• Fraud and Risk Detection
• Delivery logistics
08/03/22
Internet Search
08/03/22
Digital Advertisements (Targeted Advertising
and re- targeting)
08/03/22
Recommender Systems
08/03/22
Image Recognition
08/03/22
Speech Recognition
08/03/22
Price Comparison Website
08/03/22
Airline Route Planning
08/03/22
Fraud Detection
08/03/22
Delivery Logistics
08/03/22
Facets of Data
• In data science and big data you’ll come
across many different types of data, and
each of them tends to require different tools
and techniques. The main categories of data
are these:
– Structured
– Unstructured
– Natural language
– Machine-generated
– Graph-based
– Audio, video, and images
– Streaming
08/03/22
Structured Data
• Structured data is data that depends on
a data model and resides in a fixed
field within a record.
• As such, it’s often easy to store
structured data in tables within
databases or Excel files, SQL , or
Structured Query Language, is the
preferred way to manage and query
data that resides in databases.
• You may also come across structured
data that might give you a hard time
storing it in a traditional relational
database. 08/03/22
Structured Data
08/03/22
Unstructured Data
• Unstructured data is data that isn’t easy to fit
into a data model because the content is
context-specific or varying. One example of
unstructured data is your regular email
• Although email contains structured elements
such as the sender, title, and body text, it’s
a challenge to find the number of people
who have written an email complaint about
a specific employee because so many ways
exist to refer to a person, for example.
• The thousands of different languages and
dialects out there further complicate this.
• A human-written email, as shown in next
figure, is also a perfect example of natural
language data.08/03/22
Unstructured Data
08/03/22
Machine Generated Data
• Machine-generated data is information
that’s automatically created by a
computer, process, application, or other
machine without human intervention.
• Machine-generated data is becoming a
major data resource and will continue to
do so.
• IDC (International Data Corporation) has
estimated there will be 26 times more
connected things than people in 2020.
This network is commonly referred to as
the internet of things
08/03/22
Session 2
Data Science Process: Overview,
Different steps
08/03/22
Data Science Process
Objectives:
• Understanding the flow of a data
science process
• Discussing the steps in a data
science process
08/03/22
Data Science Process
The typical data science process
consists of six steps
08/03/22
1. Setting research goal
• An essential outcome is the
research goal that states the
purpose of your assignment in a
clear and focused manner.
• Understanding the business goals
and context is critical for project
success.
• In this phase, you also need to frame
the business problem and formulate
initial hypotheses (IH) to test..
08/03/22
Create project charter
• Clients like to know upfront what they’re paying for, so
after you have a good understanding of the business
problem, try to get a formal agreement on the
deliverables.
• All this information is best collected in a project charter.
For any significant project this would be mandatory.
• A project charter requires teamwork, and your input
covers at least the following:
– A clear research goal
– The project mission and context
– How you’re going to perform your analysis
– What resources you expect to use
– Proof that it’s an achievable project, or proof of
concepts
– Deliverables and a measure of success
08/03/22
2. Retrieving data
08/03/22
Data Retrieval
• The next step in data science is to
retrieve the required data.
Sometimes you need to go into the
field and design a data collection
process yourself, but most of the
time you won’t be involved in this
step.
• Many companies will have already
collected and stored the data for you,
and what they don’t have can often
be bought from third parties.
• Don’t be afraid to look outside your
organization for data, because more
and more organizations are making
even high-quality
08/03/22
data freely
Data Stored in company
• Your first act should be to assess the
relevance and quality of the data
that’s readily available within your
company.
• Most companies have a program for
maintaining key data, so much of the
cleaning work may already be done.
• This data can be stored in official data
repositories such as databases, data
marts, data warehouses, and data
lakes maintained by a team of IT
professionals.
• The primary goal of a database is data
storage, while
08/03/22
a data warehouse is
Data Sources
08/03/22
3. Data Preparation
08/03/22
Data Preparation
• Data can have lots of inconsistencies
like missing value, blank columns,
incorrect data format which needs to be
cleaned
• Your task now is to sanitize and prepare
it for use in the modeling and reporting
phase.
• Doing so is tremendously important
because your models will perform
better and you’ll lose less time trying
to fix strange output.
• It can’t be mentioned nearly enough
times: garbage in equals garbage out.
08/03/22
Data Cleansing
• Data cleansing is a subprocess of the
data science process that focuses on
removing errors in your data so your
data becomes a true and consistent
representation of the processes it
originates from.
08/03/22
Overview of common errors
08/03/22
Example: Outliers
08/03/22
Data Entry Errors
• Data collection and data entry are error-prone
processes.
• They often require human intervention, and
because humans are only human, they make
typos or lose their concentration for a second
and introduce an error into the chain.
• But data collected by machines or computers
isn’t free from errors either. Errors can arise
from human sloppiness, whereas others are
due to machine or hardware failure.
• Examples of errors originating from machines
are transmission errors or bugs in the extract,
transform, and load phase ( ETL ).
08/03/22
Impossible values / Sanity
Check
• Sanity checks are another
valuable type of data check.
• Here you check the value against
physically or theoretically
impossible values such as people
taller than 3 meters or someone
with an age of 299 years.
• Sanity checks can be directly
expressed with rules:
check = 0 <= age <= 120
08/03/22
Outliers
• An outlier is an observation that seems
to be distant from other observations
or, more specifically, one observation
that follows a different logic or
generative process than the other
observations.
• The easiest way to find outliers is to
use a plot or a table with the minimum
and maximum values.
• The normal dis-tribution, or Gaussian
distribution, is the most common
distribution 08/03/22
in natural sciences.
Example:
08/03/22
08/03/22
Dealing with missing
values
• Missing values aren’t necessarily
wrong, but you still need to
handle them separately; certain
modeling techniques can’t handle
missing values.
• They might be an indicator that
something went wrong in your
data collection or that an error
happened in the ETL process.
08/03/22
Handling missing values
08/03/22
4. Exploratory Data Analysis
• During exploratory data analysis you
take a deep dive into the data (see
figure).
• Information becomes much easier to
grasp when shown in a picture,
therefore you mainly use graphical
techniques to gain an understanding of
your data and the interactions
between variables.
• The goal isn’t to cleanse the data, but
it’s common that you’ll still discover
anomalies 08/03/22
you missed before, forcing
4. Exploratory Data
Analysis
08/03/22
Exploratory Data Analysis
08/03/22
Histogram
• In a histogram a variable is cut into
discrete categories and the number of
occurrences in each category are
summed up and shown in the graph.
• The boxplot, on the other hand, doesn’t
show how many observations are present
but does offer an impression of the
distribution within categories.
• It can show the maximum, minimum,
median, and other characterizing
measures at the same time.
08/03/22
08/03/22
5. Build the model
08/03/22
Building a model
• With clean data in place and a good
understanding of the content, you’re
ready to build models with the goal of
making better predictions, classifying
objects, or gaining an understanding of
the system that you’re modeling.
• This phase is much more focused than
the exploratory analysis step, because
you know what you’re looking for and
what you want the outcome to be.
08/03/22
Building a model
• Building a model is an iterative process.
• The way you build your model depends
on whether you go with classic
statistics or the somewhat more recent
machine learning school, and the type
of technique you want to use.
• Either way, most models consist of the
following main steps:
– Selection of a modeling technique
and variables to enter in the model
– Execution of the model
08/03/22
Build a model
• You’ll need to select the variables
you want to include in your
model and a modeling technique.
• Your findings from the exploratory
analysis should already give a
fair idea of what variables will
help you construct a good model.
• Many modeling techniques are
available, and choosing the right
model for a problem requires
08/03/22
• You’ll need to consider model
performance and whether your project
meets all the requirements to use your
model,
• as well as other factors:
– Must the model be moved to a
production environment and, if so,
would it be easy to implement?
– How difficult is the maintenance on the
model: how long will it remain
relevant if left untouched?
– Does the model need to be easy to
explain?
08/03/22
Model Execution
• Luckily, most programming languages, such as
Python, already have libraries such as
StatsModels or Scikit-learn. These packages
use several of the most popular techniques.
• Coding a model is a nontrivial task in most
cases, so having these libraries available can
speed up the process.
• As you can see in the following code, it’s fairly
easy to use linear regression (figure) with
StatsModels or Scikit- learn.
• Doing this your self would require much more
effort even for the simple techniques.
08/03/22
08/03/22
Summary
• Setting the research goal—Defining the what, the why,
and the how of your project in a project charter.
• Retrieving data—Finding and getting access to data
needed in your project. This data is either found within
the company or retrieved from a third party.
• Data preparation—Checking and remediating data
errors, enriching the data with data from other data
sources, and transforming it into a suitable format for
your models.
• Data exploration—Diving deeper into your data using
descriptive statistics and visual techniques.
• Data modeling—Using machine learning and statistical
techniques to achieve your project goal.
• Presentation and automation—Presenting your results to
the stakeholder and industrializing your analysis
process for repetitive reuse and integration with other
tools. 08/03/22
Session III
Machine Learning Definition and
Relation with Data Science
08/03/22
ML
• Data Science is the study of data
cleansing, preparation, and
analysis, while machine learning is
a branch of AI and subfield of data
science.
• Data Science is a field to study the
approaches to find insights from
the raw data.
• Machine Learning is a technique
used by the group of data scientists
to enable the machines to learn
automatically from the past data.
08/03/22
08/03/22
Machine Learning
• Machine learning is an application of artificial
intelligence (AI) that provides systems the
ability to automatically learn and improve
from experience without being explicitly
programmed.
• Machine learning focuses on the
development of computer programs that can
access data and use it learn for themselves.
• The primary aim is to allow the computers
learn automatically without human
intervention or assistance and adjust actions
accordingly.
08/03/22
Uses
• Predict the outcomes of elections
• Identify and filter spam messages from e-
mail
• Foresee criminal activity
• Automate traffic signals according to road
conditions
• Produce financial estimates of storms and
natural disasters
• Examine customer churn
• Create auto-piloting planes and auto-driving
cars
• Identify individuals with the capacity to
donate
08/03/22
Types of Machine
Learning
08/03/22
Supervised Machine
Learning
08/03/22
Unsupervised Machine
Learning
08/03/22
Data Science VS Machine
Learning
Data Science Machine Learning
It deals with understanding It is a subfield of data science that enables the
and finding hidden patterns machine to learn from the past data and
or useful insights from the experiences automatically.
data, which helps to take
smarter business decisions.
It is used for discovering It is used for making predictions and
insights from the data. classifying the result for new data points.
It is a broad term that It is used in the data modeling step of the data
includes various steps to science as a complete process.
create a model for a given
problem and deploy the
model.
A data scientist needs to Machine Learning Engineer needs to have
have skills to use big data skills such as computer science fundamentals,
tools like Hadoop, Hive and programming skills in Python or R, statistics
Pig, statistics, and probability concepts, etc.
programming in Python, R,
or Scala.
It can work with raw, It mostly requires structured data to work on.
structured, and
unstructured data.
Data scientists spent lots of ML engineers spend a lot of time for managing
time in handling the data, the complexities that occur during the
cleansing the data, and 08/03/22
implementation of algorithms and
Applications of ML in Data
Science
• Regression and classification are of primary
importance to a data scientist. To achieve these
goals, one of the main tools a data scientist uses is
machine learning. The uses for regression and
automatic classification are wide ranging, such as the
following:
– Finding oil fields, gold mines based on existing sites
(classification and regression)
– Finding place names or persons in text
(classification)
– Identifying people based on pictures or voice
recordings (classification)
– Recognizing birds based on their whistle
(classification)
08/03/22
• Identifying profitable customers (regression
and classification)
• Proactively identifying car parts that are
likely to fail (regression)
• Identifying tumors and diseases
(classification)
• Predicting the amount of money a person
will spend on product X (regression)
• Predicting your company’s yearly revenue
(regression)
• Predicting which team will win the
Champions League in soccer (classification)
08/03/22
Presentation
• Sometimes people get so excited about your work
that you’ll need to repeat it over and over again
because they value the predictions of your models
or the insights that you produced. For this reason,
you need to automate your models.
• This doesn’t always mean that you have to redo all
of your analysis all the time.
• Sometimes it’s sufficient that you implement only
the model scoring; other times you might build an
application that automatically updates reports,
Excel spreadsheets, or PowerPoint presentations.
• The last stage of the data science process is where
your soft skills will be most useful, and yes, they’re
extremely important.
08/03/22