AACS1573
Introduction to
Data Science
Chapter 2
Data Science Process
Last week…
1.5 1.6 1.7 1.8
Types of Analytics Related Data Science
analytics process model Software/Tools applications
Analytic Process Model
Step1
?
Some market players
Data Science Applications (shhh…assignment idea)
In this lesson, we will learn about…
2.1 2.2 2.3 2.4
Data
Data Preparation Data Exploration Data Discovery
Representation
2.5 2.6 2.7
Learning from Creating a Data Insight,
Data Product Deliverance and
Visualization
You have learned about…
2.1 2.2 2.3 2.4
Data
Data Preparation Data Exploration Data Discovery
Representation
2.5 2.6 2.7
Learning from Creating a Data Insight,
Data Product Deliverance and
Visualization
Chapter Chapter Chapter Chapter Chapter Chapter Chapter
Coming next…
Chapter 3: Visualization and Descriptive
Analytics
2.1 Data Preparation
next…
Data Preparation
Reading the data Cleansing the data.
Data Preparation
● This is the first step in turning the available data into a dataset
● i.e., a group of data points, usually normalized, that can be
used with a data analysis model or a machine learning system
(often without any additional preprocessing).
● Reading the data & cleansing the data
Reading the data
● Reading the data is relatively straightforward.
● However, when you are dealing with big data, you often need
to employ a Hadoop Distributed File System (HDFS) to store
the data for further analysis and the data needs to be read
using a MapReduce system
Data Preparation
● However, you may need to supply it in .JSON or some other similar
format type.
JSON JSON JSON JSON JSON
Strings Numbers Objects Arrays Booleans
Data Preparation
● Also, if your data is in a completely custom form, you may need to write your own
program(s) for accessing and restructuring it into a format that can be
understood by the mappers and the reducers of your cluster.
● When reading a very large amount of data, it is wise to first do a sample run on a
relatively small subset of your data to ensure that the resulting dataset will be
useable and useful for the analysis you plan to perform.
● Some preliminary visualization of the resulting sample dataset would also be
quite useful as this will ensure that the dataset is structured correctly for the
different analyses you will do in the later stages of the process.
Sampling
● The aim of sampling is to take a subset of past customer data and use that to build
an analytical model.
Question: Given high performance of computer
ability nowadays, why do we need sampling while
we could also directly analyze the full data set?
Data Preparation
Cleansing the data
● very -consuming part of data preparation and requires a level of
understanding of the data
● This step involves
○ fill in missing values,
○ remove corrupt or problematic data
○ normalize the data in a way that makes sense for the analysis that ensues.
● To comprehend this point better, let us examine the rationale
behind normalization and how distributions (mathematical models
of the frequency of the values of a variable) come into play.
stop
Data Preparation
● Although the most commonly used distribution is the
normal distribution (N), there are several others that often
come into play such as:
○ uniform distribution (U)
○ student distribution (T)
○ Poisson distribution (P)
○ binomial distribution (B)
● Note: normalization applies only to numeric data
Feature Scaling: Normalization
Distance AB before scaling = (40 − 60)2 +(3 − 3)2 = 20
Distance BC before scaling = (40 − 40)2 +(4 − 3)2 = 1
Distance AB after scaling = (1.1 − 1.5)2 +(1.18 − 1.18)2 = 2.6
Distance BC after scaling = (1.1 − 1.1)2 +(0.41 − 1.18)2 = 1.59
Data Preparation
● Normalizing your data will sometimes change the shape of its
distribution, so it makes sense to try out first a few normalizing
approaches before deciding on one.
● The approaches that are most popular are:
○ Subtracting the mean and dividing by the standard deviation, (x — p.) / o. This is
particularly useful for data that follows a normal distribution; it usually yields
values between -3 and 3, approximately.
○ Subtracting the mean and dividing by the range, (x — u) / (max-min). This
approach is a bit more generic; it usually yields values between -0.5 and 0.5,
approximately.
○ Subtracting the minimum and dividing by the range, (x-min) / (max-min). This
approach is very generic and always yields values between 0 and 1, inclusive.
Data Cleansing – Missing Values
Data Cleansing - Outliers
Data Preparation
● Normally, when dealing with big data,
outliers shouldn't be an issue…
● BUT it depends on their values (extremely
large or small values) may affect the basic
statistics of the dataset, especially if there
are many outliers in it.
Find the problems!
Missing Value
Midpoint in the Random Remove the
Mean Value
scale number column
𝒊𝒕𝒆𝒎𝟏
Data Preparation
● When dealing with text data, which is often the
case if you need to analyze logs or social media
posts, a different type of cleansing is required.
● This involves one or more of the following:
○ removing certain characters (e.g., special
characters such as @,*, and punctuation
marks)
○ making all words either uppercase or
lowercase
○ removing certain words that convey little
information (e.g., "a", "the", etc.)
○ removing extra or unnecessary spaces and
line breaks
Data Preparation
All these data preparation steps
(and other methods that may be
relevant to your industry), will help
you turn the data into a dataset.
Make sure you keep a record of
what you have done though, in
case you need to redo these steps
or describe them in a report.
Data Preparation
Date Preparation
Remove stop words
Change to lower case
2.2 Data Exploration
next…
Data Exploration
● First, some exploration of it is performed
to figure out the potential information
that could be hiding within it.
● There is a common misconception that
the more data one has, the better the
results of the analysis will be.
● It is very easy to fall victim to the illusion
that a large dataset is all you need, but
more often than not such a dataset will
contain noise and several irrelevant
attributes.
● All of these wrinkles will need to be ironed
out in the stages that follow, starting with more data = more noise!!
data exploration.
2.3 Data Representation
next…
Data Representation
● comes right after data exploration.
● According to the McGraw-Hill Dictionary of Scientific & Technical
Terms, it is "the manner in which data is expressed symbolically by
binary digits in a computer." >> How data is stored in the
computer.
● This basically involves assigning specific data structures to the
variables involved and serves a dual purpose:
○ completing the transformation of the original (raw) data into a dataset
○ optimizing the memory and storage usage for the stages to follow.
X Y
Which one better?
Z
1, 2, 3 1.00000, 2.00000, 3.00000
X
“TRUE” | “FALSE” TRUE | FALSE
00101, 00110, 00111 101, 110, 111
I make it finally! I MaKe iT FINALly!!!!
Data Representation
● All this may seem very abstract to someone who has never dealt
with data before, but it becomes very clear once you start working
with R or any other statistical analysis package.
● Speaking of R, the data structure of a dataset in that programming
platform is referred to as a data frame, and it is the most complete
structure you can find as it includes useful information about the
data (e.g. names, modality, etc.).
2.4 Data Disovery
next…
finding
patterns in a
the CORE of
dataset
the data
through
science
hypothesis
process
formulation
and testing
makes use of
several
statistical
throw away methods to
the less prove the
meaningful significance of
relationships the
based on our relationships
judgment that the data
scientist
observes
filter out less
robust
relationships
based on
statistics
Data Discovery
● Unfortunately there is no fool-proof methodology for data discovery
although there are several tools that can be used to make the
whole process more manageable.
● How effective you are regarding this stage of the data science
process will depend on your experience, your intuition and how
much time you spend on it.
● Good knowledge of the various data analysis tools (especially
machine learning techniques) can prove very useful here.
● In addition, experience with scientific research in data analysis will
also prove to be priceless in this stage.
2.5 Learning from Data
next…
Learning from Data
● Learning from data is a crucial stage in the data science process
and involves a lot of intelligent (and often creative) analysis of a
dataset using statistical methods and machine learning systems.
helps a computer learn
how to distinguish and
supervised predict new data points
based on a training set
Machine Learning
with enabling the
computer to learn on its
unsupervised own what the data
structure can reveal
about the data itself
Learning from Data
It may seem that using unsupervised and supervised learning may guarantee a
more or less automated way of learning from data.
However, without feedback from the user/programmer, this process is unlikely
to yield any good results for the majority of cases. (This feedback can take the
form of validation or corrections that provide more meaningful results.)
For example, artificial neural networks (ANNs), a very popular artificial
intelligence tool that emulates the way the human brain works, are a great tool
for supervised learning.
Learning from Data
Learning from Data
2.6 Creating a Data Product
next…
Creating a Data Product
● All of the aforementioned parts of the data science process
are precursors to developing something concrete that can
be used as a product of sorts.
● This part of the process is referred to as creating a data
product and was defined by influential data scientist Hilary
Mason as "a product that is based on the combination of
data and algorithms.
● So, a data product is not some kind of luxury that marketing
people try to force us to buy.
● It is something the user cares about
● Stop
Creating a Data Product
● To create a data product, you need to understand the
end users and become familiar with their expectations.
● exercise good judgment on the
You also need to
algorithms you will use and (particularly) on the form
that the results will take.
● Graphs, particularly interactive ones, are a very useful form in which to
present information if you want to promote it as a data
product.
Creating a Data Product
BR Post Engagement
So a data product is similar to having a data expert in your pocket who can afford to
give you useful information at very low rates due to the economies of scale employed.
2.7 Insight, Deliverance &
Visualization
next…
Insight, Deliverance and Visualization
Data science involves research into the data, the goal of which is
how the data products perform in terms of
to determine and understand more of
usefulness to the end users, maintainability,
what’s happening below the surface
etc.
This often leads to new iterations of data discovery, data learning,
etc., making data science an ongoing, evolving activity, oftentimes
employing the agile framework frequently used in software
development today.
Insight, Deliverance and Visualization
In this final stage of the data science process, the data scientist
delivers the data product he has created and observes how it is
received.
The user's feedback is crucial as it will provide the information he
needs to refine the analysis, upgrade it and even redo it from scratch
if necessary.
The data scientist may get ideas on how he can generate similar data
products (or completely new ones) based on the users' newest
requirements.
Insight, Deliverance and Visualization
● Visualization involves the
graphical representation of
data so that interesting and
meaningful information can
be obtained by the viewer.
● It is a way of summarizing
the essence of the analyzed
data graphically in a way that
is intuitive, intelligent and
oftentimes interactive.
Insight, Deliverance and Visualization
Aware of what we don't This means that you are
know and are therefore able more aware of the
to handle the uncertainty of limitations of your models as
the data much better well as the value of the data
These graphs can bring This translates into deeper
about insight (which is the understanding and usually
most valuable part of the to some new hypotheses
data science process) about the data.
Insight, Deliverance and Visualization
● It brings about the improvements
you see in data products all over
the world, the clever upgrades of
certain data applications and,
most importantly, the various
innovations in the big data world.
● So this final stage of the data
science process is NOT THE END
but rather the last part of a cycle
that starts again and again,
spiraling to greater heights of
understanding, usefulness and
evolution.