100% found this document useful (1 vote)

65 views55 pages

Data Science 2

1. Data discovery involves finding patterns in a dataset through hypothesis formulation and testing. 2. It makes use of several statistical methods to prove the significance of relationships found in the data. 3. Relationships that are less meaningful based on judgment are discarded.

Uploaded by

kagome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

65 views55 pages

Data Science 2

Uploaded by

kagome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

AACS1573

Introduction to
Data Science
Chapter 2
Data Science Process
Last week…

1.5 1.6 1.7 1.8

Types of Analytics Related Data Science
analytics process model Software/Tools applications
Analytic Process Model

Step1
?
Some market players
Data Science Applications (shhh…assignment idea)
In this lesson, we will learn about…

2.1 2.2 2.3 2.4

Data
Data Preparation Data Exploration Data Discovery
Representation

2.5 2.6 2.7

Learning from Creating a Data Insight,
Data Product Deliverance and
Visualization
You have learned about…

2.1 2.2 2.3 2.4

Data
Data Preparation Data Exploration Data Discovery
Representation

2.5 2.6 2.7

Learning from Creating a Data Insight,
Data Product Deliverance and
Visualization

Chapter Chapter Chapter Chapter Chapter Chapter Chapter

Coming next…
Chapter 3: Visualization and Descriptive
Analytics
2.1 Data Preparation
next…
Data Preparation

Reading the data Cleansing the data.

Data Preparation
● This is the first step in turning the available data into a dataset
● i.e., a group of data points, usually normalized, that can be
used with a data analysis model or a machine learning system
(often without any additional preprocessing).
● Reading the data & cleansing the data

Reading the data

● Reading the data is relatively straightforward.
● However, when you are dealing with big data, you often need
to employ a Hadoop Distributed File System (HDFS) to store
the data for further analysis and the data needs to be read
using a MapReduce system
Data Preparation
● However, you may need to supply it in .JSON or some other similar
format type.

JSON JSON JSON JSON JSON

Strings Numbers Objects Arrays Booleans
Data Preparation
● Also, if your data is in a completely custom form, you may need to write your own
program(s) for accessing and restructuring it into a format that can be
understood by the mappers and the reducers of your cluster.
● When reading a very large amount of data, it is wise to first do a sample run on a
relatively small subset of your data to ensure that the resulting dataset will be
useable and useful for the analysis you plan to perform.
● Some preliminary visualization of the resulting sample dataset would also be
quite useful as this will ensure that the dataset is structured correctly for the
different analyses you will do in the later stages of the process.
Sampling
● The aim of sampling is to take a subset of past customer data and use that to build
an analytical model.

Question: Given high performance of computer

ability nowadays, why do we need sampling while
we could also directly analyze the full data set?
Data Preparation

Cleansing the data

● very -consuming part of data preparation and requires a level of
understanding of the data
● This step involves
○ fill in missing values,
○ remove corrupt or problematic data
○ normalize the data in a way that makes sense for the analysis that ensues.
● To comprehend this point better, let us examine the rationale
behind normalization and how distributions (mathematical models
of the frequency of the values of a variable) come into play.
stop
Data Preparation

● Although the most commonly used distribution is the

normal distribution (N), there are several others that often
come into play such as:
○ uniform distribution (U)
○ student distribution (T)
○ Poisson distribution (P)
○ binomial distribution (B)

● Note: normalization applies only to numeric data

Feature Scaling: Normalization

Distance AB before scaling = (40 − 60)2 +(3 − 3)2 = 20

Distance BC before scaling = (40 − 40)2 +(4 − 3)2 = 1

Distance AB after scaling = (1.1 − 1.5)2 +(1.18 − 1.18)2 = 2.6

Distance BC after scaling = (1.1 − 1.1)2 +(0.41 − 1.18)2 = 1.59

Data Preparation
● Normalizing your data will sometimes change the shape of its
distribution, so it makes sense to try out first a few normalizing
approaches before deciding on one.
● The approaches that are most popular are:
○ Subtracting the mean and dividing by the standard deviation, (x — p.) / o. This is
particularly useful for data that follows a normal distribution; it usually yields
values between -3 and 3, approximately.

○ Subtracting the mean and dividing by the range, (x — u) / (max-min). This

approach is a bit more generic; it usually yields values between -0.5 and 0.5,
approximately.

○ Subtracting the minimum and dividing by the range, (x-min) / (max-min). This
approach is very generic and always yields values between 0 and 1, inclusive.
Data Cleansing – Missing Values
Data Cleansing - Outliers
Data Preparation
● Normally, when dealing with big data,
outliers shouldn't be an issue…

● BUT it depends on their values (extremely

large or small values) may affect the basic
statistics of the dataset, especially if there
are many outliers in it.
Find the problems!
Missing Value

Midpoint in the Random Remove the

Mean Value
scale number column

𝒊𝒕𝒆𝒎𝟏
Data Preparation
● When dealing with text data, which is often the
case if you need to analyze logs or social media
posts, a different type of cleansing is required.
● This involves one or more of the following:
○ removing certain characters (e.g., special
characters such as @,*, and punctuation
marks)

○ making all words either uppercase or

lowercase

○ removing certain words that convey little

information (e.g., "a", "the", etc.)

○ removing extra or unnecessary spaces and

line breaks
Data Preparation
All these data preparation steps
(and other methods that may be
relevant to your industry), will help
you turn the data into a dataset.

Make sure you keep a record of

what you have done though, in
case you need to redo these steps
or describe them in a report.
Data Preparation
Date Preparation

Remove stop words

Change to lower case

2.2 Data Exploration
next…
Data Exploration
● First, some exploration of it is performed
to figure out the potential information
that could be hiding within it.
● There is a common misconception that
the more data one has, the better the
results of the analysis will be.
● It is very easy to fall victim to the illusion
that a large dataset is all you need, but
more often than not such a dataset will
contain noise and several irrelevant
attributes.
● All of these wrinkles will need to be ironed
out in the stages that follow, starting with more data = more noise!!
data exploration.
2.3 Data Representation
next…
Data Representation

● comes right after data exploration.

● According to the McGraw-Hill Dictionary of Scientific & Technical
Terms, it is "the manner in which data is expressed symbolically by
binary digits in a computer." >> How data is stored in the
computer.
● This basically involves assigning specific data structures to the
variables involved and serves a dual purpose:
○ completing the transformation of the original (raw) data into a dataset
○ optimizing the memory and storage usage for the stages to follow.
X Y
Which one better?

Z
1, 2, 3 1.00000, 2.00000, 3.00000

X
“TRUE” | “FALSE” TRUE | FALSE

00101, 00110, 00111 101, 110, 111

I make it finally! I MaKe iT FINALly!!!!

Data Representation

● All this may seem very abstract to someone who has never dealt
with data before, but it becomes very clear once you start working
with R or any other statistical analysis package.
● Speaking of R, the data structure of a dataset in that programming
platform is referred to as a data frame, and it is the most complete
structure you can find as it includes useful information about the
data (e.g. names, modality, etc.).
2.4 Data Disovery
next…
finding
patterns in a
the CORE of
dataset
the data
through
science
hypothesis
process
formulation
and testing

makes use of
several
statistical
throw away methods to
the less prove the
meaningful significance of
relationships the
based on our relationships
judgment that the data
scientist
observes

filter out less

robust
relationships
based on
statistics
Data Discovery

● Unfortunately there is no fool-proof methodology for data discovery

although there are several tools that can be used to make the
whole process more manageable.
● How effective you are regarding this stage of the data science
process will depend on your experience, your intuition and how
much time you spend on it.
● Good knowledge of the various data analysis tools (especially
machine learning techniques) can prove very useful here.
● In addition, experience with scientific research in data analysis will
also prove to be priceless in this stage.
2.5 Learning from Data
next…
Learning from Data

● Learning from data is a crucial stage in the data science process

and involves a lot of intelligent (and often creative) analysis of a
dataset using statistical methods and machine learning systems.

helps a computer learn

how to distinguish and
supervised predict new data points
based on a training set

Machine Learning
with enabling the
computer to learn on its
unsupervised own what the data
structure can reveal
about the data itself
Learning from Data

It may seem that using unsupervised and supervised learning may guarantee a
more or less automated way of learning from data.

However, without feedback from the user/programmer, this process is unlikely

to yield any good results for the majority of cases. (This feedback can take the
form of validation or corrections that provide more meaningful results.)

For example, artificial neural networks (ANNs), a very popular artificial

intelligence tool that emulates the way the human brain works, are a great tool
for supervised learning.
Learning from Data
Learning from Data
2.6 Creating a Data Product
next…
Creating a Data Product

● All of the aforementioned parts of the data science process

are precursors to developing something concrete that can
be used as a product of sorts.
● This part of the process is referred to as creating a data
product and was defined by influential data scientist Hilary
Mason as "a product that is based on the combination of
data and algorithms.
● So, a data product is not some kind of luxury that marketing
people try to force us to buy.
● It is something the user cares about
● Stop
Creating a Data Product

● To create a data product, you need to understand the

end users and become familiar with their expectations.
● exercise good judgment on the
You also need to
algorithms you will use and (particularly) on the form
that the results will take.
● Graphs, particularly interactive ones, are a very useful form in which to
present information if you want to promote it as a data
product.
Creating a Data Product
BR Post Engagement

So a data product is similar to having a data expert in your pocket who can afford to
give you useful information at very low rates due to the economies of scale employed.
2.7 Insight, Deliverance &
Visualization
next…
Insight, Deliverance and Visualization

Data science involves research into the data, the goal of which is
how the data products perform in terms of
to determine and understand more of
usefulness to the end users, maintainability,
what’s happening below the surface
etc.

This often leads to new iterations of data discovery, data learning,

etc., making data science an ongoing, evolving activity, oftentimes
employing the agile framework frequently used in software
development today.
Insight, Deliverance and Visualization
In this final stage of the data science process, the data scientist
delivers the data product he has created and observes how it is
received.

The user's feedback is crucial as it will provide the information he

needs to refine the analysis, upgrade it and even redo it from scratch
if necessary.

The data scientist may get ideas on how he can generate similar data
products (or completely new ones) based on the users' newest
requirements.
Insight, Deliverance and Visualization

● Visualization involves the

graphical representation of
data so that interesting and
meaningful information can
be obtained by the viewer.
● It is a way of summarizing
the essence of the analyzed
data graphically in a way that
is intuitive, intelligent and
oftentimes interactive.
Insight, Deliverance and Visualization

Aware of what we don't This means that you are

know and are therefore able more aware of the
to handle the uncertainty of limitations of your models as
the data much better well as the value of the data

These graphs can bring This translates into deeper

about insight (which is the understanding and usually
most valuable part of the to some new hypotheses
data science process) about the data.
Insight, Deliverance and Visualization

● It brings about the improvements

you see in data products all over
the world, the clever upgrades of
certain data applications and,
most importantly, the various
innovations in the big data world.
● So this final stage of the data
science process is NOT THE END
but rather the last part of a cycle
that starts again and again,
spiraling to greater heights of
understanding, usefulness and
evolution.

21 Feature Importance Methods in ML
100% (1)
21 Feature Importance Methods in ML
41 pages
Statistics R Charts and Graphs Assignment
No ratings yet
Statistics R Charts and Graphs Assignment
13 pages
Parameters: Unless Otherwise Noted, These Formulas Assume
No ratings yet
Parameters: Unless Otherwise Noted, These Formulas Assume
6 pages
Data Science Math Skills Certificate
No ratings yet
Data Science Math Skills Certificate
1 page
Statistics
No ratings yet
Statistics
41 pages
Intro to R for Data Analysis
No ratings yet
Intro to R for Data Analysis
146 pages
7871 - 管理运筹建模与求解 - 基于Excel VBA与Matlab
No ratings yet
7871 - 管理运筹建模与求解 - 基于Excel VBA与Matlab
333 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Bok:978 1 4899 7218 7 PDF
No ratings yet
Bok:978 1 4899 7218 7 PDF
375 pages
Probability and Statistics Sheet: A A S, T & M T
No ratings yet
Probability and Statistics Sheet: A A S, T & M T
57 pages
Multivariate Statistical Method
No ratings yet
Multivariate Statistical Method
85 pages
Applied Statistics Course Overview
100% (1)
Applied Statistics Course Overview
654 pages
One-Sample T-Test
No ratings yet
One-Sample T-Test
9 pages
Gaussian Noise Detection & Estimation
No ratings yet
Gaussian Noise Detection & Estimation
55 pages
Statistical Modeling for Analysts
No ratings yet
Statistical Modeling for Analysts
22 pages
Tensor Analysis in Continuum Mechanics
No ratings yet
Tensor Analysis in Continuum Mechanics
21 pages
Data Analysis With Pandas - Aggregates in Pandas Cheatsheet - Codecademy
100% (1)
Data Analysis With Pandas - Aggregates in Pandas Cheatsheet - Codecademy
2 pages
Data Science Workshop
No ratings yet
Data Science Workshop
6 pages
LV PH DThesis 2019
No ratings yet
LV PH DThesis 2019
183 pages
Solution CH # 5
No ratings yet
Solution CH # 5
39 pages
Intro To Traditional and Bayesian M Using R-Guilford 2017
No ratings yet
Intro To Traditional and Bayesian M Using R-Guilford 2017
330 pages
Ejercicios Resueltos de Inferencia Estadistica
No ratings yet
Ejercicios Resueltos de Inferencia Estadistica
229 pages
Neural Network Backpropagation Guide
No ratings yet
Neural Network Backpropagation Guide
7 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Monte Carlo Studies Using SAS
100% (2)
Monte Carlo Studies Using SAS
258 pages
2023 - Principles and Theories of Data Mining With RapidMiner
No ratings yet
2023 - Principles and Theories of Data Mining With RapidMiner
326 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
Categorical Data Frequency Distribution
No ratings yet
Categorical Data Frequency Distribution
6 pages
945-Article Text-2920-1-10-20190802
No ratings yet
945-Article Text-2920-1-10-20190802
6 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
2020-2021 EDA 101 Handout
No ratings yet
2020-2021 EDA 101 Handout
192 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
DATA SUMMARIZATION - Print
No ratings yet
DATA SUMMARIZATION - Print
28 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
13 pages
Rohatgi Expl
No ratings yet
Rohatgi Expl
192 pages
A Brief Course in Mathematical Statistics 1st Edition Tanis Hogg Solution Manual
75% (4)
A Brief Course in Mathematical Statistics 1st Edition Tanis Hogg Solution Manual
8 pages
SAS IML User Guide PDF
No ratings yet
SAS IML User Guide PDF
1,108 pages
Lecture+14 SAS Bootstrap and Jackknife
No ratings yet
Lecture+14 SAS Bootstrap and Jackknife
12 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
CMSC 56 Course Outline
No ratings yet
CMSC 56 Course Outline
17 pages
Semi-Automated EDA in Python
No ratings yet
Semi-Automated EDA in Python
3 pages
To The Bootstrap: Bradley Efron
No ratings yet
To The Bootstrap: Bradley Efron
11 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Complete Bundle Applied Multivariate Statistical Analysis 6th Edition Johnson
No ratings yet
Complete Bundle Applied Multivariate Statistical Analysis 6th Edition Johnson
407 pages
Keyence Image Processing Useful Tips Vol.7 Pre Processing
No ratings yet
Keyence Image Processing Useful Tips Vol.7 Pre Processing
6 pages
Applied Nonparametric Regression
No ratings yet
Applied Nonparametric Regression
433 pages
Statistical Computing by Using R
100% (1)
Statistical Computing by Using R
11 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Stat 138 Course Syllabus
No ratings yet
Stat 138 Course Syllabus
4 pages
Estimation and Hypothesis
100% (2)
Estimation and Hypothesis
32 pages
Pattern Recognition for CS Scholars
0% (1)
Pattern Recognition for CS Scholars
37 pages
Practice of Introductory Time Series With R
No ratings yet
Practice of Introductory Time Series With R
22 pages
Moving Range: ISSN: 2339-2541 JURNAL GAUSSIAN, Volume 3, Nomor 4, Tahun 2014, Halaman 701 - 710
No ratings yet
Moving Range: ISSN: 2339-2541 JURNAL GAUSSIAN, Volume 3, Nomor 4, Tahun 2014, Halaman 701 - 710
10 pages
Median Polish for Data Analysis
No ratings yet
Median Polish for Data Analysis
6 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
CEMS Exam Guidelines 2023
No ratings yet
CEMS Exam Guidelines 2023
1 page
Hot Key
No ratings yet
Hot Key
8 pages
25 Cleverly Designed Minimal Logo Designs For Inspiration - Designbeep
No ratings yet
25 Cleverly Designed Minimal Logo Designs For Inspiration - Designbeep
13 pages
Questions Hrms
0% (1)
Questions Hrms
16 pages
SM-A305F.FN Galaxy A30 PDF
No ratings yet
SM-A305F.FN Galaxy A30 PDF
1 page
SBI PO Syllabus 2024 For Prelims and Mains, Detailed Exam Pattern
No ratings yet
SBI PO Syllabus 2024 For Prelims and Mains, Detailed Exam Pattern
12 pages
Data Acquisition in MATLAB
No ratings yet
Data Acquisition in MATLAB
27 pages
Pothole Detection via Lightweight Networks
No ratings yet
Pothole Detection via Lightweight Networks
90 pages
Final Semester Exam Paper
No ratings yet
Final Semester Exam Paper
4 pages
Kubernetes Sec
No ratings yet
Kubernetes Sec
73 pages
How To Trade Forex and Crypto Beginner
No ratings yet
How To Trade Forex and Crypto Beginner
22 pages
M.Sc. Electronics & Instrumentation
No ratings yet
M.Sc. Electronics & Instrumentation
70 pages
Transistor Amplifier Design FINAL
No ratings yet
Transistor Amplifier Design FINAL
12 pages
Computer Applications in Hydraulic Engineering Tutorials 2020-Jul-21
No ratings yet
Computer Applications in Hydraulic Engineering Tutorials 2020-Jul-21
100 pages
Disco Externo Iomega Datasheet
No ratings yet
Disco Externo Iomega Datasheet
2 pages
Slicing 1
No ratings yet
Slicing 1
7 pages
Which Chart or Graph Is Right For You? Tell Impactful Stories With Data
No ratings yet
Which Chart or Graph Is Right For You? Tell Impactful Stories With Data
14 pages
Railway Applications Katalog25214
0% (1)
Railway Applications Katalog25214
74 pages
CV Syllabus
No ratings yet
CV Syllabus
3 pages
ECEN3250 Lab 7: Design of Common-Source MOS Amplifiers Prelab Assignment
No ratings yet
ECEN3250 Lab 7: Design of Common-Source MOS Amplifiers Prelab Assignment
14 pages
Binomial Theorem (Practice Question) PDF
100% (3)
Binomial Theorem (Practice Question) PDF
11 pages
DIPS v7 Rosette Plot Manual
No ratings yet
DIPS v7 Rosette Plot Manual
20 pages
Abb E-Clipse Bypass Configurations (BCR, BDR, VCR, or VDR) For Ach 550 User Manual
No ratings yet
Abb E-Clipse Bypass Configurations (BCR, BDR, VCR, or VDR) For Ach 550 User Manual
100 pages
Chapter 4
No ratings yet
Chapter 4
6 pages
Berkeley DCM - 25.03.2022
No ratings yet
Berkeley DCM - 25.03.2022
2 pages
Keywords and Identifiers in C
No ratings yet
Keywords and Identifiers in C
3 pages
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
No ratings yet
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
21 pages
Machine Learning Lab Guide
No ratings yet
Machine Learning Lab Guide
69 pages
3842 PDF
No ratings yet
3842 PDF
8 pages

Data Science 2

Uploaded by

Data Science 2

Uploaded by

AACS1573

1.5 1.6 1.7 1.8

2.1 2.2 2.3 2.4

2.5 2.6 2.7

2.1 2.2 2.3 2.4

2.5 2.6 2.7

Chapter Chapter Chapter Chapter Chapter Chapter Chapter

Reading the data Cleansing the data.

Reading the data

JSON JSON JSON JSON JSON

Question: Given high performance of computer

Cleansing the data

● Although the most commonly used distribution is the

● Note: normalization applies only to numeric data

Distance AB before scaling = (40 − 60)2 +(3 − 3)2 = 20

Distance BC before scaling = (40 − 40)2 +(4 − 3)2 = 1

Distance BC after scaling = (1.1 − 1.1)2 +(0.41 − 1.18)2 = 1.59

○ Subtracting the mean and dividing by the range, (x — u) / (max-min). This

● BUT it depends on their values (extremely

Midpoint in the Random Remove the

○ making all words either uppercase or

○ removing certain words that convey little

○ removing extra or unnecessary spaces and

Make sure you keep a record of

Remove stop words

Change to lower case

● comes right after data exploration.

00101, 00110, 00111 101, 110, 111

I make it finally! I MaKe iT FINALly!!!!

filter out less

● Unfortunately there is no fool-proof methodology for data discovery

● Learning from data is a crucial stage in the data science process

helps a computer learn

However, without feedback from the user/programmer, this process is unlikely

For example, artificial neural networks (ANNs), a very popular artificial

● All of the aforementioned parts of the data science process

● To create a data product, you need to understand the

This often leads to new iterations of data discovery, data learning,

The user's feedback is crucial as it will provide the information he

● Visualization involves the

Aware of what we don't This means that you are

These graphs can bring This translates into deeper

● It brings about the improvements

You might also like