0 ratings0% found this document useful (0 votes) 47 views12 pagesUnit 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
, science is. a multidiscipting
ary field that uy
pata
ind unstructured data, p,
Scientific
© 1S Such a fn tod to
a pene
- huge field and comey insights from
S unifies statist cept that’s often
Wa scien
intr :
ine Heaening and related fet
pata science life cycle provides the
a Hifecycle O' tH s 7 _ e str
roe lteyele outlines he Mar steps, fom a
tere a $ AProaches to managing peo MSHS that proj
vrandard process for data minin Bing DS projects, amon esos UMA follow. Now,
jatbases (aka KDD), any Deere a CRUSP-DND ONES ich ac Cross-industry
raat - proces: ;
sata few other simplified processes ed custom Sree ee
Ss. njured up by a company.
Lys,
TUcture to the dev
elo
PMent of a data science project.
crish- DM
cRISP-DM is an open standard process
~ ecienti 7 model that descril
data mining scientists. In 2015, it was refi lesctibes common
15, ined and ext approaches used by
minjogy called Analytics Solutions Unified Motel ier Der wicaer oleae
ining/Predictive
moth
(aka ASUM-DM).
Analytics
the CRISP-DM model steps are:
1, Business Understanding
7, Data Understanding
3, Data Preparation
4, Modelling
5, Evaluation and
6. Deploymenta8 nce eta [ESE may |SeRas GENTE. [nore 1
soni ots | secon ;
axigreons ‘bens praised enmie |" Re me es
Seen See
ees oo cos | ais eens | Bea |Meat ng
‘Cee Pier desereeon | Cons Cleaning Report \apsrowed Modets | Monsoragety
ean el oad | ao
Assess Situation oat Construct Data Test Design | Revie Process
a Se ay lerernre.,
Recreation oeresarass, demamoat ene Pete Fatt
Mroeemont | Rrrot Poorer semings steps | Fis! Preaacon
amen ieoemoes) | |S (pga
= eceyousceomy [SEES — [EE acini | Bot oe
= | a | ome
Coss ant Senefits Reformerted Dota Model ASsesionent oes
| Deraset ‘Sertings -
ae
Se
mo |
reac Pre re |
peewee
initial Assessment of
mes
aes |
Knowledge discovery in databases (KDD)
KDD is commonly defined with the following stages:
Y Selection
Y Pre-processing
Y Transformation
Y Datamining |
¥ Interpretation/evaluation
The simplified process looks as follows: (1) Pre-processing, (2) Data Mining, and (3)
Results Validation.
Suppose, we have a standard DS project (without any industry-specific peculiarities), then the
lifecycle would typically include:
Y Business understanding
Y Data acquisition and understanding
Y Modelling
Y Deployment
v
Customer acceptanceThe DS project life cycle is an iterati
: an iterativ. i
ipuiascea the taka tecdet oe a Process of research and discovery that provides
Sea ee a node | =P eticory models. The goal of this process is to move a
ject -point by providing means for easier and cl
exmmanication between teams and customers with a well-defined set of arifets and
standardized templates to homogenize procedures and avoid misunderstan
Each stage has the following information:
+ Goals and specific objectives of the stage
+ Aclear outline of specific tasks and instructions on how to complete them
: The expected deliverables (artifact)
Business understanding
1k on a DS project, you need to understand the problem you're tying
{F your project by identifying the variables to
Before you even embai rojet
central objectives of
‘o solve and define the
predict,
oy Identify key variables that will serve a model targets and serve as the metries for
defining the suecess of 6 profs ness has already acess 10 or need 10 obtain sth
Y Identify data sources
access
Guidelines:
Work with customers and stakehold
‘hat data science needs 10 answer. ane
The goal here is to identify the Key Tea
teeds to predict and the project's
ders to define business problems and formulate questions
lers
jables (aka model targets) that your analysis
se ould be assessed against, For ‘example, the salesforecasts, Th
Your prediction
what needs to be predicted, and atthe end of your project, yout gop
to the actual volume of sales. Pate
Define project goals by asking specific questions related to data science, such as:
How much/many? (regression)
Which category? (classification)
Which group? (clustering)
Does this make sense? (anomaly detection)
Which option should be taken? (recommendation)
° Business Requirements
Y The purpose of busi
the criteria of its success.
Y Business requirements describe why a project is needed, whom it will benefit, when
and where it will take place, and what standards will be used to evaluate it.
Y Business requirement generally do not define how a project is to be implemen:
Tequirements of the business need do not encompass a project’s implementation
details,
Y “Business requirements are higher-level statements of the goals, objectives, or needs
of the enterprise.”
Y “They describe the reasons why a project has been initiated, the objectives that the
project will achieve, and the metrics that will be used to measure its success.”
¥ In short, business requirements chart where a project is going, not how it’s going to
get there.
The business requirements the analyst creates for this Project would include (but not be
limited to):
+ Identification of the business problem (key objectives of the project) ie,
“Declining ticket sales require a strategy to increase the number of customers at our
theatres.”
+ Why the solution has been proposed (its benefits; why it will produce the desired
outcome of returning ticket sales to higher levels), ic, “Customers have
overwhelmingly cited the inconvenience of standing in line as the primary reason they
no longer attend our theatre, We will remove this impediment by enabling customers
to buy and print their theatre tickets at home with just a few clicks.”
+ The scope of the project. A few examples might be:
ess requirements is to define a project's business need, a5 wu
while the plan is to bring
this project to all 400 theatres eventually, we will start with 50 theatres in the most
populated metropolitan areas,
+ Rules, policies, and regulations. For example, “We will design our web site and
commerce so that all other relevant governmental regulations are properly adhered
to.”
+ Key features of the service (without details as to how they will be implemented). A
few examples might include: “1. we will provide a secure site for the user to select the
number of tickets and showing they wish, and to enter their payment information. 2-
We will give the user the option to store his or her card information in our system SO.
that they do not have to re-enter it ina later session. 3. The system will accommodate
credit, debit, or PayPal payment methods only.”Hey enronnnnnes
) bi Testy
Ht 5 RY Hew
aig a ie
a en eng Hat ea i hey tte
av any given (inne Wana aon wlll gn tite Noe Pee
a ay cerilyRentures Hi HY Iain teen at fe Fein, win wll if oe
Ae a MOMeITLAIteN unl within aie fer formanee 1 Ahonen Wak manny set
Crileriy (mente hihi pvt ape ty sonnet
ravooennltil UT Theta bt PiU i
eLrHh (ty 200% tee eel
2000 lovee Hhie project will b
within 12 ment aaa
2 mnutin of its assed
ante), 18
ihentifier
vine a tune
japan! renting Huston Feguiony
events would
not Inehude
. Ader a How to adhere 4
Fy dexeription of how we 1 avers
Et how penn . Pee regulatory requirements
Fniinitow ult xyz promen information bs pe I Ini rnented, such ne, “08
| Any dexcription of hi D ed val be bucked up evens NV
) w the uni
J Any detail or specitic Hae Hicker identifie
fe ae etn oi co
Vil he 20 cl saturgs, auch ox; "9, Tha credit Care
ye 20 characters Jong and a ae ae oe
ver
user neleets Yeu w Ww KYZ Morey
(01), the information will be foaded Zw
med to our KYZ storys 1!
called."
extual, business
1 best serves the
significant input
ata
hile the above examples eed
compunylng nelected bulle e
ns illet points are
project:
fom a
models, or any combination of these Ha
strategic ihinkines
rements may inelude graphs,
ne needs of # project
fective business requireme
er, and the a
nti require rong,
bility to clearly state 1B
project's business: own
high level.
swith all peau’ rements, business re
Just becuse business requiren
ations doesn't mean they mu
quirements should be:
nents state busine?
able.
wwin’t be demons
quirements are specific ‘and objective. A quality control expert must be
Tor example, that the sysien rnecommodates the debit credit.
Pin te business requiremeny ‘syne could not do 30 if the
vee whe system will secommodate appropriate
For example, “This
bably too
Verifinble.
technical spe
Verifiable re
able to check,
PayPal methods speci
requirements were mor val
payment methods.” (Approprie
stating, precisely
sue,
te i erpretation.)
ium problem is beings 16 d.
1 if ticket sales inerease sufficiently,” 1 pro!
ing. atthe progect’s end:
role), Business requirements
jeture, In the aforementio
id know to design a systery
customers the theatre chain h
uirements, the
at any
ject to int
vague for a
+ Comprehi
are indec
example,
that could acco!
developers mis!
one time without performance i
requirements answer the what's, not the how's, but they are
ig, No business pont 15 overlooked. At a project's
Remember th
meticulously thorough inical record of the init;
hould serve as a methodical record itil by,
should
ehd, the business requirements sh
"ey
Problem and the scope of its solution
in perspect
Hhicctives and requicements from a domain perspective
Ject objectives and
project obj
and y
Understanding the
em definition with a prelim
“averting this knowledge into a data science problem defi ften structneg na
oo R this kn e jects are often structured ay,
Hesignet to achieve the objectives, Data science projects are band as eu
Wesigned to achieve the objec tailored and built for.
pevit Is of an industry sector (ns shown below) or even {ora ing 4
Stevitic needs of an industry sec
€ project starts from a well defined question op need
Steanization. A suecesstil data sci
Sata Acquisition
i DAQ oF DAS) is the Cn
vv auisition (commonly abbreviated as the Process of
sampling ean e measure real-world physical phenomena and converting them
aaa ter and software
into a digital orm that can be manipulated by a compu!
‘oni istinct from earlier forms of recon
Data Acy on is generally accepted to be distinc! +3 ing
tote Reon a ey seed those methods, the signal ay cnt
from the analog domain to the digital domain and then recorded to a digital Medium
suchas ROM, flash media, or hard disk drives,
SThe Purposes of Data Acq .
The primary purpose ofa data acquisition system is to acquire and store the data. But th
HS tended to provide realtime wat Post-recording visualization and analysis Of the
Furthermore, most data ‘acquisition systems have some analytical and repon generation
Sapability built-in,
ey are
Post-recording data review
Data analysis usin,
°
© Real-time data visualization
© Report generation
#8 various mathematical and Statistical calculations
* Data Preparationtt . :
pata is information typically
Daregorical). Variables serve”),
Tests of
Mmeasureme
" 4s placehol HeMeM (numerical nation,
‘ a cal) or cenuntiny
(cfrables. numerical and cateworiany, sen mre ning
numerical OF continuous varany
Arerval (C2. height, weight. te
int interval and ratio, Data on an j
meaningfully multiplied or divided eg
one day is twice as hot as another day. ¢
peadded, subtracted, multiplie
SPL any value within a finite er infinite
Toe ag re te tie type8 of numerical
inter Cale ey types of nun
One nile en ye: addled nd. wubtencee te coment
‘ise there
Dn tha ere 9 true 2800, Por examples ave count nav thet
yn the other hand,
i data on a rat e hus true zero and can
d or divided (e,p,, Weight), ratio scale his true ze
blood gly
categorical or discrete variable j
two types of categorical dat
ordering in the categories. For example:
ordinal data does have an intris
nominate cit HCE (wo oF more values (categories), ‘here
camp at snd ordinal. Nominal data does nat have an inteinate
dere ite” with two categories, male and female. In contrat,
i mn aa 8. it the categories, For example, "level of energy” wit
aee orderly categories (low, medium and high)? oe example, "I oY
= ea
Keon = Categorical
Dataset
Dataset is a collection of data, usually presented in a tabular form. Each column represents a
particular variable, and each row corresponds to a given member of the data.
Columns
Values
Rows
False
Tn
7 Tue
values.
here are some alternatives for columns, rows and
* Columns, Fields, Attributes, Variables
Vectors
* Rows, Records, Objects, Cases, Instances, Examples,* Values, Data
Im predictive modeling, predictors or attri
Auribute is the output variable whose value ii
function of the predictive model,
Database
Database collects, stores and mana
such information. It presents information in ta
@ relation in the sense that it is
be related ace: rding to commo
related tables j
(DBMS) handk
SQL (Structured Query Language) is a database computer language for managing and
‘manipulating data in relational database management systems (RDBMS).
SQL Data Definition Language (DDL) permits database tables to be created, altered or deleted. We
can also define indexes (keys), specify links between tables, and impose constraints between
database tables.
CREATE INDEX : creates an index
DROP INDEX : deletes an index
+ CREATE TABLE : creates a new table
+ ALTER TABLE : alters a table
+ DROP TABLE : deletes a table
and
SQL Data Manipulation Language (DML) is a language which enables users to access
manipulate data.
+ SELECT : retrieval of data from the database
INSERT INTO: insertion of new data into the database
UPDATE : modification of data in the databaseDELETE : deleti f
leletion of data in the datab
tabase
1p (Extractit
« ion, Transformation and Londi
sonding)
E
pile
funtion
xtracts data from d
lata soures
es and loads it it
ads it into data destinatio
stinations using a set of
set of transformation
Data extraction provi
a a provides th ili
fiat fies, relational databases, ently to extract data from a vari
ea Geaaelarmation provi WUE aoe eS Onc aaaS such as
s the abilily to ‘cleanse, es, and ooneubed data sources.
: |. aggregate, merge, and split
, and split
data.
Data loading provi
provides the abili
ability to load data into destination datab:
: latabases via update, insert
or delete statements, or in bulk.
ETL Process
Destination
Source
cz | 6
| | -
=
O
Credit Default Datasets
a
and visualization
of statistical
to focus for
» Data Exploration
5 about desc!
pects of that data in
fore data in order t
a by means
ribing the dat
important as]
hata Exploration i
shniques. We ©XP bring
irther analysis.
y Univariate AnalysisModeling
Predictive modeling is the process by which a model is created to predict an outcome. If the outcome
is categorical it is called classification and if the outcome is numerical it is called resression
Descriptive modeling or clustering is the assignment of observations into clusters so that observations
in the same cluster are similar. Finally, association rules can find interesting associations amongst
observations.
Model Evaluation
Model Evaluation is an integral part of the model development process. It helps to find the best
model that represents our data and how well the chosen model will work in the future. Evaluating
model performance with the data used for training is not acceptable in data science because it can
easily generate overoptimistic and over fitted models. There are two methods of evaluating models in
data science, Hold-Out and Cross-Validation. To avoid overfitting, both methods use a test set (zo y the model) to evaluate model performance.
gjold-Out
jpthis method, the mostly large dataset is randomly divided to three subsets:
1. oe set is a subset of the dataset used to build predictive models.
2 Na pee estes cuits dataset used to assess the performance of model built in the
Ta Prming model. Not test platform for fine tuning model's parameters and selecting the
bestperforming modcl Not ll modfing algorithms need a validation set.
a amples are a subset of the dataset to assess the likely future performance
of a model. If a model ae 8
ae el fit to the training set much better than it fits the test set, overfitting is
cross-Validation
When only @ Hines eae of data is available, to achieve an unbiased estimate of the model
peor Oe nt k-fold cross-validation. In k-fold cross-validation, we divide the data
into A Fi equal size. We build models & times, each time leaving out one of the subsets from
yaining and use it as the test set. If k equals the sample size, this is called "leave-one-out".
Model evaluation can be divided to two sections:
+ Classification Evaluation
+ Regression Evaluation
Model Deployment
‘The concept of deployment in data science refers to the application of & model for prediction using 2
tev data, Building a model is generally not the end of the project. Even if the purpose of the model is
increase knowledge of the data, the knowledge gained will need to be organized and presented in @
way that the customer can use it. Depending on the requirements, the deployment phase can be as
imple as generating a report or as complex as implementing a repeatable data science process. In
many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. For
‘ample, a credit card company may want fo deploy a trained model or set of models (¢.g., neural
werk, meta-learner) 10 quickly identify transactions, which have a high probability of being
fandulent. However, even if the analyst will not carry out the deployment effort it is important for
the customer to understand up front what actions ‘will need to be carried out in order to actually make
use of the created models.
methods:
foe way of deploying the models in data science.
tools (or cloud)
gram C, VB, «.-)
Tanguage (Java, C+
Data a SOL seript (TSQL- PL-SQt.
PMML (Predictive Model ‘Markup Lang!
mining too! (OsE0) (0 deploy ad
Model deployment
Ingeneral, there is
yeep
ision tree model.
Anexample of using a data