0% found this document useful (0 votes)

36 views7 pages

Data Analysis Process My Notes

All data analytic projects involve similar processes like exploratory data analysis (EDA), machine learning (ML), and predictive modeling. This document outlines the steps in a typical project, including data auditing to check for issues like missing values, duplicates, and appropriate variable types and levels of analysis, as well as data preparation through activities like merging, aggregating, and handling relationships between files. It then discusses exploring and summarizing the data in EDA and diagnosing problems to identify root causes. Various types of business problems are classified that can potentially be addressed through predictive analytics techniques like regression, classification, segmentation, forecasting and optimization using algorithms ranging from linear models to decision trees to neural networks.

Uploaded by

Shameel Lamba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views7 pages

Data Analysis Process My Notes

Uploaded by

Shameel Lamba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

All Data Analytic Projects have few similar processes:

● EDA
● ML
● Predictive modeling

Steps Involved :-

1. Data Audit
2. Data preparation = file preparation for analysis

Data Audit ?
1. Data is legitimate or not?
- Sample or population. If data is sampled, you should understand whether sample
is representative of population or not. You should know the characteristics of
population
- Know the distribution of data. (avg, percentages, etc)
- How the sample is drawn?
- Sample extraction methodology
- SRS
- Stratified sample
- Percentage of data out of population

2. Data Dictionary?
- Meta data = data about data
- Variables
- Data types
- Table names
- Which variables are key variables
- No. of observation
- No. of columns
- Size of the data
- encoded or not?
- Duplicates
- Missing
- Time series or normal data
- Relationships between tables (Schema - structure of database)
-

3. Sources of data
- Internal sources
- External sources

4. Analysis at what level - do we have the data at that level?

- Customer level?
- Store level
- Branch level
- Product level

5. Data checks at variable level ?

- Data types - mismatch - Does data imported as per the data dictionary/
intuitiveness?
- Renaming variables ?
- Does data have any special characters ?
- #, #NA, #error, inf, -inf, 9999999, #value, what is the meaning of zero
- Missings ? missings have any encoding ?
- Does data have any outliers ?
- Data entry problems
- Outliers may be occurred on specific scenarios
- Does data require any normalisation ?
- Derived variables
- Calculated variables
- Extracting specific information from column
- Transformations
- Binning variables
- Encoded variables
- Which variables are making each observation is unique ?
- Key variable identification?
- Does data have duplicate records?
- How to join the data
- Does data require any renaming of variables ?
- Does data require any kind of type casting ?

File Preparation:
- Appending
- Merging
- Aggregations
- For ex:-
- Retail outlet is there
- Customer - customer level
- Transaction - transaction level
- Product hierarchy - product level
- Returns - transaction level
- Now the task is to create single table at customer level (customer 360)
- Handling all the above problems mentioned in data audit phase.
Dimension to explore while analyzing projects:
1. Exploring the data & auditing the data or understanding the data (eda)
2. How to summarize or aggregate or present data for specific form (eda)
3. How to analyze the insights out of the data and find the possible problems (eda)
4. Diagnose the problems - identifying root causes of the problem (diagnostic analysis)

How to solve the problem ? ---> predictive analytics can help to solve the problem, but not
every problem can be solved with it.

What are business problems :-

● Sales is not increasing
● NPA - non performing assets (loan getting discounted)
● So how to reduce the NPAs.?
● Fraud
○ Transaction level
○ Customer
○ Claims insurance
○ How to mitigate the risk of frauds ?
● How to plan it for future ?
○ Mostly in manufacturing sector…
○ How to estimate the demand ?
● How to identify genuine customer to sell products ?
○ Low conversion rate

All businesses are there for:-

1. Revenue / topline
a. How to increase the revenue?
i. Increasing customer base adding new customers.
ii. Engaging the customer to buy more
1. Cross sell
2. Up sell
iii. Retain my customers
iv. Winback customers

2. Profit / bottom line

a. Optimize the cost
i. Reducing marketing spend
ii. Other expenses
So we need to classify these problems in some buckets and then we need to solve each one in
a specific way.

Business problems - Classification

1. Predicting a value problems. (Regression)
2. Predicting an event (Classification)
a. Classify the data into pre defined groups
3. Classify the data in n-number of groups based on similarities where 'n' is not known.
(Segmentation)
4. Predicting value over the time. (Forecasting)
a. Here time is indexed. Linking time with prediction.
5. Optimizing some thing. (Optimization)
6. Others (other problems)

Business classification 2:-

1. Strategic problems
a. How to increase the revenue/ business profits and define strategies for long term
problems?
2. Operations problems
a. Primarily Increasing the operations efficiency. ?

Business problems 3:-

1. Supervised vs unsupervised
2. If there is an objective = supervised( regression, classification, forecasting)
3. If there is no objective = unsupervised(segmentation)

Algorithms / Techniques :-

Regression Classification Segmentation Forecasting Optimization Others

OLS Logistic Heuristic:- Basic :- Linear Recom

regression regression programming mendat
ions
system
s
(MBA,
COLLA
BORAT
IVE
FILTER
ING)

Decision trees Decision trees Value based Averages Non linear Surviva
(MA, WMA, programming l
CMA) analysi
s

Bagging Bagging Life stage ETS Integer

models(expon programming
ential)

Random forest Random forest Loyalty

Adaboost Adaboost RFM Medium:-

Gradient boost Gradient boost SARIMAX

(ARIMA,
SARIMA,
ARIMAX,
SARIMAX)

Xgboost Xgboost Scientific :-

Kmeans Advanced:-

SVR SVC Hierarchical VAR

KNN ANN DBSCAN Wavelets

ANN KNN ARCH /

GARCH

Naive bayes

Lasso Regression
regression

Elastic
regression

Ridge
regression

In model making. Whichever variable is not available we call it Y variable and whichever is
available we call it X variables. In other words, My sales is dependent on X variables.

Y variable = dependent variable

X variables = independent variables, driving factors, features, predictors

If we can establish some mathematical relationship between Y and X. Using this mathematical
equations we can predict the sales.

Modeling is trying to come up with some mathematical relationship between Y and X

Predictive model is whatever you came up with the equation between Y and X.

Now, how to come up with a Relationship ?

Algorithms help to establish relationship

Linear regression establishes relationship between two variables.

Y = B1*X1 + B2*X2+ Bn*Xn + C (linear model)

Known things :-
● Y, X1, X2, X3… from existing data

Unknown things
● B1, B2, B3 etc. We need to estimate these unknown to form relationship.

How to estimate these Beta’s ?

Converting any normal distribution data into standardized normal distribution:

X is following normal distribution
Z = (X-Mean(X))/Std(X)

Now, Z is following standardized normal distribution.

In Statistics, you can do two things in terms of distribution.

● Either change the shape of distribution (Eg - Z score)
● Transform the distribution to other distributions. (Ex- Exponential dist to normal dost
using log function)

How to check distributions:

● Look at the graph one approach
● Check for characteristics of the curve, graph, mean median, mode, etc
● Statistical tests to tell whether distribution is normal. Etc

Linear Regression:
● Linear relationship between X & Y variable.
●
Target Independent Linear relationship

Y X Y=BX+C
Y X1,X2 Y=B1X1+B2X2+C
Y X1,X2,X3 Y= B1X1+B2X2+B3X3+C
● Independent Variables:
○ 1 independent variable = simple linear regression
○ 2 or more independent variable = multivariate model/regression
● How to estimate the value of B & C?
○ Searching for the best possible values of B & C. Whichever line is closer to the
points ? Sum of Square of Deviations from that line (SSE)
○ B= the change in Y with respect to change in X. If we change one unit of X, what
is the change in Y.
○ C = If X is 0, then what is Y is not explained by X variables. The amount of Y is
explained some other factors other than X.
● Goodness of Fit?
○ How accurate is my model?
○ OLS = RMSE (root means squared error)

Feature Engineering (feature selection/feature reduction / variable reduction / dimension

reduction)
1. Creation of new variables, encoding the variables ----> done as data prep 1
2. How to reduce the variables ?

1. Drop some variables based on data audit

a. Drop variables with constant variance (near zero variance) = CV< 0.05
b. Highly correlated variables (>80% correlation)
c. If the variables having lots of missing values (>25% of missings)
d. If you have created derived variables, you should drop the original variables.
e. The variables which are making unique

Statistics For Data Science
100% (1)
Statistics For Data Science
39 pages
Data Science Statistics Guide
100% (2)
Data Science Statistics Guide
38 pages
7118 Ds Methodology Ss
No ratings yet
7118 Ds Methodology Ss
56 pages
Ss 2 Economics 1st Term E-Note
No ratings yet
Ss 2 Economics 1st Term E-Note
77 pages
Jacobi Method For Nonlinear First-Order Pdes
100% (1)
Jacobi Method For Nonlinear First-Order Pdes
3 pages
Data Mining
No ratings yet
Data Mining
18 pages
PM Unit 1
No ratings yet
PM Unit 1
41 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Indian Knowledge System Q&A: Answer
No ratings yet
Indian Knowledge System Q&A: Answer
32 pages
FSMQ Maxima and Minima PDF
No ratings yet
FSMQ Maxima and Minima PDF
5 pages
Module 2-b Prediction Methods and Models-Data Preperation
No ratings yet
Module 2-b Prediction Methods and Models-Data Preperation
26 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Introduction To Predictive Analytics: UNIT-1
No ratings yet
Introduction To Predictive Analytics: UNIT-1
14 pages
Lesson2 Notes
No ratings yet
Lesson2 Notes
13 pages
Compendium Iim Shillong Analytics and Prod Man
No ratings yet
Compendium Iim Shillong Analytics and Prod Man
68 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
? Data Preprocessing
No ratings yet
? Data Preprocessing
19 pages
Steps in Data Science & Analysis
No ratings yet
Steps in Data Science & Analysis
2 pages
Predictive
No ratings yet
Predictive
8 pages
Business Analytics Essentials
No ratings yet
Business Analytics Essentials
37 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
38 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Inky The Octopus: Based On A Real-Life Aquatic Escape! Erin Guendelsberger & David Leonard Instant Download
No ratings yet
Inky The Octopus: Based On A Real-Life Aquatic Escape! Erin Guendelsberger & David Leonard Instant Download
152 pages
ModelQB - Part B&C-1
No ratings yet
ModelQB - Part B&C-1
51 pages
Chapter 1: Introduction To Business Analytics
No ratings yet
Chapter 1: Introduction To Business Analytics
14 pages
Class Xi Chapter 2
No ratings yet
Class Xi Chapter 2
10 pages
Predictive Modeling
No ratings yet
Predictive Modeling
27 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Detailed Notes On Predictive Analytics
No ratings yet
Detailed Notes On Predictive Analytics
4 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
Wave Spectrum Fatigue Guide
100% (1)
Wave Spectrum Fatigue Guide
40 pages
Optimization of The SWAT Model To Adequately Predict Different Segments of A Managed Streamflow Hydrograph
No ratings yet
Optimization of The SWAT Model To Adequately Predict Different Segments of A Managed Streamflow Hydrograph
21 pages
Predective Analytics
No ratings yet
Predective Analytics
11 pages
Data Mining Overview
No ratings yet
Data Mining Overview
4 pages
50 Interview Questions & Answers!
No ratings yet
50 Interview Questions & Answers!
52 pages
Lecture02 Frameworks Platforms-Part1
No ratings yet
Lecture02 Frameworks Platforms-Part1
40 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Pa Unit 2
No ratings yet
Pa Unit 2
6 pages
How To Develop Quantitative Analysis Model
No ratings yet
How To Develop Quantitative Analysis Model
36 pages
How To Develop Quantitative Analysis Model
No ratings yet
How To Develop Quantitative Analysis Model
36 pages
Lecture 1 Introduction PM
No ratings yet
Lecture 1 Introduction PM
21 pages
Vip No.5 - Mesl
No ratings yet
Vip No.5 - Mesl
4 pages
Data Cleaning & Predictive Modeling Guide
No ratings yet
Data Cleaning & Predictive Modeling Guide
26 pages
Business Analytics Course Guide
No ratings yet
Business Analytics Course Guide
38 pages
Unit 4
No ratings yet
Unit 4
3 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Accounting Analytics 2
No ratings yet
Accounting Analytics 2
41 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
Common Analytics Interview Questions
No ratings yet
Common Analytics Interview Questions
4 pages
DA Unit-2
No ratings yet
DA Unit-2
7 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
Introducing AI Education in High Schools
No ratings yet
Introducing AI Education in High Schools
7 pages
Aluminum Wheel Casting Simulation
No ratings yet
Aluminum Wheel Casting Simulation
5 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
Ba Unit 4 - Part1
No ratings yet
Ba Unit 4 - Part1
7 pages
Capstone Project
No ratings yet
Capstone Project
9 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Ma 1
No ratings yet
Ma 1
31 pages
Business Analytics Essentials
No ratings yet
Business Analytics Essentials
10 pages
Complete HTML Handwritten Notes by Hardik Srivastava
No ratings yet
Complete HTML Handwritten Notes by Hardik Srivastava
17 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
DPS FINAL MATHS PAPER 2023 (1) (Practice)
No ratings yet
DPS FINAL MATHS PAPER 2023 (1) (Practice)
4 pages
Analytics 02011 Learning Path - Curriculum (6632)
No ratings yet
Analytics 02011 Learning Path - Curriculum (6632)
22 pages
Exams Questions and Model Answers
No ratings yet
Exams Questions and Model Answers
6 pages
Exercise 3
No ratings yet
Exercise 3
4 pages
Reliability, Validity, Sensitivity
No ratings yet
Reliability, Validity, Sensitivity
3 pages
Persian Cat - Breed Profile, Characteristics & Care
No ratings yet
Persian Cat - Breed Profile, Characteristics & Care
16 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
JavaScript Hoisting Explained Simply
No ratings yet
JavaScript Hoisting Explained Simply
7 pages
Practice Problem Set - II PDF
No ratings yet
Practice Problem Set - II PDF
3 pages
Video Report
No ratings yet
Video Report
13 pages
City University of Hong Kong Course Syllabus Offered by Department of Mathematics With Effect From Semester - A - 20 - 15 - / 16
No ratings yet
City University of Hong Kong Course Syllabus Offered by Department of Mathematics With Effect From Semester - A - 20 - 15 - / 16
6 pages
Set Theory: Well-Defined Collections and Sets
No ratings yet
Set Theory: Well-Defined Collections and Sets
32 pages
Video On Transition To Ai and Data Science and Mlops
No ratings yet
Video On Transition To Ai and Data Science and Mlops
4 pages
Note You Must Follow A Sequential Method and Show All Your Working For Arriving at A Particular Solution
No ratings yet
Note You Must Follow A Sequential Method and Show All Your Working For Arriving at A Particular Solution
9 pages
Automata Theory Chapter 2 PDF
No ratings yet
Automata Theory Chapter 2 PDF
12 pages
HTML Notes
No ratings yet
HTML Notes
31 pages
Big - Data My Notes
No ratings yet
Big - Data My Notes
37 pages
How Indian Highways Are Numbered
No ratings yet
How Indian Highways Are Numbered
3 pages
Ethiopian & Gregorian Digital Calendar Design
No ratings yet
Ethiopian & Gregorian Digital Calendar Design
33 pages
Data Mining & Agent Selection Guide
No ratings yet
Data Mining & Agent Selection Guide
8 pages
Year 5 Math Curriculum Guide
No ratings yet
Year 5 Math Curriculum Guide
22 pages
Final Push Trig & Stats
No ratings yet
Final Push Trig & Stats
24 pages
Bayesian Leak Prediction for Utilities
No ratings yet
Bayesian Leak Prediction for Utilities
15 pages
GM and Pre Cal PT
No ratings yet
GM and Pre Cal PT
3 pages
12 Hookes Law and Youngs Modulus
No ratings yet
12 Hookes Law and Youngs Modulus
6 pages
Efficient Barcode Decoding Algorithm
No ratings yet
Efficient Barcode Decoding Algorithm
6 pages
GCSE Maths Higher Tier Exam 2014
No ratings yet
GCSE Maths Higher Tier Exam 2014
16 pages
Science, Technology, Engineering, Mathematics (STEM) As Mathematics Learning Approach in 21 ST Century
No ratings yet
Science, Technology, Engineering, Mathematics (STEM) As Mathematics Learning Approach in 21 ST Century
7 pages
House Price Prediction Guide
No ratings yet
House Price Prediction Guide
32 pages
Optimization Problems
No ratings yet
Optimization Problems
23 pages

Data Analysis Process My Notes

Uploaded by

Data Analysis Process My Notes

Uploaded by

All Data Analytic Projects have few similar processes:

4. Analysis at what level - do we have the data at that level?

5. Data checks at variable level ?

What are business problems :-

All businesses are there for:-

2. Profit / bottom line

Business problems - Classification

Business classification 2:-

Business problems 3:-

Regression Classification Segmentation Forecasting Optimization Others

OLS Logistic Heuristic:- Basic :- Linear Recom

Bagging Bagging Life stage ETS Integer

Random forest Random forest Loyalty

Adaboost Adaboost RFM Medium:-

Gradient boost Gradient boost SARIMAX

Xgboost Xgboost Scientific :-

SVR SVC Hierarchical VAR

KNN ANN DBSCAN Wavelets

ANN KNN ARCH /

Y variable = dependent variable

Modeling is trying to come up with some mathematical relationship between Y and X

Now, how to come up with a Relationship ?

Linear regression establishes relationship between two variables.

How to estimate these Beta’s ?

Converting any normal distribution data into standardized normal distribution:

Now, Z is following standardized normal distribution.

In Statistics, you can do two things in terms of distribution.

How to check distributions:

Feature Engineering (feature selection/feature reduction / variable reduction / dimension

1. Drop some variables based on data audit

You might also like