Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
36 views7 pages

Data Analysis Process My Notes

All data analytic projects involve similar processes like exploratory data analysis (EDA), machine learning (ML), and predictive modeling. This document outlines the steps in a typical project, including data auditing to check for issues like missing values, duplicates, and appropriate variable types and levels of analysis, as well as data preparation through activities like merging, aggregating, and handling relationships between files. It then discusses exploring and summarizing the data in EDA and diagnosing problems to identify root causes. Various types of business problems are classified that can potentially be addressed through predictive analytics techniques like regression, classification, segmentation, forecasting and optimization using algorithms ranging from linear models to decision trees to neural networks.

Uploaded by

Shameel Lamba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views7 pages

Data Analysis Process My Notes

All data analytic projects involve similar processes like exploratory data analysis (EDA), machine learning (ML), and predictive modeling. This document outlines the steps in a typical project, including data auditing to check for issues like missing values, duplicates, and appropriate variable types and levels of analysis, as well as data preparation through activities like merging, aggregating, and handling relationships between files. It then discusses exploring and summarizing the data in EDA and diagnosing problems to identify root causes. Various types of business problems are classified that can potentially be addressed through predictive analytics techniques like regression, classification, segmentation, forecasting and optimization using algorithms ranging from linear models to decision trees to neural networks.

Uploaded by

Shameel Lamba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

All Data Analytic Projects have few similar processes:

● EDA
● ML
● Predictive modeling

Steps Involved :-

1. Data Audit
2. Data preparation = file preparation for analysis

Data Audit ?
1. Data is legitimate or not?
- Sample or population. If data is sampled, you should understand whether sample
is representative of population or not. You should know the characteristics of
population
- Know the distribution of data. (avg, percentages, etc)
- How the sample is drawn?
- Sample extraction methodology
- SRS
- Stratified sample
- Percentage of data out of population

2. Data Dictionary?
- Meta data = data about data
- Variables
- Data types
- Table names
- Which variables are key variables
- No. of observation
- No. of columns
- Size of the data
- encoded or not?
- Duplicates
- Missing
- Time series or normal data
- Relationships between tables (Schema - structure of database)
-

3. Sources of data
- Internal sources
- External sources

4. Analysis at what level - do we have the data at that level?


- Customer level?
- Store level
- Branch level
- Product level

5. Data checks at variable level ?


- Data types - mismatch - Does data imported as per the data dictionary/
intuitiveness?
- Renaming variables ?
- Does data have any special characters ?
- #, #NA, #error, inf, -inf, 9999999, #value, what is the meaning of zero
- Missings ? missings have any encoding ?
- Does data have any outliers ?
- Data entry problems
- Outliers may be occurred on specific scenarios
- Does data require any normalisation ?
- Derived variables
- Calculated variables
- Extracting specific information from column
- Transformations
- Binning variables
- Encoded variables
- Which variables are making each observation is unique ?
- Key variable identification?
- Does data have duplicate records?
- How to join the data
- Does data require any renaming of variables ?
- Does data require any kind of type casting ?

File Preparation:
- Appending
- Merging
- Aggregations
- For ex:-
- Retail outlet is there
- Customer - customer level
- Transaction - transaction level
- Product hierarchy - product level
- Returns - transaction level
- Now the task is to create single table at customer level (customer 360)
- Handling all the above problems mentioned in data audit phase.
Dimension to explore while analyzing projects:
1. Exploring the data & auditing the data or understanding the data (eda)
2. How to summarize or aggregate or present data for specific form (eda)
3. How to analyze the insights out of the data and find the possible problems (eda)
4. Diagnose the problems - identifying root causes of the problem (diagnostic analysis)

How to solve the problem ? ---> predictive analytics can help to solve the problem, but not
every problem can be solved with it.

What are business problems :-


● Sales is not increasing
● NPA - non performing assets (loan getting discounted)
● So how to reduce the NPAs.?
● Fraud
○ Transaction level
○ Customer
○ Claims insurance
○ How to mitigate the risk of frauds ?
● How to plan it for future ?
○ Mostly in manufacturing sector…
○ How to estimate the demand ?
● How to identify genuine customer to sell products ?
○ Low conversion rate

All businesses are there for:-

1. Revenue / topline
a. How to increase the revenue?
i. Increasing customer base adding new customers.
ii. Engaging the customer to buy more
1. Cross sell
2. Up sell
iii. Retain my customers
iv. Winback customers

2. Profit / bottom line


a. Optimize the cost
i. Reducing marketing spend
ii. Other expenses
So we need to classify these problems in some buckets and then we need to solve each one in
a specific way.

Business problems - Classification


1. Predicting a value problems. (Regression)
2. Predicting an event (Classification)
a. Classify the data into pre defined groups
3. Classify the data in n-number of groups based on similarities where 'n' is not known.
(Segmentation)
4. Predicting value over the time. (Forecasting)
a. Here time is indexed. Linking time with prediction.
5. Optimizing some thing. (Optimization)
6. Others (other problems)

Business classification 2:-


1. Strategic problems
a. How to increase the revenue/ business profits and define strategies for long term
problems?
2. Operations problems
a. Primarily Increasing the operations efficiency. ?

Business problems 3:-


1. Supervised vs unsupervised
2. If there is an objective = supervised( regression, classification, forecasting)
3. If there is no objective = unsupervised(segmentation)

Algorithms / Techniques :-

Regression Classification Segmentation Forecasting Optimization Others

OLS Logistic Heuristic:- Basic :- Linear Recom


regression regression programming mendat
ions
system
s
(MBA,
COLLA
BORAT
IVE
FILTER
ING)

Decision trees Decision trees Value based Averages Non linear Surviva
(MA, WMA, programming l
CMA) analysi
s

Bagging Bagging Life stage ETS Integer


models(expon programming
ential)

Random forest Random forest Loyalty

Adaboost Adaboost RFM Medium:-

Gradient boost Gradient boost SARIMAX


(ARIMA,
SARIMA,
ARIMAX,
SARIMAX)

Xgboost Xgboost Scientific :-

Kmeans Advanced:-

SVR SVC Hierarchical VAR

KNN ANN DBSCAN Wavelets

ANN KNN ARCH /


GARCH

Naive bayes

Lasso Regression
regression

Elastic
regression

Ridge
regression

In model making. Whichever variable is not available we call it Y variable and whichever is
available we call it X variables. In other words, My sales is dependent on X variables.

Y variable = dependent variable


X variables = independent variables, driving factors, features, predictors

If we can establish some mathematical relationship between Y and X. Using this mathematical
equations we can predict the sales.

Modeling is trying to come up with some mathematical relationship between Y and X

Predictive model is whatever you came up with the equation between Y and X.

Now, how to come up with a Relationship ?


Algorithms help to establish relationship

Linear regression establishes relationship between two variables.


Y = B1*X1 + B2*X2+ Bn*Xn + C (linear model)

Known things :-
● Y, X1, X2, X3… from existing data

Unknown things
● B1, B2, B3 etc. We need to estimate these unknown to form relationship.

How to estimate these Beta’s ?

Converting any normal distribution data into standardized normal distribution:


X is following normal distribution
Z = (X-Mean(X))/Std(X)

Now, Z is following standardized normal distribution.

In Statistics, you can do two things in terms of distribution.


● Either change the shape of distribution (Eg - Z score)
● Transform the distribution to other distributions. (Ex- Exponential dist to normal dost
using log function)

How to check distributions:


● Look at the graph one approach
● Check for characteristics of the curve, graph, mean median, mode, etc
● Statistical tests to tell whether distribution is normal. Etc

Linear Regression:
● Linear relationship between X & Y variable.

Target Independent Linear relationship

Y X Y=BX+C
Y X1,X2 Y=B1X1+B2X2+C
Y X1,X2,X3 Y= B1X1+B2X2+B3X3+C
● Independent Variables:
○ 1 independent variable = simple linear regression
○ 2 or more independent variable = multivariate model/regression
● How to estimate the value of B & C?
○ Searching for the best possible values of B & C. Whichever line is closer to the
points ? Sum of Square of Deviations from that line (SSE)
○ B= the change in Y with respect to change in X. If we change one unit of X, what
is the change in Y.
○ C = If X is 0, then what is Y is not explained by X variables. The amount of Y is
explained some other factors other than X.
● Goodness of Fit?
○ How accurate is my model?
○ OLS = RMSE (root means squared error)

Feature Engineering (feature selection/feature reduction / variable reduction / dimension


reduction)
1. Creation of new variables, encoding the variables ----> done as data prep 1
2. How to reduce the variables ?

1. Drop some variables based on data audit


a. Drop variables with constant variance (near zero variance) = CV< 0.05
b. Highly correlated variables (>80% correlation)
c. If the variables having lots of missing values (>25% of missings)
d. If you have created derived variables, you should drop the original variables.
e. The variables which are making unique

You might also like