Detailed Notes on Predictive Analytics
1. Overview of Predictive Analytics
Definition: Predictive Analytics refers to using historical data, statistical
algorithms, and machine learning techniques to forecast future outcomes.
Goal: Go beyond describing "what happened" to predict what is likely to happen.
Applications:
o Business: customer churn prediction, sales forecasting.
o Healthcare: predicting disease risk, patient readmission.
o Finance: credit scoring, fraud detection.
o Engineering: equipment failure prediction.
Key Steps:
1. Define the problem.
2. Understand and preprocess data.
3. Explore variables through visualization.
4. Apply statistical tests for significance.
5. Build models and evaluate performance.
2. Setting Up the Problem
Problem Formulation: Translate the business/real-world question into a data science
problem.
o Example: “Why are customers leaving?” → Predict whether a customer will
churn (classification problem).
Define Outcome Variable (Target): What we want to predict (e.g., churn = Yes/No).
Define Predictors (Features): Variables that might influence the outcome (e.g., age,
income, purchase history).
Check Feasibility:
o Is relevant data available?
o Is there enough quantity and quality of data?
o Is the timeline practical for predictions?
3. Data Understanding
Purpose: Build intuition about the dataset before modeling.
Steps:
1. Data Collection – Gather data from internal sources (databases, logs) or
external sources (APIs, surveys).
2. Data Description – Identify number of observations (rows), variables
(columns), types of variables (categorical, numerical, ordinal).
3. Data Quality Check – Missing values, duplicates, inconsistencies.
4. Initial Insights – Basic statistics: mean, median, standard deviation,
correlations.
4. Single Variable Analysis
Focus on understanding one variable at a time.
For Categorical Variables:
o Frequency counts, mode, proportions.
o Example: Gender distribution (Male: 60%, Female: 40%).
For Numerical Variables:
o Measures of central tendency: mean, median, mode.
o Dispersion: variance, standard deviation, range, interquartile range (IQR).
o Distribution shape: skewness, kurtosis.
Purpose: Identify unusual distributions, outliers, or dominant categories.
5. Data Visualization in One Dimension
Visualization helps detect patterns, skewness, and anomalies in a single variable.
For Categorical Variables:
o Bar charts, Pie charts.
o Example: Bar chart of customer segments.
For Numerical Variables:
o Histograms: show distribution of data.
o Box plots: highlight outliers and spread.
o Density plots: smooth estimation of distribution.
6. Data Visualization in Two or Higher Dimensions
Two Variables (Bivariate Analysis):
o Helps identify relationships between predictor and target variable.
o Numerical vs. Numerical: Scatter plots, correlation heatmaps.
o Numerical vs. Categorical: Box plots, violin plots.
o Categorical vs. Categorical: Cross-tabulations, stacked bar charts.
Higher Dimensions (Multivariate Analysis):
o Pair plots (scatterplot matrix).
o Heatmaps for correlations across many variables.
o Dimensionality reduction techniques: PCA (Principal Component Analysis), t-
SNE for visualization.
Purpose: Understand variable interactions and potential predictors.
7. The Value of Statistical Significance
Definition: Measures whether the observed relationship between variables is likely
due to chance.
Key Concepts:
o Null Hypothesis (H₀): No effect/relationship.
o Alternative Hypothesis (H₁): There is an effect/relationship.
o p-value: Probability of observing results as extreme as current ones, assuming
H₀ is true.
p < 0.05 → Statistically significant.
o Confidence Intervals: Range of values within which true effect lies with a
certain probability (e.g., 95%).
Tests Used:
o t-test (difference between two means).
o Chi-square test (categorical associations).
o ANOVA (differences across multiple groups).
o Correlation coefficients (strength of linear relationships).
Why Important?: Avoids building models on spurious correlations. Ensures
predictors are truly meaningful.
8. Pulling It All Together into a Data Audit
Definition: A structured summary of the dataset before predictive modeling.
Checklist for a Data Audit:
1. Data Availability – Where data is sourced, time span covered.
2. Data Quantity – Number of records, adequacy for modeling.
3. Data Quality – Missing values, noise, duplicates, outliers.
4. Variable Properties – Data types, ranges, distributions.
5. Variable Relationships – Correlations, significant predictors.
6. Potential Issues – Bias, imbalances (e.g., 95% No churn, 5% churn).
7. Documentation – Clear description for transparency and reproducibility.
Outcome:
A Data Audit Report ensures the dataset is clean, well-understood, and ready for feature
engineering and predictive modeling.