PREDICTIVE
MODELING
ASSIGNMENT : 2
BY
HARI HARA SUDHARSAN
R
720722110011
3rd YEAR IT ‘A’
1.Customer Churn Prediction in Telecom
Objective:
Identify why customers leave and predict which ones are likely to churn.
1. Using Categorical vs. Continuous Field Analysis in SPSS Modeler
Categorical Fields (e.g., Plan Type, Complaints):
Use "Table" and "Distribution" nodes to view frequency of categories.
Cross-tab with churn status to see which groups have high churn rates.
o Example: Prepaid users have a 30% churn rate, Postpaid only 10%.
Continuous Fields (e.g., Monthly Bill, Usage Time):
Use "Statistics", "Histogram", and "Box Plot" nodes to examine:
o Average and median bill.
o Outliers (e.g., very high bills may cause dissatisfaction).
o Relationships with churn: Use "Pearson Correlation" to see how monthly bill
correlates with churn probability.
These insights guide variable selection and feature engineering for modeling.
2. How CHAID Helps in Predicting Churn
CHAID (Chi-squared Automatic Interaction Detection):
A decision tree algorithm that selects splits based on chi-square tests.
Excellent at identifying:
o Interaction effects between variables.
o Subgroups with significantly different churn rates.
Benefits:
Works well with categorical variables.
Produces easy-to-interpret trees (e.g., "If usage < 100 mins and has 2+ complaints →
50% churn probability").
Allows multilevel splits, unlike binary trees (like CART).
You can plug in a CHAID node in SPSS Modeler and let it visualize the churn-driving factors
hierarchically.
3. Handling Missing Values
Steps:
1. Identify missing values using the "Data Audit" node.
2. Handle them using:
o "Missing Values" node for automatic imputation.
o Manual strategies:
Categorical: Replace with "Unknown" or most common category.
Numerical: Impute with mean, median, or regression based on other
fields.
3. Consider flag variables to indicate where data was missing — sometimes the
missingness itself is predictive!
2.Employee Salary Prediction Using Regression Models
Objective:
Predict salaries using experience, education, and department.
1. Why Data Partitioning is Important
Before building a model:
Partition data into:
o Training set: 70% for building the model.
o Testing set: 30% for evaluating it.
This avoids overfitting—where the model performs well on old data but poorly on new
cases.
Use SPSS Modeler’s "Partition" node to split data automatically.
2. Key Factors for Accurate Salary Prediction
Predictor Quality: Experience, education level (categorical), department.
Linearity Check: Salary should increase proportionally with experience.
Multicollinearity Check: Use correlation matrix or VIF (Variance Inflation Factor) to
ensure variables aren't duplicating info.
Outlier Management: Salary outliers (e.g., C-level execs) can distort regression —
consider log transformations or remove them.
3. Role of Central Tendency and Variability
Before modeling:
Mean/Median: Show typical salary — good for baseline comparison.
Standard Deviation: Indicates how much variation there is.
Skewness/Kurtosis: Reveal whether salary is normally distributed or skewed
(important for linear regression assumptions).
Use "Statistics" node to get a full statistical profile of salary and its predictors.
3. Credit Risk Analysis Using Neural Networks
Objective:
Predict if a loan applicant is a high or low credit risk.
1. Why Data Transformation Helps
Before modeling, you may need to:
Bin continuous data (e.g., income into low/medium/high).
o Helps with nonlinear relationships and model interpretability.
Reclassify categories (e.g., combine job types into “skilled” vs “unskilled”).
In SPSS Modeler:
Use "Binning node" or "Reclassify node".
Transforms improve model convergence and reduce noise.
2. Why Choose Neural Network Over Logistic Regression
Logistic Regression:
o Simple, explainable.
o Assumes linear relationship between predictors and log-odds of default.
Neural Networks (MLP, RBF):
o Learn complex, nonlinear patterns (e.g., "high income + low repayment =
risk").
o Multi-layer perceptron (MLP) models capture interactions and subtle signals
in the data.
o Can achieve higher accuracy, especially with large datasets.
When to prefer NN:
When the relationship is not obviously linear.
When predictive power matters more than interpretability.
3. Using Statistical Functions to Analyze Risk Trends
Use Descriptive Statistics nodes to summarize key variables:
o Mean loan amount by risk category.
o Frequency of late payments.
Use Graphs (bar charts, time trends) to visualize:
o Risk categories by income.
o Changes in customer behavior over time.
Summary of Tools in SPSS Modeler
Task Node to Use
Field Distribution Data Audit, Histogram, Boxplot
Missing Value Handling Missing Values node
Partitioning Partition node
Modeling CHAID, Regression, Neural Net
Binning/Reclassifying Binning node, Reclassify node
Task Node to Use
Statistical Summary Statistics node, Table node