Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views6 pages

PD Assignment

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

PD Assignment

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

PREDICTIVE

MODELING
ASSIGNMENT : 2

BY
HARI HARA SUDHARSAN
R
720722110011
3rd YEAR IT ‘A’
1.Customer Churn Prediction in Telecom

Objective:

Identify why customers leave and predict which ones are likely to churn.

1. Using Categorical vs. Continuous Field Analysis in SPSS Modeler

Categorical Fields (e.g., Plan Type, Complaints):

 Use "Table" and "Distribution" nodes to view frequency of categories.

 Cross-tab with churn status to see which groups have high churn rates.

o Example: Prepaid users have a 30% churn rate, Postpaid only 10%.

Continuous Fields (e.g., Monthly Bill, Usage Time):

 Use "Statistics", "Histogram", and "Box Plot" nodes to examine:

o Average and median bill.

o Outliers (e.g., very high bills may cause dissatisfaction).

o Relationships with churn: Use "Pearson Correlation" to see how monthly bill
correlates with churn probability.

These insights guide variable selection and feature engineering for modeling.

2. How CHAID Helps in Predicting Churn

CHAID (Chi-squared Automatic Interaction Detection):

 A decision tree algorithm that selects splits based on chi-square tests.

 Excellent at identifying:

o Interaction effects between variables.

o Subgroups with significantly different churn rates.

Benefits:

 Works well with categorical variables.

 Produces easy-to-interpret trees (e.g., "If usage < 100 mins and has 2+ complaints →
50% churn probability").

 Allows multilevel splits, unlike binary trees (like CART).


You can plug in a CHAID node in SPSS Modeler and let it visualize the churn-driving factors
hierarchically.

3. Handling Missing Values

Steps:

1. Identify missing values using the "Data Audit" node.

2. Handle them using:

o "Missing Values" node for automatic imputation.

o Manual strategies:

 Categorical: Replace with "Unknown" or most common category.

 Numerical: Impute with mean, median, or regression based on other


fields.

3. Consider flag variables to indicate where data was missing — sometimes the
missingness itself is predictive!

2.Employee Salary Prediction Using Regression Models

Objective:

Predict salaries using experience, education, and department.

1. Why Data Partitioning is Important

Before building a model:

 Partition data into:

o Training set: 70% for building the model.

o Testing set: 30% for evaluating it.

This avoids overfitting—where the model performs well on old data but poorly on new
cases.

Use SPSS Modeler’s "Partition" node to split data automatically.

2. Key Factors for Accurate Salary Prediction


 Predictor Quality: Experience, education level (categorical), department.

 Linearity Check: Salary should increase proportionally with experience.

 Multicollinearity Check: Use correlation matrix or VIF (Variance Inflation Factor) to


ensure variables aren't duplicating info.

 Outlier Management: Salary outliers (e.g., C-level execs) can distort regression —
consider log transformations or remove them.

3. Role of Central Tendency and Variability

Before modeling:

 Mean/Median: Show typical salary — good for baseline comparison.

 Standard Deviation: Indicates how much variation there is.

 Skewness/Kurtosis: Reveal whether salary is normally distributed or skewed


(important for linear regression assumptions).

Use "Statistics" node to get a full statistical profile of salary and its predictors.

3. Credit Risk Analysis Using Neural Networks

Objective:

Predict if a loan applicant is a high or low credit risk.

1. Why Data Transformation Helps

Before modeling, you may need to:

 Bin continuous data (e.g., income into low/medium/high).

o Helps with nonlinear relationships and model interpretability.

 Reclassify categories (e.g., combine job types into “skilled” vs “unskilled”).

In SPSS Modeler:

 Use "Binning node" or "Reclassify node".

 Transforms improve model convergence and reduce noise.

2. Why Choose Neural Network Over Logistic Regression


 Logistic Regression:

o Simple, explainable.

o Assumes linear relationship between predictors and log-odds of default.

 Neural Networks (MLP, RBF):

o Learn complex, nonlinear patterns (e.g., "high income + low repayment =


risk").

o Multi-layer perceptron (MLP) models capture interactions and subtle signals


in the data.

o Can achieve higher accuracy, especially with large datasets.

When to prefer NN:

 When the relationship is not obviously linear.

 When predictive power matters more than interpretability.

3. Using Statistical Functions to Analyze Risk Trends

 Use Descriptive Statistics nodes to summarize key variables:

o Mean loan amount by risk category.

o Frequency of late payments.

 Use Graphs (bar charts, time trends) to visualize:

o Risk categories by income.

o Changes in customer behavior over time.

Summary of Tools in SPSS Modeler

Task Node to Use

Field Distribution Data Audit, Histogram, Boxplot

Missing Value Handling Missing Values node

Partitioning Partition node

Modeling CHAID, Regression, Neural Net

Binning/Reclassifying Binning node, Reclassify node


Task Node to Use

Statistical Summary Statistics node, Table node

You might also like