Exploratory Data Analysis (EDA) Summary
Report
1. Introduction
The purpose of this report is to conduct an initial exploratory data analysis on the
provided customer delinquency dataset. The primary goal is to identify key
trends, patterns, and potential risk indicators associated with loan delinquency.
The insights gained from this analysis will be instrumental in preparing the data
for subsequent predictive modeling, aiming to forecast future customer behavior
and credit risk.
2. Dataset Overview
This section summarizes the dataset, including the number of records,
keyvariables, and data types. It also highlights any anomalies, duplicates, or
inconsistencies observed during the initial review of the available data sample.
■Key dataset attributes:
● Number of records:
The dataset contains records for a minimum of 494 distinct customers. A full
review of the dataset would be required to determine the exact total.
●Key variables:
Customer_ID: A unique identifier for each customer.
●Age: The customer's age in years.
●Income: The customer's annual income.
●Credit Score: A numerical score indicating creditworthiness.
●Credit Utilization: The ratio of credit used to available credit, as defined in the
glossary.
●Missed_Payments: A count of the number of missed payments.
●Delinquent_Account: The target variable, indicating whether an account is
delinquent (1) or not (0).
●Loan_Balance: The outstanding loan balance.
●Debt_to Income Ratio: The ratio of a customer's total debt to their gross
income.
●Employment_Status: The customer's employment status (e.g., 'Employed',
'Self-employed', 'Unemployed').
●Account_Tenure: The duration of the customer's account in months.
●Credit Card Tune: The type ofCredit_Card_Type: The type of credit card the
customer holds.
●Location: The customer's geographical location.
●Month_1 Month_6: The payment status for each of the last six months ('On-
time', 'Late', 'Missed').
●Data types: The dataset contains a mix of numerical and categorical data. Age,
Income, Credit_Score, Credit_Utilization Missed_Payments, Loan_Balance
Debt_to_Income_Ratio, and Account_Tenure are numerical. The remaining
variables (Customer_ID Employment_Status, Credit_Card_Type, Location,
Month_1 Month 6, and Delinquent_Account ) are categorical.
3. Missing Data Analysis
Identifying and addressing missing data is critical to ensuring model accuracy.
This section outlines missing values in the dataset, the approach taken to handle
them, and justifications for the chosen method.
●Key missing data findings: Based on the provided data sample, no missing
values were immediately apparent. A full-scale analysis of the complete dataset
is required to confirm the presence of any missing data.
●Missing data treatment: If missing data were identified, a strategy of
imputation would be recommended. For numerical features like Income or
Loan_Balance, an appropriate imputation method could involve using the mean
or median of the variable, or more advanced techniques like K-Nearest Neighbors
(KNN) imputation ifcorrelations with other variables are high. For categorical
features like Employment_Status or Location, imputation with the mode or
creating a new 'Unknown' category would be a suitable approach.
4. Key Findings and Risk Indicators
This section identifies trends and patterns that may indicate risk factors for
delinquency. Feature relationships and statistical correlations are explored to
uncover insights relevant to predictive modeling.
● Correlations observed between key variables:
●Missed Payments: The Missed_Payments variable appears to be a strong
indicator of delinquency. For example,CUST0002, with 6 missed payments, is
marked as a delinquent account (1), while CUST0003, with 0 missed payments, is
not (0). This suggests a direct positive correlation.
●Credit Score: A lower Credit Score seems to correlate with a higher likelihood of
delinquency.
●Credit Utilization: A higher Credit_Utilization ratio may also be associated with
delinquency, as it indicates a customer is using a large portion of their available
credit, which can be a sign of financial strain.
●Payment History (Month 1-6): The detailed payment status for the past six
months provides granular
behavioral data. A high frequency of 'Late' or 'Missed' payments in this period is
a clear precursor to a delinquent status. The data for CUST0002 corroborates
this, showing 'Missed' and 'Late' payments across multiple months.
●Employment Status: The data includes various employment statuses
(Employed, Self-employed, Unemployed). It is highly probable that unemployed
customers or those with less stable employment are at a higher risk of
delinquency.
●Unexpected anomalies: An interesting data point is CUST0494, who has a very
low Credit Score of 306. This value is an extreme outlier and warrants further
investigation to determine if it is a data entry error or a legitimate score.
Suchdata point is CUST0494, who has a very low Credit Score of 306. This value
is an extreme outlier and warrants further investigation to determine if it is a
data entry error or a legitimate score. Such anomalies can significantly skew a
predictive model and should be handled with care.
5. Al & GenAl Usage
Generative Al tools were used to summarize the dataset, infer data patterns, and
structure this report. The Al analyzed the provided data snippet, identified the
variables and their types, and extrapolated potential relationships and risk
indicators based on common data analysis practices.
Example Al prompts used:
"Create an EDA reportfollowing the structure of the
EDA_SummaryReport_Template.docx file. Analyze the key variables, identify
potential risk factors for delinquency, and propose next steps for predictive
modeling."
●"Analyze the provided CSV snippet to identify data types, missing values, and
potential correlations between the target variable 'Delinquent_Account' and
other features."
6. Conclusion & Next Steps
The initial EDA has successfully identified several key risk indicators for customer
delinquency, most notably a high number of missed payments, a low credit
score, and a high credit utilization ratio. The detailed payment history over the
last six months provides valuable behavioral data that can be based on the
provided customer delinquency dataset,provides valuable behavioral data that
can be used to build a robust predictive model.
The recommended next steps are:
1. Full Data Audit: Conduct a comprehensive scan of the entire dataset to confirm
the findings, identify any other anomalies, and accurately quantify missing data.
2. Data Cleaning & Preprocessing: Address any missing values using the
proposed imputation strategies.
3. Feature Engineering: Create new features from the existing data, such as a
"total months with late payments" variable from the Month_1 to Month_6
columns, or categorical bins for numerical data.
4. Model Building: Use the cleaned and prepared data to train a predictive model
(e.g., Logistic Regression,Gradient Boosting Machines) to forecast customer
delinquency.
5. Model Evaluation: Evaluate the model's
performance using appropriate metrics such as AUC-ROC and a confusion matrix
to ensure it can accurately identify customers at risk.