Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views11 pages

Progress Report 1

Uploaded by

azanetranclc17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

Progress Report 1

Uploaded by

azanetranclc17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Unsupervised Credit Scoring Models

Bachelor of Science in Computer Science

Prepared by:

Tran Anh Vu V202100569

Vu Duy Tung V202100528

Nguyen Canh Huy

Under the supervision of:


Instructor’s Name: Prof. Doan Dang Khoa

Related course: COMP3020 - Machine Learning


COLLEGE OF ENGINEERING AND COMPUTER SCIENCE

VINUNIVERSITY

October, 2025

Team Members Role

Name Student ID Role Detail

Preprocess raw data, and handle


Data processor, missing data. Research existing
Unsupervised credit scoring functions to optimize
Tran Anh Vu V202100571
learning Model feature utilization. Develop
developer unsupervised learning models on
unlabeled datasets.

Convert raw SMS data into


Data Collection,
structured dataset, leverages
Explore supervised
supervised machine learning
Vu Duy Tung V202100528 learning algorithms,
techniques, specifically XGBoost, to
Consider future
categorize creditworthiness based on
directions
SMS transaction records.

Nguyen Canh Huy

Note: Roles can be considered to change flexibly to suit during the work process.
I. Project Overview
A. Introduction
Credit scoring is a significant task in the financial sector, especially for banks where millions of
dollars in loan decisions largely depend on users' credit scores. However, due to the complexity
of users' financial situations, traditional credit scoring methods done by humans can be
inconsistent. This makes it essential to develop a machine learning-based credit scoring model.
To ensure a general approach, we aim to develop both supervised and unsupervised learning
models so we can maintain the natural dependence of credit scores on financial features without
relying on artificial labels. This report provides an overview of our initial work in data
processing, developing both types of models, the challenges we encountered, and our future
plans for the next steps. Besides that, this also includes our communication methods and
collaborating plans toward the final goal.

B. Related works
Recent research has shown that advanced machine learning methods outperform models in credit
scoring evaluations as demonstrated in a recent study conducted by Mestiri (2024) [1]. This
study compared six credit scoring models including Linear Discriminant Analysis (LDA)
Random Forests (RF) Logistic Regression (LR) Decision Trees (DT) Support Vector Machines
(SVM) and Deep Neural Networks (DNN). The results from analyzing 688 samples and 12
variables revealed that machine learning techniques were more accurate in predicting loan
defaults than the approaches.

Ensemble methods have shown particularly promising results. XGBoost, LightGBM, and
CatBoost have emerged as top performers in recent credit-scoring studies. For instance, research
by Hlongwane et al. (2024) revealed that XGBoost achieved the highest accuracy among
examined techniques, even surpassing the industry-standard FICO scores in some cases [2].
From these recent findings, K-means Clustering and building a custom credit scoring formula
present significant advantages and opportunities in Credit Scoring Classification Models, leading
the team to adopt these approaches. K means Clustering enables the categorization of customers
into risk groups, without predefined labels which can reveal patterns that traditional methods
may overlook. Developed custom credit scoring formulas can incorporate features identified
through selection methods and include alternative data sources that have demonstrated potential
in recent studies. By combining K-means Clustering for customer segmentation with a custom
scoring formula, we aim to develop a robust, flexible, and transparent credit scoring system that
can adapt to changing market conditions and regulatory requirements.

C. Data description
The dataset was provided by SFIN Joint-Stock Company, containing six months of transaction
metadata for 800,000 anonymous users. Each user’s transactions, recorded via Short Message
Services (SMS), capture diverse financial behaviors. Extracting meaningful insights from SMS
data presents a unique opportunity due to its rich and often underutilized content. By applying
prompting techniques with Large Language Models (LLMs), we extracted and structured
relevant entities, transforming raw text into structured feature data.

The dataset comprised 38 features, but it suffered from significant sparsity, with more than half
of the features having over 90% NaN values. This high level of sparsity posed challenges, as
missing data can often degrade model performance by introducing noise and reducing predictive
reliability. Therefore, our preprocessing pipeline not only focused on filling or removing NaNs
as necessary but also on maximizing data usability by discarding features with excessively low
completion rates.

This sparsity issue underscores a limitation in SMS-based transaction data - though SMS can
capture granular financial actions, its unstructured nature may require robust handling of missing
data. In future applications, a more comprehensive data source, potentially combining SMS data
with banking records, could offer enhanced reliability.

II. Progress Description


A. Data Processing

1. Missing data imputation


Because our initial dataset is extracted from user SMS information, a sample of a user might
have several missing features if there is no SMS related to the corresponding field in the data
collecting duration. Therefore, the number of NaN entries is rất lớn, especially for those whose
popularity is not as much as others ( e.g. data related to credit card usage is more likely to be
missed than the number of transactions). Hence, although there were many samples in our credit
scoring dataset, the proportion of missing values was unexpected ( some features are missing in
nearly 80% of instances). That is why we initially aim to fill in the missing entries by applying a
missing value imputation algorithm, Bayesian Network Internal Imputation (BNII) introduced by
Lan et. al in 2020 is a potential method due to the fact that it was designed for credit scoring
datasets.

BNII Algorithm: Essentially, the internal filling missing value algorithm contains 3 steps.

Firstly, we need to build a Bayesian network that defines relationships among features in which
each node ( feature) has its own list of parents and children. Secondly, we filter and get the
completed dataset in which there is not any row containing the null value. Based on this dataset,
we will calculate the conditional probabilities table of each feature being one of its values given
their parents’ value ( called evidence). Finally, we will fill in missing values in the incomplete
dataset by finding the value with the maximum likelihood given their existing parents and
children. In the final step, specifically, the probabilities of a feature X being its value i will be
proportional to the conditional probabilities of X = i given parents' evidence and children's
evidence given X = i:

So after initializing all missing values with its mode, the algorithm iterates through all features in
the missing list, and changes it to the most likely candidate Then, it compares the new sample
with the one in the previous iteration. If the similarity (e.g. cosine similarity) is higher than a
pre-defined threshold, the algorithm will stop ( convergence).

Moreover, to fit with the characteristics of our dataset, we improve the BNII algorithm by
applying it to continuous variables. After confirming with the author about the limitations of
their algorithms, we found different methods to categorize existing numerical datasets into
discrete-values features. Therefore, we had to handle skew distributions/outliers by conducting
multiple z-score removing iterations before binning the rest into different equal-width intervals.

After that, we found a problem that my dataset contains too many missing such that some
features will lose certain types of value if we create a complete dataset with all non-null rows as
presented in the original algorithm. Therefore, we created a dictionary of Bayesian Networks,
each of those networks stores probabilities related to children and parents of only ONE feature.
Since that improvement, we only need to remove rows having null values in its parents/children
feature, not others. That is why we did not only lose fewer values ( most of them preserve full
categories, and only 2 features lose 1 category each) but also have more samples to better
generalize the relationships among them.

2. Feature encoding method


After internal imputation, our data is transformed from continuous (numerical) values into
discrete (categorized) values that might be not suitable for later classified models. Therefore, I
needed to implement an encoding method to turn them back into numerical values. And I chose
Leave-One-Out Target Encoding:

LOO Target Encoding: We encode categorized values by the average target value of a
categorical feature which is calculated excluding the current row (the "leave-one-out" part). For
each category in a categorical feature, calculate the mean of the target variable (e.g., if the target
is binary, it’s the probability of the positive class). For each row, encode the categorical value by
using the mean target value for that category, excluding the current row (to prevent leakage).
- TE(xi ) is the target encoding for row 𝑖

- 𝑦𝑗 is the target value for category 𝑥𝑖 in other rows (excluding row 𝑖)

- 𝑛 is the number of rows in the dataset that belong to that category.

Why we used LOO: Because our categorized data has high cardinality ( 5-6 values/feature) and
we want to prevent the later models from data leakage.

B. Training Models

1. Unsupervised Learning Model


Because our dataset is unlabeled, we implement unsupervised learning models to preserve to
nature of distinguishment among data points. Specifically, we choose K-mean Clustering with k
= 2 ( binary clustering for bad/good users). The reason for that choice is that K-means is
relatively simple and computationally efficient, which is helpful when working with a
high-dimensional dataset (we have ~20 features for a sample). Moreover, to make the output
model interpretable, the K-mean clusters with centroids are also suitable options because the
algorithm mechanism makes it easier to understand the profile of users in each cluster and assess
their credit risk.

After implementing it, we also visualized the correlation map to see how each variable correlated
to the “surrogate labels”. Visualization in Fig1 shows that 2 features “Biggest_Loan” and
“Average_Loan_Per_Month” are the most correlated features to the labels assigned by the
K-mean Cluster which means data are most separated in the dimensions of these 2 features. We
also confirm that by plotting a small group of random sampling data points, which are colored by
the data points’ clusters, in the 2D space of those features ( Fig 2), as you can see, they are well
separated.
Fig 1. Correlation Maps of Features in the Dataset

Fig 2. Data plotted in the 2D space of “Biggest_Loan” and “Average_Loan_Per_Month”

Unfortunately, when we compare the labels produced by K mean algorithms with those given by
the FICO formula which are mostly applied by international creditors, the accuracy is not even at
the acceptable level ( 0.34 is the maximum agreement between 2 types of labels ). However,
because the FICO formula is too general ( or its credit scoring mechanism is not specific
enough), the error might come from the way we apply it on our dataset. Therefore, the
disagreement between FICO and K mean labels should be investigated more in further steps.
2. Supervised Learning Model
Credit scores were categorized into 10 discrete bins, from 0 to 9, each representing a unique
range of creditworthiness. These bins, ranging from the lowest (0) to the highest score (9), were
defined as follows: 0 (<0.15), 1 (0.15-0.3), 2 (0.3-0.4), 3 (0.4-0.45), 4 (0.45-0.5), 5 (0.5-0.55), 6
(0.55-0.6), 7 (0.6-0.7), 8 (0.7-0.8), 9 (>0.8).

While credit scores are inherently continuous, discretization facilitated the model’s ability to
handle the classification problem effectively. This discretization also introduces interpretability,
allowing stakeholders to understand credit classes in defined ranges. Discretizing scores, though
common, can result in information loss when subtle differences within each bin are disregarded.
Future models could consider ordinal regression techniques to preserve more nuanced insights
between score ranges, enhancing precision without sacrificing interpretability.

The distribution across credit score categories was notably unbalanced, with higher scores (e.g.,
8 and 9) represented far less frequently than mid-range scores. This imbalance can hinder model
performance by skewing predictions towards the majority classes. To counteract this, we applied
the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic instances
of minority classes. SMOTE enables the model to better learn underrepresented patterns,
reducing the risk of bias and enhancing the robustness of predictions. Class imbalance remains a
fundamental challenge in many real-world datasets, and SMOTE proved beneficial here.
However, as synthetic data generation could also introduce artificial patterns, future work might
explore ensemble-based approaches or penalized loss functions as alternative methods for
handling class imbalance.

For the primary model, we selected XGBoost, a gradient-boosting algorithm well-suited to


structured data and effective in handling high-dimensional spaces and sparse data. XGBoost’s
capacity for parallelism, regularization, and efficient computation made it an ideal choice for our
credit classification task. A grid search was conducted to optimize hyperparameters, yielding the
following configuration: n_estimators=100, learning_rate=0.1, max_depth=6,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective='multi:softmax', num_class=10.

The parameters were selected based on their impact on overfitting, interpretability, and training
efficiency. For instance, a max_depth of 6 was chosen to balance depth (and thus complexity)
with the risk of overfitting, while a learning_rate of 0.1 helped achieve a moderate convergence
rate.

In addition to XGBoost, a Random Forest model was implemented as a baseline for comparison.
While both models yielded competitive results, XGBoost outperformed Random Forest across all
key metrics, including accuracy and training time. This is in line with literature on boosting
versus bagging, where XGBoost’s iterative correction of errors generally yields higher accuracy
in classification tasks. In practical terms, the model provides a reliable method for credit scoring
that accommodates the complexity of real-world financial data. The results also highlight the
potential for SMS-derived transaction data as a scalable alternative to traditional credit scoring
inputs. The success of this model underscores the value of using advanced preprocessing (e.g.,
SMOTE and discretization) alongside a powerful classification algorithm like XGBoost. Moving
forward, hybrid models or ensemble strategies could further refine classification performance,
especially in cases of extreme class imbalance.

III. Challenges
Most of our challenges now come from the low-quality dataset and the lack of financial
knowledge to come up with a proper metric evaluating scoring model. Specifically, we have
challenges at:

Extracting information from a non-generalized, irrelevant feature set: In our unsupervised


credit scoring project, the quality of the dataset was a major issue—it had a lot of features that
didn’t really add value to the credit scoring task, which made finding useful patterns harder.
Many features had missing values, and there were so many outliers that it caused the data to be
heavily skewed. This low data quality not only made preprocessing more complicated but also
limited the model’s ability to create balanced, useful clusters.

Verifying the result coming from our model: Another key challenge with unsupervised learning
was the lack of labels, so we didn’t have a benchmark to measure the model’s accuracy in sorting
users into "good" and "bad" credit clusters. We had to rely on indirect methods and judgment
calls, which led to uncertainty in the results. We believe we need financial experts to weigh in on
which features mattered most for credit scoring, giving us a guideline to pick features that
actually made sense. Evaluating the model’s performance was difficult. Accuracy alone didn’t
feel like enough, as it missed important financial trade-offs. For instance, making the scoring less
strict could attract more customers and boost revenue, but it could also raise risk. This made it
clear we needed expert guidance to set up evaluation metrics that aligned with the banking
industry’s goals and risk management practices.

IV. Future plans


Looking ahead, several promising avenues could enhance the effectiveness, robustness, and
applicability of this credit classification framework. First, integrating additional data sources
alongside SMS transaction data, such as banking or social network information, could enrich the
model’s input features, allowing a more comprehensive view of creditworthiness. Handling the
current data sparsity could also benefit from advanced imputation techniques—like K-nearest
neighbors, matrix factorization, or variational autoencoders—better capturing the relationships
within incomplete data to improve model accuracy.
Interpretability is another critical area, especially given the sensitive nature of credit decisions.
Implementing interpretability methods such as SHAP values or LIME would help explain the
model’s decision process to users and auditors, fostering transparency and trust. Additionally,
exploring ensemble or hybrid models, potentially combining XGBoost with neural networks or
recent deep learning models for tabular data, could further boost classification performance and
resilience by capturing more complex interactions within the data.

Considering the dynamic nature of creditworthiness, future models could incorporate temporal
analysis, using RNNs or Transformers to capture shifts in user behavior over time. This approach
would enable real-time, adaptive credit scoring that reflects users’ most current financial status.
Ensuring fairness in credit scoring is also essential; dedicated studies could help identify and
mitigate any demographic biases, fostering ethical AI deployment.

Finally, for real-world application, optimizing the model for real-time processing and scalability
will be crucial. Techniques such as model pruning or quantization could help streamline the
model’s deployment within high-volume environments, allowing continuous updates to credit
scores and supporting fast, data-driven decision-making in financial services.

V. Communication Method
1. Team Communication and Planning
Communication Platforms: Use Messenger for daily updates and discussions.
Weekly Meetings: Conduct virtual meetings to track progress, upcoming tasks, and potential
blocks.
Shared Calendar: Maintain a project calendar with deadlines, meeting minutes, and events.
2. Planning and Initiation
Scope Definition: Clearly define the project scope to set boundaries and expectations.
Resource Allocation: Identify and assign resources (personnel, technology) from the outset.
3. Agile Framework
Methodology: Implement Agile for flexibility and iterative progress.
Weekly Stand-Ups: Conduct short weekly meetings to align the team and address obstacles.
Sprint Reviews and Retrospectives: Evaluate completed work and processes for continuous
improvement.
4. Tracking Requirements, Risks, and Issues
Project Workspace: Utilize Google Sheets for: comprehensive requirements document
outlining scope, objectives, and deliverables; risk register to identify, assess, and monitor
potential risks with mitigation strategies; an issues log to track challenges encountered during
the project lifecycle, documenting resolutions to prevent recurrence.
VI. Members' contributions

Vu Duy Tung:
- Researched the method to manually assign credit scores to users which are used as
surrogate labels in supervised models.
- Implemented supervised learning models ( XGBoost, RandomForestClassifier).
- Collected dataset including financial features of users.
Tran Anh Vu:
- Hosted weekly meetings to discuss the projects.
- Researched, implemented missing value imputations algorithm and improved them to fit
with the characteristics of the existing dataset.
- Handled imbalanced dataset with SMOTE algorithm.
- Implemented and evaluated the performance of unsupervised learning models.
Nguyen Canh Huy:
- Researched and applied evaluation metrics of credit scoring used by existing creditors
around the world.
- Researched related works, and conducted the literature review for the topic.
- Manage the progress of the team, tracked and reminded other team members to complete
assigned tasks
-
VII. Conclusion
This study introduces an effective framework for credit score classification using SMS-derived
transaction data. By leveraging a robust preprocessing pipeline, SMOTE, and XGBoost, we
developed a model that achieves reliable credit classification across multiple credit levels. The
findings demonstrate the potential of SMS data to facilitate credit scoring in settings where
traditional data sources may be unavailable or incomplete.

Looking forward, future work might explore integrating this framework with alternative data
sources or testing additional boosting algorithms to enhance accuracy and interpretability.
Additionally, interpretability methods, such as SHAP values or feature importance, could be
applied to further elucidate the model’s decision-making process, which is particularly valuable
for stakeholders in financial services.

You might also like