0% found this document useful (0 votes)

14 views11 pages

Progress Report 1

Uploaded by

azanetranclc17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views11 pages

Progress Report 1

Uploaded by

azanetranclc17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Unsupervised Credit Scoring Models

Bachelor of Science in Computer Science

Prepared by:

Tran Anh Vu V202100569

Vu Duy Tung V202100528

Nguyen Canh Huy

Under the supervision of:

Instructor’s Name: Prof. Doan Dang Khoa

Related course: COMP3020 - Machine Learning

COLLEGE OF ENGINEERING AND COMPUTER SCIENCE

VINUNIVERSITY

October, 2025

Team Members Role

Name Student ID Role Detail

Preprocess raw data, and handle

Data processor, missing data. Research existing
Unsupervised credit scoring functions to optimize
Tran Anh Vu V202100571
learning Model feature utilization. Develop
developer unsupervised learning models on
unlabeled datasets.

Convert raw SMS data into

Data Collection,
structured dataset, leverages
Explore supervised
supervised machine learning
Vu Duy Tung V202100528 learning algorithms,
techniques, specifically XGBoost, to
Consider future
categorize creditworthiness based on
directions
SMS transaction records.

Nguyen Canh Huy

Note: Roles can be considered to change flexibly to suit during the work process.
I. Project Overview
A. Introduction
Credit scoring is a significant task in the financial sector, especially for banks where millions of
dollars in loan decisions largely depend on users' credit scores. However, due to the complexity
of users' financial situations, traditional credit scoring methods done by humans can be
inconsistent. This makes it essential to develop a machine learning-based credit scoring model.
To ensure a general approach, we aim to develop both supervised and unsupervised learning
models so we can maintain the natural dependence of credit scores on financial features without
relying on artificial labels. This report provides an overview of our initial work in data
processing, developing both types of models, the challenges we encountered, and our future
plans for the next steps. Besides that, this also includes our communication methods and
collaborating plans toward the final goal.

B. Related works
Recent research has shown that advanced machine learning methods outperform models in credit
scoring evaluations as demonstrated in a recent study conducted by Mestiri (2024) [1]. This
study compared six credit scoring models including Linear Discriminant Analysis (LDA)
Random Forests (RF) Logistic Regression (LR) Decision Trees (DT) Support Vector Machines
(SVM) and Deep Neural Networks (DNN). The results from analyzing 688 samples and 12
variables revealed that machine learning techniques were more accurate in predicting loan
defaults than the approaches.

Ensemble methods have shown particularly promising results. XGBoost, LightGBM, and
CatBoost have emerged as top performers in recent credit-scoring studies. For instance, research
by Hlongwane et al. (2024) revealed that XGBoost achieved the highest accuracy among
examined techniques, even surpassing the industry-standard FICO scores in some cases [2].
From these recent findings, K-means Clustering and building a custom credit scoring formula
present significant advantages and opportunities in Credit Scoring Classification Models, leading
the team to adopt these approaches. K means Clustering enables the categorization of customers
into risk groups, without predefined labels which can reveal patterns that traditional methods
may overlook. Developed custom credit scoring formulas can incorporate features identified
through selection methods and include alternative data sources that have demonstrated potential
in recent studies. By combining K-means Clustering for customer segmentation with a custom
scoring formula, we aim to develop a robust, flexible, and transparent credit scoring system that
can adapt to changing market conditions and regulatory requirements.

C. Data description
The dataset was provided by SFIN Joint-Stock Company, containing six months of transaction
metadata for 800,000 anonymous users. Each user’s transactions, recorded via Short Message
Services (SMS), capture diverse financial behaviors. Extracting meaningful insights from SMS
data presents a unique opportunity due to its rich and often underutilized content. By applying
prompting techniques with Large Language Models (LLMs), we extracted and structured
relevant entities, transforming raw text into structured feature data.

The dataset comprised 38 features, but it suffered from significant sparsity, with more than half
of the features having over 90% NaN values. This high level of sparsity posed challenges, as
missing data can often degrade model performance by introducing noise and reducing predictive
reliability. Therefore, our preprocessing pipeline not only focused on filling or removing NaNs
as necessary but also on maximizing data usability by discarding features with excessively low
completion rates.

This sparsity issue underscores a limitation in SMS-based transaction data - though SMS can
capture granular financial actions, its unstructured nature may require robust handling of missing
data. In future applications, a more comprehensive data source, potentially combining SMS data
with banking records, could offer enhanced reliability.

II. Progress Description

A. Data Processing

1. Missing data imputation

Because our initial dataset is extracted from user SMS information, a sample of a user might
have several missing features if there is no SMS related to the corresponding field in the data
collecting duration. Therefore, the number of NaN entries is rất lớn, especially for those whose
popularity is not as much as others ( e.g. data related to credit card usage is more likely to be
missed than the number of transactions). Hence, although there were many samples in our credit
scoring dataset, the proportion of missing values was unexpected ( some features are missing in
nearly 80% of instances). That is why we initially aim to fill in the missing entries by applying a
missing value imputation algorithm, Bayesian Network Internal Imputation (BNII) introduced by
Lan et. al in 2020 is a potential method due to the fact that it was designed for credit scoring
datasets.

BNII Algorithm: Essentially, the internal filling missing value algorithm contains 3 steps.

Firstly, we need to build a Bayesian network that defines relationships among features in which
each node ( feature) has its own list of parents and children. Secondly, we filter and get the
completed dataset in which there is not any row containing the null value. Based on this dataset,
we will calculate the conditional probabilities table of each feature being one of its values given
their parents’ value ( called evidence). Finally, we will fill in missing values in the incomplete
dataset by finding the value with the maximum likelihood given their existing parents and
children. In the final step, specifically, the probabilities of a feature X being its value i will be
proportional to the conditional probabilities of X = i given parents' evidence and children's
evidence given X = i:

So after initializing all missing values with its mode, the algorithm iterates through all features in
the missing list, and changes it to the most likely candidate Then, it compares the new sample
with the one in the previous iteration. If the similarity (e.g. cosine similarity) is higher than a
pre-defined threshold, the algorithm will stop ( convergence).

Moreover, to fit with the characteristics of our dataset, we improve the BNII algorithm by
applying it to continuous variables. After confirming with the author about the limitations of
their algorithms, we found different methods to categorize existing numerical datasets into
discrete-values features. Therefore, we had to handle skew distributions/outliers by conducting
multiple z-score removing iterations before binning the rest into different equal-width intervals.

After that, we found a problem that my dataset contains too many missing such that some
features will lose certain types of value if we create a complete dataset with all non-null rows as
presented in the original algorithm. Therefore, we created a dictionary of Bayesian Networks,
each of those networks stores probabilities related to children and parents of only ONE feature.
Since that improvement, we only need to remove rows having null values in its parents/children
feature, not others. That is why we did not only lose fewer values ( most of them preserve full
categories, and only 2 features lose 1 category each) but also have more samples to better
generalize the relationships among them.

2. Feature encoding method

After internal imputation, our data is transformed from continuous (numerical) values into
discrete (categorized) values that might be not suitable for later classified models. Therefore, I
needed to implement an encoding method to turn them back into numerical values. And I chose
Leave-One-Out Target Encoding:

LOO Target Encoding: We encode categorized values by the average target value of a
categorical feature which is calculated excluding the current row (the "leave-one-out" part). For
each category in a categorical feature, calculate the mean of the target variable (e.g., if the target
is binary, it’s the probability of the positive class). For each row, encode the categorical value by
using the mean target value for that category, excluding the current row (to prevent leakage).
- TE(xi ) is the target encoding for row 𝑖

- 𝑦𝑗 is the target value for category 𝑥𝑖 in other rows (excluding row 𝑖)

- 𝑛 is the number of rows in the dataset that belong to that category.

Why we used LOO: Because our categorized data has high cardinality ( 5-6 values/feature) and
we want to prevent the later models from data leakage.

B. Training Models

1. Unsupervised Learning Model

Because our dataset is unlabeled, we implement unsupervised learning models to preserve to
nature of distinguishment among data points. Specifically, we choose K-mean Clustering with k
= 2 ( binary clustering for bad/good users). The reason for that choice is that K-means is
relatively simple and computationally efficient, which is helpful when working with a
high-dimensional dataset (we have ~20 features for a sample). Moreover, to make the output
model interpretable, the K-mean clusters with centroids are also suitable options because the
algorithm mechanism makes it easier to understand the profile of users in each cluster and assess
their credit risk.

After implementing it, we also visualized the correlation map to see how each variable correlated
to the “surrogate labels”. Visualization in Fig1 shows that 2 features “Biggest_Loan” and
“Average_Loan_Per_Month” are the most correlated features to the labels assigned by the
K-mean Cluster which means data are most separated in the dimensions of these 2 features. We
also confirm that by plotting a small group of random sampling data points, which are colored by
the data points’ clusters, in the 2D space of those features ( Fig 2), as you can see, they are well
separated.
Fig 1. Correlation Maps of Features in the Dataset

Fig 2. Data plotted in the 2D space of “Biggest_Loan” and “Average_Loan_Per_Month”

Unfortunately, when we compare the labels produced by K mean algorithms with those given by
the FICO formula which are mostly applied by international creditors, the accuracy is not even at
the acceptable level ( 0.34 is the maximum agreement between 2 types of labels ). However,
because the FICO formula is too general ( or its credit scoring mechanism is not specific
enough), the error might come from the way we apply it on our dataset. Therefore, the
disagreement between FICO and K mean labels should be investigated more in further steps.
2. Supervised Learning Model
Credit scores were categorized into 10 discrete bins, from 0 to 9, each representing a unique
range of creditworthiness. These bins, ranging from the lowest (0) to the highest score (9), were
defined as follows: 0 (<0.15), 1 (0.15-0.3), 2 (0.3-0.4), 3 (0.4-0.45), 4 (0.45-0.5), 5 (0.5-0.55), 6
(0.55-0.6), 7 (0.6-0.7), 8 (0.7-0.8), 9 (>0.8).

While credit scores are inherently continuous, discretization facilitated the model’s ability to
handle the classification problem effectively. This discretization also introduces interpretability,
allowing stakeholders to understand credit classes in defined ranges. Discretizing scores, though
common, can result in information loss when subtle differences within each bin are disregarded.
Future models could consider ordinal regression techniques to preserve more nuanced insights
between score ranges, enhancing precision without sacrificing interpretability.

The distribution across credit score categories was notably unbalanced, with higher scores (e.g.,
8 and 9) represented far less frequently than mid-range scores. This imbalance can hinder model
performance by skewing predictions towards the majority classes. To counteract this, we applied
the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic instances
of minority classes. SMOTE enables the model to better learn underrepresented patterns,
reducing the risk of bias and enhancing the robustness of predictions. Class imbalance remains a
fundamental challenge in many real-world datasets, and SMOTE proved beneficial here.
However, as synthetic data generation could also introduce artificial patterns, future work might
explore ensemble-based approaches or penalized loss functions as alternative methods for
handling class imbalance.

For the primary model, we selected XGBoost, a gradient-boosting algorithm well-suited to

structured data and effective in handling high-dimensional spaces and sparse data. XGBoost’s
capacity for parallelism, regularization, and efficient computation made it an ideal choice for our
credit classification task. A grid search was conducted to optimize hyperparameters, yielding the
following configuration: n_estimators=100, learning_rate=0.1, max_depth=6,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective='multi:softmax', num_class=10.

The parameters were selected based on their impact on overfitting, interpretability, and training
efficiency. For instance, a max_depth of 6 was chosen to balance depth (and thus complexity)
with the risk of overfitting, while a learning_rate of 0.1 helped achieve a moderate convergence
rate.

In addition to XGBoost, a Random Forest model was implemented as a baseline for comparison.
While both models yielded competitive results, XGBoost outperformed Random Forest across all
key metrics, including accuracy and training time. This is in line with literature on boosting
versus bagging, where XGBoost’s iterative correction of errors generally yields higher accuracy
in classification tasks. In practical terms, the model provides a reliable method for credit scoring
that accommodates the complexity of real-world financial data. The results also highlight the
potential for SMS-derived transaction data as a scalable alternative to traditional credit scoring
inputs. The success of this model underscores the value of using advanced preprocessing (e.g.,
SMOTE and discretization) alongside a powerful classification algorithm like XGBoost. Moving
forward, hybrid models or ensemble strategies could further refine classification performance,
especially in cases of extreme class imbalance.

III. Challenges
Most of our challenges now come from the low-quality dataset and the lack of financial
knowledge to come up with a proper metric evaluating scoring model. Specifically, we have
challenges at:

Extracting information from a non-generalized, irrelevant feature set: In our unsupervised

credit scoring project, the quality of the dataset was a major issue—it had a lot of features that
didn’t really add value to the credit scoring task, which made finding useful patterns harder.
Many features had missing values, and there were so many outliers that it caused the data to be
heavily skewed. This low data quality not only made preprocessing more complicated but also
limited the model’s ability to create balanced, useful clusters.

Verifying the result coming from our model: Another key challenge with unsupervised learning
was the lack of labels, so we didn’t have a benchmark to measure the model’s accuracy in sorting
users into "good" and "bad" credit clusters. We had to rely on indirect methods and judgment
calls, which led to uncertainty in the results. We believe we need financial experts to weigh in on
which features mattered most for credit scoring, giving us a guideline to pick features that
actually made sense. Evaluating the model’s performance was difficult. Accuracy alone didn’t
feel like enough, as it missed important financial trade-offs. For instance, making the scoring less
strict could attract more customers and boost revenue, but it could also raise risk. This made it
clear we needed expert guidance to set up evaluation metrics that aligned with the banking
industry’s goals and risk management practices.

IV. Future plans

Looking ahead, several promising avenues could enhance the effectiveness, robustness, and
applicability of this credit classification framework. First, integrating additional data sources
alongside SMS transaction data, such as banking or social network information, could enrich the
model’s input features, allowing a more comprehensive view of creditworthiness. Handling the
current data sparsity could also benefit from advanced imputation techniques—like K-nearest
neighbors, matrix factorization, or variational autoencoders—better capturing the relationships
within incomplete data to improve model accuracy.
Interpretability is another critical area, especially given the sensitive nature of credit decisions.
Implementing interpretability methods such as SHAP values or LIME would help explain the
model’s decision process to users and auditors, fostering transparency and trust. Additionally,
exploring ensemble or hybrid models, potentially combining XGBoost with neural networks or
recent deep learning models for tabular data, could further boost classification performance and
resilience by capturing more complex interactions within the data.

Considering the dynamic nature of creditworthiness, future models could incorporate temporal
analysis, using RNNs or Transformers to capture shifts in user behavior over time. This approach
would enable real-time, adaptive credit scoring that reflects users’ most current financial status.
Ensuring fairness in credit scoring is also essential; dedicated studies could help identify and
mitigate any demographic biases, fostering ethical AI deployment.

Finally, for real-world application, optimizing the model for real-time processing and scalability
will be crucial. Techniques such as model pruning or quantization could help streamline the
model’s deployment within high-volume environments, allowing continuous updates to credit
scores and supporting fast, data-driven decision-making in financial services.

V. Communication Method
1. Team Communication and Planning
Communication Platforms: Use Messenger for daily updates and discussions.
Weekly Meetings: Conduct virtual meetings to track progress, upcoming tasks, and potential
blocks.
Shared Calendar: Maintain a project calendar with deadlines, meeting minutes, and events.
2. Planning and Initiation
Scope Definition: Clearly define the project scope to set boundaries and expectations.
Resource Allocation: Identify and assign resources (personnel, technology) from the outset.
3. Agile Framework
Methodology: Implement Agile for flexibility and iterative progress.
Weekly Stand-Ups: Conduct short weekly meetings to align the team and address obstacles.
Sprint Reviews and Retrospectives: Evaluate completed work and processes for continuous
improvement.
4. Tracking Requirements, Risks, and Issues
Project Workspace: Utilize Google Sheets for: comprehensive requirements document
outlining scope, objectives, and deliverables; risk register to identify, assess, and monitor
potential risks with mitigation strategies; an issues log to track challenges encountered during
the project lifecycle, documenting resolutions to prevent recurrence.
VI. Members' contributions

Vu Duy Tung:
- Researched the method to manually assign credit scores to users which are used as
surrogate labels in supervised models.
- Implemented supervised learning models ( XGBoost, RandomForestClassifier).
- Collected dataset including financial features of users.
Tran Anh Vu:
- Hosted weekly meetings to discuss the projects.
- Researched, implemented missing value imputations algorithm and improved them to fit
with the characteristics of the existing dataset.
- Handled imbalanced dataset with SMOTE algorithm.
- Implemented and evaluated the performance of unsupervised learning models.
Nguyen Canh Huy:
- Researched and applied evaluation metrics of credit scoring used by existing creditors
around the world.
- Researched related works, and conducted the literature review for the topic.
- Manage the progress of the team, tracked and reminded other team members to complete
assigned tasks
-
VII. Conclusion
This study introduces an effective framework for credit score classification using SMS-derived
transaction data. By leveraging a robust preprocessing pipeline, SMOTE, and XGBoost, we
developed a model that achieves reliable credit classification across multiple credit levels. The
findings demonstrate the potential of SMS data to facilitate credit scoring in settings where
traditional data sources may be unavailable or incomplete.

Looking forward, future work might explore integrating this framework with alternative data
sources or testing additional boosting algorithms to enhance accuracy and interpretability.
Additionally, interpretability methods, such as SHAP values or feature importance, could be
applied to further elucidate the model’s decision-making process, which is particularly valuable
for stakeholders in financial services.

Predicting Credit Card Approvals
100% (1)
Predicting Credit Card Approvals
14 pages
Progress Report 2
No ratings yet
Progress Report 2
10 pages
Ipsj Ics19194011
No ratings yet
Ipsj Ics19194011
8 pages
Mini Project
No ratings yet
Mini Project
9 pages
XGBoost-B-GHM An Ensemble Model With Feature Selec
No ratings yet
XGBoost-B-GHM An Ensemble Model With Feature Selec
26 pages
Credit Scoring Model Implementation in A Microfinance Context
No ratings yet
Credit Scoring Model Implementation in A Microfinance Context
6 pages
SSRN Id3769854
No ratings yet
SSRN Id3769854
8 pages
A Boosted Decision Tree Approach Using Bayesian Hyper-Parameter Optimization For Credit Scoring
No ratings yet
A Boosted Decision Tree Approach Using Bayesian Hyper-Parameter Optimization For Credit Scoring
17 pages
Loan Prediction for Banks
No ratings yet
Loan Prediction for Banks
3 pages
PSB Hackathon
No ratings yet
PSB Hackathon
15 pages
ML CS
No ratings yet
ML CS
4 pages
Hanoi - 2021: (Document Title)
No ratings yet
Hanoi - 2021: (Document Title)
19 pages
Assignment 2 Mufan
No ratings yet
Assignment 2 Mufan
9 pages
Empirical Analysis of Ensemble Learning For Imbalanced Credit Scoring
No ratings yet
Empirical Analysis of Ensemble Learning For Imbalanced Credit Scoring
18 pages
Data Preparation
No ratings yet
Data Preparation
4 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
"Deep" Learning For Missing Value Imputation in Tables With Non-Numerical Data
No ratings yet
"Deep" Learning For Missing Value Imputation in Tables With Non-Numerical Data
9 pages
Assignment 1 DA - E Oct 2023 V1-1
No ratings yet
Assignment 1 DA - E Oct 2023 V1-1
3 pages
Topic 2
No ratings yet
Topic 2
47 pages
Problem 1: Cse352 AI Homework 3 Solutions
No ratings yet
Problem 1: Cse352 AI Homework 3 Solutions
31 pages
Documenting The Solution To Develop A Behaviour Score
No ratings yet
Documenting The Solution To Develop A Behaviour Score
9 pages
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
No ratings yet
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
5 pages
A Data-Centric AI Approach To Credit Scoring
No ratings yet
A Data-Centric AI Approach To Credit Scoring
3 pages
Mathematical Modeling and Analysis of Credit Scori
No ratings yet
Mathematical Modeling and Analysis of Credit Scori
28 pages
Credit Scoring Models Enhancement Using Support Vector Machines
No ratings yet
Credit Scoring Models Enhancement Using Support Vector Machines
6 pages
C5 IEEE CreditRiskScoringAnalysisBasedonMachineLearningModels
No ratings yet
C5 IEEE CreditRiskScoringAnalysisBasedonMachineLearningModels
6 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Data Preprocessing and Cleaning For Machine Learning
No ratings yet
Data Preprocessing and Cleaning For Machine Learning
16 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Project Lit Final1
No ratings yet
Project Lit Final1
15 pages
IDS 575 Project Report
No ratings yet
IDS 575 Project Report
9 pages
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
No ratings yet
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
16 pages
Book Credit Scoring
No ratings yet
Book Credit Scoring
382 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
End SEM V IMP DSE 2
No ratings yet
End SEM V IMP DSE 2
9 pages
Credit Score Prediction.
No ratings yet
Credit Score Prediction.
3 pages
Personalized Learning
No ratings yet
Personalized Learning
13 pages
Viral Pandey Bankruptcy Prediction
No ratings yet
Viral Pandey Bankruptcy Prediction
7 pages
ISYE 6501 Notes
No ratings yet
ISYE 6501 Notes
45 pages
Zindi Financial Inclusion Guide
No ratings yet
Zindi Financial Inclusion Guide
12 pages
Credit Defaulter Classifier 1659348484
No ratings yet
Credit Defaulter Classifier 1659348484
7 pages
Credit Card Score Prediction Using Machine Learning Models: A New Dataset
No ratings yet
Credit Card Score Prediction Using Machine Learning Models: A New Dataset
7 pages
Amta Assignment
No ratings yet
Amta Assignment
20 pages
Assignment - Building A Predictive Model With PySpark and MLlib
No ratings yet
Assignment - Building A Predictive Model With PySpark and MLlib
5 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
Credit Scoring with Hybrid Mining
No ratings yet
Credit Scoring with Hybrid Mining
11 pages
Data Imputation Techniques Guide
No ratings yet
Data Imputation Techniques Guide
6 pages
Assignment 3 F1 - F4
No ratings yet
Assignment 3 F1 - F4
19 pages
Credit Card Approval Prediction Report-Final
No ratings yet
Credit Card Approval Prediction Report-Final
27 pages
About Classificatio1
No ratings yet
About Classificatio1
5 pages
Final Project Report - Kelompok 4
No ratings yet
Final Project Report - Kelompok 4
6 pages
Credit Risk Analysis
No ratings yet
Credit Risk Analysis
6 pages
Estimating Missing Values of Heterogeneous Datasets by Clustering
No ratings yet
Estimating Missing Values of Heterogeneous Datasets by Clustering
24 pages
Grangier Melvin Nips 2010
No ratings yet
Grangier Melvin Nips 2010
9 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Integration of CNN Models and Machine Learning Methods in Credit Score Classification 2D Image Transformation and Feature Extraction
No ratings yet
Integration of CNN Models and Machine Learning Methods in Credit Score Classification 2D Image Transformation and Feature Extraction
45 pages
COMP3040 Proposal 4
No ratings yet
COMP3040 Proposal 4
3 pages
4684 Down
No ratings yet
4684 Down
22 pages
annotated-CRL 202 20 - 20report 203
No ratings yet
annotated-CRL 202 20 - 20report 203
7 pages
Penalizing Gradient Norm For Efficiently Improving Generalization in Deep Learning
No ratings yet
Penalizing Gradient Norm For Efficiently Improving Generalization in Deep Learning
11 pages
AI & Machine Learning Course Guide
No ratings yet
AI & Machine Learning Course Guide
47 pages
PHME16 CM Bearing Final
No ratings yet
PHME16 CM Bearing Final
18 pages
Hybrid CNN-SVM for Digit Recognition
No ratings yet
Hybrid CNN-SVM for Digit Recognition
8 pages
IMSS2019 Paper 106 PP 633-642
No ratings yet
IMSS2019 Paper 106 PP 633-642
11 pages
Analysis and Prediction of Diabetes Using Machine Learning
No ratings yet
Analysis and Prediction of Diabetes Using Machine Learning
9 pages
Group Assignment
No ratings yet
Group Assignment
4 pages
Lecture 05 Preview
No ratings yet
Lecture 05 Preview
65 pages
PML Book
No ratings yet
PML Book
341 pages
Data Mining - Theories - Algorithms - and Examples PDF
No ratings yet
Data Mining - Theories - Algorithms - and Examples PDF
347 pages
Lesson Plan - ML3
No ratings yet
Lesson Plan - ML3
4 pages
Arabic Text Classification: The Need For Multi-Labeling Systems
No ratings yet
Arabic Text Classification: The Need For Multi-Labeling Systems
25 pages
Predicción de Cancer de Mama Con Machine Learning y Comparación de F1 Score
No ratings yet
Predicción de Cancer de Mama Con Machine Learning y Comparación de F1 Score
13 pages
Data Warehousing & Mining Exam 2008
No ratings yet
Data Warehousing & Mining Exam 2008
5 pages
ML Course File (R18)
No ratings yet
ML Course File (R18)
51 pages
Advantages and Disadvantages of Information Systems
No ratings yet
Advantages and Disadvantages of Information Systems
42 pages
"Sentiment Analysis of Survey Comments: Animesh Tilak
No ratings yet
"Sentiment Analysis of Survey Comments: Animesh Tilak
12 pages
Report 2
No ratings yet
Report 2
52 pages
Program Book For Short-Term Internship As On 18-10-2022
No ratings yet
Program Book For Short-Term Internship As On 18-10-2022
57 pages
Decision Trees for CS Students
No ratings yet
Decision Trees for CS Students
54 pages
CUML1021 Machine Learning For Predictive Analytics Syllabus
No ratings yet
CUML1021 Machine Learning For Predictive Analytics Syllabus
4 pages
The 7 Ps Marketing Mix of Home Sharing Services - 2020 - International Journal o
No ratings yet
The 7 Ps Marketing Mix of Home Sharing Services - 2020 - International Journal o
11 pages
Syllabus
No ratings yet
Syllabus
2 pages
1722496821005-M Tech
No ratings yet
1722496821005-M Tech
32 pages
Ch-2 Advanced Concepts of Modeling in AI
No ratings yet
Ch-2 Advanced Concepts of Modeling in AI
50 pages
Naivebayes
No ratings yet
Naivebayes
7 pages
Megersa Oljira
100% (3)
Megersa Oljira
106 pages
Customer Churn Presentation
No ratings yet
Customer Churn Presentation
28 pages
A Soil Moisture Classification Model Based On SVM Used in Agricultural WSN
No ratings yet
A Soil Moisture Classification Model Based On SVM Used in Agricultural WSN
5 pages
10 Heart Disease Prediction Kiranjit Kaur
No ratings yet
10 Heart Disease Prediction Kiranjit Kaur
15 pages
Lecture 1-Unit 3.3
No ratings yet
Lecture 1-Unit 3.3
3 pages
Detection and Impact of Land Encroachment in El-Beheira
No ratings yet
Detection and Impact of Land Encroachment in El-Beheira
10 pages

Progress Report 1

Uploaded by

Progress Report 1

Uploaded by

Unsupervised Credit Scoring Models

Bachelor of Science in Computer Science

Tran Anh Vu V202100569

Vu Duy Tung V202100528

Nguyen Canh Huy

Under the supervision of:

Related course: COMP3020 - Machine Learning

Team Members Role

Name Student ID Role Detail

Preprocess raw data, and handle

Convert raw SMS data into

Nguyen Canh Huy

II. Progress Description

1. Missing data imputation

2. Feature encoding method

- 𝑦𝑗 is the target value for category 𝑥𝑖 in other rows (excluding row 𝑖)

- 𝑛 is the number of rows in the dataset that belong to that category.

1. Unsupervised Learning Model

Fig 2. Data plotted in the 2D space of “Biggest_Loan” and “Average_Loan_Per_Month”

For the primary model, we selected XGBoost, a gradient-boosting algorithm well-suited to

Extracting information from a non-generalized, irrelevant feature set: In our unsupervised

IV. Future plans

You might also like