Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
55 views15 pages

Irfan Ali, Resume Classification System

The document presents a study on an automated Resume Classification System (RCS) utilizing Natural Language Processing (NLP) and Machine Learning (ML) techniques to efficiently categorize job applications. The study evaluates various ML algorithms, demonstrating that the Support Vector Machine (SVM) classifiers achieved over 96% accuracy in classifying resumes into 25 job categories. This research aims to streamline the recruitment process by reducing the time and effort required for resume screening, ultimately benefiting both employers and job seekers.

Uploaded by

Pras Andi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views15 pages

Irfan Ali, Resume Classification System

The document presents a study on an automated Resume Classification System (RCS) utilizing Natural Language Processing (NLP) and Machine Learning (ML) techniques to efficiently categorize job applications. The study evaluates various ML algorithms, demonstrating that the Support Vector Machine (SVM) classifiers achieved over 96% accuracy in classifying resumes into 25 job categories. This research aims to streamline the recruitment process by reducing the time and effort required for resume screening, ultimately benefiting both employers and job seekers.

Uploaded by

Pras Andi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Mehran University Research Journal of Engineering and Technology

Vol. 41, No. 1, 65 - 79, January 2022


p-ISSN: 0254-7821, e-ISSN: 2413-7219
DOI: https://doi.org/10.22581/muet1982.2201.07

Resume Classification System using Natural Language


Processing and Machine Learning Techniques
Irfan Ali1a, Nimra Mughal1b, Zahid Hussain Khand1c, Javed Ahmed1d, Ghulam Mujtaba1e

RECEIVED ON 02.11.2020, ACCEPTED ON 05.05.2021


ABSTRACT
The selection of a suitable job applicant from the pool of thousands applications is often daunting job for an
employer. The categorization of job applications submitted in form of Resumes against available vacancy(s)
takes significant time and efforts of an employer. Thus, Resume Classification System (RCS) using the Natural
Language Processing (NLP) and Machine Learning (ML) techniques could automate this tedious process.
Moreover, the automation of this process can significantly expedite and transparent the applicants’ screening
process with mere human involvement. This experimental study presents an automated NLP and ML-based
RCS that classifies the Resumes according to job categories with performance guarantees. This study employs
various ML algorithms and NLP techniques to measure the accuracy of RCS and proposes a solution with
better accuracy and reliability in different settings. To demonstrate the significance of NLP and ML techniques
for RCS, the extracted features were evaluated on nine ML classification models namely Support Vector
Machine - SVM (Linear, SGD, SVC and NuSVC), Naïve Bayes (Bernoulli, Multinomial & Gaussian), K-Nearest
Neighbor (KNN), and Logistic Regression (LR). The Term-Frequency-Inverse-Document-Frequency (TF-IDF)
feature representation scheme was proved suitable for RCS. The developed models were evaluated using the
Confusion Matrix, F-Score, Recall, Precision, and overall Accuracy. The experimental results indicate that
using the One-Vs-Rest-Classification strategy for this multi-class Resume classification task, the SVM class of
Machine Learning classifiers performed better on the study dataset of over nine hundred sixty plus parsed
resumes with more than 96% accuracy. The promising results suggest that NLP and ML techniques employed
in this study could be used for developing an efficient RCS.

Keywords: Resume Classification, Natural Language Processing, Machine Learning, Text Classification,
Recommender System

1. INTRODUCTION considerable amount of job applications for a vacant


position [4]. Thus, the selection of suitable job

I
nternet-based recruiting systems have been applicants from the pool of thousands of applications
rapidly adopted by recruiters in recent years. The is often a daunting job for an employer. Recruiters
rapid growth of the internet caused an identical need to screen through a large amount of data to select
growth in quantity of obtainable online information the most suitable application from the pool. Thus, it
[1]. As a result, information is widely available. significantly increases the workload of the concerned
Contrary to this, information became overloaded and department of Recruiter [5]. Moreover, this process
resulted in the need for information management [2, involves the engagement of considerable Human
3]. Moreover, the ever increasing unemployment rate Resources and requires rigorous efforts and resources
in developing countries like Pakistan results in to finalize the most suitable applicant for further

1
Center of Excellence for Robotics, Artificial Intelligence, and Blockchain, Department of Computer Science, Sukkur
IBA University, Sukkur-65200, Sindh Pakistan. Email: [email protected] (Corresponding Author),
b
[email protected], [email protected], [email protected], [email protected].
This is an open access article published by Mehran University of Engineering and Technology, Jamshoro under CC BY 4.0
International License.
65
Resume Classification System using Natural Language Processing and Machine Learning Techniques

recruiting process. If the recruiters can figure out the Text Classification (TC) is a technique to
non-relevant profiles at the earlier stages of the hiring automatically classify the predefined classes relevant
process, this can significantly save time and money to a particular text document [7, 8]. TC is one of the
[6]. most fundamental tasks of Natural Language
Processing (NLP). TC is carried out with the
The Resume is a portfolio document developed by job involvement of Supervised Machine Learning
applicants to present the relevant details for the vacant techniques. These techniques require text
job. In this document, the applicant provides personal representation as a fixed-length feature vector [7].
details, Educational details, accomplishments, Thus, Preprocessing and Feature Engineering are the
competencies, skills, and experiences. This resume most important and fundamental steps for such text
helps recruiters to shortlist the applicant from the pool classification tasks where we apply various feature
of applications as it provides the complete picture of extraction and feature representation techniques [9].
the applicant’s competencies and skills. The resume
screening demands domain knowledge to understand Feature extraction typically finds the set of most
the suitability and relevance of an applicant for the informative features whereas feature representation
advertised job vacancy. However, the current global figures out the most suitable way to represent the
economic condition that companies face of getting less values of extracted features. The most widely used
capital to speculate within their HR department, while feature extraction techniques for text documents are
desperate to ensure that they are choosing the highly N-grams, Bag of Words (BoW), and Word-to-Vec.
competitive applicant fitted to the job description [1]. Every extracted feature assigned the numeric value
Thus, recruiters are facing three main challenges: using different representation techniques such as
Binary and TF-IDF. Every feature engineering task
● Making sense of Resume: This is a fact that has some pros and cons. Hence, the job of a Machine
Resumes in the market have no defined standard. Learning Engineer is to find the most useful technique
Every resume may have a different structure in the for the problem under consideration. Nevertheless,
pool of applications. Thus, HR needs to manually various Machine Learning approaches have been
go through each resume to find out the best proposed to develop Resume Classification Systems in
resume. literature. However, this study aims at developing an
● Mapping resume to the job description: This is ML-based system that classifies the Resumes
based on mapping the applicant’s Resume to the according to job categories. The study applies the
requirements criteria provided by the recruiter. Supervised Machine Learning approach for resume
This process involves detailed screening and classification to correctly classify 25 different job
requires domain experts to efficiently perform this categories resumes belong to. The dataset has 962
task. labeled resumes’ categories to train the classifier.
● Managing the cost: For, Screening and selection, Thus, various multi-class classification algorithms and
Recruiters need to adopt automated processes NLP techniques are employed to measure the accuracy
with mere human involvement to save time and of Resume Classification using performance metrics
money. such as overall accuracy, F-Score, Precision, and,
Recall. This study proposes an ML-based Resume
Hence, Machine Learning based automated Resume Classifier with better accuracy and performance
Classification Systems can be used to classify the guarantees.
Resumes according to the job category. This approach
can automate the tedious process of Resume Selection The resume is an official and formal document used
and support recruiters to overcome the above- mainly for demonstrating the brief profile of a job
mentioned challenges. Moreover, the automation of applicant. The resume contains information related
this process can significantly expedite the applicants’ education, skills, experience, achievements, and
shortlisting process and transparent the selection portfolio of a job applicant. The resume often used as
process with mere human involvement. an effective tool to assess the overall suitability of an

Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

66
Resume Classification System using Natural Language Processing and Machine Learning Techniques

aspirant for the desired job. Moreover, in response to Exchange Stock Markets [16, 17], and bioinformatics
job postings applicants submit Resume as a formal [18, 19]. In this study, ML-based text classification
document for job application consideration. The techniques are employed in the Human Resource
employer receives hundreds of Resumes for mere Management domain. Various NLP and ML
vacancies and finds it difficult to categorize and classification techniques have been employed to
classify to a suitable job vacancy. Thus, this study predict the category of Resume.
attempts at developing an efficient and accurate
Resume Classification System to ease the job of Several studies have proposed the Machine Learning
employers. based system for Human Resource Management and
recruiting processes. For instance, the study [20]
The study proposes an automated Resume designed the approach for Resume ranking that uses
Classification System (RCS) using state-of-the-art that layered information retrieval framework to parse
Natural Language Processing (NLP) techniques for the resumes. The goal of this study was to help
processing Resumes (plain-text document submitted recruiters to find out the relevant job applicant for a
as an job application) and Machine Learning (ML) job opening. Another study [21] designed the
algorithms (classifiers) for classification of Resumes personalized approach for Resume-job matching that
as per available job category. The major contribution offers the statistical similarity for resume ranking
of study lies in preprocessing the Resumes to corpus according to the available jobs. This study could have
and vectorized representation using the NLP been more generalized to recruiters as well as for job
techniques suitable for classification task carried out seekers. Employers can make use of this system to find
by ML-based algorithms and classifiers. Moreover, the relevant resumes whereas job seekers can use to
the experimental evaluation of various features search the most relevant job matches their resumes.
extraction and representation schemes are major The fuzzy-based model used in [22] to evaluate the
contribution to body of knowledge for Resume relevancy of a resume as compared to the job
Classification task. In addition, this study presents description. All the above-mentioned studies are
experimental performance comparison of various ML- working for document similarity by comparing the
based classifiers for the Resume or plain-text resume to the job description. However, few studies
classification task. This study can serve as basic employed Supervised Text Classification Techniques
building block for developing an automated, robust, to predict the category of Resume.
and reliable RCS that could be employed to in real-
time application of applicants shortlisting process Perhaps, the most related work to the proposed
based on Resumes. approach is of [23]. In this wok, NLP and ML
techniques were employed to predict the domain of
The rest of the paper is organized as follows. Section resumes. This study aimed to allocate the relevant
2 presents the review of related studies. Section 3 project to recruits. The study proposed the Named
describes the proposed methodology to accomplish the Entity Recognition (NER) approach coupled with
objectives, Section 4 presents and discusses the various classification models such as Logistic
findings of the study, and Section 5 presents the major Regression, K-Nearest Neighbors for the
limitations of the study and proposed future work and classification. Besides this, the study proposed an
finally, Section 6 concludes the study. ensemble learning-based voting classifier that was
retrained after a fixed interval. Hence, the number of
2. RELATED WORK votes for each classifier was modified. The
experimental results revealed that a voting based
In recent years, the ML based Text Classification classifier produced 91.2% accuracy in predicting the
(TC) techniques have been widely employed in categories while the accuracy was 84.2% without
various domains [10] such as Sentiment analysis [11, retraining. Another related study is of [24], in which
12], E-Commerce portals [13, 14], Email classification the Convolutional Neural Network (CNN) was used to
[15], Human Resource Management [2], Banking and classify Resumes into 27 different job categories. In

Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

67
Resume Classification System using Natural Language Processing and Machine Learning Techniques

this study, CNN classifier was trained on word2Vec overall methodology is divided in five stages as
pre-trained representations to determine the category illustrated in Fig. 1: (i) Data Collection and
of Resume. This approach achieved 40.15% accuracy visualization (ii) Preprocessing (iii) Feature
on resume classification and 74.88% accuracy on the Engineering (iv) Model Construction and (v) Model
job classification task. However, the study only used Evaluation and testing in a real-time environment
job summary text for classification and considered using Graphical User Interface (GUI).
only one base method of fast Text for comparison of
the performance.
Resume Step3 Model
Dataset Construction
Hence, both the aforementioned studies had some
Resume Classification
major limitations. The aforementioned studies had Techniques:
employed various classification techniques whereas Step1: 1. K Nearest neighbors
failed to evaluate various preprocessing techniques for Preprocessing 2. Logistic Regression
3. SVM
the proposed classifiers which may lead to low Data Cleansing (i) Linear SVC
accuracy of the classifier. Further, only overall 1. Remove special
(ii) SVM SVC
character removal
accuracy as a measure used for evaluation and failed 2. Remove URLs and (iii) NuSVC
Emails (iv) SGD
to use various evaluation metrics such as F-Score,
3. Convert short 4. Naive Bays (NB)
Precision, and Recall to evaluate the learning forms to standard (i) Bermoulli NB
efficiency of classifiers. forms (ii) Multinomial NB
4. Remove stop (iii) Gaussian NB
words
It is evident from the above mentioned studies
that approaches used mainly suffered with two Select the best
problems lower accuracy and performance Label Encoding performing
comparison. Besides this, very few ML models were classifier
employed for the Resume Classification task and
Step2: Feature
accuracy as the only measure used for performance. Step4: Performance
Engineering
Moreover, the features extraction and representation Evaluation
techniques were not explored to overcome the less Feature Extraction
accuracy problem. To overcome the limitations of and representation Accuracy Recall
techniques:
previously proposed studies, this study will use 1. BoW technique Precision F-Score
different NLP and Machine Learning techniques to 2. N-Gram Model
improve the efficiency of classifiers and various 3. TF-IDF vectorizer

performance matrices will be used for model


Trained Resumed
evaluation. Also, various feature extraction and Select the best Classifier Model
representation techniques would be employed for performing technique
discriminative features contributing to better
classification. Further, this study will provide
Fig. 1: The proposed methodology for resume
discriminative features to several machine learning classification
models, and various performance matrices such as
PrecisionM, Recall, and F-Score will be used for 3.1 Data Collection and Visualization
performance measuring.
The Resumes with Job Categories dataset were
3. METHODOLOGY collected from an online data repository. The dataset
is in Comma Separated Values (CSV) file format and
This Section discusses the proposed methodology for has three columns namely ID, Category, and Resume’s
building an efficient and accurate Resume Text. The ID, Category and Resume Columns
Classification System in detail. To achieve the represent Index, Job Category/Field, and content of the
objective of Resume Classification, NLP and ML resume respectively. The dataset contains 962 parsed
techniques are employed using the best practices. The and labelled resumes in 25 different job categories.
Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

68
Resume Classification System using Natural Language Processing and Machine Learning Techniques

Fig. 2: Resume instances for each job category

The number of resume instances for each class job Table 1: Resume Instance for each Category
category is illustrated in Fig. 2 for more appealing Java Developer 81
representation, Table 1 for clear class distribution, and Testing 70
Fig.3 for category-wise distribution (percentage) of Development Engineering 55
resume instances plotted using Python Matplotlib Python Developer 48
Web Designing 45
library. The visual evidence in Fig. 2 shows that each
HR 44
job category has a different number of resume Hadoop 42
instances and this can lead to an imbalanced data Operation Manager 40
problem. Sales 40
Mechanical Engineer 40
Blockchain 40
ETL Developer 40
Data Science 40
Arts 36
Database 33
Electrical Engineering 30
PMO 30
Health and Fitness 30
DotNet Developer 28
Business Analyst 28
Automation Testing 26
Network Security Engineer 25
Civil Engineer 24
SAP Developer 24
Advocate 20

Moreover, the data for two categories namely Java


Developer and Testing has the highest resume
Fig. 3: Category-wise total distribution of Resume instances and can be considered as biased class
instances
categories. Whereas, the resume instances for some
Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

69
Resume Classification System using Natural Language Processing and Machine Learning Techniques

categories such as Advocate, Civil Engineer, and SAP iv) The numbers were masked.
Developer are relatively less than some other v) The string fragmentations were masked.
categories for instance Java Developer. However, the vi) The word phrases in short form such as I’ll to I
category-wise total distribution in Fig. 3 illustrates the will were converted to their full forms.
overall representation of resume instances within the vii) Similar attributions were performed on
percentage range of (2.1 to 8.7%) in the dataset. This unclean/unprocessed raw resume’s text data.
illustration provides an intuition that at some extant
dataset has class imbalance problem. However, the 3.2.2 Removal of the stop words
classification accuracy is not affected with this
Stop words removal is one of the most essential steps
imbalance class distribution due to employed effective
in data preprocessing. Stop words such as 'is', ‘each’,
feature extraction methodology for classification task.
'and' and so on appear most often in any textual data.
3.2 Data Preprocessing However, these most frequently occurring words in a
text document are not the informative features (tokens)
The Data preprocessing involves steps to transform for any classifier. Thus, these stop words should be
raw data into meaningful information for the Machine removed from the corpus for the classification model.
Learning task. In the case of textual data for text The stop words from the resume’s text column were
classification, these steps involve cleaning raw text removed by performing the following steps using the
data, removing the unnecessary or meaning-less data, Python programming:
removing the repetitive (redundant) data, removing the
missing (null) values, and transforming data to a i) The word tokenization was performed on the
common scale. To preprocess the resume’s textual resume’s text using NLTK library and tokens
data for the Resume Classification task following key were stored in an array.
ii) The standard English language stop words
steps were performed.
were imported using NLTK corpus and
compared with each element in the tokenized
3.2.1 Data Cleansing array.
iii) If any element of the tokenized array was
The dataset contains the parsed resumes from different found in the list of NLTK stop words, that
formats such as PDF, DOC, DOCX in a CSV format. particular element (tokenized word) was
removed.
It has a lot of unnecessary and unprocessed data in the
iv) Repeated this process for all the tokens. The
resume column. Thus, the major efforts were required final tokenized elements array did not contain
to preprocess the data and make it ready for Text any stop word
Classification. In the data preprocessing step, the less
informative text was cleaned using the Natural To visualize the stop words removal process, the word
Language Processing Took Kit - NLTK [23] for stop cloud of most frequently occurring words in the corpus
words removal and Python 3.7.3 Regular Expressions. of resumes was generated using the Python word cloud
The following key tasks were performed for data feature as illustrated in Fig. 4. It can be observed that
preprocessing using the customized written program
function in Python.

i) The textual content of resumes was converted


to lowercase.
ii) The special characters, punctuations, brackets,
URLs, Email addresses, mentions, hash tags,
apostrophes, leading and trailing characters,
extra white spaces, and Non-ASCII characters
were removed from the Resume’s text.
iii) The masking was applied to special escape Fig. 4: Word Cloud of most frequent words in the
sequences such as \n, \t, \a, \b, and so on. cleaned dataset
Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

70
Resume Classification System using Natural Language Processing and Machine Learning Techniques

the word cloud now contains more informative words applied.


other than frequently occurring stop words and these
words would be more meaningful for classifiers to 3.3 Feature Engineering
learn.
The feature engineering helps to extract, formulate,
3.2.3 Stemming and Lemmatization and represent the set of most discriminative
(informative) features from the corpus of text for the
Stemming and Lemmatization are known as Text classification task. After data cleaning and
Normalization or sometimes Word Normalization preprocessing, the resume’s corpus has informative set
techniques in Natural Language Processing (NLP). of words as depicted in word cloud in Fig. 4. The Fig.
The purpose of these techniques is to decrease word 4 shows that the dataset does not contain stop words
inflection in the corpus of classification text by and other less informative words. The feature
mapping the group of words to the same root stem. engineering process in Machine Learning mainly
Specifically, stemming and lemmatization remove the involves feature extraction and representation
prefixes and suffixes (affixes) such as (-es, -s, -ed, in- techniques for classification task. Therefore, different
, un-, -ing, etc) from words which result in inflectional feature extraction techniques; for word and character
(changing/deriving meaning of words). For instance, vectorization with varying range of hyperparameters
the stem (root) word for Plays, Playing, and Played is were compared as discussed in Section 3.3.1. For
Play so the stemming and lemmatization techniques feature representation, different variants of Term-
would map these words in the corpus of classification Frequency-Inverse-Document-Frequency (TF-IDF)
text to root (stem) word. Using the above mappings, a suitable document representation scheme for plain-
sentence could be normalized using the stemming and text classification [26, 27] was evaluated as discussed
lemmatization techniques as follow: in Section 3.3.2.

The Natural Language Tool Kit (NLTK) library in 3.3.1 Feature Extraction and Master Feature
Python offers the implementation of stemming and Creation
lemmatization techniques with different settings.
After applying the preprocessing step on the data, the
However, unlike stemming offered by the NLTK
dataset contains the words that are important features
library in Python, the lemmatization reduces the
for the classification. To demonstrate the significance,
inflected words properly by ensuring the root word
different variants for feature extraction namely, BoW,
belongs to the language. Thus, lemmatization us
Word Vectorizer, and Character Vectorizer with
applied on Resume’s text corpus for text normalization
varying ranges of n-grams were evaluated. However,
as Resumes are more formal document and
proposed model yielded better accuracy on Word
lemmatization ensures the proper word structure in
Vectorizer implementation using the TF-IDF feature
normalized text. In our implementation lemmatization
representation scheme.
text normalization technique produced promising
results for corpus tokenization and vectorization.
In features extraction techniques, word vectorization
with different range of hyperparameters unigram
3.2.4 Label Encoding
(N=1), bigram (N=2), and n-gram (2-6) [25] were
The label encoding technique handles the categorical compared. Moreover, character vectorization with
values of variables in the Machine Learning Model. hyperparameter n-gram (2-6) was also evaluated.
The label encoding technique assigns a unique integer However, word vectorization with hyperparameter n-
value to a categorical variable. To make raw text data gram (2-6) and max-features set of 1500 yielded better
ready for the machine learning model the label accuracy and improved training and computation time.
encoding was done to assign a numerical label to all
3.3.2 Feature Representation
categories, as shown in Fig.2. The Scikit-learn Label
Encoder was used for the mentioned purpose. Hence, This step aims to allocate an arithmetic value to each
the label encoder on the Category field of the data was of the extracted features in the vector. Term-
Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

71
Resume Classification System using Natural Language Processing and Machine Learning Techniques

Frequency-Inverse-Document-Frequency (TF-IDF) the instances. The “One-Vs-Rest-Classification”


has been reported as better performing feature strategy for multiclass classification was used [26].
representation scheme in various plain-text The brief description of the implemented nine machine
classification studies in the literature [19, 26, 28]. learning models is as follows:
Therefore, TF-IDF [27] was used for representing the
value of each extracted feature. TF-IDF is a numerical 1. K Nearest Neighbors (KNN): KNN is based on
statistic that is intended to find the importance of a finding k-nearest data points to the new instance
word to a document in text corpus collection. This and assign the label according to the highest
technique is concerned with two things. TF is neighboring data points. KNN is also known as
concerned with the occurrences of each word/feature a lazy learner classifier because of its simplest
and determines how frequently the word appears in method of Euclidian distance equation (1) for
each document. Whereas, IDF is used to determine the classification tasks [26].
weight of each word in the document. The objective of
TF-IDF feature representation is to weigh down the d p, q = ∑ q −p (1)
more frequent words while scaling up the rare words
in the document. 2. Multinomial Naïve Bayes (MNB): Naïve
Bayes classifier is based on the conditional
Hence, TF-IDF Vectorizer was implemented using probability. NB classifier finds the probability of
Python Scikit-Learn library. It is used to perform both a vector belonging to the class. It finds out the
feature extraction and feature Representation for the probability for all the given instances and
task. To compare the performance of most classifies with the conditional probability. It is
discriminative features, different values for the max- based on strong independence between the
feature sub-set were tested. However, the accuracy of features. MNB is one variant of Naïve Bayes that
classifiers was decreasing as the max-feature value multinomial distribution of all pairs [27].
was increased. For instance, the max-feature value 3. Bernoulli Naïve Bayes (BNB): it is also a
2000 and 1500 resulted in an accuracy of 95% and variant of Naïve Bayes that accepts the binary
97% respectively on SVM-SVC. Thus, it can be features only. BNB is also effective for
concluded that the larger value of the max-feature sub- classification tasks [28].
set was not significantly contributing to better 4. Gaussian Naïve Bayes (GNB): It is also a
accuracy so the max-feature value was set to 1500. variant of NB that supports continuous-valued
features that are assumed to be distributed
3.4 Resume Classifier Construction according to Gaussian distribution. GNB only
supports vectorized features representation to
The discriminative features extracted using the
implement GNB vectorized features
techniques described in the previous section were used
to build the classifier to accurately classify the representation used [29].
Resumes. Several Machine Learning classifiers were 5. Logistic Regression (LR): Logistic Regression
opted to select the best performing model for applies the logistic function on the classification
deployment and Graphical User Interface (GUI). The task with a threshold value. LR is considered one
details of Classifier construction is presented in of the easiest implementations for classification
sections below. problems [30].
6. Linear Support Vector Classifier (SVC): It is
3.4.1 Implementation details and experimental
based on finding the best separating line between
setup
two classes. It is the simplest form of Support
vector machine that finds the linear hyperplane
After extracting features from the dataset, the data was
between two classes. Although, it will not give
divided into training and testing. The dataset was
good results if the data is not linearly separable.
divided into 70% and 30% for train and test set
Linear SVM is also known as the least square
respectively. Nine different text classifiers were
Support Vector Machine classifier [31].
employed as each has its own philosophy to classify
Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

72
Resume Classification System using Natural Language Processing and Machine Learning Techniques

7. Support Vector Classifier (SVC): SVC GUI would also be helpful for implementing Machine
overcomes the above-mentioned issue of Linear Learning models in a real-time environment and
SVM by using the Kernel concept [32] that helpful for recruiters to tackle the tedious task of
works well on data that is not linearly separable. Resume Classification in different job categories.
8. Nu-Support Vector Classifier (NuSVC): It is
similar to the SVC but it also uses a parameter to 3.5 Evaluation Matrices
control the number of support vectors.
9. Stochastic Gradient Descent (SGD): It uses To measure the performance of the mentioned
SGD for training (that is, looking for the minima Classification models, we used different performance
of the loss using SGD). evaluation matrices. As the dataset was imbalanced
(shown in Fig. 2 and 3) so the overall accuracy was not
The extracted features and learned ML models were only a significant matrix for model evaluation.
stored in Python external pkl file format for future Therefore, for performance evaluation, Overall
evaluation and testing. The scikit-learn externals accuracy, Precision, Recall, F-Score matrices were
joblib library was used to store extracted features used. The brief description of performance matrices is
representation and learned models on disk and later as follows.
used in GUI for real-time testing.
I. Overall Accuracy: Accuracy is a fraction of
3.4.2 Graphical User Interface and System predictions that are correctly identified by the
Evaluation in a Real Time Environment algorithms. However, Accuracy itself does not tell
the full story when we are working with the
To evaluate the trained and learned ML models in imbalanced data.
real-time settings on unseen data the Graphical User
Interface (GUI) is designed using the Python Tkinter II. Micro Precision: Precision attempts to answer from
(Fig. 5). The extracted features and learned ML all the positive predictions, what fraction of actually
models are imported to be used in GUI. The designed positive? The value of precision is between 0 and 1.
and developed GUI allows users to provide a resume Any model that does not produce false positive
in text format or select a resume text from an unseen results has a precision of 1. It gives us the idea that
test dataset. The GUI also leverages users to select how precisely the model is identifying the True
from nine ML learned models for classification of positive values of classes. In multiclass
the resume. This implementation ensures the classification problem precision for all classes is
transparency and real-time analysis of Resume results has a precision of 1. It gives us the idea that
Classification on nine learned models. The designed

Fig. 5: Graphical User Interface of the proposed system

Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

73
Resume Classification System using Natural Language Processing and Machine Learning Techniques

how precisely the model is identifying the True The variation in the performance of trained models can
positive values of classes. In multiclass classification be significantly observed. The Support Vector
problem precision for all classes is computed and then Machine class of learning algorithms performs better
the average of all results is computed. Macro-average than other classifiers. In all 318 analyses on test data
computes the metric independently for each class and instances, the Linear Support Vector Classifier out-
then takes the average. The mathematics definition of performs the other eight classifiers with nearly 98%
Precision is as follows: overall accuracy and 1.0 precision. It can be
generalized that for the Resume Text Classification

Precision = task, the SVM class classifiers performs best.
!

III. Macro Recall: Recall attempts to answer that from Table 2 summarizes the Precision, Recall, F-Score,
all the actual positive records, what fraction is and overall Accuracy of classifiers on testing data. The
correctly identified? The value of precision is results show that most of the algorithms produced
between 0 and 1. Any model that does not produce excellent results on study data. This can be
false-negative results has a precision of 1. In comprehended as the dataset size was optimal and best
multiclass classification problem precision for all NLP and ML techniques were employed to achieve
classes is computed then the average of all results significantly better results. It is also shown that LSVC,
computed. Macro-average computes the metric SGD, LR, and SVC produced exceptionally well
independently for each class and then takes the results. Thus, the LSVC classifier is the best
average. The mathematical definition of Recall is as performing classifier.
follows.
Fig.6 illustrates the overall accuracy and
∑ misclassification report of the classifiers. It can also be
%
Recall = !
seen that the Bernoulli Naïve Bayes (BNB) did not
produce better results as compared to all other
IV. Macro F-Score: F-Measure is defined by the classifiers while the Multinomial Naïve Bayes (MNB)
weighted harmonic mean of test’s precision and performed well on the dataset. The misclassification of
recall. The values are between 0 and 1 where BNB is high as compared to all other classifiers. One
highest value ‘1’ shows that algorithm reaches to of the reasons for that misclassification is Bernoulli’s
best precision and recall values. classifier is mainly used for Binary classification and
treating all values as the negative class whereas, the
4. RESULTS AND DISCUSSION Resume Classification is a multi-class problem. Most
of the models produced approximately similar results
Table 2 presents the Precision, Recall, F-Score, and except the BNB.
overall accuracy of all the trained models on test data.

Table 2: Performance Evaluation of learned ML Models


Classifier Precision Recall F-Score Overall Accuracy % Misclassification %
LSVC 1.00 1.00 1.00 99.6 0.4
SGD 1.00 1.00 1.00 99.6 0.4
LR 1.00 0.99 0.99 99.3 0.7
SVC 1.00 0.99 0.99 99.3 0.7
NuSVC 0.99 0.99 0.99 99.3 0.7
KNN 0.99 0.98 0.99 97.2 2.8
GNB 0.98 0.96 0.96 96.5 3.5
MNB 0.98 0.95 0.96 94.8 5.2
BNB 0.89 0.76 0.79 79.2 20.8

Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

74
Resume Classification System using Natural Language Processing and Machine Learning Techniques

The overall misclassification report is relatively low, not encouraging. Hence, our designed methodology
thus this can be inferred that the extracted features extracted the most discriminative features from the
using TF-IDF were the most discriminative for the dataset. That is the reason why most of the classifiers
Resume Classification Task. Moreover, the GNB and yielded the best performance.
BNB models require a vectorized representation of
features and this could be a reason for slightly poor Fig. 8 illustrates the Train versus Test accuracy of the
performance. used nine classifiers. The overall dataset was divided
into 70% and 30% for training and testing
Fig. 7 illustrates the Precision, Recall, FScore of all the respectively. Machine Learning models often suffer
models. There is a minor difference in the Precision, with overfitting and underfitting problems.
Recall and F-Score. Well, this was not the case when
un-processed data was used. The same performance The overfitting problem occurs when the learned ML
matrices were measured on raw data and results were model performs best on training data and yields better

Fig. 6: Overall Accuracy vs Misclassification Report

Fig. 7: Precision, Recall, F-Score, - Performance Matrices

Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

75
Resume Classification System using Natural Language Processing and Machine Learning Techniques

Fig. 8: Train vs Test Accuracy

Fig. 9: Normalized Confusion Matrix of Actual vs Predicted class labels

Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

76
Resume Classification System using Natural Language Processing and Machine Learning Techniques

accuracy however, fails to perform well on the test or model will be extended to match the content of the
unseen data [33]. The overfitting problem yields resume with the provided job description. The
higher train accuracy and lower test accuracy. extension in future work will enable the proposed
Whereas, the underfitting problem occurs when the system suitable for the complete recruiting process.
model fails to perform well either on test or train data. The proposed system will perform the most tedious
The underfitting yields slighter lower accuracies for tasks of recruiting process; categorization and
train and test data. recommendation of suitable resumes for a given job
description.
It is evident from Fig. 8 that the proposed models in
this study are neither overfitting nor underfitting the 6. CONCLUSION
train or test data. The trained models equally perform
better on training and test data. It can be inferred that Resume classification is a time-consuming, costly, and
the overall process of Natural Language Processing tedious job for an organization. In this regard, this
(NLP) and Machine Learning (ML) techniques is study proposes an automated approach that uses
employed efficiently to yield balanced and better various machine learning and NLP techniques for the
performance on test and train data. classification of Resumes. The proposed methodology
used several NLP and ML techniques for
Fig. 9 illustrates the normalized Confusion Matrix of preprocessing data, feature extraction and
actual versus predicted class categories of best representation, model construction, and evaluation for
performing Support Vector Machine (SVM) – Linear the Resume Classification task. The study results
SVC classifier. Since, the classifier yielded over 98% suggested that the TF-IDF vectorizer performed best
true class prediction accuracy and it is depicted in in feature extraction and representation as the
normalized confusion matrix in Fig. 9. The predicted extracted features yielded excellent results on almost
values for confusion matrix are automatically all classifiers. However, the Support Vector Machine
normalized in plot confusion matrix library of sklearn (SVM) class algorithms such as (Linear, SVC,
metrics in Python implementation. The results shown NuSVC, and SGD) performed exceptionally good
in Fig. 9 are evident that the employed NLP techniques with over 98% and 96% accuracy respectively on the
and well-trained classification model for Resumes train and unseen test data. The study results are quite
categorization truly predicted categories. encouraging to automate the job application
categorization and recommendation based on the
5. LIMITATIONS AND FUTURE content of the Resumes. The developed system can be
WORK deployed in real-time settings for an employer to
automate the recruiting process.
The major limitation and challenge for the Resume
Classification and Recommendation task is finding an REFERENCES
appropriate and standard dataset to process using the
NLP techniques and train the ML models. Since the 1. Koyande B.A., Walke R.S., Jondhale M.G.,
resume is not a standard document and there is no “Predictive Human Resource Candidate Ranking
specific industry standard, thus major efforts were put System”, International Journal of Research in
on processing the documents in the dataset which were Engineering, Science and Management, Vol.3,
parsed from different formats and layouts. Moreover, No.1, January 2020.
the dataset size was a bit low to train the ML model 2. Al-Otaibi S.T., Ykhlef M., A., “Survey of job
for generalized classification. However, efforts were recommender systems”, International Journal of
put to find a more suitable dataset for the classification Physical Sciences, Vol. 7, No. 29, pp. 5127-5142,
task. The study achieved significant accuracy 2012.
and performance gain on Resume Classification in 3. Färber F., Weitzel T., Keim T., “An automated
different job categories. Therefore, in future work, the recommendation approach to selection in
Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

77
Resume Classification System using Natural Language Processing and Machine Learning Techniques

personnel recruitment”, Proceedings of the 9th 14. Srifi M., Oussous A., Lahcen A.A., Mouline S.,
American Conference on Information Systems “Recommender Systems Based on Collaborative
(AMCIS), Association of Information Systems Filtering Using Review Texts—A Survey”,
Electronic Library, pp. 2329-2339, 2003. Information, Vol. 11, No. 6, pp. 317, 2020.
4. Breaugh J.A., “The use of biodata for employee 15. Mujtaba G., Shuib L., Raj R.G., Majeed N., Al-
selection: Past research and future directions”, Garadi M.A., “Email classification research
Human Resource Management Review, Vol. 19, trends: review and open issues”, IEEE Access,
No.3, pp. 219-231, 2009. Vol. 5, p. 9044-9064, 2017.
5. Lin, Y., Lei H., Addo P.C., Li X., “Machine 16. Theofilatos K., Likothanassis S.,
learned resume-job matching solution”, arXiv Karathanasopoulos A., “Modeling and trading the
preprint arXiv:1607.07657, 2016. EUR/USD exchange rate using machine learning
6. Yi X., Allan J., Croft W.B.. “Matching resumes techniques”, Engineering, Technology and
and jobs based on relevance models”, Applied Science Research, Vol.2, No. 5, pp. 269-
Proceedings of the 30th Annual International 272. 2012.
ACM SIGIR Conference on Research and 17. Rahman A. Khan M.N.A., “A Classification
Development in Information Retrieval, pp. 809- Based Model to Assess Customer Behavior in
810, Amsterdam, The Netherlands, July 2007. Banking Sector”, Engineering, Technology and
7. Sebastiani F., “Machine learning in automated Applied Science Research, Vol. 8, No.3, pp. 2949-
text categorization”, ACM Computing Surveys 2953. 2018.
Vol. 34, No.1, pp. 1-47, 2002. 18. Al-Garadi M.A., Khan M.S., Varathan K.D.,
8. Nigam K., Mccallum A.K., Thrun S., Mitchell T., Mujtaba G., Al-Kabsi A.M., “Using online social
“Text classification from labeled and unlabeled networks to track a pandemic: A systematic
documents using EM”, Machine Learning, Vol. review”, Journal of Biomedical Informatics, Vol.
39, No. 2-3, pp. 103-134, 2000. 62, pp. 1-11, 2016.
9. Uysal A.K., Gunal S., “The impact of 19. Mujtaba G., Shuib L., Raj R.G., Rajandram R.,
preprocessing on text classification. Information”, Shaikh K., “Automatic text classification of ICD-
Processing and Management, Vol. 50, No.1, pp. 10 related CoD from complex and free text
104-112, 2014. forensic autopsy reports”, Proceedings of the 15th
10. Otter D.W., Medina J.R., Kalita J.K., “A Survey IEEE International Conference on Machine
of the Usages of Deep Learning for Natural Learning and Applications (ICMLA). Anaheim,
Language Processing”, IEEE Transactions on C.A., U.S.A., pp. 1055-1058, 18 – 20 December
Neural Networks and Learning Systems, Vol. 32, 2016.
No.2, pp. 604-624. February 2021. 20. Gonzalez T., Santos P., Orozco F., Alcaraz M.,
11. Parkhe V., Biswas B., “Sentiment analysis of Zaldivar V., De Obeso A., Garcia A., “Adaptive
movie reviews: finding most important movie Employee Profile Classification for Resource
aspects using driving factors”, Soft Computing, Planning Tool”, Proceedings of the Annual SRII
Vol. 20, No. 9, pp. 3373-3379. 2016. Global Conference, pp. 544-553, San Jose, C.A.,
12. Bakshi R.K., Kaur N., Kaur R., Kaur G., “Opinion U.S.A., 24-27 July 2012.
mining and sentiment analysis”, Proceedings of 21. Guo S., Alamudun F., Hammond T.,
the 3rd International Conference on Computing “RésuMatcher: A personalized résumé-job
for Sustainable Global Development matching system”, Expert Systems with
(INDIACom), pp. 452-455, New Delhi, India, 16- Applications, Vol. 60, pp. 169-182, 2016.
18 March 2016. 22. Golec A., Kahya E., “A fuzzy model for
13. Sivapalan S., Sadeghian A., Rahnama H., Madni competency-based employee evaluation and
A.M., “Recommender systems in e-commerce”, selection”, Computers and Industrial
Proceedings of the World Automation Congress Engineering, Vol. 52, No.1, pp. 143-161, 2007.
(WAC), pp. 179-184, Kona, Hawaii, 2014. 23. Gopalakrishna S.T., Vijayraghavan V.,

Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

78
Resume Classification System using Natural Language Processing and Machine Learning Techniques

“Automated Tool for Resume Classification revisited, in Australasian Joint Conference on


Using Sementic Analysis”, International Journal Artificial Intelligence”, Springer, 2004.
of Artificial Intelligence and Applications, Vol. 29. McCallum A., Nigam K., “A comparison of event
10, No.1, 2019. models for naive bayes text classification”,
24. Sayfullina L., Malmi E., Liao Y., Jung A., Proceedings of the AAAI-98 Workshop on
“Domain adaptation for resume classification Learning for Text Categorization, pp. 41-48,
using convolutional neural networks”, 1998.
Proceedings of the International Conference on 30. Raschka S., “Naive bayes and text classification
Analysis of Images, Social Networks and Texts, introduction and theory”, arXiv preprint
pp. 82-93, Springer, 2017. arXiv:1410.5329, 2014.
25. Ramos J., “Using TF-IDF to determine word 31. Xu S., “Bayesian Naïve Bayes classifiers to text
relevance in document queries”, Proceedings of classification”, Journal of Information Science,
the First Instructional Conference on Machine Vol. 44, No.1, pp. 48-59, 2018.
Learning, New Jersey, U.S.A., 2003. 32. Schölkopf B., Smola A.J., Bach F., “Learning
26. Xu J., “An extended one-versus-rest support with Kernels: Support Vector Machines,
vector machine for multi-label classification”, Regularization, Optimization, and Beyond”, MIT
Neurocomputing, Vol. 74, No. 17, pp. 3114-3124, Press, 2002.
2011. 33. Suykens J.A., Vandewalle J., “Least Squares
27. Loper E., Bird S., “NLTK: the natural language to Support Vector Machine Classifiers”, Neural
olkit”, arXiv preprint cs/0205028, 2002 Processing Letters, Vol. 9, No.3, pp. 293-300,
28. Kibriya A.M., Frank E., Pfahringer B., Holms G., 1999.
“Multinomial naive bayes for text categorization

Mehran University Research Journal of Engineering and Technology, Vol. 41, No. 1, January 2022 [p-ISSN: 0254-7821, e-ISSN: 2413-7219]

79

You might also like