Group 1 Report-2
Group 1 Report-2
Project Report
on
BACHELOR OF TECHNOLOGY
DEGREE
SESSION 2023-24
in
We hereby declare that this submission is our own work and that, to the best of our
knowledge and belief, it contains no material previously published or written by another
person nor material which to a substantial extent has been accepted for the award of any
other degree or diploma of the university or other institute of higher learning, except where
due acknowledgment has been made in the text.
Signature Signature
Date: Date:
ii
CERTIFICATE
(Designation) ( Professor)
Date:
iii
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the B. Tech Project undertaken
during B. Tech. Final Year. We owe special debt of gratitude to Prof. Neha Yadav,
Department of Computer Science & Engineering, KIET, Ghaziabad, for her constant support
and guidance throughout the course of our work. Her sincerity, thoroughness and
perseverance have been a constant source of inspiration for us. It is only her cognizant efforts
that our endeavors have seen light of the day.
We also take the opportunity to acknowledge the contribution of Dr. Vineet Sharma, Head
of the Department of Computer Science & Engineering, KIET, Ghaziabad, for his full
support and assistance during the development of the project. We also do not like to miss the
opportunity to acknowledge the contribution of all the faculty members of the department
for their kind assistance and cooperation during the development of our project.
We also do not like to miss the opportunity to acknowledge the contribution of all faculty
members, especially faculty/industry person/any person, of the department for their kind
assistance and cooperation during the development of our project. Last but not the least, we
acknowledge our friends for their contribution in the completion of the project.
Date: Date:
Signature: Signature:
iv
ABSTRACT
v
TABLE OF CONTENTS
Page No.
DECLARATION……………………………………………………………………. ii
CERTIFICATE……………………………………………………………………… iii
ACKNOWLEDGEMENTS…………………………………………………………. iv
ABSTRACT………………………………………………………………………..... v
LIST OF FIGURES…………………………………………………………………. viii
LIST OF TABLES…………………………………………………………………… ix
LIST OF ABBREVIATIONS………………………………………………………. x
CHAPTER 1 (INTRODUCTION)…………………………………………………. 1
1.1. Introduction……………………………………………………………………... 1
1.2. Project Description……………………………………………………………… 2
REFERENCES……………………………………………………………………....
38
APPENDIX………….................................................................................................
40
32
33
vii
LIST OF FIGURES
viii
LIST OF TABLES
ix
LIST OF ABBREVIATIONS
x
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
In the current digital era, the process of recruitment has evolved beyond the traditional
methods of evaluating candidates purely based on qualifications and face-to-face interviews.
Today, understanding the personality traits of candidates is increasingly recognized as a
critical factor in ensuring optimal job fit and organizational harmony. One such innovation
is the Personality Prediction System through CV Analysis, a sophisticated tool designed to
leverage machine learning (ML) and natural language processing (NLP) for the analysis of
resumes or CVs.
The Personality Prediction System is built using a Flask-based web interface, allowing users
to upload resumes in various file formats, which are then processed to extract textual data.
The goal of this research is to use big 5 models and machine learning algorithms to determine
an individual's personality. A person's personality has a big impact on both his personal and
professional life. These days, a lot of companies have also begun to shortlist applicants based
on their personalities as it boosts productivity because the individual is doing what he is good
at rather than what he is compelled to do.
Several psychological theories suggest that the OCEAN model, also known as the Big
Five model or the Five-Factor Model (FFM), was established in the early 1980s. Upon
applying statistical analysis to personality survey data, some terms are employed to
characterize the individual, and these terms provide an accurate description of the person's
overall personality or character. The word personality comes from the Latin persona, which
means characterizing a person's actions or disposition. It's been stated that personality
meaning is reflected in the very nature of the distinct attitude of an individual from other
people.
1
The dynamic organization within the individual of those psychological systems that
determine his characteristic behavior and thought," is how Hall and Lindsey define
personality. This technique ascertains the distinct manner in which an individual adjusts to
their surroundings. A person's personality is characterized by their sense of self, which
shapes their behavior in a distinctive and dynamic way. This behavior might alter as a result
of experience, education, learning, etc.
This viewpoint explains Setiadi's theory, which holds that personality is the dynamic
systemic organization that specifically dictates how an individual adapts to their
surroundings. The project mines user characteristic data and looks for patterns using learning
algorithms and sophisticated data mining methods. This project will go across regions where
it has access to a lot of behaviorally-based personal data. By applying automated personality
prediction and categorization, this data can be useful in the classification of individuals.
Large volumes of private behavioral data are accessible in certain places. Using automated
personality categorization, we may use this data to classify individuals.
Five characteristics of different individuals commonly known as the big five characteristics
namely, openness, neuroticism, conscientiousness, agreeableness, and extraversion are
stored in a dataset and used for training. Based on this training, the personality of individuals
is predicted using datamining concepts. Before testing the dataset, it is pre-processed
using different data mining concepts like handling missing values, data
normalization, etc. This pre-processed data can then be used to classify/predict
user personality based on past classifications. The system analyses user characteristics and
behaviors. The system then predicts new user personality based on personality data
stored by classification of previous user data. The model used to predict the test dataset
is “Random Forest Classifier” because Random Forest Classifier is an effective model to
predict output class labels for dependent categorical data.
2
recruitment process. The system is capable of processing multiple resume formats and offers
functionalities such as text extraction, personality trait prediction, and detailed AI-generated
trait descriptions. The Personality Prediction System through CV Analysis you're describing
uses a blend of web development technologies and artificial intelligence to create an
innovative tool for recruitment. Here’s a detailed explanation of its components and
functionalities:
The system utilizes:
Frontend Technologies :
HTML: Serves as the backbone of the webpage, structuring the content of the web interface
where users can upload resumes.
CSS: Styles the webpages, making the interface visually appealing and easy to navigate.
JavaScript: Enhances interactivity, handling events like resume uploads and displaying the
results of the personality trait predictions.
The frontend acts as the point of interaction for users. It likely includes forms for uploading
resumes and panels or dashboards where the results (personality traits and descriptions) are
displayed after analysis.
Python with Flask: Flask is a lightweight web framework for Python, chosen here probably
for its simplicity and effectiveness in setting up a web server quickly. Python's extensive
library ecosystem and its prowess in data handling and machine learning make it ideal for
the backend.
The backend handles the processing logic: receiving uploaded files, managing data flow,
storing results temporarily, and interfacing with machine learning models for personality
prediction.
Pandas: Used for handling and manipulating structured data. In this context, it could be used
to manage and analyze data extracted from resumes.
3
NLTK: Stands for Natural Language Toolkit, and it provides tools for building Python
programs to work with human language data. It could be used for text processing and feature
extraction from resumes.
PyPDF2: A library for PDF file manipulation, allowing the system to read and extract text
from resumes in PDF format.
These libraries provide essential tools that facilitate various operations from data
manipulation to complex text processing, which are critical in processing resumes and
extracting meaningful information for further analysis.
PyCharm would be used to write, test, and debug the code that makes up the backend and
helps in integrating the frontend components.
Key Features:
1. Resume Processing:
The system extracts text from uploaded resumes, which can be in various formats such as
PDFs or Word documents.
It then uses machine learning models to analyze the text and predict personality traits based
on the content, such as expressions, skills listed, and the general tone of the resume.
2. Web Interface:
4
3. AI Descriptions:
Utilizes advanced generative AI, possibly leveraging models like those from Google, to
generate detailed descriptions of the predicted personality traits. This can provide deeper
insights and explanations that can help employers understand the implications of these traits
in a professional setting.
*******
5
CHAPTER 2
LITERATURE REVIEW
6
Ayub Zubeda et al [6] worked on a design to rank CVs using Natural Language Processing
and Machine Learning. The system ranks CVs in any format according to the company’s
criteria. The authors propose to consider seeker’s Git mecca and LinkedIn profile as well to
get a better understanding making it easier for the company to find a suitable match
grounded on skill sets, capability and most importantly, personality.
Liden et al. published The General Factor of Personality: The interrelations among the Big
Five personality factors (Openness, Conscientiousness, Extraversion, Agreeableness, and
Neuroticism) were analyzed in this paper to test for the existence of a GFP. The meta-
analysis provides evidence for a GFP at the highest hierarchal level and that the GFP had a
substantive component as it is related to supervisor-rated job performance were concluded
by this paper. However, it is also realized that it is important to note that the existence of a
GFP did not mean that other personality factors that were lower in the hierarchy lost their
relevance.
On the basis of following we have known that a robust foundation for the use of ML and
NLP in personality prediction from textual data. The integration of these technologies into a
user-friendly web platform, as in the Personality Prediction System, aligns with
contemporary research and addresses both the practical and ethical complexities of modern
recruitment technologies. This review not only validates the approach taken in this project
but also highlights the innovative potential of combining these technologies for enhanced
recruitment processes.
Based on the literature review provided, several key insights and recommendations can be
drawn to enhance the Personality Prediction System through CV Analysis:
Incorporate Personality Assessment in Recruitment Process:
The studies highlighted the significance of personality assessment in predicting job success
and organizational fit. Therefore, integrating personality analysis into the recruitment
process can provide valuable insights for employers in selecting candidates who align with
the company culture and job requirements.
Utilize Machine Learning for Personality Prediction:
Leveraging machine learning algorithms, as proposed by Suraj Mali [3] and Ayub Zubeda et
al [6], can enhance the accuracy and efficiency of personality prediction. By training models
on large datasets of resumes and corresponding personality assessments, the system can learn
to identify patterns and correlations between resume content and personality traits.
Consider Multiple Data Sources:
As suggested by Kalghatgi et al. [4], considering additional data sources beyond resumes,
such as social media activity (e.g., Twitter, LinkedIn) and online profiles (e.g., GitHub), can
provide a more comprehensive understanding of candidates' personalities and skills.
7
Integrating Natural Language Processing (NLP) techniques to analyze textual data from
these sources can further enrich the personality assessment process.
Customize Assessment Criteria:
Providing flexibility for administrators, as proposed by Suraj Mali [3], to customize aptitude
and personality test questions based on organizational requirements can enhance the
relevance and effectiveness of the assessment process. This customization allows
organizations to tailor the assessment criteria to specific job roles and company culture.
Implement Fair and Transparent Ranking Mechanisms:
Allan Robey et al. [5] emphasized the importance of fairness and legality in the shortlisting
process. Implementing transparent ranking mechanisms ensures that candidates are
evaluated objectively based on merit and suitability for the role. Additionally, conducting
aptitude and personality tests alongside CV analysis can provide a more holistic evaluation
of candidates' capabilities and traits.
Continuous Model Improvement and Evaluation:
It's crucial to continuously update and refine the machine learning models used for
personality prediction based on feedback and performance evaluation. Regularly evaluating
model accuracy, bias, and generalization capabilities helps maintain the system's
effectiveness and reliability over time.
Considerations for General Factor of Personality (GFP):
Liden et al.'s study [7] highlights the existence of a General Factor of Personality (GFP) and
its relevance to job performance. While incorporating the Big Five personality factors
(Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) into the
assessment, it's essential to recognize the interrelations among these factors and their
implications for job-related outcomes.
In summary, the Personality Prediction System should aim to integrate machine learning
techniques, consider multiple data sources, customize assessment criteria, ensure fairness
and transparency in ranking mechanisms, and continuously evaluate and refine models to
enhance its effectiveness in predicting candidates' personalities and job suitability.
*******
8
CHAPTER 3
PROPOSED METHODOLOGY
The proposed methodology for the Personality Prediction System through CV Analysis
outlines a systematic approach integrating machine learning (ML), natural language
processing (NLP), and web development technologies to analyze resumes and predict
personality traits. This chapter details each component of the methodology, including data
collection, preprocessing, model development, and deployment.
Resume Processing and Trait Assignment:
1. Text Extraction:
Utilizes Python libraries such as PyPDF2, textract, and docx to extract textual data from
resumes regardless of their format (PDF or DOCX).
Handles potential exceptions, such as PDFs encrypted with passwords, using appropriate
error handling mechanisms.
2. Text Preprocessing:
Preprocesses the extracted text by removing punctuation, tokenizing it into words, and
lemmatizing each word to its base form using NLTK (Natural Language Toolkit).
Removes stopwords (commonly occurring words like "the", "is", "and") to focus on
meaningful content.
3. Trait Assignment:
Assigns personality traits to candidates based on the extracted skills and predefined
associations stored in 'traits.txt'.
Matches extracted skills with traits and assigns relevant traits to each candidate based on
their skill set.
9
5. Resume Upload and Analysis:
Allows users to upload resumes through a file input field on the web interface.
Handles file uploads securely, processing each uploaded resume immediately upon
submission.
11
3.1 TECHNIQUES USED
1. Feature Engineering:
• Objective: Enhance predictive model performane by creating new features from raw data.
• Application: Extracting relevant features from resume text, such as word frequency,
sentence length, and syntactic patterns, to improve personality trait prediction accuracy.
• Methods Used:
• Tokenization: Breaking text into tokens (words or sentences) for analysis.
• Stemming and Lemmatization: Reducing words to their base form to normalize text.
• Part-of-Speech (POS) Tagging: Identifying the grammatical components of words in
context.
• Application: Preprocessing resume text to prepare it for feature extraction and model
training.
1. PyPDF2
Purpose: PyPDF2 is a Python library for working with PDF files. It allows users to read,
merge, split, crop, and extract text and metadata from PDF documents.
Features:
Text Extraction: PyPDF2 enables users to extract text content from PDF files, making it
accessible for further processing or analysis.
Document Manipulation: Users can perform various operations on PDF documents, such as
merging multiple PDFs into one, splitting a PDF into multiple documents, or extracting
specific pages.
Metadata Access: PyPDF2 allows access to metadata information stored within PDF files,
including author, title, subject, and creation/modification dates.
Use Cases: PyPDF2 is commonly used in applications requiring PDF manipulation and text
extraction, such as document management systems, data extraction pipelines, and text
analysis tools.
2. textract
Purpose: textract is a Python library for extracting text from various document formats,
including PDFs, Microsoft Office documents (e.g., Word, PowerPoint), and other common
formats like EPUB, RTF, and HTML.
Features:
Document Parsing: textract supports parsing text from a wide range of document formats,
making it versatile for extracting content from different sources.
Pluggable Architecture: The library utilizes external command-line utilities (e.g., pdftotext,
antiword) to extract text from specific file types, providing robust support for different
formats.
Simple API: textract offers a straightforward API for extracting text, abstracting away the
complexities of interacting with different file formats and external dependencies.
13
Use Cases: textract is commonly used in applications requiring text extraction from diverse
document formats, such as content indexing, information retrieval, and data analysis
pipelines.
3. docx (Python-docx)
Purpose: The docx library, also known as Python-docx, is a Python library for creating,
modifying, and extracting text from Microsoft Word (.docx) documents.
Features:
Document Manipulation: docx allows users to create new Word documents, modify existing
documents, and extract text content from .docx files.
Text Formatting: Users can apply various text formatting options (e.g., font styles, colors,
alignment) to document content programmatically.
Table Support: docx supports working with tables in Word documents, enabling users to
create, modify, and extract tabular data.
Use Cases: docx is commonly used in applications requiring interaction with Word
documents, such as document generation, report automation, and content extraction.
4. PyCharm
Purpose: PyCharm is an Integrated Development Environment (IDE) specifically designed
for Python development. It provides a comprehensive set of features for writing, debugging,
and deploying Python applications.
Features:
Code Editor: PyCharm offers a powerful code editor with syntax highlighting, code
completion, and intelligent code analysis features, enhancing productivity and code quality.
Debugger: The built-in debugger allows users to step through code, set breakpoints, and
inspect variables, making it easier to identify and resolve issues in Python code.
Version Control Integration: PyCharm seamlessly integrates with version control systems
like Git, enabling collaborative development and efficient code management.
Project Management: The IDE includes project management tools for organizing files,
dependencies, and configurations, streamlining development workflows.
14
3.2 BACKGROUND OF PERSONALITY PERCEPTION
The Big Five Personality Traits model is based on findings from several independent
researchers, and it dates back to the late 1950s. But the model as we know it now began to
take shape in the 1990s.
Lewis Goldberg, a researcher at the Oregon Research Institute, is credited with naming the
model "The Big Five." It is now considered to be an accurate and respected personality scale,
which is routinely used by businesses and in psychological research.
The Big Five Personality Traits Model measures five key dimensions of people's
personalities:
Conscientiousness: this looks at the level of care that you take in your life and work. If you
score highly in conscientiousness, you'll likely be organized and thorough, and know how to
make plans and follow them through. If you score low, you'll likely be lax and disorganized.
Agreeableness: this dimension measures how well you get on with other people. Are you
considerate, helpful and willing to compromise? Or do you tend to put your needs before
others'?
15
3.3 MACHINE LEARNING
Machine Learning: Computer systems can now automatically learn without the need
for explicit programming thanks to machine learning. How does a machine learning
system operate, though? Thus, it may be explained by the machine learning lifecycle.
The life cycle of machine learning is a cyclical process that builds an effective project
involving machine learning. Finding a solution for the issue or project is the
lifecycle's primary goal. With the help of machine learning (ML), artificial
intelligence (AI) systems can learn automatically and get better with time without
needing explicit programming. Machine learning (ML) focuses on developing
computer algorithms that can obtain data and utilize it to learn on their own. Experts
produce widely applicable computations on vast classes of learning problems in order
to achieve machine learning (ML). When the time comes to provide an explanation
Similar to what a human does, the PC gets better at its assigned task the more
information or "experience" it gains. Learning starts with facts or observations,
examples, first-hand experience, or practice in order to find patterns in the data and
make better decisions going forward based on the examples provided. The main goal
is to let the computers learn on their own, without human assistance or intervention,
and then control movement as a result.
The following categories apply to ML algorithms:
Supervised Learning: Using categorized samples and what has been discovered
beyond to new statistics, supervised machine learning algorithms can forecast
potential future events. In order to forecast the output values, the machine learning
algorithm first analyzes an acknowledged training dataset and creates an inferred
feature. With sufficient training, the system can set goals for any new I/P. In addition
to evaluating output that is equivalent to the right one, this set of rules may also find
mistakes and evaluate output, allowing one to modify the model as needed.
Unsupervised Learning: When the training statistics are neither classified nor categorized,
unsupervised machine learning techniques are employed in the evaluation process.
Unsupervised learning studies how software can deduce a feature from unlabeled recordings
in order to describe a concealed shape. The system examines the facts and may make
deductions from datasets to characterize hidden structures from unlabeled data, but it does
not determine the correct output.
Reinforcement Learning: Reinforcement learning algorithms are a method of training that
interacts with its environment by making movements and identifying errors.Trial and error,
seek and reward, and behind schedule reward are the most broadly applicable characteristics
of the reinforcement technique. This methodology enables devices and software vendors to
16
autonomously determine the exact action inside a certain setting in order to optimize their
overall efficacy.
3.4 EXPLANATION
1. Imports:
Flask web application for analyzing resumes and predicting personality traits based on the
extracted information. Let's break down the code and explain each part:
The code imports necessary modules and functions from Flask, datetime, os, pandas, and
custom modules (ai_prediction and resume_extraction) for AI prediction and resume
extraction.
2. Flask App Setup:
Sets up the Flask application and specifies the folder where uploaded resumes will be
saved.
17
3. Routes and Functions:
3.1. Index Route:
4. Helper Functions:
4.1. Saving to CSV:
18
This Flask application provides a user-friendly interface for uploading resumes, extracting
relevant details, predicting personality traits, and displaying the results. It also offers
functionalities for managing historical data, exporting data, and clearing history, enhancing
the overall user experience.
This Python code defines a conversational agent named "Luna" that interacts with Google's
GenerativeAI service to generate descriptive text about a candidate's personality traits. Let's
break down the code and explain each part:
1. Import Statements:
The code imports the necessary modules: genai for accessing Google's GenerativeAI service
and personality traits for retrieving personality traits.
2. API Configuration:
Configures the GenerativeAI service with the API key. This API key should be obtained
from the Google Maker Suite platform.
3. Generation Configuration:
Specifies the generation configuration settings for the GenerativeAI model, such as
temperature, top-p, top-k, and max output tokens.
4. Generative Model Initialization:
Initializes the Generative Model with the specified name ("gemini-pro") and generation
configuration.
5. Chat Function:
19
Defines a function chat(query) that interacts with the GenerativeAI model to generate
descriptive text based on the input query. It retries up to three times if an error occurs.
6. Auxiliary Functions:
Defines auxiliary functions: say(text) for printing Luna's responses and takeCommand() for
generating the query based on predefined personality traits.
7. Main Execution:
Executes the main part of the script. It generates a query using takeCommand(), interacts
with the GenerativeAI model using chat(query), and prints Luna's response.
In summary, this code sets up a conversational agent named "Luna" that utilizes Google's
GenerativeAI service to generate descriptive text about a candidate's personality traits
based on predefined traits. It demonstrates how AI models can be integrated into
conversational systems to provide meaningful responses to user queries.
Python is used to create the Flask framework for web applications. It features several
modules that simplify the process of creating apps for web developers by taking care of
tedious issues like thread management and protocol management.Flask provides us with the
tools and libraries we need to create a web application, along with a range of options for
constructing online applications. We'll be developing a web application that integrates with
the model we constructed in this part. For the applications where he must input values for
forecasts, a user interface is offered. The savedmodel receives the enter values, and the
prediction is displayed on the user interface.This file, named "model.pkl," is a machine
learning model that we are employing to estimate the compression strength of the concrete.
For this model, there are seven independent variables and one dependent variable.Make a
project folder with the following contents: A file in Python named app.py Your file
containing the machine learning method (personality prediction.py or personality
prediction.ipnby, for instance) One example of a model file is Personality Prediction.pkl.
The templates folder contains the home.html page.
20
Sanitizing the dataset
Machine learning's second stage is pre-processing the dataset. Before training a better model,
it is vital to remove noisy data, fill in null (empty) values, replace garbage data, and use
algorithms to discover unknown columns.
I use many Python library functions, such as NumPy and Pandas, for this purpose in order
to clean the dataset.
Visualize the dataset: Prior to training a model, it's critical to comprehend the dataset and
select the machine learning approach. Defining the dataset includes organizing the dataset
columns for training, identifying trends, eliminating outliers, and visualizing the rows and
columns into graphs.
Since unsupervised learning is the foundation of our challenge, I went with the clustering
technique.
Clustering the dataset: We must determine the number of clusters to include before training
the model. Determining the number of clusters is crucial in order to identify every unique
characteristic inside the dataset. To determine the optimal number of clusters that best suit
the dataset, we employ the elbow approach.
Algorithmic strategy:
A.Decision Trees : Decision Tress are a type of Supervised Machine Learning (that is you
explain what the input is and what the corresponding output is in the training data) where the
data is continuously split according to a certain parameter. The tree can be explained by two
entities, namely decision nodes and leaves. The leaves are the decisions or the final
outcomes. And the decision nodes are where the data is split. which are outcomes like either
„fit‟, or,unfit‟. In this case this was a binary classification problem (a yes no type problem).
There are two main types of Decision Trees:
We‟ll build a decision tree to do that using ID3 algorithm. ID3 Algorithm will perform
following tasks recursively
1. Create root node for the tree
2. If all examples are positive, return leaf node „positive‟
3. Else if all examples are negative, return leaf node „negative‟
4. Calculate the entropy of current state H(S)
5. For each attribute, calculate the entropy with respect to the attribute „x‟ denoted by H(S,
x)
6. Select the attribute which has maximum value of IG(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.
21
B. Logistic Regreesion:
Logistic regression models the probability of the default class (e.g. the first class). For
example, if we are modeling people‟s sex as male or female from their height, then the first
class could be MMCOE, Department of Computer Engineering 2021-22 38 male and the
logistic regression model could be written as the probability of male given a person‟s height,
or more formally: P(sex=male height) Written another way, we are modelling the probability
that an input (X) belongs to the default class (Y=1), we can write this formally as: P(X) =
P(Y=1|X) We‟re predicting probabilities.
I thought logistic regression was a classification algorithm. Note that the probability
prediction must be transformed into a binary values (0 or 1) in order to actually make a
probability prediction.
More on this later when we talk about making prediction. Logistic regression is a linear
method, but the predictions are transformed using the logistic function. continuing on from
above, the model can be stated as: p(X) = e^(b0 + b1*X) / (1 + e^(b0 + b1*X)) I don‟t want
to dive into the math too much, but we can turn around the above equation as follows
(remember we can remove the e from one side by adding a natural logarithm (ln) to the
other): ln(p(X) / 1 – p(X)) = b0 + b1 * X This is useful because we can see that the calculation
of the output on the right is linear again (just like linear regression), and the input on the left
is a log of the probability of the default class. This ratio on the left is called the odds of the
default class (it‟s historical that we use odds, for example, odds are used in horse racing
rather than probabilities).
Odds are calculated as a ratio of the probability of the event divided by the probability of not
the event, e.g. 0.8/(1-0.8) which has the odds of 4.
So we could instead write: ln(odds) = b0 + b1 * X Because the odds are log transformed, we
call this left hand side the log-odds or the profit.
It is possible to use other types of functions for the transform (which is out of scope_, but as
such it is common to refer to the transform that relates the linear regression equation to the
probabilities as the link function, e.g. the profit link function. We can move the exponent
back to the right and write it as: odds = e^(b0 + b1 * X) .
22
Fig 3.1 Distortion Score
Following the OCEAN model, there will be five primary clusters representing the five
distinct personality types.
Train the Model: The primary function of machine learning is model training. Using a
dataset, train many algorithms and determine which yields the best accuracy.
Numerous techniques, such as K-mean and partitioning, are utilized for clustering. When
you give the algorithm the specified dataset and number of clusters, it will train the model.
Utilizing metrics like accuracy, ROC curve, and confusion matrix, assess the model.
Visualize the Model: This is a useful tool for determining whether or not the trained model
will perform properly. Identifying mistakes and outliers is the goal of model visualization.
Visualizing the model involves presenting the model's internal workings, structure, and
performance metrics in a graphical format. In the context of the Personality Prediction
System, here's how the model can be visualized:
1. Dimensionality Reduction:
PCA Visualization: After preprocessing the text data, Principal Component Analysis (PCA)
can be applied to reduce the dimensionality of the feature space while preserving most of the
variance. Visualizing the data in a lower-dimensional space allows for easier interpretation
and identification of clusters corresponding to different personality traits.
2. Clustering Visualization:
Cluster Plot: Using clustering algorithms such as K-means or hierarchical clustering, the
system can cluster candidates into distinct groups based on their personality traits.
Visualizing these clusters on a scatter plot can provide insights into the distribution and
separation of personality types within the dataset.
23
3. Model Evaluation:
ROC Curve: If the model involves binary classification tasks (e.g., predicting whether a
candidate possesses a specific personality trait), Receiver Operating Characteristic (ROC)
curves can be plotted to visualize the trade-off between true positive rate and false positive
rate at different threshold settings.
Confusion Matrix: For multi-class classification tasks (e.g., predicting personality traits
across multiple dimensions), a confusion matrix can be visualized to display the number of
true positives, true negatives, false positives, and false negatives for each class, offering
insights into the model's classification performance.
4. Textual Representation:
Word Clouds: To gain insights into the most prominent words or phrases associated with
each personality trait, word clouds can be generated to visualize the frequency of terms
extracted from the resumes. This provides a qualitative understanding of the textual patterns
characteristic of different personality types.
5. GenerativeAI Output:
Textual Descriptions: The descriptions generated by the GenerativeAI service can be
presented alongside the corresponding personality traits, providing users with contextual
explanations and insights into the implications of each trait. These descriptions may be
displayed in a user-friendly format within the web interface.
6. Interactive Dashboard:
Interactive Visualization: Incorporating interactive elements such as dropdown menus,
sliders, or checkboxes into the web interface allows users to dynamically explore different
aspects of the model's output.
7. Model Parameters:
Parameter Tuning Plot: If the model involves hyperparameter tuning or optimization,
visualizing the performance metrics (e.g., accuracy, loss) as a function of different parameter
values can aid in selecting the optimal configuration for the model.
24
Fig 3.2 Personality Traits after PCA
We employed the PCA model to lower the dimensionality and distribute the correlated
features linearly because our dataset has millions of rows.
Test the Model: Just as crucial as training the model is testing it. Unlabeled data will be
supplied after a model has been trained, and the trained model will forecast the data's label
and produce results.
If the outcomes are inaccurate, again train the model using a different approach until it
produces forecasts and accuracy that are good.
25
In summary, the forecasting process yields detailed insights into the dataset's structure,
facilitates the identification of distinct personality clusters, and enables the development of
accurate predictive models. These outcomes are crucial for informing decision-making
processes in various domains, including recruitment, organizational development, and
individual assessment.
*******
26
CHAPTER 4
4.1 SNAPSHOTS:
2- CV Analysis Results
27
3- Personality Traits-
The results of the project report on "Personality Prediction Through CV Analysis Using ML"
demonstrate the successful implementation of a sophisticated system for automating
personality assessments from resumes. Here's a detailed summary of the results:
2. Trait Assignment:
Personality traits are assigned to candidates based on extracted skills and predefined
associations stored in 'traits.txt', ensuring accuracy and relevance in trait assignment.
28
4. AI Interaction for Trait Description:
The system interacts with Google's GenerativeAI service to generate descriptive text about
a candidate's personality based on their assigned traits, providing deeper insights.
4.2 DISCUSSION
The discussion section of the project report on "Personality Prediction Through CV Analysis
Using ML" provides an opportunity to delve deeper into the implications, limitations, and
future directions of the project. Here's a structured discussion based on the key components
of the project:
1. Implications of the Project:
Recruitment Efficiency: The automation of personality assessments from resumes
streamlines the recruitment process, reducing manual workload for recruiters and
accelerating candidate screening.
Informed Decision-Making: Personality insights gleaned from the system empower
employers to make more informed hiring decisions, aligning candidates with organizational
values and culture.
Ethical Considerations: The project addresses ethical concerns surrounding data privacy,
bias mitigation, and transparency, promoting fairness and equality in the recruitment process.
29
2. Limitations and Challenges:
Algorithmic Bias: Despite efforts to mitigate bias, the system may still exhibit biases
inherent in the training data or algorithmic decisions, requiring ongoing monitoring and
refinement.
Generalization and Scalability: The system's performance may vary across different
datasets or industries, necessitating further research to enhance generalization capabilities
and scalability.
Interpretability: The interpretability of personality predictions and trait descriptions
generated by AI models may be limited, posing challenges for user understanding and trust.
3. Future Directions:
Refinement of Models: Continuous refinement of machine learning models based on
feedback and integration of advanced algorithms to improve prediction accuracy and
generalization.
Enhanced Features: Exploration of additional features such as personalized trait
descriptions, career guidance based on personality insights, and integration with HR systems
for comprehensive recruitment solutions.
Research and Development: Further research into the intersection of AI, psychology, and
recruitment practices to deepen understanding and improve the efficacy of personality
prediction systems.
4. Real-World Applications:
Industry Adoption: The project lays the groundwork for industry adoption of automated
personality assessment tools, offering potential benefits for various sectors such as human
resources, talent acquisition, and career counseling.
Academic and Research Impact: The project contributes to academic and research
endeavors in the fields of machine learning, natural language processing, and organizational
psychology, fostering interdisciplinary collaboration and knowledge exchange.
5. Conclusion:
Summary of Contributions: The discussion concludes by summarizing the project's
contributions to recruitment technology, ethical AI deployment, and future research
directions.
30
6. Call to Action: Encourages stakeholders to leverage the insights gained from the project
to drive positive change in recruitment practices, foster diversity and inclusion, and promote
responsible AI innovation.
Nowadays, the corporate world not only prioritizes an individual's skills but also their
personality traits, as they play a crucial role in achieving success both professionally and
personally. Therefore, recruiters must have knowledge of potential employees' personality
traits. However, due to the significant increase in job seekers and the decline in job
availability, it is challenging to manually select the most suitable candidate by just reviewing
their resume. This analysis aims to explore various machine learning techniques for
predicting personality traits effectively by analyzing resumes through Natural Language
Processing (NLP) methods. The research demonstrates that the Random Forest algorithm
outperforms other approaches such as k-Nearest Neighbors (kNN), Logistic Regression,
SVM, and Naive Bayes in terms of accuracy.
7. SWOT ANALYSIS
Strengths:
• Interactive and easy to use.
• Extract all the important features of resume in seconds
• Easily predict the personality of applicant
Weakness:
• It does not store the predicted personality data.
• Bulk of CV cannot be parsed in one go.
Opportunities:
• It can be extended for commercial uses
• It can be made more interactives where we can easily handle bulk data and
represent it.
• It can improve the training model for various addition features that help us to
predict more accurate
• result.
• Instead of direct asking the five characteristic values we can add questionaries’
which ask some
• multiple-choice questions and auto calculate the various values.
Threats:
• There is no security added in the app yet that gives different rights to different
users.
31
• There are a lot of companies in the world and their hiring system is different
from sector to sector, so
• it needs to do changes with company to company requirement which can be
complicated and
*******
32
CHAPTER 5
5.1 CONCLUSION
The Personality Prediction System leverages machine learning, NLP, and web development
technologies to create a sophisticated tool for the recruitment process. By automating
personality assessment and potentially utilizing AI for trait descriptions, the system
empowers recruiters and employers to make more informed hiring decisions based on a
candidate's potential cultural fit.
Personality prediction models include the MBTI and OCEAN models. However, the
OCEAN model has a larger and more accurate dataset than the MBTI model. Using the
provided dataset, we are able to achieve an accuracy of 89.13% in this project. The machine
learning model can be trained and tested using random forest classifier techniques.
Additionally, we can use the Flask library to deploy our model. This library offers a user
interface that makes it simple for the user to submit data, and in the back end, our machine
learning model uses the data to predict the user's personality and display it.
The future scope of the Personality Prediction System through CV Analysis holds immense
potential for further innovation and advancement in recruitment technology. This chapter
explores avenues for future development, including enhancements to the system's
functionality, expansion into new domains, and ongoing research directions.
Enhanced Functionality
34
Performance Analytics:
Industry-Specific Applications:
• Adapting the system to cater to the unique recruitment needs of specific industries, such as
healthcare, finance, or technology, by customizing personality trait models and analysis
criteria.
The future scope of the Personality Prediction System through CV Analysis is brimming
with possibilities for further innovation and impact in the field of recruitment technology.
By embracing enhanced functionality, integrating with HR systems, expanding into new
domains, and pursuing ongoing research, the system can continue to evolve and address the
evolving needs of recruiters and candidates alike. As the landscape of talent acquisition
continues to evolve, this project stands poised to lead the way in shaping the future of
recruitment practices.
Human personality has played a vital role in an individual's life as well as in the development
of an organization. One of the ways to judge human personality is by using standard
questionnaires or by analyzing the Curriculum Vitae (CV). Traditionally, recruiters manually
shortlist/filters a candidates CV as per their requirements. In this paper, we present a system
that automates the eligibility check and aptitude evaluation of candidates in a recruitment
process. To meet this need an online application is developed for the analysis of aptitude or
personality test and candidate‟ CV. The system analysis professional eligibility based on the
uploaded CV. The system employs a machine learning approach using TF-IDF Algorithm.
The output of our system gives a decision for candidate recommendation. Further, the
resulting scores help in evaluating the qualities in the candidates by analyzing the scores
35
obtained in different areas. The graphical analysis of the performance of any candidate makes
it easier to evaluate his/her personality and helpful in
In this pandemic it will be hard for organizations to conduct face-to-face interviews. As an
option, online enlistment is done. It's hard to pass judgment on an interviewee's character
online. So our intention is to distinguish an individual's character by taking photographs of
the interviewee at random time during the recruitment process. There is adequate
confirmation that analysis and expressive signals in a human face offer hints of a human
person. So with that, we are deducting an individual's character utilizing Convolutional
Neural Network (CNN) algorithm and facial width-to-height ratio (fWHR). Additionally, a
questionnaire with situational questions is used to identify the level of each personality
within a person. Questionnaire based personality prediction is achieved using K-Means
clustering algorithm. By combining the conclusion of both the approaches we predict the
highest personality that a person possesses among the big five personality traits. Thus, it will
be a gigantic help for the recruitment process. It's more advantageous for organizations and
interviewees to have recruitment online. It's because anyone from anywhere could attend
interviews. Hence time and money could be saved. Thus, we provide our support for the
organizations to keep continuing with online enlistment even after this pandemic.
Personality is an important parameter as it differentiates various individuals from one
another. Personality prediction is an evergreen area of research. Predicting personality with
the help of data through social media is a promising approach as this method does not require
any questionnaires to be filled by users thus reducing time and increasing credibility. Thus
having knowledge of personality is an interesting domain for researchers to work on.
Predicting personality has many applications in real world. Use of social media is increasing
day by day. Huge amount of textual data as well as images continue to explode to the web
daily. Current work focuses on Linear Discriminate Analysis, Multinomial Naive Bayes and
AdaBoost over Twitter standard dataset.
With the development of social networks, a large variety of approaches have been developed
to define users’ personalities based on their social activities and language use habits.
Particular approaches differ with regard to different machine learning algorithms, data
sources, and feature sets. The goal of this paper is to investigate the predictability of the
personality traits of Facebook users based on different features and measures of the Big 5
model. We examine the presence of structures of social networks and linguistic features
relative to personality interactions using the myPersonality project data set. We analyze and
compare four machine learning models and perform the correlation between each of the
feature sets and personality traits. The results for the prediction accuracy show that even if
tested under the same data set, the personality prediction system built on the XGBoost
classifier outperforms the average baseline for all the feature sets, with a highest prediction
accuracy of 74.2%. The best prediction performance was reached for the extraversion trait
by using the individual social network analysis features set, which achieved a higher
personality prediction accuracy of 78.6 %.
Integrate with existing applicant tracking systems (ATS) for seamless workflow. Refine
personality prediction models for improved accuracy.
36
Incorporate additional data sources like cover letters or social media profiles (with proper
consent) for a more holistic candidate assessment.
Address potential ethical considerations and biases within the personality prediction
algorithms.
*******
37
REFERENCES
1-Machine learning to predict personality via CV by ‘ Shankarwar Tanuj’ Edition 2023.
38
39