Project
Project
A R T I C L E I N F O A B S T R A C T
Keywords: Today we have tons of information posted on the web every day regarding job supply and demand which has
Information extraction heavily affected the job market. The online enrolling process has thus become efficient for applicants as it allows
Transformers them to present their resumes using the Internet and, as such, simultaneously to numerous organizations. Online
Online enrolling process
systems such as Monster.com, OfferZen, and LinkedIn contain millions of job offers and resumes of potential
Natural language processing
Course recommendation
candidates leaving to companies with the hard task to face an enormous amount of data to manage to select
the most suitable applicant. The task of assessing the resumes of candidates and providing automatic recom
mendations on which one suits a particular position best has, therefore, become essential to speed up the hiring
process. Similarly, it is important to help applicants to quickly find a job appropriate to their skills and provide
recommendations about what they need to master to become eligible for certain jobs. Our approach lies in this
context and proposes a new method to identify skills from candidates’ resumes and match resumes with job de
scriptions. We employed the O*NET database entities related to different skills and abilities required by different
jobs; moreover, we leveraged deep learning technologies to compute the semantic similarity between O*NET
entities and part of text extracted from candidates’ resumes. The ultimate goal is to identify the most suitable
job for a certain resume according to the information there contained. We have dfined two scenarios: i) given
a resume, identify the top O*NET occupations with the highest match with the resume, ii) given a candidate’s
resume and a set of job descriptions, identify which one of the input jobs is the most suitable for the candidate.
The evaluation that has been carried out indicates that the proposed approach outperforms the baselines in the
two scenarios. Finally, we provide a use case for candidates where it is possible to recommend courses with the
goal to fill certain skills and make them qualfied for a certain job.
1. Introduction search centers). One problem that arises from the application to multiple
online systems is that there is no universal standard format to adopt
The job market has been heavily ifluenced by the tons of informa when filling resume information (although systems such as Europass1
tion that are posted on the Web every day regarding job supply and or LinkedIn2 have some methods to automatically generate structured
demand. The online enrolling process has become efficient for appli prfiles). It follows that resumes are all very different from each other
cants as it allows them to present their resumes using the Internet and, in terms of structure, design, and format, hindering an efficient and
as such, simultaneously to numerous organizations (companies or re fast analysis from people working within human resources divisions of
* Corresponding author at: Department of Mathematics and Computer Science, University of Cagliari, Cagliari, Italy.
E-mail addresses: [email protected] (R. Alonso), [email protected] (D. Dessí), [email protected] (A. Meloni),
[email protected] (D. Reforgiato Recupero).
1
https://europa.eu/europass/en.
2
http://www.linkedin.com.
https://doi.org/10.1016/j.bdr.2025.100509
Received 18 August 2022; Received in revised form 22 January 2025; Accepted 6 February 2025
2
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
• we provide an interactive demo9 and its source code10 that imple those skills. All in an anonymous way and based on public occupation
ments our proposed approach to match applicants’ resumes with databases such as O*NET or ESCO12 .
the O*NET database jobs to measure the eligibility of a user for a Other authors in [25] created a system to suggest jobs based on users’
given job. prfiles. Users and jobs have been treated as text documents, and a
model that incorporates job transitions trained on the career progres
The remainder of this paper is organized as follows. Section 2 dis sions of a set of users has been adopted. The authors also showed that
cusses related works on resume extraction and job recommendation. combining career transitions with cosine similarity outperforms the sys
Section 3 details the tools we have used in this paper as well as the tem using just career transitions. The evaluation proving the statements
O*NET database. The proposed approach is presented in Section 4 above has been carried out on a dataset of 2,400 Linkedin users with the
whereas Section 5 introduces the scenarios we have considered. Sec task of predicting users’ current positions by looking at their prfiles and
tion 6 illustrates the evaluation we have carried out for the two scenar their jobs history.
ios. A use case we propose in the paper which can leverage the proposed The works illustrated above do not leverage a widely recognized
approach to recommend skills to job seekers is shown in Section 7. Fi taxonomy of skills or other similar entities. Sometimes the set of the
nally, Section 8 ends the paper with conclusions, limitations, and future considered skills is too high and approaches to reduce the dimensional
works where we are headed. space need to be adopted with the drawback of losing precision. Dif
ferently from the previous approaches, we use the O*NET taxonomy
by including not just skills but several other entities of a different kind
2. Related work
(i.e., knowledge, abilities, technology skills, etc.). O*NET is one of the
main occupational databases, and is almost a reference for occupation
The last two decades have seen the development of online recruit analysis and worker requirements. It is the primary source of occupa
ing platforms, a topic that has recently acquired increasing attention. tional information in the United States and it includes regularly updated
Authors in [18,19] presented two surveys of existing recommendation occupational characteristics based on questionnaires made to several
approaches that have been proposed to create recommendation systems hundred workers. In doing so, we inject into our approach cognitive,
for job seekers and recruiters. On the one hand, extracting features from interpersonal, and physical skill knowledge representing worker and
resumes and from job postings has always been challenging. In particu job requirements coming from a large number of workers and compa
lar, user prfiling deals with acquiring, extracting, and representing the nies, thus making the person-job matching more robust. Finally, our
features of users [20]. On the other hand, job prfiling is a representa work differs from previous studies [21--25] because it does not rely on
tion of job descriptions and their requirements. Usually, job descriptions previous applicants for a specific job position like in collaborative filter
come in unstructured text with no attribute names with well-defined val ing approaches. Therefore, each resume is used as it is and is matched
ues. It follows that the skill set for a particular job includes skills with a against jobs thus allowing us to avoid biases as well as the cold start
Boolean value: True if the skill is required for that job, False otherwise. problem [26]. Last but not least, to match each of these entities with
A kind of approach that achieved great success to recommend jobs is applicants’ resumes and job postings we make use of sentence transform
collaborative filtering. It is based on the assumption that if users 𝐴 and ers thus leveraging the semantics of each word, sentence, and context.
𝐵 have similar behaviors they will rate other items similarly [21--23]. This differs from the existing studies which often make use of only well
Authors in [23] developed a job recommendation system using a model defined attributes mentioned in the applicants’ resumes or job postings,
based collaborative algorithm with clustering algorithms. The Latent thus making it possible for our approach to deal with the complexity of
Semantic Analysis (LSA) and Singular Value Decomposition (SVD) have the natural language text contained in applicants’ resumes.
been adopted to create a lower-rank matrix with information about skills
and positions. Moreover, the inverse cosine similarity was employed as 3. Task definition and the used material
distance to perform the agglomerative clustering to create clusters of
positions. Authors in [22] created a system for job recommendation by In this section, we provide a formal definition of the task addressed
making clusters of users that are based on skills extracted from differ in this work, alongside details about the transformers model and the
ent websites. The Euclidean distance has been leveraged as a measure O*NET database leveraged in the proposed approach.
of similarity between the skills of users. Then, identfied skills with low
occurrence were removed from the list and a classfication using Naive 3.1. Task definition
Bayes was employed to rank the final set of recommendations for a user.
The Labor Market Explorer interactive dashboard for job seekers has The objective of our approach is to develop a system that automat
been presented by authors in [24]. It was built with a careful user ically identfies and matches relevant skills and qualfications between
centered design processing where both job seekers and job mediators candidates’ resumes and job postings. Formally, given:
were involved so that the matching process between jobs and job seek
ers could be optimized. The dashboard enables an exploration of the • Resume Information (R): A structured or semi-structured textual
job market in a personalized way based on the skills and competencies representation of a candidate’s skills, experiences, education, and
of the applicants. Efforts related to career exploration and the detec qualfications.
tion of training needs have also been and are being carried out. For • Job Information (J): A set of job classes derived from O*NET, com
example, the STAR11 project has the goal to design new technologies prising multiple entities such as technology skills, general skills,
to enable the deployment of standard-based secure, safe, reliable, and knowledge, abilities, work activities, task ratings, and tools used.
trusted human-centric AI systems in manufacturing environment. There,
the Workers’ Training Platform is being developed. This platform allows The task is dfined as a mapping function 𝑓 ∶ 𝑅 → 𝐽 , where for
workers to self-assess themselves and detect training needs related to each resume 𝑟 ∈ 𝑅, the function 𝑓 returns a ranked list of job classes
skills or knowledge while offering them training recommendations for {𝑗1 , 𝑗2 , ..., 𝑗𝑛 } ⊂ 𝐽 , ordered by their relevance to the candidate’s prfile.
The relevance score is calculated by leveraging semantic similarity
to align the extracted elements from the resume with the O*NET enti
9
Demo: http://192.167.149.11:8000. ties corresponding to each job. For each matched element, a job-specific
10
Demo source code: https://gitlab.com/hri_lab1/onet-db26-transformers-
demo.
11 12
https://www.star-ai.eu. https://ec.europa.eu/esco/.
3
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
Fig. 1. The figure illustrates the structure of a Siamese network designed for text comparison. Each branch of the network consists of a text input processed through
a BERT model that encodes the text into vectorial representations, and a pooling layer that aggregates the token-level features into a fixed-size vector representation
capturing the most salient information. The resulting vectors from both branches u and v are then compared using cosine similarity to measure their similarity. .
score derived from O*NET is assigned across various categories, in comparing resumes, job descriptions, and O*NET entities. In addition,
cluding ``skills'', ``knowledge'', ``abilities'', ``work activities'', and ``tasks''. our use of Sentence Transformers aligns with the nature of our task,
Additionally, conventional scores are applied for ``technology skills'' and where capturing nuanced semantic similarity between textual data is
“tools used''. critical. These models are specifically designed to deliver superior re
The approach operates in two specific scenarios: sults for tasks involving sentence embeddings and pairwise comparisons,
which are central to our approach. The reader notices that 𝜌 corresponds
1. Resume-to-Job Matching (R2J): Assisting companies in screen to the Spearman rank correlation coefficient [34] between the ranking
ing multiple resumes to identify the most suitable candidates for a of sentence pairs using the cosine similarity as score and the gold stan
specific job posting. dard ranking for various semantic textual similarity tasks. The Spearman
2. Job-to-Resume Matching (J2R): Helping candidates find the most rank correlation coefficient gives a score in the continuous range [−1, 1]
relevant job postings based on their resumes. where a value of 1 indicates a perfect correlation between the two rank
ings, a value of 0 indicates a weak correlation, and a value of -1 means
By leveraging NLP advancements and semantic similarity compu a perfect negative correlation.
tation, the proposed system aims to address challenges such as entity The all-mpnet-base-v2 model has a size of 420 Mbytes, a Max Se
disambiguation, contextual understanding of skills and qualfications, quence Length of 384 tokens, and an embedding speed of 2, 800
and alignment between heterogeneous textual data from resumes and sentences/sec on a V100 GPU. The generated embeddings are 768
job descriptions. dimensional vectors. The suitable score functions are the dot product,
the cosine similarity, and the Euclidean distance.
3.2. Sentence transformers
3.3. The O*NET database
SentenceTransformers13 [27] is the state-of-the-art Python frame The O*NET Program16 is the primary source of occupational infor
work for text embeddings generation. The framework leverages models mation for the United States. The goal of its creation was to understand
that are built on top of the original BERT model [28] or its further de the rapidly changing nature of work and how it impacts the work
velopments such as RoBERTa [29], MPNet [30], and ALBERT [31]. It force and the U.S. economy. One of the main outcomes of the program
provides transformer models that use siamese and triplet network struc is the O*NET database, which includes hundreds of standardized and
tures to transform sentences into embeddings. Then, these can be com occupation-specific descriptors on almost a thousand occupations cov
pared by using metrics (e.g., the cosine similarity) to perform tasks such ering the entire U.S. economy. The database is freely available, under a
as information retrieval, clustering, and semantic search. The first mod Creative Commons license, and is continuously updated on a quarterly
els were trained on Natural Language Inference (NLI) datasets [32,33] basis by different institutions. It has already been used by millions of per
and successfully became the state of the art for solving Semantic Textual sons for career exploration and to discover which training is necessary
Similarity (STS) tasks. to be eligible for a position, and by employers to find skilled workers to
Today, SentenceTransformers pre-trained models are built as exten better compete in the marketplace.
sions of Huggingface Transformers models14 by applying pooling layers Each occupation included in the O*NET database requires a dis
within siamese structures. The reader can observe an example of a com parate mix of knowledge, skills, and abilities and is performed using
mon siamese structure with two BERT models to compute the cosine a variety of activities and tasks. We have used the O*NET database ver
similarity between two texts in Fig. 1. The pool of SentenceTransformers sion 26.2 which includes 1,016 occupations, 52 Abilities, 33 Knowledge,
models can be found at https://huggingface.co/models. 35 Skills, 41 Work Activities, 4,127 Tools Used, 17,975 Task Ratings, and
Within the proposed system, we have chosen to embed the all-mpnet 8761 Technology Skills grouped into 135 categories. In the remainder
base-v2 model, which achieves a 𝜌 = 0.69 on encoding sentences over of the paper, we will refer to Abilities, Knowledge, Skills, Work Activi
14 diverse tasks, and a 𝜌 = 0.57 on 6 diverse tasks for semantic search ties, Tools Used, Task Ratings and Technology Skills as O*NET entities.
such as encoding of queries and questions. According to the official Sen Each occupation is associated with a unique identfier, a title, and a re
tence Transformers documentation15 , this model is specifically noted for lated description. For example, the occupation Computer and Information
achieving higher performance compared to a range of other pre-trained Research Scientist has description Conduct research into fundamental com
models, including standard BERT-based and RoBERTa-based architec puter and information science as theorists, designers, or inventors. Develop
tures. The documentation emphasizes that Sentence Transformers mod solutions to problems in the field of computer hardware and software.
els are optimized for sentence-level tasks, such as semantic similarity, The Abilities consist of a set of capacities related to each occupa
making them particularly well-suited for our application, which involves tion. One occupation may be associated with many abilities, each with
a related score. A given ability (as for Knowledge, Skills and Work Ac
tivities) might be present in different occupations with different scores.
13
https://www.sbert.net/.
14
https://huggingface.co/docs/transformers/index.
15 16
https://www.sbert.net/docs/pretrained_models.html. https://www.onetcenter.org/.
4
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
Fig. 2. O*NET ``Technology Skills'' and ``Tasks'' Entities - First 50 elements by frequency.
One example is Oral Comprehension which, among others, appears in the one more field, (hot technology), meaning whether the underlying tech
occupations Chief Executives and Biostatisticians with a score value of, re nology skill is hot or not.
spectively, 4.5 and 4. The score rflects the importance of that ability The reader notices that the last three described O*NET entities, Task
with respect to the associated occupation and has a value in the [1-6] Ratings, Tools Used, and Technology Skills, only occur when they are
continuous range. The higher the score, the more important that ability required for the underlying job. Moreover, Tools Used and Technology
is for the job it refers to. Skills do not have a score value associated with an occupation, Task
Knowledge represents the required area of expertise for the under Ratings does. Overall, as previously mentioned, there are 4,127 Tools
lying job and has scores in the [1-7] continuous range. An example is Used, 17,975 Task Ratings, and 8761 distinct Technology Skills. A cer
provided by Mathematics which has a score of 6.83 for the occupation tain O*NET job will require a subset of each of the three entities. For
Mathematicians. Similarly to the abilities, also Knowledge has many-to example, the O*NET job Chief Executives requires 7 Tools Used, 31 Task
many relations with the occupations and each different pair (occupation, Ratings, and 49 Technology Skills. Conversely, each O*NET occupation
knowledge) has a different score value. is always associated with a fixed number for each of the first four de
Skills are the competencies required for each occupation and have a scribed entities. Hence, any O*NET occupation will be associated with
score in the [1-6] continuous range. One example is Programming, as a vector of 52 Abilities, a vector of 33 Knowledge, a vector of 35 Skills,
sociated with the job Computer Systems Engineers/Architects with a score and a vector of 41 Work Activities.
value of 3.38. One more table worth to be mentioned is the content model reference
For Work Activities (values range in the [1-7] continuous interval), which contains complete descriptions of all the elements included in
an example is given by Analyzing Data or Information, with a score of the entities of the O*NET database previously described. For example,
6.61 with respect to the occupation Financial Quantitative Analysts. Both one of the Abilities’ elements is Cognitive Abilities and its description
Skills and Work Activities have many-to-many relations with different is Abilities that ifluence the acquisition and application of knowledge in
score values for each different job. problem solving.
One example of Task Ratings is Direct and coordinate activities involving To provide the reader with some statistics, Figs. 2 and 3 show the
sales of manufactured products, services, commodities, real estate or other first 50 elements, in decreasing order by the number of times occurring
subjects of sale, related to the job Sales Manager with an importance score within the dataset’s jobs, of the O*NET entities with a variable num
of 4.22 (scores of Task Ratings are in the interval [1-5]). ber of elements per job, that is ``Technology skills'', ``Tasks'', and ``Tools
An example of Tools Used is Personal Computer, associated with sev Used''. For example, Microsoft Excel is a technology skill required by 834
eral jobs (almost all of them). O*NET jobs whereas Personal Computers is a Tool required by 656 occu
Finally, an example of Technology Skills is Atlassian JIRA, related to pations. Similarly, Figs. 4 and 5 show the average values of the scores,
the occupation Administrative Services Managers; Technology Skills have calculated over all the O*NET jobs, of the elements of the O*NET entities
5
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
Table 1
Dimension of the vector space for the first four O*NET entities.
6
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
Fig. 4. O*NET ``Skills'' and ``Knowledge'' Entities - Average value of the elements.
Fig. 5. O*NET ``Abilities'' and ``Work Activities'' Entities - Average value of the elements.
7
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
resume (or job posting) and the rows are the elements of the O*NET en
tity which might occur in the job being considered. For example, when
analyzing the entity Abilities for a certain job, the associated similarity
matrix will have 52 rows and a number of columns depending on the in
formation (nouns, sentences, noun phrases) extracted from the resume
or job posting. We fill each cell of the matrix with the similarity value be
tween the extracted information from the resume/job posting (column
𝑐 ) and the elements of the O*NET entities (row 𝑟). From each column,
the system extracts the element with the maximum value, which corre
sponds to the element of the O*NET entity with the highest similarity
value with respect to the extracted element of the resume/job posting
represented in the underlying column. If the similarity value is greater
than an empirically found threshold, the score of the element of the
O*NET entity is added to the resume/job posting score for the current
job.
This threshold was determined after conducting numerous experi
ments, and it was adjusted to identify the value that maximized classfi
cation accuracy. We tested different values and found that 0.65 provided
the best balance between correctly identifying relevant O*NET entities
and minimizing misclassfications. Fig. 7. System diagram.
The system then calculates the maximum score obtainable for the
current job 𝑗 and normalizes the score obtained from the resume/job a job category has 25 entities associated in O*NET and only 5 entities
posting by applying the formula: are matched in the resume, the score would be normalized by dividing
𝑖=𝑛𝑢𝑚 _𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠
(∑ ) by the sum of the 25 scores, yielding a lower score compared to a job
∑ 𝑘∈𝑒𝑛𝑡𝑖𝑡𝑦𝑖 𝑠𝑐𝑜𝑟𝑒𝑘 (𝑗) ∑
𝑠𝑐𝑜𝑟𝑒(𝑗) = + 0.5 ∗ 𝑠𝑐𝑜𝑟𝑒𝑘 (𝑗) with only 10 associated entities in O*NET. This imbalance could lead to
𝑖=1
𝑠𝑐𝑜𝑟𝑒_𝑚𝑎𝑥𝑖 (𝑗) 𝑘∈𝑒𝑛𝑡𝑖𝑡𝑦𝑖 the underrepresentation of jobs with a higher number of associated enti
(1) ties, such as IT-related jobs, which are linked to many tools and technical
skills in O*NET, even if a significant portion of their relevant entities is
where 𝑛𝑢𝑚_𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠 is equal to seven and corresponds to the entities that detected in the resume. The corrective factor is necessary to address this
have been previously described, 𝑠𝑐𝑜𝑟𝑒𝑘 (𝑗) is the sum of the scores of issue, ensuring that jobs with a large number of associated entities do not
each element (identfied in the resume/job posting) of the entity 𝑖 of the receive disproportionately low scores due to the normalization process.
job 𝑗 that is being checked whereas 𝑠𝑐𝑜𝑟𝑒_𝑚𝑎𝑥𝑖 (𝑗) is the maximum score We conducted several experiments adjusting this parameter (e.g., testing
that the job 𝑗 would obtain if all the related elements of the entity 𝑖 were values like 0.4, 0.3, etc.), and found that 0.5 provided the best balance
∑
detected in a resume/job posting. The term (0.5 * 𝑘∈𝑒𝑛𝑡𝑖𝑡𝑦 𝑠𝑐𝑜𝑟𝑒𝑘 (𝑗)) for correct classfication across diverse job types. Changing this value
𝑖
acts as a corrective factor, empirically determined to prevent the mis would ifluence the classfication results, as a smaller corrective factor
judgment of job categories with a large number of entities associated could lead to the misclassfication of jobs with many associated entities,
in ONET. Without this correction, the formula would only normalize while increasing the factor could cause underrepresentation of jobs with
the scores by dividing by the maximum possible score, which could un fewer associated entities. Therefore, the value of 0.5 was chosen based
fairly disadvantage jobs with many associated entities. For example, if on its ability to yield the highest accuracy across various datasets. This
8
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
corrective factor can be considered a hyperparameter, and its optimal ements of the underlying O*NET entity and the information extracted
value was determined through extensive experimentation. from the underlying resume or job posting.
The described procedure is performed for each job 𝑗 and returns the
first five jobs that have obtained the highest scores. Fig. 7 shows the 5. Scenarios
flow diagram of the algorithm just illustrated.
Fig. 8 shows an example of a similarity matrix for the O*NET entity In this section, we will show the two scenarios that we have con
Knowledge (with 33 rows) with Java Programmer as job and Computers, sidered for job matching. They are related to the identfication of the
training, maths, English, sales, administrator, engineer as information ex best match between a resume and a certain job description. They are
tracted from an input resume. The reader should note that in the similar also complementary, in the sense that the first one is company-oriented
ity matrix, for each piece of information extracted from the resume (col whereas the second one is candidate-oriented.
umn), we have a cell with the maximum value on the row corresponding
5.1. First scenario
to an element of the Knowledge entity (as previously mentioned, Knowl
edge has 33 rows). For example, for the information Computers, this
The input of the first scenario is a resume. Here the task is to identify
maximum value corresponds to the Knowledge element Computers and
the O*NET occupations with the highest match with the given resume.
Electronics, but the corresponding similarity value of 0.6389 is below the
This might be very useful for companies that need to quickly assess dif
threshold (empirically fixed for all the entity values at 0.65). This rep
ferent resumes. With an automatic tool like this, companies would be
resents a typical ``failure case'', where the match is incorrectly excluded
able to categorize plenty of resumes with very high precision in a matter
despite being highly relevant. The O*NET score of 4.87 (associated with of seconds. In such a case, the application of our approach is straight
the value Computers and Electronics for the job Java Programmer) for the forward. We would first compute the entities of the resume as described
entity value Computers and Electronics will not, therefore, be added to the in Section 4. Next, we would compute the score of each job as indicated
total score of the resume. The normalized and averaged score of 9.26 in Equation (1) and would consider the job whose score is the highest.
(at the bottom of the figure and corresponding to the computation of
the member between the parenthesis of Equation (1) for the Knowledge 5.2. Second scenario
entity) will be then summed with the other normalized and averaged
scores of the other O*NET entities. Their similarity matrices used to cal The input of the second scenario we have envisaged consists of a re
culate the other normalized and averaged scores, work in the same way. sume and a set of job descriptions. Here the goal is to return the job
What changes is the number of rows and columns, according to the el whose description matches the most with the candidate’s skills present
9
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
Table 2
Inter-agreement evaluation of the three ex
perts on our approach and the two baselines.
Annotation1 Annotation2
𝑠𝑐𝑜𝑟𝑒(𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛2 ) ≥ 𝑠𝑐𝑜𝑟𝑒(𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛1 )
As far as the first scenario is concerned, to check the performances
of our approach we used the Resume Dataset version 118 . The dataset Three independent recruiters have annotated the two scores returned
consists of 963 resumes from 25 different categories. An example of a by our algorithm for the 105 resumes. To decide which annotation to
resume is shown in Fig. 9. Several resumes had duplicates and some cat pick, we applied the majority voting algorithm. If three annotations for a
egories contained only a few of them. To create the test dataset so that particular resume were all different from each other (so that the majority
we could analyze the performances on its categories, we fixed as a con voting algorithm could not be applied) we computed the average and
straint that at least five resumes should occur for each category. Only rounded it to the closest integer score (for example, if three scores were
21 categories satified such a constraint. Therefore, we selected a set of 1, 3, and 4 the final score would be 𝑟𝑜𝑢𝑛𝑑(1 + 3 + 4)∕3 = 3). The inter
105 resumes from 21 different categories (5 resumes per category) and annotator agreement scores computed according to Fleiss’ kappa [36]
run our method which, for each resume, returned the top five O*NET were 0.37 and 0.27 for both 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛1 and 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛2 respectively
occupations with the highest match with the underlying resume. The 21 as reported in Table 2, indicating a fair/moderate agreement among
classes were ``Data Science'', ``Human Resources'', ``Advocate'', ``Mechan the annotators, despite the elevated number of classes to pick for each
ical Engineer'', ``Sales'', ``Health and fitness'', ``Civil Engineer'', ``Java sample. Then, we applied the majority voting and scored the results
Developer'', ``Business Analyst'', ``SAP Developer'', ``Automation Test of the proposed algorithm accordingly. In particular, we obtained an
ing'', ``Electrical Engineering'', ``Python Developer'', ``DevOps Engineer'', average score of 3.8 for 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛1 and an average score of 4.2 for
“Network Security Engineer'', ``Database'', ``Hadoop'', ``ETL Developer'', 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛2 , as indicated in Table 3. We compared these results against
“DotNet Developer'', ``Blockchain'', ``Testing''. those obtained using two naive unsupervised approaches which would
To evaluate our method, we performed an extensive manual assess constitute our baselines.
ment. The relevance of each suggested job was determined by adopting The two baselines do not take into account any entity of the O*NET
a similar approach presented by [35]. We rely on a five-point relevance database and work as follows. First, each input resume is broken down in
nouns, noun phrases, and sentences. For each element, we compute the
cosine similarity against each of the 21 class names of jobs (𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒1)
18
https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset. and against each of the 1,016 O*NET occupations names (𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒2).
10
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
Table 3 Table 4
Results of our approach and the two baselines. Inter-agreement score of the three experts on our ap
proach and the two baselines for the second scenario.
Annotation1 Annotation2
Annotation1 Annotation2
Our Approach 3.8 4.2
Baseline1 3.4 4.9 Our Approach 0.69 0.66
Baseline2 3.0 3.8 Baseline1 0.81 0.63
Baseline2 0.66 0.65
In both cases, we return the five job classes with the highest seman
tic similarity. Those classes have been annotated by the three experts Table 5
with the same scoring approach mentioned above and a majority voting Results of our approach and the two baselines
strategy has been applied. We computed the inter-annotator agreement for the second scenario.
scores [36] among the three annotators for the baselines as well. The Annotation1 Annotation2
inter-agreement values for the two baselines are reported in Table 2.
Our Approach 4.17 4.95
The reader might notice a much higher inter-agreement value for the
Baseline1 3.8 4.87
𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒1 than the others. The reason is that 𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒1 returns elements Baseline2 2.03 3.75
from a set of 21 classes which are much easier to rate for the annotators
with respect to 1,016 classes. Finally, as similarly performed for our ap
proach, we consider the majority voting result that we report in Table 3. once in the entire collection. Because the related categories of each job
We can observe that for 𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒1, for the annotation 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛2 , the posting are not always matched with the occupations found within the
obtained score is very high. This happens for the same reason as the high O*NET database, we applied our method to each job posting in order
inter-annotator agreement value mentioned earlier. Because the set of to identify the O*NET occupation that matches the most the related
considered classes for this baseline is 21, it is very likely that out of 21 job posting. Therefore, for each sample, we also added ten O*NET job
candidate classes, one of the first five returned jobs is correct (equal to categories. Each one is related to each job posting in the sample.
the resume label). We dfined two baselines. The first one works as follows. Given one
sample, it starts by extracting noun phrases, nouns, and sentences from
6.2. Evaluating scenario 2 the resume. Then for each extracted element it computes the cosine
similarity against the 20 job labels (the job posting categories of the
For this second scenario, we employed a dataset of 10k job posts CareerBuilder dataset and those from the O*NET database calculated
of the CareerBuilder job dataset from the UK in 201919 . A subset of by using our approach) and returns the five job classes whose semantic
8948 of these 10k job posts are in English and belong to 299 distinct similarity with any of the extracted element is the highest.
categories. Each category might contain one or more jobs. Each job The second baseline uses the same O*NET entities we have dfined
belongs to one category only. To prepare the dataset to be used with within our approach but without using the scores. From the resume, it
our system, we selected the posts belonging to categories with cardi identfies all the values for each O*NET entity and builds one binary
nality greater than or equal to 5 (i.e., each job category contained at vector for each entity. Thus, the entity 𝑒𝑛𝑡 will consist of a vector of 0s
least five jobs). We then selected 100 resumes (5 for each category) and 1s: 0 if the 𝑖 feature of 𝑒𝑛𝑡 is not included in the resume, 1 other
from the dataset used in scenario 1, belonging to the following 20 cat wise. To note that we considered all the values of each entity so that
egories: Data Science, Human Resources, Advocate, Mechanical Engineer, vectors of a certain entity will always have the same size. We perform
Sales, Health and fitness, Pharmacists, Java Developer, Business Analyst, the same operation for the 20 jobs present in each sample (thus we ob
SAP Developer, Automation Testing, Electrical Engineering, Python Devel tain binary entity vectors for each job in the sample) and then compute
oper, DevOps Engineer, Network Security Engineer, Database, Hadoop, ETL the Euclidean distance between binary vectors. We take the 5 jobs with
Developer, DotNet Developer, Testing. Then we selected the most seman the smallest distance with respect to the resume.
tically similar job post category for each of the 20 resume categories: Finally, in a given sample, our approach is applied to the resume by
Computer and Information Research Scientists, Human Resources Specialists, computing the entities using the method already illustrated in Section 4
Lawyers, Industrial Engineers, Sales Managers, Health Educators, Pharma and looking for the best O*NET job label among the twenty we have.
cists, Computer Programmers, Management Analysts, Software Developers, Three human annotators filled two annotations for the two base
Applications, Software Quality Assurance Engineers and Testers, Electrical lines and our approach. The first annotation assesses the first predicted
Engineers, Computer Programmers, Computer Systems Engineers/Architects, job post whereas the second annotation assesses the top-5 predicted job
Network and Computer Systems Administrators, Database Administrators, posts, similarly as dfined within the first scenario. To note that a cer
Computer Systems Analysts, Software Developers, Applications, Software De tain prediction is compared against the list of the 10 original job labels
velopers, Applications, Software Quality Assurance Engineers and Testers.20 and against the list of the derived 10 O*NET job labels. So, for exam
Each of the 100 resumes has been associated with 10 different job posts,
ple, if a baseline contains ``Architectural and Engineering Managers'' as
the last of which belongs to the corresponding resume category. The ex
its first prediction, 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛1 gets a 5 if that category is either the
traction of job posts was random and each extracted post could occur
tenth among the original job categories or if that is the O*NET cate
just once.
gory corresponding to the tenth job description. We computed the final
Therefore, each sample contained a resume, ten job postings and
annotations by using a majority voting strategy on the three annota
their related categories included in the CareerBuilder job dataset. For
tors’ scores similarly as performed within the first evaluation scenario.
a given sample, the ten job postings included are always of different
The inter-annotator agreement scores according to Fleiss’ kappa [36]
categories. The last job posting of each sample corresponds to the same
among the three annotators are reported in Table 4 indicating a sub
category of the resume. The same job posting is never present more than
stantial agreement.
The reader notices that such values are much higher than those
19
https://data.world/promptcloud/jobs-on-careerbuilder-uk.
obtained within scenario 1. The reason is because, in this case, each
20
The reader notices that such a list is presented in order of similarity with method had to choose among 10 different jobs only. This low num
the list of resume categories shown above. That is, the first element, Computer ber and the fact that they are different from each other helped the
and Information Research Scientists is the most semantically similar job category role of the annotators that were in agreement more often. Table 5
to the resume category Data Science, etc. shows the results we have obtained for the three methods where it
11
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
is possible to note how ours outperforms the two baselines for both There are several ways that can be employed for such a purpose.
𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛1 and 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛2 showing again its efficiency. The data re The first one is to leverage existing APIs of well-known massive open
lated to the evaluation that has been carried out for both scenarios is online course providers such as Udemy21 or Coursera22 . In such a case
freely available at https://gitlab.com/hri_lab1/using-transformers-and- we would simply leverage the search engine of those providers by feed
o-net-to-match-jobs-to-applicants-resumes. ing them with the entities to be mastered by the applicant.
As an example, we submitted for the job Mechanical Engineer a re
7. Use case: recommending skills to job seekers sume from the Resume Dataset belonging to the category Electrical Engi
neering. Our system returns a score of 19.33, which is below the threshold
Our approach is well-suited for a recommendation task for job seek (34.62) by 15.29 points. Hence, the system does not consider the re
ers. In this section, we will show an extension of our method in this sume eligible for the Mechanical Engineer job and lists the elements of
direction. It works as follows. The user of the system, that is a person the O*NET entities that have not been identfied in the CV, including for
looking for a job, should just provide his/her resume and a list of job example: Reading comprehension, Active Listening, Mathematics and Crit
names he/she is interested in. First of all, the job names are matched ical Thinking. For each item, up to 5 Udemy courses are listed, when
against the O*NET occupations so that a map can be dfined between available. In our example, the Become Active Reader & Master Your Read
input job names and O*NET occupations. We already performed a simi ing Comprehension Skills course is suggested for the first element, Giving
lar computation within the evaluation of the proposed second scenario. full attention to what other people are saying, taking time to understand
Then, for each identfied O*NET job, the approach would compute a the points being made, asking questions as appropriate, and not interrupt
score of the resume against the input O*NET categories according to ing at inappropriate times course for the second, Math and Perfect your
the Equation (1) introduced in Section 4.2. Mathematics skills courses for the third, and Using logic and reasoning to
identify the strengths and weaknesses of alternative solutions, conclusions, or
7.1. Job seeking and skills recommendation approaches to problems course for the fourth. Then the candidate should
read or attend those lectures and improve his/her resume with the ac
The reader can find an implementation of this use case at http:// quired competencies. Afterward, by adding the missing elements listed
192.167.149.11:8000/ under the first demo link. Let us assume that the above to the resume and resubmitting the application to the system, the
user gives as an input one job name only, and thus only one job from augmented resume obtains a score of 39.85 and passes the eligibility
O*NET is selected. Then the user uploads his/her resume. The system threshold.
elaborates on it. To assess whether the score obtained from the resume Besides relying on online providers, we can also exploit dumps of re
for the requested job makes the resume eligible, we compare it with an sources with courses’ information [37,38]. For example, one resource
experimentally established threshold value of 34.62. The value of 34.62 that can be leveraged is the COCO [38] dataset, a large collection of
for eligibility represents the average score that resumes from the correct courses that includes lessons, teachers, instructors, and learners’ ratings
job category obtained during the testing phase. It was chosen as a natu collected with the purpose of developing AI-based e-learning applica
ral cutoff, distinguishing eligible candidates from ineligible ones based tions on top. For the courses, it provides metadata such as short and
on the distribution of scores across different classes. This value rflects long descriptions, the necessary requirements, and the expected skills
the typical score achieved by resumes correctly classfied, ensuring that acquired by successfully attending them. These can be leveraged by our
only resumes with a high likelihood of being relevant to the job are con system to find out which courses would support an applicant to mas
sidered eligible. It is calculated by averaging the scores obtained from ter new skills by matching the detected missing skills of a resume. Our
all elements of the Resume Dataset used in the previous scenarios, un system would suggest a list of courses from the COCO dataset simi
der the assumption (cofirmed among the release notes of the dataset) lar to what is performed by e-learning providers’ search engines. The
that each resume in that dataset is eligible for the O*NET occupation advantage of using such a dataset lies in the opportunity to develop
that was selected. customized AI-based applications (for example leveraging transform
When the user’s resume is not eligible with the input job, the sys ers or other language models) giving freedom to the developers to
tem returns the list of the O*NET entity elements not found within the build their own solutions to find the best matches with the missing
input resume in decreasing order of importance. For the entities with skills.
scores (Abilities, Knowledge, Skills, Work Activities, Task Ratings), it Finally, with the aim to recommend lectures or articles, our approach
means to return their elements in decreasing order of scores. For the can also be easily fed with modern state-of-the-art resources such as the
other entities (Technical Skills and Tools Used), they are returned in “Academia/Industry DynAmic'' (AIDA) [39] or ``The Computer Science
alphabetical order (for the technological skills we first return the en Knowledge Graph'' (CS-KG) [40--43]. AIDA is a collection of metadata
tries with hot technologies and then the others). For each element, the about 21𝑀 publications and 8𝑀 patents categorized within a taxon
system indicates the O*NET score, the name of the element, a brief de omy of computer science topics from the Computer Science Ontology
scription (when available), and the value that would be added to the (CSO) [44]. CS-KG is an automatically generated knowledge graph that
overall CV score if it was found in the resume (that is the value between describes the content of 6.7𝑀 scientific papers with 10𝑀 research en
the parenthesis of the Equation (1) for a fixed 𝑖, 𝑘 and 𝑗 ). If the user tities and 41𝑀 relationships among them within the computer science
notices that he/she has simply failed to include some of his/her skills domain. These resources can be exploited to recommend patents and
or knowledge or some previous work activity, he/she can add it to the research papers that might be of interest for the applicants to discover
resume and resubmit it. Otherwise, he/she would need to get more com where and how state-of-the-art technologies are used to improve their
petencies and, consequently, raise the score of his/her resume over the skills and be ready for the job application. Considering the example
pre-established threshold. In addition to this, the demonstration also al above, if an applicant has to acquire the skill Critical Thinking for the
lows us to identify the skills, technical abilities, and other qualfications Mechanical Engineering job, the system can be easily extended to look
needed for a job and can recommend job opportunities that align with for research papers that study it within the CS-KG SparQL endpoint23 ,
the user’s characteristics (second and third demo). suggesting to the applicant interesting reads to make stronger his/her
application. For example, CS-KG suggests ``Serious games on environ
7.2. Acquiring new skills
To acquire new skills, a system that embeds our approach can cover 21
https://www.udemy.com.
the different values present in the entities so that the candidate becomes 22
https://www.coursera.org.
23
eligible for that specific job. https://scholkg.kmi.open.ac.uk/sparql/.
12
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX cskg: <http://scholkg.kmi.open.ac.uk/cskg/resource/>
PREFIX cskg-ont: <http://scholkg.kmi.open.ac.uk/cskg/ontology#>
PREFIX provo: <http://www.w3.org/ns/prov#>
PREFIX cso: <http://cso.kmi.open.ac.uk/schema/cso#>
PREFIX dcterm: <http://purl.org/dc/terms/>
Fig. 10. SparQL query to find papers related to the Critical Thinking skill.
mental management''24 and ``Teaching Flooding Attack to the SDN Data collect users’ feedback and create a gold standard to evaluate the accu
Plane with POGIL''25 as relevant papers to master the Critical Think racy of the system at scale. We believe this system might be either used
ing skill. Fig. 10 shows the SparQL query to be executed in the CS-KG standalone or incorporated into well-known platforms for job seekers.
SparQL endpoint to retrieve the papers related to the Critical Thinking One feature we would like to add is the possibility to include a gener
skill. ator of resumes starting from a list of skills and abilities: this feature
would leverage new text generation models with the goal of prepar
8. Conclusions and future directions ing a complete resume in natural language. Furthermore, we would like
to study the injection of O*NET database knowledge and transformer
In this paper, we have proposed an approach for job recommenda models into existing recommendation methodologies (e.g., collaborative
tion based on the information extracted from applicants’ resumes and filtering) with the goal of creating a ful-fledged platform for job recom
job postings. The information is then matched against different O*NET mendation that takes benfits from domain knowledge using O*NET,
entities using semantic similarity of deep learning transformers in order transformers, and past candidate experiences.
to identify the most suitable O*NET job to the underlying resume or job One of the areas where the authors are continuing their research
posting. Two scenarios have been considered: one useful for companies focuses on providing recommendations for courses and training materi
with the goal of quickly screening several resumes and the other useful als. These recommendations aim to help job seekers acquire the skills or
for applicants with the goal of quickly identifying the job posting more competencies they lack. Additionally, the focus will shift towards apply
suitable to their skills. ing the lessons learned in the job-seeking domain to worker reskilling
For the first scenario, an extensive evaluation on the Resume Dataset initiatives within organizations, especially under the context of Indus
version 1 has been carried out considering 105 resumes from 21 differ try 5.0. In this context, LLMs and retrieval-augmented generation (RAG)
ent categories. We have dfined a scoring mechanism with five different systems will be explored for the development of study guides based on
values from 1 to 5. Moreover, we came up with two baselines (one re the organization’s training documentation. Another case study will in
turning the output over the 21 categories and the other returning the volve adapting the system for document classfication. This adaptation
output over the 1,016 possible jobs) to compare our approach against will aim to identify the specific skills and abilities that a job seeker or stu
and manually annotated the results of the three methods with the scor dent could improve through the study of the corresponding document.
ing mechanism we dfined. The scoring of the three methods has been Finally, we plan to explore fine-tuning strategies and further improve
applied to two outputs (the job with the highest score and the first five ments to the system’s performance.
jobs with the highest scores). Our approach outperformed the others in
three cases out of four.
CRediT authorship contribution statement
For the second scenario, we considered 1000 job postings from the
CareerBuilder job dataset and 100 resumes from the Resume Dataset
used within the first scenario. With similar scoring metrics as those used Rubén Alonso: Conceptualization, Funding acquisition, Visualiza
within the first scenario, we showed how our method obtained much tion, Writing -- review & editing. Danilo Dessí: Formal analysis, Investi
better results than two more baselines that we dfined. gation, Validation, Writing -- review & editing. Antonello Meloni: Data
Finally, as a use case of the proposed approach, we discussed a rec curation, Investigation, Software, Writing -- original draft. Diego Refor
ommending task for job seekers where, given in input a resume and a list giato Recupero: Conceptualization, Funding acquisition, Investigation,
of occupations, the system returns, for each occupation, a list of courses Methodology, Supervision, Writing -- review & editing.
that need to be attended to acquire the missing skills and become eligi
ble for that specific job. Declaration of competing interest
As future directions, our aim is to rfine our prototype and create a
real system that performs all the tasks we have illustrated in this work, No coflict of interest exists.
including the use case discussed at the end of the manuscript. We will
release the APIs so that everyone will be able to play with them and
Acknowledgements
integrate them into existing systems. The API will also be employed to
13
R. Alonso, D. Dessí, A. Meloni et al.
Big Data Research 39 (2025) 100509
Data availability [23] G. Domeniconi, G. Moro, A. Pagliarani, K. Pasini, R. Pasolini, Job recommenda
tion from semantic similarity of linkedln users’ skills, in: Proceedings of the 5th
International Conference on Pattern Recognition Applications and Methods, ICPRAM
Data will be made available on request. 2016, SCITEPRESS - Science and Technology Publications, Lda, Setubal, PRT, 2016,
pp. 270--277.
References [24] F. Gutiérrez, S. Charleer, R. De Croon, N.N. Htun, G. Goetschalckx, K. Verbert, Ex
plaining and exploring job recommendations: a user-driven approach for interacting
with knowledge-based job recommender systems, in: Proceedings of the 13th ACM
[1] S.D. Risavy, C. Robie, P.A. Fisher, S. Rasheed, Resumes vs. application forms: why
Conference on Recommender Systems, RecSys ’19, Association for Computing Ma
the stubborn reliance on resumes?, Frontiers in Psychology 13 (2022) 884205.
chinery, New York, NY, USA, 2019, pp. 60--68.
[2] S. Rojas-Galeano, J. Posada, E. Ordoñez, A bibliometric perspective on ai research
[25] B. Heap, A. Krzywicki, W. Wobcke, M. Bain, P. Compton, Combining career pro
for job-résumé matching, The Scientific World Journal 2022 (1) (2022) 8002363.
gression and prfile matching in a job recommender system, in: D.-N. Pham, S.-B.
[3] U. Goyal, A. Negi, A. Adhikari, S.C. Gupta, T. Choudhury, Resume data extraction Park (Eds.), PRICAI 2014: Trends in Artficial Intelligence, Springer International
using nlp, in: J. Singh, S. Kumar, U. Choudhury (Eds.), Innovations in Cyber Physical Publishing, Cham, 2014, pp. 396--408.
Systems, Springer, Singapore, Singapore, 2021, pp. 465--474. [26] X. Zhao, Cold-start collaborative filtering, SIGIR Forum 50 (1) (2016) 99--100,
[4] T.M. Harsha, G.S. Moukthika, D.S. Sai, M.N.R. Pravallika, S. Anamalamudi, M. En https://doi.org/10.1145/2964797.2964819.
duri, Automated resume screener using natural language processing (nlp), in: 2022 [27] N. Reimers, I. Gurevych, Sentence-bert: sentence embeddings using siamese bert
6th International Conference on Trends in Electronics and Informatics (ICOEI), 2022, networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
pp. 1772--1777. Language Processing, Association for Computational Linguistics, 2019, https://arxiv.
[5] A. Deshmukh, A. Raut, Applying bert-based nlp for automated resume screening and org/abs/1908.10084.
candidate ranking, Annals of Data Science (2024) 1--13. [28] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional
[6] S. Westman, J. Kauttonen, A. Klemetti, N. Korhonen, M. Manninen, A. Mononen, transformers for language understanding, arXiv preprint, arXiv:1810.04805, 2018.
S. Niittymäki, H. Paananen, Artficial intelligence for career guidance–current re [29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
quirements and prospects for the future, IAFOR Journal of Education 9 (4) (2021) V. Stoyanov, Roberta: a robustly optimized bert pretraining approach, arXiv preprint,
43--62. arXiv:1907.11692, 2019.
[7] C. Yu, C. Zhang, J. Wang, Extracting body text from academic pdf documents for [30] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, Mpnet: masked and permuted pre-training
text mining, in: KDIR, 2020, pp. 229--236. for language understanding, Advances in Neural Information Processing Systems 33
[8] C. Stahl, S. Young, D. Herrmannova, R. Patton, J. Wells, Deeppdf: a deep learning (2020) 16857--16867.
approach to extracting text from pdfs, in: Proceedings of the Eleventh International [31] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: a lite bert
Conference on Language Resources and Evaluation (LREC 2018), European Lan for self-supervised learning of language representations, arXiv preprint, arXiv:1909.
guage Resources Association (ELRA), Paris, France, 2018. 11942, 2019.
[9] J. Tiedemann, Improved text extraction from pdf documents for large-scale natural [32] S.R. Bowman, G. Angeli, C. Potts, C.D. Manning, A large annotated corpus for learn
language processing, in: A. Gelbukh (Ed.), Computational Linguistics and Intelligent ing natural language inference, in: Proceedings of the 2015 Conference on Empirical
Text Processing, Springer Berlin Heidelberg, Berlin, Heidelberg, 2014, pp. 102--112. Methods in Natural Language Processing, Association for Computational Linguistics,
[10] Z. Cheng, P. Zhang, C. Li, Q. Liang, Y. Xu, P. Li, S. Pu, Y. Niu, F. Wu, Trie++: towards Lisbon, Portugal, 2015, pp. 632--642, https://aclanthology.org/D15-1075.
end-to-end information extraction from visually rich documents, arXiv preprint, [33] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge corpus for sen
arXiv:2207.06744, 2022. tence understanding through inference, in: Proceedings of the 2018 Conference of
[11] M.J. Handel, The o*net content model: strengths and limitations, Journal for Labour the North American Chapter of the Association for Computational Linguistics: Hu
Market Research 49 (2) (2016) 157--176, https://doi.org/10.1007/s12651-016- man Language Technologies, vol. 1 (Long Papers), Association for Computational
0199-8. Linguistics, New Orleans, Louisiana, 2018, pp. 1112--1122, https://aclanthology.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, org/N18-1101.
[34] C. Spearman, The proof and measurement of association between two things, The
I. Polosukhin, Attention is all you need, in: Proceedings of the 31st International
American Journal of Psychology 100 (3/4) (1987) 441--471, http://www.jstor.org/
Conference on Neural Information Processing Systems, NIPS’17, Curran Associates
stable/1422689.
Inc., Red Hook, NY, USA, 2017, pp. 6000--6010.
[35] A. Konjengbam, N. Kumar, M. Singh, Unsupervised tag recommendation for popular
[13] C. Helwe, C. Clavel, F.M. Suchanek, Reasoning with transformer-based models: deep
and cold products, Journal of Intelligent Information Systems 54 (2019) 545--566.
learning, but shallow reasoning, in: 3rd Conference on Automated Knowledge Base
[36] J.L. Fleiss, J.C. Nee, J.R. Landis, Large sample variance of kappa in the case of dif
Construction, 2021, https://openreview.net/forum?id=Ozp1WrgtF5_.
ferent sets of raters, Psychological Bulletin 86 (5) (1979) 974.
[14] A. Meroño-Pe nuela, D. Spagnuelo, Can a transformer assist in scientific writing?
[37] D. Dessì, G. Fenu, M. Marras, D. Reforgiato Recupero, Coco: semantic-enriched col
Generating semantic web paper snippets with gpt-2, in: A. Harth, V. Presutti, R.
lection of online courses at scale with experimental use cases, in: Á. Rocha, H. Adeli,
Troncy, M. Acosta, A. Polleres, J.D. Fernández, J. Xavier Parreira, O. Hartig, K. Hose,
L.P. Reis, S. Costanzo (Eds.), Trends and Advances in Information Systems and Tech
M. Cochez (Eds.), The Semantic Web: ESWC 2020 Satellite Events, Springer Interna
nologies, Springer International Publishing, Cham, 2018, pp. 1386--1396.
tional Publishing, Cham, 2020, pp. 158--163. [38] D. Dessì, G. Fenu, M. Marras, D. Reforgiato Recupero, Bridging learning an
[15] Z. Zheng, Z. Qiu, X. Hu, L. Wu, H. Zhu, H. Xiong, Generative job recommendations alytics and cognitive computing for big data classfication in micro-learning
with large language model, arXiv preprint, arXiv:2307.02157, 2023. video collections, Computers in Human Behavior 92 (2019) 468--477, https://
[16] Z. Guan, J.-Q. Yang, Y. Yang, H. Zhu, W. Li, H. Xiong, Jobformer: skill-aware job doi.org/10.1016/j.chb.2018.03.004, https://www.sciencedirect.com/science/
recommendation with semantic-enhanced transformer, ACM Transactions on Knowl article/pii/S0747563218301092.
edge Discovery from Data (2024). [39] S. Angioni, A.A. Salatino, F. Osborne, D.R. Recupero, E. Motta, AIDA: a knowledge
[17] Y. Li, M. Yamashita, H. Chen, D. Lee, Y. Zhang, Fairness in job recommendation graph about research dynamics in academia and industry, Quantitative Science Stud
under quantity constraints, in: AAAI-23 Workshop on AI for Web Advertising, 2023. ies 2 (4) (2021) 1356--1398, https://doi.org/10.1162/qss_a_00162.
[18] J. Dhameliya, N. Desai, Job recommender systems: a survey, in: 2019 Innovations [40] D. Dessì, F. Osborne, D.R. Recupero, D. Buscaldi, E. Motta, H. Sack, AI-KG: an au
in Power and Advanced Computing Technologies (i-PACT), vol. 1, 2019, pp. 1--5. tomatically generated knowledge graph of artficial intelligence, in: The Semantic
[19] R. Mishra, S. Rathi, Efficient and scalable job recommender system using collab Web - ISWC 2020 - 19th International Semantic Web Conference, in: Lecture Notes
orative filtering, in: A. Kumar, M. Paprzycki, V.K. Gunjan (Eds.), ICDSMLA 2019, in Computer Science, vol. 12507, Springer, 2020, pp. 127--143.
Springer, Singapore, Singapore, 2020, pp. 842--856. [41] D. Dessì, F. Osborne, D.R. Recupero, D. Buscaldi, E. Motta, Generating knowledge
[20] Y. Lu, S. El Helou, D. Gillet, A recommender system for job seeking and recruiting graphs by employing natural language processing and machine learning techniques
website, in: Proceedings of the 22nd International Conference on World Wide Web, within the scholarly domain, CoRR, arXiv:2011.01103, 2020, https://arxiv.org/abs/
WWW ’13 Companion, Association for Computing Machinery, New York, NY, USA, 2011.01103.
2013, pp. 963--966. [42] D. Dessí, F. Osborne, D.R. Recupero, D. Buscaldi, E. Motta, Scicero: a deep learning
[21] W. Shalaby, B. AlAila, M. Korayem, L. Pournajaf, K. Aljadda, S. Quinn, W. Zadrozny, and nlp approach for generating scientific knowledge graphs in the computer science
Help me find a job: a graph-based approach for job recommendation at scale, domain, Knowledge-Based Systems 258 (2022) 109945.
in: Proceedings of the International Conference on Big Data (BiG Data), 2017, [43] D. Dessí, F. Osborne, D. Reforgiato Recupero, D. Buscaldi, E. Motta, Cs-kg: a large
pp. 1544--1553. scale knowledge graph of research entities and claims in computer science, in: Inter
[22] S. Choudhary, S. Koul, S. Mishra, A. Thakur, R. Jain, Collaborative job prediction national Semantic Web Conference, Springer, 2022, pp. 678--696.
based on Naïve Bayes classfier using python platform, in: 2016 International Confer [44] A.A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, E. Motta, The com
ence on Computation System and Information Technology for Sustainable Solutions puter science ontology: a large-scale taxonomy of research areas, in: International
(CSITSS), 2016, pp. 302--306. Semantic Web Conference, Springer, 2018, pp. 187--205.
14