Pattern Rank
Pattern Rank
a b c
Tim Schopf , Simon Klimek and Florian Matthes
Department of Informatics, Technical University of Munich, Boltzmannstrasse 3, Garching, Germany
{tim.schopf, simon.klimek, matthes}@tum.de
Keywords: Natural Language Processing, Keyphrase Extraction, Pretrained Language Models, Part of Speech.
Abstract: Keyphrase extraction is the process of automatically selecting a small set of most relevant phrases from a
given text. Supervised keyphrase extraction approaches need large amounts of labeled training data and per-
form poorly outside the domain of the training data (Bennani-Smires et al., 2018). In this paper, we present
PatternRank, which leverages pretrained language models and part-of-speech for unsupervised keyphrase ex-
traction from single documents. Our experiments show PatternRank achieves higher precision, recall and
F1 -scores than previous state-of-the-art approaches. In addition, we present the KeyphraseVectorizers∗ pack-
age, which allows easy modification of part-of-speech patterns for candidate keyphrase selection, and hence
adaptation of our approach to any domain.
1.
2.
Candidate Keyphrase Output
3. Top-N Keyphrases
Candidate Selection Output
4.
… Ranking
5. …
Figure 1: PatternRank approach for unsupervised keyphrase extraction. A single text document is used as input for an
initial filtering step where candidate keyphrases are selected which match a defined PoS pattern. Subsequently, the candidate
keyphrases are ranked by a PLM based on their semantic similarity to the input text document. Finally, the top-N keyphrases
are extracted as a concise reflection of the input text document.
based, graph-based, or embedding-based methods, between the document representation and the candi-
while Tf-Idf is a common baseline used for evalua- date keyphrase representations are computed and the
tion (Papagiannopoulou and Tsoumakas, 2019). candidate keyphrases are ranked in descending order
YAKE uses a set of different statistical metrics in- based on the computed similarity scores. Finally, the
cluding word casing, word position, word frequency, top-N ranked keyphrases, which are most representa-
and more to extract keyphrases from text (Campos tive of the input document, are extracted.
et al., 2020). TextRank uses PoS filters to extract noun
phrase candidates that are added to a graph as nodes, 3.1 Candidate Selection with Part of
while adding an edge between nodes if the words
co-occur within a defined window (Mihalcea and Ta- Speech
rau, 2004). Finally, PageRank (Page et al., 1999) is
In previous work, simple noun phrases consisting of
applied to extract keyphrases. SingleRank expands
zero or more adjectives followed by one or more
the TextRank approach by adding weights to edges
nouns were used for keyphrase extraction (Mihalcea
based on word co-occurrences (Wan and Xiao, 2008).
and Tarau, 2004; Wan and Xiao, 2008; Bennani-
RAKE generates a word co-occurrence graph and as-
Smires et al., 2018). However, we define a more com-
signs scores based on word frequency, word degree,
plex PoS pattern to extract candidate keyphrases from
or the ratio of degree and frequency for keyphrase
the input text document. In our approach, the tags
extraction (Rose et al., 2010). Furthermore, Knowl-
of the word tokens have to match the following PoS
edge Graphs can be used to incorporate semantics
pattern in order for the tokens to be considered as can-
for keyphrase extraction (Shi et al., 2017). Em-
didate keyphrases:
bedRank leverages Doc2Vec (Le and Mikolov, 2014)
and Sent2Vec (Pagliardini et al., 2018) sentence em-
beddings to rank candidate keyphrases for extraction
{.∗}{HY PH}{.∗} {NOUN} ∗
(Bennani-Smires et al., 2018). More recently, a PLM- (1)
based approach was introduced that uses BERT (De-
{V BG}|{V BN} ?{ADJ} ∗ {NOUN} +
vlin et al., 2019) for self-labeling of keyphrases and
subsequent use of the generated labels in an LSTM
classifier (Sharma and Li, 2019). The PoS pattern quantifiers correspond to the reg-
ular expression syntax. Therefore, we can translate
the PoS pattern as arbitrary parts-of-speech sepa-
rated by a hyphen, followed by zero or more nouns
3 KEYPHRASE EXTRACTION OR zero or one verb (gerund or present or past par-
APPROACH ticiple), followed by zero or more adjectives, followed
by one or more nouns.
Figure 1 illustrates the general keyphrase extraction
process of our PatternRank approach. The input con- 3.2 Candidate Ranking with Pretrained
sists of a single text document which is being word
Language Models
tokenized. The word tokens are then tagged with PoS
tags. Tokens whose tags match a previously defined Earlier work used graphs (Mihalcea and Tarau, 2004;
PoS pattern are selected as candidate keyphrases. Wan and Xiao, 2008) or paragraph and sentence em-
Then, the candidate keyphrases are fed into a PLM beddings (Bennani-Smires et al., 2018) to rank candi-
to rank them based on their similarity to the input text date keyphrases. However, we leverage PLMs based
document. The PLM embeds the entire text document on current transformer architectures to rank the can-
as well as all candidate keywords as semantic vector didate keyphrases that have recently demonstrated
representations. Subsequently, the cosine similarities promising results (Grootendorst, 2020). Therefore,
we follow the general EmbedRank (Bennani-Smires candidate keyphrases rather than word tokens that
et al., 2018) approach for ranking, but use PLMs match a certain PoS pattern, as in our PatternRank
instead of Doc2Vec (Le and Mikolov, 2014) and approach. For the KeyBERT experiments, we use
Sent2Vec (Pagliardini et al., 2018) to create semantic the all-mpnet-base-v21 SBERT model for candidate
vector representations of the entire text document as keyphrase ranking and an n-gram range of [1, 3]
well as all candidate keyphrases. In our experiments, for candidate keyphrase selection. This means that
we use SBERT (Reimers and Gurevych, 2019) PLMs n-grams consisting of 1, 2 or 3 words are selected as
since they have been shown to produce state of the art candidate keyphrases.
text representations for semantic similarity tasks. Us-
ing these semantic vector representations, we rank the PatternRank: To select candidate keyphrases,
candidate keyphrases based on their cosine similarity we developed the KeyphraseVectorizers2 package,
to the input text document. which allows custom PoS patterns to be defined and
returns matching candidate keyphrases. We evaluate
two different versions of the PatternRank approach.
4 EXPERIMENTS PatternRankNP selects simple noun phrases as can-
didate keyphrases and PatternRankPoS selects word
In this section, we compare four different approaches tokens whose PoS tags match the pattern defined
for unsupervised keyphrase extraction in the scholarly in section 3.1. In both cases, the all-mpnet-base-v2
domain. SBERT model is used for candidate keyphrase
ranking.
4.1 Data We evaluate the models based on exact match,
partial match, and the average of exact and partial
In our experiments, we use the Inspec dataset (Hulth, match. For each approach, we report Precision@N,
2003), which consists of 2,000 English computer sci- Recall@N, and F1 @N scores, using the top-N ex-
ence abstracts collected from scientific journal arti- tracted keyphrases respectively. The gold keyphrases
cles between 1998 and 2002. Each abstract has as- always remain the entire set of all manually assigned
signed two different types of keyphrases. First, con- keyphrases, regardless of N. Additionally, we low-
trolled and manually assigned keyphrases that appear ercase the gold keyphrases as well as the extracted
in the thesaurus of the Inspec dataset but do not nec- keyphrases and remove duplicates. We follow the
essarily have to appear in the abstract. Second, un- approach of Rousseau and Vazirgiannis (2015) and
controlled keyphrases that are freely assigned by pro- calculate Precision@N, Recall@N, and F1 @N scores
fessional indexers and are not restricted to either the per document and then use the macro-average at
thesaurus or the abstract. In our experiments, we con- the collection level for evaluation. The exact match
sider the union of both types of keyphrases as the approach yields true positives only for extracted
ground truth. keyphrases that have an exact string match to one
of the gold keyphrases. However, this evaluation
4.2 Evaluation approach penalizes keyphrase extraction methods
which predict keyphrases that are syntactically
For evaluation, we compare the performances of four different from the gold keyphrases but semantically
different keyphrase extraction approaches. similar (Rousseau and Vazirgiannis, 2015; Wang
et al., 2015). The partial match approach converts
YAKE: is a fast and lightweight approach for gold keyphrases as well as extracted keyphrases to
unsupervised keyphrase extraction from single doc- unigrams and yields true positives if the extracted
uments based on statistical features (Campos et al., unigram keyphrases have a string match to one of
2020). the unigram gold keyphrases (Rousseau and Vazir-
giannis, 2015). The drawback of the partial match
SingleRank: applies a ranking algorithm to word evaluation approach, however, is that it rewards
co-occurrence graphs for unsupervised keyphrase methods which predict keyphrases that occur in the
extraction from single documents (Wan and Xiao, unigram gold keyphrases but are not appropriate for
2008). the corresponding document (Papagiannopoulou and
SingleRank 38.11 16.55 21.97 33.29 27.27 28.55 27.24 38.84 30.80
KeyBERT 12.97 6.08 7.82 11.42 10.53 10.30 9.75 17.14 11.76
PatternRankNP 41.15 18.09 23.92 34.60 28.33 29.66 25.88 36.69 29.19
PatternRankPoS 41.76 18.44 24.35 36.10 29.63 30.99 27.80 39.42 31.37
YAKE 77.45 19.49 29.91 68.20 33.46 42.67 59.69 45.58 48.69
SingleRank 75.54 19.36 29.56 68.63 33.98 43.24 58.82 53.68 53.68
KeyBERT 77.48 20.06 30.55 65.78 32.90 41.67 57.11 45.37 48.34
PatternRankNP 83.64 21.93 33.29 75.27 37.62 47.69 62.78 56.69 57.03
PatternRankPoS 82.49 21.61 32.79 74.79 37.50 47.48 63.21 57.66 57.71
YAKE 51.81 15.60 22.64 44.54 25.96 30.59 38.07 36.68 34.13
SingleRank 56.83 17.96 25.77 50.96 30.63 35.90 43.03 46.26 42.24
KeyBERT 45.23 13.07 19.19 38.60 21.72 25.99 33.43 31.23 30.05
PatternRankNP 62.40 20.01 28.61 54.94 32.98 38.68 44.33 46.69 43.11
PatternRankPoS 62.13 20.03 28.57 55.45 33.57 39.24 45.51 48.54 44.54
Tsoumakas, 2019). For empirical comparison of cline. KeyBERT uses the same PLM to rank the
keyphrase extraction approaches, we therefore also candidate keyphrases as PatternRank, but uses sim-
report the average of the exact and partial matching ple n-grams for candidate keyphrase selection in-
results. stead of PoS patterns or noun phrases. As a result,
The results of our evaluation are shown in Ta- the KeyBERT approach consistently performs worst
ble 1. We can see that our PatternRank approach among all approaches. As expected, YAKE was the
outperforms all other approaches across all bench- fastest keyphrase extraction approach because it is
marks. In general, both approaches PatternRankNP a lightweight method based on statistical features.
and PatternRankPoS perform fairly similarly, whereas However, the extracted keyphrases are not very ac-
PatternRankPoS produces slightly better results in curate and in comparison to PatternRank, YAKE sig-
most cases. In the exact match evaluation, nificantly performs worse in all evaluations. SingleR-
PatternRankPoS consistently achieves the best results ank is the only approach that achieves competitive re-
of all approaches. Furthermore, PatternRankPoS also sults compared to PatternRank. Nevertheless, it con-
yields best results in the average mach evaluation sistently performs a few percentage points worse than
for N = 10 and 20. In the partial match evalua- PatternRank across all evaluations. We therefore con-
tion, the PatternRankNP approach marginally outper- clude that our PatternRank achieves state-of-the-art
forms the PatternRankPoS approach and yields best re- keyphrase extraction results, especially in the schol-
sults for N = 5 and 10. However, as we mentioned arly domain.
earlier the partial match evaluation approach, may
wrongly reward methods which extract keyphrases
that occur in the unigram gold keyphrases but are 5 CONCLUSION
not appropriate for the corresponding document.
Since the PatternRankPoS approach outperforms the We presented the PatternRank approach which lever-
PatternRankNP approach in the more important ex- ages PLMs and PoS for unsupervised keyphrase ex-
act match and average match evaluations, we argue traction. We evaluated our approach against three dif-
that selecting candidate keyphrases based on the PoS ferent keyphrase extraction methods: one statistics-
pattern defined in Section 3.1 instead of simple noun based approach, one graph-based approach and one
phrases helps to extract keyphrases predominantly oc- PLM-based approach. The results show that the
curring in the scholarly domain. In contrast, skip- PatternRank approach performs best in terms of pre-
ping the PoS pattern-based candidate keyphrase se- cision, recall and F1 -score across all evaluations. Fur-
lection step results in a significant performance de- thermore, we evaluated two different PatternRank
versions. PatternRankNP selects simple noun phrases neapolis, Minnesota. Association for Computational
as candidate keyphrases and PatternRankPoS selects Linguistics.
word tokens whose PoS tags match the pattern defined Grootendorst, M. (2020). Keybert: Minimal keyword ex-
in Section 3.1. While PatternRankPoS produced better traction with bert.
results in the majority of cases, PatternRankNP still Hulth, A. (2003). Improved automatic keyword extraction
performed very well in all benchmarks. We therefore given more linguistic knowledge. In Proceedings of
the 2003 Conference on Empirical Methods in Natural
conclude that the PatternRankPoS approach works par- Language Processing, EMNLP ’03, page 216–223,
ticularly well in the evaluated scholarly domain. Fur- USA. Association for Computational Linguistics.
thermore, since the use of noun phrases as candidate Hulth, A. and Megyesi, B. B. (2006). A study on auto-
keyphrases is a more general and domain-independent matically extracted keywords in text categorization.
approach, we propose using PatternRankNP as a sim- In Proceedings of the 21st International Conference
ple but effective keyphrase extraction method for ar- on Computational Linguistics and 44th Annual Meet-
bitrary domains. Future work may investigate how ing of the Association for Computational Linguistics,
pages 537–544, Sydney, Australia. Association for
the PLM and PoS pattern used in this approach can
Computational Linguistics.
be adapted to different domains or languages.
Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T.
(2012). Automatic keyphrase extraction from scien-
tific articles.
REFERENCES Le, Q. and Mikolov, T. (2014). Distributed representations
of sentences and documents. In Xing, E. P. and Je-
Augenstein, I., Das, M., Riedel, S., Vikraman, L., and Mc- bara, T., editors, Proceedings of the 31st International
Callum, A. (2017). SemEval 2017 task 10: ScienceIE Conference on Machine Learning, volume 32 of Pro-
- extracting keyphrases and relations from scientific ceedings of Machine Learning Research, pages 1188–
publications. In Proceedings of the 11th International 1196, Bejing, China. PMLR.
Workshop on Semantic Evaluation (SemEval-2017), Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., and
pages 546–555, Vancouver, Canada. Association for Chi, Y. (2017). Deep keyphrase generation. In Pro-
Computational Linguistics. ceedings of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Bennani-Smires, K., Musat, C., Hossmann, A., Baeriswyl,
Papers), pages 582–592, Vancouver, Canada. Associ-
M., and Jaggi, M. (2018). Simple unsupervised
ation for Computational Linguistics.
keyphrase extraction using sentence embeddings. In
Proceedings of the 22nd Conference on Computa- Mihalcea, R. and Tarau, P. (2004). TextRank: Bringing or-
tional Natural Language Learning, pages 221–229, der into text. In Proceedings of the 2004 Conference
Brussels, Belgium. Association for Computational on Empirical Methods in Natural Language Process-
Linguistics. ing, pages 404–411, Barcelona, Spain. Association for
Computational Linguistics.
Braun, D., Klymenko, O., Schopf, T., Kaan Akan, Y., and
Matthes, F. (2021). The language of engineering: Page, L., Brin, S., Motwani, R., and Winograd, T. (1999).
Training a domain-specific word embedding model The pagerank citation ranking : Bringing order to the
for engineering. In 2021 3rd International Conference web. In WWW 1999.
on Management Science and Industrial Engineering, Pagliardini, M., Gupta, P., and Jaggi, M. (2018). Unsuper-
MSIE 2021, page 8–12, New York, NY, USA. Asso- vised learning of sentence embeddings using compo-
ciation for Computing Machinery. sitional n-gram features. In Proceedings of the 2018
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Conference of the North American Chapter of the As-
Nunes, C., and Jatowt, A. (2020). Yake! keyword sociation for Computational Linguistics: Human Lan-
extraction from single documents using multiple local guage Technologies, Volume 1 (Long Papers), pages
features. Information Sciences, 509:257–289. 528–540, New Orleans, Louisiana. Association for
Computational Linguistics.
Caragea, C., Bulgarov, F. A., Godea, A., and Das Golla-
palli, S. (2014). Citation-enhanced keyphrase extrac- Papagiannopoulou, E. and Tsoumakas, G. (2019). A review
tion from research papers: A supervised approach. of keyphrase extraction. Wiley Interdisciplinary Re-
In Proceedings of the 2014 Conference on Empirical views: Data Mining and Knowledge Discovery, 10.
Methods in Natural Language Processing (EMNLP), Reimers, N. and Gurevych, I. (2019). Sentence-BERT:
pages 1435–1446, Doha, Qatar. Association for Com- Sentence embeddings using Siamese BERT-networks.
putational Linguistics. In Proceedings of the 2019 Conference on Empiri-
cal Methods in Natural Language Processing and the
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
9th International Joint Conference on Natural Lan-
(2019). BERT: Pre-training of deep bidirectional
guage Processing (EMNLP-IJCNLP), pages 3982–
transformers for language understanding. In Pro-
3992, Hong Kong, China. Association for Computa-
ceedings of the 2019 Conference of the North Amer-
tional Linguistics.
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume Rose, S. J., Engel, D. W., Cramer, N., and Cowley, W.
1 (Long and Short Papers), pages 4171–4186, Min- (2010). Automatic keyword extraction from individ-
ual documents.
Rousseau, F. and Vazirgiannis, M. (2015). Main core re-
tention on graph-of-words for single-document key-
word extraction. In Hanbury, A., Kazai, G., Rauber,
A., and Fuhr, N., editors, Advances in Information Re-
trieval, pages 382–393, Cham. Springer International
Publishing.
Schopf, T., Braun, D., and Matthes, F. (2021). Lbl2vec:
An embedding-based approach for unsupervised doc-
ument retrieval on predefined topics. In Proceedings
of the 17th International Conference on Web Informa-
tion Systems and Technologies - WEBIST, pages 124–
132. INSTICC, SciTePress.
Schopf, T., Weinberger, P., Kinkeldei, T., and Matthes, F.
(2022). Towards bilingual word embedding models
for engineering: Evaluating semantic linking capabil-
ities of engineering-specific word embeddings across
languages. In 2022 4th International Conference
on Management Science and Industrial Engineering
(MSIE), MSIE 2022, page 407–413, New York, NY,
USA. Association for Computing Machinery.
Sharma, P. and Li, Y. (2019). Self-supervised contextual
keyword and keyphrase retrieval with self-labelling.
Shi, W., Zheng, W., Yu, J. X., Cheng, H., and Zou,
L. (2017). Keyphrase extraction using knowledge
graphs. Data Science and Engineering, 2:275–288.
Song, M., Song, I. Y., Allen, R. B., and Obradovic, Z.
(2006). Keyphrase extraction-based query expan-
sion in digital libraries. In Proceedings of the 6th
ACM/IEEE-CS Joint Conference on Digital Libraries,
JCDL ’06, page 202–209, New York, NY, USA. As-
sociation for Computing Machinery.
Wan, X. and Xiao, J. (2008). CollabRank: Towards a
collaborative approach to single-document keyphrase
extraction. In Proceedings of the 22nd Interna-
tional Conference on Computational Linguistics (Col-
ing 2008), pages 969–976, Manchester, UK. Coling
2008 Organizing Committee.
Wang, R., Liu, W., and McDonald, C. (2015). Using word
embeddings to enhance keyword identification for sci-
entific publications. In Sharaf, M. A., Cheema, M. A.,
and Qi, J., editors, Databases Theory and Applica-
tions, pages 257–268, Cham. Springer International
Publishing.
Zhang, Y., Zincir-Heywood, N., and Milios, E. (2004).
World wide web site summarization.