0% found this document useful (0 votes)

109 views8 pages

A Comparative Study of Keyword Extraction Algorithms For English Texts

The document analyzes keyword extraction algorithms for English texts. It introduces TF-IDF and KEA algorithms and proposes an improved TF-IDF algorithm. An experiment on 100 English texts compares the performance of the algorithms and finds the improved TF-IDF algorithm performs best.

Uploaded by

ياسر سعد الخزرجي

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views8 pages

A Comparative Study of Keyword Extraction Algorithms For English Texts

Uploaded by

ياسر سعد الخزرجي

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Journal of Intelligent Systems 2021; 30: 808–815

Research Article

Jinye Li*

A comparative study of keyword extraction

algorithms for English texts
https://doi.org/10.1515/jisys-2021-0040
received March 14, 2021; accepted June 15, 2021

Abstract: This study mainly analyzed the keyword extraction of English text. First, two commonly used
algorithms, the term frequency–inverse document frequency (TF–IDF) algorithm and the keyphrase extrac-
tion algorithm (KEA), were introduced. Then, an improved TF–IDF algorithm was designed, which
improved the calculation of word frequency, and it was combined with the position weight to improve
the performance of keyword extraction. Finally, 100 English literature was selected from the British
Academic Written English Corpus for the analysis experiment. The results showed that the improved
TF–IDF algorithm had the shortest running time and took only 4.93 s in processing 100 texts; the precision
of the algorithms decreased with the increase of the number of extracted keywords. The comparison
between the two algorithms demonstrated that the improved TF–IDF algorithm had the best performance,
with a precision rate of 71.2%, a recall rate of 52.98%, and an F1 score of 60.75%, when ﬁve keywords were
extracted from each article. The experimental results show that the improved TF–IDF algorithm is eﬀective
in extracting English text keywords, which can be further promoted and applied in practice.

Keywords: English text, keyword extraction, TF–IDF algorithm, KEA

1 Introduction
With the development of society, there are more and more ways to express information, among which
natural language text is the most important one [1] and one of the largest information sources [2]. With the
rapid growth of the amount of text information, how to process and retrieve these massive texts has become
a more and more important problem [3]. Information retrieval [4], text classification [5], emotional analysis
[6], and topic identification [7] have been widely concerned by researchers. In the process of information
retrieval, users can find the corresponding web pages by inputting keywords. However, if the keywords
input by users are not accurate or the keywords do not appear on the corresponding page, the retrieval
effect of information will be greatly affected. Therefore, keywords are very important in text processing [8].
Beliga et al. [9] introduced keyword extraction methods related to supervised and unsupervised methods,
analyzed and compared various graph-based methods, and encouraged the development of new graph-
based methods for keyword extraction. Biswas et al. [10] studied the extraction of Twitter keywords,
proposed an unsupervised graph-based method, extracted keywords using the collective node weight,
carried out experiments on five data sets, and found that this method was better than other methods. Hu
et al. [11] designed a patent keyword extraction algorithm based on a distributed skip-gram model and
conducted experiments on standard data sets and self-made data sets. They found that the method was
useful in extracting keywords from patent texts. Zhang et al. [12] studied the TextRank algorithm and
conducted experiments on Hulth 2003 and Krapivin 2009 datasets. The results showed that the TextRank


* Corresponding author: Jinye Li, Department of Applied Foreign Languages, Dongguan Polytechnic, No. 3, Daxue Road,
Dongguan, Guangdong 523808, China, e-mail: [email protected]

Open Access. © 2021 Jinye Li, published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0
International License.
Keyword extraction algorithms for English texts  809

algorithm had an excellent performance, and the results were independent of the text length. Machine
learning is a science of artificial intelligence, and its theory and method have been extensively applied in
solving complex problems in the engineering application and science field; moreover, it also has good
applications in natural language processing [13]. With the explosive growth of information, how to find
out the needed information rapidly has become more and more difficult; thus, research on keyword extraction
becomes increasingly important. This study compared three keyword extraction algorithms, the term fre-
quency–inverse document frequency (TF–IDF) algorithm, keyphrase extraction algorithm (KEA), and the
improved TF–IDF algorithm, and made an experimental analysis in the corpus to verify the effectiveness
of the improved algorithm. The improved TF–IDF algorithm is beneficial to improving the extraction effect of
keywords, realizing effective information mining, and helping works such as text retrieval and classification,
providing a more efficient method for information processing. In practical application, the improved TF–IDF
algorithm can be utilized to extract keywords from English texts to provide services for the sorting and
retrieval of English questions and English literature and enhance the efficiency and level of text processing.

2 Common keyword extraction algorithms

The TF–IDF algorithm is a classic keyword extraction method [14], which mainly evaluates the importance
of a word or a phrase to the text. The importance is related to two factors, TF and IDF. TF refers to the
frequency of a word appearing in the document; the higher the frequency is, the more important the
word is. The meaning of IDF is as follows. The document where a word is located belongs to a corpus;
in this corpus, if the frequency of a word is high, it means that the word is not highly representative and
of low importance; if the frequency of a word is low, it means that the word is highly representative
and important. Suppose that there is a text d whose keyword needs to be extracted. After preprocessing,
such as word segmentation, the content is regarded as the set of feature words, i.e., di, expressed as
di = (t1, t2,…, tj,…, tD). For the importance of every feature word (tj) in text di, it is written as weight wi.
wj is calculated according to the TF and IDF of feature words:

 M 
TF − IDF(di , tj) = TF(di, tj) × IDF(tj) = TF(di , tj) × lg  , (1)
 DF(tj) 
where TF(di, tj) stands for the frequency of appearance of tj in di, DF(tj) stands for the number of texts in
which tj appears in the text set, and M stands for the number of texts in the text set.
After calculating the weight, tj is sorted according to the size of wj, i.e., Sort(di). The top k feature words
are taken as the final keywords.
KEA is targeted at English texts [15], and it predicts which keywords are good through machine
learning. Its principle is to train a model with documents with marked keywords through a naive Bayes
classifier [16] and extract keywords from new documents with the new model. It mainly uses two features.
One is the TF–IDF value, and the other is the first appearance position of words. For a feature word, if it
appears in the title, abstract, or introduction of a text, it is more likely to be a keyword. The calculation of
the first appearance position of words is the ratio of the number of words in front of the feature word when it
first appears in the text to the total number of words in the text. When training a Bayesian model, it is
assumed that the text to be classified is expressed as follows:
x = {a1 , a2,…, am}. (2)

The category collection is expressed as follows:

C = { y1, y2,…, yn}. (3)

The probability of every category, P( yi), is calculated. The probability of every category to aj, P( yi |aj), is
calculated. After the model training, in the application stage, the probability of x to every category yi, P(x|yi),
is calculated. The category that x belongs to is yi with the highest probability.
810  Jinye Li

3 Improved TF–IDF algorithm

To improve the performance of the TF–IDF algorithm, this study improved it from two aspects. First, in the
calculation of TF, the original TF–IDF algorithm treats all keywords equally and gives them higher weight
when they appear more frequently. However, considering the actual situation, in the actual English text, there
may be some situations. For example, a word appears more frequently in the text, but it is given a high weight
though it belongs to the daily use of words and has low discrimination for the text; a word is a rare word that
seldom appears in the text, but it is given a low weight though it has a great contribution to the distinction of
the text. Given these situations, this study improved the calculation of TF, i.e., adjusting the weight according
to the relative appearance times of keywords, and the corresponding formula is expressed as follows:

TF = e Nd,t − N̄t , (4)

where Nd,t stands for the appearance times of keyword t in text d and N̄t stands for the average appearance
times of t in a text set. If the appearance times of a word in the text is less than its average times, then the TF
value is smaller than 1, and the weight of the keyword will decrease; otherwise, it will increase. In this way,
the influence of common words and rare words on weight can be reduced.
For an English text, the possible positions of candidate keywords are title, introduction, text, and so on. This
study mainly divides the positions of keywords into three types. The first type is the title. Title is the summary of
the text content and theme. If a word appears in the title, it is most likely to be the final keyword with the highest
importance. The second type is the chapter title. Chapter titles can summarize the content of a chapter, and the
candidate keywords appearing in the title of a chapter are also very important. The third type is the main body.
There is much content in the main body; therefore, the possibility of keywords appearing in the main body is
relatively large, but its importance is slightly less than that of the title and chapter title.
Based on calculating the TF–IDF value and the position weight pi, the feature value of candidate
keywords is the product of the position weight and TF–IDF value. The specific calculation method is as
follows. It is assumed that there is a text containing 10,000 words; the keyword has been labeled; the title,
chapter title, and content of the text are named text 1, text 2, and text 3, respectively; the candidate
keywords appearing in these content are named place 1, place 2, and place 3, respectively; and the labeled
keyword is named text 4. The formula for calculating the frequency of keywords in different positions is
expressed as follows:
m
p= , (5)
n
where m refers to the number of keywords in this part of the content and n refers to the total number of
words in this part of the content. If the statistical text information is shown in Table 1, then the probability
of the keyword appearing in the title is:

Table 1: Statistical results of text information

Name Frequency of words Name Frequency of words

Place 1 15 Text 1 76
Place 2 29 Text 2 498
Place 3 102 Text 3 9,426

P1 = place 1/text 1 = 15/76 = 0.1974 (6)

P2 = place 2/text 2 = 29/498 = 0.0582 (7)
P3 = place 3/text 3 = 102/9426 = 0.0108 (8)

According to the results, the position weight of the keyword in the title, chapter title, and main body can be
set as 19.7, 5.8, and 1.1%, respectively.
Keyword extraction algorithms for English texts  811

4 Experimental analysis
In this study, 100 English literature was randomly selected from the British Academic Written English
(BAWE) Corpus [17] as experimental texts. A total of 672 keywords were given by the 100 literature. The
content of the literature is presented in Table 2.

Table 2: Experimental data

Subject Number of documents

Biology 12
Military 6
Power engineering 10
Electrical engineering 9
Chemical engineering 7
Mechanical engineering 11
Water conservancy project 8
Mineral engineering 10
Natural science 15
Architecture science 12

The texts were analyzed by StopAnalyzer that is specially used for analyzing English texts. The system
can ﬁlter out space in the text, convert capital and lowercase letters, and remove stop words. After pre-
processing, the remaining words in the text are mainly nouns and verbs, which is more conducive to
keyword extraction.
English literature generally provides three to six keywords, no more than 15 at most. To better conﬁrm
the performance of the algorithms, the number of keywords to be extracted was set as 5–30.
The results of keyword extraction are presented in Table 3, and the algorithms were evaluated by three
indicators, as shown below.

Table 3: Results of keyword extraction

Manually labeled as the keyword Not manually labeled as the keyword

Extracted as the keyword by the algorithm A B

Not extracted as the keyword by the algorithm C D

(1) Precision:
A
P= . (9)
A+B
(2) Recall rate:
A
R= . (10)
A+C
(3) F1 score:
2PR
F1 = . (11)
P+R

The performance of three algorithms, TF–IDF, KEA, and improved TF–IDF algorithms, in extracting
keywords was compared. First, the running time of the algorithms under diﬀerent numbers of texts is
presented in Table 4.
812  Jinye Li

Table 4: Comparison of the running time between algorithms

Number of texts TF–IDF algorithm (s) KEA algorithm (s) Improved TF–IDF algorithm (s)

10 1.12 1.36 0.78

20 1.56 1.89 1.21
30 2.08 2.21 1.79
40 2.51 2.78 2.33
50 3.01 3.21 2.64
60 3.46 3.81 2.89
70 3.98 4.23 3.46
80 4.31 4.76 3.89
90 4.86 5.21 4.21
100 5.27 5.73 4.93

It was seen from Table 4 that the running time of the algorithms increased gradually with the increase
of the number of texts. The running time of the Kea algorithm was the longest, followed by the TF–IDF
algorithm and the TF–IDF algorithm. When the number of texts was ten, the running time of the improved
TF–IDF algorithm was 30.36% shorter than the TF–IDF algorithm and 42.65% shorter than the KEA algo-
rithm; when the number of texts was 100, the running time of the improved TF–IDF algorithm was 6.45%
shorter than the TF–IDF algorithm and 13.96% shorter than the KEA algorithm. The reason for the afore-
mentioned result was because the KEA algorithm needed training.
The extraction results of the algorithms when the number of keywords extracted was diﬀerent are
presented in Table 5.

Table 5: Results of keyword extraction under diﬀerent algorithms

The number of keywords extracted Total number of keywords Correct number of

from each text extracted keywords

TF–IDF algorithm 5 500 198

10 1,000 241
15 1,500 307
20 2,000 322
25 2,500 356
30 3,000 409
KEA algorithm 5 500 219
10 1,000 263
15 1,500 321
20 2,000 398
25 2,500 411
30 3,000 489
Improved TF–IDF 5 500 243
algorithm 10 1,000 289
15 1,500 367
20 2,000 421
25 2,500 478
30 3,000 563

According to the results of keyword extraction, the P value, R value, and F1 value of the algorithms were
calculated, and the results are shown in Table 6.
It was seen from Table 5 that the P value of the algorithms decreased with the increase of keywords
extracted from each text. The decrease of the P value might be because the number of keywords marked by
originally was far less than the number of keywords extracted by the algorithms. With the increase of the
Keyword extraction algorithms for English texts  813

Table 6: Performance comparison between algorithms

The number of keywords extracted Precision (P) (%) Recall rate F1 score (F1) (%)
from each text (R) (%)

TF–IDF algorithm 5 57.80 43.01 49.32

10 30.80 45.83 36.84
15 21.40 47.77 29.56
20 16.75 49.85 25.07
25 13.88 51.64 21.88
30 12.27 54.76 20.04
KEA algorithm 5 63.80 47.47 54.44
10 31.20 46.43 37.32
15 22.20 49.55 30.66
20 17.85 53.13 26.72
25 15.84 58.93 24.97
30 14.23 63.54 23.26
Improved TF–IDF 5 71.20 52.98 60.75
algorithm 10 37.20 55.36 44.50
15 25.60 57.14 35.36
20 22.80 67.86 34.13
25 18.88 70.24 29.76
30 18.40 82.14 30.07

number of keywords extracted, the R value of the algorithms increased, which indicated that the more
keywords were extracted, the more marked keywords were included. Due to the rapid decline of the P value,
the F1 value also showed a downward trend. To compare the performance of diﬀerent algorithms, ﬁve
keywords extracted from each text were taken as an example to compare the P value, R value, and F1 value.
The results are shown in Figure 1.

Figure 1: Performance comparison between algorithms when extracting ﬁve keywords from each text.

Figure 1 shows that the performance of the improved TF–IDF algorithm was signiﬁcantly better; its
accuracy was 71.2 and 13.4% higher than the TF–IDF algorithm and 7.4% higher than the KEA algorithm; its
recall rate was 52.98 and 9.97% higher than the TF–IDF algorithm and 5.51% higher than the KEA algo-
rithm; its F1 score was 60.75 and 11.43% higher than the TF–IDF algorithm and 6.31% higher than the KEA
algorithm. Therefore, the improved TF–IDF algorithm had good reliability in extracting keywords from
English texts.
814  Jinye Li

5 Discussion
In text processing, machine learning methods have been widely used. Onan et al. [18] designed a method
that integrated Bayesian logistic regression, naive Bayes, linear discriminant analysis, logistic regression,
and support vector machine. They studied sentiment classification of texts and found through experiments
that the method showed the highest classification accuracy, 98.86%. Onan and Tocoglu [19] provided an
effective sarcasm recognition framework for social media data by pursuing the paradigm of neural language
models and deep neural networks, evaluated it on the corpus, and obtained a classification accuracy of
95.3%. Keywords play a very important role in many fields of text processing [20]. In document manage-
ment [21], the efficiency of document management can be improved by collecting keywords of documents in
different fields and establishing corresponding indexes; in text classification [22], the relevance of keywords
to texts can be calculated to divide texts with similar semantics into one category; in automatic abstracting
[23], the weight of each sentence can be calculated after keyword extraction to determine whether the
sentence can become a part of the abstract to realize the automatic abstracting.
This article mainly compared three keyword extraction algorithms, TF–IDF, KEA, and improved TF–IDF
algorithms. This study selected 5–30 keywords from each text, and there were 100 texts. The running time of
the improved TF–IDF algorithm was the shortest, 4.93 s, which was 6.45% shorter than the TF–IDF algo-
rithm and 13.96% shorter than the KEA algorithm. The precision, recall rate, and F1 score of the algorithms
were compared under different numbers of keywords extracted. It was found that the P value and F1 score of
the three algorithms decreased with the increase of the number of keywords extracted, while the R value
increased, which was because the number of keywords labeled in the text originally was less than the
number of keywords extracted by the algorithms. The improved TF–IDF algorithm had significantly
improved performance after improving the calculation method of TF and introducing the concept of the
position weight; its P value, R value, and F1 score were significantly higher than those of TF–IDF and KEA
algorithms. It showed that the improved TF–IDF algorithm had strong applicability in extracting keywords
from English texts.
It was found from the experimental results that the TF–IDF algorithm only used word frequency to
measure the importance of a word and did not take into account the location information of the word. For
KEA, if the structure of a text was poor, the contribution of the first occurrence of the word to the keyword
extraction is small; then, the results of keyword extraction will also be affected. In addition, KEA has a
strong dependence on the Bayesian algorithm. Bayes assumes that the features are independent of each
other. If this hypothesis fails, the trained classifier will have obvious defects, and the effect of keyword
extraction will become worse. Table 6 shows that the performance of the TF–IFD algorithm and KEA was
general, but this study improved the calculation method of word frequency and combined the position
weight to greatly improve the keyword extraction performance of the TF–IDF algorithm. The effectiveness of
the improved algorithm was verified. In practical application, the improved TF–IDF algorithm can be used
for keyword extraction of English literature to help readers understand the content of the literature better,
and it can also be used for keyword extraction of English education resources to improve the intelligent
level of teaching and meet the needs of learners.

6 Conclusion
This study compared the performance of three algorithms in extracting keywords from English texts. The
results showed that the improved TF–IDF algorithm had the best performance and the shortest running
time. The precision, recall rate, and F1 score of the improved TF–IDF algorithm were 71.2, 52.98, and
60.75%, respectively, signiﬁcantly better than TF–IDF and KEA algorithms. The method can be further
promoted and applied in practice. This study also has some defects. In future work, the performance of
keyword extraction algorithms will be further improved, and experiments will be carried out in more
corpora to perfect the experimental results.
Keyword extraction algorithms for English texts  815

Conﬂict of interest: The author states no conﬂict of interest.

References
[1] Perovek M, Kranjc J, Erjavec T, Cestnik B, Lavrac N. TextFlows: a visual programming platform for text mining and natural
language processing. Sci Comput Program. 2016;121:128–52.
[2] Onan A. Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes. 2017;46(2):330–48.
[3] Onan A. Sentiment analysis on twitter based on ensemble of psychological and linguistic feature sets. Balkan J Electr
Comput Eng. 2018;6:1–9.
[4] Berger A, Lafferty J. Information retrieval as statistical translation. ACM SIGIR Forum. 2017;51(2):219–26.
[5] Onan A. An ensemble scheme based on language function analysis and feature engineering for text genre classification.
J Inf Sci Prin Pract. 2018;44(1):28–47.
[6] Onan A, Korukoglu S. A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf
Sci. 2017;43(1):25–38.
[7] Onan A, Toolu MA. Weighted word embeddings and clustering-based identification of question topics in MOOC discussion
forum posts. Comput Appl Eng Educ. 2020.
[8] Firoozeh N, Nazarenko A, Alizon F, Daille B. Keyword extraction: issues and methods. Nat Lang Eng. 2019;26(3):1–33.
[9] Beliga S, Meštrović A, Martinčić-Ipšić S. An overview of graph-based keyword extraction methods and approaches.
J Inform Organ Sci. 2015;39(1):1–20.
[10] Biswas SK, Bordoloi M, Shreya J. A graph based keyword extraction model using collective node weight. Expert Syst Appl.
2017;97:51–9.
[11] Hu J, Li SB, Yao Y, Yu LY, Yang GC, Hu JJ. Patent keyword extraction algorithm based on distributed representation for
patent classification. Entropy. 2018;20(2):104.
[12] Zhang MX, Li XM, Yue SB, Yang LQ. An empirical study of TextRank for keyword extraction. IEEE Access.
2020;8:178849–58.
[13] Onan A. Mining opinions from instructor evaluation reviews: a deep learning approach. Comput Appl Eng Educ.
2020;28(1):117–38.
[14] Chin K, Zhang Z, Long J, Zhang H. Turning from TF–IDF to TF-IGM for term weighting in text classification. Expert Syst Appl.
2016;66:45–260.
[15] Qiu Q, Xie Z, Wu L, Li WJ. Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Syst Appl.
2019;125:157–69.
[16] Onan A, Korukoglu S, Bulut H. LDA-based topic modelling in text sentiment classification: an empirical analysis. Int J
Comput Linguist Appl. 2016;7(1):101–19.
[17] Farahani M. Metadiscourse in academic English texts: a corpus-driven probe into British academic written english corpus.
Stud About Lang. 2019;34:56–73.
[18] Onan A, Korukoğlu S, Bulut H. A multiobjective weighted voting ensemble classifier based on differential evolution
algorithm for text sentiment classification – ScienceDirect. Expert Syst Appl. 2016;62:1–16.
[19] Onan A, Tocoglu MA. A term weighted neural language model and stacked bidirectional LSTM based framework for
sarcasm identification. IEEE Access. 2016;4:1–23.
[20] Onan A, Korukoğlu S, Bulut H. Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst
Appl. 2016;57:232–47.
[21] Schaefer MB, Lima EDS. Evaluation and classification of documents: an analysis of its aplication in a digital archival
document management system. Perspect Ciênc Inf. 2012;17(3):137–54.
[22] Otaibi JA, Hassaine A, Safi Z, Jaoua A. Inconsistency detection in islamic advisory opinions using multilevel text cate-
gorization. Adv Sci Lett. 2017;23(5):4591–5.
[23] Hernandez-Castaneda A, Garcia-Hernandez RA, Ledeneva Y, Millán-Hernández CE. Extractive automatic text summariza-
tion based on lexical-semantic keywords. IEEE Access. 2020;8:49896–907.

Dibels Kindergarten Benchmark Assessment
100% (1)
Dibels Kindergarten Benchmark Assessment
27 pages
2017 EC Grade 8 English Model Exam
100% (1)
2017 EC Grade 8 English Model Exam
7 pages
Sign Up To English 3A - SB
No ratings yet
Sign Up To English 3A - SB
13 pages
Opposition Thesis Writing Guide
100% (3)
Opposition Thesis Writing Guide
4 pages
Lesson 7:: Kicking It Off
No ratings yet
Lesson 7:: Kicking It Off
6 pages
Keyword Extraction Thesis
100% (3)
Keyword Extraction Thesis
7 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Direct Speech and Indirect Speech
No ratings yet
Direct Speech and Indirect Speech
4 pages
Where Are You Going
No ratings yet
Where Are You Going
11 pages
03 Paper 01052007 IJCSIS Camera Ready Pp13-18
No ratings yet
03 Paper 01052007 IJCSIS Camera Ready Pp13-18
6 pages
6th Grade Nature Lesson Plan
100% (1)
6th Grade Nature Lesson Plan
7 pages
Text Mining by Tsallis Entropy
No ratings yet
Text Mining by Tsallis Entropy
17 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
Arabic Keyphrase Extraction
0% (1)
Arabic Keyphrase Extraction
77 pages
Lecture6 Text As Data Ver3
No ratings yet
Lecture6 Text As Data Ver3
69 pages
Unit 6 Test Study Guide
No ratings yet
Unit 6 Test Study Guide
6 pages
LDA Topic Modeling: Models & Applications
No ratings yet
LDA Topic Modeling: Models & Applications
40 pages
(Suarez, F, Kronen, J) On The Formal Cause of Substance
No ratings yet
(Suarez, F, Kronen, J) On The Formal Cause of Substance
221 pages
SOWNDARRAJAN Journal
No ratings yet
SOWNDARRAJAN Journal
7 pages
Research On The TF IDF Algorithm Combined With Semantics For Automatic Extraction of Keywords From Network News Texts
No ratings yet
Research On The TF IDF Algorithm Combined With Semantics For Automatic Extraction of Keywords From Network News Texts
10 pages
Project 1
No ratings yet
Project 1
6 pages
Ijcsn 2013 2 4 60 PDF
No ratings yet
Ijcsn 2013 2 4 60 PDF
3 pages
Nature of Hebrew - Elaiza Naniong
No ratings yet
Nature of Hebrew - Elaiza Naniong
9 pages
Afan Oromo Text Keyword Extraction Using Machine Learning
100% (1)
Afan Oromo Text Keyword Extraction Using Machine Learning
18 pages
MCA Sem-2 OOCP, April-2012 - List of The Questions To Be Prepared For Final GTU Exam
No ratings yet
MCA Sem-2 OOCP, April-2012 - List of The Questions To Be Prepared For Final GTU Exam
7 pages
Summarization
No ratings yet
Summarization
10 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Deep Keywordnet: Automated English Keyword Extraction in Documents Using Deep Keyword Network Based Ranking
No ratings yet
Deep Keywordnet: Automated English Keyword Extraction in Documents Using Deep Keyword Network Based Ranking
33 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
Keyword Extraction Issues and Methods
No ratings yet
Keyword Extraction Issues and Methods
33 pages
Parts of Speech
No ratings yet
Parts of Speech
63 pages
Week 7 - Show in Class - Text Processing
No ratings yet
Week 7 - Show in Class - Text Processing
4 pages
Expert Systems With Applications: Aytu G Onan, Serdar Koruko Glu, Hasan Bulut
No ratings yet
Expert Systems With Applications: Aytu G Onan, Serdar Koruko Glu, Hasan Bulut
3 pages
Course Point Breakdown
No ratings yet
Course Point Breakdown
34 pages
Text Mining
No ratings yet
Text Mining
34 pages
A Review On Machine Learning Text Feature Extraction Techniques
No ratings yet
A Review On Machine Learning Text Feature Extraction Techniques
6 pages
Topic 5 Designing Writing Activities
No ratings yet
Topic 5 Designing Writing Activities
50 pages
Active-Passive Verb Forms PDF
No ratings yet
Active-Passive Verb Forms PDF
2 pages
Unit 1 - Lesson 2.3 - Pronunciation & Speaking - Page 11
No ratings yet
Unit 1 - Lesson 2.3 - Pronunciation & Speaking - Page 11
21 pages
CSE11 Final 2013
No ratings yet
CSE11 Final 2013
14 pages
Keyword Extraction for Researchers
No ratings yet
Keyword Extraction for Researchers
5 pages
Text Extraction Research Paper
No ratings yet
Text Extraction Research Paper
6 pages
Unit 2 Listeningwriting Teaching Slides
No ratings yet
Unit 2 Listeningwriting Teaching Slides
28 pages
93 Unit 8 Test
No ratings yet
93 Unit 8 Test
4 pages
500-PG-8700!2!7 - Design of Space Flight Field Programmable Gate Arrays
No ratings yet
500-PG-8700!2!7 - Design of Space Flight Field Programmable Gate Arrays
34 pages
Domain Representative Keywords Selection - A Probabilistic Approach
No ratings yet
Domain Representative Keywords Selection - A Probabilistic Approach
14 pages
Automatic Extraction For Text Summarization: A Survey: Keyword
No ratings yet
Automatic Extraction For Text Summarization: A Survey: Keyword
12 pages
Keyword Extraction
No ratings yet
Keyword Extraction
2 pages
Text Features Extraction Based On TF-IDF Associating Semantic
No ratings yet
Text Features Extraction Based On TF-IDF Associating Semantic
7 pages
Towards Automatic Detection of Correct Domain Words in OCR Texts From Polish Digital Libraries
No ratings yet
Towards Automatic Detection of Correct Domain Words in OCR Texts From Polish Digital Libraries
5 pages
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
No ratings yet
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
6 pages
Jurnal Information Retrieval
No ratings yet
Jurnal Information Retrieval
4 pages
Keyword 2
No ratings yet
Keyword 2
5 pages
Hobbies and Sport 3
No ratings yet
Hobbies and Sport 3
10 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
4 Extractive Text Summarization Using Lexical Association and Graph Based Lexical Association
No ratings yet
4 Extractive Text Summarization Using Lexical Association and Graph Based Lexical Association
12 pages
Enhancing Linguistically Oriented Automatic Keyword Extraction
No ratings yet
Enhancing Linguistically Oriented Automatic Keyword Extraction
4 pages
Week 11 Report
No ratings yet
Week 11 Report
3 pages
Incorporating Expert Knowledge Into Keyphrase Extraction
No ratings yet
Incorporating Expert Knowledge Into Keyphrase Extraction
8 pages
A Keyword Extraction Approach For Single Document Extractive Summarization Based On Topic Centrality
No ratings yet
A Keyword Extraction Approach For Single Document Extractive Summarization Based On Topic Centrality
9 pages
Folksonomies Versus Automatic Keyword Extraction
No ratings yet
Folksonomies Versus Automatic Keyword Extraction
12 pages
A Fusion-Based Extraction of Key Phrases From Abstracts of Scientific Articles
No ratings yet
A Fusion-Based Extraction of Key Phrases From Abstracts of Scientific Articles
10 pages
Article
No ratings yet
Article
5 pages
Pattern Rank
No ratings yet
Pattern Rank
6 pages
Silent Letters: English Learner Pronunciation Difficulties
No ratings yet
Silent Letters: English Learner Pronunciation Difficulties
1 page
Contextual Topic Discovery Using Unsupervised Keyphrase Extraction and Hierarchical Semantic Graph Model
No ratings yet
Contextual Topic Discovery Using Unsupervised Keyphrase Extraction and Hierarchical Semantic Graph Model
19 pages
Ieltsspeakingpart2 181010031744
No ratings yet
Ieltsspeakingpart2 181010031744
20 pages
Twitternews: Real Time Event Detection From The Twitter Data Stream
No ratings yet
Twitternews: Real Time Event Detection From The Twitter Data Stream
9 pages
18
No ratings yet
18
8 pages
Bilingualism's Impact on Memory
No ratings yet
Bilingualism's Impact on Memory
17 pages
Keyphrase Extraction in Scientific Publications
No ratings yet
Keyphrase Extraction in Scientific Publications
10 pages
Enhanced Keyword Extraction via Linguistic Features
No ratings yet
Enhanced Keyword Extraction via Linguistic Features
8 pages
1 s2.0 S175115771600002X Main
No ratings yet
1 s2.0 S175115771600002X Main
12 pages
A Machine Learning Approach To Information Extract
No ratings yet
A Machine Learning Approach To Information Extract
10 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
1-S2.0-S1877050916311589-Main - Part-5
No ratings yet
1-S2.0-S1877050916311589-Main - Part-5
7 pages
Keyword Extraction Measure
No ratings yet
Keyword Extraction Measure
9 pages
Temarios - Inglés - Essuna - Esgrum - Esdeim 2018
No ratings yet
Temarios - Inglés - Essuna - Esgrum - Esdeim 2018
2 pages
Guía de Verbos Irregulares Inglés
No ratings yet
Guía de Verbos Irregulares Inglés
4 pages
Hormigón Con Nanotecnología: Arnaldo Aramayo, Marcelo Amayo, Deibit Paniagua
No ratings yet
Hormigón Con Nanotecnología: Arnaldo Aramayo, Marcelo Amayo, Deibit Paniagua
18 pages
Unit 1: "What Are They Doing?"
No ratings yet
Unit 1: "What Are They Doing?"
10 pages
Keyword Extraction Using Particle Swarm Optimization: Sciencedirect
No ratings yet
Keyword Extraction Using Particle Swarm Optimization: Sciencedirect
7 pages
8946-Article Text-12474-1-2-20201228
No ratings yet
8946-Article Text-12474-1-2-20201228
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Keyword Extraction Methods From Documents in NLP
No ratings yet
Keyword Extraction Methods From Documents in NLP
15 pages
AlamiMerrouni2020 Article AutomaticKeyphraseExtractionAS
No ratings yet
AlamiMerrouni2020 Article AutomaticKeyphraseExtractionAS
34 pages
Chapter 13-Communicating Across Cultures
No ratings yet
Chapter 13-Communicating Across Cultures
15 pages
Simple Unsupervised Keyphrase Extraction Using Sentence Embeddings
No ratings yet
Simple Unsupervised Keyphrase Extraction Using Sentence Embeddings
9 pages
Keyword Extraction From A Single Document Using Word Co-Occurrence Statistical Information
No ratings yet
Keyword Extraction From A Single Document Using Word Co-Occurrence Statistical Information
5 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Acl 14
No ratings yet
Acl 14
12 pages
Chapter 1
No ratings yet
Chapter 1
2 pages

A Comparative Study of Keyword Extraction Algorithms For English Texts

Uploaded by

A Comparative Study of Keyword Extraction Algorithms For English Texts

Uploaded by

Journal of Intelligent Systems 2021; 30: 808–815

A comparative study of keyword extraction

Keywords: English text, keyword extraction, TF–IDF algorithm, KEA

2 Common keyword extraction algorithms

The category collection is expressed as follows:

3 Improved TF–IDF algorithm

TF = e Nd,t − N̄t , (4)

Table 1: Statistical results of text information

Name Frequency of words Name Frequency of words

P1 = place 1/text 1 = 15/76 = 0.1974 (6)

Table 2: Experimental data

Subject Number of documents

Table 3: Results of keyword extraction

Manually labeled as the keyword Not manually labeled as the keyword

Extracted as the keyword by the algorithm A B

Table 4: Comparison of the running time between algorithms

10 1.12 1.36 0.78

Table 5: Results of keyword extraction under diﬀerent algorithms

The number of keywords extracted Total number of keywords Correct number of

TF–IDF algorithm 5 500 198

Table 6: Performance comparison between algorithms

TF–IDF algorithm 5 57.80 43.01 49.32

Conﬂict of interest: The author states no conﬂict of interest.

You might also like