YAKE! is a lightweight, unsupervised, automatic keyword extraction method that uses text statistical features to select the most important keywords from a document.
- Unsupervised: No training data required, making it easy to use out-of-the-box
- Language Independent: Works across different languages with built-in support for multiple languages
- Domain Independent: Effective for various types of content including news articles, scientific papers, and web content
- Single-Document: Designed to extract keywords from individual documents without needing a corpus
- Customizable: Offers multiple parameters to fine-tune extraction for specific needs
pip install git+https://github.com/LIAAD/yakeThis project uses uv for dependency management.
curl -LsSf https://astral.sh/uv/install.sh | shuv syncuv pip install -e ".[dev]"import yake
text = "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow."
# Simple extraction with default parameters
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)
for kw, score in keywords:
print(f"{kw} ({score})")# Configure the extractor with custom parameters
custom_kw_extractor = yake.KeywordExtractor(
lan="en", # Language
n=3, # Maximum ngram size
dedup_lim=0.9, # Deduplication threshold
dedup_func="seqm", # Deduplication function
window_size=1, # Window size
top=20 # Number of keywords to extract
)
keywords = custom_kw_extractor.extract_keywords(text)yake -ti "Your text goes here" -l en -n 3 -vOptions:
-ti, --text_input TEXT Input text
-i, --input_file TEXT Input file
-l, --language TEXT Language
-n, --ngram-size INTEGER Max size of the ngram
-df, --dedup-func [leve|jaro|seqm] Deduplication function
-dl, --dedup-lim FLOAT Deduplication threshold
-ws, --window-size INTEGER Window size
-t, --top INTEGER Number of keyphrases to extract
-v, --verbose Show scores in output
The lower the score, the more relevant the keyword is:
google (0.026580863364597897)
kaggle (0.0289005976239829)
san francisco (0.048810837074825336)
machine learning (0.09147989238151344)
data science (0.097574333771058)
YAKE! supports multiple languages:
# Portuguese example
custom_kw_extractor = yake.KeywordExtractor(lan="pt")
keywords = custom_kw_extractor.extract_keywords(portuguese_text)Please cite the following works when using YAKE
Published at the Information Sciences Journal
Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. pdf
Conference papers at ECIR
Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 684 - 691. pdf
Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE! Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 806 - 810. pdf (Best Short Paper Award)
Please refer to the CONTRIBUTING.rst file for details.