Thanks to visit codestin.com
Credit goes to github.com

Skip to content

arianpasquali/yake

 
 

Repository files navigation

YAKE! - Yet Another Keyword Extractor

YAKE! is a lightweight, unsupervised, automatic keyword extraction method that uses text statistical features to select the most important keywords from a document.

Key Features

  • Unsupervised: No training data required, making it easy to use out-of-the-box
  • Language Independent: Works across different languages with built-in support for multiple languages
  • Domain Independent: Effective for various types of content including news articles, scientific papers, and web content
  • Single-Document: Designed to extract keywords from individual documents without needing a corpus
  • Customizable: Offers multiple parameters to fine-tune extraction for specific needs

Installation

pip install git+https://github.com/LIAAD/yake

This project uses uv for dependency management.

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Install the package

uv sync

Install in development mode

uv pip install -e ".[dev]"

Basic Usage

Python API

import yake

text = "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow."

# Simple extraction with default parameters
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)

for kw, score in keywords:
    print(f"{kw} ({score})")

Custom Parameters

# Configure the extractor with custom parameters
custom_kw_extractor = yake.KeywordExtractor(
    lan="en",                # Language
    n=3,                     # Maximum ngram size
    dedup_lim=0.9,            # Deduplication threshold
    dedup_func="seqm",        # Deduplication function
    window_size=1,           # Window size
    top=20                   # Number of keywords to extract
)

keywords = custom_kw_extractor.extract_keywords(text)

Command Line

yake -ti "Your text goes here" -l en -n 3 -v

Options:

    -ti, --text_input TEXT          Input text
    -i, --input_file TEXT           Input file
    -l, --language TEXT             Language 
    -n, --ngram-size INTEGER        Max size of the ngram
    -df, --dedup-func [leve|jaro|seqm]  Deduplication function
    -dl, --dedup-lim FLOAT          Deduplication threshold
    -ws, --window-size INTEGER      Window size
    -t, --top INTEGER               Number of keyphrases to extract
    -v, --verbose                   Show scores in output

Example Output

The lower the score, the more relevant the keyword is:

google (0.026580863364597897)
kaggle (0.0289005976239829)
san francisco (0.048810837074825336)
machine learning (0.09147989238151344)
data science (0.097574333771058)

Multilingual Support

YAKE! supports multiple languages:

# Portuguese example
custom_kw_extractor = yake.KeywordExtractor(lan="pt")
keywords = custom_kw_extractor.extract_keywords(portuguese_text)

References

Please cite the following works when using YAKE

Published at the Information Sciences Journal

Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. pdf

Conference papers at ECIR

Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 684 - 691. pdf

Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE! Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 806 - 810. pdf (Best Short Paper Award)

Contributing

Please refer to the CONTRIBUTING.rst file for details.

About

Single-document unsupervised keyword extraction

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%