Thanks to visit codestin.com
Credit goes to github.com

Skip to content

techthiyanes/nmtscore

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NMTScore

Master PyPI

A library of translation-based text similarity measures.

To learn more about how these measures work, have a look at Jannis' blog post. Also, read our paper, "NMTScore: A Multilingual Analysis of Translation-based Text Similarity Measures".

Three text similarity measures implemented in this library

Installation

  • Requires Python >= 3.7 and PyTorch
  • pip install nmtscore
  • Extra requirements for the Prism model: pip install nmtscore[prism]

Usage

NMTScorer

Instantiate a scorer and start scoring short sentence pairs.

from nmtscore import NMTScorer

scorer = NMTScorer()

scorer.score("This is a sentence.", "This is another sentence.")
# 0.45192727655379844

Different similarity measures

The library implements three different measures:

# Translation cross-likelihood (default)
scorer.score_cross_likelihood(a, b, tgt_lang="en", normalize=True, both_directions=True)

# Direct translation probability
scorer.score_direct(a, b, a_lang="en", b_lang="en", normalize=True, both_directions=True)

# Pivot translation probability
scorer.score_pivot(a, b, a_lang="en", b_lang="en", pivot_lang="en", normalize=True, both_directions=True)

The score method is a shortcut for cross-likelihood.

Batch processing

The scoring methods also accept lists of strings:

scorer.score(
    ["This is a sentence.", "This is a sentence.", "This is another sentence."],
    ["This is another sentence.", "This sentence is completely unrelated.", "This is another sentence."],
)
# [0.4519273529250307, 0.13127038689469997, 1.0000000000000102]

The sentences in the first list are compared element-wise to the sentences in the second list.

The default batch size is 8. An alternative batch size can be specified as follows (independently for translating and scoring):

scorer.score_direct(
    a, b, a_lang="en", b_lang="en",
    score_kwargs={"batch_size": 16}
)

scorer.score_cross_likelihood(
    a, b,
    translate_kwargs={"batch_size": 16},
    score_kwargs={"batch_size": 16}
)

Different NMT models

This library currently supports three NMT models:

By default, the leanest model (m2m100_418M) is loaded. The main results in the paper are based on the Prism model.

scorer = NMTScorer("m2m100_418M", device=None)  # default
scorer = NMTScorer("m2m100_1.2B", device=None)
scorer = NMTScorer("prism", device=None)

Enable caching of NMT output

It can make sense to cache the translations and scores if they are needed repeatedly, e.g. in reference-based evaluation.

scorer.score_direct(
    a, b, a_lang="en", b_lang="en",
    score_kwargs={"use_cache": True}  # default: False
)

scorer.score_cross_likelihood(
    a, b,
    translate_kwargs={"use_cache": True},  # default: False
    score_kwargs={"use_cache": True}  # default: False
)

Activating this option will create an SQLite database in the ~/.cache directory. The directory can be overriden via the NMTSCORE_CACHE environment variable.

Print a version signature (à la SacreBLEU)

scorer.score(a, b, print_signature=True)
# NMTScore-cross|tgt-lang:en|model:facebook/m2m100_418M|normalized|both-directions|v0.1.0|hf4.17.0

Direct usage of NMT models

The NMT models also provide a direct interface for translating and scoring.

from nmtscore.models import load_translation_model

model = load_translation_model("m2m100_418M")

model.translate("de", ["This is a test."])
# ["Das ist ein Test."]

model.score("de", ["This is a test."], ["Das ist ein Test."])
# [0.5148844122886658]

Experiments

See experiments/README.md

Citation

@article{vamvas2022nmtscore,
  title={{NMTScore}: A Multilingual Analysis of Translation-based Text Similarity Measures},
  author={Vamvas, Jannis and Sennrich, Rico},
  journal={arXiv preprint arXiv:2204.13692},
  year={2022}
}

License

  • Code: MIT License
  • Data: See data subdirectories

About

A library of translation-based text similarity measures

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 91.1%
  • Jupyter Notebook 8.9%