Paasaa is an Elixir library for robust natural language and script detection. It achieves this through statistical analysis of character n-grams and Unicode script properties, without relying on AI. It helps in tasks like text processing, natural language understanding, or internationalization by accurately identifying the writing system and human language of a given text.
API Documentation | Hex Package
Add paasaa
to your list of dependencies in mix.exs
:
def deps do
[{:paasaa, "~> 1.0.0"}]
end
After you are done, run mix deps.get
in your shell to fetch and compile Paasaa.
Detect a language:
iex> Paasaa.detect("Detect this!")
"eng"
Detect language and return a scored list of languages:
iex> Paasaa.all("Detect this!")
[
{"eng", 1.0},
{"sco", 0.8230731943771207},
{"nob", 0.6030053320407174},
{"nno", 0.5525933107125545},
...
]
Detect a script:
iex> Paasaa.detect_script("Detect this!")
{"Latin", 0.8333333333333334}
The detect/2
and all/2
functions accept a keyword list of options to control their behavior.
Whitelist and Blacklist Languages
You can restrict the set of possible languages. This is useful if you already know the text must be one of a few languages, or you want to exclude a common false positive.
# Exclude English to find the next most likely language
iex> Paasaa.detect("Detect this!", blacklist: ["eng"])
"sco"
# Only consider Polish and Serbian
iex> text = "Pošto je priznavanje urođenog dostojanstva i jednakih i neotuđivih prava..."
iex> Paasaa.detect(text, whitelist: ["pol", "srp"])
"srp"
Set Minimum Text Length
By default, Paasaa returns "und"
for very short strings. You can adjust this threshold with :min_length
.
iex> Paasaa.detect("Привет", min_length: 10)
"und"
iex> Paasaa.detect("Привет", min_length: 6)
"rus"
For a full list of supported languages, please see LANGUAGES.md.
Contributions are welcome! Please feel free to open an issue or submit a pull request on GitHub.
If you are updating the language data, you can regenerate the necessary modules with the following command:
mix run script/generate_language_data.exs
Paasaa is a derivative work from Franc (JavaScript, MIT) by Titus Wormer.
MIT © Egor Kislitsyn