Thanks to visit codestin.com
Credit goes to github.com

Skip to content

minibikini/paasaa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paasaa

Elixir CI Coverage Status Hex.pm Hex.pm

Paasaa is an Elixir library for robust natural language and script detection. It achieves this through statistical analysis of character n-grams and Unicode script properties, without relying on AI. It helps in tasks like text processing, natural language understanding, or internationalization by accurately identifying the writing system and human language of a given text.

API Documentation | Hex Package

Installation

Add paasaa to your list of dependencies in mix.exs:

def deps do
  [{:paasaa, "~> 1.0.0"}]
end

After you are done, run mix deps.get in your shell to fetch and compile Paasaa.

Usage

Detect a language:

iex> Paasaa.detect("Detect this!")
"eng"

Detect language and return a scored list of languages:

iex> Paasaa.all("Detect this!")
[
  {"eng", 1.0},
  {"sco", 0.8230731943771207},
  {"nob", 0.6030053320407174},
  {"nno", 0.5525933107125545},
  ...
]

Detect a script:

iex> Paasaa.detect_script("Detect this!")
{"Latin", 0.8333333333333334}

Advanced Usage with Options

The detect/2 and all/2 functions accept a keyword list of options to control their behavior.

Whitelist and Blacklist Languages

You can restrict the set of possible languages. This is useful if you already know the text must be one of a few languages, or you want to exclude a common false positive.

# Exclude English to find the next most likely language
iex> Paasaa.detect("Detect this!", blacklist: ["eng"])
"sco"

# Only consider Polish and Serbian
iex> text = "Pošto je priznavanje urođenog dostojanstva i jednakih i neotuđivih prava..."
iex> Paasaa.detect(text, whitelist: ["pol", "srp"])
"srp"

Set Minimum Text Length

By default, Paasaa returns "und" for very short strings. You can adjust this threshold with :min_length.

iex> Paasaa.detect("Привет", min_length: 10)
"und"

iex> Paasaa.detect("Привет", min_length: 6)
"rus"

Supported Languages

For a full list of supported languages, please see LANGUAGES.md.

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request on GitHub.

If you are updating the language data, you can regenerate the necessary modules with the following command:

mix run script/generate_language_data.exs

Derivation

Paasaa is a derivative work from Franc (JavaScript, MIT) by Titus Wormer.

License

MIT © Egor Kislitsyn

About

🔤 Natural language detection for Elixir without AI

Topics

Resources

License

Stars

Watchers

Forks

Contributors 5

Languages