NLP toolkit for Azerbaijani language. Pure Go, zero dependencies.
All packages are safe for concurrent use.
| Package | Description |
|---|---|
| translit | Latin / Cyrillic script conversion |
| tokenizer | Word and sentence tokenization with byte offsets |
| morph | Stem and suffix chain decomposition |
| numtext | Number / text conversion ("123" → "yuz iyirmi uc") |
| ner | FIN, VOEN, phone, email, IBAN, plate, URL extraction |
| datetime | Date/time parser ("5 mart 2026" → structured) |
| normalize | Diacritic restoration ("gozel" → "gözäl") |
| spell | Spell checking (SymSpell algorithm) |
| detect | Language detection (az/ru/en/tr) |
| keywords | Keyword extraction (TF-IDF / TextRank) |
| validate | Text quality validation (spelling, punctuation, layout) |
| sentiment | Lexicon-based sentiment analysis |
| chunker | Text chunking for RAG/LLM pipelines |
go get github.com/az-ai-labs/az-lang-nlp
Requires Go 1.25.7 or later.
Proto definitions and generated client stubs are available in a separate package:
go get github.com/az-ai-labs/az-lang-nlp-grpc
13 services, 28 RPCs covering all packages above. See az-lang-nlp-grpc for details.
Convert Azerbaijani text between Latin and Cyrillic scripts.
translit.CyrillicToLatin("Азәрбајҹан")
// Azərbaycan
translit.LatinToCyrillic("Həyat gözəldir")
// Һәјат ҝөзәлдирContextual rules handle Cyrillic Г/г disambiguation automatically. Non-Azerbaijani characters (digits, punctuation, emoji) pass through unchanged.
Split Azerbaijani text into words and sentences with byte offsets.
// Word tokenization
tokenizer.Words("Bakı'nın küçələri gözəldir.")
// [Bakı'nın küçələri gözəldir]
// Structured tokens with byte offsets
for _, t := range tokenizer.WordTokens("Salam, dünya!") {
fmt.Printf("%s: %q\n", t.Type, t.Text)
}
// Word: "Salam"
// Punctuation: ","
// Space: " "
// Word: "dünya"
// Punctuation: "!"
// Sentence splitting
tokenizer.Sentences("Birinci cümlə. İkinci cümlə.")
// [Birinci cümlə. İkinci cümlə.]Handles URLs, emails, Azerbaijani abbreviations (Prof., Az.R.), thousand-separator dots (1.000.000), decimal commas (3,14), hyphens (sosial-iqtisadi), and apostrophe suffixes (Bakı'nın).
Decompose inflected Azerbaijani words into stem and suffix chain.
// Extract stem from inflected word
morph.Stem("kitablarımızdan")
// kitab
// Full morphological analysis
for _, a := range morph.Analyze("kitablar") {
fmt.Println(a)
}
// kitab[Plural:lar]
// kitabl[TenseAorist:ar]
// kitablar
// Batch stemming (pairs with tokenizer.Words)
morph.Stems([]string{"kitablarımızdan", "evlərdə", "gəlmişdir"})
// [kitab ev gəl]Uses a table-driven morphotactic state machine with backtracking. Validates vowel harmony, consonant assimilation, and suffix ordering. Includes an embedded dictionary (~12K stems from Wiktionary) for stem validation.
Convert between numbers and Azerbaijani text representations.
// Cardinal number
numtext.Convert(123)
// yüz iyirmi üç
// Ordinal number with vowel-harmony suffix
numtext.ConvertOrdinal(5)
// beşinci
// Decimal: math mode
numtext.ConvertFloat("3.14", numtext.MathMode)
// üç tam yüzdə on dörd
// Decimal: digit-by-digit mode
numtext.ConvertFloat("3.14", numtext.DigitMode)
// üç vergül bir dörd
// Parse text back to number
n, _ := numtext.Parse("iki milyon üç yüz min doxsan beş")
fmt.Println(n)
// 2300095Supports integers up to ±10^18, negative numbers, ordinals, and decimals with dot or comma separator. Parse is case-insensitive and accepts both canonical ("yüz") and explicit ("bir yüz") forms.
Extract structured entities from Azerbaijani text: FIN, VOEN, phone numbers, emails, IBANs, license plates, and URLs.
// Extract all entities with byte offsets
for _, e := range ner.Recognize("FIN: 5ARPXK2, tel +994501234567") {
fmt.Printf("%s: %q (labeled=%v)\n", e.Type, e.Text, e.Labeled)
}
// FIN: "5ARPXK2" (labeled=true)
// Phone: "+994501234567" (labeled=false)
// Convenience functions return []string
ner.Phones("+994501234567 və 0551234567")
// [+994501234567 0551234567]
ner.Emails("[email protected]")
// [[email protected]]
ner.IBANs("AZ21NABZ00000000137010001944")
// [AZ21NABZ00000000137010001944]FIN and VOEN patterns are ambiguous in isolation. When preceded by a keyword (e.g. "FIN:", "VOEN:"), Entity.Labeled is true, indicating higher confidence. Overlapping entities are resolved by preferring longer matches.
Parse Azerbaijani date and time expressions into structured values.
// Parse a natural-language date
r, _ := datetime.Parse("5 mart 2026", time.Time{})
fmt.Println(r.Type, r.Time.Format("2006-01-02"))
// Date 2026-03-05
// Extract dates from running text
for _, r := range datetime.Extract("Görüş 15 yanvar 2026 saat 14:30-da olacaq", time.Time{}) {
fmt.Printf("%s: %q -> %s\n", r.Type, r.Text, r.Time.Format("2006-01-02 15:04"))
}
// Date: "15 yanvar 2026" -> 2026-01-15 00:00
// Time: "14:30" -> ... 14:30Handles natural text ("5 mart 2026"), numeric formats ("05.03.2026", "2026-03-05"), relative expressions ("bu gun", "3 gun evvel", "kecen hefte"), and durations ("2 saat 30 däqiqä"). Written-out numbers are supported via numtext integration ("iki saat"). Relative expressions resolve against a reference time, respecting its timezone.
Restore missing Azerbaijani diacritics in ASCII-degraded text.
// Restore diacritics in a single word
normalize.NormalizeWord("gozel")
// gözəl
normalize.NormalizeWord("azerbaycan")
// azərbaycan
// Ambiguous words are left unchanged
normalize.NormalizeWord("seher")
// seher (could be səhər or şəhər)
// Full text normalization
normalize.Normalize("Bu gozel seherde yasayiram.")
// Bu gözəl seherde yasayiram.
// Case is preserved
normalize.NormalizeWord("GOZEL")
// GÖZƏLUses dictionary lookup against the morph package's ~12K stem dictionary to find unambiguous diacritic restorations. Words with multiple possible restorations or not found in the dictionary are returned unchanged. Handles hyphenated words and apostrophe suffixes. Input longer than 1 MiB is returned unchanged.
Check and correct spelling errors in Azerbaijani text using the SymSpell algorithm with morphology-aware validation.
// Check if a word is correctly spelled
spell.IsCorrect("kitab") // true
spell.IsCorrect("ketab") // false
spell.IsCorrect("kitablar") // true (morphologically valid)
spell.IsCorrect("gozel") // true (normalizable to gözəl)
// Get correction suggestions
suggestions := spell.Suggest("ketab", 2)
fmt.Println(suggestions[0].Term, suggestions[0].Distance)
// kitab 1
// Correct a single word (preserves case)
spell.CorrectWord("ketab") // kitab
spell.CorrectWord("KETAB") // KİTAB
// Correct all misspelled words in text
spell.Correct("Bu ketab gozeldir")
// Bu kitab gozeldirUses an embedded frequency dictionary (~86K entries from a 1.25 GB Azerbaijani corpus) with the SymSpell symmetric delete algorithm for sub-microsecond lookups. Validates words through frequency dictionary, morphological analysis, and diacritic normalization. Handles hyphenated words, apostrophe suffixes, and case preservation. Title-case unknown words are left unchanged to avoid over-correcting proper nouns.
Identify the language of input text: Azerbaijani, Russian, English, or Turkish.
// Detect language with confidence score
r := detect.Detect("Salam, necəsən? Bu gün hava çox gözəldir.")
fmt.Println(r.Lang, r.Script, r.Confidence)
// Azerbaijani Latn 0.95...
// ISO 639-1 code
detect.Lang("Hello, how are you doing today?")
// en
// Ranked results for all supported languages
for _, r := range detect.DetectAll("Привет, как дела?") {
fmt.Printf("%s: %.2f\n", r.Lang, r.Confidence)
}
// Russian: 0.55
// Azerbaijani: 0.45
// English: 0.00
// Turkish: 0.00Uses hybrid character-set scoring with trigram fallback for ambiguous cases (Azerbaijani vs Turkish). Supports Azerbaijani in both Latin and Cyrillic scripts. Input longer than 1 MiB is silently truncated.
Extract keywords from Azerbaijani text using TF-IDF or TextRank algorithms.
// Structured: TF-IDF scored keywords
for _, kw := range keywords.ExtractTFIDF("Azərbaycan iqtisadiyyatı sürətlə inkişaf edir. Azərbaycan neft sektorunda liderdir.", 3) {
fmt.Printf("%s (score=%.2f, count=%d)\n", kw.Stem, kw.Score, kw.Count)
}
// azərbaycan (score=1.41, count=2)
// sektor (score=1.25, count=1)
// lider (score=1.11, count=1)
// Structured: TextRank scored keywords
for _, kw := range keywords.ExtractTextRank("Azərbaycan iqtisadiyyatı sürətlə inkişaf edir", 3) {
fmt.Printf("%s (score=%.2f)\n", kw.Stem, kw.Score)
}
// iqtisadiyyat (score=0.30)
// sürət (score=0.30)
// azərbaycan (score=0.20)
// Convenience: top 10 keyword stems via TextRank
keywords.Keywords("Azərbaycan iqtisadiyyatı sürətlə inkişaf edir")
// [iqtisadiyyat sürət azərbaycan inkişaf]Integrates with normalize for diacritic restoration, tokenizer for word splitting, and morph for stemming. Inflected forms ("kitab", "kitablar", "kitabdan") group under a single stem. Stopwords (pronouns, conjunctions, particles, auxiliaries) are filtered after stemming. Input longer than 1 MiB returns nil.
Validate Azerbaijani text quality: spelling, punctuation, keyboard layout errors (homoglyphs), and mixed script detection.
// Full validation with quality score and positioned issues
report := validate.Validate("Bu ketab ,gözəldir.")
fmt.Println(report.Score)
// 87
for _, issue := range report.Issues {
fmt.Printf("[%s] %q: %s", issue.Type, issue.Text, issue.Message)
if issue.Suggestion != "" {
fmt.Printf(" (suggest: %q)", issue.Suggestion)
}
fmt.Println()
}
// [spelling] "ketab": unknown word (suggest: "kitab")
// [punctuation] " ": space before punctuation
// Quick validity check (no error-severity issues)
validate.IsValid("Bu kitab gözəldir.") // true
validate.IsValid("Bu ketab gözəldir.") // falseReturns a quality score (0-100) with weighted deductions: error -10, warning -3, info -1. Checks four categories: spelling errors via spell.IsCorrect, punctuation issues (spacing, repetition), keyboard layout errors (Cyrillic/Latin homoglyph detection), and mixed script usage. Title-case unknown words are skipped as likely proper nouns. Issues include byte offsets for editor integration. Input longer than 1 MiB returns score 100 with no issues.
Analyze the sentiment of Azerbaijani text using a lexicon-based approach.
// Full analysis with score and word counts
r := sentiment.Analyze("Bu film gözəl və maraqlı idi")
fmt.Println(r.Sentiment, r.Score)
// Positive 0.8
// Quick score check
sentiment.Score("Pis hava")
// -0.8
// Boolean convenience
sentiment.IsPositive("Həyat gözəldir")
// trueUses an embedded sentiment lexicon with ~200 Azerbaijani stems. Words are normalized and stemmed before lookup, so inflected forms ("gözəldir", "sevirdim") match their stem entries. Returns a score from -1.0 (most negative) to +1.0 (most positive). Unknown words are skipped. Input longer than 1 MiB returns a zero result.
Split text into overlapping or non-overlapping chunks for RAG/LLM pipelines.
// Convenience: split with defaults (size=512, overlap=50)
chunker.Chunks("Birinci paraqraf.\n\nİkinci paraqraf.")
// [Birinci paraqraf.\n\n İkinci paraqraf.]
// Fixed-size rune-count splitting
for _, c := range chunker.BySize("abcdefghij", 5, 0) {
fmt.Printf("[%d:%d] %q\n", c.Start, c.End, c.Text)
}
// [0:5] "abcde"
// [5:10] "fghij"
// Sentence-aware splitting
chunks := chunker.BySentence("Birinci cümlə. İkinci cümlə.", 100, 0)
fmt.Println(chunks[0].Text)
// Birinci cümlə. İkinci cümlə.
// Recursive: paragraph > sentence > word > rune with merge-back
for _, c := range chunker.Recursive("Birinci paraqraf.\n\nİkinci paraqraf.", 20, 0) {
fmt.Printf("[%d:%d] %q\n", c.Start, c.End, c.Text)
}
// [0:19] "Birinci paraqraf.\n\n"
// [19:36] "İkinci paraqraf."Three strategies: BySize (pure rune-count), BySentence (sentence-boundary aware via tokenizer), and Recursive (hierarchical paragraph/sentence/word/rune with greedy merge-back). All return []Chunk with byte offsets satisfying text[c.Start:c.End] == c.Text. Chunk size is measured in runes, not bytes, for correct handling of Azerbaijani multi-byte diacritics. Inherits abbreviation handling from the tokenizer.