Modern Azerbaijani NLP

NLP toolkit for Azerbaijani language. Pure Go, zero dependencies.

All packages are safe for concurrent use.

Packages

Package	Description
translit	Latin / Cyrillic script conversion
tokenizer	Word and sentence tokenization with byte offsets
morph	Stem and suffix chain decomposition
numtext	Number / text conversion ("123" → "yuz iyirmi uc")
ner	FIN, VOEN, phone, email, IBAN, plate, URL extraction
datetime	Date/time parser ("5 mart 2026" → structured)
normalize	Diacritic restoration ("gozel" → "gözäl")
spell	Spell checking (SymSpell algorithm)
detect	Language detection (az/ru/en/tr)
keywords	Keyword extraction (TF-IDF / TextRank)
validate	Text quality validation (spelling, punctuation, layout)
sentiment	Lexicon-based sentiment analysis
chunker	Text chunking for RAG/LLM pipelines

Install

go get github.com/az-ai-labs/az-lang-nlp

Requires Go 1.25.7 or later.

gRPC

Proto definitions and generated client stubs are available in a separate package:

go get github.com/az-ai-labs/az-lang-nlp-grpc

13 services, 28 RPCs covering all packages above. See az-lang-nlp-grpc for details.

Transliteration

Convert Azerbaijani text between Latin and Cyrillic scripts.

translit.CyrillicToLatin("Азәрбајҹан")
// Azərbaycan

translit.LatinToCyrillic("Həyat gözəldir")
// Һәјат ҝөзәлдир

Contextual rules handle Cyrillic Г/г disambiguation automatically. Non-Azerbaijani characters (digits, punctuation, emoji) pass through unchanged.

Tokenizer

Split Azerbaijani text into words and sentences with byte offsets.

// Word tokenization
tokenizer.Words("Bakı'nın küçələri gözəldir.")
// [Bakı'nın küçələri gözəldir]

// Structured tokens with byte offsets
for _, t := range tokenizer.WordTokens("Salam, dünya!") {
    fmt.Printf("%s: %q\n", t.Type, t.Text)
}
// Word: "Salam"
// Punctuation: ","
// Space: " "
// Word: "dünya"
// Punctuation: "!"

// Sentence splitting
tokenizer.Sentences("Birinci cümlə. İkinci cümlə.")
// [Birinci cümlə.  İkinci cümlə.]

Handles URLs, emails, Azerbaijani abbreviations (Prof., Az.R.), thousand-separator dots (1.000.000), decimal commas (3,14), hyphens (sosial-iqtisadi), and apostrophe suffixes (Bakı'nın).

Morphological Analysis

Decompose inflected Azerbaijani words into stem and suffix chain.

// Extract stem from inflected word
morph.Stem("kitablarımızdan")
// kitab

// Full morphological analysis
for _, a := range morph.Analyze("kitablar") {
    fmt.Println(a)
}
// kitab[Plural:lar]
// kitabl[TenseAorist:ar]
// kitablar

// Batch stemming (pairs with tokenizer.Words)
morph.Stems([]string{"kitablarımızdan", "evlərdə", "gəlmişdir"})
// [kitab ev gəl]

Uses a table-driven morphotactic state machine with backtracking. Validates vowel harmony, consonant assimilation, and suffix ordering. Includes an embedded dictionary (~12K stems from Wiktionary) for stem validation.

Number-to-Text

Convert between numbers and Azerbaijani text representations.

// Cardinal number
numtext.Convert(123)
// yüz iyirmi üç

// Ordinal number with vowel-harmony suffix
numtext.ConvertOrdinal(5)
// beşinci

// Decimal: math mode
numtext.ConvertFloat("3.14", numtext.MathMode)
// üç tam yüzdə on dörd

// Decimal: digit-by-digit mode
numtext.ConvertFloat("3.14", numtext.DigitMode)
// üç vergül bir dörd

// Parse text back to number
n, _ := numtext.Parse("iki milyon üç yüz min doxsan beş")
fmt.Println(n)
// 2300095

Supports integers up to ±10^18, negative numbers, ordinals, and decimals with dot or comma separator. Parse is case-insensitive and accepts both canonical ("yüz") and explicit ("bir yüz") forms.

Named Entity Recognition

Extract structured entities from Azerbaijani text: FIN, VOEN, phone numbers, emails, IBANs, license plates, and URLs.

// Extract all entities with byte offsets
for _, e := range ner.Recognize("FIN: 5ARPXK2, tel +994501234567") {
    fmt.Printf("%s: %q (labeled=%v)\n", e.Type, e.Text, e.Labeled)
}
// FIN: "5ARPXK2" (labeled=true)
// Phone: "+994501234567" (labeled=false)

// Convenience functions return []string
ner.Phones("+994501234567 və 0551234567")
// [+994501234567 0551234567]

ner.Emails("[email protected]")
// [[email protected]]

ner.IBANs("AZ21NABZ00000000137010001944")
// [AZ21NABZ00000000137010001944]

FIN and VOEN patterns are ambiguous in isolation. When preceded by a keyword (e.g. "FIN:", "VOEN:"), Entity.Labeled is true, indicating higher confidence. Overlapping entities are resolved by preferring longer matches.

Datetime

Parse Azerbaijani date and time expressions into structured values.

// Parse a natural-language date
r, _ := datetime.Parse("5 mart 2026", time.Time{})
fmt.Println(r.Type, r.Time.Format("2006-01-02"))
// Date 2026-03-05

// Extract dates from running text
for _, r := range datetime.Extract("Görüş 15 yanvar 2026 saat 14:30-da olacaq", time.Time{}) {
    fmt.Printf("%s: %q -> %s\n", r.Type, r.Text, r.Time.Format("2006-01-02 15:04"))
}
// Date: "15 yanvar 2026" -> 2026-01-15 00:00
// Time: "14:30" -> ... 14:30

Handles natural text ("5 mart 2026"), numeric formats ("05.03.2026", "2026-03-05"), relative expressions ("bu gun", "3 gun evvel", "kecen hefte"), and durations ("2 saat 30 däqiqä"). Written-out numbers are supported via numtext integration ("iki saat"). Relative expressions resolve against a reference time, respecting its timezone.

Text Normalization

Restore missing Azerbaijani diacritics in ASCII-degraded text.

// Restore diacritics in a single word
normalize.NormalizeWord("gozel")
// gözəl

normalize.NormalizeWord("azerbaycan")
// azərbaycan

// Ambiguous words are left unchanged
normalize.NormalizeWord("seher")
// seher (could be səhər or şəhər)

// Full text normalization
normalize.Normalize("Bu gozel seherde yasayiram.")
// Bu gözəl seherde yasayiram.

// Case is preserved
normalize.NormalizeWord("GOZEL")
// GÖZƏL

Uses dictionary lookup against the morph package's ~12K stem dictionary to find unambiguous diacritic restorations. Words with multiple possible restorations or not found in the dictionary are returned unchanged. Handles hyphenated words and apostrophe suffixes. Input longer than 1 MiB is returned unchanged.

Spell Checker

Check and correct spelling errors in Azerbaijani text using the SymSpell algorithm with morphology-aware validation.

// Check if a word is correctly spelled
spell.IsCorrect("kitab")    // true
spell.IsCorrect("ketab")    // false
spell.IsCorrect("kitablar") // true (morphologically valid)
spell.IsCorrect("gozel")    // true (normalizable to gözəl)

// Get correction suggestions
suggestions := spell.Suggest("ketab", 2)
fmt.Println(suggestions[0].Term, suggestions[0].Distance)
// kitab 1

// Correct a single word (preserves case)
spell.CorrectWord("ketab")  // kitab
spell.CorrectWord("KETAB")  // KİTAB

// Correct all misspelled words in text
spell.Correct("Bu ketab gozeldir")
// Bu kitab gozeldir

Uses an embedded frequency dictionary (~86K entries from a 1.25 GB Azerbaijani corpus) with the SymSpell symmetric delete algorithm for sub-microsecond lookups. Validates words through frequency dictionary, morphological analysis, and diacritic normalization. Handles hyphenated words, apostrophe suffixes, and case preservation. Title-case unknown words are left unchanged to avoid over-correcting proper nouns.

Language Detection

Identify the language of input text: Azerbaijani, Russian, English, or Turkish.

// Detect language with confidence score
r := detect.Detect("Salam, necəsən? Bu gün hava çox gözəldir.")
fmt.Println(r.Lang, r.Script, r.Confidence)
// Azerbaijani Latn 0.95...

// ISO 639-1 code
detect.Lang("Hello, how are you doing today?")
// en

// Ranked results for all supported languages
for _, r := range detect.DetectAll("Привет, как дела?") {
    fmt.Printf("%s: %.2f\n", r.Lang, r.Confidence)
}
// Russian: 0.55
// Azerbaijani: 0.45
// English: 0.00
// Turkish: 0.00

Uses hybrid character-set scoring with trigram fallback for ambiguous cases (Azerbaijani vs Turkish). Supports Azerbaijani in both Latin and Cyrillic scripts. Input longer than 1 MiB is silently truncated.

Keyword Extraction

Extract keywords from Azerbaijani text using TF-IDF or TextRank algorithms.

// Structured: TF-IDF scored keywords
for _, kw := range keywords.ExtractTFIDF("Azərbaycan iqtisadiyyatı sürətlə inkişaf edir. Azərbaycan neft sektorunda liderdir.", 3) {
    fmt.Printf("%s (score=%.2f, count=%d)\n", kw.Stem, kw.Score, kw.Count)
}
// azərbaycan (score=1.41, count=2)
// sektor (score=1.25, count=1)
// lider (score=1.11, count=1)

// Structured: TextRank scored keywords
for _, kw := range keywords.ExtractTextRank("Azərbaycan iqtisadiyyatı sürətlə inkişaf edir", 3) {
    fmt.Printf("%s (score=%.2f)\n", kw.Stem, kw.Score)
}
// iqtisadiyyat (score=0.30)
// sürət (score=0.30)
// azərbaycan (score=0.20)

// Convenience: top 10 keyword stems via TextRank
keywords.Keywords("Azərbaycan iqtisadiyyatı sürətlə inkişaf edir")
// [iqtisadiyyat sürət azərbaycan inkişaf]

Integrates with normalize for diacritic restoration, tokenizer for word splitting, and morph for stemming. Inflected forms ("kitab", "kitablar", "kitabdan") group under a single stem. Stopwords (pronouns, conjunctions, particles, auxiliaries) are filtered after stemming. Input longer than 1 MiB returns nil.

Text Validation

Validate Azerbaijani text quality: spelling, punctuation, keyboard layout errors (homoglyphs), and mixed script detection.

// Full validation with quality score and positioned issues
report := validate.Validate("Bu ketab ,gözəldir.")
fmt.Println(report.Score)
// 87

for _, issue := range report.Issues {
    fmt.Printf("[%s] %q: %s", issue.Type, issue.Text, issue.Message)
    if issue.Suggestion != "" {
        fmt.Printf(" (suggest: %q)", issue.Suggestion)
    }
    fmt.Println()
}
// [spelling] "ketab": unknown word (suggest: "kitab")
// [punctuation] " ": space before punctuation

// Quick validity check (no error-severity issues)
validate.IsValid("Bu kitab gözəldir.") // true
validate.IsValid("Bu ketab gözəldir.") // false

Returns a quality score (0-100) with weighted deductions: error -10, warning -3, info -1. Checks four categories: spelling errors via spell.IsCorrect, punctuation issues (spacing, repetition), keyboard layout errors (Cyrillic/Latin homoglyph detection), and mixed script usage. Title-case unknown words are skipped as likely proper nouns. Issues include byte offsets for editor integration. Input longer than 1 MiB returns score 100 with no issues.

Sentiment Analysis

Analyze the sentiment of Azerbaijani text using a lexicon-based approach.

// Full analysis with score and word counts
r := sentiment.Analyze("Bu film gözəl və maraqlı idi")
fmt.Println(r.Sentiment, r.Score)
// Positive 0.8

// Quick score check
sentiment.Score("Pis hava")
// -0.8

// Boolean convenience
sentiment.IsPositive("Həyat gözəldir")
// true

Uses an embedded sentiment lexicon with ~200 Azerbaijani stems. Words are normalized and stemmed before lookup, so inflected forms ("gözəldir", "sevirdim") match their stem entries. Returns a score from -1.0 (most negative) to +1.0 (most positive). Unknown words are skipped. Input longer than 1 MiB returns a zero result.

Text Chunking

Split text into overlapping or non-overlapping chunks for RAG/LLM pipelines.

// Convenience: split with defaults (size=512, overlap=50)
chunker.Chunks("Birinci paraqraf.\n\nİkinci paraqraf.")
// [Birinci paraqraf.\n\n İkinci paraqraf.]

// Fixed-size rune-count splitting
for _, c := range chunker.BySize("abcdefghij", 5, 0) {
    fmt.Printf("[%d:%d] %q\n", c.Start, c.End, c.Text)
}
// [0:5] "abcde"
// [5:10] "fghij"

// Sentence-aware splitting
chunks := chunker.BySentence("Birinci cümlə. İkinci cümlə.", 100, 0)
fmt.Println(chunks[0].Text)
// Birinci cümlə. İkinci cümlə.

// Recursive: paragraph > sentence > word > rune with merge-back
for _, c := range chunker.Recursive("Birinci paraqraf.\n\nİkinci paraqraf.", 20, 0) {
    fmt.Printf("[%d:%d] %q\n", c.Start, c.End, c.Text)
}
// [0:19] "Birinci paraqraf.\n\n"
// [19:36] "İkinci paraqraf."

Three strategies: BySize (pure rune-count), BySentence (sentence-boundary aware via tokenizer), and Recursive (hierarchical paragraph/sentence/word/rune with greedy merge-back). All return []Chunk with byte offsets satisfying text[c.Start:c.End] == c.Text. Chunk size is measured in runes, not bytes, for correct handling of Azerbaijani multi-byte diacritics. Inherits abbreviation handling from the tokenizer.

License

Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern Azerbaijani NLP

Packages

Install

gRPC

Transliteration

Tokenizer

Morphological Analysis

Number-to-Text

Named Entity Recognition

Datetime

Text Normalization

Spell Checker

Language Detection

Keyword Extraction

Text Validation

Sentiment Analysis

Text Chunking

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
azcase		azcase
chunker		chunker
cmd		cmd
data		data
datetime		datetime
detect		detect
e2e		e2e
keywords		keywords
morph		morph
ner		ner
normalize		normalize
numtext		numtext
scripts		scripts
sentiment		sentiment
spell		spell
tokenizer		tokenizer
translit		translit
validate		validate
.gitignore		.gitignore
.golangci.yml		.golangci.yml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
go.mod		go.mod

Folders and files

Latest commit

History

Repository files navigation

Modern Azerbaijani NLP

Packages

Install

gRPC

Transliteration

Tokenizer

Morphological Analysis

Number-to-Text

Named Entity Recognition

Datetime

Text Normalization

Spell Checker

Language Detection

Keyword Extraction

Text Validation

Sentiment Analysis

Text Chunking

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages