Go implementation of the original zemberek-nlp Java library for Turkish language processing.
Currently, the following modules have been ported:
- Turkish alphabet and phonetic attributes
- Multi-level perfect hash functions and compression primitives
- Text utilities for casing, diacritics and token helpers
- Token/span types and sentence boundary detection
- Compressed vocabulary and n‑gram accessors
- SmoothLM reader with MPHFs
- Binary lexicon loader and dictionary items
- Morphotactics graph, analysis and generation helpers
- Full sentence normalizer with spell checker + LM ranking
- Deasciifier and ASCII tolerant utilities
go get github.com/kalaomer/zemberek-gopackage main
import (
"fmt"
"github.com/kalaomer/zemberek-go/core/turkish"
"github.com/kalaomer/zemberek-go/tokenization"
)
func main() {
// Use Turkish alphabet
alphabet := turkish.Instance
fmt.Println("Is 'ı' a vowel?", alphabet.IsVowel('ı'))
// Tokenize text
extractor, _ := tokenization.NewTurkishSentenceExtractor(false, "")
sentences := extractor.FromParagraph("Merhaba dünya! Bu bir test cümlesidir.")
for _, sentence := range sentences {
fmt.Println(sentence)
}
}package main
import (
"fmt"
"log"
"github.com/kalaomer/zemberek-go/morphology"
"github.com/kalaomer/zemberek-go/normalization"
)
func main() {
morph := morphology.CreateWithDefaults()
normalizer, err := normalization.NewTurkishSentenceNormalizerAdvanced(morph, "data")
if err != nil {
log.Fatalf("normalizer init: %v", err)
}
input := "Yrn okua gidicem"
fmt.Println(normalizer.Normalize(input))
}- Go 1.18 or higher
- Standard library only (no external dependencies for core functionality)
Language resources (lexicon binaries, normalization tables, language models) are expected under data/ by default. If you keep them elsewhere, export ZEMBEREK_DATA_ROOT=/absolute/path/to/your/data so both the examples and the advanced normalizer can locate them.
Example data bundles (LM and normalization folders) are available here: https://drive.google.com/drive/folders/1tztjRiUs9BOTH-tb1v7FWyixl-iUpydW. Download the archive, extract it to a directory of your choice, and point ZEMBEREK_DATA_ROOT to that directory before running the examples.
The port follows zemberek-nlp’s architecture module by module. Core components, tokenization, lexicon handling, language model loading and advanced normalization are functional; remaining work focuses on fine-tuning morphology generation/ambiguity resolution and extending test coverage as the Java baseline evolves.
This port mirrors the Java implementation’s architecture while adapting to Go idioms:
- Java classes → Go structs/interfaces
- Java enums → Go iota constants
- Immutable data → Go value types and generated readers
- Original Java implementation: zemberek-nlp by Ahmet A. Akın
- Go port: This repository and its contributors
Apache License 2.0
Contributions are welcome! This is a large codebase and help with porting remaining modules would be appreciated.