Zemberek-Go

Go implementation of the original zemberek-nlp Java library for Turkish language processing.

Features

Currently, the following modules have been ported:

Core

Turkish alphabet and phonetic attributes
Multi-level perfect hash functions and compression primitives
Text utilities for casing, diacritics and token helpers

Tokenization

Token/span types and sentence boundary detection

Language Model (LM)

Compressed vocabulary and n‑gram accessors
SmoothLM reader with MPHFs

Morphology

Binary lexicon loader and dictionary items
Morphotactics graph, analysis and generation helpers

Normalization

Full sentence normalizer with spell checker + LM ranking
Deasciifier and ASCII tolerant utilities

Installation

go get github.com/kalaomer/zemberek-go

Usage

package main

import (
    "fmt"
    "github.com/kalaomer/zemberek-go/core/turkish"
    "github.com/kalaomer/zemberek-go/tokenization"
)

func main() {
    // Use Turkish alphabet
    alphabet := turkish.Instance
    fmt.Println("Is 'ı' a vowel?", alphabet.IsVowel('ı'))

    // Tokenize text
    extractor, _ := tokenization.NewTurkishSentenceExtractor(false, "")
    sentences := extractor.FromParagraph("Merhaba dünya! Bu bir test cümlesidir.")
    for _, sentence := range sentences {
        fmt.Println(sentence)
    }
}

Sentence normalization

package main

import (
    "fmt"
    "log"

    "github.com/kalaomer/zemberek-go/morphology"
    "github.com/kalaomer/zemberek-go/normalization"
)

func main() {
    morph := morphology.CreateWithDefaults()
    normalizer, err := normalization.NewTurkishSentenceNormalizerAdvanced(morph, "data")
    if err != nil {
        log.Fatalf("normalizer init: %v", err)
    }

    input := "Yrn okua gidicem"
    fmt.Println(normalizer.Normalize(input))
}

Dependencies

Go 1.18 or higher
Standard library only (no external dependencies for core functionality)

Resource data

Language resources (lexicon binaries, normalization tables, language models) are expected under data/ by default. If you keep them elsewhere, export ZEMBEREK_DATA_ROOT=/absolute/path/to/your/data so both the examples and the advanced normalizer can locate them.

Example data bundles (LM and normalization folders) are available here: https://drive.google.com/drive/folders/1tztjRiUs9BOTH-tb1v7FWyixl-iUpydW. Download the archive, extract it to a directory of your choice, and point ZEMBEREK_DATA_ROOT to that directory before running the examples.

Development Status

The port follows zemberek-nlp’s architecture module by module. Core components, tokenization, lexicon handling, language model loading and advanced normalization are functional; remaining work focuses on fine-tuning morphology generation/ambiguity resolution and extending test coverage as the Java baseline evolves.

Notes

This port mirrors the Java implementation’s architecture while adapting to Go idioms:

Java classes → Go structs/interfaces
Java enums → Go iota constants
Immutable data → Go value types and generated readers

Credits

Original Java implementation: zemberek-nlp by Ahmet A. Akın
Go port: This repository and its contributors

License

Apache License 2.0

Contributing

Contributions are welcome! This is a large codebase and help with porting remaining modules would be appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
core		core
examples		examples
lm		lm
morphology		morphology
normalization		normalization
sqlite_extension		sqlite_extension
tokenization		tokenization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zemberek-Go

Features

Core

Tokenization

Language Model (LM)

Morphology

Normalization

Installation

Usage

Sentence normalization

Dependencies

Resource data

Development Status

Notes

Credits

License

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

kalaomer/zemberek-go

Folders and files

Latest commit

History

Repository files navigation

Zemberek-Go

Features

Core

Tokenization

Language Model (LM)

Morphology

Normalization

Installation

Usage

Sentence normalization

Dependencies

Resource data

Development Status

Notes

Credits

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages