Thanks to visit codestin.com
Credit goes to github.com

Skip to content

kalaomer/zemberek-go

Repository files navigation

Zemberek-Go

Go implementation of the original zemberek-nlp Java library for Turkish language processing.

Features

Currently, the following modules have been ported:

Core

  • Turkish alphabet and phonetic attributes
  • Multi-level perfect hash functions and compression primitives
  • Text utilities for casing, diacritics and token helpers

Tokenization

  • Token/span types and sentence boundary detection

Language Model (LM)

  • Compressed vocabulary and n‑gram accessors
  • SmoothLM reader with MPHFs

Morphology

  • Binary lexicon loader and dictionary items
  • Morphotactics graph, analysis and generation helpers

Normalization

  • Full sentence normalizer with spell checker + LM ranking
  • Deasciifier and ASCII tolerant utilities

Installation

go get github.com/kalaomer/zemberek-go

Usage

package main

import (
    "fmt"
    "github.com/kalaomer/zemberek-go/core/turkish"
    "github.com/kalaomer/zemberek-go/tokenization"
)

func main() {
    // Use Turkish alphabet
    alphabet := turkish.Instance
    fmt.Println("Is 'ı' a vowel?", alphabet.IsVowel('ı'))

    // Tokenize text
    extractor, _ := tokenization.NewTurkishSentenceExtractor(false, "")
    sentences := extractor.FromParagraph("Merhaba dünya! Bu bir test cümlesidir.")
    for _, sentence := range sentences {
        fmt.Println(sentence)
    }
}

Sentence normalization

package main

import (
    "fmt"
    "log"

    "github.com/kalaomer/zemberek-go/morphology"
    "github.com/kalaomer/zemberek-go/normalization"
)

func main() {
    morph := morphology.CreateWithDefaults()
    normalizer, err := normalization.NewTurkishSentenceNormalizerAdvanced(morph, "data")
    if err != nil {
        log.Fatalf("normalizer init: %v", err)
    }

    input := "Yrn okua gidicem"
    fmt.Println(normalizer.Normalize(input))
}

Dependencies

  • Go 1.18 or higher
  • Standard library only (no external dependencies for core functionality)

Resource data

Language resources (lexicon binaries, normalization tables, language models) are expected under data/ by default. If you keep them elsewhere, export ZEMBEREK_DATA_ROOT=/absolute/path/to/your/data so both the examples and the advanced normalizer can locate them.

Example data bundles (LM and normalization folders) are available here: https://drive.google.com/drive/folders/1tztjRiUs9BOTH-tb1v7FWyixl-iUpydW. Download the archive, extract it to a directory of your choice, and point ZEMBEREK_DATA_ROOT to that directory before running the examples.

Development Status

The port follows zemberek-nlp’s architecture module by module. Core components, tokenization, lexicon handling, language model loading and advanced normalization are functional; remaining work focuses on fine-tuning morphology generation/ambiguity resolution and extending test coverage as the Java baseline evolves.

Notes

This port mirrors the Java implementation’s architecture while adapting to Go idioms:

  • Java classes → Go structs/interfaces
  • Java enums → Go iota constants
  • Immutable data → Go value types and generated readers

Credits

  • Original Java implementation: zemberek-nlp by Ahmet A. Akın
  • Go port: This repository and its contributors

License

Apache License 2.0

Contributing

Contributions are welcome! This is a large codebase and help with porting remaining modules would be appreciated.

About

Zemberek kütüphanesinin golang implemantasyonu

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •