Thanks to visit codestin.com
Credit goes to lib.rs

#nlp #zero-copy #unicode #normalization #tokenizer

normy

Ultra-fast, zero-copy text normalization for Rust NLP pipelines & tokenizers

2 releases

Uses new Rust 2024

new 0.1.1 Jan 17, 2026
0.1.0 Jan 16, 2026

#376 in Text processing

MIT/Apache

320KB
7K SLoC

Crates.io Docs.rs Build Status License: MIT OR Apache-2.0

๐Ÿ“ฆ Normy

Ultra-fast, zero-copy text normalization โ€” built for Rust NLP pipelines & tokenizers. Flexible enough for any high-throughput multilingual text processing (search, logs, APIs, data pipelines, โ€ฆ)

Normy delivers extreme performance through automatic iterator fusion and precise early-exit checks, while respecting language-specific rules (e.g., Turkish dotted/dotless I, German รŸ folding).

  • Zero-copy โ†’ Immediately returns without allocation when input needs no changes.
  • Automatic fusion โ†’ Can fuse eligible stages (>1 fusable stage) into a single pass for better cache locality.
  • Locale-accurate โ†’ Built-in rules for correctness across scripts.
  • Format-aware โ†’ Clean HTML/Markdown while preserving content.

Why Normy?

Traditional normalizers allocate on every callโ€”even for clean text. Normy eliminates this overhead:

  • On already-normalized text (common in production streams): up to 51ร— higher throughput than HuggingFace tokenizers normalizers due to true zero-copy.
  • On text requiring transformation: 3.7โ€“4.1ร— faster through fusion and optimized stages.

๐Ÿ† Performance Comparison

Measured against HuggingFace tokenizers normalizers on 64 KiB inputs (200 samples each).

Complex Pipeline Bert-like (Chinese + Strip + Whitespace + NFD + Diacritics + Lowercase)

Already Normalized Text

Complex Normalized

Needs Transform

Complex Transform

Simple Pipeline (French + Lowercase + Transliterate)

Already Normalized Text

Simple Normalized

With Accents/Diacritics

Simple Accents

Numbers represent geometric mean over 200 samples. Hardware, OS, and input distribution can affect results. See /benches/comparison_tokenizers_bench.rs for reproducible results.

๐Ÿ’พ Installation

Add Normy to your project:

cargo add normy

โšก Quickstart

Normy uses a fluent builder pattern with automatic fusion detection.

use std::error::Error;

use normy::{
    COLLAPSE_WHITESPACE_UNICODE, CaseFold, DEU, FRA, JPN, LowerCase, Normy, RemoveDiacritics, SegmentWords,
    TUR, Transliterate, UnifyWidth, ZHO,
};

fn main() -> Result<(), Box<dyn Error>> {
    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    // TURKISH (Turkic) โ€“ famous for its dotted/dotless I distinction
    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    let tur = Normy::builder()
        .lang(TUR)
        .add_stage(LowerCase) // Critical: ฤฐ โ†’ i, I โ†’ ฤฑ
        .build();

    println!(
        "Turkish : {}",
        tur.normalize("KIZILIRMAK NEHRฤฐ TรœRKฤฐYE'NฤฐN EN UZUN NEHRฤฐDฤฐR.")?
    );
    // โ†’ kฤฑzฤฑlฤฑrmak nehri tรผrkiye'nin en uzun nehridir.

    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    // GERMAN (Germany/Austria/Switzerland) โ€“ รŸ and umlaut handling
    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    let deu = Normy::builder()
        .lang(DEU)
        .add_stage(CaseFold) // รŸ โ†’ ss
        .add_stage(Transliterate) // ร„ โ†’ ae, ร– โ†’ oe, รœ โ†’ ue
        .build();

    println!(
        "German  : {}",
        deu.normalize("GrรผรŸe aus Mรผnchen! Die StraรŸe ist sehr schรถn.")?
    );
    // โ†’ gruesse aus muenchen! die strasse ist sehr schoen.

    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    // FRENCH (France/Belgium/Canada/etc.) โ€“ classic accented text
    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    let fra = Normy::builder()
        .lang(FRA)
        .add_stage(CaseFold)
        .add_stage(RemoveDiacritics) // รฉ โ†’ e, รง โ†’ c, etc.
        .build();

    println!(
        "French  : {}",
        fra.normalize("Bonjour ! J'adore le cafรฉ et les croissants ร  Paris.")?
    );
    // โ†’ bonjour ! j'adore le cafe et les croissants a paris.

    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    // CHINESE (Simplified โ€“ China) โ€“ fullwidth & word segmentation
    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    let zho = Normy::builder()
        .lang(ZHO)
        .add_stage(UnifyWidth)
        .add_stage(COLLAPSE_WHITESPACE_UNICODE)
        .add_stage(SegmentWords) // unigram segmentation
        .build();

    println!(
        "Chinese : {}",
        zho.normalize("ๅŒ—ไบฌ็š„็ง‹ๅคฉ็‰นๅˆซ็พŽไธฝ๏ผŒ้•ฟๅŸŽ้žๅธธๅฃฎ่ง‚๏ผ")?
    );
    // โ†’ ๅŒ— ไบฌ ็š„ ็ง‹ ๅคฉ ็‰น ๅˆซ ็พŽ ไธฝ , ้•ฟ ๅŸŽ ้ž ๅธธ ๅฃฎ ่ง‚ !

    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    // JAPANESE (Japan) โ€“ script transitions + width unification
    // โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    let jpn = Normy::builder()
        .lang(JPN)
        .add_stage(UnifyWidth)
        .add_stage(COLLAPSE_WHITESPACE_UNICODE)
        .add_stage(SegmentWords) // script boundary segmentation
        .build();

    println!(
        "Japanese: {}",
        jpn.normalize("ๆฑไบฌใฏๆœฌๅฝ“ใซ็ด ๆ™ดใ‚‰ใ—ใ„่ก—ใงใ™๏ผๆกœใŒใจใฆใ‚‚ใใ‚Œใ„ใ€‚")?
    );
    // โ†’ ๆฑไบฌใฏๆœฌๅฝ“ใซ็ด ๆ™ดใ‚‰ใ—ใ„่ก—ใงใ™ ! ๆกœใŒใจใฆใ‚‚ใใ‚Œใ„ ใ€‚

    Ok(())
}

When text is already normalized, Normy returns Cow::Borrowed โ€” zero allocation.

โœจ Features

Feature Description
Zero-Copy No allocation on clean input
Iterator Fusion Automatic speedup on 2+ fusable stages (monomorphized char iterators)
Locale-Accurate Turkish ฤฐ/i, German รŸโ†’ss, Dutch ฤฒโ†’ij, Arabic/Hebrew diacritics, etc.
Format-Aware Safe HTML/Markdown stripping (preserves <code>, fences, attributes)
Composable Pipelines Fluent builder + dynamic runtime stages
Segmentation Word boundaries for CJK, Indic, Thai, Khmer, etc. (ZWSP insertion)
Extensible Implement custom transformation stage

๐Ÿ’ผ Available Normalization Stages

Normy provides a rich set of composable, high-performance normalization stages.
Most stages support static iterator fusion for maximum speed (single-pass, zero-copy when possible).

Stage Description Fusion Support
CaseFold Locale-aware case folding (German รŸโ†’ss, etc.) Yes
LowerCase Locale-aware lowercasing (Turkish ฤฐโ†’i) Yes
RemoveDiacritics Removes combining/spacing diacritics (accents, tone marks, etc.) Yes
Transliterate Language-specific character substitutions (ร„โ†’ae, ะฎโ†’ju, etc.) Yes
NormalizePunctuation Normalizes dashes, quotes, ellipsis, bullets, etc. to standard forms Yes
UnifyWidth Converts fullwidth โ†’ halfwidth (critical for CJK compatibility) Yes
SegmentWords Inserts spaces at word/script boundaries (CJK unigram, Indic virama, etc.) Yes
StripControlChars Removes all control characters (Unicode Cc category) Yes
StripFormatControls Removes directional marks, joiners, ZWSP, invisible operators, etc. Yes
Whitespace Variants
โ€ข COLLAPSE_WHITESPACE Collapse consecutive ASCII whitespace โ†’ single space Yes
โ€ข COLLAPSE_WHITESPACE_UNICODE Collapse all Unicode whitespace โ†’ single space Yes
โ€ข NORMALIZE_WHITESPACE_FULL Normalize + collapse + trim all Unicode whitespace Yes
โ€ข TRIM_WHITESPACE Trim leading/trailing ASCII whitespace only Yes
โ€ข TRIM_WHITESPACE_UNICODE Trim leading/trailing Unicode whitespace Yes
Normalization Forms
โ€ข NFC Unicode canonical composed form (most compact, W3C recommended) No
โ€ข NFD Unicode canonical decomposed form No
โ€ข NFKC Unicode compatibility composed (lossy, e.g. ๏ฌโ†’fi, โ„ƒโ†’ยฐC) No
โ€ข NFKD Unicode compatibility decomposed No
StripHtml Strips HTML tags and decodes entities (format-aware) No
StripMarkdown Removes Markdown formatting while preserving content No

Key notes

  • Fusion = static single-pass iterator fusion (zero-copy + minimal allocation when conditions met)
  • Non-fusable stages (NFC/NFD/NFKC/NFKD, StripHtml, StripMarkdown) use optimized batch processing and should usually be placed early in the pipeline

๐Ÿ†Ž Supported Languages

Language Code Special Features
European
Turkish TUR Custom case rules (ฤฐ/i, I/ฤฑ)
German DEU รŸ folding, umlauts transliteration
Dutch NLD IJ digraph folding
Danish DAN ร…/ร†/ร˜ transliteration
Norwegian NOR ร…/ร†/ร˜ transliteration
Swedish SWE ร…/ร„/ร– transliteration
Icelandic ISL รž/ร/ร† transliteration
French FRA ล’/ร† ligatures, accent handling
Spanish SPA Accent normalization
Portuguese POR Comprehensive diacritics
Italian ITA Grave/acute accents
Catalan CAT ร‡ transliteration
Czech CES Hรกฤek preservation, selective stripping
Slovak SLK Caron handling
Polish POL Ogonek & acute accents
Croatian HRV Digraph normalization
Serbian SRP Cyrillic diacritics
Lithuanian LIT Dot-above vowels
Greek ELL Polytonic diacritics (6 types)
Russian RUS Cyrillicโ†’Latin transliteration
Middle Eastern
Arabic ARA 15 diacritic types (tashkeel)
Hebrew HEB 20 vowel points (nikud)
Asian
Vietnamese VIE Tone marks (5 tones ร— vowels)
Chinese ZHO Word segmentation, CJK unigram
Japanese JPN Word segmentation
Korean KOR Word segmentation
Thai THA Tone marks, word segmentation
Lao LAO 15 combining marks, segmentation
Khmer KHM 30+ combining marks, segmentation
Myanmar MYA 17 combining marks, segmentation
South Asian
Hindi HIN Devanagari diacritics, segmentation
Bengali BEN Bengali diacritics, segmentation
Tamil TAM Tamil diacritics, segmentation
Other
English ENG Default/baseline

Features Key:

  • Word Segmentation: Automatic boundary detection for non-space-delimited scripts
  • CJK Unigram: Character-level tokenization for Chinese ideographs
  • Transliteration: Scriptโ†’Latin conversion (e.g., Cyrillic, ligatures)
  • Diacritics: Intelligent spacing/combining mark handling

๐Ÿ“– Documentation

  • Full API docs: docs.rs/normy
  • Linguistic rules: LINGUISTIC_POLICY.md
  • Pipeline guidelines: PIPELINE_GUIDELINES.md
  • Examples are in the examples/ directory
  • Generate local docs:
cargo doc --open

๐Ÿค Contributing

Contributions are very welcome! See CONTRIBUTING.md


๐Ÿ“œ License

Dual-licensed under MIT or Apache-2.0, at your option.

See LICENSE-MIT and LICENSE-APACHE.


Normy โ€” Ultra-fast, linguistically correct normalization โ€” the next-generation layer for Rust NLP & tokenizers

Dependencies

~4.5MB
~83K SLoC