2 releases
Uses new Rust 2024
| new 0.1.1 | Jan 17, 2026 |
|---|---|
| 0.1.0 | Jan 16, 2026 |
#376 in Text processing
320KB
7K
SLoC
๐ฆ Normy
Ultra-fast, zero-copy text normalization โ built for Rust NLP pipelines & tokenizers. Flexible enough for any high-throughput multilingual text processing (search, logs, APIs, data pipelines, โฆ)
Normy delivers extreme performance through automatic iterator fusion and precise early-exit checks, while respecting language-specific rules (e.g., Turkish dotted/dotless I, German ร folding).
- Zero-copy โ Immediately returns without allocation when input needs no changes.
- Automatic fusion โ Can fuse eligible stages (>1 fusable stage) into a single pass for better cache locality.
- Locale-accurate โ Built-in rules for correctness across scripts.
- Format-aware โ Clean HTML/Markdown while preserving content.
Why Normy?
Traditional normalizers allocate on every callโeven for clean text. Normy eliminates this overhead:
- On already-normalized text (common in production streams): up to 51ร higher throughput than HuggingFace
tokenizersnormalizers due to true zero-copy. - On text requiring transformation: 3.7โ4.1ร faster through fusion and optimized stages.
๐ Performance Comparison
Measured against HuggingFace tokenizers normalizers on 64 KiB inputs (200 samples each).
Complex Pipeline Bert-like (Chinese + Strip + Whitespace + NFD + Diacritics + Lowercase)
Already Normalized Text
Needs Transform
Simple Pipeline (French + Lowercase + Transliterate)
Already Normalized Text
With Accents/Diacritics
Numbers represent geometric mean over 200 samples. Hardware, OS, and input distribution can affect results. See /benches/comparison_tokenizers_bench.rs for reproducible results.
๐พ Installation
Add Normy to your project:
cargo add normy
โก Quickstart
Normy uses a fluent builder pattern with automatic fusion detection.
use std::error::Error;
use normy::{
COLLAPSE_WHITESPACE_UNICODE, CaseFold, DEU, FRA, JPN, LowerCase, Normy, RemoveDiacritics, SegmentWords,
TUR, Transliterate, UnifyWidth, ZHO,
};
fn main() -> Result<(), Box<dyn Error>> {
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// TURKISH (Turkic) โ famous for its dotted/dotless I distinction
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let tur = Normy::builder()
.lang(TUR)
.add_stage(LowerCase) // Critical: ฤฐ โ i, I โ ฤฑ
.build();
println!(
"Turkish : {}",
tur.normalize("KIZILIRMAK NEHRฤฐ TรRKฤฐYE'NฤฐN EN UZUN NEHRฤฐDฤฐR.")?
);
// โ kฤฑzฤฑlฤฑrmak nehri tรผrkiye'nin en uzun nehridir.
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// GERMAN (Germany/Austria/Switzerland) โ ร and umlaut handling
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let deu = Normy::builder()
.lang(DEU)
.add_stage(CaseFold) // ร โ ss
.add_stage(Transliterate) // ร โ ae, ร โ oe, ร โ ue
.build();
println!(
"German : {}",
deu.normalize("Grรผรe aus Mรผnchen! Die Straรe ist sehr schรถn.")?
);
// โ gruesse aus muenchen! die strasse ist sehr schoen.
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// FRENCH (France/Belgium/Canada/etc.) โ classic accented text
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let fra = Normy::builder()
.lang(FRA)
.add_stage(CaseFold)
.add_stage(RemoveDiacritics) // รฉ โ e, รง โ c, etc.
.build();
println!(
"French : {}",
fra.normalize("Bonjour ! J'adore le cafรฉ et les croissants ร Paris.")?
);
// โ bonjour ! j'adore le cafe et les croissants a paris.
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// CHINESE (Simplified โ China) โ fullwidth & word segmentation
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let zho = Normy::builder()
.lang(ZHO)
.add_stage(UnifyWidth)
.add_stage(COLLAPSE_WHITESPACE_UNICODE)
.add_stage(SegmentWords) // unigram segmentation
.build();
println!(
"Chinese : {}",
zho.normalize("ๅไบฌ็็งๅคฉ็นๅซ็พไธฝ๏ผ้ฟๅ้ๅธธๅฃฎ่ง๏ผ")?
);
// โ ๅ ไบฌ ็ ็ง ๅคฉ ็น ๅซ ็พ ไธฝ , ้ฟ ๅ ้ ๅธธ ๅฃฎ ่ง !
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
// JAPANESE (Japan) โ script transitions + width unification
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
let jpn = Normy::builder()
.lang(JPN)
.add_stage(UnifyWidth)
.add_stage(COLLAPSE_WHITESPACE_UNICODE)
.add_stage(SegmentWords) // script boundary segmentation
.build();
println!(
"Japanese: {}",
jpn.normalize("ๆฑไบฌใฏๆฌๅฝใซ็ด ๆดใใใ่กใงใ๏ผๆกใใจใฆใใใใใ")?
);
// โ ๆฑไบฌใฏๆฌๅฝใซ็ด ๆดใใใ่กใงใ ! ๆกใใจใฆใใใใ ใ
Ok(())
}
When text is already normalized, Normy returns Cow::Borrowed โ zero allocation.
โจ Features
| Feature | Description |
|---|---|
| Zero-Copy | No allocation on clean input |
| Iterator Fusion | Automatic speedup on 2+ fusable stages (monomorphized char iterators) |
| Locale-Accurate | Turkish ฤฐ/i, German รโss, Dutch ฤฒโij, Arabic/Hebrew diacritics, etc. |
| Format-Aware | Safe HTML/Markdown stripping (preserves <code>, fences, attributes) |
| Composable Pipelines | Fluent builder + dynamic runtime stages |
| Segmentation | Word boundaries for CJK, Indic, Thai, Khmer, etc. (ZWSP insertion) |
| Extensible | Implement custom transformation stage |
๐ผ Available Normalization Stages
Normy provides a rich set of composable, high-performance normalization stages.
Most stages support static iterator fusion for maximum speed (single-pass, zero-copy when possible).
| Stage | Description | Fusion Support |
|---|---|---|
CaseFold |
Locale-aware case folding (German รโss, etc.) | Yes |
LowerCase |
Locale-aware lowercasing (Turkish ฤฐโi) | Yes |
RemoveDiacritics |
Removes combining/spacing diacritics (accents, tone marks, etc.) | Yes |
Transliterate |
Language-specific character substitutions (รโae, ะฎโju, etc.) | Yes |
NormalizePunctuation |
Normalizes dashes, quotes, ellipsis, bullets, etc. to standard forms | Yes |
UnifyWidth |
Converts fullwidth โ halfwidth (critical for CJK compatibility) | Yes |
SegmentWords |
Inserts spaces at word/script boundaries (CJK unigram, Indic virama, etc.) | Yes |
StripControlChars |
Removes all control characters (Unicode Cc category) | Yes |
StripFormatControls |
Removes directional marks, joiners, ZWSP, invisible operators, etc. | Yes |
| Whitespace Variants | ||
โข COLLAPSE_WHITESPACE |
Collapse consecutive ASCII whitespace โ single space | Yes |
โข COLLAPSE_WHITESPACE_UNICODE |
Collapse all Unicode whitespace โ single space | Yes |
โข NORMALIZE_WHITESPACE_FULL |
Normalize + collapse + trim all Unicode whitespace | Yes |
โข TRIM_WHITESPACE |
Trim leading/trailing ASCII whitespace only | Yes |
โข TRIM_WHITESPACE_UNICODE |
Trim leading/trailing Unicode whitespace | Yes |
| Normalization Forms | ||
โข NFC |
Unicode canonical composed form (most compact, W3C recommended) | No |
โข NFD |
Unicode canonical decomposed form | No |
โข NFKC |
Unicode compatibility composed (lossy, e.g. ๏ฌโfi, โโยฐC) | No |
โข NFKD |
Unicode compatibility decomposed | No |
StripHtml |
Strips HTML tags and decodes entities (format-aware) | No |
StripMarkdown |
Removes Markdown formatting while preserving content | No |
Key notes
- Fusion = static single-pass iterator fusion (zero-copy + minimal allocation when conditions met)
- Non-fusable stages (
NFC/NFD/NFKC/NFKD,StripHtml,StripMarkdown) use optimized batch processing and should usually be placed early in the pipeline
๐ Supported Languages
| Language | Code | Special Features |
|---|---|---|
| European | ||
| Turkish | TUR |
Custom case rules (ฤฐ/i, I/ฤฑ) |
| German | DEU |
ร folding, umlauts transliteration |
| Dutch | NLD |
IJ digraph folding |
| Danish | DAN |
ร /ร/ร transliteration |
| Norwegian | NOR |
ร /ร/ร transliteration |
| Swedish | SWE |
ร /ร/ร transliteration |
| Icelandic | ISL |
ร/ร/ร transliteration |
| French | FRA |
ล/ร ligatures, accent handling |
| Spanish | SPA |
Accent normalization |
| Portuguese | POR |
Comprehensive diacritics |
| Italian | ITA |
Grave/acute accents |
| Catalan | CAT |
ร transliteration |
| Czech | CES |
Hรกฤek preservation, selective stripping |
| Slovak | SLK |
Caron handling |
| Polish | POL |
Ogonek & acute accents |
| Croatian | HRV |
Digraph normalization |
| Serbian | SRP |
Cyrillic diacritics |
| Lithuanian | LIT |
Dot-above vowels |
| Greek | ELL |
Polytonic diacritics (6 types) |
| Russian | RUS |
CyrillicโLatin transliteration |
| Middle Eastern | ||
| Arabic | ARA |
15 diacritic types (tashkeel) |
| Hebrew | HEB |
20 vowel points (nikud) |
| Asian | ||
| Vietnamese | VIE |
Tone marks (5 tones ร vowels) |
| Chinese | ZHO |
Word segmentation, CJK unigram |
| Japanese | JPN |
Word segmentation |
| Korean | KOR |
Word segmentation |
| Thai | THA |
Tone marks, word segmentation |
| Lao | LAO |
15 combining marks, segmentation |
| Khmer | KHM |
30+ combining marks, segmentation |
| Myanmar | MYA |
17 combining marks, segmentation |
| South Asian | ||
| Hindi | HIN |
Devanagari diacritics, segmentation |
| Bengali | BEN |
Bengali diacritics, segmentation |
| Tamil | TAM |
Tamil diacritics, segmentation |
| Other | ||
| English | ENG |
Default/baseline |
Features Key:
- Word Segmentation: Automatic boundary detection for non-space-delimited scripts
- CJK Unigram: Character-level tokenization for Chinese ideographs
- Transliteration: ScriptโLatin conversion (e.g., Cyrillic, ligatures)
- Diacritics: Intelligent spacing/combining mark handling
๐ Documentation
- Full API docs: docs.rs/normy
- Linguistic rules:
LINGUISTIC_POLICY.md - Pipeline guidelines:
PIPELINE_GUIDELINES.md - Examples are in the
examples/directory - Generate local docs:
cargo doc --open
๐ค Contributing
Contributions are very welcome! See CONTRIBUTING.md
๐ License
Dual-licensed under MIT or Apache-2.0, at your option.
See LICENSE-MIT and LICENSE-APACHE.
Normy โ Ultra-fast, linguistically correct normalization โ the next-generation layer for Rust NLP & tokenizers
Dependencies
~4.5MB
~83K SLoC