1 unstable release
Uses new Rust 2024
| 0.1.0 | Jan 5, 2026 |
|---|
#1485 in Text processing
Used in 2 crates
370KB
9K
SLoC
mecrab
Core runtime library for MeCrab morphological analyzer.
Features
- MeCab Compatible: Works with IPADIC/UniDic dictionaries
- High Performance: Memory-mapped dictionaries, SIMD-optimized Viterbi (AVX2)
- Thread-safe: Safe concurrent access
- Live Updates: Add/remove words at runtime
- Semantic Linking: Wikidata URI attachment with JSON-LD/RDF export
- N-best Search: A* algorithm for multiple path analysis
- Streaming: Sentence boundary detection for large text processing
- Phonetic Transduction: Kana/Romaji/X-SAMPA/IPA conversion
Installation
[dependencies]
mecrab = "0.1"
Feature Flags
| Feature | Description |
|---|---|
json |
JSON output format |
parallel |
Parallel batch processing (rayon) |
simd |
SIMD optimizations (AVX2) |
wasm |
WebAssembly bindings |
python |
Python bindings (PyO3) |
Usage
use mecrab::MeCrab;
let mecrab = MeCrab::new()?;
let result = mecrab.parse("すもももももももものうち")?;
println!("{}", result);
// Add custom words
mecrab.add_word("ChatGPT", "チャットジーピーティー", "チャットジーピーティー", 5000);
// N-best paths
use mecrab::viterbi::NbestSearch;
let nbest = NbestSearch::new(&mecrab);
for path in nbest.search("東京", 5)? {
println!("Cost: {}", path.total_cost);
}
// Phonetic conversion
use mecrab::phonetic::PhoneticTransducer;
let transducer = PhoneticTransducer::new();
println!("{}", transducer.to_romaji("こんにちは")); // konnichiha
Module Structure
mecrab/src/
├── lib.rs # Public API
├── dict/ # Dictionary loading
│ ├── mod.rs # Token, SysDic, OverlayDictionary
│ └── user_dict.rs # User dictionary persistence
├── lattice/ # Lattice construction
├── viterbi/ # Viterbi algorithm
│ ├── mod.rs # Core Viterbi
│ ├── simd.rs # AVX2 acceleration
│ ├── nbest.rs # N-best A* search
│ └── analysis.rs # Cost analysis
├── semantic/ # Semantic enrichment
│ ├── mod.rs # SemanticEntry, EntityType
│ ├── pool.rs # SemanticPool (5-byte entries)
│ ├── jsonld.rs # JSON-LD export
│ ├── rdf.rs # RDF/Turtle/N-Triples export
│ ├── disambiguation.rs # Disambiguation strategies
│ └── extension.rs # TokenExtension
├── phonetic/ # Phonetic processing
│ ├── mod.rs # Reading extraction
│ └── transducer.rs # Kana/Romaji/X-SAMPA/IPA
├── stream.rs # Streaming text processing
├── normalize.rs # Text normalization
├── bench.rs # Benchmarking utilities
└── error.rs # Error types
License
MIT OR Apache-2.0
Dependencies
~4–8.5MB
~219K SLoC