2 releases
Uses new Rust 2024
| 0.1.1 | Jul 28, 2025 |
|---|---|
| 0.1.0 | Jul 25, 2025 |
#343 in Audio
85 downloads per month
Used in kitsune-stt
85KB
1K
SLoC
tekken-rs
A Rust implementation of the Mistral Tekken tokenizer with audio support. This library provides fast and efficient tokenization capabilities for text and audio data, fully compatible with Mistral AI's tokenizer.
Features
- Text Tokenization: Full compatibility with Mistral's Tekken tokenizer
- Audio Support: Encode and decode audio data with mel-scale spectrogram processing
- Multiple Versions: Support for various tokenizer versions (V7, etc.)
- Special Tokens: Complete handling of special tokens (BOS, EOS, audio tokens, etc.)
Installation
Add this to your Cargo.toml:
[dependencies]
tekken = "0.1.0"
Or use the Git repository directly:
[dependencies]
tekken = { git = "https://github.com/jorge-menjivar/tekken-rs" }
Quick Start
Basic Text Tokenization
use tekken::tekkenizer::Tekkenizer;
use tekken::special_tokens::SpecialTokenPolicy;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load tokenizer
let tokenizer = Tekkenizer::from_file("tekken.json")?;
// Encode text
let text = "Hello, world!";
let tokens = tokenizer.encode(text, true, true)?; // add_bos=true, add_eos=true
// Decode tokens
let decoded = tokenizer.decode(&tokens, SpecialTokenPolicy::Keep)?;
println!("Original: {}", text);
println!("Tokens: {:?}", tokens);
println!("Decoded: {}", decoded);
Ok(())
}
Audio Processing
use tekken::audio::{Audio, AudioConfig, AudioSpectrogramConfig, AudioEncoder};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load audio
let audio = Audio::from_file("audio.wav")?;
// Create audio configuration
let spectrogram_config = AudioSpectrogramConfig::new(80, 160, 400)?;
let audio_config = AudioConfig::new(16000, 12.5, spectrogram_config, None)?;
// Encode audio to tokens
let encoder = AudioEncoder::new(audio_config, 1000, 1001); // audio_token_id, begin_audio_token_id
let encoding = encoder.encode(audio)?;
println!("Audio encoded to {} tokens", encoding.tokens.len());
Ok(())
}
Examples
Run the examples to see the tokenizer in action:
# Basic tokenizer test
cargo run --example basic_tokenizer_test
# Audio processing test
cargo run --bin test_audio
Testing
Run the test suite:
cargo test
Architecture
The tokenizer consists of several key components:
tokenizer.rs: Main tokenizer implementationaudio.rs: Audio processing and encoding functionalityspecial_tokens.rs: Special token definitions and handlingconfig.rs: Configuration structureserrors.rs: Error handling
Audio Support
The audio implementation includes:
- WAV file loading and processing
- Mel-scale spectrogram computation
- Audio chunk encoding to tokens
- Compatible with Python implementation
Audio Token Flow
- Load Audio: Load WAV files or audio data
- Resample: Convert to target sampling rate (16kHz)
- Pad: Ensure minimum length for processing
- Tokenize: Convert to token sequence with special audio markers
Compatibility
This Rust implementation is designed to be fully compatible with the Python version:
- Same tokenization results
- Identical audio processing
- Compatible special token handling
- Same mel filter bank computations
Requirements
- Rust 1.70 or higher
- For audio support: audio files in WAV format
Project Structure
tekken-rs/
├── src/
│ ├── lib.rs # Library entry point
│ ├── tokenizer.rs # Main tokenizer implementation
│ ├── audio.rs # Audio processing functionality
│ ├── special_tokens.rs # Special token definitions
│ ├── config.rs # Configuration structures
│ └── errors.rs # Error types
├── examples/ # Example usage
├── tests/ # Integration tests
└── benches/ # Performance benchmarks
Performance
The Rust implementation provides significant performance improvements over the Python version:
- Fast tokenization using efficient data structures
- Zero-copy string handling where possible
- Optimized audio processing with SIMD operations
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to:
- Update tests as appropriate
- Follow Rust coding conventions
- Run
cargo fmtandcargo clippybefore submitting
See CONTRIBUTING.md for detailed guidelines.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Acknowledgments
This is an original Rust implementation designed to be compatible with Mistral AI's Tekken tokenizer format.
See NOTICE file for detailed attribution.
Dependencies
~20MB
~197K SLoC