The missing tiktoken training code
A lightweight Rust library for training GPT-style BPE tokenizers. The tiktoken library is excellent for inference but doesn't support training. The HuggingFace tokenizers library supports training but carries significant complexity from years of accumulated tokenizer variants. My minbpe library handles both training and inference, but only in Python and not optimized for speed.
rustbpe fills this gap: a simple, efficient BPE training implementation in Rust with Python bindings. Train your tokenizer with rustbpe, then export to tiktoken for fast inference.
- Fast training with parallel processing (rayon)
- GPT-4 style regex pre-tokenization by default
- Direct export to tiktoken format
- Python bindings via PyO3
- Batch encoding with automatic parallelization
pip install rustbpegit clone https://github.com/karpathy/rustbpe.git
cd rustbpe
uv venv && source .venv/bin/activate
uv pip install maturin
maturin develop --releaseimport rustbpe
# Create tokenizer and train on your data
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(
["your", "training", "texts", "here"],
vocab_size=4096
)
# Encode and decode
ids = tokenizer.encode("hello world")
text = tokenizer.decode(ids) # "hello world"
# Check vocabulary size
print(tokenizer.vocab_size) # 4096
# Batch encode (parallel)
all_ids = tokenizer.batch_encode(["text one", "text two", "text three"])The main use case: train with rustbpe, inference with tiktoken.
import rustbpe
import tiktoken
# Train
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(open("corpus.txt"), vocab_size=8192)
# Export to tiktoken
enc = tiktoken.Encoding(
name="my_tokenizer",
pat_str=tokenizer.get_pattern(),
mergeable_ranks={bytes(k): v for k, v in tokenizer.get_mergeable_ranks()},
special_tokens={},
)
# Fast inference with tiktoken
ids = enc.encode("hello world")
text = enc.decode(ids)By default, rustbpe uses the GPT-4 tokenization pattern. You can provide your own:
tokenizer.train_from_iterator(
texts,
vocab_size=4096,
pattern=r"[a-zA-Z]+|[0-9]+|\s+" # custom pattern
)| Method | Description |
|---|---|
Tokenizer() |
Create a new tokenizer |
train_from_iterator(texts, vocab_size, buffer_size=8192, pattern=None) |
Train on an iterator of strings |
encode(text) |
Encode a string to token IDs |
decode(ids) |
Decode token IDs back to a string |
batch_encode(texts) |
Encode multiple strings in parallel |
vocab_size |
Property: vocabulary size (256 + number of merges) |
get_pattern() |
Get the regex pattern used for pre-tokenization |
get_mergeable_ranks() |
Get token bytes and ranks for tiktoken export |
- Rust: https://rustup.rs/
- uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/karpathy/rustbpe.git
cd rustbpe
uv venv && source .venv/bin/activate
uv pip install maturin pytest
maturin develop# Rust tests (fast, tests core algorithm)
cargo test
# Python tests (requires maturin develop first)
pytest tests/python/ -v -s
# Both
cargo test && pytest tests/python/ -vrustbpe/
├── Cargo.toml # Rust package manifest
├── pyproject.toml # Python package manifest
├── src/
│ └── lib.rs # Rust implementation + PyO3 bindings + tests
└── tests/
└── python/
└── test_tokenizer.py
Byte Pair Encoding builds a vocabulary iteratively:
- Start with 256 byte-level tokens (0x00-0xff)
- Count all adjacent token pairs in the corpus
- Merge the most frequent pair into a new token
- Repeat until reaching target vocabulary size
The result is a vocabulary that efficiently represents common patterns while being able to encode any input.
I wrote the Python reference code personally and from scratch and I am expert there and understand it fully. I then wrote the Rust code against this implementation with tests for equality. However, I am not a Rust developer by background so I had significant help from ChatGPT and Claude Code Opus 4.5. All the equality tests pass as far as I am aware, but I do apologize if some of the Rust code is not properly arranged, structured, or implemented. Please let me know in Issues/PRs if so and I am happy to adjust the code to make it better.
MIT