Thanks to visit codestin.com
Credit goes to github.com

Skip to content

karpathy/rustbpe

Repository files navigation

rustbpe

CI PyPI License: MIT

The missing tiktoken training code

A lightweight Rust library for training GPT-style BPE tokenizers. The tiktoken library is excellent for inference but doesn't support training. The HuggingFace tokenizers library supports training but carries significant complexity from years of accumulated tokenizer variants. My minbpe library handles both training and inference, but only in Python and not optimized for speed.

rustbpe fills this gap: a simple, efficient BPE training implementation in Rust with Python bindings. Train your tokenizer with rustbpe, then export to tiktoken for fast inference.

Features

  • Fast training with parallel processing (rayon)
  • GPT-4 style regex pre-tokenization by default
  • Direct export to tiktoken format
  • Python bindings via PyO3
  • Batch encoding with automatic parallelization

Installation

Python

pip install rustbpe

From source

git clone https://github.com/karpathy/rustbpe.git
cd rustbpe
uv venv && source .venv/bin/activate
uv pip install maturin
maturin develop --release

Usage

Training

import rustbpe

# Create tokenizer and train on your data
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(
    ["your", "training", "texts", "here"],
    vocab_size=4096
)

# Encode and decode
ids = tokenizer.encode("hello world")
text = tokenizer.decode(ids)  # "hello world"

# Check vocabulary size
print(tokenizer.vocab_size)  # 4096

# Batch encode (parallel)
all_ids = tokenizer.batch_encode(["text one", "text two", "text three"])

Export to tiktoken

The main use case: train with rustbpe, inference with tiktoken.

import rustbpe
import tiktoken

# Train
tokenizer = rustbpe.Tokenizer()
tokenizer.train_from_iterator(open("corpus.txt"), vocab_size=8192)

# Export to tiktoken
enc = tiktoken.Encoding(
    name="my_tokenizer",
    pat_str=tokenizer.get_pattern(),
    mergeable_ranks={bytes(k): v for k, v in tokenizer.get_mergeable_ranks()},
    special_tokens={},
)

# Fast inference with tiktoken
ids = enc.encode("hello world")
text = enc.decode(ids)

Custom regex pattern

By default, rustbpe uses the GPT-4 tokenization pattern. You can provide your own:

tokenizer.train_from_iterator(
    texts,
    vocab_size=4096,
    pattern=r"[a-zA-Z]+|[0-9]+|\s+"  # custom pattern
)

API Reference

Tokenizer

Method Description
Tokenizer() Create a new tokenizer
train_from_iterator(texts, vocab_size, buffer_size=8192, pattern=None) Train on an iterator of strings
encode(text) Encode a string to token IDs
decode(ids) Decode token IDs back to a string
batch_encode(texts) Encode multiple strings in parallel
vocab_size Property: vocabulary size (256 + number of merges)
get_pattern() Get the regex pattern used for pre-tokenization
get_mergeable_ranks() Get token bytes and ranks for tiktoken export

Development

Prerequisites

Setup

git clone https://github.com/karpathy/rustbpe.git
cd rustbpe
uv venv && source .venv/bin/activate
uv pip install maturin pytest
maturin develop

Running tests

# Rust tests (fast, tests core algorithm)
cargo test

# Python tests (requires maturin develop first)
pytest tests/python/ -v -s

# Both
cargo test && pytest tests/python/ -v

Project structure

rustbpe/
├── Cargo.toml              # Rust package manifest
├── pyproject.toml          # Python package manifest
├── src/
│   └── lib.rs              # Rust implementation + PyO3 bindings + tests
└── tests/
    └── python/
        └── test_tokenizer.py

How BPE works

Byte Pair Encoding builds a vocabulary iteratively:

  1. Start with 256 byte-level tokens (0x00-0xff)
  2. Count all adjacent token pairs in the corpus
  3. Merge the most frequent pair into a new token
  4. Repeat until reaching target vocabulary size

The result is a vocabulary that efficiently represents common patterns while being able to encode any input.

LLM Assistance note

I wrote the Python reference code personally and from scratch and I am expert there and understand it fully. I then wrote the Rust code against this implementation with tests for equality. However, I am not a Rust developer by background so I had significant help from ChatGPT and Claude Code Opus 4.5. All the equality tests pass as far as I am aware, but I do apologize if some of the Rust code is not properly arranged, structured, or implemented. Please let me know in Issues/PRs if so and I am happy to adjust the code to make it better.

License

MIT

About

The missing tiktoken training code

Resources

License

Stars

Watchers

Forks

Packages

No packages published