Thanks to visit codestin.com
Credit goes to lib.rs

#bpe #nlp #tokenize #unigram

sentencepiece-rs

Rust runtime reimplementation of SentencePiece model loading, normalization, encoding, and decoding

4 releases

Uses new Rust 2024

new 0.2.2 May 20, 2026
0.2.1 May 15, 2026
0.2.0 May 15, 2026
0.1.0 May 10, 2026

#1520 in Text processing

Codestin Search App Codestin Search App Codestin Search App

352 downloads per month
Used in multiscreen-rs

Apache-2.0

66KB
1.5K SLoC

sentencepiece-rs

SentencePiece model loading, normalization, encoding, and decoding in pure Rust. Load a .model file, tokenize text, and decode tokens back into text.

Install

[dependencies]
sentencepiece-rs = "0.2"

Quick Start

use sentencepiece_rs::SentencePieceProcessor;

fn main() -> sentencepiece_rs::Result<()> {
    let sp = SentencePieceProcessor::open("tokenizer.model")?;

    let pieces = sp.encode("hello rust world")?;
    let ids = sp.encode_to_ids("hello rust world")?;
    let text = sp.decode_ids(&ids)?;

    println!("{pieces:?}");
    println!("{ids:?}");
    println!("{text}");

    Ok(())
}

Decode Pieces

use sentencepiece_rs::SentencePieceProcessor;

let sp = SentencePieceProcessor::open("tokenizer.model")?;
let text = sp.decode(&["▁hello", "▁world"])?;
assert_eq!(text, "hello world");
# Ok::<(), sentencepiece_rs::Error>(())

Extra Options

use sentencepiece_rs::SentencePieceProcessor;

let mut sp = SentencePieceProcessor::open("tokenizer.model")?;
sp.set_encode_extra_options("bos:eos")?;

let ids = sp.encode_to_ids("ship it")?;
# Ok::<(), sentencepiece_rs::Error>(())

Supported options: bos, eos, reverse, unk_piece.

Compatibility Notes

The crate reads standard SentencePiece model protobufs and uses the embedded normalizer trie, so normal .model files should load without drama. The runtime behavior follows the C++ implementation, but this is not a line-by-line rewrite.

The big missing chunk is training. Bring your own existing model for now.

Documentation

docs.rs/sentencepiece-rs

References

Test Locally

cargo fmt --all
cargo test --all-features
cargo clippy --all-targets --all-features -- -D warnings
cargo doc --all-features --no-deps

License and Terms of Use

This crate is licensed under the Apache License, Version 2.0. See ./LICENSE.

The original SentencePiece source is also licensed under the Apache License, Version 2.0. This crate is a Rust reimplementation that uses the upstream source as the behavioral reference; it is not a C++ binding and does not link against the C++ library.

Apache 2.0 is permissive. In normal-human terms, you can use, modify, distribute, sublicense, and ship derivative work, including commercially, as long as you follow the license terms.

If you redistribute this crate, the upstream source, or a modified version, keep the important bits intact:

  • include a copy of the Apache 2.0 license
  • preserve copyright, patent, trademark, and attribution notices that apply
  • mark modified files when you change Apache-licensed source
  • include upstream NOTICE content if a distributed upstream package includes one
  • do not imply Google or SentencePiece trademark endorsement

The software is provided as-is, without warranty, and the Apache 2.0 patent grant terminates if you sue over patent infringement involving the licensed work. That is the deal. Pretty reasonable, honestly.

This README is a practical summary, not legal advice. The actual license text in ./LICENSE and the upstream SentencePiece license are the source of truth.

No runtime deps