12 releases (7 breaking)
| new 0.47.4 | May 27, 2026 |
|---|---|
| 0.46.0 | May 22, 2026 |
#198 in Biology
285KB
6.5K
SLoC
bio-rs
bio-rs turns protein FASTA into validated, tokenized, model-ready inputs for bio-AI workflows.
FASTA -> validated protein sequence -> token IDs -> model-ready JSON
DNA and RNA FASTA validation is also supported; tokenization is currently protein-only.
Status: pre-1.0 CLI and JSON contract stabilization.
Why bio-rs?
Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:
- local CLIs
- CI pipelines
- servers
- browsers
- agents
bio-rs focuses on the boring but important layer before inference:
- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks
The goal is not to replace Python research workflows.
The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.
Quickstart
cargo install biors --version 0.47.4
biors tokenize examples/protein.fasta
biors workflow --max-length 8 examples/protein.fasta
biors batch validate --kind auto examples/
biors tokenizer inspect --profile protein-20-special
biors dataset inspect --source uniprot --version 2026_02 --split train examples/
Full commands, demos, and install options: docs/quickstart.md
Proof
bio-rs keeps performance claims tied to reproducible in-repo benchmarks.
Latest recorded FASTA benchmark baseline (recorded on biors-core v0.20.0;
rerun the benchmark before making new numeric claims for later versions):
| Dataset | Matched workload | bio-rs core mean | Biopython mean | bio-rs speedup |
|---|---|---|---|---|
| Human proteome | Parse + validation | 0.036s | 0.584s | 16.09x |
| Human proteome | Parse + tokenization | 0.061s | 0.587s | 9.68x |
| 100MB+ FASTA | Parse + validation | 0.294s | 3.994s | 13.59x |
| 100MB+ FASTA | Parse + tokenization | 0.492s | 4.040s | 8.22x |
| Many short records | Parse + validation | 0.007s | 0.204s | 28.35x |
| Many short records | Parse + tokenization | 0.010s | 0.205s | 20.54x |
| Single long sequence | Parse + validation | 0.005s | 0.176s | 34.48x |
| Single long sequence | Parse + tokenization | 0.007s | 0.177s | 26.67x |
Benchmark details:
- Datasets:
- UniProt human reference proteome (
UP000005640,9606) - 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
- 20,000 short 48-residue records generated from the same proteome residue stream
- one 960,000-residue sequence generated from the same proteome residue stream
- UniProt human reference proteome (
- Matched workloads:
- pure parse
- parse plus validation
- parse plus tokenization
- Current best recorded raw throughput:
- human proteome parse + validation:
315.4M residues/s,360.6 MB/s - 100MB+ FASTA parse + validation:
350.8M residues/s,401.1 MB/s - human proteome parse + tokenization:
189.0M residues/s,216.1 MB/s - 100MB+ FASTA parse + tokenization:
209.7M residues/s,239.8 MB/s
- human proteome parse + validation:
- Benchmark doc: benchmarks/fasta_vs_biopython.md
- Benchmark script: scripts/benchmark_fasta_vs_biopython.py
This benchmark measures biors-core directly and excludes CLI startup and JSON
serialization overhead. It is still workload-specific, not a broad claim that
bio-rs is faster than Biopython across every FASTA workload or researcher input
shape. The 0.47.4 patch keeps those numeric claims pinned to the latest
committed benchmark artifact while adding benchmark coverage for fixed-length
model-input construction and reducing repeated ASCII classification in reader
FASTA scanning.
What works today
biors-core provides the Rust engine and data contracts. biors provides the CLI surface.
Sequence handling
- FASTA parsing and normalization with buffered reader APIs
- Protein/DNA/RNA validation with per-record kind detection (
--kind auto) - Line and record-index diagnostics with residue warning/error reporting
Tokenization
protein-20tokenization with stable IDsprotein-20-specialtokenization with UNK/PAD/CLS/SEP/MASK special tokens- JSON tokenizer config loading and inspection
- Hugging Face tokenizer config conversion
- Positional token alignment preserved with explicit unknown-token IDs
Model input
model-inputCLI:input_ids,attention_mask, and truncation metadataworkflowCLI: end-to-end validation → tokenization → model input with readiness issues and reproducibility provenancepipelineCLI: no-config validate → tokenize → export, or config-driven (TOML/YAML/JSON) workflows with lockfile generationdebugCLI: step-by-step per-record inspection with compact residue markers- Checked and unchecked model-input builders with safety checks for unresolved residues
Batch and dataset operations
batch validate: multiple files, recursive directories, quoted globsdataset inspect: dataset descriptors, sample mapping, file SHA-256 provenancecache inspectand guardedcache cleanfor local artifact store
Package management
- Manifest inspection, validation, and migration (v0 → v1)
- Schema compatibility checks and canonical diffs
- SHA-256 checksum verification and fixture verification
- Python project to bio-rs package skeleton conversion
- Runtime bridge planning reports, backend execution abstraction contracts, and guarded external-process backend adapters
- Optional Candle backend crate for CPU safetensors linear-probe inference
- Model artifact metadata and runtime/model compatibility checks in package bridge reports
- Transport-agnostic service interface contract for service hosts, without bundling a server runtime
- Typed validation issue codes and manifest enums
External interfaces
biors-python: PyO3 bindings for Python integration and notebook workflowsbiors-wasm: WebAssembly/JavaScript bindings with TypeScript definitionsbiors-mcp-server: local MCP server crate for agent-callable sequence toolsservice contract: offline JSON route/schema contract for caller-owned service hosts
Utilities
diff: canonical JSON/raw comparison with SHA-256 hashesdoctor: platform, toolchain, WASM target, and fixture readinesscompletions: shell completion generation- JSON success/error envelopes for all commands
Documentation
- Quickstart — install, first commands, demos
- Launch demo — researcher-facing demo workflow
- Installation and distribution — cargo, binaries, completions
- CLI contract — commands, JSON envelopes, exit codes
- Package format — manifest layout and research metadata
- Package conversion — HF/Python project conversion path
- Backend architecture — runtime abstraction boundary
- Candle backend — optional Candle runtime crate
- Service interface — service-host contract and runtime boundary
- Pipeline config — config-driven static preprocessing workflows
- Dataset inputs and artifact store
- Error code registry
- Reliability and input safety
- Python interop
- Python API
- WASM readiness
- WASM API
- Phase 7 status
- 1.0 contract candidates
- Versioning policy
- Schema versioning
- Final release checklist
- JSON schemas
- Citation metadata
Not yet
These are roadmap directions, not current capabilities:
- hosted web workflows
- pretrained model-specific inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows
Development
Run checks:
scripts/check.sh
Run the faster local commit gate:
scripts/check-fast.sh
The check suite runs:
cargo fmt- shell and Python syntax checks for repo scripts
- benchmark Markdown regeneration check
- release workflow publish-order invariant check
- Rust checks
biors-corewasm32-unknown-unknownbuild check- tests
cargo clippywith warnings denied
Reproduce the FASTA benchmark:
cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json
The benchmark script updates both benchmarks/fasta_vs_biopython.json and
benchmarks/fasta_vs_biopython.md. scripts/check-benchmark-docs.sh verifies
that the Markdown report still matches the JSON artifact.
Compare two benchmark artifacts:
python scripts/compare-benchmark-artifacts.py before.json after.json
Run the Rust library example:
cargo run -p biors-core --example tokenize
Workspace
packages/
rust/
biors/ CLI
biors-backend-candle/ Optional Candle runtime backend
biors-core/ Core engine + contracts
biors-mcp-server/ Local MCP server
biors-python/ PyO3 bindings
biors-wasm/ WASM/JS bindings
schemas/
batch-validation-output.v0.json
cache-output.v0.json
cli-error.v0.json
cli-success.v0.json
dataset-inspect-output.v0.json
doctor-output.v0.json
fasta-validation-output.v0.json
inspect-output.v0.json
model-input-output.v0.json
output-diff.v0.json
pipeline-config.v0.json
pipeline-lock.v0.json
pipeline-output.v0.json
sequence-workflow-output.v0.json
sequence-debug-output.v0.json
package-bridge-output.v0.json
package-compatibility-output.v0.json
package-conversion-output.v0.json
package-diff-output.v0.json
package-inspect-output.v0.json
package-manifest.v0.json
package-manifest.v1.json
package-migration-output.v0.json
package-skeleton-output.v0.json
package-validation-report.v0.json
package-verify-output.v0.json
tokenizer-conversion-output.v0.json
tokenizer-inspect-output.v0.json
tokenize-output.v0.json
examples/
protein.fasta
multi.fasta
model-input-contract/
protein.fasta
protein-20-special.config.json
protein-20-special.expected.json
reference-python-parity.json
python/
esm_from_biors_json.py
pandas_numpy_friendly.py
protbert_from_biors_json.py
reference_preprocess.py
protein-package/
models/
docs/
manifest.json
observations.json
fixtures/
observed/
tokenizers/
vocabs/
pipelines/
pipeline/
protein.toml
protein.yaml
protein.json
pipeline.lock
Protein-20 alphabet
A C D E F G H I K L M N P Q R S T V W Y
Token IDs follow that order, starting at 0.
Contributing
See CONTRIBUTING.md for local setup, checks, and PR expectations.
License
Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software or publications, cite the repository and version via CITATION.cff.
Dependencies
~11–15MB
~303K SLoC