Date: April 30, 2026 Status: Draft v0.1 Target Language: Rust Inspiration: llama.cpp by Georgi Gerganov
Build a high-performance, dependency-light LLM inference engine in Rust that runs large language models on commodity hardware (CPUs, GPUs, Apple Silicon) using quantization and modern system programming techniques.
Key Differentiators:
- Zero-cost abstractions via Rust's ownership model
- Memory safety without GC overhead
- First-class async/concurrency support
- Native WebAssembly support for browser inference
- Modern crate ecosystem (burn, candle, etc.)
┌─────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ CLI │ HTTP Server │ Python Bindings │ WASM │ FFI │
├─────────────────────────────────────────────────────────────┤
│ API LAYER │
│ Session Management │ Sampling │ Tokenization │ Scheduling │
├─────────────────────────────────────────────────────────────┤
│ COMPUTE LAYER │
│ CPU Kernels (AVX2/AVX512/NEON) │ GPU (CUDA/HIP/Vulkan) │
│ Quantization │ Dequantization │ Matrix Multiplication │
├─────────────────────────────────────────────────────────────┤
│ MODEL LAYER │
│ GGUF Loader │ Model Graph │ Weight Storage │ KV Cache │
├─────────────────────────────────────────────────────────────┤
│ HARDWARE ABSTRACTION │
│ CPU │ NVIDIA (CUDA) │ AMD (HIP) │ Apple (Metal) │ WASM │
└─────────────────────────────────────────────────────────────┘
Objective: Establish Rust project structure with cross-platform compilation support
- TODO-1.1: Initialize Cargo workspace with workspace-level dependencies
[workspace] members = ["oxidize-core", "oxidize-cli", "oxidize-server", "oxidize-quantize"] resolver = "3"
- TODO-1.2: Set up CI/CD (GitHub Actions) for Linux, macOS, Windows builds
- TODO-1.3: Configure cross-compilation for ARM64, WASM32 targets
- TODO-1.4: Set up benchmark harness with
criterion.rs - TODO-1.5: Create Docker images for deployment
- TODO-1.6: Add
justfile/Makefilefor common tasks - TODO-1.7: Set up
cargo denyfor license/security auditing - TODO-1.8: Configure
releaseprofile with LTO and panic=abort
Estimated Effort: 2-3 days Priority: P0 (Blocking)
Objective: Parse and load GGUF (Georgi Gerganov Universal Format) files
- TODO-2.1: Implement GGUF file format parser
- Magic number validation (
GGUF) - Version handling (v2, v3)
- Tensor info metadata parsing
- Alignment and padding handling
- Magic number validation (
- TODO-2.2: Create
Tensorstruct with shape, strides, dtype - TODO-2.3: Implement memory-mapped file loading (
memmap2crate) - TODO-2.4: Support tensor name mapping for different architectures
- TODO-2.5: Add quantization type detection (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.)
- TODO-2.6: Implement
ModelLoadertrait for extensibility - TODO-2.7: Add progress callbacks for large model loading
- TODO-2.8: Create comprehensive unit tests with fixture files
- TODO-2.9: Benchmark loader against llama.cpp baseline
Key Crates: memmap2, bytemuck, half
Estimated Effort: 5-7 days
Priority: P0 (Blocking)
Objective: Implement quantization/dequantization schemes matching llama.cpp
- TODO-3.1: Implement scalar dequantization kernels:
- Q4_0, Q4_1 (4-bit with/without offsets)
- Q5_0, Q5_1 (5-bit variants)
- Q8_0 (8-bit)
- Q2_K, Q3_K, Q4_K, Q5_K, Q6_K (K-quants)
- TODO-3.2: Implement dequantization to f16 and f32
- TODO-3.3: Add quantization from f16/f32 to all supported formats
- TODO-3.4: Implement block-wise quantization with per-block scales
- TODO-3.5: Add importance matrix support (IMatrix) for better quality
- TODO-3.6: Create quantization CLI tool (
oxidize-quantize) - TODO-3.7: Add mixed quantization support (different types per layer)
- TODO-3.8: Validate output against llama.cpp reference implementation
Key Traits:
pub trait Quantization {
fn quantize(&self, data: &[f32], output: &mut [u8]) -> Result<()>;
fn dequantize(&self, data: &[u8], output: &mut [f32]) -> Result<()>;
fn block_size(&self) -> usize;
fn type_size(&self) -> usize;
}Estimated Effort: 10-14 days Priority: P0 (Blocking)
Objective: High-performance CPU inference kernels with SIMD optimization
- TODO-4.1: Set up SIMD abstraction layer
- x86: SSE2, AVX, AVX2, AVX512 (via
std::arch) - ARM: NEON (via
std::arch) - Fallback: scalar implementations
- x86: SSE2, AVX, AVX2, AVX512 (via
- TODO-4.2: Implement matrix-vector multiplication (GEMV)
- F32, F16 input types
- Quantized weights with on-the-fly dequantization
- TODO-4.3: Implement matrix-matrix multiplication (GEMM) for batching
- TODO-4.4: Implement attention mechanisms:
- Multi-head attention (MHA)
- Grouped-query attention (GQA)
- Multi-query attention (MQA)
- TODO-4.5: Implement RoPE (Rotary Position Embedding)
- TODO-4.6: Implement SwiGLU activation
- TODO-4.7: Implement RMSNorm and LayerNorm
- TODO-4.8: Implement Softmax (stable, numerically accurate)
- TODO-4.9: Add thread pool for parallel layer execution
- TODO-4.10: Optimize cache locality and prefetching
- TODO-4.11: Add runtime CPU feature detection
Performance Target: Within 10% of llama.cpp CPU performance Estimated Effort: 15-20 days Priority: P0 (Blocking)
Objective: CUDA kernels for NVIDIA GPU acceleration
- TODO-5.1: Set up CUDA build pipeline with
rust-cudaorcust - TODO-5.2: Implement memory management (device allocation, H2D/D2H transfers)
- TODO-5.3: Port GEMV kernels to CUDA
- TODO-5.4: Port GEMM kernels using cuBLAS
- TODO-5.5: Implement attention kernels (flash attention style)
- TODO-5.6: Implement quantization-aware kernels (dequantize on GPU)
- TODO-5.7: Add kernel fusion (combine multiple ops into single kernel)
- TODO-5.8: Implement layer offloading (--n-gpu-layers equivalent)
- TODO-5.9: Add multi-GPU support (tensor/pipeline parallelism)
- TODO-5.10: Optimize memory usage with Flash Attention
Key Crates: cust, cudarc
Estimated Effort: 20-25 days
Priority: P1 (High)
Objective: Metal Performance Shaders for Apple Silicon
- TODO-6.1: Set up Metal build with
metal-rs - TODO-6.2: Implement buffer management for unified memory
- TODO-6.3: Port compute kernels to Metal Shading Language
- TODO-6.4: Optimize for Apple Silicon unified memory architecture
- TODO-6.5: Add Metal Performance Shaders integration where beneficial
Estimated Effort: 10-12 days Priority: P1 (High)
Objective: Support multiple transformer architectures
- TODO-7.1: Define
Modeltrait:pub trait Model { fn forward(&mut self, tokens: &[Token], session: &mut Session) -> Result<Logits>; fn vocab_size(&self) -> usize; fn context_size(&self) -> usize; fn layer_count(&self) -> usize; }
- TODO-7.2: Implement LLaMA architecture (LLaMA 2, LLaMA 3)
- TODO-7.3: Implement Mistral architecture
- TODO-7.4: Implement Mixtral MoE architecture
- TODO-7.5: Implement Qwen architecture
- TODO-7.6: Implement Gemma architecture
- TODO-7.7: Implement Falcon architecture
- TODO-7.8: Implement GPT architecture (GPT-2, GPT-J, GPT-NeoX)
- TODO-7.9: Implement Phi architecture
- TODO-7.10: Implement architecture auto-detection from GGUF metadata
- TODO-7.11: Add LoRA/QLoRA support
Estimated Effort: 15-20 days Priority: P0 (Blocking)
Objective: Efficient key-value cache for attention
- TODO-8.1: Implement KV cache storage with configurable dtype
- TODO-8.2: Add sliding window attention cache management
- TODO-8.3: Implement cache eviction strategies
- TODO-8.4: Add cache quantization (8-bit, 4-bit KV cache)
- TODO-8.5: Support continuous batching with cache management
- TODO-8.6: Add cache persistence across sessions
- TODO-8.7: Optimize memory layout for cache access patterns
Estimated Effort: 5-7 days Priority: P1 (High)
Objective: Text-to-tokens and tokens-to-text conversion
- TODO-9.1: Implement Byte-Pair Encoding (BPE)
- TODO-9.2: Implement SentencePiece (Unigram)
- TODO-9.3: Implement WordPiece
- TODO-9.4: Add Tiktoken (GPT-4/Claude style)
- TODO-9.5: Create tokenizer loader from GGUF metadata
- TODO-9.6: Add special token handling (BOS, EOS, PAD, etc.)
- TODO-9.7: Implement streaming detokenization
- TODO-9.8: Add chat template processing
- TODO-9.9: Support token healing (merge incomplete tokens)
Key Crates: tokenizers (Hugging Face), tiktoken-rs
Estimated Effort: 5-7 days
Priority: P0 (Blocking)
Objective: Text generation with various sampling strategies
- TODO-10.1: Implement basic sampling:
- Greedy
- Temperature scaling
- Top-k
- Top-p (nucleus)
- Min-p
- TODO-10.2: Implement advanced sampling:
- Mirostat
- Typical sampling
- Tail-free sampling
- Locally typical sampling
- TODO-10.3: Implement repetition penalties:
- Frequency penalty
- Presence penalty
- Penalize newlines
- TODO-10.4: Add grammar-based constrained generation
- TODO-10.5: Implement speculative decoding
- TODO-10.6: Add beam search
- TODO-10.7: Implement streaming generation (async iterator)
Estimated Effort: 7-10 days Priority: P1 (High)
Objective: Command-line interface for inference
- TODO-11.1: Create
oxidize-clibinary - TODO-11.2: Implement argument parsing (
clap)- Model path
- Prompt (interactive, file, stdin)
- Context size
- Thread count
- GPU layers
- Sampling parameters
- System prompt
- TODO-11.3: Add interactive chat mode (REPL)
- TODO-11.4: Implement single-shot inference mode
- TODO-11.5: Add conversation history management
- TODO-11.6: Implement progress indicators for loading/generation
- TODO-11.7: Add token/speed reporting
- TODO-11.8: Support prompt caching
- TODO-11.9: Add multi-line input support
- TODO-11.10: Implement reverse prompt (stop sequences)
Estimated Effort: 5-7 days Priority: P1 (High)
Objective: OpenAI-compatible HTTP API server
- TODO-12.1: Create
oxidize-serverbinary withaxumoractix-web - TODO-12.2: Implement OpenAI-compatible endpoints:
POST /v1/chat/completionsPOST /v1/completionsGET /v1/modelsPOST /v1/embeddings
- TODO-12.3: Add Server-Sent Events (SSE) for streaming
- TODO-12.4: Implement JSON mode and structured output
- TODO-12.5: Add request/response logging
- TODO-12.6: Implement rate limiting and request queuing
- TODO-12.7: Add health check endpoints
- TODO-12.8: Support concurrent request handling
- TODO-12.9: Add authentication middleware (API keys)
- TODO-12.10: Create OpenAPI documentation
Estimated Effort: 7-10 days Priority: P1 (High)
Objective: Python interface via PyO3
- TODO-13.1: Set up
pyo3workspace - TODO-13.2: Create
oxidizePython package - TODO-13.3: Implement
Llamaclass with methods:__init__generatecreate_chat_completionembed
- TODO-13.4: Add async support with
asyncio - TODO-13.5: Support
numpyandtorchtensor interop - TODO-13.6: Create
pipinstallable wheels (maturin) - TODO-13.7: Add Python type stubs
- TODO-13.8: Match
llama-cpp-pythonAPI for compatibility
Estimated Effort: 7-10 days Priority: P2 (Medium)
Objective: Browser-based inference
- TODO-14.1: Set up
wasm-bindgenbuild - TODO-14.2: Implement WebGPU compute backend
- TODO-14.3: Add Web Worker support for background inference
- TODO-14.4: Create JavaScript/TypeScript bindings
- TODO-14.5: Implement streaming generation in browser
- TODO-14.6: Add model download/cache management in browser
- TODO-14.7: Create demo web application
Estimated Effort: 10-14 days Priority: P2 (Medium)
Objective: Achieve llama.cpp-level performance
- TODO-15.1: Profile CPU inference with
perf/samply - TODO-15.2: Optimize memory access patterns
- TODO-15.3: Implement operator fusion (combine linear + activation)
- TODO-15.4: Add INT8/INT4 GEMM via
gemmcrate or custom kernels - TODO-15.5: Implement Flash Attention for long contexts
- TODO-15.6: Add continuous batching for server throughput
- TODO-15.7: Implement pipeline parallelism for multi-GPU
- TODO-15.8: Add tensor parallelism for large models
- TODO-15.9: Optimize prompt processing (prefill) with batching
- TODO-15.10: Add memory pool allocator to reduce allocations
Performance Targets:
- CPU: Within 15% of llama.cpp
- GPU: Within 20% of llama.cpp
- Memory usage: Comparable to llama.cpp
Estimated Effort: Ongoing (20+ days) Priority: P1 (High)
Objective: Comprehensive test coverage and benchmarking
- TODO-16.1: Unit tests for all quantization schemes
- TODO-16.2: Numerical accuracy tests (vs. PyTorch reference)
- TODO-16.3: Integration tests with real GGUF models
- TODO-16.4: Benchmark suite comparing to llama.cpp
- TODO-16.5: Perplexity benchmarks on standard datasets
- TODO-16.6: Memory usage benchmarks
- TODO-16.7: Create CI benchmarks with regression detection
- TODO-16.8: Add fuzzing for parser and tokenizer
- TODO-16.9: Create benchmark dashboard
- TODO-16.10: Add model compatibility tests (run 100+ models)
Estimated Effort: Ongoing (10+ days) Priority: P1 (High)
Objective: Excellent developer and user experience
- TODO-17.1: Write comprehensive README with quick start
- TODO-17.2: Create API documentation with
rustdoc - TODO-17.3: Write architecture documentation
- TODO-17.4: Create quantization guide
- TODO-17.5: Write performance tuning guide
- TODO-17.6: Create examples:
- Basic inference
- Chat completion
- Streaming generation
- Batch processing
- Custom sampling
- Embedding extraction
- TODO-17.7: Add troubleshooting guide
- TODO-17.8: Create contribution guidelines
- TODO-17.9: Write blog post announcing release
Estimated Effort: 5-7 days Priority: P2 (Medium)
| Component | Primary Choice | Alternatives |
|---|---|---|
| Build System | Cargo | Bazel |
| Async Runtime | Tokio | async-std |
| CLI Framework | clap | structopt |
| Web Server | axum | actix-web, rocket |
| Serialization | serde | - |
| CUDA Bindings | cudarc | rustacuda, cust |
| Metal Bindings | metal-rs | - |
| Python Bindings | pyo3 + maturin | - |
| WASM | wasm-bindgen | - |
| Logging | tracing | log |
| Error Handling | thiserror + anyhow | - |
| Testing | built-in + criterion | - |
| Quantization | custom | - |
| BLAS | intel-mkl-src, openblas-src | gemm |
- TODO-1.x: Project setup
- TODO-2.x: GGUF loader
- TODO-3.x: Basic quantization
- TODO-7.1, 7.2: LLaMA architecture
Deliverable: Load and run LLaMA 2/3 models on CPU
- TODO-4.x: CPU kernels
- TODO-8.x: KV cache
- TODO-9.x: Tokenization
- TODO-10.x: Sampling
- TODO-11.x: CLI
Deliverable: Full CLI with chat mode, competitive CPU performance
- TODO-5.x: CUDA support
- TODO-6.x: Metal support
- TODO-15.x: Performance optimization
Deliverable: GPU inference matching llama.cpp speeds
- TODO-12.x: HTTP server
- TODO-13.x: Python bindings
- TODO-7.3-7.9: More architectures
- TODO-16.x: Testing
Deliverable: Production-ready with server and Python API
- TODO-14.x: WASM
- TODO-10.5: Speculative decoding
- TODO-15.6-15.8: Advanced parallelism
- TODO-7.10-7.11: LoRA, more models
Deliverable: Full feature parity with llama.cpp + Rust advantages
| Metric | Target |
|---|---|
| Models Supported | 50+ GGUF architectures |
| CPU Performance | Within 15% of llama.cpp |
| GPU Performance | Within 20% of llama.cpp |
| Memory Safety | Zero memory leaks (verified by valgrind/MIRI) |
| Test Coverage | >80% line coverage |
| Binary Size | <50MB for CLI (release) |
| Startup Time | <2s for 7B model |
| Token Throughput | Match or exceed llama.cpp per watt |
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| CUDA kernel performance gap | Medium | High | Use cuBLAS where possible, profile extensively |
| Quantization accuracy loss | Low | High | Validate against reference, use IMatrix |
| Memory overhead vs C++ | Medium | Medium | Zero-copy design, careful allocation |
| Build complexity (CUDA deps) | High | Medium | Feature flags, optional GPU backends |
| Compilation time | High | Low | Workspace organization, sccache |
- Should we use
candleorburncrates for tensor operations, or implement custom? - How to handle CUDA build in CI? (GitHub Actions has limited GPU runners)
- Should we support GGML format legacy loading?
- What's the minimum Rust version to support?
- How to handle model downloads and Hugging Face integration?
- llama.cpp — Reference implementation
- GGUF Format Spec
- The Rust Programming Language
- Rust SIMD Guide
- LLaMA Paper
- Flash Attention
Next Steps:
- Create GitHub repository and initialize workspace
- Start with TODO-1.1 (workspace setup)
- Implement TODO-2.1 (GGUF parser) as first milestone
- Set up benchmark harness to compare against llama.cpp baseline
Estimated Total Effort: 4-5 months for MVP, 6-8 months for full feature parity Team Size: 2-3 developers recommended