Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History
571 lines (444 loc) · 20.3 KB

File metadata and controls

571 lines (444 loc) · 20.3 KB

Product Requirements Document (PRD)

Project: oxidize — A Rust-Based LLM Inference Engine

Date: April 30, 2026 Status: Draft v0.1 Target Language: Rust Inspiration: llama.cpp by Georgi Gerganov


1. Executive Summary

Build a high-performance, dependency-light LLM inference engine in Rust that runs large language models on commodity hardware (CPUs, GPUs, Apple Silicon) using quantization and modern system programming techniques.

Key Differentiators:

  • Zero-cost abstractions via Rust's ownership model
  • Memory safety without GC overhead
  • First-class async/concurrency support
  • Native WebAssembly support for browser inference
  • Modern crate ecosystem (burn, candle, etc.)

2. Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                        │
│  CLI │ HTTP Server │ Python Bindings │ WASM │ FFI          │
├─────────────────────────────────────────────────────────────┤
│                    API LAYER                                │
│  Session Management │ Sampling │ Tokenization │ Scheduling  │
├─────────────────────────────────────────────────────────────┤
│                    COMPUTE LAYER                            │
│  CPU Kernels (AVX2/AVX512/NEON) │ GPU (CUDA/HIP/Vulkan)    │
│  Quantization │ Dequantization │ Matrix Multiplication     │
├─────────────────────────────────────────────────────────────┤
│                    MODEL LAYER                              │
│  GGUF Loader │ Model Graph │ Weight Storage │ KV Cache      │
├─────────────────────────────────────────────────────────────┤
│                    HARDWARE ABSTRACTION                     │
│  CPU │ NVIDIA (CUDA) │ AMD (HIP) │ Apple (Metal) │ WASM     │
└─────────────────────────────────────────────────────────────┘

3. Core Modules & TODOs

MODULE 1: Project Foundation & Build System

Objective: Establish Rust project structure with cross-platform compilation support

  • TODO-1.1: Initialize Cargo workspace with workspace-level dependencies
    [workspace]
    members = ["oxidize-core", "oxidize-cli", "oxidize-server", "oxidize-quantize"]
    resolver = "3"
  • TODO-1.2: Set up CI/CD (GitHub Actions) for Linux, macOS, Windows builds
  • TODO-1.3: Configure cross-compilation for ARM64, WASM32 targets
  • TODO-1.4: Set up benchmark harness with criterion.rs
  • TODO-1.5: Create Docker images for deployment
  • TODO-1.6: Add justfile/Makefile for common tasks
  • TODO-1.7: Set up cargo deny for license/security auditing
  • TODO-1.8: Configure release profile with LTO and panic=abort

Estimated Effort: 2-3 days Priority: P0 (Blocking)


MODULE 2: GGUF Format & Model Loader

Objective: Parse and load GGUF (Georgi Gerganov Universal Format) files

  • TODO-2.1: Implement GGUF file format parser
    • Magic number validation (GGUF)
    • Version handling (v2, v3)
    • Tensor info metadata parsing
    • Alignment and padding handling
  • TODO-2.2: Create Tensor struct with shape, strides, dtype
  • TODO-2.3: Implement memory-mapped file loading (memmap2 crate)
  • TODO-2.4: Support tensor name mapping for different architectures
  • TODO-2.5: Add quantization type detection (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.)
  • TODO-2.6: Implement ModelLoader trait for extensibility
  • TODO-2.7: Add progress callbacks for large model loading
  • TODO-2.8: Create comprehensive unit tests with fixture files
  • TODO-2.9: Benchmark loader against llama.cpp baseline

Key Crates: memmap2, bytemuck, half Estimated Effort: 5-7 days Priority: P0 (Blocking)


MODULE 3: Quantization Engine

Objective: Implement quantization/dequantization schemes matching llama.cpp

  • TODO-3.1: Implement scalar dequantization kernels:
    • Q4_0, Q4_1 (4-bit with/without offsets)
    • Q5_0, Q5_1 (5-bit variants)
    • Q8_0 (8-bit)
    • Q2_K, Q3_K, Q4_K, Q5_K, Q6_K (K-quants)
  • TODO-3.2: Implement dequantization to f16 and f32
  • TODO-3.3: Add quantization from f16/f32 to all supported formats
  • TODO-3.4: Implement block-wise quantization with per-block scales
  • TODO-3.5: Add importance matrix support (IMatrix) for better quality
  • TODO-3.6: Create quantization CLI tool (oxidize-quantize)
  • TODO-3.7: Add mixed quantization support (different types per layer)
  • TODO-3.8: Validate output against llama.cpp reference implementation

Key Traits:

pub trait Quantization {
    fn quantize(&self, data: &[f32], output: &mut [u8]) -> Result<()>;
    fn dequantize(&self, data: &[u8], output: &mut [f32]) -> Result<()>;
    fn block_size(&self) -> usize;
    fn type_size(&self) -> usize;
}

Estimated Effort: 10-14 days Priority: P0 (Blocking)


MODULE 4: Compute Kernels — CPU

Objective: High-performance CPU inference kernels with SIMD optimization

  • TODO-4.1: Set up SIMD abstraction layer
    • x86: SSE2, AVX, AVX2, AVX512 (via std::arch)
    • ARM: NEON (via std::arch)
    • Fallback: scalar implementations
  • TODO-4.2: Implement matrix-vector multiplication (GEMV)
    • F32, F16 input types
    • Quantized weights with on-the-fly dequantization
  • TODO-4.3: Implement matrix-matrix multiplication (GEMM) for batching
  • TODO-4.4: Implement attention mechanisms:
    • Multi-head attention (MHA)
    • Grouped-query attention (GQA)
    • Multi-query attention (MQA)
  • TODO-4.5: Implement RoPE (Rotary Position Embedding)
  • TODO-4.6: Implement SwiGLU activation
  • TODO-4.7: Implement RMSNorm and LayerNorm
  • TODO-4.8: Implement Softmax (stable, numerically accurate)
  • TODO-4.9: Add thread pool for parallel layer execution
  • TODO-4.10: Optimize cache locality and prefetching
  • TODO-4.11: Add runtime CPU feature detection

Performance Target: Within 10% of llama.cpp CPU performance Estimated Effort: 15-20 days Priority: P0 (Blocking)


MODULE 5: Compute Kernels — GPU (CUDA)

Objective: CUDA kernels for NVIDIA GPU acceleration

  • TODO-5.1: Set up CUDA build pipeline with rust-cuda or cust
  • TODO-5.2: Implement memory management (device allocation, H2D/D2H transfers)
  • TODO-5.3: Port GEMV kernels to CUDA
  • TODO-5.4: Port GEMM kernels using cuBLAS
  • TODO-5.5: Implement attention kernels (flash attention style)
  • TODO-5.6: Implement quantization-aware kernels (dequantize on GPU)
  • TODO-5.7: Add kernel fusion (combine multiple ops into single kernel)
  • TODO-5.8: Implement layer offloading (--n-gpu-layers equivalent)
  • TODO-5.9: Add multi-GPU support (tensor/pipeline parallelism)
  • TODO-5.10: Optimize memory usage with Flash Attention

Key Crates: cust, cudarc Estimated Effort: 20-25 days Priority: P1 (High)


MODULE 6: Compute Kernels — Apple Metal

Objective: Metal Performance Shaders for Apple Silicon

  • TODO-6.1: Set up Metal build with metal-rs
  • TODO-6.2: Implement buffer management for unified memory
  • TODO-6.3: Port compute kernels to Metal Shading Language
  • TODO-6.4: Optimize for Apple Silicon unified memory architecture
  • TODO-6.5: Add Metal Performance Shaders integration where beneficial

Estimated Effort: 10-12 days Priority: P1 (High)


MODULE 7: Model Architectures

Objective: Support multiple transformer architectures

  • TODO-7.1: Define Model trait:
    pub trait Model {
        fn forward(&mut self, tokens: &[Token], session: &mut Session) -> Result<Logits>;
        fn vocab_size(&self) -> usize;
        fn context_size(&self) -> usize;
        fn layer_count(&self) -> usize;
    }
  • TODO-7.2: Implement LLaMA architecture (LLaMA 2, LLaMA 3)
  • TODO-7.3: Implement Mistral architecture
  • TODO-7.4: Implement Mixtral MoE architecture
  • TODO-7.5: Implement Qwen architecture
  • TODO-7.6: Implement Gemma architecture
  • TODO-7.7: Implement Falcon architecture
  • TODO-7.8: Implement GPT architecture (GPT-2, GPT-J, GPT-NeoX)
  • TODO-7.9: Implement Phi architecture
  • TODO-7.10: Implement architecture auto-detection from GGUF metadata
  • TODO-7.11: Add LoRA/QLoRA support

Estimated Effort: 15-20 days Priority: P0 (Blocking)


MODULE 8: KV Cache Management

Objective: Efficient key-value cache for attention

  • TODO-8.1: Implement KV cache storage with configurable dtype
  • TODO-8.2: Add sliding window attention cache management
  • TODO-8.3: Implement cache eviction strategies
  • TODO-8.4: Add cache quantization (8-bit, 4-bit KV cache)
  • TODO-8.5: Support continuous batching with cache management
  • TODO-8.6: Add cache persistence across sessions
  • TODO-8.7: Optimize memory layout for cache access patterns

Estimated Effort: 5-7 days Priority: P1 (High)


MODULE 9: Tokenization

Objective: Text-to-tokens and tokens-to-text conversion

  • TODO-9.1: Implement Byte-Pair Encoding (BPE)
  • TODO-9.2: Implement SentencePiece (Unigram)
  • TODO-9.3: Implement WordPiece
  • TODO-9.4: Add Tiktoken (GPT-4/Claude style)
  • TODO-9.5: Create tokenizer loader from GGUF metadata
  • TODO-9.6: Add special token handling (BOS, EOS, PAD, etc.)
  • TODO-9.7: Implement streaming detokenization
  • TODO-9.8: Add chat template processing
  • TODO-9.9: Support token healing (merge incomplete tokens)

Key Crates: tokenizers (Hugging Face), tiktoken-rs Estimated Effort: 5-7 days Priority: P0 (Blocking)


MODULE 10: Sampling & Generation

Objective: Text generation with various sampling strategies

  • TODO-10.1: Implement basic sampling:
    • Greedy
    • Temperature scaling
    • Top-k
    • Top-p (nucleus)
    • Min-p
  • TODO-10.2: Implement advanced sampling:
    • Mirostat
    • Typical sampling
    • Tail-free sampling
    • Locally typical sampling
  • TODO-10.3: Implement repetition penalties:
    • Frequency penalty
    • Presence penalty
    • Penalize newlines
  • TODO-10.4: Add grammar-based constrained generation
  • TODO-10.5: Implement speculative decoding
  • TODO-10.6: Add beam search
  • TODO-10.7: Implement streaming generation (async iterator)

Estimated Effort: 7-10 days Priority: P1 (High)


MODULE 11: CLI Application

Objective: Command-line interface for inference

  • TODO-11.1: Create oxidize-cli binary
  • TODO-11.2: Implement argument parsing (clap)
    • Model path
    • Prompt (interactive, file, stdin)
    • Context size
    • Thread count
    • GPU layers
    • Sampling parameters
    • System prompt
  • TODO-11.3: Add interactive chat mode (REPL)
  • TODO-11.4: Implement single-shot inference mode
  • TODO-11.5: Add conversation history management
  • TODO-11.6: Implement progress indicators for loading/generation
  • TODO-11.7: Add token/speed reporting
  • TODO-11.8: Support prompt caching
  • TODO-11.9: Add multi-line input support
  • TODO-11.10: Implement reverse prompt (stop sequences)

Estimated Effort: 5-7 days Priority: P1 (High)


MODULE 12: HTTP Server & API

Objective: OpenAI-compatible HTTP API server

  • TODO-12.1: Create oxidize-server binary with axum or actix-web
  • TODO-12.2: Implement OpenAI-compatible endpoints:
    • POST /v1/chat/completions
    • POST /v1/completions
    • GET /v1/models
    • POST /v1/embeddings
  • TODO-12.3: Add Server-Sent Events (SSE) for streaming
  • TODO-12.4: Implement JSON mode and structured output
  • TODO-12.5: Add request/response logging
  • TODO-12.6: Implement rate limiting and request queuing
  • TODO-12.7: Add health check endpoints
  • TODO-12.8: Support concurrent request handling
  • TODO-12.9: Add authentication middleware (API keys)
  • TODO-12.10: Create OpenAPI documentation

Estimated Effort: 7-10 days Priority: P1 (High)


MODULE 13: Python Bindings

Objective: Python interface via PyO3

  • TODO-13.1: Set up pyo3 workspace
  • TODO-13.2: Create oxidize Python package
  • TODO-13.3: Implement Llama class with methods:
    • __init__
    • generate
    • create_chat_completion
    • embed
  • TODO-13.4: Add async support with asyncio
  • TODO-13.5: Support numpy and torch tensor interop
  • TODO-13.6: Create pip installable wheels (maturin)
  • TODO-13.7: Add Python type stubs
  • TODO-13.8: Match llama-cpp-python API for compatibility

Estimated Effort: 7-10 days Priority: P2 (Medium)


MODULE 14: WebAssembly Support

Objective: Browser-based inference

  • TODO-14.1: Set up wasm-bindgen build
  • TODO-14.2: Implement WebGPU compute backend
  • TODO-14.3: Add Web Worker support for background inference
  • TODO-14.4: Create JavaScript/TypeScript bindings
  • TODO-14.5: Implement streaming generation in browser
  • TODO-14.6: Add model download/cache management in browser
  • TODO-14.7: Create demo web application

Estimated Effort: 10-14 days Priority: P2 (Medium)


MODULE 15: Performance Optimization

Objective: Achieve llama.cpp-level performance

  • TODO-15.1: Profile CPU inference with perf/samply
  • TODO-15.2: Optimize memory access patterns
  • TODO-15.3: Implement operator fusion (combine linear + activation)
  • TODO-15.4: Add INT8/INT4 GEMM via gemm crate or custom kernels
  • TODO-15.5: Implement Flash Attention for long contexts
  • TODO-15.6: Add continuous batching for server throughput
  • TODO-15.7: Implement pipeline parallelism for multi-GPU
  • TODO-15.8: Add tensor parallelism for large models
  • TODO-15.9: Optimize prompt processing (prefill) with batching
  • TODO-15.10: Add memory pool allocator to reduce allocations

Performance Targets:

  • CPU: Within 15% of llama.cpp
  • GPU: Within 20% of llama.cpp
  • Memory usage: Comparable to llama.cpp

Estimated Effort: Ongoing (20+ days) Priority: P1 (High)


MODULE 16: Testing & Quality Assurance

Objective: Comprehensive test coverage and benchmarking

  • TODO-16.1: Unit tests for all quantization schemes
  • TODO-16.2: Numerical accuracy tests (vs. PyTorch reference)
  • TODO-16.3: Integration tests with real GGUF models
  • TODO-16.4: Benchmark suite comparing to llama.cpp
  • TODO-16.5: Perplexity benchmarks on standard datasets
  • TODO-16.6: Memory usage benchmarks
  • TODO-16.7: Create CI benchmarks with regression detection
  • TODO-16.8: Add fuzzing for parser and tokenizer
  • TODO-16.9: Create benchmark dashboard
  • TODO-16.10: Add model compatibility tests (run 100+ models)

Estimated Effort: Ongoing (10+ days) Priority: P1 (High)


MODULE 17: Documentation & Examples

Objective: Excellent developer and user experience

  • TODO-17.1: Write comprehensive README with quick start
  • TODO-17.2: Create API documentation with rustdoc
  • TODO-17.3: Write architecture documentation
  • TODO-17.4: Create quantization guide
  • TODO-17.5: Write performance tuning guide
  • TODO-17.6: Create examples:
    • Basic inference
    • Chat completion
    • Streaming generation
    • Batch processing
    • Custom sampling
    • Embedding extraction
  • TODO-17.7: Add troubleshooting guide
  • TODO-17.8: Create contribution guidelines
  • TODO-17.9: Write blog post announcing release

Estimated Effort: 5-7 days Priority: P2 (Medium)


4. Technology Stack

Component Primary Choice Alternatives
Build System Cargo Bazel
Async Runtime Tokio async-std
CLI Framework clap structopt
Web Server axum actix-web, rocket
Serialization serde -
CUDA Bindings cudarc rustacuda, cust
Metal Bindings metal-rs -
Python Bindings pyo3 + maturin -
WASM wasm-bindgen -
Logging tracing log
Error Handling thiserror + anyhow -
Testing built-in + criterion -
Quantization custom -
BLAS intel-mkl-src, openblas-src gemm

5. Development Phases

Phase 1: Foundation (Weeks 1-3)

  • TODO-1.x: Project setup
  • TODO-2.x: GGUF loader
  • TODO-3.x: Basic quantization
  • TODO-7.1, 7.2: LLaMA architecture

Deliverable: Load and run LLaMA 2/3 models on CPU

Phase 2: Core Inference (Weeks 4-6)

  • TODO-4.x: CPU kernels
  • TODO-8.x: KV cache
  • TODO-9.x: Tokenization
  • TODO-10.x: Sampling
  • TODO-11.x: CLI

Deliverable: Full CLI with chat mode, competitive CPU performance

Phase 3: GPU Acceleration (Weeks 7-9)

  • TODO-5.x: CUDA support
  • TODO-6.x: Metal support
  • TODO-15.x: Performance optimization

Deliverable: GPU inference matching llama.cpp speeds

Phase 4: Production Features (Weeks 10-12)

  • TODO-12.x: HTTP server
  • TODO-13.x: Python bindings
  • TODO-7.3-7.9: More architectures
  • TODO-16.x: Testing

Deliverable: Production-ready with server and Python API

Phase 5: Advanced Features (Weeks 13-16)

  • TODO-14.x: WASM
  • TODO-10.5: Speculative decoding
  • TODO-15.6-15.8: Advanced parallelism
  • TODO-7.10-7.11: LoRA, more models

Deliverable: Full feature parity with llama.cpp + Rust advantages


6. Success Metrics

Metric Target
Models Supported 50+ GGUF architectures
CPU Performance Within 15% of llama.cpp
GPU Performance Within 20% of llama.cpp
Memory Safety Zero memory leaks (verified by valgrind/MIRI)
Test Coverage >80% line coverage
Binary Size <50MB for CLI (release)
Startup Time <2s for 7B model
Token Throughput Match or exceed llama.cpp per watt

7. Risks & Mitigations

Risk Probability Impact Mitigation
CUDA kernel performance gap Medium High Use cuBLAS where possible, profile extensively
Quantization accuracy loss Low High Validate against reference, use IMatrix
Memory overhead vs C++ Medium Medium Zero-copy design, careful allocation
Build complexity (CUDA deps) High Medium Feature flags, optional GPU backends
Compilation time High Low Workspace organization, sccache

8. Open Questions

  1. Should we use candle or burn crates for tensor operations, or implement custom?
  2. How to handle CUDA build in CI? (GitHub Actions has limited GPU runners)
  3. Should we support GGML format legacy loading?
  4. What's the minimum Rust version to support?
  5. How to handle model downloads and Hugging Face integration?

9. References


Next Steps:

  1. Create GitHub repository and initialize workspace
  2. Start with TODO-1.1 (workspace setup)
  3. Implement TODO-2.1 (GGUF parser) as first milestone
  4. Set up benchmark harness to compare against llama.cpp baseline

Estimated Total Effort: 4-5 months for MVP, 6-8 months for full feature parity Team Size: 2-3 developers recommended