PRD.md

Product Requirements Document (PRD)

Project: oxidize — A Rust-Based LLM Inference Engine

Date: April 30, 2026 Status: Draft v0.1 Target Language: Rust Inspiration: llama.cpp by Georgi Gerganov

1. Executive Summary

Build a high-performance, dependency-light LLM inference engine in Rust that runs large language models on commodity hardware (CPUs, GPUs, Apple Silicon) using quantization and modern system programming techniques.

Key Differentiators:

Zero-cost abstractions via Rust's ownership model
Memory safety without GC overhead
First-class async/concurrency support
Native WebAssembly support for browser inference
Modern crate ecosystem (burn, candle, etc.)

2. Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                        │
│  CLI │ HTTP Server │ Python Bindings │ WASM │ FFI          │
├─────────────────────────────────────────────────────────────┤
│                    API LAYER                                │
│  Session Management │ Sampling │ Tokenization │ Scheduling  │
├─────────────────────────────────────────────────────────────┤
│                    COMPUTE LAYER                            │
│  CPU Kernels (AVX2/AVX512/NEON) │ GPU (CUDA/HIP/Vulkan)    │
│  Quantization │ Dequantization │ Matrix Multiplication     │
├─────────────────────────────────────────────────────────────┤
│                    MODEL LAYER                              │
│  GGUF Loader │ Model Graph │ Weight Storage │ KV Cache      │
├─────────────────────────────────────────────────────────────┤
│                    HARDWARE ABSTRACTION                     │
│  CPU │ NVIDIA (CUDA) │ AMD (HIP) │ Apple (Metal) │ WASM     │
└─────────────────────────────────────────────────────────────┘

3. Core Modules & TODOs

MODULE 1: Project Foundation & Build System

Objective: Establish Rust project structure with cross-platform compilation support

TODO-1.1: Initialize Cargo workspace with workspace-level dependencies

[workspace]
members = ["oxidize-core", "oxidize-cli", "oxidize-server", "oxidize-quantize"]
resolver = "3"

TODO-1.2: Set up CI/CD (GitHub Actions) for Linux, macOS, Windows builds
TODO-1.3: Configure cross-compilation for ARM64, WASM32 targets
TODO-1.4: Set up benchmark harness with criterion.rs
TODO-1.5: Create Docker images for deployment
TODO-1.6: Add justfile/Makefile for common tasks
TODO-1.7: Set up cargo deny for license/security auditing
TODO-1.8: Configure release profile with LTO and panic=abort

Estimated Effort: 2-3 days Priority: P0 (Blocking)

MODULE 2: GGUF Format & Model Loader

Objective: Parse and load GGUF (Georgi Gerganov Universal Format) files

TODO-2.1: Implement GGUF file format parser
- Magic number validation (GGUF)
- Version handling (v2, v3)
- Tensor info metadata parsing
- Alignment and padding handling
TODO-2.2: Create Tensor struct with shape, strides, dtype
TODO-2.3: Implement memory-mapped file loading (memmap2 crate)
TODO-2.4: Support tensor name mapping for different architectures
TODO-2.5: Add quantization type detection (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.)
TODO-2.6: Implement ModelLoader trait for extensibility
TODO-2.7: Add progress callbacks for large model loading
TODO-2.8: Create comprehensive unit tests with fixture files
TODO-2.9: Benchmark loader against llama.cpp baseline

Key Crates: memmap2, bytemuck, half Estimated Effort: 5-7 days Priority: P0 (Blocking)

MODULE 3: Quantization Engine

Objective: Implement quantization/dequantization schemes matching llama.cpp

TODO-3.1: Implement scalar dequantization kernels:
- Q4_0, Q4_1 (4-bit with/without offsets)
- Q5_0, Q5_1 (5-bit variants)
- Q8_0 (8-bit)
- Q2_K, Q3_K, Q4_K, Q5_K, Q6_K (K-quants)
TODO-3.2: Implement dequantization to f16 and f32
TODO-3.3: Add quantization from f16/f32 to all supported formats
TODO-3.4: Implement block-wise quantization with per-block scales
TODO-3.5: Add importance matrix support (IMatrix) for better quality
TODO-3.6: Create quantization CLI tool (oxidize-quantize)
TODO-3.7: Add mixed quantization support (different types per layer)
TODO-3.8: Validate output against llama.cpp reference implementation

Key Traits:

pub trait Quantization {
    fn quantize(&self, data: &[f32], output: &mut [u8]) -> Result<()>;
    fn dequantize(&self, data: &[u8], output: &mut [f32]) -> Result<()>;
    fn block_size(&self) -> usize;
    fn type_size(&self) -> usize;
}

Estimated Effort: 10-14 days Priority: P0 (Blocking)

MODULE 4: Compute Kernels — CPU

Objective: High-performance CPU inference kernels with SIMD optimization

Performance Target: Within 10% of llama.cpp CPU performance Estimated Effort: 15-20 days Priority: P0 (Blocking)

MODULE 5: Compute Kernels — GPU (CUDA)

Objective: CUDA kernels for NVIDIA GPU acceleration

Key Crates: cust, cudarc Estimated Effort: 20-25 days Priority: P1 (High)

MODULE 6: Compute Kernels — Apple Metal

Objective: Metal Performance Shaders for Apple Silicon

TODO-6.1: Set up Metal build with metal-rs
TODO-6.2: Implement buffer management for unified memory
TODO-6.3: Port compute kernels to Metal Shading Language
TODO-6.4: Optimize for Apple Silicon unified memory architecture
TODO-6.5: Add Metal Performance Shaders integration where beneficial

Estimated Effort: 10-12 days Priority: P1 (High)

MODULE 7: Model Architectures

Objective: Support multiple transformer architectures

Estimated Effort: 15-20 days Priority: P0 (Blocking)

MODULE 8: KV Cache Management

Objective: Efficient key-value cache for attention

TODO-8.1: Implement KV cache storage with configurable dtype
TODO-8.2: Add sliding window attention cache management
TODO-8.3: Implement cache eviction strategies
TODO-8.4: Add cache quantization (8-bit, 4-bit KV cache)
TODO-8.5: Support continuous batching with cache management
TODO-8.6: Add cache persistence across sessions
TODO-8.7: Optimize memory layout for cache access patterns

Estimated Effort: 5-7 days Priority: P1 (High)

MODULE 9: Tokenization

Objective: Text-to-tokens and tokens-to-text conversion

TODO-9.1: Implement Byte-Pair Encoding (BPE)
TODO-9.2: Implement SentencePiece (Unigram)
TODO-9.3: Implement WordPiece
TODO-9.4: Add Tiktoken (GPT-4/Claude style)
TODO-9.5: Create tokenizer loader from GGUF metadata
TODO-9.6: Add special token handling (BOS, EOS, PAD, etc.)
TODO-9.7: Implement streaming detokenization
TODO-9.8: Add chat template processing
TODO-9.9: Support token healing (merge incomplete tokens)

Key Crates: tokenizers (Hugging Face), tiktoken-rs Estimated Effort: 5-7 days Priority: P0 (Blocking)

MODULE 10: Sampling & Generation

Objective: Text generation with various sampling strategies

TODO-10.1: Implement basic sampling:
- Greedy
- Temperature scaling
- Top-k
- Top-p (nucleus)
- Min-p
TODO-10.2: Implement advanced sampling:
- Mirostat
- Typical sampling
- Tail-free sampling
- Locally typical sampling
TODO-10.3: Implement repetition penalties:
- Frequency penalty
- Presence penalty
- Penalize newlines
TODO-10.4: Add grammar-based constrained generation
TODO-10.5: Implement speculative decoding
TODO-10.6: Add beam search
TODO-10.7: Implement streaming generation (async iterator)

Estimated Effort: 7-10 days Priority: P1 (High)

MODULE 11: CLI Application

Objective: Command-line interface for inference

Estimated Effort: 5-7 days Priority: P1 (High)

MODULE 12: HTTP Server & API

Objective: OpenAI-compatible HTTP API server

Estimated Effort: 7-10 days Priority: P1 (High)

MODULE 13: Python Bindings

Objective: Python interface via PyO3

TODO-13.1: Set up pyo3 workspace
TODO-13.2: Create oxidize Python package
TODO-13.3: Implement Llama class with methods:
- __init__
- generate
- create_chat_completion
- embed
TODO-13.4: Add async support with asyncio
TODO-13.5: Support numpy and torch tensor interop
TODO-13.6: Create pip installable wheels (maturin)
TODO-13.7: Add Python type stubs
TODO-13.8: Match llama-cpp-python API for compatibility

Estimated Effort: 7-10 days Priority: P2 (Medium)

MODULE 14: WebAssembly Support

Objective: Browser-based inference

TODO-14.1: Set up wasm-bindgen build
TODO-14.2: Implement WebGPU compute backend
TODO-14.3: Add Web Worker support for background inference
TODO-14.4: Create JavaScript/TypeScript bindings
TODO-14.5: Implement streaming generation in browser
TODO-14.6: Add model download/cache management in browser
TODO-14.7: Create demo web application

Estimated Effort: 10-14 days Priority: P2 (Medium)

MODULE 15: Performance Optimization

Objective: Achieve llama.cpp-level performance

Performance Targets:

CPU: Within 15% of llama.cpp
GPU: Within 20% of llama.cpp
Memory usage: Comparable to llama.cpp

Estimated Effort: Ongoing (20+ days) Priority: P1 (High)

MODULE 16: Testing & Quality Assurance

Objective: Comprehensive test coverage and benchmarking

Estimated Effort: Ongoing (10+ days) Priority: P1 (High)

MODULE 17: Documentation & Examples

Objective: Excellent developer and user experience

TODO-17.1: Write comprehensive README with quick start
TODO-17.2: Create API documentation with rustdoc
TODO-17.3: Write architecture documentation
TODO-17.4: Create quantization guide
TODO-17.5: Write performance tuning guide
TODO-17.6: Create examples:
- Basic inference
- Chat completion
- Streaming generation
- Batch processing
- Custom sampling
- Embedding extraction
TODO-17.7: Add troubleshooting guide
TODO-17.8: Create contribution guidelines
TODO-17.9: Write blog post announcing release

Estimated Effort: 5-7 days Priority: P2 (Medium)

4. Technology Stack

Component	Primary Choice	Alternatives
Build System	Cargo	Bazel
Async Runtime	Tokio	async-std
CLI Framework	clap	structopt
Web Server	axum	actix-web, rocket
Serialization	serde	-
CUDA Bindings	cudarc	rustacuda, cust
Metal Bindings	metal-rs	-
Python Bindings	pyo3 + maturin	-
WASM	wasm-bindgen	-
Logging	tracing	log
Error Handling	thiserror + anyhow	-
Testing	built-in + criterion	-
Quantization	custom	-
BLAS	intel-mkl-src, openblas-src	gemm

5. Development Phases

Phase 1: Foundation (Weeks 1-3)

TODO-1.x: Project setup
TODO-2.x: GGUF loader
TODO-3.x: Basic quantization
TODO-7.1, 7.2: LLaMA architecture

Deliverable: Load and run LLaMA 2/3 models on CPU

Phase 2: Core Inference (Weeks 4-6)

TODO-4.x: CPU kernels
TODO-8.x: KV cache
TODO-9.x: Tokenization
TODO-10.x: Sampling
TODO-11.x: CLI

Deliverable: Full CLI with chat mode, competitive CPU performance

Phase 3: GPU Acceleration (Weeks 7-9)

TODO-5.x: CUDA support
TODO-6.x: Metal support
TODO-15.x: Performance optimization

Deliverable: GPU inference matching llama.cpp speeds

Phase 4: Production Features (Weeks 10-12)

TODO-12.x: HTTP server
TODO-13.x: Python bindings
TODO-7.3-7.9: More architectures
TODO-16.x: Testing

Deliverable: Production-ready with server and Python API

Phase 5: Advanced Features (Weeks 13-16)

TODO-14.x: WASM
TODO-10.5: Speculative decoding
TODO-15.6-15.8: Advanced parallelism
TODO-7.10-7.11: LoRA, more models

Deliverable: Full feature parity with llama.cpp + Rust advantages

6. Success Metrics

Metric	Target
Models Supported	50+ GGUF architectures
CPU Performance	Within 15% of llama.cpp
GPU Performance	Within 20% of llama.cpp
Memory Safety	Zero memory leaks (verified by valgrind/MIRI)
Test Coverage	>80% line coverage
Binary Size	<50MB for CLI (release)
Startup Time	<2s for 7B model
Token Throughput	Match or exceed llama.cpp per watt

7. Risks & Mitigations

Risk	Probability	Impact	Mitigation
CUDA kernel performance gap	Medium	High	Use cuBLAS where possible, profile extensively
Quantization accuracy loss	Low	High	Validate against reference, use IMatrix
Memory overhead vs C++	Medium	Medium	Zero-copy design, careful allocation
Build complexity (CUDA deps)	High	Medium	Feature flags, optional GPU backends
Compilation time	High	Low	Workspace organization, sccache

8. Open Questions

Should we use candle or burn crates for tensor operations, or implement custom?
How to handle CUDA build in CI? (GitHub Actions has limited GPU runners)
Should we support GGML format legacy loading?
What's the minimum Rust version to support?
How to handle model downloads and Hugging Face integration?

9. References

Next Steps:

Create GitHub repository and initialize workspace
Start with TODO-1.1 (workspace setup)
Implement TODO-2.1 (GGUF parser) as first milestone
Set up benchmark harness to compare against llama.cpp baseline

Estimated Total Effort: 4-5 months for MVP, 6-8 months for full feature parity Team Size: 2-3 developers recommended

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Product Requirements Document (PRD)

Project: oxidize — A Rust-Based LLM Inference Engine

1. Executive Summary

2. Architecture Overview

3. Core Modules & TODOs

MODULE 1: Project Foundation & Build System

MODULE 2: GGUF Format & Model Loader

MODULE 3: Quantization Engine

MODULE 4: Compute Kernels — CPU

MODULE 5: Compute Kernels — GPU (CUDA)

MODULE 6: Compute Kernels — Apple Metal

MODULE 7: Model Architectures

MODULE 8: KV Cache Management

MODULE 9: Tokenization

MODULE 10: Sampling & Generation

MODULE 11: CLI Application

MODULE 12: HTTP Server & API

MODULE 13: Python Bindings

MODULE 14: WebAssembly Support

MODULE 15: Performance Optimization

MODULE 16: Testing & Quality Assurance

MODULE 17: Documentation & Examples

4. Technology Stack

5. Development Phases

Phase 1: Foundation (Weeks 1-3)

Phase 2: Core Inference (Weeks 4-6)

Phase 3: GPU Acceleration (Weeks 7-9)

Phase 4: Production Features (Weeks 10-12)

Phase 5: Advanced Features (Weeks 13-16)

6. Success Metrics

7. Risks & Mitigations

8. Open Questions

9. References

FilesExpand file tree

PRD.md

Latest commit

History

PRD.md

File metadata and controls

Product Requirements Document (PRD)

Project: oxidize — A Rust-Based LLM Inference Engine

1. Executive Summary

2. Architecture Overview

3. Core Modules & TODOs

MODULE 1: Project Foundation & Build System

MODULE 2: GGUF Format & Model Loader

MODULE 3: Quantization Engine

MODULE 4: Compute Kernels — CPU

MODULE 5: Compute Kernels — GPU (CUDA)

MODULE 6: Compute Kernels — Apple Metal

MODULE 7: Model Architectures

MODULE 8: KV Cache Management

MODULE 9: Tokenization

MODULE 10: Sampling & Generation

MODULE 11: CLI Application

MODULE 12: HTTP Server & API

MODULE 13: Python Bindings

MODULE 14: WebAssembly Support

MODULE 15: Performance Optimization

MODULE 16: Testing & Quality Assurance

MODULE 17: Documentation & Examples

4. Technology Stack

5. Development Phases

Phase 1: Foundation (Weeks 1-3)

Phase 2: Core Inference (Weeks 4-6)

Phase 3: GPU Acceleration (Weeks 7-9)

Phase 4: Production Features (Weeks 10-12)

Phase 5: Advanced Features (Weeks 13-16)

6. Success Metrics

7. Risks & Mitigations

8. Open Questions

9. References