Compare GPU kernels across CUDA, Metal, and CPU — one API, real numbers.
Unikernels is a lightweight cross-backend benchmarking toolkit for GPU developers, compiler engineers, and AI researchers. It lets you write a kernel once, run it on multiple backends, and measure how they really perform — with consistent APIs, timing, and disassembly.
Warning
This project is under early development preview. APIs and features may change without notice.
GPU compute is fragmented. CUDA, Metal, HIP, SYCL, oneAPI, Vulkan… every vendor has its own dialect.
Unikernels doesn’t try to replace them — it exposes them. You can write, benchmark, and compare kernels across devices with minimal friction.
Think:
- tinygrad’s simplicity × Kokkos’ backend reach × Triton’s introspection tools.
| Feature | Status |
|---|---|
| ✅ Unified C++ API for kernels (CUDA, Metal, CPU) | done |
| 🧪 CLI benchmark runner | in progress |
| 📈 Cross-backend perf visualization | planned |
| 🧠 Python and Rust bindings | planned |
| 🔬 GEMM, conv2d, reduction, attention microbenchmarks | planned |
| 🔍 Kernel disassembly viewer (PTX / Metal IR) | planned |
| 🧰 Reproducibility metadata (compiler, driver, device) | planned |
git clone https://github.com/raishish/unikernels
cd unikernels
cmake -B build
cmake --build build -j./build/unikernels bench matmul --size 1024 --backend metal
./build/unikernels bench matmul --size 1024 --backend cudapython3 scripts/plot_benchmarks.py results.json| Kernel | Backend | Size | Time (ms) | TFLOPS |
|---|---|---|---|---|
| matmul | CUDA (RTX 4090) | 1024 | 0.42 | 5.1 |
| matmul | Metal (M3 Max) | 1024 | 0.75 | 2.8 |
| matmul | CPU (i9) | 1024 | 12.5 | 0.2 |
| Backend | Supported | Notes |
|---|---|---|
| Metal | ✅ | Metal 4 kernels (support for Metal 3.x coming soon) |
| CUDA | ✅ | |
| CPU | 🔜 |
src/
├─ backends/
│ ├─ cuda/
│ ├─ metal/
│ ├─ cpu/
├─ core/
│ ├─ context.cpp
│ ├─ tensor.cpp
├─ benchmarks/
│ ├─ matmul.cpp
│ ├─ conv2d.cpp
└─ cli/
├─ main.cpp
Each backend implements a small, consistent interface for launching kernels and collecting timings. The CLI and Python bindings wrap these interfaces for easy experimentation.
- Metal, CUDA, CPU backends
- Vector add + matmul examples
- CLI benchmarking tool
- JSON/CSV output
- conv2d, reduction, attention kernels
- perf charts + Python bindings
- reproducibility metadata
- Disassembly viewer
- Auto-report generator for perf comparisons
Pull requests are welcome — especially new kernels or backends. See CONTRIBUTING.md for setup and testing guidelines.
MIT — do whatever you want, just credit the project.
Because “write once, run anywhere” has always been a myth — and it’s time someone measured how mythical it actually is.
If you use UniKernels in your research, education, or production systems, please cite:
@software{unikernels2025,
title={UniKernels: A Cross-Platform C++ GPU Computing Library for Deep Learning and HPC},
author={Rai, Ashish},
url={https://github.com/raishish/unikernels},
version={1.0},
year={2025},
note={C++ library with Python and Rust bindings for CUDA, ROCm, and Metal GPU programming}
}