Namespace: vernier::bench
Platform: Linux (full), macOS (core harness)
C++ Standard: C++20 (C++23 used when available)
Performance benchmarking framework with profiler integrations, GPU support, and statistical analysis.
- Quick Start
- Key Features
- Common Workflows
- CLI Tools and Backends
- API Reference
- Requirements
- Platform Support
- Testing
- Project Structure
- License
- See Also
#include "Perf.hpp"
PERF_TEST(MyLib, Throughput) {
UB_PERF_GUARD(perf);
perf.warmup([&]{ work(); });
auto result = perf.throughputLoop([&]{ work(); }, "label");
EXPECT_GT(result.callsPerSecond, 10000.0);
}
PERF_MAIN()make compose-debug
make compose-testp
docker compose run --rm -T dev-cuda bash -c '
./build/native-linux-debug/bin/ptests/BenchmarkCPU_PTEST --csv results.csv
'cmake --preset native-linux-debug
cmake --build --preset native-linux-debug
./build/native-linux-debug/bin/ptests/BenchmarkCPU_PTEST --csv results.csv- GoogleTest integration with CSV export and end-of-run summary tables
- 14 profiler backends covering CPU, heap, off-CPU, energy, thread-safety, and both NVIDIA + AMD GPU stacks (see Section 4 for the list)
- Per-backend environment doctor (
--profile-check) with actionable hints - SIGALRM per-test watchdog so hung profiler runs fail loudly, not silently
- CUDA GPU benchmarking with multi-GPU and Unified Memory support, plus in-process CUPTI kernel metrics (register / smem / launch counts) without spawning ncu
- NVTX timeline annotation API auto-injected into Nsight Systems runs
- Companion
vernier::monitorlibrary for lightweight runtime instrumentation in production runs (lock-free queue, env-var-driven enablement, console + file sinks) - Statistical analysis: median, percentiles, CV%, adaptive stability detection
- Memory bandwidth analysis with efficiency calculations
- Multi-threaded contention benchmarking with synchronized start gates
- Semantic test macros (PERF_THROUGHPUT, PERF_LATENCY, PERF_MEMORY, etc.)
- CLI tools for analysis, comparison, regression detection, visualization,
doctor / profile-all / profile-summarize orchestration, project-level
defaults via
.bench.yaml
# 1. Baseline measurement
./bin/ptests/MyComponent_PTEST --repeats 30 --csv baseline.csv
# 2. Profile to find hotspots
./bin/ptests/MyComponent_PTEST --profile perf
# 3. Make changes, rebuild, measure again
./bin/ptests/MyComponent_PTEST --repeats 30 --csv optimized.csv
# 4. Statistical comparison
bench compare baseline.csv optimized.csv --threshold 5./bin/ptests/BenchmarkCPU_PTEST --quick --gtest_filter="*Throughput*"make compose-release
make installConsumers use find_package(vernier):
find_package(vernier REQUIRED)
target_link_libraries(my_benchmark PRIVATE vernier::bench)The install tree contains headers, shared libraries, CMake config, and documentation
under build/native-linux-release/install/.
CLI tools build with make tools-rust and make tools-py; source .env
from the build directory to put them on PATH.
| Tool | Language | Purpose |
|---|---|---|
bench |
Rust | Analysis, comparison, validation, run, doctor, profile-all, profile-summarize, init, config-validate, gpu-env, gpu-lock, gpu-monitor, gpu-topo, flamegraph |
bench-plot |
Python | Visualization (plots, dashboards, charts) |
nsight-parse |
Python | Turn .nsys-rep / .ncu-rep reports into a tidy CSV |
--profile X dispatches to whichever backend self-registered under that
name; bench doctor lists them all with their environment readiness.
| Backend | Layer | Wraps |
|---|---|---|
perf |
CPU | Linux perf_events (stat / record / mem / c2c) |
gperf |
CPU | gperftools |
callgrind |
CPU | valgrind callgrind |
bpftrace |
CPU | bpftrace scripts |
rapl |
CPU | Intel RAPL MSRs |
massif |
CPU | valgrind massif (heap timeline, ~20x) |
memcheck |
CPU | valgrind memcheck (errors / leaks) |
helgrind |
CPU | valgrind helgrind / DRD (data races, locks) |
offcpu |
CPU | bpftrace finish_task_switch (off-CPU stacks) |
heaptrack |
CPU | heaptrack (low-overhead heap, ~1.5x) |
jemalloc |
CPU | jemalloc prof sampling (~5-10%, LD_PRELOAD) |
nsight |
GPU | NVIDIA Nsight Systems / Compute |
compute-sanitizer |
GPU | NVIDIA Compute Sanitizer (GPU memcheck) |
rocprof |
GPU | AMD ROCm rocprof |
CUPTI kernel metrics populate the GPU CSV section automatically on every
GPU benchmark; NVTX annotations are available via BENCH_NVTX_SCOPE.
bench summary results.csv
bench compare baseline.csv candidate.csv --fail-on-regression
bench doctor ./build/native-linux-debug/bin/ptests/MyComponent_PTEST
bench profile-all MyComponent --quick
bench-plot plot results.csv --output charts/See tools/README.md for full CLI documentation.
| Document | Purpose |
|---|---|
| CPU Guide | CPU benchmarking patterns and profiler usage |
| GPU Guide | GPU/CUDA benchmarking patterns |
| API Reference | Complete API documentation |
| Advanced Guide | Memory profiling, parameterized tests |
| CI/CD Integration | Automated regression detection |
| Docker Setup | Container build and profiling setup |
| Troubleshooting | Common issues and solutions |
| Demo Walkthroughs | 22 step-by-step walkthroughs (16 CPU + 4 GPU demos, plus rocprof and CUPTI) |
| Monitor Guide | Runtime instrumentation library |
Required:
- C++20 compiler or newer (Clang 12+ / GCC 10+); C++23 is used automatically when the toolchain supports it (Clang 21 / GCC 13+)
- CMake 3.24+
- GoogleTest (auto-fetched via CMake FetchContent)
- POSIX system (Linux or macOS)
Optional:
- CUDA toolkit 12+ (GPU benchmarking, NVTX, CUPTI, Compute Sanitizer)
- gperftools (
gperfbackend) - valgrind (
callgrind/massif/memcheckbackends) - bpftrace (
bpftraceandoffcpubackends; needs root + tracefs) - heaptrack (
heaptrackbackend -- low-overhead heap profiler) - jemalloc with
profenabled (jemallocbackend; LD_PRELOAD) - ROCm + rocprof (AMD GPU profiling via the
rocprofbackend) - Rust toolchain (for
benchCLI tool) - Python 3.10+ with Poetry (for
bench-plotandnsight-parseCLI tools)
| Platform | Library | Profilers | CUDA | Pre-built Artifact |
|---|---|---|---|---|
| x86_64 Linux | Full | All 14 | Yes | vernier-*-x86_64-linux[-cuda] |
| Jetson (aarch64) | Full | All except RAPL | Yes | vernier-*-aarch64-jetson |
| Raspberry Pi (aarch64) | Full | CPU backends, no RAPL | No | vernier-*-aarch64-rpi |
| RISC-V 64 | Full | CPU backends, no RAPL | No | vernier-*-riscv64-linux |
| macOS (Apple Silicon/x86) | Full | No-ops | No | Build from source |
rapl is Intel-only (energy via MSRs); rocprof is AMD-only; nsight /
compute-sanitizer / cupti / nvtx are NVIDIA-only. All backends
degrade gracefully when hardware or tools are unavailable -- the core
timing harness always works.
# Build and run all tests (Docker)
make compose-debug
make compose-testp
# Run specific library tests
docker compose run --rm -T dev-cuda ctest --test-dir build/native-linux-debug -L bench
# CLI tool tests
make test-rust
make test-pyvernier/
CMakeLists.txt Root project (version, presets, CUDA detection)
Makefile Build entry point (make help for full list)
docker-compose.yml Dev containers (CPU, CUDA, cross-compile)
cmake/vernier/ CMake infrastructure (targets, testing, coverage)
docker/ Dockerfiles (base, dev, builder, toolchain)
mk/ Make modules (build, test, docker, coverage)
src/
bench/ Benchmarking library (perf, GPU harness, profilers)
inc/ Public headers (Perf.hpp, PerfGpu.hpp, Nvtx.hpp, profilers)
src/ Profiler implementations + CUPTI collector
bpf/ bpftrace scripts (write / fsync latency)
utst/ Unit tests
ptst/ Performance tests (CPU + GPU)
demo/ 16 CPU + 4 GPU walkthroughs with step-by-step docs
docs/ Library documentation
monitor/ Runtime instrumentation library (vernier::monitor)
inc/ Public headers (Monitor.hpp, MonitorConfig.hpp)
src/ Sink implementations
utst/ Unit tests
examples/ End-to-end usage examples
docs/ MONITOR_GUIDE.md
tools/
rust/ bench CLI (Rust) -- analysis, doctor, run, gpu-*
py/ bench-plot, nsight-parse CLIs (Python)
MIT License. See LICENSE for details.
- tools/README.md - CLI tools documentation (bench, bench-plot)
- src/bench/docs/CPU_GUIDE.md - CPU benchmarking guide
- src/bench/docs/GPU_GUIDE.md - GPU benchmarking guide
- src/bench/docs/ - Technical documentation