MPMC Ring is a high-performance, bounded multi-producer/multi-consumer ring buffer for C++20.
It provides a work-conserving blocking (ticketed) path and a non-blocking try_* API,
uses atomic operations (no mutexes), and targets low-latency inter-thread messaging.
The library is header-only and ships with benchmarks and tests for reproducibility.
Bounded multi-producer / multi-consumer ring buffer with:
- Blocking, ticketed fast path and non-blocking
try_*API - No mutexes; uses atomic operations.
- Per-slot acquire/release handoff; relaxed cursors
- Thread pinning and cursor padding; reproducible bench with CSV output
try_push(const T&) / try_push(T&&): non-blocking; returnsfalseif fullpush(const T&) / push(T&&): blocking (spins) until enqueuedtry_pop(T&): non-blocking; returnsfalseif emptypop(T&): blocking (spins) until dequeued- Template:
MpmcRing<T, /*Padding=*/bool>(cursor padding toggle). Owns bounded storage.
Full signatures: see include/mpmc.hpp.
- Open the folder as a CMake Project.
- Set x64 | Release.
- Run
testsorbench.
Typical paths:
out/build/msvc-ninja-release/tests.exeout/build/msvc-ninja-release/bench.exe
cmake --preset=mingw-release
cmake --build --preset=mingw-release -j
./out/build/mingw-release/benchBenchmark defaults: producers=consumers=1, capacity=65,536, mode=blocking, warmup=2,500 ms, duration=17,500 ms, bucket_width=5 ns, buckets=4,096, padding=on, pinning=on, large_payload=off, move_only_payload=off.
Testbed: Windows 11 (24H2)
CPU: Intel Core i7-11800H (8c/16t)
Compiler: MSVC 19.44 (Visual Studio 2022 17.10), /O2 /GL
Build system: CMake 4.1.1 + Ninja 1.11.1 (Release)
Power plan: Legion Balance Mode (OEM Balanced)
# Blocking (defaults)
./out/build/mingw-release/bench --producers 4 --consumers 4
# Non-blocking A/B
./out/build/mingw-release/bench --producers 4 --consumers 4 --blocking offSee --help for all options and defaults.
Figure settings (used for all charts unless stated): producers=consumers=4, capacity=65,536, mode=blocking, warmup=2,500 ms, duration=17,500 ms, bucket_width=5 ns, buckets=4,096, padding=on, pinning=on, large_payload=off, move_only_payload=off.
At 4p4c, blocking outperforms non-blocking by ~2–3× on this host, while tightening p99/p999.
Materially reduce tail latency.
Large copyable payloads pay for data movement; move-only payloads keep latency close to small POD.
As threads increase, p50, p99, and p999 grow.
- On this Windows testbed, total throughput plateaued by 4p4c despite the ticketed fast path.
- Explicit thread pinning (affinity APIs) did not change the result.
- Blocking is ~2–3× faster than non-blocking at 4p4c, and pinning/padding primarily tighten tails.
- On a single-socket Linux host, higher scaling is expected.
- Next steps for increased performance: sharding the queue (N sub-rings).