High-Performance Order Book Matching Engine

A limit order book implementation achieving 50M+ operations per second with deterministic O(1) complexity for all core operations. Built in Zig for zero-cost abstractions and explicit memory control.

Performance Metrics

Operation       Throughput        Latency    Complexity
────────────────────────────────────────────────────────
Insert          54.91M ops/sec    18ns       O(1)
Cancel          53.11M ops/sec    18ns       O(1)
Best Price      211.82M ops/sec   4ns        O(1)
Match           119.05M ops/sec   8ns        O(1) per fill

Measured on Apple M3 Max using isolated microbenchmarks (zig build bench-ladder-micro).

Architecture Overview

graph TB
    subgraph "Domain Layer"
        LB[LadderBook<br/>Array-based O(1)]
        OB[OrderBook<br/>RB-tree baseline]
    end

    subgraph "Application Layer"
        ME[MatchingEngine]
        CMD[Command Processor]
        EVT[Event Emitter]
    end

    subgraph "Adapter Layer"
        HTTP[HTTP Server]
        WS[WebSocket Server]
        BIN[Binary Protocol]
        JRN[Journal Persistence]
        SNAP[Snapshot Store]
    end

    CMD --> ME
    ME --> LB
    ME --> EVT
    EVT --> JRN
    HTTP --> CMD
    WS --> CMD
    BIN --> CMD

Core Data Structure: The Ladder Algorithm

The ladder algorithm replaces the traditional RB-tree with fixed-size arrays and hierarchical bitsets for constant-time operations.

graph LR
    subgraph "Memory Layout (per side)"
        PL[Price Levels<br/>1M slots × 32B<br/>32MB total]
        OP[Order Pool<br/>100K orders<br/>Pre-allocated]
        HM[HashMap<br/>OrderID → Index<br/>No rehashing]
        BS[Bitset<br/>2-level hierarchy<br/>O(1) find]
    end

    PL --> OP
    HM --> OP
    BS --> PL

Mathematical Foundation

Price to Index Mapping:

index = (price - base_tick) / tick_size

Best Price Discovery:

For bids: best_bid = base_tick + (highest_set_bit × tick_size)
For asks: best_ask = base_tick + (lowest_set_bit × tick_size)

Where set bits are found using CPU intrinsics:

@clz (count leading zeros) for highest bit
@ctz (count trailing zeros) for lowest bit

Complexity Analysis:

Operation	Traditional (RB-tree)	Ladder Implementation
Insert	O(log M) where M = unique prices	O(1)
Cancel	O(log M) + O(1)	O(1)
Best Price	O(1) with cached min/max	O(1) via bitset
Match	O(log M) + O(K) where K = orders	O(1) + O(K)

Memory Layout Details

classDiagram
    class PriceLevel {
        +u32 head_idx
        +u32 tail_idx
        +u64 aggregate_qty
        +[12]u8 padding
        ----
        32 bytes (cache-aligned)
    }

    class Order {
        +u64 id
        +u64 quantity
        +u64 filled
        +u32 next_idx
        +u32 prev_idx
        +OrderType type
        +TimeInForce tif
    }

    class BookSide {
        +[1M]PriceLevel levels
        +[100K]Order pool
        +HashMap id_map
        +Bitset occupancy
        +u32 free_head
    }

    BookSide --> PriceLevel : contains array
    BookSide --> Order : manages pool
    PriceLevel --> Order : indexes into

Algorithm Walkthrough

Order Insertion

flowchart TD
    A[New Order] --> B[Calculate price index]
    B --> C[Allocate from pool]
    C --> D[Add to HashMap]
    D --> E{Level empty?}
    E -->|Yes| F[Set bitset bit]
    E -->|No| G[Link to tail]
    F --> H[Update level head/tail]
    G --> H
    H --> I[Update aggregate]
    I --> J[Return success]

Pseudocode:

function insert(order):
    idx = pool.allocate()              // O(1) - pop from free list
    pool[idx] = order                  // O(1) - array write

    tick = price_to_tick(order.price)  // O(1) - arithmetic
    level = &levels[tick]               // O(1) - array access

    if level.is_empty():
        occupancy.set(tick)             // O(1) - bit operation
        level.head = idx
    else:
        pool[level.tail].next = idx     // O(1) - link update
        pool[idx].prev = level.tail

    level.tail = idx
    level.aggregate_qty += order.qty   // O(1) - arithmetic
    id_map.put(order.id, idx)          // O(1) - amortized

Order Matching

flowchart TD
    A[Market Order] --> B[Find best price via bitset]
    B --> C[Get level at price]
    C --> D{Has liquidity?}
    D -->|Yes| E[Match FIFO]
    E --> F{Order filled?}
    F -->|No| G[Find next price]
    F -->|Yes| H[Complete]
    G --> C
    D -->|No| I[No liquidity]

Matching Loop:

function match_market(qty_requested):
    while qty_requested > 0:
        best_tick = occupancy.find_first()     // O(1) via @ctz
        if !best_tick:
            break  // No liquidity

        level = &levels[best_tick]
        idx = level.head

        while idx != INVALID and qty_requested > 0:
            order = &pool[idx]
            match_qty = min(order.qty, qty_requested)

            emit_trade(order.id, match_qty)
            order.qty -= match_qty
            qty_requested -= match_qty

            if order.qty == 0:
                next = order.next
                remove_order(idx)               // O(1) unlink
                idx = next

Build and Test

Prerequisites

Zig 0.14.0 or later
No external dependencies

Build Commands

# Build optimized binary
zig build -Doptimize=ReleaseFast

# Run all tests (unit + parity + integration)
zig build test

# Run benchmarks
zig build bench-ladder-micro   # Isolated operation benchmarks
zig build bench-compare        # Ladder vs RB-tree comparison
zig build bench-ladder         # Full workflow benchmark

Testing Strategy

Unit Tests - Verify individual operations
Parity Tests - Ensure Ladder and RB-tree produce identical results
Invariant Tests - Validate state consistency:
- Aggregate quantity = Σ(individual orders)
- Bitset occupancy ⟷ level state
- FIFO ordering maintained
Performance Tests - Measure throughput and latency

Integration Guide

Basic Usage

const std = @import("std");
const MatchingEngine = @import("matching_engine");

// Initialize engine
var engine = try MatchingEngine.init(allocator, .{
    .n_ticks = 1_000_000,      // Price range: 1M ticks
    .max_orders = 100_000,     // Order capacity
    .tick_size = 1,            // Minimum price increment (cents)
});
defer engine.deinit();

// Insert limit order
const order_id = try engine.insertLimit(.{
    .id = unique_id,
    .side = .buy,
    .price = 45000,            // $450.00 with tick_size=1
    .quantity = 100,
    .type = .limit,
    .time_in_force = .good_till_cancel,
});

// Cancel order
engine.cancel(order_id, .buy);

// Match market order
var events = std.ArrayList(Event).init(allocator);
defer events.deinit();

try engine.matchMarket(.{
    .side = .sell,
    .quantity = 50,
    .events_out = &events,
});

// Process resulting events
for (events.items) |event| {
    switch (event) {
        .trade => |t| processTrade(t),
        .level_update => |u| updateMarketData(u),
        .order_accepted => |a| confirmOrder(a),
    }
}

Event Types

pub const DomainEvent = union(enum) {
    order_accepted: struct {
        id: u64,
        side: Side,
        price: u64,
        quantity: u64,
        timestamp: u64,
    },
    order_rejected: struct {
        id: u64,
        reason: RejectReason,
    },
    order_canceled: struct {
        id: u64,
        remaining_qty: u64,
    },
    trade: struct {
        maker_id: u64,
        taker_id: u64,
        price: u64,
        quantity: u64,
        maker_filled: bool,
        taker_filled: bool,
        timestamp: u64,
    },
    level_update: struct {
        side: Side,
        price: u64,
        new_quantity: u64,
    },
};

Design Decisions

Why Array-based Over Tree-based?

Traditional order books use balanced trees (RB-tree, AVL) for price levels:

Pros: Dynamic range, memory efficient for sparse books
Cons: O(log M) operations, poor cache locality, rebalancing overhead

The ladder approach uses fixed arrays:

Pros: O(1) operations, excellent cache locality, no allocations
Cons: Fixed memory overhead, limited price range

For active markets, the ladder's performance advantage (6-7x based on benchmarks) outweighs the memory cost (64MB).

Memory vs Performance Trade-off

Memory usage: 64MB per symbol
  - Price levels: 32MB (1M × 32 bytes)
  - Order pool: ~20MB (100K orders)
  - HashMap + bitset: ~12MB

Performance gain: 6-7x throughput
  - RB-tree: ~8M ops/sec → Ladder: 54M ops/sec
  - Worth it for active symbols
  - Consider hybrid approach for long-tail symbols

Why Pre-allocation?

Dynamic allocation introduces:

Unpredictable latency spikes (malloc can block)
Memory fragmentation
Cache pollution
HashMap rehashing (10x slowdown during resize)

Pre-allocation ensures:

Deterministic latency (18ns consistently)
No allocation in hot path
Predictable memory layout
Better cache utilization

Configuration Guidelines

// For equity markets (stocks)
.n_ticks = 100_000,     // $0.01 to $1,000.00 range
.tick_size = 1,         // 1 cent increments

// For crypto markets
.n_ticks = 10_000_000,  // Wide range for volatility
.tick_size = 1,         // $0.01 increments

// For FX markets
.n_ticks = 1_000_000,   // 6 decimal places
.tick_size = 1,         // 0.000001 increments

Production Deployment

System Requirements

Memory: 64MB per symbol × number of symbols
CPU: Single-threaded per symbol (no lock contention)
Latency: Sub-microsecond matching, network is bottleneck

Scaling Architecture

graph TB
    subgraph "Gateway Layer"
        GW1[Gateway 1]
        GW2[Gateway 2]
        GW3[Gateway N]
    end

    subgraph "Matching Layer"
        subgraph "Server 1"
            ME1[AAPL Engine]
            ME2[GOOGL Engine]
        end
        subgraph "Server 2"
            ME3[MSFT Engine]
            ME4[AMZN Engine]
        end
    end

    subgraph "Services"
        BAL[Balance Service]
        RISK[Risk Engine]
        SETTLE[Settlement]
        MD[Market Data]
    end

    GW1 --> ME1
    GW2 --> ME2
    ME1 --> BAL
    ME2 --> RISK
    ME3 --> SETTLE
    ME4 --> MD

Integration Checklist

Required services to build around this engine:

Authentication Service - Verify user identity
Balance Service - Lock/unlock funds before/after trades
Risk Management - Position limits, margin requirements
Settlement Service - Clear and settle trades
Market Data Distribution - WebSocket/FIX feed
Audit/Compliance - Trade reporting, regulatory compliance
Monitoring - Metrics, alerts, dashboards

Performance Monitoring

Key metrics to track:

Matching latency - P50, P95, P99, P99.9
Throughput - Orders/sec, trades/sec
Queue depth - Orders waiting at each price
Memory usage - Pool utilization, HashMap load factor
Event lag - Time from match to event emission

Why Zig?

After evaluating multiple languages for this performance-critical application:

C++: Template complexity, hidden allocations in STL, undefined behavior pitfalls Rust: Borrow checker friction with intrusive data structures, async runtime overhead C: Manual memory management overhead, lack of modern abstractions Go: GC pauses unacceptable for sub-microsecond latency requirements

Zig provides:

No hidden allocations - explicit memory control
Comptime metaprogramming - zero-cost abstractions
First-class error handling - no exceptions
Direct hardware access - CPU intrinsics when needed
Simple, readable code - maintainability matters

Benchmarking Methodology

All benchmarks follow consistent methodology:

Warmup - 100K operations to stabilize caches
Measurement - 10M+ operations for statistical significance
Isolation - Single operation type per benchmark
Environment - Release build with optimizations enabled
Verification - Results validated against reference implementation

Future Optimizations

While current performance exceeds requirements, potential optimizations include:

SIMD Aggregation - Vectorize quantity summation
Prefetching - Explicit cache line prefetch hints
Huge Pages - Reduce TLB misses for large arrays
NUMA Awareness - Pin memory to local nodes

These remain unimplemented as the bottleneck is network I/O, not the matching engine.

Contributing

Performance improvements welcome. Requirements:

Benchmark demonstrating measurable improvement
All existing tests pass
No complexity increase without justification
Clear documentation of trade-offs

License

MIT

Acknowledgments

Built with Zig for its performance and safety guarantees without sacrificing control.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bench		bench
deploy		deploy
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
baseline_bench.txt		baseline_bench.txt
build.zig		build.zig
config.json		config.json

Uh oh!

Uh oh!

selimozten/ziglob-ex

Folders and files

Latest commit

History

Repository files navigation

High-Performance Order Book Matching Engine

Performance Metrics

Architecture Overview

Core Data Structure: The Ladder Algorithm

Mathematical Foundation

Memory Layout Details

Algorithm Walkthrough

Order Insertion

Order Matching

Build and Test

Prerequisites

Build Commands

Testing Strategy

Integration Guide

Basic Usage

Event Types

Design Decisions

Why Array-based Over Tree-based?

Memory vs Performance Trade-off

Why Pre-allocation?

Configuration Guidelines

Production Deployment

System Requirements

Scaling Architecture

Integration Checklist

Performance Monitoring

Why Zig?

Benchmarking Methodology

Future Optimizations

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages