Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A limit order book implementation achieving 50M+ operations per second with deterministic O(1) complexity for all core operations. Built in Zig for zero-cost abstractions and explicit memory control.

selimozten/ziglob-ex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High-Performance Order Book Matching Engine

A limit order book implementation achieving 50M+ operations per second with deterministic O(1) complexity for all core operations. Built in Zig for zero-cost abstractions and explicit memory control.

Performance Metrics

Operation       Throughput        Latency    Complexity
────────────────────────────────────────────────────────
Insert          54.91M ops/sec    18ns       O(1)
Cancel          53.11M ops/sec    18ns       O(1)
Best Price      211.82M ops/sec   4ns        O(1)
Match           119.05M ops/sec   8ns        O(1) per fill

Measured on Apple M3 Max using isolated microbenchmarks (zig build bench-ladder-micro).

Architecture Overview

graph TB
    subgraph "Domain Layer"
        LB[LadderBook<br/>Array-based O(1)]
        OB[OrderBook<br/>RB-tree baseline]
    end

    subgraph "Application Layer"
        ME[MatchingEngine]
        CMD[Command Processor]
        EVT[Event Emitter]
    end

    subgraph "Adapter Layer"
        HTTP[HTTP Server]
        WS[WebSocket Server]
        BIN[Binary Protocol]
        JRN[Journal Persistence]
        SNAP[Snapshot Store]
    end

    CMD --> ME
    ME --> LB
    ME --> EVT
    EVT --> JRN
    HTTP --> CMD
    WS --> CMD
    BIN --> CMD
Loading

Core Data Structure: The Ladder Algorithm

The ladder algorithm replaces the traditional RB-tree with fixed-size arrays and hierarchical bitsets for constant-time operations.

graph LR
    subgraph "Memory Layout (per side)"
        PL[Price Levels<br/>1M slots × 32B<br/>32MB total]
        OP[Order Pool<br/>100K orders<br/>Pre-allocated]
        HM[HashMap<br/>OrderID → Index<br/>No rehashing]
        BS[Bitset<br/>2-level hierarchy<br/>O(1) find]
    end

    PL --> OP
    HM --> OP
    BS --> PL
Loading

Mathematical Foundation

Price to Index Mapping:

index = (price - base_tick) / tick_size

Best Price Discovery:

  • For bids: best_bid = base_tick + (highest_set_bit × tick_size)
  • For asks: best_ask = base_tick + (lowest_set_bit × tick_size)

Where set bits are found using CPU intrinsics:

  • @clz (count leading zeros) for highest bit
  • @ctz (count trailing zeros) for lowest bit

Complexity Analysis:

Operation Traditional (RB-tree) Ladder Implementation
Insert O(log M) where M = unique prices O(1)
Cancel O(log M) + O(1) O(1)
Best Price O(1) with cached min/max O(1) via bitset
Match O(log M) + O(K) where K = orders O(1) + O(K)

Memory Layout Details

classDiagram
    class PriceLevel {
        +u32 head_idx
        +u32 tail_idx
        +u64 aggregate_qty
        +[12]u8 padding
        ----
        32 bytes (cache-aligned)
    }

    class Order {
        +u64 id
        +u64 quantity
        +u64 filled
        +u32 next_idx
        +u32 prev_idx
        +OrderType type
        +TimeInForce tif
    }

    class BookSide {
        +[1M]PriceLevel levels
        +[100K]Order pool
        +HashMap id_map
        +Bitset occupancy
        +u32 free_head
    }

    BookSide --> PriceLevel : contains array
    BookSide --> Order : manages pool
    PriceLevel --> Order : indexes into
Loading

Algorithm Walkthrough

Order Insertion

flowchart TD
    A[New Order] --> B[Calculate price index]
    B --> C[Allocate from pool]
    C --> D[Add to HashMap]
    D --> E{Level empty?}
    E -->|Yes| F[Set bitset bit]
    E -->|No| G[Link to tail]
    F --> H[Update level head/tail]
    G --> H
    H --> I[Update aggregate]
    I --> J[Return success]
Loading

Pseudocode:

function insert(order):
    idx = pool.allocate()              // O(1) - pop from free list
    pool[idx] = order                  // O(1) - array write

    tick = price_to_tick(order.price)  // O(1) - arithmetic
    level = &levels[tick]               // O(1) - array access

    if level.is_empty():
        occupancy.set(tick)             // O(1) - bit operation
        level.head = idx
    else:
        pool[level.tail].next = idx     // O(1) - link update
        pool[idx].prev = level.tail

    level.tail = idx
    level.aggregate_qty += order.qty   // O(1) - arithmetic
    id_map.put(order.id, idx)          // O(1) - amortized

Order Matching

flowchart TD
    A[Market Order] --> B[Find best price via bitset]
    B --> C[Get level at price]
    C --> D{Has liquidity?}
    D -->|Yes| E[Match FIFO]
    E --> F{Order filled?}
    F -->|No| G[Find next price]
    F -->|Yes| H[Complete]
    G --> C
    D -->|No| I[No liquidity]
Loading

Matching Loop:

function match_market(qty_requested):
    while qty_requested > 0:
        best_tick = occupancy.find_first()     // O(1) via @ctz
        if !best_tick:
            break  // No liquidity

        level = &levels[best_tick]
        idx = level.head

        while idx != INVALID and qty_requested > 0:
            order = &pool[idx]
            match_qty = min(order.qty, qty_requested)

            emit_trade(order.id, match_qty)
            order.qty -= match_qty
            qty_requested -= match_qty

            if order.qty == 0:
                next = order.next
                remove_order(idx)               // O(1) unlink
                idx = next

Build and Test

Prerequisites

  • Zig 0.14.0 or later
  • No external dependencies

Build Commands

# Build optimized binary
zig build -Doptimize=ReleaseFast

# Run all tests (unit + parity + integration)
zig build test

# Run benchmarks
zig build bench-ladder-micro   # Isolated operation benchmarks
zig build bench-compare        # Ladder vs RB-tree comparison
zig build bench-ladder         # Full workflow benchmark

Testing Strategy

  1. Unit Tests - Verify individual operations
  2. Parity Tests - Ensure Ladder and RB-tree produce identical results
  3. Invariant Tests - Validate state consistency:
    • Aggregate quantity = Σ(individual orders)
    • Bitset occupancy ⟷ level state
    • FIFO ordering maintained
  4. Performance Tests - Measure throughput and latency

Integration Guide

Basic Usage

const std = @import("std");
const MatchingEngine = @import("matching_engine");

// Initialize engine
var engine = try MatchingEngine.init(allocator, .{
    .n_ticks = 1_000_000,      // Price range: 1M ticks
    .max_orders = 100_000,     // Order capacity
    .tick_size = 1,            // Minimum price increment (cents)
});
defer engine.deinit();

// Insert limit order
const order_id = try engine.insertLimit(.{
    .id = unique_id,
    .side = .buy,
    .price = 45000,            // $450.00 with tick_size=1
    .quantity = 100,
    .type = .limit,
    .time_in_force = .good_till_cancel,
});

// Cancel order
engine.cancel(order_id, .buy);

// Match market order
var events = std.ArrayList(Event).init(allocator);
defer events.deinit();

try engine.matchMarket(.{
    .side = .sell,
    .quantity = 50,
    .events_out = &events,
});

// Process resulting events
for (events.items) |event| {
    switch (event) {
        .trade => |t| processTrade(t),
        .level_update => |u| updateMarketData(u),
        .order_accepted => |a| confirmOrder(a),
    }
}

Event Types

pub const DomainEvent = union(enum) {
    order_accepted: struct {
        id: u64,
        side: Side,
        price: u64,
        quantity: u64,
        timestamp: u64,
    },
    order_rejected: struct {
        id: u64,
        reason: RejectReason,
    },
    order_canceled: struct {
        id: u64,
        remaining_qty: u64,
    },
    trade: struct {
        maker_id: u64,
        taker_id: u64,
        price: u64,
        quantity: u64,
        maker_filled: bool,
        taker_filled: bool,
        timestamp: u64,
    },
    level_update: struct {
        side: Side,
        price: u64,
        new_quantity: u64,
    },
};

Design Decisions

Why Array-based Over Tree-based?

Traditional order books use balanced trees (RB-tree, AVL) for price levels:

  • Pros: Dynamic range, memory efficient for sparse books
  • Cons: O(log M) operations, poor cache locality, rebalancing overhead

The ladder approach uses fixed arrays:

  • Pros: O(1) operations, excellent cache locality, no allocations
  • Cons: Fixed memory overhead, limited price range

For active markets, the ladder's performance advantage (6-7x based on benchmarks) outweighs the memory cost (64MB).

Memory vs Performance Trade-off

Memory usage: 64MB per symbol
  - Price levels: 32MB (1M × 32 bytes)
  - Order pool: ~20MB (100K orders)
  - HashMap + bitset: ~12MB

Performance gain: 6-7x throughput
  - RB-tree: ~8M ops/sec → Ladder: 54M ops/sec
  - Worth it for active symbols
  - Consider hybrid approach for long-tail symbols

Why Pre-allocation?

Dynamic allocation introduces:

  • Unpredictable latency spikes (malloc can block)
  • Memory fragmentation
  • Cache pollution
  • HashMap rehashing (10x slowdown during resize)

Pre-allocation ensures:

  • Deterministic latency (18ns consistently)
  • No allocation in hot path
  • Predictable memory layout
  • Better cache utilization

Configuration Guidelines

// For equity markets (stocks)
.n_ticks = 100_000,     // $0.01 to $1,000.00 range
.tick_size = 1,         // 1 cent increments

// For crypto markets
.n_ticks = 10_000_000,  // Wide range for volatility
.tick_size = 1,         // $0.01 increments

// For FX markets
.n_ticks = 1_000_000,   // 6 decimal places
.tick_size = 1,         // 0.000001 increments

Production Deployment

System Requirements

  • Memory: 64MB per symbol × number of symbols
  • CPU: Single-threaded per symbol (no lock contention)
  • Latency: Sub-microsecond matching, network is bottleneck

Scaling Architecture

graph TB
    subgraph "Gateway Layer"
        GW1[Gateway 1]
        GW2[Gateway 2]
        GW3[Gateway N]
    end

    subgraph "Matching Layer"
        subgraph "Server 1"
            ME1[AAPL Engine]
            ME2[GOOGL Engine]
        end
        subgraph "Server 2"
            ME3[MSFT Engine]
            ME4[AMZN Engine]
        end
    end

    subgraph "Services"
        BAL[Balance Service]
        RISK[Risk Engine]
        SETTLE[Settlement]
        MD[Market Data]
    end

    GW1 --> ME1
    GW2 --> ME2
    ME1 --> BAL
    ME2 --> RISK
    ME3 --> SETTLE
    ME4 --> MD
Loading

Integration Checklist

Required services to build around this engine:

  • Authentication Service - Verify user identity
  • Balance Service - Lock/unlock funds before/after trades
  • Risk Management - Position limits, margin requirements
  • Settlement Service - Clear and settle trades
  • Market Data Distribution - WebSocket/FIX feed
  • Audit/Compliance - Trade reporting, regulatory compliance
  • Monitoring - Metrics, alerts, dashboards

Performance Monitoring

Key metrics to track:

  • Matching latency - P50, P95, P99, P99.9
  • Throughput - Orders/sec, trades/sec
  • Queue depth - Orders waiting at each price
  • Memory usage - Pool utilization, HashMap load factor
  • Event lag - Time from match to event emission

Why Zig?

After evaluating multiple languages for this performance-critical application:

C++: Template complexity, hidden allocations in STL, undefined behavior pitfalls Rust: Borrow checker friction with intrusive data structures, async runtime overhead C: Manual memory management overhead, lack of modern abstractions Go: GC pauses unacceptable for sub-microsecond latency requirements

Zig provides:

  • No hidden allocations - explicit memory control
  • Comptime metaprogramming - zero-cost abstractions
  • First-class error handling - no exceptions
  • Direct hardware access - CPU intrinsics when needed
  • Simple, readable code - maintainability matters

Benchmarking Methodology

All benchmarks follow consistent methodology:

  1. Warmup - 100K operations to stabilize caches
  2. Measurement - 10M+ operations for statistical significance
  3. Isolation - Single operation type per benchmark
  4. Environment - Release build with optimizations enabled
  5. Verification - Results validated against reference implementation

Future Optimizations

While current performance exceeds requirements, potential optimizations include:

  1. SIMD Aggregation - Vectorize quantity summation
  2. Prefetching - Explicit cache line prefetch hints
  3. Huge Pages - Reduce TLB misses for large arrays
  4. NUMA Awareness - Pin memory to local nodes

These remain unimplemented as the bottleneck is network I/O, not the matching engine.

Contributing

Performance improvements welcome. Requirements:

  • Benchmark demonstrating measurable improvement
  • All existing tests pass
  • No complexity increase without justification
  • Clear documentation of trade-offs

License

MIT

Acknowledgments

Built with Zig for its performance and safety guarantees without sacrificing control.

About

A limit order book implementation achieving 50M+ operations per second with deterministic O(1) complexity for all core operations. Built in Zig for zero-cost abstractions and explicit memory control.

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •