# ADR-062: QEMU ESP32-S3 Swarm Configurator

| Field       | Value                                          |
|-------------|------------------------------------------------|
| **Status**  | Accepted                                       |
| **Date**    | 2026-03-14                                     |
| **Authors** | RuView Team                                    |
| **Relates** | ADR-061 (QEMU testing platform), ADR-060 (channel/MAC filter), ADR-018 (binary frame), ADR-039 (edge intel) |

## Glossary

| Term | Definition |
|------|-----------|
| Swarm | A group of N QEMU ESP32-S3 instances running simultaneously |
| Topology | How nodes are connected: star, mesh, line, ring |
| Role | Node function: `sensor` (collects CSI), `coordinator` (aggregates + forwards), `gateway` (bridges to host) |
| Scenario matrix | Cross-product of topology × node count × NVS config × mock scenario |
| Health oracle | Python process that monitors all node UART logs and declares swarm health |

## Context

ADR-061 Layer 3 provides a basic multi-node mesh test: N identical nodes with sequential TDM slots connected via a Linux bridge. This is useful but limited:

1. **All nodes are identical** — real deployments have heterogeneous roles (sensor, coordinator, gateway)
2. **Single topology** — only fully-connected bridge; no star, line, or ring topologies
3. **No scenario variation per node** — all nodes run the same mock CSI scenario
4. **Manual configuration** — each test requires hand-editing env vars and arguments
5. **No swarm-level health monitoring** — validation checks individual nodes, not collective behavior
6. **No cross-node timing validation** — TDM slot ordering and inter-frame gaps aren't verified

Real WiFi-DensePose deployments use 3-8 ESP32-S3 nodes in various topologies. A single coordinator aggregates CSI from multiple sensors. The firmware must handle TDM conflicts, missing nodes, role-based behavior differences, and network partitions — none of which ADR-061 Layer 3 tests.

## Decision

Build a **QEMU Swarm Configurator** — a YAML-driven tool that defines multi-node test scenarios declaratively and orchestrates them under QEMU with swarm-level validation.

### Architecture

```
┌─────────────────────────────────────────────────────┐
│                 swarm_config.yaml                     │
│  nodes: [{role: sensor, scenario: 2, channel: 6}]   │
│  topology: star                                       │
│  duration: 60s                                        │
│  assertions: [all_nodes_boot, tdm_no_collision, ...]  │
└──────────────────────┬──────────────────────────────┘
                       │
          ┌────────────▼────────────┐
          │   qemu_swarm.py         │
          │   (orchestrator)        │
          └───┬────┬────┬───┬──────┘
              │    │    │   │
         ┌────▼┐ ┌▼──┐ ▼  ┌▼────┐
         │Node0│ │N1 │... │N(n-1)│   QEMU instances
         │sens │ │sen│    │coord │
         └──┬──┘ └─┬─┘    └──┬───┘
            │      │         │
         ┌──▼──────▼─────────▼──┐
         │  Virtual Network      │    TAP bridge / SLIRP
         │  (topology-shaped)    │
         └──────────┬───────────┘
                    │
         ┌──────────▼───────────┐
         │  Aggregator (Rust)    │    Collects frames
         └──────────┬───────────┘
                    │
         ┌──────────▼───────────┐
         │  Health Oracle        │    Swarm-level assertions
         │  (swarm_health.py)    │
         └──────────────────────┘
```

### YAML Configuration Schema

```yaml
# swarm_config.yaml
swarm:
  name: "3-sensor-star"
  duration_s: 60
  topology: star          # star | mesh | line | ring
  aggregator_port: 5005

nodes:
  - role: coordinator
    node_id: 0
    scenario: 0           # empty room (baseline)
    channel: 6
    edge_tier: 2
    is_gateway: true       # receives aggregated frames

  - role: sensor
    node_id: 1
    scenario: 2           # walking person
    channel: 6
    tdm_slot: 1           # TDM slot index (auto-assigned from node position if omitted)

  - role: sensor
    node_id: 2
    scenario: 3           # fall event
    channel: 6
    tdm_slot: 2

assertions:
  - all_nodes_boot
  - no_crashes
  - tdm_no_collision
  - all_nodes_produce_frames
  - coordinator_receives_from_all
  - fall_detected_by_node_2
  - frame_rate_above: 15    # Hz minimum per node
  - max_boot_time_s: 10
```

### Topologies

| Topology | Network | Description |
|----------|---------|-------------|
| `star` | All sensors connect to coordinator; coordinator has TAP to each sensor | Hub-and-spoke, most common |
| `mesh` | All nodes on same bridge (existing Layer 3 behavior) | Every node sees every other |
| `line` | Node 0 ↔ Node 1 ↔ Node 2 ↔ ... | Linear chain, tests multi-hop |
| `ring` | Like line but last connects to first | Circular, tests routing |

### Node Roles

| Role | Behavior | NVS Keys |
|------|----------|----------|
| `sensor` | Runs mock CSI, sends frames to coordinator | `node_id`, `tdm_slot`, `target_ip` |
| `coordinator` | Receives frames from sensors, runs edge aggregation | `node_id`, `tdm_slot=0`, `edge_tier=2` |
| `gateway` | Like coordinator but also bridges to host UDP | `node_id`, `target_ip=host`, `is_gateway=1` |

### Assertions (Swarm-Level)

| Assertion | What It Checks |
|-----------|---------------|
| `all_nodes_boot` | Every node's UART log shows boot indicators within timeout |
| `no_crashes` | No Guru Meditation, assert, panic in any log |
| `tdm_no_collision` | No two nodes transmit in the same TDM slot |
| `all_nodes_produce_frames` | Every sensor node's log contains CSI frame output |
| `coordinator_receives_from_all` | Coordinator log shows frames from each sensor's node_id |
| `fall_detected_by_node_N` | Node N's log reports a fall detection event |
| `frame_rate_above` | Each node produces at least N frames/second |
| `max_boot_time_s` | All nodes boot within N seconds |
| `no_heap_errors` | No OOM or heap corruption in any log |
| `network_partitioned_recovery` | After deliberate partition, nodes resume communication (future) |

### Preset Configurations

| Preset | Nodes | Topology | Purpose |
|--------|-------|----------|---------|
| `smoke` | 2 | star | Quick CI smoke test (15s) |
| `standard` | 3 | star | Default 3-node (sensor + sensor + coordinator) |
| `large-mesh` | 6 | mesh | Scale test with 6 fully-connected nodes |
| `line-relay` | 4 | line | Multi-hop relay chain |
| `ring-fault` | 4 | ring | Ring with fault injection mid-test |
| `heterogeneous` | 5 | star | Mixed scenarios: walk, fall, static, channel-sweep, empty |
| `ci-matrix` | 3 | star | CI-optimized preset (30s, minimal assertions) |

## File Layout

```
scripts/
├── qemu_swarm.py              # Main orchestrator (CLI entry point)
├── swarm_health.py            # Swarm-level health oracle
└── swarm_presets/
    ├── smoke.yaml
    ├── standard.yaml
    ├── large_mesh.yaml
    ├── line_relay.yaml
    ├── ring_fault.yaml
    ├── heterogeneous.yaml
    └── ci_matrix.yaml

.github/workflows/
└── firmware-qemu.yml          # MODIFIED: add swarm test job
```

## Consequences

### Benefits

1. **Declarative testing** — define swarm topology in YAML, not shell scripts
2. **Role-based nodes** — test coordinator/sensor/gateway interactions
3. **Topology variety** — star/mesh/line/ring match real deployment patterns
4. **Swarm-level assertions** — validate collective behavior, not just individual nodes
5. **Preset library** — quick CI smoke tests and thorough manual validation
6. **Reproducible** — YAML configs are version-controlled and shareable

### Limitations

1. **Still requires root** for TAP bridge topologies (star, line, ring); mesh can use SLIRP
2. **QEMU resource usage** — 6+ QEMU instances use ~2GB RAM, may slow CI runners
3. **No real RF** — inter-node communication is IP-based, not WiFi CSI multipath

## References

- ADR-061: QEMU ESP32-S3 firmware testing platform (Layers 1-9)
- ADR-060: Channel override and MAC address filter provisioning
- ADR-018: Binary CSI frame format (magic `0xC5110001`)
- ADR-039: Edge intelligence pipeline (biquad, vitals, fall detection)