|
| 1 | +# ADR-037: Multi-Person Pose Detection from Single ESP32 CSI Stream |
| 2 | + |
| 3 | +- **Status**: Proposed |
| 4 | +- **Date**: 2026-03-02 |
| 5 | +- **Issue**: [#97](https://github.com/ruvnet/wifi-densepose/issues/97) |
| 6 | +- **Deciders**: @ruvnet |
| 7 | +- **Supersedes**: None |
| 8 | +- **Related**: ADR-014 (SOTA signal processing), ADR-024 (AETHER re-ID), ADR-029 (multistatic sensing), ADR-036 (RVF training pipeline) |
| 9 | + |
| 10 | +## Context |
| 11 | + |
| 12 | +The current signal-derived pose estimation pipeline (`derive_pose_from_sensing()` in the sensing server) generates at most one skeleton per frame from aggregate CSI features. When multiple people are present, only a single blended skeleton is produced. Live testing with ESP32 hardware confirmed: 2 people in the room yields 1 detected person. |
| 13 | + |
| 14 | +A single ESP32 node provides 1 TX × 1 RX × 56 subcarriers of CSI data per frame. While this is limited spatial resolution compared to camera-based systems, the signal contains composite reflections from all scatterers in the environment. The challenge is decomposing these composite signals into per-person contributions. |
| 15 | + |
| 16 | +## Decision |
| 17 | + |
| 18 | +Implement multi-person pose detection in four phases, progressively improving accuracy from heuristic to neural approaches. |
| 19 | + |
| 20 | +### Phase 1: Person Count Estimation |
| 21 | + |
| 22 | +Estimate occupancy count from CSI signal statistics without decomposition. |
| 23 | + |
| 24 | +**Approach**: Eigenvalue analysis of the CSI covariance matrix across subcarriers. |
| 25 | + |
| 26 | +- Compute the 56×56 covariance matrix of CSI amplitudes over a sliding window (e.g., 50 frames / 5 seconds) |
| 27 | +- Count eigenvalues above a noise threshold — each significant eigenvalue corresponds to an independent scatterer (person or static object) |
| 28 | +- Subtract the static environment baseline (estimated during calibration or from the field model's SVD eigenstructure) |
| 29 | +- The residual significant eigenvalue count estimates person count |
| 30 | + |
| 31 | +**Accuracy target**: > 80% for 0-3 people with single ESP32 node. |
| 32 | + |
| 33 | +**Integration point**: `signal/src/ruvsense/field_model.rs` already computes SVD eigenstructure. Extend with a `estimate_occupancy()` method. |
| 34 | + |
| 35 | +### Phase 2: Signal Decomposition |
| 36 | + |
| 37 | +Separate per-person signal contributions using blind source separation. |
| 38 | + |
| 39 | +**Approach**: Non-negative Matrix Factorization (NMF) on the CSI spectrogram. |
| 40 | + |
| 41 | +- Construct a time-frequency matrix from CSI amplitudes: rows = subcarriers (56), columns = time frames |
| 42 | +- Apply NMF with k components (k = estimated person count from Phase 1) |
| 43 | +- Each component's frequency profile maps to a person's motion pattern |
| 44 | +- NMF is preferred over ICA because CSI amplitudes are non-negative |
| 45 | + |
| 46 | +**Alternative**: Independent Component Analysis (ICA) on complex CSI (amplitude + phase). More powerful but requires phase calibration (see `ruvsense/phase_align.rs`). |
| 47 | + |
| 48 | +**Integration point**: New module `signal/src/ruvsense/separation.rs`. |
| 49 | + |
| 50 | +### Phase 3: Multi-Skeleton Generation |
| 51 | + |
| 52 | +Generate distinct pose skeletons per decomposed component. |
| 53 | + |
| 54 | +**Approach**: Per-component feature extraction → per-person skeleton synthesis. |
| 55 | + |
| 56 | +- Extract motion features (dominant frequency, energy, spectral centroid) per NMF component |
| 57 | +- Map each component to a spatial position using subcarrier phase gradient (Fresnel zone model) |
| 58 | +- Generate 17-keypoint COCO skeleton per person with position offset |
| 59 | +- Assign person IDs using the existing Kalman tracker (`ruvsense/pose_tracker.rs`) with AETHER re-ID embeddings (ADR-024) |
| 60 | + |
| 61 | +**Integration point**: Modify `derive_pose_from_sensing()` in `sensing-server/src/main.rs` to return `Vec<Person>` with length > 1. |
| 62 | + |
| 63 | +### Phase 4: Neural Multi-Person Model |
| 64 | + |
| 65 | +Train a dedicated multi-person model using the RVF pipeline (ADR-036). |
| 66 | + |
| 67 | +- Use MM-Fi dataset (ADR-015) multi-person scenarios for training data |
| 68 | +- Architecture: shared CSI encoder → person count head + per-person pose heads |
| 69 | +- LoRA fine-tuning profile for multi-person specialization |
| 70 | +- Inference via the model manager in the sensing server |
| 71 | + |
| 72 | +**Accuracy target **: [email protected] > 60% for 2-person scenarios. |
| 73 | + |
| 74 | +## Consequences |
| 75 | + |
| 76 | +### Positive |
| 77 | + |
| 78 | +- Enables room occupancy counting (Phase 1 alone is useful) |
| 79 | +- Distinct pose tracking per person enables activity recognition per individual |
| 80 | +- Progressive approach — each phase delivers incremental value |
| 81 | +- Reuses existing infrastructure (field model SVD, Kalman tracker, AETHER, RVF pipeline) |
| 82 | + |
| 83 | +### Negative |
| 84 | + |
| 85 | +- Single ESP32 node has fundamental spatial resolution limits — separating 2 people standing close together (< 0.5m) will be unreliable |
| 86 | +- NMF decomposition adds ~5-10ms latency per frame |
| 87 | +- Person count estimation will have false positives from large moving objects (pets, fans) |
| 88 | +- Phase 4 neural model requires multi-person training data collection |
| 89 | + |
| 90 | +### Neutral |
| 91 | + |
| 92 | +- Multi-node multistatic mesh (ADR-029) dramatically improves multi-person separation but is a separate effort |
| 93 | +- UI already supports multi-person rendering — no frontend changes needed for the `persons[]` array |
| 94 | + |
| 95 | +## Affected Components |
| 96 | + |
| 97 | +| Component | Phase | Change | |
| 98 | +|-----------|-------|--------| |
| 99 | +| `signal/src/ruvsense/field_model.rs` | 1 | Add `estimate_occupancy()` | |
| 100 | +| `signal/src/ruvsense/separation.rs` | 2 | New module: NMF decomposition | |
| 101 | +| `sensing-server/src/main.rs` | 3 | `derive_pose_from_sensing()` multi-person output | |
| 102 | +| `signal/src/ruvsense/pose_tracker.rs` | 3 | Multi-target tracking | |
| 103 | +| `nn/` | 4 | Multi-person inference head | |
| 104 | +| `train/` | 4 | Multi-person training pipeline | |
| 105 | + |
| 106 | +## Performance Budget |
| 107 | + |
| 108 | +| Operation | Budget | Phase | |
| 109 | +|-----------|--------|-------| |
| 110 | +| Person count estimation | < 2ms | 1 | |
| 111 | +| NMF decomposition (k=3) | < 10ms | 2 | |
| 112 | +| Multi-skeleton synthesis | < 3ms | 3 | |
| 113 | +| Neural inference (multi-person) | < 50ms | 4 | |
| 114 | +| **Total pipeline** | **< 65ms** (15 FPS) | All | |
| 115 | + |
| 116 | +## Alternatives Considered |
| 117 | + |
| 118 | +1. **Camera fusion**: Use a camera for person detection and WiFi for pose — rejected because the project goal is camera-free sensing. |
| 119 | +2. **Multiple single-person models**: Run N independent pose estimators — rejected because they would produce correlated outputs from the same CSI data. |
| 120 | +3. **Spatial filtering (beamforming)**: Use antenna array beamforming to isolate directions — rejected because single ESP32 has only 1 antenna; viable with multistatic mesh (ADR-029). |
| 121 | +4. **Skip signal-derived, go straight to neural**: Train an end-to-end multi-person model — rejected because signal-derived provides faster iteration and interpretability for the early phases. |
0 commit comments