Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 36b0d27

Browse files
committed
docs: ADR-037 multi-person pose detection from single ESP32 CSI stream
Four-phase approach: eigenvalue-based person count estimation, NMF signal decomposition, multi-skeleton generation with Kalman tracking, and neural multi-person model training via RVF pipeline. Ref: ruvnet#97 Co-Authored-By: claude-flow <[email protected]>
1 parent 113011e commit 36b0d27

1 file changed

Lines changed: 121 additions & 0 deletions

File tree

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# ADR-037: Multi-Person Pose Detection from Single ESP32 CSI Stream
2+
3+
- **Status**: Proposed
4+
- **Date**: 2026-03-02
5+
- **Issue**: [#97](https://github.com/ruvnet/wifi-densepose/issues/97)
6+
- **Deciders**: @ruvnet
7+
- **Supersedes**: None
8+
- **Related**: ADR-014 (SOTA signal processing), ADR-024 (AETHER re-ID), ADR-029 (multistatic sensing), ADR-036 (RVF training pipeline)
9+
10+
## Context
11+
12+
The current signal-derived pose estimation pipeline (`derive_pose_from_sensing()` in the sensing server) generates at most one skeleton per frame from aggregate CSI features. When multiple people are present, only a single blended skeleton is produced. Live testing with ESP32 hardware confirmed: 2 people in the room yields 1 detected person.
13+
14+
A single ESP32 node provides 1 TX × 1 RX × 56 subcarriers of CSI data per frame. While this is limited spatial resolution compared to camera-based systems, the signal contains composite reflections from all scatterers in the environment. The challenge is decomposing these composite signals into per-person contributions.
15+
16+
## Decision
17+
18+
Implement multi-person pose detection in four phases, progressively improving accuracy from heuristic to neural approaches.
19+
20+
### Phase 1: Person Count Estimation
21+
22+
Estimate occupancy count from CSI signal statistics without decomposition.
23+
24+
**Approach**: Eigenvalue analysis of the CSI covariance matrix across subcarriers.
25+
26+
- Compute the 56×56 covariance matrix of CSI amplitudes over a sliding window (e.g., 50 frames / 5 seconds)
27+
- Count eigenvalues above a noise threshold — each significant eigenvalue corresponds to an independent scatterer (person or static object)
28+
- Subtract the static environment baseline (estimated during calibration or from the field model's SVD eigenstructure)
29+
- The residual significant eigenvalue count estimates person count
30+
31+
**Accuracy target**: > 80% for 0-3 people with single ESP32 node.
32+
33+
**Integration point**: `signal/src/ruvsense/field_model.rs` already computes SVD eigenstructure. Extend with a `estimate_occupancy()` method.
34+
35+
### Phase 2: Signal Decomposition
36+
37+
Separate per-person signal contributions using blind source separation.
38+
39+
**Approach**: Non-negative Matrix Factorization (NMF) on the CSI spectrogram.
40+
41+
- Construct a time-frequency matrix from CSI amplitudes: rows = subcarriers (56), columns = time frames
42+
- Apply NMF with k components (k = estimated person count from Phase 1)
43+
- Each component's frequency profile maps to a person's motion pattern
44+
- NMF is preferred over ICA because CSI amplitudes are non-negative
45+
46+
**Alternative**: Independent Component Analysis (ICA) on complex CSI (amplitude + phase). More powerful but requires phase calibration (see `ruvsense/phase_align.rs`).
47+
48+
**Integration point**: New module `signal/src/ruvsense/separation.rs`.
49+
50+
### Phase 3: Multi-Skeleton Generation
51+
52+
Generate distinct pose skeletons per decomposed component.
53+
54+
**Approach**: Per-component feature extraction → per-person skeleton synthesis.
55+
56+
- Extract motion features (dominant frequency, energy, spectral centroid) per NMF component
57+
- Map each component to a spatial position using subcarrier phase gradient (Fresnel zone model)
58+
- Generate 17-keypoint COCO skeleton per person with position offset
59+
- Assign person IDs using the existing Kalman tracker (`ruvsense/pose_tracker.rs`) with AETHER re-ID embeddings (ADR-024)
60+
61+
**Integration point**: Modify `derive_pose_from_sensing()` in `sensing-server/src/main.rs` to return `Vec<Person>` with length > 1.
62+
63+
### Phase 4: Neural Multi-Person Model
64+
65+
Train a dedicated multi-person model using the RVF pipeline (ADR-036).
66+
67+
- Use MM-Fi dataset (ADR-015) multi-person scenarios for training data
68+
- Architecture: shared CSI encoder → person count head + per-person pose heads
69+
- LoRA fine-tuning profile for multi-person specialization
70+
- Inference via the model manager in the sensing server
71+
72+
**Accuracy target**: [email protected] > 60% for 2-person scenarios.
73+
74+
## Consequences
75+
76+
### Positive
77+
78+
- Enables room occupancy counting (Phase 1 alone is useful)
79+
- Distinct pose tracking per person enables activity recognition per individual
80+
- Progressive approach — each phase delivers incremental value
81+
- Reuses existing infrastructure (field model SVD, Kalman tracker, AETHER, RVF pipeline)
82+
83+
### Negative
84+
85+
- Single ESP32 node has fundamental spatial resolution limits — separating 2 people standing close together (< 0.5m) will be unreliable
86+
- NMF decomposition adds ~5-10ms latency per frame
87+
- Person count estimation will have false positives from large moving objects (pets, fans)
88+
- Phase 4 neural model requires multi-person training data collection
89+
90+
### Neutral
91+
92+
- Multi-node multistatic mesh (ADR-029) dramatically improves multi-person separation but is a separate effort
93+
- UI already supports multi-person rendering — no frontend changes needed for the `persons[]` array
94+
95+
## Affected Components
96+
97+
| Component | Phase | Change |
98+
|-----------|-------|--------|
99+
| `signal/src/ruvsense/field_model.rs` | 1 | Add `estimate_occupancy()` |
100+
| `signal/src/ruvsense/separation.rs` | 2 | New module: NMF decomposition |
101+
| `sensing-server/src/main.rs` | 3 | `derive_pose_from_sensing()` multi-person output |
102+
| `signal/src/ruvsense/pose_tracker.rs` | 3 | Multi-target tracking |
103+
| `nn/` | 4 | Multi-person inference head |
104+
| `train/` | 4 | Multi-person training pipeline |
105+
106+
## Performance Budget
107+
108+
| Operation | Budget | Phase |
109+
|-----------|--------|-------|
110+
| Person count estimation | < 2ms | 1 |
111+
| NMF decomposition (k=3) | < 10ms | 2 |
112+
| Multi-skeleton synthesis | < 3ms | 3 |
113+
| Neural inference (multi-person) | < 50ms | 4 |
114+
| **Total pipeline** | **< 65ms** (15 FPS) | All |
115+
116+
## Alternatives Considered
117+
118+
1. **Camera fusion**: Use a camera for person detection and WiFi for pose — rejected because the project goal is camera-free sensing.
119+
2. **Multiple single-person models**: Run N independent pose estimators — rejected because they would produce correlated outputs from the same CSI data.
120+
3. **Spatial filtering (beamforming)**: Use antenna array beamforming to isolate directions — rejected because single ESP32 has only 1 antenna; viable with multistatic mesh (ADR-029).
121+
4. **Skip signal-derived, go straight to neural**: Train an end-to-end multi-person model — rejected because signal-derived provides faster iteration and interpretability for the early phases.

0 commit comments

Comments
 (0)