MLLCV records data as episode-based observation-action trajectories.
Robotic perception and VLA training require synchronized time-series data, not just independent images. Each step should include what the system sees, what it believes, and what action is taken.
Each step should include:
- timestamp
- episode ID
- frame ID
- language instruction
- RGB frame path
- IR frame path, if available
- detection bbox
- tracking score
- target center
- Kalman prediction state
- gimbal yaw/pitch/zoom telemetry
- expert action: yaw rate, pitch rate, zoom command, stop, mode
- latency in milliseconds
The expert action may come from:
- Visual servo controller
- Human teleoperation
- Hybrid controller
- Replay from a validated trajectory
Do not commit real recordings to Git. Use:
- local
data/directory during development - Git LFS for small public demo assets
- Hugging Face Dataset for public sanitized datasets
- private storage for sensitive real-world recordings
Do not publish private faces, license plates, company interiors, device credentials, RTSP URLs, or sensitive scenes.