Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
Installation
Realtime
Native speech-to-speech over WebRTC with built-in STT/TTS.| Name | Type | Default | Description |
|---|---|---|---|
model | str | "gpt-realtime" | OpenAI realtime model |
voice | str | "marin" | Voice (“marin”, “alloy”, “echo”, etc.) |
fps | int | 1 | Video frames per second |
LLM
Uses the Responses API (default for GPT-5+). Requires separate STT/TTS.| Name | Type | Default | Description |
|---|---|---|---|
model | str | — | Model (e.g., "gpt-4o", "gpt-5") |
api_key | str | None | API key (defaults to OPENAI_API_KEY env var) |
base_url | str | None | Custom API endpoint |
ChatCompletionsLLM
Works with any OpenAI-compatible API (Together AI, Fireworks, DeepSeek, etc.).TTS
Streaming text-to-speech.| Name | Type | Default | Description |
|---|---|---|---|
model | str | "gpt-4o-mini-tts" | TTS model |
voice | str | "alloy" | Voice (“alloy”, “echo”, “fable”, “onyx”, “nova”, “shimmer”) |

