Moondream provides zero-shot object detection, visual question answering, and image captioning. Detect any object by describing it in natural language without training. Available as cloud-hosted API or local on-device.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
Installation
uv add vision-agents[moondream]
Detection (Cloud)
from vision_agents.core import Agent, User
from vision_agents.plugins import moondream, gemini, getstream
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Assistant", id="agent"),
instructions="You are a vision assistant.",
llm=gemini.Realtime(fps=10),
processors=[
moondream.CloudDetectionProcessor(
detect_objects=["person", "car", "dog"],
conf_threshold=0.3,
)
],
)
Set MOONDREAM_API_KEY in your environment or pass api_key directly.
| Name | Type | Default | Description |
|---|
detect_objects | str or List[str] | "person" | Objects to detect (zero-shot) |
conf_threshold | float | 0.3 | Confidence threshold |
fps | int | 30 | Frame processing rate |
Detection (Local)
Runs on-device without API calls. Requires HF_TOKEN for model access.
processor = moondream.LocalDetectionProcessor(
detect_objects=["person", "car"],
device="cuda",
)
VLM (Cloud)
Visual question answering or automatic captioning.
from vision_agents.plugins import moondream, deepgram, elevenlabs
llm = moondream.CloudVLM(mode="vqa") # or "caption"
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Assistant", id="agent"),
llm=llm,
stt=deepgram.STT(),
tts=elevenlabs.TTS(),
)
| Name | Type | Default | Description |
|---|
mode | str | "vqa" | Mode ("vqa" or "caption") |
VLM (Local)
llm = moondream.LocalVLM(mode="vqa", force_cpu=False)
Cloud vs Local
| Cloud | Local |
|---|
| Use when | Simple setup, no infrastructure management | Higher throughput, own GPU infrastructure |
| Pros | No model download, no GPU required, automatic updates | No rate limits, no API costs, full control |
| Cons | Requires API key, 2 RPS rate limit (can be increased) | Requires GPU for best performance |
Local models require HF_TOKEN for HuggingFace authentication. CUDA recommended for best performance.
Next Steps