Codestin Search App

Moondream provides zero-shot object detection, visual question answering, and image captioning. Detect any object by describing it in natural language without training. Available as cloud-hosted API or local on-device.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Installation

uv add vision-agents[moondream]

Detection (Cloud)

from vision_agents.core import Agent, User
from vision_agents.plugins import moondream, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[
        moondream.CloudDetectionProcessor(
            detect_objects=["person", "car", "dog"],
            conf_threshold=0.3,
        )
    ],
)

Set MOONDREAM_API_KEY in your environment or pass api_key directly.

Name	Type	Default	Description
`detect_objects`	`str` or `List[str]`	`"person"`	Objects to detect (zero-shot)
`conf_threshold`	`float`	`0.3`	Confidence threshold
`fps`	`int`	`30`	Frame processing rate

Detection (Local)

Runs on-device without API calls. Requires HF_TOKEN for model access.

processor = moondream.LocalDetectionProcessor(
    detect_objects=["person", "car"],
    device="cuda",
)

VLM (Cloud)

Visual question answering or automatic captioning.

from vision_agents.plugins import moondream, deepgram, elevenlabs

llm = moondream.CloudVLM(mode="vqa")  # or "caption"

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    llm=llm,
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)

Name	Type	Default	Description
`mode`	`str`	`"vqa"`	Mode (`"vqa"` or `"caption"`)

VLM (Local)

llm = moondream.LocalVLM(mode="vqa", force_cpu=False)

Cloud vs Local

	Cloud	Local
Use when	Simple setup, no infrastructure management	Higher throughput, own GPU infrastructure
Pros	No model download, no GPU required, automatic updates	No rate limits, no API costs, full control
Cons	Requires API key, 2 RPS rate limit (can be increased)	Requires GPU for best performance

Local models require HF_TOKEN for HuggingFace authentication. CUDA recommended for best performance.

Overview

AI Providers

Custom Integrations

Moondream

Installation

Detection (Cloud)

Detection (Local)

VLM (Cloud)

VLM (Local)

Cloud vs Local

Next Steps

Build a Voice Agent

Build a Video Agent

Overview

AI Providers

Custom Integrations

​Installation

​Detection (Cloud)

​Detection (Local)

​VLM (Cloud)

​VLM (Local)

​Cloud vs Local

​Next Steps

Build a Voice Agent

Build a Video Agent

Installation

Detection (Cloud)

Detection (Local)

VLM (Cloud)

VLM (Local)

Cloud vs Local

Next Steps