Thanks to visit codestin.com
Credit goes to visionagents.ai

Skip to main content
Moondream provides zero-shot object detection, visual question answering, and image captioning. Detect any object by describing it in natural language without training. Available as cloud-hosted API or local on-device.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Installation

uv add vision-agents[moondream]

Detection (Cloud)

from vision_agents.core import Agent, User
from vision_agents.plugins import moondream, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[
        moondream.CloudDetectionProcessor(
            detect_objects=["person", "car", "dog"],
            conf_threshold=0.3,
        )
    ],
)
Set MOONDREAM_API_KEY in your environment or pass api_key directly.
NameTypeDefaultDescription
detect_objectsstr or List[str]"person"Objects to detect (zero-shot)
conf_thresholdfloat0.3Confidence threshold
fpsint30Frame processing rate

Detection (Local)

Runs on-device without API calls. Requires HF_TOKEN for model access.
processor = moondream.LocalDetectionProcessor(
    detect_objects=["person", "car"],
    device="cuda",
)

VLM (Cloud)

Visual question answering or automatic captioning.
from vision_agents.plugins import moondream, deepgram, elevenlabs

llm = moondream.CloudVLM(mode="vqa")  # or "caption"

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    llm=llm,
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)
NameTypeDefaultDescription
modestr"vqa"Mode ("vqa" or "caption")

VLM (Local)

llm = moondream.LocalVLM(mode="vqa", force_cpu=False)

Cloud vs Local

CloudLocal
Use whenSimple setup, no infrastructure managementHigher throughput, own GPU infrastructure
ProsNo model download, no GPU required, automatic updatesNo rate limits, no API costs, full control
ConsRequires API key, 2 RPS rate limit (can be increased)Requires GPU for best performance
Local models require HF_TOKEN for HuggingFace authentication. CUDA recommended for best performance.

Next Steps