TL;DR:
camroll-agentis an AI agent that does VQA on a personal camera roll.
- index your camera roll into a hierarchical queryable memory (events << captions << images).
- the agent answers questions over that memory using 5 atomic tools:
search,grep,list_by_date,get, andview_image.
| π Use API (OpenAI, Gemini) | π» Local (GPU required) |
|---|---|
git clone https://github.com/thaoshibe/camroll
cd camroll
conda create -n camroll python=3.10 -y
conda activate camroll
pip install -r requirements.txt
pip install -e camroll-agent/OpenAI + Gemini APIs. No torch (~50 MB install). |
git clone https://github.com/thaoshibe/camroll
cd camroll
conda create -n camroll-local python=3.10 -y
conda activate camroll-local
pip install -r requirements_local.txt
pip install -e camroll-agent/Adds Qwen-VL / Kimi-VL + |
Set the API key for whichever cloud backend you use:
export OPENAI_API_KEY=sk-β¦ # for OpenAI VLM/LLM + embeddings (default)
export GEMINI_API_KEY=β¦ # for Gemini VLM/LLMThe default embedding model is OpenAI's text-embedding-3-small (fast,
no local install). If you'd rather use a local sentence-transformers model
(free, offline), install requirements_local.txt and pass
--embedding-model sentence-transformers/all-MiniLM-L6-v2 at index time.
All commands below assume you are at the repo root (camroll/).
A ready-to-run sample (6 real photos) is at camroll-agent/examples/sample_conversation.json.
Preview what will be processed without calling any API:
python -m camroll_agent inspect camroll-agent/examples/sample_conversation.jsonUsing OpenAI (default):
export OPENAI_API_KEY=sk-β¦
python -m camroll_agent run camroll-agent/examples/sample_conversation.json -o memory/ \
--vlm-backend openai \
--vlm-model gpt-4o \
--embedding-model text-embedding-3-smallUsing Gemini:
export GEMINI_API_KEY=β¦
python -m camroll_agent run camroll-agent/examples/sample_conversation.json -o memory/ \
--vlm-backend gemini \
--vlm-model gemini-2.5-flash \
--embedding-model text-embedding-3-small # still needs OPENAI_API_KEY unless you use local embeddingsFully local β no API key needed (GPU required):
python -m camroll_agent run camroll-agent/examples/sample_conversation.json -o memory/ \
--vlm-backend local \
--vlm-model Qwen/Qwen2.5-VL-7B-Instruct \
--embedding-model sentence-transformers/all-MiniLM-L6-v2First run downloads Qwen2.5-VL-7B from HuggingFace (~15 GB). Cached after that.
All run flags:
| Flag | Default | Description |
|---|---|---|
-o / --output-dir |
(required) | Where to write the memory |
--vlm-backend |
openai |
openai | gemini | local |
--vlm-model |
backend default | e.g. gpt-4o, gemini-2.5-flash, Qwen/Qwen2.5-VL-7B-Instruct |
--embedding-model |
text-embedding-3-small |
OpenAI model name or any sentence-transformers ID |
--max-images |
all | Process at most N images (useful for smoke tests) |
--resume |
off | Continue an interrupted run |
Or run Stage 1 and Stage 2 separately:
python -m camroll_agent build camroll-agent/examples/sample_conversation.json -o memory/ \
--vlm-backend openai --vlm-model gpt-4o --max-images 10 --resume
python -m camroll_agent index memory/ \
--embedding-model text-embedding-3-smallUsing OpenAI (default):
export OPENAI_API_KEY=sk-β¦
python -m camroll_agent ask "When did I go to Lake Michigan?" \
--memory memory/ \
--llm-backend openai \
--llm-model gpt-4oUsing Gemini:
export GEMINI_API_KEY=β¦
python -m camroll_agent ask "When did I go to Lake Michigan?" \
--memory memory/ \
--llm-backend gemini \
--llm-model gemini-2.5-flashFully local (no API key needed, GPU required):
python -m camroll_agent ask "When did I go to Lake Michigan?" \
--memory memory/ \
--llm-backend local \
--llm-model Qwen/Qwen2.5-Coder-7B-InstructAll ask flags:
| Flag | Default | Description |
|---|---|---|
--memory |
(required) | Memory directory built in Step 2 |
--llm-backend |
openai |
openai | gemini | local (Qwen, GPU required) |
--llm-model |
backend default | e.g. gpt-4o, gemini-2.5-flash, Qwen/Qwen2.5-Coder-7B-Instruct |
--vlm-backend |
openai |
VLM used by view_image (openai | gemini | local) |
--vlm-model |
backend default | e.g. gpt-4o, gemini-2.5-flash, Qwen/Qwen2.5-VL-7B-Instruct |
--no-stream |
off | Suppress live tool output, print final answer only |
--json |
off | Output full JSON (answer + tool trace + latency) |
--max-steps |
25 |
Max ReAct steps before stopping |
--max-view-image-calls |
5 |
Cap on expensive view_image calls |
Examples:
# use a different VLM for viewing photos (default: openai)
python -m camroll_agent ask "What color was the car at the airport?" \
--memory memory/ --vlm-backend local
# get structured JSON output (answer + tool trace + latency)
python -m camroll_agent ask "When did I go to Lake Michigan?" \
--memory memory/ --json
# suppress live output, print final answer only
python -m camroll_agent ask "When did I go to Lake Michigan?" \
--memory memory/ --no-streamfrom camroll_agent import build_memory, index, Agent
build_memory.run("my_album.json", output_dir="memory/", backend="openai")
index.run("memory/")
agent = Agent(memory_dir="memory/", llm_backend="openai")
result = agent.ask("When did I go to Lake Michigan?")
print(result.final_text)
print(result.tool_trace)Streaming:
for evt, data in agent.ask_streaming("..."):
print(evt, data)The agent reasons over 5 deliberately small, single-purpose tools:
| Tool | What it does | Cost |
|---|---|---|
search(query, β¦) |
Semantic (vector) search over events + captions | cheap |
grep(query, β¦) |
Literal BM25 keyword search via SQLite FTS5 | cheap |
list_by_date(date_from, date_to, β¦) |
Pure metadata filter | cheap |
get(id) |
Fetch the full event or image record by id | cheap |
view_image(image_ids, prompt) |
Look at the actual photos with a VLM | expensive |
Every tool requires a one-sentence thought argument before it can be
called β this is the ReAct discipline. The agent terminates by emitting
plain text (no answer tool).
Any class that implements LLMClient.chat(messages, tools) works:
from camroll_agent.llm.base import LLMClient
from camroll_agent import Agent
class MyLLM(LLMClient):
def chat(self, messages, tools=None, *, tool_choice="auto"):
# return an OpenAI-shaped assistant message dict
...
agent = Agent(memory_dir="memory/", llm=MyLLM())from camroll_agent.llm.base import VLMClient
from camroll_agent import build_memory
class MyVLM(VLMClient):
def generate(self, prompt: str, image_paths: list[str]) -> str:
...
build_memory.run("my_album.json", output_dir="memory/", vlm=MyVLM())from camroll_agent import index
from camroll_agent.vector import EmbeddingClient
class MyEmbed:
def embed_many(self, texts: list[str]) -> list[list[float]]:
...
index.run("memory/", embedding_client=MyEmbed())camroll-agent/
βββ pyproject.toml
βββ camroll_agent/
β βββ __init__.py
β βββ build_memory.py Stage 1: VLM captioning + event grouping
β βββ index.py Stage 2: SQLite + FTS5 + vector store
β βββ store.py β³ SQLite schema + read/write helpers
β βββ vector.py β³ embeddings + FAISS / numpy
β βββ agent.py Stage 3: ReAct loop, pluggable backends
β βββ tools.py β³ the 5 atomic tools
β βββ prompts.py β³ system prompts + observation formatter
β βββ schemas.py β³ OpenAI-style tool schemas
β βββ cli.py `camroll-agent inspect/build/index/run/ask`
β βββ llm/ pluggable VLM + LLM backends
β βββ base.py
β βββ openai_client.py
β βββ gemini_client.py
β βββ local_client.py
βββ examples/
βββ sample_conversation.json
βββ quickstart.py
@misc{camroll,
title={Personal AI Agent for Camera Roll VQA},
author={Thao Nguyen and Krishna Kumar Singh and Donghyun Kim and Yong Jae Lee and Yuheng Li},
year={2026},
eprint={2606.05275},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.05275},
}Attribution-NonCommercial-ShareAlike 4.0 International