Sixiang Chen1,2, Zhaohu Xing1, Tian Ye1, Xinyu Geng3, Yunlong Lin, Jianyu Lai1,2, Xuanhua He3, Fuxiang Zhai1, Jialin Gao4,β‘, Lei Zhu1,3,β
1The Hong Kong University of Science and Technology (Guangzhou)
2Meituan
3The Hong Kong University of Science and Technology
4National University of Singapore
Project Leader: Junfeng Luo (Meituan)
The same trained agent policy paired with two reference-conditioned generators βΆ
Qwen-Image-Edit (open) Β Β·Β Nano Banana Pro (strong)
GenEvolve formulates open-ended image generation as a tool-orchestrated visual trajectory. The agent gathers external textual evidence, retrieves visual references, performs internal knowledge activation through callable generation skills, and synthesizes a prompt-reference program
The released GenEvolve policy is based on Qwen3-VL-8B and is designed to be generator-transferable: the same agent output can be rendered by the open Qwen-Image-Edit backend or by a stronger proprietary renderer such as Nano Banana Pro.
| Component | Where |
|---|---|
π§ Trained agent policy GenEvolve (Qwen3-VL-8B-based) |
π€ MeiGen-AI/GenEvolve |
β‘ Standalone inference runtime (GenEvolveAgent, OpenAI-compatible) |
this repo |
π οΈ Three tools (search, image_search, query_knowledge) |
this repo |
| π The eight skill markdown files used at training time | this repo |
| π¨ Reference-conditioned generator wrappers (Qwen-Image-Edit + Nano Banana Pro) | this repo |
| π¦ SFT trajectories (9,000 records) | π€ MeiGen-AI/GenEvolve-Data-Bench / GenEvolve-Data-SFT/ |
| π― Self-evolution prompts + GT images (3,175 records) | π€ MeiGen-AI/GenEvolve-Data-Bench / GenEvolve-Data-RL/ |
| π Held-out evaluation benchmark (594 prompts + GT images) | π€ MeiGen-AI/GenEvolve-Data-Bench / GenEvolve-Bench/ |
GenEvolve has a main runtime environment for policy serving, agent rollouts, tool execution, and benchmark inference. This is not the only process used in a full image-generation pipeline: for reproducible Qwen rendering, run Qwen-Image-Edit as a separate FastAPI/diffusers service and call it from GenEvolve through --service-url.
Use this environment for the released agent code path: serving GenEvolve, running the agent, calling tools, using the Nano client, and calling a Qwen service endpoint. Install it once using the Quickstart commands below.
| Component | Version | Notes |
|---|---|---|
| Python | 3.11 | |
| CUDA stack | CUDA 12.x; our logs used PyTorch CUDA 12.8 wheels | |
torch / torchvision |
2.8.0 / 0.23.0 |
|
transformers |
4.57.1 |
|
vllm |
0.11.0 |
|
ray |
2.54.1 |
|
flash-attn |
2.8.3 |
This environment does not install or launch external services such as Qwen-Image-Edit, Serper, or the Google image API. Those are configured separately.
| Service | Variable | Used for |
|---|---|---|
| serper.dev | SERPER_API_KEY |
required for search and image_search |
| Google Generative Language API | GOOGLE_API_KEY or GEMINI_API_KEY |
only for --backend nano-banana-pro |
| Qwen-Image-Edit FastAPI service | --service-url |
only for --backend qwen-image-edit-service |
For Qwen rendering, use a separate service environment instead of mixing the diffusion stack into the vLLM server. A typical working stack is Python 3.11, PyTorch/torchvision 2.6.0/0.21.0 with CUDA 12.4 wheels, diffusers>=0.38, transformers>=4.57, accelerate, fastapi, uvicorn, pillow, and requests.
conda create -n qwenimage python=3.11 -y
conda activate qwenimage
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install "diffusers>=0.38" "transformers>=4.57" accelerate fastapi uvicorn pillow requestsStart any Qwen-Image-Edit FastAPI service compatible with POST /generate; a common deployment is one Qwen pipeline per visible GPU, with one HTTP endpoint such as http://host:8001. GenEvolve sends requests with --backend qwen-image-edit-service --service-url http://host:8001.
git clone https://github.com/MeiGen-AI/GenEvolve.git
cd GenEvolve
conda create -n genevolve python=3.11 -y
conda activate genevolve
pip install -U pip setuptools wheel packaging psutil ninja
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install --no-build-isolation -r requirements.txt
pip install -e .This installs only the main GenEvolve runtime: vLLM serving, the agent tools, and lightweight generator clients/wrappers. It does not install or start the separate Qwen-Image-Edit service; set up that service from the Qwen environment section above when using --backend qwen-image-edit-service.
Put the Hugging Face checkpoint directory in MODEL_PATH. The serving scripts support both tensor parallelism (TP) and data parallel replicas (DP).
TPshards one model replica across multiple GPUs.DPlaunches multiple model replicas to improve throughput for many concurrent prompts.- Total GPU usage is
TP Γ DP. - Use a larger
DPwhenscripts/run_agent.py --parallelis large and each request fits on one GPU. - Use a larger
TPwhen one model replica needs more memory or longer context than one GPU can provide.
# Single GPU / single replica.
MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=1 bash scripts/serve_vllm.sh
# Higher throughput on one 8-GPU node: 8 replicas, one GPU per replica.
MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=8 bash scripts/serve_vllm.sh
# If one replica needs more memory: 4 replicas, two GPUs per replica.
MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=2 DP=4 bash scripts/serve_vllm.shFor example, TP=8 DP=1 is one model replica sharded over 8 GPUs. It is not 8 independent services. For throughput on one 8-GPU node, prefer TP=1 DP=8 if the model fits on one GPU; use TP=2 DP=4 or TP=4 DP=2 when each replica needs multiple GPUs.
export SERPER_API_KEY=<your_key> # required for search and image_search
export GOOGLE_API_KEY=<your_key> # or GEMINI_API_KEY; only for Nano Banana Pro
python examples/quickstart.py \
--backend nano-banana-pro \
--base-url http://localhost:8000/v1 \
--model GenEvolve \
--prompt "A 1990s travel-magazine cover of two backpackers in front of the Eiffel Tower at golden hour, the title \"PARIS\" rendered in bold serif type." \
--output paris.pngFor the open-generator path, use --backend qwen-image-edit-service with one or more Qwen-Image-Edit service endpoints:
python examples/quickstart.py \
--backend qwen-image-edit-service \
--service-url http://your-qwen-service:8001 \
--base-url http://localhost:8000/v1 \
--model GenEvolve \
--output paris_qwen.png--backend qwen-image-edit is kept only as a local diffusers debug path when the Qwen-Image-Edit dependencies are installed in the active environment.
The agent rollout and the heavy image rendering are split into two stages so they can run on different machines.
# Stage 1: agent rollouts -> results.json.
python scripts/run_agent.py \
--input examples/example_prompts.jsonl \
--output-dir runs/example \
--base-url http://localhost:8000/v1 \
--model GenEvolve \
--parallel 4
# Stage 2a: render through one or more Qwen-Image-Edit services.
# Repeating --service-url enables round-robin dispatch; --parallel sends
# concurrent requests so multiple service workers can be used.
python scripts/generate_images.py \
--input runs/example/results.json \
--output-dir runs/example_qwen_service \
--backend qwen-image-edit-service \
--service-url http://your-qwen-service-1:8001 \
--service-url http://your-qwen-service-2:8001 \
--parallel 8
# Stage 2b: render with Nano Banana Pro.
python scripts/generate_images.py \
--input runs/example/results.json \
--output-dir runs/example_nano \
--backend nano-banana-pro \
--parallel 4Current script support:
| Stage | Script | Scaling knobs |
|---|---|---|
| Agent model serving | scripts/serve_vllm.sh |
TP, DP, PORT, MAX_MODEL_LEN, MODEL_PATH |
| Agent rollouts | scripts/run_agent.py |
--parallel, --base-url, --model |
| Remote Qwen rendering | scripts/generate_images.py --backend qwen-image-edit-service |
repeat --service-url and set --parallel |
| Local Qwen debug rendering | scripts/generate_images.py --backend qwen-image-edit |
single local process; requires a Qwen-compatible diffusers environment |
| Nano rendering | scripts/generate_images.py --backend nano-banana-pro |
--parallel, subject to API quota/rate limits |
To reproduce benchmark metrics, download the public dataset and pass the
benchmark JSONL directly to the agent runner. The public benchmark uses
question as the prompt field; scripts/run_agent.py accepts both question
and prompt, preserves extra fields such as gt_image, eval_type,
category, and difficulty, and the rendering script copies them into its
output results.json.
The scorer in scripts/evaluate_images.py is the paper-compatible Gemini judge:
it uses the same rubric prompt, the same image order (Image 1 = generated,
Image 2 = GT), the same OpenAI-compatible multimodal chat-completions call, and
the same score normalization and weighted overall formula used for the reported
benchmark numbers. No service endpoint or API key is hard-coded.
Public benchmark row format:
{"id": "0", "question": "A detailed image-generation request...", "gt_image": "images/case_00000.jpg", "eval_type": "Knowledge-Anchored", "category": "architecture_landmark", "difficulty": "hard"}Run the same two-stage pipeline, then score the rendered images with Gemini:
huggingface-cli download MeiGen-AI/GenEvolve-Data-Bench \
--repo-type dataset \
--local-dir ./GenEvolve-Data-Bench
# Stage 1: agent rollouts.
python scripts/run_agent.py \
--input ./GenEvolve-Data-Bench/GenEvolve-Bench/test.jsonl \
--output-dir runs/bench_agent \
--base-url http://localhost:8000/v1 \
--model GenEvolve \
--parallel 16
# Stage 2: render images, for example through Qwen-Image-Edit services.
python scripts/generate_images.py \
--input runs/bench_agent/results.json \
--output-dir runs/bench_qwen \
--backend qwen-image-edit-service \
--service-url http://your-qwen-service:8001 \
--parallel 16
# Stage 3: Gemini judge.
# Use an OpenAI-compatible Gemini chat-completions endpoint.
export OPENAI_API_KEY=<your_eval_api_key>
export OPENAI_API_BASE=<your_openai_compatible_base_url>
python scripts/evaluate_images.py \
--results runs/bench_qwen/results.json \
--gt-root ./GenEvolve-Data-Bench/GenEvolve-Bench \
--model gemini-3.1-pro-preview \
--max-workers 16 \
--rpm 60 \
--resumescripts/evaluate_images.py writes:
| File | Contents |
|---|---|
results_eval.json |
per-sample judge output and rationale |
summary.json |
aggregate metrics |
summary.csv |
the same metrics in table form |
results_eval.json also appends benchmark split summaries such as
eval_type:Knowledge-Anchored, eval_type:Quality-Anchored, and
overall_avg.
The reported metrics are faithfulness, visual_correctness,
text_accuracy, aesthetics, and the weighted overall score:
overall = 0.1 * faithfulness
+ 0.4 * visual_correctness
+ 0.4 * text_accuracy
+ 0.1 * aesthetics
overall_missing_zero keeps the full denominator and treats missing or failed
cases as zero. The summary also reports metrics by eval_type, category,
and difficulty when those fields are present.
If you only want to run the provided scripts, you can skip this section. This is for users who want to call the agent and renderer directly from their own Python pipeline instead of going through scripts/run_agent.py and scripts/generate_images.py.
from genevolve import GenEvolveAgent
from genevolve.generator import QwenImageEditServiceGenerator # or NanoBananaProGenerator
agent = GenEvolveAgent(
model="GenEvolve",
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
result = agent.run("A cyberpunk version of the Sydney Opera House at sunset.")
# z = (gen_prompt, reference_images)
print(result.gen_prompt)
for r in result.reference_images:
print(r["img_id"], r["local_path"], r["note"])
backend = QwenImageEditServiceGenerator(["http://your-qwen-service:8001"])
image = backend.generate(
result.gen_prompt,
[r["local_path"] for r in result.reference_images if r.get("local_path")],
)
image.save("opera.png")For a user request
where each
| Tool | Role | Output |
|---|---|---|
search(queries) |
External textual evidence - entities, dates, facts. | Markdown digest. |
image_search(query) |
Visual references; each result gets a unique IMG_### id. |
Image list with local paths. |
query_knowledge(skill_name) |
Internal knowledge activation - invokes one of the eight callable generation skills. | Skill instructions in Markdown. |
The final answer is a JSON object, the prompt-reference program:
{
"gen_prompt": "... a targeted instruction that refers to references by ordinal phrases ('the first reference image', 'the second reference image') ...",
"reference_images": [
{"img_id": "IMG_001", "note": "what to copy from this reference"}
]
}We release the training data and benchmark in one Hugging Face dataset repository: MeiGen-AI/GenEvolve-Data-Bench. The total trajectory data is too large for GitHub but installs in one line via π€ datasets / huggingface-cli.
| Dataset | Records | Size | Purpose |
|---|---|---|---|
GenEvolve-Data-SFT/ |
9,000 records | ~7.4 GB | Multi-turn tool-orchestrated trajectories used for the SFT cold start. Each record: messages (chat-format ReAct trajectory ending in <answer>{gen_prompt, reference_images}) + images (reference jpegs). |
GenEvolve-Data-RL/ |
3,175 records | ~680 MB | Open-ended user requests paired with curated GT images. Used for GRPO + Visual Experience Distillation, where multiple agent rollouts per prompt are scored against the GT. |
GenEvolve-Bench/ |
594 prompts | ~120 MB | Held-out evaluation benchmark. Contains both Knowledge-Anchored (335) and Quality-Anchored (259) tracks plus per-prompt category, difficulty, and skill metadata. |
pip install -U huggingface_hub datasets
huggingface-cli download MeiGen-AI/GenEvolve-Data-Bench \
--repo-type dataset \
--local-dir ./GenEvolve-Data-Benchfrom datasets import load_dataset
repo_id = "MeiGen-AI/GenEvolve-Data-Bench"
bench = load_dataset(repo_id, "bench", split="test")
print(bench[0]["question"], bench[0]["gt_image"])
rl = load_dataset(repo_id, "rl", split="train")
sft = load_dataset(repo_id, "sft", split="train")
print(sft[0]["messages"])
print(sft[0]["images"])All paths inside the datasets are relative, for example images/case_00512.jpg or images/traj_00213/IMG_001.jpg; resolve them against the dataset directory you downloaded to. Per-dataset usage notes live on each dataset's Hub page.
The full training scripts are not included in this repository, but the released SFT/RL datasets, model weights, tools, and runtime let you reproduce the path from a user request to a rendered image.
The same GenEvolve policy paired with two different reference-conditioned generators. Orange marks external/uncommon knowledge, blue marks internal generation-knowledge requirements.
Additional qualitative results of GenEvolve with Nano Banana Pro as the downstream renderer. The agent autonomously orchestrates search, reference selection, and skill activation across diverse open-ended categories: spatial layout, text rendering, quantity counting, attribute binding, anatomy/pose, creative transfer, material physics, and aesthetic drawing.
The same trained agent policy paired with the open-source Qwen-Image-Edit-2511 renderer. Consistent quality across both generators demonstrates that GenEvolve learns generator-transferable tool orchestration rather than overfitting to one specific renderer.
| Variable | Purpose | Default |
|---|---|---|
OPENAI_BASE_URL |
OpenAI-compatible chat-completions endpoint | http://localhost:8000/v1 |
OPENAI_API_KEY |
API key for the inference server or the OpenAI-compatible evaluator endpoint | EMPTY for local inference |
OPENAI_API_BASE |
OpenAI-compatible Gemini judge endpoint used by scripts/evaluate_images.py |
provider-specific |
SERPER_API_KEY |
serper.dev key for text and image search | required |
SERPER_BASE_URL |
Override for Serper-compatible gateways | https://google.serper.dev |
IMAGE_DOWNLOAD_DIR |
Local cache for image_search downloads |
/tmp/genevolve_images |
GOOGLE_API_KEY / GEMINI_API_KEY |
Google Generative Language API key | required for Nano backend |
| Symptom | Check |
|---|---|
search / image_search returns authentication errors |
Set SERPER_API_KEY or configure SERPER_BASE_URL for your internal Serper-compatible gateway. |
| Agent cannot connect to the model | Confirm the vLLM server is running and OPENAI_BASE_URL or --base-url ends with /v1. |
| Qwen local renderer fails at import time | Use a separate Qwen-Image-Edit service environment and call it with qwen-image-edit-service; avoid mixing incompatible xformers / flash-attn combinations into the renderer env. |
| Qwen renderer says it needs a reference image | Qwen-Image-Edit is reference-conditioned; rerun the agent or use Nano Banana Pro for no-reference prompts. |
evaluate_images.py cannot find GT images |
Keep gt_image in each input record and pass --gt-root pointing to the downloaded benchmark directory. |
flash-attn build fails |
Install a PyTorch/CUDA wheel first, then run pip install flash-attn==2.8.3 --no-build-isolation. |
| Batch rendering resumes after interruption | scripts/generate_images.py writes results.json incrementally under the output directory. |
genevolve/
βββ genevolve/
β βββ agent.py # GenEvolveAgent: ReAct loop on top of an OpenAI-compatible server
β βββ system_prompt.py # system prompt used by the released agent
β βββ knowledge_tool.py # query_knowledge: eight callable generation skills
β βββ tools/web_search.py # search + image_search (Serper-compatible)
β βββ generator.py # Qwen-Image-Edit + Nano Banana Pro backends
β βββ knowledge/skills/ # skill markdown files
βββ scripts/
β βββ serve_vllm.sh # serve the checkpoint with vLLM
β βββ run_agent.py # batch agent rollouts -> results.json
β βββ generate_images.py # render images from results.json
β βββ evaluate_images.py # Gemini judge scoring and metric summary
βββ examples/
β βββ quickstart.py # single-prompt end-to-end example
β βββ example_prompts.jsonl
βββ assets/ # README figures
βββ requirements.txt
βββ setup.py
βββ README.md
We thank the authors and maintainers of Gen-Searcher, Qwen3-VL, Qwen-Image-Edit, vLLM, Serper.dev, and the Google Generative Language API.
@misc{chen2026genevolveselfevolvingimagegeneration,
title={GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation},
author={Sixiang Chen and Zhaohu Xing and Tian Ye and Xinyu Geng and Yunlong Lin and Jianyu Lai and Xuanhua He and Fuxiang Zhai and Jialin Gao and Lei Zhu},
year={2026},
eprint={2605.21605},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.21605},
}Code is released under the Apache 2.0 license. Released model weights inherit the upstream license of Qwen3-VL-8B-Instruct. Search results returned by Serper.dev and images rendered by Nano Banana Pro / Qwen-Image-Edit are governed by the respective upstream service terms.




