Stephanie Fu1, Trevor Darrell1, XuDong Wang1*
University of California, Berkeley1
University of California, Los Angeles2
Panasonic AI Research3
*Corresponding author
[2025-11-24] 🥃 Released the training data and code. Give it a shot!
[2025-11-24] ⭐️ The evaluation and Gradio demo are available NOW!
[2025-11-24] 🤗 Our finetuned weights are available. Check it here!
Rather than restricting VLM reasoning to a discrete language space with limited representational capacity, CoVT forms a visual thought chain that enables VLMs to reason in continuous visual space. By introducing continuous visual tokens that encode perceptual cues (e.g., segmentation, depth, instance, and edge structure), CoVT composes chains of textual and visual thoughts that link semantic reasoning with perceptual grounding. These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.
💡 Abstract
Vision–Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness.This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (CoVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens — compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, CoVT distills knowledge from lightweight vision experts capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, a VLM equipped with CoVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual-token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating CoVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16%, demonstrating that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
🧩 Pipeline
Continuous visual thinking with CoVT. CoVT introduces compact, continuous visual tokens that encode fine-grained perceptual cues, such as object localization, spatial structure, and scene semantics, directly into VLM reasoning. These tokens ground multimodal reasoning in visual space, enabling the model to capture fine-grained relationships across vision-centric tasks (e.g., counting, depth ordering, and scene understanding) without relying on external tools. They can also be decoded into dense predictions, offering human-interpretable visualizations of the model's reasoning process.
The training pipeline of CoVT. CoVT first generates the thinking process, containing visual thinking tokens, and then leverages these visual thoughts to condition next-token prediction and reason the final answer. To endow these tokens with perceptual meaning, we align them with lightweight vision experts (e.g., SAM, DepthAnything, PIDINet, DINO) on their respective tasks during training. Specifically: SAM uses 8 visual tokens as mask prompts; DepthAnything uses 4 tokens to reconstruct depth; PIDINet uses 4 tokens to reconstruct edges; and DINO uses 4 tokens to match patch-level features. The VLM is finetuned with LoRA and all projection layers are trainable. Note: During inference, dense predictions are decoded only when interpretability is desired; otherwise, reasoning occurs entirely in the latent visual space.
💫 Results
Comparison of CoVT with the baseline and closed-source models. CoVT delivers consistent improvements across all vision-centric benchmarks and further reveals that each type of visual token contributes most effectively to the tasks that align with its encoded perceptual information.
Comparison between CoVT and Aurora based on LLaVA-v1.5-13B.
$^{\dag}$ indicates our reproduced results based on the provided checkpoints.
To ensure consistency and reproducibility, we use VLMEvalKit as the framework for evaluating models. In our repository, we have forked a copy of VLMEvalKit. You can have a quick start by following this instruction.
We provid an interactive demo built with Gradio, showcasing a conversational interface powered by the CoVT VLM. The demo allows users to upload images, ask questions, and interact with the model in real time through a simple web UI. You can have a quick start following here.
Our training data is released here!
Dataset Composition
CoVT dataset utilizes some subsets of LLaVA-OneVision, and merges the filtered TallyQA dataset and ADE20K-Depth from Aurora.
To reproduce our method, please follow this instruction and start training CoVT! 🍻
A collection of CoVT models on Hugging Face with benchmark performance:
| Baseline | Segment | Depth | DINO | Edge | Parameters | CV-Bench | Link |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | ✔ | 7B (+1B) | 77.9 | 🤗 HuggingFace | |||
| Qwen2.5-VL-7B-Instruct | ✔ | 7B (+1B) | 78.7 | 🤗 HuggingFace | |||
| Qwen2.5-VL-7B-Instruct | ✔ | ✔ | ✔ | 7B (+1B) | 80.0 | 🤗 HuggingFace | |
| Qwen2.5-VL-7B-Instruct | ✔ | ✔ | ✔ | ✔ | 7B (+1B) | 79.8 | 🤗 HuggingFace |
| LLaVA-v1.5-13B | ✔ | 13B (+1B) | 59.9 | 🤗 HuggingFace |
+1Bdenotes the parameters of the projection layer for decoding the visual thinking tokens. We don't nned these parameters during inference!
- Release our model weights on Hugging Face.
- Release the evaluation code.
- Release the Gradio demo code.
- Release the dataset.
- Release the training code.
- Support huggingface demo.
- Support more VLMs as the base models.
The majority of CoVT is licensed under the Apache License, however portions of the project are available under their own license terms: Qwen series, SAM, Depth Anything v2, and DINOv2 are licensed under Apache, PIDINet are licensed under their own license; If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than Apache, CC-BY-NC, MIT, or CC0.
For feedback, or collaboration opportunities, feel free to reach out!
For general questions, feel free to drop us an email at [email protected] or [email protected].
If you're running into code or implementation issues, the best way is to open an issue right here in the repo (highly recommended!) — chances are your question might help someone else too. 😊
If you use this work in your research, please cite:
@article{qin2025chain,
title={Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens},
author={Qin, Yiming and Wei, Bomin and Ge, Jiaxin and Kallidromitis, Konstantinos and Fu, Stephanie and Darrell, Trevor and Wang, Xudong},
journal={arXiv preprint arXiv:2511.19418},
year={2025}
}