Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

⭐️ CoVT enriches VLMs’ vision-centric reasoning capabilities. ⭐️

Yiming Qin¹, Bomin Wei², Jiaxin Ge¹, Konstantinos Kallidromitis³,
Stephanie Fu¹, Trevor Darrell¹, XuDong Wang^1*
University of California, Berkeley¹
University of California, Los Angeles²
Panasonic AI Research³

*Corresponding author

🔥 News

[2025-11-24] 🥃 Released the training data and code. Give it a shot!

[2025-11-24] ⭐️ The evaluation and Gradio demo are available NOW!

[2025-11-24] 🤗 Our finetuned weights are available. Check it here!

📑 Table of Contents

👀 Overview

Rather than restricting VLM reasoning to a discrete language space with limited representational capacity, CoVT forms a visual thought chain that enables VLMs to reason in continuous visual space. By introducing continuous visual tokens that encode perceptual cues (e.g., segmentation, depth, instance, and edge structure), CoVT composes chains of textual and visual thoughts that link semantic reasoning with perceptual grounding. These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.

💡 Abstract

Vision–Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness.This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (CoVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens — compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, CoVT distills knowledge from lightweight vision experts capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, a VLM equipped with CoVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual-token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating CoVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16%, demonstrating that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

🧩 Pipeline

Continuous visual thinking with CoVT. CoVT introduces compact, continuous visual tokens that encode fine-grained perceptual cues, such as object localization, spatial structure, and scene semantics, directly into VLM reasoning. These tokens ground multimodal reasoning in visual space, enabling the model to capture fine-grained relationships across vision-centric tasks (e.g., counting, depth ordering, and scene understanding) without relying on external tools. They can also be decoded into dense predictions, offering human-interpretable visualizations of the model's reasoning process.

The training pipeline of CoVT. CoVT first generates the thinking process, containing visual thinking tokens, and then leverages these visual thoughts to condition next-token prediction and reason the final answer. To endow these tokens with perceptual meaning, we align them with lightweight vision experts (e.g., SAM, DepthAnything, PIDINet, DINO) on their respective tasks during training. Specifically: SAM uses 8 visual tokens as mask prompts; DepthAnything uses 4 tokens to reconstruct depth; PIDINet uses 4 tokens to reconstruct edges; and DINO uses 4 tokens to match patch-level features. The VLM is finetuned with LoRA and all projection layers are trainable. Note: During inference, dense predictions are decoded only when interpretability is desired; otherwise, reasoning occurs entirely in the latent visual space.

💫 Results

Comparison of CoVT with the baseline and closed-source models. CoVT delivers consistent improvements across all vision-centric benchmarks and further reveals that each type of visual token contributes most effectively to the tasks that align with its encoded perceptual information.

Comparison between CoVT and Aurora based on LLaVA-v1.5-13B. $^{\dag}$ indicates our reproduced results based on the provided checkpoints.

🚀 Quick Start!

Evaluation

To ensure consistency and reproducibility, we use VLMEvalKit as the framework for evaluating models. In our repository, we have forked a copy of VLMEvalKit. You can have a quick start by following this instruction.

Gradio Demo

We provid an interactive demo built with Gradio, showcasing a conversational interface powered by the CoVT VLM. The demo allows users to upload images, ask questions, and interact with the model in real time through a simple web UI. You can have a quick start following here.

Training CoVT

Our training data is released here!

Dataset Composition

CoVT dataset utilizes some subsets of LLaVA-OneVision, and merges the filtered TallyQA dataset and ADE20K-Depth from Aurora.

To reproduce our method, please follow this instruction and start training CoVT! 🍻

🤗 Model Zoo

A collection of CoVT models on Hugging Face with benchmark performance:

Baseline	Segment	Depth	DINO	Edge	Parameters	CV-Bench	Link
Qwen2.5-VL-7B-Instruct	✔				7B (+1B)	77.9	🤗 HuggingFace
Qwen2.5-VL-7B-Instruct		✔			7B (+1B)	78.7	🤗 HuggingFace
Qwen2.5-VL-7B-Instruct	✔	✔	✔		7B (+1B)	80.0	🤗 HuggingFace
Qwen2.5-VL-7B-Instruct	✔	✔	✔	✔	7B (+1B)	79.8	🤗 HuggingFace
LLaVA-v1.5-13B		✔			13B (+1B)	59.9	🤗 HuggingFace

+1B denotes the parameters of the projection layer for decoding the visual thinking tokens. We don't nned these parameters during inference!

🏖️ TODO

Release our model weights on Hugging Face.
Release the evaluation code.
Release the Gradio demo code.
Release the dataset.
Release the training code.
Support huggingface demo.
Support more VLMs as the base models.

🪪 License

The majority of CoVT is licensed under the Apache License, however portions of the project are available under their own license terms: Qwen series, SAM, Depth Anything v2, and DINOv2 are licensed under Apache, PIDINet are licensed under their own license; If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than Apache, CC-BY-NC, MIT, or CC0.

📮 Contact

For feedback, or collaboration opportunities, feel free to reach out!

For general questions, feel free to drop us an email at [email protected] or [email protected].

If you're running into code or implementation issues, the best way is to open an issue right here in the repo (highly recommended!) — chances are your question might help someone else too. 😊

Citation

If you use this work in your research, please cite:

@article{qin2025chain,
  title={Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens},
  author={Qin, Yiming and Wei, Bomin and Ge, Jiaxin and Kallidromitis, Konstantinos and Fu, Stephanie and Darrell, Trevor and Wang, Xudong},
  journal={arXiv preprint arXiv:2511.19418},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
VLMEvalKit		VLMEvalKit
assets		assets
docs		docs
gradio		gradio
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

⭐️ CoVT enriches VLMs’ vision-centric reasoning capabilities. ⭐️

🔥 News

📑 Table of Contents

👀 Overview

🚀 Quick Start!

Evaluation

Gradio Demo

Training CoVT

🤗 Model Zoo

🏖️ TODO

🪪 License

📮 Contact

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

Wakals/CoVT

Folders and files

Latest commit

History

Repository files navigation

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

⭐️ CoVT enriches VLMs’ vision-centric reasoning capabilities. ⭐️

🔥 News

📑 Table of Contents

👀 Overview

🚀 Quick Start!

Evaluation

Gradio Demo

Training CoVT

🤗 Model Zoo

🏖️ TODO

🪪 License

📮 Contact

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages