Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Wakals/CoVT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

⭐️ CoVT enriches VLMs’ vision-centric reasoning capabilities. ⭐️

Arixv Hugging Face Collection Project Page

Yiming Qin1, Bomin Wei2, Jiaxin Ge1, Konstantinos Kallidromitis3,
Stephanie Fu1, Trevor Darrell1, XuDong Wang1*

University of California, Berkeley1
University of California, Los Angeles2
Panasonic AI Research3

*Corresponding author

🔥 News

[2025-11-24] 🥃 Released the training data and code. Give it a shot!

[2025-11-24] ⭐️ The evaluation and Gradio demo are available NOW!

[2025-11-24] 🤗 Our finetuned weights are available. Check it here!

📑 Table of Contents

👀 Overview

Teaser Image

Rather than restricting VLM reasoning to a discrete language space with limited representational capacity, CoVT forms a visual thought chain that enables VLMs to reason in continuous visual space. By introducing continuous visual tokens that encode perceptual cues (e.g., segmentation, depth, instance, and edge structure), CoVT composes chains of textual and visual thoughts that link semantic reasoning with perceptual grounding. These visual “thought chains” bridge language and vision, enabling fine-grained understanding, spatial precision, and geometric awareness beyond the reach of text-based reasoning.

💡 Abstract

Vision–Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness.This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (CoVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens — compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, CoVT distills knowledge from lightweight vision experts capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, a VLM equipped with CoVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual-token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating CoVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16%, demonstrating that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.


🧩 Pipeline

Pipeline Image

Continuous visual thinking with CoVT. CoVT introduces compact, continuous visual tokens that encode fine-grained perceptual cues, such as object localization, spatial structure, and scene semantics, directly into VLM reasoning. These tokens ground multimodal reasoning in visual space, enabling the model to capture fine-grained relationships across vision-centric tasks (e.g., counting, depth ordering, and scene understanding) without relying on external tools. They can also be decoded into dense predictions, offering human-interpretable visualizations of the model's reasoning process.

Method Image

The training pipeline of CoVT. CoVT first generates the thinking process, containing visual thinking tokens, and then leverages these visual thoughts to condition next-token prediction and reason the final answer. To endow these tokens with perceptual meaning, we align them with lightweight vision experts (e.g., SAM, DepthAnything, PIDINet, DINO) on their respective tasks during training. Specifically: SAM uses 8 visual tokens as mask prompts; DepthAnything uses 4 tokens to reconstruct depth; PIDINet uses 4 tokens to reconstruct edges; and DINO uses 4 tokens to match patch-level features. The VLM is finetuned with LoRA and all projection layers are trainable. Note: During inference, dense predictions are decoded only when interpretability is desired; otherwise, reasoning occurs entirely in the latent visual space.


💫 Results

Results Image

Comparison of CoVT with the baseline and closed-source models. CoVT delivers consistent improvements across all vision-centric benchmarks and further reveals that each type of visual token contributes most effectively to the tasks that align with its encoded perceptual information.

Results Image

Comparison between CoVT and Aurora based on LLaVA-v1.5-13B. $^{\dag}$ indicates our reproduced results based on the provided checkpoints.


🚀 Quick Start!

Evaluation

To ensure consistency and reproducibility, we use VLMEvalKit as the framework for evaluating models. In our repository, we have forked a copy of VLMEvalKit. You can have a quick start by following this instruction.

Gradio Demo

We provid an interactive demo built with Gradio, showcasing a conversational interface powered by the CoVT VLM. The demo allows users to upload images, ask questions, and interact with the model in real time through a simple web UI. You can have a quick start following here.

Gradio Demo Image

Training CoVT

Our training data is released here!

Dataset Composition

Data Image

CoVT dataset utilizes some subsets of LLaVA-OneVision, and merges the filtered TallyQA dataset and ADE20K-Depth from Aurora.


To reproduce our method, please follow this instruction and start training CoVT! 🍻

🤗 Model Zoo

A collection of CoVT models on Hugging Face with benchmark performance:

Baseline Segment Depth DINO Edge Parameters CV-Bench Link
Qwen2.5-VL-7B-Instruct 7B (+1B) 77.9 🤗 HuggingFace
Qwen2.5-VL-7B-Instruct 7B (+1B) 78.7 🤗 HuggingFace
Qwen2.5-VL-7B-Instruct 7B (+1B) 80.0 🤗 HuggingFace
Qwen2.5-VL-7B-Instruct 7B (+1B) 79.8 🤗 HuggingFace
LLaVA-v1.5-13B 13B (+1B) 59.9 🤗 HuggingFace

+1B denotes the parameters of the projection layer for decoding the visual thinking tokens. We don't nned these parameters during inference!

🏖️ TODO

  • Release our model weights on Hugging Face.
  • Release the evaluation code.
  • Release the Gradio demo code.
  • Release the dataset.
  • Release the training code.
  • Support huggingface demo.
  • Support more VLMs as the base models.

🪪 License

The majority of CoVT is licensed under the Apache License, however portions of the project are available under their own license terms: Qwen series, SAM, Depth Anything v2, and DINOv2 are licensed under Apache, PIDINet are licensed under their own license; If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than Apache, CC-BY-NC, MIT, or CC0.

📮 Contact

For feedback, or collaboration opportunities, feel free to reach out!

For general questions, feel free to drop us an email at [email protected] or [email protected].

If you're running into code or implementation issues, the best way is to open an issue right here in the repo (highly recommended!) — chances are your question might help someone else too. 😊

Citation

If you use this work in your research, please cite:

@article{qin2025chain,
  title={Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens},
  author={Qin, Yiming and Wei, Bomin and Ge, Jiaxin and Kallidromitis, Konstantinos and Fu, Stephanie and Darrell, Trevor and Wang, Xudong},
  journal={arXiv preprint arXiv:2511.19418},
  year={2025}
}

About

Official repo of "Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published