- [2026.01] ๐๐ Published as a conference paper at SIDโs Display Week 2026 !!
XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression
Zunhai Su*, Weihao Ye*, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong
* Equal contribution.
We propose XStreamVGGT, a tuning-free and extremely memory-efficient streaming vision geometry transformer that compresses KV cache through joint token pruning and distribution-aware quantization. By removing redundant tokens and quantizing remaining KV representations, XStreamVGGT achieves up to 4.42ร memory reduction and 5.48ร inference speedup, while maintaining mostly negligible performance degradation. This enables scalable and long-horizon streaming 3D reconstruction in real-world applications.
We recommend using Conda to set up the environment:
conda env create -f StreamVGGT_environment.yml
conda activate streamvggtDownload the pretrained StreamVGGT model weights from:
After downloading, place the checkpoint file under:
XStreamVGGT/ckpt/
Please refer to the official instructions of the following repositories to prepare the evaluation datasets:
The supported datasets include:
- Sintel
- Bonn
- KITTI
- NYU-v2
- ScanNet
- 7Scenes
- Neural-RGBD
The overall folder structure should be organized as follows:
XStreamVGGT
โโโ ckpt/
โ โโโ checkpoints.pth
โโโ config/
โ โโโ ...
โโโ data/
โ โโโ eval/
โ โ โโโ 7scenes
โ โ โโโ bonn
โ โ โโโ kitti
โ โ โโโ neural_rgbd
โ โ โโโ nyu-v2
โ โ โโโ scannetv2
โ โ โโโ sintel
โ โโโ train/
โ โ โโโ processed_arkitscenes
โ โ โโโ ...
โโโ src/
โโโ ...
To evaluate XStreamVGGT with KV cache pruning enabled:
CUDA_VISIBLE_DEVICES=0 \
KV_POOL_SIZE=16 \
KV_CACHE_SIZE=2048 \
bash eval/video_depth/run.shTo evaluate the version with KV cache quantization, please switch to the corresponding branch first:
git checkout prune_and_quantizeThen run:
CUDA_VISIBLE_DEVICES=0 \
KV_QUANT_MODE=KCVT \
KV_POOL_SIZE=16 \
KV_CACHE_SIZE=2048 \
bash eval/video_depth/run.shThis codebase is built upon StreamVGGT and related streaming 3D reconstruction frameworks. We thank the authors for their open-source contributions.