Thanks to visit codestin.com
Credit goes to github.com

Skip to content

๐Ÿš€ Official code for โ€œXStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compressionโ€, published at SIDโ€™s Display Week 2026.

License

Notifications You must be signed in to change notification settings

ywh187/XStreamVGGT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

84 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

๐Ÿ”” News

  • [2026.01] ๐Ÿš€๐Ÿ“„ Published as a conference paper at SIDโ€™s Display Week 2026 !!

XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

Zunhai Su*, Weihao Ye*, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong

* Equal contribution.


Overview

We propose XStreamVGGT, a tuning-free and extremely memory-efficient streaming vision geometry transformer that compresses KV cache through joint token pruning and distribution-aware quantization. By removing redundant tokens and quantizing remaining KV representations, XStreamVGGT achieves up to 4.42ร— memory reduction and 5.48ร— inference speedup, while maintaining mostly negligible performance degradation. This enables scalable and long-horizon streaming 3D reconstruction in real-world applications.


Environment Setup

We recommend using Conda to set up the environment:

conda env create -f StreamVGGT_environment.yml
conda activate streamvggt

Model Weights

Download the pretrained StreamVGGT model weights from:

After downloading, place the checkpoint file under:

XStreamVGGT/ckpt/

Evaluation Datasets

Please refer to the official instructions of the following repositories to prepare the evaluation datasets:

The supported datasets include:

  • Sintel
  • Bonn
  • KITTI
  • NYU-v2
  • ScanNet
  • 7Scenes
  • Neural-RGBD

Folder Structure

The overall folder structure should be organized as follows:

XStreamVGGT
โ”œโ”€โ”€ ckpt/
โ”‚   โ””โ”€โ”€ checkpoints.pth
โ”œโ”€โ”€ config/
โ”‚   โ”œโ”€โ”€ ...
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ eval/
โ”‚   โ”‚   โ”œโ”€โ”€ 7scenes
โ”‚   โ”‚   โ”œโ”€โ”€ bonn
โ”‚   โ”‚   โ”œโ”€โ”€ kitti
โ”‚   โ”‚   โ”œโ”€โ”€ neural_rgbd
โ”‚   โ”‚   โ”œโ”€โ”€ nyu-v2
โ”‚   โ”‚   โ”œโ”€โ”€ scannetv2
โ”‚   โ”‚   โ””โ”€โ”€ sintel
โ”‚   โ”œโ”€โ”€ train/
โ”‚   โ”‚   โ”œโ”€โ”€ processed_arkitscenes
โ”‚   โ”‚   โ”œโ”€โ”€ ...
โ””โ”€โ”€ src/
    โ”œโ”€โ”€ ...

Evaluation

Standard KV Cache (Pruning Only)

To evaluate XStreamVGGT with KV cache pruning enabled:

CUDA_VISIBLE_DEVICES=0 \
KV_POOL_SIZE=16 \
KV_CACHE_SIZE=2048 \
bash eval/video_depth/run.sh

KV Cache Pruning with Simulated Quantization

To evaluate the version with KV cache quantization, please switch to the corresponding branch first:

git checkout prune_and_quantize

Then run:

CUDA_VISIBLE_DEVICES=0 \
KV_QUANT_MODE=KCVT \
KV_POOL_SIZE=16 \
KV_CACHE_SIZE=2048 \
bash eval/video_depth/run.sh

Acknowledgements

This codebase is built upon StreamVGGT and related streaming 3D reconstruction frameworks. We thank the authors for their open-source contributions.

About

๐Ÿš€ Official code for โ€œXStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compressionโ€, published at SIDโ€™s Display Week 2026.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages