Thanks to visit codestin.com
Credit goes to GitHub.com

Skip to content

VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.

License

Notifications You must be signed in to change notification settings

howard-hou/VisualRWKV

Repository files navigation

VisualRWKV: A Visual Language Model Based on RWKV

Logo

πŸ“– Paper | πŸ€— Model | 🐰 Demo

VisualRWKV is a visual language model based on the RWKV language model, enabling RWKV to handle various visual tasks.

Key Papers:

πŸš€ News and Updates

  • 2025.02.10 πŸ”₯ VisualRWKV-7.00 checkpoints released! [weights]
  • 2024.01.11 πŸ”₯ VisualRWKV-7.00 code released! [code]
  • 2024.06.25 πŸ”₯ VisualRWKV-6.0 checkpoints released! [weights]
  • 2024.05.11 πŸ”₯ VisualRWKV-6.0 code released! [code]
  • 2024.03.25 πŸ”₯ VisualRWKV-5.0 released!

πŸ“Š VisualRWKV v7.0 Metrics

The following table presents the performance comparison between VisualRWKV v7.0 and its predecessor VisualRWKV v6 across several benchmark datasets.

Model Name VQAv2(test-dev) ScienceQA(IMG) TextVQA GQA(acc) Vision Encoder
v0700+0b1 75.22 50.62 37.90 59.92 SigLIP+dinov2+Sam
v0700+0b4 77.85 54.98 41.05 62.30 SigLIP+dinov2+Sam
v0700+1b5 79.84 59.74 49.49 63.20 SigLIP+dinov2+Sam
VisualRWKV - v6 1.6B 73.66 57.02 48.70 58.23 SigLIP+dinov2+Sam
VisualRWKV - v6 3B 71.52 65.34 48.68 59.56 CLIP
VisualRWKV - v6 7B 75.82 68.22 51.01 64.27 CLIP

πŸ—οΈ Architecture

VisualRWKV Architecture

πŸ¦„ Model Zoo

VisualRWKV weights, checkpoints, and related results can be found in the Model Zoo.


πŸ’» Installation

1. Clone the repository

Clone the repo and navigate to the VisualRWKV folder. Version 7.00 is the stable release.

git clone https://github.com/howard-hou/VisualRWKV.git
cd VisualRWKV-v7/v7.00

2. Install dependencies

Create a conda environment and install the necessary packages.

conda create -n visualrwkv python=3.10 -y
conda activate visualrwkv
pip install --upgrade pip  # Enable PEP 660 support

# Install dependencies:
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install pytorch-lightning==1.9.5 deepspeed==0.7.0 wandb ninja

# For best performance, use the following:
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu126
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade

πŸ“š Pre-training and Fine-tuning

Latest stable version is VisualRWKV-v7/v7.00. Please navigate to the VisualRWKV-v7/v7.00 directory for running the code.

VisualRWKV training consists of two stages:

  1. Pre-training: Using a pretrain dataset to train a projection layer from a frozen pretrained vision encoder to the frozen RWKV.
  2. Fine-tuning: Using visual instruction data to teach the model to follow visual instructions.

πŸ”₯ Pre-training

Download LLaVA-Pretrain Dataset

You can download the LLaVA-Pretrain.

Download RWKV Checkpoints for Pre-training

If you want to pretrain the model yourself, download the following RWKV checkpoints.

VisualRWKV Version RWKV 0B1 RWKV 0B4 RWKV 1B5 RWKV 3B RWKV 7B
VisualRWKV-v6 - - RWKV-x060-World-1B6 RWKV-x060-World-3B RWKV-x060-World-7B
VisualRWKV-v700 RWKV-x070-World-0B1 RWKV-x070-World-0B4 RWKV-x070-World-1B5 - -

Pre-training Command

To pretrain the VisualRWKV-v7.0 model (example for using 4 GPUs with a 1B5 RWKV model): please refer to pretrain script


πŸ”§ Visual Instruction Tuning

Prepare Data

Refer to the LLaVA project for visual instruction data.

Fine-tuning Command

To fine-tune the VisualRWKV-v7.0 model, please refer to fine-tune script

About

VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5