Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"

Notifications You must be signed in to change notification settings

inst-it/inst-it

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Inst-It: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Website arXiv HF Dataset: Inst-It-Bench HF Dataset: Inst-It-Dataset HF Model: Inst-It Leaderboard
Wujian Peng1,2*, Lingchen Meng1*, Yitong Chen1,2, Yiweng Xie1, Yang Liu1, Tao Gui1, Hang Xu3, Xipeng Qiu1,2, Zuxuan Wu1,2†, Yu-Gang Jiang1
1School of Computer Science, Fudan University  2Shanghai Innovation Institute  3Huawei Noah’s Ark Lab 
* Equal contributions  Corresponding author 

🔥 News

  • Feb. 19, 2025 Inst-It Bench Evaluation toolkit is released, you can evluate your model now!
  • Dec. 11, 2024 Inst-It Dataset is available at here. Welcome to use our dataset!
  • Dec. 5, 2024 our checkpoints are available at huggingface.

🏆 Inst-It Bench

Inst-It Bench is a fine-grained multimodal benchmark for evaluating LMMs at the instance-level.

  • Size: ~1,000 image QAs and ~1,000 video QAs
  • Splits: Image split and Video split
  • Evaluation Formats: Open-Ended and Multiple-Choice

See this Evaluate.md to learn how to perform evaluation on Inst-It-Bench.

🏆 Inst-It Dataset

Inst-It Dataset can be downloaded here. To our knowledge, this is the first dataset that provides fine-grained annotations centric on specific instances. In total, Inst-it Dataset includes :

  • 21k videos
  • 51k images
  • 21k video-level descriptions
  • 207k frame-level descriptions (51k images, 156k video frames) (each frame-level description includes captions of 1)individual instances, 2)the entire image, and 3)the temporal changes.)
  • 335k open-ended QA pairs

We visualize the data structure in the figure below, and you can view a more detailed data sample here.


click here to see the annotation format of Inst-It-Bench
[
    {
        "video_id": int,
        "frame_level_caption": (annotation for each frame within this video)
          [
              {
                  "timestamp": int, (indicate the timestamp of this frame in the video, e.g. <1>)
                  "frame_name": string, (the image filename of this frame)
                  "instance_level": (caption for each instance within this frame)
                    {
                        "1": "caption for instance 1",
                        (more instance level captions ...)
                    },
                  "image_level": string, (caption for the entire frame)
                  "temporal_change": string (caption for the temporal changes relative to the previous frame)
              },
              (more frame level captions ...)
          ],
        "question_answer_pairs": (open ended question answer pairs)
          [
             {
                "question": "the question",
                "answer": "the corresponding answer"
              },
             (more question answer pairs ...)
          ],
        "video_level_caption": string, (a dense caption for the entire video, encompassing all frames)
        "video_path": string (the path to where this video is stored)
    },
    (more annotations for other videos ...)
]
[
    {
        "image_id": int,
        "instance_level_caption": (caption for each instance within this image)
          {
              "1": "caption for instance 1",
              (more instance level captions ...)
          },
        "image_level_caption": string, (caption for the entire image)
        "image_path": string (the path to where this image is stored)
    },
    (more annotations for other images ...)
]

🌐 Model weights

We trained two models based on LLaVA-Next using our Inst-It-Dataset, which not only achieve outstanding performance on Inst-It-Bench but also demonstrate significant improvements on other generic image and video understanding benchmarks. We provide the checkpoints here:

Model Checkpoints
LLaVA-Next-Inst-It-Vicuna-7B weights and docs
LLaVA-Next-Inst-It-Qwen2-7B weights and docs

📝 Todo

  • Release the Inst-It Bench data and evaluation code.
  • Release the Inst-It Dataset.
  • Release the checkpoint of our fine-tuned models.
  • Release the meta-annotation of Inst-It Dataset, such as instance segmentation masks, bounding boxes, and more ...
  • Release the annotation file of Inst-It Dataset, which follows the format in the LLaVA codebase.
  • Release the training code.

📧 Contact Us

Feel free to contact us if you have any questions or suggestions

📎 Citation

If you find our work helpful, please consider citing our paper 📎 and starring our repo 🌟 :

@article{peng2024inst,
  title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
  author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2412.03565},
  year={2024}
}

About

[NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages