Paper 2 Base
Paper 2 Base
Haozhe Zhao˚§1,2 , Zefan Cai˚1 , Shuzheng Si˚1 , Xiaojian Ma3 , Kaikai An1 ,
Liang Chen1 , Zixuan Liu4 , Sheng Wang4 , Wenjuan Han:5 , Baobao Chang:1
1
National Key Laboratory for Multimedia Information Processing, Peking University
2
School of Software and Microelectronics, Peking University, China
3
National Key Laboratory of General Artificial Intelligence, BIGAI
4
Paul G. Allen School of Computer Science and Engineering, University of Washington
5
Beijing Jiaotong University
arXiv:2309.07915v3 [cs.CL] 20 Mar 2024
A BSTRACT
1 I NTRODUCTION
General-purpose vision-language pre-trained models (VLMs) have made significant advancements (Li
et al., 2022; 2023d;g; Zhu et al., 2023; Li et al., 2023b). Recent VLMs mostly augment a large
language model (LLM) with a visual encoder and exhibit impressive zero-shot capacities in various
visual tasks. However, unlike LLMs that can extract rich background knowledge and task information
from the prompt with in-context learning (ICL), most VLMs still struggle to understand complex
multi-modal prompts that include multiple images. Previous studies (Li et al., 2023d;b) primarily
focus on handling the user queries with a single image rather than multi-modal prompts with
interleaved multiple images and text. Although some VLMs like Flamingo (Alayrac et al., 2022) and
Kosmos-1 (Huang et al., 2023b) can handle user queries with multiple images, their pre-training data
can not provide more sophisticated multi-modal prompts than interleaved image and text crawled
from the web (Awadalla et al., 2023). Hence, there is a gap between the prompts used in pre-training
these VLMs and the user queries in real-world scenarios, which always contain multiple images and
* Equal contribution.
§
Project lead.
†
Corresponding author.
1
Published as a conference paper at ICLR 2024
(e)
These images illustrate the growth The image 0 is just germinating, the image 1 is
phases of the tree, please describe the only a bare trunk, the image 2 is luxuriant, and
contents of each image carefully. the image 3 is a growing plant.
(f)
more sophisticated text. Specifically, these VLMs may suffer from the following three limitations,
which makes VLMs less effective in downstream vision-language tasks.
Hard to Understand Text-to-Image Reference: Previous studies rarely attempt to address the issue
of text-to-image reference in the multi-modal prompts. However, there are often intricate referential
relationships between the text and images in user queries, with different words mentioning different
images. For example, the user may ask a specific question about multiple images(Fig. 1.c and
Fig. 1.f) or use multiple images as exemplars to ask the question only about a specific image(Fig. 1.d).
However, the training data used in previous studies (Li et al., 2023d; Alayrac et al., 2022; Huang
et al., 2023a) are crawled from the web and may lack explicit text-to-image references. VLMs, thus
might fail to handle user queries involving intricate text-to-image references.
Hard to Understand the Relationships between Multiple Images: There are often spatial, temporal,
and logical relationships between multiple images, and correctly understanding them allows the
model to handle user queries better. However, the pre-training data used by previous VLMs (Alayrac
et al., 2022) are collected from the internet, lacking close connections among images, especially
when these images are far apart on the same webpage. It hampers the ability of VLMs to understand
the intricate relationships among the images and further limits their reasoning ability.
Hard to Learn from In-Context Multi-Modal Demonstrations: Previous studies have shown
that pretrained LLMs can benefit from few in-context demonstrations (Brown et al., 2020; Dong
et al., 2023). However, the ICL ability of current VLMs is rather limited, specifically: 1) VLMs
like BLIP-2 (Li et al., 2023d), LLaVA (Li et al., 2023b) only support multi-modal prompts with a
single image, hampering their abilities to use multiple multi-modal demonstrations to enhance their
performance during the inference; 2)Although VLMs such as Flamingo (Alayrac et al., 2022) support
multi-image inputs during pretraining and emerge ICL abilities, their context schemes fail to provide
text-image references and closely related images. It inhibits them from offering sophisticated enough
prompts to the VLMs, thereby limiting the effectiveness of their ICL ability. Besides, the lack of
further supervised instruction tuning hinders their effectiveness across downstream tasks.
In this paper, to address the aforementioned limitations 1) We present MMICL, a new approach to
allow VLMs to efficiently deal with multi-modal inputs, including relationships among multiple
2
Published as a conference paper at ICLR 2024
Figure 2: Comparison of different VLM architectures: VLMs focused on a single image, VLMs
with few-shot ability, and MMICL with equal treatment of image and text representation.
images and text-to-image references. 2) We propose a novel context scheme in which incorporating
an extra image declaration section, along with the inclusion of image proxy tokens, enhances the ICL
ability of the VLM. 3) We construct a multi-modal in-context learning dataset in accordance with the
proposed scheme. The dataset is adapted from a range of existing datasets and can be used to provide
support for the training of more capable VLMs.
Our experiments show that MMICL achieves new state-of-the-art performance on various of vision-
language benchmarks including MME (Fu et al., 2023) and MMBench (Liu et al., 2023d) 1 . Com-
prehensive examinations of the three limitations we aim to address reveal that MMICL exhibits
exceptional ability in understanding text-to-image references (13-points improvement on the vision-
language compositionality benchmark, Winoground (Thrush et al., 2022a)) and intricate relationships
among images(12-points improvement on the multi-image reasoning benchmark, RAVEN (Huang
et al., 2023a)). Moreover, MMICL demonstrates impressive multi-modal ICL performance across var-
ious tasks. We also observe that MMICL efficiently mitigates the language bias, which often causes
VLMs to ignore visual contents when facing extensive textual contexts, leading to hallucinations.
2 MMICL
Most VLMs utilize Visual-Prompt Generators (VPG) (e.g., Resampler (Alayrac et al., 2022), Q-
former (Li et al., 2023d)) to extract visual embeddings from the image features encoded by vision
backbones and use visual embeddings to help LLMs understand visual inputs. The model architecture
shown in Fig. 2.a belongs to VLMs that focus on prompts with a single image, such as Blip-2 (Li
et al., 2023d), which always places the image at the top of the entire input and can not handle the
inputs with multiple images. In Fig. 2.b, VLMs with few-shot ability, such as Flamingo (Alayrac
et al., 2022), encode images into image embeddings with a fixed number of visual tokens and inserts
new gated cross-attention layers into the LLM to inject visual features. Different from previous work,
MMICL shown in Fig. 2.c treats image and text representations equally and establishes the reference
between image and text via image declaration. It enables users to have the flexibility to input multiple
images and text in any desired order, with no restrictions on the quantity or placement of images in
contexts. As shown in Fig. 5, each given image is encoded by a vision encoder (e.g., ViT (Radford
et al., 2021)) to get the image representation. Then, we use the Q-former as the VPG to encode
images into embeddings understandable by the language model. We utilize a fully connected layer as
the projection layer to convert each visual embedding to the same dimension as the text embedding of
the LLM. Finally, we combine the visual and text embeddings into an interleaved style and feed them
into the LLM. This design is a natural extension of the original attention mechanism in LLMs. We set
the weights for mapping query and value vectors in the attention layer of LLM as learnable to better
adapt to multi-modal prompts with multiple images. More details are presented in Appendix E.
In this section, we outline the design of the Context Scheme for MMICL. The proposed scheme is
devised to proficiently transform the interleaved image-text data into the training context for MMICL.
1
Results of MMICL are submitted on August 28th, 2023.
3
Published as a conference paper at ICLR 2024
Figure 3: Context scheme for MMICL, which seamlessly transforms the interleaved image-text data
into training context in a unified format.
4
Published as a conference paper at ICLR 2024
Figure 4: Illustration of automatic data construction pipeline for multi-model data with interconnected
images. The automatic construction is based on the existing annotation of VCR dataset (Zellers et al.,
2019) without human involvement. ChatGPT is used for instruction refinement.
constructing few-shot exemplars generated by sampling instances from the data. These exemplars
are combined with the input instance to produce the multi-modal in-context data. In this way, we
can transform all tasks into a unified multi-modal in-context format, as illustrated in Fig. 3.c. This
method facilitates amassing an extensive amount of high-quality data from different tasks, enriching
the context scheme of MMICL with an abundant diversity of multi-modal in-context data teeming
with diverse instructions. Ultimately, this improves the model’s ability to follow instructions and
multi-modal in-context learning ability. Each instance Ii comprises N exemplars.
Ii “ ptP1 , ¨ ¨ ¨ , PN u, Xi , qi , ai q (3)
Each exemplar Pj “ pXj , qj , aj q, Xj denotes the image declaration of the j-th exemplar. qj and
aj denote the question and answer for the j-th exemplar, respectively.
To help VLMs understand the complex prompts, we construct MIC dataset by gathering data from
public data resources and converting them based on the context scheme. It has three key aspects: 1)
image declaration, 2) multi-modal data with closely related images, and 3) multi-modal in-context
data for different tasks. Training set of MIC comes from 16 datasets across 8 categories, while the
test set comes from 18 datasets across 10 categories.
Our dataset is automatically constructed based on existing datasets. Firstly, we created an image
declaration for each instance in all datasets to produce datasets with explicit text-to-image reference.
Secondly, we created an instruction template for each dataset and asked Chatgpt to rewrite instructions,
filling in the data from the existing datasets to obtain a dataset with diverse instruction formats. Finally,
we used those datasets with instructions to construct the MIC dataset according to our proposed
5
Published as a conference paper at ICLR 2024
Figure 5: Illustration of MMICL architecture and training paradigm. The upper part denotes the
overview of model architecture, and the bottom denotes the pipeline of two-stage training paradigm.
context scheme. For the example presented in Fig. 3 and Fig. 4 (e.g., two people quarreling with
each other), we constructed the data based on existing annotations (i.e., bounding boxes and the
relation between bounding boxes) provided by the VCR dataset (Zellers et al., 2019). Additionally,
we also constructed an in-context learning dataset by sampling examples from the original dataset.
We also extracted eight frames per video from video datasets to generate the multi-modal data with
interconnected images. Details are presented at Appendix D.
We convert all data into a vision-language Q&A format to create high-quality multi-modal training
data and accumulate 5.8M samples in MIC dataset. Due to resource constraints, we use approximately
10% of the data with the sampling strategy described in Appendix F to finetune MMICL. It is
anticipated that a larger model trained on all of our data would yield a more promising result.
Stage I: Pretraining. This stage aims to assist the model in aligning the image and text embeddings.
During this stage, both the vision encoder and the LLM remain frozen. The VPG (i.e., Q-Former)
and projection layer are trained to learn visual embeddings that can be interpreted by the LLM.
Stage II: Multi-Modal In-Context Tuning. In this stage, we aim to address the aforementioned
limitations and take our model a step further by extending it to multi-modal in-context learning.
Specifically, we aim to make the model understand the intricate referential relationships between
the text and images and the complex relationships among multiple images and ultimately acquire a
proficient multi-modal in-context learning ability. Therefore, we perform multi-modal In-Context
Tuning on MIC dataset. During the stage II, we freeze the image encoder, Q-former, and LLM while
jointly training the projection layer and query and value vectors. Details can be found in Appendix H.
3 E XPERIMENT
Evaluation Setup. We aim to develop general-purpose VLMs that can generally adapt to diverse,
challenging multi-modal prompts. Therefore, we evaluate our models in several vision-language
benchmarks, including tasks that involve images and videos. The metrics used in these benchmarks
and further details are shown in Appendix M.
Models and Baselines. We provide two versions of MMICL: (1) MMICL (FLAN-T5) which uses
BLIP-2 (Li et al., 2023d) as the backbone and (2) MMICL (Instruct-FLAN-T5) which uses Instruct-
BLIP (Dai et al., 2023) as the backbone. We also adopt XL and XXL of FLANT5 (Chung et al., 2022)
model for both versions. We compare MMICL with following strong baselines: Flamingo (Alayrac
et al., 2022), KOSMOS-1 (Huang et al., 2023a), BLIP-2-FLAN-T5, InstructBLIP-FLAN-T5,
6
Published as a conference paper at ICLR 2024
Cognition Perception
Model Model Size Total Avg.
Comm. Num. Text. Code. Existen. Count Pos. Color OCR Poster Cele. Scene Land. Art.
LLaVA 13B 57.14 50.00 57.50 50.00 50.00 50.00 50.00 55.00 50.00 50.00 48.82 50.00 50.00 49.00 51.25
MiniGPT-4 13B 59.29 45.00 0.00 40.00 68.33 55.00 43.33 75.00 57.50 41.84 54.41 71.75 54.00 60.50 51.85
MultiModal-GPT 9B 49.29 62.50 60.00 55.00 61.67 55.00 58.33 68.33 82.50 57.82 73.82 68.00 69.75 59.50 62.97
VisualGLM-6B 6B 39.29 45.00 50.00 47.50 85.00 50.00 48.33 55.00 42.50 65.99 53.24 146.25 83.75 75.25 63.36
VPGTrans 7B 64.29 50.00 77.50 57.50 70.00 85.00 63.33 73.33 77.50 84.01 53.53 141.75 64.75 77.25 74.27
LaVIN 13B 87.14 65.00 47.50 50.00 185.00 88.33 63.33 75.00 107.50 79.59 47.35 136.75 93.50 87.25 86.66
LLaMA-Adapter-V2 7B 81.43 62.50 50.00 55.00 120.00 50.00 48.33 75.00 125.00 99.66 86.18 148.50 150.25 69.75 87.26
mPLUG-Owl 7B 78.57 60.00 80.00 57.50 120.00 50.00 50.00 55.00 65.00 136.05 100.29 135.50 159.25 96.25 88.82
InstructBLIP 12.1B 129.29 40.00 65.00 57.50 185.00 143.33 66.67 153.33 72.50 123.81 101.18 153.00 79.75 134.25 107.47
BLIP-2 12.1B 110.00 40.00 65.00 75.00 160.00 135.00 73.33 148.33 110.00 141.84 105.59 145.25 138.00 136.50 113.13
Lynx 7B 110.71 17.50 42.50 45.00 195.00 151.67 90.00 170.00 77.50 124.83 118.24 164.50 162.00 119.50 113.50
GIT2 5.1B 99.29 50.00 67.50 45.00 190.00 118.33 96.67 158.33 65.00 112.59 145.88 158.50 140.50 146.25 113.85
Otter 9B 106.43 72.50 57.50 70.00 195.00 88.33 86.67 113.33 72.50 138.78 172.65 158.75 137.25 129.00 114.19
Cheetor 7B 98.57 77.50 57.50 87.50 180.00 96.67 80.00 116.67 100.00 147.28 164.12 156.00 145.73 113.50 115.79
LRV-Instruction 7B 100.71 70.00 85.00 72.50 165.00 111.67 86.67 165.00 110.00 139.04 112.65 147.98 160.53 101.25 116.29
BLIVA 12.1B 136.43 57.50 77.50 60.00 180.00 138.33 81.67 180.00 87.50 155.10 140.88 151.50 89.50 133.25 119.23
MMICL 12.1B 136.43 82.50 132.50 77.50 170.00 160.00 81.67 156.67 100.00 146.26 141.76 153.75 136.13 135.50 129.33
Table 1: Evaluation results on the MME. Top two scores are highlighted and underlined, respectively.
C1: Some plants surrounding a lightbulb. Q: Do you agree the following image is:
C2: A lightbulb surrounding some plants.
Figure 6: Illustration of two complex vision language reasoning tasks: Winoground (Thrush et al.,
2022b) (Left) and RAVEN (Zhang et al., 2019) (Right).
Shikra (Chen et al., 2023a), Otter (Li et al., 2023a), Ying-VLM (Li et al., 2023e). The details of
MMICL and baselines are shown in Appendix H, and Appendix O.
We evaluate the general performance of MMICL on both MME (Fu et al., 2023) and MMBench (Liu
et al., 2023d) benchmarks2 . MME evaluates VLMs with 14 sub-tasks that encompass cognition and
perception abilities. Results in Table 1 show that MMICL can achieve the best average scores com-
pared with current VLMs on cognition and perception tasks. MMICL also demonstrates outstanding
performance and significantly surpasses other VLMs on the MMBench benchmark, which thoroughly
evaluates the diverse skills of VLMs. The detailed results are presented in Table 22. See Appendix I
and J for MMICL’s evaluation detail and comparisons with other VLMs.
The Winoground (Thrush et al., 2022b) pro- Table 2: Results on Winoground across text, image
poses a task of correctly matching two given and group score metrics.
images and captions, as depicted in the left of Model Text Image Group
Fig. 6. The challenge lies in the fact that both
MTurk Human 89.50 88.50 85.50
captions consist of the exact same words, albeit
in a different order. VLMs must compare both VQ2 (Yarom et al., 2023) 47.00 42.20 30.50
PALI (Chen et al., 2022) 46.50 38.00 28.75
images and texts to discern their subtle differ- Blip-2 (Li et al., 2023d) 44.00 26.00 23.50
ences and capture the implicit reference between GPT4-V (Wu et al., 2023) 69.25 46.25 39.25
them. Therefore, we select the Winoground to MMICL (FLAN-T5-XXL) 45.00 45.00 43.00
evaluate whether VLMs understand the text-to-
image reference. MMICL is given two images and two captions in each prompt during evaluation.
Results in Table 2 demonstrate that MMICL captures the referential relationship between image and
text, surpassing previous baselines.
2
All the reported performance for the baseline methods is from the leaderboard of MME (Fu et al., 2023) and
MMBench (Liu et al., 2023d). We report the result of MMICL with InstructBlip-FLANT5-XXL backbone.
7
Published as a conference paper at ICLR 2024
Table 4: Main results illustrating the multi-modal in-context learning ability of MMICL across
various vision-language tasks4 . All metrics employed in the evaluations are introduced in Table 25.
As shown in Table 4, we evaluate the multi-modal in-context learning ability of MMICL across
various vision-language tasks. MMICL outperforms other VLMs on both the held-in and held-out
datasets and achieves the state-of-the-art few-shot performance. For example, few-shot evaluation
(4-shot) of MMICL on the VizWiz benchmark outperforms the baseline Flamingo-9B (Alayrac et al.,
2022) and KOSMOS-1 (Huang et al., 2023b) by 15.38 and 14.98 points, respectively. Since VizWiz
has never been exposed in the training data, this superior suggests the ability of MMICL to generalize
to new tasks with a few exemplars. The few-shot performance of Flickr30K decreases with examples
given because the captions examples may provide noise for the VLM to finish the task(i.e., in-context
exemplars generally do not provide hints for models to perform image captioning tasks).
4
The anomalous score of InstructBlip on the Vizwiz dataset results from correct outputs not being calculated
by the VQA accuracy metric due to the exact match failure.
8
Published as a conference paper at ICLR 2024
Table 6: Ablation study on Training Paradigm across five datasets: VSR (Liu et al., 2022), IconQA-
text (Lu et al., 2021), VisDial (Das et al., 2017), IconQA-img, and Bongard-HOI (Jiang et al., 2022).
Model MMEPerception MMECognition Icon-QA NLVR2 Raven Winoground
- w/o context scheme 1238.99 316.79 52.80 56.65 8.00 6.00
- w/o image declaration 1170.87 341.07 47.15 61.00 18.00 3.00
- w/o in-context format 1141.02 345.36 51.95 62.63 28.00 20.00
- w/o interrelated images 1207.70 333.21 54.35 59.60 16.00 25.75
MMICL 1303.59 370.71 58.12 72.45 32.00 38.75
Table 7: Ablation study on Context Scheme across different tasks. We only report the overall
perception and cognition score of MME and the Group score of Winoground.
3.5 H ALLUCINATION AND L ANGUAGE B IAS OF VLMS
Current VLMs have significant visual hallucinations (Li et al., 2023f), preventing VLMs from
benefiting from multi-modal ICL. Especially when dealing with complex prompts with multiple
images (e.g., multi-modal chain of thoughts (Zhang et al., 2023b)), VLMs often overlook visual
content when facing extensive text. This language bias reduces their efficiency in answering questions
that require both images and text. ScienceQA-IMG (Lu et al., 2022) is a challenging task that requires
a model to use both modalities to answer the question. We manually split the dataset into two groups:
questions needing images to answer and those not. Details are presented in Appendix Q. Extensive
experiments in Table 5 demonstrate that MMICL effectively mitigates language bias as it performs
equally well in both groups. We also examined object hallucination in MMICL in Appendix L, which
shows impressive performance.
Ablation Study on Training Paradigm: We conduct an ablation study on various tasks to evaluate
the effect of multi-modal in-context tuning. Table 6 displays a significant enhancement due to
multi-modal in-context tuning. It can be observed across all types and sizes of models, especially
for tasks that involve multiple images. This indicates that with the help of Stage II, MMICL can
handle complex multi-modal prompts and accomplish challenging tasks with multiple images. Result
in Appendix K also confirms this point with the outstanding performance of MMICL across video
datasets.
Ablation Study on Context Scheme: We conducted ablation experiments using the InstructBLIP-
FLANT5-XL as the backbone model in various settings to verify the effectiveness of the context
scheme, aiming at pinpointing the exact source driving improvements in certain capabilities of our
model. As shown in Table 7, the superiority of MMICL is driven by the collective impact of our
design elements—removing any component cannot guarantee the superior performance of our model.
Each component of our design significantly contributes to different aspects of our model. Details and
further analysis are available in Appendix N.
4 C ONCLUSION
In this paper, we highlight the limitations of VLMs handling the complex multi-modal prompts
with multiple images, which makes VLMs less effective in downstream vision-language tasks. We
introduce MMICL to address the aforementioned limitations and take our model a step further by
extending it to multi-modal in-context learning. This breakthrough enables VLMs to better understand
complex multi-modal prompts. Furthermore, MMICL sets a new state-of-the-art performance on the
general VLM benchmarks and complex multi-modal reasoning benchmarks.
9
Published as a conference paper at ICLR 2024
5 ACKNOWLEDGEMENT
We extend our heartfelt gratitude to the anonymous reviewers whose dedication and insightful
feedback have significantly enhanced the caliber of this paper. Their constructive critiques and
valuable suggestions were instrumental in refining our work. Additionally, we are deeply appreciative
of the Program Chairs and Area Chairs for their meticulous handling of our submission and for their
comprehensive and invaluable feedback. Their guidance has been pivotal in elevating the quality of
our research.
This work is supported by the National Science Foundation of China under Grant No.61936012 and
61876004.
R EFERENCES
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv
Batra, and Devi Parikh. Vqa: Visual question answering, 2016.
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra,
Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In ICCV,
pp. 8948–8957, 2019.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel
Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford,
Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick,
Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bińkowski,
Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan. Flamingo: a visual
language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.
23716–23736. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_
files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf.
Kaikai An, Ce Zheng, Bofei Gao, Haozhe Zhao, and Baobao Chang. Coarse-to-fine dual encoders
are better frame identification learners. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),
Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 13455–13466,
Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.
findings-emnlp.897. URL https://aclanthology.org/2023.findings-emnlp.897.
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe,
Yonatan Bitton, Samir Yitzhak Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei
Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source
framework for training large autoregressive vision-language models. ArXiv, abs/2308.01390, 2023.
URL https://api.semanticscholar.org/CorpusID:261043320.
Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin
Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers
to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software
and technology, pp. 333–342, 2010.
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawa-
har, and Dimosthenis Karatzas. Scene text visual question answering. In Proceedings of the
IEEE/CVF international conference on computer vision, pp. 4291–4301, 2019.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
Yucheng Cai, Wentao Ma, Yuchuan Wu, Shuzheng Si, Yuan Shao, Zhijian Ou, and Yongbin Li.
Unipcm: Universal pre-trained conversation model with task-aware automatic prompt, 2023.
10
Published as a conference paper at ICLR 2024
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing
web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568, 2021.
David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In
Proceedings of the 49th annual meeting of the association for computational linguistics: human
language technologies, pp. 190–200, 2011.
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing
multimodal llm’s referential dialogue magic, 2023a.
Liang Chen, Shuming Ma, Dongdong Zhang, Furu Wei, and Baobao Chang. On the pareto front of
multilingual neural machine translation. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt,
and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 33188–
33201. Curran Associates, Inc., 2023b. URL https://proceedings.neurips.cc/paper_
files/paper/2023/file/690eb240baf1180b69dac48fc905c918-Paper-Conference.pdf.
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu
Liu, and Baobao Chang. Towards end-to-end embodied decision making via multi-modal large
language model: Explorations with gpt4-vision and beyond, 2023c.
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi
Meng, Tianyu Liu, and Baobao Chang. Pca-bench: Evaluating multimodal large language models
in perception-cognition-action chain, 2024a.
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang.
An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-
language models, 2024b.
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian
Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual
language-image model. arXiv preprint arXiv:2209.06794, 2022.
Xingyu Chen, Zihan Zhao, Lu Chen, JiaBao Ji, Danyang Zhang, Ao Luo, Yuxuan Xiong, and Kai Yu.
WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing, pp. 4173–4185, Online and
Punta Cana, Dominican Republic, November 2021a. Association for Computational Linguistics.
doi: 10.18653/v1/2021.emnlp-main.343. URL https://aclanthology.org/2021.emnlp-main.
343.
Xingyu Chen, Zihan Zhao, Lu Chen, Danyang Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, and
Kai Yu. Websrc: A dataset for web-based structural reading comprehension. arXiv preprint
arXiv:2101.09465, 2021b.
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna:
An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https:
//lmsys.org/blog/2023-03-30-vicuna/.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li,
Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun
Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin
Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang,
Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny
Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models, 2022.
Maria Cipollone, Catherine C Schifter, and Rick A Moffat. Minecraft as a creative tool: A case study.
International Journal of Game-Based Learning (IJGBL), 4(2):1–14, 2014.
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang,
Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language
models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
11
Published as a conference paper at ICLR 2024
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh,
and Dhruv Batra. Visual dialog. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 326–335, 2017.
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei
Li, and Zhifang Sui. A survey on in-context learning, 2023.
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang.
Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint
arXiv:2103.10360, 2021.
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong
Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 19358–19369, June 2023.
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei
Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive
evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394,
2023.
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu,
Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model.
arXiv preprint arXiv:2304.15010, 2023.
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot
learners. arXiv preprint arXiv:2012.15723, 2020.
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu,
Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for
dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa
matter: Elevating the role of image understanding in visual question answering. In CVPR, July
2017.
Helan Hu, Shuzheng Si, Haozhe Zhao, Shuang Zeng, Kaikai An, Zefan Cai, and Baobao Chang.
Distantly-supervised named entity recognition with uncertainty-aware teacher learning and student-
student collaborative learning, 2023a.
Wenbo Hu, Yifan Xu, Y Li, W Li, Z Chen, and Z Tu. Bliva: A simple multimodal llm for better
handling of text-rich visual questions. arXiv preprint arXiv:2308.09936, 2023b.
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao
Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning
perception with language models. arXiv preprint arXiv:2302.14045, 2023a.
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao
Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning
perception with language models. arXiv preprint arXiv:2302.14045, 2023b.
Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning
and compositional question answering. In CVPR, 2019.
Huaizu Jiang, Xiaojian Ma, Weili Nie, Zhiding Yu, Yuke Zhu, and Anima Anandkumar. Bongard-hoi:
Benchmarking few-shot visual reasoning for human-object interactions. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19056–19065, 2022.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia,
and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal
memes. Advances in Neural Information Processing Systems, 33:2611–2624, 2020.
12
Published as a conference paper at ICLR 2024
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie
Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language
and vision using crowdsourced dense image annotations. International journal of computer vision,
123:32–73, 2017.
Po-Nien Kung and Nanyun Peng. Do models really learn to follow instructions? an empirical study
of instruction tuning. arXiv preprint arXiv:2305.11383, 2023.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A
multi-modal model with in-context instruction tuning, 2023a.
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan
Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision
assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023b.
Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-
Seng Chua, Siliang Tang, and Yueting Zhuang. Empowering vision-language models to follow
interleaved vision-language instructions. arXiv preprint arXiv:2308.04152, 2023c.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-
training for unified vision-language understanding and generation. In International Conference on
Machine Learning, pp. 12888–12900. PMLR, 2022.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597,
2023d.
Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang,
Jingjing Xu, Xu Sun, Lingpeng Kong, and Qi Liu. M3 it: A large-scale dataset towards multi-modal
multilingual instruction tuning, 2023e.
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object
hallucination in large vision-language models, 2023f.
Yunshui Li, Binyuan Hui, ZhiChao Yin, Min Yang, Fei Huang, and Yongbin Li. PaCE: Unified
multi-modal dialogue pre-training with progressive and compositional experts. In Proceedings of
the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 13402–13416, Toronto, Canada, July 2023g. Association for Computational Linguistics. doi:
10.18653/v1/2023.acl-long.749. URL https://aclanthology.org/2023.acl-long.749.
Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Min Yang, Lei Zhang, Shuzheng Si, Junhao Liu,
Tongliang Liu, Fei Huang, and Yongbin Li. One shot learning as instruction data prospector for
large language models, 2024.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. arXiv preprint
arXiv:2205.00363, 2022.
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large
multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023b.
Xinyu Liu, Yan Ding, Kaikai An, Chunyang Xiao, Pranava Madhyastha, Tong Xiao, and Jingbo Zhu.
Towards robust aspect-based sentiment analysis through non-counterfactual augmentations. arXiv
preprint arXiv:2306.13971, 2023c.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi
Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model
an all-around player?, 2023d.
13
Published as a conference paper at ICLR 2024
Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu,
Zengxian Yang, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Zhengliang
Li, Liang Chen, Yiming Zong, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin,
Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. Ml-bench: Large language
models leverage open-source libraries for machine learning tasks, 2023e.
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang,
and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual
language reasoning. arXiv preprint arXiv:2110.13214, 2021.
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord,
Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for
science question answering. Advances in Neural Information Processing Systems, 35:2507–2521,
2022.
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and
quick: Efficient vision-language instruction tuning for large language models. arXiv preprint
arXiv:2305.15023, 2023.
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual
question answering benchmark requiring external knowledge. In Conference on Computer Vision
and Pattern Recognition (CVPR), 2019.
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in
context. arXiv preprint arXiv:2110.15943, 2021.
Ivona Najdenkoska, Xiantong Zhen, and Marcel Worring. Meta learning to bridge vision and language
models for multimodal few-shot learning. arXiv preprint arXiv:2302.14794, 2023.
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 mil-
lion captioned photographs. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q.
Weinberger (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Asso-
ciates, Inc., 2011. URL https://proceedings.neurips.cc/paper_files/paper/2011/file/
5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, and Hongsheng
Li. Retrieving-to-answer: Zero-shot video question answering with frozen large language models,
2023.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International conference on machine learning, pp.
8748–8763. PMLR, 2021.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer, 2023.
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations
toward training trillion parameter models. In SC20: International Conference for High Performance
Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System op-
timizations enable training deep learning models with over 100 billion parameters. In Pro-
ceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, KDD ’20, pp. 3505–3506, New York, NY, USA, 2020. Association for Com-
puting Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703. URL https:
//doi.org/10.1145/3394486.3406703.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition
challenge. International journal of computer vision, 115:211–252, 2015.
14
Published as a conference paper at ICLR 2024
Babak Saleh and Ahmed Elgammal. Large-scale classification of fine-art paintings: Learning the
right metric on the right feature. arXiv preprint arXiv:1505.00855, 2015.
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis,
Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of
clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi.
A-okvqa: A benchmark for visual question answering using world knowledge, 2022.
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning. In Iryna Gurevych and Yusuke
Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 2556–2565, Melbourne, Australia, July 2018. Association
for Computational Linguistics. doi: 10.18653/v1/P18-1238. URL https://aclanthology.org/
P18-1238.
Shuzheng Si, Shuang Zeng, and Baobao Chang. Mining clues from incomplete utterance: A
query-enhanced network for incomplete utterance rewriting. In Marine Carpuat, Marie-Catherine
de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pp. 4839–4847, Seattle, United States, July 2022a. Association for Computational
Linguistics. doi: 10.18653/v1/2022.naacl-main.356. URL https://aclanthology.org/2022.
naacl-main.356.
Shuzheng Si, Shuang Zeng, Jiaxing Lin, and Baobao Chang. SCL-RAI: Span-based contrastive
learning with retrieval augmented inference for unlabeled entity problem in NER. In Nicoletta
Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-
Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue,
Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond,
and Seung-Hoon Na (eds.), Proceedings of the 29th International Conference on Computational
Linguistics, pp. 2313–2318, Gyeongju, Republic of Korea, October 2022b. International Committee
on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.202.
Shuzheng Si, Zefan Cai, Shuang Zeng, Guoqiang Feng, Jiaxing Lin, and Baobao Chang. SANTA:
Separate strategies for inaccurate and incomplete annotation noise in distantly-supervised named
entity recognition. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of
the Association for Computational Linguistics: ACL 2023, pp. 3883–3896, Toronto, Canada, July
2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.239. URL
https://aclanthology.org/2023.findings-acl.239.
Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan,
Fei Huang, and Yongbin Li. Spokenwoz: A large-scale speech-text benchmark for spoken task-
oriented dialogue agents. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine
(eds.), Advances in Neural Information Processing Systems, volume 36, pp. 39088–39118. Curran
Associates, Inc., 2023b. URL https://proceedings.neurips.cc/paper_files/paper/2023/
file/7b16688a2b053a1b01474ab5c78ce662-Paper-Datasets_and_Benchmarks.pdf.
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and
Marcus Rohrbach. Towards VQA models that can read. In IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 8317–8326,
2019.
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for
reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491,
2018.
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Can-
dace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
5238–5248, 2022a.
15
Published as a conference paper at ICLR 2024
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Can-
dace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
5238–5248, 2022b.
Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill.
Multimodal few-shot learning with frozen language models. Advances in Neural Information
Processing Systems, 34:200–212, 2021.
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu,
and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv
preprint arXiv:2205.14100, 2022a.
Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng
Chau. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models.
arXiv:2210.14896 [cs], 2022b. URL https://arxiv.org/abs/2210.14896.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick
von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger,
Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural
language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for
Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.
6.
Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, and Yixin Nie. The role
of chain-of-thought in complex vision-language reasoning task, 2023.
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-
answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 9777–9786, 2021.
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging
video and language. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 5288–5296, 2016.
Zhiyang Xu, Ying Shen, and Lifu Huang. MultiInstruct: Improving multi-modal zero-shot learn-
ing via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pp. 11445–11465, Toronto, Canada, July
2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.641. URL
https://aclanthology.org/2023.acl-long.641.
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to
answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 1686–1697, 2021.
Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang,
Eran Ofek, and Idan Szpektor. What you see is what you read? improving text-image alignment
evaluation. arXiv preprint arXiv:2305.10400, 2023.
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu,
Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with
multimodality. arXiv preprint arXiv:2304.14178, 2023.
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions. Transactions
of the Association for Computational Linguistics, 2, 2014.
16
Published as a conference paper at ICLR 2024
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context
in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam,
The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 69–85. Springer, 2016.
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang,
and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv
preprint arXiv:2308.02490, 2023.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual
commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pp. 6720–6731, 2019.
Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and
Tao Kong. What matters in training a gpt4-style language model with multimodal inputs? arXiv
preprint arXiv:2307.02469, 2023.
Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual
prompt generator across llms. CoRR, abs/23045.01278, 2023a. URL https://doi.org/10.
48550/arXiv.2305.01278.
Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational
and analogical visual reasoning. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pp. 5317–5327, 2019.
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal
chain-of-thought reasoning in language models, 2023b.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En-
hancing vision-language understanding with advanced large language models. arXiv preprint
arXiv:2304.10592, 2023.
17
Published as a conference paper at ICLR 2024
A R ELATED W ORK
A.1 V ISION -L ANGUAGE P RETRAINING
Our work is inspired by recent vision-language pre-training works (Zhu et al., 2023; Liu et al., 2023b;
Li et al., 2022; 2023d), which have been proven effective for aligning visual inputs and frozen LLMs
to obtain cross-modal generalization ability.
BLIP-2 BLIP-2 (Li et al., 2023d) bridges the modality gap with a lightweight Querying Transformer,
which is pre-trained in two stages. The first stage bootstraps vision-language representation learning
from a frozen image encoder. The second stage bootstraps vision-to-language generative learning
from a frozen language model.
InstructBLIP InstructBLIP (Dai et al., 2023) performs vision-language instruction tuning based
on the pre-trained BLIP-2 models with converted multi-modal datasets and the LLaVA (Liu et al.,
2023b) dataset generated by GPT-4.
MiniGPT-4 MiniGPT-4 (Zhu et al., 2023)aligns a CLIP visual encoder with a frozen Vincuna (Chi-
ang et al., 2023) with an artificially collected dialog dataset
Shikra Shikra (Chen et al., 2023a), a VLM which can handle spatial coordinate inputs and outputs
in natural language. It makes Shikra excel at referential dialogue and general vision-language tasks,
resulting in outstanding performance.
However, there is still less work focusing on VLMs with multi-image inputs.
Flamingo Flamingo (Tsimpoukelli et al., 2021) achieves multi-visual inputs based on self-attention
for images but performs poorly in downstream tasks. Flamingo supports Few-Shot Learning (FSL) in
VLM via ICL by leveraging its robust capability to handle multi-visual inputs and uses cross-attention
instead of self-attention to get better performance. However, it still suffers from the unableness to
explicitly point images, so they introduce a hacky cross-attention mask.
Kosmos-1 Kosmos-1 (Huang et al., 2023a), is trained from scratch on billion-scale multi-modal
corpora, including interleaved text-image web page data, image-text caption, and language-only
instruction tuning data. It can multi-modal Few-Shot Learning and Chain-of-Thought processes,
thereby achieving formidable performance.
Otter Otter (Li et al., 2023a), an open-source implementation of flamingo and trained with multi-
modal instruction in-context tuning data.
Meta learner Najdenkoska et al. (2023) uses meta-learning objective to train an adapter that
aggregates multiple image features so the original VLM and adapter become a better few-shot learner.
Recent VLMs (Zhu et al., 2023; Liu et al., 2023b; Li et al., 2022; Alayrac et al., 2022; Dai et al., 2023;
Chen et al., 2024b) have been proven effective for aligning visual inputs and frozen LLMs to obtain
18
Published as a conference paper at ICLR 2024
cross-modal generalization ability. However, previous works overlooked multi-image VLMs, mainly
focusing on handling single-image prompts. Tsimpoukelli et al. (2021) supports multi-image inputs
using self-attention for images but performs poorly in downstream tasks. Although Flamingo (Alayrac
et al., 2022) supports Few-Shot Learning in VLMs and uses cross-attention to capture text-image
relationships, it still suffers from exact reference to specific images.
Pre-trained language models have achieved great success in both natural language understanding (Si
et al., 2022b; 2023a; An et al., 2023; Liu et al., 2023c; Hu et al., 2023a; Chen et al., 2023b) and natural
language generation (Si et al., 2022a; 2023b; Cai et al., 2023; Liu et al., 2023e). Recently, instruction
tuning (Kung & Peng, 2023; Li et al., 2024; Wei et al., 2022) has attracted more and more attention
in order to enable LLMs to follow natural language instructions and complete real-world tasks.
However, multi-modal instruction tuning still requires further exploration. Multiinstruct (Xu et al.,
2023) introduces instruction tuning to enhance the performance of VLMs in instruction-following
ability. Due to the architectural design, Multiinstruct still struggles with complex contexts containing
multiple images. Otter (Li et al., 2023a) fine-tunes Openflamingo (Awadalla et al., 2023) to augment
its instruction comprehension capabilities. However, Otter’s dataset lacks text-to-image references
and interconnected image-to-image data. This limitation hinders its capability to handle complex
contexts that involve visual-textual relationships.
It has been well-explored to enable ICL in pre-trained language models (PLM). MetaICL (Min et al.,
2021) proposes a meta-training framework for few-shot learning to tune a PLM to do in-context
learning on a large set of training tasks. LM-BFF (Gao et al., 2020) studies few-shot fine-tuning of
PLMs. However, ICL in VLM is still less explored. Recent works in VLM mainly focus on zero-shot
evaluation with single image input.
C DATA R ESOURCE
The data resource used in constructing the MIC dataset is displayed in Fig. 7. Our training dataset
comes from 8 task categories and 16 datasets.
Image Captioning aims to produce descriptions of the given images according to different needs. Our
training dataset includes MS COCO (Lin et al., 2014), DiffusionDB (Wang et al., 2022b), and Flickr
30K (Young et al., 2014).
Knowledgeable Visual Question Answering (KVQA) requires the model to make use of commonsense
knowledge outside the input image to answer questions. Our training dataset includes OK-VQA
(Marino et al., 2019).
Image Question Answering (IQA) requires the model to answer the questions based on the image
correctly. Our training dataset includes VQAv2 (Goyal et al., 2017), ST-VQA (Biten et al., 2019),
Text-VQA (Singh et al., 2019), WikiART (Saleh & Elgammal, 2015) and RefCOCO (Yu et al.,
2016).
Video Question Answering (VideoQA) requires the model to answer questions based on the video
correctly. We extract eight frames per video as visual inputs for Video QA tasks. Our training dataset
includes MSRVTTQA (Xu et al., 2016).
Video Captioning requires the model to give the caption based on the video. We extract eight frames
per video as visual inputs for Video Captioning tasks. Our training dataset includes MSRVTT (Xu
et al., 2016).
19
Published as a conference paper at ICLR 2024
#samples
Task Dataset Used License
Train Val Test
MS COCO (Lin et al., 2014) Yes 413,952 202,496 0 Custom
DiffusionDB (Wang et al., 2022b) Yes 19,963 0 0 MIT
Captioning
Flickr (Young et al., 2014) Yes 144,896 4,864 4,864 Custom
NoCaps (Agrawal et al., 2019) Yes 0 45,000 0 CC BY 2.0
Classification MiniImage (Russakovsky et al., 2015) Yes 38,400 9,600 12,000 CC0 1.0
VQA v2 (Goyal et al., 2017) Yes 592,998 26,173 25,747 CC BY 4.0
ST-VQA (Biten et al., 2019) Yes 78,222 0 4,070 CC BY 4.0
VQA Text-VQA (Singh et al., 2019) Yes 34,602 0 0 CC BY 4.0
NLVR2 (Suhr et al., 2018) Yes 86,373 6,982 6,967 CC BY 4.0
RefCOCO (Yu et al., 2016) Yes 141,968 0 0 Custom
KVQA OK-VQA (Marino et al., 2019) Yes 9,009 5,046 0 CC BY 4.0
GQA (Hudson & Manning, 2019) Yes 943,000 132,062 12,578 CC BY 4.0
Reasoning VCR (Zellers et al., 2019) Yes 118,156 14,472 5,000 Custom
Winoground (Thrush et al., 2022a) No 0 0 400 Custom
MSRVTT-QA (Xu et al., 2016) Yes 158,581 12,278 72,821 MIT
video
MSRVTT (Xu et al., 2016) Yes 130,260 9,940 59,800 MIT
WikiART (Saleh & Elgammal, 2015) Yes 13,000 5,500 0 Custom
Others
LLAVA-Instruct-150K (Liu et al., 2023b) Yes 236,564 0 0 CC BY 4.0
Table 9: Detailed task descriptions and statistics of our instruction tuning tasks, including all datasets
in all types of tasks. The column “Used” indicates whether we use this dataset in the multi-modal
in-context tuning stage.
Figure 7: Illustration of the data resource used to construct MIC dataset. It consists of 11 tasks and 33
different datasets. The held-in datasets are indicated by white and the held-out datasets are indicated
by yellow.
Visual Reasoning requires the model to correctly perform image reasoning and answer questions. Our
training dataset includes GQA (Hudson & Manning, 2019), VCR (Zellers et al., 2019), and NLVR2
(Suhr et al., 2018).
Image Classification involves classifying an image based on a given set of candidate labels. Our
training dataset includes MiniImage (Russakovsky et al., 2015).
Visual Dialog requires the model to hold a meaningful dialog about visual content with humans in
natural, conversational language. Our training dataset includes LLAVA-Instruct-150K (Liu et al.,
2023b).
Our testing dataset comes from 10 task categories and 18 datasets.
Image Captioning includes the Nocaps (Agrawal et al., 2019) dataset.
20
Published as a conference paper at ICLR 2024
Knowledgeable Visual Question Answering (KVQA) includes the ScienceQA (Lu et al., 2022) and
A-OKVQA (Schwenk et al., 2022) datasets.
Image Question Answering (IQA) includes the VizWiz (Bigham et al., 2010) dataset.
Visual Reasoning includes the Winoground (Thrush et al., 2022b), VSR (Liu et al., 2022) and
IconQA (Lu et al., 2021) dataset. Winoground proposes a task of matching two given images and
two captions correctly. The challenge of this task is that both captions contain a completely identical
set of words, only in a different order. VSR describes the spatial relation of two individual objects in
the image, and a VLM needs to judge whether the caption correctly describes the image (True) or
not (False). The IconQA dataset has two sub-datasets: image question answering with multiple text
choice and image question answering with multiple image choice.
Web Page Question Answering (Web QA) includes the Websrc (Chen et al., 2021a; Huang et al.,
2023a) datasets. The model must answer questions based on the web image and the optional extracted
texts. We sampled 2000 instances from Websrc for the evaluation. To align with KOSMOS-1 (Huang
et al., 2023a), we only use the web image as input.
Video Question Answering (VideoQA) includes the iVQA (Yang et al., 2021), MVSD (Chen & Dolan,
2011), and NextQA (Xiao et al., 2021) dateset. The NextQA dataset has two sub-datasets: video
question answering with multiple choice and open-domain video question answering.
Few-shot Image Classification includes the HatefulMemes (Kiela et al., 2020) and Bonard-HOI (Jiang
et al., 2022) dataset. HatefulMemes requires the model to determine if a meme is hateful based
on the image and explanation provided. Bonard-HOI is the benchmark for evaluating the model’s
ability in Few-Shot Visual Reasoning for Human-Object Interactions. It provides few-shot examples
with challenging negatives, where positive and negative images only differ in action labels. The
model is then asked whether the final image is positive or negative. We sampled 2000 instances from
Bonard-HOI for the evaluation.
Nonverbal Reasoning includes the Raven IQ test (Huang et al., 2023a). Each instance in the Raven
IQ test has 3 or 8 images as inputs and six candidate images with a unique correct completion, and
the goal is to predict the next image from the candidates.
Visual Dialog includes the visual dialog dataset (Das et al., 2017). We use the question of the final
dialogue as the question for instance and take all preceding dialogues as the context to perform
open-domain image question answering.
OOD Generalization includes the Minecraft dataset that we construct using Minecraft (Cipollone
et al., 2014) game which requires the VLM to identify whether an animal (i.e., cow, llama, chicken,
donkey, and so on) is present in a picture.
More detailed task descriptions and statistics about the datasets are shown in Table 9.
D DATA C ONSTRUCTION
Firstly, we create an image declaration per instance in all datasets to generate datasets with explicit
text-to-image reference. We then have annotators scrutinize every dataset’s samples and provide task
instructions. This practice aids in gaining a comprehensive understanding of the task and helps craft
high-quality templates.
Next, we employ ChatGPT5 to rewrite the instructions to describe the key characteristics of each task
accurately. After ChatGPT generates the instructions, we undergo a manual review to guarantee the
high quality of the instructions.
We select ten suitable templates matching as candidates, then merge the original dataset’s input
into a randomly chosen template. We assemble demonstrations for each instance from the dataset
by selecting a small amount of data and arranging them sequentially. These demonstrations are
integrated with the input instance to generate multi-modal contextual data6 .
5
We use the gpt-3.5-turbo version of ChatGPT.
6
Except for the video datasets, vcr dataset, and LLaVa dataset.
21
Published as a conference paper at ICLR 2024
We construct multi-image data by extracting eight frames per video from MSRVTT (Xu et al., 2016)
and MSRVTTQA (Xu et al., 2016) datasets. We also crop images from the VCR (Zellers et al., 2019)
dataset using object bounding boxes to produce intertwined multi-modal data with closely related
images. We convert all data into a vision-language Q&A format to create high-quality multi-modal
training data and accumulate 5.8M samples in MIC dataset. Due to resource constraints, we use
approximately 10% of the data with the sampling strategy described in Appendix F to finetune
MMICL. It is anticipated that a larger model trained on all of our data would yield a more promising
result.
The data construction framework for Multi-modal Data with Interconnected Images as Sec. 2.2.2
is developed through an automated process that leverages existing dataset annotations (i.e. Visual
Commonsense Reasoning dataset (VCR) (Zellers et al., 2019)). This approach obviates the necessity
for supplementary manual annotation efforts.
The original annotations of VCR dataset comprise raw images, bounding boxes, and a set of questions
and answers that specifically reference these bounding boxes as the left part of Fig. 8. This structured
annotation approach facilitates a comprehensive understanding of the image context. Leveraging
the existing annotations, we develop an automated workflow, including extracting sub-images from
raw images based on bounding boxes and employing ChatGPT to formulate a variety of instructions.
Initially, we manually craft a set of instruction templates, which, alongside the dataset description, are
inputted into ChatGPT. Subsequently, ChatGPT is asked to expand the templates, thereby yielding a
diverse set of instruction templates.
The data construction framework for the multi-modal in-context format data as Sec. 2.2.3 is
developed through an automated process that leverages existing dataset annotations. The used dataset
are presented as Appendix F.
Through selective sampling within the dataset, we have formulated in-context examples for each
respective task. These examples encompass a range of elements, including images, questions, and
answers. ChatGPT is used to formulate a variety of instructions. The formulated instruction templates
are presented in Appendix G.
The data construction framework for Video Data is developed through an automated process that
leverages existing dataset annotations. The used dataset are MSRVTT (Xu et al., 2016) and MSRVT-
TQA (Xu et al., 2016) datasets as Appendix F.
22
Published as a conference paper at ICLR 2024
Figure 8: Illustration of automatic data construction pipeline for multi-model data with interconnected
images. The automatic construction is based on existing annotation of VCR dataset (Zellers et al.,
2019) without human involvement. ChatGPT is used for instruction refinement.
E M ODEL STRUCTURE
As shown in Fig. 11, MMICL treats the image and language representations equally and combines
them into interleaved image-text representations, similar to the original input. Each given image is
encoded by a vision encoder (e.g., ViT (Radford et al., 2021; Fang et al., 2023)) to get the vision
representation of the image. Then, we use the Q-former as the VPG to extract the visual embedding.
We utilize a fully connected layer as the projection layer to convert each visual embedding to the
same dimension as the text embedding of the LLMs. This alignment helps the LLM to understand the
images. Our approach treats the visual and text embedding equally, enabling a flexible combination
of visual and textual content. Finally, we combine the visual embeddings of multiple images with
text embeddings in an interleaved style and then feed them into the LLM. We set the weights for
mapping query and value vectors in the attention layer of LLM as learnable to better adapt to the
multi-modal context with multiple images. During the pre-training, we freeze the image encoder,
Q-former, and the backbone LLM while jointly training the language projection and the query and
value vectors of the LLM.
F DATA BALANCE
Previous studies have shown that the data balance of training data could significantly influence the
model performance (Dai et al., 2023). Mixing the training data of each dataset uniformly could
23
Published as a conference paper at ICLR 2024
Figure 9: Illustration of automatic data construction pipeline for multi-model data with in-context
examples. The automatic construction is based on existing annotation without human involvement.
ChatGPT is used for instruction refinement. Data source can be found at Appendix F. The formulated
instruction templates are presented in Appendix G.
Figure 10: Illustration of automatic data construction pipeline for multi-model data with video. The
automatic construction is based on existing annotation without human involvement. ChatGPT is used
for instruction refinement.
24
Published as a conference paper at ICLR 2024
❄
❄
Projection Layer
Q-Former
Encoder
Tokenizer & Embedding
Image
cause the model to overfit smaller datasets and underfit larger datasets, causing poor performance. In
order to alleviate this problem, we employ a sampling strategy to sample datasets with probabilities
proportional to the square root of the number of training samples following Dai et al. (2023).
Formally, given D datasets with N¨ training samples tN1 , N2 , . . . , ND u, the probability pd of data
samples being selected from a dataset during training is as follows.
?
Nd
pd “ řD ? (4)
i“1 Ni
H E XPERIMENT D ETAILS
Following Chung et al. (2022), we use FLANT5-XL and FLANT5-XXL (Chung et al., 2022) as
the backbone LLMs. In Stage I, we set the vision encoder and language model to be frozen and
utilize the images from COCO(Lin et al., 2014), CC3M(Sharma et al., 2018), Visual Genome(Krishna
et al., 2017), CC12M(Changpinyo et al., 2021), SBU(Ordonez et al., 2011) and LAION-400M
data(Schuhmann et al., 2021), as well as the captions generate by BLIPlarge (Li et al., 2022) to
perform feature alignment training on the Q-former. We keep the other part of the VLM frozen and
jointly train the Q-former and projection layer. The Q-former’s visual embedding output occurs
in 32-token size of the backbone language model. To benefit from BLIP-2’s significant visual
representation extraction ability, we integrate its powerful vision encoder to initialize the Q-former
and projection layer. 7 . In Stage II, we train the model for three epochs with a lower learning rate of
1e ´ 5. The weights of mapping query and value vectors in the attention layer of LLMs are learnable
in this stage to better adapt to the multi-modal prompts with multiple images. In this stage, we freeze
the visual encoder, Q-former, and the backbone LLM and jointly train the projection layer, the query
vectors, and the value vectors of the LLM.
7
In practice, we use checkpoint of BLIP-2 and InstructBlip as the backbone for MMICL, so Stage I is
skipped.
25
Published as a conference paper at ICLR 2024
Table 10: Instruction templates used for transforming datasets into instruction tuning data. (I) {image}
denotes image embedding encoded by image encoder, image embedding will be concatenated with
language embedding as input. <imagej> denotes image token to exact reference the j-th image in an
instance as described in Sec. 2.2.1.
Table 11: Instruction templates used for transforming datasets into instruction tuning data. (I) {image}
denotes image embedding encoded by image encoder, image embedding will be concatenated with
language embedding as input. <imagej> denotes image token to exact reference the j-th image in an
instance as described in Sec. 2.2.1.
26
Published as a conference paper at ICLR 2024
Table 13: Instruction templates for tasks VQAv2, ST-VQA, WikiART, and RefCOCO.
27
Published as a conference paper at ICLR 2024
28
Published as a conference paper at ICLR 2024
Table 16: Instruction templates for tasks GQV, VCR and NLVR v2.
29
Published as a conference paper at ICLR 2024
Table 17: Evaluation of cognition. In the MME benchmark, each image will have two questions,
with answers restricted to ’yes’ or ’no’. The evaluation metrics for this benchmark include ACC and
ACC+. ACC refers to the accuracy calculated for each question, while ACC+ represents the accuracy
for each image, where both questions must be answered correctly. The Avg. metric denotes the
average value across all numbers. It is important to note that all the reported figures for the baseline
methods are obtained from the MME benchmark (Fu et al., 2023). We use the FLAN-T5-XXL
version of MMICL to evaluate the performance.
All experiments are conducted with 6 NVIDIA A40 GPUs with the zero2-offload (Rajbhandari et al.,
2020) of Deepspeed (Rasley et al., 2020) with the trainer of huggingface transformers (Wolf et al.,
2020). The batch size is 10 and 4 for MMICL (FLAN-T5-XL) and MMICL (FLAN-T5-XXL),
respectively. The largest MMICL (FLAN-T5-XXL) requires about two days for the Stage II.
I MME B ENCHMARK
Model Existen. Count Pos. Color OCR Poster Cele. Scene Land. Art. Avg.
LLaVA 50.00 50.00 50.00 55.00 50.00 50.00 48.82 50.00 50.00 49.00 50.28
MiniGPT-4 68.33 55.00 43.33 75.00 57.50 41.84 54.41 71.75 54.00 60.50 58.17
MultiModal-GPT 61.67 55.00 58.33 68.33 82.50 57.82 73.82 68.00 69.75 59.50 65.47
VisualGLM-6B 85.00 50.00 48.33 55.00 42.50 65.99 53.24 146.25 83.75 75.25 70.53
VPGTrans 70.00 85.00 63.33 73.33 77.50 84.01 53.53 141.75 64.75 77.25 79.05
LaVIN 185.00 88.33 63.33 75.00 107.50 79.59 47.35 136.75 93.50 87.25 96.36
LLaMA-Adapter-V2 120.00 50.00 48.33 75.00 125.00 99.66 86.18 148.50 150.25 69.75 97.27
mPLUG-Owl 120.00 50.00 50.00 55.00 65.00 136.05 100.29 135.50 159.25 96.25 96.73
InstructBLIP 185.00 143.33 66.67 153.33 72.50 123.81 101.18 153.00 79.75 134.25 121.28
BLIP-2 160.00 135.00 73.33 148.33 110.00 141.84 105.59 145.25 138.00 136.50 129.38
Lynx 195.00 151.67 90.00 170.00 77.50 124.83 118.24 164.50 162.00 119.50 137.32
GIT2 190.00 118.33 96.67 158.33 65.00 112.59 145.88 158.50 140.50 146.25 133.21
Otter 195.00 88.33 86.67 113.33 72.50 138.78 172.65 158.75 137.25 129.00 129.23
Cheetor 180.00 96.67 80.00 116.67 100.00 147.28 164.12 156.00 145.73 113.50 130.00
LRV-Instruction 165.00 111.67 86.67 165.00 110.00 139.04 112.65 147.98 160.53 101.25 129.98
BLIVA 180.00 138.33 81.67 180.00 87.50 155.10 140.88 151.50 89.50 133.25 133.77
MMICL 170.00 160.00 81.67 156.67 100.00 146.26 141.76 153.75 136.13 135.50 138.17
Table 18: Evaluation of coarse-grained and fine-grained recognition and OCR. The settings are
the same as Table 17. It is important to note that all the reported figures for the baseline methods
are obtained from the MME benchmark (Fu et al., 2023). We use the FLAN-T5-XXL version of
MMICL to evaluate the performance.
30
Published as a conference paper at ICLR 2024
MME comprehensively evaluates VLMs with 14 sub-tasks that encompass perception and cognition
abilities. Other than OCR, perception ability includes the recognition of coarse-grained and fine-
grained objects. The former identifies the existence, count, position, and color of objects. The latter
recognizes movie posters, celebrities, scenes, landmarks, and artworks. The cognition includes
commonsense reasoning, numerical calculation, text translation, and code reasoning.
MME evaluates a wide range of multi-modal abilities. The compared baselines include LLaVA (Liu
et al., 2023b), MiniGPT-4 (Zhu et al., 2023), MultiModal-GPT (Gong et al., 2023), VisualGPM-
6B (Du et al., 2021), VPGTrans (Zhang et al., 2023a) , LaVIN (Luo et al., 2023), mPLUG-Owl (Ye
et al., 2023), LLaMA-Adapter-V2 (Gao et al., 2023), InstructBLIP (Dai et al., 2023), Otter (Li et al.,
2023a), BLIP-2 (Li et al., 2023d), LRV-Instruction (Liu et al., 2023a), Cheetor (Li et al., 2023c),
GIT2 (Wang et al., 2022a), Lynx (Zeng et al., 2023), BLIVA (Hu et al., 2023b). We also provide
more detail evaluation results for MMICL at Table 18, Table 19, Table 20, and Table 21. Results
show that MMICL can achieve the best average scores in comparisons with current VLMs.
31
Published as a conference paper at ICLR 2024
Table 22: Evaluation of MM benchmark dev set. All the reported performance for the baseline
methods is from the leaderboard of MM benchmark (Liu et al., 2023d). We use the FLAN-T5-XXL
version of MMICL to evaluate the performance.
significant improvement of 10.86, 4.53, and 2.45 points for MSVD-QA (Chen & Dolan, 2011),
NExT-QA (Xiao et al., 2021), and iVQA (Yang et al., 2021) respectively, when compared to the
strongest baselines. It is important to note that our training dataset did not include any videos. This
indicates that MMICL effectively enhances the model’s ability to understand temporal information
in videos.
We test the following VLMs on the POPE benchmark to evaluate their object hallucination perfor-
mance: MMICL, Shikra (Chen et al., 2023a), InstructBLIP (Dai et al., 2023), MiniGPT-4 (Zhu et al.,
2023), LLaVA (Liu et al., 2023b), MM-GPT (Gong et al., 2023) and mPLUG-Owl (Ye et al., 2023).
The result is present in the Table 24.
32
Published as a conference paper at ICLR 2024
NExT QA
Model MSVD QA iVQA
Multi-choice
Flamingo-3B (Alayrac et al., 2022) (Zero-Shot) 27.50 - 32.70
Flamingo-3B (Alayrac et al., 2022) (4-Shot) 33.00 - 35.20
Flamingo-9B (Alayrac et al., 2022) (Zero-Shot) 30.20 - 35.20
Flamingo-9B (Alayrac et al., 2022) (4-Shot) 36.20 - 37.70
Flamingo-80B (Alayrac et al., 2022) (Zero-Shot) 35.60 - 40.70
Flamingo-80B (Alayrac et al., 2022) (4-Shot) 41.70 - 44.10
R2A (Pan et al., 2023) 37.00 - 29.30
BLIP-2 (Li et al., 2023d) (FLANT5-XL) 33.70 61.73 37.30
BLIP-2 (Li et al., 2023d) (FLANT5-XXL) 34.40 61.97 49.38
InstructBLIP (Dai et al., 2023) (FLANT5-XL) 43.40 36.10 25.18
InstructBLIP (Dai et al., 2023) (FLANT5-XXL) 44.30 64.27 36.15
MMICL (FLAN-T5-XL) 47.31 66.17 41.68
MMICL (FLAN-T5-XXL) 55.16 64.67 41.13
MMICL (Instruct-FLAN-T5-XL) 53.68 65.33 49.28
MMICL (Instruct-FLAN-T5-XXL) 52.19 68.80 51.83
Table 23: Results of MMICL compared with other VLMs across different video-languages tasks.
For Blip-2 and Instructblip, We concatenate the visual embeddings of all frames and place them on
the top of the textual prompts following Dai et al. (2023).
We use the same VQA Tools as the original VQA paper (Agrawal et al., 2016) and use it in all metrics
using the VQA accuracy.
N A BLATION S TUDY
We conduct ablation experiments using the InstructBLIP-FLANT5-XL as the backbone model in the
following settings to verify the effectiveness of our proposed context scheme:
33
Published as a conference paper at ICLR 2024
Dataset Metrics
MSVD (Chen & Dolan, 2011) Top-1 Acc.
iVQA (Yang et al., 2021) iVQA Acc.
NExT-QA-multiple-choice (Xiao et al., 2021) Top-1 Acc.
NExT-QA-opendomain (Xiao et al., 2021) WUPS Score.
Hateful Memes (Kiela et al., 2020) AUC Score
WebSRC (Chen et al., 2021b) Exact Match
VSR (Liu et al., 2022) Top-1 Acc.
˚VQAv2 (Goyal et al., 2017) VQA Acc.
VizWiz (Bigham et al., 2010) VQA Acc.
IconQA-text (Lu et al., 2021) Top-1 Acc.
IconQA-img (Lu et al., 2021) Top-1 Acc.
ScienceQA-IMG (Lu et al., 2022) Top-1 Acc.
Bongard-HOI (Jiang et al., 2022) Top-1 Acc.
VisDial (Das et al., 2017) Exact Match
NoCaps (Agrawal et al., 2019) Cider Score
A-OKVQA (Agrawal et al., 2019) Top-1 Acc.
˚Flickr (Young et al., 2014) Cider Score
Winoground (Thrush et al., 2022b) Winoground mertic.
Raven IQ Test (Huang et al., 2023a) Top-1 Acc.
Minecraft Top-1 Acc.
Table 25: Summary of the evaluation datasets and metrics. These datasets are used to validate the
general design of MMICL. The datasets marked with ˚ are the hold-in datasets, where their training
set is used in training the MMICL.
Table 26: Ablation study on Context Scheme for different scores on Winoground.
• w/o context scheme: This model is trained only with instructions using all datasets from
stage 2 of MMICL, without any additional design in context format and model architecture.
It aims to ablate the contribution of the multi-modal instruction tuning.
• w/o image declaration: This model is trained without image declaration but includes
multi-modal data with interconnected images and in-context data. It utilizes all datasets
used in stage 2 of MMICL and intends to evaluate the impact of image declaration design.
• w/o in-context format: This model, devoid of in-context format data, comprises of multi-
modal data with related images and image declaration. It uses all datasets found in stage 2
of MMICL with a single image-text pair and aims to assess the effectiveness of multi-modal
in-context data design.
• w/o interrelated images: This version of the model is trained without any interconnected
images. Its training pipeline contains multi-modal in-context data and image declaration.
It utilizes all datasets used in stage 2 of MMICL, excluding the video/VCR datasets. It
aims to evaluate the contribution of the design which incorporates multi-modal data with
interrelated images
We use the equivalent amount of data as in the stage II of MMICL for the ablation models. We
employ the ablation study on both the general benchmark(MME), Winoground (Thrush et al., 2022b)
and multi-image datasets(IconQA img (Lu et al., 2021) & NLVR2 (Suhr et al., 2018), Raven IQ-
test (Huang et al., 2023a)) to comprehensively evaluate the sources of the observed performance
improvement in MMICL.
34
Published as a conference paper at ICLR 2024
The ablation results in Table 7 affirm that the superiority of MMICL is driven by the collective impact
of our design elements—removing any component cannot guarantee superior performance of our
model. Each component of our design significantly contributes to different aspects of our model.
Ablation of Context Scheme: Our improved results demonstrate the effectiveness of our context
scheme design rather than merely aggregating more pretraining data into the pipeline. The substantial
improvements across all tasks when compared to the w/o context scheme further substantiate its
effectiveness. Lacking the context scheme and our model architecture supporting the scheme, the
model performed poorly across most datasets, except performance on the single image task focusing
on specific object perception (such as Existence, Count). This is presumably due to the alignment of
the additional single image-text pair training with the task’s evaluation format.
Ablation of Interrelated Images: The exclusion of this design leads to a significant decrease in
multi-image understanding ability, evidenced in the NLVR2 and Raven datasets. However, multi-
modal in-context training pipeline still enabled the model to retain the multi-image understanding
ability, achieving an advantage over datasets with multiple images (IconQA img, NLVR2, Raven)
compared to the w/o context scheme. At the same time, the design of multi-modal in-context data
still improves the model on single image understanding over the w/o in-context format, particularly
in understanding the task about specific object perception (Perception), but performed poorly in
more abstract tasks involving recognition and cognition(Cognition). When compared to the w/o
in-context format, it demonstrates a superior understanding of paired image and text data, evident
in the performance in Winoground. This suggests that multi-modal data with interrelated images
provide some assistance in understanding multiple images. However, without the help of multi-modal
ICL data, it will not display superior performance. It aligns with our initial insight.
Ablation of In-context Format: Excluding this design revealed that even with the introduction
of extra datasets, performance improvement on the single image tasks was not achieved. Compared
to other ablation settings, despite the w/o in-context format receiving a noticeable improvement in
tasks related to comprehension and inference(Cognition) compared to other ablation settings, we
observed a significant drop in the task about specific object perception (Perception). Training with
multiple images enhanced its ability to better understand abstract logic information within a single
image and to perform better in downstream tasks(Cognition). However, the model lost its ability
to analyze and recognize specific objects in a single image, leading to a significant decrease in the
perception score of MME. We found out that, by benefiting from learning on interrelated image data,
its understanding ability on multiple images has not significantly declined, exhibiting its superiority
over other settings. The performance in the MME also confirms this, indicating that multi-modal
in-context data offers significant advantages for comprehending multiple images and enhancing the
model’s ability to understand abstract information in the signal image.
Ablation of Image Declaration: The removal of image declaration led to a stark decline in
performance in understanding complex image-text relationships (Winoground) due to a lack of
explicit modeling between image-text references. This phenomenon was also observed in tasks
that involved understanding multiple image relations(Raven, IconQA img, NLVR2). However, it
can be observed lacking interrelated images harms the model’s multi-image understanding more
than missing image declarations, given the relatively better performance of model without image
declarations. This indicates the importance of image declaration in handling the multi-image and
image-text relationships.
Ablation of In-context Learning Ability: The ablation results in Table 27 clarify the significance of
the proposed context scheme for boosting the in-context learning ability of MMICL. The effectiveness
of MMICL in improving the in-context learning ability of the VLM is attributed to the integration of
all three designs in the context scheme, not merely the in-context learning format in the designed
scheme. Notably, ICL-only achieves the most substantial enhancements compared to the other
settings in the Vizwiz dataset. This highlights the efficacy of tuning with in-context format data to
augment the model’s comprehension of in-context instances. However, the modest advancements
on the Flickr dataset indicate that, in the absence of multi-image instruction tuning, the model finds
it challenging to effectively leverage in-context examples for enhancing its performance in tasks
without fixed output formats, such as image captioning.
35
Published as a conference paper at ICLR 2024
Task ICL example MMICL w/o context scheme w/o in-context format w/o interrelated images
w/o ICL example 27.10 28.13 15.90 23.43
VizWiz w ICL example(5) 42.66 36.23 28.33 37.13
∆ 15.56 8.10 12.43 13.70
w/o ICL example 83.94 77.84 65.33 76.88
Flickr w ICL example(5) 88.11 79.38 65.95 77.09
∆ 4.17 1.54 0.62 0.18
Table 27: An Ablation Study of the Contribution of the Context Scheme to the Model’s In-Context
Learning Ability
O BASELINES
Baselines We primarily compare MMICL with recently proposed powerful multi-modal approaches,
including:
(1) Flamingo (Alayrac et al., 2022) where a VLM is trained on large-scale multi-modal- web corpora
containing arbitrarily interleaved text and images;
(2) KOSMOS-1 (Huang et al., 2023a) which is trained from scratch on web-scale multi-modal
corpora;
(3) BLIP-2-FLAN-T5 (Li et al., 2023d) where an instruction-tuned Flan-T5 (Chung et al., 2022) is
connected with a powerful visual encoder to perform a series of multi-modal tasks;
(4) InstructBLIP-FLAN-T5 (Dai et al., 2023), a recently proposed instruction tuning enhanced
multi-modal agents with FLAN-T5 with converted multi-modal datasets and the LLaVA (Liu et al.,
2023b) dataset generated by GPT-4 (OpenAI, 2023);
(5) Shikra (Chen et al., 2023a), a VLM that can handle spatial coordinate inputs and outputs in
natural language without the need for extra vocabularies or external plugin models. All inputs and
outputs of Shikra are in natural language form.
(6) Otter (Li et al., 2023a), an open-source implementation of flamingo (Alayrac et al., 2022). By
utilizing multi-modal instruction in-context tuning data, Otter fine-tunes Openflamingo to augment
its instruction comprehension capabilities while maintaining its ability to learn in context;
(7) Ying-VLM (Li et al., 2023e), a VLM model trained on Multi-Modal multilingual instruction
tuning dataset, showcasing its potential to answer complex questions requiring world knowledge,
generalize to unseen video tasks, and comprehend unseen instructions in Chinese.
Table 28: Results of generalization of MMICL to unseen domain in Minecraft. Results show that
MMICL is able to generalize to unseen domains and tasks given a few examples.
In an unseen challenging domain with limited exemplars, analyzing regular patterns, reasoning, and
learning new knowledge (OOD Generalization to unseen domain) is a great way to test multi-modal
ICL ability.
We construct a task using Minecraft (Chen et al., 2024a; 2023c; Cipollone et al., 2014), which requires
the VLM to identify whether an animal (i.e., cow, llama, chicken, donkey, and so on) is present in
case (d) of Fig. 1.
36
Published as a conference paper at ICLR 2024
We collect 550 cases and transfer the task to a vision-to-text question-answering task to evaluate
the performance of OOD generalization of MMICL. The results are shown in Table 28. Results
demonstrate that MMICL is able to generalize to the Minecraft domain even if the images are
extremely different compared to the images used by training at Stage I and II as Sec. 2.4.
Q D ETAILS OF S CIENCE QA
In the test set of Science-QA, each item consists of a question and several choices. Out of 4221 items,
we sampled 2017 instances that contain multimodal information incorporating both text and image, a
subset we refer to as ScienceQA-IMG, while the remainder only includes text. However, we observe
that not all questions within these 2017 items necessarily require answers based on image information.
The standard for annotation is whether an understanding of the image sample is essential to answer
the questions in the test set. 8
To illustrate this, we provide two examples of questions and their corresponding choices:
• Question: Which animal is also adapted to be camouflaged among dead leaves? Choices: [
"strawberry poison frog", "Surinam horned frog" ];
• Question: Which property do these three objects have in common? Choices: [’shiny’,
’slippery’, ’opaque’].
Clearly, the first question can be answered without reference to image, while the second can not. As
such, ScienceQA-IMG can be divided into two groups: questions that require images for answers
(1012) and those that do not (1005).
MM-VET (Yu et al., 2023) is an evaluation benchmark that examines VLMs on complex multi-modal
tasks. It requires free-form answers and is evaluated by GPT-4, providing a different perspective on
the model’s capabilities.
It’s worth noting that compared to the Flant5 backbone’s instructblip model in table 29, MMICL,
with the identical backbone, shows significant improvement on MM-VET. This demonstrates that
our proposed method prevents the model from overfitting to a specific data structure and substan-
tially enhances the model’s free-form answering capability. Meanwhile, we can observe that the
performance of MMICL on MM-VET is relatively low, which is due to MMICL using the Flant5
backbone model. As T5 is mainly fine-tuned on NLP benchmarks containing many multi-choice QA
and classification datasets, it tends to output shorter content during free-form question answering,
which is disadvantageous when evaluated with GPT4.
We evaluated the MMICL model against other models with multi-modal instruction tuning on various
multi-image datasets, including the Raven-IQ dataset and a random sample of 2000 instances from
Nlvr2 and IconQA-img datasets. The query prompts for different methods are available in Table ??.
Our results in Table 30 demonstrate that our approach consistently outperforms others.
8
The minimum hourly wage in the United States is near $7.5, which can be found at https://www.worker.
gov/. On average, annotating an example takes 30 seconds. Consequently, the cost per annotator amounts to
$262.5.
37
Published as a conference paper at ICLR 2024
38
Published as a conference paper at ICLR 2024
T L IMITATIONS
T.1 P OTENTIAL OVERFITTING :
We mitigate the risk of overfitting specific data structures by expanding the instruction templates with
ChatGPT, ensuring diversity in the MIC dataset. The result in Table 28 demonstrates the generalization
ability of MMICL to an unseen domain in Minecraft. This out-of-domain understanding ability
demonstrates that MMICL does not overfit a specific data structure.
The context length of the backbone language model imposes restrictions on the number of images that
can be included in the input. This limitation renders MMICL unable to accommodate an unlimited
number of images. Consequently, during MIC tuning, we integrate up to eight images per instance.
Notably, MMICL demonstrates robust generalization capabilities beyond this constraint. For instance,
in the context of video datasets, MMICL maintains excellent performance with inputs of up to 12/16
frames. We attribute this impressive generalization to the utilization of relative position embeddings
in the architecture, as employed by Flant5 (Chung et al., 2022).
We acknowledge that our exploration has primarily focused on the T5 (Raffel et al., 2023) series
models, and the effectiveness of our proposed approach with decoder-only architectures has not been
comprehensively investigated.
39