Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
90 views20 pages

Socratic Models ML AI LM 2204.00598

This document proposes Socratic Models (SMs), a framework that uses structured dialogue between pre-existing foundation models to perform new multimodal tasks without additional training. As an example, it presents a case study of SMs that can answer contextual questions about egocentric video by first summarizing the video into a short story and then answering questions. SMs demonstrate how foundation models trained on different domains, such as vision, language, and audio, can be composed through language to capture new functionalities without domain-specific data collection.

Uploaded by

michetravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views20 pages

Socratic Models ML AI LM 2204.00598

This document proposes Socratic Models (SMs), a framework that uses structured dialogue between pre-existing foundation models to perform new multimodal tasks without additional training. As an example, it presents a case study of SMs that can answer contextual questions about egocentric video by first summarizing the video into a short story and then answering questions. SMs demonstrate how foundation models trained on different domains, such as vision, language, and audio, can be composed through language to capture new functionalities without domain-specific data collection.

Uploaded by

michetravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari,
Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence
Google

Abstract
Large foundation models can exhibit unique capabilities
depending on the domain of data they are trained on.
While these domains are generic, they may only barely
arXiv:2204.00598v1 [cs.CV] 1 Apr 2022

overlap. For example, visual-language models (VLMs)


are trained on Internet-scale image captions, but large
language models (LMs) are further trained on Internet-
scale text with no images (e.g. from spreadsheets, to
SAT questions). As a result, these models store differ-
ent forms of commonsense knowledge across different
domains. In this work, we show that this model diver-
sity is symbiotic, and can be leveraged to build AI sys-
tems with structured Socratic dialogue – in which new
multimodal tasks are formulated as a guided language-
based exchange between different pre-existing founda-
tion models, without additional finetuning. In the con-
text of egocentric perception, we present a case study
of Socratic Models (SMs) that can provide meaning-
ful results for complex tasks such as generating free-
form answers to contextual questions about egocen-
tric video, by formulating video Q&A as short story Fig. 1: Language-based foundation models trained across differ-
Q&A, i.e. summarizing the video into a short story, ent domains learn complementary forms of commonsense. Lan-
then answering questions about it. Additionally, SMs guage is an intermediate representation by which these models can
can generate captions for Internet images, and are com- communicate with each other to generate joint predictions for new
petitive with state-of-the-art on zero-shot video-to-text multimodal tasks, without finetuning. New applications (e.g., aug-
retrieval with 42.8 R@1 on MSR-VTT 1k-A. SMs mented reality (AR), human feedback, robotics) can be viewed as
demonstrate how to compose foundation models zero- adding participants to the multi-model discussion. In this paper, we
shot to capture new multimodal functionalities, without study inter-model dialogue in the context of multimodal reasoning.
domain-specific data collection. Prototypes are avail-
able at socraticmodels.github.io.
ing reading comprehension questions (Rajpurkar, Jia, and
Liang 2018)) that to date have not been demonstrated with
1 Introduction VLMs alone. In this work, we propose that these model dif-
ferences are complementary, and can be jointly leveraged
Foundation models (Bommasani et al. 2021) (e.g., BERT,
to build AI systems with structured Socratic dialogue – in
GPT-3, CLIP) have enabled impressive capabilities in re-
which new multimodal tasks are formulated as a guided ex-
cent years: from zero-shot image classification (Radford et
change between different pre-existing language-based foun-
al. 2021; Li et al. 2021a), to high-level planning (Huang et
dation models, without additional finetuning. Rather than
al. 2022; Ahn et al. 2022). These capabilities depend on their
scaling training data in the areas of overlap (e.g., alt-text
training data distribution – and while they may be generic
captions (Jia et al. 2021)), or unifying model architectures
or indiscriminately crawled from the web, their distributions
(Hu and Singh 2021), Socratic Models1 (SMs) are a class of
remain distinct across domains. For example, in terms of lin-
systems that embrace the zero-shot capabilities of founda-
guistic data, visual-language models (VLMs) (Wang et al.
tion models by engineering guided discussions between the
2021; Jain et al. 2021) are trained on image and video cap-
independent models to reach a shared consensus on a task-
tions, but large language models (LMs) (Devlin et al. 2018;
specific output. SMs use language as the representation by
Thoppilan et al. 2022; Chen et al. 2021) are additionally
which inter-domain foundation models can jointly be used
trained on a large corpora of other data such as spread-
for inference.
sheets, fictional novels, and standardized test questions.
These different domains offer distinct commonsense knowl- 1
The name Socratic Models draws inspiration from an analogy
edge: VLMs can ground text to visual content, but LMs to the Socratic Method, but with language-interactable models that
can perform a variety of other linguistic tasks (e.g., answer- may produce or interpret language.
Fig. 2: In this work we propose Socratic Models (SMs), a framework that uses structured dialogue between pre-existing foundation models,
each of which can exhibit unique (but complementary) capabilities depending on the distributions of data on which they are trained. On various
perceptual tasks (shown), this work presents a case study of SMs with visual language models (VLMs, e.g., CLIP), large language models
(LMs, e.g., GPT-3, RoBERTa), and audio language models (ALMs, e.g., Wav2CLIP, Speech2Text). From video search, to image captioning;
from generating free-form answers to contextual reasoning questions, to forecasting future activities – SMs can provide meaningful results
for complex tasks across classically challenging computer vision domains, without any model finetuning.

Across a number of tasks spanning vision, language, and au- Socratic Models framework, (ii) demonstration of an ego-
dio modalities, we find that specific instantiations of SMs, centric perception system using Socratic Models, (iii) qual-
using LMs together with VLMs and audio-language mod- itative results on video understanding (synthesizing video
els (ALMs), can generate results on challenging percep- snippets from a full day of activity) that is not covered by
tual tasks (examples in Fig. 2) that are often coherent and existing benchmark datasets, (iv) qualitative comparisons
correct. We present results on Internet image captioning to a state-of-the-art model (Mokady, Hertz, and Bermano
(Sec. 4) and the common video understanding task of video- 2021) on the task of single-image captioning in egocentric
to-text retrieval (Sec. 5), but our highlighted application is and Internet image domains, (v) quantitative comparisons
open-ended reasoning in the context of egocentric percep- to state-of-the-art video understanding models on the pop-
tion (Fig. 4) – from answering free-form contextual reason- ular MSR-VTT (Xu et al. 2016; Yu, Kim, and Kim 2018)
ing questions about first-person videos (e.g. “why did I go dataset for video-to-text retrieval, and (vi) a framework for
to the front porch today?”), to forecasting events into the fu- unsupervised quantitative model selection of Socratic Mod-
ture with commonsense (e.g. “what will I do 3 hours from els through sub-model ablations.
now?”). Our egocentric SM system consists of two primary
Overall, these ideas shed light on promising new opportu-
components, each of which benefits from multimodal multi-
nities to build simple systems for general applications that
model discussions: (i) assembling video into a language- compose foundation models out-of-the-box. By construc-
based world-state history, i.e. a story or event log, then tion, multimodal foundation models are likely to be trained
(ii) performing various types of open-ended text-prompted on different distributions of Internet data, resulting in dif-
tasks based on that world-state history. We find that simple ferent test time capabilities. These capabilities can be im-
scripted policies to guide a closed-loop exchange between proved for a given target distribution through finetuning, but
pre-trained LM, VLM, and ALM models can (a) generate at the cost of generality and robustness to distribution shifts
meaningful captions that respond to questions like “what (Wortsman et al. 2021). SMs offer an alternative approach in
am I doing?” with answers like “receiving a package”
which these capabilities can be integrated cooperatively, and
that span beyond the label set of standard vision datasets
in which we can make use of concepts in domain A that are
(Sigurdsson et al. 2018; Smaira et al. 2020), and (b) ex-
more easily obtained from domain B without complex align-
hibit open-ended contextual Q&A capabilities previously ment of the representations, i.e., through additional large
thought to be out-of-reach for egocentric perception with- large-scale training across multiple domains. Instead, the
out domain-specific data collection (Grauman et al. 2021; common representation is language, and it may be used to
Damen et al. 2020). compose existing models, in a zero-shot manner.
The goal of this paper is (1) to discuss new perspectives Of course, our demonstrated SM systems are not without
on building AI systems that embrace the heterogeneity of their limitations. We discuss these limitations, such as un-
language-based foundation models through structured So- reliability inherited from the foundation models on which
cratic dialogue, and (2) give example demonstrations of they are constructed, together with other potential broader
what is already possible today with SMs on challenging per- impacts (Sec. 8.1).
ceptual tasks. Specifically, our contributions include (i) the
2 Socratic Models Framework

Socratic Models (SMs) is a framework (Fig. 2) in which


multiple pretrained language-interactable foundation mod-
els are composed (Fig. 3) zero-shot to perform new down-
stream tasks e.g., egocentric perception (our primary high-
light in this work), as well as Internet data image captioning
and video understanding (i.e., video-to-text retrieval). While
a common trend in multimodal learning has been to seek em-
bedding spaces in which multiple modalities may co-exist, Fig. 3: Socratic Models are composed via guided multi-model ex-
we instead compose multiple models with language. While changes. In this example, LMs, VLMs, and ALMs are composed
we may leverage submodels built with shared embedding closed-loop via language. Examples of guided multi-model discus-
spaces, language is the representation itself upon which mul- sions are provided in Sec. 3.3, Sec. 3.4, Sec. 4, and Sec. 5 on vari-
tiple of these models, as well as other language-interactable ous multimodal applications.
models, may interact. In closed-loop, multi-model interac-
tions can perform joint inference. These guided multi-model 2021) that learn a semantic similarity measure between au-
exchanges are best described through examples, as provided dio and text. We use both types of models to capture infor-
in Sections 3.3, 3.4, 4, and 5. mation from the audio domain (e.g., to rank generic sound
categories such as “running water” against audio snippets,
This work explores the use of several classes of models, or to directly transcribe speech from audio inputs). (Note:
each capturing a different domain or functionality. For visu- we indicate a VLM checking an LM’s work, for example
alization purposes, outputs from LMs are blue, VLMs green, through ranking, with purple-underlined blue text.)
ALMs purple, other prompt text is gray, LM completions on
a shown prompt are bolded in blue, and user inputs are ma- Composing Multimodal Models. There are many options
genta. of foundation models to choose from, but the examples we
showcase in the following sections use models that are pub-
Large Language Models (LMs) at the scale of 100B or licly available, so that our SM systems can be made acces-
more parameters trained on Internet-scale data e.g., GPT-3 sible to the community. In particular, we use CLIP (Radford
(Brown et al. 2020) have shown to achieve strong zero-shot et al. 2021) as the VLM (for egocentric perception, ViT-
performance on many NLP tasks including dialogue (Thop- L/14 with 428M params); Wav2CLIP (Wu et al. 2021a) as
pilan et al. 2022), coding (Chen et al. 2021), and reasoning the sound-critic ALM and Google Cloud Speech-to-text API
(Wei et al. 2022). LMs can generate language sequences au- as the speech-to-text ALM; and GPT-3 with 175B params
toregressively, or compute similarity measures between sen- (Brown et al. 2020; Ouyang et al. 2022) and RoBERTa (Liu
tence embeddings (Devlin et al. 2018). In the SMs frame- et al. 2019b) with 355M params as the LMs. All pretrained
work, LMs (and the knowledge that they store) are used to models are used off-the-shelf with no additional finetuning.
perform contextual reasoning and Q&A, conditioned on the
information provided by other models. This reasoning may In all our example systems shown, the composed founda-
in turn be subsequently checked by other models as well. tion models are used zero-shot, and the interactions between
these models are scripted. While in future work we are ex-
Visual-Language Models (VLMs) such (Radford et al. cited to explore the possibility of learning the interactions
2021; Jia et al. 2021; Li et al. 2021a; Jain et al. 2021; themselves, we also have found that there are practical ben-
Zhai et al. 2021) learn models that map images and text efits of harnessing a framework in which zero application-
into a shared feature embedding space. VLMs can serve as specific learning is performed. In particular, new applica-
image-text critics that can rank multiple text suggestions (ei- tions can be quickly retargeted by a small amount of creative
ther generic visual entity categories from existing datasets, programming, with no training time or compute resources
or outputs from the LM) against a given image using co- required. Additionally, the zero-shot capabilities may espe-
sine similarity between features. Functionally, this operates cially benefit domains in which training data is expensive to
as an open-set vocabulary image classifier, which can be acquire (e.g., egocentric perception for AR, or robotics). The
replaced with domain-specific alternatives e.g., (He et al. goal of this work is to demonstrate the capabilities available
2017). VLMs are often trained on filtered image alt-text today in multimodal domains with Socratic Models.
HTML pairs crawled from the Internet (Sharma et al. 2018).
In the SMs framework, VLMs are used to detect visual en- In the following sections, we showcase example SM sys-
tities, across various categories, from large pre-existing dic- tems and results for egocentric perception (Sec. 3), generic
tionaries, and these detections can both inform and check the Internet data image captioning (Sec. 4), and Internet data
work of LMs used for reasoning. (Note: we indicate a VLM video-to-text retrieval (Sec. 5).
checking an LM’s work, for example through ranking, with
green-underlined blue text.)
3 Socratic Models for Egocentric Perception
Audio-Language Models (ALMs) ground audio informa-
tion into language, and can take the form of speech-to-text Here we describe an SM system for egocentric perception.
models, or audio-text critics (Wu et al. 2021a; Zhao et al. We first provide a brief system overview (Section 3.1) and
address “why egocentric perception?” (Section 3.2), then de- with downstream applications in augmented reality (AR)
scribe and show results for two primary system components: and robotics (Ahn et al. 2022). From unusual viewpoints to
(i) compiling language-based world-state history (Section the lack of temporal curation – the characteristics of first-
3.3), and (ii) performing open-ended reasoning and video person videos are unique and not often found in existing
search on egocentric video (Section 3.4). We also explore datasets, which focus more on generic Internet content cap-
methods to scale up video search capabilities (Section 3.5). tured from third-person spectator views (Deng et al. 2009;
Lin et al. 2014; Sharma et al. 2018). Notably, this domain
3.1 System Overview: Socratic Egocentric Perception shift makes it difficult for data-driven egocentric models to
benefit from the standard paradigm of pretraining on third
Overall, our system formulates egocentric video understand- person Internet data (Li et al. 2021b; Sigurdsson et al. 2018).
ing as SM-guided discussions between different foundation See Related Work (Sec. 7) for a more detailed discussion on
models spanning multiple modalities. In general, our spe- prior work in egocentric perception – overall, the key chal-
cific system can be thought of as a case study in the ways in lenges for the field have included how to acquire sufficient
which Socratic Models can be composed. Further, a goal of egocentric data, and/or how to make sufficient use of this
our egocentric perception system is to demonstrate progress data (either with dense labels, or otherwise).
towards performing unconstrained video understanding of Despite the challenges of egocentric perception, we find
egocentric videos. The input to this system is a video and an that SMs can reconcile the complementary strengths of
interactive sequence of natural language prompts (dialogue) pretrained foundation models to address these difficulties
from a user. For each prompt, depending on its type, the out- through contextual reasoning. For example, while modern
put is either a textual answer, and/or relevant frame(s) or activity recognition models trained on third person data
video snippets from the video. might over-index to the motion of the primary person in
To this end, a key component of our system is using So- video (making the models difficult to be adapted to first-
cratic dialogue (Fig. 5) to turn key moments from an egocen- person videos), we find that LMs like GPT-3 can suggest
tric video into a language-based summary, which we term equally plausible activities (e.g., “receiving a package”) that
a language-based world-state history (Fig. 4, middle). This may be occurring given only a brief description of the scene
world-state history can then aid in prompting an LM to per- (e.g., “front porch”) and the objects detected in the image
form various reasoning tasks (Fig. 4, bottom) via text com- (“package, driveway, door”) by a VLM. These activity sug-
pletion. In contrast to common paradigms for video Q&A, gestions are often more expressive than the class categories
which may involve supervising video-to-text models on la- that can be found in typical activity recognition datasets
beled datasets or with adversarial training (we refer to (Pa- (e.g., Charades (Sigurdsson et al. 2018), Kinetics (Smaira
tel, Parikh, and Shastri 2021) for a recent survey), our ap- et al. 2020)), and reflect the information already stored in
proach simply maintains a language-based reconstruction of the model, agnostic to the point of view. Our SM system for
the video, generated by independently captioning key video egocentric perception leverages these advantages, and also
frames, then recursively summarizing (Wu et al. 2021b) suggests future research directions in contextual reasoning
them to fit as input context to an LM (example in Fig. 5). that leverage existing language-based models without hav-
ing to curate large annotated datasets.
For open-ended reasoning, a key aspect of our approach is
formulating video understanding as reading comprehension,
i.e. re-frame “video Q&A” as a “short story Q&A” problem, 3.3 Language-Based World-State History from Video
where the language-based world-state history can be viewed
as an interpretable state representation in the form of short In order to provide language-based reasoning capabilities for
stories, which can be used for reading comprehension Q&A, open-ended question-answering, a key aspect of our system
for which LMs have demonstrated strong zero-shot perfor- is to describe the observed states of the world in language,
mance (Brown et al. 2020). Drawing analogies to 3D vision with the goal of creating a language-based world-state his-
and robotics, this can be thought of as building an on-the- tory (Fig. 4) that can be used as context to an LM. To this
fly reconstruction of the sequence of events in the observ- end, a component of our method generates Socratic image
able world with language, rather than other representations, summaries of individual video frames (Sec. 3.3-A), that can
such as dynamically-updated 3D meshes (Izadi et al. 2011) then be concatenated (along with timestamps) to form an
or neural fields (Tancik et al. 2022). In addition to open- event log (illustrated at the top and middle of Fig. 4).
ended question-answering in the form of text, we can also 3.3-A. Socratic Egocentric Image Summaries. Given an
provide video search capabilities (in the form of image or au- image frame as input, this component generates a natural
dio retrieval, Fig 7) from natural language questions through language summary (e.g., caption) of what is occurring in
zero-shot composition of SMs as well. the image. Our system uses a Socratic approach with guided
multimodal multi-model discussion to provide answers to
3.2 Why Egocentric Perception? 3 questions that describe the visual scene: “where am I?”,
“what do I see?”, and “what am I doing?”, which are then
We highlight SMs on egocentric perception because it is an summarized into a single caption per image frame.
important yet challenging computer vision domain (Grau-
man et al. 2021; Damen et al. 2020; Sigurdsson et al. 2018) • Where am I? For place recognition, we use a VLM to
Fig. 4: An instantiation of the SMs framework for open-ended reasoning with egocentric perception. SMs can generate meaningful structured
captions (top) for egocentric images through Socratic dialogue between VLMs (green) and LMs (blue), and qualitatively perform well versus
state-of-the-art captioning models such as ClipCap (Mokady, Hertz, and Bermano 2021). Key moments from egocentric video are summarized
with SMs into a language-based world-state history (middle), which can be provided as context to an LM for open-ended question answering.
Results (bottom) for generated answers (blue) and model explanations (blue) suggest SMs are fairly capable of performing a variety of
reasoning tasks including answering binary yes or no questions, contextual and temporal reasoning questions, as well as subjective questions.
Fig. 5: Examples of guided multi-model exchanges (Socratic Models) for an egocentric perception system: (i, left) parsing a natural language
question into search entities (with LM) to be used to find the most relevant key moments in the video (with VLM); (ii, middle) describing each
key frame by detecting places and objects (VLM), suggesting commonsense activities (LM), pruning the most likely activity (VLM), then
generating a natural language summary (LM) of the SM interaction; (iii, right) concatenating key frame summaries into a language-based
world-state history that an LM can use as context to answer the original question.

rank Places365 (Zhou et al. 2016) scene categories against the Socratic dialogue further, this can be repeated likewise
the image, with the top n candidates (out of 365) inserted to generate new relevant objects (conditioned on activities
into a prefix: “Places: {place1}, {place2}, {place3}.”. and places), as well as new places (conditioned on objects
• What do I see? For object and people recognition, and activities). One can iterate the procedure (LM generate,
we use a VLM to rank OpenImages object categories VLM re-rank, repeat) to populate the set of places, objects,
(Kuznetsova et al. 2020) against the image, with the top and activities until equilibrium (i.e., no more new entities),
m categories (out of 600) inserted into a second prefix: which generally helps to cover a broader set of places and
“Objects: {object1}, {object2}, {object3}.” objects that expand beyond the initial seed categories from
Places365 and OpenImages. For example:
• What am I doing? For activity recognition, we use a
back-and-forth interaction between an LM and VLM: we If I am making making pancakes, objects that I am
first use an LM to infer the activities most related to the likely to see include: a frying pan, a spatula,
a bowl, milk, eggs, flour, sugar, baking powder,
places and objects previously listed by the VLM (green):
butter, a plate, syrup.
Places: {place1}, {place2}, {place3}. Objects:
{object1}, {object2}, {object3}. Activities: Given the final set of places, objects, and activities, we use
activity a, activity b, activity c. the LM to generate an overall first-person summary of what
is happening in the image. Specifically, the prompt is:
We find that generating candidate activities using an LM
yields more suitable descriptions of egocentric activities I am in a place1, place2, place3. I see a object1,
and interactions with first-person video, than using stan- object2, object3. I am activity1. Question: What am
dard activity recognition dataset categories (e.g., from I doing? Answer: I am most likely
Charades or Kinetics). Activity recognition datasets are
The summarization process in general can capture more rich
often tailored to third person videos, and can only cover
descriptions conditioned on the places, objects, and activi-
a partial subset of human activities, which instead can be
ties, and qualitatively seem to do well at ignoring irrelevant
more holistically captured through LM reasoning (Petroni
categories (i.e., denoising). For example:
et al. 2019) over the objects and places that the VLM per-
ceives. For example, “receiving a package” is a common I am in a nursing home, landfill, living room. I
household activity not found in most datasets. After the see a wine, wine glass, woman. I am drinking wine.
LM generates candidate activities, these candidates are Question: What am I doing? Answer: I am most likely
then fed back to the VLM and re-ranked to sort out the enjoying a glass of wine with a friend or loved one.
top k activities by relevance to the key image frame: “Ac-
tivities: {activity1}, {activity2}, {activity3}.” However, while the LM’s denoising capabilities can com-
pensate for the shortcomings of the VLM, it is important
This process of generating candidate activities from places to note that this may also cause unwanted ignoring of no-
and objects is one way of extracting commonsense from table, but rare events (e.g., such as witnessing a purple uni-
LMs as knowledge bases (Petroni et al. 2019). Continuing corn, which may be ignored, but potentially it is Halloween).
Finding new ways in which such events can be indexed ap-
propriately may be useful for downstream applications.
Egocentric Image Summary Results. On egocentric im-
ages, we show several qualitative examples of summaries
generated by our system in Fig. 4, and compare them to re-
sults from a state-of-the-art image captioning model, Clip-
Cap (Mokady, Hertz, and Bermano 2021). While state-of-
the-art captioning models can perform reasonably over sev-
eral of the images, we find that our system generally pro-
duces more relevant captions for a larger portion of the ego-
centric examples. Image captioning models are biased based
on the datasets they are trained on, and have shown to per- Fig. 6: Example frame and corresponding (centered) 5-second au-
dio clip which provide the driving example for Sec. 3.3-B, i.e.,
form poorly on egocentric images (Agarwal et al. 2020), adding in ALMs into Socratic dialogue to improve single-moment
which aligns with our observations. Relatively less research summarization. Note that this waveform mostly represents the
has been carried out specifically on egocentric image cap- background piano music, but the system is still able to rank cor-
tioning (Fan, Zhang, and Crandall 2018). SMs can neverthe- rectly that footsteps as the highest sounds relative to others in the
less produce reasonable captions without additional training LM-suggested candidate set.
on domain-specific data.
3.3-B. Adding Audio into Single-moment Summaries. In
addition to using visual perceptual inputs, we may use a So- I am in a: {place}. I see a: {object1}, {object2},
cratic approach which engages perceptual inputs from audio {object3}, {object4}, {object5}. I think I hear
as well, via an ALM (audio language model). Our exam- {sound1} I am: {activity}. Summary: I am most likely
ple egocentric perception system uses Wav2CLIP (Wu et al.
2021a) as the ALM. Wav2CLIP is trained on 5-second audio As above, incorporating “I think I hear footsteps” into the
clips from the VGGSound dataset (Chen et al. 2020), and is summary and prompting this to the LM provides the com-
trained in a contrastive manner by aligning its audio encoder pletion: “climbing a staircase, and I may hear footsteps.”
to the visual CLIP embeddings from video. In this case, this summary result is preferable to the men-
tioned single-image caption without sound.
Incorporating an ALM like Wav2CLIP into our Socratic
framework can provide an additional modality with which to While this example demonstrates in a certain case the utility
perform zero-shot cross-modal reasoning, and this may help of audio-informed summaries, overall in egocentric video,
further improve inference beyond the vision-language-only with a variety of background noise, we find that Wav2CLIP
case. Fig. 6 displays a driving example for which a visual- can provide reasonable detections for certain language-
only summarization produced the less-than-desirable sum- represented auditory entities such as ‘baby babbling’ and
mary: “I am climbing a staircase, and I may see a hamster entities to do with running water, but do not provide as
or human leg” with the incorrect propogation of the false robust detections as CLIP. Also, while there are many ad-
detection of a hamster and human leg. vantages to the specific Wav2CLIP approach, including its
use of the CLIP embedding space, a major downside is that
To perform audio-aided single-moment summarization,
the training process is “blind” to hearing things that cannot
we first run image-based summarization as described
be seen. Accordingly, for the rest of demonstrations shown,
previously, but we then prompt the LM to suggest
we simply build world-state history from VLM-LM interac-
sounds that it may hear, given the visual context, via
tions alone. We expect however that with further attention to
“hvisual single-image summaryi. 5 Possible Sounds:”. For
model approaches, and scaling of audio-language datasets,
the example in Fig. 6 an example prompt, which has already
approaches like Wav2CLIP will increase in robustness. We
gone through multiple rounds of Socratic dialogue to be gen-
also show an additional application (Sec. 3.4) of audio, for
erated, together with completion by the LM is:
audio retrieval. In that case, only a single auditory search
Places: staircase. Objects: stairs, animal, mammal, entity is required in order to enable a useful application, and
hamster, human leg. Activities: climbing. 5 Possible so it can be easier to verify that it is a sufficiently robustly-
Sounds: footsteps, creaking stairs, someone calling detected entity.
your name, a dog barking, a centipede crawling.
3.3-C. Compiling a Language-Based World-State History
These auditory entities expressed in language can then be
Our system compiles the image summaries from each key
ranked by the ALM. In this moment of the video, the sound
video frame into a language-based world-state history. Since
of footsteps can be faintly heard in the background, and in
the total number of frames in the video may be large, compil-
this case the ALM provides a correct detection of ranking
ing a summary for every individual frame would create text
footsteps as the most likely sound. This ranking can then be
that is too large (too many tokens) to be processed directly
incorporated into a prompt for the LM to provide the single-
by an LM as context for Q&A. Accordingly in this work, we
image summary, for example:
propose solutions that sparsify and/or condense language-
based world-state histories (e.g., via search-based methods)
into practically usable context sizes for reasoning. In partic- generate compelling answers to open-ended reasoning tasks,
ular, we explore two methods of identifying “key moments” at a scope that is beyond what we are aware is possible to-
in videos for summarization: (i) uniform sampling over time, day with available methods. Of course, the answers may also
and (ii) video search (image or audio retrieval) for on-the-fly inherit undesirable characteristics from the component mod-
compilation of context. els, such as an LM that is overconfident even when wrong. It
The first method, uniform sampling, is straightforward and is our hope that our results may help inspire work on prepar-
compiles a world-state history from Socratic summaries of ing even more comprehensive video understanding datasets
video frames sampled at fixed time intervals. This can also for the community, to assist further assessment.
be condensed hierarchically using recursive linguistic sum- Our example system uses a language-based world-state
marization (Wu et al. 2021b), to fit even dense sampling into history generated through Socratic multi-model discussion
usable LM-context sizes. However, while broadly indiscrim- (Sec. 3.3), and provides this as context to an LM to enable
inate, uniform sampling may not have sufficient temporal open-ended reasoning on egocentric videos. Open-ended
resolution to capture important spontaneous events in the text prompts from a user, conditioned on an egocentric
video (such as adding salt to the pot while cooking soup in video, can yields three types of responses: a text-based re-
the kitchen). sponse, a visual result, and/or an audio clip. These latter
two provide examples that open up the capabilities of the
Hence the second method, identifying key moments with
system to respond not only with text-based responses, but
video search, uses a VLM or ALM to search for entities most
also respond with video snippets themselves, which may be
relevant to the question, which can more precisely index the
a higher-bandwidth way to respond to user requests (“a pic-
frames in which the subject appears. Specifically, our instan-
tiation of SMs for this component parses a natural language ture is worth a thousand words”). The specific composition
question with an LM into several search entities to be used to of our system is of course just one example – overall, the
find key frames in the video. For example, the question “did modularity of the Socratic approach makes it easy to com-
I drink coffee today?” yields a search entity “drink coffee” pose together foundation models, zero-shot, in a variety of
that is then used with language-conditioned video search ways to provide a spectrum of multimodal reasoning capa-
to index the most relevant n key frames of “drink coffee” bilities.
in the video. The LM categorizes the search, which can The demonstrated tasks include (i) summarization, (ii) open-
be image-based (VLMs) or audio-based (ALMs), e.g., for ended Q&A, (iii) forecasting, (iv) corrections, and (v) video
language-conditioned auditory recall questions ((Oncescu et search for either visual or audio cues. These tasks have pre-
al. 2021)) like “why was my wife laughing today?” . While dominantly been studied in isolation in the research commu-
search-based indexing of key moments can be useful for nity – but our example results with SMs suggest they can be
finding spontaneous events, this method for generating con- subsumed under the same unified language-based system for
text can also provide disadvantages for downstream Q&A if multimodal reasoning. Descriptions and results for each of
the answer to the question depends on events that are not (i)-(v) are shown below.
directly related to the search subject. For example, “why (i) Summarization can be implemented by prompting an
was I chopping wood today?” returns key frames related to LM to complete the excerpt “{world-state history} Sum-
“chopping wood”, but does not return the key frames after mary of my day:” to which it can respond with outputs like
the event related to making a campfire. On the other hand, “I slept in a bed, made coffee, watched TV, did laun-
if uniform sampling is employed and the campfire events dry, received a package, bench pressed, showered, ate
are captured by the summary, then the LM can success- a sandwich, worked on a computer, and drank wine.”
fully return the answer “I was making a campfire.” Choos- Since the language-based world-state history is constructed
ing which method to use for compiling the language-based
with summaries of visual content, it carries contextual in-
world-state history may depend on the application.
formation that can be complementary to what is found in
Language-based World-state History Results. Fig. 4, closed captions (e.g., speech and dialogue, explored in Sec.
middle, shows results generated by our system. The specific 5). Summarizing egocentric videos enables a number of
event log shown in Fig. 4 has been trimmed down for space applications, including augmenting human memory to re-
considerations, but is representative of the type of event logs call events, or life-logging of daily activities for caregiver
that may be generated without manual curation. These event assistance. Our system draws similarity to early work in
logs are used as context to enable LM open-ended reasoning the area involving text-based summarization and identify-
on video, as demonstrated in the next section. ing key frames (see (Barbieri, Agnihotri, and Dimitrova
2003) for an early survey and (Del Molino et al. 2016;
3.4 Open-Ended Reasoning on Egocentric Video Apostolidis et al. 2021) for more recent surveys).

In this section we describe a few examples of how the So- (ii) Open-ended Q&A can be implemented by prompting
cratic Models framework can be used to perform open-ended the LM to complete the template: “{world-state history} Q:
multimodal-informed completion of text prompts, condi- {question} A:”. We find that LMs such as GPT-3 can gener-
tioned on egocentric video (examples in Fig. 2). There are ate surprisingly meaningful results to binary yes or no ques-
of course limitations to they can provide, but our demon- tions, contextual reasoning questions, as well as temporal
strated examples suggest that we can already today often reasoning questions. As in (Yang et al. 2021) we can further
Fig. 7: SMs can interface with the user through dialogue and perform a variety of tasks (formulated as Q&A) with egocentric video: sort-
ing reasoning questions by their output modalities e.g., text-base responses, images from visual search, video snippets from audio search.
Depending on the modality, each question can pass through a different sequence of Socratic interactions between the LM, VLM, and ALM.

prompt the LM to explain the answer by adding “This is key frames are not influenced by the question.
because:”. We find that the accuracy of the answers and ex- Temporal Reasoning. SMs can answer questions related
planations remain largely conditioned on whether the neces- to time by appending timestamps to each key moment in
sary information can be found within the world-state history.
the world-state history. By associating image summaries to
This suggests that the quality of the language-based recon-
times of the day, this allows answering questions that time
structions of the videos (e.g., via key frame sampling and
index various activities. For example “when did I last drink
captioning in this work) is central to the approach.
coffee?” can return the last time drinking coffee was men-
We show several qualitative examples of free-form ques- tioned in the log, with a full response “I last drank coffee
tion answering using our SM system on egocentric video at 10:17 AM” and an explanation “I was making coffee in
in Fig. 4, bottom, Fig. 5, and Fig. 7 generated using a first- the kitchen.” The system can also count events, for exam-
person POV video2 as input. ple when asked “how many times did I receive a package
Recall Questions. SMs can perform simple retrieval of today?”, the system will respond appropriately “I received
events. For example, “did I eat dinner today?”, yields a re- a package once today.” with an explanation “I was receiv-
sponse “yes I ate dinner today.” along with an explanation ing a package at 3:24 PM”. We find that a common failure
“I was seen eating a sandwich in a kitchen at 5:27 PM.” mode for these types of questions is that the system tends
which points to the key frame that was captioned with the to over-count, especially as a reaction to false positive VLM
sandwich in hand. Another example that involves contextual detection results that get surfaced into the world-state his-
tory. For example, asking “who did I interact with?” would
reasoning to recall events is “what was I doing outdoors?”
yield “woman, hamster” where hamster was a false positive
to which the system responds “I was chopping wood in
prediction from CLIP. These issues become more prominent
a yard.” Likewise, if the entities described in the ques-
tion do not appear in the world-state history, such as “did with search-based key frame sampling, as a byproduct of
I drive today?” the system can respond with a negative an- an inability to distinguish neighboring local argmaxes of the
swer: “no, I did not drive today.” with an explanation “I same event from each other.
was at home all day.” This capability expands beyond stan- Cause and Effect Reasoning. SMs can answer questions
dard video search, which might only return nearest neighbor about cause and effect relationships between events, condi-
video frames, without a natural language response (or a neg- tioned on that all the events appear in the world-state history.
ative response). For example, when asked “why did I go to the front porch to-
The performance of recalling events largely depends on the day?” the system would respond “I went to the front porch
relevance of the language-based world-state history to the today to receive a package.” and an explanation “I saw
question. We find that recall-type questions work best with on the porch a package and knew that I was expecting
it.” These types of questions are exciting because they sug-
world-state history logs that are compiled by using search-
gest opportunities for prompting logical deduction of events.
based key frame indexing (see Sec. 3.3-B). The system can
However, since information about both the cause and the ef-
still return negative responses, since the captioning of the
fect needs to be in the world-state history, the quality of re-
2
Examples on https://youtu.be/-UXKmqBPk1w used with per- sults remains highly dependent on the key frame sampling
mission from Cody Wanner. strategy used to compile it (Sec. 3.3-B). Uniform gives an
unbiased account of events, and is currently the best variant
for this form of reasoning. More targeted construction of the
world-state history with search based key frames can some-
times miss frames that capture the answer to the question.
Subjective Reasoning. SMs can also answer more subjec-
tive questions, such as “was I happy today?” or “what was
my favorite drink today?”. Without additional context, these
Fig. 8: Example zero-shot language-prompted auditory retrieval
questions rely on biases from the LM’s dataset – which
(shown: top 2 results) in response to “what did my daughter’s laugh
could have negative consequences, and should be managed sound like today?”, for which an LM identifies the audio search
carefully with additional mechanisms for safety and ground- query of “daughter’s laugh”, and an ALM (Wav2CLIP) is used for
edness (Thoppilan et al. 2022). The full personalization of audio retrieval. The top (left) retrieval is only partially correct, re-
these subjective questions are likely to be conditioned on turning a video clip involving the daughter but not laughter. The
whether a better context can be constructed of prior user be- second (right) retrieval is correct, from a moment of playing (get-
haviors related to the question. ting tossed into the air). Faces obscured for privacy.
(iii) Forecasting of future events can be formulated
as language-based world-state completion. Our system Context: Where am I? outdoor cabin, campsite, outdoor
prompts the LM to complete the rest of an input event log. inn. What do I see? fire, marshmallow, fire iron,
Timestamps of the predictions can be preemptively specified hearth, fireside, camp chair. What am I doing?
depending on the application needs. The completion results Commonsense suggests: roasting marshmallows, sitting
are generative, and are more broad than binary event classi- around the fire, chatting. Most likely: sitting
fication (e.g., (Lei et al. 2020)). Example completion results around the fire.
(also shown in Fig. 2): Original Summary: I am camping and enjoying the
company of my friends around the fire.
1:46 PM: I am eating a sandwich in a kitchen. Corrections: It was actually my family, not friends,
2:18 PM: I am checking time and working on a laptop sitting around the fire.
in a clean room. Corrected Summary: I am camping with my family and
2:49 PM: I am buying produce from a grocery store or enjoying the company of them around the fire.
market.
3:21 PM: I am driving a car. (v) Video Search: Image or Audio Retrieval. Our SM sys-
4:03 PM: I am in a park and see a playground. tem can also return additional modalities (images, audio) as
4:35 PM: I am in a home and see a television. answers to questions, by simply few-shot prompting the LM
to classify a target modality based on the input question. For
Few-shot prompting the LM with additional examples of example, “where did I leave my remote control” can map
prior event logs most similar to the current one is likely to to image search using VLM features for “remote control”
improve the accuracy of the completion results. Without ad- while “what did my daughter’s laugh sound like today?” can
ditional context, these results are again biased towards typi- map to natural-langauge-queried audio search ((Oncescu
cal schedules seen by the LM across Internet-scale data. et al. 2021)) using ALM features for “daughter’s laugh”
To a certain extent, this forecasting capability extends and (Fig. 8). This can be useful for some applications (e.g., AR)
generalizes the traditional topic of activity forecasting in in which the user may find the retrieved modality to be more
computer vision. In the research community, activity fore- useful than a natural language response. Our approach for
casting has been often formulated as an extension of ac- this uses an LM to parse a search entity from the question
tion classification, tracking, or feature generation: Given a to index key video frames. This is done with several few-
sequence of image frames, they directly predict a few cat- shot examples provided as context. For example, the ques-
egorized actions (Ryoo 2011; Hoai and De la Torre 2014; tion “when did I last wash my hands?” yields a search entity
Rhinehart and Kitani 2017), human locations (Kitani et al. “wash my hands” that is then used with video search to
2012), or image features (Vondrick, Pirsiavash, and Torralba index the most relevant n key frames of “wash my hands”
2016) to be observed in the future frames. In contrast, So- in the video. Specifically, our system runs video search by
cratic Models with LMs enables generating more semanti- ranking matching CLIP or Wav2CLIP features of the entity
cally interpretable descriptions of future events, conditioned text against all video frames, and returning the top n local
on multimodal information. maximums. For each frame, the features can either be image
features or audio features (e.g., from the surrounding 5 sec-
(iv) Corrections. SMs can be prompted to incorporate hu-
onds with Wav2CLIP) – where the LM few-shot categorizes
man feedback in the loop as well, which could be useful for
which domain to use for any given question. This can be
interactive language-based systems. For example, given im-
thought of as calling different subprograms for hierarchical
age captions generated from an VLM and LM:
search.
Limitations. Overall, our results suggest that SMs are ca-
pable of generating meaningful outputs for various egocen-
tric perception tasks via visual contextual reasoning – but its
limitations also suggest areas for future work. For example,
a primary bottleneck in the Q&A system is that it relies on ory requirements, we propose to leverage recently intro-
the richness (i.e., recall) and quality (i.e., precision) of the duced techniques on linear attention (Choromanski et al.
event log. This likely could be improved with better image 2021b) combined with modern continuous associative mem-
and audio detectors or captioning systems (Gu et al. 2021). ory (MCAM) models (Ramsauer et al. 2021). MCAM mod-
Also, we find that the used Wav2CLIP may provide satisfac- els are de facto differentiable dictionaries (with provable
tory results for certain categories in audio retrieval, but we few-shot retrieval) that can be thought of as energy-based
currently do not involve it in generating the event log, since models using negated exponentiated latent-representations-
its robustness and range of open-language detection is not at dot-product energy for the exponential storage capacity. A
the same level as CLIP. This seems addressable with further naive computation of such an energy still requires explicitly
approaches and scaling of datasets in the audio-language do- keeping all the patterns (which is exactly what we want to
main. avoid), but this can be bypassed by applying the lineariza-
tion of that energy (which effectively is just the negated
Additionally, accurate response to cause and effect reason-
sum of the softmax kernel values) with the FAVOR+ mecha-
ing questions also require relevant key moments to be re-
nism used in linear-attention Transformers, called Perform-
flected in the event log – which points to open ended ques-
ers (Choromanski et al. 2021b). This modification has sev-
tions on how to achieve better key frame sampling (beyond
eral advantages: (1) it makes the size of the dictionary com-
the simple baselines that we have demonstrated). Finally, the
dialogue between the different models are fairly structured pletely independent from the number of the implicitly stored
with manually engineered prompts. It may be interesting to patterns; the size now scales linearly with the number of ran-
investigate more autonomous means of achieving language- dom features used for energy linearization, (2) it provides a
based closed loop discussions between the models until a constant-time querying mechanism at the price of compress-
commonsense consensus is reached. ing all the patterns (and thus losing some information).

Random Feature Trees. The other approach, that com-


3.5 Scaling Up Socratic Video Search bined the ideas from both MIP-search and linear attention
systems, leverages the so-called random feature tree (RFT)
The search algorithms of the SMs, which may be used both
data structure (Rawat et al. 2019). This approach relaxes the
for compiling world-state history (Sec. 3.3-C) and for video
MIP-search to sampling from the linearized softmax distri-
search retrieval (Sec. 3.4) rely on the matching procedure
bution via FAVOR+ (Choromanski et al. 2021a). Sampling
conducted in the corresponding latent space (e.g. VLM fea-
from such a linearized distribution can be done in time log-
tures of the text snippet against these of the video frames).
arithmic in the number of samples via RFT which is a bal-
This can be abstracted as dot-product-maximization key
anced tree with leaves corresponding to latent representa-
search in the given key-dataset. In practice, if the key-
tions of video frames and nodes encoding representations of
dataset is large (e.g. long videos) a naive linear search is
the subsets of keys (e.g. the video frames) defined as sums
prohibitively expensive. We propose several solutions to this
of the random feature transforms of the keys.
problem.

MIP-Search. The first observation is that several data pre- 4 Socratic Internet Data Image Captioning
processing techniques applied in the so-called maximum in-
ner product (MIP) search can be directly used to reorganize The SMs framework can also be used to generate text cap-
the keys (e.g. latent representations of video frames) to pro- tions for generic Internet images with a guided multi-model
vide sub-linear querying mechanism for the incoming text exchange between a VLM and LM. We describe an example
snippet (see: (Abuzaid et al. 2019)). Those include prun- system in Sec. 4.1 and demonstrate results in Sec. 4.2.
ing and various indexing techniques, such as LSH-hashing
(Shrivastava and Li 2014). In the hashing approach, a col- 4.1 System: Image Captioning on Internet Data
lection of hash-tables, indexed by the binarized represen-
tations of the hashes is stored with different entries of the Overall, our example SMs system for Internet image cap-
hash table corresponding to the subsets of keys producing a tioning is extremely similar to how we perform single-image
particular hash. There are several cheap ways of computing captioning in our egocentric system, but (i) adapted for Inter-
such hashes, e.g. signed random projection (those in princi- net images rather than ego-centric images, and (ii) adapted
ple linearize the angular distance, but every MIP task can be such that the “final task” is the generation of a single im-
translated to the minimum angular distance search problem). age caption, rather than open-ended tasks based on text-
The querying is then conducted by searching for the most prompted completion.
similar hash-entries in the hash-tables and then performing First, similar to the process of generating egocentric image
linear search only on the subsets of keys corresponding to captions, we may prompt the VLM to zero-shot detect vi-
these entries to obtain final ranking. sual entities across different categories of language. As with
the egocentric system, we return top matching place cate-
Associative Memories. The above approach provides gories and object categories. For Internet data, we use Ten-
sub-linear querying mechanism, but does not address the cent ML-Images (Wu et al. 2019) for object entities. We
space complexity problem. In the scenario of strict mem- also choose to detect the image type from the set {photo,
Fig. 9: The SMs framework with a VLM and LM can be used to zero-shot generate captions for generic Internet images, and can be as
expressive as finetuned-for-captioning state-of-the-art models e.g., ClipCap (Mokady, Hertz, and Bermano 2021).

cartoon, sketch, painting} and the amount of people from Examples of an actual prompt and generated sample cap-
the set {are no people, is one person, are two people, are tions, for example for the bottom-left photo of Fig. 9 is:
three people, are several people, are many people}. For
I am an intelligent image captioning bot. This image
generic Internet images, which are not necessarily real pho-
is a photo. There are no people. I think this photo
tos, and very often are taken by people of people, we find was taken at a indoor bow window, dining room, or
that these additional contextual pieces of information help interior balcony. I think there might be a double-hung
generate better captions. The various VLM detections give: window, casement window, sliding window, pivoting
“Places: {place1}, {place2}, {place3}. Objects: {object1}, window, breakfast area, breakfast nook, dining area,
{object2}, {object3}. Image type: {image type}. People re- storm window, storm sash, dining room, bay window,
sult: {people result}.” bow window, lancet window in this photo. A creative
short caption I can generate to describe this image
Next, given the VLM detections of various visual entities, is: ... (run n times, examples shown below)
we can then prompt the LM to generate several (n) candi-
• Looking through the window to a beautiful view.
date captions. For this step, we employ a non-zero sampling
temperature (we find 0.9 gives good results) in order to give • This serene and stately room is centered around
sufficient variety of results across the n options, but still gen- a beautiful window, which floods the space with
erate reasonably likely options. natural light. The double-hung window allows for
ventilation, while the storm sash protects against
I am an intelligent image captioning bot. This image inclement weather. The perfect spot for a lazy
is a {img type}. There {num people}. I think this Sunday brunch or an intimate dinner party.
photo was taken at a {place1}, {place2}, or {place3}. • ...
I think there might be a {object1}, {object2},
{object3},... in this {img type}. A creative short
caption I can generate to describe this image is: Lastly, after the LLM has generated several captions, they
are subsequently ranked by the VLM, and the highest scor-
ing caption is returned – in the example above, this is:
“This image shows an inviting dining space with
plenty of natural light.”

4.2 Results: Image Captioning on Internet Data


Fig. 9 shows several qualitative comparisons with Clip-
Cap (Mokady, Hertz, and Bermano 2021), a state-of-the-art
method for image captioning specifically trained via finetun-
ing on image captioning. We show zero-shot generation of
captions on the set of images displayed in ClipCap’s code re-
lease.3 We make a colab available to generate these results.4
Overall, our results (Fig. 9) show that the Socratic Model
framework can be adopted to provide often convincing re-
sults for image captioning via the creativity of the LM com-
bined with the visual grounding of the VLM. While these
results are promising, the degree to which visual details are
provided in the captions is largely limited by the capabil-
ities of the VLM. For example, attributes (e.g., color of a
shirt, a person’s facial expression, or the spatial relationships
between objects) are details not often captured in our par-
ticular system, which relies more on the contextual image Fig. 10: Evolution over time of state-of-the-art (SOTA) results for
classification capabilities of the VLM. Future work may ex- video-to-text retrieval, with R@1 metric, on the popular MSR-VTT
plore open-vocabulary object detectors (Gu et al. 2021) as a (Xu et al. 2016) 1k-A (Yu, Kim, and Kim 2018) dataset. See Tab. 1
means to recover salient details for more rich captions. It is for additional information on these methods.
also important to note that the system may generate captions
that reflect unwanted biases found in the Internet-scale data
that it is trained on, and should be used with caution (and videos with speech-to-text ALMs (Bapna et al. 2022) (also
checked for correctness) in downstream applications. called ASR, or automatic speech recognition), using the
publicly-available Google Cloud speech-to-text API.5 Al-
though raw transcripts may be challenging to incorporate
5 Socratic Video-to-Text Retrieval into meaningful improvements for video/caption retrieval,
We also adapt the Socratic Models framework to the task of we may leverage reasoning capabilities from large LMs in
video-to-text retrieval, a common video understanding task order to usefully harness the transcripts. For videos with suf-
with clear and quantitative metrics. We describe an exam- ficiently long transcripts, we summarize the content with an
ple SM-based system (Sec. 5.1) which uses a guided multi- LM (e.g., GPT-3) using the following prompt:
model exchange between a VLM, a causal LM, a speech-to- I am an intelligent video captioning bot.’ I hear
text (ALM) model, and a masked LM, that achieves state-of- a person saying: "{transcript}". Q: What’s a short
the-art performance for zero-shot video-to-text (i.e., caption) video caption for this video? A: In this video,"
retrieval on the popular MSR-VTT dataset (Sec. 5.2).
We then compute the similarity scores of the generated
summary to the set of captions with a masked LM (e.g.,
5.1 System Overview: Socratic Video-to-Text Retrieval
similarity between sentence embeddings from RoBERTa
Our approach uses SMs to augment Portillo-Quintero et al. (Liu et al. 2019b)), and use those scores to re-weight
(2021) with commonsense information from the audio and the CLIP-based ranking from Portillo-Quintero et al.
language domains. Portillo-Quintero et al. (2021) computes Specifically, for videos with sufficiently-long transcripts
a similarity measure between the average VLM (i.e., CLIP) (we use ≥100-character transcripts),
 the matching score is:
features of the image frames from a video, and the CLIP text CLIP (caption) · CLIP (video0 ) × RoBERTa (caption) ·
RoBERTa (GPT-3 with prompt (Speech2Text (audio0 ))) ,

features of a caption. This zero-shot method can be used di-
rectly for MSR-VTT video-to-text retrieval via one-to-many where · represents normalized dot product of embeddings,
nearest neighbor matching. and × represents scalar multiplication. For a given video, if
We can improve upon their method through a combina- there is no audio or the transcript is too short, we resort to
tion of speech-to-text ALMs together with LM-based com- exactly the method of Portillo-Quintero et al., which com-
monsense reasoning. First, we transcribe the audio from all putes matching scores with CLIP(caption) · CLIP(video0 ).
3 5
https://github.com/rmokady/CLIP prefix caption https://cloud.google.com/speech-to-text. Key used parameters
4
Note that due to the non-zero temperature used for sampling include ‘model=video’ and ‘use enhanced=True’. At 0.006 cents
from the generative language model, results from this approach are per 15 seconds, this represents an estimated speech-to-text process-
stochastic, but comparable results are producible. ing cost of under 25 cents (USD) for all MSR-VTT test data.
MSR-VTT 1k-A MSR-VTT Full
Category Method R@1↑ R@5↑ R@10↑ MdR↓ R@1↑ R@5↑ R@10↑ MdR↓ Audio CLIP enc.
JEMC (Mithun et al. 2018) 12.5 32.1 42.4 16.0 yes
Collaborative Experts (Liu et al. 2019a) 20.6 50.3 64.0 5.3 15.6 40.9 55.2 8.3 yes
SSB (Patrick et al. 2020) 28.5 58.6 71.6 3.0 no
CLIP4Clip (Luo et al. 2021) 43.1 70.5 81.2 2.0 no ViT-V/32
CLIP2Video (Fang et al. 2021) 43.5 72.3 82.1 2.0 54.6 82.1 90.8 1.0 no ViT-V/32
Finetuning
DRL (Wang et al. 2022), ViT-B/32 45.3 73.9 83.3 2.0 no ViT-V/32
CAMoE (Cheng et al. 2021) 49.1 74.3 84.3 2.0 no ViT-B/32
CLIP2TV (Gao et al. 2021) 54.1 77.4 85.7 1.0 no ViT-B/16
DRL (Wang et al. 2022), ViT-B/16 + QB-n 56.2 79.9 87.4 1.0 no ViT-B/16
SSB (Patrick et al. 2020), zero-shot 8.7 23.0 31.1 31.0 no
Zero-shot CLIP via Portillo-Quintero (2021) 27.2 51.7 62.6 5.0 40.3 69.7 79.2 2.0 no ViT-B/32
SMs (ours) 42.8 62.6 70.6 2.0 44.7 71.2 80.0 2.0 yes ViT-B/32

Tab. 1: Video-to-text retrieval results on MSR-VTT (Xu et al. 2016) dataset, both on the popular 1k-A (Yu, Kim, and Kim 2018) subset and
the original ‘full’ test set. Differentiated are methods which train on the MSR-VTT dataset (finetuning), compared with zero-shot methods,
which do not. Also noted: whether the methods use audio channels, and if CLIP (Radford et al. 2021) is used, which CLIP encoder is used.
See Fig. 10 for the chronology of the SOTA across each category.

Here, the Socratic interaction lies mainly between the ALM Portillo-Quintero et al. (2021), which uses CLIP by itself,
(speech-to-text) to the commonsense LM (GPT-3 to sum- but does not use the Socratic method: there is no multi-
marize the transcriptions), and between the commonsense model exchange, and no LMs are used. Additionally, this
LM to the ranking based system that is a combination of task provides a great opportunity to incorporate another type
the VLM (CLIP) and the masked LM (RoBERTa). Note that of modality – speech-to-text from audio data. We compare
we may also prompt LMs (in this case, via multiple-choice) our method both with zero-shot methods, and with finetuned
to determine if one caption is a better fit than another for a methods specifically trained on MSR-VTT.
given video. However, for this specific task and dataset, with
Results show that our method sets a new zero-shot state-
thousands of possible answers to choose from, the numeri- of-the-art for video-to-text retrieval on MSR-VTT (Tab.1),
cal ranking provided by embedding similarity scores pro- both on the “1k-A” and “full” test sets. Since our demon-
vides a practical solution rather than relying on thousand- strated system uses exactly the method of Portillo-Quintero
way multiple-choice commonsense reasoning. et al. (2021) for its processing of CLIP features but addi-
tionally incorporates LLM reasoning on speech-to-text tran-
5.2 Results: Socratic Video-to-Text Retrieval scripts, the increased measured performance of our method
(i.e. 27.2 → 42.8 R@1 on MSR-VTT 1k-A) directly reflects
Long-transcript subset of . . . the additional benefit of incorporating language-based mul-
MSR-VTT 1k-A timodal reasoning. Additionally, to keep the comparison be-
Method R@1↑ R@5↑ R@10↑ MdR↑ tween our method and Portillo-Quintero et al. (2021) as di-
rect as possible, we maintain the usage of their precomputed
CLIP via Portillo-Quintero (2021) 28.2 49.9 60.3 6.0
CLIP features6 from ViT-B/32. Given results from other re-
SMs (ours) 55.0 71.6 76.3 1.0
cent methods (Tab. 1) it seems likely we may be able to im-
MSR-VTT Full prove our performance by switching to ViT-B/16, or other
CLIP via Portillo-Quintero (2021) 41.5 69.6 77.4 2.0 recent more-performant VLM models (Zhai et al. 2021).
SMs (ours) 54.9 74.0 79.9 1.0 As shown in Table 2, if we look at only the long-transcript
Tab. 2: Evaluation on MSR-VTT for video-to-text retrieval with
videos, i.e. the videos for which our method used a tran-
long-transcript videos: the subset of videos for which our SMs script, then we especially see an increase in performance
method used transcripts on 1k-A (n=451 out of 1,000) and full – on MSR-VTT 1k-A, R@1 almost doubles, from 28.2 to
(n=1,007 out of 2,990). On these subsets, we evaluate Portillo- 55.0, for our method compared to Portillo-Quintero et al.
Quintero et al. (2021) vs. our method. Outside this subset, we resort (2021). Further, although it is on only a subset of the test
to Portillo-Quintero et al. set, note that this R@1 metric achieved of 55.0 is roughly
For video-to-text evaluation we use the MSR-VTT dataset comparable to the R@1 of the best finetuned-SOTA method,
(Xu et al. 2016), which as noted in other recent works (Gao DRL (Wang et al. 2022) on the entire 1k-A dataset, with 56.2
et al. 2021; Cheng et al. 2021) is the most popular bench- R@1 (Tab. 1). If we assume that, for visual-only methods,
mark dataset for the task of video-to-text retrieval. Like other the videos with-or-without transcripts are of roughly equal
recent works (Gao et al. 2021), we focus our results on this difficulty from a visual-only retrieval perspective, this sug-
dataset. One of the reasons this is a good task and dataset gests that on internet videos with sufficient spoken language
for generally testing the value of the SMs approach is that
6
there is already a strong zero-shot baseline, provided by https://github.com/Deferf/CLIP Video Representation
present, our method for zero-shot video-to-text retrieval can distance, for instance here by scoring them against a distinct,
nearly match the finetuned-SOTA method for video-to-text held-out language model.
retrieval.
As an example of using this approach, we extend the method
Note that instead of video-to-text retrieval, but rather on text- in (Strope et al. 2011) to Socratic Models on egocentric per-
to-video retrieval, a recent method (Li et al. 2022) has shown ception, where we show it is possible to quantify the mu-
strong zero-shot results. Other methods have also attempted tual dependence between foundation models without ground
zero-shot on MSR-VTT text-to-video retrieval (Xu et al. truth data. Specifically, to evaluate a new VLM (VLM’) for
2021; Miech et al. 2020; Bain et al. 2021), but these have all generating language-based world-state history, we first use
been outperformed by Portillo-Quintero et al. (2021). Our a baseline VLM VLM paired with the strong LM (sLM) to
method may be adapted as well to text-to-video, but due to generate pseudo ground truth predictions VLM×sLM. We
our use of transcripts on only a subset of the videos, unlike in then take both the baseline VLM VLM and new VLM VLM’,
video-to-text, this creates an asymmetry which may require and pair them with a weak LM wLM to generate predictions
an unwieldly relative weighting for ranking videos with or VLM× wLM and VLM’×wLM respectively. We score these
without transcripts. predictions (per image summary) against the pseudo ground
Also note that (Tab. 1) prior to the CLIP revolution in video- truth VLM×sLM. Since the outputs are linguistic, we can
measure the similarity of a given prediction to the ground
to-text retrieval, using the audio modality was not uncom-
truth, by comparing their sentence embeddings produced by
mon amongst competitive video-to-text retrieval methods
another language model e.g., RoBERTa (Liu et al. 2019b). It
(Mithun et al. 2018; Liu et al. 2019a). The trend over the past
is important to use a distinct LM for scoring to avoid spuri-
year, however, has been to instead focus on using only visual
features, with all recent competitive methods being based off ous correlations with the models under evaluation.
of CLIP, and not using audio data. Our approach, through
leveraging commonsense reasoning stored in the LMs, is
able to once again allow audio data to enable progress in VLM (CLIP) Variants + Weak LM
this common video understanding task, beyond what CLIP Truth Models RN50x4 RN50x16 ViT-B/32 ViT-B/16 ViT-L/14
alone can provide.
GPT-3 + ViT-B/16 0.628 0.646 0.686 0.861 0.704
GPT-3 + RN50x16 0.667 0.851 0.689 0.655 0.704
6 Unsupervised Socratic Model Selection ImageNet Accuracy 65.8 70.5 63.2 68.6 76.2
Size (# params) 178M 291M 151M 150M 427M
The combination of complementary models, in which one
may compensate for the weaknesses of the other, opens an Tab. 3: Unsupervised evaluation (higher is better) of various VLMs
interesting avenue for unsupervised evaluation of model per- by pairing them with a weak LM and comparing outputs to a VLM
formance. Since our metric of interest is the combined per- paired with a strong LM, which provides relative ‘truth gradients’
formance of e.g., a VLM and a LM – rather than asking the that inform how well the VLMs can compensate for the weak LM.
question: ‘(A): how well does this VLM perform in abso- These results suggest that better VLMs (measured by zero-shot Im-
lute?’, we can instead ask: ‘(B): how well does this VLM ageNet classification accuracies) can improve Socratic synergies.
compensate for the weakness of the LM?’.
(Strope et al. 2011) proposes a scheme which does so with- Tab. 3 shows example results of this analysis with GPT-
out requiring any evaluation ground truth. They also find that
3 “Davinci” as the sLM, and GPT-3 “Curie” as the wLM,
asking question (B) correlates well with answers to question
to compare VLM (i.e., CLIP) variants with different back-
(A), and is useful e.g., for model selection. The method as-
bones: vision transformers (ViT) (Dosovitskiy et al. 2020)
sumes you have access to a weak (wLM) and a strong (sLM)
and ResNets (RN50) (He et al. 2016) with different model
LM (respectively VLM if evaluating the LM’s performance). sizes. We find that this method can capture a correlation
Asking “how well does this VLM compensate for the weak- of ascending performance curve with increasingly better
nesses of the LM” is equivalent to asking: “if we have a col- VLMs (e.g., better variants of CLIP) (Radford et al. 2021),
lection of VLMs, and we combine them with a weak LM, as measured by zero-shot image classification accuracy on
which model is going to perform the closest to the combi- ImageNet (Deng et al. 2009) – with correlation coefficients
nation of the VLM with a strong LM?” If a VLM combined of 0.41 and 0.46 between ImageNet accuracies and mean
with a weak LM, instead of a strong one, makes up for the similarity to truth models via ViT-B/16 and RN50x16 re-
LM’s shortcomings and still performs well in combination,
spectively. We find that with our SM system for egocentric
then it may serve as a better component in the context of this
perception (and in contrast to the original setting in (Strope
combined system.
et al. 2011)), it is necessary to use a third baseline VLM
The benefit of this approach, while not entirely making up bVLM×sLM to generate the pseudo ground truth, instead of
for doing absolute evaluations against a ground truth, is that VLM×sLM. This is because the SM combinations that use
because it only measures relative distance between model the same VLM as the one that generates ground truth are
outputs, it can be performed unsupervised without annotated biased to produce similar visual grounding results and can
data: the distance between the output of the weak and strong exhibit an unfair advantage during the comparisons. Those
combination can be measured using measures of semantic numbers in our tests have been grayed out in Tab. 3.
7 Related Work Kitchens (Damen et al. 2018), Charades-Ego (Sigurdsson et
al. 2018), and Ego4D (Grauman et al. 2021).
Multi-model multimodal reasoning. In the context of
transfer learning (e.g., via fine-tuning), pre-trained founda- 8 Discussion
tion models have achieved strong results when combined
and trained together for a number of downstream multi- Socratic Models are a class of systems that leverage struc-
modal (Ngiam et al. 2011) applications including VLMs tured dialogue between multiple language-based foundation
with LMs for image captioning (e.g., CLIP with GPT-2) models to make joint predictions for new multimodal tasks.
(Mokady, Hertz, and Bermano 2021), video understanding SMs leverage the commonsense knowledge already stored
(e.g., CLIP with BERT (Gao et al. 2021)), visual ques- within foundation models pretrained on different domains of
tion answering e.g., (Song et al. 2022a) and ALMs and data (e.g., text-to-text, text-to-images, text-to-audio), which
LMs for speech and text modeling e.g., (Song et al. 2022b; may include for example Internet-scale data. Our shown sys-
Bapna et al. 2022). These systems are often finetuned on tems for egocentric perception, image captioning, and video-
task-specific data, and while this paradigm is likely to be to-text retrieval are just examples of the SMs framework,
preferred in domains for which data is abundant, our initial and may shed light on new opportunities to build simple sys-
results suggest that SMs can be a strong zero-shot alterna- tems that adapt foundation models to (i) capture new mul-
tive for applications in which data is less available or more timodal functionalities zero-shot without having to rely on
expensive to obtain, e.g., egocentric perception and robotics. additional domain-specific data collection or model finetun-
ing, and (ii) do so while retaining their robustness to distri-
The notion as well of “Mixture-of-Experts” ((Jordan and bution shifts (which is known to deteriorate after finetuning)
Jacobs 1994), see (Masoudnia and Ebrahimpour 2014) for (Wortsman et al. 2021).
a review) is a common paradigm for combining the out-
puts of multiple models, and specifically mixtures of experts SMs present a language-based approach to combining the
across multimodal domains including vision and audio (Liu outputs of multiple foundation models, which differs from
et al. 2019a) have been studied – note that results from Liu a classical Bayesian approach where one model is used as
et al. (2019a) are included in Table 1. Investigating further a prior and the other as evidence. Relying on language-only
these techniques in the context of recent pretrained founda- multi-model discussion carries both pros and cons. For ex-
tion models may be a promising direction for future work. ample, the intermediate outputs of the models may be more
Our work may be interpreted as a particular extension of interpretable, but are treated as “truth” between models –
Mixture-of-Experts in which experts may be composed to i.e., not weighing them against the other’s priors or evidence,
provide feedback to each other, closed-loop, via the com- which can lead to more divergent model interactions.
mon representation of language. In the context of egocentric perception, we find that for-
Egocentric perception continues to be an important prob- mulating video Q&A as reading comprehension in SMs di-
lem in computer vision. Early work in the area explores rectly leverages the extent to which large LMs are capable of
hand-designed first-person visual features for egocentric ac- logical reasoning by connecting commonsense relationships
tion recognition, object understanding, and video summa- with knowledge learned from Internet-scale data. For exam-
rization. This includes ego-motion (e.g., optical flows) (Ki- ple, the system returns the following answer when presented
tani et al. 2011; Ryoo and Matthies 2013) as well as features with the world-state history log:
from human gaze, hands, and objects (Spriggs, De La Torre, 8:00 AM: went to grocery store to buy orange juice,
and Hebert 2009; Lee, Ghosh, and Grauman 2012; Fathi, chocolate, and bread.
Farhadi, and Rehg 2011; Pirsiavash and Ramanan 2012; 8:15 AM: I went to gas station to fill up the vehicle
Li and Kitani 2013; Lee and Grauman 2015). Focusing on tank.
hand-designed features was common in early egocentric vi- 8:30 AM: drove back home and left the groceries in
sion research, as the availability of data (or videos in gen- the kitchen.
eral) was very limited. More recent approaches in egocentric 8:45 AM: started cooking eggs in the pan.
perception leverage learned feature representations, utilizing 9:00 AM: the dog went into the kitchen.
pre-trained convolutional network features (Ryoo, Rothrock, 9:15 AM: took the dog out for a walk.
9:30 AM: the dog is sick.
and Matthies 2015), finetuning them (Ma, Fan, and Kitani
Q: Why is the dog sick? A: The dog may have eaten
2016; Zellers et al. 2022), or training them from scratch something it was not supposed to, such as chocolate.
(Bambach et al. 2015) with first-person videos. Similar to
the topics explored in early work, learning of visual rep- Arriving at the answer requires bridging multiple connec-
resentations capturing human hands, objects, and eye gaze tions between observations e.g., that the dog went into the
has been extensively studied (Garcia-Hernando et al. 2018; kitchen, that the groceries are still in the kitchen, and that
Li, Liu, and Rehg 2018). (Kazakos et al. 2019) learns mul- the groceries contain chocolate. Such results offer a glimpse
timodal embeddings (i.e., video + audio), and (Furnari and of what might be possible using SMs for deductive reason-
Farinella 2019) studies future action anticipation from ego- ing across multiple domains of information, and raises in-
centric videos. Lack of sufficient data however, consistently teresting research questions on (i) how to better assemble
remains a bottleneck – motivating researchers to construct language-based world-state histories (beyond what is pre-
new larger-scale egocentric video datasets including EPIC- sented in this work) that capture relevant evidence to im-
prove the accuracy of conclusions, and (ii) how to elicit control algorithms can begin to tap into the capabilities of
chain of thought prompting (Wei et al. 2022) to decompose models trained on Internet-scale data, and to tackle applica-
multi-step problems into intermediate ones. For example, tions that have traditionally been data-scarce.
one promising extension could be prompting the LM with
chain of thought sequences to expand on hypotheses:
9 Acknowledgements
Q: What are reasons for why I might be chopping wood?
A: Reasons might include: needing firewood, wanting We thank Debidatta Dwibedi and Matthew O’Kelly for
to make a statement, or needing the exercise. excellent feedback on improving this manuscript, Anelia
Angelova, Jean-Jacques Slotine, Jonathan Tompson, Maria
to which each hypothesis can be progressively explored by Attarian, Shuran Song, for fruitful technical discussions,
downstream subprograms called at recursively higher reso- Kan Huang for applications support, Ahmed Omran, Aren
lutions until a conclusion is reached. These directions sug- Jensen, Malcolm Slaney, Karolis Misiunas for advice on au-
gest pathways towards achieving increasingly meaningful dio models, and Cody Wanner for YouTube videos.
utility and analysis by digital multimodal assistants.
References
8.1 Broader Impacts
Abuzaid, F.; Sethi, G.; Bailis, P.; and Zaharia, M. 2019. To index or
Socratic Models offer a new perspective that encour- not to index: Optimizing exact maximum inner product search. In
35th IEEE International Conference on Data Engineering, ICDE
ages building AI systems using off-the-shelf language-
2019, Macao, China, April 8-11, 2019, 1250–1261. IEEE.
interactable foundation models without additional data col-
lection or model finetuning. This leads to several practical Agarwal, P.; Betancourt, A.; Panagiotou, V.; and Dı́az-Rodrı́guez,
N. 2020. Egoshots, an ego-vision life-logging dataset and semantic
benefits, new applications, and risks as well. For one, SMs fidelity metric to evaluate diversity in image captioning models.
provide an interpretable window, through language, into the arXiv preprint arXiv:2003.11743.
behavior of the systems (even for non-experts). Further, the
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David,
barrier to entry for this technology is small: SMs can be en- B.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu,
gineered to capture new functionalities with minimal com- J.; Ibarz, J.; Ichter, B.; Irpan, A.; Jang, E.; Ruano, R. J.; Jeffrey,
pute resources. No model training was used to create any K.; Jesmonth, S.; Joshi, N.; Julian, R.; Kalashnikov, D.; Kuang,
demonstrated results. This can be enabling, but also raises Y.; Lee, K.-H.; Levine, S.; Lu, Y.; Luu, L.; Parada, C.; Pastor, P.;
potential risks, since it increases the flexibility of unintended Quiambao, J.; Rao, K.; Rettinghouse, J.; Reyes, D.; Sermanet, P.;
end use applications, and should be carefully monitored over Sievers, N.; Tan, C.; Toshev, A.; Vanhoucke, V.; Xia, F.; Xiao,
time. We welcome broad discussion on how to maximize the T.; Xu, P.; Xu, S.; and Yan, M. 2022. Do as i can and not as i
potential positive impacts (enabling broad, new multimodal say: Grounding language in robotic affordances. In arXiv preprint
applications, with minimal new resources) while minimizing arXiv:2022.00000.
the capabilities of bad actors. Apostolidis, E.; Adamantidou, E.; Metsai, A. I.; Mezaris, V.; and
Patras, I. 2021. Video summarization using deep neural networks:
Regarding the impact on energy and other resource con- A survey. Proceedings of the IEEE 109(11):1838–1863.
sumption for machine learning, this work may help pave a Bain, M.; Nagrani, A.; Varol, G.; and Zisserman, A. 2021. Frozen
path for new, capable machine learning models to be com- in time: A joint video and image encoder for end-to-end retrieval.
posed with minimal training resource consumption, pro- In Proceedings of the IEEE/CVF International Conference on
vided that large foundational pretrained models are avail- Computer Vision, 1728–1738.
able. This may help provide an answer for how large pre- Bambach, S.; Lee, S.; Crandall, D. J.; and Yu, C. 2015. Lend-
trained models may be retargeted to a wide variety of mul- ing a hand: Detecting hands and recognizing activities in complex
timodal applications, without additional considerable com- egocentric interactions. In Proceedings of the IEEE International
pute resources required. Since SMs help demonstrate how Conference on Computer Vision (ICCV), 1949–1957.
a wide variety of applications may be addressed with fixed Bapna, A.; Cherry, C.; Zhang, Y.; Jia, Y.; Johnson, M.; Cheng, Y.;
(pretrained) models zero-shot, this may also help foster Khanuja, S.; Riesa, J.; and Conneau, A. 2022. mslam: Massively
adoption of new machine learning accelerators (e.g., fixed multilingual joint pre-training for speech and text. arXiv preprint
analog circuity (Reuther et al. 2020), optical diffraction (Lin arXiv:2202.01374.
et al. 2018)) for inference with substantially lower power Barbieri, M.; Agnihotri, L.; and Dimitrova, N. 2003. Video sum-
consumption and more compact form factors. marization: methods and landscape. In Internet Multimedia Man-
agement Systems IV, volume 5242, 1–13. International Society for
We are excited about opportunities as well in downstream Optics and Photonics.
applications. For example, SMs suggest promising research Bommasani, R.; Hudson, D. A.; Adeli, E.; Altman, R.; Arora, S.;
directions for data-driven learning in robotics, where the var- von Arx, S.; Bernstein, M. S.; Bohg, J.; Bosselut, A.; Brunskill, E.;
ious modules within a robot system (e.g., planning (Ahn et et al. 2021. On the opportunities and risks of foundation models.
al. 2022; Huang et al. 2022), perception (Shridhar, Manuelli, arXiv preprint arXiv:2108.07258.
and Fox 2022)) can be replaced with zero-shot foundation Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhari-
models imbued with commonsense priors across domains. wal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
These ideas may give rise to a new class of robot systems 2020. Language models are few-shot learners. Advances in neural
where by grounding affordances (Zeng 2019) on language, information processing systems 33:1877–1901.
Chen, H.; Xie, W.; Vedaldi, A.; and Zisserman, A. 2020. Vg- Garcia-Hernando, G.; Yuan, S.; Baek, S.; and Kim, T.-K. 2018.
gsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 First-person hand action benchmark with rgb-d videos and 3d hand
IEEE International Conference on Acoustics, Speech and Signal pose annotations. In Proceedings of the IEEE conference on com-
Processing (ICASSP), 721–725. IEEE. puter vision and pattern recognition (CVPR), 409–419.
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H. P. d. O.; Kaplan, Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.;
J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. 2021. Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2021.
Evaluating large language models trained on code. arXiv preprint Ego4d: Around the world in 3,000 hours of egocentric video. arXiv
arXiv:2107.03374. preprint arXiv:2110.07058.
Cheng, X.; Lin, H.; Wu, X.; Yang, F.; and Shen, D. 2021. Improv- Gu, X.; Lin, T.-Y.; Kuo, W.; and Cui, Y. 2021. Open-vocabulary ob-
ing video-text retrieval by multi-stream corpus alignment and dual ject detection via vision and language knowledge distillation. arXiv
softmax loss. arXiv preprint arXiv:2109.04290. preprint arXiv:2104.13921.
Choromanski, K.; Chen, H.; Lin, H.; Ma, Y.; Sehanobish, A.; Jain, He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-
D.; Ryoo, M. S.; Varley, J.; Zeng, A.; Likhosherstov, V.; Kalash- ing for image recognition. In Proceedings of the IEEE conference
nikov, D.; Sindhwani, V.; and Weller, A. 2021a. Hybrid random on computer vision and pattern recognition, 770–778.
features. to appear in ICLR 2022 abs/2110.04367. He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask r-cnn.
Choromanski, K. M.; Likhosherstov, V.; Dohan, D.; Song, X.; In Proceedings of the IEEE international conference on computer
Gane, A.; Sarlós, T.; Hawkins, P.; Davis, J. Q.; Mohiuddin, A.; vision, 2961–2969.
Kaiser, L.; Belanger, D. B.; Colwell, L. J.; and Weller, A. 2021b. Hoai, M., and De la Torre, F. 2014. Max-margin early event detec-
Rethinking attention with performers. In 9th International Con- tors. International Journal of Computer Vision 107(2):191–202.
ference on Learning Representations, ICLR 2021, Virtual Event, Hu, R., and Singh, A. 2021. Transformer is all you need: Multi-
Austria, May 3-7, 2021. OpenReview.net. modal multitask learning with a unified transformer. arXiv e-prints
Damen, D.; Doughty, H.; Farinella, G. M.; Fidler, S.; Furnari, A.; arXiv–2102.
Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. Huang, W.; Abbeel, P.; Pathak, D.; and Mordatch, I. 2022. Lan-
2018. Scaling egocentric vision: The epic-kitchens dataset. In Pro- guage models as zero-shot planners: Extracting actionable knowl-
ceedings of the European Conference on Computer Vision (ECCV), edge for embodied agents. arXiv preprint arXiv:2201.07207.
720–736.
Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.;
Damen, D.; Doughty, H.; Farinella, G. M.; Furnari, A.; Kaza- Kohli, P.; Shotton, J.; Hodges, S.; Freeman, D.; Davison, A.; et al.
kos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, 2011. Kinectfusion: real-time 3d reconstruction and interaction us-
W.; et al. 2020. Rescaling egocentric vision. arXiv preprint ing a moving depth camera. In Proceedings of the 24th annual
arXiv:2006.13256. ACM symposium on User interface software and technology, 559–
Del Molino, A. G.; Tan, C.; Lim, J.-H.; and Tan, A.-H. 2016. Sum- 568.
marization of egocentric videos: A comprehensive survey. IEEE Jain, A.; Guo, M.; Srinivasan, K.; Chen, T.; Kudugunta, S.; Jia, C.;
Transactions on Human-Machine Systems 47(1):65–76. Yang, Y.; and Baldridge, J. 2021. Mural: multimodal, multitask
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. retrieval across languages. arXiv preprint arXiv:2109.05125.
2009. Imagenet: A large-scale hierarchical image database. In Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le,
2009 IEEE conference on computer vision and pattern recognition, Q.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling up visual
248–255. Ieee. and vision-language representation learning with noisy text super-
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: vision. In International Conference on Machine Learning, 4904–
Pre-training of deep bidirectional transformers for language under- 4916. PMLR.
standing. arXiv preprint arXiv:1810.04805. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; experts and the em algorithm. Neural computation 6(2):181–214.
Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, Kazakos, E.; Nagrani, A.; Zisserman, A.; and Damen, D. 2019.
G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Epic-fusion: Audio-visual temporal binding for egocentric action
Transformers for image recognition at scale. arXiv preprint recognition. In Proceedings of the IEEE/CVF International Con-
arXiv:2010.11929. ference on Computer Vision (ICCV), 5492–5501.
Fan, C.; Zhang, Z.; and Crandall, D. J. 2018. Deepdiary: Lifel- Kitani, K. M.; Okabe, T.; Sato, Y.; and Sugimoto, A. 2011. Fast
ogging image captioning and summarization. Journal of Visual unsupervised ego-action learning for first-person sports videos. In
Communication and Image Representation 55:40–55. Proceedings of the IEEE Conference on Computer Vision and Pat-
Fang, H.; Xiong, P.; Xu, L.; and Chen, Y. 2021. Clip2video: tern Recognition (CVPR), 3241–3248.
Mastering video-text retrieval via image clip. arXiv preprint Kitani, K. M.; Ziebart, B. D.; Bagnell, J. A.; and Hebert, M. 2012.
arXiv:2106.11097. Activity forecasting. In European Conference on Computer Vision
Fathi, A.; Farhadi, A.; and Rehg, J. M. 2011. Understanding ego- (ECCV), 201–214.
centric activities. In Proceedings of the IEEE International Con- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-
ference on Computer Vision (ICCV), 407–414. Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al.
Furnari, A., and Farinella, G. M. 2019. What would you ex- 2020. The open images dataset v4. International Journal of Com-
pect? anticipating egocentric actions with rolling-unrolling lstms puter Vision 128(7):1956–1981.
and modality attention. In Proceedings of the IEEE/CVF Interna- Lee, Y. J., and Grauman, K. 2015. Predicting important objects for
tional Conference on Computer Vision (ICCV), 6252–6261. egocentric video summarization. International Journal of Com-
Gao, Z.; Liu, J.; Chen, S.; Chang, D.; Zhang, H.; and Yuan, J. puter Vision 114(1):38–55.
2021. Clip2tv: An empirical study on transformer-based methods Lee, Y. J.; Ghosh, J.; and Grauman, K. 2012. Discovering im-
for video-text retrieval. arXiv preprint arXiv:2111.05610. portant people and objects for egocentric video summarization.
In IEEE conference on computer vision and pattern recognition Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.;
(CVPR), 1346–1353. Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al.
Lei, J.; Yu, L.; Berg, T. L.; and Bansal, M. 2020. What is more 2022. Training language models to follow instructions with human
likely to happen next? video-and-language future event prediction. feedback. Preprint.
arXiv preprint arXiv:2010.07999. Patel, D.; Parikh, R.; and Shastri, Y. 2021. Recent advances in
Li, C., and Kitani, K. M. 2013. Pixel-level hand detection in ego- video question answering: A review of datasets and methods. In In-
centric videos. In Proceedings of the IEEE Conference on Com- ternational Conference on Pattern Recognition, 339–356. Springer.
puter Vision and Pattern Recognition (CVPR), 3570–3577. Patrick, M.; Huang, P.-Y.; Asano, Y.; Metze, F.; Hauptmann,
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; and Hoi, S. A.; Henriques, J.; and Vedaldi, A. 2020. Support-set bot-
C. H. 2021a. Align before fuse: Vision and language represen- tlenecks for video-text representation learning. arXiv preprint
tation learning with momentum distillation. Advances in Neural arXiv:2010.02824.
Information Processing Systems 34. Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller,
Li, Y.; Nagarajan, T.; Xiong, B.; and Grauman, K. 2021b. Ego- A. H.; and Riedel, S. 2019. Language models as knowledge bases?
exo: Transferring visual representations from third-person to first- arXiv preprint arXiv:1909.01066.
person videos. In Proceedings of the IEEE/CVF Conference on Pirsiavash, H., and Ramanan, D. 2012. Detecting activities of daily
Computer Vision and Pattern Recognition, 6943–6953. living in first-person camera views. In 2012 IEEE Conference on
Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. Blip: Bootstrap- Computer Vision and Pattern Recognition (CVPR), 2847–2854.
ping language-image pre-training for unified vision-language un- Portillo-Quintero, J. A.; Ortiz-Bayliss, J. C.; and Terashima-Marı́n,
derstanding and generation. arXiv preprint arXiv:2201.12086. H. 2021. A straightforward framework for video retrieval us-
Li, Y.; Liu, M.; and Rehg, J. M. 2018. In the eye of beholder: Joint ing clip. In Mexican Conference on Pattern Recognition, 3–12.
learning of gaze and actions in first person video. In Proceedings Springer.
of the European conference on computer vision (ECCV), 619–635.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agar-
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, wal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021.
D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common Learning transferable visual models from natural language super-
objects in context. In European conference on computer vision, vision. In International Conference on Machine Learning, 8748–
740–755. Springer. 8763. PMLR.
Lin, X.; Rivenson, Y.; Yardimci, N. T.; Veli, M.; Luo, Y.; Jarrahi, Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you
M.; and Ozcan, A. 2018. All-optical machine learning using don’t know: Unanswerable questions for squad. arXiv preprint
diffractive deep neural networks. Science 361(6406):1004–1008. arXiv:1806.03822.
Liu, Y.; Albanie, S.; Nagrani, A.; and Zisserman, A. 2019a. Use Ramsauer, H.; Schäfl, B.; Lehner, J.; Seidl, P.; Widrich, M.; Gru-
what you have: Video retrieval using representations from collabo- ber, L.; Holzleitner, M.; Adler, T.; Kreil, D. P.; Kopp, M. K.; Klam-
rative experts. BMVC. bauer, G.; Brandstetter, J.; and Hochreiter, S. 2021. Hopfield net-
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; works is all you need. In 9th International Conference on Learn-
Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019b. Roberta: ing Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
A robustly optimized bert pretraining approach. arXiv preprint 2021. OpenReview.net.
arXiv:1907.11692. Rawat, A. S.; Chen, J.; Yu, F. X.; Suresh, A. T.; and Kumar, S.
Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; and Li, T. 2019. Sampled softmax with random fourier features. In Wal-
2021. Clip4clip: An empirical study of clip for end to end video lach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.;
clip retrieval. arXiv preprint arXiv:2104.08860. Fox, E. B.; and Garnett, R., eds., Advances in Neural Information
Ma, M.; Fan, H.; and Kitani, K. M. 2016. Going deeper into first- Processing Systems 32: Annual Conference on Neural Information
person activity recognition. In Proceedings of the IEEE Conference Processing Systems 2019, NeurIPS 2019, December 8-14, 2019,
on Computer Vision and Pattern Recognition (CVPR), 1894–1903. Vancouver, BC, Canada, 13834–13844.
Masoudnia, S., and Ebrahimpour, R. 2014. Mixture of experts: a Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; and
literature survey. Artificial Intelligence Review 42(2):275–293. Kepner, J. 2020. Survey of machine learning accelerators. In 2020
Miech, A.; Alayrac, J.-B.; Smaira, L.; Laptev, I.; Sivic, J.; and IEEE high performance extreme computing conference (HPEC),
Zisserman, A. 2020. End-to-end learning of visual representa- 1–12. IEEE.
tions from uncurated instructional videos. In Proceedings of the Rhinehart, N., and Kitani, K. M. 2017. First-person activity fore-
IEEE/CVF Conference on Computer Vision and Pattern Recogni- casting with online inverse reinforcement learning. In Proceedings
tion, 9879–9889. of the IEEE International Conference on Computer Vision (ICCV),
Mithun, N. C.; Li, J.; Metze, F.; and Roy-Chowdhury, A. K. 2018. 3696–3705.
Learning joint embedding with multimodal cues for cross-modal Ryoo, M. S., and Matthies, L. 2013. First-person activity recog-
video-text retrieval. In Proceedings of the 2018 ACM on Interna- nition: What are they doing to me? In Proceedings of the IEEE
tional Conference on Multimedia Retrieval, 19–27. Conference on Computer Vision and Pattern Recognition (CVPR),
Mokady, R.; Hertz, A.; and Bermano, A. H. 2021. Clipcap: Clip 2730–2737.
prefix for image captioning. arXiv preprint arXiv:2111.09734. Ryoo, M. S.; Rothrock, B.; and Matthies, L. 2015. Pooled mo-
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. tion features for first-person videos. In Proceedings of the IEEE
2011. Multimodal deep learning. In ICML. Conference on Computer Vision and Pattern Recognition (CVPR),
Oncescu, A.-M.; Koepke, A.; Henriques, J. F.; Akata, Z.; and Al- 896–904.
banie, S. 2021. Audio retrieval with natural language queries. Ryoo, M. S. 2011. Human activity prediction: Early recognition
arXiv preprint arXiv:2105.02192. of ongoing activities from streaming videos. In Proceedings of the
IEEE International Conference on Computer Vision (ICCV), 1036– Wu, B.; Chen, W.; Fan, Y.; Zhang, Y.; Hou, J.; Liu, J.; and Zhang, T.
1043. 2019. Tencent ml-images: A large-scale multi-label image database
Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Concep- for visual representation learning. IEEE Access 7:172683–172693.
tual captions: A cleaned, hypernymed, image alt-text dataset for Wu, H.-H.; Seetharaman, P.; Kumar, K.; and Bello, J. P. 2021a.
automatic image captioning. In Proceedings of the 56th Annual Wav2clip: Learning robust audio representations from clip. arXiv
Meeting of the Association for Computational Linguistics (Volume preprint arXiv:2110.11499.
1: Long Papers), 2556–2565. Wu, J.; Ouyang, L.; Ziegler, D. M.; Stiennon, N.; Lowe, R.; Leike,
Shridhar, M.; Manuelli, L.; and Fox, D. 2022. Cliport: What and J.; and Christiano, P. 2021b. Recursively summarizing books with
where pathways for robotic manipulation. In Conference on Robot human feedback. arXiv preprint arXiv:2109.10862.
Learning, 894–906. PMLR. Xu, J.; Mei, T.; Yao, T.; and Rui, Y. 2016. Msr-vtt: A large video
Shrivastava, A., and Li, P. 2014. Asymmetric LSH (ALSH) for description dataset for bridging video and language. In Proceed-
sublinear time maximum inner product search (MIPS). In Ghahra- ings of the IEEE conference on computer vision and pattern recog-
mani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Wein- nition, 5288–5296.
berger, K. Q., eds., Advances in Neural Information Processing Xu, H.; Ghosh, G.; Huang, P.-Y.; Okhonko, D.; Aghajanyan, A.;
Systems 27: Annual Conference on Neural Information Processing Metze, F.; Zettlemoyer, L.; and Feichtenhofer, C. 2021. Video-
Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, clip: Contrastive pre-training for zero-shot video-text understand-
2321–2329. ing. arXiv preprint arXiv:2109.14084.
Sigurdsson, G. A.; Gupta, A.; Schmid, C.; Farhadi, A.; and Alahari, Yang, Z.; Gan, Z.; Wang, J.; Hu, X.; Lu, Y.; Liu, Z.; and Wang, L.
K. 2018. Charades-ego: A large-scale dataset of paired third and 2021. An empirical study of gpt-3 for few-shot knowledge-based
first person videos. arXiv preprint arXiv:1804.09626. vqa. arXiv preprint arXiv:2109.05014.
Smaira, L.; Carreira, J.; Noland, E.; Clancy, E.; Wu, A.; and Zis- Yu, Y.; Kim, J.; and Kim, G. 2018. A joint sequence fusion model
serman, A. 2020. A short note on the kinetics-700-2020 human for video question answering and retrieval. In Proceedings of the
action dataset. arXiv preprint arXiv:2010.10864. European Conference on Computer Vision (ECCV), 471–487.
Song, H.; Dong, L.; Zhang, W.-N.; Liu, T.; and Wei, F. 2022a. Clip Zellers, R.; Lu, J.; Lu, X.; Yu, Y.; Zhao, Y.; Salehi, M.; Kusupati,
models are few-shot learners: Empirical studies on vqa and visual A.; Hessel, J.; Farhadi, A.; and Choi, Y. 2022. Merlot reserve:
entailment. arXiv preprint arXiv:2203.07190. Neural script knowledge through vision and language and sound.
arXiv preprint arXiv:2201.02639.
Song, Y.; Fan, X.; Yang, Y.; Ren, G.; and Pan, W. 2022b. Large
Zeng, A. 2019. Learning visual affordances for robotic manipula-
pretrained models on multimodal sentiment analysis. In Artificial
tion. Ph.D. Dissertation, Princeton University.
Intelligence in China. Springer. 506–513.
Zhai, X.; Wang, X.; Mustafa, B.; Steiner, A.; Keysers, D.;
Spriggs, E. H.; De La Torre, F.; and Hebert, M. 2009. Temporal Kolesnikov, A.; and Beyer, L. 2021. Lit: Zero-shot transfer with
segmentation and activity classification from first-person sensing. locked-image text tuning. arXiv preprint arXiv:2111.07991.
In 2009 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops (CVPRW), 17–24. Zhao, Y.; Hessel, J.; Yu, Y.; Lu, X.; Zellers, R.; and Choi, Y.
2021. Connecting the dots between audio and text without par-
Strope, B.; Beeferman, D.; Gruenstein, A.; and Lei, X. 2011. Un- allel data through visual knowledge transfer. arXiv preprint
supervised testing strategies for asr. arXiv:2112.08995.
Tancik, M.; Casser, V.; Yan, X.; Pradhan, S.; Mildenhall, B.; Srini- Zhou, B.; Khosla, A.; Lapedriza, A.; Torralba, A.; and Oliva, A.
vasan, P. P.; Barron, J. T.; and Kretzschmar, H. 2022. Block- 2016. Places: An image database for deep scene understanding.
nerf: Scalable large scene neural view synthesis. arXiv preprint arXiv preprint arXiv:1610.02055.
arXiv:2202.05263.
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha,
A.; Cheng, H.-T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. 2022.
Lamda: Language models for dialog applications. arXiv preprint
arXiv:2201.08239.
Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Anticipat-
ing visual representations from unlabeled video. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 98–106.
Wang, Z.; Yu, J.; Yu, A. W.; Dai, Z.; Tsvetkov, Y.; and Cao, Y.
2021. Simvlm: Simple visual language model pretraining with
weak supervision. arXiv preprint arXiv:2108.10904.
Wang, Q.; Zhang, Y.; Zheng, Y.; Pan, P.; and Hua, X.-S. 2022.
Disentangled representation learning for text-video retrieval. arXiv
preprint arXiv:2203.07111.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.;
and Zhou, D. 2022. Chain of thought prompting elicits reasoning
in large language models. arXiv preprint arXiv:2201.11903.
Wortsman, M.; Ilharco, G.; Li, M.; Kim, J. W.; Hajishirzi, H.;
Farhadi, A.; Namkoong, H.; and Schmidt, L. 2021. Robust fine-
tuning of zero-shot models. arXiv preprint arXiv:2109.01903.

You might also like