Socratic Models ML AI LM 2204.00598
Socratic Models ML AI LM 2204.00598
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari,
Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence
Google
Abstract
Large foundation models can exhibit unique capabilities
depending on the domain of data they are trained on.
While these domains are generic, they may only barely
arXiv:2204.00598v1 [cs.CV] 1 Apr 2022
Across a number of tasks spanning vision, language, and au- Socratic Models framework, (ii) demonstration of an ego-
dio modalities, we find that specific instantiations of SMs, centric perception system using Socratic Models, (iii) qual-
using LMs together with VLMs and audio-language mod- itative results on video understanding (synthesizing video
els (ALMs), can generate results on challenging percep- snippets from a full day of activity) that is not covered by
tual tasks (examples in Fig. 2) that are often coherent and existing benchmark datasets, (iv) qualitative comparisons
correct. We present results on Internet image captioning to a state-of-the-art model (Mokady, Hertz, and Bermano
(Sec. 4) and the common video understanding task of video- 2021) on the task of single-image captioning in egocentric
to-text retrieval (Sec. 5), but our highlighted application is and Internet image domains, (v) quantitative comparisons
open-ended reasoning in the context of egocentric percep- to state-of-the-art video understanding models on the pop-
tion (Fig. 4) – from answering free-form contextual reason- ular MSR-VTT (Xu et al. 2016; Yu, Kim, and Kim 2018)
ing questions about first-person videos (e.g. “why did I go dataset for video-to-text retrieval, and (vi) a framework for
to the front porch today?”), to forecasting events into the fu- unsupervised quantitative model selection of Socratic Mod-
ture with commonsense (e.g. “what will I do 3 hours from els through sub-model ablations.
now?”). Our egocentric SM system consists of two primary
Overall, these ideas shed light on promising new opportu-
components, each of which benefits from multimodal multi-
nities to build simple systems for general applications that
model discussions: (i) assembling video into a language- compose foundation models out-of-the-box. By construc-
based world-state history, i.e. a story or event log, then tion, multimodal foundation models are likely to be trained
(ii) performing various types of open-ended text-prompted on different distributions of Internet data, resulting in dif-
tasks based on that world-state history. We find that simple ferent test time capabilities. These capabilities can be im-
scripted policies to guide a closed-loop exchange between proved for a given target distribution through finetuning, but
pre-trained LM, VLM, and ALM models can (a) generate at the cost of generality and robustness to distribution shifts
meaningful captions that respond to questions like “what (Wortsman et al. 2021). SMs offer an alternative approach in
am I doing?” with answers like “receiving a package”
which these capabilities can be integrated cooperatively, and
that span beyond the label set of standard vision datasets
in which we can make use of concepts in domain A that are
(Sigurdsson et al. 2018; Smaira et al. 2020), and (b) ex-
more easily obtained from domain B without complex align-
hibit open-ended contextual Q&A capabilities previously ment of the representations, i.e., through additional large
thought to be out-of-reach for egocentric perception with- large-scale training across multiple domains. Instead, the
out domain-specific data collection (Grauman et al. 2021; common representation is language, and it may be used to
Damen et al. 2020). compose existing models, in a zero-shot manner.
The goal of this paper is (1) to discuss new perspectives Of course, our demonstrated SM systems are not without
on building AI systems that embrace the heterogeneity of their limitations. We discuss these limitations, such as un-
language-based foundation models through structured So- reliability inherited from the foundation models on which
cratic dialogue, and (2) give example demonstrations of they are constructed, together with other potential broader
what is already possible today with SMs on challenging per- impacts (Sec. 8.1).
ceptual tasks. Specifically, our contributions include (i) the
2 Socratic Models Framework
rank Places365 (Zhou et al. 2016) scene categories against the Socratic dialogue further, this can be repeated likewise
the image, with the top n candidates (out of 365) inserted to generate new relevant objects (conditioned on activities
into a prefix: “Places: {place1}, {place2}, {place3}.”. and places), as well as new places (conditioned on objects
• What do I see? For object and people recognition, and activities). One can iterate the procedure (LM generate,
we use a VLM to rank OpenImages object categories VLM re-rank, repeat) to populate the set of places, objects,
(Kuznetsova et al. 2020) against the image, with the top and activities until equilibrium (i.e., no more new entities),
m categories (out of 600) inserted into a second prefix: which generally helps to cover a broader set of places and
“Objects: {object1}, {object2}, {object3}.” objects that expand beyond the initial seed categories from
Places365 and OpenImages. For example:
• What am I doing? For activity recognition, we use a
back-and-forth interaction between an LM and VLM: we If I am making making pancakes, objects that I am
first use an LM to infer the activities most related to the likely to see include: a frying pan, a spatula,
a bowl, milk, eggs, flour, sugar, baking powder,
places and objects previously listed by the VLM (green):
butter, a plate, syrup.
Places: {place1}, {place2}, {place3}. Objects:
{object1}, {object2}, {object3}. Activities: Given the final set of places, objects, and activities, we use
activity a, activity b, activity c. the LM to generate an overall first-person summary of what
is happening in the image. Specifically, the prompt is:
We find that generating candidate activities using an LM
yields more suitable descriptions of egocentric activities I am in a place1, place2, place3. I see a object1,
and interactions with first-person video, than using stan- object2, object3. I am activity1. Question: What am
dard activity recognition dataset categories (e.g., from I doing? Answer: I am most likely
Charades or Kinetics). Activity recognition datasets are
The summarization process in general can capture more rich
often tailored to third person videos, and can only cover
descriptions conditioned on the places, objects, and activi-
a partial subset of human activities, which instead can be
ties, and qualitatively seem to do well at ignoring irrelevant
more holistically captured through LM reasoning (Petroni
categories (i.e., denoising). For example:
et al. 2019) over the objects and places that the VLM per-
ceives. For example, “receiving a package” is a common I am in a nursing home, landfill, living room. I
household activity not found in most datasets. After the see a wine, wine glass, woman. I am drinking wine.
LM generates candidate activities, these candidates are Question: What am I doing? Answer: I am most likely
then fed back to the VLM and re-ranked to sort out the enjoying a glass of wine with a friend or loved one.
top k activities by relevance to the key image frame: “Ac-
tivities: {activity1}, {activity2}, {activity3}.” However, while the LM’s denoising capabilities can com-
pensate for the shortcomings of the VLM, it is important
This process of generating candidate activities from places to note that this may also cause unwanted ignoring of no-
and objects is one way of extracting commonsense from table, but rare events (e.g., such as witnessing a purple uni-
LMs as knowledge bases (Petroni et al. 2019). Continuing corn, which may be ignored, but potentially it is Halloween).
Finding new ways in which such events can be indexed ap-
propriately may be useful for downstream applications.
Egocentric Image Summary Results. On egocentric im-
ages, we show several qualitative examples of summaries
generated by our system in Fig. 4, and compare them to re-
sults from a state-of-the-art image captioning model, Clip-
Cap (Mokady, Hertz, and Bermano 2021). While state-of-
the-art captioning models can perform reasonably over sev-
eral of the images, we find that our system generally pro-
duces more relevant captions for a larger portion of the ego-
centric examples. Image captioning models are biased based
on the datasets they are trained on, and have shown to per- Fig. 6: Example frame and corresponding (centered) 5-second au-
dio clip which provide the driving example for Sec. 3.3-B, i.e.,
form poorly on egocentric images (Agarwal et al. 2020), adding in ALMs into Socratic dialogue to improve single-moment
which aligns with our observations. Relatively less research summarization. Note that this waveform mostly represents the
has been carried out specifically on egocentric image cap- background piano music, but the system is still able to rank cor-
tioning (Fan, Zhang, and Crandall 2018). SMs can neverthe- rectly that footsteps as the highest sounds relative to others in the
less produce reasonable captions without additional training LM-suggested candidate set.
on domain-specific data.
3.3-B. Adding Audio into Single-moment Summaries. In
addition to using visual perceptual inputs, we may use a So- I am in a: {place}. I see a: {object1}, {object2},
cratic approach which engages perceptual inputs from audio {object3}, {object4}, {object5}. I think I hear
as well, via an ALM (audio language model). Our exam- {sound1} I am: {activity}. Summary: I am most likely
ple egocentric perception system uses Wav2CLIP (Wu et al.
2021a) as the ALM. Wav2CLIP is trained on 5-second audio As above, incorporating “I think I hear footsteps” into the
clips from the VGGSound dataset (Chen et al. 2020), and is summary and prompting this to the LM provides the com-
trained in a contrastive manner by aligning its audio encoder pletion: “climbing a staircase, and I may hear footsteps.”
to the visual CLIP embeddings from video. In this case, this summary result is preferable to the men-
tioned single-image caption without sound.
Incorporating an ALM like Wav2CLIP into our Socratic
framework can provide an additional modality with which to While this example demonstrates in a certain case the utility
perform zero-shot cross-modal reasoning, and this may help of audio-informed summaries, overall in egocentric video,
further improve inference beyond the vision-language-only with a variety of background noise, we find that Wav2CLIP
case. Fig. 6 displays a driving example for which a visual- can provide reasonable detections for certain language-
only summarization produced the less-than-desirable sum- represented auditory entities such as ‘baby babbling’ and
mary: “I am climbing a staircase, and I may see a hamster entities to do with running water, but do not provide as
or human leg” with the incorrect propogation of the false robust detections as CLIP. Also, while there are many ad-
detection of a hamster and human leg. vantages to the specific Wav2CLIP approach, including its
use of the CLIP embedding space, a major downside is that
To perform audio-aided single-moment summarization,
the training process is “blind” to hearing things that cannot
we first run image-based summarization as described
be seen. Accordingly, for the rest of demonstrations shown,
previously, but we then prompt the LM to suggest
we simply build world-state history from VLM-LM interac-
sounds that it may hear, given the visual context, via
tions alone. We expect however that with further attention to
“hvisual single-image summaryi. 5 Possible Sounds:”. For
model approaches, and scaling of audio-language datasets,
the example in Fig. 6 an example prompt, which has already
approaches like Wav2CLIP will increase in robustness. We
gone through multiple rounds of Socratic dialogue to be gen-
also show an additional application (Sec. 3.4) of audio, for
erated, together with completion by the LM is:
audio retrieval. In that case, only a single auditory search
Places: staircase. Objects: stairs, animal, mammal, entity is required in order to enable a useful application, and
hamster, human leg. Activities: climbing. 5 Possible so it can be easier to verify that it is a sufficiently robustly-
Sounds: footsteps, creaking stairs, someone calling detected entity.
your name, a dog barking, a centipede crawling.
3.3-C. Compiling a Language-Based World-State History
These auditory entities expressed in language can then be
Our system compiles the image summaries from each key
ranked by the ALM. In this moment of the video, the sound
video frame into a language-based world-state history. Since
of footsteps can be faintly heard in the background, and in
the total number of frames in the video may be large, compil-
this case the ALM provides a correct detection of ranking
ing a summary for every individual frame would create text
footsteps as the most likely sound. This ranking can then be
that is too large (too many tokens) to be processed directly
incorporated into a prompt for the LM to provide the single-
by an LM as context for Q&A. Accordingly in this work, we
image summary, for example:
propose solutions that sparsify and/or condense language-
based world-state histories (e.g., via search-based methods)
into practically usable context sizes for reasoning. In partic- generate compelling answers to open-ended reasoning tasks,
ular, we explore two methods of identifying “key moments” at a scope that is beyond what we are aware is possible to-
in videos for summarization: (i) uniform sampling over time, day with available methods. Of course, the answers may also
and (ii) video search (image or audio retrieval) for on-the-fly inherit undesirable characteristics from the component mod-
compilation of context. els, such as an LM that is overconfident even when wrong. It
The first method, uniform sampling, is straightforward and is our hope that our results may help inspire work on prepar-
compiles a world-state history from Socratic summaries of ing even more comprehensive video understanding datasets
video frames sampled at fixed time intervals. This can also for the community, to assist further assessment.
be condensed hierarchically using recursive linguistic sum- Our example system uses a language-based world-state
marization (Wu et al. 2021b), to fit even dense sampling into history generated through Socratic multi-model discussion
usable LM-context sizes. However, while broadly indiscrim- (Sec. 3.3), and provides this as context to an LM to enable
inate, uniform sampling may not have sufficient temporal open-ended reasoning on egocentric videos. Open-ended
resolution to capture important spontaneous events in the text prompts from a user, conditioned on an egocentric
video (such as adding salt to the pot while cooking soup in video, can yields three types of responses: a text-based re-
the kitchen). sponse, a visual result, and/or an audio clip. These latter
two provide examples that open up the capabilities of the
Hence the second method, identifying key moments with
system to respond not only with text-based responses, but
video search, uses a VLM or ALM to search for entities most
also respond with video snippets themselves, which may be
relevant to the question, which can more precisely index the
a higher-bandwidth way to respond to user requests (“a pic-
frames in which the subject appears. Specifically, our instan-
tiation of SMs for this component parses a natural language ture is worth a thousand words”). The specific composition
question with an LM into several search entities to be used to of our system is of course just one example – overall, the
find key frames in the video. For example, the question “did modularity of the Socratic approach makes it easy to com-
I drink coffee today?” yields a search entity “drink coffee” pose together foundation models, zero-shot, in a variety of
that is then used with language-conditioned video search ways to provide a spectrum of multimodal reasoning capa-
to index the most relevant n key frames of “drink coffee” bilities.
in the video. The LM categorizes the search, which can The demonstrated tasks include (i) summarization, (ii) open-
be image-based (VLMs) or audio-based (ALMs), e.g., for ended Q&A, (iii) forecasting, (iv) corrections, and (v) video
language-conditioned auditory recall questions ((Oncescu et search for either visual or audio cues. These tasks have pre-
al. 2021)) like “why was my wife laughing today?” . While dominantly been studied in isolation in the research commu-
search-based indexing of key moments can be useful for nity – but our example results with SMs suggest they can be
finding spontaneous events, this method for generating con- subsumed under the same unified language-based system for
text can also provide disadvantages for downstream Q&A if multimodal reasoning. Descriptions and results for each of
the answer to the question depends on events that are not (i)-(v) are shown below.
directly related to the search subject. For example, “why (i) Summarization can be implemented by prompting an
was I chopping wood today?” returns key frames related to LM to complete the excerpt “{world-state history} Sum-
“chopping wood”, but does not return the key frames after mary of my day:” to which it can respond with outputs like
the event related to making a campfire. On the other hand, “I slept in a bed, made coffee, watched TV, did laun-
if uniform sampling is employed and the campfire events dry, received a package, bench pressed, showered, ate
are captured by the summary, then the LM can success- a sandwich, worked on a computer, and drank wine.”
fully return the answer “I was making a campfire.” Choos- Since the language-based world-state history is constructed
ing which method to use for compiling the language-based
with summaries of visual content, it carries contextual in-
world-state history may depend on the application.
formation that can be complementary to what is found in
Language-based World-state History Results. Fig. 4, closed captions (e.g., speech and dialogue, explored in Sec.
middle, shows results generated by our system. The specific 5). Summarizing egocentric videos enables a number of
event log shown in Fig. 4 has been trimmed down for space applications, including augmenting human memory to re-
considerations, but is representative of the type of event logs call events, or life-logging of daily activities for caregiver
that may be generated without manual curation. These event assistance. Our system draws similarity to early work in
logs are used as context to enable LM open-ended reasoning the area involving text-based summarization and identify-
on video, as demonstrated in the next section. ing key frames (see (Barbieri, Agnihotri, and Dimitrova
2003) for an early survey and (Del Molino et al. 2016;
3.4 Open-Ended Reasoning on Egocentric Video Apostolidis et al. 2021) for more recent surveys).
In this section we describe a few examples of how the So- (ii) Open-ended Q&A can be implemented by prompting
cratic Models framework can be used to perform open-ended the LM to complete the template: “{world-state history} Q:
multimodal-informed completion of text prompts, condi- {question} A:”. We find that LMs such as GPT-3 can gener-
tioned on egocentric video (examples in Fig. 2). There are ate surprisingly meaningful results to binary yes or no ques-
of course limitations to they can provide, but our demon- tions, contextual reasoning questions, as well as temporal
strated examples suggest that we can already today often reasoning questions. As in (Yang et al. 2021) we can further
Fig. 7: SMs can interface with the user through dialogue and perform a variety of tasks (formulated as Q&A) with egocentric video: sort-
ing reasoning questions by their output modalities e.g., text-base responses, images from visual search, video snippets from audio search.
Depending on the modality, each question can pass through a different sequence of Socratic interactions between the LM, VLM, and ALM.
prompt the LM to explain the answer by adding “This is key frames are not influenced by the question.
because:”. We find that the accuracy of the answers and ex- Temporal Reasoning. SMs can answer questions related
planations remain largely conditioned on whether the neces- to time by appending timestamps to each key moment in
sary information can be found within the world-state history.
the world-state history. By associating image summaries to
This suggests that the quality of the language-based recon-
times of the day, this allows answering questions that time
structions of the videos (e.g., via key frame sampling and
index various activities. For example “when did I last drink
captioning in this work) is central to the approach.
coffee?” can return the last time drinking coffee was men-
We show several qualitative examples of free-form ques- tioned in the log, with a full response “I last drank coffee
tion answering using our SM system on egocentric video at 10:17 AM” and an explanation “I was making coffee in
in Fig. 4, bottom, Fig. 5, and Fig. 7 generated using a first- the kitchen.” The system can also count events, for exam-
person POV video2 as input. ple when asked “how many times did I receive a package
Recall Questions. SMs can perform simple retrieval of today?”, the system will respond appropriately “I received
events. For example, “did I eat dinner today?”, yields a re- a package once today.” with an explanation “I was receiv-
sponse “yes I ate dinner today.” along with an explanation ing a package at 3:24 PM”. We find that a common failure
“I was seen eating a sandwich in a kitchen at 5:27 PM.” mode for these types of questions is that the system tends
which points to the key frame that was captioned with the to over-count, especially as a reaction to false positive VLM
sandwich in hand. Another example that involves contextual detection results that get surfaced into the world-state his-
tory. For example, asking “who did I interact with?” would
reasoning to recall events is “what was I doing outdoors?”
yield “woman, hamster” where hamster was a false positive
to which the system responds “I was chopping wood in
prediction from CLIP. These issues become more prominent
a yard.” Likewise, if the entities described in the ques-
tion do not appear in the world-state history, such as “did with search-based key frame sampling, as a byproduct of
I drive today?” the system can respond with a negative an- an inability to distinguish neighboring local argmaxes of the
swer: “no, I did not drive today.” with an explanation “I same event from each other.
was at home all day.” This capability expands beyond stan- Cause and Effect Reasoning. SMs can answer questions
dard video search, which might only return nearest neighbor about cause and effect relationships between events, condi-
video frames, without a natural language response (or a neg- tioned on that all the events appear in the world-state history.
ative response). For example, when asked “why did I go to the front porch to-
The performance of recalling events largely depends on the day?” the system would respond “I went to the front porch
relevance of the language-based world-state history to the today to receive a package.” and an explanation “I saw
question. We find that recall-type questions work best with on the porch a package and knew that I was expecting
it.” These types of questions are exciting because they sug-
world-state history logs that are compiled by using search-
gest opportunities for prompting logical deduction of events.
based key frame indexing (see Sec. 3.3-B). The system can
However, since information about both the cause and the ef-
still return negative responses, since the captioning of the
fect needs to be in the world-state history, the quality of re-
2
Examples on https://youtu.be/-UXKmqBPk1w used with per- sults remains highly dependent on the key frame sampling
mission from Cody Wanner. strategy used to compile it (Sec. 3.3-B). Uniform gives an
unbiased account of events, and is currently the best variant
for this form of reasoning. More targeted construction of the
world-state history with search based key frames can some-
times miss frames that capture the answer to the question.
Subjective Reasoning. SMs can also answer more subjec-
tive questions, such as “was I happy today?” or “what was
my favorite drink today?”. Without additional context, these
Fig. 8: Example zero-shot language-prompted auditory retrieval
questions rely on biases from the LM’s dataset – which
(shown: top 2 results) in response to “what did my daughter’s laugh
could have negative consequences, and should be managed sound like today?”, for which an LM identifies the audio search
carefully with additional mechanisms for safety and ground- query of “daughter’s laugh”, and an ALM (Wav2CLIP) is used for
edness (Thoppilan et al. 2022). The full personalization of audio retrieval. The top (left) retrieval is only partially correct, re-
these subjective questions are likely to be conditioned on turning a video clip involving the daughter but not laughter. The
whether a better context can be constructed of prior user be- second (right) retrieval is correct, from a moment of playing (get-
haviors related to the question. ting tossed into the air). Faces obscured for privacy.
(iii) Forecasting of future events can be formulated
as language-based world-state completion. Our system Context: Where am I? outdoor cabin, campsite, outdoor
prompts the LM to complete the rest of an input event log. inn. What do I see? fire, marshmallow, fire iron,
Timestamps of the predictions can be preemptively specified hearth, fireside, camp chair. What am I doing?
depending on the application needs. The completion results Commonsense suggests: roasting marshmallows, sitting
are generative, and are more broad than binary event classi- around the fire, chatting. Most likely: sitting
fication (e.g., (Lei et al. 2020)). Example completion results around the fire.
(also shown in Fig. 2): Original Summary: I am camping and enjoying the
company of my friends around the fire.
1:46 PM: I am eating a sandwich in a kitchen. Corrections: It was actually my family, not friends,
2:18 PM: I am checking time and working on a laptop sitting around the fire.
in a clean room. Corrected Summary: I am camping with my family and
2:49 PM: I am buying produce from a grocery store or enjoying the company of them around the fire.
market.
3:21 PM: I am driving a car. (v) Video Search: Image or Audio Retrieval. Our SM sys-
4:03 PM: I am in a park and see a playground. tem can also return additional modalities (images, audio) as
4:35 PM: I am in a home and see a television. answers to questions, by simply few-shot prompting the LM
to classify a target modality based on the input question. For
Few-shot prompting the LM with additional examples of example, “where did I leave my remote control” can map
prior event logs most similar to the current one is likely to to image search using VLM features for “remote control”
improve the accuracy of the completion results. Without ad- while “what did my daughter’s laugh sound like today?” can
ditional context, these results are again biased towards typi- map to natural-langauge-queried audio search ((Oncescu
cal schedules seen by the LM across Internet-scale data. et al. 2021)) using ALM features for “daughter’s laugh”
To a certain extent, this forecasting capability extends and (Fig. 8). This can be useful for some applications (e.g., AR)
generalizes the traditional topic of activity forecasting in in which the user may find the retrieved modality to be more
computer vision. In the research community, activity fore- useful than a natural language response. Our approach for
casting has been often formulated as an extension of ac- this uses an LM to parse a search entity from the question
tion classification, tracking, or feature generation: Given a to index key video frames. This is done with several few-
sequence of image frames, they directly predict a few cat- shot examples provided as context. For example, the ques-
egorized actions (Ryoo 2011; Hoai and De la Torre 2014; tion “when did I last wash my hands?” yields a search entity
Rhinehart and Kitani 2017), human locations (Kitani et al. “wash my hands” that is then used with video search to
2012), or image features (Vondrick, Pirsiavash, and Torralba index the most relevant n key frames of “wash my hands”
2016) to be observed in the future frames. In contrast, So- in the video. Specifically, our system runs video search by
cratic Models with LMs enables generating more semanti- ranking matching CLIP or Wav2CLIP features of the entity
cally interpretable descriptions of future events, conditioned text against all video frames, and returning the top n local
on multimodal information. maximums. For each frame, the features can either be image
features or audio features (e.g., from the surrounding 5 sec-
(iv) Corrections. SMs can be prompted to incorporate hu-
onds with Wav2CLIP) – where the LM few-shot categorizes
man feedback in the loop as well, which could be useful for
which domain to use for any given question. This can be
interactive language-based systems. For example, given im-
thought of as calling different subprograms for hierarchical
age captions generated from an VLM and LM:
search.
Limitations. Overall, our results suggest that SMs are ca-
pable of generating meaningful outputs for various egocen-
tric perception tasks via visual contextual reasoning – but its
limitations also suggest areas for future work. For example,
a primary bottleneck in the Q&A system is that it relies on ory requirements, we propose to leverage recently intro-
the richness (i.e., recall) and quality (i.e., precision) of the duced techniques on linear attention (Choromanski et al.
event log. This likely could be improved with better image 2021b) combined with modern continuous associative mem-
and audio detectors or captioning systems (Gu et al. 2021). ory (MCAM) models (Ramsauer et al. 2021). MCAM mod-
Also, we find that the used Wav2CLIP may provide satisfac- els are de facto differentiable dictionaries (with provable
tory results for certain categories in audio retrieval, but we few-shot retrieval) that can be thought of as energy-based
currently do not involve it in generating the event log, since models using negated exponentiated latent-representations-
its robustness and range of open-language detection is not at dot-product energy for the exponential storage capacity. A
the same level as CLIP. This seems addressable with further naive computation of such an energy still requires explicitly
approaches and scaling of datasets in the audio-language do- keeping all the patterns (which is exactly what we want to
main. avoid), but this can be bypassed by applying the lineariza-
tion of that energy (which effectively is just the negated
Additionally, accurate response to cause and effect reason-
sum of the softmax kernel values) with the FAVOR+ mecha-
ing questions also require relevant key moments to be re-
nism used in linear-attention Transformers, called Perform-
flected in the event log – which points to open ended ques-
ers (Choromanski et al. 2021b). This modification has sev-
tions on how to achieve better key frame sampling (beyond
eral advantages: (1) it makes the size of the dictionary com-
the simple baselines that we have demonstrated). Finally, the
dialogue between the different models are fairly structured pletely independent from the number of the implicitly stored
with manually engineered prompts. It may be interesting to patterns; the size now scales linearly with the number of ran-
investigate more autonomous means of achieving language- dom features used for energy linearization, (2) it provides a
based closed loop discussions between the models until a constant-time querying mechanism at the price of compress-
commonsense consensus is reached. ing all the patterns (and thus losing some information).
MIP-Search. The first observation is that several data pre- 4 Socratic Internet Data Image Captioning
processing techniques applied in the so-called maximum in-
ner product (MIP) search can be directly used to reorganize The SMs framework can also be used to generate text cap-
the keys (e.g. latent representations of video frames) to pro- tions for generic Internet images with a guided multi-model
vide sub-linear querying mechanism for the incoming text exchange between a VLM and LM. We describe an example
snippet (see: (Abuzaid et al. 2019)). Those include prun- system in Sec. 4.1 and demonstrate results in Sec. 4.2.
ing and various indexing techniques, such as LSH-hashing
(Shrivastava and Li 2014). In the hashing approach, a col- 4.1 System: Image Captioning on Internet Data
lection of hash-tables, indexed by the binarized represen-
tations of the hashes is stored with different entries of the Overall, our example SMs system for Internet image cap-
hash table corresponding to the subsets of keys producing a tioning is extremely similar to how we perform single-image
particular hash. There are several cheap ways of computing captioning in our egocentric system, but (i) adapted for Inter-
such hashes, e.g. signed random projection (those in princi- net images rather than ego-centric images, and (ii) adapted
ple linearize the angular distance, but every MIP task can be such that the “final task” is the generation of a single im-
translated to the minimum angular distance search problem). age caption, rather than open-ended tasks based on text-
The querying is then conducted by searching for the most prompted completion.
similar hash-entries in the hash-tables and then performing First, similar to the process of generating egocentric image
linear search only on the subsets of keys corresponding to captions, we may prompt the VLM to zero-shot detect vi-
these entries to obtain final ranking. sual entities across different categories of language. As with
the egocentric system, we return top matching place cate-
Associative Memories. The above approach provides gories and object categories. For Internet data, we use Ten-
sub-linear querying mechanism, but does not address the cent ML-Images (Wu et al. 2019) for object entities. We
space complexity problem. In the scenario of strict mem- also choose to detect the image type from the set {photo,
Fig. 9: The SMs framework with a VLM and LM can be used to zero-shot generate captions for generic Internet images, and can be as
expressive as finetuned-for-captioning state-of-the-art models e.g., ClipCap (Mokady, Hertz, and Bermano 2021).
cartoon, sketch, painting} and the amount of people from Examples of an actual prompt and generated sample cap-
the set {are no people, is one person, are two people, are tions, for example for the bottom-left photo of Fig. 9 is:
three people, are several people, are many people}. For
I am an intelligent image captioning bot. This image
generic Internet images, which are not necessarily real pho-
is a photo. There are no people. I think this photo
tos, and very often are taken by people of people, we find was taken at a indoor bow window, dining room, or
that these additional contextual pieces of information help interior balcony. I think there might be a double-hung
generate better captions. The various VLM detections give: window, casement window, sliding window, pivoting
“Places: {place1}, {place2}, {place3}. Objects: {object1}, window, breakfast area, breakfast nook, dining area,
{object2}, {object3}. Image type: {image type}. People re- storm window, storm sash, dining room, bay window,
sult: {people result}.” bow window, lancet window in this photo. A creative
short caption I can generate to describe this image
Next, given the VLM detections of various visual entities, is: ... (run n times, examples shown below)
we can then prompt the LM to generate several (n) candi-
• Looking through the window to a beautiful view.
date captions. For this step, we employ a non-zero sampling
temperature (we find 0.9 gives good results) in order to give • This serene and stately room is centered around
sufficient variety of results across the n options, but still gen- a beautiful window, which floods the space with
erate reasonably likely options. natural light. The double-hung window allows for
ventilation, while the storm sash protects against
I am an intelligent image captioning bot. This image inclement weather. The perfect spot for a lazy
is a {img type}. There {num people}. I think this Sunday brunch or an intimate dinner party.
photo was taken at a {place1}, {place2}, or {place3}. • ...
I think there might be a {object1}, {object2},
{object3},... in this {img type}. A creative short
caption I can generate to describe this image is: Lastly, after the LLM has generated several captions, they
are subsequently ranked by the VLM, and the highest scor-
ing caption is returned – in the example above, this is:
“This image shows an inviting dining space with
plenty of natural light.”
Tab. 1: Video-to-text retrieval results on MSR-VTT (Xu et al. 2016) dataset, both on the popular 1k-A (Yu, Kim, and Kim 2018) subset and
the original ‘full’ test set. Differentiated are methods which train on the MSR-VTT dataset (finetuning), compared with zero-shot methods,
which do not. Also noted: whether the methods use audio channels, and if CLIP (Radford et al. 2021) is used, which CLIP encoder is used.
See Fig. 10 for the chronology of the SOTA across each category.
Here, the Socratic interaction lies mainly between the ALM Portillo-Quintero et al. (2021), which uses CLIP by itself,
(speech-to-text) to the commonsense LM (GPT-3 to sum- but does not use the Socratic method: there is no multi-
marize the transcriptions), and between the commonsense model exchange, and no LMs are used. Additionally, this
LM to the ranking based system that is a combination of task provides a great opportunity to incorporate another type
the VLM (CLIP) and the masked LM (RoBERTa). Note that of modality – speech-to-text from audio data. We compare
we may also prompt LMs (in this case, via multiple-choice) our method both with zero-shot methods, and with finetuned
to determine if one caption is a better fit than another for a methods specifically trained on MSR-VTT.
given video. However, for this specific task and dataset, with
Results show that our method sets a new zero-shot state-
thousands of possible answers to choose from, the numeri- of-the-art for video-to-text retrieval on MSR-VTT (Tab.1),
cal ranking provided by embedding similarity scores pro- both on the “1k-A” and “full” test sets. Since our demon-
vides a practical solution rather than relying on thousand- strated system uses exactly the method of Portillo-Quintero
way multiple-choice commonsense reasoning. et al. (2021) for its processing of CLIP features but addi-
tionally incorporates LLM reasoning on speech-to-text tran-
5.2 Results: Socratic Video-to-Text Retrieval scripts, the increased measured performance of our method
(i.e. 27.2 → 42.8 R@1 on MSR-VTT 1k-A) directly reflects
Long-transcript subset of . . . the additional benefit of incorporating language-based mul-
MSR-VTT 1k-A timodal reasoning. Additionally, to keep the comparison be-
Method R@1↑ R@5↑ R@10↑ MdR↑ tween our method and Portillo-Quintero et al. (2021) as di-
rect as possible, we maintain the usage of their precomputed
CLIP via Portillo-Quintero (2021) 28.2 49.9 60.3 6.0
CLIP features6 from ViT-B/32. Given results from other re-
SMs (ours) 55.0 71.6 76.3 1.0
cent methods (Tab. 1) it seems likely we may be able to im-
MSR-VTT Full prove our performance by switching to ViT-B/16, or other
CLIP via Portillo-Quintero (2021) 41.5 69.6 77.4 2.0 recent more-performant VLM models (Zhai et al. 2021).
SMs (ours) 54.9 74.0 79.9 1.0 As shown in Table 2, if we look at only the long-transcript
Tab. 2: Evaluation on MSR-VTT for video-to-text retrieval with
videos, i.e. the videos for which our method used a tran-
long-transcript videos: the subset of videos for which our SMs script, then we especially see an increase in performance
method used transcripts on 1k-A (n=451 out of 1,000) and full – on MSR-VTT 1k-A, R@1 almost doubles, from 28.2 to
(n=1,007 out of 2,990). On these subsets, we evaluate Portillo- 55.0, for our method compared to Portillo-Quintero et al.
Quintero et al. (2021) vs. our method. Outside this subset, we resort (2021). Further, although it is on only a subset of the test
to Portillo-Quintero et al. set, note that this R@1 metric achieved of 55.0 is roughly
For video-to-text evaluation we use the MSR-VTT dataset comparable to the R@1 of the best finetuned-SOTA method,
(Xu et al. 2016), which as noted in other recent works (Gao DRL (Wang et al. 2022) on the entire 1k-A dataset, with 56.2
et al. 2021; Cheng et al. 2021) is the most popular bench- R@1 (Tab. 1). If we assume that, for visual-only methods,
mark dataset for the task of video-to-text retrieval. Like other the videos with-or-without transcripts are of roughly equal
recent works (Gao et al. 2021), we focus our results on this difficulty from a visual-only retrieval perspective, this sug-
dataset. One of the reasons this is a good task and dataset gests that on internet videos with sufficient spoken language
for generally testing the value of the SMs approach is that
6
there is already a strong zero-shot baseline, provided by https://github.com/Deferf/CLIP Video Representation
present, our method for zero-shot video-to-text retrieval can distance, for instance here by scoring them against a distinct,
nearly match the finetuned-SOTA method for video-to-text held-out language model.
retrieval.
As an example of using this approach, we extend the method
Note that instead of video-to-text retrieval, but rather on text- in (Strope et al. 2011) to Socratic Models on egocentric per-
to-video retrieval, a recent method (Li et al. 2022) has shown ception, where we show it is possible to quantify the mu-
strong zero-shot results. Other methods have also attempted tual dependence between foundation models without ground
zero-shot on MSR-VTT text-to-video retrieval (Xu et al. truth data. Specifically, to evaluate a new VLM (VLM’) for
2021; Miech et al. 2020; Bain et al. 2021), but these have all generating language-based world-state history, we first use
been outperformed by Portillo-Quintero et al. (2021). Our a baseline VLM VLM paired with the strong LM (sLM) to
method may be adapted as well to text-to-video, but due to generate pseudo ground truth predictions VLM×sLM. We
our use of transcripts on only a subset of the videos, unlike in then take both the baseline VLM VLM and new VLM VLM’,
video-to-text, this creates an asymmetry which may require and pair them with a weak LM wLM to generate predictions
an unwieldly relative weighting for ranking videos with or VLM× wLM and VLM’×wLM respectively. We score these
without transcripts. predictions (per image summary) against the pseudo ground
Also note that (Tab. 1) prior to the CLIP revolution in video- truth VLM×sLM. Since the outputs are linguistic, we can
measure the similarity of a given prediction to the ground
to-text retrieval, using the audio modality was not uncom-
truth, by comparing their sentence embeddings produced by
mon amongst competitive video-to-text retrieval methods
another language model e.g., RoBERTa (Liu et al. 2019b). It
(Mithun et al. 2018; Liu et al. 2019a). The trend over the past
is important to use a distinct LM for scoring to avoid spuri-
year, however, has been to instead focus on using only visual
features, with all recent competitive methods being based off ous correlations with the models under evaluation.
of CLIP, and not using audio data. Our approach, through
leveraging commonsense reasoning stored in the LMs, is
able to once again allow audio data to enable progress in VLM (CLIP) Variants + Weak LM
this common video understanding task, beyond what CLIP Truth Models RN50x4 RN50x16 ViT-B/32 ViT-B/16 ViT-L/14
alone can provide.
GPT-3 + ViT-B/16 0.628 0.646 0.686 0.861 0.704
GPT-3 + RN50x16 0.667 0.851 0.689 0.655 0.704
6 Unsupervised Socratic Model Selection ImageNet Accuracy 65.8 70.5 63.2 68.6 76.2
Size (# params) 178M 291M 151M 150M 427M
The combination of complementary models, in which one
may compensate for the weaknesses of the other, opens an Tab. 3: Unsupervised evaluation (higher is better) of various VLMs
interesting avenue for unsupervised evaluation of model per- by pairing them with a weak LM and comparing outputs to a VLM
formance. Since our metric of interest is the combined per- paired with a strong LM, which provides relative ‘truth gradients’
formance of e.g., a VLM and a LM – rather than asking the that inform how well the VLMs can compensate for the weak LM.
question: ‘(A): how well does this VLM perform in abso- These results suggest that better VLMs (measured by zero-shot Im-
lute?’, we can instead ask: ‘(B): how well does this VLM ageNet classification accuracies) can improve Socratic synergies.
compensate for the weakness of the LM?’.
(Strope et al. 2011) proposes a scheme which does so with- Tab. 3 shows example results of this analysis with GPT-
out requiring any evaluation ground truth. They also find that
3 “Davinci” as the sLM, and GPT-3 “Curie” as the wLM,
asking question (B) correlates well with answers to question
to compare VLM (i.e., CLIP) variants with different back-
(A), and is useful e.g., for model selection. The method as-
bones: vision transformers (ViT) (Dosovitskiy et al. 2020)
sumes you have access to a weak (wLM) and a strong (sLM)
and ResNets (RN50) (He et al. 2016) with different model
LM (respectively VLM if evaluating the LM’s performance). sizes. We find that this method can capture a correlation
Asking “how well does this VLM compensate for the weak- of ascending performance curve with increasingly better
nesses of the LM” is equivalent to asking: “if we have a col- VLMs (e.g., better variants of CLIP) (Radford et al. 2021),
lection of VLMs, and we combine them with a weak LM, as measured by zero-shot image classification accuracy on
which model is going to perform the closest to the combi- ImageNet (Deng et al. 2009) – with correlation coefficients
nation of the VLM with a strong LM?” If a VLM combined of 0.41 and 0.46 between ImageNet accuracies and mean
with a weak LM, instead of a strong one, makes up for the similarity to truth models via ViT-B/16 and RN50x16 re-
LM’s shortcomings and still performs well in combination,
spectively. We find that with our SM system for egocentric
then it may serve as a better component in the context of this
perception (and in contrast to the original setting in (Strope
combined system.
et al. 2011)), it is necessary to use a third baseline VLM
The benefit of this approach, while not entirely making up bVLM×sLM to generate the pseudo ground truth, instead of
for doing absolute evaluations against a ground truth, is that VLM×sLM. This is because the SM combinations that use
because it only measures relative distance between model the same VLM as the one that generates ground truth are
outputs, it can be performed unsupervised without annotated biased to produce similar visual grounding results and can
data: the distance between the output of the weak and strong exhibit an unfair advantage during the comparisons. Those
combination can be measured using measures of semantic numbers in our tests have been grayed out in Tab. 3.
7 Related Work Kitchens (Damen et al. 2018), Charades-Ego (Sigurdsson et
al. 2018), and Ego4D (Grauman et al. 2021).
Multi-model multimodal reasoning. In the context of
transfer learning (e.g., via fine-tuning), pre-trained founda- 8 Discussion
tion models have achieved strong results when combined
and trained together for a number of downstream multi- Socratic Models are a class of systems that leverage struc-
modal (Ngiam et al. 2011) applications including VLMs tured dialogue between multiple language-based foundation
with LMs for image captioning (e.g., CLIP with GPT-2) models to make joint predictions for new multimodal tasks.
(Mokady, Hertz, and Bermano 2021), video understanding SMs leverage the commonsense knowledge already stored
(e.g., CLIP with BERT (Gao et al. 2021)), visual ques- within foundation models pretrained on different domains of
tion answering e.g., (Song et al. 2022a) and ALMs and data (e.g., text-to-text, text-to-images, text-to-audio), which
LMs for speech and text modeling e.g., (Song et al. 2022b; may include for example Internet-scale data. Our shown sys-
Bapna et al. 2022). These systems are often finetuned on tems for egocentric perception, image captioning, and video-
task-specific data, and while this paradigm is likely to be to-text retrieval are just examples of the SMs framework,
preferred in domains for which data is abundant, our initial and may shed light on new opportunities to build simple sys-
results suggest that SMs can be a strong zero-shot alterna- tems that adapt foundation models to (i) capture new mul-
tive for applications in which data is less available or more timodal functionalities zero-shot without having to rely on
expensive to obtain, e.g., egocentric perception and robotics. additional domain-specific data collection or model finetun-
ing, and (ii) do so while retaining their robustness to distri-
The notion as well of “Mixture-of-Experts” ((Jordan and bution shifts (which is known to deteriorate after finetuning)
Jacobs 1994), see (Masoudnia and Ebrahimpour 2014) for (Wortsman et al. 2021).
a review) is a common paradigm for combining the out-
puts of multiple models, and specifically mixtures of experts SMs present a language-based approach to combining the
across multimodal domains including vision and audio (Liu outputs of multiple foundation models, which differs from
et al. 2019a) have been studied – note that results from Liu a classical Bayesian approach where one model is used as
et al. (2019a) are included in Table 1. Investigating further a prior and the other as evidence. Relying on language-only
these techniques in the context of recent pretrained founda- multi-model discussion carries both pros and cons. For ex-
tion models may be a promising direction for future work. ample, the intermediate outputs of the models may be more
Our work may be interpreted as a particular extension of interpretable, but are treated as “truth” between models –
Mixture-of-Experts in which experts may be composed to i.e., not weighing them against the other’s priors or evidence,
provide feedback to each other, closed-loop, via the com- which can lead to more divergent model interactions.
mon representation of language. In the context of egocentric perception, we find that for-
Egocentric perception continues to be an important prob- mulating video Q&A as reading comprehension in SMs di-
lem in computer vision. Early work in the area explores rectly leverages the extent to which large LMs are capable of
hand-designed first-person visual features for egocentric ac- logical reasoning by connecting commonsense relationships
tion recognition, object understanding, and video summa- with knowledge learned from Internet-scale data. For exam-
rization. This includes ego-motion (e.g., optical flows) (Ki- ple, the system returns the following answer when presented
tani et al. 2011; Ryoo and Matthies 2013) as well as features with the world-state history log:
from human gaze, hands, and objects (Spriggs, De La Torre, 8:00 AM: went to grocery store to buy orange juice,
and Hebert 2009; Lee, Ghosh, and Grauman 2012; Fathi, chocolate, and bread.
Farhadi, and Rehg 2011; Pirsiavash and Ramanan 2012; 8:15 AM: I went to gas station to fill up the vehicle
Li and Kitani 2013; Lee and Grauman 2015). Focusing on tank.
hand-designed features was common in early egocentric vi- 8:30 AM: drove back home and left the groceries in
sion research, as the availability of data (or videos in gen- the kitchen.
eral) was very limited. More recent approaches in egocentric 8:45 AM: started cooking eggs in the pan.
perception leverage learned feature representations, utilizing 9:00 AM: the dog went into the kitchen.
pre-trained convolutional network features (Ryoo, Rothrock, 9:15 AM: took the dog out for a walk.
9:30 AM: the dog is sick.
and Matthies 2015), finetuning them (Ma, Fan, and Kitani
Q: Why is the dog sick? A: The dog may have eaten
2016; Zellers et al. 2022), or training them from scratch something it was not supposed to, such as chocolate.
(Bambach et al. 2015) with first-person videos. Similar to
the topics explored in early work, learning of visual rep- Arriving at the answer requires bridging multiple connec-
resentations capturing human hands, objects, and eye gaze tions between observations e.g., that the dog went into the
has been extensively studied (Garcia-Hernando et al. 2018; kitchen, that the groceries are still in the kitchen, and that
Li, Liu, and Rehg 2018). (Kazakos et al. 2019) learns mul- the groceries contain chocolate. Such results offer a glimpse
timodal embeddings (i.e., video + audio), and (Furnari and of what might be possible using SMs for deductive reason-
Farinella 2019) studies future action anticipation from ego- ing across multiple domains of information, and raises in-
centric videos. Lack of sufficient data however, consistently teresting research questions on (i) how to better assemble
remains a bottleneck – motivating researchers to construct language-based world-state histories (beyond what is pre-
new larger-scale egocentric video datasets including EPIC- sented in this work) that capture relevant evidence to im-
prove the accuracy of conclusions, and (ii) how to elicit control algorithms can begin to tap into the capabilities of
chain of thought prompting (Wei et al. 2022) to decompose models trained on Internet-scale data, and to tackle applica-
multi-step problems into intermediate ones. For example, tions that have traditionally been data-scarce.
one promising extension could be prompting the LM with
chain of thought sequences to expand on hypotheses:
9 Acknowledgements
Q: What are reasons for why I might be chopping wood?
A: Reasons might include: needing firewood, wanting We thank Debidatta Dwibedi and Matthew O’Kelly for
to make a statement, or needing the exercise. excellent feedback on improving this manuscript, Anelia
Angelova, Jean-Jacques Slotine, Jonathan Tompson, Maria
to which each hypothesis can be progressively explored by Attarian, Shuran Song, for fruitful technical discussions,
downstream subprograms called at recursively higher reso- Kan Huang for applications support, Ahmed Omran, Aren
lutions until a conclusion is reached. These directions sug- Jensen, Malcolm Slaney, Karolis Misiunas for advice on au-
gest pathways towards achieving increasingly meaningful dio models, and Cody Wanner for YouTube videos.
utility and analysis by digital multimodal assistants.
References
8.1 Broader Impacts
Abuzaid, F.; Sethi, G.; Bailis, P.; and Zaharia, M. 2019. To index or
Socratic Models offer a new perspective that encour- not to index: Optimizing exact maximum inner product search. In
35th IEEE International Conference on Data Engineering, ICDE
ages building AI systems using off-the-shelf language-
2019, Macao, China, April 8-11, 2019, 1250–1261. IEEE.
interactable foundation models without additional data col-
lection or model finetuning. This leads to several practical Agarwal, P.; Betancourt, A.; Panagiotou, V.; and Dı́az-Rodrı́guez,
N. 2020. Egoshots, an ego-vision life-logging dataset and semantic
benefits, new applications, and risks as well. For one, SMs fidelity metric to evaluate diversity in image captioning models.
provide an interpretable window, through language, into the arXiv preprint arXiv:2003.11743.
behavior of the systems (even for non-experts). Further, the
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David,
barrier to entry for this technology is small: SMs can be en- B.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu,
gineered to capture new functionalities with minimal com- J.; Ibarz, J.; Ichter, B.; Irpan, A.; Jang, E.; Ruano, R. J.; Jeffrey,
pute resources. No model training was used to create any K.; Jesmonth, S.; Joshi, N.; Julian, R.; Kalashnikov, D.; Kuang,
demonstrated results. This can be enabling, but also raises Y.; Lee, K.-H.; Levine, S.; Lu, Y.; Luu, L.; Parada, C.; Pastor, P.;
potential risks, since it increases the flexibility of unintended Quiambao, J.; Rao, K.; Rettinghouse, J.; Reyes, D.; Sermanet, P.;
end use applications, and should be carefully monitored over Sievers, N.; Tan, C.; Toshev, A.; Vanhoucke, V.; Xia, F.; Xiao,
time. We welcome broad discussion on how to maximize the T.; Xu, P.; Xu, S.; and Yan, M. 2022. Do as i can and not as i
potential positive impacts (enabling broad, new multimodal say: Grounding language in robotic affordances. In arXiv preprint
applications, with minimal new resources) while minimizing arXiv:2022.00000.
the capabilities of bad actors. Apostolidis, E.; Adamantidou, E.; Metsai, A. I.; Mezaris, V.; and
Patras, I. 2021. Video summarization using deep neural networks:
Regarding the impact on energy and other resource con- A survey. Proceedings of the IEEE 109(11):1838–1863.
sumption for machine learning, this work may help pave a Bain, M.; Nagrani, A.; Varol, G.; and Zisserman, A. 2021. Frozen
path for new, capable machine learning models to be com- in time: A joint video and image encoder for end-to-end retrieval.
posed with minimal training resource consumption, pro- In Proceedings of the IEEE/CVF International Conference on
vided that large foundational pretrained models are avail- Computer Vision, 1728–1738.
able. This may help provide an answer for how large pre- Bambach, S.; Lee, S.; Crandall, D. J.; and Yu, C. 2015. Lend-
trained models may be retargeted to a wide variety of mul- ing a hand: Detecting hands and recognizing activities in complex
timodal applications, without additional considerable com- egocentric interactions. In Proceedings of the IEEE International
pute resources required. Since SMs help demonstrate how Conference on Computer Vision (ICCV), 1949–1957.
a wide variety of applications may be addressed with fixed Bapna, A.; Cherry, C.; Zhang, Y.; Jia, Y.; Johnson, M.; Cheng, Y.;
(pretrained) models zero-shot, this may also help foster Khanuja, S.; Riesa, J.; and Conneau, A. 2022. mslam: Massively
adoption of new machine learning accelerators (e.g., fixed multilingual joint pre-training for speech and text. arXiv preprint
analog circuity (Reuther et al. 2020), optical diffraction (Lin arXiv:2202.01374.
et al. 2018)) for inference with substantially lower power Barbieri, M.; Agnihotri, L.; and Dimitrova, N. 2003. Video sum-
consumption and more compact form factors. marization: methods and landscape. In Internet Multimedia Man-
agement Systems IV, volume 5242, 1–13. International Society for
We are excited about opportunities as well in downstream Optics and Photonics.
applications. For example, SMs suggest promising research Bommasani, R.; Hudson, D. A.; Adeli, E.; Altman, R.; Arora, S.;
directions for data-driven learning in robotics, where the var- von Arx, S.; Bernstein, M. S.; Bohg, J.; Bosselut, A.; Brunskill, E.;
ious modules within a robot system (e.g., planning (Ahn et et al. 2021. On the opportunities and risks of foundation models.
al. 2022; Huang et al. 2022), perception (Shridhar, Manuelli, arXiv preprint arXiv:2108.07258.
and Fox 2022)) can be replaced with zero-shot foundation Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhari-
models imbued with commonsense priors across domains. wal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
These ideas may give rise to a new class of robot systems 2020. Language models are few-shot learners. Advances in neural
where by grounding affordances (Zeng 2019) on language, information processing systems 33:1877–1901.
Chen, H.; Xie, W.; Vedaldi, A.; and Zisserman, A. 2020. Vg- Garcia-Hernando, G.; Yuan, S.; Baek, S.; and Kim, T.-K. 2018.
gsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 First-person hand action benchmark with rgb-d videos and 3d hand
IEEE International Conference on Acoustics, Speech and Signal pose annotations. In Proceedings of the IEEE conference on com-
Processing (ICASSP), 721–725. IEEE. puter vision and pattern recognition (CVPR), 409–419.
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H. P. d. O.; Kaplan, Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.;
J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. 2021. Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. 2021.
Evaluating large language models trained on code. arXiv preprint Ego4d: Around the world in 3,000 hours of egocentric video. arXiv
arXiv:2107.03374. preprint arXiv:2110.07058.
Cheng, X.; Lin, H.; Wu, X.; Yang, F.; and Shen, D. 2021. Improv- Gu, X.; Lin, T.-Y.; Kuo, W.; and Cui, Y. 2021. Open-vocabulary ob-
ing video-text retrieval by multi-stream corpus alignment and dual ject detection via vision and language knowledge distillation. arXiv
softmax loss. arXiv preprint arXiv:2109.04290. preprint arXiv:2104.13921.
Choromanski, K.; Chen, H.; Lin, H.; Ma, Y.; Sehanobish, A.; Jain, He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-
D.; Ryoo, M. S.; Varley, J.; Zeng, A.; Likhosherstov, V.; Kalash- ing for image recognition. In Proceedings of the IEEE conference
nikov, D.; Sindhwani, V.; and Weller, A. 2021a. Hybrid random on computer vision and pattern recognition, 770–778.
features. to appear in ICLR 2022 abs/2110.04367. He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask r-cnn.
Choromanski, K. M.; Likhosherstov, V.; Dohan, D.; Song, X.; In Proceedings of the IEEE international conference on computer
Gane, A.; Sarlós, T.; Hawkins, P.; Davis, J. Q.; Mohiuddin, A.; vision, 2961–2969.
Kaiser, L.; Belanger, D. B.; Colwell, L. J.; and Weller, A. 2021b. Hoai, M., and De la Torre, F. 2014. Max-margin early event detec-
Rethinking attention with performers. In 9th International Con- tors. International Journal of Computer Vision 107(2):191–202.
ference on Learning Representations, ICLR 2021, Virtual Event, Hu, R., and Singh, A. 2021. Transformer is all you need: Multi-
Austria, May 3-7, 2021. OpenReview.net. modal multitask learning with a unified transformer. arXiv e-prints
Damen, D.; Doughty, H.; Farinella, G. M.; Fidler, S.; Furnari, A.; arXiv–2102.
Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. Huang, W.; Abbeel, P.; Pathak, D.; and Mordatch, I. 2022. Lan-
2018. Scaling egocentric vision: The epic-kitchens dataset. In Pro- guage models as zero-shot planners: Extracting actionable knowl-
ceedings of the European Conference on Computer Vision (ECCV), edge for embodied agents. arXiv preprint arXiv:2201.07207.
720–736.
Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.;
Damen, D.; Doughty, H.; Farinella, G. M.; Furnari, A.; Kaza- Kohli, P.; Shotton, J.; Hodges, S.; Freeman, D.; Davison, A.; et al.
kos, E.; Ma, J.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, 2011. Kinectfusion: real-time 3d reconstruction and interaction us-
W.; et al. 2020. Rescaling egocentric vision. arXiv preprint ing a moving depth camera. In Proceedings of the 24th annual
arXiv:2006.13256. ACM symposium on User interface software and technology, 559–
Del Molino, A. G.; Tan, C.; Lim, J.-H.; and Tan, A.-H. 2016. Sum- 568.
marization of egocentric videos: A comprehensive survey. IEEE Jain, A.; Guo, M.; Srinivasan, K.; Chen, T.; Kudugunta, S.; Jia, C.;
Transactions on Human-Machine Systems 47(1):65–76. Yang, Y.; and Baldridge, J. 2021. Mural: multimodal, multitask
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. retrieval across languages. arXiv preprint arXiv:2109.05125.
2009. Imagenet: A large-scale hierarchical image database. In Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le,
2009 IEEE conference on computer vision and pattern recognition, Q.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling up visual
248–255. Ieee. and vision-language representation learning with noisy text super-
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: vision. In International Conference on Machine Learning, 4904–
Pre-training of deep bidirectional transformers for language under- 4916. PMLR.
standing. arXiv preprint arXiv:1810.04805. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; experts and the em algorithm. Neural computation 6(2):181–214.
Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, Kazakos, E.; Nagrani, A.; Zisserman, A.; and Damen, D. 2019.
G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Epic-fusion: Audio-visual temporal binding for egocentric action
Transformers for image recognition at scale. arXiv preprint recognition. In Proceedings of the IEEE/CVF International Con-
arXiv:2010.11929. ference on Computer Vision (ICCV), 5492–5501.
Fan, C.; Zhang, Z.; and Crandall, D. J. 2018. Deepdiary: Lifel- Kitani, K. M.; Okabe, T.; Sato, Y.; and Sugimoto, A. 2011. Fast
ogging image captioning and summarization. Journal of Visual unsupervised ego-action learning for first-person sports videos. In
Communication and Image Representation 55:40–55. Proceedings of the IEEE Conference on Computer Vision and Pat-
Fang, H.; Xiong, P.; Xu, L.; and Chen, Y. 2021. Clip2video: tern Recognition (CVPR), 3241–3248.
Mastering video-text retrieval via image clip. arXiv preprint Kitani, K. M.; Ziebart, B. D.; Bagnell, J. A.; and Hebert, M. 2012.
arXiv:2106.11097. Activity forecasting. In European Conference on Computer Vision
Fathi, A.; Farhadi, A.; and Rehg, J. M. 2011. Understanding ego- (ECCV), 201–214.
centric activities. In Proceedings of the IEEE International Con- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-
ference on Computer Vision (ICCV), 407–414. Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al.
Furnari, A., and Farinella, G. M. 2019. What would you ex- 2020. The open images dataset v4. International Journal of Com-
pect? anticipating egocentric actions with rolling-unrolling lstms puter Vision 128(7):1956–1981.
and modality attention. In Proceedings of the IEEE/CVF Interna- Lee, Y. J., and Grauman, K. 2015. Predicting important objects for
tional Conference on Computer Vision (ICCV), 6252–6261. egocentric video summarization. International Journal of Com-
Gao, Z.; Liu, J.; Chen, S.; Chang, D.; Zhang, H.; and Yuan, J. puter Vision 114(1):38–55.
2021. Clip2tv: An empirical study on transformer-based methods Lee, Y. J.; Ghosh, J.; and Grauman, K. 2012. Discovering im-
for video-text retrieval. arXiv preprint arXiv:2111.05610. portant people and objects for egocentric video summarization.
In IEEE conference on computer vision and pattern recognition Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.;
(CVPR), 1346–1353. Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al.
Lei, J.; Yu, L.; Berg, T. L.; and Bansal, M. 2020. What is more 2022. Training language models to follow instructions with human
likely to happen next? video-and-language future event prediction. feedback. Preprint.
arXiv preprint arXiv:2010.07999. Patel, D.; Parikh, R.; and Shastri, Y. 2021. Recent advances in
Li, C., and Kitani, K. M. 2013. Pixel-level hand detection in ego- video question answering: A review of datasets and methods. In In-
centric videos. In Proceedings of the IEEE Conference on Com- ternational Conference on Pattern Recognition, 339–356. Springer.
puter Vision and Pattern Recognition (CVPR), 3570–3577. Patrick, M.; Huang, P.-Y.; Asano, Y.; Metze, F.; Hauptmann,
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; and Hoi, S. A.; Henriques, J.; and Vedaldi, A. 2020. Support-set bot-
C. H. 2021a. Align before fuse: Vision and language represen- tlenecks for video-text representation learning. arXiv preprint
tation learning with momentum distillation. Advances in Neural arXiv:2010.02824.
Information Processing Systems 34. Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller,
Li, Y.; Nagarajan, T.; Xiong, B.; and Grauman, K. 2021b. Ego- A. H.; and Riedel, S. 2019. Language models as knowledge bases?
exo: Transferring visual representations from third-person to first- arXiv preprint arXiv:1909.01066.
person videos. In Proceedings of the IEEE/CVF Conference on Pirsiavash, H., and Ramanan, D. 2012. Detecting activities of daily
Computer Vision and Pattern Recognition, 6943–6953. living in first-person camera views. In 2012 IEEE Conference on
Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. Blip: Bootstrap- Computer Vision and Pattern Recognition (CVPR), 2847–2854.
ping language-image pre-training for unified vision-language un- Portillo-Quintero, J. A.; Ortiz-Bayliss, J. C.; and Terashima-Marı́n,
derstanding and generation. arXiv preprint arXiv:2201.12086. H. 2021. A straightforward framework for video retrieval us-
Li, Y.; Liu, M.; and Rehg, J. M. 2018. In the eye of beholder: Joint ing clip. In Mexican Conference on Pattern Recognition, 3–12.
learning of gaze and actions in first person video. In Proceedings Springer.
of the European conference on computer vision (ECCV), 619–635.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agar-
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, wal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021.
D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common Learning transferable visual models from natural language super-
objects in context. In European conference on computer vision, vision. In International Conference on Machine Learning, 8748–
740–755. Springer. 8763. PMLR.
Lin, X.; Rivenson, Y.; Yardimci, N. T.; Veli, M.; Luo, Y.; Jarrahi, Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you
M.; and Ozcan, A. 2018. All-optical machine learning using don’t know: Unanswerable questions for squad. arXiv preprint
diffractive deep neural networks. Science 361(6406):1004–1008. arXiv:1806.03822.
Liu, Y.; Albanie, S.; Nagrani, A.; and Zisserman, A. 2019a. Use Ramsauer, H.; Schäfl, B.; Lehner, J.; Seidl, P.; Widrich, M.; Gru-
what you have: Video retrieval using representations from collabo- ber, L.; Holzleitner, M.; Adler, T.; Kreil, D. P.; Kopp, M. K.; Klam-
rative experts. BMVC. bauer, G.; Brandstetter, J.; and Hochreiter, S. 2021. Hopfield net-
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; works is all you need. In 9th International Conference on Learn-
Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019b. Roberta: ing Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
A robustly optimized bert pretraining approach. arXiv preprint 2021. OpenReview.net.
arXiv:1907.11692. Rawat, A. S.; Chen, J.; Yu, F. X.; Suresh, A. T.; and Kumar, S.
Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; and Li, T. 2019. Sampled softmax with random fourier features. In Wal-
2021. Clip4clip: An empirical study of clip for end to end video lach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.;
clip retrieval. arXiv preprint arXiv:2104.08860. Fox, E. B.; and Garnett, R., eds., Advances in Neural Information
Ma, M.; Fan, H.; and Kitani, K. M. 2016. Going deeper into first- Processing Systems 32: Annual Conference on Neural Information
person activity recognition. In Proceedings of the IEEE Conference Processing Systems 2019, NeurIPS 2019, December 8-14, 2019,
on Computer Vision and Pattern Recognition (CVPR), 1894–1903. Vancouver, BC, Canada, 13834–13844.
Masoudnia, S., and Ebrahimpour, R. 2014. Mixture of experts: a Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; and
literature survey. Artificial Intelligence Review 42(2):275–293. Kepner, J. 2020. Survey of machine learning accelerators. In 2020
Miech, A.; Alayrac, J.-B.; Smaira, L.; Laptev, I.; Sivic, J.; and IEEE high performance extreme computing conference (HPEC),
Zisserman, A. 2020. End-to-end learning of visual representa- 1–12. IEEE.
tions from uncurated instructional videos. In Proceedings of the Rhinehart, N., and Kitani, K. M. 2017. First-person activity fore-
IEEE/CVF Conference on Computer Vision and Pattern Recogni- casting with online inverse reinforcement learning. In Proceedings
tion, 9879–9889. of the IEEE International Conference on Computer Vision (ICCV),
Mithun, N. C.; Li, J.; Metze, F.; and Roy-Chowdhury, A. K. 2018. 3696–3705.
Learning joint embedding with multimodal cues for cross-modal Ryoo, M. S., and Matthies, L. 2013. First-person activity recog-
video-text retrieval. In Proceedings of the 2018 ACM on Interna- nition: What are they doing to me? In Proceedings of the IEEE
tional Conference on Multimedia Retrieval, 19–27. Conference on Computer Vision and Pattern Recognition (CVPR),
Mokady, R.; Hertz, A.; and Bermano, A. H. 2021. Clipcap: Clip 2730–2737.
prefix for image captioning. arXiv preprint arXiv:2111.09734. Ryoo, M. S.; Rothrock, B.; and Matthies, L. 2015. Pooled mo-
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. tion features for first-person videos. In Proceedings of the IEEE
2011. Multimodal deep learning. In ICML. Conference on Computer Vision and Pattern Recognition (CVPR),
Oncescu, A.-M.; Koepke, A.; Henriques, J. F.; Akata, Z.; and Al- 896–904.
banie, S. 2021. Audio retrieval with natural language queries. Ryoo, M. S. 2011. Human activity prediction: Early recognition
arXiv preprint arXiv:2105.02192. of ongoing activities from streaming videos. In Proceedings of the
IEEE International Conference on Computer Vision (ICCV), 1036– Wu, B.; Chen, W.; Fan, Y.; Zhang, Y.; Hou, J.; Liu, J.; and Zhang, T.
1043. 2019. Tencent ml-images: A large-scale multi-label image database
Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Concep- for visual representation learning. IEEE Access 7:172683–172693.
tual captions: A cleaned, hypernymed, image alt-text dataset for Wu, H.-H.; Seetharaman, P.; Kumar, K.; and Bello, J. P. 2021a.
automatic image captioning. In Proceedings of the 56th Annual Wav2clip: Learning robust audio representations from clip. arXiv
Meeting of the Association for Computational Linguistics (Volume preprint arXiv:2110.11499.
1: Long Papers), 2556–2565. Wu, J.; Ouyang, L.; Ziegler, D. M.; Stiennon, N.; Lowe, R.; Leike,
Shridhar, M.; Manuelli, L.; and Fox, D. 2022. Cliport: What and J.; and Christiano, P. 2021b. Recursively summarizing books with
where pathways for robotic manipulation. In Conference on Robot human feedback. arXiv preprint arXiv:2109.10862.
Learning, 894–906. PMLR. Xu, J.; Mei, T.; Yao, T.; and Rui, Y. 2016. Msr-vtt: A large video
Shrivastava, A., and Li, P. 2014. Asymmetric LSH (ALSH) for description dataset for bridging video and language. In Proceed-
sublinear time maximum inner product search (MIPS). In Ghahra- ings of the IEEE conference on computer vision and pattern recog-
mani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Wein- nition, 5288–5296.
berger, K. Q., eds., Advances in Neural Information Processing Xu, H.; Ghosh, G.; Huang, P.-Y.; Okhonko, D.; Aghajanyan, A.;
Systems 27: Annual Conference on Neural Information Processing Metze, F.; Zettlemoyer, L.; and Feichtenhofer, C. 2021. Video-
Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, clip: Contrastive pre-training for zero-shot video-text understand-
2321–2329. ing. arXiv preprint arXiv:2109.14084.
Sigurdsson, G. A.; Gupta, A.; Schmid, C.; Farhadi, A.; and Alahari, Yang, Z.; Gan, Z.; Wang, J.; Hu, X.; Lu, Y.; Liu, Z.; and Wang, L.
K. 2018. Charades-ego: A large-scale dataset of paired third and 2021. An empirical study of gpt-3 for few-shot knowledge-based
first person videos. arXiv preprint arXiv:1804.09626. vqa. arXiv preprint arXiv:2109.05014.
Smaira, L.; Carreira, J.; Noland, E.; Clancy, E.; Wu, A.; and Zis- Yu, Y.; Kim, J.; and Kim, G. 2018. A joint sequence fusion model
serman, A. 2020. A short note on the kinetics-700-2020 human for video question answering and retrieval. In Proceedings of the
action dataset. arXiv preprint arXiv:2010.10864. European Conference on Computer Vision (ECCV), 471–487.
Song, H.; Dong, L.; Zhang, W.-N.; Liu, T.; and Wei, F. 2022a. Clip Zellers, R.; Lu, J.; Lu, X.; Yu, Y.; Zhao, Y.; Salehi, M.; Kusupati,
models are few-shot learners: Empirical studies on vqa and visual A.; Hessel, J.; Farhadi, A.; and Choi, Y. 2022. Merlot reserve:
entailment. arXiv preprint arXiv:2203.07190. Neural script knowledge through vision and language and sound.
arXiv preprint arXiv:2201.02639.
Song, Y.; Fan, X.; Yang, Y.; Ren, G.; and Pan, W. 2022b. Large
Zeng, A. 2019. Learning visual affordances for robotic manipula-
pretrained models on multimodal sentiment analysis. In Artificial
tion. Ph.D. Dissertation, Princeton University.
Intelligence in China. Springer. 506–513.
Zhai, X.; Wang, X.; Mustafa, B.; Steiner, A.; Keysers, D.;
Spriggs, E. H.; De La Torre, F.; and Hebert, M. 2009. Temporal Kolesnikov, A.; and Beyer, L. 2021. Lit: Zero-shot transfer with
segmentation and activity classification from first-person sensing. locked-image text tuning. arXiv preprint arXiv:2111.07991.
In 2009 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops (CVPRW), 17–24. Zhao, Y.; Hessel, J.; Yu, Y.; Lu, X.; Zellers, R.; and Choi, Y.
2021. Connecting the dots between audio and text without par-
Strope, B.; Beeferman, D.; Gruenstein, A.; and Lei, X. 2011. Un- allel data through visual knowledge transfer. arXiv preprint
supervised testing strategies for asr. arXiv:2112.08995.
Tancik, M.; Casser, V.; Yan, X.; Pradhan, S.; Mildenhall, B.; Srini- Zhou, B.; Khosla, A.; Lapedriza, A.; Torralba, A.; and Oliva, A.
vasan, P. P.; Barron, J. T.; and Kretzschmar, H. 2022. Block- 2016. Places: An image database for deep scene understanding.
nerf: Scalable large scene neural view synthesis. arXiv preprint arXiv preprint arXiv:1610.02055.
arXiv:2202.05263.
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha,
A.; Cheng, H.-T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. 2022.
Lamda: Language models for dialog applications. arXiv preprint
arXiv:2201.08239.
Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016. Anticipat-
ing visual representations from unlabeled video. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 98–106.
Wang, Z.; Yu, J.; Yu, A. W.; Dai, Z.; Tsvetkov, Y.; and Cao, Y.
2021. Simvlm: Simple visual language model pretraining with
weak supervision. arXiv preprint arXiv:2108.10904.
Wang, Q.; Zhang, Y.; Zheng, Y.; Pan, P.; and Hua, X.-S. 2022.
Disentangled representation learning for text-video retrieval. arXiv
preprint arXiv:2203.07111.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.;
and Zhou, D. 2022. Chain of thought prompting elicits reasoning
in large language models. arXiv preprint arXiv:2201.11903.
Wortsman, M.; Ilharco, G.; Li, M.; Kim, J. W.; Hajishirzi, H.;
Farhadi, A.; Namkoong, H.; and Schmidt, L. 2021. Robust fine-
tuning of zero-shot models. arXiv preprint arXiv:2109.01903.