Whisper-tiny was finetuned to translate the audio sequence into text. The finetuning code can be found here.
An Encoder-Decoder Model made up of two BERTs (one for encoder and one for decoder) was finetuned to extract keywords from a text. Finetuned weights from a Bert2Bert summary task were used to speed up the finetune process and fix performance issues related to the lack of examples for training. The finetuning code can be found here
An end-to-end Vision Language Model - OWLv2 - was finetuned with guidance from a sentence encoder all-MiniLM-L6-v2 to perform open-set vocabulary object detection. The finetuning code can be found here and is edited based on resources taken from here.
Formated training and testing data in a 0.75-0.25 split are also provided in train_caption_vlm.jsonl and test_caption_vlm.jsonl respectively. Data can also be formated from the original vlm.jsonl with dataset_converter.py