Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views4 pages

LLM Multimodal Guidelines

The document outlines guidelines for creating prompts for a Multimodal Large Language Model (LLM) that processes multiple data types, such as text and images. It emphasizes the importance of crafting relevant, varied, and complex queries that require the model to analyze the content of images for accurate responses. Additionally, it provides specific instructions and considerations for writing effective queries, including avoiding basic prompts and ensuring proper language use.

Uploaded by

monkey0luffy237
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views4 pages

LLM Multimodal Guidelines

The document outlines guidelines for creating prompts for a Multimodal Large Language Model (LLM) that processes multiple data types, such as text and images. It emphasizes the importance of crafting relevant, varied, and complex queries that require the model to analyze the content of images for accurate responses. Additionally, it provides specific instructions and considerations for writing effective queries, including avoiding basic prompts and ensuring proper language use.

Uploaded by

monkey0luffy237
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1

Multimodal Guidelines
Date Created: 08/09/2024

Table of Contents
Introduction to LLMs (Large Language Models) ..................................................................1
Project Overview ............................................................................................................1
Introduction to Multimodal ..............................................................................................1
Diagram for Multimodal processing ..................................................................................2
Instructions to the Task ...................................................................................................2
DO's and DON'Ts ............................................................................................................3

Introduction to LLMs (Large Language Models)


Large Language Models (LLMs) are advanced artificial intelligence (AI) systems trained on vast amounts of data
to understand and generate human-like text. These models are capable of performing various tasks such as text
generation, summarization, translation, and more.

Project Overview
A Multimodal LLM is a type of AI system that can process multiple types of data simultaneously. Consider the
different ways we learn, such as reading, seeing pictures, and hearing sounds. LLMs can now learn from all
these sources, which helps them perform tasks in a more human-like way.

For example, imagine you have an LLM chatbot built into a personal device such as Smart Glasses, which
enables you to interact with it while on the go. Instead of being limited to writing text into a web browser, you
can ask it about something you see in the real world as you “show” it to the model through a visual or audio
input to the device. By combining these modes, multimodal LLMs can improve the accuracy and naturalness of
their responses.

Introduction to Multimodal
In this task, you will be creating prompts/queries about a provided image. A prompt or query is the instruction
or question given to the chatbot. These queries will be used to test and train a multimodal LLM chatbot. By
collecting a large variety of prompts/queries, we can teach the model to understand the context of an image as
well as the best ways to respond to various instructions and questions about it. These prompts need to be
questions that people might feasibly ask this model about the image.

Prompts need to be relevant to the image, sound natural and conversational, and be varied in form and
content. This requires the contributor to use good reasoning and creative thinking skills to produce a variety of
prompts that are neither repetitive nor overly simple. Basic prompts such as “describe this photo” or “how
many jars are in this picture?” are not helpful when trying to train models to process more involved
knowledge-based queries because these can be answered with basic image recognition technology. Instead,
we want to teach the model to “think” about what is in an image before it responds.

This content is for internal use only


2

For example, if looking at an image of the Statue of Liberty, a good prompt in this case could be, “What time
does it open for visitors?” Another prompt for the image could be, “Is there a gift shop inside?”

As another example, if shown an image of a plant, a good prompt in this case might be, “How much sunlight
should this get?”

Diagram for Multimodal processing


This simulator utilizes “Text and Image as input” functionality for multimodal, with the output of “Text” as
shown here.

Instructions to the Task


1. Review the image provided on the left-hand side under “Judgment.” Consider what kind of real-world
situation you might be in when looking at the contents of the image.
2. Type three independent queries about the image into the prompt boxes. Each query must adhere to
the criteria outlined in the “Important Considerations when Writing Queries” section below.
3. After proofreading your three queries, click the “Test Validation” button.

This content is for internal use only


3

Important Considerations when Writing Queries


1. Queries must be relevant to the image. In other words, they must require the LLM model to
analyze/consider the content of the image to provide a valid response. For example, if the image
shows a kitchen table with several ingredients on it, a valid query would be, "What main dish could I
make using the ingredients on this table?"

Queries that the LLM could answer without referencing the image, such as "What is the best cut of
meat for pot roast?" would not be acceptable because the response could be answered without the
need for the LLM to consider the image's contents.

2. Queries should be no longer than 40 words.

3. Queries should differ from one another as much as possible.

4. Phrasing should be natural and realistic.

5. Queries should be sufficiently complex—they should not be answerable through basic image
recognition software.

6. Queries should aim to elicit brief responses—realistic questions about the contents of an image do
not typically require long responses. For example, an effective query about an image of the Eiffel
Tower would not be, “Provide a detailed description of the history of this monument.” This would
require the model to provide a lengthy response.

7. Queries should be written with proper punctuation and capitalization and free from spelling/grammar
errors, profanity, or otherwise objectionable text or content.

8. Queries should be natural questions that could be asked in the moment and should not be overly
formal, nor should they address the model by beginning with “hey chatbot.”

9. Queries about the scene or objects in the image should not contain the actual words "in this image."
Imagine you are seeing the image in person, not asking about an image on a screen.

10. Queries need to elicit a text response only. The “multimodal” part of this model refers to the form of
inputs it can receive, not the kind of responses it can create. It cannot perform actions like creating a
reminder, note, or contact, nor can it place calls or send a message. However, it can translate,
summarize, or rewrite a block of text if one exists in the image.

DO's and DON'Ts


Based on the Task Instructions, here are some practical tips for performing this task effectively:

DO:
• Write prompts that rely on the image to be answered.
• Vary your prompts—repetitive prompts asking the same or very similar questions is not helpful.
• Create queries that elicit brief responses.
• Use proper punctuation, capitalization, and spelling.
• Write a query that elicits a text response.

This content is for internal use only


4

DON’T:
• Write basic prompts. Queries should be sufficiently complex.
• Write prompts that are overly formal or that informally address the model by starting with “hey
chatbot.”
• Include the words “in this image.”

This simulator and these guidelines are intended to introduce you to LLMs. Guidelines and requirements will
vary for every project.

This content is for internal use only

You might also like