National University of Computer and Emerging Sciences
RAG and Prompt Engineering
AI-4009 Generative AI
Dr. Akhtar Jamil
Department of Computer Science
04/23/2025 Presented by Dr. AKHTAR JAMIL 1
Goals
• Recap of Previous Lecture
• Today’s Lecture
– Retrieval Augmented Generation (RAG)
– Prompt Engineering
04/23/2025 Presented by Dr. AKHTAR JAMIL 2
Retrieval-Augmented Generation (RAG)
• Retrieval-Augmented Generation (RAG) is a technique
that enhances language model generation by
incorporating external knowledge.
• Retrieving relevant information from an external
database and using that information to inform the
generation process.
• Useful for Generative AI applications over private or
custom datasets
• Use-cases like chatbots, question-answering, as well as
research and analysis.
04/23/2025 Presented by Dr. AKHTAR JAMIL 3
Retrieval-Augmented Generation (RAG)
• LLMs are trained on a vast amount of textual data
• Their capabilities are based on the knowledge they
acquire from this data.
– Knowledge Cut-off
• Can’t answer a question about data that is not part of
their training set
– May not be able to respond accurately
• May give a refusal (where the LLM responds with “I don’t
know”)
• Or worse, a hallucination.
04/23/2025 Presented by Dr. AKHTAR JAMIL 4
Limitations of current LMMs: Hallucination
• What’s a Hallucination?
Model can generate text that is factually incorrect or nonsensical to the provided source content
• Possible Hallucination causes:
- Data: flawed data sources and the inferior utilization of factual knowledge captured in the data
- Training: architecture flaw and misalignment
- Inference: inherent randomness of decoding strategies and imperfect decoding representation
04/23/2025 Presented by Dr. AKHTAR JAMIL 5
RAG Pipeline
• Steps of RAG:
04/23/2025 Presented by Dr. AKHTAR JAMIL 6
QA with Retrieval Augmented Generation
How many parameters are there in
Grok-0?
Query
Documents Web Pages
Split into Embeddings
chunks Prompt
Template LLM
Searc
h
Stor Top-K
Context Answer
Embeddings e Vector DB
[...] Grok-0, the prototype LLM
After announcing xAI, we trained a prototype LLM (Grok- mentioned in the provided
0) with 33 billion parameters. This early model context, is stated to have
approaches LLaMA 2 (70B) capabilities on standard been trained with 33 billion
LM benchmarks but uses only half of its training parameters.
resources.
[...]
04/23/2025 Presented by Dr. AKHTAR JAMIL 7
Retrieval Augmented Generation: Ingestion
• Chunking
• Chunking is the process of dividing the prompts and/or the
documents to be retrieved, into smaller, manageable segments
or chunks.
• These chunks can be defined either by a fixed size, such as a
specific number of characters, sentences or paragraphs.
• Each chunk is encoded into an embedding vector for retrieval.
• Smaller, more precise chunks lead to a finer match between
the user’s query and the content
• Enhancing the accuracy and relevance of the information retrieved.
04/23/2025 Presented by Dr. AKHTAR JAMIL 8
Retrieval Augmented Generation: Ingestion
• Embeddings
• Transforming both the user’s query (prompt) and the
documents in the knowledge base into a format that can be
effectively compared for relevance.
• This is called Embeddings
• Important to retrieve the most relevant information from its
knowledge base in response to a user query.
• Many embedding models; select the one that best suited for
your task
– HuggingFace’s
Massive Text Embedding Benchmark (MTEB) leaderboard.
04/23/2025 Presented by Dr. AKHTAR JAMIL 9
Retrieval Augmented Generation: Ingestion
• Sparse embedding:
• It refers to a method of representing words or phrases in a
high-dimensional space where most of the vector
elements are zeros.
• E.G. TF-IDF are great for lexical (word) matching the
prompt with the documents.
• It’s computationally less intensive but may not capture
the deeper semantic meanings in the text.
04/23/2025 Presented by Dr. AKHTAR JAMIL 10
Retrieval Augmented Generation: Ingestion
• Semantic embedding: Semantic embeddings,
such as BERT or SentenceBERT are more suitable
for RAG use cases.
• BERT: Suitable for capturing contextual
information in both the documents and queries.
• Requires more computational resources
• Offers more semantically rich embeddings.
04/23/2025 Presented by Dr. AKHTAR JAMIL 11
Sentence BERT
• SentenceBERT: Ideal for scenarios where the context
and meaning at the sentence level are important.
• This is usually the preferred for RAG application
• For example: BERT may not easily distinguish between
sentences like "Polar bears endangered as glaciers melt"
and "Polar bears feed on fish"
• SentenceBERT can better understands sentence
meanings
04/23/2025 Presented by Dr. AKHTAR JAMIL 12
Sentence BERT
• BERT as a powerful pre-trained network for Mask Language
Prediction or Sentence prediction.
• BERT faces a challenge in creating detailed sentence
embeddings.
• Sentence-BERT is making sentence embedding generation more
straightforward.
• Uses Siamese architectures, which effectively produce deeply
meaningful sentence embeddings, boosting its performance.
• Uses Siamese Network and triplet loss
04/23/2025 Presented by Dr. AKHTAR JAMIL 13
SentenceBERT
04/23/2025 Presented by Dr. AKHTAR JAMIL 14
Sentence BERT
• Bi-Encoder: This encoder processes sentences independently,
resulting in two distinct embeddings
• Embeddings are then compared to determine sentence similarity
– Cosine similarity.
• Cross-Encoder: This encoder processes sentences together
through the Transformer network.
– It outputs a similarity score ranging from 0 to 1.
04/23/2025 Presented by Dr. AKHTAR JAMIL 15
Sentence BERT
04/23/2025 Presented by Dr. AKHTAR JAMIL 16
Sentence BERT
• Sentence-BERT improves upon the original BERT model by
integrating a pooling step with its outputs.
• This addition transforms the outputs into consistent-sized
sentence embeddings.
04/23/2025 Presented by Dr. AKHTAR JAMIL 17
Which Tokenizer to use?
• You can check the best suited Model e.g. from HuggingFace’s
MTEB:
• Massive Text Embedding Benchmark (MTEB) Leaderboard
04/23/2025 Presented by Dr. AKHTAR JAMIL 18
Retrieval Augmented Generation: Ingestion
• Response:
• The model synthesizes the retrieved information with its pre-
trained knowledge to generate coherent and contextually
relevant responses.
• This process involves a generative relevant response that is not
only informative but also aligns with the user’s original query,
maintaining a natural and conversational tone.
• Creating an expanded prompt (with the retrieved top-k chunks)
for an LLM could enhance the RAG system’s effectiveness
• Make the system more performant.
04/23/2025 Presented by Dr. AKHTAR JAMIL 19
Vector DB: introduction
• A vector database stores vectors of fixed dimensions
(called embeddings)
• We can then query the database to search the closest
embeddings to a given query vector
– Employ a distance metric (such as cosine similarity, Euclidean
distance)
• Can use KNN (K Nearest Neighbor) algorithm or another
similarity search algorithm.
04/23/2025 Presented by Dr. AKHTAR JAMIL 20
Vector DB: introduction
04/23/2025 Presented by Dr. AKHTAR JAMIL 21
Vector DB: introduction
How many parameters are there in Grok-
0?
Documents Web Pages Query
Split into Embeddings
chunks
Searc
h
Stor Top-
Embeddings e Vector DB K Context
[...]
After announcing xAI, we trained a prototype LLM (Grok-0) with 33 billion parameters. This early
model approaches LLaMA 2 (70B) capabilities on standard LM benchmarks but uses only half of its
training resources.
[...]
04/23/2025 Presented by Dr. AKHTAR JAMIL 22
Prompt Engineering
04/23/2025 Presented by Dr. AKHTAR JAMIL 23
What is prompt
engineering?
Find an effective way to
communicate with AI tools
04/23/2025 Presented by Dr. AKHTAR JAMIL 24
What are prompts?
• Prompts involve instructions and context passed to a
language model to achieve a desired task
• Prompt engineering is the practice of developing and
optimizing prompts to efficiently use language models
(LMs) for a variety of applications
•Prompt engineering is a useful skill for AI engineers and
researchers to improve and efficiently use language models
What is prompt engineering?
Prompt engineering is a process of creating a set of prompts,
or questions, that are used to guide the user toward a desired
outcome. It is an effective tool for designers to create user
experiences that are easy to use and intuitive. This method is
often used in interactive design and software development, as
it allows users to easily understand how to interact with a
system or product..
04/23/2025 Presented by Dr. AKHTAR JAMIL 25
Why Prompt Engineering?
• Why learn prompt engineering?
• Important for research, discoveries, and advancement
• Helps to test and evaluate the limitations of LLMs
• Enables all kinds of innovative applications on top of LLMs
Source: https://jobs.lever.co/Anthropic/e3cde481-d446-460f-b576-93cab67bd1ed
04/23/2025 Presented by Dr. AKHTAR JAMIL 26
First Basic Prompt
Instructio
n
The sky is
blue
The sky is a beautiful blue color during the day. The
blue of the sky is created by the Earth’s atmosphere
Respons
scattering the sun’s light. The blue is a result of the
air molecules in the atmosphere reflecting the shorter
e
wavelength of blue light back to our eyes.
Model: “text-davinci-003”
temperature: 0.7
top-p: 1
04/23/2025 Presented by Dr. AKHTAR JAMIL 27
Elements of a Prompt
• A prompt is composed with the following
components:
• Instructions
• Context
• Input data
• Output indicator
Classify the text into neutral, negative or positive
Text: I think the food was okay.
Sentiment:
04/23/2025 Presented by Dr. AKHTAR JAMIL 28
Settings to keep in mind
• When prompting a language model you should keep in mind a few
settings
• You can get very different results with prompts when using
different settings
• One important setting is controlling how deterministic the model is
when generating completion for prompts
• Temperature and top_p are two important parameters to keep in
mind
• Generally, keep these low if you are looking for exact answers
04/23/2025 Presented by Dr. AKHTAR JAMIL 29
Settings to keep in mind
• The term "temperature" refers to a parameter that controls the
randomness of the model's responses.
• A scaling factor applied to the logits
• Keep them high if you are looking for more diverse responses
• Zero Temperature: Effectively turns the model into a greedy algorithm that
always chooses the most likely next word based on the model's training.
• Top-p (Nucleus Sampling): Focuses on randomness in a subset of the
most likely next tokens.
– Instead of sampling from the entire distribution, it selects the smallest set of words
whose cumulative probability exceeds the threshold p, and then samples only from
this set.
04/23/2025 Presented by Dr. AKHTAR JAMIL 30
Designing Prompts for Different Tasks
• In the next few slides, we will cover a few
examples of common tasks using different
prompts
• Tasks covered:
• Text Summarization
• Question Answering
• Text Classification
• Role Playing
• Code Generation
• Reasoning
04/23/2025 Presented by Dr. AKHTAR JAMIL 31
Text Summarization
Antibiotics are a type of medication used to treat
bacterial infections. They work by either killing the
bacteria or preventing them from reproducing, allowing
the body’s immune system to fight off the infection.
Contex
Antibiotics are usually taken orally in the form of
pills, capsules, or liquid solutions, or sometimes
t
administered intravenously. They are not effective
against viral infections, and using them inappropriately
can lead to antibiotic resistance.
Instructio
Explain the above in one sentence:
n
Antibiotics are medications used to treat bacterial
infections by either killing the bacteria or stopping
them from reproducing, but they are not effective against
viruses and overuse can lead to antibiotic resistance.
04/23/2025 Presented by Dr. AKHTAR JAMIL 32
Question Answering
Answer the question based on the context below. Keep the
answer short and concise. Respond "Unsure about answer"
if not sure about the answer.
Context: Teplizumab traces its roots to a New Jersey drug
company called Ortho Pharmaceutical. There, scientists
generated an early version of the antibody, dubbed OKT3.
Originally sourced from mice, the molecule was able to
bind to the surface of T cells and limit their cell-
killing potential. In 1986, it was approved to help
prevent organ rejection after kidney transplants, making
it the first therapeutic antibody allowed for human use.
Question: What was OKT3 originally sourced from?
Answer: Mice.
Context obtained from: https://www.nature.com/articles/d41586-023-00400-x
04/23/2025 Presented by Dr. AKHTAR JAMIL 33
Text Classification
Classify the text into neutral, negative or positive.
Text: I think the food was okay.
Sentiment: Neutral
04/23/2025 Presented by Dr. AKHTAR JAMIL 34
Role Playing
The following is a conversation with an AI research
assistant. The assistant tone is technical and
scientific.
Human: Hello, who are you?
AI: Greeting! I am an AI research assistant. How can I
help you today?
Human: Can you tell me about the creation of blackholes?
AI: Sure! Black holes are regions of spacetime where the
gravitational force is so strong that nothing, not even
light, can escape from it. They are created when a very
massive star dies and its core collapses in on itself,
forming a singularity of infinite density. The intense
gravity of the singularity pulls in all the matter and
radiation around it, creating the black hole.
04/23/2025 Presented by Dr. AKHTAR JAMIL 35
Code Generation
“””
Table departments, columns = [DepartmentId,
DepartmentName]
Table students, columns = [DepartmentId, StudentId,
StudentName]
Create a MySQL query for all students in the Computer
Science Department
“””
SELECT StudentId, StudentName
FROM students
WHERE DepartmentId IN (SELECT
DepartmentId WHERE
departments FROM DepartmentName = 'Computer Science');
04/23/2025 Presented by Dr. AKHTAR JAMIL 36
Reasoning
The odd number i this group add up to an even number: 15,
s n
32, 5, 13, 82, 7, 1.
Solve by breaking the problem into steps. First, identify
the odd numbers, add them, an indicate whether the result
Odd numbers: 15, 5, 13, 7, 1
Sum: 41 d
is odd or even.
41 is an odd number.
04/23/2025 Presented by Dr. AKHTAR JAMIL 37
Prompt Engineering Techniques
• Many advanced prompting techniques have
been designed to improve performance on
complex tasks
• Few-shot prompts
• Chain-of-thought (CoT) prompting
• Self-Consistency
04/23/2025 Presented by Dr. AKHTAR JAMIL 38
Few-shot prompts
• Few-shot prompting is a method used with large language models
(LLMs) like GPT-3
• The model is provided with a few examples to demonstrate a task
before being asked to perform a similar task.
• It allows the model's pre-trained knowledge and its ability to
generalize from a few examples, to understand the task at hand
and produce the desired output.
04/23/2025 Presented by Dr. AKHTAR JAMIL 39
Few-shot prompts
04/23/2025 Presented by Dr. AKHTAR JAMIL 40
Chain-of-Thought Prompting
• Commonly used method
• Introduces chain-of-thought prompting as a novel method.
• Significantly improves LLMs' complex reasoning tasks
performance.
– Arithmetic reasoning
– Commonsense reasoning
– symbolic reasoning
04/23/2025 Presented by Dr. AKHTAR JAMIL 41
Traditional Prompting vs Chain-of-Thought
• Traditional Prompting
– In-context few-shot learners
– Works poorly on tasks that require reasoning abilities
– Not improving with increasing language model scale
• Chain-of-thought
– Series of intermediate reasoning steps
– Mimic human step-by-step thought
04/23/2025 Presented by Dr. AKHTAR JAMIL 42
Chain-of-thought Prompting Principals
• Step-by-Step Guidance - Breaking complex problems into smaller steps
– “Let’s do it step by step.”
• Explicit Reasoning Chains - Creating prompts that detail the reasoning process step
by step
– “Consider the dangers of lightning. Next, evaluate the safety of being Step 1
outdoors vs. indoors. Conclude with the safest option.”
Step n
• Use of Examples - Providing solved examples with detailed reasoning step
– “If A is bigger than B and B is bigger than C, then A is the biggest. For
example, if A = 5, B = 3, and C = 2, then 5 is the biggest.”
• Iterative Refinement - Refining prompts based on model output for clarity and
effectiveness
– "Consider the relationship between A, B, and C step by step, starting with
comparing A and B.”
04/23/2025 Presented by Dr. AKHTAR JAMIL 43
Attractive Properties of CoT
• Enables decomposition of multi-step problems.
• Provides interpretable insights into LLM's reasoning.
– How the answer was derived?
– Easy to debug when reasoning path went wrong
• Broad applicability across diverse reasoning tasks.
– (Any) problems solvable via language.
• Simple, off-the-shelf
– No addition training or fine-tuning needed
04/23/2025 Presented by Dr. AKHTAR JAMIL 44
Evaluation - Arithmetic Reasoning
• Benchmark:
– Five math word problems: GSM8K, SVAMP, ASDiV, AQuA, MAWPS
– “Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000
in repairs. This increased the value of the house by 150%. How much profit did he make?”
(GSM8K)
• Baseline: Standard prompting with in-context exemplars
• CoT: 8 few-shot exemplars with chains of thought for prompting
• Language models:
– GPT-4
– LaMDA
– PaLM
– UL2 20B
– Codex
04/23/2025 Presented by Dr. AKHTAR JAMIL 45
Evaluation - Arithmetic Reasoning
• CoT is an emergent ability of the model scale
– No effect on small models
• Harder problems have a greater improvement
– Performance doubled on GSM8K
• Error analysis on LaMDA 137B for GSM8K
– 46% - calculator errors, symbol mapping error
– 54% - semantic misunderstanding
• PaLM 62B -> PaLM 540B
04/23/2025 Presented by Dr. AKHTAR JAMIL 46
Limitation of CoT
• Emulate the thought process, but not necessarily
reasoning
• Cost of manually augmenting exemplars with CoTs
• Emergence of CoT reasoning only at large model scale
• No guarantee of correct reasoning paths
04/23/2025 Presented by Dr. AKHTAR JAMIL 47
Self-Consistency Improves Chain of Thought Reasoning in LLMs
• Analogous to the human way of thinking:
– If multiple different ways of thinking lead to the same answer, one has
greater confidence that the final answer is correct.
– Complex reasoning tasks typically admit multiple reasoning paths that
reach a correct answer.
• Correct reasoning processes tend to have greater agreement in
their final answer than incorrect processes.
04/23/2025 Presented by Dr. AKHTAR JAMIL 48
Traditional CoT vs CoT w/ Self Consistency
04/23/2025 Presented by Dr. AKHTAR JAMIL 49
3 steps of Self Consistency
1. Prompting LLM with CoT
2. Sampling from the LLM decoder to get diverse reasoning paths
3. Selecting the most consistent answers (e.g. majority voting)
04/23/2025 Presented by Dr. AKHTAR JAMIL 50
Another example
04/23/2025 Presented by Dr. AKHTAR JAMIL 51
Understanding the Consistency
• Decomposition of decoder output into
– Reasoning path
– Final answer
• Example
– Common Sense reasoning
• Q: Yes or no: Would a pear sink in water?
• : Considering that pears have a density less than water, we can infer that a pear would float in
water.
• : The answer is yes, a pear would float in water.
– Symbolic reasoning
• Q: Take the last letters of the words in “Lady Gaga” and concatenate them.
• : The last letter of “Lady” is “y”. The last letter of “Gaga” is “a”.
• : The answer is “ya”.
04/23/2025 Presented by Dr. AKHTAR JAMIL 52
Understanding the Consistency
04/23/2025 Presented by Dr. AKHTAR JAMIL 53
Self-Refine: Iterative Refinement with
Self-Feedback
• Alternates between two generative steps–FEEDBACK and REFINE.
• 3 prompts are needed during the entire process
o initial generation, feedback, refinement
04/23/2025 Presented by Dr. AKHTAR JAMIL 54
Self-Refine - Big picture
04/23/2025 Presented by Dr. AKHTAR JAMIL 55
Thank You
04/23/2025 Presented by Dr. AKHTAR JAMIL 56