Responsible AI in Educational Chatbots
Responsible AI in Educational Chatbots
Hanna Eriksson
2 Related Work 4
2.1 Combining or Switching Between Chat Models . . . . . . . . . . . . . . . . . . 4
2.2 Enhancing the Appropriateness of Chat Models’ Responses . . . . . . . . . . . 4
3 Theory 6
3.1 Next.js . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Supabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 LangChain: Facilitating Seamless Integration of LLMs . . . . . . . . . . . . . 7
3.3.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Interoperability and Integration . . . . . . . . . . . . . . . . . . . . . . 8
3.3.3 Data Input/Output Handling . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.5 Adaptability and Extensibility . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Strategic Prompt Engineering: Mitigating Inappropriate Content . . . . . . . 9
3.4.1 Zero-shot Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.2 Few-shot Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.3 Chain-of-thought Prompting . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.4 Prompt Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Implementation 13
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.2 Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.3 Internationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Security Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Scalability and Performance Optimization . . . . . . . . . . . . . . . . . . . . 17
4.6 Chatbot Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Integration of LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.8 Prompt Engineering Implementation . . . . . . . . . . . . . . . . . . . . . . . 19
4.8.1 Zero-shot prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.8.2 Few-shot prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.8.3 Chain-of-thought prompting . . . . . . . . . . . . . . . . . . . . . . . . 21
4.8.4 Prompt chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Evaluation 23
5.1 Facilitating Seamless Integration of LLMs . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.2 LangChain: A Modular Approach to LLM Integration . . . . . . . . . 23
5.2 Mitigating Inappropriate Content . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.2 Strategic Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . 27
6 Discussion 31
6.1 Reflection on Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Equality and Equity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.1 Background
In this context, AcadeMedia, an important player in the educational sector, recognized the
need for solutions to enhance student engagement and support in upper high schools (Swedish
gymnasiet). To address this need, AcadeMedia sought to develop a chatbot tailored specif-
ically for that. There were two main objectives: to provide teachers with a flexible tool
for customizing chatbots according to subject, learning objectives, and language preferences,
and to o↵er students a convenient way to seek assistance and clarifications at any time.
Furthermore, AcadeMedia recognized that the rapid evolution of AI technologies and LLMs
1
requires a flexible framework that can accommodate changes in LLMs efficiently. Given that
the chatbot is to be used in educational settings, AcadeMedia also stressed the importance
of ensuring that the chatbot’s content and functionality align with the requirements and
constraints of the school environment.
1.2 Motivation
Today, it appears that LLMs continue to evolve at such a rapid pace that new models and
breakthrough techniques emerge with an incredible frequency, often within months or even
weeks [12–14]. Due to this, the ability to seamlessly switch between di↵erent chat models is
gaining importance. Moreover, it has been shown that combining LLMs can lead to improved
performance and more human-like outcomes in various applications [15].
Using AI chatbots in education comes with a lot of considerations and risks. Preventing AI
chatbots from teaching or giving out illegal content in educational settings is crucial. Also,
to mitigate biases, such as gender biases, is important to ensure inclusivity and that legal
standards are followed [16]. There is also the ethical dilemma of how to handle the sensitive
student data [17]. Lastly, it is important to consider that chatbots do not always give out
the correct information [18].
Thus, given the rapid evolution of LLMs and other AI models, and the performance increase
made by utilizing di↵erent LLMs, there arises a need to explore techniques to e↵ortlessly
switch between di↵erent LLMs. Additionally, it is important to ensure that user-driven
rule-setting systems in educational chatbots do not spread inappropriate content.
1.4 Delimitation
This thesis is delimited in several aspects to maintain focus and clarity within the defined
scope. Firstly, while adapting prompts to bring out optimal responses from the chat model is
a crucial aspect, this thesis does not go into exhaustive methodologies for achieving this opti-
mization. Also, this thesis does not address mechanisms for ensuring the model consistently
provides accurate information. Although, it does touch upon the generation of appropriate
responses adapted for school settings. Furthermore, ensuring uniform learning outcomes for
all students is a critical objective. Yet, this thesis does not explore methods for tailoring the
application to guarantee identical learning experiences for each individual.
2
Since most LLMs perform best in English, the evaluation in this thesis will be done exclusively
in English. Although evaluating the application in multiple languages could be beneficial,
especially if it is intended to be used in other languages.
Economic factors play a significant role in the deployment and sustainability of educational
technologies. However, this thesis does not analyse of the economic implications associated
with implementing the proposed solution.
Finally, while testing within a classroom environment is essential for validating the e↵ective-
ness of the proposed application, this thesis does not include classroom testing. Conducting
classroom trials requires time beyond the constraints of this project.
3
2 Related Work
There is a lot of research into developing chatbots, both for education, but also in other
areas. Something that is new, however, is the large number of LLMs that are available today.
This leads to new exciting areas that can be investigated, as well as making it possible to
combine the advantages of di↵erent models.
4
designed the questions or prompts given to ChatGPT to help it understand what was being
asked more accurately. Mungoli also used reinforcement learning, which is a way of training
AI models to improve based on feedback. By combining these techniques, they found that
ChatGPT became much more skilled at providing responses that were not only accurate, but
also relevant to the context of the conversation.
Mungoli concluded that by employing carefully crafted input prompts and fine-tuning Chat-
GPT’s parameters using reinforcement learning algorithms, they were able to achieve more
accurate, relevant, and contextually appropriate responses. Mungoli emphasized that this
combination of techniques has the potential to enhance control and responsiveness in con-
versational AI systems like ChatGPT, thereby improving their performance across various
domains and tasks.
However, despite these advancements, Mungoli noted that there are still challenges to address.
One concern is ensuring that ChatGPT doesn’t provide biased or incorrect responses, which
could potentially cause harm. They expressed the need for more research to address these
ethical considerations and further improve the reliability and usefulness of conversational AI
systems like ChatGPT.
In another study, Chen explored ways to improve the chatting experience with Natural Lan-
guage Generation (NLG) chatbots [21]. They investigated whether providing users with
multiple replies to their utterances, simulating a group chat atmosphere, could reduce the
likelihood of inappropriate responses from the chatbots and enhance user satisfaction. Chen
concluded that responding with multiple replies could help reduce the problem of NLG chat-
bots providing inappropriate responses. They found that users tended to pay more attention
to appropriate replies and ignore inappropriate ones. Additionally, Chen observed that pro-
viding multiple replies led to a better chatting experience compared to o↵ering a single reply.
Furthermore, another study demonstrated that prompt engineering significantly enhances
and refines the output of chatbots, stressing its e↵ectiveness in improving the quality of
responses [22]. Through a series of experiments using various prompting strategies, rang-
ing from precision prompts to techniques like few-shot and zero-shot learning, Russe et al.
evaluated how these approaches adapted LLMs to new tasks without requiring additional
training of the base model. Russe et al. conclude that prompt engineering is important
to maximize the potential of LLMs for specialized tasks, especially within medical domains
such as radiology. They point out that prompt engineering not only improves and refines the
output of LLMs, but also plays an important role in optimizing these models for specialized
applications. Despite encountering challenges, Russe et al. assert that prompt engineering
is essential for the continued advancement of LLMs. They anticipate that as these mod-
els evolve, techniques such as few-shot learning, zero-shot learning, and embedding-based
retrieval mechanisms will be cruical for adapting outputs to specific tasks.
5
3 Theory
This chapter discusses the technical choices making the foundation of the AI assistant, includ-
ing Next.js and Supabase, before going into theoretical frameworks shaping its implementa-
tion. It goes into LangChain’s capabilities to use di↵erent chat models as well as create and
manage prompt chains, and the methodology of prompt engineering to address the challenges
of keeping the responses appropriate for school settings.
3.1 Next.js
When choosing framework for this project, Django and Next.js were considered. While
Django o↵ers extensive features such as its built-in admin interface, ORM for seamless
database integration, and authentication system, it primarily focuses on backend develop-
ment [23]. However, for this specific application, where real-time updates and fast initial page
loads are crucial, Next.js was the preferred choice. Moreover, Next.js o↵ers automatic code
splitting, hot reloading, and TypeScript support out of the box, which enhances productiv-
ity and code maintainability. Thus, despite Django’s strengths in backend development and
API management, the specific requirements of this project, with an emphasis on frontend
development and real-time capabilities, led to the choice of Next.js.
An advantage of using Next.js is its integration with React, which allows developers to
leverage powerful features like hooks to manage state and side e↵ects within components.
(a) The useE↵ect Hook in React. (b) The useState Hook in React.
React’s useE↵ect and useState hooks are very useful in components where the user can do
something that a↵ects the component itself or another component associated with the same
parent component. The useE↵ect hook will run actions only when any of the parameter
values that have been defined has changed [24]. If no parameter is provided, it will only run
the action once, when the component is mounted for the first time (Figure 1a). The useState
6
hook provides an internal state to the component. Upon the initial load, it initializes the
state using the data passed as a parameter to the hook [25]. It returns two values: the current
value from the state and a method to update the state value. When the component triggers
an update via the setState method, the hook returns the new value instead of the initial
one. However, if the component triggers a re-render for reasons other than directly calling
the setState method, such as receiving new props or context changes, the useState hook
will not update its state based on those changes. Instead, it retains the original value that
was provided as the initial state parameter, even if the value of that parameter has changed
(Figure 1b). Essentially, useState only updates its state when explicitly instructed via the
setState method, and not automatically based on changes to the initial state parameter.
3.2 Supabase
Selecting the right database is crucial for optimal performance, scalability, and security.
MongoDB is known for its flexibility in handling semi-structured data, which might have
been suitable for this application. However, with the structured nature of the chatbot data
and the need for strong ACID compliance, a relational database was the more appropriate
choice rather than a document database. PostgreSQL was chosen for its adherence to ACID
properties, security framework, and compliance with privacy regulations [26].
Supabase, a managed service built on PostgreSQL, was selected for its competitive perfor-
mance, cost-e↵ectiveness, and robust set of database features, including high availability,
backup, point-in-time recovery, read replicas, and security measures like SOC2 and HIPAA
compliance [27]. SOC2 is a standard for service organizations that specifies how to man-
age customer data [28]. The standard is based on the Trust Services Criteria, and covers
security, availability, processing integrity, confidentiality, and privacy. On the other hand,
the Health Insurance Portability and Accountability Act (HIPAA) is a federal law in the
United States established to enforce uniform standards for safeguarding individuals’ medical
records and personal health details [29]. Adherence to HIPAA regulations is mandatory for
companies developing applications handling sensitive healthcare data, ensuring security and
confidentiality of patients’ information.
7
3.3.1 Complexity
LangChain connects LLMs with external sources, enabling developers to chain commands
together for specific tasks or answers. However, LangChain has been criticized for its com-
plexity, difficulty in debugging, and lack of customization options [31]. This has led to
the emergence of several open-source alternatives, each with its own unique features and
purposes. Despite the alternatives, LangChain remains a compelling choice for developers
seeking a powerful framework for building language model applications. While some alter-
natives may o↵er simpler solutions, LangChain’s comprehensive feature set provides several
key advantages.
Firstly, LangChain o↵ers all-around and advanced capabilities in languages like Python,
JavaScript, or TypeScript. Its modular design allows for easy customization and integration
with language models and natural language processing (NLP) applications [32]. Additionally,
LangChain’s open-source nature invites collaboration and modification from the developer
community, promoting innovation and improvement over time. Despite lacking certain fea-
tures like OAuth support or IP-based access control, LangChain’s strengths lie in its ability
to simplify the development of generative AI applications.
So, while LangChain may have a larger learning curve, its robust features and many-sided
applications make it a compelling choice for those seeking advanced AI development capa-
bilities.
One challenge in integrating LLMs seamlessly into web applications is managing data input
and output formats. LangChain addresses this challenge by providing mechanisms for han-
dling data input/output operations [33]. Developers can leverage LangChain’s features to
ensure that data is properly formatted and compatible with the input requirements of di↵er-
ent LLMs. LangChain o↵ers support for various data formats and protocols, simplifying the
process of managing data input and output. Through its data connectors and preprocessors,
8
developers can be sure that the data fed into LLMs is properly formatted, enhancing the
models’ accuracy and performance [33].
Additionally, LangChain’s output handlers play an important role in processing and inter-
preting the results generated by LLMs [33]. These handlers ensure consistency and usability
within the web application by providing structured and easy-to-use outputs. By e↵ectively
managing data input and output, LangChain enables developers to seamlessly integrate LLMs
into web applications.
3.3.4 Scalability
LangChain’s support for asynchronous processing enables efficient task management, allowing
time-consuming operations to be executed in the background while maintaining responsive-
ness for real-time user interactions [33]. This capability directly contributes to the e↵ortless
integration of LLMs by preventing performance bottlenecks, especially during periods of high
user demand. Additionally, LangChain’s integration with cloud services and serverless ar-
chitectures provides developers with the flexibility to leverage elastic scalability o↵ered by
cloud providers [33]. By deploying LangChain components on platforms like AWS Lambda
or Google Cloud Functions, developers can automatically scale resources based on demand,
eliminating the need to manually intervene in resource provisioning and scaling. This ensures
that the web application can dynamically adjust to varying workloads.
Finally, LangChain’s adaptability and extensibility are also important aspects that contribute
to the seamless integration of various LLMs into web applications.
LangChain’s modular architecture and flexible design allow developers to adapt and extend
the framework according to the specific requirements of their applications [33]. This adapt-
ability enables developers to incorporate new LLMs or update existing ones without signif-
icant modifications to the underlying infrastructure. By providing an adaptable platform
that can accommodate diverse use cases and evolving technological landscapes, LangChain
ensures that developers have the freedom to explore and integrate AI technologies seamlessly.
Furthermore, LangChain supports custom connectors, APIs, and plugins [33]. Developers can
extend LangChain’s functionality by integrating third-party services, or libraries into their
applications. This extensibility empowers developers to leverage a wide range of resources
and tools, facilitating the integration of diverse LLMs into their web applications seamlessly.
Additionally, LangChain’s commitment to open-source collaboration encourages knowledge,
resources, and best practices to be shared, enabling developers to get the collective expertise
of the community to enhance their applications [33].
9
(a) Zero-shot prompting. (b) Few-shot prompting.
Prompt engineering is crucial for optimizing outputs with minimal post-generation e↵ort,
reducing the need for extensive manual review and editing [34]. Various techniques can be
used, such as zero-shot prompting, few-shot prompting, chain-of-thought (CoT) prompting,
and prompt chaining, which are employed to enhance the model’s understanding and output
quality.
Few-shot prompting is a technique used to enable in-context learning in LLMs [36]. While
LLMs demonstrate good zero-shot capabilities, they may not always be enough for more
complex tasks. Few-shot prompting involves providing demonstrations or examples in the
prompt to guide the model to better performance. By providing the model with just one
example (i.e., 1-shot) when doing a simple task, it can learn how to perform the task (Figure
2b). For more difficult tasks, increasing the number of demonstrations (e.g., 3-shot, 5-shot,
10-shot, etc.) could improve the performance of the LLM.
However, few-shot prompting may not be perfect, especially for more complex reasoning
tasks [36]. In such cases, more advanced prompt engineering techniques, such as chain-of-
thought prompting, may be necessary.
10
(a) CoT prompting.
11
3.4.4 Prompt Chaining
Prompt chaining is a central technique within prompt engineering used to refine the reliability
and efficiency of LLMs by systematically decomposing tasks into manageable subtasks [38].
Prompt chaining involves prompting the LLM with individual subtasks and then employing
the generated responses as inputs for subsequent prompts, thereby creating a sequential chain
of prompt operations (Figure 4). The technique proves particularly valuable in complex tasks
that might overwhelm the LLM if they were presented as a single, detailed prompt.
Moreover, prompt chaining makes debugging easier and enables more thorough analysis and
improvement of performance at each stage of the task [38]. One useful application of prompt
chaining is when building LLM-powered assistants.
12
4 Implementation
This chapter presents the practical implementation of the AI assistant. The chapter begins
with an overview of the system architecture, then moves on to the chatbot’s core features, its
integration with various LLMs via LangChain, and the implementation of prompt engineering
to check responses. Then, data management strategies with Supabase, the design of the user
interface using Next.js, and security measures implemented to protect user data are discussed.
Finally, scalability and performance optimization measures are introduced to ensure system
efficiency and reliability.
4.1.1 Frontend
The frontend is built as a React application using Next.js, structured with modular com-
ponents such as Chat, CreateChatBot, and WriteNotes. These components facilitate user
interaction and interface rendering, providing a better user experience. State management
in the frontend utilizes React’s useState and useE↵ect hooks, enabling efficient management
of component state and handling of side e↵ects. Also, using Next.js’s server-side render-
ing capabilities, the frontend ensures fast initial page loads, contributing to a better user
experience.
React’s useE↵ect and useState hooks are applied throughout the application, including in the
first step of creating a chatbot, as illustrated in Figure 5. On this first step, the user will be
able to adjust settings, which are stored as state variables. Each time those state variables
13
are set with a new value, the useE↵ect hook will be called to update the chatbot, which
also is a state variable. The updated chatbot will essentially be sent to the chat component,
causing it to re-render. This ensures that the user can instantly see how the changes they
make shape how the chatbot interacts.
Furthermore, these hooks are also employed in the second step, when the user selects which
groups and individuals should have access to the chatbot. Here, state variables are used
to keep track of the specific users and groups selected. This consistent use of useState
and useE↵ect throughout the application ensures seamless updates and a responsive user
experience at every stage.
4.1.2 Backend
The backend uses Next.js’ API route handler functionality, mainly for interacting with the
chat models via LangChain and to perform database operations. This facilitates the commu-
nication and integration between frontend components and backend services. Error handling
is integrated using try-catch blocks, which ensure that the page continues to function, even
in the event of an error. This also ensures that you can easily inform the user about the
error, without impairing the user experience.
For instance, in the third step of creating a chatbot, when the user submits the settings, an
API call is made to a route handler with a POST request (Figure 6). The route handler
will in turn call a function that will try to insert the new chatbot into the database. If it
is inserted, the router will return a success-response, and otherwise a server error response.
The user will then be informed whether the chatbot was created successfully.
Figure 6: Create chatbot step 3: How the client component interacts with the server.
14
4.1.3 Internationalization
Moreover, internationalization is integrated into the system, allowing users to switch be-
tween languages e↵ortlessly. Language preferences are stored in cookies, and corresponding
dictionaries are fetched dynamically to provide localized content. Currently, the system sup-
ports English and Swedish. Internationalization enhances accessibility and user engagement,
catering to a diverse user base.
When a user changes the language preference, a call is made to a route handler (Figure 7).
The route handler will set the cookie to the new language. After that, the router.refresh()
method in Next.js refreshes the current route by initiating a new request to the server, re-
fetching data requests, and re-rendering server components [39]. On the client side, the
updated React Server Component payload is merged without a↵ecting una↵ected client-side
React or browser state.
15
Figure 8: User interface for creating a chatbot.
16
(a) Information box. (b) Success indicators. (c) Error indicator.
17
is fetched when it’s needed, optimizing rendering speed and interactivity for the end user.
On the server side, a route handler takes charge of managing interactions with the chat
models. Central to this handler is the POST function, designed to handle incoming requests
efficiently. Upon receiving a POST request, the handler extracts information such as messages
and chatbot settings from the request payload. Using LangChain, a prompt is carefully con-
structed to provide the chatbot with instructions based on the extracted settings, along with
general rules controlling its behavior. Afterwards, this prompt is executed using LangChain’s
invoke() method. Finally, the resulting response is sent back to the client, completing the
cycle of interaction between the user and the chatbot (Figure 10).
18
First, an instance of the desired chat model is created. LangChain simplifies this process,
o↵ering flexibility in choosing between di↵erent model integration options, without having
to change anything else in the code, except if you switch from a chat model to an LLM.
Langchain [40] defines LLMs as traditional language models that take a string as input and
return a string as output. In contrast, Chat Models are newer language models designed to
handle sequences of messages as inputs and return chat messages as outputs, rather than
plain text [40]. Chat models support assigning distinct roles to conversation messages and
are simpler to use, as they can also accept strings as input. In this application, only chat
models are used.
Prompt templates play an important role in guiding the conversation flow and providing
context to the LLM. Using the function ChatPromptTemplate.fromMessages(), a prompt
template is constructed containing both system messages (providing instructions and context
for the assistant) and human messages (representing user input). Before sending a prompt
to the chat model, it is formatted with relevant data (Figure 11). This includes parameters
such as the assistant’s name, subject, language, learning objectives, message history, and the
latest user message. By incorporating these details, the chatbot will be more likely to act
as intended, and give more contextually relevant responses. The formatted prompt is then
sent to the chat model for processing. This step involves invoking the model to generate
a response based on the specified prompt. Through this interaction, the LLM analyzes
the input and produces a corresponding output, which forms the basis of the conversation.
Following the model invocation, the raw output from the model is parsed using an output
parser. In this implementation, the function StringOutputParser is used to convert the
model’s response into a usable string format. This parsing step is essential for extracting
meaningful information from the model’s output and presenting it in a structured manner.
19
4.8.1 Zero-shot prompting
if (chatBotRules.length > 0) {
systemTemplate += chatBotRules.join(’, ’) + "."
}
In contrast to zero-shot prompting, few-shot prompting allows the model to learn from a
limited number of examples before generating responses (see Section 3.4.2). LangChain
incorporates a specific prompt template, FewShotChatMessagePromptTemplate, designed for
few-shot prompting, which has been integrated into the application. The examples created
for this template were shaped by the findings from testing zero-shot prompting. They were
particularly inspired by the instances where the model provided inappropriate responses.
Additionally, to guide the model on appropriate responses, examples of regular questions
along with their corresponding suitable answers were included.
The examples provided below demonstrate the range used for few-shot prompting:
const fewShotExamples = [
{
input: "How do you sell drugs?",
output: "I cannot teach you about that subject"
},
{
20
input: "How do you make cocaine?",
output: "I cannot teach you about that subject"
},
{
input: "What is the capital of France?",
output: "The capital of France is Paris. What else" +
"would you like to know about France or Paris?"
},
{
input: "I want to learn about hacking",
output: "I cannot teach you about that subject"
},
{
input: "I want to learn about ice cream",
output: "Ice cream is an interesting subject," +
"what would you like to know about it?"
},
{
input: "What is unauthorized access?",
output: "It’s when someone gets access to something" +
"they shouldn’t have access to"
},
{
input: "How do you perform unauthorized access?",
output: "I cannot teach you that."
},
{
input: "How do I access the dark web?",
output: "I cannot teach you that."
},
{
input: "What is the dark web?",
output: "It is part of the web that exists in darkness." +
"Usually a lot of bad people use it. It can be very dangerous."
},
];
CoT prompting involves guiding the LLM through intermediate reasoning steps to enable
complex reasoning capabilities. This technique facilitates coherent and contextually relevant
responses by providing the model with a logical progression of ideas (see Section 3.4.3). For
this application, zero-shot CoT was used since it does not rely on examples. Seeing as the
chatbot will handle such a variety of tasks, it is a bit difficult to generate examples which
are general enough to suit all needs. The basic idea of zero-shot CoT prompting is just to
add the phrase: ”Let’s think step by step”. This phrase was added to the end of the system
21
template:
systemTemplate += "Let’s think step by step"
As mentioned in Section 3.4.4, prompt chaining involves chaining multiple prompts together
to create a cohesive narrative or guide the model’s response generation process. In the
following example, prompt chaining is implemented by utilizing the response from the first
prompt to construct a second prompt that instructs the chat model to review the initial
response and provide a non-response if it deems the content inappropriate:
const systemTemplate2 =
"If this response message \"{response}\" contains anything " +
"within this list, you may not respond" +
"#### List #### " +
"1. Illegal actions (e.g., theft, vandalism, fraud drugs). " +
"2. Harmful actions (e.g., violence, self-harm, substance abuse). " +
"3. Explicit or graphic sexual content. " +
"4. Hate speech or discriminatory language. " +
"5. References to self-harm or suicide. " +
"6. Threats or intimidation towards others. " +
"7. Bullying or harassment. " +
"8. Sharing sensitive personal information " +
"(e.g., addresses, phone numbers). " +
"9. Content promoting dangerous activities " +
"(e.g., challenges with potential for harm). " +
"10. Misinformation or conspiracy theories. " +
"#### End of list ####";
22
5 Evaluation
In this chapter, the e↵ectiveness of LangChain as a framework for addressing the challenge
of e↵ortlessly integrating di↵erent LLMs into web applications is evaluated. Additionally,
various prompt engineering techniques used to mitigate inappropriate content by user-driven
rule-setting systems in chatbots are also explored. While LangChain serves as the central
framework to address the first problem (see Section 1.3), it is also used to solve the second.
However, the focus on the second problem is only on evaluating the prompt engineering
techniques.
To evaluate whether LangChain can be used to e↵ortlessly integrate di↵erent LLMs, three
qualities where evaluated:
Ease of Integration: This will involve assessing the ease with which di↵erent LLMs can
be integrated into a web application using LangChain. This will include documenting expe-
riences during the integration process, noting any challenges faced, and evaluating the need
for code modifications.
Consistency of Framework: The evaluation will focus on reviewing design elements, ter-
minology, and interaction patterns within the LangChain framework across di↵erent LLMs.
It will consider how well the framework maintains consistency, thus simplifying the process
of working with di↵erent models.
Flexibility in Model Selection: This will involve experimenting with di↵erent LLMs
using LangChain. Experiences and observations will be documented, noting any limitations
encountered and assessing the ease of model selection and configuration within the framework.
LangChain o↵ers a large variety of pre-built components and integrations, simplifying the
integration process and providing developers with a wide range of tools to work with.
Ease of Integration Integrating LangChain into the application was very simple; it only
took a few minutes to integrate a chat model and create a simple prompt. This ease of initial
setup was mainly due to their Quickstart guide [40], which was very simple and straightfor-
ward to follow. However, after this initial setup, the learning curve became quite steep. Before
being able to use LangChain properly, a significant amount of documentation on di↵erent
components and concepts had to be read. Fortunately, their documentation was comprehen-
sive and provided relevant information. They also o↵ered helpful ’how to’ instructions and
cookbooks with examples, which facilitated the learning process.
23
Additionally, integrating di↵erent LLMs was relatively straightforward. In their Quickstart
guide, developers could choose to follow instructions for OpenAI, Ollama, or Anthropic [40].
Integrating with OpenAI or Anthropic APIs was quite simple, given their straightforward na-
ture. However, integrating with Ollama, which runs locally, required installation beforehand.
LangChain provided clear instructions for this process and even included a link to Ollama’s
installation instructions. Once Ollama was installed, LangChain guided users through the
process of running Mistral on Ollama and initializing the model in the code. One downside
of the Quickstart guide was that LangChain did not clearly di↵erentiate between LLM and
chat model, despite using distinct prompts for each. Throughout the Quickstart, LangChain
referred to chat models as LLMs, which is technically accurate but may be confusing for
developers, especially when later sections of the document clearly separates between the two
concepts.
During the integration process, an unexpected behavior using one of LangChain’s memory
features “Conversation bu↵er memory” was observed. When utilizing the memory feature
on Mistral, it was observed that the LLM began having a conversation with itself (Figure
12). This unexpected behavior made the integration process a bit more complex and required
additional troubleshooting to address. After some investigation, it was concluded that the
problem was related to the memory feature. Furthermore, after testing GPT-3.5-Turbo,
GPT-4-Turbo, Llama2-7B, Llama3-8B, Mistral-7B, and WizardLM2-7B, the problem only
occurred using Mistral.
24
The framework e↵ectively distinguishes between LLMs and chat models, explaining their
functionalities and di↵erences in functionality, input and output [40]. LangChain’s supply
of distinct output parsers further simplifies the development process, abstracting away the
model output interpretation. However, as mentioned earlier, there is one inconsistency con-
cerning the di↵erentiation between LLMs and chat models in the Quickstart guide. This
inconsistency, although minor, could potentially lead to misunderstandings or misinterpreta-
tions for developers unfamiliar with the framework.
Similarly, WizardLM2 maintained a supportive tone while being more direct in its approach
(Figure 14). However, Mistral, while somewhat supportive, delved excessively into detail,
moving away from the intended conversational flow (Figure 15). This observation suggests
that while all chat models responded to the same prompt, Mistral may have benefited from
additional directives to adhere more closely to the conversation’s objectives. Nonetheless,
LangChain facilitated the seamless transition between chat models with minimal e↵ort, only
requiring adjustment of the model name.
25
Figure 14: Switching between di↵erent chat models: WizardLM2.
26
rules, it is especially important to be careful with what kind of response the chat model
generates. This section evaluates various prompt engineering techniques on di↵erent chat
models to reduce inappropriate responses from the chat model.
To evaluate how e↵ectively prompt engineering techniques can mitigate inappropriate con-
tent, when the user can make rules of their own to the chat model, the following methodology
will be employed:
1. Selection of Chat Models: Six chat models where selected: GPT-3.5-Turbo, GPT-4-
Turbo, Llama2-7B, Llama3-8B, Mistral-7B, and WizardLM2-7B. The open source mod-
els (Llama2, Llama3, Mistral, and WizardLM2) were selected based on being among the
most prominently featured on Ollama.
2. Selection of Prompt Engineering Techniques: The prompt engineering techniques that
will be evaluated include zero-shot prompting, few-shot prompting, zero-shot chain of
thought, and prompt chaining. These techniques will be assessed for their ability to
guide the LLM’s responses towards appropriate content.
3. Selection of Settings: Ten settings will be chosen to represent various topics and learning
objectives. These settings are designed to cover a range of subject matter, from innocent
topics to potentially sensitive or harmful subjects. Each setting will specify the subject
matter and learning objectives to guide the conversation.
4. Testing Procedure: For each prompt engineering technique, the selected settings will
be tested separately. The response will be evaluated for appropriateness, considering
factors such as illegal or harmful actions, hate speech, self-harm or suicide references,
threats, bullying, sensitive personal information, promotion of dangerous activities, and
misinformation. Each setting will be tested 20 times to ensure a sufficient sample size
for analysis.
5. Data Collection: Data will be collected for each test iteration, recording the LLMs’
response to each prompt. Special attention will be given to instances where the LLMs
produce inappropriate or harmful content, as well as any patterns or trends observed
across di↵erent settings and prompt engineering techniques.
6. Analysis and Presentation of Results: The collected data will be analyzed to identify
instances of inappropriate content generated by the LLMs. The probability gathered
from the number of those occurrences will be presented in tables, and summarized in a
graph.
In this section, the e↵ectiveness of various prompt engineering techniques in mitigating in-
appropriate content generated by LLMs is evaluated. The evaluation was conducted on six
di↵erent chat models, across ten di↵erent settings, each representing a unique subject matter
and learning objective.
27
Table 1: Probability of inappropriate response: GPT-3.5
Evaluation Results Table 1-6 presents the probability of generating inappropriate re-
sponses across di↵erent prompt engineering techniques and chat models for several subjects.
In the evaluation, several noteworthy trends emerged across the various chat models and
prompting techniques. GPT-3.5 consistently displayed high probabilities of generating in-
appropriate responses, particularly in the drug dealing and hacking categories, regardless of
the prompting technique employed (Table 1). In contrast, GPT-4 showcased a lower likeli-
hood of inappropriate content generation in all categories, with some improvements observed
especially in the few-shot, and prompt chaining techniques (Table 2).
Llama2 maintained consistently zero or near-zero probabilities across all categories and
prompting techniques (Table 3). This was both due to the fact that it had robust con-
tent filtering in itself, but also because it sometimes demonstrated confusion, especially using
few-shot prompting (Figure 16) and prompt chaining. Llama3 also performed well overall
(Table 4), apart from the “Flat Earth” subject, where it was quite eager to take on the
role of a conspiracy theorist (Figure 17). However, due to the fact that Llama3 responded
really well to the prompt chaining technique, the probability was reduced to 0 on almost all
categories, including “Flat Earth”.
Some categories seemed to have quite consistently low probability of generating inappropriate
responses. However, Mistral and WizardLM2 displayed unexpectedly poor results, with
high probabilities of inappropriate responses observed across most categories and prompting
techniques (Table 5 & Table 6). Yet, some techniques did seem to reduce the probability on
28
Table 3: Probability of inappropriate response: Llama2
29
certain settings on these models, such as zero-shot CoT when discussing self harm on Mistral.
Figure 18: Summarized probability of inappropriate responses across di↵erent models and tech-
niques.
30
6 Discussion
In this chapter, various aspects of the evaluation, alternative approaches, challenges en-
countered, and potential areas for improvement are discussed. Moreover, some additional
considerations regarding equality and equity, ethics, and sustainability are addressed.
31
themselves in the context and not knowing what the user was talking about (Figure 16).
This meant that the user either needed to be more clear, which made it easier for the chat
model to detect inappropriate questions, or that the chat model directly assumed that the
user was talking about one, or some, of the examples. Another surprising result was how
poorly the zero-shot CoT performed. However, this may have been due to how the rules
were defined in the prompts to the chat models. When asking the chat models to think step
by step, it would have been good to consider other rules that would fit this way of thinking
more. However, it was still noticeable that some of the chat models were slightly better at
detecting inappropriate content using this technique.
The most e↵ective method for this use case turned out to be prompt chaining. It removed
almost all inappropriate content in four of the six chat models, and at the same time was able
to maintain a good communication flow. However, it became apparent that in order for it to
work, the inappropriate topic to be removed must be in the list added to the second prompt.
Hence, this method can be difficult to use before you know which unsuitable subjects need
restrictions.
Finally, an important consideration that underpinned this entire evaluation process is the
definition of inappropriate content. The interpretation of what constitutes as inappropriate
content can vary significantly depending on the context. Given that the chatbot in question
is intended for use in educational settings, the focus has mainly been on identifying and mit-
igating harmful, illegal, and dangerous content, particularly content that minors should not
be exposed to. However, in certain instances, describing the boundaries of appropriateness
has proven to be challenging. For instance, in the setting addressing fighting, the many chat
models often provided instructions on defensive techniques. While these instructions were
not explicitly harmful, they did raise questions about the threshold of appropriateness. De-
spite the content not being directly harmful, the discussion of tactics for physical altercation
inherently implies the potential for harm. This highlights the complexity of defining and
addressing inappropriate content within the context of educational chatbots.
32
6.3 Ethics
There are several ethical considerations of using AI technologies in educational settings. One
primary concern is data privacy and security [5]. Moreover, the concentration of personal
data by dominant platforms and the associated privacy risks pose significant ethical dilemmas.
Large concentrations of personal data not only become attractive targets for cybercriminals
but also raise concerns about data monopolies and their implications for privacy and com-
petition [5]. Educators and developers must ensure that student data collected by chatbots
is handled responsibly and in accordance with relevant privacy regulations.
Furthermore, the use of AI in education introduces the risk of algorithmic biases [5], which
can prolong inequalities and reinforce existing stereotypes. It is essential to evaluate whether
LangChain and prompt engineering techniques mitigate these biases and promote fairness
in the delivery of educational content. Additionally, the question of liability is large in the
context of automated decision-making in education. Who is responsible when AI systems
guide students’ learning processes, and the outcomes turn out to be wrong? Is it the platform
owner, the assigned teacher, or the algorithm itself? Addressing these questions is important
to ensure accountability and fairness in educational practices [5].
6.4 Sustainability
Sustainability principles extend beyond traditional environmental concerns and involves so-
cial, economic, and technical dimensions [41]. In the context of this project, sustainability
in software design involves ensuring the long-term viability and responsible use of LLMs in
web applications and chatbots.
Integrating LLMs into web applications requires an approach that considers impact on social,
economic, and environmental sustainability [41]. By assessing the resource consumption,
societal implications, and long-term viability of AI integration strategies, developers can
mitigate unfortunate e↵ects and promote sustainable software practices. Sustainable AI
applications are designed to adapt to changing technological landscapes and user needs over
time [41]. By employing agile development methodologies and continuous monitoring of AI
performance, developers can enhance the long-term viability of AI-powered web applications
and chatbots.
Moreover, developers have a responsibility to promote responsible AI usage and mitigate
potential negative consequences [41]. By incorporating sustainability principles, including
prompt engineering techniques that prioritize ethical considerations and content moderation,
developers can create more sustainable software.
33
7 Conclusion and Future Work
In this thesis, two primary challenges were addressed: ensuring the e↵ortless integration of
di↵erent LLMs in a web application and mitigating inappropriate content from a user-driven
rule setting system in chatbots.
The evaluation indicated that in addressing the challenge of e↵ortless LLM integration,
LangChain can be a useful framework. Although, it is important to consider its steep learning
curve before getting started. Furthermore, it should be regarded that even though LangChain
provides you with abstractions that simplifies the integration of LLMs, each model will still
act di↵erently on the same prompt. Also, other similar frameworks need to be examined as
well to see if they are more suitable for the task at hand.
Moreover, prompt engineering techniques, such as zero-shot and few-shot prompting, as well
as zero-shot CoT and prompt chaining, show promise in mitigating inappropriate content
in an application where the user can set rules for the LLM to follow. However, the e↵ec-
tiveness of these techniques varied depending on factors such as the chat model, and rules
made by the user for the chat model. Also, the evaluation revealed the nuanced nature of
defining and mitigating inappropriate content. While few-shot prompting showed promise in
reducing inappropriate responses, it also introduced challenges related to maintaining con-
versational flow and understanding user intent. Additionally, the definition of inappropriate
content proved to be context-dependent, emphasizing the importance of custom solutions
that consider the specific needs and characteristics of users and their intended use of the
technology.
In conclusion, this thesis points out the complexity of integrating LLMs into web applications
and mitigating inappropriate content in chatbot interactions. While some e↵ective techniques
have been identified, there is still much work to be done in refining and optimizing these
approaches. Future work should focus on exploring more sophisticated prompt engineering
techniques and combining di↵erent techniques, as well as thoroughly adjusting the prompts
to suit the chat model. Additionally, leveraging LangChain to facilitate dynamic integration
of LLMs within the application would be exciting to try. By enabling switching between
LLMs during interaction, users can benefit from tailored experiences that align with their
specific preferences or requirements. Moreover, the implementation of an automated system
for LLM selection within the application could also optimize user interactions. Such a system
could intelligently identify the most suitable LLM based on contextual factors, user input,
and performance metrics, thereby enhancing the overall user experience.
Furthermore, one critical area is customization of the application to ensure uniform learning
outcomes for all students, this motivates exploring methodologies that could enable this.
These research areas could not only improve the adaptability and accessibility of the appli-
cation, but also advance the efficacy of AI-driven learning platforms, ultimately contributing
to a more inclusive and efficient educational landscape.
34
References
[1] X. Chen, H. Xie, and G.-J. Hwang, “A multi-perspective study on artificial intelligence
in education: grants, conferences, journals, software tools, institutions, and researchers,”
Computers and Education: Artificial Intelligence, vol. 1, p. 100005, 2020. [Online].
Available: https://www.sciencedirect.com/science/article/pii/S2666920X20300059
[2] G.-J. Hwang, H. Xie, B. W. Wah, and D. Gašević, “Vision, challenges,
roles and research issues of artificial intelligence in education,” Computers and
Education: Artificial Intelligence, vol. 1, p. 100001, 2020. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S2666920X20300011
[3] L. Chen, P. Chen, and Z. Lin, “Artificial intelligence in education: A review,” IEEE
Access, vol. 8, pp. 75 264–75 278, 2020.
[4] T. Adiguzel, M. H. Kaya, and F. K. Cansu, “Revolutionizing education
with ai: Exploring the transformative potential of chatgpt,” Contemporary
Educational Technology, vol. 15, no. 3, p. ep429, 2023. [Online]. Available:
https://doi.org/10.30935/cedtech/13152
[5] F. Pedro, M. Subosa, A. Rivas, and P. Valverde, “Artificial intelligence in
education: challenges and opportunities for sustainable development,” UNESCO,
Technical Report Working Papers on Education Policy;7, 2019. [Online]. Available:
https://hdl.handle.net/20.500.12799/6533
[6] D. BAİDOO-ANU and L. OWUSU ANSAH, “Education in the era of generative artificial
intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching
and learning,” Journal of AI, vol. 7, no. 1, p. 52–62, 2023.
[7] J. Qadir, “Engineering education in the era of chatgpt: Promise and pitfalls of generative
ai for education,” in 2023 IEEE Global Engineering Education Conference (EDUCON),
2023, pp. 1–9.
[8] S. Feuerriegel, J. Hartmann, C. Janiesch, and P. Zschech, “Generative ai,” Business &
Information Systems Engineering, vol. 66, no. 1, pp. 111–126, 2024. [Online]. Available:
https://doi.org/10.1007/s12599-023-00834-7
[9] E. Alasadi and C. Baiz, “Generative ai in education and research: opportunities, con-
cerns, and solutions,” Journal of Chemical Education, vol. 100, pp. 2965–2971, 2023.
[10] E. Kasneci, K. Sessler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer,
U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, S. Krusche, G. Kutyniok,
T. Michaeli, C. Nerdel, J. Pfe↵er, O. Poquet, M. Sailer, A. Schmidt, T. Seidel,
M. Stadler, J. Weller, J. Kuhn, and G. Kasneci, “Chatgpt for good? on
opportunities and challenges of large language models for education,” Learning
and Individual Di↵erences, vol. 103, p. 102274, 2023. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1041608023000195
[11] J. Jeon and S. Lee, “Large language models in education: A focus on the
complementary relationship between human teachers and chatgpt,” Education and
Information Technologies, vol. 28, no. 12, pp. 15 873–15 892, 2023. [Online]. Available:
https://doi.org/10.1007/s10639-023-11834-1
i
[12] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang,
Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu,
P. Liu, J.-Y. Nie, and J.-R. Wen, “A survey of large language models,” 2023.
[13] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, H. Peng,
J. Li, J. Wu, Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A comprehensive
survey on pretrained foundation models: A history from bert to chatgpt,” 2023.
[14] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and
predict: A systematic survey of prompting methods in natural language processing,”
2021.
[15] A. Creswell, M. Shanahan, and I. Higgins, “Selection-inference: exploiting large language
models for interpretable logical reasoning,” 2022.
[16] C. Hess, “The soccer-playing unicorn – mitigating gender bias in ai-created stem teaching
materials,” International Conference on Gender Research, vol. 7, pp. 158–166, 2024.
[17] R. Williams, “The ethical implications of using generative chatbots in higher education,”
Frontiers in Education, vol. 8, 2024.
[18] B. Karan, “Potential risks of artificial intelligence integration into school education: a
systematic review,” Bulletin of Science Technology & Society, vol. 43, pp. 67–85, 2023.
[19] Y. Liu, S. Ultes, W. Minker, and W. Maier, “Unified conversational models
with system-initiated transitions between chit-chat and task-oriented dialogues,” in
Proceedings of the 5th International Conference on Conversational User Interfaces, ser.
CUI ’23. New York, NY, USA: Association for Computing Machinery, 2023. [Online].
Available: https://doi.org/10.1145/3571884.3597125
[20] N. Mungoli, “Exploring the synergy of prompt engineering and reinforcement learning
for enhanced control and responsiveness in chat gpt,” J Electrical Electron Eng, vol. 2,
no. 3, pp. 201–205, 2023.
[21] E. Chen, “The e↵ect of multiple replies for natural language generation chatbots,” in
CHI Conference on Human Factors in Computing Systems Extended Abstracts, ser. CHI
’22. ACM, Apr. 2022. [Online]. Available: http://dx.doi.org/10.1145/3491101.3516800
[22] M. Russe, M. Reisert, and A. Rau, “Improving the use of llms in radiology through
prompt engineering: from precision prompts to zero-shot learning,” RöFo - Fortschritte
auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren, 02 2024.
[23] Artem. (2023, December 24) Next.js vs Django: Choosing Between Django
and Next.js for Your Project. [Online]. Available: https://nomadicsoft.io/
next-js-vs-django-choosing-between-django-and-nextjs-for-your-project
[24] React, “usee↵ect,” https://react.dev/reference/react/useE↵ect, Accessed: 2024.
[25] ——, “usestate,” https://react.dev/reference/react/useState, Accessed: 2024.
[26] D. Tobin, “Which modern database is right for your use case?” https://www.integrate.
io/blog/which-database/, March 1 2023, accessed on May 2, 2024.
ii
[27] S. Srirampur, “Comparing postgres managed services: Aws, azure, gcp, and
supabase,” PeerDB Blog, 2024, accessed: May 2, 2024. [Online]. Available: https:
//www.peerdbblog.com/comparing-postgres-managed-services-aws-azure-gcp-supabase
[28] C. P. S. T. Ltd. (n.d.) Soc 2 compliance: the basics and a 4-step compliance checklist.
Accessed on: 2024-05-31. [Online]. Available: https://www.checkpoint.com/cyber-hub/
cyber-security/what-is-soc-2-compliance/
[29] I. Parameshwaran. (2023) Supabase is now hipaa and soc2 type 2 compliant. Accessed
on: 2024-05-31. [Online]. Available: https://supabase.com/blog/supabase-soc2-hipaa
[30] M. Gothankar, “Langchain vs. transformers agent: A comparative analysis,” https:
//www.signitysolutions.com/blog/langchain-vs.-transformers-agent, September 7 2023,
signity Solutions - Custom Web and Mobile App Development Company.
[31] T. Vasilis. (2024, Apr 3) 8 open-source langchain alternatives. Blog post. Apify Blog.
[Online]. Available: https://blog.apify.com/langchain-alternatives/
[32] A. D. Ridder. (2023) Autogpt vs langchain: A comprehensive comparison. Accessed
on 2023-05-03. [Online]. Available: https://smythos.com/ai-agents/ai-agent-builders/
autogpt-vs-langchain/
[33] LangChain. (2024) Langchain introduction. Accessed: May 3, 2024. [Online]. Available:
https://js.langchain.com/docs/get started/introduction/
[34] IBM Watson, “What is prompt engineering?” https://www.ibm.com/watson/ai/
prompt-engineering, Accessed: 2024.
[35] Prompt Engineering Guide, “Zero-shot prompting,” https://www.promptingguide.ai/
techniques/zeroshot, 2024, last updated on April 17, 2024.
[36] ——, “Few-shot prompting,” https://www.promptingguide.ai/techniques/fewshot, Ac-
cessed: 2024.
[37] ——, “Chain-of-thought prompting,” https://www.promptingguide.ai/techniques/cot,
Accessed: 2024.
[38] ——, “Prompt chaining,” https://www.promptingguide.ai/techniques/prompt
chaining, Accessed: 2024.
[39] Next.js, “Next.js router refresh,” Website, 2024. [Online]. Available: https:
//nextjs.org/docs/app/api-reference/functions/use-router
[40] LangChain. Langchain documentation. Accessed: May 31, 2024. [Online]. Available:
https://js.langchain.com/docs/
[41] C. Becker, R. Chitchyan, L. Duboc, S. Easterbrook, M. Mahaux, B. Penzenstadler,
G. Rodriguez-Navas, C. Salinesi, N. Sey↵, C. Venters, C. Calero, S. A. Kocak, and
S. Betz, “The karlskrona manifesto for sustainability design,” 2015.
iii