Stell RAG
Stell RAG
with RAG
Hefei Qiu∗ , Brian White† , Ashley Ding‡ , Reinaldo Costa† , Ali Hachem† , Wei Ding† and Ping Chen†
∗ Department of Computer Science
Fitchburg State University, 160 Pearl Street, Fitchburg, MA 01420-2697
† Department of Computer Science
Email: [email protected]
Abstract—Large Language Models (LLMs) have shown strong reliable grading and feedback remains a challenge. We propose
general capabilities in many applications. However, how to make SteLLA (Structured Grading System Using LLMs with RAG),
them reliable tools for some specific tasks such as automated an automatic grading system that performs a structured grading
short answer grading (ASAG) remains a challenge. We present
SteLLA (Structured Grading System Using LLMs with RAG) based on Question Answering (QA) techniques, which is em-
in which a) Retrieval Augmented Generation (RAG) approach powered by highly relevant augmented information retrieved
is used to empower LLMs specifically on the ASAG task by from the instructor-provided reference answer and rubric.
extracting structured information from the highly relevant and The field of automatic grading and feedback systems has
reliable external knowledge based on the instructor-provided
reference answer and rubric, b) an LLM performs a structured been explored through various domains such as programming
and question-answering-based evaluation of student answers to [1], [2] and mathematics [3], [4], as well as on different types
provide analytical grades and feedback. A real-world dataset of answers such as essays [5], [6] and short answers [7]–
which contains students’ answers in an exam was collected from a [9]. Compared with an essay which is usually long and with
college-level Biology course. Experiments show that our proposed multiple paragraphs, a short answer is much shorter and with
system can achieve substantial agreement with the human grader
while providing break-down grades and feedback on all the just a couple of sentences. Grading on short answers is more
knowledge points examined in the problem. A qualitative and focused on correctness and does not consider text coherence or
error analysis of the feedback generated by GPT4 shows that writing style as in essay grading. SteLLA is a system designed
GPT4 is good at capturing facts while may prone to inferring for automatic short-answer grading (ASAG).
too much implication from the given text in the grading task
which provides insights into the usage of LLMs in the ASAG There have been many attempts to build automatic grading
system. and feedback systems. Many of them utilize the recent devel-
Index Terms—LLM-based ASAG system, RAG, QA-based opment in Natural Language Processing (NLP). Motivated by
evaluation, structured evaluation the huge progress of LLMs and the needs of instructors and
learners, our design uses LLMs as a key component. To ground
I. I NTRODUCTION a general LLM on the specific task of grading, we propose
a reference answer and rubric based retrieval augmented
Assessment plays an important role in the teaching and generation (R-RAG) approach. Given an instructor-provided
learning process. It usually includes closed-ended questions reference answer and a rubric, R-RAG extracts highly relevant
such as multiple choices and open-ended questions such as and structured information from them. It applies question-
short-answer questions. Although open-ended questions are generation and question-answering techniques to generate a
more powerful in evaluating students’ learning, grading on set of evaluation questions and corresponding answers. An
such questions are more time-consuming. In some scenarios LLM performs a structured grading by checking how well a
such as introductory-level courses in college with hundreds student’s response answers these evaluation questions. Even-
of students, or online courses with an even larger scale of tually, an overall grade, the breakdown grades and feedback
learners, the potentially heavy workload due to the manual are generated to the user.
grading on open-ended short-answer questions hinders their
The contributions of this work are as follows:
usage in practice. An automated grading system can provide
prompt feedback to a learner, can support large scale learn- • We propose an LLM-based ASAG system, SteLLA, that
ing environment, and further facilitate active and life-long shows substantial agreement with human graders.
learning. The recent development of Large Language Models • We present R-RAG which is specifically designed for
(LLMs) has shown their strong general capabilities in many the grading task. It treats an instructor-provided reference
tasks. However, how to use them to automatically provide answer and rubric as a knowledge base to extract highly
relevant augmented information to ground a general- larger size of model parameters and training data. They show
trained LLM on the grading task. emerging abilities to solve more complex tasks. ChatGPT
• Our system is the first attempt to apply a QA-based struc- (OpenAI 2022), developed upon the GPT-3 (OpenAI 2021)
tured grading. Compared with the text-similarity-based and above series, provides a highly accessible and effective
grading approach, i.e., directly comparing the similarity way to use LLMs in a conversational manner and without
between the student answer and the reference answer, the fine-tuning. This intimates a large number of research and
QA-based approach provides a tool to induce a deeper applications. The most recent versions, GPT-4 (OpenAI 2023)
semantic understanding of the text in grading. Moreover, and GPT-4O (OpenAI 2024), are multimodel models that
It provides not only an overall grade but also decomposed accept both text and images as inputs.
grades and feedback on the knowledge points examined Recent LLMs use Transformer [22] as the backbone ar-
in the problem. chitecture of the models. Originally introduced for the ma-
• We systematically analyze the responses generated by an chine translation task, the vanilla Transformer is built on an
LLM and show both of its capability and the errors it is encoder-decoder structure. The encoder and decoder are both a
prone to make, which provides some insights on how to stack of transformer blocks. Through the multi-head attention
properly use an LLM in the grading task. mechanism, the encoder encodes the input sentence in one
The rest of the paper is organized as follows: Section language into a latent space of representation; the decoder
II introduces the background and related work; method and decodes this representation to autoregressively generate the
system architecture are explained in Section III; how we translated sentence. Different from the vanilla Transformer,
collected the data is described in Section IV; Section V the GPT series uses the decoder only.
presents the experiments and results; the last section, Section
VI, gives the conclusion and future work. C. Retrieval-Augmented Generation
II. BACKGROUND AND R ELATED W ORK Although LLMs have shown strong general capabilities,
there are some key challenges these models are still suf-
A. QA-Based Evaluation
fering from, e.g., factual hallucination [23]–[25]. Retrieval-
While Question Answering itself is one of the major tasks Augmented Generation (RAG) [26], [27] has been proposed
in NLP, the QA-based approach is novel in applying QA and established to be a technique to alleviate such challenges.
techniques to perform text evaluation for other NLP tasks. It references reliable external knowledge by retrieving relevant
This approach has been applied to evaluate the quality of information and further enhances the performance of LLMs.
texts in summarization or text compression tasks. Some of Some of the works use the retrieved data as augmented
the earlier work used the QA evaluation diagram to examine inputs to guide the generation of LLMs [26], [28]. Others
to what extent documents could be summarized while not apply this approach in the middle of generation [29], [30]
affecting comprehension on them [10], to perform human or after the generation [31], [32]. We apply RAG by using
evaluations of summaries [11]. Along with the progress of it to retrieve augmented information as inputs. We treat an
question generation techniques, multiple researches have been instructor-provided reference answer as an external knowledge
done on automatically generate questions from the reference base, extract information that contains the target answer to
summary [12], [13], from the source document [14], and from an evaluation question, and send it together with the student
the evaluated summary [15], [16] to check fact-based consis- response and the evaluation question to an LLM to perform
tency or faithfulness. Extended from previous work, QuestEval the assessment.
combines both recall and precision approaches and shows an
improved QA-based metric on evaluating summarization [17]. D. Automatic Short-Answer Grading
QestEval is also applied on evaluating text simplification [18]
and text converted from semi-structured data such as table The research on the ASAG has a long history. In the earlier
[19]. To the best of our knowledge, our work is the first attempt days of ASAG research, many traditional methods used rule-
to apply QA-based evaluation to the grading and feedback based models [33]. For example, the idea of Concept Mapping
task. is more rule-based, which breaks the student answers into
several concepts and detects if each concept is present or
B. Large Language Models (LLMs) not [34]–[36]. The approach that uses information retrieval
Language Modeling (LM) has been one of the central tasks techniques is also more rule-based. It usually checks student
in NLP. In general, LM is to learn a probability distribution answers more by relying on pattern matching through, e.g.,
over sequences of tokens by predicting the probabilities of the regular expressions or parsing trees [37], [38].
next or missing token(s). Pre-trained language models such Along with the development of machine learning in NLP,
as BERT [20] have shown surprising capability in learning it also has become popular in ASAG systems. Some of them
context-aware word representations and achieved high perfor- apply clustering methods such as grouping together student
mance in a series of NLP tasks. Since the launching of GPT-3 responses using LDA clustering to lessen the workload for
[21], LLMs have attracted a lot of attention. Compared with a human grader [7] or using k-means algorithm based on
pre-trained language models, LLMs are scaled with a much common word similarity [8]. Others treat it as a classification
Fig. 1. (a) System architecture of SteLLA consisting of i) R-RAG Module which takes the instructor-provided reference answer and rubrics as inputs,
generates and extracts a list of evaluation questions with gold answers, and sends it to the LLM; ii) LLM and QA-based Evaluation Module in which an LLM
is prompted to perform grading using QA-based evaluation approach; iii) Scoring Module which generates a final grade and feedback. (b) R-RAG approach
(c) Typical RAG approach
problem using, for example, a k-nearest neighbor classifier to Our approach is different from the above in the way that
detect and diagnose semantic errors in student answers [39]. we use the instructor-provided reference answer and rubrics
Most recently, interest in Pre-trained Language Models as highly relevant external knowledge base, extract structured
(PLMs) and LLMs has increased significantly. In accordance, information in the form of evaluation question-answer pairs,
there has been many research on the possible applications of and then ask LLMs to assess to what extent a student’s
LLMs to the educational field [40]. PLMs such as BERT can response answers all these evaluation questions.
be pre-trained on domain resources to improve ASAG. [9] uses
LLM-based one-shot prompting and a text similarity scoring III. M ETHOD AND S YSTEM A RCHITECTURE
model based on Sentence-BERT [41] to grade short answers. In this section, we present our approach and the system
[42] evaluated using ChatGPT to perform auto-grading on design. The overall method is to apply the RAG approach
short text answers, in which they use ChatGPT to directly to generate structured evaluation questions and corresponding
assess answers by both the educator and the students. They answers from the instructor-provided reference answer and
concluded that LLMs currently can be used as a complemen- rubrics to a problem. These augmented evaluation question-
tary viewpoint but are not ready as an independent tool yet. answer pairs are used to ground an LLM’s grading. Together
Fig. 2. An example to show the flow of grading.
with a student’s response and prompts, they are sent to an module has some unique designs specifically for the grading
LLM as inputs. The LLM performs the question-answering task.
task to assess to what extent a student’s response answers all Highly Relevant Knowledge Base. R-RAG treats the
these evaluation questions and gives the grades and feedback. instructor-provided reference answers and rubrics as an ex-
The grades of all these questions are eventually consolidated ternal knowledge base, which is highly relevant to the grading
into a final grade. Figure 1 part(a) shows the design of the task that the LLM is going to perform. Normally, the external
entire system. A concrete example in Figure 2 illustrates the knowledge that the RAG approach relies on is very large
flow of grading. The prosed system is composed of three key and requires sophisticated techniques to retrieve query-relevant
modules: a) the R-RAG module based on the reference answer information. Inspired by the traditional learning assessment
and rubric, b) the Evaluation module based on LLM and QA, process in which an instructor usually provides a reference
and c) the Scoring module. All the modules are explained in answer and rubrics to facilitate graders in grading, we directly
detail in the following. use such available data as external knowledge. They are small
and highly relevant to the student’s responses that are needed
A. R-RAG Module to be graded. This gives the potential to simplify the system
and further enhance its usage.
The R-RAG module applies RAG approach based on the Structured Information. Due to the nature of external
reference answer and rubrics. It is specifically designed for the knowledge typically used in RAG which is large, some in-
grading task. A typical RAG approach is shown in Figure 1 formation retrieval techniques such as ranking are usually
part (c). Given a query from the user, a retriever, usually using used to get the most relevant information. In R-RAG, instead
information retrieval techniques, retrieves relevant information of retrieving ranked relevant information, we aim to extract
from an external knowledge base such as Wikipedia or other structured information. This is chosen to perform a structured
reliable datasets. This highly relevant information serves as assessment. To a learner, while it’s important to get a correct
part of the prompts and guides the LLM to generate specific grade on the answer, it’s even more important to understand
results for the given query. the knowledge points tested in the problem and how he/she
As shown in Figure 1 part (b), the R-RAG module takes does on each of them. A structured assessment provides more
the instructor-provided reference answer and rubrics as inputs, valuable feedback to improve both learning and teaching.
generates and extracts a list of evaluation questions with gold Under this consideration, the outputs from the R-RAG are
answers, and sends it to the LLM. More specifically, given structured following the rubrics, each of which reflects a rubric
a full-credit reference response r and a rubric b, each rubric point.
point is marked as a conditioned target answer. A question- QA-Based Evaluation. When humans grade a student’s
generation model will generate a corresponding question for response to a problem, we do not just compare how similar
each target answer based on the reference answer. For ex- it is with the reference answer. Instead, for each knowledge
ample, For the rubric point “C and H”, the corresponding point, we ask if the student’s response answers it correctly.
question could be “What does molecule 1 consist of?” Even- Inspired by this human grading process, question-answering
tually, this module will generate a set of evaluation questions becomes a natural approach in our automatic grading system.
Q = {q1 , ..., qn } and their gold answers L = {l1 , ..., ln }, Each bullet point in a rubric is marked as a conditioned answer,
where n is the length of the rubric points. Each evaluation a question generation model is applied to generate a question
question reflects a rubric point. Each gold answer is supported to it based on the reference answer. Meanwhile, a subset of
by both the reference response and the rubric. The R-RAG the reference answer which contains the conditioned answer
phrase is also extracted for the generated question. They form C. Labeling Process
a question-answer pair. A list of such pairs will be sent as part Two undergraduate Research Assistants, who had taken the
of inputs to the LLM. same Biology course before and understood the course materi-
B. LLM-based Evaluation Module als well, did the labeling as human graders. The entire labeling
The LLM-based evaluation module takes the outputs from is an iterated process, two graders working first respectively to
the R-RAG Module, a student’s response, and other prompts give only a final grade to each student’s answer, then adding
as inputs. The outputs from this module are a set of numeric grades to all rubric points, and in the end consolidating two
grades and detailed feedback to justify its grading. graders’ labels into an agreed version. For the selected few-
We apply zero-shot and few-shot learning when prompting shot samples, the human graders also give the text feedback to
the LLM. To better select shots, which are a few task-specific justify their grading. This process lasted about two semesters.
samples provided to an LLM, we use clustering techniques to The two human graders are first trained by the instructor on
select learning samples. All students’ responses are sent to a how to do grading specifically for assignments or exams for
sentence encoder such as SBERT [41] to get their embeddings. this course. Then they label the data in two steps. In step
Then a clustering algorithm such as KMeans is applied to one, they do the labeling respectively. For each evaluation
group them into k clusters. The centroids of all clusters are question on a problem, they check to what extent a student’s
identified and selected as the few-shots. If a centroid is not a response answers the question. If it answers the question
student’s response, then find the student’s response that is the completely correctly, then label it with 1; otherwise, label it
closest to the centroid. with 0. We do not consider partial credits since the evaluation
has been decomposed into a set of questions, each of which
C. Scoring Module
is focused on one knowledge point. We original start with
The Scoring module takes the set of grades and feedback only the one final grade for each answer. Along with the
from the Evaluation module as inputs. Based on the weights development of the approach, the graders are instructed to add
of each evaluation question, this module performs the cal- labels to all the rubric points for a problem. Then in step two,
culation such as weighted sum to generate a final grade of under the guidance of the instructor, the two human graders
a student’s response and a unified feedback. Since the final identify all the labels that they do not agree with each other,
grade & feedback and the breakdown grades & feedback are have a discussion, and come across the labels they all agree.
all valuable, they are all presented to the user as the outputs Eventually, this process gives us the ground-truth labels for
from the system. evaluation.
IV. DATA
D. Characteristics and Statistics
In this section, we report the data collected for this study.
We first describe the data source, then explain how we redact The collected data contain a total of 176 samples. Due to
the data to protect students’ privacy, and lastly present statistics one empty entry, the number of valid samples is 175. The
of the data. average length of a student’s answer is around 39 words. Each
answer, which is a paragraph, contains around 2 sentences
A. Data Source
on average. This is consistent with the normal description of
The data used in this study are collected from an short answers such as the length is “phrases to three to four
undergraduate-level introductory Biology course in the sentences” or “a few words to approximately 100 words” [43].
semester of Fall 2018 at a public university in the United To facilitate the human graders, the instructor provides one
States. The data are student’s answers to a problem from an reference answer and a grading rubric which contains 4 key
exam. We will make the dataset public after publication. As rubric points such as O/OH and H-Bonds. The score of each
shown in Figure 3, in part (a) of this problem, students are rubric point is 1. This leads to 4 points total as the full score
provided with 3 images of different molecules and asked to for the problem. Originally part b in the exam is 5 points.
rank them in the order from the most hydrophobic to the most The instructor adjusted it to be 4 points based on the rubric.
hydrophilic. In part (b) of the problem, students are asked to Accordingly, the score on each rubric point is binary (0/1) and
briefly explain their choices in part a. Their short answers in the score of each student is an integer value in the range of
part (b) are the data collected for this study. 0-4 inclusive. The following are the reference answer and a
B. Privacy Protection sample student answer:
We take our responsibility to protect students’ privacy Reference answer: Molecule 1 consists entirely of
seriously. The data used in this study are all under the approval C and H atoms. This makes molecule 1 entirely
of the Institutional Review Board (IRB) at the school where non-polar and therefore very hydrophobic. Molecule
the data are collected. We redact the data to make them de- 3 has an O atom which can form hydrogen bonds,
identified through the following pre-processing: a) Removing making it polar and hydrophilic.
student names and using file names as index instead; b)
Removing any information in the answers that can be linked Sample student answer: Molecule 1: No lone
to any specific individual. pairs, No special hydrogens therefore hydrophobic.
Fig. 3. The problem, the reference answer, and the rubric in the dataset.
Molecule 2: Two lone pairs, has special hydrogen General instruction: You are the instructor of a
therefore more hydrophilic than molecule 1 college-level Introductory Biology course. You are
going to grade the exam for this course. Your
V. E XPERIMENTS AND R ESULTS
grading should be based on the question asked, the
In this section, we describe our experiment settings and full-credit answer, the student’s answer, and nothing
report the experiment results. else. Give the binary score 1 or 0, in which 1 means
A. Experiment Settings the student’s answer is correct and 0 means the
student’s answer is incorrect or does not answer
In the R-RAG module, the instructor-provided reference
the question, and justify your grading.
answer and the rubric are both supplied. Answer-conditioned
question generation is applied to the reference answer, in
Question-specific instruction: As long as the answer
which one rubric point is set as a conditioned answer to gen-
mentions or implies that the molecule contains just
erate one question. To make sure the generated questions are
carbon, it should be considered as being correct and
of high quality then we can have a more consistent and solid
graded as 1.
evaluation of LLM’s performance, we manually generate three
questions for each rubric point based on the reference answer.
B. Evaluation Results
The course instructor reviews these questions and selects the
best one out of the three. In the Evaluation Module, the system SteLLA essentially takes the role of a grader. Thus We
calls GPT4 API (the version of GPT4-Turbo-Preview). When evaluate the results by calculating the agreement with the
prompting GPT4, we design general instruction and question- human grader’s grading, which is commonly used in grading
specific instruction. The general instruction is to specify the evaluation. Because this work is pioneering in applying QA-
role, task, detailed instruction, and constraints on how to based evaluation on ASAG task, on a newly collected real-
grade such as the grade scale, criteria of each grade, etc. The world dataset, and this field is relatively new, we weren’t able
question-specific instruction is to address a grader’s personal to find highly related models or systems to compare with. As
criteria. For example, in the evaluation question ”What does explained in the labeling process section, under the instructor’s
molecule 1 consist of?”, although the reference answer expects supervision, the two human graders discussed the difference
a student’s answer to contain the information that molecule in the grades they assigned to the same questions, reached an
1 consists of C and H atoms, the course instructor thinks if agreement, and reassigned the agreed grades to those questions
a student’s answer only mentions C (carbon) atom, it’s also as the ground-truth labels. We compare the agreement between
considered as being correct. This personal criteria is addressed the results from our system and the ground-truth labels and
in the question-specific instruction. We apply few-shot learning report both Cohen’s Kappa coefficient (κ) [44] and Raw
in which the 4-shot gives us the best performance. To select Agreement (Accuracy).
samples, we perform random selection. Selected samples are Agreement Results. As shown in Table I, Cohen’s Kappa
excluded from evaluation. The following shows an example of coefficient value between the human grader and the ground-
instructions in the prompt: truth labels reaches 0.8315 which is normally accepted as a
near-perfect agreement. Although our system still does not the given text. For example, in Q3 to the student response
reach human performance, it achieves a substantial agreement 9328790, based on “Molecule 1 is most hydrophobic be-
with the ground-truth labels by κ = 0.6720. As for the raw cause it is all carbons and it can’t make hydrogen bonds.”,
agreement, it’s about 8% lower than the human grader. These GPT4 properly infers that the student implies molecule 1
results show that our system is promising in automatic grading consists of carbon atoms and does not contain elements
while maintaining high accuracy. like oxygen or nitrogen which can form hydrogen bonds,
and further grades it as being correct on this evaluation
TABLE I question. While in Q4 to the student response 9328809,
AGREEMENT RESULTS BETWEEN THE SYSTEM AND LABELS GPT4 interprets the student’s statement “Molecule 1 does
not have donor or acceptor” as suggesting that molecule
Cohen’s Kappa Raw Agreement 1 is non-polar and grades it as being correct which is
actually incorrect. In this example, GPT4’s interpretation
Human 0.8315 0.9157
might be true in general. However, it infers too much
Our System 0.6720 0.8358 from the student’s response in this specific problem, in
which the instructor tries to test the concept of non-polar.
Human Evaluation on Feedback. In order to further We notice this is a type of error that GPT4 is prone to
understand the generated text from LLM which is to justify make in this grading task. This error type shows that,
the grading, we did a human evaluation of all the justifications since LLM such as GPT is trained on massive data which
generated by GPT4. The two human graders are instructed to is expected to have learned a large amount of general
do the evaluation. In human evaluation, the question we ask is knowledge, how to ground it to some specific task and
how relevant the justification generated by GPT4 is to support some specific domain is a big challenge. Our methods of
its grading. In other words, if the grade assigned by GPT4 is R-RAG and structured evaluation provide an approach to
correct or incorrect, does the justification support this grading? address this issue. We also experimented with prompting
The data we use for human evaluation is from a 6-shot learning engineering to set some constraints, such as defining the
experiment setting which leaves a total of 169 samples for role to be a college-level Biology instructor and explicitly
evaluation. Since 4 evaluation questions are generated for asking GPT4 to do the grading based only on the student’s
the problem, there is a total of 676 GPT4 responses to be response, the evaluation question, the reference answer to
evaluated. Very surprisingly, only 1 response is evaluated to the question, and nothing else. However, we find it’s still
be irrelevant to the numeric grade. Even when the grading hard to eliminate such error types by refining the prompts
of GPT4 is incorrect, it’s usually still based on the relevant only.
facts but with too much or not enough inference which will • Error cases of Q1 and Q2 in the student response 9328790
be shown in the sample results analysis in the following. This show the complexity of the grading task. Due to the
shows that GPT4 does do the grading based on the relevant student not giving any statements about molecule 3,
facts which increases the confidence in using an application GPT4 grades the response to be incorrect on these two
based on it. questions which are both about molecule 3. However, the
human grader is more focused on the concept that the
C. Sample Grading and Feedback Analysis
most hydrophilic molecule has an OH which makes it
In Table II, we list three sample students’ responses and able to form H-Bonds. Based on this, although the student
GPT4 grading results to the evaluation questions. We have discusses molecule 2 instead of molecule 3, the response
several findings about using GPT4 to do grading as: shows he/she indeed understands the concept correctly.
• GPT4 is good at identifying relevant facts or statements. Accordingly, human graders give the student full credit
For example, in Q1 and Q2 to the student response on these two questions. During the human evaluation
9328795, GPT4 is able to identify that molecule 3 has process, the course instructor and two human graders all
an Oxygen atom and can form hydrogen bonds even agree that, in such cases, GPT4 does the job properly
though the two phrases are a bit far from each other in the based on the instructions it’s given. The challenge lies
original text answer. In student response 9328809, GPT4 not only in how to make an LLM understand the abstract
identifies question-related information that molecule 3 concept behind the text, but also in how to formulate what
has an O atom and it cannot form H-bonds and then is examined in a problem in the learning process itself.
grades the student response on Q1 is correct and on Q3
is incorrect. D. Ablation Study
• GPT4 can be tolerant of some typos in the input. For We did the following ablation studies to show the effect
example, in student response 9328795, there are typos or of some parameters and settings. Due to the time and cost
errors such as tho and then. But they do not affect GPT4’s constraints, the following experiments were done using GPT-
understanding of the response text. 4.
• GPT4 sometimes can infer the meaning of the text prop- Effect of Clustering. As shown in Figure 4, applying a
erly, while sometimes infers too much implication from clustering algorithm to select samples for few-shot learning
TABLE II
E XAMPLE GRADINGS AND FEEDBACK
consistently improves Cohen’s Kappa coefficient compared the performance of a general LLM on a specific task such
with that without using clustering, e.g., about 0.2 increments as grading. Under the setting with clustering, the 4-shot gives
in the κ value under one-shot. This supports the effectiveness the best result which is significantly higher than the 3-shot
of the clustering approach in selecting learning samples that while slightly higher than the 5-shot. Under the setting without
are expected to better represent the distribution of the data, clustering, the performance under the 6-shot is significantly
and further empower the capability of the LLM such as GPT4 better than the 1-shot, while the 10-shot does not show
on this specific dataset and task. much further improvement compared with the 6-shot. This is
consistent with the common understanding that the few-shot
Number of Shots. We experimented with different shot
in-text learning can guide a general LLM toward a specific
numbers. The Cohen’s Kappa coefficient values in Figure 4
task such as grading in this experiment. Meanwhile, the effect
show that a few learning samples can significantly improve
declines when reaching a reasonable shot number. [5] W. A. Mansour, S. Albatarni, S. Eltanbouly, and T. Elsayed, “Can
large language models automatically score proficiency of written
essays?” in Proceedings of the 2024 Joint International Conference on
1 Computational Linguistics, Language Resources and Evaluation (LREC-
κ with clustering COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti,
and N. Xue, Eds. Torino, Italia: ELRA and ICCL, May 2024, pp. 2777–
κ without clustering 2786. [Online]. Available: https://aclanthology.org/2024.lrec-main.247
0.8
[6] Y. Wang, C. Wang, R. Li, and H. Lin, “On the use of bert
for automated essay scoring: Joint learning of multi-scale essay
representation,” in Proceedings of the 2022 Conference of the North
Agreement