Stell RAG

The document presents SteLLA, a structured grading system that utilizes Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to automate short answer grading (ASAG). It leverages instructor-provided reference answers and rubrics to generate evaluation questions, allowing for detailed grading and feedback that aligns closely with human grading. Experimental results indicate that SteLLA achieves substantial agreement with human graders while providing structured insights into student responses.

Uploaded by

sakethsreeram7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views10 pages

Stell RAG

Uploaded by

sakethsreeram7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

SteLLA: A Structured Grading System Using LLMs

with RAG
Hefei Qiu∗ , Brian White† , Ashley Ding‡ , Reinaldo Costa† , Ali Hachem† , Wei Ding† and Ping Chen†
∗ Department of Computer Science
Fitchburg State University, 160 Pearl Street, Fitchburg, MA 01420-2697
† Department of Computer Science

University of Massachusetts Boston, 100 Morrissey Blvd, Boston, MA 02125

Email: {brian.white, reinaldo.costa001, ali.hachem002, wei.ding, ping.chen}@umb.edu
† Chantilly High School

4201 Stringfellow Rd, Chantilly, VA 20151

arXiv:2501.09092v1 [cs.CL] 15 Jan 2025

Email: [email protected]

Abstract—Large Language Models (LLMs) have shown strong reliable grading and feedback remains a challenge. We propose
general capabilities in many applications. However, how to make SteLLA (Structured Grading System Using LLMs with RAG),
them reliable tools for some specific tasks such as automated an automatic grading system that performs a structured grading
short answer grading (ASAG) remains a challenge. We present
SteLLA (Structured Grading System Using LLMs with RAG) based on Question Answering (QA) techniques, which is em-
in which a) Retrieval Augmented Generation (RAG) approach powered by highly relevant augmented information retrieved
is used to empower LLMs specifically on the ASAG task by from the instructor-provided reference answer and rubric.
extracting structured information from the highly relevant and The field of automatic grading and feedback systems has
reliable external knowledge based on the instructor-provided
reference answer and rubric, b) an LLM performs a structured been explored through various domains such as programming
and question-answering-based evaluation of student answers to [1], [2] and mathematics [3], [4], as well as on different types
provide analytical grades and feedback. A real-world dataset of answers such as essays [5], [6] and short answers [7]–
which contains students’ answers in an exam was collected from a [9]. Compared with an essay which is usually long and with
college-level Biology course. Experiments show that our proposed multiple paragraphs, a short answer is much shorter and with
system can achieve substantial agreement with the human grader
while providing break-down grades and feedback on all the just a couple of sentences. Grading on short answers is more
knowledge points examined in the problem. A qualitative and focused on correctness and does not consider text coherence or
error analysis of the feedback generated by GPT4 shows that writing style as in essay grading. SteLLA is a system designed
GPT4 is good at capturing facts while may prone to inferring for automatic short-answer grading (ASAG).
too much implication from the given text in the grading task
which provides insights into the usage of LLMs in the ASAG There have been many attempts to build automatic grading
system. and feedback systems. Many of them utilize the recent devel-
Index Terms—LLM-based ASAG system, RAG, QA-based opment in Natural Language Processing (NLP). Motivated by
evaluation, structured evaluation the huge progress of LLMs and the needs of instructors and
learners, our design uses LLMs as a key component. To ground
I. I NTRODUCTION a general LLM on the specific task of grading, we propose
a reference answer and rubric based retrieval augmented
Assessment plays an important role in the teaching and generation (R-RAG) approach. Given an instructor-provided
learning process. It usually includes closed-ended questions reference answer and a rubric, R-RAG extracts highly relevant
such as multiple choices and open-ended questions such as and structured information from them. It applies question-
short-answer questions. Although open-ended questions are generation and question-answering techniques to generate a
more powerful in evaluating students’ learning, grading on set of evaluation questions and corresponding answers. An
such questions are more time-consuming. In some scenarios LLM performs a structured grading by checking how well a
such as introductory-level courses in college with hundreds student’s response answers these evaluation questions. Even-
of students, or online courses with an even larger scale of tually, an overall grade, the breakdown grades and feedback
learners, the potentially heavy workload due to the manual are generated to the user.
grading on open-ended short-answer questions hinders their
The contributions of this work are as follows:
usage in practice. An automated grading system can provide
prompt feedback to a learner, can support large scale learn- • We propose an LLM-based ASAG system, SteLLA, that
ing environment, and further facilitate active and life-long shows substantial agreement with human graders.
learning. The recent development of Large Language Models • We present R-RAG which is specifically designed for
(LLMs) has shown their strong general capabilities in many the grading task. It treats an instructor-provided reference
tasks. However, how to use them to automatically provide answer and rubric as a knowledge base to extract highly
relevant augmented information to ground a general- larger size of model parameters and training data. They show
trained LLM on the grading task. emerging abilities to solve more complex tasks. ChatGPT
• Our system is the first attempt to apply a QA-based struc- (OpenAI 2022), developed upon the GPT-3 (OpenAI 2021)
tured grading. Compared with the text-similarity-based and above series, provides a highly accessible and effective
grading approach, i.e., directly comparing the similarity way to use LLMs in a conversational manner and without
between the student answer and the reference answer, the fine-tuning. This intimates a large number of research and
QA-based approach provides a tool to induce a deeper applications. The most recent versions, GPT-4 (OpenAI 2023)
semantic understanding of the text in grading. Moreover, and GPT-4O (OpenAI 2024), are multimodel models that
It provides not only an overall grade but also decomposed accept both text and images as inputs.
grades and feedback on the knowledge points examined Recent LLMs use Transformer [22] as the backbone ar-
in the problem. chitecture of the models. Originally introduced for the ma-
• We systematically analyze the responses generated by an chine translation task, the vanilla Transformer is built on an
LLM and show both of its capability and the errors it is encoder-decoder structure. The encoder and decoder are both a
prone to make, which provides some insights on how to stack of transformer blocks. Through the multi-head attention
properly use an LLM in the grading task. mechanism, the encoder encodes the input sentence in one
The rest of the paper is organized as follows: Section language into a latent space of representation; the decoder
II introduces the background and related work; method and decodes this representation to autoregressively generate the
system architecture are explained in Section III; how we translated sentence. Different from the vanilla Transformer,
collected the data is described in Section IV; Section V the GPT series uses the decoder only.
presents the experiments and results; the last section, Section
VI, gives the conclusion and future work. C. Retrieval-Augmented Generation
II. BACKGROUND AND R ELATED W ORK Although LLMs have shown strong general capabilities,
there are some key challenges these models are still suf-
A. QA-Based Evaluation
fering from, e.g., factual hallucination [23]–[25]. Retrieval-
While Question Answering itself is one of the major tasks Augmented Generation (RAG) [26], [27] has been proposed
in NLP, the QA-based approach is novel in applying QA and established to be a technique to alleviate such challenges.
techniques to perform text evaluation for other NLP tasks. It references reliable external knowledge by retrieving relevant
This approach has been applied to evaluate the quality of information and further enhances the performance of LLMs.
texts in summarization or text compression tasks. Some of Some of the works use the retrieved data as augmented
the earlier work used the QA evaluation diagram to examine inputs to guide the generation of LLMs [26], [28]. Others
to what extent documents could be summarized while not apply this approach in the middle of generation [29], [30]
affecting comprehension on them [10], to perform human or after the generation [31], [32]. We apply RAG by using
evaluations of summaries [11]. Along with the progress of it to retrieve augmented information as inputs. We treat an
question generation techniques, multiple researches have been instructor-provided reference answer as an external knowledge
done on automatically generate questions from the reference base, extract information that contains the target answer to
summary [12], [13], from the source document [14], and from an evaluation question, and send it together with the student
the evaluated summary [15], [16] to check fact-based consis- response and the evaluation question to an LLM to perform
tency or faithfulness. Extended from previous work, QuestEval the assessment.
combines both recall and precision approaches and shows an
improved QA-based metric on evaluating summarization [17]. D. Automatic Short-Answer Grading
QestEval is also applied on evaluating text simplification [18]
and text converted from semi-structured data such as table The research on the ASAG has a long history. In the earlier
[19]. To the best of our knowledge, our work is the first attempt days of ASAG research, many traditional methods used rule-
to apply QA-based evaluation to the grading and feedback based models [33]. For example, the idea of Concept Mapping
task. is more rule-based, which breaks the student answers into
several concepts and detects if each concept is present or
B. Large Language Models (LLMs) not [34]–[36]. The approach that uses information retrieval
Language Modeling (LM) has been one of the central tasks techniques is also more rule-based. It usually checks student
in NLP. In general, LM is to learn a probability distribution answers more by relying on pattern matching through, e.g.,
over sequences of tokens by predicting the probabilities of the regular expressions or parsing trees [37], [38].
next or missing token(s). Pre-trained language models such Along with the development of machine learning in NLP,
as BERT [20] have shown surprising capability in learning it also has become popular in ASAG systems. Some of them
context-aware word representations and achieved high perfor- apply clustering methods such as grouping together student
mance in a series of NLP tasks. Since the launching of GPT-3 responses using LDA clustering to lessen the workload for
[21], LLMs have attracted a lot of attention. Compared with a human grader [7] or using k-means algorithm based on
pre-trained language models, LLMs are scaled with a much common word similarity [8]. Others treat it as a classification
Fig. 1. (a) System architecture of SteLLA consisting of i) R-RAG Module which takes the instructor-provided reference answer and rubrics as inputs,
generates and extracts a list of evaluation questions with gold answers, and sends it to the LLM; ii) LLM and QA-based Evaluation Module in which an LLM
is prompted to perform grading using QA-based evaluation approach; iii) Scoring Module which generates a final grade and feedback. (b) R-RAG approach
(c) Typical RAG approach

problem using, for example, a k-nearest neighbor classifier to Our approach is different from the above in the way that
detect and diagnose semantic errors in student answers [39]. we use the instructor-provided reference answer and rubrics
Most recently, interest in Pre-trained Language Models as highly relevant external knowledge base, extract structured
(PLMs) and LLMs has increased significantly. In accordance, information in the form of evaluation question-answer pairs,
there has been many research on the possible applications of and then ask LLMs to assess to what extent a student’s
LLMs to the educational field [40]. PLMs such as BERT can response answers all these evaluation questions.
be pre-trained on domain resources to improve ASAG. [9] uses
LLM-based one-shot prompting and a text similarity scoring III. M ETHOD AND S YSTEM A RCHITECTURE
model based on Sentence-BERT [41] to grade short answers. In this section, we present our approach and the system
[42] evaluated using ChatGPT to perform auto-grading on design. The overall method is to apply the RAG approach
short text answers, in which they use ChatGPT to directly to generate structured evaluation questions and corresponding
assess answers by both the educator and the students. They answers from the instructor-provided reference answer and
concluded that LLMs currently can be used as a complemen- rubrics to a problem. These augmented evaluation question-
tary viewpoint but are not ready as an independent tool yet. answer pairs are used to ground an LLM’s grading. Together
Fig. 2. An example to show the flow of grading.

with a student’s response and prompts, they are sent to an module has some unique designs specifically for the grading
LLM as inputs. The LLM performs the question-answering task.
task to assess to what extent a student’s response answers all Highly Relevant Knowledge Base. R-RAG treats the
these evaluation questions and gives the grades and feedback. instructor-provided reference answers and rubrics as an ex-
The grades of all these questions are eventually consolidated ternal knowledge base, which is highly relevant to the grading
into a final grade. Figure 1 part(a) shows the design of the task that the LLM is going to perform. Normally, the external
entire system. A concrete example in Figure 2 illustrates the knowledge that the RAG approach relies on is very large
flow of grading. The prosed system is composed of three key and requires sophisticated techniques to retrieve query-relevant
modules: a) the R-RAG module based on the reference answer information. Inspired by the traditional learning assessment
and rubric, b) the Evaluation module based on LLM and QA, process in which an instructor usually provides a reference
and c) the Scoring module. All the modules are explained in answer and rubrics to facilitate graders in grading, we directly
detail in the following. use such available data as external knowledge. They are small
and highly relevant to the student’s responses that are needed
A. R-RAG Module to be graded. This gives the potential to simplify the system
and further enhance its usage.
The R-RAG module applies RAG approach based on the Structured Information. Due to the nature of external
reference answer and rubrics. It is specifically designed for the knowledge typically used in RAG which is large, some in-
grading task. A typical RAG approach is shown in Figure 1 formation retrieval techniques such as ranking are usually
part (c). Given a query from the user, a retriever, usually using used to get the most relevant information. In R-RAG, instead
information retrieval techniques, retrieves relevant information of retrieving ranked relevant information, we aim to extract
from an external knowledge base such as Wikipedia or other structured information. This is chosen to perform a structured
reliable datasets. This highly relevant information serves as assessment. To a learner, while it’s important to get a correct
part of the prompts and guides the LLM to generate specific grade on the answer, it’s even more important to understand
results for the given query. the knowledge points tested in the problem and how he/she
As shown in Figure 1 part (b), the R-RAG module takes does on each of them. A structured assessment provides more
the instructor-provided reference answer and rubrics as inputs, valuable feedback to improve both learning and teaching.
generates and extracts a list of evaluation questions with gold Under this consideration, the outputs from the R-RAG are
answers, and sends it to the LLM. More specifically, given structured following the rubrics, each of which reflects a rubric
a full-credit reference response r and a rubric b, each rubric point.
point is marked as a conditioned target answer. A question- QA-Based Evaluation. When humans grade a student’s
generation model will generate a corresponding question for response to a problem, we do not just compare how similar
each target answer based on the reference answer. For ex- it is with the reference answer. Instead, for each knowledge
ample, For the rubric point “C and H”, the corresponding point, we ask if the student’s response answers it correctly.
question could be “What does molecule 1 consist of?” Even- Inspired by this human grading process, question-answering
tually, this module will generate a set of evaluation questions becomes a natural approach in our automatic grading system.
Q = {q1 , ..., qn } and their gold answers L = {l1 , ..., ln }, Each bullet point in a rubric is marked as a conditioned answer,
where n is the length of the rubric points. Each evaluation a question generation model is applied to generate a question
question reflects a rubric point. Each gold answer is supported to it based on the reference answer. Meanwhile, a subset of
by both the reference response and the rubric. The R-RAG the reference answer which contains the conditioned answer
phrase is also extracted for the generated question. They form C. Labeling Process
a question-answer pair. A list of such pairs will be sent as part Two undergraduate Research Assistants, who had taken the
of inputs to the LLM. same Biology course before and understood the course materi-
B. LLM-based Evaluation Module als well, did the labeling as human graders. The entire labeling
The LLM-based evaluation module takes the outputs from is an iterated process, two graders working first respectively to
the R-RAG Module, a student’s response, and other prompts give only a final grade to each student’s answer, then adding
as inputs. The outputs from this module are a set of numeric grades to all rubric points, and in the end consolidating two
grades and detailed feedback to justify its grading. graders’ labels into an agreed version. For the selected few-
We apply zero-shot and few-shot learning when prompting shot samples, the human graders also give the text feedback to
the LLM. To better select shots, which are a few task-specific justify their grading. This process lasted about two semesters.
samples provided to an LLM, we use clustering techniques to The two human graders are first trained by the instructor on
select learning samples. All students’ responses are sent to a how to do grading specifically for assignments or exams for
sentence encoder such as SBERT [41] to get their embeddings. this course. Then they label the data in two steps. In step
Then a clustering algorithm such as KMeans is applied to one, they do the labeling respectively. For each evaluation
group them into k clusters. The centroids of all clusters are question on a problem, they check to what extent a student’s
identified and selected as the few-shots. If a centroid is not a response answers the question. If it answers the question
student’s response, then find the student’s response that is the completely correctly, then label it with 1; otherwise, label it
closest to the centroid. with 0. We do not consider partial credits since the evaluation
has been decomposed into a set of questions, each of which
C. Scoring Module
is focused on one knowledge point. We original start with
The Scoring module takes the set of grades and feedback only the one final grade for each answer. Along with the
from the Evaluation module as inputs. Based on the weights development of the approach, the graders are instructed to add
of each evaluation question, this module performs the cal- labels to all the rubric points for a problem. Then in step two,
culation such as weighted sum to generate a final grade of under the guidance of the instructor, the two human graders
a student’s response and a unified feedback. Since the final identify all the labels that they do not agree with each other,
grade & feedback and the breakdown grades & feedback are have a discussion, and come across the labels they all agree.
all valuable, they are all presented to the user as the outputs Eventually, this process gives us the ground-truth labels for
from the system. evaluation.
IV. DATA
D. Characteristics and Statistics
In this section, we report the data collected for this study.
We first describe the data source, then explain how we redact The collected data contain a total of 176 samples. Due to
the data to protect students’ privacy, and lastly present statistics one empty entry, the number of valid samples is 175. The
of the data. average length of a student’s answer is around 39 words. Each
answer, which is a paragraph, contains around 2 sentences
A. Data Source
on average. This is consistent with the normal description of
The data used in this study are collected from an short answers such as the length is “phrases to three to four
undergraduate-level introductory Biology course in the sentences” or “a few words to approximately 100 words” [43].
semester of Fall 2018 at a public university in the United To facilitate the human graders, the instructor provides one
States. The data are student’s answers to a problem from an reference answer and a grading rubric which contains 4 key
exam. We will make the dataset public after publication. As rubric points such as O/OH and H-Bonds. The score of each
shown in Figure 3, in part (a) of this problem, students are rubric point is 1. This leads to 4 points total as the full score
provided with 3 images of different molecules and asked to for the problem. Originally part b in the exam is 5 points.
rank them in the order from the most hydrophobic to the most The instructor adjusted it to be 4 points based on the rubric.
hydrophilic. In part (b) of the problem, students are asked to Accordingly, the score on each rubric point is binary (0/1) and
briefly explain their choices in part a. Their short answers in the score of each student is an integer value in the range of
part (b) are the data collected for this study. 0-4 inclusive. The following are the reference answer and a
B. Privacy Protection sample student answer:
We take our responsibility to protect students’ privacy Reference answer: Molecule 1 consists entirely of
seriously. The data used in this study are all under the approval C and H atoms. This makes molecule 1 entirely
of the Institutional Review Board (IRB) at the school where non-polar and therefore very hydrophobic. Molecule
the data are collected. We redact the data to make them de- 3 has an O atom which can form hydrogen bonds,
identified through the following pre-processing: a) Removing making it polar and hydrophilic.
student names and using file names as index instead; b)
Removing any information in the answers that can be linked Sample student answer: Molecule 1: No lone
to any specific individual. pairs, No special hydrogens therefore hydrophobic.
Fig. 3. The problem, the reference answer, and the rubric in the dataset.

Molecule 2: Two lone pairs, has special hydrogen General instruction: You are the instructor of a
therefore more hydrophilic than molecule 1 college-level Introductory Biology course. You are
going to grade the exam for this course. Your
V. E XPERIMENTS AND R ESULTS
grading should be based on the question asked, the
In this section, we describe our experiment settings and full-credit answer, the student’s answer, and nothing
report the experiment results. else. Give the binary score 1 or 0, in which 1 means
A. Experiment Settings the student’s answer is correct and 0 means the
student’s answer is incorrect or does not answer
In the R-RAG module, the instructor-provided reference
the question, and justify your grading.
answer and the rubric are both supplied. Answer-conditioned
question generation is applied to the reference answer, in
Question-specific instruction: As long as the answer
which one rubric point is set as a conditioned answer to gen-
mentions or implies that the molecule contains just
erate one question. To make sure the generated questions are
carbon, it should be considered as being correct and
of high quality then we can have a more consistent and solid
graded as 1.
evaluation of LLM’s performance, we manually generate three
questions for each rubric point based on the reference answer.
B. Evaluation Results
The course instructor reviews these questions and selects the
best one out of the three. In the Evaluation Module, the system SteLLA essentially takes the role of a grader. Thus We
calls GPT4 API (the version of GPT4-Turbo-Preview). When evaluate the results by calculating the agreement with the
prompting GPT4, we design general instruction and question- human grader’s grading, which is commonly used in grading
specific instruction. The general instruction is to specify the evaluation. Because this work is pioneering in applying QA-
role, task, detailed instruction, and constraints on how to based evaluation on ASAG task, on a newly collected real-
grade such as the grade scale, criteria of each grade, etc. The world dataset, and this field is relatively new, we weren’t able
question-specific instruction is to address a grader’s personal to find highly related models or systems to compare with. As
criteria. For example, in the evaluation question ”What does explained in the labeling process section, under the instructor’s
molecule 1 consist of?”, although the reference answer expects supervision, the two human graders discussed the difference
a student’s answer to contain the information that molecule in the grades they assigned to the same questions, reached an
1 consists of C and H atoms, the course instructor thinks if agreement, and reassigned the agreed grades to those questions
a student’s answer only mentions C (carbon) atom, it’s also as the ground-truth labels. We compare the agreement between
considered as being correct. This personal criteria is addressed the results from our system and the ground-truth labels and
in the question-specific instruction. We apply few-shot learning report both Cohen’s Kappa coefficient (κ) [44] and Raw
in which the 4-shot gives us the best performance. To select Agreement (Accuracy).
samples, we perform random selection. Selected samples are Agreement Results. As shown in Table I, Cohen’s Kappa
excluded from evaluation. The following shows an example of coefficient value between the human grader and the ground-
instructions in the prompt: truth labels reaches 0.8315 which is normally accepted as a
near-perfect agreement. Although our system still does not the given text. For example, in Q3 to the student response
reach human performance, it achieves a substantial agreement 9328790, based on “Molecule 1 is most hydrophobic be-
with the ground-truth labels by κ = 0.6720. As for the raw cause it is all carbons and it can’t make hydrogen bonds.”,
agreement, it’s about 8% lower than the human grader. These GPT4 properly infers that the student implies molecule 1
results show that our system is promising in automatic grading consists of carbon atoms and does not contain elements
while maintaining high accuracy. like oxygen or nitrogen which can form hydrogen bonds,
and further grades it as being correct on this evaluation
TABLE I question. While in Q4 to the student response 9328809,
AGREEMENT RESULTS BETWEEN THE SYSTEM AND LABELS GPT4 interprets the student’s statement “Molecule 1 does
not have donor or acceptor” as suggesting that molecule
Cohen’s Kappa Raw Agreement 1 is non-polar and grades it as being correct which is
actually incorrect. In this example, GPT4’s interpretation
Human 0.8315 0.9157
might be true in general. However, it infers too much
Our System 0.6720 0.8358 from the student’s response in this specific problem, in
which the instructor tries to test the concept of non-polar.
Human Evaluation on Feedback. In order to further We notice this is a type of error that GPT4 is prone to
understand the generated text from LLM which is to justify make in this grading task. This error type shows that,
the grading, we did a human evaluation of all the justifications since LLM such as GPT is trained on massive data which
generated by GPT4. The two human graders are instructed to is expected to have learned a large amount of general
do the evaluation. In human evaluation, the question we ask is knowledge, how to ground it to some specific task and
how relevant the justification generated by GPT4 is to support some specific domain is a big challenge. Our methods of
its grading. In other words, if the grade assigned by GPT4 is R-RAG and structured evaluation provide an approach to
correct or incorrect, does the justification support this grading? address this issue. We also experimented with prompting
The data we use for human evaluation is from a 6-shot learning engineering to set some constraints, such as defining the
experiment setting which leaves a total of 169 samples for role to be a college-level Biology instructor and explicitly
evaluation. Since 4 evaluation questions are generated for asking GPT4 to do the grading based only on the student’s
the problem, there is a total of 676 GPT4 responses to be response, the evaluation question, the reference answer to
evaluated. Very surprisingly, only 1 response is evaluated to the question, and nothing else. However, we find it’s still
be irrelevant to the numeric grade. Even when the grading hard to eliminate such error types by refining the prompts
of GPT4 is incorrect, it’s usually still based on the relevant only.
facts but with too much or not enough inference which will • Error cases of Q1 and Q2 in the student response 9328790
be shown in the sample results analysis in the following. This show the complexity of the grading task. Due to the
shows that GPT4 does do the grading based on the relevant student not giving any statements about molecule 3,
facts which increases the confidence in using an application GPT4 grades the response to be incorrect on these two
based on it. questions which are both about molecule 3. However, the
human grader is more focused on the concept that the
C. Sample Grading and Feedback Analysis
most hydrophilic molecule has an OH which makes it
In Table II, we list three sample students’ responses and able to form H-Bonds. Based on this, although the student
GPT4 grading results to the evaluation questions. We have discusses molecule 2 instead of molecule 3, the response
several findings about using GPT4 to do grading as: shows he/she indeed understands the concept correctly.
• GPT4 is good at identifying relevant facts or statements. Accordingly, human graders give the student full credit
For example, in Q1 and Q2 to the student response on these two questions. During the human evaluation
9328795, GPT4 is able to identify that molecule 3 has process, the course instructor and two human graders all
an Oxygen atom and can form hydrogen bonds even agree that, in such cases, GPT4 does the job properly
though the two phrases are a bit far from each other in the based on the instructions it’s given. The challenge lies
original text answer. In student response 9328809, GPT4 not only in how to make an LLM understand the abstract
identifies question-related information that molecule 3 concept behind the text, but also in how to formulate what
has an O atom and it cannot form H-bonds and then is examined in a problem in the learning process itself.
grades the student response on Q1 is correct and on Q3
is incorrect. D. Ablation Study
• GPT4 can be tolerant of some typos in the input. For We did the following ablation studies to show the effect
example, in student response 9328795, there are typos or of some parameters and settings. Due to the time and cost
errors such as tho and then. But they do not affect GPT4’s constraints, the following experiments were done using GPT-
understanding of the response text. 4.
• GPT4 sometimes can infer the meaning of the text prop- Effect of Clustering. As shown in Figure 4, applying a
erly, while sometimes infers too much implication from clustering algorithm to select samples for few-shot learning
TABLE II
E XAMPLE GRADINGS AND FEEDBACK

Question GPT4’s Grading Label

Q1: Does molecule 3 have Oxygen (O) atom or OH Group?
Q2: Can molecule 3 form H-Bonds or Hydrogen Bonds?
Q3: What does molecule 1 consist of?
Q4: Is molecule 1 entirely non-polar?
Student response 9328795: Molecule 1 is the most hydrophobic because it is a carbon chain tho does not have the ability to create
hydrogen or ionic bonds. Whereas Molecule 3 is more hydrophilic then molecule 1 due to its ability to form hydrogen bonds
because of the lone pairs on the oxygen atom.
Q1 The student’s score is 1. The student answers the question correctly. The student states, “Molecule 3 is more hydrophilic then molecule 1
1 due to its ability to form hydrogen bonds because of the lone pairs on the oxygen atom.” This indicates that molecule 3 has an
Oxygen atom, which is consistent with the full-credit answer. Accordingly, the student answers the question correctly.
Q2 The student’s score is 1. The student answers the question correctly. The student’s answer states, “Molecule 3 is more hydrophilic 1
then molecule 1 due to its ability to form hydrogen bonds because of the lone pairs on the oxygen atom,” indicating that molecule 3
can form hydrogen bonds. This aligns with the full-credit answer which states “Molecule 3 has an O atom which can form hydrogen
bonds”. Accordingly, the student answers the question correctly.
Q3 The student’s score is 1. The student answers the question correctly. The student states, “Molecule 1 is the most hydrophobic because 1
it is a carbon chain,” implying it consists of Carbon. Moreover, the student says that molecule 1 “does not have the ability to create
hydrogen or ionic bonds,” which indirectly suggests the presence of Hydrogen. Both of these components are consistent with the
full-credit answer, ”Molecule 1 consists entirely of C and H atoms.” Accordingly, the student answers the question correctly.
Q4 The student’s score is 0. The student’s answer is not relevant to the question. Even though the student discusses the hydrophobic nature 0
of molecule 1, they do not answer directly if molecule 1 is entirely non-polar. Accordingly, the student’s answer is not relevant to
this question.
Student response 9328790: Molecule 1 is most hydrophobic because it is all carbons and it can’t make hydrogen bonds. But
Molecule #2 has an OH at the end, allowing it to create h-bonds.
Q1 The student’s score is 0. The student’s answer is not relevant to the question. The student’s answer does not include any information 1
about molecule 3. Instead, the answer discusses molecules 1 and 2. Accordingly, the student answer is not relevant to the question.
Q2 The student’s score is 0. The student does not answer the question. The student provides information about molecules 1 and 2, but 1
does not mention whether or not molecule 3 can form hydrogen bonds. Accordingly, the student does not answer the question.
Q3 The student’s score is 1. The student answers the question correctly. The student states, “Molecule 1 is most hydrophobic because it is 1
all carbons and it can’t make hydrogen bonds.” This implies that molecule 1 consists of carbon atoms and does not contain elements
like oxygen or nitrogen which can form hydrogen bonds. This is consistent with the full-credit answer “Molecule 1 consists entirely
of C and H atoms.” Accordingly, the student answers the question correctly.
Q4 The student’s score is 0. The student’s answer does not directly address the question asked. While the student correctly identifies 0
molecule 1 as being hydrophobic and incapable of forming hydrogen bonds, they do not explicitly state that this makes molecule 1
entirely non-polar. The student also brings in comparison with other molecules (Molecule #2), which is not relevant to the specific
question asked. Therefore, the answer is incorrect.
Student response 9328809: Molecule 1 does not have donor or acceptor. It is probably a covalent bond, which is the strongest
bond. Molecule 2 has a -OH bond, which makes it partly hydrophilic, but more hydrophobic. Molecule 3 only has an O atom,
which is an acceptor, but cannot form H-bonds.
Q1 The student’s score is 1. The student answers the question correctly. The student states, “Molecule 3 only has an O atom, which is 1
an acceptor, but cannot form H-bonds.” This indicates that molecule 3 has an O atom, which aligns with the full-credit answer,
”Molecule 3 has an O atom”. Accordingly, the student answers the question correctly.
Q2 The student’s score is 0. The student answers the question incorrectly. The student’s answer “Molecule 3 only has an O atom, which 0
is an acceptor, but cannot form H-bonds” contradicts the full-credit answer “Molecule 3 has an O atom which can form hydrogen
bonds”. An oxygen atom in a molecule can be a hydrogen bond acceptor, meaning that it can form a hydrogen bond. Therefore,
the student’s understanding of the concept is incorrect.
Q3 The student’s score is 0. The student’s answer is not relevant to the question. The student mentions various properties of molecule 1, 0
but does not state what molecule 1 consists of in terms of atoms. Accordingly, the student’s answer is not relevant to the question.
Q4 The student’s score is 1. The student answers the question correctly. The student’s response includes “Molecule 1 does not have 0
donor or acceptor”, which can be interpreted as suggesting that molecule 1 is non-polar, aligning with the full-credit answer. Thus,
the student correctly answers the question.

consistently improves Cohen’s Kappa coefficient compared the performance of a general LLM on a specific task such
with that without using clustering, e.g., about 0.2 increments as grading. Under the setting with clustering, the 4-shot gives
in the κ value under one-shot. This supports the effectiveness the best result which is significantly higher than the 3-shot
of the clustering approach in selecting learning samples that while slightly higher than the 5-shot. Under the setting without
are expected to better represent the distribution of the data, clustering, the performance under the 6-shot is significantly
and further empower the capability of the LLM such as GPT4 better than the 1-shot, while the 10-shot does not show
on this specific dataset and task. much further improvement compared with the 6-shot. This is
consistent with the common understanding that the few-shot
Number of Shots. We experimented with different shot
in-text learning can guide a general LLM toward a specific
numbers. The Cohen’s Kappa coefficient values in Figure 4
task such as grading in this experiment. Meanwhile, the effect
show that a few learning samples can significantly improve
declines when reaching a reasonable shot number. [5] W. A. Mansour, S. Albatarni, S. Eltanbouly, and T. Elsayed, “Can
large language models automatically score proficiency of written
essays?” in Proceedings of the 2024 Joint International Conference on
1 Computational Linguistics, Language Resources and Evaluation (LREC-
κ with clustering COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti,
and N. Xue, Eds. Torino, Italia: ELRA and ICCL, May 2024, pp. 2777–
κ without clustering 2786. [Online]. Available: https://aclanthology.org/2024.lrec-main.247
0.8
[6] Y. Wang, C. Wang, R. Li, and H. Lin, “On the use of bert
for automated essay scoring: Joint learning of multi-scale essay
representation,” in Proceedings of the 2022 Conference of the North
Agreement

0.6 American Chapter of the Association for Computational Linguistics:

Human Language Technologies, M. Carpuat, M.-C. de Marneffe,
and I. V. Meza Ruiz, Eds. Seattle, United States: Association
0.4 for Computational Linguistics, Jul. 2022, pp. 3416–3425. [Online].
Available: https://aclanthology.org/2022.naacl-main.249
[7] S. Basu, C. Jacobs, and L. Vanderwende, “Powergrading: a Clustering
0.2 Approach to Amplify Human Effort for Short Answer Grading,”
Transactions of the Association for Computational Linguistics, vol. 1,
pp. 391–402, 10 2013. [Online]. Available: https://doi.org/10.1162/tacl
a 00236
0 [8] N. Süzen, A. N. Gorban, J. Levesley, and E. M. Mirkes, “Automatic short
1 2 3 4 5 6 7 8 9 10 answer grading and feedback using text mining methods,” Procedia
N-shot Computer Science, vol. 169, pp. 726–743, 2020, postproceedings of
the 10th Annual International Conference on Biologically Inspired
Cognitive Architectures, BICA 2019 (Tenth Annual Meeting of the
Fig. 4. Effect of shot number. BICA Society), held August 15-19, 2019 in Seattle, Washington, USA.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S1877050920302945
VI. C ONCLUSION AND F UTURE W ORK [9] S.-Y. Yoon, “Short answer grading using one-shot prompting and text
similarity scoring model,” 2023.
We propose SteLLA, an automatic short-answer grading [10] A. H. Morris, G. M. Kasper, and D. A. Adams, “The effects and
system that uses RAG techniques based on the instructor- limitations of automated text condensing on reading comprehension
performance,” Info. Sys. Research, vol. 3, no. 1, p. 17–35, mar 1992.
provided reference answer and rubric to facilitate an LLM [Online]. Available: https://doi.org/10.1287/isre.3.1.17
performing structured question-answering-based assessment of [11] J. Clarke and M. Lapata, “Discourse constraints for document
student responses. Experiments on a real-world dataset show compression,” Computational Linguistics, vol. 36, no. 3, pp. 411–441,
Sep. 2010. [Online]. Available: https://aclanthology.org/J10-3005
that our system is able to achieve substantial agreement with [12] P. Chen, F. Wu, T. Wang, and W. Ding, “A semantic qa-based approach
the human graders. It can also provide analytical grades and for text summarization evaluation,” 2018.
feedback on knowledge points examined in the problem. In [13] M. Eyal, T. Baumel, and M. Elhadad, “Question answering as
an automatic evaluation metric for news article summarization,”
the future, one direction of the work could be on generating in Proceedings of the 2019 Conference of the North American
structured evaluation question-answer pairs in the context of Chapter of the Association for Computational Linguistics: Human
missing rubrics, i.e., only the reference answer available. An- Language Technologies, Volume 1 (Long and Short Papers), J. Burstein,
C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association
other direction could be to add human-interactive components for Computational Linguistics, Jun. 2019, pp. 3938–3948. [Online].
to increase the system’s adaptability in personalization. Available: https://aclanthology.org/N19-1395
[14] T. Scialom, S. Lamprier, B. Piwowarski, and J. Staiano, “Answers
ACKNOWLEDGMENT unite! unsupervised metrics for reinforced summarization models,” in
Proceedings of the 2019 Conference on Empirical Methods in Natural
This paragraph is omitted due to the blind review policy. Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang,
R EFERENCES V. Ng, and X. Wan, Eds. Hong Kong, China: Association for
[1] T. Phung, J. Cambronero, S. Gulwani, T. Kohn, R. Majumdar, A. Singla, Computational Linguistics, Nov. 2019, pp. 3246–3256. [Online].
and G. Soares, “Generating high-precision feedback for programming Available: https://aclanthology.org/D19-1320
syntax errors using large language models,” in Proceedings of the [15] A. Wang, K. Cho, and M. Lewis, “Asking and answering questions to
16th International Conference on Educational Data Mining, M. Feng, evaluate the factual consistency of summaries,” in Proceedings of the
T. KÃ¤ser, and P. Talukdar, Eds. Bengaluru, India: International 58th Annual Meeting of the Association for Computational Linguistics,
Educational Data Mining Society, July 2023, pp. 370–377. D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online:
[2] X. Liu, S. Wang, P. Wang, and D. Wu, “Automatic grading of program- Association for Computational Linguistics, Jul. 2020, pp. 5008–5020.
ming assignments: An approach based on formal semantics,” in 2019 [Online]. Available: https://aclanthology.org/2020.acl-main.450
IEEE/ACM 41st International Conference on Software Engineering: [16] E. Durmus, H. He, and M. Diab, “FEQA: A question answering
Software Engineering Education and Training (ICSE-SEET), 2019, pp. evaluation framework for faithfulness assessment in abstractive
126–137. summarization,” in Proceedings of the 58th Annual Meeting of the
[3] S. Baral, A. Botelho, A. Santhanam, A. Gurung, L. Cheng, and N. Hef- Association for Computational Linguistics, D. Jurafsky, J. Chai,
fernan, “Auto-scoring student responses with images in mathematics,” in N. Schluter, and J. Tetreault, Eds. Online: Association for
Proceedings of the 16th International Conference on Educational Data Computational Linguistics, Jul. 2020, pp. 5055–5070. [Online].
Mining, M. Feng, T. KÃ¤ser, and P. Talukdar, Eds. Bengaluru, India: Available: https://aclanthology.org/2020.acl-main.454
International Educational Data Mining Society, July 2023, pp. 362–369. [17] T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano,
[4] A. Botelho, S. Baral, J. A. Erickson, P. Benachamardi, and A. Wang, and P. Gallinari, “QuestEval: Summarization asks for
N. T. Heffernan, “Leveraging natural language processing to support fact-based evaluation,” in Proceedings of the 2021 Conference on
automated assessment and feedback for student open responses in Empirical Methods in Natural Language Processing, M.-F. Moens,
mathematics,” J. Comput. Assist. Learn., vol. 39, no. 3, pp. 823–840, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and
2023. [Online]. Available: https://doi.org/10.1111/jcal.12793 Punta Cana, Dominican Republic: Association for Computational
Linguistics, Nov. 2021, pp. 6594–6604. [Online]. Available: https: [32] J. He, G. Neubig, and T. Berg-Kirkpatrick, “Efficient nearest neighbor
//aclanthology.org/2021.emnlp-main.529 language models,” 2021.
[18] T. Scialom, L. Martin, J. Staiano, Éric Villemonte de la Clergerie, and [33] S. Burrows, I. Gurevych, and B. Stein, “The eras and trends of
B. Sagot, “Rethinking automatic evaluation in sentence simplification,” automatic short answer grading,” International Journal of Artificial
2021. Intelligence in Education, vol. 25, pp. 60–117, 2015. [Online].
[19] C. Rebuffel, T. Scialom, L. Soulier, B. Piwowarski, S. Lamprier, J. Sta- Available: https://api.semanticscholar.org/CorpusID:5917679
iano, G. Scoutheeten, and P. Gallinari, “Data-questeval: A referenceless [34] J. Burstein, R. Kaplan, S. Wolff, and C. Lu, “Using lexical semantic
metric for data-to-text semantic evaluation,” 2021. techniques to classify free-responses,” in Breadth and Depth of Semantic
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training Lexicons, 1996. [Online]. Available: https://aclanthology.org/W96-0304
of deep bidirectional transformers for language understanding,” 2019. [35] C. Leacock and M. Chodorow, “C-rater: Automated scoring of short-
[21] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, answer questions,” Computers and the Humanities, vol. 37, pp. 389–405,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- 2003. [Online]. Available: https://api.semanticscholar.org/CorpusID:
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, 27443635
J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, [36] M. Mohler and R. Mihalcea, “Text-to-text semantic similarity for
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, automatic short answer grading,” in Proceedings of the 12th Conference
and D. Amodei, “Language models are few-shot learners,” in Advances of the European Chapter of the ACL (EACL 2009), A. Lascarides,
in Neural Information Processing Systems, H. Larochelle, M. Ranzato, C. Gardent, and J. Nivre, Eds. Athens, Greece: Association for
R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Computational Linguistics, Mar. 2009, pp. 567–575. [Online]. Available:
Inc., 2020, pp. 1877–1901. https://aclanthology.org/E09-1065
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [37] T. Mitchell, T. Russell, P. Broomhead, and N. Aldridge, “Towards
L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. robust computerised marking of free-text responses,” 2002. [Online].
[23] M. Cao, Y. Dong, J. Wu, and J. C. K. Cheung, “Factual error correction Available: https://api.semanticscholar.org/CorpusID:17936736
for abstractive summarization models,” in Proceedings of the 2020 [38] L. F. Bachman, N. Carr, G. Kamei, M. Kim, M. J. Pan, C. Salvador, and
Conference on Empirical Methods in Natural Language Processing Y. Sawaki, “A reliable approach to automatic assessment of short answer
(EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: free responses,” in COLING 2002: The 17th International Conference
Association for Computational Linguistics, Nov. 2020, pp. 6251–6258. on Computational Linguistics: Project Notes, 2002. [Online]. Available:
[Online]. Available: https://aclanthology.org/2020.emnlp-main.506 https://aclanthology.org/C02-2023
[24] V. Raunak, A. Menezes, and M. Junczys-Dowmunt, “The curious [39] S. Bailey and D. Meurers, “Diagnosing meaning errors in short
case of hallucinations in neural machine translation,” in Proceedings answers to reading comprehension questions,” in Proceedings of the
of the 2021 Conference of the North American Chapter of Third Workshop on Innovative Use of NLP for Building Educational
the Association for Computational Linguistics: Human Language Applications, J. Tetreault, J. Burstein, and R. De Felice, Eds.
Technologies, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani- Columbus, Ohio: Association for Computational Linguistics, Jun. 2008,
Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, pp. 107–115. [Online]. Available: https://aclanthology.org/W08-0913
Eds. Online: Association for Computational Linguistics, Jun. 2021, [40] E. Kasneci, K. Sessler, S. Küchemann, M. Bannert, D. Dementieva,
pp. 1172–1183. [Online]. Available: https://aclanthology.org/2021. F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier,
naacl-main.92 S. Krusche, G. Kutyniok, T. Michaeli, C. Nerdel, J. Pfeffer,
[25] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, O. Poquet, M. Sailer, A. Schmidt, T. Seidel, M. Stadler, J. Weller,
A. Madotto, and P. Fung, “Survey of hallucination in natural language J. Kuhn, and G. Kasneci, “Chatgpt for good? on opportunities and
generation,” ACM Computing Surveys, vol. 55, no. 12, p. 1–38, Mar. challenges of large language models for education,” Learning and
2023. [Online]. Available: http://dx.doi.org/10.1145/3571730 Individual Differences, vol. 103, p. 102274, 2023. [Online]. Available:
[26] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, https://www.sciencedirect.com/science/article/pii/S1041608023000195
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, [41] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using
“Retrieval-augmented generation for knowledge-intensive nlp tasks,” in siamese bert-networks,” 2019.
Proceedings of the 34th International Conference on Neural Information [42] J. Schneider, B. Schenk, C. Niklaus, and M. Vlachos, “Towards llm-
Processing Systems, ser. NIPS ’20. Red Hook, NY, USA: Curran based autograding for short textual answers,” 2023.
Associates Inc., 2020. [43] J. Sukkarieh and S. Stoyanchev, “Automating model building in
[27] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, c-rater,” in Proceedings of the 2009 Workshop on Applied Textual
D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain Inference (TextInfer), C. Callison-Burch, I. Dagan, C. Manning,
question answering,” in Proceedings of the 2020 Conference on M. Pennacchiotti, and F. M. Zanzotto, Eds. Suntec, Singapore:
Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Aug. 2009, pp. 61–69.
B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association [Online]. Available: https://aclanthology.org/W09-2509
for Computational Linguistics, Nov. 2020, pp. 6769–6781. [Online]. [44] J. Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational
Available: https://aclanthology.org/2020.emnlp-main.550 and Psychological Measurement, vol. 20, no. 1, p. 37, 1960.
[28] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “Realm:
retrieval-augmented language model pre-training,” in Proceedings of
the 37th International Conference on Machine Learning, ser. ICML’20.
JMLR.org, 2020.
[29] G. Izacard and E. Grave, “Leveraging passage retrieval with generative
models for open domain question answering,” in Proceedings of the
16th Conference of the European Chapter of the Association for
Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann,
and R. Tsarfaty, Eds. Online: Association for Computational
Linguistics, Apr. 2021, pp. 874–880. [Online]. Available: https:
//aclanthology.org/2021.eacl-main.74
[30] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Mil-
lican, G. van den Driessche, J.-B. Lespiau, B. Damoc, A. Clark,
D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang,
L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Paganini, G. Irving,
O. Vinyals, S. Osindero, K. Simonyan, J. W. Rae, E. Elsen, and L. Sifre,
“Improving language models by retrieving from trillions of tokens,”
2022.
[31] U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis,
“Generalization through memorization: Nearest neighbor language mod-
els,” 2020.

Conference Template A4
No ratings yet
Conference Template A4
6 pages
Grade Like A Human
No ratings yet
Grade Like A Human
16 pages
GradeHITL - Human in The Loop Automated Short Answer Grading
No ratings yet
GradeHITL - Human in The Loop Automated Short Answer Grading
16 pages
Paula 1
No ratings yet
Paula 1
12 pages
Beyond Scores A Modular RAG-Based System For Automatic Short Answer Scoring With Feedback
No ratings yet
Beyond Scores A Modular RAG-Based System For Automatic Short Answer Scoring With Feedback
15 pages
AI As A Grader For K12 Tests
No ratings yet
AI As A Grader For K12 Tests
5 pages
R Paper
No ratings yet
R Paper
7 pages
Conference Template A4
No ratings yet
Conference Template A4
6 pages
Levonian-2310 03184
No ratings yet
Levonian-2310 03184
14 pages
Few Shot Is Enough: Exploring Chatgpt Prompt Engineering Method For Automatic Question Generation in English Education
No ratings yet
Few Shot Is Enough: Exploring Chatgpt Prompt Engineering Method For Automatic Question Generation in English Education
33 pages
Ragas: Automated Evaluation of Retrieval Augmented Generation
No ratings yet
Ragas: Automated Evaluation of Retrieval Augmented Generation
8 pages
2024 Eacl-Demo 16
No ratings yet
2024 Eacl-Demo 16
9 pages
Few Shot Is Enough: Exploring Chatgpt Prompt Engineering Method For Automatic Question Generation in English Education
No ratings yet
Few Shot Is Enough: Exploring Chatgpt Prompt Engineering Method For Automatic Question Generation in English Education
33 pages
Automated Assessment in Math Education - A Comparative Analysis of LLMs For Open-Ended Responses
No ratings yet
Automated Assessment in Math Education - A Comparative Analysis of LLMs For Open-Ended Responses
6 pages
Project A
No ratings yet
Project A
22 pages
Revolutionizing Assessment AI-Powered Evaluation With RAG and LLM Technologies
100% (1)
Revolutionizing Assessment AI-Powered Evaluation With RAG and LLM Technologies
6 pages
Subjective Answer Evaluator
No ratings yet
Subjective Answer Evaluator
7 pages
Major Projectpp
No ratings yet
Major Projectpp
5 pages
Major Projectpp
No ratings yet
Major Projectpp
5 pages
Automated Educational Question Generation at
No ratings yet
Automated Educational Question Generation at
14 pages
An Automated Essay Scoring Systems: A Systematic Literature Review
No ratings yet
An Automated Essay Scoring Systems: A Systematic Literature Review
33 pages
Large Language Model As An Assignment Evaluator
No ratings yet
Large Language Model As An Assignment Evaluator
25 pages
IJRPR41077
No ratings yet
IJRPR41077
3 pages
Automated Essay Scoring Review
No ratings yet
Automated Essay Scoring Review
33 pages
AI Driven Essay Grading and Tutor Arjun C
No ratings yet
AI Driven Essay Grading and Tutor Arjun C
5 pages
Hybrid BERT System for ASAG
No ratings yet
Hybrid BERT System for ASAG
6 pages
On Automated Essay Grading Using Large Language Models: Pei Yee Liew Ian K. T. Tan
No ratings yet
On Automated Essay Grading Using Large Language Models: Pei Yee Liew Ian K. T. Tan
8 pages
Lindsay A Framework For Responsible Development of Automat
No ratings yet
Lindsay A Framework For Responsible Development of Automat
10 pages
A Large Language Model-Assisted Education Tool To Provide Feedback On Open-Ended Responses
No ratings yet
A Large Language Model-Assisted Education Tool To Provide Feedback On Open-Ended Responses
7 pages
Large Language Models
No ratings yet
Large Language Models
10 pages
Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects
No ratings yet
Leveraging Large Language Models To Generate Course-Specific Semantically Annotated Learning Objects
20 pages
Applying Large Language Models To Enhance The Assessment of Parallel Functional Programming Assignments
No ratings yet
Applying Large Language Models To Enhance The Assessment of Parallel Functional Programming Assignments
9 pages
RAG For Educational Application
No ratings yet
RAG For Educational Application
14 pages
Qa-Rag
No ratings yet
Qa-Rag
15 pages
Automatic Question-Answer Pairs Generation Using Pre-Trained Large Language Models in Higher Education
No ratings yet
Automatic Question-Answer Pairs Generation Using Pre-Trained Large Language Models in Higher Education
12 pages
The Future of Learning in The Age of Generative AI
No ratings yet
The Future of Learning in The Age of Generative AI
13 pages
A Survey of Current Machine Learning Approaches To Student Free-Text Evaluation For Intelligent Tutoring - 2022
No ratings yet
A Survey of Current Machine Learning Approaches To Student Free-Text Evaluation For Intelligent Tutoring - 2022
39 pages
Subjective Answer Evaluation Using NLP
No ratings yet
Subjective Answer Evaluation Using NLP
12 pages
Document Question Answering Using Large Language Model
No ratings yet
Document Question Answering Using Large Language Model
10 pages
Does Multiple Choice Have A Future in The Age of Generative AI? A Posttest-Only RCT
No ratings yet
Does Multiple Choice Have A Future in The Age of Generative AI? A Posttest-Only RCT
17 pages
2023 BDCC - Short Ans. Grading
No ratings yet
2023 BDCC - Short Ans. Grading
14 pages
#2000 Redesign Preprint
No ratings yet
#2000 Redesign Preprint
12 pages
Cotal: Human-In-The-Loop Prompt Engineering, Chain-Of-Thought Reasoning, and Active Learning For Generalizable Formative Assessment Scoring
No ratings yet
Cotal: Human-In-The-Loop Prompt Engineering, Chain-Of-Thought Reasoning, and Active Learning For Generalizable Formative Assessment Scoring
14 pages
All PDF Reader 20250612 16.46.26
No ratings yet
All PDF Reader 20250612 16.46.26
25 pages
Sujective Paper
No ratings yet
Sujective Paper
3 pages
B-RED 400 A Assignment 1 ASG 2025
No ratings yet
B-RED 400 A Assignment 1 ASG 2025
36 pages
Ijimai8 - 5 - 8 Evaluating ChatGPT-Generated Linear Algebra Formative Assessments
No ratings yet
Ijimai8 - 5 - 8 Evaluating ChatGPT-Generated Linear Algebra Formative Assessments
8 pages
An Automated System For Question Generation and Answer Evaluation
No ratings yet
An Automated System For Question Generation and Answer Evaluation
6 pages
Are S Framework For Rag
No ratings yet
Are S Framework For Rag
17 pages
Major Projectpp
No ratings yet
Major Projectpp
5 pages
Enhancing Student Feedback in Data Science Education: Harnessing The Power of AI-Generated Approaches
No ratings yet
Enhancing Student Feedback in Data Science Education: Harnessing The Power of AI-Generated Approaches
24 pages
The Responsible Development of Automated Student Feedback With Generative AI
No ratings yet
The Responsible Development of Automated Student Feedback With Generative AI
10 pages
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
No ratings yet
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
2 pages
Major Projectpp
No ratings yet
Major Projectpp
6 pages
Evaluating Large Language Model (LLM) Systems: Metrics, Challenges, and Best Practices
No ratings yet
Evaluating Large Language Model (LLM) Systems: Metrics, Challenges, and Best Practices
27 pages
Paper 2
No ratings yet
Paper 2
12 pages
Team 23
No ratings yet
Team 23
15 pages
The Alphaphysics Term Rewriting System For Marking Algebraic Expressions in Physics Exams
No ratings yet
The Alphaphysics Term Rewriting System For Marking Algebraic Expressions in Physics Exams
13 pages
Rag Multi Agent
No ratings yet
Rag Multi Agent
11 pages
Collaborative Automated Data Science
No ratings yet
Collaborative Automated Data Science
187 pages
RAGtreebased
No ratings yet
RAGtreebased
17 pages
FastRAG: Efficient RAG for Semi-structured Data
No ratings yet
FastRAG: Efficient RAG for Semi-structured Data
9 pages
Amazon+Bedrock+and+AWS+Generative+AI+-+Beginner+to+Advanced v1.7 25052024
100% (1)
Amazon+Bedrock+and+AWS+Generative+AI+-+Beginner+to+Advanced v1.7 25052024
213 pages
Phi 3
No ratings yet
Phi 3
24 pages
Google Opr Databases
No ratings yet
Google Opr Databases
13 pages
P4 Tech Report
No ratings yet
P4 Tech Report
36 pages
Lecture 6 Resonance PDF
No ratings yet
Lecture 6 Resonance PDF
67 pages
EE 350 Introduction To Material Science For Engineers:: Prof. Dr. Şener Oktik
No ratings yet
EE 350 Introduction To Material Science For Engineers:: Prof. Dr. Şener Oktik
41 pages
11C 14 Chemical Bonding
No ratings yet
11C 14 Chemical Bonding
82 pages
Semiconductor Devices Tutorial PDF
100% (1)
Semiconductor Devices Tutorial PDF
81 pages
CH 03
No ratings yet
CH 03
98 pages
Chemical Bonding
100% (1)
Chemical Bonding
44 pages
Chemistry Basics for Biology Students
100% (1)
Chemistry Basics for Biology Students
9 pages
04 - System, Matter & Energy
No ratings yet
04 - System, Matter & Energy
14 pages
Atomic Theory Multiple Choice Worksheet
No ratings yet
Atomic Theory Multiple Choice Worksheet
4 pages
Basic Coordination Chemistry
No ratings yet
Basic Coordination Chemistry
17 pages
Chemistry MCQs
100% (1)
Chemistry MCQs
11 pages
CHE1400 Lecture02
No ratings yet
CHE1400 Lecture02
34 pages
CHY2023 - Unit 3 Aromatic Hydrocarbons
100% (1)
CHY2023 - Unit 3 Aromatic Hydrocarbons
75 pages
Ionic Bonding and Properties
100% (1)
Ionic Bonding and Properties
40 pages
Lesson Plan:: 80 Minutes
No ratings yet
Lesson Plan:: 80 Minutes
7 pages
2 the+Chemistry+of+Life
No ratings yet
2 the+Chemistry+of+Life
107 pages
4) Chemical Bonding
No ratings yet
4) Chemical Bonding
2 pages
Unit - 1 Lesson - 1
No ratings yet
Unit - 1 Lesson - 1
271 pages
Edexcel IGCSE Chemistry: Ionic Bonding Mark Scheme
No ratings yet
Edexcel IGCSE Chemistry: Ionic Bonding Mark Scheme
10 pages
Organic Chemistry Quiz Guide
No ratings yet
Organic Chemistry Quiz Guide
18 pages
GRADE 9-SCIENCE Modules 1-9 211 Pages
100% (1)
GRADE 9-SCIENCE Modules 1-9 211 Pages
211 pages
Agric Sciences p1 Gr11 Memo Nov2019 - Eng D
No ratings yet
Agric Sciences p1 Gr11 Memo Nov2019 - Eng D
10 pages
Inorganic NEET
No ratings yet
Inorganic NEET
74 pages
JEE Main Practice Test
No ratings yet
JEE Main Practice Test
18 pages
JEE & Pre-Med Chemistry Revision
No ratings yet
JEE & Pre-Med Chemistry Revision
4 pages
AAAADDD
No ratings yet
AAAADDD
9 pages
General Trends in The Chemistry of P-Block Elements
No ratings yet
General Trends in The Chemistry of P-Block Elements
7 pages
PhD Thesis Help in Organic Chem
100% (3)
PhD Thesis Help in Organic Chem
4 pages
How To Invent (Almost) Anything - David Straker
No ratings yet
How To Invent (Almost) Anything - David Straker
1,242 pages

Stell RAG

Uploaded by

Stell RAG

Uploaded by

SteLLA: A Structured Grading System Using LLMs

University of Massachusetts Boston, 100 Morrissey Blvd, Boston, MA 02125

4201 Stringfellow Rd, Chantilly, VA 20151

Question GPT4’s Grading Label

0.6 American Chapter of the Association for Computational Linguistics:

You might also like