Adapting LLMs for Educational Use
Adapting LLMs for Educational Use
Qingyao Li1 , Lingyue Fu1 , Weiming Zhang1 , Xianyu Chen1 , Jingwei Yu1 , Wei Xia2 ,
Weinan Zhang1∗ , Ruiming Tang2 and Yong Yu1
1
Shanghai Jiao Tong University
2
Huawei Noah’s Ark Lab
arXiv:2401.08664v1 [cs.AI] 27 Dec 2023
Abstract Student: I am asked to write a Python program to calculate the sum from 1 to 100 in
a simple way. Could you help me to solve this problem?
Online education platforms, leveraging the inter-
net to distribute education resources, seek to pro- Foundational Capabilities
vide convenient education but often fall short in
real-time communication with students. They often Knowledge-based
Programming Writing Question Reasoning Mathematics
struggle to offer personalized education resources Answering
due to the challenge of addressing the diverse ob-
stacles students encounter throughout their learn-
Education Assistant: Sure! You can use a mathematical formula to calculate the sum
ing journey. Recently, the emergence of large lan- of numbers from 1 to 100. The formula for the sum of an arithmetic series is
guage models (LLMs), such as ChatGPT, offers the , where n is the last number in the series. In this case, n= 100.
possibility for resolving this issue by comprehend- Here is a simple Python program using this formula:
ing individual requests. Although LLMs have been def calculate_sum():
successful in various fields, creating an LLM-based n=100
education system is still challenging for the wide total_sum = n * (n+1) // 2
return total_sum
range of educational skills required. This paper re- result = calculate_sum()
views the recently emerged LLM researches related print(f"The sum of numbers from 1 to 100 is: {result}")
Challenging Mathematics Problems WizardMath [Luo et al., 2023a], MeatMath [Yu et al., 2023]
Mathematics
Multimodal Problems Mathvista [Lu et al., 2023], Unigeo [Chen et al., 2022b]
Mathematical Proof LeanDojo [Yang et al., 2023b], LEGO-Prover [Xin et al., 2023]
Text Summarization BRIO [Liu et al., 2022], PROM [Ma et al., 2023b]
Writing
Grammatical Error Correction CoEdit [Liu et al., 2022], GrammarGPT [Fan et al., 2023]
Foundational Code Writing CodeLlama [Rozière et al., 2023], WizardCoder [Luo et al., 2023b]
Programming
Capabilities Code Refinement Codet [Chen et al., 2022a], SEIDR [Liventsev et al., 2023]
Supervised Finetuning CAGE [Rajani et al., 2019], Scratchpad [Nye et al., 2021]
Reasoning Prompt Engineering CoT [Wei et al., 2022], SelfConsistency [Wang et al., 2022]
Hybrid Methodologies FineTuneCoT [Ho et al., 2022], STaR [Zelikman et al., 2022]
answers. This approach leads to a significant improvement in Many LLM-based methods are proposed to help theorem
the zero-shot accuracy of mathematical problem-solving. proving [Jiang et al., 2023; Liu et al., 2023b; Xiong et al.,
2023]. Based on these proof assistants, there are two main
Problems Involving Multi-Modal Information approaches to utilizing LLMs for mathematical proofs. The
Multi-modal inputs like geometric shapes in math problems first approach is formal proof search, exemplified by mod-
require LLMs to understand text and image information for els like GPT-f [Polu and Sutskever, 2020], which involves
solutions. Research in multi-modal LLMs for mathematical prompting LLMs to produce the next proof step(also called
reasoning is emerging [Lu et al., 2023; Chen et al., 2022b; ’tactic’ in proof assistants) based on the current proof state
Peng et al., 2023b]. This type of task poses a high require- and some optional context. Following, Thor [Jiang et al.,
ment for the training data’s formation and quality. Chen et 2022b] was further proposed to help select the premise for
al. [2022b] introduced a Unified Geometry problem bench- theorem proving. Yang et al. [2023b] introduced an open-
mark combining calculation and proving tasks. Based on source framework named LeanDojo based on the Lean proof
this dataset, the study presented a framework capable of si- assistant. This framework comprises data, toolkits, models,
multaneously solving calculations and proving tasks through and benchmarks, and it has led to the development of Re-
a sequence generation approach. Lu et al. [2023] proposed Prover (Retrieval-Augmented Prover), which enhances proof
MATHVISTA, a benchmark for diverse mathematical and vi- accuracy by using retrieval methods to extract premises for
sual challenges. Frameworks for this task usually contain LLMs to base their mathematical proofs.
separate encoders for different modalities. For geometric
problems, Zhang et al. [2023b] converted diagrams into text The second is natural proof translation, also known as aut-
clauses, using a convolutional neural network and a language oformalization, represented by models like DSP [Jiang et al.,
model for encoding and a GRU-based framework for answer 2022a], which converts math proofs written in natural lan-
generation. Additionally, Lu et al. [2022a] presented the Tab- guage into formalized versions. This approach primarily ad-
ular Math Word Problems (TABWP) dataset, requiring tex- dresses the issue of insufficient data for formal mathematical
tual and tabular data reasoning, and introduced PROMPTPG, proofs. By utilizing autoformalization, a significant increase
a policy gradient-based selector for training and prompt con- in this type of data can be achieved. Consequently, this ap-
struction for test samples. proach enhances the proof-generating capabilities of neural
provers that have been fine-tuned on this expanded dataset.
Mathematical Proof The strategy leverages the transformation of existing proofs
Unlike other types of mathematical problems where LLMs into a formal structure, thereby enriching the training mate-
primarily focu on reasoning and providing answers, LLMs’ rial for machine learning models in the field of mathemat-
role in mathematical proofs emphasizes the integration with ical proof generation. For example, Wu et al. [2022] em-
proof assistants such as Coq [Barras et al., 1997], Is- ployed LLMs for autoformalization, transforming mathemat-
abelle [Nipkow et al., 2002], and Lean [de Moura et al., ical proofs and problems expressed in natural language into
2015]. These proof assistants correspond to specific program- formal specifications and proofs in Isabelle language. The
ming languages, requiring users to input proofs in these lan- generated data was used to train a neural theorem prover,
guages, after which the assistant can verify the proof’s cor- enhancing the effectiveness of the original prover. Follow
rectness. it, Cunningham et al. [2023] utilized an encoder-decoder
framework based on the universal transformer architecture, al., 2023] first evaluated the error correction performance of
converting both problem statements and mathematical proofs closed-source LLMs such as ChatGPT. Although there ex-
written in LaTeX into the language of the Coq interactive ists a pronounced gap between ChatGPT and the previous
prover. SOTA [Omelianchuk et al., 2020; Grammarly, 2023] mod-
In mathematical education, proof problems are indispens- els on the overall F0.5 metric, closer analysis shows that
able. Currently, LLMs for math proof primarily operate in ChatGPT underperforms other models in terms of precision
the form of interactive theorem proving. In this approach, but far exceeds other models in terms of recall. That said,
LLMs complete proofs by interacting with software proof as- LLMs like ChatGPT are good at error detection. A detailed
sistants. To realize completely automated theorem proving manual analysis of ChatGPT’s outputs revealed that, in most
with LLMs, it is essential that these models possess not only cases, it maintained grammar accuracy better than the pre-
strong reasoning skills but also the capability to formalize vious methods. However, it often overcorrects sentences to
concepts effectively. There is no room for hallucination in increase diversity and fluency, resulting in a decrease in pre-
mathematical proofs, which poses a formidable challenge for cision. Nevertheless, it demonstrates that LLMs are adept at
LLMs. generating diverse and accurate content. Some studies have
attempted to explore open-source LLMs, such as CoEdit [Ra-
2.2 Writing heja et al., 2023], which covered multiple text editing tasks
Writing proficiency is crucial for LLMs, underpinning their (including GEC) by fine-tuning LLMs to integrate the ca-
ability to comprehend inputs deeply and produce semanti- pabilities brought by these tasks. GrammarGPT [Fan et al.,
cally and syntactically accurate outputs [Dong et al., 2022; 2023], on the other hand, asked ChatGPT to generate gram-
Chang et al., 2023]. In education, the writing capability of matically incorrect sentences to guide the training of other
LLMs holds the potential to transform how writing is taught. models. Collectively, these studies underscore the prowess of
They can assist in content creation, simplify complex top- LLMs in generating correct text.
ics for students, and offer personalized educational materials. Leveraging LLMs’ proficiency in text summarization and
In this part, we dive into LLM’s writing capability on two grammatical error correction can significantly benefit edu-
education-related tasks: text summarization and grammatical cation. Their capability to condense complex material into
error correction. concise summaries facilitates efficient learning, while error
correction tools help improve students’ writing and language
Text Summarization
skills. However, critical challenges need to be resolved to
Text summarization, which condenses long texts into sum- integrate these writing-related capabilities to help education.
maries, challenges LLMs to understand diverse content like It becomes evident that more refined evaluation metrics and
news and cross-lingual texts. It is evident that traditional fine- task-specific optimizations are essential for LLMs.
tuning methods are less effective with the advent of more ad-
vanced LLMs [Pu et al., 2023; Liu et al., 2022]. Pu et al. 2.3 Programming
[2023] and Liang et al. [2022] showed LLMs like ChatGPT
initially lag behind fine-tuned models like T5 [Raffel et al., Programming is a process of writing code and correcting
2020] and BART [Lewis et al., 2019] in ROUGE scores [Lin, code if unexpected results are obtained. Incorporating LLMs
2004] for text summarization. However, when human judges in programming education is reshaping the future of AI-
evaluate overall quality, LLMs outperform fine-tuned mod- assisted programming learning. LLMs could play multiple
els and even standard human summaries, superior in aspects roles: as instructors providing guidance, as teaching assis-
like factual consistency, fluency, and diversity. This discovery tants offering personalized tutorials, and as collaborative cod-
underscores the limitations of traditional evaluation methods ing partners. Studies like [Ma et al., 2023a] demonstrated
and suggests a need for new paradigms to guide summariza- improved performance (17%) and efficiency (13%) among
tion tasks in the LLM era. For example, BRIO [Liu et al., programming novices using LLM-based assistants. Research
2022] implemented a ranking task to foster more diverse sum- from [Phung et al., 2023] focused on programming educa-
marizations. Furthermore, Liu et al. [2023e] utilized a GPT tion tasks and benchmarks like [Fu et al., 2023] and [Ding
model based on BRIO to directly generate training data to et al., 2023], which were used to evaluate the effectiveness
guide the learning process of other models, which is similar of LLMs. This section mainly discusses the LLMs’ coding
to the process of RLHF [Stiennon et al., 2020]. development from two perspectives: code writing and code
refinement, corresponding to the two stages in programming.
Grammatical Error Correction
We are well aware of the remarkable capability of LLMs to Code Writing
generate fluent and coherent conversations. However, from an Unlike natural language tasks, generating code requires a
educational perspective, the importance of producing gram- more rigorous token syntax and places higher demands on
matically correct dialogues cannot be overstated, especially the training stage. A common method to improve LLMs’
for students learning a new language. The correctness of performance in generating code is to train or fine-tune them
grammar in conversations plays a vital role in language ac- on extensive code datasets [Chen et al., 2021a; Nijkamp et
quisition, providing students with reliable examples to em- al., 2023]. WizardCoder [Luo et al., 2023b] introduced the
ulate and learn from. Numerous studies have evaluated the Evol-Instruct [Xu et al., 2023] method to generate more com-
effectiveness of LLMs in grammatical error correction. Sev- plex and diverse instruction datasets of code-related tasks.
eral works [Wu et al., 2023; Fang et al., 2023; Mohammed et To emulate the iterative process of humans repeatedly mod-
ifying and reviewing code, InCoder [Fried et al., 2023] uti- 2.4 Reasoning
lized bidirectional encoding instead of left-to-right encoding. The reasoning capability of LLMs offers significant poten-
In addition to next-token prediction, training or fine-tuning tial for educational use. LLMs can act as advanced tools,
code-aimed LLMs on additional code-related tasks could en- enhancing students’ cognitive processes, providing person-
hance their programming capabilities. LLMs first learn lan- alized mentorship, and offering tailored learning support.
guage patterns and representations from a large amount of Consequently, these models can bolster students’ problem-
text data through unsupervised learning. Then, they could solving skills and promote holistic cognitive development in
be fine-tuned on labeled code tasks, allowing them to learn an educational setting. This section reviews LLMs’ general
more targeted code representations and gain a deeper under- reasoning ability development strategies.
standing of code structure and semantics based on the pro-
vided labels. CodeT5+ [Wang et al., 2023b] introduced the Supervised Fine-tuning for Reasoning
concepts of unimodal and bimodal alignment, increasing the Prior researches on the LLMs’ reasoning ability are pri-
model’s adaptability to function in different modes for vari- marily based on the fully supervised fine-tuning method.
ous downstream tasks. During the bimodal alignment phase, For instance, Rajani et al. [2019] effectively fine-tuned a
the model synchronizes the representations of text-code pairs pre-trained GPT model to generate rationales for predic-
using multiple tasks, which improves its ability to understand tions based on the CoS-E dataset [Talmor et al., 2018], en-
and generate content across different modalities. CodeL- hancing performance in commonsense question-answering
lama [Rozière et al., 2023] also applied the multi-task ob- tasks. However, the success of fine-tuning mainly depends
jectives, including autoregression and causal infilling predic- on the availability of a dataset with explicit reasoning, which
tion, which achieves better performance among open models. could be challenging to obtain. Furthermore, the inferences
MFTCoder [Liu et al., 2023a] utilized the Multi-Task Learn- drawn from fine-tuning are mainly limited to specific domain
ing (MTL) technique and incorporated a training loss compu- datasets, relying more on the content within the fine-tuning
tation algorithm to alleviate the instability and imbalance of data than on the LLM’s inherent reasoning abilities.
multi-task training.
Prompt Engineering for Reasoning
Efforts have been made in recent research to tackle the con-
Code Refinement straints inherent in the fine-tuning process of LLMs. These
fine-tuning methods tend to overfit specific dataset distribu-
In most cases, LLMs could not generate the correct code at tions, reducing their effectiveness on more diverse datasets.
once. We could enable LLMs to generate a code sketch (ei- To mitigate this issue, numerous studies have developed
ther actual code or pseudocode) and utilize various methods strategies that utilize the robust reasoning abilities of LLMs,
to guide the model to modify and refine the code. By leverag- taking advantage of their extensive pre-trained parameters.
ing the inherent code correction ability of LLMs, the overall One such strategy involves guiding LLMs to generate in-
precision and quality of the code could be significantly en- ference and reasoning processes through demonstrations or
hanced. Liventsev et al. [2023] constructed a pipeline: Syn- prompts. For example, Wei et al. [2022] introduced the
thesize, Execute, Instruct, Debug, and Rank (SEIDR). It first ”Chain of Thought” (CoT) method, which employed natu-
generates multiple different codes and undergoes the process ral language reasoning steps as prompts to enhance the rea-
of code filtering and debugging, ultimately selecting the best soning performance of LLMs. Similarly, Wang et al. [2022]
code among them. According to Magister et al. [2022], teach- proposed a self-consistency strategy involving sampling a va-
ing an LLM to debug its program draft via few-shot demon- riety of reasoning paths and selecting the most consistent an-
strations could improve the performance on code generation swer, which has been shown to improve model performance.
tasks. Another method for LLM debug is generating unit tests Addressing the constraint of fixed manual annotation demon-
by LLM itself and checking its code [Chen et al., 2022a]. By strations, Diao et al. [2023] introduced an active selection
mimicking the human coding process, LLM’s programming approach. This method dynamically identified the most ef-
ability is greatly enhanced. However, these methods lead to fective demonstrations relevant to task requirements from a
an increased number of calls to LLMs, resulting in a signifi- pool of queries. Furthermore, Zhou et al. [2022] proposed a
cant increase in inference time. prompting strategy that decomposed complex problems into
simpler sub-problems. This approach encouraged sequen-
In coding education, the guidance given by current LLMs tial problem-solving and could potentially enhance the per-
still falls short of the level offered by human instructors. formance of LLMs further.
Firstly, there is significant room for improvement in the cod-
ing abilities of LLMs. When it comes to generating complex Hybrid Methodologies for Reasoning
algorithms, LLMs’ performance rapidly declines compared Despite the success of prompt engineering in improving the
to the functionality achieved by humans [Chen et al., 2021b]. performance of LLMs, it does not boost their core reason-
Additionally, due to the lack of real-world data, LLMs strug- ing skills by not changing the model parameters. To ad-
gle to learn the intermediate thinking process of code writing, dress this limitation, new approaches combining fine-tuning
making it difficult for them to provide relevant explanations with prompt engineering have been developed, aiming to en-
and instructions to beginners. As a result, the use of LLMs in hance the model’s reasoning abilities and responsiveness to
programming education still needs to be improved, especially prompts. One approach employs LLMs in a pedagogical
in terms of interpretability. capacity to instruct smaller student models. Studies by Ho
et al. [2022] and Magister et al. [2022] demonstrated sig- LLM-based education system, this issue becomes more crit-
nificant performance gains across various datasets through ical for providing seemingly correct yet incorrect answers,
fine-tuning student models based on the reasoning outputs leading to misleading students. Drawing insights from ap-
of larger LLMs, even for smaller-scale models. Moreover, proaches that introduce additional information from sources
Zelikman et al. [2022] reported substantial performance im- such as the web or textbooks can offer valuable lessons for
provements across multiple datasets by generating step-by- the development of an LLM-based education system.
step rationales and fine-tuning models based on correct an-
swers, thus facilitating model learning from its reasoning. Domain-specific QA
In conclusion, the discussed methods have significantly ad- Although LLMs are trained on vast corpora, they may still ex-
vanced the reasoning capability of LLMs, optimally utiliz- hibit gaps in understanding specific domains. Typically, there
ing their unique features for diverse reasoning tasks. This are dedicated knowledge repositories for professional content
enhancement can greatly benefit educational applications. that consolidate domain-specific knowledge, such as Med-
However, it’s crucial to recognize the limitations. As under- linePlus1 , GeeksforGeeks2 , etc. Choi et al. [2023] utilized
scored by Valmeekam et al. [2022] and Ruis et al. [2022], an external knowledge base to generate a set of question-
LLMs struggle with complex reasoning tasks and those re- answer pairs and then employed fine-tuning to transfer fi-
quiring implicit expressions. For example, LLMs can strug- nancial knowledge to LLMs, significantly improving finan-
gle with complex reasoning scenarios, leading to a notable cial question-answering tasks. Another common approach
decrease in performance. This is particularly relevant in edu- is to leverage the in-context learning capability of LLMs by
cational contexts, where incorrect problem-solving modeled incorporating retrieved knowledge from the knowledge base
by LLMs could misguide students and lead to misunderstand- into prompts. Peng et al. [2023a] used text embeddings for
ings or flawed comprehension. Thus, despite LLMs’ im- retrieval, followed by LLM’s automatic feature extraction in
mense potential in education, their limitations must be care- order to achieve higher accuracy in pest identification tasks.
fully considered to ensure they facilitate rather than obstruct zhang et al. [2023d] utilized K-nearest neighbors (KNN) to
learning. search for the most similar K records from an accounting
database, serving as k-shot examples and greatly improving
2.5 Knowledge-based Question Answering accounting efficiency. There are also works that train and im-
prove the retriever encoder [Zhang et al., 2023c], as well as
In the context of Knowledge-based Question Answering us-
distill and refine the data in the database [Jeronymo et al.,
ing LLM, the user presents a question to LLM, and LLM
2023]. Such retrieval frameworks have lower costs and can
leverages knowledge-based methods and responds with the
be more flexible in applications across different domains. Liu
corresponding answer. Previous work by Ren et al. [2023]
et al. [2023c] built upon a retrieval-enhanced framework and
showed that LLMs have an inaccurate perception of factual
provided several plug-and-play modules, enabling its appli-
boundaries and often exhibit overconfidence. Many studies
cation in various professional fields.
have explored and utilized external knowledge from open-
world and domain-specific databases to enhance the knowl- By integrating Information Retrieval (IR) systems, LLMs
edge base of these LLMs. can enhance their capabilities with professional knowledge,
gaining valuable and precise supplemental information. Fur-
Open-domain QA thermore, according to Ren et al. [2023], retrieval augmen-
Open-domain question answering requires LLMs to accu- tation can also be employed to improve LLMs’ ability to
rately determine the reliability of information in the open perceive facts within the boundaries of their legal knowl-
world and craft their responses based on that understanding. edge, mitigating the issue of hallucinations. During the ed-
Jiang et al. [2021] evaluated the accuracy of LLM responses ucation process, different majors or courses involve different
to a particular question from the perspective of calibration. professional content. Applying external knowledge reposi-
Through experiments, the researchers discovered that mod- tories as an enhancement mechanism can provide more ac-
els such as T5, BERT, and GPT-2 are not well-calibrated in curate guidance in domain-specific contexts and mitigate the
QA tasks. While suggesting that incorporating calibration- issues caused by misleading information. Therefore, domain-
related methods into the fine-tuning process can effectively specific question-answering ability is crucial for developing
enhance performance in QA tasks, it is evident that solely an LLM-based education system.
pre-trained language models still face significant challenges
in open-domain tasks. To overcome this challenge, many 2.6 Overall Development Status
works tried to add additional information to help the LLMs We select representative benchmarks to assess the current
answer correctly [Khandelwal et al., 2019; Guu et al., 2020; development of LLMs across education-related capabilities.
Borgeaud et al., 2022]. Lazaridou et al. [2022] employed in- Specifically, we mainly collect the results from three sources:
formation gathered from web searches as prompt input for Huggingface3 , OpenCompass4 and C-Eval5 . The formal two
LLM, conditioning it to generate answers to questions. This
1
approach effectively enables LLM to use open-world infor- https://medlineplus.gov
2
mation to answer questions. https://www.geeksforgeeks.org
3
The development of LLM-based open-domain question an- https://huggingface.co/spaces/HuggingFaceH4/open llm leaderboard
4
swering highlights significant challenges, particularly in deal- https://opencompass.org.cn/leaderboard-llm
5
ing with hallucinations. In the context of establishing an https://cevalbenchmark.com/static/leaderboard.html
Table 1: Overview of LLMs’ performance on foundational education-related capabilities.
are comprehensive leaderboards. C-Eval is a Chinese eval- Research on whether LLMs can handle educational tasks
uation suite for foundation models spanning 52 diverse dis- has been underway. Wang and Demszky [2023] introduced
ciplines. We collect performance data from popular general three teacher coaching tasks for generative AI: (A) scoring
LLMs on these benchmarks, and the compiled results are pre- transcript segments using classroom observation instruments,
sented in Table 1, where one can observe that: (B) identifying highlights and missed opportunities for effec-
• It is hard for a single LLM to be superior across all ca- tive instructional strategies, and (C) offering actionable sug-
pabilities. GPT-4 exhibits the best overall performance gestions to encourage more student reasoning. And evaluated
among current LLMs. However, LLMs still lag signif- by human teachers, ChatGPT on these tasks for elementary
icantly behind humans in some crucial abilities, such math classroom transcripts generates responses that are rele-
as achieving 94% accuracy in TruthfulQA [Lin et al., vant to improving instruction, but they are often not novel or
2021], a performance level far superior to the optimal insightful. Beyond that, Phung et al. [2023] assessed the pro-
results of LLMs. gramming education ability of ChatGPT and GPT-4 by com-
paring them with human tutors. The result shows that GPT-
• Most LLMs display considerable variation in developing 4 performs way better than ChatGPT, even close to human
these skills. While certain models (such as Alpaca and tutors in some scenarios, while it also highlights some situa-
Yi) might excel in text comprehension tasks, their effec- tions in which GPT-4 struggles. In particular, for the grading
tiveness often diminishes in areas requiring deep under- feedback and task creation scenarios that have a substantial
standing and reasoning, like Mathematics and Program- gap in the performance of GPT-4 compared to that of human
ming. This reveals the substantial challenges in build- tutors.
ing a unified education-focused LLM since it may fail in
certain areas. 3.2 MoE Approach
Section 2 reviewed the current development of LLMs across
3 Potential of LLM-based Education System various capabilities. Unfortunately, despite the existence of
comprehensive language models, such as GPT-4, these mod-
LLMs can potentially transform online education by under- els often exhibit notable deficiencies in certain abilities. This
standing a wide range of student questions, similar to hu- situation poses a challenge, indicating that relying solely on
man teachers. They aim to provide support across different an LLM itself for educational guidance involving all these ca-
subjects and skill levels. With the latest developments in pabilities is currently a difficult task. Yet, LLMs can achieve
LLMs, we suggest two approaches for creating LLM-based excellent results through fine-tuning individual capabilities,
education systems. The first involves training a comprehen- and their ability to comprehend human language is exception-
sive LLM that can handle questions from various subjects. ally strong. Therefore, we can aggregate models with distinct
The second approach uses a mixture-of-experts (MoE) frame- capabilities using a mixture-of-experts approach. By estab-
work, integrating specialized models to support the system lishing an LLM-based controller for language interaction and
with an LLM controller to manage interactive dialogs with task assignment with students, a currently feasible education
students. system can be generated.
An education framework implemented with a mixture-of-
3.1 Unified Approach experts (MoE) approach is illustrated in Figure 3(b), consist-
The most straightforward idea for establishing an LLM-based ing of multiple models that excel in individual capabilities
education system is to train a language model capable of an- (not necessarily LLMs) and an LLM controller. The con-
swering students’ questions across all subjects. As shown in troller is mainly responsible for three tasks: (1) Understand
Figure 3 (a), the foundational capabilities are included in the the student’s request and decide which specific area or areas
unified LLM, and the student can directly communicate with the request is about. (2) Re-form the request to fit the input
it and ask questions. of the specific areas’ expert models. (3) Aggregate the output
Q: I am asked to write A: The formula
A: The for the sumfor is
the sum is , A: The A:formula for the for
sumtheis sum is ,
Q: I am askeda to
Python
write a Python formula , Q: I am Q:
askedI amtoasked
writetoa Python
write a Python The formula ,
program toprogram
calculate
to the sum from
calculate 1 tofrom 1 where
the sum to n is the last
where n isnumber
the last in the in the
number programprogram
to calculate the sumthe
to calculate from
sum1 to
from 1 towhere nwhere
is thenlast number
is the in the in the
last number
100 in a simple
100 inway. Could
a simple youCould
way. help you
me helpseries.
me Here is aHere
series. simpleis aPython
simpleprogram
Python program 100 in a100
simplein a way.
simpleCould
way.you helpyou
Could mehelp me
series. Here
series.isHere
a simple PythonPython
is a simple program
program
to solve this problem?
to solve this problem? using thisusing
formula: def calculate_sum():
this formula: def calculate_sum(): to solve to
this problem?
solve this problem? using this formula:
using def calculate_sum():
this formula: def calculate_sum():
LLM Controller
LLM Controller Unified LLM-based
Unified Education
LLM-based Model
Education Model
Call Mathematics model andmodel
Call Mathematics ask 'how Call Programming
and ask 'how model and
Call Programming ask 'Write
model and ask 'Write A: The formula for the sum
A: The formula is sum is
for the , where, nwhere
is thenlast number
is the in the in the
last number
to calculate
to the sum from
calculate 1 to from
the sum 100 in a Python
a 100 in
1 to a aprogram to implement
Python program the
to implement the
series.Here is a simple
series.Here is a Python programprogram
simple Python using this formula:
using this formula:
simple way?'.
simple way?'. calculation of
calculation of '. '. def calculate_sum():
def calculate_sum():
n=100 n=100
··· ···
Expert Expert
ModelsModels Internal
Internal Capabilities
Capabilities
Knowledge-based
Knowledge-based Knowledge-based
Knowledge-based
Mathematics
Mathematics Writing WritingQuestionQuestionReasoning
Reasoning Programming
Programming Mathematics
Mathematics WritingWriting Question
Question Reasoning
Reasoning Programming
Programming
Answering
Answering Answering
Answering
Figure 3: Two frameworks towards LLM-based educational framework. (a) is the unified framework where one LLM model handles all the
education-related questions. (b) is the MoE framework where the LLM controller is responsible for task distribution, and the corresponding
expert model finishes the given task.
of the related experts and generate the final response to the mainly discuss the LLMs’ ability to help researchers’ inter-
student. disciplinary collaboration instead of their ability to answer
The advantage of this approach is that training is less chal- interdisciplinary questions. Cultivating an LLM to obtain
lenging, requiring training on individual capability-related this ability would help to develop a unified education system,
datasets to obtain multiple models, each superior in a spe- which is an essential research direction.
cific capability. However, the drawback lies in potential un-
derstanding errors during inference due to interactions. This 4.2 Student Modeling
includes errors introduced by the LLM controller when as- Establishing an LLM-based education system allows stu-
signing tasks based on student input and integrating outputs dents to articulate their personalized needs through dialogs.
from various experts. Currently, this approach might be a re- Through such conversations, we can extract or infer person-
alistic way of achieving the LLM-based educational assistant alized features about students, such as their current mastery
system. of topics and preferences in learning styles.
Besides modeling students from conversations, some re-
4 Future Directions searches [Aher et al., 2023; Argyle et al., 2023] have shown
Recently, more and more researchers have been trying to ap- that LLMs have certain abilities in simulating humans and
ply LLMs to handle education tasks, such as course design, generating human samples. Applying to education, this abil-
student evaluation, lesson plan design, and others. Neverthe- ity indicates a potential for LLM-based student simulation.
less, there are still many challenges and opportunities. In this way, for the students with few interaction records, the
LLM-based simulator could generate more samples and pro-
4.1 Interdisciplinary Reasoning Ability vide data to help the expert model better understand the stu-
dent. It could help human teachers develop teaching skills
Students may encounter interdisciplinary reasoning problems
better.
during real-world learning, requiring the education system to
integrate multiple capabilities to formulate responses. As il-
lustrated in Figure 1, the student intends to write a program
4.3 Multi-modal Education
to solve a mathematical problem, and the model needs first to In education, multi-modal information is common, like ge-
comprehend the mathematical problem, devise a solution, and ometry problems combining text and images or textbook con-
then generate the code. This process necessitates the model to cepts with illustrations. Building a general intelligent educa-
synthesize both mathematical and programming capabilities. tion system requires handling such multi-modal data. Thus,
However, there is currently limited research in the integra- developing a multi-modal educational model is a promising
tion of multiple interdisciplinary capabilities for LLMs at this future direction.
stage, including both datasets and algorithms. Boyko et al. Notably, the development of multi-modal LLMs is rapidly
[2023] examined how LLMs augment scientific inquiry, code advancing [Ye et al., 2023; Du et al., 2022]. Different
development, scientific writing process, etc., and they pro- kinds of architectures and pre-train tasks are proposed [Du
pose that LLMs can foster interdisciplinary work by bridg- et al., 2022]. However, the education domain often exhibits
ing knowledge gaps across scientific fields. However, they unique distribution characteristics in multi-modal informa-
tion. Therefore, it may place higher demands on the model’s Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen-
multi-modal reasoning capability. rique Ponde de Oliveira Pinto, Jared Kaplan, Harri Ed-
wards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex
5 Conclusion Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy
Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan,
In this paper, we presented an overview of the development Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power,
of the LLM-based education system. We first reviewed the Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
important development of LLMs’ education-related abilities. Philippe Tillet, Felipe Petroski Such, Dave Cummings,
Then, we analyzed the potential of it and proposed two dif- Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel
ferent ways of building such a system. We also highlighted Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex
the future directions that are worth working on. We hope this Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir
survey provides some insight into future research in this di- Balaji, Shantanu Jain, William Saunders, Christopher
rection. Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant
Misra, Evan Morikawa, Alec Radford, Matthew Knight,
Acknowledgments Miles Brundage, Mira Murati, Katie Mayer, Peter Welin-
The SJTU team is partially supported by National Natural der, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
Science Foundation of China (62177033). Sutskever, and Wojciech Zaremba. Evaluating large lan-
guage models trained on code, 2021.
References Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen-
Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Us- rique Ponde de Oliveira Pinto, Jared Kaplan, Harri Ed-
ing large language models to simulate multiple humans and wards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex
replicate human subject studies. In International Confer- Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy
ence on Machine Learning, pages 337–371. PMLR, 2023. Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan,
Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power,
Md Adnan Arefeen, Biplob Debnath, and Srimat Chakrad-
Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
har. Leancontext: Cost-efficient domain-specific question
Philippe Tillet, Felipe Petroski Such, Dave Cummings,
answering using llms. arXiv preprint arXiv:2309.00841,
Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel
2023.
Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex
Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir
Gubler, Christopher Rytting, and David Wingate. Out of Balaji, Shantanu Jain, William Saunders, Christopher
one, many: Using language models to simulate human Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant
samples. Political Analysis, 31(3):337–351, 2023. Misra, Evan Morikawa, Alec Radford, Matthew Knight,
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Miles Brundage, Mira Murati, Katie Mayer, Peter Welin-
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei der, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
Huang, et al. Qwen technical report. arXiv preprint Sutskever, and Wojciech Zaremba. Evaluating large lan-
arXiv:2309.16609, 2023. guage models trained on code, 2021.
Bruno Barras, Samuel Boutin, Cristina Cornes, Judicaël Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi
Courant, Jean-Christophe Filliatre, Eduardo Gimenez, Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code gen-
Hugo Herbelin, Gerard Huet, Cesar Munoz, Chetan eration with generated tests, 2022.
Murthy, et al. The Coq proof assistant reference manual:
Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin,
Version 6.1. PhD thesis, Inria, 1997.
Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying ge-
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, ometry logical reasoning via reformulating mathematical
Trevor Cai, Eliza Rutherford, Katie Millican, George Bm expression. arXiv preprint arXiv:2212.02746, 2022.
Van Den Driessche, Jean-Baptiste Lespiau, Bogdan
Damoc, Aidan Clark, et al. Improving language models Stephen Choi, William Gazeley, Siu Ho Wong, and Tingting
by retrieving from trillions of tokens. In International con- Li. Conversational financial information retrieval model
ference on machine learning, pages 2206–2240. PMLR, (confirm), 2023.
2022. Garett Cunningham, Razvan C Bunescu, and David Juedes.
James Boyko, Joseph Cohen, Nathan Fox, Maria Han Veiga, Towards autoformalization of mathematics and code cor-
Jennifer I Li, Jing Liu, Bernardo Modenesi, Andreas H rectness: Experiments with elementary proofs. arXiv
Rauch, Kenneth N Reid, Soumi Tribedi, et al. An interdis- preprint arXiv:2301.02195, 2023.
ciplinary outlook on large language models for scientific Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris
research. arXiv preprint arXiv:2311.04929, 2023. Van Doorn, and Jakob von Raumer. The lean theorem
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kai- prover (system description). In Automated Deduction-
jie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang CADE-25: 25th International Conference on Automated
Wang, Yidong Wang, et al. A survey on evaluation of large Deduction, Berlin, Germany, August 1-7, 2015, Proceed-
language models. arXiv preprint arXiv:2307.03109, 2023. ings 25, pages 378–388. Springer, 2015.
Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neu-
Active prompting with chain-of-thought for large language big. How can we know when language models know? on
models. arXiv preprint arXiv:2302.12246, 2023. the calibration of language models for question answer-
Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian ing. Transactions of the Association for Computational
Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Linguistics, 9:962–977, 2021.
Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Ji-
Xiang. Crosscodeeval: A diverse and multilingual bench- acheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu,
mark for cross-file code completion, 2023. and Guillaume Lample. Draft, sketch, and prove: Guid-
Chenhe Dong, Yinghui Li, Haifan Gong, Miaoxin Chen, ing formal theorem provers with informal proofs. arXiv
Junxin Li, Ying Shen, and Min Yang. A survey of natu- preprint arXiv:2210.12283, 2022.
ral language generation. ACM Comput. Surv., 55(8), dec Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski,
2022. Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś,
Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. A sur- Yuhuai Wu, and Mateja Jamnik. Thor: Wielding ham-
vey of vision-language pre-trained models. arXiv preprint mers to integrate language models and automated theorem
arXiv:2202.10936, 2022. provers. Advances in Neural Information Processing Sys-
tems, 35:8360–8373, 2022.
Yaxin Fan, Feng Jiang, Peifeng Li, and Haizhou Li. Gram-
margpt: Exploring open-source llms for native chi- Albert Q Jiang, Wenda Li, and Mateja Jamnik. Multi-
nese grammatical error correction with supervised fine- lingual mathematical autoformalization. arXiv preprint
tuning. In CCF International Conference on Natural Lan- arXiv:2311.03755, 2023.
guage Processing and Chinese Computing, pages 69–80. Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettle-
Springer, 2023. moyer, and Mike Lewis. Generalization through mem-
Tao Fang, Shu Yang, Kaixin Lan, Derek F Wong, Jinpeng orization: Nearest neighbor language models. arXiv
Hu, Lidia S Chao, and Yue Zhang. Is chatgpt a highly flu- preprint arXiv:1911.00172, 2019.
ent grammatical error correction system? a comprehensive Angeliki Lazaridou, Elena Gribovskaya, Wojciech
evaluation. arXiv preprint arXiv:2304.01746, 2023. Stokowiec, and Nikolai Grigorev. Internet-augmented
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, language models through few-shot prompting for
Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke open-domain question answering. arXiv preprint
Zettlemoyer, and Mike Lewis. Incoder: A generative arXiv:2203.05115, 2022.
model for code infilling and synthesis, 2023. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine-
Lingyue Fu, Huacan Chai, Shuang Luo, Kounianhua Du, jad, Abdelrahman Mohamed, Omer Levy, Ves Stoy-
Weiming Zhang, Longteng Fan, Jiayi Lei, Renting Rui, anov, and Luke Zettlemoyer. Bart: Denoising sequence-
Jianghao Lin, Yuchen Fang, Yifan Liu, Jingkuan Wang, to-sequence pre-training for natural language genera-
Siyuan Qi, Kangning Zhang, Weinan Zhang, and Yong Yu. tion, translation, and comprehension. arXiv preprint
Codeapex: A bilingual programming evaluation bench- arXiv:1910.13461, 2019.
mark for large language models, 2023. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan
FudanUniversity. moss. https://github.com/OpenLMLab/ Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose
MOSS, 2023. Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al.
Solving quantitative reasoning problems with language
Wensheng Gan, Zhenlian Qi, Jiayang Wu, and Jerry Chun- models, 2022. URL https://arxiv. org/abs/2206.14858,
Wei Lin. Large language models in education: Vision and 2022.
opportunities. arXiv preprint arXiv:2311.13160, 2023.
Nian Li, Chen Gao, Yong Li, and Qingmin Liao. Large lan-
Mor Geva, Ankit Gupta, and Jonathan Berant. Injecting guage model-empowered agents for simulating macroeco-
numerical reasoning skills into language models. arXiv nomic activities. arXiv preprint arXiv:2310.10436, 2023.
preprint arXiv:2004.04487, 2020.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras,
Grammarly. Grammarly, 2023. Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holis-
Mingwei Chang. Retrieval augmented language model pre- tic evaluation of language models. arXiv preprint
training. In International conference on machine learning, arXiv:2211.09110, 2022.
pages 3929–3938. PMLR, 2020. Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa:
Namgyu Ho, Laura Schmid, and Se-Young Yun. Large Measuring how models mimic human falsehoods. arXiv
language models are reasoning teachers. arXiv preprint preprint arXiv:2109.07958, 2021.
arXiv:2212.10071, 2022. Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen,
Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruim-
Fadaee, Roberto Lotufo, Jakub Zavrel, and Rodrigo ing Tang, et al. How can recommender systems benefit
Nogueira. Inpars-v2: Large language models as efficient from large language models: A survey. arXiv preprint
dataset generators for information retrieval, 2023. arXiv:2306.05817, 2023.
Chin-Yew Lin. Rouge: A package for automatic evaluation Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng,
of summaries. In Text summarization branches out, pages Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and
74–81, 2004. Daxin Jiang. Wizardcoder: Empowering code large lan-
guage models with evol-instruct, 2023.
lingyiwanwu. Yi. https://www.lingyiwanwu.com/, 2023.
Qianou Ma, Hua Shen, Kenneth Koedinger, and Tongshuang
Tiedong Liu and Bryan Kian Hsiang Low. Goat: Fine-tuned Wu. Hypocompass: Large-language-model-based tutor for
llama outperforms gpt-4 on arithmetic tasks. arXiv preprint hypothesis construction in debugging for novices, 2023.
arXiv:2305.14201, 2023.
Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan
Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neu- Duan. Prom: A phrase-level copying mechanism with
big. Brio: Bringing order to abstractive summarization. pre-training for abstractive summarization. arXiv preprint
arXiv preprint arXiv:2203.16804, 2022. arXiv:2305.06647, 2023.
Bingchang Liu, Chaoyu Chen, Cong Liao, Zi Gong, Huan Lucie Charlotte Magister, Jonathan Mallinson, Jakub
Wang, Zhichao Lei, Ming Liang, Dajun Chen, Min Shen, Adamek, Eric Malmi, and Aliaksei Severyn. Teach-
Hailian Zhou, Hang Yu, and Jianguo Li. Mftcoder: Boost- ing small language models to reason. arXiv preprint
ing code llms with multitask fine-tuning, 2023. arXiv:2212.08410, 2022.
Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Osamah Mohammed, Thaeer Mueen Sahib, Israa M Hayder,
Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Sani Salisu, Misbah Shahid, et al. Chatgpt evaluation: Can
Yichun Yin, Lin Li, et al. Fimo: A challenge formal it replace grammarly and quillbot tools? British Journal of
dataset for automated theorem proving. arXiv preprint Applied Linguistics, 3(2):34–46, 2023.
arXiv:2309.04295, 2023. Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan
Jiongnan Liu, Jiajie Jin, Zihan Wang, Jiehan Cheng, Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
Zhicheng Dou, and Ji-Rong Wen. Reta-llm: A retrieval- Codegen: An open large language model for code with
augmented large language model toolkit, 2023. multi-turn program synthesis, 2023.
Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson.
Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin
Isabelle/HOL: a proof assistant for higher-order logic.
Zhou, and Yining Hua. Qilin-med-vl: Towards chinese
Springer, 2002.
large vision-language model for general healthcare. arXiv
preprint arXiv:2310.17956, 2023. Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Hen-
ryk Michalewski, Jacob Austin, David Bieber, David Do-
Yixin Liu, Alexander R Fabbri, Pengfei Liu, Dragomir han, Aitor Lewkowycz, Maarten Bosma, David Luan,
Radev, and Arman Cohan. On learning to summarize et al. Show your work: Scratchpads for intermedi-
with large language models as references. arXiv preprint ate computation with language models. arXiv preprint
arXiv:2305.14239, 2023. arXiv:2112.00114, 2021.
Vadim Liventsev, Anastasiia Grishina, Aki Härmä, and Leon Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chern-
Moonen. Fully autonomous programming with large lan- odub, and Oleksandr Skurzhanskyi. Gector–grammatical
guage models. In Proceedings of the Genetic and Evolu- error correction: tag, not rewrite. arXiv preprint
tionary Computation Conference. ACM, jul 2023. arXiv:2005.12592, 2020.
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song- OpenAI. chatgpt. https://chat.openai.com/, 2023.
Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin OpenAI. Gpt-4 technical report, 2023.
Kalyan. Dynamic prompt learning via policy gradient for
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp
semi-structured mathematical reasoning. arXiv preprint
models really able to solve simple math word problems?
arXiv:2209.14610, 2022.
arXiv preprint arXiv:2103.07191, 2021.
Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Ruoling Peng, Kang Liu, Po Yang, Zhipeng Yuan, and Shun-
Chang. A survey of deep learning for mathematical rea- bao Li. Embedding-based retrieval with llm for effective
soning. arXiv preprint arXiv:2212.10535, 2022. agriculture information extracting from unstructured data,
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, 2023.
Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Shuai Peng, Di Fu, Yijun Liang, Liangcai Gao, and Zhi Tang.
Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- Geodrl: A self-learning framework for geometry problem
matical reasoning of foundation models in visual contexts. solving using reinforcement learning in deductive reason-
arXiv preprint arXiv:2310.02255, 2023. ing. In Findings of the Association for Computational Lin-
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang guistics: ACL 2023, pages 13468–13480, 2023.
Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Tung Phung, Victor-Alexandru Pădurean, José Cambronero,
Chen, and Dongmei Zhang. Wizardmath: Empowering Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish
mathematical reasoning for large language models via re- Singla, and Gustavo Soares. Generative ai for program-
inforced evol-instruct. arXiv preprint arXiv:2308.09583, ming education: Benchmarking chatgpt, gpt-4, and human
2023. tutors, 2023.
Stanislas Polu and Ilya Sutskever. Generative language Llama: Open and efficient foundation language models.
modeling for automated theorem proving. arXiv preprint arXiv preprint arXiv:2302.13971, 2023.
arXiv:2009.03393, 2020. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Xiao Pu, Mingqi Gao, and Xiaojun Wan. Summarization is Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
(almost) dead. arXiv preprint arXiv:2309.09558, 2023. Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Llama 2: Open foundation and fine-tuned chat models.
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and arXiv preprint arXiv:2307.09288, 2023.
Peter J Liu. Exploring the limits of transfer learning with Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and
a unified text-to-text transformer. The Journal of Machine Subbarao Kambhampati. Large language models still can’t
Learning Research, 21(1):5485–5551, 2020. plan (a benchmark for llms on planning and reasoning
Vipul Raheja, Dhruv Kumar, Ryan Koo, and Dongyeop about change). arXiv preprint arXiv:2206.10498, 2022.
Kang. Coedit: Text editing by task-specific instruction tun- Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry
ing. arXiv preprint arXiv:2305.09857, 2023. Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou,
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, Quoc Le, et al. Freshllms: Refreshing large language
and Richard Socher. Explain yourself! leveraging lan- models with search engine augmentation. arXiv preprint
guage models for commonsense reasoning. arXiv preprint arXiv:2310.03214, 2023.
arXiv:1906.02361, 2019. Rose E Wang and Dorottya Demszky. Is chatgpt a good
Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, teacher coach? measuring zero-shot performance for scor-
Jing Liu, Hao Tian, Hua Wu, Ji-Rong Wen, and Haifeng ing and providing actionable insights on classroom instruc-
Wang. Investigating the factual knowledge boundary of tion. arXiv preprint arXiv:2306.03090, 2023.
large language models with retrieval augmentation, 2023. Yan Wang, Xiaojiang Liu, and Shuming Shi. Deep neural
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten solver for math word problems. In Proceedings of the 2017
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu conference on empirical methods in natural language pro-
Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, cessing, pages 845–854, 2017.
Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Can- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
ton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Denny Zhou. Self-consistency improves chain of
Martin, Nicolas Usunier, Thomas Scialom, and Gabriel thought reasoning in language models. arXiv preprint
Synnaeve. Code llama: Open foundation models for code, arXiv:2203.11171, 2022.
2023.
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu
Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang
Rocktäschel, and Edward Grefenstette. Large language Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating
models are not zero-shot communicators. arXiv preprint college-level scientific problem-solving abilities of large
arXiv:2210.14986, 2022. language models. arXiv preprint arXiv:2307.10635, 2023.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q.
Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code
and Paul F Christiano. Learning to summarize with human large language models for code understanding and genera-
feedback. Advances in Neural Information Processing Sys- tion, 2023.
tems, 33:3008–3021, 2020.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma,
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-
Jonathan Berant. Commonsenseqa: A question answer- thought prompting elicits reasoning in large language mod-
ing challenge targeting commonsense knowledge. arXiv els. Advances in Neural Information Processing Systems,
preprint arXiv:1811.00937, 2018. 35:24824–24837, 2022.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li, Markus Rabe,
Xuechen Li, Carlos Guestrin, Percy Liang, and Tat- Charles Staats, Mateja Jamnik, and Christian Szegedy.
sunori B. Hashimoto. alpaca. https://crfm.stanford.edu/ Autoformalization with large language models. Advances
2023/03/13/alpaca.html, 2023. in Neural Information Processing Systems, 35:32353–
TheVicunaTeam. vicuna. https://lmsys.org/blog/ 32368, 2022.
2023-03-30-vicuna/, 2023. Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao,
TigerResearch. Tigerbot. https://github.com/TigerResearch/ and Michael Lyu. Chatgpt or grammarly? evaluating chat-
TigerBot, 2023. gpt on grammatical error correction benchmark. arXiv
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- preprint arXiv:2303.13648, 2023.
tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Huajian Xin, Haiming Wang, Chuanyang Zheng, Lin Li,
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong,
Han Shi, Enze Xie, et al. Lego-prover: Neural the- Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and
orem proving with growing libraries. arXiv preprint Jian-Yun Nie. Retrieve anything to augment large language
arXiv:2310.00656, 2023. models, 2023.
Jing Xiong, Jianhao Shen, Ye Yuan, Haiming Wang, Yichun Sabrina zhang, Daksha Yadav, and Tom Jin. Cash transac-
Yin, Zhengying Liu, Lin Li, Zhijiang Guo, Qingxing Cao, tion booking via retrieval augmented llm. In KDD 2023
Yinya Huang, et al. Trigo: Benchmarking formal math- Workshop on Robust NLP for Finance (RobustFin), 2023.
ematical proof reduction for generative language models. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan
arXiv preprint arXiv:2310.10180, 2023. Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui,
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Ji- Olivier Bousquet, Quoc Le, et al. Least-to-most prompt-
azhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: ing enables complex reasoning in large language models.
Empowering large language models to follow complex in- arXiv preprint arXiv:2205.10625, 2022.
structions, 2023. Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo,
Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang,
Zhan, et al. Solving challenging math word problems using
Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan,
gpt-4 code interpreter with code-based self-verification.
Fan Yang, et al. Baichuan 2: Open large-scale language
arXiv preprint arXiv:2308.07921, 2023.
models. arXiv preprint arXiv:2309.10305, 2023.
Kaiyu Yang, Aidan M Swope, Alex Gu, Rahul Chalamala,
Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger,
and Anima Anandkumar. Leandojo: Theorem proving
with retrieval-augmented language models. arXiv preprint
arXiv:2306.15626, 2023.
Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang, Zehai
He, Yuyi Guo, Jinfeng Bai, and Jie Tang. Gpt can solve
mathematical problems without a calculator. arXiv preprint
arXiv:2309.03241, 2023.
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei
Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou.
mplug-owl2: Revolutionizing multi-modal large language
model with modality collaboration. arXiv preprint
arXiv:2311.04257, 2023.
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengy-
ing Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian
Weller, and Weiyang Liu. Metamath: Bootstrap your own
mathematical questions for large language models. arXiv
preprint arXiv:2309.12284, 2023.
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang,
and Songfang Huang. How well do large language
models perform in arithmetic tasks? arXiv preprint
arXiv:2304.02015, 2023.
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman.
Star: Bootstrapping reasoning with reasoning, 2022. URL
https://arxiv. org/abs/2203.14465, 2022.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu
Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng,
Xiao Xia, et al. Glm-130b: An open bilingual pre-trained
model. arXiv preprint arXiv:2210.02414, 2022.
Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing
Sha, Shijin Wang, and Ji-Rong Wen. Evaluating and im-
proving tool-augmented computation-intensive math rea-
soning. arXiv preprint arXiv:2306.02408, 2023.
Ming-Liang Zhang, Fei Yin, and Cheng-Lin Liu. A multi-
modal neural geometric solver with textual clauses parsed
from diagram. arXiv preprint arXiv:2302.11097, 2023.