F: Formula-Driven Reinforcement Learning For Symbolic Table Reasoning in Language Models
F: Formula-Driven Reinforcement Learning For Symbolic Table Reasoning in Language Models
Lang Cao1†∗ Jingxian Xu2∗ Hanbing Liu3∗ Jinyu Wang4∗ Mengyu Zhou†
Haoyu Dong Shi Han Dongmei Zhang
1
University of Illinois Urbana-Champaign 2 Nankai University
3
Tsinghua University 4 Shandong University Microsoft Research
arXiv:2505.23667v1 [cs.AI] 29 May 2025
[email protected] [email protected]
Abstract
Tables are a fundamental structure for organizing and analyzing data, making
effective table understanding a critical capability for intelligent systems. While
large language models (LMs) demonstrate strong general reasoning abilities, they
continue to struggle with accurate numerical or symbolic reasoning over tabular
data, especially in complex scenarios. Spreadsheet formulas provide a powerful
and expressive medium for representing executable symbolic operations, encoding
rich reasoning patterns that remain largely underutilized. In this paper, we propose
Formula Tuning (Fortune), a reinforcement learning (RL) framework that trains
LMs to generate executable spreadsheet formulas for question answering over
general tabular data. Formula Tuning reduces the reliance on supervised formula
annotations by using binary answer correctness as a reward signal, guiding the
model to learn formula derivation through reasoning. We provide a theoretical
analysis of its advantages and demonstrate its effectiveness through extensive
experiments on seven table reasoning benchmarks. Formula Tuning substantially
enhances LM performance, particularly on multi-step numerical and symbolic
reasoning tasks, enabling a 7B model to outperform O1 on table understanding.
This highlights the potential of formula-driven RL to advance symbolic table
reasoning in LMs.
Reinforcement Learning
Reward
<Think> y‘ == y ?
…
</Think>
Table
Reasoning
Sum values in
v.s.
A1:A10. <Answer>
=SUM(A1:A10) Formula Final Ground
</Answer > Executor Answer Truth
Formula
Query
1 Introduction
Tables are a common and practical data structure in daily life, playing a central role in data collection,
representation, and analysis [1, 2]. Recent advances in large language models (LLMs) [3, 4, 5] have
∗
Work done during internship at Microsoft.
†
Corresponding authors.
2
• We propose Formula Tuning (Fortune), a reinforcement learning framework that enhances symbolic
reasoning for table understanding by training language models to generate executable spreadsheet
formulas.
• We provide a theoretical analysis and discussion comparing textual versus symbolic reasoning in
table understanding, as well as supervised fine-tuning versus reinforcement learning in symbolic
table reasoning.
• We conduct extensive experiments on seven table understanding benchmarks, demonstrating the
effectiveness of Formula Tuning, and perform comprehensive analyses to provide deeper insights.
2 Related Work
Table Understanding and Reasoning. Many studies have explored fine-tuning language models
(LMs) to improve their ability to understand and reason over tabular data. Building on the masked
language modeling introduced by BERT [27], models such as TaPas [28], PaSTA [29], and TUTA
[30] propose specialized pretraining strategies tailored for tables. TAPEX [31] pretrains an encoder-
decoder model as a neural SQL executor to better capture the semantics of table operations. Recent
efforts, including TableLLaMA [32] and TableGPT [33], build upon large decoder-only language
models pretrained for general-purpose table understanding across a wide range of downstream tasks.
Other studies focus on enabling LMs to better perform table-related tasks without fine-tuning. For
example, Dater [34] proposes strategies for dynamically constructing sub-tables, modifying the input
context to enhance comprehension. Chain-of-Table [35] models table reasoning as a sequence of
transformations using predefined operations, gradually generating sub-tables to support complex
multi-step inference. TableMaster [11] introduces a general framework for table understanding
and underscores the importance of symbolic reasoning in handling complex scenarios. Given the
structured and often numerical nature of tabular data, program-of-thought prompting [13] and other
symbolic approaches [36, 37, 38] have demonstrated strong effectiveness for table reasoning.
Formula Learning. A growing body of research has explored the potential of spreadsheet formulas as
a powerful means to enhance table understanding. NL2Formula [26] constructs a formula generation
dataset by converting text-to-SQL tasks into spreadsheet formulas, enabling position-aware reasoning
from natural language queries. ForTap [39] leverages spreadsheet formulas as pretraining signals to
enhance numerical reasoning. Auto-Formula [40] applies contrastive learning to transfer formulas
from similar spreadsheets for formula recommendation. SpreadsheetCoder [41] formulates formula
prediction as a program synthesis task, leveraging both headers and surrounding cell values for context.
FLAME [42] trains a small domain-specific model tailored for formula repair and completion. TabAF
[22] jointly generates answers and formulas for table question answering, but relies on supervised
fine-tuning over datasets generated by LLMs.
Reinforcement Learning for Language Models. Reinforcement Learning (RL) [43] is a machine
learning paradigm that trains agents to make decisions through interaction with an environment, with
the goal of maximizing cumulative rewards. In the era of large language models (LLMs), RL has
gained significant traction as an effective framework for aligning models with human preferences. A
prominent example is Reinforcement Learning from Human Feedback (RLHF) [44, 45, 46], which
leverages the Proximal Policy Optimization (PPO) algorithm [47] and human preference data to train
a reward model that guides the fine-tuning of LLMs. Building on RLHF, more recent algorithms such
as GRPO [48] and REINFORCE++ [49] aim to enhance reward modeling and mitigate issues like
biased optimization [50].
Reasoning with Language Models. It has been observed that sufficiently large language models
(LMs) can demonstrate emergent reasoning capabilities [51, 52]. Chain-of-thought prompting [53] is
one technique used to elicit step-by-step reasoning, significantly improving performance on complex
tasks. Further advances include self-consistency [54] and structuring the reasoning process in forms
like trees [55] or graphs [56, 57] are also useful for more complex reasoning tasks. RL has also
been used to directly improve reasoning skills during training [58, 59]. Notably, DeepSeek-R1 [17]
demonstrates that large-scale RL can substantially boost the reasoning abilities of LMs. In terms of
application, DeepRetrieval [18] applies RL to teach models how to reason about interacting with
search engines for information retrieval, while DeepCoder [19] uses RL for code reasoning and
generation tasks. Rec-R1 [60] also bridges LLMs and recommendation systems through RL.
3
Textual v.s. Symbolic SFT v.s. RL
Pred
Figure 2: Comparison of Textual versus Symbolic Reasoning in Table Understanding, and Supervised
Fine-Tuning (SFT) versus Reinforcement Learning (RL) in Symbolic Table Reasoning.
3 Methodology
In this section, we present a theoretical analysis and discussion comparing textual versus symbolic
reasoning in table understanding, as well as supervised fine-tuning (SFT) versus reinforcement
learning (RL) in symbolic table reasoning (Figure 2). We then introduce our proposed training
framework, Formula Tuning. All notations are list at Appendix M.
Table Understanding with Language Models. We consider a language model (LM) as a conditional
generation policy πθ (a | s), where θ denotes its parameters. The input s ∈ S comprises a table T
and a natural-language query q, i.e., s = (T, q). The table T is a two-dimensional grid of cells:
C1,1 C1,2 · · ·
Tm×n = C2,1 Ci,j · · · , (1)
.. .. ..
. . .
where each Ci,j may contain a data value, structural information (e.g., a top header or a left header),
or be empty. In practice, we linearize T into a text sequence before feeding it to the LM.
The LM then generates an output a ∈ A, which can be either a final textual answer or a spreadsheet
formula f that produces the answer upon execution. Our goal is to optimize the parameters θ that
maximize the expected table-understanding performance, measured by a reward function r(a | s).
Formally,
max Es∼p(s), a∼πθ (·|s) r(a | s) , (2)
θ
where p(s) denotes the empirical distribution over table–query pairs and r(a | s) evaluates the
correctness of the final answer from the LM given the input.
Definition 1 (Textual and Symbolic Policies). Given an input s = (T, q), we consider two types of
reasoning strategies:
1. Textual policy πθtxt : The language model generates a chain of thought and directly produces a
textual answer a ∈ Atxt .
2. Symbolic policy πθsym : The language model generates a chain of thought followed by a spread-
sheet formula f ∈ F; the final answer is obtained by executing the formula deterministically:
a = exec(f, T).
Lemma 1 (Reward Decomposition). Let the reward be defined as r(a | s) = 1[a = a⋆ (s)], where
a⋆ (s) denotes the ground-truth answer.
4
For textual reasoning, the expected reward is:
which represents the probability of generating both a logically valid reasoning path and a numerically
correct final answer.
For symbolic reasoning, the model generates a formula f , which is executed to produce an answer
a = exec(f, T). The expected reward becomes:
πθ (f | s) · 1[exec(f, T) = a⋆ (s)].
X sym
Ef ∼πθsym [r(exec(f, T) | s)] = (4)
f
This corresponds to the probability of generating a valid reasoning path and a formula that yields
the correct answer. Importantly, any formula that produces the correct output receives full reward,
regardless of whether it matches the canonical ground-truth formula.
Assumption 1 (Symbolic Reasoning Setting).
1. The formula executor is sound and complete with respect to the formula language F.
2. All symbolic outputs are executed deterministically and without numerical error.
3. The textual and symbolic policies are equally capable of planning high-level solution strategies
in their respective formats (text or formula).
Theorem 1 (Symbolic Reasoning Superiority). Under Assumption 1, the expected reward achieved
by symbolic reasoning is greater than or equal to that of textual reasoning for any input s:
Ea∼πθsym [r(a | s)] ≥ Ea∼πθtxt [r(a | s)]. (5)
where H(πg ) denotes the entropy of the teacher policy. Thus, maximizing the log-likelihood of πθ
under samples from πg is equivalent to minimizing the KL divergence from πg to πθ .
5
Lemma 3 (Convergence of SFT and Reward Upper Bound). Let s = (T, q) ∈ S, where T is the input
table and q is the natural language question. Suppose the model generates a formula f ∼ πθ (· | s),
and let the final answer be computed deterministically as a = exec(f, T).
Under Assumption 2, the optimal supervised fine-tuning (SFT) policy
πθ⋆ = arg max Es∼p, f ∼πg [log πθ (f | s)] (8)
θ
satisfies
πθ⋆ (f | s) = πg (f | s) for almost every s ∈ S. (9)
Consequently, for the reward function r(a | s) = 1[a = a⋆ (s)], we have:
Es∼p, f ∼πθ⋆ [r(exec(f, T) | s)] ≤ Es∼p, f ∼πg [r(exec(f, T) | s)] . (10)
provided that the teacher policy πg does not fully cover all correct solutions.
The proof of Theorem 2 is provided in Appendix C.4.
Remark 3 (RL Objective and Advantage). Unlike SFT, which is constrained to imitating the teacher
policy πg , reinforcement learning (RL) directly maximizes the expected task reward:
max Es∼p(s), a∼πθ (·|s) [r(a | s)]. (12)
θ
This objective allows the model to discover high-reward actions that may lie outside the support
of πg —for example, alternative formulas that yield the correct answer but differ syntactically or
structurally from those seen during supervised training.
In symbolic table reasoning, this flexibility is especially valuable. Since many distinct formulas may
produce the same correct result, RL can exploit this many-to-one mapping by exploring diverse yet
semantically valid expressions. As a result, RL has the potential to exceed the SFT reward bound
established in Lemma 3, provided the reward signal is well aligned with the task objective and
sufficient exploration occurs.
Definition. Formula Tuning is a reinforcement learning (RL) framework that defines spreadsheet
formulas as an explicit symbolic reasoning space for table understanding. Specifically, we fine-tune a
pretrained LM πθ to generate formulas f ∈ F, which are executed by a deterministic spreadsheet
engine exec(f, T). The resulting answer a = exec(f, T) is compared against the ground-truth answer
a⋆ (s), and the model receives a reward:
if a = a⋆ (s),
1,
r(a | s) = 0.2, if a ̸= a⋆ (s) and f is executable, (13)
0, if f is not executable.
This reward function encourages the model to explore valid, executable formulas—even if initially
incorrect—while assigning full credit only when the answer exactly matches the ground truth.
6
Objective. Formula Tuning maximizes the expected reward using RL algorithms, such as proximal
policy optimization (PPO) [47], with the action space constrained to spreadsheet formulas:
Training Workflow.
1. Decoding: The LM generates a chain of thought and samples candidate formulas f1 , f2 , . . . from
its current policy πθ .
2. Execution: Each formula fk is executed to produce the corresponding answer ak .
3. Rewarding: The environment returns a scalar reward rk = r(ak | s) based on the correctness
and executability of the result.
4. Policy Update: The LM parameters θ are updated using a RL algorithm (e.g., PPO) based on the
observed tuple (s, fk , rk ).
This framework enables the model to perform symbolic reasoning over general tables with higher
accuracy, with its advantages analyzed in Section 3.2 and Section 3.3. For additional discussion and
practical challenges of the methodology, please refer to Appendix D.
4 Experiments
4.1 Settings
We conduct experiments on seven diverse table understanding benchmarks, including WikiTQ [61],
TabFact [62], FinQA [63], HiTab [25], MultiHiertt [64], AIT-QA [65], and TableBench [66]. These
datasets differ in domain sources, table structure types, and question complexity, collectively covering
the full spectrum of table understanding tasks. For training, we merge the first five datasets into a
single training corpus and train the model jointly on this combined set, then evaluate it separately
on each dataset. Among these, AIT-QA and TableBench are treated as out-of-distribution (OOD)
evaluation sets, while the rest are considered in-distribution (ID). Our experiments cover a range of
models, including GPT-4o-mini, GPT-4o, O1, Llama-3.18B , and Qwen2.5-Coder7B . Following prior
work [61, 25], we use exact match accuracy as our primary evaluation metric. The prompts used in
our experiments are provided in Appendix L. Detailed settings are provided in Appendix E.
Table 1 presents the performance of different formula learning methods under supervised fine-tuning
(SFT), reinforcement learning (RL), RL with a cold-start strategy (RL w/ CS), and direct zero-shot
inference without any training. This experiment is primarily designed to validate the theoretical
analysis discussed earlier and to discuss how its behavior in practical scenarios aligns with our
theoretical analysis. Several key analyses and insights are summarized below:
Large zero-shot performance gap between textual and symbolic reasoning. Closed-source models
such as GPT-4o already achieve an overall accuracy of 66.5% with purely textual reasoning, yet
their formula accuracy drops to 58.5%. For open-source 7–8 B models the situation is more severe:
textual scores hover around 40%, while formula accuracies nearly collapse (13–36%). The only
partial exception is Qwen2.5-Coder7B , whose additional code pre-training endows a modest ability
to emit formulas, confirming that vanilla pre-training leaves models largely unaware of spreadsheet
syntax and semantics. This underscores the necessity of dedicated formula tuning.
SFT rapidly narrows the textual and symbolic reasoning gap. After SFT, open-source models
gain 17–20 accuracy points (e.g. Llama-38B rises from 41.5% to 58.7%), and the residual gap between
textual and symbolic reasoning shrinks to only 1–2 points. SFT thus injects essential task knowledge
and brings symbolic reasoning almost on par with textual reasoning.
RL yields further gains, especially for formulas and OOD. Starting from scratch, RL improves
formula accuracy by an additional 5–9 points for both backbones, pushing overall performance
above 63%. The improvements for text-only reasoning are comparatively smaller, suggesting that RL
primarily enhances the model’s ability to synthesize correct formulas rather than improve surface-level
7
Table 1: Performance under Zero-shot, SFT, and RL settings across models and datasets. Values in
the table indicate accuracy (%). Text and Formula refer to textual and symbolic reasoning methods,
respectively. w/ CS denotes cold-start RL [17] initialized from SFT. For open-source models, the best
performance in each column is highlighted in dark blue, and the second-best in light blue.
In-Distribution Out-of-Distribution
Base Model Method Overall
WikiTQ TabFact FinQA HiTab MultiHiertt AIT-QA TableBench
Zero-Shot
Text 67.36 88.44 59.20 57.89 22.41 77.67 36.79 58.54
GPT-4o-mini
Formula 49.16 74.90 43.07 48.17 28.26 75.53 34.65 50.53
Text 78.57 94.52 63.12 69.26 36.11 81.36 42.66 66.51
GPT-4o
Formula 53.36 79.94 48.65 65.40 38.41 85.83 37.92 58.50
Text 77.90 95.50 55.71 74.31 38.31 81.55 45.03 66.90
O1
Formula 70.40 91.11 42.81 75.00 48.95 88.93 44.02 65.89
Text 50.46 67.84 42.37 29.92 18.77 59.61 21.29 41.47
Llama-3.18B
Formula 12.58 15.77 22.67 6.57 4.23 20.00 10.31 13.16
Text 55.53 78.26 55.54 50.38 26.11 73.60 24.12 51.93
Qwen2.5-Coder7B
Formula 38.96 52.52 34.44 32.60 15.90 52.43 26.05 36.13
Supervised Fine-Tuning (SFT)
Text 66.15 82.95 50.04 70.27 39.98 78.45 23.10 58.71
Llama-3.18B
Formula 59.62 72.48 58.50 72.54 44.00 74.95 31.48 59.08
Text 65.98 81.13 59.46 72.35 43.93 81.55 24.01 61.20
Qwen2.5-Coder7B
Formula 63.46 78.16 58.33 71.83 42.68 76.12 35.56 60.88
Reinforcement Learning (RL)
Text 64.37 82.16 62.60 68.81 31.99 82.91 28.08 60.13
Text w/ CS 71.56 87.01 56.84 77.64 49.34 85.66 28.16 65.17
Llama-3.18B
Formula 57.64 80.09 60.85 67.93 29.40 80.78 30.69 58.20
Formula w/ CS 70.49 83.04 71.99 79.29 54.55 81.29 36.64 68.18
Text 66.95 85.43 64.34 74.24 35.55 85.28 27.86 62.80
Text w/ CS 71.31 86.07 64.77 77.42 54.25 85.43 25.94 66.46
Qwen2.5-Coder7B
Formula 67.80 84.19 62.16 71.19 41.72 81.17 35.45 63.38
Formula w/ CS 70.90 86.18 69.21 77.89 56.78 79.14 39.25 68.48
responses. Furthermore, the out-of-distribution results on AIT-QA and TableBench show substantial
gains over supervised fine-tuning, demonstrating the generalization benefits of RL.
Cold-start RL is essential for open-source models. Initializing RL from an SFT model instead
of from scratch (the w/ CS rows) delivers an additional 3–7 point lift. The Formula w/ CS setting
consistently achieves the best open-source numbers: 68.2% for Llama-38B and 68.4% for Qwen2.5-
Coder7B . SFT provides the knowledge base, while RL pushes performance toward the ceiling.
Symbolic reasoning shows markedly higher robustness on complex benchmarks. On the chal-
lenging TableBench OOD set, Formula w/ CS lifts Llama-38B from a 10.3% zero-shot score to 36.6%,
surpassing Text w/ CS by 8.5 points. The advantage is most pronounced on datasets that require
multi-step calculations, range manipulation, or nested aggregation.
Textual reasoning remains preferable for simple look-up QA. On TabFact, Text w/ CS slightly
outperforms formula-based reasoning (87.0% vs. 83.0% for Llama-38B ), suggesting that composing
an explicit formula is not always necessary when a direct textual response suffices. A similar trend is
observed on AIT-QA, where Text w/ CS again yields higher accuracy (85.06% vs. 80.64%). These
results indicate that textual reasoning is more effective in settings where answers can be directly
extracted from the table without requiring symbolic composition.
Formula-tuned open-source models now rival or surpass closed-source LMs. After SFT + RL,
both Llama-38B and Qwen2.5-Coder7B attain overall accuracies (68.2–68.4%) on par with or exceed-
ing the proprietary GPT-4o-mini baseline (58.5%) and even approach the much larger O1 model
(66.9%). This highlights the effectiveness of our formula-centric training pipeline for democratizing
high-quality table reasoning.
Table 2 presents the performance of Fortune and Fortune++ compared to several strong baselines,
as detailed in Appendix E. Fortune is derived from the best overall performance achieved through
cold-start RL in formula-based symbolic reasoning. Following prior work [68, 69, 22], we adopt the
self-consistency strategy [54] to enhance table understanding performance. This strategy involves
8
Table 2: Performance comparison of different methods. Values in the table indicate accuracy (%).
Values marked with * indicate out-of-distribution results. ‘-’ indicates results not reported in the
related paper. For fine-tuning-based methods, the best performance in each column is highlighted in
dark blue, and the second-best in light blue.
Method Backbone WikiTQ TabFact HiTab FinQA AIT-QA
Prompting-Based Methods
Binder [36] CodeX 64.60 85.10 - - -
Dater [34] CodeX 65.90 85.60 - - -
API-Assisted [11] CodeX 42.40 - 69.30 - -
ReAcTable [67] CodeX 68.00 86.10 - - -
Chain-of-Table [35] PaLM 2 67.31 86.61 - - -
Norm-DP&Agent [68] GPT-3.5 73.65 88.50 - - -
TIDE DP&Agent [69] GPT-3.5 75.00 89.82 - - -
TableMaster [11] GPT-4o-mini 78.13 90.12 - 66.40 -
E5 [70] GPT-4 - - 85.08 - -
SS-CoT [71] Llama-3.170B 76.80 - 79.10 - -
Finetuning-Based Methods
FORTAP [39] BERT+LSTM - - 47.00 - -
TAPEX-Large [31] BARTLarge 59.10 84.20 45.60 - -
OmniTab [72] BARTLarge 62.80 - - - -
TableLlama [32] Llama-27B 32.14* 82.55 60.48 2.27* 26.99*
TableLLM [73] Qwen27B 53.59 69.81 43.88 8.63* 64.85
TableGPT2 [74] Qwen2.57B 61.42 77.80 70.27 40.28* 12.43*
TabAF [22] Qwen2.5-Coder7B 74.72 83.99 78.41 45.07* 62.33*
Fortune (Ours) Qwen2.5-Coder7B 67.05 85.08 69.74 62.16 80.39*
Fortune++ (Ours) Qwen2.5-Coder7B 82.54 95.06 87.24 80.47 93.20*
generating multiple candidate formulas and selecting the final answer based on majority voting. It is
a widely adopted and effective approach for improving accuracy. Fortune follows previous work by
generating 10 symbolic reasoning outputs. To further leverage the complementary strengths of textual
and symbolic reasoning, we introduce Fortune++, which produces a balanced mix of 5 textual and 5
symbolic outputs. Neither method relies on a cold-start strategy.
Fortune++ delivers consistently strong performance across benchmarks. Fortune++ surpasses all
finetuning-based methods across the reported datasets. Specifically, it achieves 80.47% on FinQA,
demonstrating strong complex mathematical reasoning ability. On AIT-QA, Fortune++ brings an
improvement of 30.9 points, highlighting the out-of-distribution robustness enabled by RL. These
results also show that smaller open-source models can outperform larger closed-source models.
Despite using only a 7B-parameter Qwen backbone, Fortune++ consistently outperforms nearly all
prompting-based methods, including those powered by GPT-4o. The original Fortune, which relies
solely on formula-based reasoning, also achieves competitive performance across benchmarks.
RL surpasses SFT. TabAF [22] is a strong baseline that uses SFT for symbolic reasoning with
formula and similarly adopts a hybrid self-consistency strategy with 5 textual and 5 formula-based
outputs. Nevertheless, Fortune++ significantly outperforms TabAF, demonstrating that RL offers
clear advantages over SFT-only models distilled from stronger teacher models.
Additional results and further analysis. Fortune achieves 40.85% on MultiHiertt and 35.22%
on TableBench, while Fortune++ achieves 51.73% and 44.96%, respectively. An ablation study
and upper-bound performance analysis of Fortune and Fortune++ are presented in Appendix F. A
statistical analysis of the generated formulas is provided in Appendix I, and qualitative case studies
are discussed in Appendix J. A comparative study of formula tuning versus other symbolic table
reasoning methods is included in Appendix G, and an impact analysis of the reasoning process
during formula tuning appears in Appendix H.
5 Conclusion
In this paper, we introduced Formula Tuning (Fortune), a reinforcement learning framework that
trains language models to generate executable spreadsheet formulas for table understanding task.
Our findings highlight the promise of formula-driven learning in enhancing reasoning capabilities of
language models on tabular tasks. Limitations and future work are discussed in Appendix A.
9
References
[1] Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma, Rui Ding, Lun Du, Yan Gao, Ran Jia,
Xu Chen, Shi Han, Zejian Yuan, and Dongmei Zhang. Text2analysis: A benchmark of table
question answering with advanced data analysis and unclear queries, 2023.
[2] Deyin Yi, Yihao Liu, Lang Cao, Mengyu Zhou, Haoyu Dong, Shi Han, and Dongmei Zhang.
Tablepilot: Recommending human-preferred tabular data analysis with large language models.
arXiv preprint arXiv: 2503.13262, 2025.
[3] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno,
Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil
Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tau-
man Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, 2023.
[4] OpenAI. Gpt-4 technical report, 2024.
[5] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez,
Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation
language models, 2023.
[6] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier
Amatriain, and Jianfeng Gao. Large language models: A survey, 2024.
[7] Yilun Zhu, Joel Ruben Antony Moniz, Shruti Bhargava, Jiarui Lu, Dhivya Piraviperumal, Site
Li, Yuan Zhang, Hong Yu, and Bo-Hsiang Tseng. Can large language models understand
context? In Yvette Graham and Matthew Purver, editors, Findings of the Association for
Computational Linguistics: EACL 2024, pages 2004–2018, St. Julian’s, Malta, March 2024.
Association for Computational Linguistics.
[8] Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back.
Reasoning with large language models, a survey, 2024.
[9] Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach,
Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. Large language models(llms)
on tabular data: Prediction, generation, and understanding – a survey, 2024.
[10] Xuanliang Zhang, Dingzirui Wang, Longxu Dou, Qingfu Zhu, and Wanxiang Che. A survey of
table reasoning with large language models, 2024.
[11] Lang Cao and Hanbing Liu. Tablemaster: A recipe to advance table understanding with language
models. arXiv preprint arXiv: 2501.19378, 2025.
[12] Yang Yan, Yu Lu, Renjun Xu, and Zhenzhong Lan. Do phd-level llms truly grasp elementary
addition? probing rule learning vs. memorization in large language models, 2025.
[13] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts
prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023.
[14] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan,
and Graham Neubig. Pal: Program-aided language models, 2023.
[15] Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, and Faramarz Fekri. Can LLMs reason
in the wild with programs? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,
Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9806–9829,
Miami, Florida, USA, November 2024. Association for Computational Linguistics.
[16] Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When
do program-of-thoughts work for reasoning?, 2023.
[17] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin
Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu,
Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan
Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang,
Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli
Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng
Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li,
Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian
Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean
10
Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan
Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian,
Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong
Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan
Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting
Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun,
T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu,
Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao
Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su,
Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang
Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X.
Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao
Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang
Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He,
Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong
Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha,
Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan
Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu,
Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang,
and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement
learning, 2025.
[18] Pengcheng Jiang, Jiacheng Lin, Lang Cao, R. Tian, S. Kang, Z. Wang, Jimeng Sun, and Jiawei
Han. Deepretrieval: Hacking real search engines and retrievers with large language models via
reinforcement learning. arXiv preprint arXiv: 2503.00223, 2025.
[19] Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang
Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion
Stoica, and Tianjun Zhang. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025.
Notion Blog.
[20] Microsoft Corporation. Overview of formulas in excel, 2025.
[21] Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Mengyu Zhou, Yun Lin, José Cam-
bronero, Yeye He, Shi Han, and Dongmei Zhang. Encoding spreadsheets for large language
models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the
2024 Conference on Empirical Methods in Natural Language Processing, pages 20728–20748,
Miami, Florida, USA, November 2024. Association for Computational Linguistics.
[22] Zhongyuan Wang, Richong Zhang, and Zhijie Nie. General table question answering via
answer-formula joint generation, 2025.
[23] Charles Smalley. Excels new lambda function makes it turing complete, 2023. Accessed:
2025-05-09.
[24] Simon Thorne. Experimenting with chatgpt for spreadsheet formula generation: Evidence of
risk in ai generated spreadsheets, 2023.
[25] Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang
Lou, and Dongmei Zhang. Hitab: A hierarchical table dataset for question answering and
natural language generation. arXiv preprint arXiv:2108.06712, 2021.
[26] Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui,
and Haidong Zhang. NL2Formula: Generating spreadsheet formulas from natural language
queries. In Yvette Graham and Matthew Purver, editors, Findings of the Association for
Computational Linguistics: EACL 2024, pages 2377–2388, St. Julian’s, Malta, March 2024.
Association for Computational Linguistics.
[27] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding, 2019.
[28] Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, rancesco Piccinno, and Julian Martin
Eisenschlos. TAPAS: weakly supervised table parsing via pre-training. CoRR, abs/2004.02349,
2020.
[29] Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, and Xiaoyong Du. Pasta: Table-
operations aware fact verification via sentence-table cloze pre-training, 2022.
11
[30] Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu, Shi Han, and Dongmei Zhang. Tuta:
Tree-based transformers for generally structured table pre-training. In Proceedings of the 27th
ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1780–1790, 2021.
[31] Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, and Jian-Guang Lou.
Tapex: Table pre-training via learning a neural sql executor, 2022.
[32] Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. Tablellama: Towards open large generalist
models for tables, 2024.
[33] Liangyu Zha, Junlin Zhou, Liyao Li, Rui Wang, Qingyi Huang, Saisai Yang, Jing Yuan,
Changbao Su, Xiang Li, Aofeng Su, Tao Zhang, Chen Zhou, Kaizhe Shou, Miao Wang, Wufang
Zhu, Guoshan Lu, Chao Ye, Yali Ye, Wentao Ye, Yiming Zhang, Xinglong Deng, Jie Xu, Haobo
Wang, Gang Chen, and Junbo Zhao. Tablegpt: Towards unifying tables, nature language and
commands into one gpt, 2023.
[34] Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. Large language
models are versatile decomposers: Decompose evidence and questions for table-based reasoning,
2023.
[35] Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng
Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. Chain-
of-table: Evolving tables in the reasoning chain for table understanding, 2024.
[36] Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming
Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu.
Binding language models in symbolic languages, 2023.
[37] Md Nahid and Davood Rafiei. TabSQLify: Enhancing reasoning capabilities of LLMs through
table decomposition. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings
of the 2024 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5725–5737,
Mexico City, Mexico, June 2024. Association for Computational Linguistics.
[38] Qingyang Mao, Qi Liu, Zhi Li, Mingyue Cheng, Zheng Zhang, and Rui Li. Potable: Program-
ming standardly on table-based reasoning like a human analyst, 2024.
[39] Zhoujun Cheng, Haoyu Dong, Ran Jia, Pengfei Wu, Shi Han, Fan Cheng, and Dongmei Zhang.
Fortap: Using formulas for numerical-reasoning-aware table pretraining, 2022.
[40] Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, and
Surajit Chaudhuri. Auto-formula: Recommend formulas in spreadsheets using contrastive
learning for table representations. Proceedings of the ACM on Management of Data, 2(3):1–27,
2024.
[41] Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and
Denny Zhou. Spreadsheetcoder: Formula prediction from semi-structured context, 2021.
[42] Harshit Joshi, Abishai Ebenezer, José Cambronero, Sumit Gulwani, Aditya Kanade, Vu Le,
Ivan Radiček, and Gust Verbruggen. Flame: A small language model for spreadsheet formulas,
2023.
[43] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A
survey. J. Artif. Intell. Res., 4:237–285, 1996.
[44] Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario
Amodei. Deep reinforcement learning from human preferences. ArXiv, abs/1706.03741, 2017.
[45] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan J. Lowe, Chelsea Voss, Alec
Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.
ArXiv, abs/2009.01325, 2020.
[46] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton,
Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis
Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with
human feedback. ArXiv, abs/2203.02155, 2022.
[47] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. ArXiv, abs/1707.06347, 2017.
12
[48] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of
mathematical reasoning in open language models, 2024.
[49] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.
ArXiv, abs/2501.03262, 2025.
[50] Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weiling Liu, Zhiyu Mei, Guangju Wang, Chao
Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study. ArXiv,
abs/2404.10719, 2024.
[51] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani
Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto,
Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language
models, 2022.
[52] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won
Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challeng-
ing big-bench tasks and whether chain-of-thought can solve them, 2022.
[53] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi,
Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language
models, 2023.
[54] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language
models, 2023.
[55] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik
Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023.
[56] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas
Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten
Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceed-
ings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, March 2024.
[57] Lang Cao. GraphReason: Enhancing reasoning capabilities of large language models through a
graph-based verification approach. In Bhavana Dalvi Mishra, Greg Durrett, Peter Jansen, Ben
Lipkin, Danilo Neves Ribeiro, Lionel Wong, Xi Ye, and Wenting Zhao, editors, Proceedings of
the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL 2024),
pages 1–12, Bangkok, Thailand, August 2024. Association for Computational Linguistics.
[58] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023.
[59] Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang,
Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with
process- and outcome-based feedback, 2022.
[60] Jiacheng Lin, Tian Wang, and Kun Qian. Rec-r1: Bridging generative large language models
and user-centric recommendation systems via reinforcement learning, 2025.
[61] Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables,
2015.
[62] Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou
Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification,
2020.
[63] Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema
Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA:
A dataset of numerical reasoning over financial data. In Marie-Francine Moens, Xuanjing
Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, pages 3697–3711, Online and Punta Cana,
Dominican Republic, November 2021. Association for Computational Linguistics.
[64] Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. MultiHiertt: Numerical reasoning over
multi hierarchical tabular and textual data. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages 6588–6600, Dublin,
Ireland, May 2022. Association for Computational Linguistics.
13
[65] Yannis Katsis, Saneem Chemmengath, Vishwajeet Kumar, Samarth Bharadwaj, Mustafa Canim,
Michael Glass, Alfio Gliozzo, Feifei Pan, Jaydeep Sen, Karthik Sankaranarayanan, and Soumen
Chakrabarti. Ait-qa: Question answering dataset over complex tables in the airline industry,
2021.
[66] Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin
Shu, Xianfu Cheng, Tianzhen Sun, Guanglin Niu, Tongliang Li, and Zhoujun Li. Tablebench:
A comprehensive and complex benchmark for table question answering, 2025.
[67] Yunjia Zhang, Jordan Henkel, Avrilia Floratou, Joyce Cahoon, Shaleen Deep, and Jignesh M.
Patel. Reactable: Enhancing react for table question answering, 2023.
[68] Tianyang Liu, Fei Wang, and Muhao Chen. Rethinking tabular data understanding with large
language models, 2023.
[69] Zhen Yang, Ziwei Du, Minghan Zhang, Wei Du, Jie Chen, Zhen Duan, and Shu Zhao. Triples
as the key: Structuring makes decomposition and verification easier in LLM-based tableQA. In
The Thirteenth International Conference on Learning Representations, 2025.
[70] Zhehao Zhang, Yan Gao, and Jian-Guang Lou. e5 : Zero-shot hierarchical table analysis using
augmented LLMs via explain, extract, execute, exhibit and extrapolate. In Kevin Duh, Helena
Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies
(Volume 1: Long Papers), pages 1244–1258, Mexico City, Mexico, June 2024. Association for
Computational Linguistics.
[71] Zilong Zhao, Yao Rong, Dongyang Guo, Emek Gözlüklü, Emir Gülboy, and Enkelejda Kasneci.
Stepwise self-consistent mathematical reasoning with large language models, 2024.
[72] Zhengbao Jiang, Yi Mao, Pengcheng He, Graham Neubig, and Weizhu Chen. OmniTab:
Pretraining with natural and synthetic data for few-shot table-based question answering. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, July 2022.
[73] Xiaokang Zhang, Sijia Luo, Bohan Zhang, Zeyao Ma, Jing Zhang, Yang Li, Guanlin Li, Zijun
Yao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, Jifan Yu, Shu Zhao, Juanzi Li, and Jie Tang.
Tablellm: Enabling tabular data manipulation by llms in real office usage scenarios, 2025.
[74] Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu,
Haobo Wang, Haokai Xu, Hao Chen, Haoze Li, Haoxuan Lan, Jiaming Tian, Jing Yuan, Junbo
Zhao, Junlin Zhou, Kaizhe Shou, Liangyu Zha, Lin Long, Liyao Li, Pengzuo Wu, Qi Zhang,
Qingyi Huang, Saisai Yang, Tao Zhang, Wentao Ye, Wufang Zhu, Xiaomeng Hu, Xijun Gu,
Xinjie Sun, Xiang Li, Yuhang Yang, and Zhiqing Xiao. Tablegpt2: A large multimodal model
with tabular data integration, 2024.
[75] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun
Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei
Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng
Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024.
[76] Meta AI. The llama 3 herd of models, 2024.
[77] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang
Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large
language models, 2023.
[78] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua
Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv
preprint arXiv: 2409.19256, 2024.
[79] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu,
Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large
language model serving with pagedattention, 2023.
[80] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang,
Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A
distributed framework for emerging ai applications, 2018.
[81] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast
and memory-efficient exact attention with io-awareness, 2022.
14
[82] Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann
Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-
thought helps mainly on math and symbolic reasoning, 2025.
[83] Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L.
Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where
thinking makes humans worse, 2024.
15
Contents of Appendix
C Supplementary Proofs 18
C.1 Proof of Symbolic Reasoning Dominance . . . . . . . . . . . . . . . . . . . . . . 18
C.2 Proof of MLE Minimizes KL Divergence . . . . . . . . . . . . . . . . . . . . . . 18
C.3 Proof of Convergence of SFT and Reward Upper Bound . . . . . . . . . . . . . . 19
C.4 Proof of RL Superiority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
G Comparative Analysis of Formula Tuning and Other Symbolic Table Reasoning Methods 24
J Case Study 28
J.1 Textual vs. Symbolic Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
J.2 Performance of SFT vs. RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
M Notation Table 37
16
A Limitations and Future Work
While Fortune demonstrates strong performance, several limitations remain and suggest promising
directions for future research.
Limited datasets and experimental coverage. Due to the resource-intensive nature of reinforcement
learning, our evaluation is limited to a few representative public datasets, which primarily consist of
clean and well-structured tables. This may not fully capture real-world scenarios, where spreadsheets
often contain noisy, irregular, or complex two-dimensional layouts. Additionally, we experimented
with only a limited set of base models and reinforcement learning algorithms (e.g., PPO). Nonetheless,
we believe the experiments conducted in this work sufficiently demonstrate the effectiveness of our
approach. Future work should explore a broader range of model sizes, architectures, and reinforcement
learning algorithms across different downstream scenarios, such as applying Formula Tuning to larger
models to achieve even better performance.
Applicability to broader table understanding tasks. Our method assumes that answers can be fully
derived from tabular data via executable formulas, which holds for many symbolic reasoning tasks.
However, this assumption may not extend to tasks involving free-form text, multi-modal inputs, or
ambiguous supervision. Nonetheless, formulas may still serve as useful intermediate representations,
auxiliary objectives, or reasoning grounding mechanisms in such settings. Investigating how formula
tuning can benefit or integrate with these broader tasks is an important direction.
Extensions to other formula-related tasks. Executable formulas are central not only to reasoning
but also to related tasks such as formula completion, correction, and refilling. These tasks could
benefit from multi-task learning or joint training alongside formula reasoning. Conversely, using
these tasks as pre-training objectives may also enhance symbolic reasoning capabilities via formula.
Exploring how these tasks can be unified within a single framework could lead to more powerful and
general-purpose symbolic table models.
Cold-start challenges in reinforcement learning. For base models with limited symbolic reasoning
capabilities and minimal knowledge of spreadsheet formulas, reinforcement learning from scratch
can be unstable. In our experiments, we mitigated this by using the same training corpus for both SFT
cold-start and RL. However, curating independent and high-quality cold-start corpora and identifying
optimal initialization checkpoints for reinforcement learning remain open challenges.Furthermore,
reinforcement learning itself is inherently unstable. Developing practical techniques to stabilize
training and improve performance remains a critical area for exploration. Enhancing warm-up
strategies and training stability could lead to significantly better RL outcomes.
Reward design for formula optimization. Our current reward signal is based solely on binary
execution accuracy. While simple and effective, it overlooks important factors such as formula
efficiency, token redundancy, and partial credit. Future work can incorporate more fine-grained
reward shaping, including length penalties or structure-aware scoring, to improve both learning
stability and the quality of generated formulas.
These limitations point to several promising directions for future research: (1) scaling Formula
Tuning to diverse domains and tasks, (2) exploring joint learning of symbolic tasks, (3) developing
more stable and adaptive reinforcement learning strategies, and (4) advancing reward engineering for
structured output generation.
17
could lead to downstream harms (e.g., miscalculated budgets or flawed data reports). Furthermore,
since symbolic reasoning via formulas may be more accessible in high-resource languages or domains
with well-structured spreadsheets, deployment in low-resource settings could exacerbate inequalities
in model performance and accessibility.
Safeguards. To mitigate such risks, we recommend several safeguards for future use of Fortune and
similar symbolic reasoning systems. First, generated formulas should undergo verification through
deterministic execution engines to ensure correctness. Second, evaluations should be conducted
across diverse domains and spreadsheet structures, particularly including noisy or adversarial formats.
Third, human-in-the-loop validation should be used in high-stakes applications (e.g., healthcare or
financial audits) to ensure interpretability and safety. Finally, we advocate for transparent reporting of
formula generation limitations and the inclusion of provenance indicators that show how a particular
output was derived, enabling error tracing and accountability.
C Supplementary Proofs
C.1 Proof of Symbolic Reasoning Dominance
Proof. Let E1 denote the event that the model selects a correct high-level reasoning plan—i.e., a
valid logical strategy that, if accurately followed, can lead to the correct answer.
By Assumption 1 (3), both the symbolic policy πθsym and the textual policy πθtxt are equally capable
of producing such high-level plans:
Psym [E1 ] = Ptxt [E1 ]. (15)
We now compare how these two policies operationalize the same plan downstream:
• Symbolic reasoning. After selecting a correct high-level plan, the symbolic policy proceeds
by emitting a formal expression—typically a spreadsheet formula f —that directly encodes the
solution. This formula is then passed to an external executor, which deterministically computes
the final answer a = exec(f, T). Under Assumptions 1 (1) and (2), if the plan is correct, the
execution will reliably yield the correct answer a⋆ (s). Thus, the expected reward under the
symbolic policy is:
Ea∼πθsym [r(a | s)] = P[E1 ]. (16)
• Textual reasoning. In contrast, after selecting the same correct high-level plan, the textual
policy must verbalize the intermediate reasoning steps and compute results step-by-step in free
text. This includes performing arithmetic, maintaining numerical precision, and formatting the
final answer string. Let E2 denote the event that all intermediate computations and the final
output are accurate. Then, the expected reward under the textual policy is:
Ea∼πθtxt [r(a | s)] = P[E1 ] · P[E2 | E1 ]. (17)
Unlike symbolic execution, this textual process is inherently fragile. Errors in numerical
calculations, token prediction, or formatting can easily lead to incorrect final answers, resulting
in a reward of 0.
Since P[E2 | E1 ] ≤ 1, we conclude:
Ea∼πθtxt [r(a | s)] ≤ P[E1 ] = Ea∼πθsym [r(a | s)], (18)
which completes the proof.
18
C.3 Proof of Convergence of SFT and Reward Upper Bound
(ii) Convergence. Under Assumption 2(1), the model class {πθ } is expressive enough such that there
exists some θ⋆ satisfying
inf Es∼p(s) [DKL (πg (· | s) ∥ πθ (· | s))] = 0. (22)
θ
Assumption 2(2) ensures the optimization algorithm converges to this global optimum, and As-
sumption 2(3) guarantees that the empirical distribution p̂(s, f ) converges to the true distribution
p(s)πg (f | s) as the sample size N → ∞.
Therefore, at convergence,
DKL (πg ∥ πθ⋆ ) = 0 almost everywhere, (23)
which implies pointwise equivalence between the student and teacher policies:
πθ⋆ (f | s) = πg (f | s) for almost every s ∈ S. (24)
(iii) Reward upper bound. Let r(a | s) = 1[a = a⋆ (s)] be the task reward, where a = exec(f, T)
is the executed output. Since execution is deterministic and the student mimics the teacher exactly,
we have:
Es∼p,f ∼πθ⋆ [r(exec(f, T) | s)] = Es∼p,f ∼πg [r(exec(f, T) | s)] . (25)
Thus, supervised fine-tuning under ideal assumptions can at best match the teacher’s reward perfor-
mance. In particular, this expected reward serves as an upper bound for what SFT can achieve when
trained only on demonstrations from πg .
Proof. Let a⋆ (s) be the ground-truth answer for input s, and suppose that the teacher policy πg (f | s)
covers only a strict subset of all possible formulas f such that exec(f, T) = a⋆ (s).
By Lemma 3, supervised fine-tuning under ideal assumptions can at best match the expected reward
of πg :
Es∼p, f ∼πθ⋆ [r(exec(f, T) | s)] = Es∼p, f ∼πg [r(exec(f, T) | s)] . (26)
Now consider an RL policy πθRL . Under Assumption 3, the RL policy explores the full action space
and assigns non-zero probability to correct formulas f ′ that are not in the support of πg but still
satisfy exec(f ′ , T) = a⋆ (s).
As the reward function r(a | s) depends solely on execution correctness, and not formula structure,
RL is able to collect reward on these additional correct actions that πg does not generate. Therefore,
Es∼p, f ∼πθRL [r(exec(f, T) | s)] > Es∼p, f ∼πg [r(exec(f, T) | s)] , (27)
which implies the desired result.
In addition to the formal analysis in Section 3.2, we highlight several conceptual advantages of
symbolic reasoning for table understanding:
19
• Compositionality and structure. Spreadsheet formulas offer compositional and type-aware
representations, providing stronger structural priors than unstructured text.
• Verifiability and transparency. Symbolic outputs are interpretable and verifiable: they can be
inspected, tested, reused, or debugged—enabling traceable and auditable reasoning processes.
• Discrete action space. The symbolic action space is bounded and discrete, which facilitates more
stable exploration and optimization during training.
• Robustness to token-level variability. Unlike textual reasoning, which is prone to errors from
exposure bias or numerical drift, symbolic reasoning delegates exact computation to the executor,
reducing dependency on fragile token generation.
We also expand upon the discussion in Section 3.3, comparing supervised fine-tuning (SFT) and
reinforcement learning (RL) for symbolic reasoning:
• SFT limitations. SFT imitates teacher demonstrations at the token level and struggles to general-
ize beyond the training distribution. It penalizes semantically correct but structurally different
formulas, constraining exploration.
• Reward-aligned optimization. RL optimizes directly for task-level correctness using execution-
based rewards, allowing the model to discover diverse yet valid solution strategies.
• Support for many-to-one mappings. Since different formulas can yield the same correct answer,
RL naturally accommodates this multiplicity, whereas SFT often fails to reward such diversity.
• Flexible reward shaping. RL allows for auxiliary reward terms—such as penalties on length,
syntactic constraints, or correctness under verification—which are difficult to incorporate in SFT.
• Improved generalization. By optimizing for semantic correctness rather than mimicking surface-
level token patterns, RL enables the model to generalize more effectively in both in-distribution
(ID) and out-of-distribution (OOD) scenarios, including novel question types, unseen table
schemas, and structurally diverse formulas.
While reinforcement learning (RL) offers significant advantages for symbolic table reasoning, it also
introduces several practical challenges, especially under the assumptions outlined in Section 3.2 and
3.3.
• Exploration bottlenecks. Assumption 3 assumes that the RL policy can eventually explore
correct formulas. However, the space of possible formulas is extremely large, and valid, executable
ones are rare—especially at the start of training. This makes it difficult for the model to receive
useful reward signals, leading to slow or unstable learning.
• Limited symbolic priors. Unlike supervised fine-tuning (SFT), RL does not benefit from direct
examples of correct formulas. If the model lacks prior knowledge of spreadsheet syntax or
symbolic structures, it may struggle to generate meaningful outputs. This weak starting point
often results in inefficient exploration and poor early performance.
• RL training instability. When training from scratch, the model often produces repetitive, invalid,
or meaningless formulas in the early stages, receiving no reward. This can cause unstable training
and hinder convergence. Empirically, initializing with a supervised or pretrained model leads to
more stable training and faster reward learning.
• Sparse and coarse reward signals. Execution-based rewards typically only indicate whether the
final answer is correct or not, without offering any feedback on partially correct or structurally
promising outputs. This makes it harder for the model to learn from near misses. Design-
ing more informative reward functions—such as those based on formula structure or partial
execution—remains an important direction.
20
Overcoming these challenges is essential for scaling Formula Tuning to more complex symbolic
tasks, broader domains, and higher-capacity models. Future work may explore techniques such as
curriculum learning, hybrid supervision, symbolic inductive priors, or multi-objective optimization to
improve training stability and exploration efficiency.
Table 3: Overview of the training data and table benchmarks used in this study.
Evaluation Type Dataset # Train Data # Test Data Table Type Domain License Source
WikiTQ [61] 13,753 4,217 Relational Wikipedia CC-BY-SA-4.0 Link
TabFact [62] 10,000 2,024 Relational Wikipedia CC-BY-4.0 Link
In-Distribution FinQA [63] 6,251 1,147 Relational Finance MIT Link
HiTab [25] 7,399 1,583 Hierarchical Statistical Reports C-UDA 1.0 Link
MultiHiertt [64] 7,795 1,038 Multiple & Hierarchical Finance MIT Link
AIT-QA [65] – 515 Hierarchical Airline CDLA-Sharing-1.0 Link
Out-of-Distribution
TableBench [66] – 883 Relational Cross Domain CC0-1.0 Link
Table Encoding. We adopt a table encoding method similar to SpreadsheetEncoder [21], which
converts a table into a linearized markdown-style format. Each cell is represented by its spreadsheet
address and value, forming text sequences such as A1,Year|A2,Profit. This encoding preserves
both structural and content information, enabling the model to better understand cell-level references.
Output Format. Following the structured reasoning paradigm, the model is required to produce
outputs in a two-stage format:
3
https://openai.com/policies/row-terms-of-use/
21
y = ⟨think⟩ t ⟨/think⟩ ⟨answer⟩ {json} ⟨/answer⟩,
| {z }| {z }
reasoning trajectory final answer
where t is a free-form natural language reasoning process (i.e., the thinking process), and the answer
block contains a JSON object from which the final prediction is extracted. This design enables
decoupling the reasoning trajectory from the answer payload and facilitates more structured reward
computation.
To encourage adherence to this format, we introduce a lightweight format reward. If the output fails
to follow the required structure (e.g., malformed tags or unparseable JSON), the model receives a
penalty of −2. If the format is valid and the answer can be successfully parsed from the JSON object,
a small positive reward of +0.1 is added to the answer-level reward. Therefore, the final reward is as:
rfinal (a | s) = rans (a | s) + rfmt (a) (28)
This reward shaping helps stabilize training and guide the model toward producing reliably structured
outputs.
Baselines. We compare our proposed framework against a broad range of strong baselines, including
both prompting-based and fine-tuning-based methods. To ensure a fair comparison, we require all
methods to output short, deterministic answers rather than open-ended free-form text. Following this
criterion, we exclude TableLLM [73], which relies on a critique model for answer evaluation and
does not produce a directly verifiable answer string. Prompting-based methods currently dominate the
TableQA landscape, with most relying on large closed-source models for performance. We compare
Fortune and Fortune++ with several representative methods in this category: Binder [36], Dater [34],
API-Assisted [11], Chain-of-Table [35], ReAcTable [67], Norm-DP [68], TIDE [69], E5 [70], and
SS-CoT [71]. We also include TableMaster [11], a recent recipe-based prompting framework built
on GPT-4o-mini. For fine-tuning-based methods, we select models specifically trained for table
question answering tasks. These include TAPEX-Large [31], OmniTab [72], TableLlama [32],
TableGPT2 [74], and TabAF [22], a recent strong method that combines formula generation and
hybrid self-consistency. Our framework is evaluated under the same settings to ensure consistency
and comparability across methods.
RL Training. We use Proximal Policy Optimization (PPO) for reinforcement learning (RL). The
maximum prompt length is set to 8192 tokens, and the maximum response length is 512 tokens.
The critic model is initialized with the same weights as the actor model. The actor is trained with
a learning rate of 1e-6, while the critic uses a slightly higher learning rate of 1e-5 to enable faster
value estimation. We set the KL divergence coefficient to 0.001 to balance exploration and policy
stability. The generation temperature is set to 0.6 to encourage a mix of determinism and diversity in
the generated reasoning chains and formula outputs. The PPO mini-batch size is 64. We evaluate
performance every 20 steps and report the results based on the best performance achieved on each
dataset.
SFT Training. The supervised fine-tuning (SFT) training corpus is distilled from GPT-4o by
prompting it with ground-truth answers, eliciting chain-of-thought reasoning followed by a final
answer. We adopt a rejection-based fine-tuning (RFT) strategy [77], retaining only examples where
the generated answer exactly matches the ground truth. For symbolic reasoning tasks, correctness is
determined by executing the generated formula and verifying that the resulting answer matches the
expected output. This approach ensures high-quality supervision for fine-tuning. All SFT models
are trained for 4 epochs, and we report results based on the checkpoint with the highest exact match
accuracy on the test set. We use a learning rate of 2e-5 and a batch size of 64.
Evaluation Inference. For all models, including both open-source and proprietary ones, we use
a temperature of 0, top-k of 50, and top-p of 0.7 during inference. Setting the temperature to
0 encourages deterministic outputs and improves stability in single-pass predictions. For self-
consistency decoding in Fortune++, we use a higher temperature of 0.6 to promote diversity across
multiple samples, enabling the model to better explore reasoning variations and improve final answer
voting.
Evaluation Metrics. Following prior work [61, 25], we primarily use exact match (EM) as the
evaluation metric, applying numeric tolerance when comparing numerical values. Official evaluation
scripts are used whenever available to ensure consistency. For TabFact, which is formulated as a
22
binary classification task, we report standard classification accuracy. Since our training objective
aligns with evaluation, we also use exact match (EM) for answer reward calculation.
Software. We implement Fortune using Python 3.11, with the VERL framework [78] serving as the
core architecture for reinforcement learning and other supervised fine-tuning with language models.
Our implementation utilizes VLLM (v0.8.3) [79] for efficient LLM inference and generation, PyTorch
(v2.4.0) with CUDA 12.8 for deep learning operations, and Ray [80] for distributed training and
inference. FlashAttention-2 [81] is integrated to accelerate attention computation. For proprietary
LLMs, we access OpenAI models via the Microsoft Azure platform4 . For formula execution, we
use the open-source spreadsheet engine formulas5 (EUPL 1.1+ License), which supports a wide
range of standard spreadsheet operators. A representative list of symbolic operators used in our table
reasoning framework is provided in Appendix K.
Hardware. All experiments are conducted on machines equipped with either NVIDIA A100 80GB
PCIe or NVIDIA H100 80GB PCIe GPUs, along with 1.0 TB of RAM. For reinforcement learning
(RL) training of open-source models, we use 8 × NVIDIA H100 80GB PCIe GPUs by default. For
supervised fine-tuning (SFT) of open-source models, we use 4 × NVIDIA A100 80GB PCIe GPUs
by default.
Table 4: Performance comparison of TabAF, Fortune, and Fortune++ variants. Values in the table
indicate accuracy (%). ‘-’ indicates results not reported in the related paper. Gray rows represent upper-
bound performance. Numbers in parentheses with a downward arrow (↓) indicate the performance
drop relative to the default Fortune++ configuration. Top results are highlighted in dark blue.
Method WikiTQ TabFact FinQA HiTab MultiHiertt AIT-QA TableBench
Upper Bound 80.13 94.02 - 82.07 - - -
5 Text 61.42 81.47 - 74.24 - - -
TabAF [22]
5 Formula 64.20 67.54 - 74.87 - - -
5 Text + 5 Formula 74.72 83.99 - 78.41 - - -
Upper Bound 77.35 96.49 72.01 79.91 54.82 88.93 43.26
Fortune
10 Formula 67.05 85.08 62.16 69.74 40.85 80.39 35.22
Upper Bound 93.62 99.51 91.02 95.01 71.29 98.06 61.16
5 Text 64.52 (↓18.02) 85.18 (↓9.88) 63.64 (↓16.83) 74.48 (↓12.76) 36.42 (↓15.31) 83.30 (↓9.90) 28.31 (↓16.65)
Fortune++
5 Formula 66.48 (↓16.06) 82.41 (↓12.65) 61.99 (↓18.48) 68.54 (↓18.70) 39.60 (↓12.13) 79.42 (↓13.78) 34.65 (↓10.31)
5 Text + 5 Formula 82.54 95.06 80.47 87.24 51.73 93.20 44.96
We conduct an ablation study of Fortune++ to investigate the complementary roles and effectiveness
of textual and symbolic reasoning. Table 4 reveals several instructive patterns.
Combining textual and symbolic reasoning yields the most robust performance. The balanced
sampling strategy used by Fortune++ (five textual and five formula candidates) consistently outper-
forms both the pure-text and pure-formula variants across all benchmarks. This confirms that textual
and symbolic reasoning address complementary error modes. Disabling either modality leads to
substantial accuracy degradation, with drops of up to 18 percentage points (e.g., –18 pp on FinQA for
text-only, –13 pp on AIT-QA for formula-only).
Text and formula reasoning each excel in different scenarios. Textual reasoning performs better
on simpler table QA tasks, such as TabFact and AIT-QA, where natural language understanding
and logical inference dominate. In contrast, formula-based reasoning excels on arithmetic-heavy or
structured computation tasks like FinQA and MultiHiertt, where symbolic execution is crucial for
deriving the correct answer. This division reinforces the importance of integrating both modalities for
general-purpose table understanding.
Reinforcement learning enhances symbolic reasoning beyond supervised fine-tuning. Compared
with TabAF—which uses the same backbone but is trained only with SFT—Fortune’s RL variant
achieves substantially stronger formula-only performance (e.g., 82.41% vs. 67.54% on TabFact with
5 Formula). This suggests that reinforcement learning encourages the model to explore more reliable
and executable reasoning paths, ultimately improving symbolic program quality.
4
https://azure.microsoft.com/
5
https://github.com/vinci1it2000/formulas
23
SQL (Zero-Shot) Python (Zero-Shot) Formula (Zero-Shot) SQL (RL) Python (RL) Formula (RL)
Overall Overall
39 64
29 62
19 59
27
40
10
36
71
57
20
30
35
70
13
20
34
69
10
33
68
7
9 42
15 84
18 50
31 84
27 58
46 84
36 66
62 84
FinQA TabFact FinQA TabFact
Many correct answers are lost due to naive majority voting. The upper-bound rows show
that Fortune++ frequently generates correct answers that are not selected by simple majority vote.
The discrepancy between upper-bound and actual performance reaches 19 pp on MultiHiertt and
17 pp on TableBench, indicating considerable headroom for improved candidate selection through
confidence-based aggregation or smarter reranking mechanisms.
Complex, low-resource benchmarks remain the greatest challenge. The largest performance
gaps appear on structurally complex and low-resource datasets such as MultiHiertt and TableBench.
These results highlight the limitations of current voting and reasoning mechanisms and point to future
directions including symbolic planner integration, adaptive sampling, and confidence-calibrated
answer selection.
Table 5: Performance comparison of different symbolic reasoning methods (SQL, Python, and
Formula) under Zero-shot and RL settings. Values in the table indicate accuracy (%). The best
performance in each column is highlighted in dark blue.
Method WikiTQ TabFact FinQA TableBench Overall
Zero-Shot
SQL 17.07 21.15 1.51 7.64 11.84
Python 30.96 60.39 34.87 16.33 35.64
Formula 38.96 52.52 34.44 26.05 37.99
Reinforcement Learning (RL)
SQL 67.58 83.94 38.79 32.09 55.60
Python 70.46 84.42 65.85 35.26 64.00
Formula 70.67 84.50 65.89 35.84 64.23
Spreadsheet formulas offer the strongest zero-shot symbolic reasoning. Without any task-specific
training, formulas achieve the highest out-of-the-box performance, with an overall accuracy of
37.99%. They slightly outperform Python (35.64%) and significantly surpass SQL (11.84%). The gap
is especially pronounced on datasets like WikiTQ and FinQA, suggesting that pre-trained language
models possess some intuitive understanding of spreadsheet-style operations, but struggle to generate
valid and executable SQL or Python code without further adaptation. On TableBench, which features
complex table QA questions, Python code falls short. Models without fine-tuning often fail to
24
= H U R 6 K R W Z R 5 H D V R Q L Q J = H U R 6 K R W Z 5 H D V R Q L Q J 5 / Z R 5 H D V R Q L Q J 5 / Z 5 H D V R Q L Q J