Official implementation for Pseudo-Formalization for Automatic Proof Verification. Two experimental pipelines are provided:
- arxiv — for running our
ArxivMathGradingBenchbenchmark (found atdata/arxiv_grading_bench.jsonl), research level mathematics. - hard2verify — for running the
Salesforce/Hard2Verifybenchmark, IMO and Putnam level mathematics.
Both pipelines support two verification methods:
- baseline — LLM-as-judge over the full proof.
- pseudo-formalisation — translates the proof into Pseudo-Formal (PF) modules, then runs Block Verification (BV) on each module independently.
- Python 3.10+
openai,pandas,tqdm,matplotlib,seaborn,scikit-learn,datasets
export OPENAI_API_KEY="your-openai-api-key"data/arxiv_grading_bench.jsonl— theArxivMathGradingBenchbenchmark metadata (arXiv IDs, versions, ground-truth error locations). The paper PDFs are not included in this repo, and need to be downloaded by running:This fetches each paper from arXiv at the benchmark version intopython scripts/download_arxiv_pdfs.py --out-dir data/arxiv_grading_bench_pdf_files
arXiv-<id><version>.pdf. Then point the runner at that directory with--pdf-dir.data/hard2verify.json— the decrypted Hard2Verify data read byhard2verify_grading.py. Not included; to produce it:- Clone the Hard2Verify repo into
./Hard2Verify/from https://github.com/SalesforceAIResearch/Hard2Verify — it provides thedecrypt_sampleutility thatdecrypt_hard2verify.pyimports. - Run
python scripts/decrypt_hard2verify.py, which loadsSalesforce/Hard2Verifyfrom Hugging Face, decrypts it, and writesdata/hard2verify.json(and a.csvfor inspection). Note, make sure not to commit the decrypted version of this benchmark in any public repos (as requested by theHard2Verifyauthors).
- Clone the Hard2Verify repo into
First download the paper PDFs (see Data above):
python scripts/download_arxiv_pdfs.py --out-dir data/arxiv_grading_bench_pdf_filesFor running the baseline:
python arxiv_grading.py --method baseline \
--model gpt-5.4-mini-2026-03-17 --effort medium \
--n-runs 1 --concurrency 16 \
--pdf-dir data/arxiv_grading_bench_pdf_filesFor running Pseudo-formal + block verification:
python arxiv_grading.py --method pseudo-formalisation \
--model gpt-5.4-mini-2026-03-17 --effort medium \
--n-runs 1 --concurrency 16 \
--pdf-dir data/arxiv_grading_bench_pdf_filesThe arxiv pipeline expects PDF source files in a directory pointed to by
--pdf-dir <path> (you can also use export ARXIV_PDF_DIR=...).
For running the baseline:
python hard2verify_grading.py --method baseline \
--model gpt-5.4-mini-2026-03-17 --effort medium \
--n-runs 1 --dev-set-size 200 --concurrency 16 \
--output-dir outputs/hard2verify-baselineFor running Pseudo-formal + block verification:
python hard2verify_grading.py --method pseudo-formalisation \
--model gpt-5.4-mini-2026-03-17 --effort medium \
--n-runs 1 --dev-set-size 200 --concurrency 16 \
--output-dir outputs/hard2verify-pseudo-formalisationFor the pseudo-formalisation method, step-level calibration runs inside this
command: it uses the original solution steps from data/hard2verify.json,
audits the rewritten-proof verifier flags, and emits per-step verdicts plus
incorrect steps in each run's step_verification field.
Both runners save raw per-run model outputs to a single
grading_full_output.json; neither runner computes metrics itself. The
metrics we report (and how the saved output maps to them) are defined per
pipeline below.
arxiv_grading.py writes outputs/arxiv-<method>/grading_full_output.json.
Per paper it stores the ground-truth error location and, for each of the n
runs, the model's raw output: extracted_errors (a list of
{location, description}), extracted_locations, an extraction_status, and
token usage.
The metrics treat finding an error as the positive class. For each predicted
error, an LLM judge decides whether its location refers to a ground-truth
error location of that paper (the same labelled result — e.g. Theorem 19 —
or a step inside its proof). To aggregate k parallel runs, the predicted
errors from the k runs are unioned and de-duplicated by location. Counting
across all papers gives
- TP = ground-truth locations matched by some prediction,
- FP = predictions matching no ground-truth location,
- FN = ground-truth locations matched by no prediction,
so precision = TP / (TP + FP) and recall = TP / (TP + FN). Sweeping k
from 1..n traces a precision–recall curve. Because the benchmark only
annotates the author-disclosed error(s), FP is an upper bound and precision a
lower bound on the true values.
Compute these from a run's output with:
python scripts/arxiv_metrics.py outputs/arxiv-baseline/grading_full_output.jsonThis runs the LLM judge (gpt-5.4-mini by default) and prints
precision/recall for each k. Pass --csv <path> to save the table.
hard2verify_grading.py writes <output-dir>/grading_full_output.json
(default outputs/hard2verify-<method>/). Per problem it stores the human
step labels (human_labels), a binary Points (7 iff all steps correct), and
LLM_Full_Output with each run's per-step and per-solution predictions (plus
the calibrated step_verification for pseudo-formalisation).
The metrics are at the step level, with incorrect step as the positive
class. Across k runs the predictions are aggregated pessimistically — a step
is predicted incorrect if any of the k runs flags it — and TP/FP/FN are
tallied over all steps to give step-level precision and recall, again swept
over k.
arxiv_grading.py # arxiv pipeline entry point
hard2verify_grading.py # Hard2Verify pipeline entry point
src/
utils.py # Data loading, metrics, plotting utilities
verifier/
verifier.py # Verifierbaseline and PseudoFormalisationVerifier
complex_verifier.py # Decomposed-rewriter (IMO/H2V)
arxiv_complex_verifier.py
prompts.py # Prompt templates
complex_prompts.py
arxiv_complex_prompts.py
data/
arxiv_grading_bench.jsonl
scripts/ # metrics, PDF download, and one-off utility scripts
Hard2Verify/ # NOT included — clone the Hard2Verify repo here for its decrypt utility (see Data)