Official Implementation for paper: On Path to Multimodal Historical Reasoning: HistBench and HistAgent
Welcome to the official open-source repository for HistAgent, an innovative multi-agent system crafted for historical question answering. This cutting-edge tool employs advanced multimodal approaches, leveraging components such as OCR, image analysis, and file parsing to tackle the intricate challenges of historical inquiries effectively.
Reproduce our results using the resources provided in this repository, as detailed in our paper:
"On the Path to Multimodal Historical Reasoning: HistBench and HistAgent"
(Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu Deng, Jiye Fu, Yunting Gu, Zijie Guan, Zirui Huang, Xiaoyan Ji, Yumeng Jiang, Delong Kong, Haolong Li, Jiaqi Li, Ruipeng Li, Tianze Li, Zhuoran Li, Haixia Lian, Mengyue Lin, Xudong Liu, Jiayi Lu, Jinghan Lu, Wanyu Luo, Ziyue Luo, Zihao Pu, Zhi Qiao, Ruihuan Ren, Liang Wan, Ruixiang Wang, Tianhui Wang, Yang Wang, Zeyu Wang, Zihua Wang, Yujia Wu, Zhaoyi Wu, Hao Xin, Weiao Xing, Ruojun Xiong, Weijie Xu, Yao Shu, Xiao Yao, Xiaorui Yang, Yuchen Yang, Nan Yi, Jiadong Yu, Yangyuxuan Yu, Huiting Zeng, Danni Zhang, Yunjie Zhang, Zhaoyu Zhang, Zhiheng Zhang, Xiaofeng Zheng, Peirong Zhou, Linyan Zhong, Xiaoyin Zong, Ying Zhao, Zhenxin Chen, Lin Ding, Xiaoyu Gao, Bingbing Gong, Yichao Li, Yang Liao, Guang Ma, Tianyuan Ma, Xinrui Sun, Tianyi Wang, Han Xia, Ruobing Xian, Gen Ye, Tengfei Yu, Wentao Zhang, Yuxi Wang, Xi Gao, Mengdi Wang)
HistAgent is rooted in the smolagents framework with adaptations for the challenges of historical reasoning. This system is rigorously evaluated on HistBench, a benchmark featuring 414 questions and covering 29 languages, extensively described in our paper.
- Paper: On Path to Multimodal Historical Reasoning: HistBench and HistAgent
- Dataset: HistBench on Hugging Face
- Specialized Agents
Individual agents manage OCR, image analysis, document parsing (PDF, Word, Excel), translation, speech transcription, web browsing, and academic literature search. - Multimodal and Multilingual Reasoning
Supports diverse source materials: manuscripts, images, audio, video, inscriptions, and texts in 29 modern and ancient languages. - Historical OCR Integration
Uses Transkribus for Western scripts and specialized models for non-Latin scripts to transcribe manuscripts and inscriptions. - Citation-Aware Literature Search
Prioritizes peer-reviewed sources via Google Scholar, Google Books and Springer API, returning exact quotes with full bibliographic metadata. - HistBench Compatibility: Optimized for seamless operation on HistBench, evaluating historical reasoning across various dimensions.
- Modular and Extensible: Built on
smolagentsfor easier customization and tool extension. - Three-Stage Human-in-the-Loop Review Pipeline
HistBench questions undergo preliminary screening, LLM-based filtering, and expert validation to ensure academic rigor and fairness in evaluation. - Structured Historical Evaluation Framework
HistBench consists of 414 questions across 36 subfields, stratified into three difficulty levels and annotated with metadata (source, topic, reasoning dimension, and explanation). - Competitive and Generalizable Performance
HistAgent achieves 27.54% pass@1 and 36.47% pass@2 accuracy on HistBench, outperforming GPT-4o (18.60% pass@1) and ODR-smolagents. It also maintains strong performance (60.00% pass@1) on the general-purpose GAIA benchmark.
Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
HistAgent demonstrates superior performance over existing LLMs on HistBench's challenging benchmarks.
Fig. 1: Performance comparison on HistBench.
HistBench is the first dedicated benchmark for historical reasoning in AI:
- 414 questions, curated by domain experts and students, covering:
- 6 reasoning dimensions: bibliographic retrieval, source identification, source processing, historical analysis, interdisciplinary integration, cultural contextualization.
- 3 difficulty levels: Basic (Level 1), Intermediate (Level 2), and Challenging (Level 3).
- 29 languages: including English, Classical Chinese, Latin, Sanskrit, Old Uyghur, and more.
- Multimodal sources: manuscripts, inscriptions, images, audio, video.
- 36+ historical subfields: e.g., epigraphy, climate history, intellectual history, and material culture.
- Three-stage review: format/semantic check โ LLM-based filtering โ expert validation.
Fig. 2: HistBench difficulty levels.
Fig. 3: Language diversity (left) and geographic coverage (right) in HistBench.
For details in HistBench, refer to Section 3 of our paper and Appendix A.
HistAgent utilizes a Manager Agent to coordinate specialized agents, each with targeted tools, thus embracing a modular structure for complex historical reasoning tasks.
Fig. 4: HistAgent architecture overview.
| Agent | Focus | Core Functions |
|---|---|---|
| Manager Agent | Orchestrator | Directs execution, manages request parsing and validation, handles agent outputs. |
| Text WebBrowser Agent | Web & Text Interaction | Conducts web searches, navigates web content, extracts text. |
| Image Information Agent | Image Analysis | Executes reverse searches, assesses image parameters, interfaces with OCR for text extraction. |
| Literature Search Agent | Academic Source Retrieval | Utilizes scholarly databases for literature retrieval, processes PDFs for detailed content extraction. |
| File Processing Agent | Document Handling | Processes files like PDFs, DOCX, XLSX, etc., to extract and interpret content. |
| OCR Agent | Optical Character Recognition | Deployed for recognizing text from images, including historical manuscripts with specialized models. |
| Speech Recognition Agent | Audio Transcription | Converts speech to text, suitable for audio sources such as interviews or oral histories. |
| Translator Agent | Language Conversion | Handles multilingual document translation, ensuring historical language accuracy. |
| Video Agent | Video Analysis | Extracts and processes frames, enabling multimodal interrogation of video content. |
The Manager Agent orchestrates tasks in a CodeAct loop, ensuring stepwise verification, citation integrity, and structured outputs. For more detailed information on the architecture and agent functionalities, consult Section 5 of our paper.
The Literature Search Agent is critical in leveraging academic research capabilities within HistAgent:
Fig. 5: Literature Search Agent architecture.
- Smart Retrieval: Leverages academic databases efficiently, prioritizing peer-reviewed and reputable sources.
- Full-Text Analysis: Engages in detailed content parsing, extracting precise quotes and sections needed for historical context.
- Citation Aware: Ensures integrity with metadata extraction and citation-ready outputs.
- Integrated API Use: Combines Springer API and browser enhancements for comprehensive sourcing.
This agent significantly enhances the integrity and depth of historical research, ensuring results are biblically verifiable.
git clone https://github.com/CharlesQ9/HistAgent.git
cd HistAgentUse Conda for efficient Python environment management.
conda create -n HistAgent python=3.12
conda activate HistAgentInstall the necessary packages via requirements.txt and smolagents.
pip install -r requirements.txt
pip install 'smolagents[dev]'HistAgent requires a modified browser_use library for enhanced capabilities. Ensure replacement of any existing installations with our provided version:
-
Identify your
site-packagesdirectory:python -c "import site; print(site.getsitepackages()[0])" -
Install the modified version of
browser_use: Manually replace it after identifying your environment directory.
Make sure git-lfs is installed and then clone the dataset from Hugging Face:
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/jiahaoq/HistBench
# If you want to clone without large files - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/jiahaoq/HistBenchSeveral API keys are needed for HistAgent's operation. Configure these either via a .env file or set them as environment variables in your system.
Place a .env file at the project root with your keys:
OPENAI_API_KEY=your_openai_api_key
SERPAPI_API_KEY=your_serpapi_api_key
IMGBB_API_KEY=your_imgbb_api_key
SPRINGER_API_KEY=your_springer_api_key
LLAMA_API_KEY=your_llama_api_key
TRANSKRIBUS_USERNAME=your_transkribus_username
TRANSKRIBUS_PASSWORD=your_transkribus_password
HF_TOKEN=your_huggingface_token
OPENROUTER_API_KEY=your_openrouter_api_key
OPENROUTER_API_BASE=https://openrouter.ai/api/v1
TEXTIN_APP_ID=your_textin_app_id
TEXTIN_SECRET_CODE=your_textin_secret_codeSet environment variables directly:
export OPENAI_API_KEY="your_openai_api_key"
export SERPAPI_API_KEY="your_serpapi_api_key"
export IMGBB_API_KEY="your_imgbb_api_key"
export SPRINGER_API_KEY="your_springer_api_key"
export LLAMA_API_KEY="your_llama_api_key"
export TRANSKRIBUS_USERNAME="your_transkribus_username"
export TRANSKRIBUS_PASSWORD="your_transkribus_password"For quick tests, provide API keys as command-line arguments:
python run_hist.py --run-name "test_run_cli_keys" --api-key "your_openai_key" --springer-api-key "your_springer_key" --llama-api-key "your_llama_key"Use run_hist.py to leverage HistAgent on HistBench.
Run HistAgent for Level 2 questions with various capabilities enabled:
python run_hist.py --run-name "trial_level2" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level2| Flag | Description | Default Value |
|---|---|---|
--run-name |
Required. Unique identifier for this run. | None |
--model-id |
Specifies the language model for use (e.g., gpt-4o). |
gpt-4o |
--concurrency |
Number of parallel tasks. | 8 |
--level |
HistBench difficulty level: level1, level2, level3, or all. |
level2 |
--question-ids |
Specific question IDs to process (comma-separated). | None |
--start-id, --end-id |
Range of question IDs to process. | None, None |
--results-json-path |
Path to a previous run's .jsonl output for filtering or re-evaluation. |
None |
--baseline |
Use a simpler baseline agent architecture. | False |
--use-image-agent |
Enables the Image Information Agent. | False |
--use-file-agent |
Enables the File Processing Agent. | False |
--use-literature-agent |
Enables the Literature Search Agent. | False |
--use-browser |
Enables browser-based tools (modified browser_use library required). |
False |
--api-key |
OpenAI API key (overrides other sources). | None |
--springer-api-key |
Springer API key (overrides other sources). | None |
--llama-api-key |
LlamaParse API key (overrides other sources). | None |
1. Full Level 2 Execution:
python run_hist.py --run-name "histagent_full_level2" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level22. Baseline Comparison for Level 1:
python run_hist.py --run-name "baseline_comparison_level1" --baseline --level level13. Specific Level 3 Questions:
python run_hist.py --run-name "selected_level3" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level3 --question-ids "5,8,20"4. Custom Dataset Processing:
python run_hlejson.py --run-name "custom_hle_run" --use-image-agent --use-file-agent --use-literature-agent --use-browser --path-to-hle-json "path/to/custom_dataset.json"Note: Ensure run_hlejson.py aligns with run_hist.py options if based on the same architecture.
Each run's results are saved in organized directories based on parameters and run names:
- Standard runs:
output/{LEVEL}_summary/{RUN_NAME}/ - Baseline runs:
output_baseline/{LEVEL}_summary/{RUN_NAME}/
Each directory contains:
- JSONL: Detailed results, agent activity, tool interactions.
- Excel: Summary of answers and key metrics.
- TXT: Human-readable answer summaries.
- Stats JSON: Aggregate statistics of runs (accuracy, tool use).
- Logs Directory:
- Main Log: Run-wide logging.
- Task-Specific Logs: Per question/task logging.
- Error Log: Details of runtime errors encountered.
We provide an evaluation script, judgement.py, to assess the accuracy of HistAgentโs outputs against ground-truth answers. This script builds upon and refines the evaluation logic from the HLE benchmark (Humanityโs Last Exam).
After generating result .jsonl files using run_hist.py, you can evaluate them as follows:
python judgement.py --input_file path/to/your_results.jsonl --output_file path/to/evaluated_results.jsonlFor batch processing multiple files, use:
python judgement.py --input_dir path/to/your_results_dir --output_dir path/to/evaluated_results_dirThe script uses your OpenAI API key for LLM-based semantic judging and allows customization of the evaluation model (e.g., gpt-4o).
- HLE-Inspired Framework:
The evaluation criteria inherit from the HLE benchmark's semantic consistency logic. It checks whether the core meaning and necessary key points in the response align with the ground truth, while allowing for small numerical errors and variations in phrasing. - Final Answer Extraction:
The script extracts afinal_answerfrom each model's response. If no exact answer is present, it recordsNone. - Semantic Judgement:
An LLM judges whether the extracted final answer matches the ground truth. If there are any inconsistencies, ambiguities, or missing key points, the answer is marked as incorrect. - Accuracy Calculation:
The script computes overall accuracy as the proportion of correct answers. It also logs detailed information for each task, including extracted answers, reasoning, and confidence scores.
Example output for each entry:
{
"task_id": "001",
"question": "Who painted the Mona Lisa?",
"model_answer": "The Mona Lisa was painted by Leonardo da Vinci.",
"true_answer": "Leonardo da Vinci",
"is_correct": true,
"extracted_final_answer": "Leonardo da Vinci",
"reasoning": "The model answer directly matches the true answer without any discrepancies.",
"confidence": 100
}For detailed evaluation logic, see judgement.py.
Combine and analyze results from different experiments using combine_results.py.
python combine_results.py output/level2_summary/run1/*.jsonl output/level2_summary/run2/*.jsonl --output-dir combined_analysis --output-name comparison_run| Flag | Description | Default Value |
|---|---|---|
input_files |
Required. List of .jsonl result files or patterns for combination. |
None |
--output-dir |
Directory for saving combined analysis outputs. | output/combined |
--output-name |
Specifies a custom name for output files. | YYYYMMDD_HHMMSS_combined |
--conflict-strategy |
Defines the conflict resolution approach if duplicate questions are detected: first, latest, model. |
latest |
--preferred-model |
If using --conflict-strategy model, specify your preferred model. |
None |
--formats |
Desired output formats: jsonl, excel, txt, all. |
all |
--add-readme |
Check this to include a README that outlines combined results and settings. | False |
--level |
Filter combinations by task level: level1, level2, level3, all. |
all |
--copy-images |
Shuffle and copy associated images to the output directory. | False |
--images-dir |
Specify the original directory containing dataset images, necessary if --copy-images is active. |
dataset/ |
If you use HistAgent or HistBench, please kindly cite:
@article{qiu2025path,
title={On Path to Multimodal Historical Reasoning: HistBench and HistAgent},
author={Qiu, Jiahao and Xiao, Fulian and Wang, Yimin and Mao, Yuchen and Chen, Yijia and Juan, Xinzhe and Wang, Siran and Qi, Xuan and Zhang, Tongcheng and Yao, Zixin and others},
journal={arXiv preprint arXiv:2505.20246},
year={2025}
}Contributions to HistAgent are invaluable! We're open to enhancements, fixes, or new features. Get involved by:
- Opening an issue to discuss your ideas.
- Forking this repository and submitting a pull request.
Refer to our CONTRIBUTING.md for detailed guidelines.
HistAgent is distributed under the Apache-2.0. See the LICENSE file for complete info.
A special thanks to the smolagents developers for their foundational work, and additional gratitude to all who helped develop and refine HistBench.