On the Path to Multimodal Historical Reasoning: HistBench and HistAgent

Official Implementation for paper: On Path to Multimodal Historical Reasoning: HistBench and HistAgent

Welcome to the official open-source repository for HistAgent, an innovative multi-agent system crafted for historical question answering. This cutting-edge tool employs advanced multimodal approaches, leveraging components such as OCR, image analysis, and file parsing to tackle the intricate challenges of historical inquiries effectively.

Reproduce our results using the resources provided in this repository, as detailed in our paper:

"On the Path to Multimodal Historical Reasoning: HistBench and HistAgent"
(Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu Deng, Jiye Fu, Yunting Gu, Zijie Guan, Zirui Huang, Xiaoyan Ji, Yumeng Jiang, Delong Kong, Haolong Li, Jiaqi Li, Ruipeng Li, Tianze Li, Zhuoran Li, Haixia Lian, Mengyue Lin, Xudong Liu, Jiayi Lu, Jinghan Lu, Wanyu Luo, Ziyue Luo, Zihao Pu, Zhi Qiao, Ruihuan Ren, Liang Wan, Ruixiang Wang, Tianhui Wang, Yang Wang, Zeyu Wang, Zihua Wang, Yujia Wu, Zhaoyi Wu, Hao Xin, Weiao Xing, Ruojun Xiong, Weijie Xu, Yao Shu, Xiao Yao, Xiaorui Yang, Yuchen Yang, Nan Yi, Jiadong Yu, Yangyuxuan Yu, Huiting Zeng, Danni Zhang, Yunjie Zhang, Zhaoyu Zhang, Zhiheng Zhang, Xiaofeng Zheng, Peirong Zhou, Linyan Zhong, Xiaoyin Zong, Ying Zhao, Zhenxin Chen, Lin Ding, Xiaoyu Gao, Bingbing Gong, Yichao Li, Yang Liao, Guang Ma, Tianyuan Ma, Xinrui Sun, Tianyi Wang, Han Xia, Ruobing Xian, Gen Ye, Tengfei Yu, Wentao Zhang, Yuxi Wang, Xi Gao, Mengdi Wang)

HistAgent is rooted in the smolagents framework with adaptations for the challenges of historical reasoning. This system is rigorously evaluated on HistBench, a benchmark featuring 414 questions and covering 29 languages, extensively described in our paper.

📄 Paper and Dataset

Paper: On Path to Multimodal Historical Reasoning: HistBench and HistAgent
Dataset: HistBench on Hugging Face

🌟 Key Features

Specialized Agents
Individual agents manage OCR, image analysis, document parsing (PDF, Word, Excel), translation, speech transcription, web browsing, and academic literature search.
Multimodal and Multilingual Reasoning
Supports diverse source materials: manuscripts, images, audio, video, inscriptions, and texts in 29 modern and ancient languages.
Historical OCR Integration
Uses Transkribus for Western scripts and specialized models for non-Latin scripts to transcribe manuscripts and inscriptions.
Citation-Aware Literature Search
Prioritizes peer-reviewed sources via Google Scholar, Google Books and Springer API, returning exact quotes with full bibliographic metadata.
HistBench Compatibility: Optimized for seamless operation on HistBench, evaluating historical reasoning across various dimensions.
Modular and Extensible: Built on smolagents for easier customization and tool extension.
Three-Stage Human-in-the-Loop Review Pipeline
HistBench questions undergo preliminary screening, LLM-based filtering, and expert validation to ensure academic rigor and fairness in evaluation.
Structured Historical Evaluation Framework
HistBench consists of 414 questions across 36 subfields, stratified into three difficulty levels and annotated with metadata (source, topic, reasoning dimension, and explanation).
Competitive and Generalizable Performance
HistAgent achieves 27.54% pass@1 and 36.47% pass@2 accuracy on HistBench, outperforming GPT-4o (18.60% pass@1) and ODR-smolagents. It also maintains strong performance (60.00% pass@1) on the general-purpose GAIA benchmark.

📜 Abstract

Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.

📊 Performance Highlights

HistAgent demonstrates superior performance over existing LLMs on HistBench's challenging benchmarks.

Fig. 1: Performance comparison on HistBench.

💡 HistBench Overview

HistBench is the first dedicated benchmark for historical reasoning in AI:

414 questions, curated by domain experts and students, covering:
- 6 reasoning dimensions: bibliographic retrieval, source identification, source processing, historical analysis, interdisciplinary integration, cultural contextualization.
- 3 difficulty levels: Basic (Level 1), Intermediate (Level 2), and Challenging (Level 3).
- 29 languages: including English, Classical Chinese, Latin, Sanskrit, Old Uyghur, and more.
- Multimodal sources: manuscripts, inscriptions, images, audio, video.
- 36+ historical subfields: e.g., epigraphy, climate history, intellectual history, and material culture.
Three-stage review: format/semantic check → LLM-based filtering → expert validation.

Fig. 2: HistBench difficulty levels.

Fig. 3: Language diversity (left) and geographic coverage (right) in HistBench.

For details in HistBench, refer to Section 3 of our paper and Appendix A.

🧠 HistAgent Architecture & Specialist Agents

HistAgent utilizes a Manager Agent to coordinate specialized agents, each with targeted tools, thus embracing a modular structure for complex historical reasoning tasks.

Fig. 4: HistAgent architecture overview.

Agent	Focus	Core Functions
Manager Agent	Orchestrator	Directs execution, manages request parsing and validation, handles agent outputs.
Text WebBrowser Agent	Web & Text Interaction	Conducts web searches, navigates web content, extracts text.
Image Information Agent	Image Analysis	Executes reverse searches, assesses image parameters, interfaces with OCR for text extraction.
Literature Search Agent	Academic Source Retrieval	Utilizes scholarly databases for literature retrieval, processes PDFs for detailed content extraction.
File Processing Agent	Document Handling	Processes files like PDFs, DOCX, XLSX, etc., to extract and interpret content.
OCR Agent	Optical Character Recognition	Deployed for recognizing text from images, including historical manuscripts with specialized models.
Speech Recognition Agent	Audio Transcription	Converts speech to text, suitable for audio sources such as interviews or oral histories.
Translator Agent	Language Conversion	Handles multilingual document translation, ensuring historical language accuracy.
Video Agent	Video Analysis	Extracts and processes frames, enabling multimodal interrogation of video content.

The Manager Agent orchestrates tasks in a CodeAct loop, ensuring stepwise verification, citation integrity, and structured outputs. For more detailed information on the architecture and agent functionalities, consult Section 5 of our paper.

🔍 Academic Literature Tools

The Literature Search Agent is critical in leveraging academic research capabilities within HistAgent:

Fig. 5: Literature Search Agent architecture.

Smart Retrieval: Leverages academic databases efficiently, prioritizing peer-reviewed and reputable sources.
Full-Text Analysis: Engages in detailed content parsing, extracting precise quotes and sections needed for historical context.
Citation Aware: Ensures integrity with metadata extraction and citation-ready outputs.
Integrated API Use: Combines Springer API and browser enhancements for comprehensive sourcing.

This agent significantly enhances the integrity and depth of historical research, ensuring results are biblically verifiable.

🛠️ Installation

Step 1: Clone the Repository

git clone https://github.com/CharlesQ9/HistAgent.git
cd HistAgent

Step 2: Create and Activate a Conda Environment

Use Conda for efficient Python environment management.

conda create -n HistAgent python=3.12
conda activate HistAgent

Step 3: Install Dependencies

Install the necessary packages via requirements.txt and smolagents.

pip install -r requirements.txt
pip install 'smolagents[dev]'

Step 4: Configure Modified `browser_use` Library

HistAgent requires a modified browser_use library for enhanced capabilities. Ensure replacement of any existing installations with our provided version:

Identify your site-packages directory:

python -c "import site; print(site.getsitepackages()[0])"

Install the modified version of browser_use: Manually replace it after identifying your environment directory.

Step 5: Download HistBench Dataset

Make sure git-lfs is installed and then clone the dataset from Hugging Face:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/datasets/jiahaoq/HistBench

# If you want to clone without large files - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/jiahaoq/HistBench

⚙️ Environment Configuration

Several API keys are needed for HistAgent's operation. Configure these either via a .env file or set them as environment variables in your system.

Use a `.env` File

Place a .env file at the project root with your keys:

OPENAI_API_KEY=your_openai_api_key
SERPAPI_API_KEY=your_serpapi_api_key
IMGBB_API_KEY=your_imgbb_api_key
SPRINGER_API_KEY=your_springer_api_key
LLAMA_API_KEY=your_llama_api_key
TRANSKRIBUS_USERNAME=your_transkribus_username
TRANSKRIBUS_PASSWORD=your_transkribus_password
HF_TOKEN=your_huggingface_token
OPENROUTER_API_KEY=your_openrouter_api_key
OPENROUTER_API_BASE=https://openrouter.ai/api/v1
TEXTIN_APP_ID=your_textin_app_id
TEXTIN_SECRET_CODE=your_textin_secret_code

Alternatively, Export via Terminal

Set environment variables directly:

export OPENAI_API_KEY="your_openai_api_key"
export SERPAPI_API_KEY="your_serpapi_api_key"
export IMGBB_API_KEY="your_imgbb_api_key"
export SPRINGER_API_KEY="your_springer_api_key"
export LLAMA_API_KEY="your_llama_api_key"
export TRANSKRIBUS_USERNAME="your_transkribus_username"
export TRANSKRIBUS_PASSWORD="your_transkribus_password"

Command Line Overrides

For quick tests, provide API keys as command-line arguments:

python run_hist.py --run-name "test_run_cli_keys" --api-key "your_openai_key" --springer-api-key "your_springer_key" --llama-api-key "your_llama_key"

🚀 Running HistAgent (`run_hist.py`)

Use run_hist.py to leverage HistAgent on HistBench.

Example Command

Run HistAgent for Level 2 questions with various capabilities enabled:

python run_hist.py --run-name "trial_level2" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level2

Command Line Arguments for `run_hist.py`

Flag	Description	Default Value
`--run-name`	Required. Unique identifier for this run.	`None`
`--model-id`	Specifies the language model for use (e.g., `gpt-4o`).	`gpt-4o`
`--concurrency`	Number of parallel tasks.	8
`--level`	HistBench difficulty level: `level1`, `level2`, `level3`, or `all`.	`level2`
`--question-ids`	Specific question IDs to process (comma-separated).	`None`
`--start-id`, `--end-id`	Range of question IDs to process.	`None`, `None`
`--results-json-path`	Path to a previous run's `.jsonl` output for filtering or re-evaluation.	`None`
`--baseline`	Use a simpler baseline agent architecture.	`False`
`--use-image-agent`	Enables the Image Information Agent.	`False`
`--use-file-agent`	Enables the File Processing Agent.	`False`
`--use-literature-agent`	Enables the Literature Search Agent.	`False`
`--use-browser`	Enables browser-based tools (modified `browser_use` library required).	`False`
`--api-key`	OpenAI API key (overrides other sources).	`None`
`--springer-api-key`	Springer API key (overrides other sources).	`None`
`--llama-api-key`	LlamaParse API key (overrides other sources).	`None`

🧪 Example Commands

1. Full Level 2 Execution:

python run_hist.py --run-name "histagent_full_level2" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level2

2. Baseline Comparison for Level 1:

python run_hist.py --run-name "baseline_comparison_level1" --baseline --level level1

3. Specific Level 3 Questions:

python run_hist.py --run-name "selected_level3" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level3 --question-ids "5,8,20"

4. Custom Dataset Processing:

python run_hlejson.py --run-name "custom_hle_run" --use-image-agent --use-file-agent --use-literature-agent --use-browser --path-to-hle-json "path/to/custom_dataset.json"

Note: Ensure run_hlejson.py aligns with run_hist.py options if based on the same architecture.

📂 Output Files

Each run's results are saved in organized directories based on parameters and run names:

Standard runs: output/{LEVEL}_summary/{RUN_NAME}/
Baseline runs: output_baseline/{LEVEL}_summary/{RUN_NAME}/

Each directory contains:

JSONL: Detailed results, agent activity, tool interactions.
Excel: Summary of answers and key metrics.
TXT: Human-readable answer summaries.
Stats JSON: Aggregate statistics of runs (accuracy, tool use).
Logs Directory:
- Main Log: Run-wide logging.
- Task-Specific Logs: Per question/task logging.
- Error Log: Details of runtime errors encountered.

🧮 Evaluation

We provide an evaluation script, judgement.py, to assess the accuracy of HistAgent’s outputs against ground-truth answers. This script builds upon and refines the evaluation logic from the HLE benchmark (Humanity’s Last Exam).

How to Run the Evaluation

After generating result .jsonl files using run_hist.py, you can evaluate them as follows:

python judgement.py --input_file path/to/your_results.jsonl --output_file path/to/evaluated_results.jsonl

For batch processing multiple files, use:

python judgement.py --input_dir path/to/your_results_dir --output_dir path/to/evaluated_results_dir

The script uses your OpenAI API key for LLM-based semantic judging and allows customization of the evaluation model (e.g., gpt-4o).

Evaluation Logic

HLE-Inspired Framework:
The evaluation criteria inherit from the HLE benchmark's semantic consistency logic. It checks whether the core meaning and necessary key points in the response align with the ground truth, while allowing for small numerical errors and variations in phrasing.
Final Answer Extraction:
The script extracts a final_answer from each model's response. If no exact answer is present, it records None.
Semantic Judgement:
An LLM judges whether the extracted final answer matches the ground truth. If there are any inconsistencies, ambiguities, or missing key points, the answer is marked as incorrect.
Accuracy Calculation:
The script computes overall accuracy as the proportion of correct answers. It also logs detailed information for each task, including extracted answers, reasoning, and confidence scores.

Example output for each entry:

{
  "task_id": "001",
  "question": "Who painted the Mona Lisa?",
  "model_answer": "The Mona Lisa was painted by Leonardo da Vinci.",
  "true_answer": "Leonardo da Vinci",
  "is_correct": true,
  "extracted_final_answer": "Leonardo da Vinci",
  "reasoning": "The model answer directly matches the true answer without any discrepancies.",
  "confidence": 100
}

For detailed evaluation logic, see judgement.py.

🔗 Results Combination Tool (`combine_results.py`)

Combine and analyze results from different experiments using combine_results.py.

Basic Usage

python combine_results.py output/level2_summary/run1/*.jsonl output/level2_summary/run2/*.jsonl --output-dir combined_analysis --output-name comparison_run

Arguments

Flag	Description	Default Value
`input_files`	Required. List of `.jsonl` result files or patterns for combination.	`None`
`--output-dir`	Directory for saving combined analysis outputs.	`output/combined`
`--output-name`	Specifies a custom name for output files.	`YYYYMMDD_HHMMSS_combined`
`--conflict-strategy`	Defines the conflict resolution approach if duplicate questions are detected: `first`, `latest`, `model`.	`latest`
`--preferred-model`	If using `--conflict-strategy model`, specify your preferred model.	`None`
`--formats`	Desired output formats: `jsonl`, `excel`, `txt`, `all`.	`all`
`--add-readme`	Check this to include a README that outlines combined results and settings.	`False`
`--level`	Filter combinations by task level: `level1`, `level2`, `level3`, `all`.	`all`
`--copy-images`	Shuffle and copy associated images to the output directory.	`False`
`--images-dir`	Specify the original directory containing dataset images, necessary if `--copy-images` is active.	`dataset/`

📜 Citation

If you use HistAgent or HistBench, please kindly cite:

@article{qiu2025path,
  title={On Path to Multimodal Historical Reasoning: HistBench and HistAgent},
  author={Qiu, Jiahao and Xiao, Fulian and Wang, Yimin and Mao, Yuchen and Chen, Yijia and Juan, Xinzhe and Wang, Siran and Qi, Xuan and Zhang, Tongcheng and Yao, Zixin and others},
  journal={arXiv preprint arXiv:2505.20246},
  year={2025}
}

🤝 Contributing

Contributions to HistAgent are invaluable! We're open to enhancements, fixes, or new features. Get involved by:

Opening an issue to discuss your ideas.
Forking this repository and submitting a pull request.

Refer to our CONTRIBUTING.md for detailed guidelines.

📄 License

HistAgent is distributed under the Apache-2.0. See the LICENSE file for complete info.

🙏 Acknowledgements

A special thanks to the smolagents developers for their foundational work, and additional gratitude to all who helped develop and refine HistBench.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Figures		Figures
browser_use		browser_use
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
combine_results.py		combine_results.py
dataset_loader.py		dataset_loader.py
judgment.py		judgment.py
openai_baseline.py		openai_baseline.py
requirements.txt		requirements.txt
run_gaia.py		run_gaia.py
run_hist.py		run_hist.py
run_hlejson.py		run_hlejson.py

Uh oh!

License

Uh oh!

CharlesQ9/HistAgent

Folders and files

Latest commit

History

Repository files navigation

On the Path to Multimodal Historical Reasoning: HistBench and HistAgent

📄 Paper and Dataset

🌟 Key Features

📜 Abstract

📊 Performance Highlights

💡 HistBench Overview

🧠 HistAgent Architecture & Specialist Agents

🔍 Academic Literature Tools

🛠️ Installation

Step 1: Clone the Repository

Step 2: Create and Activate a Conda Environment

Step 3: Install Dependencies

Step 4: Configure Modified browser_use Library

Step 5: Download HistBench Dataset

⚙️ Environment Configuration

Use a .env File

Alternatively, Export via Terminal

Command Line Overrides

🚀 Running HistAgent (run_hist.py)

Example Command

Command Line Arguments for run_hist.py

🧪 Example Commands

📂 Output Files

🧮 Evaluation

How to Run the Evaluation

Evaluation Logic

🔗 Results Combination Tool (combine_results.py)

Basic Usage

Arguments

📜 Citation

🤝 Contributing

📄 License

🙏 Acknowledgements

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Step 4: Configure Modified `browser_use` Library

Use a `.env` File

🚀 Running HistAgent (`run_hist.py`)

Command Line Arguments for `run_hist.py`

🔗 Results Combination Tool (`combine_results.py`)

Packages