Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CharlesQ9/HistAgent

On the Path to Multimodal Historical Reasoning: HistBench and HistAgent

Official Implementation for paper: On Path to Multimodal Historical Reasoning: HistBench and HistAgent

Paper License: Apache-2.0 Python 3.12+

Welcome to the official open-source repository for HistAgent, an innovative multi-agent system crafted for historical question answering. This cutting-edge tool employs advanced multimodal approaches, leveraging components such as OCR, image analysis, and file parsing to tackle the intricate challenges of historical inquiries effectively.

Reproduce our results using the resources provided in this repository, as detailed in our paper:

"On the Path to Multimodal Historical Reasoning: HistBench and HistAgent"
(Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu Deng, Jiye Fu, Yunting Gu, Zijie Guan, Zirui Huang, Xiaoyan Ji, Yumeng Jiang, Delong Kong, Haolong Li, Jiaqi Li, Ruipeng Li, Tianze Li, Zhuoran Li, Haixia Lian, Mengyue Lin, Xudong Liu, Jiayi Lu, Jinghan Lu, Wanyu Luo, Ziyue Luo, Zihao Pu, Zhi Qiao, Ruihuan Ren, Liang Wan, Ruixiang Wang, Tianhui Wang, Yang Wang, Zeyu Wang, Zihua Wang, Yujia Wu, Zhaoyi Wu, Hao Xin, Weiao Xing, Ruojun Xiong, Weijie Xu, Yao Shu, Xiao Yao, Xiaorui Yang, Yuchen Yang, Nan Yi, Jiadong Yu, Yangyuxuan Yu, Huiting Zeng, Danni Zhang, Yunjie Zhang, Zhaoyu Zhang, Zhiheng Zhang, Xiaofeng Zheng, Peirong Zhou, Linyan Zhong, Xiaoyin Zong, Ying Zhao, Zhenxin Chen, Lin Ding, Xiaoyu Gao, Bingbing Gong, Yichao Li, Yang Liao, Guang Ma, Tianyuan Ma, Xinrui Sun, Tianyi Wang, Han Xia, Ruobing Xian, Gen Ye, Tengfei Yu, Wentao Zhang, Yuxi Wang, Xi Gao, Mengdi Wang)

HistAgent is rooted in the smolagents framework with adaptations for the challenges of historical reasoning. This system is rigorously evaluated on HistBench, a benchmark featuring 414 questions and covering 29 languages, extensively described in our paper.

๐Ÿ“„ Paper and Dataset


๐ŸŒŸ Key Features

  • Specialized Agents
    Individual agents manage OCR, image analysis, document parsing (PDF, Word, Excel), translation, speech transcription, web browsing, and academic literature search.
  • Multimodal and Multilingual Reasoning
    Supports diverse source materials: manuscripts, images, audio, video, inscriptions, and texts in 29 modern and ancient languages.
  • Historical OCR Integration
    Uses Transkribus for Western scripts and specialized models for non-Latin scripts to transcribe manuscripts and inscriptions.
  • Citation-Aware Literature Search
    Prioritizes peer-reviewed sources via Google Scholar, Google Books and Springer API, returning exact quotes with full bibliographic metadata.
  • HistBench Compatibility: Optimized for seamless operation on HistBench, evaluating historical reasoning across various dimensions.
  • Modular and Extensible: Built on smolagents for easier customization and tool extension.
  • Three-Stage Human-in-the-Loop Review Pipeline
    HistBench questions undergo preliminary screening, LLM-based filtering, and expert validation to ensure academic rigor and fairness in evaluation.
  • Structured Historical Evaluation Framework
    HistBench consists of 414 questions across 36 subfields, stratified into three difficulty levels and annotated with metadata (source, topic, reasoning dimension, and explanation).
  • Competitive and Generalizable Performance
    HistAgent achieves 27.54% pass@1 and 36.47% pass@2 accuracy on HistBench, outperforming GPT-4o (18.60% pass@1) and ODR-smolagents. It also maintains strong performance (60.00% pass@1) on the general-purpose GAIA benchmark.

๐Ÿ“œ Abstract

Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.

๐Ÿ“Š Performance Highlights

HistAgent demonstrates superior performance over existing LLMs on HistBench's challenging benchmarks.

Performance of LLMs and Agents on HistBench
Fig. 1: Performance comparison on HistBench.

๐Ÿ’ก HistBench Overview

HistBench is the first dedicated benchmark for historical reasoning in AI:

  • 414 questions, curated by domain experts and students, covering:
    • 6 reasoning dimensions: bibliographic retrieval, source identification, source processing, historical analysis, interdisciplinary integration, cultural contextualization.
    • 3 difficulty levels: Basic (Level 1), Intermediate (Level 2), and Challenging (Level 3).
    • 29 languages: including English, Classical Chinese, Latin, Sanskrit, Old Uyghur, and more.
    • Multimodal sources: manuscripts, inscriptions, images, audio, video.
    • 36+ historical subfields: e.g., epigraphy, climate history, intellectual history, and material culture.
  • Three-stage review: format/semantic check โ†’ LLM-based filtering โ†’ expert validation.

HistBench Difficulty Level Definitions
Fig. 2: HistBench difficulty levels.

Language Distribution in HistBench Geographic Coverage of HistBench
Fig. 3: Language diversity (left) and geographic coverage (right) in HistBench.

For details in HistBench, refer to Section 3 of our paper and Appendix A.

๐Ÿง  HistAgent Architecture & Specialist Agents

HistAgent utilizes a Manager Agent to coordinate specialized agents, each with targeted tools, thus embracing a modular structure for complex historical reasoning tasks.

HistAgent System Architecture
Fig. 4: HistAgent architecture overview.

Agent Focus Core Functions
Manager Agent Orchestrator Directs execution, manages request parsing and validation, handles agent outputs.
Text WebBrowser Agent Web & Text Interaction Conducts web searches, navigates web content, extracts text.
Image Information Agent Image Analysis Executes reverse searches, assesses image parameters, interfaces with OCR for text extraction.
Literature Search Agent Academic Source Retrieval Utilizes scholarly databases for literature retrieval, processes PDFs for detailed content extraction.
File Processing Agent Document Handling Processes files like PDFs, DOCX, XLSX, etc., to extract and interpret content.
OCR Agent Optical Character Recognition Deployed for recognizing text from images, including historical manuscripts with specialized models.
Speech Recognition Agent Audio Transcription Converts speech to text, suitable for audio sources such as interviews or oral histories.
Translator Agent Language Conversion Handles multilingual document translation, ensuring historical language accuracy.
Video Agent Video Analysis Extracts and processes frames, enabling multimodal interrogation of video content.

The Manager Agent orchestrates tasks in a CodeAct loop, ensuring stepwise verification, citation integrity, and structured outputs. For more detailed information on the architecture and agent functionalities, consult Section 5 of our paper.

๐Ÿ” Academic Literature Tools

The Literature Search Agent is critical in leveraging academic research capabilities within HistAgent:

Literature Search Agent Architecture
Fig. 5: Literature Search Agent architecture.

  • Smart Retrieval: Leverages academic databases efficiently, prioritizing peer-reviewed and reputable sources.
  • Full-Text Analysis: Engages in detailed content parsing, extracting precise quotes and sections needed for historical context.
  • Citation Aware: Ensures integrity with metadata extraction and citation-ready outputs.
  • Integrated API Use: Combines Springer API and browser enhancements for comprehensive sourcing.

This agent significantly enhances the integrity and depth of historical research, ensuring results are biblically verifiable.

๐Ÿ› ๏ธ Installation

Step 1: Clone the Repository

git clone https://github.com/CharlesQ9/HistAgent.git
cd HistAgent

Step 2: Create and Activate a Conda Environment

Use Conda for efficient Python environment management.

conda create -n HistAgent python=3.12
conda activate HistAgent

Step 3: Install Dependencies

Install the necessary packages via requirements.txt and smolagents.

pip install -r requirements.txt
pip install 'smolagents[dev]'

Step 4: Configure Modified browser_use Library

HistAgent requires a modified browser_use library for enhanced capabilities. Ensure replacement of any existing installations with our provided version:

  1. Identify your site-packages directory:

    python -c "import site; print(site.getsitepackages()[0])"
  2. Install the modified version of browser_use: Manually replace it after identifying your environment directory.

Step 5: Download HistBench Dataset

Make sure git-lfs is installed and then clone the dataset from Hugging Face:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/datasets/jiahaoq/HistBench

# If you want to clone without large files - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/jiahaoq/HistBench

โš™๏ธ Environment Configuration

Several API keys are needed for HistAgent's operation. Configure these either via a .env file or set them as environment variables in your system.

Use a .env File

Place a .env file at the project root with your keys:

OPENAI_API_KEY=your_openai_api_key
SERPAPI_API_KEY=your_serpapi_api_key
IMGBB_API_KEY=your_imgbb_api_key
SPRINGER_API_KEY=your_springer_api_key
LLAMA_API_KEY=your_llama_api_key
TRANSKRIBUS_USERNAME=your_transkribus_username
TRANSKRIBUS_PASSWORD=your_transkribus_password
HF_TOKEN=your_huggingface_token
OPENROUTER_API_KEY=your_openrouter_api_key
OPENROUTER_API_BASE=https://openrouter.ai/api/v1
TEXTIN_APP_ID=your_textin_app_id
TEXTIN_SECRET_CODE=your_textin_secret_code

Alternatively, Export via Terminal

Set environment variables directly:

export OPENAI_API_KEY="your_openai_api_key"
export SERPAPI_API_KEY="your_serpapi_api_key"
export IMGBB_API_KEY="your_imgbb_api_key"
export SPRINGER_API_KEY="your_springer_api_key"
export LLAMA_API_KEY="your_llama_api_key"
export TRANSKRIBUS_USERNAME="your_transkribus_username"
export TRANSKRIBUS_PASSWORD="your_transkribus_password"

Command Line Overrides

For quick tests, provide API keys as command-line arguments:

python run_hist.py --run-name "test_run_cli_keys" --api-key "your_openai_key" --springer-api-key "your_springer_key" --llama-api-key "your_llama_key"

๐Ÿš€ Running HistAgent (run_hist.py)

Use run_hist.py to leverage HistAgent on HistBench.

Example Command

Run HistAgent for Level 2 questions with various capabilities enabled:

python run_hist.py --run-name "trial_level2" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level2

Command Line Arguments for run_hist.py

Flag Description Default Value
--run-name Required. Unique identifier for this run. None
--model-id Specifies the language model for use (e.g., gpt-4o). gpt-4o
--concurrency Number of parallel tasks. 8
--level HistBench difficulty level: level1, level2, level3, or all. level2
--question-ids Specific question IDs to process (comma-separated). None
--start-id, --end-id Range of question IDs to process. None, None
--results-json-path Path to a previous run's .jsonl output for filtering or re-evaluation. None
--baseline Use a simpler baseline agent architecture. False
--use-image-agent Enables the Image Information Agent. False
--use-file-agent Enables the File Processing Agent. False
--use-literature-agent Enables the Literature Search Agent. False
--use-browser Enables browser-based tools (modified browser_use library required). False
--api-key OpenAI API key (overrides other sources). None
--springer-api-key Springer API key (overrides other sources). None
--llama-api-key LlamaParse API key (overrides other sources). None

๐Ÿงช Example Commands

1. Full Level 2 Execution:

python run_hist.py --run-name "histagent_full_level2" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level2

2. Baseline Comparison for Level 1:

python run_hist.py --run-name "baseline_comparison_level1" --baseline --level level1

3. Specific Level 3 Questions:

python run_hist.py --run-name "selected_level3" --use-image-agent --use-file-agent --use-literature-agent --use-browser --level level3 --question-ids "5,8,20"

4. Custom Dataset Processing:

python run_hlejson.py --run-name "custom_hle_run" --use-image-agent --use-file-agent --use-literature-agent --use-browser --path-to-hle-json "path/to/custom_dataset.json"

Note: Ensure run_hlejson.py aligns with run_hist.py options if based on the same architecture.

๐Ÿ“‚ Output Files

Each run's results are saved in organized directories based on parameters and run names:

  • Standard runs: output/{LEVEL}_summary/{RUN_NAME}/
  • Baseline runs: output_baseline/{LEVEL}_summary/{RUN_NAME}/

Each directory contains:

  • JSONL: Detailed results, agent activity, tool interactions.
  • Excel: Summary of answers and key metrics.
  • TXT: Human-readable answer summaries.
  • Stats JSON: Aggregate statistics of runs (accuracy, tool use).
  • Logs Directory:
    • Main Log: Run-wide logging.
    • Task-Specific Logs: Per question/task logging.
    • Error Log: Details of runtime errors encountered.

๐Ÿงฎ Evaluation

We provide an evaluation script, judgement.py, to assess the accuracy of HistAgentโ€™s outputs against ground-truth answers. This script builds upon and refines the evaluation logic from the HLE benchmark (Humanityโ€™s Last Exam).

How to Run the Evaluation

After generating result .jsonl files using run_hist.py, you can evaluate them as follows:

python judgement.py --input_file path/to/your_results.jsonl --output_file path/to/evaluated_results.jsonl

For batch processing multiple files, use:

python judgement.py --input_dir path/to/your_results_dir --output_dir path/to/evaluated_results_dir

The script uses your OpenAI API key for LLM-based semantic judging and allows customization of the evaluation model (e.g., gpt-4o).

Evaluation Logic

  • HLE-Inspired Framework:
    The evaluation criteria inherit from the HLE benchmark's semantic consistency logic. It checks whether the core meaning and necessary key points in the response align with the ground truth, while allowing for small numerical errors and variations in phrasing.
  • Final Answer Extraction:
    The script extracts a final_answer from each model's response. If no exact answer is present, it records None.
  • Semantic Judgement:
    An LLM judges whether the extracted final answer matches the ground truth. If there are any inconsistencies, ambiguities, or missing key points, the answer is marked as incorrect.
  • Accuracy Calculation:
    The script computes overall accuracy as the proportion of correct answers. It also logs detailed information for each task, including extracted answers, reasoning, and confidence scores.

Example output for each entry:

{
  "task_id": "001",
  "question": "Who painted the Mona Lisa?",
  "model_answer": "The Mona Lisa was painted by Leonardo da Vinci.",
  "true_answer": "Leonardo da Vinci",
  "is_correct": true,
  "extracted_final_answer": "Leonardo da Vinci",
  "reasoning": "The model answer directly matches the true answer without any discrepancies.",
  "confidence": 100
}

For detailed evaluation logic, see judgement.py.

๐Ÿ”— Results Combination Tool (combine_results.py)

Combine and analyze results from different experiments using combine_results.py.

Basic Usage

python combine_results.py output/level2_summary/run1/*.jsonl output/level2_summary/run2/*.jsonl --output-dir combined_analysis --output-name comparison_run

Arguments

Flag Description Default Value
input_files Required. List of .jsonl result files or patterns for combination. None
--output-dir Directory for saving combined analysis outputs. output/combined
--output-name Specifies a custom name for output files. YYYYMMDD_HHMMSS_combined
--conflict-strategy Defines the conflict resolution approach if duplicate questions are detected: first, latest, model. latest
--preferred-model If using --conflict-strategy model, specify your preferred model. None
--formats Desired output formats: jsonl, excel, txt, all. all
--add-readme Check this to include a README that outlines combined results and settings. False
--level Filter combinations by task level: level1, level2, level3, all. all
--copy-images Shuffle and copy associated images to the output directory. False
--images-dir Specify the original directory containing dataset images, necessary if --copy-images is active. dataset/

๐Ÿ“œ Citation

If you use HistAgent or HistBench, please kindly cite:

@article{qiu2025path,
  title={On Path to Multimodal Historical Reasoning: HistBench and HistAgent},
  author={Qiu, Jiahao and Xiao, Fulian and Wang, Yimin and Mao, Yuchen and Chen, Yijia and Juan, Xinzhe and Wang, Siran and Qi, Xuan and Zhang, Tongcheng and Yao, Zixin and others},
  journal={arXiv preprint arXiv:2505.20246},
  year={2025}
}

๐Ÿค Contributing

Contributions to HistAgent are invaluable! We're open to enhancements, fixes, or new features. Get involved by:

  1. Opening an issue to discuss your ideas.
  2. Forking this repository and submitting a pull request.

Refer to our CONTRIBUTING.md for detailed guidelines.

๐Ÿ“„ License

HistAgent is distributed under the Apache-2.0. See the LICENSE file for complete info.

๐Ÿ™ Acknowledgements

A special thanks to the smolagents developers for their foundational work, and additional gratitude to all who helped develop and refine HistBench.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •