InfoMosaic is a comprehensive framework designed for advanced information retrieval, multi-step reasoning, and performance evaluation of large language models (LLMs). This project is based on the research paper "InfoMosaic: A Multimodal, Multi-Source Benchmark for Information Integration and Reasoning" and leverages the InfoMosaic_Bench dataset for evaluation.
The framework enables:
- Multi-source information retrieval and integration
- Rigorous evaluation of LLM's reasoning capabilities
- Flexible tool usage for enhanced information acquisition
- Parallel processing for efficient benchmarking
Comparison of 14 LLM agents equipped with a web search tool on InfoMosaic-Bench, evaluated across six domains and the overall average. Metrics include Accuracy (Acc) and Pass Rate. The best overall Accuracy and Pass Rate is highlighted in bold.
InfoMosaic/
βββ data/ # Data preparation and management
βββ eval/ # Evaluation utilities
βββ inference/ # Inference components
βββ tool_backends/ # Backend services for tools
β βββ MCP/ # Multi-Content Protocol tools
β βββ api_proxy/ # API proxies
β βββ configs/ # Configuration files
β βββ test/ # Test scripts for tools
βββ infer_answer.py # Main inference script
βββ ensemble_answer.py # Script to avoid answer leakage
βββ gen_result.py # Result generation
βββ utils.py # Utility functions
βββ README.md # Project documentation
- Python 3.8+ or Docker
- API keys for external services (serper, google_map, youtube, serapi)
# Clone the repository
git clone [email protected]:DorothyDUUU/Info-Mosaic.git # If applicable
cd InfoMosaic
# Install dependencies
pip install . For detailed instructions on API key configuration, please refer to the following README API Key and Configuration Management Guide
This document provides comprehensive information about configuring API keys for all external services used in InfoMosaic, including web search, maps, and other tools.
This is by far the simplest automated tool deployment repository! InfoMosaic Tool Backend Services launches MCP server based on Python sandbox, providing an extremely simple one-click deployment solution.
To enable the full functionality of the tools, please refer to the detailed deployment guide:
This document provides complete deployment steps, including Docker deployment, quick deployment scripts, service management, and testing tools. All services can be configured and started with just one command.
First, prepare the dataset by combining the HuggingFace benchmark data with ground truth answers:
python data/prepare_data.pyThis script will:
- Download the InfoMosaic_Bench dataset from HuggingFace
- Load the ground truth answers from
data/info_mosaic_gt_answer.jsonl - Combine the datasets and save to
data/info_mosaic_w_gt.jsonl
Run the inference script to evaluate a model on the benchmark:
export OPENAI_API_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY="sk-..."
export SERPER_API_KEY="your_serper_api_key"
sh inference/run_infer.shKey arguments:
--model_name: Type of agent to use: agent_wo_tool (no tools), agent_w_web_tool (web tool only), or agent_w_multi_tool (multiple tools)--parallel_size: Number of parallel threads for processing--llm_name: Name of the LLM model to use, default is "gpt-5-mini"--domain: Domain to evaluate, default is "all", optional values are 'all', 'map', 'bio', 'financial', 'web', 'video', 'multidomain'
Evaluate the model's performance using the pass rate evaluation script:
export OPENAI_API_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY="sk-..."
sh eval/run_eval.shKey arguments:
--model_name: Type of agent to use: agent_wo_tool (no tools), agent_w_web_tool (web tool only), or agent_w_multi_tool (multiple tools)--domain: Domain to evaluate, default is "all", optional values are 'all', 'map', 'bio', 'financial', 'web', 'video', 'multidomain'
This script will:
- Load the model's generated answers
- Use a judge LLM to evaluate the correctness of answers
- Calculate pass rates for sub-questions and final answers
- Generate detailed evaluation metrics
Looking forward to the release of InfoMosaic Flow!
If you use this framework or dataset in your research, please cite our paper:
@article{du2025infomosaic,
title={InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents},
author={Du, Yaxin and Zhang, Yuanshuo and Yang, Xiyuan and Zhou, Yifan and Wang, Cheng and Zou, Gongyi and Pang, Xianghe and Wang, Wenhao and Chen, Menglan and Tang, Shuo and others},
journal={arXiv preprint arXiv:2510.02271},
year={2025}
}
And the dataset:
@dataset{InfoMosaic_Bench,
title = {InfoMosaic_Bench},
author = {Dorothydu},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Dorothydu/InfoMosaic_Bench}
}
Contributions to improve InfoMosaic are welcome! Please refer to the project's GitHub repository (if available) for contribution guidelines.
We would like to express our gratitude to the following open-source projects that have inspired and supported the development of InfoMosaic:
- Browse-Master
- mcp_sandbox
- amap-mcp-server
- biomcp
- fmp-mcp-server
- googlemap-mcp
- serper-mcp-server
- mcp-youtube
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.