Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[NeurIPS 2025 D&B (Spotlight๐ŸŒŸ)] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenario

Notifications You must be signed in to change notification settings

sylvain-wei/TIME

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

23 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

โณ TIME

Paper Code TIME Dataset TIME-Lite TIME-Lite TIME-Lite

[NeurIPS'25 Spotlight] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Peking University Huawei Noah's Ark Lab

๐ŸŽ‰๐ŸŽ‰ Congratulations! This paper has been accepted as NeurIPS 2025 Spotlight ๐ŸŒŸ๐Ÿ”ฅ at D&B track.

๐ŸŒŸ If you found this work helpful, please consider giving us a โญ on GitHub!

GitHub stars Hugging Face

๐Ÿ“‹ Project Information

Authors: Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang
Affiliation: Peking University, Noah's Ark Lab
Contact: [email protected]

๐Ÿ“– Abstract

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning:

  • Intensive temporal information
  • Fast-changing event dynamics
  • Complex temporal dependencies in social interactions

To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios.

TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial.

We conduct extensive experiments on reasoning models and non-reasoning models, and conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.

TIME Dataset Overview

๐Ÿš€ Get Started

๐Ÿ“ฅ Step 1: Install Dependencies

# Install git-lfs
pip install git-lfs

๐Ÿ“Š Step 2: Download Dataset

We provide two datasets. Choose according to your needs:

โš ๏ธ Option 1: Complete TIME Dataset (Large dataset - may be too large for quick evaluation)

# Navigate to the working directory and download the benchmark dataset TIME
chmod +x scripts/download_data_time.sh

# Download the data
./scripts/download_data_time.sh

โœ… Option 2: TIME-Lite Dataset (Recommended - High-quality subset)

# Navigate to the working directory and download the benchmark dataset TIME-Lite
chmod +x scripts/download_data_time_lite.sh

# Download the data
./scripts/download_data_time_lite.sh

๐Ÿ”ง Step 3: Install Evaluation Dependencies

pip install -r evaluation/requirements.txt

โ–ถ๏ธ Step 4: Run Evaluation

Option A: Evaluate TIME dataset

./scripts/eval_time.sh

Option B: Evaluate TIME-Lite dataset (Recommended)

./scripts/eval_timelite.sh

๐Ÿง  Construction Pipeline

TIME Construction Pipeline

๐Ÿ“Š Data Quantity

๐Ÿ“ˆ Dataset Statistics:

  • TIME: 38,522 QA pairs (Complete benchmark)
  • TIME-Lite: 943 QA pairs (High-quality subset)

Here is a detailed breakdown of the dataset statistics:

Dataset All Tasks Ext. Loc. Comp. D.C. O.C. E.R. O.R. R.R. C.T. T.L. C.F.
TIME 38522 1480 3546 3376 3401 3549 3537 3538 3537 3513 5508 3537
TIME-Wiki 13848 1261 1299 1126 1151 1299 1287 1288 1287 1263 1300 1287
TIME-News 19958 0 1800 1800 1800 1800 1800 1800 1800 1800 3758 1800
TIME-Dial 4716 219 447 450 450 450 450 450 450 450 450 450
TIME-Lite 943 60 90 78 86 90 90 90 90 90 89 90
TIME-Lite-Wiki 322 30 30 24 28 30 30 30 30 30 30 30
TIME-Lite-News 299 0 30 30 30 30 30 30 30 30 29 30
TIME-Lite-Dial 322 30 30 24 28 30 30 30 30 30 30 30

Task abbreviations: Ext. (Extract), Loc. (Localization), Comp. (Computation), D.C. (Duration Compare), O.C. (Order Compare); E.R. (Explicit Reasoning), O.R. (Order Reasoning), R.R. (Relative Reasoning); C.T. (Co-temporality), T.L. (Timeline), C.F. (Counterfactual).

๐Ÿ’ช๐Ÿป Evaluation Results

๐Ÿ“Š TIME-Lite Results Radar Charts

Here are the detailed evaluation results for the TIME-Lite dataset on different sub-datasets:

๐Ÿ—„๏ธ TIME-Lite-Wiki

TIME-Lite-Wiki Results

๐Ÿ“ฐ TIME-Lite-News

TIME-Lite-News Results

๐Ÿ’ฌ TIME-Lite-Dial

TIME-Lite-Dial Results

๐Ÿ’ฌ Citation

If you find our work interesting and meaningful, welcome to star this repo, give an upvote to our HF repo TIME and cite our paper as follows.

@article{wei2025time,
  title={TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios},
  author={Wei, Shaohang and Li, Wei and Song, Feifan and Luo, Wen and Zhuang, Tianyi and Tan, Haochen and Guo, Zhijiang and Wang, Houfeng},
  journal={arXiv preprint arXiv:2505.12891},
  year={2025}
}

About

[NeurIPS 2025 D&B (Spotlight๐ŸŒŸ)] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenario

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published