[Project Page] [HuggingFace] [Preprint] [Code] [Raw Data]
Shuo Sun1,3, Yimin Zhao1, Christina Dao Wen Lee1, Jiawei Sun1, Chengran Yuan1,
Zefan Huang1,3, Dongen Li1,3, Justin KW Yeoh1, Alok Prakash3,
Thomas W. Malone2,3, Marcelo H. Ang Jr.1,3
1National University of Singapore 2Massachusetts Institute of Technology
3Singapore MIT Alliance for Research and Technology
You can go to our Project Page for a more detailed rating distribution analysis.
You can view the visualized test cases samples and their associated ratings on our HuggingFace Collection for all six datasets.
As the field progresses toward Artificial General Intelligence (AGI), there is a pressing need for more comprehensive and insightful evaluation frameworks that go beyond aggregate performance metrics. This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains. Unlike existing metrics that focus solely on models, our approach allows for fine-grained, difficulty-aware evaluations through competitive interactions between models and tasks, capturing both the long-tail distribution of real-world challenges and the competency gap between current models and full task mastery. We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains. The resulting rating distributions offer novel perspectives and interpretable insights into task difficulty, model progression, and the outstanding challenges that remain on the path to achieving full AGI task mastery.
- Clone this repository
git clone https://github.com/SS47816/AGI-Elo.git
cd AGI-Elo- Install all Dependencies
# Auto install conda env AGI_Elo
direnv allow
make install
conda activate AGI_Elo
# Auto install all pip dependencies from requirements.txt
make pip-installEach .pkl file should contain the prediction results of one model evaluated across all test cases.
You can download our precomputed prediction files from: Google Drive: Raw Data.
After downloading, organize the ./data folder with the following structure:
   ./data
   ├── imagenet_class_index.json
   │
   ├── classification/
   │   ├── ImageNet/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── detection/
   │   ├── COCO/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── question_answering/
   │   ├── MMLU/
   │       ├── test/
   │           ├── predictions/
   │           ├── ...
   │
   ├── coding/
   │   ├── LiveCodeBench/
   │       ├── test/
   │           ├── predictions/
   │           ├── ...
   │
   ├── motion_prediction/
   │   ├── Waymo/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
   ├── motion_planning/
   │   ├── NAVSIM/
   │       ├── val/
   │           ├── predictions/
   │           ├── ...
   │
To run rating estimation across all tasks and datasets, use:
python3 AGI_Elo/scripts/run_all_experiments.pyOr optionally, you can run a specific task independently (e.g., classification):
python3 AGI_Elo/pipeline/classification.pyThe results will be save to their respective ratings/ folders.
If you find our work interesting, please consider citing our paper:
@misc{sun2025agielofarmasteringtask,
  title={AGI-Elo: How Far Are We From Mastering A Task?}, 
  author={Shuo Sun and Yimin Zhao and Christina Dao Wen Lee and Jiawei Sun and Chengran Yuan and Zefan Huang and Dongen Li and Justin KW Yeoh and Alok Prakash and Thomas W. Malone and Marcelo H. Ang Jr},
  year={2025},
  eprint={2505.12844},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2505.12844}, 
}
This repository is licensed under the Apache License 2.0
Project based on Nesta's data science project template