This repository is the official codebase of our paper "RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs" [paper] [slide]. The proposed RouterEval is a comprehensive benchmark for evaluating router performance in the Routing LLMs paradigm, featuring 12 LLM evaluations, 8,500+ LLMs, and 200,000,000+ data records.
2025-10 - We released our raw data (including original answers) in [Hugging Face]. 👈🎉Please try it!
2025-03 - We released our all dataset in [Baidu Drive] [Google Drive] [Hugging Face]. 👈🎉Please try it!
2025-03 - We released a curated list of awesome works in the Routing LLMs [Link]. 👈🎉Please check it out!
Create a Python virtual environment and install all the packages listed in the requirements.txt.
conda create -n RouterEval python=3.10
conda activate RouterEval
pip install -r requirements.txtData Download: [Baidu Drive] [Google Drive] [Hugging Face]
The data format in the cloud drive is as follows. You can just download the router_dataset for basic use.
data/
├── leaderboard_score/ # 200M score records across 8500 LLMs and 12 datasets
├── leaderboard_prompt/ # Full prompts for all test cases
├── leaderboard_embed/ # Pre-computed embeddings (4 types)
└── router_dataset/ # ready-to-use router evaluation data (12 datasets)
Recommendation➡️ For direct use of our pre-built router datasets:
- Create a
datafolder and downloadrouter_datasetto thedatafolder - For basic use, there is NO NEED to download
leaderboard_score,leaderboard_prompt, andleaderboard_embed.
# Create a 'data' directory in the root of this repository
mkdir data
cd data
# Download the dataset file (router_dataset.zip) to data/
# Download using the wgt command or manually download from the link above
ids="1BurZNXnHkva2umQxKbvhgccuKQ35p_Ki"
url="https://drive.google.com/uc?id=$ids&export=download"
wget --no-check-certificate "$url" -O router_dataset.zip
unzip router_dataset.zipRun quick_start.ipynb to view the information of the router dataset, build a simple router, train and test the router using the data from the dataset, and check the performance metrics.
| Difficulty Level | Candidate Pool Size | Candidate Groups |
|---|---|---|
| Easy | [3, 5] | all strong / all weak / strong to weak |
| Hard | [10, 100, 1000] | all strong / all weak / strong to weak |
router/
├── C-RoBERTa-cluster/ # C-RoBERTa router
├── MLPR_LinearR/ # mlp & linear router
├── PRKnn-knn/ # kNN router
├── R_o/ # Oracle & r_o & random router
└── RoBERTa-MLC/ # MLC router
In test_router.py, change baseline = 'knn' to one of ['knn', 'oracle', 'random', 'r_o_0.5', 'linear', 'mlp', 'roberta_cluster', 'roberta_MLC'], then run
python test_router.py
If you want to design a router and test its performance on the router datasets, you can follow the steps below.
-
Create new folder under
router/ -
Implement your method with required format:
# train your router
......
# test your router
......
# compute metircs (Must print these three metrics at last)
......
print(mu, vb, ep) -
Add command to run your router in
test_router.py. -
Run
test_router.pyto test your custom router.
Advanced Usage (optional) ➡️ For custom embeddings, you can:
- Download
leaderboard_promptand process with your embedding model. - Download
leaderboard_embedand use existing pre-computed embeddings (including four embed models: longformer, RoBERTa, RoBERTa_last, and sentence_bert).
Advanced Usage (optional) ➡️ To reproduce the construction process of the Router Dataset, you can:
-
Download
leaderboard_score,leaderboard_prompt, andleaderboard_embed -
Place the three folder in
data/directory -
Run
get_router_dataset.pyto build router datasets:
python get_router_dataset.py