We investigate the tool learning ability of 41 prevalent LLMs by reproducing 33 benchmarks and enabling one-click evaluation for seven of them, forming a Tool Learning Platform named ToLeaP. Our motivation is to deepen insights into current work and thus facilitate foresight into future directions in the tool learning domain. ToLeaP comprises 7 out of 33 benchmarks and possesses the functionality that takes an LLM as input and outputs the values of all 64 evaluation metrics proposed by the benchmarks.
conda create -n toleap python=3.10 -y && conda activate toleap
git clone https://github.com/Hytn/ToLeaP.git && cd ToLeap
pip install -e .
cd scripts
pip install vllm==0.6.5
pip install rouge_score # taskbench
pip install mmengine # teval
pip install nltk accelerate # injecagent
bash ../src/benchmark/bfcl/bfcl_setup.sh
First, run
cd data
mkdir rotbench sealtools taskbench injecagent glaive stabletoolbench apibank
cd ..
cd src/benchmark/rotbench
bash rotbench.sh
cd src/benchmark/sealtools
bash sealtools.sh
cd src/benchmark/taskbench
python taskbench.py
cd src/benchmark/glaive
python glaive2sharegpt.py
cd data
unzip teval.zip
rm teval.zip
cd src/benchmark/injecagent
bash injecagent.sh
Download the data from this link, and place the files in the data/stabletoolbench folder.
After downloading the data, the directory structure should look like this:
├── /data/
│ ├── /glaive/
│ │ ├──
│ │ └── ...
│ ├── /injecagent/
│ │ ├── attacker_cases_dh.jsonl
│ │ └── ...
├── /scripts/
│ ├── /gorilla/
│ ├── bfcl_standard.py
│ ├── ...
├── /src/
│ ├── /benchmark/
│ │ └── ...
│ ├── /cfg/
│ │ └── ...
│ ├── /utils/
│ │ └── ...
First, run:
mkdir results
cd results
mkdir rotbench sealtools taskbench teval injecagent glaive stabletoolbench
cd ..
If you want to perform one-click evaluation, run:
cd scripts
# bash one-click-evaluation.sh model_path is_api gpu_num batch_size input output display_name
bash one-click-evaluation.sh meta-llama/Llama-3.1-8B-Instruct false 1 256 4096 512 llama3.1
If you prefer to evaluate each benchmark separately, follow the instructions below.
cd scripts
python rotbench_eval.py --model meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
cd scripts
python sealtools_eval.py --model meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
cd scripts
python taskbench_eval.py --model meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
WARNING: As the official BFCL codebase changes frequently, if the following instructions do not work, please refer to the latest official repository.
Before using BFCL for evaluation, some preparation steps are required:
-
Ensure that the model you want to evaluate is included in the handler mapping file:
scripts/gorilla/berkeley-function-call-leaderboard/bfcl/constants/model_config.py. If you want to evaluate API models, set the API key:export OPENAI_API_KEY="your-api-key"To use an unofficial base URL, modify the following code in
scripts/gorilla/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py:self.client = OpenAI( api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_API_BASE") )Then
export OPENAI_API_KEY="your-api-key" export OPENAI_API_BASE="your-base-url" -
To add the
--max-model-lenor--tensor-parallel-sizeparameters, modify the code around line 130 in:scripts/gorilla/berkeley-function-call-leaderboard/bfcl/model_handler/local_inference/base_oss_handler.py. -
To run the evaluation in parallel, change the
VLLM_PORTin:scripts/gorilla/berkeley-function-call-leaderboard/bfcl/constants/eval_config.py. -
If you want to use a locally trained model, ensure the model path does not contain underscores (). Otherwise, to avoid conflicts, manually add the following code after model_name_escaped = model_name.replace("", "/"):
-
In the
generate_leaderboard_csvfunction inscripts/gorilla/berkeley-function-call-leaderboard/bfcl/eval_checker/eval_runner_helper.py. -
And also in the
runnerfunction inscripts/gorilla/berkeley-function-call-leaderboard/bfcl/eval_checker/eval_runner.py.if model_name == "sft_model_merged_lora_checkpoint-20000": model_name_escaped = "/sft_model/merged_lora/checkpoint-20000"
- To ensure the evaluation results are properly recorded, add the model path to:
scripts/gorilla/berkeley-function-call-leaderboard/bfcl/constants/model_metadata.py. Example:MODEL_METADATA_MAPPING = { "/path/to/sft_model/merged_lora/checkpoint-60000": [ "", "", "", "", ], ... }
Finally, run
bfcl generate \
--model meta-llama/Llama-3.1-8B-Instruct \
--test-category parallel,multiple,simple,parallel_multiple,java,javascript,irrelevance,live,multi_turn \
--num-threads 1
bfcl evaluate --model meta-llama/Llama-3.1-8B-Instruct
cd scripts
python glaive_eval.py --model meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
cd scripts
bash teval_eval.sh meta-llama/Llama-3.1-8B-Instruct Llama-3.1-8B-Instruct False 4
Then run
python standard_teval.py ../results/teval/Llama-3.1-8B-Instruct/Llama-3.1-8B-Instruct_-1_.json
to obtain the clean results.
To evaluate API models, run:
bash teval_eval.sh gpt-3.5-turbo gpt-3.5-turbo True
cd scripts
export OPENAI_API_KEY="your-open-api-kei"
python injecagent_eval.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--use_cach
cd scripts
python apibank_eval.py --model_name meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
First, set up the environment:
pip install skythought
According to the original author's recommendation, you must use datasets==2.21.0. Otherwise, some benchmarks will not run correctly.
Then run:
bash one-click-sky.sh
to evaluate all tasks. You can specify models and tasks within the script.