This repo contains the code and data of: VoiceBench: Benchmarking LLM-Based Voice Assistants
2025.04.20Releasedwildvoice, a crowd-sourced dataset comprising human-recorded speech with diverse accents.2025.04.12Releasedbbh, a crowd-sourced dataset comprising human-recorded speech, for evaluating the reasoning ability of voice assistants.2024.12.11Updated the VoiceBench Leaderboard to includemmsu.2024.12.10Added a curated list of awesome voice assistants.2024.11.24Expanded the test samples in VoiceBench to includemmsu, covering 12 diverse domains frommmlu-pro.2024.11.12Updated the VoiceBench Leaderboard to include: 1) Mini-Omni2, GPT-4o-Audio, and Whisper-v3+GPT-4o, and 2) multiple-choice QA from OpenBookQA.2024.10.30Expanded the test samples in VoiceBench to include: 1) the complete set of open-ended QA fromalpacaeval, and 2) multiple-choice QA fromopenbookqa.
| Rank | Model | AlpacaEval | CommonEval | WildVoice | SD-QA | MMSU | OBQA | BBH | IFEval | AdvBench | Overall |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Whisper-v3-large+GPT-4o | 4.80 | 4.47 | 4.62 | 75.77 | 81.69 | 92.97 | 87.20 | 76.51 | 98.27 | 87.80 |
| 2 | GPT-4o-Audio | 4.78 | 4.49 | 4.58 | 75.50 | 80.25 | 89.23 | 84.10 | 76.02 | 98.65 | 86.75 |
| 3 | GPT-4o-mini-Audio | 4.75 | 4.24 | 4.40 | 67.36 | 72.90 | 84.84 | 81.50 | 72.90 | 98.27 | 82.84 |
| 4 | Ultravox-v0.6-LLaMA-3.3-70B | 4.69 | 4.26 | 4.38 | 82.60 | 69.20 | 86.40 | 78.80 | 61.50 | 91.20 | 81.81 |
| 5 | Parakeet-TDT-0.6b-V2+Qwen3-8B | 4.68 | 4.46 | 4.35 | 47.47 | 59.10 | 80.00 | 77.90 | 78.99 | 99.81 | 79.23 |
| 6 | Whisper-v3-large+LLaMA-3.1-8B | 4.53 | 4.04 | 4.16 | 70.43 | 62.43 | 72.53 | 69.70 | 69.53 | 98.08 | 77.48 |
| 7 | Kimi-Audio | 4.46 | 3.97 | 4.20 | 63.12 | 62.17 | 83.52 | 69.70 | 61.10 | 100.00 | 76.91 |
| 8 | Whisper-v3-turbo+LLaMA-3.1-8B | 4.55 | 4.02 | 4.12 | 58.23 | 62.04 | 72.09 | 69.10 | 71.12 | 98.46 | 76.09 |
| 9 | Ultravox-v0.5-LLaMA-3.1-8B | 4.59 | 4.11 | 4.28 | 58.68 | 54.16 | 68.35 | 67.80 | 66.51 | 98.65 | 74.86 |
| 10 | Ultravox-v0.4.1-LLaMA-3.1-8B | 4.55 | 3.90 | 4.12 | 53.35 | 47.17 | 65.27 | 66.30 | 66.88 | 98.46 | 72.09 |
| 11 | Baichuan-Omni-1.5 | 4.50 | 4.05 | 4.06 | 43.40 | 57.25 | 74.51 | 62.70 | 54.54 | 97.31 | 71.32 |
| 12 | MiniCPM-o | 4.42 | 4.15 | 3.94 | 50.72 | 54.78 | 78.02 | 60.40 | 49.25 | 97.69 | 71.23 |
| 13 | Whisper-v3-turbo+LLaMA-3.2-3B | 4.45 | 3.82 | 4.04 | 49.28 | 51.37 | 60.66 | 63.90 | 69.71 | 98.08 | 71.02 |
| 14 | Baichuan-Audio | 4.41 | 4.08 | 3.92 | 45.84 | 53.19 | 71.65 | 54.80 | 50.31 | 99.42 | 69.27 |
| 15 | MERaLiON | 4.50 | 3.77 | 4.12 | 55.06 | 34.95 | 27.23 | 62.60 | 62.93 | 94.81 | 65.04 |
| 16 | VITA-1.5 | 4.21 | 3.66 | 3.48 | 38.88 | 52.15 | 71.65 | 55.30 | 38.14 | 97.69 | 64.53 |
| 17 | Phi-4-multimodal | 3.81 | 3.82 | 3.56 | 39.78 | 42.19 | 65.93 | 61.80 | 45.35 | 100.00 | 64.32 |
| 18 | Ola | 4.12 | 2.97 | 3.19 | 33.82 | 45.97 | 67.91 | 51.10 | 39.57 | 90.77 | 59.42 |
| 19 | Lyra-Base | 3.85 | 3.50 | 3.42 | 38.25 | 49.74 | 72.75 | 59.00 | 36.28 | 59.62 | 59.00 |
| 20 | Ultravox-v0.5-LLaMA-3.2-1B | 4.04 | 3.57 | 3.47 | 34.72 | 30.03 | 35.60 | 52.70 | 45.56 | 96.92 | 57.46 |
| 21 | DiVA | 3.67 | 3.54 | 3.74 | 57.05 | 25.76 | 25.49 | 51.80 | 39.15 | 98.27 | 57.39 |
| 22 | GLM-4-Voice | 3.97 | 3.42 | 3.18 | 36.98 | 39.75 | 53.41 | 52.80 | 25.92 | 88.08 | 56.48 |
| 23 | Qwen2-Audio | 3.74 | 3.43 | 3.01 | 35.71 | 35.72 | 49.45 | 54.70 | 26.33 | 96.73 | 55.80 |
| 24 | Freeze-Omni | 4.03 | 3.46 | 3.15 | 53.45 | 28.14 | 30.98 | 50.70 | 23.40 | 97.30 | 55.20 |
| 25 | Step-Audio | 4.13 | 3.09 | 2.93 | 44.21 | 28.33 | 33.85 | 50.60 | 27.96 | 69.62 | 50.84 |
| 26 | Megrez-3B-Omni | 3.50 | 2.95 | 2.34 | 25.95 | 27.03 | 28.35 | 50.30 | 25.71 | 87.69 | 46.76 |
| 27 | Ichigo | 3.79 | 3.17 | 2.83 | 36.53 | 25.63 | 26.59 | 46.50 | 21.59 | 57.50 | 45.57 |
| 28 | Lyra-Mini | 2.99 | 2.69 | 2.58 | 19.89 | 31.42 | 41.54 | 48.40 | 20.91 | 80.00 | 45.26 |
| 29 | Mair-hub-0.5B-Omni | 3.06 | 2.87 | 2.48 | 21.70 | 25.60 | 25.27 | 50.90 | 14.85 | 94.81 | 44.59 |
| 30 | LLaMA-Omni | 3.70 | 3.46 | 2.92 | 39.69 | 25.93 | 27.47 | 49.20 | 14.87 | 11.35 | 41.12 |
| 31 | VITA-1.0 | 3.38 | 2.15 | 1.87 | 27.94 | 25.70 | 29.01 | 47.70 | 22.82 | 26.73 | 36.43 |
| 32 | SLAM-Omni | 1.90 | 1.79 | 1.60 | 4.16 | 26.06 | 25.27 | 48.80 | 13.38 | 94.23 | 35.30 |
| 33 | Mini-Omni2 | 2.32 | 2.18 | 1.79 | 9.31 | 24.27 | 26.59 | 46.40 | 11.56 | 57.50 | 33.49 |
| 34 | Mini-Omni | 1.95 | 2.02 | 1.61 | 13.92 | 24.69 | 26.59 | 46.30 | 13.58 | 37.12 | 30.42 |
| 35 | Moshi | 2.01 | 1.60 | 1.30 | 15.64 | 24.04 | 25.93 | 47.40 | 10.12 | 44.23 | 29.51 |
We encourage you to submit new voice assistant results directly through the issue tracker. The ranking list will be updated accordingly.
conda create -n voicebench python=3.10
conda activate voicebench
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.23 --no-deps
pip install -r requirements.txtThe data used in this project is available at VoiceBench Dataset hosted on Hugging Face.
You can access it directly via the link and integrate it into your project by using the Hugging Face datasets library.
To load the dataset in your Python environment:
from datasets import load_dataset
# Load the VoiceBench dataset
# Available subset: alpacaeval, commoneval, sd-qa, ifeval, advbench, ...
dataset = load_dataset("hlt-lab/voicebench", 'alpacaeval')| Subset | # Samples | Audio Source | Task Type |
|---|---|---|---|
| alpacaeval | 199 | Google TTS | Open-Ended QA |
| alpacaeval_full | 636 | Google TTS | Open-Ended QA |
| commoneval | 200 | Human | Open-Ended QA |
| wildvoice | 1,000 | Human | Open-Ended QA |
| openbookqa | 455 | Google TTS | Multiple-Choice QA |
| mmsu | 3,074 | Google TTS | Multiple-Choice QA |
| sd-qa | 553 | Human | Reference-Based QA |
| mtbench | 46 | Google TTS | Multi-Turn QA |
| ifeval | 345 | Google TTS | Instruction Following |
| bbh | 1,000 | Human | Reasoning |
| advbench | 520 | Google TTS | Safety |
PS: alpacaeval contains helpful_base and vicuna data, while alpacaeval_full is constructed with the complete data. alpacaeval_full is used in the leaderboard.
To obtain the responses from the voice assistant model, run the following command:
python main.py --model naive --data alpacaeval --split test --modality audioSupported Arguments:
--model: Specifies the model to use for generating responses. Replacenaivewith the model you want to test (e.g.,qwen2,diva).--data: Selects the subset of the dataset. Replacealpacaevalwith other subsets likecommoneval,sd-qa, etc., depending on your evaluation needs.--split: Chooses the data split to evaluate.- For most datasets (
alpacaeval,commoneval,ifeval,advbench), usetestas the value. - For the
sd-qasubset, you should provide a region code instead oftest, such asausfor Australia,usafor the United States, etc.
- For most datasets (
--modality: Useaudiofor spoken instructions,textfor text-based instructions.
This will generate the output and save it to a file named naive-alpacaeval-test-audio.jsonl.
For datasets alpacaeval, commoneval, wildvoice, and sd-qa, we use gpt-4o-mini to evaluate the responses. Run the following command to get the GPT score:
python api_judge.py --src_file naive-alpacaeval-test-audio.jsonlThe GPT evaluation scores will be saved to result-naive-alpacaeval-test-audio.jsonl.
Note: This step should be skipped for other datasets, as they are not evaluated using GPT-4.
To generate the final evaluation results, run:
python evaluate.py --src_file result-naive-alpacaeval-test-audio.jsonl --evaluator openSupported Arguments:
--evaluator: Specifies the evaluator type:- Use
openforalpacaeval,commoneval, andwildvoice. - Use
qaforsd-qa. - Use
ifevalforifeval. - Use
harmforadvbench. - Use
mcqforopenbookqaandmmsu. - Use
bbhforbbh.
- Use
If you use the VoiceBench in your research, please cite the following paper:
@article{chen2024voicebench,
title={VoiceBench: Benchmarking LLM-Based Voice Assistants},
author={Chen, Yiming and Yue, Xianghu and Zhang, Chen and Gao, Xiaoxue and Tan, Robby T. and Li, Haizhou},
journal={arXiv preprint arXiv:2410.17196},
year={2024}
}