VoiceBench

This repo contains the code and data of: VoiceBench: Benchmarking LLM-Based Voice Assistants

News

2025.04.20 Released wildvoice, a crowd-sourced dataset comprising human-recorded speech with diverse accents.
2025.04.12 Released bbh, a crowd-sourced dataset comprising human-recorded speech, for evaluating the reasoning ability of voice assistants.
2024.12.11 Updated the VoiceBench Leaderboard to include mmsu.
2024.12.10 Added a curated list of awesome voice assistants.
2024.11.24 Expanded the test samples in VoiceBench to include mmsu, covering 12 diverse domains from mmlu-pro.
2024.11.12 Updated the VoiceBench Leaderboard to include: 1) Mini-Omni2, GPT-4o-Audio, and Whisper-v3+GPT-4o, and 2) multiple-choice QA from OpenBookQA.
2024.10.30 Expanded the test samples in VoiceBench to include: 1) the complete set of open-ended QA from alpacaeval, and 2) multiple-choice QA from openbookqa.

Leaderboard

Rank	Model	AlpacaEval	CommonEval	WildVoice	SD-QA	MMSU	OBQA	BBH	IFEval	AdvBench	Overall
1	Whisper-v3-large+GPT-4o	4.80	4.47	4.62	75.77	81.69	92.97	87.20	76.51	98.27	87.80
2	GPT-4o-Audio	4.78	4.49	4.58	75.50	80.25	89.23	84.10	76.02	98.65	86.75
3	GPT-4o-mini-Audio	4.75	4.24	4.40	67.36	72.90	84.84	81.50	72.90	98.27	82.84
4	Ultravox-v0.6-LLaMA-3.3-70B	4.69	4.26	4.38	82.60	69.20	86.40	78.80	61.50	91.20	81.81
5	Parakeet-TDT-0.6b-V2+Qwen3-8B	4.68	4.46	4.35	47.47	59.10	80.00	77.90	78.99	99.81	79.23
6	Whisper-v3-large+LLaMA-3.1-8B	4.53	4.04	4.16	70.43	62.43	72.53	69.70	69.53	98.08	77.48
7	Kimi-Audio	4.46	3.97	4.20	63.12	62.17	83.52	69.70	61.10	100.00	76.91
8	Whisper-v3-turbo+LLaMA-3.1-8B	4.55	4.02	4.12	58.23	62.04	72.09	69.10	71.12	98.46	76.09
9	Ultravox-v0.5-LLaMA-3.1-8B	4.59	4.11	4.28	58.68	54.16	68.35	67.80	66.51	98.65	74.86
10	Ultravox-v0.4.1-LLaMA-3.1-8B	4.55	3.90	4.12	53.35	47.17	65.27	66.30	66.88	98.46	72.09
11	Baichuan-Omni-1.5	4.50	4.05	4.06	43.40	57.25	74.51	62.70	54.54	97.31	71.32
12	MiniCPM-o	4.42	4.15	3.94	50.72	54.78	78.02	60.40	49.25	97.69	71.23
13	Whisper-v3-turbo+LLaMA-3.2-3B	4.45	3.82	4.04	49.28	51.37	60.66	63.90	69.71	98.08	71.02
14	Baichuan-Audio	4.41	4.08	3.92	45.84	53.19	71.65	54.80	50.31	99.42	69.27
15	MERaLiON	4.50	3.77	4.12	55.06	34.95	27.23	62.60	62.93	94.81	65.04
16	VITA-1.5	4.21	3.66	3.48	38.88	52.15	71.65	55.30	38.14	97.69	64.53
17	Phi-4-multimodal	3.81	3.82	3.56	39.78	42.19	65.93	61.80	45.35	100.00	64.32
18	Ola	4.12	2.97	3.19	33.82	45.97	67.91	51.10	39.57	90.77	59.42
19	Lyra-Base	3.85	3.50	3.42	38.25	49.74	72.75	59.00	36.28	59.62	59.00
20	Ultravox-v0.5-LLaMA-3.2-1B	4.04	3.57	3.47	34.72	30.03	35.60	52.70	45.56	96.92	57.46
21	DiVA	3.67	3.54	3.74	57.05	25.76	25.49	51.80	39.15	98.27	57.39
22	GLM-4-Voice	3.97	3.42	3.18	36.98	39.75	53.41	52.80	25.92	88.08	56.48
23	Qwen2-Audio	3.74	3.43	3.01	35.71	35.72	49.45	54.70	26.33	96.73	55.80
24	Freeze-Omni	4.03	3.46	3.15	53.45	28.14	30.98	50.70	23.40	97.30	55.20
25	Step-Audio	4.13	3.09	2.93	44.21	28.33	33.85	50.60	27.96	69.62	50.84
26	Megrez-3B-Omni	3.50	2.95	2.34	25.95	27.03	28.35	50.30	25.71	87.69	46.76
27	Ichigo	3.79	3.17	2.83	36.53	25.63	26.59	46.50	21.59	57.50	45.57
28	Lyra-Mini	2.99	2.69	2.58	19.89	31.42	41.54	48.40	20.91	80.00	45.26
29	Mair-hub-0.5B-Omni	3.06	2.87	2.48	21.70	25.60	25.27	50.90	14.85	94.81	44.59
30	LLaMA-Omni	3.70	3.46	2.92	39.69	25.93	27.47	49.20	14.87	11.35	41.12
31	VITA-1.0	3.38	2.15	1.87	27.94	25.70	29.01	47.70	22.82	26.73	36.43
32	SLAM-Omni	1.90	1.79	1.60	4.16	26.06	25.27	48.80	13.38	94.23	35.30
33	Mini-Omni2	2.32	2.18	1.79	9.31	24.27	26.59	46.40	11.56	57.50	33.49
34	Mini-Omni	1.95	2.02	1.61	13.92	24.69	26.59	46.30	13.58	37.12	30.42
35	Moshi	2.01	1.60	1.30	15.64	24.04	25.93	47.40	10.12	44.23	29.51

We encourage you to submit new voice assistant results directly through the issue tracker. The ranking list will be updated accordingly.

Setup

conda create -n voicebench python=3.10
conda activate voicebench
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.23 --no-deps
pip install -r requirements.txt

Dataset

The data used in this project is available at VoiceBench Dataset hosted on Hugging Face.

You can access it directly via the link and integrate it into your project by using the Hugging Face datasets library.

How to Use the Dataset

To load the dataset in your Python environment:

from datasets import load_dataset

# Load the VoiceBench dataset
# Available subset: alpacaeval, commoneval, sd-qa, ifeval, advbench, ...
dataset = load_dataset("hlt-lab/voicebench", 'alpacaeval')

Available Data

Subset	# Samples	Audio Source	Task Type
alpacaeval	199	Google TTS	Open-Ended QA
alpacaeval_full	636	Google TTS	Open-Ended QA
commoneval	200	Human	Open-Ended QA
wildvoice	1,000	Human	Open-Ended QA
openbookqa	455	Google TTS	Multiple-Choice QA
mmsu	3,074	Google TTS	Multiple-Choice QA
sd-qa	553	Human	Reference-Based QA
mtbench	46	Google TTS	Multi-Turn QA
ifeval	345	Google TTS	Instruction Following
bbh	1,000	Human	Reasoning
advbench	520	Google TTS	Safety

PS: alpacaeval contains helpful_base and vicuna data, while alpacaeval_full is constructed with the complete data. alpacaeval_full is used in the leaderboard.

Evaluation

Step 1: Get the Voice Assistant's Response

To obtain the responses from the voice assistant model, run the following command:

python main.py --model naive --data alpacaeval --split test --modality audio

Supported Arguments:

--model: Specifies the model to use for generating responses. Replace naive with the model you want to test (e.g., qwen2, diva).
--data: Selects the subset of the dataset. Replace alpacaeval with other subsets like commoneval, sd-qa, etc., depending on your evaluation needs.
--split: Chooses the data split to evaluate.
- For most datasets (alpacaeval, commoneval, ifeval, advbench), use test as the value.
- For the sd-qa subset, you should provide a region code instead of test, such as aus for Australia, usa for the United States, etc.
--modality: Use audio for spoken instructions, text for text-based instructions.

This will generate the output and save it to a file named naive-alpacaeval-test-audio.jsonl.

Step2: Automatic GPT-4 Evaluation

For datasets alpacaeval, commoneval, wildvoice, and sd-qa, we use gpt-4o-mini to evaluate the responses. Run the following command to get the GPT score:

python api_judge.py --src_file naive-alpacaeval-test-audio.jsonl

The GPT evaluation scores will be saved to result-naive-alpacaeval-test-audio.jsonl.

Note: This step should be skipped for other datasets, as they are not evaluated using GPT-4.

Step3: Get the Final Results

To generate the final evaluation results, run:

python evaluate.py --src_file result-naive-alpacaeval-test-audio.jsonl --evaluator open

Supported Arguments:

--evaluator: Specifies the evaluator type:
- Use open for alpacaeval, commoneval, and wildvoice.
- Use qa for sd-qa.
- Use ifeval for ifeval.
- Use harm for advbench.
- Use mcq for openbookqa and mmsu.
- Use bbh for bbh.

Awesome Voice Assistants

Title	Date	Code
OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue	2025-08-13	Github
DIFFA: Large Language Diffusion Models Can Listen and Understand	2025-07-24	Github
Voxtral	2025-07-17	--
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models	2025-07-10	Github
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment	2025-07-03	Github
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model	2025-06-16	Github
Ming-Omni: A Unified Multimodal Model for Perception and Generation	2025-06-11	Github
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model	2025-06-10	--
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model	2025-05-06	Github
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis	2025-05-05	Github
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play	2025-05-05	Github
Kimi-Audio Technical Report	2025-04-25	Github
Qwen2.5-Omni Technical Report	2025-03-26	Github
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs	2025-03-03	HF
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision	2025-02-26	--
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance	2025-02-26	--
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction	2025-02-24	Github
LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems	2025-02-19	--
FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems	2025-02-19	--
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction	2025-02-17	Github
DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities	2025-02-16	Github
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment	2025-02-06	Github
SpeechGPT 2.0-preview	2025-01-26	Github
Baichuan-Omni-1.5 Technical Report	2025-01-26	Github
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech, and Multimodal Live Streaming on Your Phone	2025-01-24	Github
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction	2025-01-10	--
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis	2025-01-08	Github
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	2025-01-03	Github
OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios	2025-01-02	--
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training	2024-12-20	Github
MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models	2024-12-13	HF
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition	2024-12-12	Github
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners	2024-12-06	--
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot	2024-12-03	Github
Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data	2024-12-02	--
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation	2024-11-27	--
Ultravox: An Open-Weight Alternative to GPT-4o Realtime	2024-11-12	Github
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM	2024-11-01	Github
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation	2024-10-23	--
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant	2024-10-20	Github
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities	2024-10-15	Github
Baichuan-Omni Technical Report	2024-10-11	--
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities	2024-10-09	--
Distilling an End-to-End Voice Assistant Without Instruction Training Data	2024-10-03	HF
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	2024-09-26	--
Moshi: a Speech-Text Foundation Model for Real-Time Dialogue	2024-09-17	Github
LLaMA-Omni: Seamless Speech Interaction with Large Language Models	2024-09-10	Github
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	2024-08-29	Github
VITA: Towards Open-Source Interactive Omni Multimodal LLM	2024-08-09	Github
Qwen2-Audio Technical Report	2024-07-15	Github
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems	2024-06-18	--
LLaSM: Large Language and Speech Model	2023-08-30	Github

Citation

If you use the VoiceBench in your research, please cite the following paper:

@article{chen2024voicebench,
  title={VoiceBench: Benchmarking LLM-Based Voice Assistants},
  author={Chen, Yiming and Yue, Xianghu and Zhang, Chen and Gao, Xiaoxue and Tan, Robby T. and Li, Haizhou},
  journal={arXiv preprint arXiv:2410.17196},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api_judge.py		api_judge.py
evaluate.py		evaluate.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VoiceBench

News

Table of Contents

Leaderboard

Setup

Dataset

How to Use the Dataset

Available Data

Evaluation

Step 1: Get the Voice Assistant's Response

Step2: Automatic GPT-4 Evaluation

Step3: Get the Final Results

Awesome Voice Assistants

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

MatthewCYM/VoiceBench

Folders and files

Latest commit

History

Repository files navigation

VoiceBench

News

Table of Contents

Leaderboard

Setup

Dataset

How to Use the Dataset

Available Data

Evaluation

Step 1: Get the Voice Assistant's Response

Step2: Automatic GPT-4 Evaluation

Step3: Get the Final Results

Awesome Voice Assistants

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages