- [2025/10/17]
- [2025/05/22]
- [2025/05/12]
- 支持Qwen2.5-Omni
qwen2.5-omni-audio, qwen2.5-omni-speech, Kimi-Audio-7B-Instructkimiaudio, kimiaudio-speech模型,并且更新音频理解榜单
- 支持Qwen2.5-Omni
- [2025/05/8]
- 更加快捷断点续评, -r/--resume参数,不指定文件可以自动搜索最近一次的断点续评结果
- 支持从推理文件开始评测, --infer-file参数,可以直接从推理文件开始评测,无需重新生成推理文件
- [2025/03/23]
- 新增支持step-audio模型评测和排名
- 排名详情见:leaderboard.md
- 评测支持:Step-Audio-Chat
- 新增支持step-audio模型评测和排名
- [2025/03/04]
- 支持断点[续评] [resume evaluation](docs/Procedures for Restarting an Incomplete Evaluation.md), 命令行参数
--resume $checkpoint_res_file - glm-4-voice服务部署,支持UltraEval-Audio评测, 详情见GLM-4-Voice
- 并行评测支持,命令行参数
--workers $num_workers
- 支持断点[续评] [resume evaluation](docs/Procedures for Restarting an Incomplete Evaluation.md), 命令行参数
- [2025/01/13] release v1.0.0
UltraEval-Audio——全球首个同时支持语音理解和语音生成评估的开源框架,专为语音大模型评估打造,集合了34项权威Benchmark,覆盖语音、声音、医疗及音乐四大领域,支持十种语言,涵盖十二类任务。选择UltraEval-Audio,您将体验到前所未有的便捷与高效:
- 一键式基准管理 📥:告别繁琐的手动下载与数据处理,UltraEval-Audio为您自动化完成这一切,轻松获取所需基准测试数据。
- 内置评估利器 ⚙️:无需再四处搜寻评估工具,UltraEval-Audio内置八种常用的评估方法(如WER、WER-ZH、BLEU、G-Eval),无论是基于规则还是模型驱动,都能满足您的需求。
- 功能强大,灵活易用 🛠️:支持预览测试、随机样本、错误重试、断点重跑等功能,确保评估过程灵活可控,提升效率与准确性。
- 无缝集成自定义数据集 💼:不仅支持公开benchmark,还提供强大的自定义数据集功能,让您在各种工程场景下也能迅速应用。
- 轻松对接现有系统 🔗:具备优秀的扩展性和标准化设计,即使您已拥有一套完善的评估体系,UltraEval-Audio也能无缝对接,简化项目管理流程,输出结果统一规范。
Audio Understanding LLM: Speech + Text → Text
Audio Generation LLM: Speech → Speech
| Rank | Model | ASR | AST | EMO |
|---|---|---|---|---|
| 🏅 | MiniCPM-o 2.6 | 95 | 38 | 52 |
| 🥈 | Kimi-Audio-Instruct | 97 | 27 | 59 |
| 🥉 | Gemini-1.5-Pro | 94 | 35 | 48 |
| 4 | Qwen2.5-Omni | 96 | 27 | 54 |
| 5 | Qwen2-Audio | 95 | 35 | 43 |
| 6 | GPT-4o-Realtime | 87 | 26 | 33 |
| 7 | Qwen2-Audio-Instruction | 94 | 31 | 17 |
| 8 | Step-Audio-Chat | 94 | 20 | 26 |
| 9 | Gemini-1.5-Flash | 27 | 21 | 45 |
详细模型指标见leaderboard.md
# 支持数据集
git clone https://github.com/OpenBMB/UltraEval-Audio.git
cd UltraEval-Audio
conda create -n aduioeval python=3.10 -y
conda activate aduioeval
pip install -r requirments.txtexport PYTHONPATH=$PWD:$PYTHONPATH
# 针对部分地区可能需要加速下载 需要设置:export HF_ENDPOINT=https://hf-mirror.com
# 测试MiniCPM-o 2.6语音理解能力
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --prompt mini-cpm-omni-asr-zh --model MiniCPMo2_6-audio
# 测试MiniCPM-o 2.6语音生成能力
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset llama-questions-s2t --model MiniCPMo2_6-speech
# 测试GPT-4o-Realtime语音理解能力
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset sample --model gpt4o_audio
# 测试GPT-4o-Realtime语音生成能力
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset llama-questions-s2t --model gpt4o_speech
# 测试gemini-1.5-pro语音理解能力
export GOOGLE_API_KEY=$your-key
python audio_evals/main.py --dataset sample --model gemini-pro
# 测试qwen2-audio-offline语音理解能力
pip install -r requirments-offline-model.txt
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --model qwen2-audio-chat遇到报错或者不能复现Mini-CPM-o 2.6的结果,可以先看常见问题
评测完毕,结果文件如下:
- res
|-- $model-name
|-- $dataset
|-- $time.jsonl
|-- $time-overview.jsonl评测命令:
python audio_evals/main.py --dataset <dataset_name> --model <model_name>--dataset 指定要评测的数据集,支持的数据集如下:
speech-chatbot-alpaca-evalllama-questionsspeech-web-questionsspeech-triviaqatedlium-release1tedlium-release2tedlium-release3catdogaudiocapscovost2-en-arcovost2-en-cacovost2-en-cycovost2-en-decovost2-en-etcovost2-en-facovost2-en-idcovost2-en-jacovost2-en-lvcovost2-en-mncovost2-en-slcovost2-en-svcovost2-en-tacovost2-en-trcovost2-en-zhcovost2-zh-encovost2-it-encovost2-fr-encovost2-es-encovost2-de-enGTZANTESSnsynthmeld-emomeld-sentimentclotho-aqaravdess-emoravdess-genderCOVID-recognizerrespiratory-cracklesrespiratory-wheezesKeSpeechaudio-MNISTlibrispeech-test-cleanlibrispeech-dev-cleanlibrispeech-test-otherlibrispeech-dev-othermls_dutchmls_frenchmls_germanmls_italianmls_polishmls_portuguesemls_spanishheartbeat_soundvocalsoundfleurs-zhvoxceleb1voxceleb2chord-recognitionwavcaps-audiosetwavcaps-freesoundwavcaps-soundbibleair-foundationair-chatdesedpeoples-speechWenetSpeech-test-meetingWenetSpeech-test-netgigaspeechaishell-1cv-15-encv-15-zhcv-15-frcv-15-yue
| <dataset_name> | name | task | domain | metric |
|---|---|---|---|---|
| speech-chatbot-alpaca-eval | speech-chatbot-alpaca-eval | SpeechQA | speech2speech | GPT-score |
| llama-questions | llama-questions | SpeechQA | speech2speech | acc |
| speech-web-questions | speech-web-questions | SpeechQA | speech2speech | acc |
| speech-triviaqa | speech-triviaqa | SpeechQA | speech2speech | acc |
| tedlium-* | tedlium | ASR(Automatic Speech Recognition) | speech | wer |
| clotho-aqa | ClothoAQA | AQA(AudioQA) | sound | acc |
| catdog | catdog | AQA | sound | acc |
| mls-* | multilingual_librispeech | ASR | speech | wer |
| KeSpeech | KeSpeech | ASR | speech | cer |
| librispeech-* | librispeech | ASR | speech | wer |
| fleurs-* | FLEURS | ASR | speech | wer |
| aishell-1 | AISHELL-1 | ASR | speech | wer |
| WenetSpeech-* | WenetSpeech | ASR | speech | wer |
| covost2-* | covost2 | STT(Speech Text Translation) | speech | BLEU |
| GTZAN | GTZAN | MQA(MusicQA) | music | acc |
| TESS | TESS | EMO(emotional recognition) | speech | acc |
| nsynth | nsynth | MQA | music | acc |
| meld-emo | meld | EMO | speech | acc |
| meld-sentiment | meld | SEN(sentiment recognition) | speech | acc |
| ravdess-emo | ravdess | EMO | speech | acc |
| ravdess-gender | ravdess | GEND(gender recognition) | speech | acc |
| COVID-recognizer | COVID | MedicineCls | medicine | acc |
| respiratory-* | respiratory | MedicineCls | medicine | acc |
| audio-MNIST | audio-MNIST | AQA | speech | acc |
| heartbeat_sound | heartbeat | MedicineCls | medicine | acc |
| vocalsound | vocalsound | MedicineCls | medicine | acc |
| voxceleb* | voxceleb | GEND | speech | acc |
| chord-recognition | chord | MQA | music | acc |
| wavcaps-* | wavcaps | AC(AudioCaption) | sound | acc |
| air-foundation | AIR-BENCH | AC,GEND,MQA,EMO | sound,music,speech | acc |
| air-chat | AIR-BENCH | AC,GEND,MQA,EMO | sound,music,speech | GPT4-score |
| desed | desed | AQA | sound | acc |
| peoples-speech | peoples-speech | ASR | speech | wer |
| gigaspeech | gigaspeech | ASR | speech | wer |
| cv-15-* | common voice 15 | ASR | speech | wer |
构造你自己的数据集: docs/how add a dataset.md
--model 指定要评测的模型,支持的模型如下:
gpt4o_audio:使用gpt-4o-realtime-preview-2024-10-01的音频转文本模态模型。gpt4o_speech:使用gpt-4o-realtime-preview-2024-10-01的音频转语音模态模型。gpt4o_audio_ms:使用gpt-4o-realtime-preview-2024-10-01(在 AZURE 上)的音频转文本模态模型。gpt4o_speech_ms:使用gpt-4o-realtime-preview-2024-10-01(在 AZURE 上)的音频转语音模态模型。gpt4o_speech:使用Ggpt-4o-realtime-preview-2024-10-01的音频转语音模态模型。gemini-pro:使用Gemini Pro模型。gemini-1.5-pro:使用Gemini 1.5 Pro模型。gemini-1.5-flash:使用Gemini 1.5 Flash模型。gemini-2.0-flash-exp:使用Gemini 2.0 Flash模型。qwen-audio:使用qwen-audio-chatAPI 模型。qwen2-audio-offline:使用Qwen2-Audio-7B离线模型。qwen2-audio-chat:使用Qwen2-Audio-7B-Instruct离线模型。qwen-audio-chat-offline:使用Qwen-Audio-Chat离线模型。qwen-audio-pretrain-offline:使用Qwen-Audio离线模型。ultravox:使用ultravox-v0_4离线模型。
speech2speech 模型(glm4voice,mini-omni...)稍后支持。
评测你自己的模型 docs/how eval your model.md
我们参考了evals 中registry代码
如果你有任何建议或疑问可以提issue或者加入discord群组: https://discord.gg/PHGy66QP