Codestin Search App

Audio Flamingo: Series of Advanced Audio Understanding Language Models

Overview

In this repo, we present the Audio Flamingo series of advanced audio understanding Language models:

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (ICML 2024)
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities (ICML 2025)
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models (NeurIPS 2025, Spotlight)
Music Flamingo: Scaling Music Understaning in Audio Language Models (arxiv)

Music Flamingo (arXiv)

Music Flamingo (MF) is a fully open, state-of-the-art Large Audio-Language Model (LALM) built on Audio Flamingo 3 backbone, designed to advance music (including song) understanding in foundational audio models. MF brings together innovations in:

Deep music understanding across songs and instrumentals.
Rich, theory-aware captions and question answering (harmony, structure, timbre, lyrics, cultural context).
Reasoning-centric training using chain-of-thought + reinforcement learning with custom rewards for step-by-step reasoning.
Long-form song reasoning over full-length, multicultural audio (extended context).

Extensive evaluations confirm Music Flamingo's effectiveness, setting new benchmarks on over 10+ public music understanding and reasoning tasks.

Audio Flamingo 3 (NeurIPS 2025)

Audio Flamingo 3 is our latest model based on a 7B language model and the LLaVA architecture. We trained our unified AF-Whisper audio encoder based on Whisper to handle understanding beyond speech recognition. We included speech-related tasks in Audio Flamingo 3 and scaled up the training dataset to about 50M audio-text pairs. Therefore, Audio Flamingo 3 is able to handle all three modalities in audio: sound, music, and speech. It outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni, LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.

Audio Flamingo 3 can take up to 10 minutes of audio inputs, and has a streaming TTS module (AF3-Chat) to output voice.

Audio Flamingo Sound-CoT (Technical Report)

Audio Flamingo Sound-CoT has significant improvements on the chain-of-thought (CoT) reasoning abilities. Our 3B finetuned model based on Audio Flamingo 2 is comparable to several 7B reasoning baselines on reasoning benchmarks.

We introduce AF-Reasoning-Eval, a sound reasoning benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices. We also introduce AF-CoT-Train that contains about 1M CoT reasoning traces to advance the field of audio understanding.

Audio Flamingo 2 (ICML 2025)

Audio Flamingo 2 significantly improves Audio Flamingo in several aspects. First, we re-trained a better CLAP for with stronger text understanding abilities. Second, we scaled up the training set to about 10M audio-text pairs with a focus on several understanding skills (AudioSkills) and understanding of longer audio (LongAudio). Third, we carefully ablate the training recipes and curriculums and found a 3-stage training strategy yields the best results. Audio Flamingo 2 is based on a 3B langauge model. It achieves the SOTA results on several individual and mixed audio understanding benchmarks of captioning, classification, and question answering. It can also understand longer audio up to 5 minutes.

Audio Flamingo (ICML 2024)

Audio Flamingo is our first audio language model based on the Flamingo architecture. It is based on an 1.3B language model and has in-context few-shot learning and multi-turn dialogue abilities (see Audio Dialogues for details of dialogue data). We curated about 5.9M audio-text pairs to train our model. It achieves the SOTA results on several zero-shot, few-shot, and in-distribution benchmarks of captioning, classification, and question answering.

Code Structure

Each branch includes the individual code to train and inference Audio Flamingo.

License

The code in this repo is under MIT license.
The checkpoints are for non-commercial use only (see NVIDIA OneWay Noncommercial License). They are also subject to other restrictions (see README and incl_licenses within each branch).
Notice: Audio Flamingo is built with OPT-IML and is subject to the OPT-IML license.

Citation

Audio Flamingo

@inproceedings{kong2024audio,
  title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
  author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
  booktitle={International Conference on Machine Learning},
  pages={25125--25148},
  year={2024},
  organization={PMLR}
}

Audio Flamingo 2

@inproceedings{
  ghosh2025audio,
  title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
  author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=xWu5qpDK6U}
}

Audio Flamingo 3

@article{goel2025audio,
  title={Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models},
  author={Goel, Arushi and Ghosh, Sreyan and Kim, Jaehyeon and Kumar, Sonal and Kong, Zhifeng and Lee, Sang-gil and Yang, Chao-Han Huck and Duraiswami, Ramani and Manocha, Dinesh and Valle, Rafael and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2507.08128},
  year={2025}
}

Audio Flamingo Sound-CoT

@article{kong2025audio,
  title={Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding},
  author={Kong, Zhifeng and Goel, Arushi and Santos, Joao Felipe and Ghosh, Sreyan and Valle, Rafael and Ping, Wei and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2508.11818},
  year={2025}
}

Music Flamingo

@article{ghosh2025music,
  title={Music Flamingo: Scaling Music Understanding in Audio Language Models},
  author={Ghosh, Sreyan and Goel, Arushi and Koroshinadze, Lasha and Lee, Sang-gil and Kong, Zhifeng and Santos, Joao Felipe and Duraiswami, Ramani and Manocha, Dinesh and Ping, Wei and Shoeybi, Mohammad and Catanzaro, Bryan},
  journal={arXiv preprint arXiv},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
incl_licenses		incl_licenses
static		static
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio Flamingo: Series of Advanced Audio Understanding Language Models

Overview

Music Flamingo (arXiv)

Audio Flamingo 3 (NeurIPS 2025)

Audio Flamingo Sound-CoT (Technical Report)

Audio Flamingo 2 (ICML 2025)

Audio Flamingo (ICML 2024)

Code Structure

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

NVIDIA/audio-flamingo

Folders and files

Latest commit

History

Repository files navigation

Audio Flamingo: Series of Advanced Audio Understanding Language Models

Overview

Music Flamingo (arXiv)

Audio Flamingo 3 (NeurIPS 2025)

Audio Flamingo Sound-CoT (Technical Report)

Audio Flamingo 2 (ICML 2025)

Audio Flamingo (ICML 2024)

Code Structure

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages