AI/TLDR

NVIDIA / University of Maryland · 2026-04-13 · notable

Audio Flamingo Next — Open Audio-Language Model with 30-Minute Temporal Reasoning

NVIDIA and UMD open-source AF-Next for research — three audio-language model variants with 30-minute audio support and timestamp-grounded CoT reasoning. AF-Next-Think scores 58.7 on MMAU-Pro, beating Gemini 2.5 Pro (57.4).

Audio Flamingo Next logo — NVIDIA and University of Maryland open audio-language model

NVIDIA and UMD open-source an audio-language model that timestamps its reasoning through 30-minute recordings and beats Gemini 2.5 Pro on MMAU-Pro.

Key specs

LicenseNVIDIA non-commercial
Af next think mmau pro58.7
Gemini 2.5 pro mmau pro57.4
Af next instruct long audio bench73.9
Gemini 2.5 pro long audio bench60.4
Max audio length30 min

What is it?

Audio Flamingo Next (AF-Next) is an open audio-language model from NVIDIA and the University of Maryland. It comes in three variants: AF-Next-Instruct for general question answering, AF-Next-Think for multi-step reasoning with explicit timestamp grounding, and AF-Next-Captioner for dense long-form audio description. The model handles speech, environmental sounds, and music in a single unified architecture. Weights, code, and training data are all open under the NVIDIA non-commercial research license.

How does it work?

AF-Next uses an AF-Whisper audio encoder connected via an MLP adapter to a Qwen-based decoder. The key innovation is Temporal Audio Chain-of-Thought (T-ACoT): rather than answering directly, the Think variant anchors each reasoning step to a specific timestamp before drawing a conclusion. Rotary Time Embeddings (RoTE) encode absolute position in audio time so the model can cite minute 3 while reasoning at minute 24. Training uses a curriculum strategy across 20 audio understanding and reasoning benchmarks.

Why does it matter?

Long-form audio has been the last major unstructured medium AI models handle poorly. AF-Next-Think scores 58.7 on MMAU-Pro (beating Gemini 2.5 Pro at 57.4) and AF-Next-Instruct hits 73.9 on LongAudioBench (beating Gemini 2.5 Pro at 60.4). Open weights and training data let the research community fine-tune on their own audio domains without building encoder and LLM stacks from scratch.

Who is it for?

Audio AI researchers, podcast and video platform developers, accessibility tooling authors.

Try it

from transformers import AutoModelForAudioTextToText; model = AutoModelForAudioTextToText.from_pretrained("nvidia/audio-flamingo-next-hf")

Sources · 3 outlets

Tags

  • audio
  • multimodal
  • open-weights
  • audio-understanding
  • speech
  • music
  • long-audio
  • reasoning
  • nvidia

← All releases · Learn AI