AI/TLDR

VibeVoice

Open-source voice AI from Microsoft for long-form, multi-speaker TTS and ASR

Overview

VibeVoice is a family of open-source voice AI models from Microsoft Research that covers both directions of speech: text-to-speech (TTS) and automatic speech recognition (ASR). It is built around continuous speech tokenizers running at a low 7.5 Hz frame rate and a next-token diffusion framework, where a language model handles textual context and dialogue flow while a diffusion head produces the acoustic detail.

The lineup includes VibeVoice-TTS for long-form, multi-speaker audio, a smaller VibeVoice-Realtime-0.5B model for streaming text-to-speech, and VibeVoice-ASR for long-form transcription. The ASR model is published as a Hugging Face Transformers release, so you can load it with the standard AutoProcessor and model classes.

It fits the speech and audio corner of the multimodal space and is aimed at researchers and developers who need to generate or transcribe long stretches of conversational audio. Note that Microsoft removed the original VibeVoice-TTS code from the repository after misuse and frames the project as research-focused.

What it does

  • VibeVoice-TTS synthesizes conversational or single-speaker speech up to 90 minutes in a single pass
  • Supports up to 4 distinct speakers in one dialogue with natural turn-taking and consistent voices
  • VibeVoice-ASR transcribes up to 60 minutes of audio in one pass within a 64K token window
  • ASR output is structured by Who (speaker), When (timestamps), and What (content), combining transcription and diarization
  • Customized hotwords let you supply names or technical terms to improve recognition accuracy
  • ASR is natively multilingual across more than 50 languages; TTS covers English, Chinese, and others

Getting started

The ASR model ships through Hugging Face Transformers, which is the most direct way to try VibeVoice today.

Install Transformers

VibeVoice-ASR is included in recent Transformers releases, so install or upgrade to a supported version.

bashbash
pip install "transformers>=5.3.0"

Load the ASR model

Load the processor and model by their Hugging Face id. device_map="auto" places the model on available hardware.

pythonpython
from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration

model_id = "microsoft/VibeVoice-ASR-HF"
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")

Transcribe audio

Build a transcription request from an audio source, generate, and decode the parsed result with speaker segments.

pythonpython
inputs = processor.apply_transcription_request(
    audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
).to(model.device, model.dtype)

output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]

transcription = processor.decode(generated_ids, return_format="parsed")[0]
for speaker_transcription in transcription:
    print(speaker_transcription)

Try the other models

For streaming text-to-speech, the VibeVoice-Realtime-0.5B model has a Colab notebook linked from the repository, and the TTS and ASR docs cover the full lineup.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Generate long podcast-style audio with several speakers from a written script
  • Produce single-speaker narration or audiobook segments that stay consistent over long passages
  • Transcribe hour-long meetings or interviews with speaker labels and timestamps in one pass
  • Add domain-specific names and terms as hotwords to improve transcription of technical content

How VibeVoice compares

VibeVoice alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Whisper★ 103kOpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS★ 58.9kAn open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice★ 49.5kOpen-source voice AI from Microsoft for long-form, multi-speaker TTS and ASR
Coqui TTS★ 45.6kA library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS★ 39.5kChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird★ 36.9kAn open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice★ 36.7kOpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM★ 31kAn open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.