VibeVoice

Open-source voice AI from Microsoft for long-form, multi-speaker TTS and ASR

github.com/microsoft/VibeVoice★ 49.5k microsoft.github.io/VibeVoice

Overview

VibeVoice is a family of open-source voice AI models from Microsoft Research that covers both directions of speech: text-to-speech (TTS) and automatic speech recognition (ASR). It is built around continuous speech tokenizers running at a low 7.5 Hz frame rate and a next-token diffusion framework, where a language model handles textual context and dialogue flow while a diffusion head produces the acoustic detail.

The lineup includes VibeVoice-TTS for long-form, multi-speaker audio, a smaller VibeVoice-Realtime-0.5B model for streaming text-to-speech, and VibeVoice-ASR for long-form transcription. The ASR model is published as a Hugging Face Transformers release, so you can load it with the standard AutoProcessor and model classes.

It fits the speech and audio corner of the multimodal space and is aimed at researchers and developers who need to generate or transcribe long stretches of conversational audio. Note that Microsoft removed the original VibeVoice-TTS code from the repository after misuse and frames the project as research-focused.

What it does

VibeVoice-TTS synthesizes conversational or single-speaker speech up to 90 minutes in a single pass
Supports up to 4 distinct speakers in one dialogue with natural turn-taking and consistent voices
VibeVoice-ASR transcribes up to 60 minutes of audio in one pass within a 64K token window
ASR output is structured by Who (speaker), When (timestamps), and What (content), combining transcription and diarization
Customized hotwords let you supply names or technical terms to improve recognition accuracy
ASR is natively multilingual across more than 50 languages; TTS covers English, Chinese, and others

Getting started

The ASR model ships through Hugging Face Transformers, which is the most direct way to try VibeVoice today.

Install Transformers

VibeVoice-ASR is included in recent Transformers releases, so install or upgrade to a supported version.

bashbash

pip install "transformers>=5.3.0"

Load the ASR model

Load the processor and model by their Hugging Face id. device_map="auto" places the model on available hardware.

pythonpython

from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration

model_id = "microsoft/VibeVoice-ASR-HF"
processor = AutoProcessor.from_pretrained(model_id)
model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")

Transcribe audio

Build a transcription request from an audio source, generate, and decode the parsed result with speaker segments.

pythonpython

inputs = processor.apply_transcription_request(
    audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
).to(model.device, model.dtype)

output_ids = model.generate(**inputs)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]

transcription = processor.decode(generated_ids, return_format="parsed")[0]
for speaker_transcription in transcription:
    print(speaker_transcription)

Try the other models

For streaming text-to-speech, the VibeVoice-Realtime-0.5B model has a Colab notebook linked from the repository, and the TTS and ASR docs cover the full lineup.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Generate long podcast-style audio with several speakers from a written script
Produce single-speaker narration or audiobook segments that stay consistent over long passages
Transcribe hour-long meetings or interviews with speaker labels and timestamps in one pass
Add domain-specific names and terms as hotwords to improve transcription of technical content

How VibeVoice compares

VibeVoice alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Whisper	★ 103k	OpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS	★ 58.9k	An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice	★ 49.5k	Open-source voice AI from Microsoft for long-form, multi-speaker TTS and ASR
Coqui TTS	★ 45.6k	A library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS	★ 39.5k	ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird	★ 36.9k	An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice	★ 36.7k	OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM	★ 31k	An open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.

// Overview

// What it does

// Getting started