Overview
Dia is a 1.6B-parameter text-to-speech model from Nari Labs that turns a written transcript directly into spoken dialogue. You write lines tagged with [S1] and [S2], and the model generates audio for a back-and-forth conversation between two speakers.
It is aimed at developers and researchers who need conversational speech rather than a single narrator. Beyond plain words, Dia can produce nonverbal sounds such as laughter, coughing, and throat-clearing using inline tags, and you can condition the output on a short audio prompt to control voice, emotion, and tone.
Within the speech and audio space, Dia is an open-weights alternative to hosted dialogue TTS services. The weights are on Hugging Face under Apache-2.0, and you can run it through the project's own code, a Gradio UI, a CLI, or the Hugging Face Transformers integration. The model currently supports English only.
What it does
- Generates two-speaker dialogue from a transcript using [S1] and [S2] tags
- Produces nonverbal sounds such as (laughs), (coughs), (sighs), and (clears throat)
- Voice cloning by conditioning on a short 5-10 second audio prompt with its transcript
- Open weights on Hugging Face (Dia-1.6B-0626) under the Apache-2.0 license
- Runs via the project repo, a Gradio web UI, a CLI, or Hugging Face Transformers
- Sampling controls including guidance_scale, temperature, top_p, and top_k for output tuning
Getting started
You can run Dia through Hugging Face Transformers or from the project repo. The Transformers path needs the main branch of transformers and a CUDA GPU.
Install via Transformers
Install the main branch of transformers, which contains the Dia implementation.
pip install git+https://github.com/huggingface/transformers.gitGenerate dialogue audio
Load the processor and model, pass a transcript using [S1] and [S2] tags, then save the result. Requires a CUDA device.
from transformers import AutoProcessor, DiaForConditionalGeneration
torch_device = "cuda"
model_checkpoint = "nari-labs/Dia-1.6B-0626"
text = [
"[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
]
processor = AutoProcessor.from_pretrained(model_checkpoint)
inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)
model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
outputs = model.generate(
**inputs, max_new_tokens=3072, guidance_scale=3.0, temperature=1.8, top_p=0.90, top_k=45
)
outputs = processor.batch_decode(outputs)
processor.save_audio(outputs, "example.mp3")Or install from the repo
Clone the repository and install in editable mode, then run a bundled example. A Gradio UI (app.py) and CLI (cli.py) are also included.
git clone https://github.com/nari-labs/dia.git
cd dia
pip install -e .
python example/simple.pyCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Generating two-speaker dialogue for podcasts, demos, or scripted conversations
- Adding nonverbal cues like laughter or sighs to make synthetic speech sound more natural
- Cloning a target voice from a short reference clip to keep a consistent speaker
- Researching or prototyping open-weights conversational TTS without a hosted API
How Dia compares
Dia alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Whisper | ★ 103k | OpenAI's speech recognition model that transcribes and translates audio across many languages. |
| GPT-SoVITS | ★ 58.9k | An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning. |
| VibeVoice | ★ 49.5k | Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts. |
| Coqui TTS | ★ 45.6k | A library of text-to-speech models including the multilingual XTTS voice-cloning model. |
| ChatTTS | ★ 39.5k | ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody. |
| MockingBird | ★ 36.9k | An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line. |
| OpenVoice | ★ 36.7k | OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation. |
| Dia | ★ 19.3k | A 1.6B text-to-speech model for realistic multi-speaker dialogue |