AI/TLDR

Dia

A 1.6B text-to-speech model for realistic multi-speaker dialogue

Overview

Dia is a 1.6B-parameter text-to-speech model from Nari Labs that turns a written transcript directly into spoken dialogue. You write lines tagged with [S1] and [S2], and the model generates audio for a back-and-forth conversation between two speakers.

It is aimed at developers and researchers who need conversational speech rather than a single narrator. Beyond plain words, Dia can produce nonverbal sounds such as laughter, coughing, and throat-clearing using inline tags, and you can condition the output on a short audio prompt to control voice, emotion, and tone.

Within the speech and audio space, Dia is an open-weights alternative to hosted dialogue TTS services. The weights are on Hugging Face under Apache-2.0, and you can run it through the project's own code, a Gradio UI, a CLI, or the Hugging Face Transformers integration. The model currently supports English only.

What it does

  • Generates two-speaker dialogue from a transcript using [S1] and [S2] tags
  • Produces nonverbal sounds such as (laughs), (coughs), (sighs), and (clears throat)
  • Voice cloning by conditioning on a short 5-10 second audio prompt with its transcript
  • Open weights on Hugging Face (Dia-1.6B-0626) under the Apache-2.0 license
  • Runs via the project repo, a Gradio web UI, a CLI, or Hugging Face Transformers
  • Sampling controls including guidance_scale, temperature, top_p, and top_k for output tuning

Getting started

You can run Dia through Hugging Face Transformers or from the project repo. The Transformers path needs the main branch of transformers and a CUDA GPU.

Install via Transformers

Install the main branch of transformers, which contains the Dia implementation.

bashbash
pip install git+https://github.com/huggingface/transformers.git

Generate dialogue audio

Load the processor and model, pass a transcript using [S1] and [S2] tags, then save the result. Requires a CUDA device.

pythonpython
from transformers import AutoProcessor, DiaForConditionalGeneration

torch_device = "cuda"
model_checkpoint = "nari-labs/Dia-1.6B-0626"

text = [
    "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
]
processor = AutoProcessor.from_pretrained(model_checkpoint)
inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)

model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
outputs = model.generate(
    **inputs, max_new_tokens=3072, guidance_scale=3.0, temperature=1.8, top_p=0.90, top_k=45
)

outputs = processor.batch_decode(outputs)
processor.save_audio(outputs, "example.mp3")

Or install from the repo

Clone the repository and install in editable mode, then run a bundled example. A Gradio UI (app.py) and CLI (cli.py) are also included.

bashbash
git clone https://github.com/nari-labs/dia.git
cd dia
pip install -e .
python example/simple.py

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Generating two-speaker dialogue for podcasts, demos, or scripted conversations
  • Adding nonverbal cues like laughter or sighs to make synthetic speech sound more natural
  • Cloning a target voice from a short reference clip to keep a consistent speaker
  • Researching or prototyping open-weights conversational TTS without a hosted API

How Dia compares

Dia alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Whisper★ 103kOpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS★ 58.9kAn open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice★ 49.5kMicrosoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS★ 45.6kA library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS★ 39.5kChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird★ 36.9kAn open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice★ 36.7kOpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
Dia★ 19.3kA 1.6B text-to-speech model for realistic multi-speaker dialogue