IndexTTS

Zero-shot text-to-speech with voice cloning and independent emotion control

Overview

IndexTTS2 is a text-to-speech model from bilibili's Index team. You give it a short reference clip of a voice plus some text, and it speaks that text in the cloned voice without any per-speaker training (zero-shot). It runs as a Python package and ships its weights on Hugging Face and ModelScope.

The model is built for developers who need expressive, controllable speech. It separates speaker identity from emotion, so you can keep one person's timbre while changing the emotional tone from a separate style prompt. It also supports a duration-control mode that can fix the number of generated tokens, which helps when speech has to line up with video.

Within the speech and audio space, IndexTTS2 sits alongside other autoregressive zero-shot TTS systems. It targets cases where word accuracy, speaker similarity, and emotional fidelity all matter, such as dubbing and voiceover work.

What it does

Zero-shot voice cloning from a single short reference audio clip
Independent control of timbre and emotion, using separate speaker and style prompts
Two duration modes: precise token-count control, or free autoregressive generation that follows the prompt's prosody
Soft emotion control through natural-language text descriptions
Pre-trained weights distributed on Hugging Face and ModelScope
Python inference API (indextts.infer_v2.IndexTTS2) with optional FP16, CUDA kernel, and DeepSpeed flags

Getting started

IndexTTS2 is a Python project. You clone the repo, install dependencies with uv, download the model weights, then call the inference API. Make sure git and git-lfs are installed first.

Clone the repository

Clone the repo and pull the large files tracked by git-lfs.

bashbash

git clone https://github.com/index-tts/index-tts.git && cd index-tts
git lfs pull

Install dependencies

The project uses uv to manage its environment. Install all extras.

bashbash

uv sync --all-extras

Download the model weights

Pull the IndexTTS-2 checkpoints from Hugging Face into a local checkpoints folder.

bashbash

uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints

Synthesize speech in Python

Load the model from the checkpoints directory and pass a reference voice clip plus your text. The cloned voice is read from spk_audio_prompt and the result is written to output_path.

pythonpython

from indextts.infer_v2 import IndexTTS2

tts = IndexTTS2(
    cfg_path="checkpoints/config.yaml",
    model_dir="checkpoints",
    use_fp16=False,
    use_cuda_kernel=False,
    use_deepspeed=False,
)

text = "Translate for me, what is a surprise!"
tts.infer(
    spk_audio_prompt='examples/voice_01.wav',
    text=text,
    output_path="gen.wav",
    verbose=True,
)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Dub video or animation, using duration control to keep generated speech aligned with the original timing
Clone a single speaker's voice from a short sample to narrate articles, audiobooks, or product walkthroughs
Generate the same line in different emotional tones by swapping the style prompt while keeping one voice
Prototype voice assistants or characters that need consistent timbre with adjustable emotional delivery

How IndexTTS compares

IndexTTS alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Whisper	★ 103k	OpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS	★ 58.9k	An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice	★ 49.5k	Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS	★ 45.6k	A library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS	★ 39.5k	ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird	★ 36.9k	An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice	★ 36.7k	OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
IndexTTS	★ 21.3k	Zero-shot text-to-speech with voice cloning and independent emotion control

// Overview

// What it does

// Getting started

Clone the repository

Install dependencies

Download the model weights

Synthesize speech in Python

// When to use it

// How IndexTTS compares

Overview

What it does

Getting started

When to use it

How IndexTTS compares