Overview
IndexTTS2 is a text-to-speech model from bilibili's Index team. You give it a short reference clip of a voice plus some text, and it speaks that text in the cloned voice without any per-speaker training (zero-shot). It runs as a Python package and ships its weights on Hugging Face and ModelScope.
The model is built for developers who need expressive, controllable speech. It separates speaker identity from emotion, so you can keep one person's timbre while changing the emotional tone from a separate style prompt. It also supports a duration-control mode that can fix the number of generated tokens, which helps when speech has to line up with video.
Within the speech and audio space, IndexTTS2 sits alongside other autoregressive zero-shot TTS systems. It targets cases where word accuracy, speaker similarity, and emotional fidelity all matter, such as dubbing and voiceover work.
What it does
- Zero-shot voice cloning from a single short reference audio clip
- Independent control of timbre and emotion, using separate speaker and style prompts
- Two duration modes: precise token-count control, or free autoregressive generation that follows the prompt's prosody
- Soft emotion control through natural-language text descriptions
- Pre-trained weights distributed on Hugging Face and ModelScope
- Python inference API (indextts.infer_v2.IndexTTS2) with optional FP16, CUDA kernel, and DeepSpeed flags
Getting started
IndexTTS2 is a Python project. You clone the repo, install dependencies with uv, download the model weights, then call the inference API. Make sure git and git-lfs are installed first.
Clone the repository
Clone the repo and pull the large files tracked by git-lfs.
git clone https://github.com/index-tts/index-tts.git && cd index-tts
git lfs pullInstall dependencies
The project uses uv to manage its environment. Install all extras.
uv sync --all-extrasDownload the model weights
Pull the IndexTTS-2 checkpoints from Hugging Face into a local checkpoints folder.
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpointsSynthesize speech in Python
Load the model from the checkpoints directory and pass a reference voice clip plus your text. The cloned voice is read from spk_audio_prompt and the result is written to output_path.
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(
cfg_path="checkpoints/config.yaml",
model_dir="checkpoints",
use_fp16=False,
use_cuda_kernel=False,
use_deepspeed=False,
)
text = "Translate for me, what is a surprise!"
tts.infer(
spk_audio_prompt='examples/voice_01.wav',
text=text,
output_path="gen.wav",
verbose=True,
)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Dub video or animation, using duration control to keep generated speech aligned with the original timing
- Clone a single speaker's voice from a short sample to narrate articles, audiobooks, or product walkthroughs
- Generate the same line in different emotional tones by swapping the style prompt while keeping one voice
- Prototype voice assistants or characters that need consistent timbre with adjustable emotional delivery
How IndexTTS compares
IndexTTS alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Whisper | ★ 103k | OpenAI's speech recognition model that transcribes and translates audio across many languages. |
| GPT-SoVITS | ★ 58.9k | An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning. |
| VibeVoice | ★ 49.5k | Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts. |
| Coqui TTS | ★ 45.6k | A library of text-to-speech models including the multilingual XTTS voice-cloning model. |
| ChatTTS | ★ 39.5k | ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody. |
| MockingBird | ★ 36.9k | An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line. |
| OpenVoice | ★ 36.7k | OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation. |
| IndexTTS | ★ 21.3k | Zero-shot text-to-speech with voice cloning and independent emotion control |