AI/TLDR

IndexTTS

Zero-shot text-to-speech with voice cloning and independent emotion control

Overview

IndexTTS2 is a text-to-speech model from bilibili's Index team. You give it a short reference clip of a voice plus some text, and it speaks that text in the cloned voice without any per-speaker training (zero-shot). It runs as a Python package and ships its weights on Hugging Face and ModelScope.

The model is built for developers who need expressive, controllable speech. It separates speaker identity from emotion, so you can keep one person's timbre while changing the emotional tone from a separate style prompt. It also supports a duration-control mode that can fix the number of generated tokens, which helps when speech has to line up with video.

Within the speech and audio space, IndexTTS2 sits alongside other autoregressive zero-shot TTS systems. It targets cases where word accuracy, speaker similarity, and emotional fidelity all matter, such as dubbing and voiceover work.

What it does

  • Zero-shot voice cloning from a single short reference audio clip
  • Independent control of timbre and emotion, using separate speaker and style prompts
  • Two duration modes: precise token-count control, or free autoregressive generation that follows the prompt's prosody
  • Soft emotion control through natural-language text descriptions
  • Pre-trained weights distributed on Hugging Face and ModelScope
  • Python inference API (indextts.infer_v2.IndexTTS2) with optional FP16, CUDA kernel, and DeepSpeed flags

Getting started

IndexTTS2 is a Python project. You clone the repo, install dependencies with uv, download the model weights, then call the inference API. Make sure git and git-lfs are installed first.

Clone the repository

Clone the repo and pull the large files tracked by git-lfs.

bashbash
git clone https://github.com/index-tts/index-tts.git && cd index-tts
git lfs pull

Install dependencies

The project uses uv to manage its environment. Install all extras.

bashbash
uv sync --all-extras

Download the model weights

Pull the IndexTTS-2 checkpoints from Hugging Face into a local checkpoints folder.

bashbash
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints

Synthesize speech in Python

Load the model from the checkpoints directory and pass a reference voice clip plus your text. The cloned voice is read from spk_audio_prompt and the result is written to output_path.

pythonpython
from indextts.infer_v2 import IndexTTS2

tts = IndexTTS2(
    cfg_path="checkpoints/config.yaml",
    model_dir="checkpoints",
    use_fp16=False,
    use_cuda_kernel=False,
    use_deepspeed=False,
)

text = "Translate for me, what is a surprise!"
tts.infer(
    spk_audio_prompt='examples/voice_01.wav',
    text=text,
    output_path="gen.wav",
    verbose=True,
)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Dub video or animation, using duration control to keep generated speech aligned with the original timing
  • Clone a single speaker's voice from a short sample to narrate articles, audiobooks, or product walkthroughs
  • Generate the same line in different emotional tones by swapping the style prompt while keeping one voice
  • Prototype voice assistants or characters that need consistent timbre with adjustable emotional delivery

How IndexTTS compares

IndexTTS alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Whisper★ 103kOpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS★ 58.9kAn open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice★ 49.5kMicrosoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS★ 45.6kA library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS★ 39.5kChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird★ 36.9kAn open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice★ 36.7kOpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
IndexTTS★ 21.3kZero-shot text-to-speech with voice cloning and independent emotion control