AI/TLDR

Coqui TTS

Open-source text-to-speech with multilingual voice cloning

Overview

Coqui TTS is a Python library for generating speech from text. It ships a collection of pretrained models, including XTTS v2, an end-to-end model that clones a voice from a short audio clip and speaks in 16 languages. You can run it from a Python script or the command line.

It is aimed at developers and researchers who want to add speech synthesis to their own projects, or who need to train and fine-tune their own voice models. Alongside inference, it includes tools for model training, dataset analysis, and dataset curation.

Within the speech and audio space, it sits among local, self-hosted TTS options. You download the models and run them on your own machine (CPU or GPU), so audio never has to leave your environment.

What it does

  • Pretrained text-to-speech models covering a large range of languages, downloadable and runnable locally
  • XTTS v2 end-to-end model for voice cloning from a short reference clip, with streaming support at under 200ms latency
  • Many model families included: Tacotron/Tacotron2, Glow-TTS, VITS, YourTTS, plus Tortoise and Bark for inference
  • Python API and a `tts` command-line tool for synthesis
  • Training and fine-tuning support through a Trainer API, with terminal and TensorBoard logs
  • Utilities for multi-speaker TTS, speaker embeddings, and dataset curation

Getting started

Install the package from PyPI, then synthesize speech from Python or the command line.

Install Coqui TTS

Install the library with pip. A GPU is optional but speeds up the larger models.

bashbash
pip install TTS

Clone a voice with XTTS v2 in Python

Load the multilingual XTTS v2 model, pass a short reference clip to clone, and write the result to a WAV file.

pythonpython
import torch
from TTS.api import TTS

device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")

Synthesize from the command line

List the available models, then generate speech with the default model and write it to a file.

bashbash
tts --list_models
tts --text "Text for TTS" --out_path output/path/speech.wav

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Clone a voice from a short clip and have it read text aloud in another language
  • Add offline, self-hosted text-to-speech to an app so audio stays on your own machine
  • Fine-tune or train a custom voice model on your own dataset
  • Generate voiceovers or narration for videos, demos, and accessibility tooling

How Coqui TTS compares

Coqui TTS alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Whisper★ 103kOpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS★ 58.9kAn open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice★ 49.5kMicrosoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS★ 45.6kOpen-source text-to-speech with multilingual voice cloning
ChatTTS★ 39.5kChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird★ 36.9kAn open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice★ 36.7kOpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM★ 31kAn open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.