Overview
Coqui TTS is a Python library for generating speech from text. It ships a collection of pretrained models, including XTTS v2, an end-to-end model that clones a voice from a short audio clip and speaks in 16 languages. You can run it from a Python script or the command line.
It is aimed at developers and researchers who want to add speech synthesis to their own projects, or who need to train and fine-tune their own voice models. Alongside inference, it includes tools for model training, dataset analysis, and dataset curation.
Within the speech and audio space, it sits among local, self-hosted TTS options. You download the models and run them on your own machine (CPU or GPU), so audio never has to leave your environment.
What it does
- Pretrained text-to-speech models covering a large range of languages, downloadable and runnable locally
- XTTS v2 end-to-end model for voice cloning from a short reference clip, with streaming support at under 200ms latency
- Many model families included: Tacotron/Tacotron2, Glow-TTS, VITS, YourTTS, plus Tortoise and Bark for inference
- Python API and a `tts` command-line tool for synthesis
- Training and fine-tuning support through a Trainer API, with terminal and TensorBoard logs
- Utilities for multi-speaker TTS, speaker embeddings, and dataset curation
Getting started
Install the package from PyPI, then synthesize speech from Python or the command line.
Install Coqui TTS
Install the library with pip. A GPU is optional but speeds up the larger models.
pip install TTSClone a voice with XTTS v2 in Python
Load the multilingual XTTS v2 model, pass a short reference clip to clone, and write the result to a WAV file.
import torch
from TTS.api import TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")Synthesize from the command line
List the available models, then generate speech with the default model and write it to a file.
tts --list_models
tts --text "Text for TTS" --out_path output/path/speech.wavCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Clone a voice from a short clip and have it read text aloud in another language
- Add offline, self-hosted text-to-speech to an app so audio stays on your own machine
- Fine-tune or train a custom voice model on your own dataset
- Generate voiceovers or narration for videos, demos, and accessibility tooling
How Coqui TTS compares
Coqui TTS alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Whisper | ★ 103k | OpenAI's speech recognition model that transcribes and translates audio across many languages. |
| GPT-SoVITS | ★ 58.9k | An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning. |
| VibeVoice | ★ 49.5k | Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts. |
| Coqui TTS | ★ 45.6k | Open-source text-to-speech with multilingual voice cloning |
| ChatTTS | ★ 39.5k | ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody. |
| MockingBird | ★ 36.9k | An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line. |
| OpenVoice | ★ 36.7k | OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation. |
| VoxCPM | ★ 31k | An open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip. |