Coqui TTS

Open-source text-to-speech with multilingual voice cloning

Overview

Coqui TTS is a Python library for generating speech from text. It ships a collection of pretrained models, including XTTS v2, an end-to-end model that clones a voice from a short audio clip and speaks in 16 languages. You can run it from a Python script or the command line.

It is aimed at developers and researchers who want to add speech synthesis to their own projects, or who need to train and fine-tune their own voice models. Alongside inference, it includes tools for model training, dataset analysis, and dataset curation.

Within the speech and audio space, it sits among local, self-hosted TTS options. You download the models and run them on your own machine (CPU or GPU), so audio never has to leave your environment.

What it does

Pretrained text-to-speech models covering a large range of languages, downloadable and runnable locally
XTTS v2 end-to-end model for voice cloning from a short reference clip, with streaming support at under 200ms latency
Many model families included: Tacotron/Tacotron2, Glow-TTS, VITS, YourTTS, plus Tortoise and Bark for inference
Python API and a `tts` command-line tool for synthesis
Training and fine-tuning support through a Trainer API, with terminal and TensorBoard logs
Utilities for multi-speaker TTS, speaker embeddings, and dataset curation

Getting started

Install the package from PyPI, then synthesize speech from Python or the command line.

Install Coqui TTS

Install the library with pip. A GPU is optional but speeds up the larger models.

bashbash

pip install TTS

Clone a voice with XTTS v2 in Python

Load the multilingual XTTS v2 model, pass a short reference clip to clone, and write the result to a WAV file.

pythonpython

import torch
from TTS.api import TTS

device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(text="Hello world!", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")

Synthesize from the command line

List the available models, then generate speech with the default model and write it to a file.

bashbash

tts --list_models
tts --text "Text for TTS" --out_path output/path/speech.wav

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Clone a voice from a short clip and have it read text aloud in another language
Add offline, self-hosted text-to-speech to an app so audio stays on your own machine
Fine-tune or train a custom voice model on your own dataset
Generate voiceovers or narration for videos, demos, and accessibility tooling

How Coqui TTS compares

Coqui TTS alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Whisper	★ 103k	OpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS	★ 58.9k	An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice	★ 49.5k	Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS	★ 45.6k	Open-source text-to-speech with multilingual voice cloning
ChatTTS	★ 39.5k	ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird	★ 36.9k	An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice	★ 36.7k	OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM	★ 31k	An open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.

// Overview

// What it does

// Getting started