Chatterbox

Open-source text-to-speech and voice cloning that runs on a single GPU

github.com/resemble-ai/chatterbox★ 25.1k resemble.ai

Overview

Chatterbox is a family of open-source text-to-speech (TTS) models from Resemble AI. You give it text and a short reference clip of a voice, and it generates speech that matches that voice. It runs on a single GPU, so you can use it locally without a cloud service.

The family covers different needs. Chatterbox-Turbo is a smaller 350M-parameter model aimed at low-latency English voice agents, and it supports paralinguistic tags like [laugh] and [cough] for more realistic speech. Chatterbox Multilingual V3 is a 500M model that handles 23+ languages with steadier speaker similarity across them, and there are single-language finetunes for priority languages.

It fits the speech and audio space for developers who want zero-shot voice cloning they can run themselves, for example in voice agents, narration, or localization workflows. A hosted Resemble AI service is available if you later need to scale beyond local hardware.

What it does

Zero-shot voice cloning from a short (around 10s) reference clip
Multilingual generation across 23+ languages with Chatterbox Multilingual V3
Chatterbox-Turbo: a 350M model for low-latency English voice agents
Paralinguistic tags such as [cough], [laugh], and [chuckle] (native in Turbo)
Runs on a single GPU, with device options for cuda, cpu, or mps
Installable from PyPI (chatterbox-tts) or from source

Getting started

Install the package from PyPI, then load a model and generate speech from text plus a reference voice clip.

Install Chatterbox

Install the package from PyPI.

bashbash

pip install chatterbox-tts

Install from source (optional)

If you want to edit the code or pin dependencies yourself, clone the repo and install in editable mode. Chatterbox was developed and tested on Python 3.11.

bashbash

git clone https://github.com/resemble-ai/chatterbox.git
cd chatterbox
pip install -e .

Generate English speech

Load the standard model and synthesize a WAV file. Use a reference clip with audio_prompt_path for voice cloning.

pythonpython

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

device = "cuda"  # or "cpu" / "mps"
model = ChatterboxTTS.from_pretrained(device=device)

wav = model.generate("Hello there!", audio_prompt_path="your_10s_ref_clip.wav")
ta.save("test.wav", wav, model.sr)

Use the Turbo model with tags

For low-latency English with paralinguistic tags, load the Turbo model instead.

pythonpython

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")
text = "Have you got one minute to chat [chuckle]?"
wav = model.generate(text, audio_prompt_path="your_10s_ref_clip.wav")
ta.save("test-turbo.wav", wav, model.sr)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Build a low-latency English voice agent that reads back responses in a chosen voice
Clone a voice from a short clip to narrate articles, videos, or audiobooks
Localize spoken content across many languages while keeping a consistent speaker identity
Add expressive cues like [laugh] or [cough] to make generated speech sound more natural

How Chatterbox compares

Chatterbox alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Whisper	★ 103k	OpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS	★ 58.9k	An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice	★ 49.5k	Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS	★ 45.6k	A library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS	★ 39.5k	ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird	★ 36.9k	An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice	★ 36.7k	OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
Chatterbox	★ 25.1k	Open-source text-to-speech and voice cloning that runs on a single GPU

// Overview

// What it does

// Getting started