Overview
Fish Speech is an open text-to-speech and voice-cloning system from Fish Audio. The current model, Fish Audio S2 Pro, is a 4B-parameter model trained on a large multilingual audio set and covering more than 80 languages, so it can read text aloud and copy the voice in a short reference clip.
It is built for developers and researchers who want to add speech to their own apps without calling a hosted API. You can run it from the command line, through a Gradio WebUI, or as an API server, and the model weights are published on HuggingFace.
Within the speech and audio space, Fish Speech stands out for fine-grained control: you embed natural-language tags like [whisper], [excited], or [angry] directly in the text to shape prosody and emotion, and it can handle multi-speaker, multi-turn dialogue.
What it does
- Multilingual TTS covering more than 80 languages, trained on a large audio corpus
- Voice cloning from a short reference clip plus its matching transcript
- Inline emotion and prosody tags such as [whisper], [excited], [pause], and [laughing]
- Multi-speaker and multi-turn conversation generation
- Run it your way: command-line inference, a Gradio WebUI, or an API server
- Docker Compose profiles for WebUI and server, including CPU-only and AMD ROCm setups
Getting started
Fish Speech runs on Linux or WSL and recommends a GPU with about 24GB of memory. The simplest path is a conda environment; the project also publishes Docker Compose profiles.
Install system prerequisites
Install the audio libraries the build depends on.
apt install portaudio19-dev libsox-dev ffmpegCreate the environment and install
Set up Python 3.12 and install the package with the CUDA extra (use cu126/cu128 for other CUDA versions, or .[cpu] for CPU-only).
conda create -n fish-speech python=3.12
conda activate fish-speech
pip install -e .[cu129]Launch the WebUI
Start the Gradio interface to generate speech in the browser, or use Docker Compose instead.
python tools/run_webui.pyRun a quick start with Docker
If you prefer containers, bring up the WebUI profile with Docker Compose.
docker compose --profile webui upCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Add natural-sounding narration or in-app voice to a product without using a hosted TTS API
- Clone a specific voice from a short sample to keep a consistent character or brand voice
- Generate expressive, emotion-tagged dialogue for games, audiobooks, or video
- Produce multilingual voiceovers across the 80+ supported languages
How Fish Speech compares
Fish Speech alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Whisper | ★ 103k | OpenAI's speech recognition model that transcribes and translates audio across many languages. |
| GPT-SoVITS | ★ 58.9k | An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning. |
| VibeVoice | ★ 49.5k | Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts. |
| Coqui TTS | ★ 45.6k | A library of text-to-speech models including the multilingual XTTS voice-cloning model. |
| ChatTTS | ★ 39.5k | ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody. |
| MockingBird | ★ 36.9k | An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line. |
| OpenVoice | ★ 36.7k | OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation. |
| Fish Speech | ★ 30.9k | Multilingual text-to-speech and voice cloning with inline emotion control |