AI/TLDR

GPT-SoVITS

Few-shot voice cloning and TTS WebUI from one minute of audio

Overview

GPT-SoVITS is a voice cloning and text-to-speech tool with a built-in WebUI. With zero-shot mode you feed it a 5-second vocal sample and it instantly speaks new text in that voice. With few-shot mode you fine-tune the model on about one minute of audio to get closer voice similarity and more natural results.

It also handles cross-lingual speech, so a voice trained on one language can read text in another. The project currently supports English, Japanese, Korean, Cantonese, and Chinese. The WebUI bundles helper tools for preparing training data, which makes it approachable for people who are not machine-learning experts.

What it does

  • Zero-shot TTS: generate speech in a new voice from just a 5-second sample
  • Few-shot fine-tuning: train on roughly one minute of audio for better voice similarity
  • Cross-lingual inference across English, Japanese, Korean, Cantonese, and Chinese
  • WebUI dataset tools: vocal/accompaniment separation, automatic segmentation, Chinese ASR, and text labeling
  • Runs on NVIDIA CUDA, AMD ROCm, Apple silicon (MPS), or plain CPU
  • Windows integrated package and Docker images for quick setup

Getting started

GPT-SoVITS runs in a conda environment with Python 3.10. Windows users can grab the integrated package, while Linux and macOS users install via the provided script. After setup, you launch the WebUI to clone voices and generate speech.

Create the conda environment

Set up a fresh Python 3.10 environment and activate it before installing.

bashbash
conda create -n GPTSoVits python=3.10
conda activate GPTSoVits

Install on Linux

Run the install script, choosing your device and the source to download models from. Add --download-uvr5 if you want the vocal-separation models.

bashbash
bash install.sh --device <CU126|CU128|ROCM|CPU> --source <HF|HF-Mirror|ModelScope> [--download-uvr5]

Start on Windows with the integrated package

Windows users (tested on Windows 10 and newer) can download the integrated package and double-click go-webui.bat to launch the GPT-SoVITS WebUI without manual setup.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Clone a voice from a short sample to narrate scripts, videos, or audiobooks
  • Fine-tune a custom voice on about a minute of audio for a personal TTS model
  • Produce cross-lingual voiceovers, reading text in a language the voice was not trained on
  • Build a labeled training dataset using the WebUI's separation, segmentation, and ASR tools

How GPT-SoVITS compares

GPT-SoVITS alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Whisper★ 103kOpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS★ 58.9kFew-shot voice cloning and TTS WebUI from one minute of audio
VibeVoice★ 49.5kMicrosoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS★ 45.6kA library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS★ 39.5kChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird★ 36.9kAn open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice★ 36.7kOpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM★ 31kAn open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.