Overview
MockingBird is an open-source voice cloning toolbox built on PyTorch. It takes a short recording of someone speaking and learns the sound of that voice, then uses it to generate new speech that reads any text you give it. Its main focus is Mandarin Chinese, which it has been tested on with several public datasets such as aidatatang_200zh, magicdata, aishell3, and data_aishell.
The project is a Chinese-language fork of the well-known Real-Time-Voice-Cloning project, which only supported English. It runs on Windows, Linux, and even Apple M1 Macs, and works with or without a GPU. You can train your own models or load a pretrained one shared by the community, then produce cloned speech through a web server, a desktop toolbox, or a simple command-line script.
What it does
- Clones a target voice from a short audio sample and reuses a pretrained encoder and vocoder, so only a newly trained synthesizer is needed
- Focused on Mandarin Chinese and tested against multiple public datasets including aidatatang_200zh, magicdata, aishell3, and data_aishell
- Three ways to run it: a browser-based web server, a desktop toolbox GUI, and a command-line generator
- Built on PyTorch and runs on Windows, Linux, and Apple M1 Macs, with or without a GPU
- Lets you train your own encoder, synthesizer, and vocoder models, or start from community-shared pretrained weights
- Supports multiple speech model architectures behind the scenes, including Tacotron, WaveRNN, and HiFi-GAN vocoders
Getting started
MockingBird needs Python 3.7 or higher (3.9 is recommended), PyTorch, and ffmpeg. Install the requirements, prepare a synthesizer model, then launch the web app, the toolbox, or the command line.
Install the requirements
After installing PyTorch and ffmpeg, install the remaining Python packages from the requirements file. You may also need the webrtcvad wheel.
pip install -r requirements.txt
pip install webrtcvad-wheelsOr set up the environment with conda
As an alternative to pip, you can create a virtual environment from the bundled env.yml file using conda or mamba, then activate it.
conda env create -n env_name -f env.yml
conda activate env_nameRun the web server
The easiest way to try MockingBird is the web app. Start it and open the page in your browser, by default at http://localhost:8080.
python web.pyOr generate speech from the command line
You can also clone a voice straight from a text file and a reference wav using the command-line generator.
python gen_voice.py <text_file.txt> your_wav_file.wavCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Generating Mandarin Chinese speech in a chosen voice for narration, dubbing, or accessibility projects
- Experimenting with and researching voice cloning and text-to-speech models on Chinese datasets
- Training custom voice models on your own audio data using the included encoder, synthesizer, and vocoder scripts
- Serving cloned-voice synthesis to other apps through the built-in web server
How MockingBird compares
MockingBird alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Whisper | ★ 103k | OpenAI's speech recognition model that transcribes and translates audio across many languages. |
| GPT-SoVITS | ★ 58.9k | An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning. |
| VibeVoice | ★ 49.5k | Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts. |
| Coqui TTS | ★ 45.6k | A library of text-to-speech models including the multilingual XTTS voice-cloning model. |
| ChatTTS | ★ 39.5k | ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody. |
| MockingBird | ★ 36.9k | AI voice cloning toolbox that learns a Mandarin Chinese voice in seconds |
| OpenVoice | ★ 36.7k | OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation. |
| VoxCPM | ★ 31k | An open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip. |