AI/TLDR

MockingBird

AI voice cloning toolbox that learns a Mandarin Chinese voice in seconds

Overview

MockingBird is an open-source voice cloning toolbox built on PyTorch. It takes a short recording of someone speaking and learns the sound of that voice, then uses it to generate new speech that reads any text you give it. Its main focus is Mandarin Chinese, which it has been tested on with several public datasets such as aidatatang_200zh, magicdata, aishell3, and data_aishell.

The project is a Chinese-language fork of the well-known Real-Time-Voice-Cloning project, which only supported English. It runs on Windows, Linux, and even Apple M1 Macs, and works with or without a GPU. You can train your own models or load a pretrained one shared by the community, then produce cloned speech through a web server, a desktop toolbox, or a simple command-line script.

What it does

  • Clones a target voice from a short audio sample and reuses a pretrained encoder and vocoder, so only a newly trained synthesizer is needed
  • Focused on Mandarin Chinese and tested against multiple public datasets including aidatatang_200zh, magicdata, aishell3, and data_aishell
  • Three ways to run it: a browser-based web server, a desktop toolbox GUI, and a command-line generator
  • Built on PyTorch and runs on Windows, Linux, and Apple M1 Macs, with or without a GPU
  • Lets you train your own encoder, synthesizer, and vocoder models, or start from community-shared pretrained weights
  • Supports multiple speech model architectures behind the scenes, including Tacotron, WaveRNN, and HiFi-GAN vocoders

Getting started

MockingBird needs Python 3.7 or higher (3.9 is recommended), PyTorch, and ffmpeg. Install the requirements, prepare a synthesizer model, then launch the web app, the toolbox, or the command line.

Install the requirements

After installing PyTorch and ffmpeg, install the remaining Python packages from the requirements file. You may also need the webrtcvad wheel.

bashbash
pip install -r requirements.txt
pip install webrtcvad-wheels

Or set up the environment with conda

As an alternative to pip, you can create a virtual environment from the bundled env.yml file using conda or mamba, then activate it.

bashbash
conda env create -n env_name -f env.yml
conda activate env_name

Run the web server

The easiest way to try MockingBird is the web app. Start it and open the page in your browser, by default at http://localhost:8080.

bashbash
python web.py

Or generate speech from the command line

You can also clone a voice straight from a text file and a reference wav using the command-line generator.

bashbash
python gen_voice.py <text_file.txt> your_wav_file.wav

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Generating Mandarin Chinese speech in a chosen voice for narration, dubbing, or accessibility projects
  • Experimenting with and researching voice cloning and text-to-speech models on Chinese datasets
  • Training custom voice models on your own audio data using the included encoder, synthesizer, and vocoder scripts
  • Serving cloned-voice synthesis to other apps through the built-in web server

How MockingBird compares

MockingBird alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Whisper★ 103kOpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS★ 58.9kAn open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice★ 49.5kMicrosoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS★ 45.6kA library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS★ 39.5kChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird★ 36.9kAI voice cloning toolbox that learns a Mandarin Chinese voice in seconds
OpenVoice★ 36.7kOpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM★ 31kAn open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.