MockingBird

AI voice cloning toolbox that learns a Mandarin Chinese voice in seconds

Overview

MockingBird is an open-source voice cloning toolbox built on PyTorch. It takes a short recording of someone speaking and learns the sound of that voice, then uses it to generate new speech that reads any text you give it. Its main focus is Mandarin Chinese, which it has been tested on with several public datasets such as aidatatang_200zh, magicdata, aishell3, and data_aishell.

The project is a Chinese-language fork of the well-known Real-Time-Voice-Cloning project, which only supported English. It runs on Windows, Linux, and even Apple M1 Macs, and works with or without a GPU. You can train your own models or load a pretrained one shared by the community, then produce cloned speech through a web server, a desktop toolbox, or a simple command-line script.

What it does

Clones a target voice from a short audio sample and reuses a pretrained encoder and vocoder, so only a newly trained synthesizer is needed
Focused on Mandarin Chinese and tested against multiple public datasets including aidatatang_200zh, magicdata, aishell3, and data_aishell
Three ways to run it: a browser-based web server, a desktop toolbox GUI, and a command-line generator
Built on PyTorch and runs on Windows, Linux, and Apple M1 Macs, with or without a GPU
Lets you train your own encoder, synthesizer, and vocoder models, or start from community-shared pretrained weights
Supports multiple speech model architectures behind the scenes, including Tacotron, WaveRNN, and HiFi-GAN vocoders

Getting started

MockingBird needs Python 3.7 or higher (3.9 is recommended), PyTorch, and ffmpeg. Install the requirements, prepare a synthesizer model, then launch the web app, the toolbox, or the command line.

Install the requirements

After installing PyTorch and ffmpeg, install the remaining Python packages from the requirements file. You may also need the webrtcvad wheel.

bashbash

pip install -r requirements.txt
pip install webrtcvad-wheels

Or set up the environment with conda

As an alternative to pip, you can create a virtual environment from the bundled env.yml file using conda or mamba, then activate it.

bashbash

conda env create -n env_name -f env.yml
conda activate env_name

Run the web server

The easiest way to try MockingBird is the web app. Start it and open the page in your browser, by default at http://localhost:8080.

bashbash

python web.py

Or generate speech from the command line

You can also clone a voice straight from a text file and a reference wav using the command-line generator.

bashbash

python gen_voice.py <text_file.txt> your_wav_file.wav

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Generating Mandarin Chinese speech in a chosen voice for narration, dubbing, or accessibility projects
Experimenting with and researching voice cloning and text-to-speech models on Chinese datasets
Training custom voice models on your own audio data using the included encoder, synthesizer, and vocoder scripts
Serving cloned-voice synthesis to other apps through the built-in web server

How MockingBird compares

MockingBird alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Whisper	★ 103k	OpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS	★ 58.9k	An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice	★ 49.5k	Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS	★ 45.6k	A library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS	★ 39.5k	ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird	★ 36.9k	AI voice cloning toolbox that learns a Mandarin Chinese voice in seconds
OpenVoice	★ 36.7k	OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM	★ 31k	An open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.

// Overview

// What it does

// Getting started

Install the requirements

Or set up the environment with conda

Run the web server

Or generate speech from the command line

// When to use it

// How MockingBird compares

Overview

What it does

Getting started

When to use it

How MockingBird compares