VoxCPM

Tokenizer-free text-to-speech for multilingual voice design and cloning

github.com/OpenBMB/VoxCPM★ 31k voxcpm.com

Overview

VoxCPM is an open-source text-to-speech (TTS) system from OpenBMB. Instead of turning speech into discrete tokens, it directly generates continuous speech representations through an end-to-end diffusion autoregressive architecture, which the team says produces highly natural and expressive audio.

The latest release, VoxCPM2, is a 2-billion-parameter model trained on over 2 million hours of multilingual speech and built on a MiniCPM-4 backbone. It supports 30 languages, voice design from a plain-text description, controllable voice cloning, and 48kHz studio-quality output. The weights and code are released under the Apache-2.0 license, so the project is free to use commercially.

What it does

Multilingual synthesis across 30 languages plus several Chinese dialects, with no language tag needed in the input text
Voice Design: create a brand-new voice from a natural-language description (gender, age, tone, emotion, pace) without any reference audio
Controllable voice cloning from a short reference clip, with optional style guidance to steer emotion, pace, and expression while keeping the original timbre
Ultimate cloning that reproduces fine vocal detail by continuing from a reference clip paired with its transcript
48kHz high-quality output with built-in super-resolution, accepting 16kHz reference audio and needing no external upsampler
Real-time streaming generation, with much faster inference available through Nano-vLLM or the official vLLM-Omni serving engine

Getting started

VoxCPM ships as a Python package. Install it with pip, then load a pretrained model and generate audio with a few lines of code, or use the bundled command-line tool. It requires Python 3.10 or newer (below 3.13), PyTorch 2.5.0 or newer, and CUDA 12.0 or newer.

Install the package

Install VoxCPM from PyPI with pip.

bashbash

pip install voxcpm

Generate speech with the Python API

Load the recommended VoxCPM2 model, generate a waveform from text, and write it to a WAV file.

pythonpython

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
  "openbmb/VoxCPM2",
  load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)

Design a voice from text or clone one from the CLI

The voxcpm command-line tool can design a voice from a description or clone a voice from a reference clip.

bashbash

# Design a voice from a description, no reference audio needed
voxcpm design \
  --text "VoxCPM2 brings studio-quality multilingual speech synthesis." \
  --output out.wav

# Clone a voice from a reference recording
voxcpm clone \
  --text "This is a voice cloning demo." \
  --reference-audio path/to/voice.wav \
  --output out.wav

Run the local web demo

Launch the bundled web app and open it in your browser to try synthesis interactively.

bashbash

python app.py --port 8808  # then open http://localhost:8808

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Adding natural multilingual voiceovers to videos, audiobooks, and other content across 30 languages
Designing a custom brand or character voice from a written description when no recording is available
Cloning a specific voice from a short clip for narration, dubbing, or accessibility tools
Building real-time voice features into apps using streaming generation and high-throughput serving engines

How VoxCPM compares

VoxCPM alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Whisper	★ 103k	OpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS	★ 58.9k	An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice	★ 49.5k	Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS	★ 45.6k	A library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS	★ 39.5k	ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird	★ 36.9k	An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice	★ 36.7k	OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM	★ 31k	Tokenizer-free text-to-speech for multilingual voice design and cloning

// Overview

// What it does

// Getting started

Install the package

Generate speech with the Python API

Design a voice from text or clone one from the CLI

Run the local web demo

// When to use it

// How VoxCPM compares

Overview

What it does

Getting started

When to use it

How VoxCPM compares