k2-fsa · 2026-04-01 · major

OmniVoice — Zero-Shot TTS for 600+ Languages at 40× Real-Time

Item: OmniVoice — Zero-Shot TTS for 600+ Languages at 40× Real-Time
Rating: 4
Author: AI/TLDR

Apache 2.0 zero-shot TTS from the k2/Kaldi team: 600+ languages, voice cloning, speaker attribute-based voice design, RTF 0.025 (40× real-time). Built on Qwen3-0.6B. 4.3k GitHub stars, 702 HuggingFace model likes.

OmniVoice TTS model overview — zero-shot multilingual speech synthesis covering 600+ languages

Zero-shot TTS covering 600+ languages at 40× real-time speed — Apache 2.0, from the team behind Kaldi and k2.

Key specs

GitHub stars	4.3k
Languages	600+
Real time factor	0.025 (40× real-time)
Hf model likes	702

What is it?

OmniVoice is a massively multilingual, zero-shot text-to-speech model from the k2-fsa team (Daniel Povey's group, the authors of Kaldi and k2). It synthesizes natural-sounding speech in any of 600+ supported languages without per-voice fine-tuning. Core capabilities: voice cloning from a short audio clip, voice design via speaker attribute text descriptions (age, pitch, dialect, accent, style), and both standard and streaming generation modes. Built on a Qwen3-0.6B language model backbone. Apache 2.0 license.

How does it work?

OmniVoice conditions a language model (Qwen3-0.6B) on a speaker prompt — either a reference audio clip converted to a compact acoustic representation, or a text description of desired speaker attributes — then autoregressively generates codec tokens that are decoded back to audio via a diffusion language model architecture. The extremely low RTF (0.025 for non-streaming) comes from an efficient codec and streaming-aware KV cache management. The 600+ language coverage is achieved through multilingual training on a large corpus.

Why does it matter?

Most open-source TTS models cover a dozen languages at best. OmniVoice opens zero-shot synthesis across the long tail of languages under a permissive Apache 2.0 license, making it practical for researchers and developers working on underrepresented languages. The real-time factor means it runs comfortably on CPU for many use cases. Coming from the Kaldi/k2 team gives it credibility with the speech research community.

Who is it for?

Speech researchers and developers building multilingual voice applications

Try it

huggingface.co/k2-fsa/OmniVoice