Overview
Qwen3-Omni is Alibaba's (Qwen team) third-generation omni-modal large language model, released September 22, 2025 as the flagship of the Qwen-Omni line. It is natively end-to-end: a single model takes text, images, audio and video as input and streams both text and natural speech as output in real time, rather than chaining separate ASR, LLM and TTS systems.
The release centers on the Qwen3-Omni-30B-A3B family, published in three open-weight variants under Apache 2.0: Qwen3-Omni-30B-A3B-Instruct (full Thinker plus Talker, audio/video/text in, audio and text out), Qwen3-Omni-30B-A3B-Thinking (Thinker only, with chain-of-thought reasoning), and Qwen3-Omni-30B-A3B-Captioner (a fine-grained audio captioning model). Architecturally it uses a Thinker-Talker Mixture-of-Experts design with roughly 3B active parameters out of 30B, a from-scratch AuT audio encoder, and a SigLIP2-initialized vision encoder.
Qwen3-Omni handles text in 119 languages, speech understanding in 19 languages and speech generation in 10. It supports up to a 65,536-token (64K) context, can process audio recordings up to about 40 minutes per instance, and reaches a theoretical first-packet latency around 234 ms for audio and 547 ms for audio-video. The team reports open-source state-of-the-art on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22, with audio understanding and voice-conversation quality comparable to Gemini 2.5 Pro. Weights are free to download; a hosted Qwen3-Omni-Flash API is also offered through Alibaba Cloud Model Studio.
| Released | 2025-09-22 |
|---|---|
| License | Apache-2.0 |
| Weights | Open weights |
| Parameters | 30B total · ~3B active (Thinker MoE); separate ~3B-A0.3B Talker |
| Context | 64K |
| Max output | 16K |
| Architecture | Thinker-Talker Mixture-of-Experts. The Thinker (30B-A3B MoE transformer) handles perception and text reasoning; the Talker (3B-A0.3B) streams speech via a multi-codebook design. A from-scratch AuT audio encoder (trained on 20M hours of audio) and a ~543M-parameter vision encoder initialized from SigLIP2-So400m feed the Thinker. |
| Modalities | Text, Vision, Audio, Video |
| Status | Generally available |
Benchmarks
- MMAU (audio understanding, v05.15.25)77.5%
- Video-MME (w/o subtitles)70.5%
- WorldSense (audio-visual)54%
- AIME25 (math reasoning)65%
- GPQA (graduate-level QA)69.6%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.43 text · $3.81 audio · $0.78 image/video per 1M tokens |
|---|---|
| Output | $1.66 text per 1M tokens |
Weights are free to download under Apache 2.0; prices shown are for the hosted Qwen3-Omni-Flash API on Alibaba Cloud Model Studio (International), billed per modality. Audio output is billed separately (~$15.11/1M).
Strengths
- Single end-to-end model for text, image, audio and video — no separate ASR/TTS pipeline
- Real-time streaming speech output with low first-packet latency (~234 ms audio, ~547 ms audio-video)
- Open weights under Apache 2.0 across all three variants (Instruct, Thinking, Captioner)
- Broad language coverage: 119 text languages, 19 speech-input, 10 speech-output
- Open-source SOTA on 32 of 36 audio and audio-visual benchmarks per the technical report
- Efficient 30B-A3B MoE — only ~3B parameters active per token
- Long-audio understanding up to ~40 minutes per instance
Best for
- Real-time voice and video assistants that listen, see and speak back
- Speech recognition, transcription and spoken-language understanding across many languages
- Fine-grained audio captioning and audio event description (Captioner variant)
- Multimodal document, chart and video question answering
- Multilingual speech translation and cross-modal dialogue
- On-prem or self-hosted omni-modal apps where open weights and Apache 2.0 licensing matter
How to access
| Provider | Model ID |
|---|---|
| Alibaba Cloud Model Studio (DashScope) ↗ | qwen3-omni-flash |
| Hugging Face (open weights) ↗ | Qwen/Qwen3-Omni-30B-A3B-Instruct |
Qwen-Omni — every version
The full lineage of the Qwen-Omni line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Qwen3-Omnicurrent | 2025-09-22 | — | Apache-2.0 |
| Qwen2.5-Omni | 2025-03-26 | — | Open weights |
FAQ
Is Qwen3-Omni open source?
Yes. All three variants — Qwen3-Omni-30B-A3B-Instruct, -Thinking and -Captioner — are released with open weights under the Apache 2.0 license, so they can be downloaded, self-hosted and used commercially. A hosted Qwen3-Omni-Flash API is also available through Alibaba Cloud Model Studio.
What modalities does Qwen3-Omni support?
It accepts text, images, audio and video as input and produces text plus natural, real-time speech as output. It supports text in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages, and can process audio recordings up to about 40 minutes per instance.
How big is Qwen3-Omni and how is it built?
The flagship is Qwen3-Omni-30B-A3B: a Thinker-Talker Mixture-of-Experts model with 30B total parameters and roughly 3B active per token. A from-scratch AuT audio encoder trained on 20M hours of audio and a ~543M-parameter SigLIP2-initialized vision encoder feed the Thinker, while a separate Talker streams speech.
How does Qwen3-Omni perform on benchmarks?
Per the technical report it reaches open-source state-of-the-art on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22, with results such as 77.5 on MMAU, 70.5 on Video-MME (without subtitles) and 69.6 on GPQA. Alibaba states its ASR, audio understanding and voice-conversation quality are comparable to Gemini 2.5 Pro.