Qwen3-Omni

Alibaba's natively end-to-end omni-modal model — text, image, audio and video in, real-time text and speech out, open-weight under Apache 2.0.

Overview

Qwen3-Omni is Alibaba's (Qwen team) third-generation omni-modal large language model, released September 22, 2025 as the flagship of the Qwen-Omni line. It is natively end-to-end: a single model takes text, images, audio and video as input and streams both text and natural speech as output in real time, rather than chaining separate ASR, LLM and TTS systems.

The release centers on the Qwen3-Omni-30B-A3B family, published in three open-weight variants under Apache 2.0: Qwen3-Omni-30B-A3B-Instruct (full Thinker plus Talker, audio/video/text in, audio and text out), Qwen3-Omni-30B-A3B-Thinking (Thinker only, with chain-of-thought reasoning), and Qwen3-Omni-30B-A3B-Captioner (a fine-grained audio captioning model). Architecturally it uses a Thinker-Talker Mixture-of-Experts design with roughly 3B active parameters out of 30B, a from-scratch AuT audio encoder, and a SigLIP2-initialized vision encoder.

Qwen3-Omni handles text in 119 languages, speech understanding in 19 languages and speech generation in 10. It supports up to a 65,536-token (64K) context, can process audio recordings up to about 40 minutes per instance, and reaches a theoretical first-packet latency around 234 ms for audio and 547 ms for audio-video. The team reports open-source state-of-the-art on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22, with audio understanding and voice-conversation quality comparable to Gemini 2.5 Pro. Weights are free to download; a hosted Qwen3-Omni-Flash API is also offered through Alibaba Cloud Model Studio.

Released	2025-09-22
License	Apache-2.0
Weights	Open weights
Parameters	30B total · ~3B active (Thinker MoE); separate ~3B-A0.3B Talker
Context	64K
Max output	16K
Architecture	Thinker-Talker Mixture-of-Experts. The Thinker (30B-A3B MoE transformer) handles perception and text reasoning; the Talker (3B-A0.3B) streams speech via a multi-codebook design. A from-scratch AuT audio encoder (trained on 20M hours of audio) and a ~543M-parameter vision encoder initialized from SigLIP2-So400m feed the Thinker.
Modalities	Text, Vision, Audio, Video
Status	Generally available

Benchmarks

MMAU (audio understanding, v05.15.25)77.5%
Video-MME (w/o subtitles)70.5%
WorldSense (audio-visual)54%
AIME25 (math reasoning)65%
GPQA (graduate-level QA)69.6%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.43 text · $3.81 audio · $0.78 image/video per 1M tokens
Output	$1.66 text per 1M tokens

Weights are free to download under Apache 2.0; prices shown are for the hosted Qwen3-Omni-Flash API on Alibaba Cloud Model Studio (International), billed per modality. Audio output is billed separately (~$15.11/1M).

Pricing source ↗

Strengths

Single end-to-end model for text, image, audio and video — no separate ASR/TTS pipeline
Real-time streaming speech output with low first-packet latency (~234 ms audio, ~547 ms audio-video)
Open weights under Apache 2.0 across all three variants (Instruct, Thinking, Captioner)
Broad language coverage: 119 text languages, 19 speech-input, 10 speech-output
Open-source SOTA on 32 of 36 audio and audio-visual benchmarks per the technical report
Efficient 30B-A3B MoE — only ~3B parameters active per token
Long-audio understanding up to ~40 minutes per instance

Best for

Real-time voice and video assistants that listen, see and speak back
Speech recognition, transcription and spoken-language understanding across many languages
Fine-grained audio captioning and audio event description (Captioner variant)
Multimodal document, chart and video question answering
Multilingual speech translation and cross-modal dialogue
On-prem or self-hosted omni-modal apps where open weights and Apache 2.0 licensing matter

How to access

Provider	Model ID
Alibaba Cloud Model Studio (DashScope) ↗	`qwen3-omni-flash`
Hugging Face (open weights) ↗	`Qwen/Qwen3-Omni-30B-A3B-Instruct`

Qwen-Omni — every version

The full lineage of the Qwen-Omni line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Qwen3-Omnicurrent	2025-09-22	—	Apache-2.0
Qwen2.5-Omni	2025-03-26	—	Open weights

FAQ

Is Qwen3-Omni open source?

Yes. All three variants — Qwen3-Omni-30B-A3B-Instruct, -Thinking and -Captioner — are released with open weights under the Apache 2.0 license, so they can be downloaded, self-hosted and used commercially. A hosted Qwen3-Omni-Flash API is also available through Alibaba Cloud Model Studio.

What modalities does Qwen3-Omni support?

It accepts text, images, audio and video as input and produces text plus natural, real-time speech as output. It supports text in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages, and can process audio recordings up to about 40 minutes per instance.

How big is Qwen3-Omni and how is it built?

The flagship is Qwen3-Omni-30B-A3B: a Thinker-Talker Mixture-of-Experts model with 30B total parameters and roughly 3B active per token. A from-scratch AuT audio encoder trained on 20M hours of audio and a ~543M-parameter SigLIP2-initialized vision encoder feed the Thinker, while a separate Talker streams speech.

How does Qwen3-Omni perform on benchmarks?

Per the technical report it reaches open-source state-of-the-art on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22, with results such as 77.5 on MMAU, 70.5 on Video-MME (without subtitles) and 69.6 on GPQA. Alibaba states its ASR, audio understanding and voice-conversation quality are comparable to Gemini 2.5 Pro.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// Qwen-Omni — every version

// FAQ