AI/TLDR

Qwen2.5-Omni

Alibaba's first end-to-end omni-modal model: text, image, audio and video in, streaming text and speech out.

Overview

Qwen2.5-Omni is Alibaba's (Qwen) first end-to-end omni-modal model line, released on March 26, 2025. A single model takes text, images, audio and video as input and generates both text and natural speech, streaming the speech back in real time. It shipped as open weights in two sizes, a 7B and a smaller 3B variant, under the Apache-2.0 license.

The model uses a Thinker-Talker architecture: the Thinker is a transformer-decoder LLM (with audio and vision encoders) that understands the inputs and produces text, while the Talker is a dual-track autoregressive decoder that turns the Thinker's representations into discrete speech tokens. To keep video frames and audio in sync, Qwen2.5-Omni introduces TMRoPE (Time-aligned Multimodal RoPE), a position embedding that aligns the timestamps of video and audio. The 7B model supports a 32,768-token context and ships with two built-in speech voices, Chelsie (female) and Ethan (male).

Qwen2.5-Omni reports state-of-the-art results on OmniBench, the cross-modal reasoning benchmark, and its performance following spoken instructions is close to its performance on the same tasks given as text. Besides the open weights on Hugging Face and GitHub, it is served through Alibaba Cloud Model Studio (DashScope) as qwen2.5-omni-7b. It was succeeded by Qwen3-Omni in September 2025.

Released2025-03-26
LicenseApache-2.0
WeightsOpen weights
Parameters7B base LLM (~11B total with Talker); also a 3B variant
Context32K
ArchitectureThinker-Talker end-to-end omni-modal transformer with TMRoPE
ModalitiesText, Vision, Audio, Video
StatusGenerally available

Benchmarks

  1. OmniBench (avg)56.13%
  2. MMLU-redux71%
  3. MMMU (val)59.2%
  4. MMStar64%
  5. MVBench70.3%
  6. GSM8K88.7%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.10 / 1M tokens
Output$0.40 / 1M tokens

Alibaba Cloud Model Studio (Singapore), qwen2.5-omni-7b, text input / text-only output; multimodal-input output is $0.84/1M.

Pricing source ↗

Strengths

  • Truly omni-modal: one model accepts text, image, audio and video and produces both text and speech, without stitching separate models together.
  • Real-time, streaming speech output with two natural-sounding built-in voices (Chelsie and Ethan).
  • Open weights under Apache-2.0 in both 7B and 3B sizes, so it can be self-hosted and used commercially.
  • Strong cross-modal reasoning: state-of-the-art OmniBench average (56.13) with spoken-instruction following close to text-instruction quality.
  • TMRoPE keeps audio and video frames time-aligned, improving understanding of synced audio-visual content.

Best for

  • Real-time voice assistants and conversational agents that listen and talk back.
  • Audio-visual understanding: answering questions about video clips with synced sound.
  • Speech-to-speech and spoken-instruction following without a separate ASR/TTS pipeline.
  • On-device or self-hosted multimodal apps using the smaller 3B variant.
  • Building accessibility and voice-interface features on open, commercially-licensed weights.

How to access

ProviderModel ID
Alibaba Cloud Model Studio (DashScope) ↗qwen2.5-omni-7b

Qwen-Omni — every version

The full lineage of the Qwen-Omni line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Qwen3-Omnicurrent2025-09-22Apache-2.0
Qwen2.5-Omni2025-03-26Open weights

FAQ

What is Qwen2.5-Omni?

Qwen2.5-Omni is Alibaba's (Qwen) first end-to-end omni-modal model, released March 26, 2025. A single open-weight model takes text, images, audio and video as input and generates both text and natural speech, streaming the speech back in real time. It comes in 7B and 3B sizes under the Apache-2.0 license.

What is the Thinker-Talker architecture?

It splits the model into two parts. The Thinker is a transformer-decoder LLM (with audio and vision encoders) that understands the inputs and produces text. The Talker is a dual-track autoregressive decoder that turns the Thinker's output into discrete speech tokens, so the model can write and speak at the same time. A position embedding called TMRoPE keeps video and audio time-aligned.

Is Qwen2.5-Omni open source and free to use?

The weights are released under Apache-2.0, so you can download, self-host and use them commercially. It is also offered as a paid API (qwen2.5-omni-7b) on Alibaba Cloud Model Studio, where text input is about $0.10 per 1M tokens and text output about $0.40 per 1M tokens in the Singapore region.

What is the context window of Qwen2.5-Omni?

The 7B model supports a context length of 32,768 tokens (32K), per its Hugging Face config (max_position_embeddings).