Qwen3.7-Plus

Alibaba's low-cost multimodal agent that sees screens, codes, and acts in one loop

Overview

Qwen3.7-Plus is Alibaba's Qwen team's multimodal agent model in the Qwen-Plus line, announced on June 2, 2026 and generally available from June 1, 2026 after a short public preview. It bolts image and video understanding onto the text-only Qwen 3.7 backbone and is positioned as the lower-cost sibling of the flagship Qwen3.7-Max — Alibaba lists it at roughly one-sixth the per-token price of Max.

Unlike a text model with a vision adapter, Qwen3.7-Plus is built to operate as an interactive agent: it perceives real-world scenes, reads screens and grounds clicks on GUIs, writes code from visual references, navigates mobile apps end to end, and answers questions about video frames — blending GUI and command-line actions inside a single agent loop with tool calls, self-testing, and autonomous iteration. It is a perception model only: it reads images and video but returns text, not generated pictures.

The model carries a 1-million-token context window and is exposed as the API endpoint qwen3.7-plus on Alibaba Cloud Model Studio (DashScope), reachable through OpenAI-compatible chat-completions and responses APIs across Beijing, Singapore, and US-Virginia endpoints, and resold through aggregators such as OpenRouter. It is proprietary and API-only — no open weights have been published.

Released	2026-06-02
License	Proprietary (API-only)
Weights	API only
Parameters	Not disclosed
Context	1M
Max output	32,768 tokens
Architecture	Multimodal vision-language agent that extends the Qwen 3.7 text backbone with image and video understanding. It is a perception model — it accepts text, images, and video and returns text only (no image generation). Parameter count and exact architecture are not publicly disclosed.
Knowledge cutoff	Not disclosed
Modalities	Text, Vision, Video
Status	Generally available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.40 / 1M tokens per 1M tokens
Cached input	$0.04 / 1M tokens (cache write) per 1M tokens
Output	$1.20 / 1M tokens per 1M tokens

Singapore (international) region, non-thinking mode, for prompts up to 256K tokens. Above 256K tokens the rate rises to $1.20 input / $3.60 output per 1M. Thinking-mode output is billed higher ($4 / 1M up to 256K). Pricing is tiered by request length.

Pricing source ↗

Strengths

GUI grounding and on-screen agent control: 79.0 on ScreenSpot Pro and 81.0 on AndroidWorld (vendor-reported)
Very large 1M-token context window for long documents, codebases, and multi-turn agent traces
Native image and video input alongside text — reads screens, frames, and document images
Strong agentic tool use, self-testing, and autonomous iteration inherited from the Qwen 3.7 agent backbone
Low per-token price relative to flagship agent models (about one-sixth the cost of Qwen3.7-Max)
OpenAI-compatible API and multi-region availability make integration straightforward

Best for

Computer-use and mobile agents that read screens and click the right UI element
Long-context document, codebase, and transcript analysis up to 1M tokens
Visual question answering over screenshots, document images, and video frames
Coding from visual references and UI mockups with built-in test-and-iterate loops
Tool-calling agent workflows that mix GUI and command-line actions
Cost-sensitive multimodal deployments that need vision without flagship-tier pricing

How to access

Provider	Model ID
Alibaba Cloud Model Studio (DashScope) ↗	`qwen3.7-plus`
OpenRouter ↗	`qwen/qwen3.7-plus`

Qwen-Plus (multimodal agent) — every version

The full lineage of the Qwen-Plus (multimodal agent) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Qwen3.7-Pluscurrent	2026-06-02	—	Proprietary
Qwen3.6-Plus	2026-04	—	Proprietary
Qwen3.5-Plus	2026-02-16	1M	Proprietary

FAQ

Is Qwen3.7-Plus open source?

No. Qwen3.7-Plus is a proprietary, API-only model. No open weights have been published; it is accessed through Alibaba Cloud Model Studio (DashScope) and resellers like OpenRouter.

What can Qwen3.7-Plus actually take as input?

It accepts text, images, and video, and returns text only. It is a perception and agent model — it reads screens, document images, and video frames, but it does not generate images.

How is Qwen3.7-Plus different from Qwen3.7-Max?

Qwen3.7-Plus adds vision and video understanding on top of the Qwen 3.7 backbone and is the lower-cost, multimodal agent tier. Alibaba lists it at roughly one-sixth the per-token price of the text-focused flagship Qwen3.7-Max.

How large is the context window?

Qwen3.7-Plus supports a 1-million-token context window, with a maximum output of 32,768 tokens per response according to Alibaba Cloud Model Studio documentation.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// Qwen-Plus (multimodal agent) — every version

// FAQ