AI/TLDR

GLM-5V-Turbo

Native multimodal vision-to-code model that turns design mockups, screenshots and video into working front-end code and GUI actions.

Overview

GLM-5V-Turbo is the multimodal vision model in Z.ai (Zhipu / GLM)'s GLM Turbo line, released 1 April 2026 and now the current flagship of that line. Unlike a text model with a bolt-on image reader, GLM-5V-Turbo treats visual perception as a core part of reasoning, planning and execution: it takes images, short video clips, design drafts and document layouts directly in context and produces code, structured actions or GUI steps as output.

The headline use case is vision-to-code. GLM-5V-Turbo is tuned to turn a design mockup, screenshot or product walkthrough into runnable front-end code, and it is explicitly integrated for agentic coding workflows such as OpenClaw and Claude Code. Architecturally it builds on the GLM-5-Turbo text base, adding the CogViT vision encoder and a Multimodal Multi-Token Prediction (MMTP) path so it keeps competitive text-only coding ability while gaining native multimodal grounding.

GLM-5V-Turbo is served only through Z.ai's API and chat platform — unlike the open-weights GLM-5 flagship, this Turbo vision variant ships proprietary with no published weights. It exposes a 200K-token context window and up to 128K tokens of output, supports thinking mode, streaming, function calling and context caching, and is priced at $1.20 per million input tokens and $4.00 per million output tokens.

Released2026-04-01
LicenseProprietary (API-only)
WeightsAPI only
Context200K
Max output128K
ArchitectureNative multimodal model built on the GLM-5-Turbo text base, pairing the CogViT vision encoder with a Multimodal Multi-Token Prediction (MMTP) decoding path. Trained with '30+ Task Joint Reinforcement Learning' spanning STEM reasoning, visual grounding, video analysis, document understanding, GUI interaction and tool use. Z.ai has not disclosed total or active parameter counts for this variant.
Knowledge cutoffNot disclosed
ModalitiesText, Vision, Video, PDF
StatusAvailable

Benchmarks

Benchmark table comparing GLM-5V-Turbo with Kimi K2.5 and Claude Opus 4.6 across multimodal coding (Design2Code, Flame-VLM-Code, Vision2Web), multimodal tool use (ImageMining, BrowseComp-VL, MMSearch, MMSearch-Plus, SimpleVQA, Facts, V*) and GUI agent (OSWorld, AndroidWorld, WebVoyager) benchmarks.
Z.ai's multimodal coding, tool-use and GUI-agent benchmark results for GLM-5V-Turbo vs Kimi K2.5 and Claude Opus 4.6. — Z.ai (Zhipu / GLM)
Benchmark table comparing GLM-5V-Turbo with GLM-5-Turbo, Kimi K2.5 and Claude Opus 4.6 on text coding (CC-Backend, CC-Frontend, CC-Repo-Exploration) and Claw agent (PinchBench, ClawEval, ZClawBench) benchmarks.
Z.ai's pure-text coding and Claw agent benchmark results for GLM-5V-Turbo vs GLM-5-Turbo, Kimi K2.5 and Claude Opus 4.6. — Z.ai (Zhipu / GLM)

Z.ai-published benchmark comparison for GLM-5V-Turbo vs GLM-5-Turbo, Kimi K2.5 and Claude Opus 4.6 (multimodal coding/tool-use/GUI-agent table + text-coding/Claw-agent table). Values transcribed from the two readable benchmark tables on the official Z.ai docs page.

BenchmarkGLM-5V-TurboGLM-5-TurboKimi K2.5Claude Opus 4.6
Design2Code94.8 score91.3 score77.3 score
Flame-VLM-Code93.8 score88.8 score98.8 score
Vision2Web31 score33.2 score43.5 score
ImageMining30.7 score24.4 score
BrowseComp-VL51.9 score42.9 score35.9 score
MMSearch72.9 score58.7 score63.8 score
MMSearch-Plus30 score25.6 score25.6 score
SimpleVQA78.2 score71.5 score63.2 score
Facts58.6 score57.8 score
V*89 score84.3 score66.5 score
OSWorld62.3 score63.3 score72.2 score
AndroidWorld75.7 score43.1 score62 score
WebVoyager88.5 score84.3 score88 score
CC-Backend22.8 score20.5 score25.3 score26.9 score
CC-Frontend68.4 score69.4 score62.3 score75.9 score
CC-Repo-Exploration72.2 score68.9 score66.7 score74.4 score
PinchBench (Best/Avg)87.0 / 80.7 score86.5 / 81.1 score84.8 / 79.2 score93.3 / 82.9 score
ClawEval (Pass^3/Pass@3)57.7 / 75.0 score51.0 / 72.1 score52.9 / 73.1 score66.3 / 77.9 score
ZClawBench57.6 score60.6 score49.1 score62.3 score

Comparison source ↗

This model's scores

  1. Design2Code (multimodal coding)94.8%
  2. AndroidWorld (GUI agents)75.7%
  3. OSWorld (GUI agents)62.3%
  4. MMSearch (multimodal tool use)72.9%
  5. ImageMining (visual search)30.7%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$1.20 per 1M tokens
Output$4.00 per 1M tokens

Same per-token pricing as the GLM-5-Turbo text model. API-only via the Z.ai platform.

Pricing source ↗

Strengths

  • Native multimodal grounding — images, video, design drafts and document layouts are processed as first-class reasoning inputs, not OCR'd afterthoughts.
  • Vision-to-code: converts mockups, screenshots and walkthroughs into executable front-end code, the model's primary design goal.
  • Built for agentic engineering — explicitly integrated with OpenClaw and Claude Code workflows, with function calling and context caching.
  • Keeps competitive text-only coding ability via the GLM-5-Turbo base while adding multimodal perception.
  • Large 200K context with up to 128K output, suited to long, tool-augmented multi-step tasks.

Best for

  • Turning Figma/design mockups and UI screenshots directly into working front-end code.
  • GUI and computer-use agents that read a screen and take multi-step actions (Android, desktop, web).
  • Multimodal document understanding — extracting structure and answers from complex PDF and document layouts.
  • Visual tool use and multimodal search inside agent frameworks.
  • Video-grounded tasks such as reasoning over screen recordings and product walkthroughs.

How to access

ProviderModel ID
Z.ai ↗glm-5v-turbo

GLM Turbo — every version

The full lineage of the GLM Turbo line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
GLM-5V-Turbocurrent2026-04-01Open weights
GLM-5-Turbo2026-03-15Open weights

FAQ

What is GLM-5V-Turbo?

GLM-5V-Turbo is Z.ai (Zhipu / GLM)'s native multimodal vision model, released on 1 April 2026 as the current flagship of the GLM Turbo line. It takes images, video, design drafts and documents directly in context and outputs code or structured GUI actions, with a focus on turning designs and screenshots into working front-end code.

Is GLM-5V-Turbo open weights?

No. Unlike the open-weights GLM-5 flagship, GLM-5V-Turbo ships proprietary and is available only through Z.ai's API and chat platform — Z.ai has not published downloadable weights for this Turbo vision variant.

How much does GLM-5V-Turbo cost?

Z.ai prices GLM-5V-Turbo at $1.20 per million input tokens and $4.00 per million output tokens — the same per-token rate as the GLM-5-Turbo text model. It is served via the Z.ai API using the model id glm-5v-turbo.

What context length and output limit does GLM-5V-Turbo support?

GLM-5V-Turbo supports a 200K-token context window and up to 128K tokens of output, along with thinking mode, streaming output, function calling and context caching.