GLM-4.5V

Open-weight 106B/12B-active vision-language MoE with a toggleable thinking mode for images, video, documents and GUI agents.

Overview

GLM-4.5V is the open-weight vision-language model in Z.ai's (Zhipu / GLM) GLM-V line, released on August 11, 2025. It is a Mixture-of-Experts model with 106B total parameters and 12B activated per token, built on top of the GLM-4.5-Air-Base text foundation. GLM-4.5V handles images (up to 4K resolution, arbitrary aspect ratio), video, multi-image prompts, and documents such as PDFs and slide decks, and it can output precise bounding-box grounding for elements in a scene.

Like the earlier GLM-4.1V-9B-Thinking, GLM-4.5V ships with a toggleable thinking mode: users can switch between fast direct answers and a deeper chain-of-thought pass for harder visual reasoning, OCR, chart and document parsing, video understanding, and GUI-agent tasks (screen reading, icon detection, desktop operation). Z.ai reports state-of-the-art results among similarly sized open models across 42 public vision-language benchmarks.

The weights are released under the MIT license on Hugging Face (zai-org/GLM-4.5V) and the GLM-V GitHub repo, allowing commercial use and fine-tuning. GLM-4.5V is also served through the Z.ai API platform and third parties such as OpenRouter, with an exposed context window of roughly 64K tokens and up to 16K tokens of output.

Released	2025-08-11
License	MIT
Weights	Open weights
Parameters	106B total / 12B active (MoE)
Context	64K
Max output	16K
Architecture	Mixture-of-Experts vision-language model built on the GLM-4.5-Air-Base text foundation (106B parameters, 12B activated). Pairs an image/video vision encoder with the MoE language model and adds a user-toggleable "thinking" mode that trades latency for deeper multimodal reasoning. Trained with scalable reinforcement learning for visual reasoning, grounding (bounding-box output), and GUI-agent control.
Knowledge cutoff	December 2024
Modalities	Text, Vision, Video, PDF
Status	Available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.60 / 1M tokens per 1M tokens
Cached input	$0.11 / 1M tokens per 1M tokens
Output	$1.80 / 1M tokens per 1M tokens

Z.ai API platform list price. Open weights (MIT) can also be self-hosted at no per-token cost.

Pricing source ↗

Strengths

Open weights under the permissive MIT license — free for commercial use and fine-tuning
Efficient MoE design: 106B total parameters but only 12B activated per token
Toggleable thinking mode balances fast responses against deeper multimodal reasoning
Broad visual coverage: images up to 4K, video, multi-image, PDFs and slides in one pass
Strong STEM and chart/document scores (MMMU, MathVista, AI2D, ChartQAPro)
Native grounding with bounding-box output and GUI-agent control (OSWorld, WebVoyager)

Best for

Document, chart and long-PDF understanding and information extraction
OCR and structured data extraction from images and plots
Long-video segmentation and event recognition
GUI agents: screen reading, icon detection and desktop operation assistance
Visual grounding and spatial localization with bounding boxes
Multi-image scene analysis, defect inspection and geo/context inference

How to access

Provider	Model ID
Z.ai API Platform ↗	`glm-4.5v`
OpenRouter ↗	`z-ai/glm-4.5v`

GLM-V (vision-language) — every version

The full lineage of the GLM-V (vision-language) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
GLM-4.6Vcurrent	2025-12-08	—	MIT
GLM-4.5V	2025-08-11	—	Open weights
GLM-4.1V-9B-Thinking	2025-07-01	—	Open weights

FAQ

Is GLM-4.5V open source?

Yes. The weights are released under the permissive MIT license on Hugging Face (zai-org/GLM-4.5V) and the GLM-V GitHub repo, allowing commercial use, redistribution and fine-tuning.

How big is GLM-4.5V?

It is a Mixture-of-Experts model with 106B total parameters and 12B activated per token, built on the GLM-4.5-Air-Base text foundation. The efficient MoE design keeps inference cost closer to a 12B model.

What inputs does GLM-4.5V support?

Text plus visual inputs: images up to 4K resolution at any aspect ratio, video, multi-image prompts, and documents such as PDFs and slide decks. It can also output bounding-box grounding and drive GUI agents.

What does GLM-4.5V cost to use?

On the Z.ai API platform it lists at $0.60 per million input tokens and $1.80 per million output tokens, with cached input at $0.11 per million. Because the weights are MIT-licensed, you can also self-host with no per-token fee.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// GLM-V (vision-language) — every version

// FAQ