GLM-4.6V

Open-weight 106B (~12B active) MoE vision-language model with native multimodal function calling — images, screenshots and document pages pass straight in as tool parameters — and a 128K context; MIT license.

Overview

GLM-4.6V is the flagship vision-language model in Z.ai's (Zhipu / GLM) GLM-V line, released December 8, 2025 under the permissive MIT license with open weights on Hugging Face and ModelScope. It is a Mixture-of-Experts model with 106B total parameters and roughly 12B active per token, and it ships alongside a smaller 9B sibling, GLM-4.6V-Flash, for low-latency and local deployment.

The headline change in GLM-4.6V is native multimodal Function Calling. Instead of describing an image in text and then calling a tool, the model passes images, screenshots and document pages directly as tool parameters, and tools can return search-result grids, charts, rendered web pages or product images back into its reasoning. Z.ai frames this as bridging visual perception and executable action for multimodal agents. The model accepts video, image, text and file inputs, scales its training context to 128K tokens (roughly 150 pages of dense documents, 200 slides or about an hour of video in one pass), and is tuned for design-to-code work — reconstructing HTML, CSS and JavaScript from UI screenshots.

GLM-4.6V was evaluated on more than 20 mainstream multimodal benchmarks and reports state-of-the-art results among open-source models of comparable scale across multimodal interaction, logical reasoning and long-context understanding. It is the successor to GLM-4.5V (the model family's technical report, arXiv 2507.01006, covers the GLM-4.6V series) and the current top of the GLM-V vision-language line.

Released	2025-12-08
License	MIT
Weights	Open weights
Parameters	106B total / ~12B active (MoE)
Context	128K
Max output	Not disclosed
Architecture	Mixture-of-Experts vision-language transformer (model class Glm4vMoeForConditionalGeneration) in the GLM-V family, building on the recipe behind GLM-4.5V and GLM-4.1V-Thinking. A vision encoder feeds image, screenshot, document-page and video frames into a 106B-total MoE language backbone with ~12B parameters active per token. Trained with Reinforcement Learning with Curriculum Sampling (RLCS) and scaled to a 128K-token context. Its defining change is native multimodal Function Calling: images, screenshots and document pages are passed directly as tool parameters, and tools can return image grids, charts or rendered pages back into the reasoning loop.
Knowledge cutoff	Not disclosed
Modalities	Text, Vision, Video, PDF
Status	Available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.30 per 1M tokens
Output	$0.90 per 1M tokens

Third-party API pricing (Novita / OpenRouter) for the non-reasoning vision model at the Dec 2025 release. Because the weights are MIT-licensed, GLM-4.6V can also be self-hosted at no per-token cost.

Pricing source ↗

Strengths

Open weights under the permissive MIT license — free to self-host, fine-tune and ship commercially
Native multimodal Function Calling: images, screenshots and document pages pass directly as tool parameters
Sparse 106B-MoE design activates only ~12B parameters per token, keeping inference cheaper than a dense 100B+ model
128K context handles long documents, large slide decks and roughly an hour of video in a single pass
Strong on chart, document and OCR understanding plus design-to-code (Design2Code 88.6) screenshot-to-frontend tasks
Ships with a 9B GLM-4.6V-Flash variant for local and low-latency deployment

Best for

Multimodal agents and GUI automation that act on what they see (WebVoyager-style web navigation)
Design-to-code: turning UI screenshots and mockups into HTML, CSS and JavaScript
Document, chart and table understanding plus OCR over long PDFs
Video understanding and long-form visual reasoning
Self-hosted multimodal deployments that require open weights and a permissive license
Image grounding and visual question answering inside tool-using pipelines

How to access

Provider	Model ID
Z.ai ↗	`glm-4.6v`
OpenRouter ↗	`z-ai/glm-4.6v`

GLM-V (vision-language) — every version

The full lineage of the GLM-V (vision-language) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
GLM-4.6Vcurrent	2025-12-08	—	MIT
GLM-4.5V	2025-08-11	—	Open weights
GLM-4.1V-9B-Thinking	2025-07-01	—	Open weights

FAQ

Is GLM-4.6V open source?

Yes. Z.ai released GLM-4.6V's weights openly under the MIT license on Hugging Face and ModelScope, so you can download, run, fine-tune and deploy it commercially. A smaller 9B GLM-4.6V-Flash variant is released the same way for local and low-latency use.

What makes GLM-4.6V different from a normal vision model?

Its headline feature is native multimodal Function Calling. Instead of converting an image to text before calling a tool, GLM-4.6V passes images, screenshots and document pages directly as tool parameters, and tools can return image grids, charts or rendered pages back into its reasoning — which Z.ai frames as bridging visual perception and executable action for agents.

How big is GLM-4.6V and what can it take as input?

GLM-4.6V is a Mixture-of-Experts model with 106B total parameters and roughly 12B active per token. It accepts video, image, text and file inputs and outputs text, with a 128K-token context that fits roughly 150 pages of dense documents, 200 slides or about an hour of video in a single pass.

How much does GLM-4.6V cost to use?

Through third-party APIs such as Novita and OpenRouter, GLM-4.6V is priced around $0.30 per million input tokens and $0.90 per million output tokens at launch. Because the weights are MIT-licensed, you can also self-host it with no per-token fee.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// GLM-V (vision-language) — every version

// FAQ