Overview
GLM-4.6V is the flagship vision-language model in Z.ai's (Zhipu / GLM) GLM-V line, released December 8, 2025 under the permissive MIT license with open weights on Hugging Face and ModelScope. It is a Mixture-of-Experts model with 106B total parameters and roughly 12B active per token, and it ships alongside a smaller 9B sibling, GLM-4.6V-Flash, for low-latency and local deployment.
The headline change in GLM-4.6V is native multimodal Function Calling. Instead of describing an image in text and then calling a tool, the model passes images, screenshots and document pages directly as tool parameters, and tools can return search-result grids, charts, rendered web pages or product images back into its reasoning. Z.ai frames this as bridging visual perception and executable action for multimodal agents. The model accepts video, image, text and file inputs, scales its training context to 128K tokens (roughly 150 pages of dense documents, 200 slides or about an hour of video in one pass), and is tuned for design-to-code work — reconstructing HTML, CSS and JavaScript from UI screenshots.
GLM-4.6V was evaluated on more than 20 mainstream multimodal benchmarks and reports state-of-the-art results among open-source models of comparable scale across multimodal interaction, logical reasoning and long-context understanding. It is the successor to GLM-4.5V (the model family's technical report, arXiv 2507.01006, covers the GLM-4.6V series) and the current top of the GLM-V vision-language line.
| Released | 2025-12-08 |
|---|---|
| License | MIT |
| Weights | Open weights |
| Parameters | 106B total / ~12B active (MoE) |
| Context | 128K |
| Max output | Not disclosed |
| Architecture | Mixture-of-Experts vision-language transformer (model class Glm4vMoeForConditionalGeneration) in the GLM-V family, building on the recipe behind GLM-4.5V and GLM-4.1V-Thinking. A vision encoder feeds image, screenshot, document-page and video frames into a 106B-total MoE language backbone with ~12B parameters active per token. Trained with Reinforcement Learning with Curriculum Sampling (RLCS) and scaled to a 128K-token context. Its defining change is native multimodal Function Calling: images, screenshots and document pages are passed directly as tool parameters, and tools can return image grids, charts or rendered pages back into the reasoning loop. |
| Knowledge cutoff | Not disclosed |
| Modalities | Text, Vision, Video, PDF |
| Status | Available |
Benchmarks
- MMBench V1.1 (EN)88.8%
- MMMU (Val)76%
- MathVista85.2%
- AI2D88.8%
- OCRBench86.5%
- Design2Code88.6%
- WebVoyager81%
- VideoMMU74.7%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.30 per 1M tokens |
|---|---|
| Output | $0.90 per 1M tokens |
Third-party API pricing (Novita / OpenRouter) for the non-reasoning vision model at the Dec 2025 release. Because the weights are MIT-licensed, GLM-4.6V can also be self-hosted at no per-token cost.
Strengths
- Open weights under the permissive MIT license — free to self-host, fine-tune and ship commercially
- Native multimodal Function Calling: images, screenshots and document pages pass directly as tool parameters
- Sparse 106B-MoE design activates only ~12B parameters per token, keeping inference cheaper than a dense 100B+ model
- 128K context handles long documents, large slide decks and roughly an hour of video in a single pass
- Strong on chart, document and OCR understanding plus design-to-code (Design2Code 88.6) screenshot-to-frontend tasks
- Ships with a 9B GLM-4.6V-Flash variant for local and low-latency deployment
Best for
- Multimodal agents and GUI automation that act on what they see (WebVoyager-style web navigation)
- Design-to-code: turning UI screenshots and mockups into HTML, CSS and JavaScript
- Document, chart and table understanding plus OCR over long PDFs
- Video understanding and long-form visual reasoning
- Self-hosted multimodal deployments that require open weights and a permissive license
- Image grounding and visual question answering inside tool-using pipelines
How to access
| Provider | Model ID |
|---|---|
| Z.ai ↗ | glm-4.6v |
| OpenRouter ↗ | z-ai/glm-4.6v |
GLM-V (vision-language) — every version
The full lineage of the GLM-V (vision-language) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| GLM-4.6Vcurrent | 2025-12-08 | — | MIT |
| GLM-4.5V | 2025-08-11 | — | Open weights |
| GLM-4.1V-9B-Thinking | 2025-07-01 | — | Open weights |
FAQ
Is GLM-4.6V open source?
Yes. Z.ai released GLM-4.6V's weights openly under the MIT license on Hugging Face and ModelScope, so you can download, run, fine-tune and deploy it commercially. A smaller 9B GLM-4.6V-Flash variant is released the same way for local and low-latency use.
What makes GLM-4.6V different from a normal vision model?
Its headline feature is native multimodal Function Calling. Instead of converting an image to text before calling a tool, GLM-4.6V passes images, screenshots and document pages directly as tool parameters, and tools can return image grids, charts or rendered pages back into its reasoning — which Z.ai frames as bridging visual perception and executable action for agents.
How big is GLM-4.6V and what can it take as input?
GLM-4.6V is a Mixture-of-Experts model with 106B total parameters and roughly 12B active per token. It accepts video, image, text and file inputs and outputs text, with a 128K-token context that fits roughly 150 pages of dense documents, 200 slides or about an hour of video in a single pass.
How much does GLM-4.6V cost to use?
Through third-party APIs such as Novita and OpenRouter, GLM-4.6V is priced around $0.30 per million input tokens and $0.90 per million output tokens at launch. Because the weights are MIT-licensed, you can also self-host it with no per-token fee.