AI/TLDR

GLM-4.6V

Open-weight 106B (~12B active) MoE vision-language model with native multimodal function calling — images, screenshots and document pages pass straight in as tool parameters — and a 128K context; MIT license.

Overview

GLM-4.6V is the flagship vision-language model in Z.ai's (Zhipu / GLM) GLM-V line, released December 8, 2025 under the permissive MIT license with open weights on Hugging Face and ModelScope. It is a Mixture-of-Experts model with 106B total parameters and roughly 12B active per token, and it ships alongside a smaller 9B sibling, GLM-4.6V-Flash, for low-latency and local deployment.

The headline change in GLM-4.6V is native multimodal Function Calling. Instead of describing an image in text and then calling a tool, the model passes images, screenshots and document pages directly as tool parameters, and tools can return search-result grids, charts, rendered web pages or product images back into its reasoning. Z.ai frames this as bridging visual perception and executable action for multimodal agents. The model accepts video, image, text and file inputs, scales its training context to 128K tokens (roughly 150 pages of dense documents, 200 slides or about an hour of video in one pass), and is tuned for design-to-code work — reconstructing HTML, CSS and JavaScript from UI screenshots.

GLM-4.6V was evaluated on more than 20 mainstream multimodal benchmarks and reports state-of-the-art results among open-source models of comparable scale across multimodal interaction, logical reasoning and long-context understanding. It is the successor to GLM-4.5V (the model family's technical report, arXiv 2507.01006, covers the GLM-4.6V series) and the current top of the GLM-V vision-language line.

Released2025-12-08
LicenseMIT
WeightsOpen weights
Parameters106B total / ~12B active (MoE)
Context128K
Max outputNot disclosed
ArchitectureMixture-of-Experts vision-language transformer (model class Glm4vMoeForConditionalGeneration) in the GLM-V family, building on the recipe behind GLM-4.5V and GLM-4.1V-Thinking. A vision encoder feeds image, screenshot, document-page and video frames into a 106B-total MoE language backbone with ~12B parameters active per token. Trained with Reinforcement Learning with Curriculum Sampling (RLCS) and scaled to a 128K-token context. Its defining change is native multimodal Function Calling: images, screenshots and document pages are passed directly as tool parameters, and tools can return image grids, charts or rendered pages back into the reasoning loop.
Knowledge cutoffNot disclosed
ModalitiesText, Vision, Video, PDF
StatusAvailable

Benchmarks

  1. MMBench V1.1 (EN)88.8%
  2. MMMU (Val)76%
  3. MathVista85.2%
  4. AI2D88.8%
  5. OCRBench86.5%
  6. Design2Code88.6%
  7. WebVoyager81%
  8. VideoMMU74.7%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.30 per 1M tokens
Output$0.90 per 1M tokens

Third-party API pricing (Novita / OpenRouter) for the non-reasoning vision model at the Dec 2025 release. Because the weights are MIT-licensed, GLM-4.6V can also be self-hosted at no per-token cost.

Pricing source ↗

Strengths

  • Open weights under the permissive MIT license — free to self-host, fine-tune and ship commercially
  • Native multimodal Function Calling: images, screenshots and document pages pass directly as tool parameters
  • Sparse 106B-MoE design activates only ~12B parameters per token, keeping inference cheaper than a dense 100B+ model
  • 128K context handles long documents, large slide decks and roughly an hour of video in a single pass
  • Strong on chart, document and OCR understanding plus design-to-code (Design2Code 88.6) screenshot-to-frontend tasks
  • Ships with a 9B GLM-4.6V-Flash variant for local and low-latency deployment

Best for

  • Multimodal agents and GUI automation that act on what they see (WebVoyager-style web navigation)
  • Design-to-code: turning UI screenshots and mockups into HTML, CSS and JavaScript
  • Document, chart and table understanding plus OCR over long PDFs
  • Video understanding and long-form visual reasoning
  • Self-hosted multimodal deployments that require open weights and a permissive license
  • Image grounding and visual question answering inside tool-using pipelines

How to access

ProviderModel ID
Z.ai ↗glm-4.6v
OpenRouter ↗z-ai/glm-4.6v

GLM-V (vision-language) — every version

The full lineage of the GLM-V (vision-language) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
GLM-4.6Vcurrent2025-12-08MIT
GLM-4.5V2025-08-11Open weights
GLM-4.1V-9B-Thinking2025-07-01Open weights

FAQ

Is GLM-4.6V open source?

Yes. Z.ai released GLM-4.6V's weights openly under the MIT license on Hugging Face and ModelScope, so you can download, run, fine-tune and deploy it commercially. A smaller 9B GLM-4.6V-Flash variant is released the same way for local and low-latency use.

What makes GLM-4.6V different from a normal vision model?

Its headline feature is native multimodal Function Calling. Instead of converting an image to text before calling a tool, GLM-4.6V passes images, screenshots and document pages directly as tool parameters, and tools can return image grids, charts or rendered pages back into its reasoning — which Z.ai frames as bridging visual perception and executable action for agents.

How big is GLM-4.6V and what can it take as input?

GLM-4.6V is a Mixture-of-Experts model with 106B total parameters and roughly 12B active per token. It accepts video, image, text and file inputs and outputs text, with a 128K-token context that fits roughly 150 pages of dense documents, 200 slides or about an hour of video in a single pass.

How much does GLM-4.6V cost to use?

Through third-party APIs such as Novita and OpenRouter, GLM-4.6V is priced around $0.30 per million input tokens and $0.90 per million output tokens at launch. Because the weights are MIT-licensed, you can also self-host it with no per-token fee.