GLM-4.1V-9B-Thinking

Z.ai's 9B open-weight vision-language reasoner that punches at 72B scale

Overview

GLM-4.1V-9B-Thinking is an open-weight vision-language model released on July 1, 2025 by Z.ai (the Zhipu AI / GLM team) together with Tsinghua University's KEG lab. It is the first reasoning-focused entry in the GLM-V multimodal line, built on the GLM-4-9B-0414 base and paired with an AIMv2-Huge vision encoder. Despite being a roughly 9-billion-parameter model, it is positioned to compete with much larger systems on multimodal reasoning.

The defining feature is its 'thinking' paradigm: GLM-4.1V-9B-Thinking produces an explicit chain-of-thought before answering, which the team trained with a method they call Reinforcement Learning with Curriculum Sampling (RLCS). The model accepts text, images, and video, supports a 64K-token context, and handles arbitrary aspect ratios and image resolutions up to 4K via 2D-RoPE positional encoding.

Because the weights ship under an MIT license, GLM-4.1V-9B-Thinking can be downloaded, fine-tuned, and self-hosted without API fees, and its small size makes it practical to run on a single modern GPU. It is also offered as a hosted API through Z.ai's platform and third-party inference providers such as SiliconFlow for teams that prefer not to manage their own infrastructure.

Released	2025-07-01
License	MIT
Weights	Open weights
Parameters	9B
Context	64K
Max output	8K
Architecture	Vision-language model combining an AIMv2-Huge vision encoder, an MLP adapter, and a GLM language decoder built on the GLM-4-9B-0414 base. Adds a "thinking" chain-of-thought reasoning paradigm trained with Reinforcement Learning with Curriculum Sampling (RLCS). Uses 2D-RoPE to handle arbitrary aspect ratios and image resolutions up to 4K.
Knowledge cutoff	Not disclosed
Modalities	Text, Vision, Video
Status	Available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.035 / 1M tokens per 1M tokens
Output	$0.14 / 1M tokens per 1M tokens

Hosted API pricing via SiliconFlow. The model weights are open under MIT, so self-hosting incurs no per-token fee.

Pricing source ↗

Strengths

Open-weight (MIT) — free to download, fine-tune, and self-host with no usage restrictions
Strong multimodal reasoning for its size: leads 10B-class models on 23 of 28 reported benchmarks and beats Qwen2.5-VL-72B on 18 of them
Explicit chain-of-thought 'thinking' output improves accuracy and interpretability on STEM and math-vision tasks
Handles arbitrary aspect ratios and up to 4K image resolution, plus video input
Small enough (9B) to run on a single GPU, keeping inference cheap

Best for

Solving STEM and math problems presented as images, diagrams, or charts
Document and chart understanding, including OCR-heavy and long-document pages
Video understanding and question answering
GUI / screenshot understanding and UI navigation agents
Self-hosted multimodal reasoning where data privacy or cost rules out a closed API

How to access

Provider	Model ID
Z.ai (Zhipu / BigModel) ↗	`glm-4.1v-thinking-flash`
SiliconFlow ↗	`zai-org/GLM-4.1V-9B-Thinking`
Hugging Face (weights) ↗	`zai-org/GLM-4.1V-9B-Thinking`

GLM-V (vision-language) — every version

The full lineage of the GLM-V (vision-language) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
GLM-4.6Vcurrent	2025-12-08	—	MIT
GLM-4.5V	2025-08-11	—	Open weights
GLM-4.1V-9B-Thinking	2025-07-01	—	Open weights

FAQ

Is GLM-4.1V-9B-Thinking open source?

Yes. The weights are released under the MIT license on Hugging Face (zai-org/GLM-4.1V-9B-Thinking), so you can download, fine-tune, and self-host the model commercially without per-token fees.

What can GLM-4.1V-9B-Thinking process besides text?

It is a vision-language model that accepts images and video alongside text, with text output. It handles arbitrary aspect ratios and image resolutions up to 4K, and supports a 64K-token context window.

How does a 9B model compete with 72B models?

Its 'thinking' chain-of-thought paradigm, trained with Reinforcement Learning with Curriculum Sampling (RLCS), lets it reason step by step before answering. In Z.ai's technical report it leads 10B-class models on 23 of 28 benchmarks and outperforms the much larger Qwen2.5-VL-72B on 18 of them.

How much does GLM-4.1V-9B-Thinking cost to use?

Self-hosting is free since the weights are open. As a hosted API it is inexpensive — SiliconFlow lists about $0.035 per million input tokens and $0.14 per million output tokens, and it is also served through Z.ai's own platform.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// GLM-V (vision-language) — every version

// FAQ