AI/TLDR

GLM-4.1V-9B-Thinking

Z.ai's 9B open-weight vision-language reasoner that punches at 72B scale

Overview

GLM-4.1V-9B-Thinking is an open-weight vision-language model released on July 1, 2025 by Z.ai (the Zhipu AI / GLM team) together with Tsinghua University's KEG lab. It is the first reasoning-focused entry in the GLM-V multimodal line, built on the GLM-4-9B-0414 base and paired with an AIMv2-Huge vision encoder. Despite being a roughly 9-billion-parameter model, it is positioned to compete with much larger systems on multimodal reasoning.

The defining feature is its 'thinking' paradigm: GLM-4.1V-9B-Thinking produces an explicit chain-of-thought before answering, which the team trained with a method they call Reinforcement Learning with Curriculum Sampling (RLCS). The model accepts text, images, and video, supports a 64K-token context, and handles arbitrary aspect ratios and image resolutions up to 4K via 2D-RoPE positional encoding.

Because the weights ship under an MIT license, GLM-4.1V-9B-Thinking can be downloaded, fine-tuned, and self-hosted without API fees, and its small size makes it practical to run on a single modern GPU. It is also offered as a hosted API through Z.ai's platform and third-party inference providers such as SiliconFlow for teams that prefer not to manage their own infrastructure.

Released2025-07-01
LicenseMIT
WeightsOpen weights
Parameters9B
Context64K
Max output8K
ArchitectureVision-language model combining an AIMv2-Huge vision encoder, an MLP adapter, and a GLM language decoder built on the GLM-4-9B-0414 base. Adds a "thinking" chain-of-thought reasoning paradigm trained with Reinforcement Learning with Curriculum Sampling (RLCS). Uses 2D-RoPE to handle arbitrary aspect ratios and image resolutions up to 4K.
Knowledge cutoffNot disclosed
ModalitiesText, Vision, Video
StatusAvailable

Benchmarks

  1. MMStar72.9%
  2. MathVista80.7%
  3. MMMU68%
  4. MMMU-Pro57.1%
  5. AI2D87.9%
  6. MMBench-V1.1-EN85.8%
  7. OCRBench84.2%
  8. VideoMME (with subtitles)73.6%
  9. WeMath63.8%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.035 / 1M tokens per 1M tokens
Output$0.14 / 1M tokens per 1M tokens

Hosted API pricing via SiliconFlow. The model weights are open under MIT, so self-hosting incurs no per-token fee.

Pricing source ↗

Strengths

  • Open-weight (MIT) — free to download, fine-tune, and self-host with no usage restrictions
  • Strong multimodal reasoning for its size: leads 10B-class models on 23 of 28 reported benchmarks and beats Qwen2.5-VL-72B on 18 of them
  • Explicit chain-of-thought 'thinking' output improves accuracy and interpretability on STEM and math-vision tasks
  • Handles arbitrary aspect ratios and up to 4K image resolution, plus video input
  • Small enough (9B) to run on a single GPU, keeping inference cheap

Best for

  • Solving STEM and math problems presented as images, diagrams, or charts
  • Document and chart understanding, including OCR-heavy and long-document pages
  • Video understanding and question answering
  • GUI / screenshot understanding and UI navigation agents
  • Self-hosted multimodal reasoning where data privacy or cost rules out a closed API

How to access

ProviderModel ID
Z.ai (Zhipu / BigModel) ↗glm-4.1v-thinking-flash
SiliconFlow ↗zai-org/GLM-4.1V-9B-Thinking
Hugging Face (weights) ↗zai-org/GLM-4.1V-9B-Thinking

GLM-V (vision-language) — every version

The full lineage of the GLM-V (vision-language) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
GLM-4.6Vcurrent2025-12-08MIT
GLM-4.5V2025-08-11Open weights
GLM-4.1V-9B-Thinking2025-07-01Open weights

FAQ

Is GLM-4.1V-9B-Thinking open source?

Yes. The weights are released under the MIT license on Hugging Face (zai-org/GLM-4.1V-9B-Thinking), so you can download, fine-tune, and self-host the model commercially without per-token fees.

What can GLM-4.1V-9B-Thinking process besides text?

It is a vision-language model that accepts images and video alongside text, with text output. It handles arbitrary aspect ratios and image resolutions up to 4K, and supports a 64K-token context window.

How does a 9B model compete with 72B models?

Its 'thinking' chain-of-thought paradigm, trained with Reinforcement Learning with Curriculum Sampling (RLCS), lets it reason step by step before answering. In Z.ai's technical report it leads 10B-class models on 23 of 28 benchmarks and outperforms the much larger Qwen2.5-VL-72B on 18 of them.

How much does GLM-4.1V-9B-Thinking cost to use?

Self-hosting is free since the weights are open. As a hosted API it is inexpensive — SiliconFlow lists about $0.035 per million input tokens and $0.14 per million output tokens, and it is also served through Z.ai's own platform.