GLM-4.7-Flash

Z.ai's free, open-weight 30B-A3B MoE built for fast local coding and agents.

Overview

GLM-4.7-Flash is the lightweight, speed-focused member of Z.ai's (Zhipu AI) GLM-4.7 family, released on January 19, 2026. It is a 30B-A3B Mixture-of-Experts model — roughly 30 billion total parameters with only about 3 billion active per token — which Z.ai positions as the free-tier counterpart to the larger GLM-4.7. The whole point of GLM-4.7-Flash is to deliver responsive, low-latency generation and strong coding at a scale you can actually run yourself.

Despite its small active footprint, GLM-4.7-Flash is tuned specifically for agentic coding: it strengthens code generation, long-horizon task planning, and tool collaboration, and ships with a thinking mode, streaming, function calling, structured output, and context caching. It handles a 200K-token context window and can emit very long outputs, making it practical for working across large codebases and multi-step agent loops.

GLM-4.7-Flash is released under the permissive MIT license with open weights on Hugging Face and ModelScope, so it can be used commercially and deployed locally. Z.ai offers it free through its own API, and it is also hosted by third parties such as OpenRouter and Cloudflare Workers AI. Z.ai describes it as the strongest open model in the 30B class, with open-source SOTA scores on benchmarks like SWE-bench Verified and τ²-Bench among comparable-size models.

Released	2026-01-19
License	MIT
Weights	Open weights
Parameters	30B total / 3B active (30B-A3B MoE)
Context	200K
Max output	128K
Architecture	Sparse Mixture-of-Experts (MoE) transformer with ~30B total parameters and ~3B active per token (30B-A3B). Supports a hybrid "thinking" reasoning mode, function calling, structured output, and context caching. Small enough to run on a single H100, with day-0 inference support in vLLM, SGLang, and Transformers, plus quantized builds for llama.cpp, Ollama, and LM Studio.
Modalities	Text
Status	Available

Benchmarks

Performances on Benchmarks — GLM-4.7-Flash vs Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B, as published on the official Z.ai (zai-org) GLM-4.7-Flash Hugging Face model card.

Benchmark	GLM-4.7-Flash	Qwen3-30B-A3B-Thinking-2507	GPT-OSS-20B
AIME 25	91.6%	85%	91.7%
GPQA	75.2%	73.4%	71.5%
LCB v6	64%	66%	61%
HLE	14.4%	9.8%	10.9%
SWE-bench Verified	59.2%	22%	34%
τ²-Bench	79.5%	49%	47.7%
BrowseComp	42.8%	2.29%	28.3%

Comparison source ↗

This model's scores

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.06 / 1M tokens per 1M tokens
Output	$0.40 / 1M tokens per 1M tokens

Free to use on Z.ai's own API (input, cached input, and output all listed as Free). The per-token prices shown are for third-party hosting via OpenRouter.

Pricing source ↗

Strengths

Free on Z.ai's own API and MIT-licensed with open weights — usable commercially and deployable locally
Tuned for agentic coding: code generation, long-horizon planning, and tool/function calling
Efficient 30B-A3B MoE — only ~3B parameters active per token, runnable on a single H100 or consumer hardware via quantization
Large 200K-token context window with very long max output for big codebases and long agent loops
Day-0 support across vLLM, SGLang, Transformers, Ollama, LM Studio, OpenRouter, and Cloudflare Workers AI
Optional thinking/reasoning mode plus structured output and context caching

Best for

Local and self-hosted AI coding assistants on a single GPU or capable workstation
Agentic workflows that need tool calling, planning, and multi-step execution at low cost
High-throughput, latency-sensitive code completion and refactoring
Long-context tasks: reading large repositories, long documents, and extended chat history
Budget or free prototyping of LLM apps before scaling to a larger model
Chinese-language writing, translation, and long-form text processing

How to access

Provider	Model ID
Z.ai ↗	`glm-4.7-flash`
OpenRouter ↗	`z-ai/glm-4.7-flash`
Cloudflare Workers AI ↗	`glm-4.7-flash`
Ollama ↗	`glm-4.7-flash`

FAQ

Is GLM-4.7-Flash free?

Yes. Z.ai lists GLM-4.7-Flash as completely free on its own API — input, cached input, and output are all priced at Free. The weights are also openly available under the MIT license, so you can run it yourself at no licensing cost. Third-party hosts like OpenRouter charge a small per-token fee (about $0.06 per million input tokens and $0.40 per million output tokens) for their infrastructure.

How big is GLM-4.7-Flash and can I run it locally?

It is a 30B-A3B Mixture-of-Experts model: roughly 30 billion total parameters but only about 3 billion active per token. That sparsity keeps it efficient — it can run on a single H100 GPU, and quantized builds work on consumer hardware via llama.cpp, Ollama, and LM Studio. Day-0 inference support is also available in vLLM, SGLang, and Transformers.

What is GLM-4.7-Flash best at?

It is tuned for fast, agentic coding — code generation, long-horizon task planning, and tool/function calling — and Z.ai cites open-source SOTA results for its size on SWE-bench Verified and τ²-Bench. It also handles math and reasoning well (91.6 on AIME 25) and supports an optional thinking mode, structured output, and a 200K-token context window.

How does GLM-4.7-Flash differ from the full GLM-4.7?

GLM-4.7-Flash is the lightweight, speed-focused, free-tier version. The full GLM-4.7 is larger and more capable (and paid — roughly $0.60 per million input and $2.20 per million output tokens on Z.ai), while Flash trades some peak quality for much lower latency, lower cost, and the ability to self-host a 30B-class model.

// Overview

// Benchmarks

This model's scores

// Pricing

// Strengths

// Best for

// How to access

// FAQ