Overview
GLM-4.7-Flash is the lightweight, speed-focused member of Z.ai's (Zhipu AI) GLM-4.7 family, released on January 19, 2026. It is a 30B-A3B Mixture-of-Experts model — roughly 30 billion total parameters with only about 3 billion active per token — which Z.ai positions as the free-tier counterpart to the larger GLM-4.7. The whole point of GLM-4.7-Flash is to deliver responsive, low-latency generation and strong coding at a scale you can actually run yourself.
Despite its small active footprint, GLM-4.7-Flash is tuned specifically for agentic coding: it strengthens code generation, long-horizon task planning, and tool collaboration, and ships with a thinking mode, streaming, function calling, structured output, and context caching. It handles a 200K-token context window and can emit very long outputs, making it practical for working across large codebases and multi-step agent loops.
GLM-4.7-Flash is released under the permissive MIT license with open weights on Hugging Face and ModelScope, so it can be used commercially and deployed locally. Z.ai offers it free through its own API, and it is also hosted by third parties such as OpenRouter and Cloudflare Workers AI. Z.ai describes it as the strongest open model in the 30B class, with open-source SOTA scores on benchmarks like SWE-bench Verified and τ²-Bench among comparable-size models.
| Released | 2026-01-19 |
|---|---|
| License | MIT |
| Weights | Open weights |
| Parameters | 30B total / 3B active (30B-A3B MoE) |
| Context | 200K |
| Max output | 128K |
| Architecture | Sparse Mixture-of-Experts (MoE) transformer with ~30B total parameters and ~3B active per token (30B-A3B). Supports a hybrid "thinking" reasoning mode, function calling, structured output, and context caching. Small enough to run on a single H100, with day-0 inference support in vLLM, SGLang, and Transformers, plus quantized builds for llama.cpp, Ollama, and LM Studio. |
| Modalities | Text |
| Status | Available |
Benchmarks
Performances on Benchmarks — GLM-4.7-Flash vs Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B, as published on the official Z.ai (zai-org) GLM-4.7-Flash Hugging Face model card.
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B-Thinking-2507 | GPT-OSS-20B |
|---|---|---|---|
| AIME 25 | 91.6% | 85% | 91.7% |
| GPQA | 75.2% | 73.4% | 71.5% |
| LCB v6 | 64% | 66% | 61% |
| HLE | 14.4% | 9.8% | 10.9% |
| SWE-bench Verified | 59.2% | 22% | 34% |
| τ²-Bench | 79.5% | 49% | 47.7% |
| BrowseComp | 42.8% | 2.29% | 28.3% |
This model's scores
- AIME 2591.6%
- GPQA75.2%
- LiveCodeBench v664%
- SWE-bench Verified59.2%
- τ²-Bench79.5%
- BrowseComp42.8%
- Humanity's Last Exam (HLE)14.4%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.06 / 1M tokens per 1M tokens |
|---|---|
| Output | $0.40 / 1M tokens per 1M tokens |
Free to use on Z.ai's own API (input, cached input, and output all listed as Free). The per-token prices shown are for third-party hosting via OpenRouter.
Strengths
- Free on Z.ai's own API and MIT-licensed with open weights — usable commercially and deployable locally
- Tuned for agentic coding: code generation, long-horizon planning, and tool/function calling
- Efficient 30B-A3B MoE — only ~3B parameters active per token, runnable on a single H100 or consumer hardware via quantization
- Large 200K-token context window with very long max output for big codebases and long agent loops
- Day-0 support across vLLM, SGLang, Transformers, Ollama, LM Studio, OpenRouter, and Cloudflare Workers AI
- Optional thinking/reasoning mode plus structured output and context caching
Best for
- Local and self-hosted AI coding assistants on a single GPU or capable workstation
- Agentic workflows that need tool calling, planning, and multi-step execution at low cost
- High-throughput, latency-sensitive code completion and refactoring
- Long-context tasks: reading large repositories, long documents, and extended chat history
- Budget or free prototyping of LLM apps before scaling to a larger model
- Chinese-language writing, translation, and long-form text processing
How to access
| Provider | Model ID |
|---|---|
| Z.ai ↗ | glm-4.7-flash |
| OpenRouter ↗ | z-ai/glm-4.7-flash |
| Cloudflare Workers AI ↗ | glm-4.7-flash |
| Ollama ↗ | glm-4.7-flash |
FAQ
Is GLM-4.7-Flash free?
Yes. Z.ai lists GLM-4.7-Flash as completely free on its own API — input, cached input, and output are all priced at Free. The weights are also openly available under the MIT license, so you can run it yourself at no licensing cost. Third-party hosts like OpenRouter charge a small per-token fee (about $0.06 per million input tokens and $0.40 per million output tokens) for their infrastructure.
How big is GLM-4.7-Flash and can I run it locally?
It is a 30B-A3B Mixture-of-Experts model: roughly 30 billion total parameters but only about 3 billion active per token. That sparsity keeps it efficient — it can run on a single H100 GPU, and quantized builds work on consumer hardware via llama.cpp, Ollama, and LM Studio. Day-0 inference support is also available in vLLM, SGLang, and Transformers.
What is GLM-4.7-Flash best at?
It is tuned for fast, agentic coding — code generation, long-horizon task planning, and tool/function calling — and Z.ai cites open-source SOTA results for its size on SWE-bench Verified and τ²-Bench. It also handles math and reasoning well (91.6 on AIME 25) and supports an optional thinking mode, structured output, and a 200K-token context window.
How does GLM-4.7-Flash differ from the full GLM-4.7?
GLM-4.7-Flash is the lightweight, speed-focused, free-tier version. The full GLM-4.7 is larger and more capable (and paid — roughly $0.60 per million input and $2.20 per million output tokens on Z.ai), while Flash trades some peak quality for much lower latency, lower cost, and the ability to self-host a 30B-class model.