AI/TLDR

GLM-4.7-Flash

Z.ai's free, open-weight 30B-A3B MoE built for fast local coding and agents.

Overview

GLM-4.7-Flash is the lightweight, speed-focused member of Z.ai's (Zhipu AI) GLM-4.7 family, released on January 19, 2026. It is a 30B-A3B Mixture-of-Experts model — roughly 30 billion total parameters with only about 3 billion active per token — which Z.ai positions as the free-tier counterpart to the larger GLM-4.7. The whole point of GLM-4.7-Flash is to deliver responsive, low-latency generation and strong coding at a scale you can actually run yourself.

Despite its small active footprint, GLM-4.7-Flash is tuned specifically for agentic coding: it strengthens code generation, long-horizon task planning, and tool collaboration, and ships with a thinking mode, streaming, function calling, structured output, and context caching. It handles a 200K-token context window and can emit very long outputs, making it practical for working across large codebases and multi-step agent loops.

GLM-4.7-Flash is released under the permissive MIT license with open weights on Hugging Face and ModelScope, so it can be used commercially and deployed locally. Z.ai offers it free through its own API, and it is also hosted by third parties such as OpenRouter and Cloudflare Workers AI. Z.ai describes it as the strongest open model in the 30B class, with open-source SOTA scores on benchmarks like SWE-bench Verified and τ²-Bench among comparable-size models.

Released2026-01-19
LicenseMIT
WeightsOpen weights
Parameters30B total / 3B active (30B-A3B MoE)
Context200K
Max output128K
ArchitectureSparse Mixture-of-Experts (MoE) transformer with ~30B total parameters and ~3B active per token (30B-A3B). Supports a hybrid "thinking" reasoning mode, function calling, structured output, and context caching. Small enough to run on a single H100, with day-0 inference support in vLLM, SGLang, and Transformers, plus quantized builds for llama.cpp, Ollama, and LM Studio.
ModalitiesText
StatusAvailable

Benchmarks

Performances on Benchmarks — GLM-4.7-Flash vs Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B, as published on the official Z.ai (zai-org) GLM-4.7-Flash Hugging Face model card.

BenchmarkGLM-4.7-FlashQwen3-30B-A3B-Thinking-2507GPT-OSS-20B
AIME 2591.6%85%91.7%
GPQA75.2%73.4%71.5%
LCB v664%66%61%
HLE14.4%9.8%10.9%
SWE-bench Verified59.2%22%34%
τ²-Bench79.5%49%47.7%
BrowseComp42.8%2.29%28.3%

Comparison source ↗

This model's scores

  1. AIME 2591.6%
  2. GPQA75.2%
  3. LiveCodeBench v664%
  4. SWE-bench Verified59.2%
  5. τ²-Bench79.5%
  6. BrowseComp42.8%
  7. Humanity's Last Exam (HLE)14.4%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.06 / 1M tokens per 1M tokens
Output$0.40 / 1M tokens per 1M tokens

Free to use on Z.ai's own API (input, cached input, and output all listed as Free). The per-token prices shown are for third-party hosting via OpenRouter.

Pricing source ↗

Strengths

  • Free on Z.ai's own API and MIT-licensed with open weights — usable commercially and deployable locally
  • Tuned for agentic coding: code generation, long-horizon planning, and tool/function calling
  • Efficient 30B-A3B MoE — only ~3B parameters active per token, runnable on a single H100 or consumer hardware via quantization
  • Large 200K-token context window with very long max output for big codebases and long agent loops
  • Day-0 support across vLLM, SGLang, Transformers, Ollama, LM Studio, OpenRouter, and Cloudflare Workers AI
  • Optional thinking/reasoning mode plus structured output and context caching

Best for

  • Local and self-hosted AI coding assistants on a single GPU or capable workstation
  • Agentic workflows that need tool calling, planning, and multi-step execution at low cost
  • High-throughput, latency-sensitive code completion and refactoring
  • Long-context tasks: reading large repositories, long documents, and extended chat history
  • Budget or free prototyping of LLM apps before scaling to a larger model
  • Chinese-language writing, translation, and long-form text processing

How to access

ProviderModel ID
Z.ai ↗glm-4.7-flash
OpenRouter ↗z-ai/glm-4.7-flash
Cloudflare Workers AI ↗glm-4.7-flash
Ollama ↗glm-4.7-flash

FAQ

Is GLM-4.7-Flash free?

Yes. Z.ai lists GLM-4.7-Flash as completely free on its own API — input, cached input, and output are all priced at Free. The weights are also openly available under the MIT license, so you can run it yourself at no licensing cost. Third-party hosts like OpenRouter charge a small per-token fee (about $0.06 per million input tokens and $0.40 per million output tokens) for their infrastructure.

How big is GLM-4.7-Flash and can I run it locally?

It is a 30B-A3B Mixture-of-Experts model: roughly 30 billion total parameters but only about 3 billion active per token. That sparsity keeps it efficient — it can run on a single H100 GPU, and quantized builds work on consumer hardware via llama.cpp, Ollama, and LM Studio. Day-0 inference support is also available in vLLM, SGLang, and Transformers.

What is GLM-4.7-Flash best at?

It is tuned for fast, agentic coding — code generation, long-horizon task planning, and tool/function calling — and Z.ai cites open-source SOTA results for its size on SWE-bench Verified and τ²-Bench. It also handles math and reasoning well (91.6 on AIME 25) and supports an optional thinking mode, structured output, and a 200K-token context window.

How does GLM-4.7-Flash differ from the full GLM-4.7?

GLM-4.7-Flash is the lightweight, speed-focused, free-tier version. The full GLM-4.7 is larger and more capable (and paid — roughly $0.60 per million input and $2.20 per million output tokens on Z.ai), while Flash trades some peak quality for much lower latency, lower cost, and the ability to self-host a 30B-class model.