Z.ai & Tsinghua University · 2026-04-29 · major

GLM-5V-Turbo Paper Drops — Z.ai's Native Multimodal Foundation Model with CogViT and MTP

Z.ai posted the technical report for GLM-5V-Turbo, a vision-coding foundation model with the CogViT encoder and an inference-friendly MTP architecture. Top of HuggingFace Papers with 2.3K upvotes.

GLM-V GitHub repository banner from Z.ai showing the multimodal reasoning model family

Z.ai drops the technical report for GLM-5V-Turbo: native multimodal foundation model with a fresh vision encoder and MTP decoder.

What is it?

GLM-5V-Turbo is Z.ai's first multimodal foundation model purpose-built for agentic workflows over images, videos, webpages, documents, and GUIs. The Apr 29 arXiv report is the long-form technical writeup behind the API model that started shipping earlier in April. It hit the #1 spot on HuggingFace Papers within hours.

How does it work?

The paper introduces CogViT, a new vision encoder built specifically for the GLM-V family, paired with a multi-token prediction (MTP) decoder for inference efficiency. Training combines hierarchical optimization, multimodal RL, and end-to-end verification of agent rollouts. Tool calling, planning, and execution are integrated into the base model rather than bolted on through separate orchestration.

Why does it matter?

Most 'multimodal agent' systems wrap a text LLM with an image classifier or OCR layer. Z.ai is arguing that perception has to be native to the reasoning loop for agents to handle real GUIs, documents, and video. The 200K context window and 131K max output make this practical for long-horizon coding agents, not just demos.

Who is it for?

Multimodal-agent researchers and teams building vision-coding agents

Try it

https://docs.z.ai/guides/vlm/glm-5v-turbo

Key numbers

huggingFacePapersUpvotes: 2290
contextWindow: 200K tokens
maxOutputTokens: 131072
arXivId: 2604.26752