AutoGLM-Phone-Multilingual

Z.ai's open-weight 9B vision-language phone agent that reads your Android screen and taps, types, and swipes through tasks in English and Chinese apps.

Overview

AutoGLM-Phone-Multilingual (officially the AutoGLM-Phone-9B-Multilingual weights) is the international-facing variant of Z.ai's open-source mobile agent line, released alongside the Open-AutoGLM project in December 2025. It is a ~9B-parameter vision-language model fine-tuned from GLM-4.1V-9B-Base, with an architecture identical to GLM-4.1V-9B-Thinking. The model looks at a phone screenshot, reasons about what to do next, and emits a concrete UI action.

Unlike a chat model, AutoGLM-Phone-Multilingual is the brain inside a closed loop: the Open-AutoGLM framework captures the Android screen, sends it to the model, and the model returns one of a fixed action set (Launch, Tap, Type, Swipe, Back, Home, Long Press, Double Tap, Wait, plus a hand-back-to-human request for logins or captchas). Those actions are executed on a real device or emulator over ADB (Android Debug Bridge). The 'Multilingual' weight is tuned for English and other-language apps such as Gmail, Google Maps, Amazon, eBay, Booking.com, X, TikTok and WhatsApp, while the sibling AutoGLM-Phone-9B targets 50+ high-frequency Chinese apps like WeChat, Taobao, Douyin and Meituan.

Z.ai (Zhipu AI) ships the weights under the MIT license and the surrounding framework code under Apache-2.0, explicitly so enterprises and developers can self-host and keep screen data, logs and permissions inside their own environment. You can download the model from Hugging Face and ModelScope, run it locally with vLLM or SGLang, or call it through hosted APIs (Z.ai, ModelScope, Novita).

Released	2025-12-11
License	MIT (model weights); Apache-2.0 (framework code)
Weights	Open weights
Parameters	9B
Context	25K
Max output	3K tokens (configurable)
Architecture	Vision-language model fine-tuned from GLM-4.1V-9B-Base; architecture identical to GLM-4.1V-9B-Thinking. Used as the perception-and-action core of the Open-AutoGLM phone-use framework, which drives an Android device over ADB (screenshot in, GUI action out).
Modalities	Text, Vision
Status	Generally available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.035 / 1M tokens per 1M tokens
Output	$0.138 / 1M tokens per 1M tokens

Pricing shown is Novita AI's hosted serverless rate for the AutoGLM-Phone-9B-Multilingual weights. Z.ai's own developer platform listed the model as 'free for a limited time' at launch. Weights are open and free to self-host (MIT license).

Pricing source ↗

Strengths

Fully open weights (MIT) plus open framework code (Apache-2.0) — self-hostable, no vendor lock-in
Small 9B footprint runs locally via vLLM/SGLang and on consumer hardware through quantized GGUF/MLX builds (Ollama, LM Studio, llama.cpp, Jan)
Purpose-built for on-device phone control: reads screenshots and emits concrete tap/type/swipe actions over ADB
Multilingual tuning targets English-language apps (Gmail, Google Maps, Amazon, eBay, Booking.com) alongside Chinese ones
Privacy-by-design positioning: screen data and permissions stay inside the user's own deployment
Backed by the MobileRL online-RL training method, which reached state-of-the-art mobile-agent success rates

Best for

Autonomous mobile task completion (search a product, order food, book a flight) from a single natural-language instruction
Self-hosted phone-use agents where screen data must stay on-premise for privacy or compliance
QA / UI test automation across real Android apps and emulators
Research on GUI agents and reinforcement learning for mobile control
Building accessibility or hands-free assistants that operate everyday apps on the user's behalf

How to access

Provider	Model ID
Z.ai (Zhipu) Developer Platform ↗	`ZAI/AutoGLM-Phone-9B`
Novita AI ↗	`zai-org/autoglm-phone-9b-multilingual`
ModelScope (self-host / inference) ↗	`ZhipuAI/AutoGLM-Phone-9B-Multilingual`

FAQ

What is AutoGLM-Phone-Multilingual?

It is Z.ai's (Zhipu AI's) open-weight ~9B vision-language model that acts as an autonomous Android phone agent. It reads a screenshot of the phone, reasons about the task, and outputs a concrete UI action (tap, type, swipe, etc.) that the Open-AutoGLM framework executes on the device over ADB. The 'Multilingual' variant is tuned for English and other-language apps in addition to Chinese ones.

Is AutoGLM-Phone-Multilingual open source and free?

Yes. The model weights are released under the MIT license and the surrounding Open-AutoGLM framework code under Apache-2.0, both available on GitHub, Hugging Face and ModelScope. You can self-host it for free. Hosted API access is also available — Z.ai listed it as free for a limited time at launch, and Novita AI offers it on a pay-per-token serverless plan.

How is it different from the Chinese AutoGLM-Phone-9B?

Both share the same ~9B GLM-4.1V-based architecture. AutoGLM-Phone-9B is optimized for 50+ high-frequency Chinese apps (WeChat, Taobao, Douyin, Meituan). AutoGLM-Phone-9B-Multilingual extends support to English and other-language apps such as Gmail, Google Maps, Amazon, eBay and Booking.com, making it suited to international use cases.

What hardware does it need to run?

The model is built on a 9B architecture and can be served with vLLM or SGLang in an OpenAI-compatible format, or run on consumer machines through quantized GGUF/MLX builds for Ollama, LM Studio, llama.cpp and Jan. To actually control a phone you connect an Android device or emulator over ADB through the Open-AutoGLM framework.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// FAQ