AI/TLDR

AutoGLM-Phone-Multilingual

Z.ai's open-weight 9B vision-language phone agent that reads your Android screen and taps, types, and swipes through tasks in English and Chinese apps.

Overview

AutoGLM-Phone-Multilingual (officially the AutoGLM-Phone-9B-Multilingual weights) is the international-facing variant of Z.ai's open-source mobile agent line, released alongside the Open-AutoGLM project in December 2025. It is a ~9B-parameter vision-language model fine-tuned from GLM-4.1V-9B-Base, with an architecture identical to GLM-4.1V-9B-Thinking. The model looks at a phone screenshot, reasons about what to do next, and emits a concrete UI action.

Unlike a chat model, AutoGLM-Phone-Multilingual is the brain inside a closed loop: the Open-AutoGLM framework captures the Android screen, sends it to the model, and the model returns one of a fixed action set (Launch, Tap, Type, Swipe, Back, Home, Long Press, Double Tap, Wait, plus a hand-back-to-human request for logins or captchas). Those actions are executed on a real device or emulator over ADB (Android Debug Bridge). The 'Multilingual' weight is tuned for English and other-language apps such as Gmail, Google Maps, Amazon, eBay, Booking.com, X, TikTok and WhatsApp, while the sibling AutoGLM-Phone-9B targets 50+ high-frequency Chinese apps like WeChat, Taobao, Douyin and Meituan.

Z.ai (Zhipu AI) ships the weights under the MIT license and the surrounding framework code under Apache-2.0, explicitly so enterprises and developers can self-host and keep screen data, logs and permissions inside their own environment. You can download the model from Hugging Face and ModelScope, run it locally with vLLM or SGLang, or call it through hosted APIs (Z.ai, ModelScope, Novita).

Released2025-12-11
LicenseMIT (model weights); Apache-2.0 (framework code)
WeightsOpen weights
Parameters9B
Context25K
Max output3K tokens (configurable)
ArchitectureVision-language model fine-tuned from GLM-4.1V-9B-Base; architecture identical to GLM-4.1V-9B-Thinking. Used as the perception-and-action core of the Open-AutoGLM phone-use framework, which drives an Android device over ADB (screenshot in, GUI action out).
ModalitiesText, Vision
StatusGenerally available

Benchmarks

  1. AndroidWorld (success rate, MobileRL-9B, the RL method behind AutoGLM-Phone)80.2%
  2. AndroidLab (success rate, MobileRL-9B, the RL method behind AutoGLM-Phone)53.6%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.035 / 1M tokens per 1M tokens
Output$0.138 / 1M tokens per 1M tokens

Pricing shown is Novita AI's hosted serverless rate for the AutoGLM-Phone-9B-Multilingual weights. Z.ai's own developer platform listed the model as 'free for a limited time' at launch. Weights are open and free to self-host (MIT license).

Pricing source ↗

Strengths

  • Fully open weights (MIT) plus open framework code (Apache-2.0) — self-hostable, no vendor lock-in
  • Small 9B footprint runs locally via vLLM/SGLang and on consumer hardware through quantized GGUF/MLX builds (Ollama, LM Studio, llama.cpp, Jan)
  • Purpose-built for on-device phone control: reads screenshots and emits concrete tap/type/swipe actions over ADB
  • Multilingual tuning targets English-language apps (Gmail, Google Maps, Amazon, eBay, Booking.com) alongside Chinese ones
  • Privacy-by-design positioning: screen data and permissions stay inside the user's own deployment
  • Backed by the MobileRL online-RL training method, which reached state-of-the-art mobile-agent success rates

Best for

  • Autonomous mobile task completion (search a product, order food, book a flight) from a single natural-language instruction
  • Self-hosted phone-use agents where screen data must stay on-premise for privacy or compliance
  • QA / UI test automation across real Android apps and emulators
  • Research on GUI agents and reinforcement learning for mobile control
  • Building accessibility or hands-free assistants that operate everyday apps on the user's behalf

How to access

ProviderModel ID
Z.ai (Zhipu) Developer Platform ↗ZAI/AutoGLM-Phone-9B
Novita AI ↗zai-org/autoglm-phone-9b-multilingual
ModelScope (self-host / inference) ↗ZhipuAI/AutoGLM-Phone-9B-Multilingual

FAQ

What is AutoGLM-Phone-Multilingual?

It is Z.ai's (Zhipu AI's) open-weight ~9B vision-language model that acts as an autonomous Android phone agent. It reads a screenshot of the phone, reasons about the task, and outputs a concrete UI action (tap, type, swipe, etc.) that the Open-AutoGLM framework executes on the device over ADB. The 'Multilingual' variant is tuned for English and other-language apps in addition to Chinese ones.

Is AutoGLM-Phone-Multilingual open source and free?

Yes. The model weights are released under the MIT license and the surrounding Open-AutoGLM framework code under Apache-2.0, both available on GitHub, Hugging Face and ModelScope. You can self-host it for free. Hosted API access is also available — Z.ai listed it as free for a limited time at launch, and Novita AI offers it on a pay-per-token serverless plan.

How is it different from the Chinese AutoGLM-Phone-9B?

Both share the same ~9B GLM-4.1V-based architecture. AutoGLM-Phone-9B is optimized for 50+ high-frequency Chinese apps (WeChat, Taobao, Douyin, Meituan). AutoGLM-Phone-9B-Multilingual extends support to English and other-language apps such as Gmail, Google Maps, Amazon, eBay and Booking.com, making it suited to international use cases.

What hardware does it need to run?

The model is built on a 9B architecture and can be served with vLLM or SGLang in an OpenAI-compatible format, or run on consumer machines through quantized GGUF/MLX builds for Ollama, LM Studio, llama.cpp and Jan. To actually control a phone you connect an Android device or emulator over ADB through the Open-AutoGLM framework.