In plain English
Yes, you can run a language model on a phone — but it looks quite different from the large cloud models you might be used to. Rather than a 70-billion-parameter model running on a rack of GPUs in a data center, a phone runs a small language model (SLM) with 1–7 billion parameters, compressed down to fit in the device's RAM. Your phone does all the thinking locally, with no internet required after the model downloads.

The best analogy is the shift from mainframes to personal computers. For decades, "serious" computing meant big shared machines that you connected to remotely. Then chips got powerful enough that a machine on your desk could do meaningful work. AI is following the same arc: it started in giant cloud data centers, and the compute is now reaching your pocket. The experience feels different — slower output, smaller context window, narrower capabilities — but the same fundamental thing is happening: a neural network running inference, producing tokens, on hardware you hold in your hand.
Why it matters for builders
The obvious appeal is privacy. Anything that never leaves the device — a private journal, a medical symptom checker, a legal clause analyzer — cannot be logged, stored, or leaked by a cloud provider. For sensitive enterprise apps this is not a nice-to-have; it is sometimes the only viable architecture.
The second appeal is availability. A phone has signal roughly 80% of the time but a reliable LTE/5G connection far less often. Subway tunnels, airplane mode, rural areas, and firewalled corporate networks all produce the same result: no cloud API response. An on-device model responds in milliseconds regardless of connectivity.
The third is latency. Even on a fast connection, a round-trip to a cloud API adds 200–600 ms before the first token arrives. An NPU-accelerated model on a Snapdragon 8 Elite can start producing tokens in under 50 ms. For real-time features — autocomplete, live transcription analysis, interactive tutoring — that gap is perceptible.
Finally, there is cost at scale. A consumer app with 10 million daily users calling a cloud LLM 20 times each generates serious API bills. If a 2B-parameter on-device model handles 90% of those calls adequately, the marginal inference cost drops to zero.
How it works
Three things have to be true simultaneously for a model to fit on a phone: the parameter count must be small, the precision (how many bits per weight) must be reduced, and the runtime must know how to dispatch the work to the phone's dedicated AI chip rather than burning through the CPU.
The NPU: why it changes the math
Every flagship phone shipped since 2022 includes a Neural Processing Unit (NPU) — a fixed-function chip built specifically to accelerate the matrix multiplications that transformer inference is made of. Qualcomm's Hexagon NPU (found in Snapdragon 8-series chips) delivers tens of TOPS (tera-operations per second) of dedicated AI throughput. Apple's Neural Engine, integrated into every A-series iPhone chip since the A11, is tightly coupled to Core ML for the same purpose. Google's Tensor chips include a TPU block optimized for Gemini Nano inference.
The difference matters enormously. On a Snapdragon 8 Elite, running a 3B model through the Hexagon NPU achieves roughly 48 tokens/second. Running the same model on the CPU on the same chip yields around 10 tokens/second. A readable, usable conversation requires at least 8–10 tokens/second; anything below that feels like watching someone hunt-and-peck on a keyboard. NPU acceleration is what makes on-device LLMs feel like a real feature rather than a demo.
Quantization: shrinking without gutting
A full-precision (FP32) Llama 3.2 3B model weighs about 12 GB — far too large for any phone. Quantization re-encodes each weight using fewer bits. INT8 quantization halves that to roughly 6 GB; INT4 halves it again to around 2 GB, which fits comfortably in the RAM of any phone with 8 GB or more. The accuracy cost is modest: INT8 typically loses 1–3% on standard benchmarks; INT4 loses 5–10%, which for everyday tasks is rarely noticeable.
The commonly used Q4_K_M format (a mixed 4-bit scheme from llama.cpp) brings Llama 3.2 3B down to 1.88 GB — roughly a 68% reduction from the original 6 GB. That single number explains why the mobile LLM space opened up so fast between 2024 and 2026.
Which models actually fit on a phone
The limiting factor is almost always RAM, not storage. A phone's OS, your apps, and the model all compete for the same pool of memory. A flagship with 12 GB RAM might have 6–7 GB free after the system and a few background apps. That gives you a working budget for the model and its KV cache.
| Model | Quantized size | Min RAM needed | Practical speed (Snapdragon 8 Elite) |
|---|---|---|---|
| Gemma 3n E2B | ~1.5 GB (INT4) | 6 GB phone RAM | 60–70 tok/s |
| Llama 3.2 1B Q4 | ~0.7 GB | 6 GB phone RAM | 80+ tok/s |
| Llama 3.2 3B Q4_K_M | ~1.9 GB | 8 GB phone RAM | 40–50 tok/s |
| Gemma 3n E4B | ~2.5 GB (INT4) | 8 GB phone RAM | 25–35 tok/s |
| Phi-4-mini (3.8B) | ~2.4 GB (INT4) | 8 GB phone RAM | 30–40 tok/s |
| Qwen 3 0.6B | ~0.4 GB | 6 GB phone RAM | 80+ tok/s |
| Llama 3.1 8B Q4 | ~4.7 GB | 12 GB phone RAM | 10–15 tok/s |
Google Gemma 3n deserves special mention because it was designed from scratch for mobile. Its MatFormer architecture uses Per-Layer Embeddings (PLE), which let the model share parameters across layers in a way that reduces effective memory use beyond what quantization alone achieves. The E2B variant has around 5 billion raw parameters but is engineered to run in just 2 GB of memory — a feat that required rethinking the architecture rather than just compressing weights.
Apple Intelligence takes a different approach. The on-device model (AFM 3 Core, ~3B parameters) is baked into iOS and exposed through the Foundation Models framework. Apple uses 2-bit quantization-aware training (QAT), meaning the model was trained knowing it would be heavily quantized, which recovers much of the accuracy usually lost at 2-bit precision. Developers get a clean Swift API; they cannot swap in a different model, but the system-level integration means it is always NPU-accelerated and always current.
Apps and runtimes to try right now
You do not need to write a line of code to experiment with on-device LLMs. Several consumer apps handle the download, quantization selection, and NPU dispatch for you.
- PocketPal AI (iOS + Android) — the most popular third-party local LLM app with 500K+ downloads as of mid-2026. Supports Phi, Gemma, Qwen, Llama families. Downloads directly from Hugging Face. All inference is local.
- Google AI Edge Gallery (Android, iOS in progress) — Google's experimental open-source app for running Gemma 3n and other Hugging Face models fully offline, built on the LiteRT-LM runtime.
- MLC Chat (iOS + Android) — from the MLC-AI team; broad model support (Llama, Qwen, Gemma, Phi) using their own compiled-kernel approach for NPU acceleration.
- Gemini app on Pixel 9+ — system-level Gemini Nano powers Magic Compose, Recorder summaries, and Smart Reply via Android AICore. Third-party apps can integrate via the ML Kit Inference API on supported devices.
- Apple Intelligence (iOS 18.1+) — not an app you open, but a platform feature. Writing tools, mail summaries, and Siri's reasoning layer all run the AFM 3 Core model on-device. Developers access it via the Foundation Models framework in Swift.
Runtimes for developers embedding a model in their own app
- llama.cpp — the original portable C++ inference engine. Android apps can call it via JNI; iOS via a compiled static library. Handles GGUF quantized models. CPU fallback always available; NPU requires a backend plugin.
- ExecuTorch (Meta / PyTorch) — the production PyTorch-to-mobile pipeline. Supports Qualcomm QNN, MediaTek, Apple ANE, ARM NEON, and Vulkan backends. Powers Llama 3.2 1B/3B on Pixel 8 and iPhone 15 Pro at ~40 tok/s. 50 KB base footprint.
- Google LiteRT-LM — C++ orchestration layer on top of LiteRT (the rebranded TensorFlow Lite), optimized for Gemma inference. Handles KV-cache management and prompt caching. The production backend for Gemini Nano on Pixel Watch and Chromebook Plus.
- Core ML (Apple-only) — Apple's on-device ML framework, required for full ANE access on iOS/macOS. Models must be converted to
.mlpackageformat. The Foundation Models framework wraps it for LLM use cases.
# Install llama.cpp on Android via Termux (quick experiment)
pkg install clang cmake
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build -j4
# Download Llama 3.2 1B Q4 (~700 MB) from Hugging Face and run:
./build/bin/llama-cli -m llama-3.2-1b-instruct-q4_k_m.gguf -p "Hello" -n 128Going deeper
Once a model runs, the next challenge is making it stay running. Mobile operating systems are aggressive about reclaiming RAM: iOS will terminate background apps, and Android's memory management can kill a model mid-session. Production apps use a background service (Android) or background task with memory locking (iOS) to keep the model loaded. Reloading a 2 GB model from disk takes 1–3 seconds — acceptable for a cold start, terrible if it happens in the middle of a conversation.
Context window management is the other deep problem. A typical mobile model supports 4096–8192 tokens. That sounds generous until you realize a single document can fill it and a multi-turn conversation will blow past it. Runtimes handle overflow by either hard-truncating (dropping early context) or using a sliding window. Neither is invisible to the user. If your app needs long context, consider a hybrid architecture: route short, local queries to the on-device model and escalate long or complex ones to a cloud API.
Fine-tuning for mobile is still an emerging frontier. The MobileFineTuner paper (2024) demonstrated on-device LoRA adaptation using gradient checkpointing to fit the backward pass in 6 GB of phone RAM. It is slow — hours for a tiny dataset — but it points toward a future where the model adapts to a specific user's writing style or domain vocabulary without any data leaving the device.
The hardware trajectory
NPU TOPS roughly doubled every two years from 2020 to 2026. The Snapdragon 8 Elite delivers around 45 TOPS from its Hexagon NPU. The Apple M4 Neural Engine hits 38 TOPS in a MacBook but the mobile A18 is close behind. At the current trajectory, a 2028 flagship phone will have enough headroom to run a 7B model at comfortable speed with a 32K context window — territory that required a dedicated GPU workstation in 2023.
For builders, the practical implication is: designs that feel constrained today ("we need cloud for reasoning") may become unnecessary within one or two device generations. Architecting your app with a clean local/cloud split — where local handles what it can and cloud handles the rest — means you can shift the boundary as hardware improves without rewriting your app.
- Privacy: data never leaves phone
- Works offline and on flaky connections
- First-token latency under 100 ms
- Zero marginal API cost
- Limited to ~7B params and short context
- User must download model (1-3 GB)
- Access to 70B+ parameter models
- 100K+ token context windows
- Always latest model version
- No on-device storage needed
- Requires reliable internet
- Pay per token; data leaves device
FAQ
What is the minimum phone RAM needed to run a local LLM?
A practical minimum is 8 GB of device RAM, which gives you enough free headroom (after OS and apps) to run a 2–3B quantized model with a short context window. 6 GB phones can technically run sub-1B models like Qwen 3 0.6B or Llama 3.2 1B at INT4, but memory pressure from background apps makes crashes common. 12 GB or more is the comfortable zone for 3–7B models.
Does Gemini Nano work in third-party apps on Android?
As of mid-2026, direct access to Gemini Nano via Android AICore is restricted to certain Google system features and approved partners. Third-party developers on supported Pixel and Samsung Galaxy devices can use the ML Kit Inference API for some tasks, or deploy their own model (Gemma 3n via LiteRT, a llama.cpp-backed model, or ExecuTorch) for full control. The API surface is expanding but is not fully open yet.
How fast is an LLM on a phone compared to a desktop?
On a 2024-era flagship with a Snapdragon 8 Elite, a 3B model generates roughly 40–50 tokens per second through the NPU — usable, but noticeably slower than the 80–150 tok/s you would get from a dedicated GPU on a desktop. A MacBook M4 running the same 3B model through Core ML hits 100+ tok/s. The phone gap closes as NPU TOPS improve with each SoC generation.
Can an iPhone run a local LLM like Llama?
Yes. iPhones with 8 GB RAM (iPhone 15 Pro, iPhone 16 series) can run 1–3B models through PocketPal AI or MLC Chat. Apple also ships its own ~3B on-device model (AFM 3 Core) as part of Apple Intelligence on iOS 18.1+, with developer access via the Foundation Models Swift framework. For open-weight models, ExecuTorch with the Core ML backend gives you full Apple Neural Engine acceleration.
Does running an LLM on your phone drain the battery fast?
It depends on the model size and how often you invoke it. Google reports that Gemma 3 270M uses less than 1% battery for 25 short conversations on a Pixel 9 Pro. A heavier 3B model running continuously would drain battery faster — roughly equivalent to a sustained video call. Most apps run inference only during active use and release the NPU immediately after, so battery impact for realistic usage patterns is moderate.
What is the difference between Gemma 3n and regular Gemma models for phones?
Standard Gemma models are general-purpose; they can be quantized and run on phones but were not designed with mobile memory constraints in mind. Gemma 3n was built mobile-first: its MatFormer architecture and Per-Layer Embeddings let it pack 5B effective parameters into ~1.5 GB of RAM. It also natively supports vision, audio, and text inputs — capabilities that standard small models lack. For a phone deployment, Gemma 3n (or its successor Gemma 4 E2B/E4B) is the Google-recommended starting point.