Can You Run an LLM on a Phone? On-Device AI

Q: What is the minimum phone RAM needed to run a local LLM?

A practical minimum is **8 GB of device RAM**, which gives you enough free headroom (after OS and apps) to run a 2–3B quantized model with a short context window. 6 GB phones can technically run sub-1B models like Qwen 3 0.6B or Llama 3.2 1B at INT4, but memory pressure from background apps makes crashes common. 12 GB or more is the comfortable zone for 3–7B models.

In plain English

Yes, you can run a language model on a phone — but it looks quite different from the large cloud models you might be used to. Rather than a 70-billion-parameter model running on a rack of GPUs in a data center, a phone runs a small language model (SLM) with 1–7 billion parameters, compressed down to fit in the device's RAM. Your phone does all the thinking locally, with no internet required after the model downloads.

Can You Run an LLM on a Phone — diagram — Can You Run an LLM on a Phone — medium.com

The best analogy is the shift from mainframes to personal computers. For decades, "serious" computing meant big shared machines that you connected to remotely. Then chips got powerful enough that a machine on your desk could do meaningful work. AI is following the same arc: it started in giant cloud data centers, and the compute is now reaching your pocket. The experience feels different — slower output, smaller context window, narrower capabilities — but the same fundamental thing is happening: a neural network running inference, producing tokens, on hardware you hold in your hand.

Why it matters for builders

The obvious appeal is privacy. Anything that never leaves the device — a private journal, a medical symptom checker, a legal clause analyzer — cannot be logged, stored, or leaked by a cloud provider. For sensitive enterprise apps this is not a nice-to-have; it is sometimes the only viable architecture.

The second appeal is availability. A phone has signal roughly 80% of the time but a reliable LTE/5G connection far less often. Subway tunnels, airplane mode, rural areas, and firewalled corporate networks all produce the same result: no cloud API response. An on-device model responds in milliseconds regardless of connectivity.

The third is latency. Even on a fast connection, a round-trip to a cloud API adds 200–600 ms before the first token arrives. An NPU-accelerated model on a Snapdragon 8 Elite can start producing tokens in under 50 ms. For real-time features — autocomplete, live transcription analysis, interactive tutoring — that gap is perceptible.

Finally, there is cost at scale. A consumer app with 10 million daily users calling a cloud LLM 20 times each generates serious API bills. If a 2B-parameter on-device model handles 90% of those calls adequately, the marginal inference cost drops to zero.

How it works

Three things have to be true simultaneously for a model to fit on a phone: the parameter count must be small, the precision (how many bits per weight) must be reduced, and the runtime must know how to dispatch the work to the phone's dedicated AI chip rather than burning through the CPU.

The NPU: why it changes the math

Every flagship phone shipped since 2022 includes a Neural Processing Unit (NPU) — a fixed-function chip built specifically to accelerate the matrix multiplications that transformer inference is made of. Qualcomm's Hexagon NPU (found in Snapdragon 8-series chips) delivers tens of TOPS (tera-operations per second) of dedicated AI throughput. Apple's Neural Engine, integrated into every A-series iPhone chip since the A11, is tightly coupled to Core ML for the same purpose. Google's Tensor chips include a TPU block optimized for Gemini Nano inference.

The difference matters enormously. On a Snapdragon 8 Elite, running a 3B model through the Hexagon NPU achieves roughly 48 tokens/second. Running the same model on the CPU on the same chip yields around 10 tokens/second. A readable, usable conversation requires at least 8–10 tokens/second; anything below that feels like watching someone hunt-and-peck on a keyboard. NPU acceleration is what makes on-device LLMs feel like a real feature rather than a demo.

// How a phone runs an LLM inference request

User types promptApp on CPUTokenize inputRuntime (llama.cpp / LiteRT / ExecuTorch)Dispatch matrix opsNPU / ANE / HexagonKV cache updatedIn device RAMNext token sampledBack on CPUToken decoded + streamed to UIRepeat until EOS

Quantization: shrinking without gutting

A full-precision (FP32) 3B model weighs about 12 GB — far too large for any phone. Quantization re-encodes each weight using fewer bits. INT8 quantization halves that to roughly 6 GB; INT4 halves it again to around 2 GB, which fits comfortably in the RAM of any phone with 8 GB or more. The accuracy cost is modest: INT8 typically loses 1–3% on standard benchmarks; INT4 loses 5–10%, which for everyday tasks is rarely noticeable.

The commonly used Q4_K_M format (a mixed 4-bit scheme from llama.cpp) brings a 3B model down to 1.88 GB — roughly a 68% reduction from the original 6 GB. That single number explains why the mobile LLM space opened up so fast between 2024 and 2026.

// Software stack for on-device LLM inference

Your App / Chat UISwift, Kotlin, React Native, FlutterInference Runtimellama.cpp / ExecuTorch / LiteRT-LM / Core MLHardware BackendQualcomm QNN, Apple ANE, ARM NEON, Vulkan GPUModel Weights on DiskGGUF / SafeTensors, 4-bit quantized

Which models actually fit on a phone

The limiting factor is almost always RAM, not storage. A phone's OS, your apps, and the model all compete for the same pool of memory. A flagship with 12 GB RAM might have 6–7 GB free after the system and a few background apps. That gives you a working budget for the model and its KV cache.

Model	Quantized size	Min RAM needed	Practical speed (Snapdragon 8 Elite)
Gemma 3n E2B	~1.5 GB (INT4)	6 GB phone RAM	60–70 tok/s
Llama 3.2 1B Q4	~0.7 GB	6 GB phone RAM	80+ tok/s
Llama 3.2 3B Q4_K_M	~1.9 GB	8 GB phone RAM	40–50 tok/s
Gemma 3n E4B	~2.5 GB (INT4)	8 GB phone RAM	25–35 tok/s
Phi-4-mini (3.8B)	~2.4 GB (INT4)	8 GB phone RAM	30–40 tok/s
Qwen 3 0.6B	~0.4 GB	6 GB phone RAM	80+ tok/s
Llama 3.1 8B Q4	~4.7 GB	12 GB phone RAM	10–15 tok/s

Google Gemma 3n deserves special mention because it was designed from scratch for mobile. Its MatFormer architecture uses Per-Layer Embeddings (PLE), which let the model share parameters across layers in a way that reduces effective memory use beyond what quantization alone achieves. The E2B variant has around 5 billion raw parameters but is engineered to run in just 2 GB of memory — a feat that required rethinking the architecture rather than just compressing weights.

Apple Intelligence takes a different approach. The on-device model (AFM 3 Core, ~3B parameters) is baked into iOS and exposed through the Foundation Models framework. Apple uses 2-bit quantization-aware training (QAT), meaning the model was trained knowing it would be heavily quantized, which recovers much of the accuracy usually lost at 2-bit precision. Developers get a clean Swift API; they cannot swap in a different model, but the system-level integration means it is always NPU-accelerated and always current.

Apps and runtimes to try right now

You do not need to write a line of code to experiment with on-device LLMs. Several consumer apps handle the download, quantization selection, and NPU dispatch for you.

PocketPal AI (iOS + Android) — the most popular third-party local LLM app with 500K+ downloads as of mid-2026. Supports Phi, Gemma, Qwen, Llama families. Downloads directly from Hugging Face. All inference is local.
Google AI Edge Gallery (Android, iOS in progress) — Google's experimental open-source app for running Gemma 3n and other Hugging Face models fully offline, built on the LiteRT-LM runtime.
MLC Chat (iOS + Android) — from the MLC-AI team; broad model support (Llama, Qwen, Gemma, Phi) using their own compiled-kernel approach for NPU acceleration.
Gemini app on Pixel 9+ — system-level Gemini Nano powers Magic Compose, Recorder summaries, and Smart Reply via Android AICore. Third-party apps can integrate via the ML Kit Inference API on supported devices.
Apple Intelligence (iOS 18.1+) — not an app you open, but a platform feature. Writing tools, mail summaries, and Siri's reasoning layer all run the AFM 3 Core model on-device. Developers access it via the Foundation Models framework in Swift.

Runtimes for developers embedding a model in their own app

llama.cpp — the original portable C++ inference engine. Android apps can call it via JNI; iOS via a compiled static library. Handles GGUF quantized models. CPU fallback always available; NPU requires a backend plugin.
ExecuTorch (Meta / PyTorch) — the production PyTorch-to-mobile pipeline. Supports Qualcomm QNN, MediaTek, Apple ANE, ARM NEON, and Vulkan backends. Powers Llama 3.2 1B/3B on Pixel 8 and iPhone 15 Pro at ~40 tok/s. 50 KB base footprint.
Google LiteRT-LM — C++ orchestration layer on top of LiteRT (the rebranded TensorFlow Lite), optimized for Gemma inference. Handles KV-cache management and prompt caching. The production backend for Gemini Nano on Pixel Watch and Chromebook Plus.
Core ML (Apple-only) — Apple's on-device ML framework, required for full ANE access on iOS/macOS. Models must be converted to .mlpackage format. The Foundation Models framework wraps it for LLM use cases.

bashbash

# Install llama.cpp on Android via Termux (quick experiment)
pkg install clang cmake
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build -j4

# Download Llama 3.2 1B Q4 (~700 MB) from Hugging Face and run:
./build/bin/llama-cli -m llama-3.2-1b-instruct-q4_k_m.gguf -p "Hello" -n 128

Going deeper

Once a model runs, the next challenge is making it stay running. Mobile operating systems are aggressive about reclaiming RAM: iOS will terminate background apps, and Android's memory management can kill a model mid-session. Production apps use a background service (Android) or background task with memory locking (iOS) to keep the model loaded. Reloading a 2 GB model from disk takes 1–3 seconds — acceptable for a cold start, terrible if it happens in the middle of a conversation.

Context window management is the other deep problem. A typical mobile model supports 4096–8192 tokens. That sounds generous until you realize a single document can fill it and a multi-turn conversation will blow past it. Runtimes handle overflow by either hard-truncating (dropping early context) or using a sliding window. Neither is invisible to the user. If your app needs long context, consider a hybrid architecture: route short, local queries to the on-device model and escalate long or complex ones to a cloud API.

Fine-tuning for mobile is still an emerging frontier. The MobileFineTuner paper (2024) demonstrated on-device LoRA adaptation using gradient checkpointing to fit the backward pass in 6 GB of phone RAM. It is slow — hours for a tiny dataset — but it points toward a future where the model adapts to a specific user's writing style or domain vocabulary without any data leaving the device.

The hardware trajectory

NPU TOPS roughly doubled every two years from 2020 to 2026. The Snapdragon 8 Elite delivers around 45 TOPS from its Hexagon NPU. The Apple M4 Neural Engine hits 38 TOPS in a MacBook but the mobile A18 is close behind. At the current trajectory, a 2028 flagship phone will have enough headroom to run a 7B model at comfortable speed with a 32K context window — territory that required a dedicated GPU workstation in 2023.

For builders, the practical implication is: designs that feel constrained today ("we need cloud for reasoning") may become unnecessary within one or two device generations. Architecting your app with a clean local/cloud split — where local handles what it can and cloud handles the rest — means you can shift the boundary as hardware improves without rewriting your app.

// On-device vs cloud LLM: the real trade-offs

On-device

Privacy: data never leaves phone
Works offline and on flaky connections
First-token latency under 100 ms
Zero marginal API cost
Limited to ~7B params and short context
User must download model (1-3 GB)

Cloud API

Access to 70B+ parameter models
100K+ token context windows
Always latest model version
No on-device storage needed
Requires reliable internet
Pay per token; data leaves device

FAQ

What is the minimum phone RAM needed to run a local LLM?

A practical minimum is 8 GB of device RAM, which gives you enough free headroom (after OS and apps) to run a 2–3B quantized model with a short context window. 6 GB phones can technically run sub-1B models like Qwen 3 0.6B or Llama 3.2 1B at INT4, but memory pressure from background apps makes crashes common. 12 GB or more is the comfortable zone for 3–7B models.

Does Gemini Nano work in third-party apps on Android?

As of mid-2026, direct access to Gemini Nano via Android AICore is restricted to certain Google system features and approved partners. Third-party developers on supported Pixel and Samsung Galaxy devices can use the ML Kit Inference API for some tasks, or deploy their own model (Gemma 3n via LiteRT, a llama.cpp-backed model, or ExecuTorch) for full control. The API surface is expanding but is not fully open yet.

How fast is an LLM on a phone compared to a desktop?

On a 2024-era flagship with a Snapdragon 8 Elite, a 3B model generates roughly 40–50 tokens per second through the NPU — usable, but noticeably slower than the 80–150 tok/s you would get from a dedicated GPU on a desktop. A MacBook M4 running the same 3B model through Core ML hits 100+ tok/s. The phone gap closes as NPU TOPS improve with each SoC generation.

Can an iPhone run a local LLM like Llama?

Yes. iPhones with 8 GB RAM (iPhone 15 Pro, iPhone 16 series) can run 1–3B models through PocketPal AI or MLC Chat. Apple also ships its own ~3B on-device model (AFM 3 Core) as part of Apple Intelligence on iOS 18.1+, with developer access via the Foundation Models Swift framework. For open-weight models, ExecuTorch with the Core ML backend gives you full Apple Neural Engine acceleration.

Does running an LLM on your phone drain the battery fast?

It depends on the model size and how often you invoke it. Google reports that Gemma 3 270M uses less than 1% battery for 25 short conversations on a Pixel 9 Pro. A heavier 3B model running continuously would drain battery faster — roughly equivalent to a sustained video call. Most apps run inference only during active use and release the NPU immediately after, so battery impact for realistic usage patterns is moderate.

What is the difference between Gemma 3n and regular Gemma models for phones?

Standard Gemma models are general-purpose; they can be quantized and run on phones but were not designed with mobile memory constraints in mind. Gemma 3n was built mobile-first: its MatFormer architecture and Per-Layer Embeddings let it pack 5B effective parameters into ~1.5 GB of RAM. It also natively supports vision, audio, and text inputs — capabilities that standard small models lack. For a phone deployment, Gemma 3n (or its successor Gemma 4 E2B/E4B) is the Google-recommended starting point.

Can You Run an LLM on a Phone? On-Device AI Explained

In plain English

Why it matters for builders

How it works

The NPU: why it changes the math

Quantization: shrinking without gutting

Which models actually fit on a phone

Apps and runtimes to try right now

Runtimes for developers embedding a model in their own app

Going deeper

The hardware trajectory

FAQ

Further reading

// In plain English

// Why it matters for builders

// How it works

The NPU: why it changes the math

Quantization: shrinking without gutting

// Which models actually fit on a phone

// Apps and runtimes to try right now

Runtimes for developers embedding a model in their own app

// Going deeper

The hardware trajectory

// FAQ

// Further reading

// Related

In plain English

Why it matters for builders

How it works

Which models actually fit on a phone

Apps and runtimes to try right now

Going deeper

FAQ

Further reading

Related