AI/TLDR

How to Run LLMs on a Mac: Apple Silicon, Metal, and MLX

You will understand why Macs punch above their weight for local AI and how to get the most out of yours.

INTERMEDIATE13 MIN READUPDATED 2026-06-12

In plain English

For years, running a real AI model locally meant you needed a gaming PC crammed with an NVIDIA GPU. A MacBook seemed like the wrong tool entirely. That changed when Apple introduced its M-series chips. The hardware architecture inside every Mac since 2020 turns out to be surprisingly well-suited for AI inference — not just "it can technically run a model" well-suited, but genuinely competitive with dedicated gaming GPUs for certain workloads.

Run LLMs on a Mac — diagram
Run LLMs on a Mac — linkedin.com

The key is unified memory. On a typical PC, the CPU has its own pool of RAM and the GPU has its own separate video memory (VRAM). A model's weights have to be loaded into whichever one is doing the work, and when you switch, data is copied across a bus. On an Apple Silicon Mac — M1, M2, M3, M4, or M5 — the CPU, GPU, and Neural Engine all share a single physical memory pool. There's no copying. The GPU can read model weights that already sit in the same memory the CPU loaded, with zero transfer overhead. For AI inference, which is almost entirely about reading billions of weights from memory as fast as possible, this is a massive structural advantage.

Think of it this way. A conventional PC gaming setup is like a restaurant with a prep kitchen (CPU memory) and a fry station with its own separate ingredient storage (VRAM). Every time the fry cook needs something, a runner has to carry it over. Apple Silicon is a restaurant where every station shares the same open pantry — the fry cook, the prep cook, and the expeditor all grab directly from the same shelves at the same time. Less running around means faster service.

Why it matters for local AI

The practical consequence is that a Mac packs more usable AI memory into a laptop form factor than almost any PC alternative. A MacBook Pro M4 Max with 128 GB of unified memory can hold and run a 70B-parameter model in 4-bit quantization entirely in memory — something that on a PC would require a multi-GPU server rig costing several times as much. That same MacBook runs on battery, fits in a backpack, and is quiet.

The reason memory capacity matters so much: LLM inference is memory-bandwidth-bound. During every token it generates, the model must read its entire set of weights from memory. The single number that predicts how fast your Mac generates tokens is memory bandwidth — how many gigabytes per second the GPU can read. This is why Mac chips with higher memory bandwidth (M4 Max at 546 GB/s) are dramatically faster than the base chip (M4 at 120 GB/s) for the same model, and why every generation of chip brings measurable speed gains even on the same model.

For developers and power users who want a local AI setup that just works, this matters practically:

  • No VRAM cliff. On an NVIDIA GPU, if a model doesn't fit in VRAM it falls back to RAM at a speed penalty of 5–10x. On a Mac, unified memory means the GPU reads directly from the main memory pool — no cliff, no fallback penalty.
  • Bigger models on smaller budgets. A Mac Mini M4 Pro with 48 GB of RAM can run a 30B model comfortably for a fraction of what a PC server with 48 GB VRAM would cost.
  • Offline, portable, private. A MacBook is a self-contained AI inference machine. No internet, no API keys, no per-token costs after purchase.
  • The ecosystem caught up. Major tools — Ollama, LM Studio, MLX-LM — now treat Apple Silicon as a first-class platform with dedicated optimizations, not an afterthought.

How the hardware accelerates LLM inference

Three layers of hardware inside every Apple Silicon chip work together to accelerate inference: Metal GPU, Neural Engine, and the unified memory fabric that connects them.

Metal: Apple's GPU compute API

Metal is Apple's low-level graphics and compute API, similar in role to CUDA on NVIDIA hardware. LLM runtimes like llama.cpp use Metal shaders to run matrix multiplications — the core math of every transformer forward pass — on the GPU cores rather than the CPU. Metal has supported LLM inference since llama.cpp added a Metal backend in 2023, and it works on every Apple Silicon Mac.

MLX: the higher-level framework built for unified memory

MLX is Apple's open-source array framework for machine learning, released in late 2023 and reaching production maturity in 2025. The key difference from using Metal directly: MLX is designed with the knowledge that CPU and GPU share the same physical memory. Arrays in MLX live in shared memory and both processors operate on the same data simultaneously — there are no explicit copy operations between CPU and GPU memory because there is no such boundary. Runtimes that use Metal directly still incur overhead from Metal's buffer management even though the underlying hardware has unified memory; MLX eliminates that overhead at the API design level.

In March 2026, Ollama shipped version 0.19 with MLX as its backend for Apple Silicon, replacing its direct Metal integration. The improvement was substantial: prefill speed (prompt processing) jumped from roughly 1,150 tokens per second to 1,810 tokens per second, and generation speed (decode) nearly doubled from 58 to 112 tokens per second, measured on an M5 Max running a 35B model. On M5 chips, Ollama via MLX also gains access to the new GPU Neural Accelerators — dedicated matrix-multiplication units not fully exposed through the Metal API.

How memory bandwidth determines your token speed

During every token generation step, the GPU reads all the model weights from memory to compute the next token. The speed at which it can do that reading is memory bandwidth. This creates a predictable rule: double the bandwidth, roughly double the tokens per second. Here are representative bandwidth figures and the throughput you can expect on a 7B model in 4-bit quantization:

ChipUnified memory bandwidth7B Q4 approximate tok/s
M2 (base)~100 GB/s~18 tok/s
M2 Max~400 GB/s~28 tok/s (varies by GPU config)
M4 (base)~120 GB/s~20 tok/s
M4 Pro~273 GB/s~40 tok/s
M4 Max546 GB/s~58 tok/s
M3 Ultra819 GB/s~80+ tok/s

Picking your runtime: Ollama, MLX-LM, or llama.cpp

Three runtimes dominate local LLM use on Macs, and they each make sense for different situations.

Ollama: the easiest path

Ollama is the recommended starting point for most Mac users. Install it with brew install ollama, and you have a service that handles model downloads, GPU acceleration, and an OpenAI-compatible API endpoint at http://localhost:11434. Since version 0.19, Ollama uses MLX under the hood on Apple Silicon, so you get the performance benefits without any additional setup. Pull and run a model in one command:

Ollama quickstart on Macbash
# Install Ollama
brew install ollama

# Start the Ollama service (runs in background)
ollama serve

# In another terminal: pull and chat with a 4B model (~2.5GB download)
ollama run qwen3:4b

# Or a larger 14B model if you have 16GB+ unified memory
ollama run qwen3:14b

MLX-LM: native Python for maximum performance

MLX-LM is the official Python package from Apple's MLX team. Install it with pip and you can run models from the mlx-community Hugging Face organization — a collection of pre-converted MLX-format models. MLX-LM gives slightly higher raw throughput than Ollama for single-user interactive inference because there's no intermediary service layer. It also exposes more control over generation parameters.

MLX-LM quickstartbash
# Install MLX-LM (requires macOS 14+ and Apple Silicon)
pip install mlx-lm

# Generate from a 4-bit Qwen3 4B model
mlx_lm.generate \
  --model mlx-community/Qwen3-4B-Instruct-4bit \
  --prompt "Explain unified memory in one paragraph"

# Start a local server with an OpenAI-compatible API
mlx_lm.server --model mlx-community/Qwen3-4B-Instruct-4bit
MLX-LM Python APIpython
from mlx_lm import load, generate

# Load model and tokenizer (downloads on first run)
model, tokenizer = load("mlx-community/Qwen3-4B-Instruct-4bit")

# Generate a response
response = generate(
    model,
    tokenizer,
    prompt="Explain what Apple Silicon unified memory means for AI",
    max_tokens=256,
    verbose=True,  # prints tokens as they stream
)

llama.cpp and LM Studio: GGUF model compatibility

llama.cpp uses Apple's Metal API for GPU acceleration and supports the GGUF format — the widest-compatibility format in the local LLM ecosystem. Ollama and LM Studio both run llama.cpp under the hood (Ollama now also has the MLX path; LM Studio uses llama.cpp with GGUF). If you need a specific GGUF model not available in MLX format, or you want a GUI chat app, LM Studio (downloadable from lmstudio.ai) wraps llama.cpp in a polished desktop interface and also supports MLX models. For most Mac users who prefer the terminal, Ollama's MLX backend will outperform a direct llama.cpp Metal setup.

Picking a model your Mac can actually run

The rule of thumb for 4-bit quantized models: you need roughly 0.6 GB of memory per billion parameters, plus headroom for the KV cache (the growing context buffer). Rounding up to ~0.7 GB per billion parameter gives you a safe ceiling. A 7B model at Q4 needs about 5 GB; a 70B model needs about 45 GB. Here's what fits comfortably at different memory tiers:

Unified memoryLargest comfortable model (Q4)Recommended models
8 GB7B (just fits, limited context)Qwen3 4B, Llama 3.2 3B
16 GB8B–13BLlama 3.1 8B, Qwen3 8B, Phi-4
24–32 GB14B–30BQwen3 14B, Mistral Small 22B
48–64 GB32B–70BQwen3 32B, Llama 3.3 70B (Q4)
96–128 GB70B–120BLlama 3.3 70B (Q5/Q6), Qwen3 72B

On an 8 GB Mac (base M1, M2, M3, M4 MacBook Air), a 7B model runs but macOS also needs memory for itself and other apps. In practice, 8 GB machines work better with 3B–4B models for comfortable interactive use. The system will swap rather than crash if you exceed memory, but speed drops dramatically when swapping occurs.

MLX format vs GGUF: which should you download?

If you're using Ollama or LM Studio, you don't need to think about this — they handle format selection for you. If you're using MLX-LM directly, you want models from the mlx-community Hugging Face organization, which are pre-converted to MLX's native format. If you're using llama.cpp directly, you want GGUF files. The quality of a model at the same quantization level is similar across formats; the difference is in which runtime can load it efficiently. MLX models tend to use slightly less peak memory than their GGUF equivalents because MLX avoids some buffer-copy overhead.

Going deeper

Once you have a model running smoothly on your Mac, a few directions become interesting.

Fine-tuning locally with MLX

MLX-LM includes a mlx_lm.lora command for LoRA fine-tuning directly on your Mac. Because unified memory means the GPU can access the full system RAM during training, you can fine-tune models that would require a much larger dedicated GPU on a PC. A LoRA fine-tune of a 7B model on a Mac with 32 GB unified memory is entirely practical. The MLX framework also supports QLoRA — combining quantization with LoRA — so even larger base models become trainable on consumer hardware.

The M5 Neural Accelerators

The M5, M5 Pro, and M5 Max chips (released 2025) introduced GPU-side Neural Accelerators — dedicated matrix-multiplication units for ML inference. These are distinct from the Neural Engine (which handles specific Apple-controlled workloads like Face ID). Ollama 0.19+ via MLX exposes these accelerators automatically on M5 hardware; runtimes that use Metal directly do not yet have full access to them. If you're on an M5 Mac, using Ollama 0.19 or later ensures you're getting the most from the hardware.

Serving models to other devices on your network

Both Ollama and MLX-LM's server mode expose an OpenAI-compatible HTTP API. By binding to 0.0.0.0 instead of localhost, your Mac becomes a local AI server for other devices on your home network — phones, tablets, other laptops. This is a practical pattern for households or small teams where one well-specced Mac (say, a Mac Studio M4 Max with 64 GB) serves as a shared local inference node. Set OLLAMA_HOST=0.0.0.0:11434 in your environment before starting ollama serve to enable network-wide access.

Multi-modal models on Mac

Vision-language models — models that accept both images and text — also run on Apple Silicon. Models like LLaVA, Qwen2-VL, and Llama 3.2 Vision are available through Ollama (ollama pull llava) and through MLX-LM. The memory requirements are somewhat higher than the text-only equivalent (vision models carry an image encoder alongside the language model), but the hardware acceleration path is the same: unified memory and Metal/MLX. A Mac with 32 GB can comfortably run a 7B vision model.

Monitoring GPU utilization

To confirm your Mac is actually using GPU acceleration (and not falling back to CPU), open Activity Monitor and check the GPU History window (Window > GPU History). While a model is generating text, you should see GPU utilization spike significantly. Alternatively, the asitop command-line tool (installable via pip) provides a real-time view of CPU, GPU, and memory bandwidth usage — useful for confirming you're getting the hardware throughput you expect and diagnosing bottlenecks.

Monitor GPU and memory usage during inferencebash
# Install asitop for real-time Apple Silicon metrics
pip install asitop

# Run with sudo for full metrics including memory bandwidth
sudo asitop

# While asitop is running, start inference in another terminal:
# ollama run qwen3:14b
# You should see GPU utilization rise during generation

FAQ

Does a Mac need a dedicated GPU to run LLMs?

No. Every Apple Silicon Mac (M1 and later) has GPU cores built into the chip that share the same unified memory pool as the CPU. Tools like Ollama and MLX-LM automatically use those GPU cores for acceleration. There is no separate GPU to install — the hardware is already there.

What is the minimum Mac spec for running a useful LLM?

An M1 MacBook Air with 8 GB of unified memory can run 3B–4B parameter models in 4-bit quantization at interactive speeds (15–20 tokens per second). Models at this size are genuinely useful for many tasks. For a 7B model with comfortable context headroom, 16 GB is the practical minimum. If you want to run 14B+ models, 32 GB is the right target.

What is the difference between MLX and llama.cpp on a Mac?

Both accelerate LLM inference using the Mac's GPU, but MLX is designed from the ground up to exploit unified memory — CPU and GPU operations work on the same memory without any copying overhead. llama.cpp uses Apple's Metal API, which is lower-level and slightly less efficient for Apple Silicon because it abstracts the memory as if it were discrete. In practice, MLX-based runtimes are 20–40% faster for most models on Mac, and the gap widens on Pro, Max, and Ultra chips.

Is Ollama using MLX on Mac now?

Yes. Ollama version 0.19, released in March 2026, switched its Apple Silicon backend from direct Metal to MLX. This roughly doubled decode speed and improved prefill speed by about 60% on the same hardware. Update to Ollama 0.19 or later with brew upgrade ollama to get the MLX backend automatically.

Can I run a 70B model on a MacBook Pro?

You can if your MacBook Pro has enough unified memory. A 70B model in 4-bit quantization needs roughly 40–45 GB. A MacBook Pro M4 Max with 64 GB or 128 GB of unified memory can handle it at reasonable speed (around 10–20 tokens per second depending on configuration). A MacBook with 16 GB or 32 GB cannot fit a 70B model without degrading to CPU-only or heavily swapping.

Why does my Mac slow down when running a large model?

The most common cause is memory pressure — the model's weights plus the KV cache for your context are exceeding available unified memory, forcing macOS to page to disk. Reduce the context window size, switch to a smaller model, or close other memory-heavy apps. A second cause is thermal throttling: sustained inference heats the chip and macOS lowers clock speeds. MacBook Airs (fanless) throttle more aggressively than MacBook Pros with active cooling.

Further reading