In plain English
Running an LLM on Windows means downloading a model file and a program that reads it, so the model thinks on your own PC instead of in a cloud data center. No API key, no monthly bill, no sending your text to anyone. If you have a gaming desktop or laptop with a decent graphics card, you already own most of the hardware people pay for in the cloud.

Here's the mental model. A local LLM is just a big file of numbers (the weights) plus an engine that runs the math on them. On Windows you mostly use a friendly app — Ollama or LM Studio — that bundles the engine, downloads the file for you, and gives you a chat box. Think of it like installing Steam and then installing a game: the platform handles the boring parts, you just pick what to run.
Windows adds one twist that Mac and Linux don't have: you can run the model the plain native way, or inside WSL2 — a real Linux running invisibly inside Windows. Most people should start native. WSL2 is a power-user shortcut you reach for only when a Linux-only tool won't run otherwise. We'll cover exactly when each one earns its keep.
Why it matters
Windows is, by a wide margin, the most common home for a powerful consumer GPU. Gamers buy NVIDIA and AMD cards with 8, 12, 16, or 24 GB of fast video memory — exactly the resource an LLM is hungry for. That hardware sits idle most of the day. Running models locally is how you put it to work.
- Privacy. Your prompts, code, and documents never leave the machine. For anything sensitive — legal notes, health questions, unreleased work — that alone is reason enough.
- Zero per-token cost. Once the model is downloaded, you can run it a million times for the price of electricity. No surprise bills, no rate limits.
- Works offline. On a plane, a bad hotel connection, or an air-gapped network, a local model keeps answering.
- Learning and tinkering. You can swap models, change quantization, and watch how each choice trades quality for speed — feedback you never see behind a cloud API.
The reason a Windows-specific guide is needed is that the friction is Windows-specific. The model files and engines are cross-platform, but GPU drivers, the CUDA-versus-AMD split, the PATH environment variable, and the native-versus-WSL2 question are all places where Windows users get stuck in ways a Mac guide never warns about. Get those four things right and the rest is just downloading a file.
How it works
When you ask a local model a question on Windows, the same pipeline runs every time. The app loads the model's weights into memory, ideally your GPU's video memory (VRAM), and then runs billions of multiplications per generated word. The faster the memory and the more of the model that fits in VRAM, the faster your answer arrives.
The native path (recommended for most people)
Ollama and LM Studio both ship real Windows installers (.exe). You double-click, the app detects your GPU, and it talks to your graphics driver directly. Nothing about Linux is involved. For NVIDIA cards this 'just works' because the engine uses CUDA, NVIDIA's GPU computing layer, which is mature and bundled inside the app. This is the route to start with — it is the least likely to break.
The WSL2 path (a Linux inside Windows)
WSL2 (Windows Subsystem for Linux, version 2) runs a genuine Linux kernel in a lightweight VM. Modern NVIDIA drivers pass the GPU through to that Linux, so CUDA tools that only ship for Linux can use your card at near-native speed. You'd choose this when a tutorial, library, or build step assumes a Linux shell — common in research and advanced llama.cpp builds. The cost is a second filesystem, a second copy of your models unless you're careful, and one more layer to debug.
- Double-click .exe, done
- App detects GPU for you
- NVIDIA CUDA bundled in
- Best AMD support on Windows
- Files live in normal Windows folders
- Run Linux-only tools and scripts
- Manual CUDA/driver setup
- NVIDIA GPU passthrough is solid
- AMD passthrough is rough/limited
- Watch out for duplicate model files
A first run, native, in five minutes
The fastest way to feel this work is the native Ollama path. It installs an engine plus a tiny command-line tool, then downloads and runs a model in one command. (For the broader Ollama walkthrough see how to run Ollama.)
- Update your GPU driver first. NVIDIA users: install the latest Game Ready or Studio driver. AMD users: install the latest Adrenalin driver. Stale drivers are the single most common cause of 'it runs on CPU and is painfully slow'.
- Download the Windows installer from the official Ollama site and run it. It registers the
ollamacommand and starts a background service. - Open PowerShell (or Windows Terminal) and pull a small model that fits most GPUs.
- Chat. The first run downloads the model file; later runs start instantly from cache.
# A ~3B-parameter model: small, fast, fits in ~4 GB of VRAM
ollama run llama3.2
# Confirm what's installed and how big it is
ollama list
# Check it can see your GPU (look for a CUDA/ROCm line, not 'cpu')
ollama psIf ollama is not recognized as a command, your PATH didn't pick up the new install — close and reopen the terminal, or sign out and back in. That single gotcha trips up a huge share of first-timers and has nothing to do with the model itself.
NVIDIA vs AMD on Windows
Which GPU you have changes the smoothness of the ride more than anything else. NVIDIA is the path of least resistance because CUDA is the default target for nearly every LLM tool. AMD works — and modern AMD apps are good — but you'll meet a couple more acronyms.
| NVIDIA | AMD | |
|---|---|---|
| GPU compute layer | CUDA (default everywhere) | ROCm (Linux) or DirectML (Windows) |
| Native Ollama / LM Studio | Works out of the box | Supported on recent cards; check release notes |
| WSL2 GPU passthrough | Mature and reliable | Limited; often more pain than it's worth |
| What usually 'just works' | Almost everything | Native apps; less so raw Linux tooling |
| Best beginner choice | Yes | Fine, with a little more reading |
A quick glossary so the table makes sense. CUDA is NVIDIA's GPU programming platform — the thing almost every AI tool is built against. ROCm is AMD's equivalent, strongest on Linux. DirectML is a Microsoft layer that lets apps use any DirectX-12 GPU (AMD, NVIDIA, even Intel) on Windows; it's broadly compatible but usually slower than CUDA or ROCm. The takeaway: NVIDIA users rarely think about this; AMD users should prefer the native Windows apps, which pick a working backend for them, rather than fighting ROCm inside WSL2.
Common Windows gotchas
Almost every 'it doesn't work' on Windows is one of a handful of repeat offenders. Walk this list before you blame the model.
- Running on CPU by accident. If answers crawl out one word per second, the GPU probably isn't being used. Update your driver, and check
ollama psor LM Studio's GPU indicator. CPU-only works but is many times slower. - Stale or wrong GPU driver. The most common single cause of trouble. NVIDIA: latest Game Ready/Studio driver. AMD: latest Adrenalin. Reboot after installing.
PATHnot updated. A fresh install adds itself toPATH, but your already-open terminal won't know. Open a new terminal window.- Model too big for VRAM. Pick a model that fits. As a rough guide, a 7–8B model at 4-bit quantization needs roughly 5–6 GB of VRAM; a 3B model needs about 3–4 GB. If it overflows, the app spills to system RAM and slows to a crawl — pick a smaller model or a heavier quantization.
- Antivirus or corporate policy blocking the local server. These apps run a small server on localhost (Ollama defaults to port 11434). If a tool can't connect, check Windows Firewall and any endpoint-security software.
- Duplicating models across native and WSL2. If you use both, each keeps its own copy on its own filesystem. That quietly eats tens of gigabytes. Pick one home for your models.
Going deeper
Once the basics click, a few directions are worth knowing as you push further on Windows.
Quantization is your main quality-vs-fit dial. A model shipped in the GGUF format comes in many sizes — Q4, Q5, Q8 and so on — where a smaller number means a smaller file that fits more easily but loses a little accuracy. On a memory-constrained Windows GPU, choosing the right quantization is often what makes a model usable at all. The quantization explainer covers the tradeoffs, and GGUF vs GPTQ vs AWQ compares the formats.
Partial GPU offload. When a model is slightly too big for VRAM, engines let you put some layers on the GPU and the rest on the CPU. Ollama and LM Studio expose a setting for the number of GPU layers. Pushing more layers onto the GPU until you hit the VRAM ceiling is the standard way to squeeze the most speed out of a card that can't hold the whole model.
Running a local API server. Both apps can expose an OpenAI-compatible HTTP endpoint on localhost, so your own scripts and apps can call the local model exactly like a cloud one — just pointed at http://localhost:11434 (Ollama) or LM Studio's server port. This is how local models slot into coding assistants and small projects without any cloud account.
When to actually commit to WSL2. Reserve it for the moment a specific Linux-only workflow blocks you: compiling a bleeding-edge llama.cpp feature, a research repo that assumes a Linux shell, or Docker-based tooling. For everyday chat and serving, native Windows is faster to set up and easier to debug. And if you also use a Mac or want models on the go, the Mac and phone guides cover those platforms — the model files are the same everywhere.
The durable lesson: on Windows, your bottlenecks are almost always the GPU driver, the amount of VRAM, and the chosen quantization — not the model's intelligence. Get those three right, start native, and only step into WSL2 when something genuinely demands Linux.
FAQ
Can I run an LLM on Windows without WSL?
Yes — and for most people you should. Native Windows apps like Ollama and LM Studio ship real .exe installers that use your GPU directly. WSL2 is only worth the extra setup when you need a Linux-only tool or library; for plain chatting and serving, native is simpler and just as fast.
How do I get Ollama to use my GPU on Windows?
Install the latest GPU driver first (NVIDIA Game Ready/Studio, or AMD Adrenalin), then run the native Ollama installer and reboot. Check with ollama ps — you should see a CUDA or ROCm line rather than cpu. If it still runs on CPU, the driver is usually stale or the model is too big to fit in VRAM.
Can I run local LLMs on an AMD GPU on Windows?
Yes. Recent versions of LM Studio and Ollama support many AMD cards on Windows using ROCm or DirectML, and the native apps pick a working backend for you. AMD generally needs a little more reading than NVIDIA, and AMD GPU passthrough into WSL2 is rough, so prefer the native Windows apps.
How much VRAM do I need to run an LLM on Windows?
It depends on model size and quantization. As a rough guide, a 3B model at 4-bit needs about 3–4 GB of VRAM, and a 7–8B model at 4-bit needs about 5–6 GB. If the model doesn't fit, the app spills to system RAM and slows down — pick a smaller model or a heavier quantization.
Why is my local model so slow on Windows?
Almost always because it's running on the CPU instead of the GPU. The usual culprits are a stale GPU driver, a model too large for your VRAM, or the app not detecting the card. Update the driver, reboot, pick a model that fits your VRAM, and confirm GPU use in the app.
What's the best app to run LLMs on Windows for a beginner?
LM Studio if you want a polished graphical app with a model browser and one-click local server, or Ollama if you're comfortable with a single terminal command. Both are built on llama.cpp, so they run the same GGUF model files — the choice is mostly about whether you prefer a GUI or the command line.