AI/TLDR

How to Run LLMs Locally with Ollama

After reading you will be able to install Ollama, pull a model, chat from the terminal, and call the local OpenAI-compatible API from your own code.

BEGINNER9 MIN READUPDATED 2026-06-12

In plain English

Ollama is a free, open-source tool that makes running a large language model on your own computer as simple as one terminal command. You type ollama run llama3.2, it downloads the model, and within a minute you are having a conversation that never touches the internet.

Behind the scenes Ollama does several fiddly things for you: it finds and downloads a quantized model file sized to your hardware, loads it into memory, manages a background server process, and exposes an HTTP API — all without you touching a configuration file. For most people it is the fastest possible on-ramp to local AI.

Why Ollama specifically

Before Ollama, running a local model meant compiling llama.cpp from source, manually downloading GGUF files from Hugging Face, and writing your own server glue. Ollama wraps all of that into a single binary with a familiar interface modelled on Docker: ollama pull, ollama run, ollama list. Models are versioned and pulled by name just like container images.

Three things make Ollama stand out among local-AI tools:

  • OpenAI-compatible API. Ollama's local server speaks the same /v1/chat/completions schema as OpenAI. Any code that calls OpenAI works with Ollama by changing one URL — no library changes needed.
  • Cross-platform. A single download covers macOS (Apple Silicon and Intel), Windows, and Linux. GPU acceleration is detected and enabled automatically for NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal).
  • Huge model library. The official library lists hundreds of models. As of mid-2025, Llama 3.1 alone had been pulled over 115 million times, making Ollama the most popular local-model runner by a wide margin.

How Ollama works under the hood

When you run any Ollama command, a background server process (ollama serve) starts if it isn't already running. That server owns the GPU or CPU context, loads model weights into memory, and handles all inference. The CLI you type into is just an HTTP client talking to that server on localhost:11434. Your own code can call the same server directly.

Model files are stored in ~/.ollama/models (macOS/Linux) or %USERPROFILE%\.ollama\models (Windows). Each model is a GGUF file — a quantized format that shrinks the weights to 4-bit precision so a 7B model fits in 4–5 GB rather than the 14 GB it would need at full precision.

The server exposes two API families: its own native /api/chat and /api/generate endpoints, plus an OpenAI-compatible layer at /v1/chat/completions. Both return streaming JSON by default, and the OpenAI layer accepts the same request shape as openai.chat.completions.create().

Step-by-step: install, pull, and run

Step 1 — Install Ollama

Go to ollama.com/download and grab the installer for your OS. The macOS and Windows packages are point-and-click. On Linux, paste this one-liner into your terminal:

bashbash
curl -fsSL https://ollama.com/install.sh | sh

After installation, verify it is working:

bashbash
ollama -v
# ollama version 0.15.x (or newer)

Step 2 — Pull a model

ollama pull downloads a model without starting a chat. Pick one that fits your RAM. A safe first choice for most machines is llama3.2 (3B, about 2 GB):

bashbash
# Download the 3B Llama 3.2 model (~2 GB)
ollama pull llama3.2

# Or pull the 8B variant if you have 8+ GB of free memory
ollama pull llama3.2:8b

Step 3 — Start an interactive chat

ollama run pulls the model if needed and opens an interactive session:

bashbash
ollama run llama3.2
# >>> Hello! How can I assist you today?

>>> Explain what a large language model is in two sentences.
# ... model responds ...

# Type /bye or press Ctrl+D to exit the session.

Step 4 — Useful CLI commands

Here are the commands you will use most often:

bashbash
ollama list            # show all downloaded models
ollama ps              # show which models are currently loaded in memory
ollama show llama3.2   # display model details: size, context window, family
ollama rm llama3.2     # delete a model to free disk space
ollama serve           # start the server manually (runs automatically otherwise)

Step 5 — Call the native REST API

Once the server is running (which it is after any ollama run), you can send HTTP requests from any language. Here is a minimal curl call and a Python example:

bashbash
# One-shot request via curl (non-streaming)
curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}],
    "stream": false
  }'
pythonpython
import requests

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.2",
        "messages": [{"role": "user", "content": "Why is the sky blue?"}],
        "stream": False,
    },
)
print(response.json()["message"]["content"])

Step 6 — Use the OpenAI-compatible API

Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions. This means you can point the official openai Python library at your local server with a single URL change — no code rewrites, no API key required:

pythonpython
from openai import OpenAI

# Point the OpenAI client at your local Ollama server.
# The api_key value is required by the library but ignored by Ollama.
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

completion = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "What is the capital of France?"},
    ],
)

print(completion.choices[0].message.content)

Going deeper

Once you are comfortable pulling and chatting, here are the natural next steps.

Customise behaviour with a Modelfile. A Modelfile is a short config file (think Dockerfile for models) that lets you set a system prompt, change the temperature, or set a stop sequence without touching your application code. Create one and run ollama create my-assistant -f Modelfile to register it as its own named model:

texttext
FROM llama3.2
SYSTEM "You are a concise technical assistant. Answer in bullet points."
PARAMETER temperature 0.2
PARAMETER num_ctx 8192

Stream responses in Python. By default the API streams — set stream=True (or omit it) and iterate over the chunks, the same pattern as OpenAI's streaming API. This makes the response appear token-by-token instead of waiting for the full completion, which feels far more responsive in a UI.

Serve multiple users. ollama serve starts the background server on 0.0.0.0:11434 if you set OLLAMA_HOST=0.0.0.0. This lets other machines on your local network call your Ollama instance. For production-grade multi-user serving, look at a dedicated inference server like vLLM instead.

Model quality vs. size. When you hit the limits of a 3B model, try the 8B of the same family before jumping to a 70B — the 8B is usually a large quality step for a modest memory increase. If you have a 24 GB GPU, 34B models become reachable. For anything bigger you need to start thinking about multi-GPU setups or cloud offloading. See why LLMs need GPUs for the hardware picture.

Local RAG. Combine Ollama with a local vector database (Chroma or Qdrant both run locally) and an embedding model like nomic-embed-text to build a fully offline RAG pipeline. Every piece — the embedder, the retriever, and the generator — runs on your machine with no external API calls.

FAQ

How do I install Ollama on Windows?

Download the Windows installer from ollama.com/download and run it. The installer sets up the Ollama service automatically. After installation, open a Command Prompt or PowerShell window and type ollama run llama3.2 to pull and run your first model.

Do I need a GPU to run Ollama?

No. Ollama runs on CPU-only machines. On a modern laptop CPU, a small 1B or 3B model answers in a few seconds — slower than a GPU but workable for development. If you have an NVIDIA or AMD GPU, Ollama detects it automatically and uses it for much faster inference.

How do I use Ollama with the OpenAI Python library?

Set base_url to http://localhost:11434/v1 and any non-empty string for api_key when creating the OpenAI client. Everything else — chat.completions.create(), streaming, system messages — works unchanged because Ollama implements the same API schema.

What is the best first model to try with Ollama?

llama3.2 (the 3B default) is the safest first pull: it is about 2 GB, runs on almost any machine including CPU-only laptops, and is capable enough for basic chat and Q&A. Once that works, try llama3.1 (8B) for significantly better quality if your machine has 8+ GB of free memory.

How much disk space do Ollama models take?

Small 1B–3B models are 1–2 GB. Popular 7B–8B models are 4–5 GB in 4-bit quantization. Large 70B models are around 40 GB. Models are stored in ~/.ollama/models and you can delete any of them with ollama rm <model-name> to reclaim disk space.

Can Ollama run multiple models at the same time?

Yes. Each model request loads the model into memory. ollama ps shows which models are currently loaded. If you have enough VRAM or RAM, multiple models can be resident simultaneously. Models are unloaded from memory after a period of inactivity (5 minutes by default) to free resources.

Further reading