AI/TLDR

What Is Ollama? Run Your First Local LLM in Five Minutes

You will go from nothing installed to chatting with a local model through Ollama, and understand what it does under the hood.

BEGINNER10 MIN READUPDATED 2026-06-12

In plain English

Ollama is a free, open-source tool that lets you download and run large language models on your own computer with a single command. Think of it like Docker, but for AI models instead of containers: you type ollama pull llama3.2, and Ollama handles everything — fetching the model file, choosing the right format for your hardware, and starting a local server you can talk to.

Before Ollama existed, running a local LLM meant installing llama.cpp from source, manually downloading the right .gguf file from Hugging Face, wrestling with CUDA version flags, and writing your own inference script. Ollama collapses all of that into three shell commands.

The analogy that sticks: Ollama is to local LLMs what brew install is to software on a Mac. One command fetches, installs, and runs — and another command removes it cleanly when you are done.

Why it matters for builders

Cloud LLM APIs are fast to start with, but they come with real friction: every prompt leaves your machine, costs money per token, requires an internet connection, and is rate-limited at scale. For many tasks — writing assistants, code helpers, search over internal documents — those constraints are acceptable. For others they are dealbreakers.

Ollama gives you a practical local alternative. Here is when that trade matters:

  • Privacy — prompts stay on your hardware. Medical notes, source code, legal contracts never leave the machine.
  • Zero per-token cost — run llama3.2 all day and pay nothing beyond the electricity.
  • Offline access — once a model is downloaded, no network is required.
  • Iteration speed — no rate limits, no cold starts, no API key rotation.
  • Experimentation — swap between dozens of models in seconds without waiting for vendor access.

For teams building internal tools — a private code reviewer, a document Q&A bot, a customer-support prototype — Ollama is often the fastest path from idea to working demo, because you can iterate locally and only move to a cloud API if and when you need more scale or a larger model than your hardware supports.

How Ollama works under the hood

Ollama is a thin Go application that wraps llama.cpp — the high-performance C++ inference engine that does the actual matrix math. Ollama's job is to make llama.cpp feel invisible: it manages model storage, selects the right compute backend, exposes a REST API, and handles the request lifecycle.

GGUF and quantization

Ollama stores models in the GGUF format — a single-file binary standard that llama.cpp uses natively. Each GGUF file encodes the model weights at a particular quantization level, which is a lossy compression that trades a small amount of quality for large reductions in file size and memory use.

When you run ollama pull llama3.2, Ollama automatically downloads the Q4_K_M variant of that model by default. Q4 means 4-bit weights; K_M means a medium k-quant strategy that uses slightly higher precision for attention layers and lower precision for feed-forward layers. The result: a model that was originally ~16 GB fits in ~4.7 GB and loses only a few percentage points of quality.

The local server and API

When you start Ollama (either via ollama serve or automatically at login on macOS/Windows), it opens a local HTTP server on port 11434. That server exposes two API families:

  • Native API at /api/* — Ollama-specific endpoints with streaming, model management, and Modelfile support.
  • OpenAI-compatible API at /v1/* — a drop-in replacement for api.openai.com/v1. Point any existing OpenAI SDK at http://localhost:11434/v1 with api_key='ollama' and it works without code changes.

GPU acceleration

Ollama detects your GPU automatically and offloads as many transformer layers as will fit in VRAM. On Apple Silicon it uses Metal; on NVIDIA it uses CUDA; on AMD it uses ROCm. If your GPU does not have enough VRAM for the whole model, Ollama splits layers between GPU and CPU — slower, but it still works. The num_gpu parameter in a Modelfile lets you tune how many layers go to the GPU.

Install Ollama and run your first model

Installation takes under a minute on any platform. Pick your OS:

PlatformInstall method
macOSbrew install ollama or download the .dmg from ollama.com
Linuxcurl -fsSL https://ollama.com/install.sh | sh
WindowsDownload the installer from ollama.com — installs as a background service

Once installed, open a terminal and pull a small model. Llama 3.2 3B is a good beginner choice — it is about 2 GB and fast on any modern laptop:

Pull and run your first modelbash
# Download the model (one-time, ~2 GB for llama3.2:3b)
ollama pull llama3.2:3b

# Start an interactive chat session
ollama run llama3.2:3b

You will see a >>> prompt. Type anything and press Enter. To exit, type /bye or press Ctrl+D.

Talking to it from code

The more useful path for builders is the API. Because Ollama speaks the OpenAI protocol, you can use the official OpenAI Python SDK with zero changes except the base URL:

Using the OpenAI SDK with Ollamapython
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[
        {"role": "user", "content": "Explain transformers in two sentences."}
    ],
)
print(response.choices[0].message.content)

You can also use raw curl against the native API if you prefer:

Calling the native generate endpointbash
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2:3b",
    "prompt": "What is a vector database?",
    "stream": false
  }'

Choosing the right model

The Ollama library at ollama.com/library hosts hundreds of models. Here are the most useful starting points for different hardware and tasks:

ModelPull commandRAM neededGood for
Llama 3.2 3Bollama pull llama3.2:3b~3 GBFast chat, quick Q&A on low-end hardware
Llama 3.2 8Bollama pull llama3.2~5 GBStrong general-purpose chat and coding
Gemma 3 9Bollama pull gemma3~6 GBGoogle's efficient instruction-tuned model
Qwen3 8Bollama pull qwen3~5 GBMultilingual tasks, coding, reasoning
DeepSeek-R1 8Bollama pull deepseek-r1~5 GBStep-by-step reasoning, math
Mistral 7Bollama pull mistral~4 GBFast European model, good at instruction following

The model tag (the part after the colon) selects a specific size or quantization. ollama pull llama3.2 without a tag always fetches the default, which is usually the best-balanced variant. ollama pull llama3.2:8b-instruct-q8_0 picks an explicit 8-bit quantization of the instruct-tuned 8B variant.

Customizing models with Modelfiles

A Modelfile is a plain text recipe, similar in spirit to a Dockerfile, that lets you bake a custom system prompt and parameter defaults into a named model. Once created, you run your custom model with a short name instead of repeating the same --system flags.

Example Modelfile — a concise coding assistanttext
FROM llama3.2

SYSTEM """
You are a senior software engineer. Answer every question with working code first,
then a short explanation. Be concise. Default to Python unless asked otherwise.
"""

PARAMETER temperature 0.2
PARAMETER num_ctx 8192

Save that file as Modelfile, then build and run it:

bashbash
ollama create my-coder -f Modelfile
ollama run my-coder

Key Modelfile parameters worth knowing:

  • temperature — controls randomness (0 = deterministic, 1 = creative). Lower values like 0.1–0.3 work well for coding tasks.
  • num_ctx — the context window in tokens. The default is usually 2048; set it higher (up to the model's maximum) if you need to pass long documents.
  • top_p — nucleus sampling probability. Leave it at the default unless you have a specific reason to change it.
  • FROM — can point to a local .gguf file path, letting you import any model from Hugging Face that is not yet in the Ollama library.

Going deeper

Once you are comfortable with the basics, there are several directions worth exploring.

Embedding models

Ollama runs embedding models too — models that turn text into numeric vectors useful for semantic search and retrieval-augmented generation (RAG). Pull nomic-embed-text or mxbai-embed-large and call the /api/embeddings endpoint. You can run your entire RAG pipeline locally: Ollama generates embeddings, a local vector store like Chroma holds them, and Ollama answers queries — zero cloud dependencies.

Vision models

Several models in the Ollama library understand images as well as text. llava (LLaVA) and gemma3 with the :vision tag accept an image as part of the prompt. Pass the image as a base64 string in the API request to analyze screenshots, diagrams, or photos entirely on-device.

Connecting Ollama to applications

Because Ollama is OpenAI-compatible, it plugs straight into popular frameworks. In LangChain, set base_url to http://localhost:11434/v1. In LlamaIndex, use the Ollama LLM class. Desktop apps like Open WebUI and Enchanted wrap Ollama in a polished chat interface that non-developers can use.

Multi-model and concurrent requests

Ollama can serve multiple models from the same server process. By default it keeps one model loaded at a time and swaps models on demand. If you have enough VRAM, set OLLAMA_MAX_LOADED_MODELS=2 to keep two models hot simultaneously — useful when your application needs a fast small model for classification and a larger model for generation in the same pipeline.

Importing any GGUF from Hugging Face

If a model is not yet in the Ollama library but exists as a .gguf file on Hugging Face, you can still run it. Download the file, create a minimal Modelfile with FROM /path/to/model.gguf, and run ollama create my-model -f Modelfile. The ollama.com/library page is curated and reviewed, but the Hugging Face ecosystem has thousands of fine-tunes and specialized models you can reach this way.

FAQ

Does Ollama work without a GPU?

Yes. Ollama falls back to CPU inference automatically when no compatible GPU is found. Performance is slower — typically 5–20 tokens per second on a modern CPU depending on the model size — but all features work. A GPU is strongly recommended for 13B+ models.

What is the difference between `ollama run` and `ollama serve`?

ollama serve starts the background API server on port 11434 without opening a chat session. ollama run <model> starts the server if it is not already running, loads the model, and opens an interactive terminal chat. Use serve when you want to call the API from your own code; use run for quick manual conversations.

How much disk space do models use?

It depends on model size and quantization. A Q4_K_M 7B model is roughly 4–5 GB; a 13B model is around 8 GB; a 70B model is around 40 GB. Run ollama list to see exact sizes of everything you have downloaded, and ollama rm <model> to remove one.

Can I use Ollama with the OpenAI Python or JavaScript SDK?

Yes — Ollama exposes a /v1 endpoint that mirrors the OpenAI API. Set base_url='http://localhost:11434/v1' and api_key='ollama' (any string works) when constructing the client. Chat completions, streaming, and embeddings all work with no other code changes.

How do I update a model to its latest version?

Run ollama pull <model> again. Ollama checks whether the remote manifest has changed and downloads only the layers that differ — similar to how docker pull works. If nothing has changed, the command exits immediately without re-downloading anything.

Is Ollama the same thing as llama.cpp?

No — Ollama uses llama.cpp internally as its inference engine, but adds a model registry, automatic GPU detection, a REST API layer, Modelfile support, and cross-platform installers on top. llama.cpp is the low-level inference library; Ollama is the developer-friendly product built around it.

Further reading