In plain English
Every AI application that uses a language model makes one foundational choice: do you rent the model or own it? A closed model is a proprietary AI system you access through an API — GPT-5, Claude, Gemini. You send a request, pay per token, and get a response. The weights, the training data, and the infrastructure all belong to the provider. An open model is one whose trained weights are published for anyone to download. You run it on your own hardware (or a rented GPU), and nothing leaves your network unless you choose to send it there.
A useful analogy: closed APIs are like a taxi, open models are like owning a car. The taxi is faster to start using — you just hail it. But every trip costs money, you have no control over the route, and the driver keeps a log. The car has a higher upfront cost and you handle maintenance, but once you own it the marginal cost per trip is near zero, you can modify it, and nobody tracks where you go.
Why the decision matters for builders
This choice touches almost every dimension of a production AI system: latency, cost, data privacy, compliance, ability to customise, and how much engineering you need to maintain. Getting it wrong early is expensive — switching from a closed API to self-hosted infrastructure mid-project means rewriting prompt logic, load-testing a new stack, and retraining any fine-tuned variants.
The decision has also gotten genuinely harder over the past two years. In 2023, closed models were clearly superior in quality and open models were the budget option. By 2026 that gap has largely closed: Qwen3 235B, DeepSeek R1, and Llama 4 match or exceed GPT-4-class performance on coding, reasoning, and multilingual tasks, while running on hardware you control. The tradeoff today is less about capability and more about operational complexity versus strategic control.
- Cost at scale: closed API billing grows linearly with usage; self-hosted infrastructure has a fixed monthly cost that becomes cheaper per token as volume increases
- Data privacy: open models on private hardware mean prompts and completions never leave your network — essential for healthcare, legal, and financial workloads
- Customisation: open weights let you fine-tune the model on your own data; closed APIs offer limited fine-tuning with no access to intermediate representations
- Vendor lock-in: a closed API provider can change pricing, deprecate a model version, or go offline — open weights you downloaded cannot be taken away
- Time to first result: a closed API is working in minutes; setting up self-hosted inference takes hours to days, depending on your infrastructure experience
How the two models work
Understanding the mechanics of each approach makes the tradeoffs concrete. With a closed API, your application sends an HTTPS request to the provider's servers, which run the model on their GPU cluster and return a response. With an open model, you download the weights once, load them onto your own GPU (or CPU with quantisation), and run inference entirely within your own infrastructure.
- Your app sends prompt via HTTPS
- Provider runs model on their GPUs
- Response returned as JSON
- You pay per input + output token
- Provider stores request logs (by default)
- Model updates happen silently
- Download weights once (5-100+ GB)
- Load model onto your GPU or CPU
- Run inference entirely on your hardware
- Pay for compute, not per token
- No data leaves your network
- You control model version and updates
The closed API path
Closed providers like OpenAI, Anthropic, and Google invest billions in training and infrastructure. Their frontier models — GPT-5, Claude 4, Gemini 2.x — sit at the edge of what is technically possible and are updated continuously. You get that capability with zero infrastructure work: sign up, get an API key, and you're calling the model in an afternoon. The provider handles scaling automatically; whether you send 10 requests a day or 10 million, the API just works.
The cost model is pay-as-you-go: typically measured in dollars per million tokens, split between input (the prompt) and output (the completion). GPT-5 costs roughly \$1.75 per million input tokens and \$14 per million output tokens as of mid-2026. At low volumes this is extremely cheap. At high volumes — above roughly 10 million tokens per day — the bill becomes the dominant operating cost for many AI products.
The open model path
Open models require you to manage infrastructure. You pick a model (say, Llama 4 Scout or Qwen3 30B), download the weights, choose a serving framework (Ollama for development, vLLM or llama.cpp server for production), and provision enough GPU memory to load the model. A 7B model needs roughly 6-8 GB of VRAM at 4-bit quantisation; a 70B model needs ~40 GB. Once running, your marginal inference cost is just electricity and amortised hardware.
The hidden costs are real: a self-hosted LLM setup typically requires 10-20 hours per month of engineering time for monitoring, updates, and troubleshooting — at senior-engineer rates, that is \$750-\$3,000 per month in labour alone, before you count the GPU hardware or cloud VM bill. Entry-level self-hosting (a 7B-13B model on a cloud GPU instance) starts around \$1,500-\$5,000 per month in total cost.
A practical decision framework
Rather than declaring a winner, the useful question is: which constraints matter most for your project right now? The table below maps key situations to the right starting point.
| Your situation | Best starting point | Why |
|---|---|---|
| Prototype or MVP, low volume | Closed API | Zero infra setup; fast iteration; cost is negligible at small scale |
| Healthcare, legal, or financial data | Open model (self-hosted) | Data never leaves your network; meets HIPAA/PCI-DSS requirements |
| Daily usage > 10M tokens, budget is a constraint | Open model (cloud GPU) | Self-hosting typically breaks even above ~10-30M tokens/day |
| Need domain-specific fine-tuning | Open model | Closed APIs offer limited fine-tuning; open weights give full control |
| No ML/DevOps expertise on the team | Closed API | No infrastructure to manage; focus on product, not serving stack |
| Need highest absolute quality today | Closed API (frontier model) | GPT-5, Claude 4 still lead on ambiguous reasoning and long-context tasks |
| Regulated industry requiring on-premise | Open model (on-premise) | Enterprise agreements rarely substitute for full infrastructure control |
| Batch processing at very high volume | Open model (cloud GPU) | Per-token cost savings are massive; latency requirements are relaxed |
Most mature production deployments end up using both: a closed frontier model for complex general-purpose tasks (customer support, long-document analysis) and a fine-tuned open model for high-volume, domain-specific pipelines where the cost savings justify the infrastructure overhead.
How close is the quality gap in 2026?
The most common objection to open models used to be quality. That objection has weakened significantly. By early 2026, Qwen3 235B (Apache 2.0, Alibaba) leads the broadest range of public benchmarks, combining top-tier reasoning, coding, and multilingual capability. DeepSeek R1 (MIT) reaches frontier-level mathematical reasoning, placing near the top on AIME-style competition problems. Llama 4 Scout (Meta community licence) introduced a 10-million-token context window — larger than any closed model at the time.
Where closed models still hold an edge: ambiguous long-context reasoning (following complex instructions across a 100K+ token document), multimodal tasks that combine vision, audio, and text, and tasks that benefit from continuous model updates. Closed provider labs also ship safety features and alignment work that smaller open-model deployments must replicate themselves.
For the majority of structured tasks — extracting fields from documents, summarising meeting transcripts, answering questions from a knowledge base, classifying support tickets — a fine-tuned open model typically matches or beats the equivalent closed API call at a fraction of the cost, because the task is narrow enough that domain-specific training outweighs raw benchmark performance.
Going deeper
Once you have decided to use open models, a few advanced topics determine how successful your deployment will be:
Licences matter in production
Choosing a model for a commercial product means reading the licence carefully. MIT and Apache 2.0 (DeepSeek R1, Qwen3, Gemma, Phi-4, most Mistral models) allow commercial use, redistribution, and modification without restrictions. Meta's Llama community licence allows commercial use for most companies, but organisations that reach 700 million monthly active users must apply for a separate licence — which rules it out for a handful of mega-platforms but is irrelevant for the vast majority of products. Always verify you are reading the licence for the specific model version you are deploying, as licences can change between releases.
Fine-tuning is the multiplier
The biggest advantage of open weights is not avoiding API costs — it is the ability to fine-tune the model on your own domain data. A 7B model fine-tuned on a few thousand domain-specific examples often outperforms a 70B general-purpose model on the narrow task it was trained for. LoRA (Low-Rank Adaptation) makes this practical: you can fine-tune a 7B model on a single consumer GPU in a few hours with a few hundred annotated examples. The result is a model that is faster, cheaper to serve, and more accurate on your task than any general-purpose API.
The hybrid architecture
Most sophisticated production AI stacks in 2026 are hybrid: a closed frontier API handles rare, complex, ambiguous queries where the highest possible quality is worth the cost; a fine-tuned open model handles the high-volume, repetitive, well-defined tasks where the economics of self-hosting pay off. A routing layer — as simple as a classification prompt — directs each request to the right model. This pattern captures the benefits of both worlds without being locked into either.
def route_request(prompt: str, complexity_score: float):
# Use open model (Ollama local) for routine queries
if complexity_score < 0.6:
return call_ollama("qwen3:8b", prompt)
# Fall back to closed API for complex reasoning
else:
return call_openai("gpt-4o", prompt)
# complexity_score can be a simple classifier, token count,
# or another LLM call that is cheap to run locallyPrivacy and compliance are not binary
Both OpenAI and Anthropic now offer zero-data-retention API tiers and SOC 2 compliance, which addresses many enterprise privacy concerns without self-hosting. For HIPAA (US healthcare), a Business Associate Agreement (BAA) with the provider is often sufficient. Where self-hosting becomes a hard requirement is in environments that prohibit data from crossing national borders, in classified or defence contexts, or in organisations whose security policy simply does not permit any third-party data processor regardless of contractual protections.
FAQ
Is it cheaper to self-host an open LLM than to use the OpenAI API?
It depends on your volume. At low usage (under ~10 million tokens per day), a closed API is almost always cheaper once you factor in the engineering time, GPU hardware or cloud VM cost, and ongoing maintenance. Self-hosting typically breaks even somewhere between 10 million and 30 million tokens per day. At very high volumes — 100 million+ tokens per day — self-hosting can cut costs by 50-80%.
Can I use open-source LLMs for HIPAA-compliant healthcare applications?
Yes — running an open model on infrastructure you fully control is one of the most reliable paths to HIPAA compliance, because no patient data leaves your network. Closed APIs can also be HIPAA-compliant if the provider signs a Business Associate Agreement (BAA) and you configure zero-data-retention — both OpenAI and Anthropic offer this. The choice depends on your security policy and whether your organisation allows any third-party data processor.
How close are open models to GPT-4 and Claude quality in 2026?
Very close on most structured tasks. Qwen3 235B and DeepSeek R1 match or exceed GPT-4o on coding, math, and multilingual benchmarks. For narrow, well-defined tasks (classification, extraction, summarisation), a fine-tuned open model often outperforms a general closed API. Closed models retain an edge on ambiguous long-context reasoning, real-time multimodal tasks, and tasks that benefit from continuous model updates.
What is the easiest way to start running an open LLM locally?
Ollama is the fastest starting point. Install it, then run ollama run llama3.3 or ollama run qwen3:8b — it downloads the model, handles 4-bit quantisation automatically, and exposes a local API compatible with the OpenAI SDK. You can be calling a local LLM from Python in under 10 minutes with no GPU configuration required.
Do open models have the same safety guardrails as closed APIs?
Not by default. Frontier closed models from OpenAI and Anthropic include extensive alignment and safety training that has been continuously refined. Open models include some safety training but it varies widely and can be bypassed more easily. If you deploy an open model in a consumer-facing product, you are responsible for adding your own content filtering, output validation, and safety layers.
Can I fine-tune a closed API model on my own data?
OpenAI and some other providers do offer fine-tuning on their smaller models (e.g. GPT-4o mini). However, you upload your training data to their servers, the resulting model lives on their infrastructure, and you still pay per token to use it. With open weights, fine-tuning uses your own data on your own hardware — no data leaves your network, and you own the resulting model weights outright.