Open vs Closed LLMs: Which Should You Build On?

Q: How close are open models to GPT-4 and Claude quality in 2026?

Very close on most structured tasks. **Qwen3 235B** and **DeepSeek R1** match or exceed GPT-4o on coding, math, and multilingual benchmarks. For narrow, well-defined tasks (classification, extraction, summarisation), a fine-tuned open model often outperforms a general closed API. Closed models retain an edge on ambiguous long-context reasoning, real-time multimodal tasks, and tasks that benefit from continuous model updates.

Q: What is the easiest way to start running an open LLM locally?

**Ollama** is the fastest starting point. Install it, then run `ollama run llama3.3` or `ollama run qwen3:8b` — it downloads the model, handles 4-bit quantisation automatically, and exposes a local API compatible with the OpenAI SDK. You can be calling a local LLM from Python in under 10 minutes with no GPU configuration required.

In plain English

Every AI application that uses a language model makes one foundational choice: do you rent the model or own it? A closed model is a proprietary AI system you access through an API — GPT-5, Claude, Gemini. You send a request, pay per token, and get a response. The weights, the training data, and the infrastructure all belong to the provider. An open model is one whose trained weights are published for anyone to download. You run it on your own hardware (or a rented GPU), and nothing leaves your network unless you choose to send it there.

A useful analogy: closed APIs are like a taxi, open models are like owning a car. The taxi is faster to start using — you just hail it. But every trip costs money, you have no control over the route, and the driver keeps a log. The car has a higher upfront cost and you handle maintenance, but once you own it the marginal cost per trip is near zero, you can modify it, and nobody tracks where you go.

Why the decision matters for builders

This choice touches almost every dimension of a production AI system: latency, cost, data privacy, compliance, ability to customise, and how much engineering you need to maintain. Getting it wrong early is expensive — switching from a closed API to self-hosted infrastructure mid-project means rewriting prompt logic, load-testing a new stack, and retraining any fine-tuned variants.

The decision has also gotten genuinely harder over the past two years. In 2023, closed models were clearly superior in quality and open models were the budget option. By 2026 that gap has largely closed: Qwen3 235B, DeepSeek R1, and Llama 4 match or exceed GPT-4-class performance on coding, reasoning, and multilingual tasks, while running on hardware you control. The tradeoff today is less about capability and more about operational complexity versus strategic control.

Cost at scale: closed API billing grows linearly with usage; self-hosted infrastructure has a fixed monthly cost that becomes cheaper per token as volume increases
Data privacy: open models on private hardware mean prompts and completions never leave your network — essential for healthcare, legal, and financial workloads
Customisation: open weights let you fine-tune the model on your own data; closed APIs offer limited fine-tuning with no access to intermediate representations
Vendor lock-in: a closed API provider can change pricing, deprecate a model version, or go offline — open weights you downloaded cannot be taken away
Time to first result: a closed API is working in minutes; setting up self-hosted inference takes hours to days, depending on your infrastructure experience

How the two models work

Understanding the mechanics of each approach makes the tradeoffs concrete. With a closed API, your application sends an HTTPS request to the provider's servers, which run the model on their GPU cluster and return a response. With an open model, you download the weights once, load them onto your own GPU (or CPU with quantisation), and run inference entirely within your own infrastructure.

// Closed API vs Open Model: how each path works

Closed API (e.g. GPT-5, Claude)

Your app sends prompt via HTTPS
Provider runs model on their GPUs
Response returned as JSON
You pay per input + output token
Provider stores request logs (by default)
Model updates happen silently

Open Model (e.g. Llama 4, Qwen3)

Download weights once (5-100+ GB)
Load model onto your GPU or CPU
Run inference entirely on your hardware
Pay for compute, not per token
No data leaves your network
You control model version and updates

The closed API path

Closed providers like OpenAI, Anthropic, and Google invest billions in training and infrastructure. Their frontier models — GPT-5, Claude 4, Gemini 2.x — sit at the edge of what is technically possible and are updated continuously. You get that capability with zero infrastructure work: sign up, get an API key, and you're calling the model in an afternoon. The provider handles scaling automatically; whether you send 10 requests a day or 10 million, the API just works.

The cost model is pay-as-you-go: typically measured in dollars per million tokens, split between input (the prompt) and output (the completion). GPT-5 costs roughly \$1.75 per million input tokens and \$14 per million output tokens as of mid-2026. At low volumes this is extremely cheap. At high volumes — above roughly 10 million tokens per day — the bill becomes the dominant operating cost for many AI products.

The open model path

Open models require you to manage infrastructure. You pick a model (say, Llama 4 Scout or Qwen3 30B), download the weights, choose a serving framework (Ollama for development, vLLM or llama.cpp server for production), and provision enough GPU memory to load the model. A 7B model needs roughly 6-8 GB of VRAM at 4-bit quantisation; a 70B model needs ~40 GB. Once running, your marginal inference cost is just electricity and amortised hardware.

The hidden costs are real: a self-hosted LLM setup typically requires 10-20 hours per month of engineering time for monitoring, updates, and troubleshooting — at senior-engineer rates, that is \$750-\$3,000 per month in labour alone, before you count the GPU hardware or cloud VM bill. Entry-level self-hosting (a 7B-13B model on a cloud GPU instance) starts around \$1,500-\$5,000 per month in total cost.

A practical decision framework

Rather than declaring a winner, the useful question is: which constraints matter most for your project right now? The table below maps key situations to the right starting point.

Your situation	Best starting point	Why
Prototype or MVP, low volume	Closed API	Zero infra setup; fast iteration; cost is negligible at small scale
Healthcare, legal, or financial data	Open model (self-hosted)	Data never leaves your network; meets HIPAA/PCI-DSS requirements
Daily usage > 10M tokens, budget is a constraint	Open model (cloud GPU)	Self-hosting typically breaks even above ~10-30M tokens/day
Need domain-specific fine-tuning	Open model	Closed APIs offer limited fine-tuning; open weights give full control
No ML/DevOps expertise on the team	Closed API	No infrastructure to manage; focus on product, not serving stack
Need highest absolute quality today	Closed API (frontier model)	GPT-5, Claude 4 still lead on ambiguous reasoning and long-context tasks
Regulated industry requiring on-premise	Open model (on-premise)	Enterprise agreements rarely substitute for full infrastructure control
Batch processing at very high volume	Open model (cloud GPU)	Per-token cost savings are massive; latency requirements are relaxed

Most mature production deployments end up using both: a closed frontier model for complex general-purpose tasks (customer support, long-document analysis) and a fine-tuned open model for high-volume, domain-specific pipelines where the cost savings justify the infrastructure overhead.

How close is the quality gap in 2026?

The most common objection to open models used to be quality. That objection has weakened significantly. By early 2026, Qwen3 235B (Apache 2.0, Alibaba) leads the broadest range of public benchmarks, combining top-tier reasoning, coding, and multilingual capability. DeepSeek R1 (MIT) reaches frontier-level mathematical reasoning, placing near the top on AIME-style competition problems. Llama 4 Scout (Meta community licence) introduced a 10-million-token context window — larger than any closed model at the time.

// The open model quality ladder (mid-2026)

Frontier open models: Qwen3 235B, DeepSeek R1, Llama 4 MaverickMatch or exceed GPT-4o on coding, math, multilingual — Apache 2.0 / MIT / community licenceMid-size open models: Llama 4 Scout, Mistral Large 3, Qwen3 72BGPT-3.5 to GPT-4 quality range; 40-80 GB VRAM; good throughput on a single A100Efficient open models: Gemma 3 27B, Qwen3 30B, Mistral Small 4Strong quality on consumer hardware (16-40 GB VRAM); best quality-per-dollarSmall / edge open models: Phi-4 Mini, Gemma 3 4B, Qwen3 8B4-8 GB VRAM; runs on laptops; best for constrained or on-device use cases

Where closed models still hold an edge: ambiguous long-context reasoning (following complex instructions across a 100K+ token document), multimodal tasks that combine vision, audio, and text, and tasks that benefit from continuous model updates. Closed provider labs also ship safety features and alignment work that smaller open-model deployments must replicate themselves.

For the majority of structured tasks — extracting fields from documents, summarising meeting transcripts, answering questions from a knowledge base, classifying support tickets — a fine-tuned open model typically matches or beats the equivalent closed API call at a fraction of the cost, because the task is narrow enough that domain-specific training outweighs raw benchmark performance.

Going deeper

Once you have decided to use open models, a few advanced topics determine how successful your deployment will be:

Licences matter in production

Choosing a model for a commercial product means reading the licence carefully. MIT and Apache 2.0 (DeepSeek R1, Qwen3, Gemma, Phi-4, most Mistral models) allow commercial use, redistribution, and modification without restrictions. Meta's Llama community licence allows commercial use for most companies, but organisations that reach 700 million monthly active users must apply for a separate licence — which rules it out for a handful of mega-platforms but is irrelevant for the vast majority of products. Always verify you are reading the licence for the specific model version you are deploying, as licences can change between releases.

Fine-tuning is the multiplier

The biggest advantage of open weights is not avoiding API costs — it is the ability to fine-tune the model on your own domain data. A 7B model fine-tuned on a few thousand domain-specific examples often outperforms a 70B general-purpose model on the narrow task it was trained for. LoRA (Low-Rank Adaptation) makes this practical: you can fine-tune a 7B model on a single consumer GPU in a few hours with a few hundred annotated examples. The result is a model that is faster, cheaper to serve, and more accurate on your task than any general-purpose API.

The hybrid architecture

Most sophisticated production AI stacks in 2026 are hybrid: a closed frontier API handles rare, complex, ambiguous queries where the highest possible quality is worth the cost; a fine-tuned open model handles the high-volume, repetitive, well-defined tasks where the economics of self-hosting pay off. A routing layer — as simple as a classification prompt — directs each request to the right model. This pattern captures the benefits of both worlds without being locked into either.

Simple routing pattern: open model for common queries, closed API for complex onespython

def route_request(prompt: str, complexity_score: float):
    # Use open model (Ollama local) for routine queries
    if complexity_score < 0.6:
        return call_ollama("qwen3:8b", prompt)
    # Fall back to closed API for complex reasoning
    else:
        return call_openai("gpt-4o", prompt)

# complexity_score can be a simple classifier, token count,
# or another LLM call that is cheap to run locally

Privacy and compliance are not binary

Both OpenAI and Anthropic now offer zero-data-retention API tiers and SOC 2 compliance, which addresses many enterprise privacy concerns without self-hosting. For HIPAA (US healthcare), a Business Associate Agreement (BAA) with the provider is often sufficient. Where self-hosting becomes a hard requirement is in environments that prohibit data from crossing national borders, in classified or defence contexts, or in organisations whose security policy simply does not permit any third-party data processor regardless of contractual protections.

FAQ

Is it cheaper to self-host an open LLM than to use the OpenAI API?

It depends on your volume. At low usage (under ~10 million tokens per day), a closed API is almost always cheaper once you factor in the engineering time, GPU hardware or cloud VM cost, and ongoing maintenance. Self-hosting typically breaks even somewhere between 10 million and 30 million tokens per day. At very high volumes — 100 million+ tokens per day — self-hosting can cut costs by 50-80%.

Can I use open-source LLMs for HIPAA-compliant healthcare applications?

Yes — running an open model on infrastructure you fully control is one of the most reliable paths to HIPAA compliance, because no patient data leaves your network. Closed APIs can also be HIPAA-compliant if the provider signs a Business Associate Agreement (BAA) and you configure zero-data-retention — both OpenAI and Anthropic offer this. The choice depends on your security policy and whether your organisation allows any third-party data processor.

How close are open models to GPT-4 and Claude quality in 2026?

Very close on most structured tasks. Qwen3 235B and DeepSeek R1 match or exceed GPT-4o on coding, math, and multilingual benchmarks. For narrow, well-defined tasks (classification, extraction, summarisation), a fine-tuned open model often outperforms a general closed API. Closed models retain an edge on ambiguous long-context reasoning, real-time multimodal tasks, and tasks that benefit from continuous model updates.

What is the easiest way to start running an open LLM locally?

Ollama is the fastest starting point. Install it, then run ollama run llama3.3 or ollama run qwen3:8b — it downloads the model, handles 4-bit quantisation automatically, and exposes a local API compatible with the OpenAI SDK. You can be calling a local LLM from Python in under 10 minutes with no GPU configuration required.

Do open models have the same safety guardrails as closed APIs?

Not by default. Frontier closed models from OpenAI and Anthropic include extensive alignment and safety training that has been continuously refined. Open models include some safety training but it varies widely and can be bypassed more easily. If you deploy an open model in a consumer-facing product, you are responsible for adding your own content filtering, output validation, and safety layers.

Can I fine-tune a closed API model on my own data?

OpenAI and some other providers do offer fine-tuning on their smaller models (e.g. GPT-4o mini). However, you upload your training data to their servers, the resulting model lives on their infrastructure, and you still pay per token to use it. With open weights, fine-tuning uses your own data on your own hardware — no data leaves your network, and you own the resulting model weights outright.

Open vs Closed Models: Should You Build on Open Source?

In plain English

Why the decision matters for builders

How the two models work

The closed API path

The open model path

A practical decision framework

How close is the quality gap in 2026?

Going deeper

Licences matter in production

Fine-tuning is the multiplier

The hybrid architecture

Privacy and compliance are not binary

FAQ

Further reading

// In plain English

// Why the decision matters for builders

// How the two models work

The closed API path

The open model path

// A practical decision framework

// How close is the quality gap in 2026?

// Going deeper

Licences matter in production

Fine-tuning is the multiplier

The hybrid architecture

Privacy and compliance are not binary

// FAQ

// Further reading

// Related

In plain English

Why the decision matters for builders

How the two models work

A practical decision framework

How close is the quality gap in 2026?

Going deeper

FAQ

Further reading

Related