How to Choose a Model: Flagship, Mid-Tier, and Small Models Compared

Build a mental model of provider lineups so you can match each task to the cheapest tier that still does the job.

INTERMEDIATE11 MIN READUPDATED 2026-06-12

In Plain English

Every major LLM provider — Anthropic, OpenAI, Google — sells their models in a lineup of three roughly distinct tiers: a flagship (the most capable, most expensive), a mid-tier (the balanced workhorse), and a small model (the cheap, fast option). You can think of it like a restaurant menu: there is the premium tasting menu, the regular entree, and the lunch special. The food comes from the same kitchen, but the ingredients, time, and price differ substantially.

Choose a Model — diagram — Choose a Model — scholarhat.com

In concrete terms today (mid-2026): Anthropic has Claude Opus (flagship), Claude Sonnet (mid-tier), and Claude Haiku (small). OpenAI has the GPT-5 series — GPT-5.5 (flagship) down to its smaller GPT-5 mini and nano variants. Google has the Gemini 3 family — Gemini 3.1 Pro (flagship), Gemini 3.5 Flash (mid-tier), and Gemini 3.1 Flash-Lite (small). The names evolve with each product cycle, but the three-tier pattern stays consistent across all providers.

Why It Matters

Choosing the wrong tier is the single easiest way to waste money or ship a product that frustrates users. A developer who routes every request through a flagship model for a simple classification task can spend 30 to 50 times more than necessary. Conversely, a developer who forces a small model to write complex multi-step business logic will get unreliable outputs and spend time debugging failures that better reasoning would have avoided.

At scale the gap is enormous. Suppose your app processes one million user messages per day, each about 500 tokens of input and 300 tokens of output — roughly 800 tokens per call. At Claude Opus pricing ($5 input / $25 output per million tokens), that single day costs around $10,000. The same volume on Claude Haiku ($1 input / $5 output) costs roughly $2,000. Routing smartly — sending 80% of requests to Haiku and 20% to Opus — drops your bill close to $3,600 while keeping quality high where it counts.

Speed is the other axis. Small models typically generate tokens two to three times faster than flagships. For a chatbot where users expect real-time responses, shaving 500 ms off every turn measurably improves satisfaction. For a nightly batch report that nobody reads until morning, latency does not matter at all.

How the Tiers Work

The differences between tiers boil down to three interrelated levers: model size (number of parameters), training compute (how long and expensively the model was trained), and inference compute (how much hardware is needed to generate each token). Bigger models with more training compute consistently outperform smaller ones on complex tasks — this is the empirical foundation of AI scaling laws. But bigger also means slower and more expensive to run.

// Provider Tier Stack (Anthropic example)

Flagship: Claude Opus$5 input / $25 output per 1M tokens — highest reasoningMid-Tier: Claude Sonnet$3 input / $15 output per 1M tokens — balanced quality + speedSmall: Claude Haiku$1 input / $5 output per 1M tokens — fastest, cheapest

Flagship models

Flagships are trained longest, on the most data, with the most RLHF fine-tuning. They handle multi-step reasoning, nuanced instruction following, long-horizon coding, complex document analysis, and tasks requiring the model to hold many constraints in mind at once. As of mid-2026: Claude Opus 4.8 is $5 / $25 per million tokens; OpenAI GPT-5.5 is roughly $5 / $30; Gemini 3.1 Pro is $2–$4 / $12–$18 (tiered by context length). These are the models you reach for when output quality is worth paying for and errors are expensive — legal review, medical triage, production code generation.

Mid-tier models

Mid-tier models are the daily workhorses for most production applications. They sit 10–20% below flagship quality on hard benchmarks while costing 40–60% less and running noticeably faster. Claude Sonnet 4.6 is $3 / $15 per million tokens; OpenAI's GPT-5 mid-tier and Google's Gemini 3.5 Flash sit in a comparable band. If you are building a chatbot, a document Q&A tool, a summarisation pipeline, or a coding assistant, the mid-tier is usually the right starting point — it is where providers invest most of their optimisation effort and where the price-to-quality ratio peaks.

Small models

Small models are highly distilled versions, sometimes trained with knowledge from the flagship to punch above their weight. They are optimised for speed and cost, not maximum accuracy. Claude Haiku 4.5 is $1 / $5 per million tokens, with OpenAI's GPT-5 nano-class and Google's Gemini 3.1 Flash-Lite occupying the same cheapest tier. These shine for classification, slot filling, intent detection, short-answer generation, and high-volume pipelines where you are calling the API hundreds of times per user session. They struggle with anything requiring sustained multi-step reasoning or deep domain knowledge.

Matching Tasks to Tiers

The clearest mental model: ask yourself two questions. First, how much can go wrong if the model makes a mistake? Second, how complex is the task's reasoning chain? High stakes plus complex reasoning points to flagship. Low stakes and simple execution points to small. Everything in between is mid-tier.

Task type	Recommended tier	Why
Intent classification (chatbot routing)	Small	Binary or small-class decision, high volume, errors recoverable
Slot filling / entity extraction	Small	Structured extraction from short text, format-constrained output
FAQ answering from a knowledge base	Small to mid-tier	Retrieval-augmented, mostly surface-level composition
Summarisation (news, emails, docs)	Mid-tier	Requires coherent synthesis across longer context
Customer support chatbot	Mid-tier	Needs tone, empathy, and light reasoning; volume is significant
Code review and suggestions	Mid-tier to flagship	Depends on codebase complexity and error cost
Multi-file refactoring / architecture	Flagship	Long context, multi-constraint reasoning, high correctness demand
Legal / medical document analysis	Flagship	Errors are expensive; nuanced domain knowledge required
Agentic task execution with tools	Flagship or mid-tier	Reliability of function calling degrades on small models

Notice that the recommendation depends on your tolerance for errors, not just the task label. A code suggestion feature in a casual side-project can tolerate mid-tier errors. The same feature in a regulated financial system cannot. Let error cost, not task name, drive the decision.

Provider Lineup Comparison

All three major providers follow the same three-tier logic, but the absolute pricing and the quality gap between tiers differ. The table below shows verified mid-2026 pricing. Prices shift regularly — treat these as order-of-magnitude reference points and check the official pages before committing to a provider.

Provider	Flagship	Mid-tier	Small	Flagship input $/1M	Small input $/1M
Anthropic	Claude Opus 4.8	Claude Sonnet 4.6	Claude Haiku 4.5	$5.00	$1.00
OpenAI	GPT-5.5	GPT-5 mid-tier	GPT-5 nano-class	~$5.00	—
Google	Gemini 3.1 Pro	Gemini 3.5 Flash	Gemini 3.1 Flash-Lite	$2.00–$4.00	—

A few things to notice in that table. Google's flagship is significantly cheaper than Anthropic's and OpenAI's — and Gemini's Pro tier tends to generate tokens faster than Opus or GPT-5.5. OpenAI's smallest GPT-5 variants are extremely cheap and well-suited for pipelines that need a reliable JSON extractor or a quick intent classifier. Anthropic's mid-tier Sonnet is the most popular choice among developers building production chatbots because its output quality is close to flagship on conversational tasks.

// Tier strengths at a glance

Flagship

Complex multi-step reasoning
Nuanced instruction following
Long-horizon agentic tasks
Best benchmark scores
Highest cost and lowest speed

Mid-Tier

Best price-to-quality ratio
Strong conversational quality
Good code and summarisation
Provider's most optimised model
40-60% cheaper than flagship

Small

Fastest token generation
Lowest cost per call
Excellent for classification
Great for high-volume pipelines
Struggles with hard reasoning

Routing and Hybrid Strategies

The most cost-effective production architectures do not send every request to one fixed model — they route based on task complexity. A routing layer (which can itself be a small, cheap model) classifies each incoming request and sends it to the appropriate tier. Well-executed routing typically sends 70–80% of traffic to the small or mid-tier model while reserving the flagship for the 20–30% of requests that genuinely need it. Real-world deployments report 50–70% cost reductions with minimal quality degradation.

Sketch of a two-tier routerpython

import anthropic

client = anthropic.Anthropic()

SMALL_MODEL  = "claude-haiku-4-5"    # fast, cheap
FLAGSHIP_MODEL = "claude-opus-4-8"   # powerful, expensive

def classify_complexity(user_message: str) -> str:
    """Ask the small model whether this task needs the flagship."""
    resp = client.messages.create(
        model=SMALL_MODEL,
        max_tokens=10,
        system="Reply with only 'simple' or 'complex'.",
        messages=[{
            "role": "user",
            "content": f"Is this task simple (lookup/extraction/short QA) or complex (reasoning/code/multi-step)? Task: {user_message}"
        }]
    )
    return resp.content[0].text.strip().lower()

def route_and_call(user_message: str) -> str:
    complexity = classify_complexity(user_message)
    model = FLAGSHIP_MODEL if complexity == "complex" else SMALL_MODEL
    resp = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": user_message}]
    )
    return resp.content[0].text

The code above is a minimal illustration. Real routers use confidence scores, task categories, per-user budgets, and fallback logic. Some teams use an embedding-based classifier instead of an LLM call to avoid adding latency to the routing step itself.

Going Deeper

Once you understand the three tiers, several more nuanced factors start to matter in production decisions.

Context window differences

Flagship and mid-tier models from Anthropic and Google now offer 1 million token context windows at standard pricing — meaning a 900k-token request costs the same per-token rate as a 9k one. Small models usually have shorter windows (8k–128k depending on provider). If your application processes long documents, PDFs, or full codebases in a single call, small models may simply not fit the content.

Benchmark scores vs. production performance

Published benchmarks measure performance on curated test sets, not your specific workload. A flagship model that scores 85% on a standardised reasoning benchmark may outperform a mid-tier model by only 5% on your task. Always evaluate on your own data before committing to a tier. Collect 50–100 representative examples, run them through candidates, score the outputs, and let the data decide — not marketing materials.

Fine-tuning and distillation

A small model fine-tuned on high-quality examples generated by a flagship can rival the flagship on narrow, well-defined tasks. This is the distillation pattern: use the expensive model to produce a gold dataset, then train the cheap model to replicate that behaviour. OpenAI and Anthropic both support fine-tuning of their smaller models. The upfront cost of generating and curating training data is real, but for high-volume, stable tasks the ROI is dramatic.

Reasoning models are a separate axis

OpenAI's reasoning-optimised configurations, and Anthropic's extended thinking feature on Opus, represent a different tradeoff: more internal reasoning tokens traded for better performance on hard logical, mathematical, and coding problems. These modes are not simply "better flagship" — they are architecturally different (they spend compute on chain-of-thought before responding) and can be overkill for tasks that do not require formal reasoning. Expect such reasoning modes to be 2–5x slower and cost more per task than a standard flagship.

A practical decision checklist

Define the task: classification, generation, reasoning, or coding?
Estimate volume: calls per day and tokens per call.
Estimate error cost: what breaks if the model gets it wrong?
Start with the mid-tier. Benchmark on 50–100 real examples.
If quality is sufficient, try the small model. If insufficient, try the flagship.
Add prompt caching and batch mode for any tier to cut costs further.
Revisit every six months — new model releases frequently shift the optimal choice.

FAQ

Which LLM model should I use for a customer support chatbot?

Start with the mid-tier: Claude Sonnet, OpenAI's GPT-5 mid-tier, or Gemini 3.5 Flash. These models handle conversational nuance, tone, and moderate reasoning well without the cost of a flagship. If your product handles sensitive topics (healthcare, finance) or requires precise policy adherence, test the flagship. If your volume is very high and questions are mostly FAQ-style lookups, test a small model first.

Is a small model good enough for production use?

For many tasks, yes. The cheapest tier from each provider — Claude Haiku, OpenAI's smallest GPT-5 variants, or Gemini 3.1 Flash-Lite — performs well on classification, extraction, summarisation of short documents, and conversational Q&A. Small models struggle with long-horizon reasoning, complex coding, and tasks requiring sustained multi-step logic. Evaluate one on your actual workload — do not rely on general benchmarks.

How much cheaper is a small model than a flagship?

Roughly 5–30x cheaper on input tokens, depending on the provider. Anthropic's Haiku is $1/M input versus Opus at $5/M — a 5x gap. OpenAI's gap between its smallest GPT-5 variants and GPT-5.5 is wider still. Output tokens follow a similar ratio. At scale, this difference can mean tens of thousands of dollars per month.

Does using a smaller model always mean worse quality?

Not always. For well-defined, narrow tasks — classifying sentiment, extracting names from a form, answering questions from a retrieved document — small models often match flagship quality because the task does not require broad reasoning. Quality gaps show up most on open-ended generation, complex instruction following, and multi-step problem solving.

What is a model routing strategy and when should I use it?

Model routing classifies each incoming request by complexity and sends it to the cheapest tier that can handle it. You should consider routing once you have significant traffic (thousands of calls per day) and clear evidence that a meaningful fraction of requests are simple enough for a cheaper model. Routing adds a small latency overhead (one cheap classifier call), so it is not worth it for very low-volume applications.

Are Google Gemini models cheaper than Claude and GPT?

For the flagship tier, yes. Gemini 3.1 Pro is priced at $2–$4 per million input tokens versus $5 for Claude Opus and GPT-5.5. Gemini also generates tokens significantly faster. Google's small model (Gemini 3.1 Flash-Lite) is among the cheapest available. Whether it is the right choice depends on your quality requirements — evaluate on your task, not just price.

// In Plain English

// Why It Matters

// How the Tiers Work

Flagship models

Mid-tier models

Small models

// Matching Tasks to Tiers

// Provider Lineup Comparison

// Routing and Hybrid Strategies

// Going Deeper

Context window differences

Benchmark scores vs. production performance

Fine-tuning and distillation

Reasoning models are a separate axis

A practical decision checklist

// FAQ

// Further reading

// Related