AI/TLDR

How to Choose a Model: Flagship, Mid-Tier, and Small Models Compared

Build a mental model of provider lineups so you can match each task to the cheapest tier that still does the job.

INTERMEDIATE11 MIN READUPDATED 2026-06-12

In Plain English

Every major LLM provider — Anthropic, OpenAI, Google — sells their models in a lineup of three roughly distinct tiers: a flagship (the most capable, most expensive), a mid-tier (the balanced workhorse), and a small model (the cheap, fast option). You can think of it like a restaurant menu: there is the premium tasting menu, the regular entree, and the lunch special. The food comes from the same kitchen, but the ingredients, time, and price differ substantially.

In concrete terms today (mid-2026): Anthropic has Claude Opus (flagship), Claude Sonnet (mid-tier), and Claude Haiku (small). OpenAI has GPT-5.5 (flagship), GPT-4o (mid-tier), and GPT-4o-mini (small). Google has Gemini 3.1 Pro (flagship), Gemini 3.5 Flash (mid-tier), and Gemini 2.5 Flash-Lite (small). The names evolve with each product cycle, but the three-tier pattern stays consistent across all providers.

Why It Matters

Choosing the wrong tier is the single easiest way to waste money or ship a product that frustrates users. A developer who routes every request through a flagship model for a simple classification task can spend 30 to 50 times more than necessary. Conversely, a developer who forces a small model to write complex multi-step business logic will get unreliable outputs and spend time debugging failures that better reasoning would have avoided.

At scale the gap is enormous. Suppose your app processes one million user messages per day, each about 500 tokens of input and 300 tokens of output — roughly 800 tokens per call. At Claude Opus pricing ($5 input / $25 output per million tokens), that single day costs around $10,000. The same volume on Claude Haiku ($1 input / $5 output) costs roughly $2,000. Routing smartly — sending 80% of requests to Haiku and 20% to Opus — drops your bill close to $3,600 while keeping quality high where it counts.

Speed is the other axis. Small models typically generate tokens two to three times faster than flagships. For a chatbot where users expect real-time responses, shaving 500 ms off every turn measurably improves satisfaction. For a nightly batch report that nobody reads until morning, latency does not matter at all.

How the Tiers Work

The differences between tiers boil down to three interrelated levers: model size (number of parameters), training compute (how long and expensively the model was trained), and inference compute (how much hardware is needed to generate each token). Bigger models with more training compute consistently outperform smaller ones on complex tasks — this is the empirical foundation of AI scaling laws. But bigger also means slower and more expensive to run.

Flagship models

Flagships are trained longest, on the most data, with the most RLHF fine-tuning. They handle multi-step reasoning, nuanced instruction following, long-horizon coding, complex document analysis, and tasks requiring the model to hold many constraints in mind at once. As of mid-2026: Claude Opus 4.8 is $5 / $25 per million tokens; OpenAI GPT-5.5 is roughly $5 / $30; Gemini 3.1 Pro is $2–$4 / $12–$18 (tiered by context length). These are the models you reach for when output quality is worth paying for and errors are expensive — legal review, medical triage, production code generation.

Mid-tier models

Mid-tier models are the daily workhorses for most production applications. They sit 10–20% below flagship quality on hard benchmarks while costing 40–60% less and running noticeably faster. Claude Sonnet 4.6 is $3 / $15 per million tokens; GPT-4o is $2.50 / $10; Gemini 3.5 Flash is $0.30 / $2.50. If you are building a chatbot, a document Q&A tool, a summarisation pipeline, or a coding assistant, the mid-tier is usually the right starting point — it is where providers invest most of their optimisation effort and where the price-to-quality ratio peaks.

Small models

Small models are highly distilled versions, sometimes trained with knowledge from the flagship to punch above their weight. They are optimised for speed and cost, not maximum accuracy. Claude Haiku 4.5 is $1 / $5 per million tokens; GPT-4o-mini is $0.15 / $0.60; Gemini 2.5 Flash-Lite is $0.10 / $0.40. These shine for classification, slot filling, intent detection, short-answer generation, and high-volume pipelines where you are calling the API hundreds of times per user session. They struggle with anything requiring sustained multi-step reasoning or deep domain knowledge.

Matching Tasks to Tiers

The clearest mental model: ask yourself two questions. First, how much can go wrong if the model makes a mistake? Second, how complex is the task's reasoning chain? High stakes plus complex reasoning points to flagship. Low stakes and simple execution points to small. Everything in between is mid-tier.

Task typeRecommended tierWhy
Intent classification (chatbot routing)SmallBinary or small-class decision, high volume, errors recoverable
Slot filling / entity extractionSmallStructured extraction from short text, format-constrained output
FAQ answering from a knowledge baseSmall to mid-tierRetrieval-augmented, mostly surface-level composition
Summarisation (news, emails, docs)Mid-tierRequires coherent synthesis across longer context
Customer support chatbotMid-tierNeeds tone, empathy, and light reasoning; volume is significant
Code review and suggestionsMid-tier to flagshipDepends on codebase complexity and error cost
Multi-file refactoring / architectureFlagshipLong context, multi-constraint reasoning, high correctness demand
Legal / medical document analysisFlagshipErrors are expensive; nuanced domain knowledge required
Agentic task execution with toolsFlagship or mid-tierReliability of function calling degrades on small models

Notice that the recommendation depends on your tolerance for errors, not just the task label. A code suggestion feature in a casual side-project can tolerate mid-tier errors. The same feature in a regulated financial system cannot. Let error cost, not task name, drive the decision.

Provider Lineup Comparison

All three major providers follow the same three-tier logic, but the absolute pricing and the quality gap between tiers differ. The table below shows verified mid-2026 pricing. Prices shift regularly — treat these as order-of-magnitude reference points and check the official pages before committing to a provider.

ProviderFlagshipMid-tierSmallFlagship input $/1MSmall input $/1M
AnthropicClaude Opus 4.8Claude Sonnet 4.6Claude Haiku 4.5$5.00$1.00
OpenAIGPT-5.5GPT-4oGPT-4o-mini~$5.00$0.15
GoogleGemini 3.1 ProGemini 3.5 FlashGemini 2.5 Flash-Lite$2.00–$4.00$0.10

A few things to notice in that table. Google's flagship is significantly cheaper than Anthropic's and OpenAI's — and Gemini 3.1 Pro generates tokens faster (around 125 tokens/sec on benchmarks versus 50–60 for Opus and GPT-5.5). OpenAI's small model (GPT-4o-mini) is extremely cheap ($0.15 input) and well-suited for pipelines that need a reliable JSON extractor or a quick intent classifier. Anthropic's mid-tier Sonnet is the most popular choice among developers building production chatbots because its output quality is close to flagship on conversational tasks.

Routing and Hybrid Strategies

The most cost-effective production architectures do not send every request to one fixed model — they route based on task complexity. A routing layer (which can itself be a small, cheap model) classifies each incoming request and sends it to the appropriate tier. Well-executed routing typically sends 70–80% of traffic to the small or mid-tier model while reserving the flagship for the 20–30% of requests that genuinely need it. Real-world deployments report 50–70% cost reductions with minimal quality degradation.

Sketch of a two-tier routerpython
import anthropic

client = anthropic.Anthropic()

SMALL_MODEL  = "claude-haiku-4-5"    # fast, cheap
FLAGSHIP_MODEL = "claude-opus-4-8"   # powerful, expensive

def classify_complexity(user_message: str) -> str:
    """Ask the small model whether this task needs the flagship."""
    resp = client.messages.create(
        model=SMALL_MODEL,
        max_tokens=10,
        system="Reply with only 'simple' or 'complex'.",
        messages=[{
            "role": "user",
            "content": f"Is this task simple (lookup/extraction/short QA) or complex (reasoning/code/multi-step)? Task: {user_message}"
        }]
    )
    return resp.content[0].text.strip().lower()

def route_and_call(user_message: str) -> str:
    complexity = classify_complexity(user_message)
    model = FLAGSHIP_MODEL if complexity == "complex" else SMALL_MODEL
    resp = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": user_message}]
    )
    return resp.content[0].text

The code above is a minimal illustration. Real routers use confidence scores, task categories, per-user budgets, and fallback logic. Some teams use an embedding-based classifier instead of an LLM call to avoid adding latency to the routing step itself.

Going Deeper

Once you understand the three tiers, several more nuanced factors start to matter in production decisions.

Context window differences

Flagship and mid-tier models from Anthropic and Google now offer 1 million token context windows at standard pricing — meaning a 900k-token request costs the same per-token rate as a 9k one. Small models usually have shorter windows (8k–128k depending on provider). If your application processes long documents, PDFs, or full codebases in a single call, small models may simply not fit the content.

Benchmark scores vs. production performance

Published benchmarks measure performance on curated test sets, not your specific workload. A flagship model that scores 85% on a standardised reasoning benchmark may outperform a mid-tier model by only 5% on your task. Always evaluate on your own data before committing to a tier. Collect 50–100 representative examples, run them through candidates, score the outputs, and let the data decide — not marketing materials.

Fine-tuning and distillation

A small model fine-tuned on high-quality examples generated by a flagship can rival the flagship on narrow, well-defined tasks. This is the distillation pattern: use the expensive model to produce a gold dataset, then train the cheap model to replicate that behaviour. OpenAI and Anthropic both support fine-tuning of their smaller models. The upfront cost of generating and curating training data is real, but for high-volume, stable tasks the ROI is dramatic.

Reasoning models are a separate axis

OpenAI's o3 and o4-mini, and Anthropic's extended thinking feature on Opus, represent a different tradeoff: more internal reasoning tokens traded for better performance on hard logical, mathematical, and coding problems. These models are not simply "better flagship" — they are architecturally different (they spend compute on chain-of-thought before responding) and can be overkill for tasks that do not require formal reasoning. Expect o-series models to be 2–5x slower and cost more per task than a standard flagship.

A practical decision checklist

  1. Define the task: classification, generation, reasoning, or coding?
  2. Estimate volume: calls per day and tokens per call.
  3. Estimate error cost: what breaks if the model gets it wrong?
  4. Start with the mid-tier. Benchmark on 50–100 real examples.
  5. If quality is sufficient, try the small model. If insufficient, try the flagship.
  6. Add prompt caching and batch mode for any tier to cut costs further.
  7. Revisit every six months — new model releases frequently shift the optimal choice.

FAQ

Which LLM model should I use for a customer support chatbot?

Start with the mid-tier: Claude Sonnet, GPT-4o, or Gemini 3.5 Flash. These models handle conversational nuance, tone, and moderate reasoning well without the cost of a flagship. If your product handles sensitive topics (healthcare, finance) or requires precise policy adherence, test the flagship. If your volume is very high and questions are mostly FAQ-style lookups, test a small model first.

Is GPT-4o-mini good enough for production use?

For many tasks, yes. GPT-4o-mini costs $0.15 / $0.60 per million tokens and performs well on classification, extraction, summarisation of short documents, and conversational Q&A. It struggles with long-horizon reasoning, complex coding, and tasks requiring sustained multi-step logic. Evaluate it on your actual workload — do not rely on general benchmarks.

How much cheaper is a small model than a flagship?

Roughly 5–30x cheaper on input tokens, depending on the provider. Anthropic's Haiku is $1/M input versus Opus at $5/M — a 5x gap. OpenAI's GPT-4o-mini is $0.15/M versus GPT-5.5 at roughly $5/M — a 30x+ gap. Output tokens follow a similar ratio. At scale, this difference can mean tens of thousands of dollars per month.

Does using a smaller model always mean worse quality?

Not always. For well-defined, narrow tasks — classifying sentiment, extracting names from a form, answering questions from a retrieved document — small models often match flagship quality because the task does not require broad reasoning. Quality gaps show up most on open-ended generation, complex instruction following, and multi-step problem solving.

What is a model routing strategy and when should I use it?

Model routing classifies each incoming request by complexity and sends it to the cheapest tier that can handle it. You should consider routing once you have significant traffic (thousands of calls per day) and clear evidence that a meaningful fraction of requests are simple enough for a cheaper model. Routing adds a small latency overhead (one cheap classifier call), so it is not worth it for very low-volume applications.

Are Google Gemini models cheaper than Claude and GPT?

For the flagship tier, yes. Gemini 3.1 Pro is priced at $2–$4 per million input tokens versus $5 for Claude Opus and GPT-5.5. Gemini also generates tokens significantly faster. Google's small model (Gemini 2.5 Flash-Lite) is among the cheapest available at $0.10/M input. Whether it is the right choice depends on your quality requirements — evaluate on your task, not just price.

Further reading