In plain English
The open-model landscape can feel like a brand alphabet soup: Llama, Qwen, Mistral, DeepSeek, Gemma, Phi. Each is a family — a series of models released by one lab, sharing a common architecture lineage, naming pattern, and general philosophy. Just as a car manufacturer releases the same platform in multiple trim levels, an AI lab releases the same underlying approach in models ranging from a compact 1B you can run on a phone to a massive 400B that needs a small server farm.

A useful analogy: think of each lab as a wine region and each model family as a grape variety. The region (Meta, Alibaba, Mistral AI) sets the terroir — the training philosophy, data diet, and design priorities. The variety (Llama, Qwen, Mistral) gives you a consistent flavor profile across vintages. Individual releases — Llama 4 Maverick, Qwen3 235B, DeepSeek-R1 — are specific bottles from that variety. Once you learn a region's style, you can navigate new releases with much less research.
Why the family map matters for builders
When you search "best open LLM for coding" you will find a hundred benchmarks and blog posts pointing in different directions. The family map cuts through this noise. Each family has a consistent set of strengths, a characteristic deployment footprint, and a community ecosystem around it. Knowing that DeepSeek specialises in reasoning-via-reinforcement-learning, or that Qwen leads on multilingual coverage, lets you short-list the right family first and then pick the right size within it — instead of evaluating 40 random checkpoints.
The map also matters for sustainability. Labs iterate quickly. Llama 3 was the gold standard in mid-2024; Llama 4 landed in April 2025 with a completely different MoE architecture. If you understand that Meta's philosophy is "highest-quality general-purpose open model," you know to track their releases and upgrade paths rather than being surprised by each new drop.
- Short-list faster: filter by family before evaluating individual checkpoints
- Predict strengths: each lab's design philosophy produces consistent tradeoffs
- Plan upgrades: new versions within a family are usually drop-in compatible
- Match the license: MIT vs Apache 2.0 vs custom makes a difference for commercial products
- Find the community: fine-tunes, adapters, and tooling cluster around families, not individual checkpoints
How the major families are structured
Six families dominate the open-model landscape as of mid-2026. Each is described by its originating lab, its architectural signature, and its deployment sweet spot. The diagram below shows the split by originating lab and the primary design orientation each family has optimised for.
Llama (Meta)
Meta's Llama family is the most widely deployed open-weight lineage. Llama 2 (2023) established the standard for open fine-tuning; Llama 3 (2024) raised quality dramatically; Llama 4 (April 2025) introduced a mixture-of-experts (MoE) architecture and native multimodality for the first time. Llama 4 Scout has 17B active parameters across 16 experts (109B total), a 10 million-token context window, and fits on a single H100 GPU. Llama 4 Maverick uses the same 17B active parameters but distributes them across 128 experts (400B total) for higher quality. The still-in-training Llama 4 Behemoth is reported at 288B active / ~2T total parameters. License: Llama 4 Community License — broadly permissive but not an OSI-approved open-source license.
Qwen (Alibaba Cloud)
Alibaba's Qwen family is the strongest multilingual open-weight lineage and the most-downloaded family on Hugging Face as of late 2025. Qwen3 (April 2025) introduced a hybrid thinking / non-thinking mode — the model can switch between fast responses and deep chain-of-thought on demand. The flagship Qwen3 235B-A22B MoE uses 22B active parameters. Qwen3.5 (February 2026) pushed to 397B total / 17B active parameters, added native video input, and expanded language coverage to 201 languages. Apache 2.0 license across the open-weight line.
DeepSeek (DeepSeek AI)
DeepSeek operates two parallel tracks. DeepSeek-V3 (December 2024) is a general-purpose 671B MoE (37B active) trained on 14.8 trillion tokens for approximately $5.6 million — a fraction of comparable closed-model costs. DeepSeek-R1 (January 2025) layers reinforcement learning on top of V3 to produce o1-style chain-of-thought reasoning. R1 distilled variants ship at 1.5B, 7B, 8B, 14B, 32B, and 70B parameter sizes, making frontier reasoning accessible on consumer hardware. Both families are MIT licensed — the most permissive of any frontier-class open model family.
Mistral (Mistral AI)
French startup Mistral AI has consistently punched above its weight by releasing efficient models. Mistral 7B (2023) outperformed models twice its size; Mixtral 8x7B introduced MoE to the open community. The 2025–2026 lineup includes Mistral Large 3 (675B total / 41B active MoE, Apache 2.0, December 2025) and Mistral Small 4 (March 2026), which merges reasoning, vision, and agentic coding into a single cheap model. Ministral 3B/8B/14B dense models cover the edge-device range. All 2025-onward releases are Apache 2.0.
Gemma (Google DeepMind)
Google's Gemma series distills techniques from Google's proprietary research into small, permissively licensed models. Gemma 3 (March 2025) shipped in 1B, 4B, 12B, and 27B sizes with 140-language support. Gemma 4 (April 2026) arrived in sizes from 2B to 31B and added multimodal input across images, video, and audio. The 31B dense variant placed third on the LMSYS Chatbot Arena text leaderboard. Apache 2.0 license.
Phi (Microsoft)
Microsoft's Phi series is defined by its thesis: high-quality synthetic training data enables smaller models to match larger ones on reasoning tasks. Phi-4 (late 2024) and its successors Phi-4-mini and Phi-4-multimodal deliver competitive math, science, and coding performance in the sub-14B range. Phi-4-multimodal adds vision and audio inputs in a single small package optimised for on-device and edge deployments. MIT license.
Side-by-side: which family for which task
The table below summarises each family's primary strengths, the hardware tier required for their largest open-weight model, and the license. Use it as a first-pass filter before evaluating specific checkpoints.
| Family | Primary strengths | Smallest usable size | License |
|---|---|---|---|
| Llama 4 | General chat, multimodal, long context (10M tokens) | Scout 17B-A (1 H100) | Llama 4 Community |
| Qwen3/3.5 | Multilingual (201 langs), hybrid thinking mode, agentic | 0.6B dense | Apache 2.0 |
| DeepSeek-R1 | Step-by-step reasoning, math, science | 1.5B distill | MIT |
| DeepSeek-V3 | General coding, long-context analysis | No tiny distill — 37B active minimum | MIT |
| Mistral | Efficient frontier, function calling, European compliance | Ministral 3B | Apache 2.0 |
| Gemma 4 | On-device multimodal, research reproducibility | Gemma 4 E2B (~2B) | Apache 2.0 |
| Phi-4 | Reasoning quality in sub-14B range, math, science | Phi-4-mini | MIT |
- Optimise for the widest range of tasks
- Large MoE architectures
- Heavy investment in multimodal
- Massive download / fine-tune ecosystem
- Llama 4, Qwen3/3.5
- Optimise for specific strengths (reasoning, efficiency)
- Frontier quality at lower compute cost
- Distilled small models for broad access
- Tighter model lines, fewer variants
- DeepSeek R1/V3, Mistral Large/Small
- Distil research know-how into small models
- Strong on-device / edge profile
- Prioritise reproducibility and safety
- Benchmarks emphasise reasoning density
- Gemma 4, Phi-4
Key architecture patterns across families
As you look across these families you will notice the same architectural moves recurring. Understanding them helps you read model cards and benchmark results more accurately.
Mixture of Experts (MoE)
MoE splits the model's feedforward layers into multiple "expert" sub-networks. Each token is routed to a small subset of experts — so a 671B-parameter DeepSeek-V3 only activates ~37B parameters per token. This lets labs build very large total-parameter models without proportionally increasing inference cost. Llama 4, Qwen3.5, DeepSeek V3/R1, and Mistral Large 3 all use MoE. Dense models (Gemma, Phi, Ministral) activate all parameters for every token — simpler to serve but linearly expensive to scale.
Reinforcement learning for reasoning
DeepSeek pioneered the use of Group Relative Policy Optimization (GRPO) to train R1 after a supervised cold-start phase. The model learns to self-verify, backtrack, and extend its reasoning chain — not by reading examples of good reasoning but by getting rewarded for correct answers. Qwen3's "thinking mode" and Mistral's reasoning variants follow a similar pattern. The visible symptom is extended <think> blocks before the final answer: the model is showing its work, not just predicting the next token.
Hybrid thinking / non-thinking modes
Qwen3 introduced a toggle between reasoning mode (slower, more thorough, uses extra tokens) and standard generation (fast, lower cost). This is controlled by a system prompt or API parameter. Mistral Small 4 merges a similar capability — you get reasoning when you ask for it, and direct generation otherwise. For production deployments this matters: reasoning mode can cost 3–10x more tokens than direct generation for the same query.
import openai
client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# Thinking mode ON — thorough, more tokens
response = client.chat.completions.create(
model="qwen3:30b-a3b",
messages=[
{"role": "system", "content": "/think"},
{"role": "user", "content": "Prove that sqrt(2) is irrational."}
]
)
# Thinking mode OFF — fast, fewer tokens
response = client.chat.completions.create(
model="qwen3:30b-a3b",
messages=[
{"role": "system", "content": "/no_think"},
{"role": "user", "content": "Summarise this paragraph in one sentence."}
]
)Going deeper
Once you are comfortable with the family map, the next layer is understanding the fine-tune ecosystem that sits on top of each family. Because Llama has the largest base of users, it also has the most community fine-tunes: coding variants (Code Llama tradition continues with Llama 4 fine-tunes), roleplay models, domain-specific medical and legal variants, and instruction-tuned versions in dozens of languages. Qwen's large multilingual base makes it the starting point for non-English fine-tunes. DeepSeek's MIT license makes it uniquely attractive for commercial fine-tuning without legal complexity.
Evaluating a new family release
When a new model drops, use this checklist before updating your stack:
- License check first. A quality improvement is irrelevant if the new license breaks your commercial use case.
- Check the active-parameter count, not total parameters. A 235B MoE with 22B active is cheaper to serve than a 70B dense model.
- Read the technical report or blog post. Look for training data size, context window, and any capability regressions vs the prior version.
- Check community benchmark reproducibility. Official numbers are on best-case prompts; community evals on MMLU, HumanEval, and MATH are more representative.
- Test your own task with your own data. No public benchmark matches your production distribution exactly.
The second tier: Falcon, Gemma 3n, SmolLM, and beyond
The six families above are the safest bets for production use. A broader ecosystem exists for specific niches: Falcon (TII, UAE) remains relevant for Arabic and multilingual research. Gemma 3n is Google's variant optimised specifically for on-device execution on phones and tablets. SmolLM3 (Hugging Face, 3B) leads the sub-4B category for edge inference. Command R (Cohere) is tuned for RAG and enterprise retrieval. These families matter when a major family's smallest variant is still too large for your deployment target.
Tracking the landscape
The Open LLM Leaderboard on Hugging Face aggregates benchmarks across families and is updated as new models release. The LMSYS Chatbot Arena adds human preference rankings. Neither is a perfect signal — leaderboard overfitting is real, and human preference scores can reward fluency over accuracy — but together they give you a reasonable weekly snapshot of which families are advancing fastest.
FAQ
What is the difference between Llama 4 Scout and Llama 4 Maverick?
Both have 17B active parameters, but Scout uses 16 experts (109B total) while Maverick uses 128 experts (400B total). Scout fits on a single H100 with a 10M-token context window and is faster to serve. Maverick is higher quality — comparable to GPT-4o on many benchmarks — but requires more GPU memory. Scout is the default local deployment choice; Maverick is for cloud-hosted inference where quality matters most.
Is DeepSeek-R1 the same as DeepSeek-V3?
No. DeepSeek-V3 is a general-purpose conversational and coding model. DeepSeek-R1 is built on V3's architecture but adds a reinforcement-learning training stage that teaches explicit chain-of-thought reasoning. R1 is slower and uses more tokens per answer, but it dramatically outperforms V3 on math, science, and multi-step reasoning. For simple tasks, V3 is faster and cheaper.
Can I use DeepSeek models commercially?
Yes. Both DeepSeek-V3 and DeepSeek-R1 are released under the MIT license, which is one of the most permissive open-source licenses available. You can use them in commercial products, fine-tune them, and redistribute derived works, with minimal restrictions. Always verify the current license on the official Hugging Face model card before shipping.
Which open model family is best for non-English languages?
Qwen is the clear leader for multilingual tasks. Qwen3.5 supports 201 languages and dialects and was trained on a much more diverse multilingual corpus than Llama or Mistral. For European languages specifically, Mistral models also perform well due to the lab's European focus and training data. Gemma 3 supports 140 languages.
What does 'mixture of experts' mean in practice when running a model locally?
In a MoE model, only a fraction of the weights are used per token — so a 235B-parameter Qwen3 might only activate 22B parameters per forward pass. In practice this means the model is faster per token than a 235B dense model, but you still need enough VRAM (or RAM for CPU inference) to hold all 235B parameters loaded. Tools like llama.cpp and Ollama handle this, but expect higher memory requirements than the active-parameter count suggests.
How do I choose between Qwen3 thinking mode on and off?
Use thinking mode on for tasks that benefit from multi-step reasoning: math proofs, coding challenges, logical puzzles, and analysis tasks. Use thinking mode off for tasks that need a fast, direct answer: summarisation, translation, classification, and conversational responses. Thinking mode generates substantially more tokens (and costs more if you are using an API), so defaulting it to on is wasteful for simple queries.