AI/TLDR

Best Open-Source LLMs: Llama, Mistral, Gemma Compared

You will understand what the major open-source LLM families are, how they compare on benchmarks and licensing, and which one to pick for your task.

BEGINNER9 MIN READUPDATED 2026-06-12

In plain English

An open-source LLM (or open-weight model) is an AI language model whose trained weights are published for anyone to download and run. You don't have to call a paid API or trust a cloud provider — you can run it on your own hardware. That freedom has sparked a race among research labs, companies, and independent teams to publish increasingly capable models.

A handful of families now dominate this space. Llama (Meta) is the most widely used foundation for open models. Mistral (Paris-based startup) bet early on small, efficient models. Gemma (Google DeepMind) packages research know-how into a permissively licensed series. Phi (Microsoft) pursues reasoning quality in very small sizes. DeepSeek (Chinese lab) shocked the AI world with frontier-grade quality at open-weight prices. Each family has its own philosophy, licensing terms, and sweet spot.

Why open-source models matter

For most of AI's history, the best models were locked behind a few proprietary APIs. Open models change the economics: you pay for compute once (your own GPU, or a cheap VM), and you keep 100% of your data on your own infrastructure — no calls to external servers.

That shift matters for several reasons. Privacy: hospitals, law firms, and governments can run AI on-premise without sending sensitive data over the internet. Cost: at scale, self-hosted inference can be 10–100× cheaper than per-token API pricing. Customisation: you can fine-tune an open model on your own data in ways that closed APIs simply don't allow. Speed of iteration: the open ecosystem moves faster — within days of a new technique appearing in a paper, community implementations are already available.

The quality gap with closed models has closed dramatically. By mid-2025 the best open models matched GPT-4-class performance on many benchmarks, and by 2026 several open models exceed it on coding and reasoning tasks.

  • Privacy / compliance: data never leaves your servers
  • Cost control: flat compute cost instead of per-token billing
  • Fine-tuning freedom: adapt the model to your own domain or style
  • No rate limits: serve as many requests as your hardware allows
  • Vendor independence: you own the weights; the model can't be taken away

How the major families compare

Open-weight families differ in three key dimensions: architecture (dense transformer vs Mixture-of-Experts), size range (from sub-4B for mobile to 671B for servers), and license (MIT / Apache 2.0 for true commercial freedom, or custom licenses with restrictions). Here is how the five main families sit against each other.

Llama

Meta's Llama series is the most replicated open model family. Llama 3 (2024) arrived in 8B and 70B dense sizes and a massive 405B variant, all with a 128K token context. Llama 3.3 70B is still a practical workhorse: it matches GPT-4 (2023) on MMLU at ~82% and fits on two consumer GPUs with 4-bit quantisation (Q4_K_M, ~40 GB RAM). Llama 4 (2025) introduced a Mixture-of-Experts architecture — Scout has 109B total parameters but only 17B active per token, making it far cheaper to serve than its parameter count suggests.

Mistral

Mistral is a French startup that prioritises efficiency. The original Mistral 7B (2023) punched well above its weight — MMLU ~64%, fast inference, and tiny enough to run on a single mid-range GPU. Mixtral 8x7B followed as a Mixture-of-Experts model (8 experts, 2 active per token) giving ~70% MMLU at roughly the speed of a 12B dense model. Most Mistral models ship under Apache 2.0, making them genuinely business-friendly without needing a legal review.

Gemma

Google DeepMind's Gemma series packages the same research insights as Gemini into small, clean models. Gemma 2 (9B and 27B) outperformed Llama 3 in its size class on several benchmarks when it launched. Gemma 3 adds Quantization-Aware Training (QAT): the 27B model runs on a single RTX 3090 GPU while retaining most of its full-precision quality. Gemma 4 extends to multimodal input with a 256K context window. All releases are Apache 2.0.

Phi

Microsoft's Phi series is built around one question: how much reasoning can you pack into 3–4 billion parameters? Phi-3 Mini (3.8B) ran comfortably on an iPhone. Phi-4 (14B) beat much larger models on math and coding benchmarks. Phi-4 Mini (3.8B) is the current go-to for edge devices — 4 GB of RAM is enough, and it runs natively on-device without any cloud call. Licence is MIT.

DeepSeek

DeepSeek, a Chinese AI lab, made headlines in early 2025 when DeepSeek-V3 posted MMLU of 88.5% and HumanEval coding scores that rivalled Claude 3.5 Sonnet — at a fraction of the training cost. DeepSeek-R1 went further: using pure reinforcement learning, it reached frontier reasoning, placing second on AIME (math olympiad) behind only OpenAI's o3. Both models are released under MIT, and they use a MoE architecture (671B total, 37B active) that keeps serving costs manageable.

Which model should you pick?

The right choice depends on your hardware budget, use case, and licence requirements. The table below maps common situations to a recommended starting point.

SituationRecommended modelWhy
Only have 8 GB RAM (laptop)Phi-4 Mini 3.8B or Gemma 3 4BFits in VRAM; good reasoning per GB
16 GB GPU, general chat / codingLlama 3.3 8B or Mistral 7B Q4Fast, well-documented, huge community
40 GB RAM, best local qualityLlama 3.3 70B Q4_K_MNear-GPT-4 quality on consumer hardware
Need Apache 2.0 for a productMistral or GemmaNo usage-cap clauses; safe for commercial work
Math, reasoning, research tasksDeepSeek R1 (distilled variants)Best open-source reasoning model available
On-device / mobile appPhi-4 Mini or Gemma 3 1BDesigned for constrained hardware and power budgets
Long documents (>100K tokens)Llama 4 Scout10M-token context, MoE efficiency

Benchmark numbers are a useful guide but not the final word. A model that scores 85% on MMLU might still struggle on your specific domain. The best practice is to test two or three candidates on a handful of real prompts from your own use case before committing.

Model sizes and hardware requirements

Model size — measured in billions of parameters — is the biggest driver of memory requirements. 4-bit quantisation (the GGUF Q4_K_M format) reduces memory by roughly 75% with only a small quality loss, making models that would otherwise need a data-centre GPU runnable on consumer hardware.

Mixture-of-Experts (MoE) models like DeepSeek V3 and Llama 4 Scout have a large total parameter count but only activate a small subset per token. This means they cost no more to run than a much smaller dense model — but you still need enough RAM to load all the weights.

Going deeper

Once you have a model running locally, the next step is understanding why certain models excel at certain tasks. A few concepts worth exploring:

  • Quantisation formats: Q4_K_M is the practical default, but Q8_0 retains more quality at 2× the RAM. What is quantisation covers this in depth.
  • Fine-tuning: open weights mean you can adapt any of these models to your own domain with a few hundred examples and a single GPU. What is fine-tuning and LoRA explain how.
  • Inference servers: for production traffic you'll want a server like vLLM, llama.cpp server, or Ollama's API mode. What is an inference server compares options.
  • Benchmarks in context: MMLU, HumanEval, and GPQA measure specific things. What are LLM benchmarks explains what each benchmark actually tests — and what it misses.
  • Leaderboards: the Open LLM Leaderboard on Hugging Face and LMSYS Chatbot Arena give continuously updated rankings as new models are released.

The open-model ecosystem moves quickly — a "best" ranking from six months ago can be outdated today. Subscribe to model release feeds, follow the Hugging Face blog, and keep an eye on the Open LLM Leaderboard for the latest rankings.

FAQ

What is the best open-source LLM to run locally right now?

It depends on your hardware. For most developers with a 16 GB GPU, Llama 3.3 8B or Mistral 7B (4-bit quantised) are the best starting points — fast, high quality, and widely supported. If you have 40 GB of RAM or two GPUs, Llama 3.3 70B at Q4_K_M quality rivals GPT-4 (2023). For edge devices with 4–8 GB, Phi-4 Mini or Gemma 3 4B punch well above their size.

Is Llama truly open source? Can I use it commercially?

Llama uses a Meta custom licence, not a standard open-source licence like MIT or Apache 2.0. Commercial use is broadly allowed for most businesses, but if your product reaches 700 million monthly active users you must obtain a separate licence from Meta. For early-stage products this restriction is irrelevant, but for enterprise deployments, check with your legal team. Mistral, Gemma, Phi, and DeepSeek all use Apache 2.0 or MIT — genuinely permissive.

What is the difference between a 7B and a 70B model in practice?

The number refers to the count of trainable parameters, which roughly correlates with quality and memory. A 7B model at 4-bit quantisation needs about 6–8 GB of RAM and runs at 30–80 tokens per second on a consumer GPU. A 70B model needs ~40 GB and runs at 10–20 tokens per second on two GPUs. The 70B model will generally give better answers on complex reasoning, nuanced writing, and rare knowledge — but for most everyday chat tasks, the 7B is surprisingly good.

How does DeepSeek compare to Llama and Mistral?

DeepSeek V3 and R1 are among the strongest open models available, with MMLU scores (88–94%) that match or exceed GPT-4o. The key advantage is reasoning: DeepSeek R1 used reinforcement learning to develop step-by-step thinking skills that rival OpenAI's o1. Both are MIT-licensed. The main consideration is that DeepSeek is a large MoE model (671B total parameters, 37B active) — you need a server with 80+ GB RAM to run the full version, though smaller distilled variants work on consumer hardware.

What does Mixture-of-Experts (MoE) mean for open models?

An MoE model has many more total parameters than it uses for any single token. For example, Mixtral 8x7B has 46B total parameters but only activates two of its eight expert groups per token — giving the compute cost of a ~12B dense model with the quality of a much larger one. Llama 4, DeepSeek V3, and Mistral Large all use MoE. The catch: you still need RAM to hold all the weights in memory, even the inactive ones.

Can I fine-tune these open-source models on my own data?

Yes — that is one of the biggest advantages of open weights. LoRA (Low-Rank Adaptation) lets you fine-tune a 7B model on a single consumer GPU in a few hours using a few hundred to a few thousand domain-specific examples. Llama, Mistral, Gemma, and Phi all have extensive LoRA fine-tuning guides and Hugging Face integration. DeepSeek fine-tuning is supported but less documented in English.

Further reading