Overview
Gemma 2 is Google's second-generation family of open-weights large language models, launched on June 27, 2024 in 9B and 27B parameter sizes, with a smaller 2B (2.6B) variant following on July 31, 2024. Built by Google DeepMind from the same research and technology used to create the Gemini models, Gemma 2 is text-in, text-out, and is distributed as downloadable weights on Hugging Face, Kaggle, and Vertex AI Model Garden under the custom Gemma license — which permits commercial use, fine-tuning, and redistribution.
Each size ships in two flavors: a pretrained base model (e.g. gemma-2-27b) and an instruction-tuned chat model (e.g. gemma-2-27b-it). Architecturally, Gemma 2 is a decoder-only transformer that interleaves local sliding-window attention with full global attention, uses grouped-query attention for faster inference, and applies logit soft-capping plus RMSNorm pre/post-normalization for training stability. The 2B and 9B models were trained via knowledge distillation from the 27B model, which is why the smaller Gemma 2 models were unusually strong for their size at launch.
At release, Google positioned the 27B model as offering performance competitive with proprietary models and open models more than twice its size, while the 9B model was pitched as outperforming Llama 3 8B and other open models in its class. Gemma 2 has since been succeeded by Gemma 3 (2025) and Gemma 4 (2026), which add multimodality, larger context windows, and multilingual support, but the Gemma 2 weights remain published and widely used for on-device and self-hosted deployment because of their small footprint and the fact that the 9B and 2B models run on a single consumer GPU or even a laptop.
| Released | 2024-06-27 |
|---|---|
| License | Gemma (Gemma Terms of Use) — a custom, commercially permissive license that allows redistribution, fine-tuning, commercial use, and derivative works subject to a prohibited-use policy. Not OSI-approved (not Apache 2.0). |
| Weights | Open weights |
| Parameters | Three sizes: 2B (2.6B), 9B, and 27B parameters |
| Context | 8,192 tokens |
| Max output | Not separately specified; output shares the 8,192-token context window |
| Architecture | Decoder-only transformer. Alternates local sliding-window attention (4,096-token window) with full global attention across the 8,192-token context. Uses grouped-query attention (GQA), RoPE positional embeddings, RMSNorm in both pre- and post-normalization, GeGLU activations, and logit soft-capping (attention cap 50.0, final-layer cap 30.0). The 2B and 9B models were trained with knowledge distillation from the larger 27B teacher; the 27B was trained from scratch. Training data: 27B on 13T tokens, 9B on 8T tokens, 2B on 2T tokens of primarily English web text, code, and math. |
| Knowledge cutoff | Not officially published by Google for Gemma 2 |
| Modalities | text |
| Status | Available (open weights, downloadable). Superseded as Google's current generation by Gemma 3 (March 2025) and Gemma 4 (March 2026), but the weights remain published and usable. |
Benchmarks
- MMLU (5-shot)75.2%
- GSM8K (5-shot, maj@1)74%
- HumanEval (pass@1)51.8%
- MATH (4-shot)42.3%
- HellaSwag (10-shot)86.4%
- ARC-c (25-shot)71.4%
- MMLU (5-shot, 9B)71.3%
- GSM8K (5-shot, maj@1, 9B)68.6%
- HumanEval (pass@1, 9B)40.2%
- MATH (4-shot, 9B)36.6%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | Free to download and self-host / 1M tokens |
|---|---|
| Output | Free to download and self-host / 1M tokens |
Gemma 2 is open weights — Google charges nothing for the model itself; cost depends on your own compute or a third-party inference host. Google did not publish a first-party per-token price for Gemma 2. Various inference providers (e.g. Groq, Together AI) listed hosted per-token rates over time, but those vary by provider and change frequently.
Strengths
- Strong quality-per-parameter: the 9B and 2B models were trained with knowledge distillation from the 27B, making them punch above their size class
- Truly open weights — downloadable and self-hostable with no API dependency, under a commercially permissive license
- Small footprint: the 9B and 2B variants run on a single consumer GPU (or a high-end laptop in quantized form), and the 27B fits on a single A100 80GB or H100
- Efficient architecture (GQA + sliding-window attention) gives fast inference relative to dense models its size
- Broad ecosystem support out of the box — Hugging Face Transformers, llama.cpp/Ollama, vLLM, TensorRT-LLM, Keras, and Vertex AI
Best for
- On-device and self-hosted text generation, summarization, and extraction where data must stay local
- Fine-tuning a small, license-friendly base model for a domain-specific task
- Cost-controlled inference at scale by running open weights on your own GPUs instead of paying per-token API fees
- Research and experimentation that requires full access to model weights and architecture
- Lightweight chat assistants and coding helpers on modest hardware (2B/9B)
How to access
| Provider | Model ID |
|---|---|
| Hugging Face ↗ | google/gemma-2-27b-it |
| Hugging Face ↗ | google/gemma-2-9b-it |
| Google AI for Developers ↗ | gemma-2 |
Gemma (open weights) — every version
The full lineage of the Gemma (open weights) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
FAQ
Is Gemma 2 open source?
Gemma 2 is open weights, but not open source in the strict OSI sense. Google publishes the model weights for free download and allows commercial use, fine-tuning, and redistribution under the custom Gemma license (Gemma Terms of Use). That license is more permissive than many, but it is not Apache 2.0 and carries a prohibited-use policy, so it is not an OSI-approved open-source license.
What sizes does Gemma 2 come in?
Three parameter sizes: 2B (2.6B), 9B, and 27B. The 9B and 27B launched on June 27, 2024, and the 2B variant followed on July 31, 2024. Each size has a pretrained base model and an instruction-tuned (-it) chat model.
What is Gemma 2's context window?
Gemma 2 has an 8,192-token context window. The architecture alternates local sliding-window attention (a 4,096-token window) with full global attention across the full 8,192 tokens, which keeps inference efficient.
Is Gemma 2 still worth using in 2026?
Google has released Gemma 3 (2025) and Gemma 4 (2026), which add multimodality, larger context, and multilingual support, so Gemma 2 is no longer the latest generation. But its weights remain published and it is still a practical choice for lightweight, self-hosted, English-only text tasks — especially the 2B and 9B models, which run on a single consumer GPU or laptop.