Gemma 2

Name: Gemma 2
Author: Google

Google's 2024 open-weights LLM family — 2B, 9B, and 27B models that punch above their size.

Overview

Gemma 2 is Google's second-generation family of open-weights large language models, launched on June 27, 2024 in 9B and 27B parameter sizes, with a smaller 2B (2.6B) variant following on July 31, 2024. Built by Google DeepMind from the same research and technology used to create the Gemini models, Gemma 2 is text-in, text-out, and is distributed as downloadable weights on Hugging Face, Kaggle, and Vertex AI Model Garden under the custom Gemma license — which permits commercial use, fine-tuning, and redistribution.

Each size ships in two flavors: a pretrained base model (e.g. gemma-2-27b) and an instruction-tuned chat model (e.g. gemma-2-27b-it). Architecturally, Gemma 2 is a decoder-only transformer that interleaves local sliding-window attention with full global attention, uses grouped-query attention for faster inference, and applies logit soft-capping plus RMSNorm pre/post-normalization for training stability. The 2B and 9B models were trained via knowledge distillation from the 27B model, which is why the smaller Gemma 2 models were unusually strong for their size at launch.

At release, Google positioned the 27B model as offering performance competitive with proprietary models and open models more than twice its size, while the 9B model was pitched as outperforming Llama 3 8B and other open models in its class. Gemma 2 has since been succeeded by Gemma 3 (2025) and Gemma 4 (2026), which add multimodality, larger context windows, and multilingual support, but the Gemma 2 weights remain published and widely used for on-device and self-hosted deployment because of their small footprint and the fact that the 9B and 2B models run on a single consumer GPU or even a laptop.

Released	2024-06-27
License	Gemma (Gemma Terms of Use) — a custom, commercially permissive license that allows redistribution, fine-tuning, commercial use, and derivative works subject to a prohibited-use policy. Not OSI-approved (not Apache 2.0).
Weights	Open weights
Parameters	Three sizes: 2B (2.6B), 9B, and 27B parameters
Context	8,192 tokens
Max output	Not separately specified; output shares the 8,192-token context window
Architecture	Decoder-only transformer. Alternates local sliding-window attention (4,096-token window) with full global attention across the 8,192-token context. Uses grouped-query attention (GQA), RoPE positional embeddings, RMSNorm in both pre- and post-normalization, GeGLU activations, and logit soft-capping (attention cap 50.0, final-layer cap 30.0). The 2B and 9B models were trained with knowledge distillation from the larger 27B teacher; the 27B was trained from scratch. Training data: 27B on 13T tokens, 9B on 8T tokens, 2B on 2T tokens of primarily English web text, code, and math.
Knowledge cutoff	Not officially published by Google for Gemma 2
Modalities	text
Status	Available (open weights, downloadable). Superseded as Google's current generation by Gemma 3 (March 2025) and Gemma 4 (March 2026), but the weights remain published and usable.

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	Free to download and self-host / 1M tokens
Output	Free to download and self-host / 1M tokens

Gemma 2 is open weights — Google charges nothing for the model itself; cost depends on your own compute or a third-party inference host. Google did not publish a first-party per-token price for Gemma 2. Various inference providers (e.g. Groq, Together AI) listed hosted per-token rates over time, but those vary by provider and change frequently.

Pricing source ↗

Strengths

Strong quality-per-parameter: the 9B and 2B models were trained with knowledge distillation from the 27B, making them punch above their size class
Truly open weights — downloadable and self-hostable with no API dependency, under a commercially permissive license
Small footprint: the 9B and 2B variants run on a single consumer GPU (or a high-end laptop in quantized form), and the 27B fits on a single A100 80GB or H100
Efficient architecture (GQA + sliding-window attention) gives fast inference relative to dense models its size
Broad ecosystem support out of the box — Hugging Face Transformers, llama.cpp/Ollama, vLLM, TensorRT-LLM, Keras, and Vertex AI

Best for

On-device and self-hosted text generation, summarization, and extraction where data must stay local
Fine-tuning a small, license-friendly base model for a domain-specific task
Cost-controlled inference at scale by running open weights on your own GPUs instead of paying per-token API fees
Research and experimentation that requires full access to model weights and architecture
Lightweight chat assistants and coding helpers on modest hardware (2B/9B)

How to access

Provider	Model ID
Hugging Face ↗	`google/gemma-2-27b-it`
Hugging Face ↗	`google/gemma-2-9b-it`
Google AI for Developers ↗	`gemma-2`

Gemma (open weights) — every version

The full lineage of the Gemma (open weights) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Gemma 4current	2026-04-02	—	Apache-2.0
Gemma 3n	2025-06-26	—	Open weights
Gemma 3	2025-03-12	—	Open weights
Gemma 2	2024-06-27	—	Open weights
Gemma 1	2024-02-21	—	Open weights

FAQ

Is Gemma 2 open source?

Gemma 2 is open weights, but not open source in the strict OSI sense. Google publishes the model weights for free download and allows commercial use, fine-tuning, and redistribution under the custom Gemma license (Gemma Terms of Use). That license is more permissive than many, but it is not Apache 2.0 and carries a prohibited-use policy, so it is not an OSI-approved open-source license.

What sizes does Gemma 2 come in?

Three parameter sizes: 2B (2.6B), 9B, and 27B. The 9B and 27B launched on June 27, 2024, and the 2B variant followed on July 31, 2024. Each size has a pretrained base model and an instruction-tuned (-it) chat model.

What is Gemma 2's context window?

Gemma 2 has an 8,192-token context window. The architecture alternates local sliding-window attention (a 4,096-token window) with full global attention across the full 8,192 tokens, which keeps inference efficient.

Is Gemma 2 still worth using in 2026?

Google has released Gemma 3 (2025) and Gemma 4 (2026), which add multimodality, larger context, and multilingual support, so Gemma 2 is no longer the latest generation. But its weights remain published and it is still a practical choice for lightweight, self-hosted, English-only text tasks — especially the 2B and 9B models, which run on a single consumer GPU or laptop.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// Gemma (open weights) — every version

// FAQ