Overview
Llama 4 Scout is the smaller, more deployable member of Meta's Llama 4 "herd," announced on April 5, 2025 alongside Llama 4 Maverick. It is a natively multimodal, mixture-of-experts (MoE) model with 17 billion active parameters drawn from 109 billion total across 16 experts. Like the rest of the line, it uses early fusion so that text and image tokens are processed in a single backbone rather than bolted together after the fact.
Scout's headline feature is its context window: Meta reports support for up to 10 million tokens, enabled by an interleaved-attention (iRoPE) architecture that generalizes to lengths far beyond training. It accepts multilingual text and image input and produces multilingual text and code output across 12 supported languages, with a knowledge cutoff of August 2024. Despite the large total parameter count, the sparse MoE design keeps only 17B parameters active per token, and Meta says the Int4-quantized model fits on a single NVIDIA H100 GPU.
The weights are open under the Llama 4 Community License Agreement — a custom community license with use restrictions (including limits in the EU) rather than a standard OSI-approved open-source license. Scout can be downloaded from Hugging Face and is hosted by third-party API providers such as Together AI and OpenRouter, making it a practical choice for self-hosted, long-context, multimodal workloads.
| Released | 2025-04-05 |
|---|---|
| License | Llama 4 Community License Agreement |
| Weights | Open weights |
| Parameters | 109B total · 17B active (16 experts) |
| Context | 10M |
| Architecture | Mixture-of-Experts (early-fusion multimodal, iRoPE) |
| Knowledge cutoff | 2024-08 |
| Modalities | Text, Vision |
| Status | Available |
Benchmarks
- MMLU Pro74.3%
- GPQA Diamond57.2%
- MMMU73.4%
- MMMU Pro59.6%
- MathVista73.7%
- LiveCodeBench32.8%
- MGSM90.6%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.18 per 1M tokens |
|---|---|
| Output | $0.59 per 1M tokens |
Llama 4 Scout has no single official price — it is open-weight and priced by each host. Figures shown are Together AI's serverless rate; other providers (OpenRouter, DeepInfra) differ.
Strengths
- Industry-leading 10M-token context window via iRoPE interleaved attention
- Native text + image understanding through early-fusion multimodality
- Efficient sparse MoE: only 17B of 109B parameters active per token
- Single-H100 deployment at Int4 quantization
- Open weights under the Llama 4 Community License, downloadable from Hugging Face
- Multilingual across 12 languages, with text and code output
Best for
- Long-document and large-codebase analysis at up to 10M-token context
- Multimodal assistants that combine image understanding with chat
- Self-hosted, open-weight deployments needing multimodal quality on a single GPU
- Multilingual text generation and coding across 12 languages
- Retrieval-light pipelines where the whole corpus fits in-context
How to access
| Provider | Model ID |
|---|---|
| Together AI ↗ | meta-llama/Llama-4-Scout-17B-16E-Instruct |
| OpenRouter ↗ | meta-llama/llama-4-scout |
Llama 4 — every version
The full lineage of the Llama 4 line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Llama 4 Maverickcurrent | 2025-04-05 | 1M | Llama 4 Community |
| Llama 4 Scout | 2025-04-05 | — | Open weights |
| Llama 4 Behemoth | 2025-04 | — | Open weights |
FAQ
Is Llama 4 Scout open-weight?
Yes. Meta releases the weights under the Llama 4 Community License Agreement, and they can be downloaded from Hugging Face (meta-llama/Llama-4-Scout-17B-16E-Instruct). The license is a custom community license with use restrictions — including limits in the EU — rather than a standard OSI-approved open-source license. Scout is also hosted by third-party providers such as Together AI and OpenRouter.
How many parameters does Llama 4 Scout have?
Scout is a mixture-of-experts model with 109 billion total parameters across 16 experts, but only 17 billion are active per token. This sparse design gives larger-model quality while keeping inference cost closer to a 17B dense model. It is the smaller sibling of Llama 4 Maverick (128 experts, 400B total).
What is the context window of Llama 4 Scout?
Meta reports support for up to a 10 million-token context window, enabled by an interleaved-attention (iRoPE) architecture. In practice, individual API hosts often serve shorter windows (for example, some providers expose 192K to 1M tokens), so the usable length depends on where you run the model.
Can Llama 4 Scout run on a single GPU?
Yes. Meta states that the Int4-quantized Llama 4 Scout fits on a single NVIDIA H100 GPU. The mixture-of-experts design keeps only 17B of its 109B parameters active per token, which is what makes single-GPU deployment of a natively multimodal model practical.