AI/TLDR

Llama 4 Behemoth

Meta's ~2T-parameter teacher model — previewed in training, never shipped

Overview

Llama 4 Behemoth is the largest model in Meta's Llama 4 "herd," unveiled alongside Llama 4 Scout and Maverick on April 5, 2025. Meta describes it as a natively multimodal mixture-of-experts (MoE) model with 288 billion active parameters drawn from a total parameter count approaching 2 trillion, organized across 16 experts. It accepts text and image input and was positioned as one of the most capable base models in the world for non-reasoning tasks.

Behemoth's primary published role is as a teacher model. Meta codistilled the smaller Llama 4 Maverick from Behemoth using a novel distillation loss that dynamically weights soft and hard targets during training, which Meta credits for substantial quality gains in the shipped models. On STEM benchmarks, Meta reported that Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several measures including MATH-500 and GPQA Diamond.

Crucially, Llama 4 Behemoth was never publicly released. At the April 2025 launch Meta stated it was "not yet releasing Llama 4 Behemoth as it is still training." Through 2025 and into 2026 its launch was repeatedly delayed amid reports that internal teams were unsure the gains justified shipping a 2-trillion-parameter model, leaving it effectively shelved — never formally cancelled, but with no public weights, API, or pricing.

Released2025-04
LicenseLlama 4 Community License Agreement (weights never released)
WeightsAPI only
Parameters~2T total · 288B active (16 experts)
ArchitectureNatively multimodal mixture-of-experts (early fusion)
Knowledge cutoff2024-08
ModalitiesText, Vision
StatusPreviewed in training; not publicly released

Benchmarks

  1. MATH-50095%
  2. MMLU Pro82.2%
  3. GPQA Diamond73.7%
  4. MMMU76.1%
  5. Multilingual MMLU85.8%
  6. LiveCodeBench49.4%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • State-of-the-art reported results for a non-reasoning model on math, multilinguality, and image benchmarks at the April 2025 announcement
  • Outperformed GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks per Meta
  • Native multimodality (text + image) via an early-fusion MoE backbone
  • Highly effective as a teacher model: codistilled Maverick for substantial downstream quality gains

Best for

  • Knowledge distillation — serving as a teacher to train smaller, deployable Llama 4 models (Scout, Maverick)
  • Internal research benchmark for frontier-scale non-reasoning multimodal performance
  • Reference point for evaluating large-scale MoE design and codistillation approaches

Llama 4 — every version

The full lineage of the Llama 4 line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Llama 4 Maverickcurrent2025-04-051MLlama 4 Community
Llama 4 Scout2025-04-05Open weights
Llama 4 Behemoth2025-04Open weights

FAQ

Is Llama 4 Behemoth available to download or use?

No. Behemoth was previewed at the April 2025 Llama 4 launch but Meta stated it was "not yet releasing" the model because it was still training. As of 2026 there are no public weights, no Hugging Face repository, and no API. Its launch was repeatedly delayed and the model is effectively shelved, though Meta never issued a formal cancellation. Only Llama 4 Scout and Maverick were released.

How big is Llama 4 Behemoth?

Meta describes Behemoth as a mixture-of-experts model with 288 billion active parameters out of a total parameter count approaching 2 trillion, organized across 16 experts. That makes it by far the largest model in the Llama 4 herd — Maverick has 400B total/17B active and Scout has 109B total/17B active.

What is Llama 4 Behemoth used for if it was never released?

Its published role is as a teacher model. Meta codistilled the smaller, shippable Llama 4 Maverick from Behemoth using a distillation loss that dynamically weights soft and hard targets during training, which Meta credits for substantial quality improvements in the released models.

How does Llama 4 Behemoth compare to GPT-4.5 and Claude Sonnet 3.7?

At announcement, Meta reported that Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks, including MATH-500 (95.0) and GPQA Diamond (73.7). These are Meta's own published figures for a model that was never independently benchmarked because it was never released.