Grok 4 Heavy

xAI's top-tier multi-agent version of Grok 4 that runs several reasoning agents in parallel for the hardest problems.

Overview

Grok 4 Heavy is the most powerful configuration of xAI's Grok 4, launched on July 9, 2025 alongside the standard Grok 4. Where ordinary Grok 4 answers with a single reasoning pass, Grok 4 Heavy spawns several Grok 4 agents that work on the same problem in parallel and then compare notes, picking the strongest answer. xAI markets this "test-time compute" approach as its highest-capability offering for the most demanding reasoning, math, and coding tasks.

Grok 4 Heavy is not a separate set of weights — it is the same underlying Grok 4 model run in a multi-agent mode, with a 256,000-token context window and text plus image (vision) input. At launch it set state-of-the-art numbers across several frontier benchmarks: it was the first model publicly reported to clear 50% on Humanity's Last Exam (50.7% on the text-only subset with tools) and posted 15.9% on ARC-AGI-2, 100% on AIME 2025, and 61.9% on the USAMO 2025 math-proof benchmark.

Crucially, Grok 4 Heavy is a consumer-only product. It is available exclusively through the SuperGrok Heavy subscription at $300/month (or $3,000/year) and was never exposed to developers as a standalone API model ID — the xAI API offered the regular Grok 4 model instead. The parallel multi-agent idea later resurfaced as a documented API mode in xAI's newer Grok 4.20 line.

Released	2025-07-09
License	Proprietary
Weights	API only
Context	256K
Architecture	Multi-agent reasoning system that spawns several Grok 4 instances in parallel at inference time and cross-evaluates their outputs to pick the best answer ("test-time compute"). Built on the Grok 4 reasoning model trained with large-scale reinforcement learning on xAI's Colossus cluster.
Knowledge cutoff	November 2024
Modalities	Text, Vision
Status	Generally available

Benchmarks

Humanity's Last Exam (text-only, with tools)50.7%
ARC-AGI-215.9%
GPQA88.9%
AIME 2025100%
USAMO 202561.9%
HMMT 202596.7%
LiveCodeBench (Jan-May)79.4%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Highest benchmark scores in the Grok 4 family — first model to break 50% on Humanity's Last Exam (50.7% with tools)
Parallel multi-agent reasoning improves reliability on long, multi-step math and coding chains
State-of-the-art on ARC-AGI-2 (15.9%) for closed models at launch
Near-saturated competition-math performance (AIME 2025 100%, HMMT 2025 96.7%)
256K-token context window for long documents and large codebases
Native tool use and live web/X search integrated into reasoning

Best for

Hardest research-grade reasoning and STEM problem solving
Competition-level mathematics and formal proof drafting
Complex multi-step coding and debugging tasks
Deep research that benefits from cross-checking multiple agent answers
High-stakes analysis where extra accuracy justifies the premium tier

FAQ

What is the difference between Grok 4 and Grok 4 Heavy?

Standard Grok 4 answers with a single reasoning pass. Grok 4 Heavy runs several Grok 4 agents in parallel on the same problem and cross-evaluates their outputs to pick the best answer, which raises accuracy on the hardest tasks. Both share the same 256K context window and knowledge cutoff.

How much does Grok 4 Heavy cost?

Grok 4 Heavy is available only on the SuperGrok Heavy plan, priced at $300/month or $3,000/year. It was xAI's most expensive consumer tier at launch.

Can I use Grok 4 Heavy through the xAI API?

No. Grok 4 Heavy was a consumer-only feature with no standalone API model ID. Developers could call the regular Grok 4 model through the API, but the multi-agent Heavy mode was exclusive to the SuperGrok Heavy subscription.

How well does Grok 4 Heavy do on benchmarks?

At launch it was the first model publicly reported to top 50% on Humanity's Last Exam (50.7% with tools), and it posted 15.9% on ARC-AGI-2, 88.9% on GPQA, 100% on AIME 2025, and 61.9% on USAMO 2025.

// Overview

// Benchmarks

// Strengths

// Best for

// FAQ

Overview

Benchmarks

Strengths

Best for

FAQ