Overview
Grok 4 Heavy is the most powerful configuration of xAI's Grok 4, launched on July 9, 2025 alongside the standard Grok 4. Where ordinary Grok 4 answers with a single reasoning pass, Grok 4 Heavy spawns several Grok 4 agents that work on the same problem in parallel and then compare notes, picking the strongest answer. xAI markets this "test-time compute" approach as its highest-capability offering for the most demanding reasoning, math, and coding tasks.
Grok 4 Heavy is not a separate set of weights — it is the same underlying Grok 4 model run in a multi-agent mode, with a 256,000-token context window and text plus image (vision) input. At launch it set state-of-the-art numbers across several frontier benchmarks: it was the first model publicly reported to clear 50% on Humanity's Last Exam (50.7% on the text-only subset with tools) and posted 15.9% on ARC-AGI-2, 100% on AIME 2025, and 61.9% on the USAMO 2025 math-proof benchmark.
Crucially, Grok 4 Heavy is a consumer-only product. It is available exclusively through the SuperGrok Heavy subscription at $300/month (or $3,000/year) and was never exposed to developers as a standalone API model ID — the xAI API offered the regular Grok 4 model instead. The parallel multi-agent idea later resurfaced as a documented API mode in xAI's newer Grok 4.20 line.
| Released | 2025-07-09 |
|---|---|
| License | Proprietary |
| Weights | API only |
| Context | 256K |
| Architecture | Multi-agent reasoning system that spawns several Grok 4 instances in parallel at inference time and cross-evaluates their outputs to pick the best answer ("test-time compute"). Built on the Grok 4 reasoning model trained with large-scale reinforcement learning on xAI's Colossus cluster. |
| Knowledge cutoff | November 2024 |
| Modalities | Text, Vision |
| Status | Generally available |
Benchmarks
- Humanity's Last Exam (text-only, with tools)50.7%
- ARC-AGI-215.9%
- GPQA88.9%
- AIME 2025100%
- USAMO 202561.9%
- HMMT 202596.7%
- LiveCodeBench (Jan-May)79.4%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Strengths
- Highest benchmark scores in the Grok 4 family — first model to break 50% on Humanity's Last Exam (50.7% with tools)
- Parallel multi-agent reasoning improves reliability on long, multi-step math and coding chains
- State-of-the-art on ARC-AGI-2 (15.9%) for closed models at launch
- Near-saturated competition-math performance (AIME 2025 100%, HMMT 2025 96.7%)
- 256K-token context window for long documents and large codebases
- Native tool use and live web/X search integrated into reasoning
Best for
- Hardest research-grade reasoning and STEM problem solving
- Competition-level mathematics and formal proof drafting
- Complex multi-step coding and debugging tasks
- Deep research that benefits from cross-checking multiple agent answers
- High-stakes analysis where extra accuracy justifies the premium tier
FAQ
What is the difference between Grok 4 and Grok 4 Heavy?
Standard Grok 4 answers with a single reasoning pass. Grok 4 Heavy runs several Grok 4 agents in parallel on the same problem and cross-evaluates their outputs to pick the best answer, which raises accuracy on the hardest tasks. Both share the same 256K context window and knowledge cutoff.
How much does Grok 4 Heavy cost?
Grok 4 Heavy is available only on the SuperGrok Heavy plan, priced at $300/month or $3,000/year. It was xAI's most expensive consumer tier at launch.
Can I use Grok 4 Heavy through the xAI API?
No. Grok 4 Heavy was a consumer-only feature with no standalone API model ID. Developers could call the regular Grok 4 model through the API, but the multi-agent Heavy mode was exclusive to the SuperGrok Heavy subscription.
How well does Grok 4 Heavy do on benchmarks?
At launch it was the first model publicly reported to top 50% on Humanity's Last Exam (50.7% with tools), and it posted 15.9% on ARC-AGI-2, 88.9% on GPQA, 100% on AIME 2025, and 61.9% on USAMO 2025.