NIST CAISI · 2026-05-01 · major

NIST CAISI Evaluation: DeepSeek V4 Pro Lags U.S. Frontier by ~8 Months Across Five Domains

First independent US-government technical evaluation of DeepSeek V4 Pro. CAISI finds it the most capable Chinese model tested but trailing the US frontier by about eight months, performing similarly to GPT-5.

NIST CAISI bar chart titled 'Overall AI Capability' comparing DeepSeek V4 Pro against US frontier models — NIST / CAISI

NIST's CAISI publishes its first independent technical evaluation of DeepSeek V4 Pro across five capability domains.

Key specs

Capability gap vs frontier	~8 months
Benchmarks evaluated	9
Capability domains	5
Held out benchmarks	ARC-AGI-2 semi-private + PortBench
Cost vs gpt 5.4 mini	53% cheaper to 41% more expensive across 7 benchmarks

What is it?

CAISI is the Center for AI Standards and Innovation, the AI-evaluation arm of NIST stood up under the 2025 America's AI Action Plan. This is its second public DeepSeek report, following the September 2025 evaluation of earlier DeepSeek models. It tests V4 Pro on nine benchmarks spanning cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics.

How does it work?

Two of the nine benchmarks are held out from the public to detect benchmark gaming: ARC-AGI-2's semi-private split and CAISI's internally-built PortBench for software engineering. Models are scored on both capability (against US frontier baselines) and cost per task. The aggregate finding: V4 Pro performs similarly to GPT-5, which shipped about eight months earlier, despite DeepSeek's own reporting suggesting near-parity with current US frontier models.

Why does it matter?

DeepSeek V4 Pro is the most-discussed Chinese model release of 2026, with vendor benchmarks claiming it closes the gap with GPT-5.5 and Gemini 3.1 Pro. CAISI's independent numbers give policymakers and enterprise buyers a reference point that does not rely on vendor self-reporting, and document that V4 Pro is more cost-efficient than GPT-5.4 mini on five of seven benchmarks (53% cheaper to 41% more expensive depending on the task).

Who is it for?

AI policy analysts, enterprise procurement teams comparing US vs Chinese frontier models, infosec leads weighing open-weight Chinese models

Try it

https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro