AI/TLDR

DeepSeek-R1-Zero

The RL-only reasoning model that learned to think with no supervised fine-tuning at all.

Overview

DeepSeek-R1-Zero is the first model in DeepSeek's R1 reasoning line, released on January 20, 2025. It is the headline science experiment behind the R1 launch: starting from the DeepSeek-V3-Base mixture-of-experts model (671B total parameters, 37B activated), DeepSeek trained it using large-scale reinforcement learning alone, with no supervised fine-tuning step beforehand. The reward was simple, mostly correctness on math and code, and the model was left to figure out how to reason on its own.

What makes DeepSeek-R1-Zero notable is what emerged from that pure-RL process. Without ever being shown human-written reasoning traces, the model spontaneously developed long chains of thought, self-verification, reflection, and an 'aha moment' where it learned to reallocate thinking time to harder steps. Its AIME 2024 score climbed from 15.6% to 71.0% pass@1 over training (86.7% with majority voting), matching OpenAI's o1-0912 on that benchmark. This was the first open research to publicly validate that reasoning capability can be incentivized through reinforcement learning alone.

The catch is that DeepSeek-R1-Zero is hard to use in practice. The paper openly documents that its outputs suffer from poor readability and language mixing (switching between English and Chinese mid-thought), with chaotic formatting. Those flaws are exactly why DeepSeek built DeepSeek-R1 on top of it, adding a cold-start dataset and multi-stage training to clean up the output while keeping the reasoning gains. R1-Zero remains open under the MIT license on Hugging Face mainly as a research baseline rather than a polished assistant. The R1 work was later peer-reviewed and published on the cover of Nature (Vol. 645, 18 September 2025).

Released2025-01-20
LicenseMIT
WeightsOpen weights
Parameters671B total / 37B activated (Mixture-of-Experts)
Context128K tokens
Max output32,768 tokens
ArchitectureMixture-of-Experts (MoE) transformer with 671B total parameters and 37B activated per token, built on DeepSeek-V3-Base. Trained purely with large-scale reinforcement learning using Group Relative Policy Optimization (GRPO), a critic-free RL algorithm, with no supervised fine-tuning (SFT) cold-start. Reasoning behavior (long chain-of-thought, self-verification, reflection) emerged on its own during RL.
Knowledge cutoffInherited from DeepSeek-V3-Base (training data up to ~July 2024)
Modalitiestext
StatusReleased January 20, 2025 as a research artifact. Still available as open weights on Hugging Face, but superseded by DeepSeek-R1 (and later R1-0528) for practical use. It was never offered as a paid API product.

Benchmarks

  1. AIME 2024 (pass@1)71%
  2. AIME 2024 (cons@64, majority voting)86.7%
  3. MATH-500 (pass@1)95.9%
  4. GPQA Diamond (pass@1)73.3%
  5. LiveCodeBench (pass@1)50%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • Proves reasoning can emerge from reinforcement learning alone, with zero supervised fine-tuning
  • Strong math performance: 71.0% pass@1 on AIME 2024 (86.7% with majority voting), matching OpenAI o1-0912 on AIME
  • Open weights under the permissive MIT license, free for commercial and derivative use
  • Efficient MoE design: only 37B of 671B parameters active per token
  • Exposes its full chain-of-thought, useful for studying how reasoning behaviors self-organize

Best for

  • Research baseline for studying RL-driven reasoning and self-evolution in LLMs
  • A starting point for teams experimenting with pure-RL or GRPO training pipelines
  • Math and competition-style problem solving where readable formatting is not required
  • Ablation comparisons against DeepSeek-R1 to measure the value of cold-start SFT data
  • Generating reasoning traces for distillation or analysis (with post-processing for readability)

DeepSeek R1 — every version

The full lineage of the DeepSeek R1 line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
DeepSeek-R1-0528current2025-05-28MIT
DeepSeek-R12025-01-20MIT
DeepSeek-R1-Zero2025-01-20MIT

FAQ

What is the difference between DeepSeek-R1-Zero and DeepSeek-R1?

DeepSeek-R1-Zero is trained with reinforcement learning only, directly on DeepSeek-V3-Base, with no supervised fine-tuning. It reasons well but its output suffers from poor readability and language mixing. DeepSeek-R1 adds a cold-start dataset and multi-stage training on top of that RL recipe to fix the readability problems while keeping the reasoning gains, which is why R1 is the model people actually deploy.

How was DeepSeek-R1-Zero trained?

It was trained purely with large-scale reinforcement learning using Group Relative Policy Optimization (GRPO), a critic-free RL algorithm, starting from DeepSeek-V3-Base. There was no supervised fine-tuning step. Reasoning behaviors like long chain-of-thought, self-verification, and reflection emerged on their own during RL rather than being taught from human examples.

Can I use DeepSeek-R1-Zero via an API?

Not as a dedicated product. DeepSeek published API pricing only for DeepSeek-R1, not for R1-Zero. R1-Zero is released as open weights under the MIT license on Hugging Face, so you can download and self-host it, but DeepSeek never offered it as a paid hosted endpoint.

How big is DeepSeek-R1-Zero and what license is it under?

It is a Mixture-of-Experts model with 671B total parameters and 37B activated per token, supporting a 128K-token context window and up to 32,768 output tokens. It is released under the permissive MIT license, allowing commercial use and derivative works.