DeepSeekMath

Name: DeepSeekMath
Author: DeepSeek

The original 7B open math model that introduced GRPO — the RL method later used to train DeepSeek-R1.

Overview

DeepSeekMath is a 7-billion-parameter open-weight math model from DeepSeek, first released in February 2024 (with the v3 paper revision following in April 2024). Rather than training from scratch, DeepSeek continued pre-training its DeepSeek-Coder-Base-v1.5 7B checkpoint on 120 billion math-related tokens scraped from Common Crawl, plus natural-language and code data. The result was a small model whose mathematical reasoning rivaled far larger systems: DeepSeekMath approached the MATH-benchmark level of Gemini-Ultra and GPT-4 while being open and self-hostable.

DeepSeek shipped three variants: DeepSeekMath-Base 7B (the continued-pretrained foundation), DeepSeekMath-Instruct 7B (chain-of-thought instruction-tuned), and DeepSeekMath-RL 7B (the strongest, refined with reinforcement learning). The headline result — 51.7% on the competition-level MATH benchmark and 88.2% on GSM8K using chain-of-thought without any external tools — came from the RL variant and beat every open-source model from 7B to 70B at the time. All variants run in a 4,096-token context.

DeepSeekMath's most lasting contribution is GRPO (Group Relative Policy Optimization), the reinforcement-learning algorithm introduced in this paper. GRPO is a memory-efficient variant of PPO that drops the separate value/critic model and instead estimates the baseline from a group of sampled outputs. It is the same method DeepSeek later scaled up to train DeepSeek-R1, making DeepSeekMath a direct technical ancestor of DeepSeek's reasoning models. The line continues today with the much larger DeepSeek-Math-V2.

Released	2024-04
License	MIT (code) + DeepSeek Model License (weights); commercial use permitted
Weights	Open weights
Parameters	7B (dense)
Context	4K (4,096 tokens)
Architecture	Dense decoder-only transformer, continued from DeepSeek-Coder-Base-v1.5 7B on 120B math tokens; RL variant trained with GRPO
Modalities	Text
Status	Superseded — the current flagship of the DeepSeek Math line is DeepSeek-Math-V2 (Nov 2025). The original 7B weights remain available on Hugging Face.

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Introduced GRPO, the group-relative RL method DeepSeek later reused to train DeepSeek-R1
Strong math reasoning from a small 7B model — 51.7% on MATH and 88.2% on GSM8K without external tools
Continued pre-training on 120B math tokens from public web data, showing high-quality data engineering over raw scale
Fully open weights with commercial-use license, in three variants (Base, Instruct, RL) for self-hosting and fine-tuning
Approaches Gemini-Ultra / GPT-4-level MATH accuracy at a fraction of the parameter count

Best for

Self-hosted math problem solving and step-by-step (chain-of-thought) reasoning
Research baseline for GRPO and reinforcement-learning-for-reasoning experiments
Fine-tuning a small, efficient math foundation model for tutoring or STEM assistants
Studying math-focused continued pre-training and data curation from web corpora
Running competition-style math (GSM8K / MATH) evaluation on commodity hardware

How to access

Provider	Model ID
Hugging Face (open weights — RL variant) ↗	`deepseek-ai/deepseek-math-7b-rl`
Hugging Face (open weights — Instruct variant) ↗	`deepseek-ai/deepseek-math-7b-instruct`
Hugging Face (open weights — Base variant) ↗	`deepseek-ai/deepseek-math-7b-base`

DeepSeek Math — every version

The full lineage of the DeepSeek Math line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
DeepSeek-Math-V2current	2025-11-27	—	Apache-2.0
DeepSeekMath	2024-04	—	Open weights

FAQ

Is DeepSeekMath open source and free to use?

Yes. The weights are published on Hugging Face in three variants (Base, Instruct, and RL). The code is MIT-licensed and the model weights are covered by the DeepSeek Model License, which permits commercial use. You can download, run, and fine-tune the model yourself.

What is GRPO, and why does DeepSeekMath matter?

GRPO (Group Relative Policy Optimization) is the reinforcement-learning algorithm introduced in the DeepSeekMath paper. It is a memory-efficient variant of PPO that removes the separate critic model and instead computes a baseline from a group of sampled answers. DeepSeek later used GRPO to train DeepSeek-R1, so DeepSeekMath is a direct technical ancestor of its reasoning models.

How well does DeepSeekMath perform on math benchmarks?

The DeepSeekMath-RL 7B variant scores 51.7% on the competition-level MATH benchmark and 88.2% on GSM8K using chain-of-thought without external tools. With self-consistency over 64 samples, the model reaches 60.9% on MATH — approaching the MATH accuracy of Gemini-Ultra and GPT-4 despite being only 7B parameters.

Is DeepSeekMath still the latest model in its line?

No. DeepSeekMath (2024) is the original 7B model. DeepSeek released DeepSeek-Math-V2 in November 2025 — a much larger 685B open-weight self-verifying prover. The original 7B DeepSeekMath weights are still available, but V2 is the current flagship of the DeepSeek Math line.

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// DeepSeek Math — every version

// FAQ