DeepSeek-R1-Distill-Qwen-7B

Name: DeepSeek-R1-Distill-Qwen-7B
Author: DeepSeek

DeepSeek-R1 reasoning distilled into a 7B Qwen model you can run locally.

Overview

DeepSeek-R1-Distill-Qwen-7B is one of six distilled models DeepSeek released on 20 January 2025 alongside the full DeepSeek-R1. It takes Alibaba's Qwen2.5-Math-7B as the base and fine-tunes it on roughly 800,000 reasoning traces generated by the 671B-parameter DeepSeek-R1. The result is a small, dense 7B model that inherits much of R1's step-by-step "thinking" behaviour — it writes out a long chain of thought before giving a final answer — while being light enough to run on a single consumer GPU.

Because it is built on the math-specialised Qwen2.5-Math-7B backbone, the model is strongest at mathematics and competition-style reasoning. DeepSeek reports 92.8% on MATH-500 and 55.5% on AIME 2024 for this 7B distill, results that the team noted surpass the much larger QwQ-32B-Preview on those tasks. It keeps the standard Qwen2 architecture (28 layers, grouped-query attention, 152k vocabulary) and a 131,072-token context window.

The weights are openly published on Hugging Face under an MIT repository license, and the model derives from Qwen2.5 (Apache 2.0), so commercial use, modification and further distillation are all permitted. DeepSeek advises running it with no system prompt — all instructions go in the user message — and a temperature around 0.6 to keep the reasoning stable. There is no first-party paid API; the model is meant to be self-hosted or run through third-party inference providers.

Released	2025-01-20
License	MIT (repository); derived from Qwen2.5-Math-7B, originally Apache 2.0. Commercial use, modification and further distillation are permitted.
Weights	Open weights
Parameters	7.6B (dense)
Context	131,072 tokens (max_position_embeddings in config.json)
Max output	32,768 tokens recommended for the R1-distill series (long chain-of-thought generations)
Architecture	Dense decoder-only transformer (Qwen2 architecture): 28 layers, hidden size 3584, 28 attention heads, 4 key/value heads (grouped-query attention), vocabulary 152,064. The base Qwen2.5-Math-7B was fine-tuned via supervised distillation on ~800k reasoning samples generated by DeepSeek-R1 — no reinforcement-learning stage on the distilled model itself.
Knowledge cutoff	Not officially published by DeepSeek; inherits the underlying Qwen2.5 / DeepSeek-R1 training data.
Modalities	text
Status	Available (open weights). Superseded for most uses by newer DeepSeek-R1-0528 distills and other small reasoning models, but still actively downloaded and served.

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Strong math and competition reasoning for its size — 92.8% on MATH-500 and 55.5% on AIME 2024, beating some 32B models
Small enough to run locally: a 7.6B dense model fits comfortably on a single consumer GPU (roughly 6-16 GB VRAM depending on quantization)
Open weights under a permissive MIT/Apache-2.0 chain — free to self-host, fine-tune, and use commercially
Inherits DeepSeek-R1's explicit chain-of-thought style, useful for transparent step-by-step reasoning
Large 131K-token context window for a model this size
Widely supported across inference stacks (vLLM, SGLang, llama.cpp/Ollama, LM Studio) and many third-party API hosts

Best for

Local and offline math, logic, and reasoning assistants on a single GPU
Cost-sensitive reasoning workloads where a 7B model is enough
Education: showing worked, step-by-step solutions to math and competition problems
On-device coding help and algorithmic problem-solving (e.g. LeetCode-style tasks)
A fine-tuning base for domain-specific reasoning models
Research and experimentation with distilled chain-of-thought reasoning

How to access

Provider	Model ID
Hugging Face (weights) ↗	`deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
OpenRouter ↗	`deepseek/deepseek-r1-distill-qwen-7b`
Fireworks AI ↗	`accounts/fireworks/models/deepseek-r1-distill-qwen-7b`
NVIDIA NIM ↗	`deepseek-ai/deepseek-r1-distill-qwen-7b`

DeepSeek R1 Distill — every version

The full lineage of the DeepSeek R1 Distill line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
DeepSeek-R1-0528-Qwen3-8Bcurrent	2025-05-29	131K	MIT
DeepSeek-R1-Distill-Llama-70B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-32B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-14B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Llama-8B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-7B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-1.5B	2025-01-20	—	Open weights

FAQ

What is DeepSeek-R1-Distill-Qwen-7B based on?

It is a fine-tune of Alibaba's Qwen2.5-Math-7B. DeepSeek distilled the reasoning behaviour of its full 671B DeepSeek-R1 model into it using about 800,000 supervised reasoning samples generated by R1. It keeps the standard Qwen2 dense architecture (28 layers, grouped-query attention).

How good is it at math and coding?

Very strong for a 7B model on math: DeepSeek reports 92.8% on MATH-500 and 55.5% on AIME 2024, which it notes beats the larger QwQ-32B-Preview on those tasks. On coding it scores 37.6% pass@1 on LiveCodeBench, and 49.1% on GPQA Diamond for science reasoning.

What context window and license does it have?

The config sets a 131,072-token context window. The weights ship under an MIT repository license and the model derives from Qwen2.5 (Apache 2.0), so commercial use, modification and further distillation are all allowed.

How much does it cost to use?

DeepSeek does not sell a first-party API for this model — it is open weights, so you can download and self-host it for free. If you do not want to run it yourself, third-party hosts such as OpenRouter, Fireworks AI and NVIDIA NIM serve it; their per-token prices vary, so check the provider for the current rate.

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// DeepSeek R1 Distill — every version

// FAQ