Overview
DeepSeek-R1-Distill-Qwen-7B is one of six distilled models DeepSeek released on 20 January 2025 alongside the full DeepSeek-R1. It takes Alibaba's Qwen2.5-Math-7B as the base and fine-tunes it on roughly 800,000 reasoning traces generated by the 671B-parameter DeepSeek-R1. The result is a small, dense 7B model that inherits much of R1's step-by-step "thinking" behaviour — it writes out a long chain of thought before giving a final answer — while being light enough to run on a single consumer GPU.
Because it is built on the math-specialised Qwen2.5-Math-7B backbone, the model is strongest at mathematics and competition-style reasoning. DeepSeek reports 92.8% on MATH-500 and 55.5% on AIME 2024 for this 7B distill, results that the team noted surpass the much larger QwQ-32B-Preview on those tasks. It keeps the standard Qwen2 architecture (28 layers, grouped-query attention, 152k vocabulary) and a 131,072-token context window.
The weights are openly published on Hugging Face under an MIT repository license, and the model derives from Qwen2.5 (Apache 2.0), so commercial use, modification and further distillation are all permitted. DeepSeek advises running it with no system prompt — all instructions go in the user message — and a temperature around 0.6 to keep the reasoning stable. There is no first-party paid API; the model is meant to be self-hosted or run through third-party inference providers.
| Released | 2025-01-20 |
|---|---|
| License | MIT (repository); derived from Qwen2.5-Math-7B, originally Apache 2.0. Commercial use, modification and further distillation are permitted. |
| Weights | Open weights |
| Parameters | 7.6B (dense) |
| Context | 131,072 tokens (max_position_embeddings in config.json) |
| Max output | 32,768 tokens recommended for the R1-distill series (long chain-of-thought generations) |
| Architecture | Dense decoder-only transformer (Qwen2 architecture): 28 layers, hidden size 3584, 28 attention heads, 4 key/value heads (grouped-query attention), vocabulary 152,064. The base Qwen2.5-Math-7B was fine-tuned via supervised distillation on ~800k reasoning samples generated by DeepSeek-R1 — no reinforcement-learning stage on the distilled model itself. |
| Knowledge cutoff | Not officially published by DeepSeek; inherits the underlying Qwen2.5 / DeepSeek-R1 training data. |
| Modalities | text |
| Status | Available (open weights). Superseded for most uses by newer DeepSeek-R1-0528 distills and other small reasoning models, but still actively downloaded and served. |
Benchmarks
- AIME 2024 (pass@1)55.5%
- MATH-500 (pass@1)92.8%
- GPQA Diamond (pass@1)49.1%
- LiveCodeBench (pass@1)37.6%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Strengths
- Strong math and competition reasoning for its size — 92.8% on MATH-500 and 55.5% on AIME 2024, beating some 32B models
- Small enough to run locally: a 7.6B dense model fits comfortably on a single consumer GPU (roughly 6-16 GB VRAM depending on quantization)
- Open weights under a permissive MIT/Apache-2.0 chain — free to self-host, fine-tune, and use commercially
- Inherits DeepSeek-R1's explicit chain-of-thought style, useful for transparent step-by-step reasoning
- Large 131K-token context window for a model this size
- Widely supported across inference stacks (vLLM, SGLang, llama.cpp/Ollama, LM Studio) and many third-party API hosts
Best for
- Local and offline math, logic, and reasoning assistants on a single GPU
- Cost-sensitive reasoning workloads where a 7B model is enough
- Education: showing worked, step-by-step solutions to math and competition problems
- On-device coding help and algorithmic problem-solving (e.g. LeetCode-style tasks)
- A fine-tuning base for domain-specific reasoning models
- Research and experimentation with distilled chain-of-thought reasoning
How to access
| Provider | Model ID |
|---|---|
| Hugging Face (weights) ↗ | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| OpenRouter ↗ | deepseek/deepseek-r1-distill-qwen-7b |
| Fireworks AI ↗ | accounts/fireworks/models/deepseek-r1-distill-qwen-7b |
| NVIDIA NIM ↗ | deepseek-ai/deepseek-r1-distill-qwen-7b |
DeepSeek R1 Distill — every version
The full lineage of the DeepSeek R1 Distill line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| DeepSeek-R1-0528-Qwen3-8Bcurrent | 2025-05-29 | 131K | MIT |
| DeepSeek-R1-Distill-Llama-70B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Qwen-32B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Qwen-14B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Llama-8B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Qwen-7B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Qwen-1.5B | 2025-01-20 | — | Open weights |
FAQ
What is DeepSeek-R1-Distill-Qwen-7B based on?
It is a fine-tune of Alibaba's Qwen2.5-Math-7B. DeepSeek distilled the reasoning behaviour of its full 671B DeepSeek-R1 model into it using about 800,000 supervised reasoning samples generated by R1. It keeps the standard Qwen2 dense architecture (28 layers, grouped-query attention).
How good is it at math and coding?
Very strong for a 7B model on math: DeepSeek reports 92.8% on MATH-500 and 55.5% on AIME 2024, which it notes beats the larger QwQ-32B-Preview on those tasks. On coding it scores 37.6% pass@1 on LiveCodeBench, and 49.1% on GPQA Diamond for science reasoning.
What context window and license does it have?
The config sets a 131,072-token context window. The weights ship under an MIT repository license and the model derives from Qwen2.5 (Apache 2.0), so commercial use, modification and further distillation are all allowed.
How much does it cost to use?
DeepSeek does not sell a first-party API for this model — it is open weights, so you can download and self-host it for free. If you do not want to run it yourself, third-party hosts such as OpenRouter, Fireworks AI and NVIDIA NIM serve it; their per-token prices vary, so check the provider for the current rate.