Overview
DeepSeek-R1-Distill-Llama-8B is one of six distilled models DeepSeek released alongside its flagship DeepSeek-R1 reasoning model on January 20, 2025. Rather than being a new architecture, it takes Meta's Llama-3.1-8B base model and fine-tunes it on roughly 800,000 reasoning examples generated by the much larger DeepSeek-R1. The goal is to give a small, locally-runnable 8B model some of the step-by-step reasoning behavior of a frontier reasoning system.
Because it is built on Llama 3.1, the model is a dense decoder-only transformer with 8 billion parameters and a context window of up to 128K tokens. It is text-only and, like DeepSeek-R1 itself, it 'thinks out loud' — it produces a long reasoning trace wrapped in <think> tags before giving its final answer. DeepSeek recommends a sampling temperature around 0.6, putting all instructions in the user prompt rather than a system prompt, and forcing the response to begin with reasoning so the model does not skip its thinking step.
The headline result is that a model small enough to run on a single consumer GPU (or even a laptop with enough memory) can post strong math scores — 89.1% on MATH-500 and 50.4% on AIME 2024. It is open weights under the Llama 3.1 Community License and is downloadable from Hugging Face for self-hosting, with quantized GGUF builds widely available through Ollama and LM Studio. As an older January 2025 release it has since been overtaken by newer small reasoning models, but it remains a popular, easy-to-run baseline.
| Released | 2025-01-20 |
|---|---|
| License | Llama 3.1 Community License (the weights are derived from Llama-3.1-8B-Base). The wider DeepSeek-R1 repository is MIT-licensed, but this specific model inherits Meta's Llama 3.1 license terms. |
| Weights | Open weights |
| Parameters | 8 billion (8.03B) |
| Context | 128K tokens (inherited from the Llama 3.1 base; commonly served at 131,072 tokens) |
| Max output | 32K tokens (recommended; reasoning models emit a long <think> trace before the answer) |
| Architecture | Dense decoder-only transformer. Built on Meta's Llama-3.1-8B-Base and supervised-fine-tuned (SFT only, no separate RL stage) on ~800K reasoning samples generated by DeepSeek-R1, transferring chain-of-thought reasoning into the small model. |
| Knowledge cutoff | December 2023 (from the Llama 3.1 base; distillation added reasoning traces, not new pretraining data) |
| Modalities | text |
| Status | Available (open weights). An older January 2025 release: still downloadable and self-hostable, but several hosted-API aggregators no longer track active providers, so treat managed availability as limited and verify before depending on it. |
Benchmarks
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | ~$0.20 per 1M tokens per 1M tokens |
|---|---|
| Output | ~$0.20 per 1M tokens per 1M tokens |
DeepSeek did not publish a first-party hosted price for this open-weights model — it is intended for self-hosting. Third-party hosts vary widely (roughly $0.025–$0.20 per 1M tokens); the figure shown is Fireworks AI's serverless rate for this 8B model. Self-hosting the open weights has no per-token cost.
Strengths
- Strong math and reasoning for its size — 89.1% on MATH-500 and 50.4% on AIME 2024, far above a typical vanilla 8B chat model
- Small enough to run locally on a single consumer GPU, and available as quantized GGUF builds for Ollama and LM Studio
- Open weights you can download, fine-tune, and self-host with no API dependency
- Long 128K-token context inherited from the Llama 3.1 base
- Transparent chain-of-thought: emits a visible <think> reasoning trace before the final answer
Best for
- Running a private, offline reasoning assistant on local hardware for math and logic problems
- Cost-sensitive math, science, and coding-help tasks where a small open model is enough
- Generating step-by-step reasoning traces for research, teaching, or building reasoning datasets
- A lightweight base to fine-tune for domain-specific reasoning on a budget
- Edge or on-device deployments where data cannot leave the machine
How to access
| Provider | Model ID |
|---|---|
| Hugging Face ↗ | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| Ollama ↗ | deepseek-r1:8b |
| Fireworks AI ↗ | accounts/fireworks/models/deepseek-r1-distill-llama-8b |
| OpenRouter ↗ | deepseek/deepseek-r1-distill-llama-8b |
DeepSeek R1 Distill — every version
The full lineage of the DeepSeek R1 Distill line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| DeepSeek-R1-0528-Qwen3-8Bcurrent | 2025-05-29 | 131K | MIT |
| DeepSeek-R1-Distill-Llama-70B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Qwen-32B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Qwen-14B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Llama-8B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Qwen-7B | 2025-01-20 | — | Open weights |
| DeepSeek-R1-Distill-Qwen-1.5B | 2025-01-20 | — | Open weights |
FAQ
Is DeepSeek-R1-Distill-Llama-8B the same as DeepSeek-R1?
No. DeepSeek-R1 is a very large Mixture-of-Experts reasoning model. This is a small 8B model built on Meta's Llama 3.1 that was fine-tuned on about 800,000 reasoning examples generated by DeepSeek-R1. It borrows R1's reasoning style at a fraction of the size, but it is a distinct, much smaller model.
What license is DeepSeek-R1-Distill-Llama-8B under?
The model weights are under the Llama 3.1 Community License because they are derived from Meta's Llama-3.1-8B base. The broader DeepSeek-R1 repository is MIT-licensed, but for this specific model you should follow Meta's Llama 3.1 license terms, which permit commercial use subject to its conditions.
Can I run it locally?
Yes. It is open weights and small enough to run on a single consumer GPU. Quantized GGUF builds are available through Ollama (as deepseek-r1:8b) and LM Studio, and the full weights can be downloaded from Hugging Face for self-hosting or fine-tuning.
Why does it print <think> tags before answering?
It is a reasoning model, so it generates a step-by-step chain-of-thought inside <think>...</think> before the final answer. DeepSeek recommends a temperature around 0.6, putting instructions in the user prompt instead of a system prompt, and forcing the reply to start with reasoning so the model does not skip its thinking step.