AI/TLDR

DeepSeek-R1-Distill-Llama-8B

An 8B Llama model taught to reason by DeepSeek-R1 — small enough to run locally, strong on math.

Overview

DeepSeek-R1-Distill-Llama-8B is one of six distilled models DeepSeek released alongside its flagship DeepSeek-R1 reasoning model on January 20, 2025. Rather than being a new architecture, it takes Meta's Llama-3.1-8B base model and fine-tunes it on roughly 800,000 reasoning examples generated by the much larger DeepSeek-R1. The goal is to give a small, locally-runnable 8B model some of the step-by-step reasoning behavior of a frontier reasoning system.

Because it is built on Llama 3.1, the model is a dense decoder-only transformer with 8 billion parameters and a context window of up to 128K tokens. It is text-only and, like DeepSeek-R1 itself, it 'thinks out loud' — it produces a long reasoning trace wrapped in <think> tags before giving its final answer. DeepSeek recommends a sampling temperature around 0.6, putting all instructions in the user prompt rather than a system prompt, and forcing the response to begin with reasoning so the model does not skip its thinking step.

The headline result is that a model small enough to run on a single consumer GPU (or even a laptop with enough memory) can post strong math scores — 89.1% on MATH-500 and 50.4% on AIME 2024. It is open weights under the Llama 3.1 Community License and is downloadable from Hugging Face for self-hosting, with quantized GGUF builds widely available through Ollama and LM Studio. As an older January 2025 release it has since been overtaken by newer small reasoning models, but it remains a popular, easy-to-run baseline.

Released2025-01-20
LicenseLlama 3.1 Community License (the weights are derived from Llama-3.1-8B-Base). The wider DeepSeek-R1 repository is MIT-licensed, but this specific model inherits Meta's Llama 3.1 license terms.
WeightsOpen weights
Parameters8 billion (8.03B)
Context128K tokens (inherited from the Llama 3.1 base; commonly served at 131,072 tokens)
Max output32K tokens (recommended; reasoning models emit a long <think> trace before the answer)
ArchitectureDense decoder-only transformer. Built on Meta's Llama-3.1-8B-Base and supervised-fine-tuned (SFT only, no separate RL stage) on ~800K reasoning samples generated by DeepSeek-R1, transferring chain-of-thought reasoning into the small model.
Knowledge cutoffDecember 2023 (from the Llama 3.1 base; distillation added reasoning traces, not new pretraining data)
Modalitiestext
StatusAvailable (open weights). An older January 2025 release: still downloadable and self-hostable, but several hosted-API aggregators no longer track active providers, so treat managed availability as limited and verify before depending on it.

Benchmarks

  1. MATH-500 (pass@1)89.1%
  2. AIME 2024 (pass@1)50.4%
  3. GPQA Diamond (pass@1)49%
  4. LiveCodeBench (pass@1)39.6%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input~$0.20 per 1M tokens per 1M tokens
Output~$0.20 per 1M tokens per 1M tokens

DeepSeek did not publish a first-party hosted price for this open-weights model — it is intended for self-hosting. Third-party hosts vary widely (roughly $0.025–$0.20 per 1M tokens); the figure shown is Fireworks AI's serverless rate for this 8B model. Self-hosting the open weights has no per-token cost.

Pricing source ↗

Strengths

  • Strong math and reasoning for its size — 89.1% on MATH-500 and 50.4% on AIME 2024, far above a typical vanilla 8B chat model
  • Small enough to run locally on a single consumer GPU, and available as quantized GGUF builds for Ollama and LM Studio
  • Open weights you can download, fine-tune, and self-host with no API dependency
  • Long 128K-token context inherited from the Llama 3.1 base
  • Transparent chain-of-thought: emits a visible <think> reasoning trace before the final answer

Best for

  • Running a private, offline reasoning assistant on local hardware for math and logic problems
  • Cost-sensitive math, science, and coding-help tasks where a small open model is enough
  • Generating step-by-step reasoning traces for research, teaching, or building reasoning datasets
  • A lightweight base to fine-tune for domain-specific reasoning on a budget
  • Edge or on-device deployments where data cannot leave the machine

How to access

ProviderModel ID
Hugging Face ↗deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Ollama ↗deepseek-r1:8b
Fireworks AI ↗accounts/fireworks/models/deepseek-r1-distill-llama-8b
OpenRouter ↗deepseek/deepseek-r1-distill-llama-8b

DeepSeek R1 Distill — every version

The full lineage of the DeepSeek R1 Distill line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
DeepSeek-R1-0528-Qwen3-8Bcurrent2025-05-29131KMIT
DeepSeek-R1-Distill-Llama-70B2025-01-20Open weights
DeepSeek-R1-Distill-Qwen-32B2025-01-20Open weights
DeepSeek-R1-Distill-Qwen-14B2025-01-20Open weights
DeepSeek-R1-Distill-Llama-8B2025-01-20Open weights
DeepSeek-R1-Distill-Qwen-7B2025-01-20Open weights
DeepSeek-R1-Distill-Qwen-1.5B2025-01-20Open weights

FAQ

Is DeepSeek-R1-Distill-Llama-8B the same as DeepSeek-R1?

No. DeepSeek-R1 is a very large Mixture-of-Experts reasoning model. This is a small 8B model built on Meta's Llama 3.1 that was fine-tuned on about 800,000 reasoning examples generated by DeepSeek-R1. It borrows R1's reasoning style at a fraction of the size, but it is a distinct, much smaller model.

What license is DeepSeek-R1-Distill-Llama-8B under?

The model weights are under the Llama 3.1 Community License because they are derived from Meta's Llama-3.1-8B base. The broader DeepSeek-R1 repository is MIT-licensed, but for this specific model you should follow Meta's Llama 3.1 license terms, which permit commercial use subject to its conditions.

Can I run it locally?

Yes. It is open weights and small enough to run on a single consumer GPU. Quantized GGUF builds are available through Ollama (as deepseek-r1:8b) and LM Studio, and the full weights can be downloaded from Hugging Face for self-hosting or fine-tuning.

Why does it print <think> tags before answering?

It is a reasoning model, so it generates a step-by-step chain-of-thought inside <think>...</think> before the final answer. DeepSeek recommends a temperature around 0.6, putting instructions in the user prompt instead of a system prompt, and forcing the reply to start with reasoning so the model does not skip its thinking step.