DeepSeek-R1-Distill-Llama-70B

Name: DeepSeek-R1-Distill-Llama-70B
Author: DeepSeek

R1 reasoning distilled onto a Llama 3.3 70B base — the largest of DeepSeek's January 2025 distill set.

Overview

DeepSeek-R1-Distill-Llama-70B is an open-weight reasoning model that DeepSeek released on January 20, 2025 alongside its flagship DeepSeek-R1. Instead of being a from-scratch model, it takes Meta's Llama-3.3-70B-Instruct as a base and fine-tunes it on roughly 800K reasoning samples generated by the full DeepSeek-R1. The result is a 70.6B dense model that produces R1-style chain-of-thought — thinking through a problem step by step before answering — while running on the same hardware and inference stacks already built for Llama 3.3 70B.

It is the largest of the six original DeepSeek-R1 distill checkpoints (the others target 1.5B / 7B / 14B / 32B Qwen bases and an 8B Llama base). DeepSeek's own evaluation table puts the Llama-70B distill at the top of that set on math and reasoning, scoring 70.0 on AIME 2024, 94.5 on MATH-500, and 65.2 on GPQA Diamond — competitive with much larger reasoning systems while being downloadable and self-hostable.

The model is published under the MIT license, which permits commercial use, modification, and further distillation; the underlying Llama 3.3 weights remain subject to Meta's Llama 3.3 Community License. After launch it was widely served by inference providers — Groq, Together, Fireworks, OpenRouter and others — though hosted availability has since narrowed: Groq, for example, deprecated the model in September 2025 and decommissioned it in February 2026, pointing users to newer alternatives. The weights themselves remain available on Hugging Face.

Released	2025-01-20
License	MIT (base model derived from Llama-3.3-70B-Instruct, originally under the Llama 3.3 Community License)
Weights	Open weights
Parameters	70.6B (dense)
Context	128K tokens
Max output	32,768 tokens (recommended max generation length)
Architecture	Dense transformer (Llama 3.3 70B architecture), fine-tuned on DeepSeek-R1 reasoning traces
Knowledge cutoff	Not separately disclosed by DeepSeek; inherits the Llama 3.3 base pretraining cutoff of December 2023
Modalities	Text
Status	Available as open weights; hosted access wound down at some providers (Groq deprecated it Sept 2025 and decommissioned it Feb 2026).

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.80 / 1M tokens
Output	$0.80 / 1M tokens

Open weights, so cost varies by host; example list price from OpenRouter. Free to self-host.

Pricing source ↗

Strengths

Strong math and step-by-step reasoning for its size: 94.5 on MATH-500 and 70.0 on AIME 2024 (pass@1) in DeepSeek's own evaluation
Drops into existing Llama 3.3 70B infrastructure — same architecture, tokenizer family, and serving stacks (vLLM, SGLang, TGI, Ollama, llama.cpp)
Open weights under a permissive MIT license, so it can be self-hosted, quantized, and further fine-tuned or distilled without API lock-in
Exposes its chain-of-thought between <think> tags, which is useful for debugging reasoning and for building transparent agent loops
Quantizes well to 4-bit, putting a 70B reasoning model within reach of a single high-memory GPU or a multi-GPU workstation

Best for

Self-hosted reasoning assistant for math, logic, and structured problem-solving where you want to keep data on-prem
Code generation and debugging that benefits from explicit step-by-step thinking (57.5 on LiveCodeBench pass@1)
A drop-in reasoning upgrade for teams already serving Llama 3.3 70B who want chain-of-thought without changing their stack
Generating reasoning traces / synthetic data to further distill smaller student models
Research and evaluation of open reasoning models under a permissive license

How to access

Provider	Model ID
OpenRouter ↗	`deepseek/deepseek-r1-distill-llama-70b`
Groq (deprecated/decommissioned) ↗	`deepseek-r1-distill-llama-70b`

DeepSeek R1 Distill — every version

The full lineage of the DeepSeek R1 Distill line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
DeepSeek-R1-0528-Qwen3-8Bcurrent	2025-05-29	131K	MIT
DeepSeek-R1-Distill-Llama-70B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-32B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-14B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Llama-8B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-7B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-1.5B	2025-01-20	—	Open weights

FAQ

Is DeepSeek-R1-Distill-Llama-70B the same as DeepSeek-R1?

No. DeepSeek-R1 is the full 671B-parameter Mixture-of-Experts reasoning model. The Llama-70B distill is a separate, much smaller 70.6B dense model: it takes Meta's Llama-3.3-70B-Instruct and fine-tunes it on reasoning samples generated by DeepSeek-R1. It mimics R1's chain-of-thought style at a fraction of the size, but it is not R1 itself.

What license is it released under, and can I use it commercially?

DeepSeek published the distill weights under the MIT license, which permits commercial use, modification, and redistribution. Because the model is derived from Llama-3.3-70B-Instruct, the underlying base weights are also subject to Meta's Llama 3.3 Community License, so review both before deploying.

What hardware do I need to run it?

It is a 70.6B dense model, so full-precision serving needs roughly 140GB+ of GPU memory (multiple GPUs). With 4-bit quantization it can fit on a single high-memory GPU or a multi-GPU workstation, and it runs in standard Llama 3.3 stacks such as vLLM, SGLang, Ollama, and llama.cpp.

Is it still available through hosted APIs?

The weights remain on Hugging Face and can be self-hosted indefinitely. Hosted API access has narrowed over time, though — Groq, an early host, deprecated the model in September 2025 and decommissioned it in February 2026, directing users to newer models. Check each provider's current catalog before relying on it.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// DeepSeek R1 Distill — every version

// FAQ