DeepSeek-R1-Distill-Qwen-14B

Name: DeepSeek-R1-Distill-Qwen-14B
Author: DeepSeek

A 14B open-weight reasoning model that fine-tunes Qwen2.5-14B on 800K chain-of-thought traces from DeepSeek-R1, putting R1-style step-by-step reasoning on a single consumer GPU.

Overview

DeepSeek-R1-Distill-Qwen-14B is a 14-billion-parameter open-weight reasoning model that DeepSeek released on January 20, 2025, alongside its flagship DeepSeek-R1. Rather than being trained with reinforcement learning from scratch, it is a distillation: DeepSeek took the open-source Qwen2.5-14B model as the base and fine-tuned it on roughly 800,000 reasoning samples generated by the full DeepSeek-R1. The result is a dense model that produces the same explicit, visible chain-of-thought (wrapped in <think> tags) as R1, but small enough to run on a single high-memory consumer or workstation GPU.

It is one of six checkpoints in the original DeepSeek-R1-Distill family, which spans 1.5B, 7B, 8B, 14B, 32B (all Qwen or Llama-based) and a 70B Llama variant. The 14B sits in the middle of the lineup — markedly stronger than the 7B and 8B distills on math and code, while needing far less memory than the 32B and 70B versions. DeepSeek reported that the distilled dense models reach state-of-the-art results for their size and that this 14B checkpoint outperforms OpenAI's o1-mini on several reasoning benchmarks, an unusually strong showing for a model in this parameter range.

Because it inherits Qwen2.5-14B's architecture, the model is a standard dense transformer (not a Mixture-of-Experts) with a 128K-token context window, and DeepSeek caps generation at 32,768 tokens. DeepSeek recommends running it with a temperature around 0.6 (within a 0.5-0.7 range), top-p 0.95, and no system prompt — putting all instructions in the user turn — to avoid repetition loops. The weights are published on Hugging Face under the MIT License, and the model is widely available through inference providers such as Together AI and OpenRouter.

Released	2025-01-20
License	MIT License (DeepSeek code repository and model weights). The model is derived from Qwen2.5-14B, which is originally released under the Apache 2.0 License; commercial use and modification are permitted.
Weights	Open weights
Parameters	14B (dense; based on Qwen2.5-14B)
Context	128K (some serving platforms list 131K / 131,072 tokens)
Max output	32,768 tokens (DeepSeek's recommended maximum generation length for R1 models)
Architecture	Dense decoder-only transformer (not Mixture-of-Experts). The base model is Qwen2.5-14B, fine-tuned by DeepSeek on ~800,000 reasoning samples generated by the full DeepSeek-R1. It produces an explicit chain-of-thought wrapped in <think>...</think> tags before the final answer. DeepSeek recommends temperature 0.5-0.7 (0.6 recommended), top-p 0.95, and no system prompt (instructions in the user turn) to avoid repetition.
Knowledge cutoff	Not separately disclosed by DeepSeek; the base Qwen2.5-14B was pretrained on data up to 2024.
Modalities	Text
Status	available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.18 / 1M tokens (Together AI) per 1M tokens
Output	$0.18 / 1M tokens (Together AI) per 1M tokens

DeepSeek does not run a first-party hosted endpoint for the distill models; pricing is set by third-party inference providers. Together AI charges a flat $0.18 per million input and output tokens. The weights are MIT-licensed and free to self-host. Reasoning models emit verbose chain-of-thought, so output-token usage per request tends to be higher than non-reasoning models.

Pricing source ↗

Strengths

Strong math and reasoning for its size — 93.9% on MATH-500 and 69.7% pass@1 on AIME 2024, with DeepSeek reporting it beats OpenAI o1-mini on several benchmarks
Emits an explicit, readable chain-of-thought (in <think> tags) you can inspect and verify, distilled directly from full DeepSeek-R1
Open weights under the permissive MIT License, allowing commercial use, modification, and self-hosting
Compact enough (14B, dense) to run on a single high-memory consumer or workstation GPU, especially when quantized to 4-bit (GGUF / AWQ builds are widely available)
Large 128K-token context window inherited from the Qwen2.5-14B base
Cheap to serve through third-party APIs (e.g. ~$0.18 per million tokens on Together AI), with no per-token cost when self-hosted

Best for

Self-hosted, privacy-sensitive reasoning workloads where a small open-weight model with visible chain-of-thought is preferred over a closed API
Multi-step math, logic, and competition-style problem solving (AIME / MATH-class problems)
Coding and algorithmic-reasoning tasks within assistant or agent pipelines
Generating reasoning traces and synthetic data on commodity hardware without calling a frontier API
Local experimentation and research on distilled reasoning models, including further fine-tuning under the MIT License
Cost-sensitive deployments that need R1-style reasoning at a fraction of the size and price of the full DeepSeek-R1

How to access

Provider	Model ID
Together AI ↗	`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`
OpenRouter ↗	`deepseek/deepseek-r1-distill-qwen-14b`
Hugging Face (self-host) ↗	`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`

DeepSeek R1 Distill — every version

The full lineage of the DeepSeek R1 Distill line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
DeepSeek-R1-0528-Qwen3-8Bcurrent	2025-05-29	131K	MIT
DeepSeek-R1-Distill-Llama-70B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-32B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-14B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Llama-8B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-7B	2025-01-20	—	Open weights
DeepSeek-R1-Distill-Qwen-1.5B	2025-01-20	—	Open weights

FAQ

Is DeepSeek-R1-Distill-Qwen-14B the same as DeepSeek-R1?

No. DeepSeek-R1 is a very large Mixture-of-Experts reasoning model. DeepSeek-R1-Distill-Qwen-14B is a much smaller 14B dense model: DeepSeek took the open-source Qwen2.5-14B and fine-tuned it on about 800,000 reasoning examples generated by the full R1. It mimics R1's chain-of-thought reasoning style at a fraction of the size, but it is not as capable as the full R1 model.

What license is DeepSeek-R1-Distill-Qwen-14B released under?

The DeepSeek code repository and the model weights are released under the MIT License, which permits commercial use, modification, and redistribution. Because the model is derived from Qwen2.5-14B (originally Apache 2.0), users should also respect that base license. The weights are downloadable on Hugging Face.

What hardware do I need to run it?

As a 14B dense model it is much lighter than the full R1. In full 16-bit precision it needs roughly 28-30GB of GPU memory, but quantized 4-bit GGUF or AWQ builds (widely available from the community) bring that down to around 10GB, letting it run on a single high-memory consumer or workstation GPU.

How good is it compared to other models?

DeepSeek reported that the distilled dense models set new state-of-the-art results for their size, and that the 14B checkpoint outperforms OpenAI's o1-mini on several reasoning benchmarks. It scores 69.7% on AIME 2024 (pass@1), 93.9% on MATH-500, 59.1% on GPQA Diamond, and 53.1% on LiveCodeBench — strong math and reasoning numbers for a 14B model, though it trails larger frontier reasoning models.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// DeepSeek R1 Distill — every version

// FAQ