Overview
gpt-oss-20b is the smaller of OpenAI's two open-weight language models, released August 5, 2025 alongside gpt-oss-120b. It is a 20.9B-parameter Mixture-of-Experts transformer that activates only about 3.6B parameters per token, and it ships under the permissive Apache 2.0 license — part of the first open-weight model release from OpenAI since GPT-2. The weights are freely downloadable from Hugging Face and the openai/gpt-oss GitHub repository.
The headline feature of gpt-oss-20b is that it fits in roughly 16GB of memory. Its MoE weights are quantized to MXFP4 out of the box (a 12.8 GiB checkpoint), so the model runs on a single high-end laptop or consumer GPU rather than a datacenter card. That makes it OpenAI's pick for on-device reasoning, local inference, and rapid iteration. OpenAI reports it delivers results similar to its proprietary o3-mini on common benchmarks, even edging it out on competition math and health questions.
Like its larger sibling, gpt-oss-20b exposes configurable reasoning effort (low / medium / high) with visible chain-of-thought, and natively supports function calling, web browsing, Python tool use, and structured outputs via OpenAI's harmony response format. It is text-only, has a June 2024 knowledge cutoff and a 131,072-token context window. Because it is open-weight rather than served first-party, it runs across many hosts — Ollama, LM Studio, vLLM, llama.cpp, Hugging Face, OpenRouter, Fireworks, Together, AWS, Azure and others — each setting its own price, with several offering a free tier.
| Released | 2025-08-05 |
|---|---|
| License | Apache 2.0 |
| Weights | Open weights |
| Parameters | 20.9B total / 3.6B active (MoE) |
| Context | 131K |
| Max output | 131K |
| Architecture | Mixture-of-Experts transformer with 24 layers and 32 experts, of which the top-4 are active per token, giving roughly 3.6B active parameters out of 20.9B total. Uses Grouped Query Attention, alternating banded-window (128-token bandwidth) and fully dense attention patterns, and rotary position embeddings extended via YaRN to a 131,072-token context. The MoE weights ship in native MXFP4 quantization (~4.25 bits per parameter, a 12.8 GiB checkpoint) so the model runs within about 16GB of memory on a single consumer GPU. Supports configurable reasoning effort (low / medium / high) with full chain-of-thought, and is trained for OpenAI's "harmony" response format. Text-only. |
| Knowledge cutoff | June 2024 |
| Modalities | Text |
| Status | Available |
Benchmarks
- MMLU85.3%
- GPQA Diamond (no tools)71.5%
- AIME 2024 (no tools)92.1%
- AIME 2025 (with tools)98.7%
- SWE-bench Verified60.7%
- HealthBench42.5%
- Humanity's Last Exam (with tools)17.3%
- Tau-Bench Retail54.8%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.029 / 1M tokens per 1M tokens |
|---|---|
| Output | $0.14 / 1M tokens per 1M tokens |
gpt-oss-20b is open-weight, so it is not served first-party by OpenAI — prices vary by host. Figures shown are a representative low rate from OpenRouter providers; several hosts also offer a free tier (e.g. openai/gpt-oss-20b:free). Self-hosting incurs only your own compute cost.
Strengths
- Runs on-device in ~16GB of memory thanks to native MXFP4 quantization (12.8 GiB checkpoint)
- Strong reasoning for its size — comparable to OpenAI o3-mini, beating it on competition math and health
- Permissive Apache 2.0 license allowing commercial use, fine-tuning, and self-hosting
- Efficient MoE design: only ~3.6B of 20.9B parameters active per token for low latency
- Configurable reasoning effort (low/medium/high) with full chain-of-thought visibility
- Native agentic tooling: function calling, web browsing, Python execution, structured outputs
- 131K-token context window for long documents and agent traces
- Fine-tunable on consumer hardware with no vendor lock-in
Best for
- On-device and local reasoning assistants that keep data off the cloud
- Edge and offline deployments where a 16GB memory footprint is the constraint
- Cost-controlled high-volume inference via self-hosting or cheap third-party providers
- Agentic workflows needing tool calling, browsing, and code execution at low latency
- Fine-tuning an open reasoning model for domain-specific applications
- Math, science, and coding tasks where o3-mini-class quality is enough
- Rapid prototyping and research into chain-of-thought with fully visible reasoning traces
How to access
| Provider | Model ID |
|---|---|
| OpenRouter ↗ | openai/gpt-oss-20b |
| Hugging Face ↗ | openai/gpt-oss-20b |
| Ollama ↗ | gpt-oss:20b |
| OpenAI API ↗ | gpt-oss-20b |
gpt-oss (Open Weight) — every version
The full lineage of the gpt-oss (Open Weight) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| gpt-oss-120bcurrent | 2025-08-05 | — | Apache-2.0 |
| gpt-oss-20b | 2025-08-05 | — | Apache-2.0 |
FAQ
What hardware do I need to run gpt-oss-20b?
Because its Mixture-of-Experts weights ship in native MXFP4 quantization (a 12.8 GiB checkpoint), gpt-oss-20b runs within about 16GB of memory — small enough for a high-end laptop or a single consumer GPU. It has 20.9B total parameters but activates only about 3.6B per token, keeping latency low.
Is gpt-oss-20b free and open source?
The weights are released under the permissive Apache 2.0 license, so you can download, run, fine-tune, and deploy gpt-oss-20b commercially without copyleft restrictions. Running it still costs compute, but several hosts (including an OpenRouter free tier) offer it at zero or near-zero cost, and self-hosting incurs only your own hardware cost.
How does gpt-oss-20b compare to OpenAI's o3-mini?
OpenAI reports gpt-oss-20b delivers results similar to its proprietary o3-mini on common benchmarks, and matches or exceeds it on competition mathematics (AIME) and health questions (HealthBench) — while being fully open-weight, self-hostable, and able to run on-device in about 16GB.
Is gpt-oss-20b multimodal?
No. gpt-oss-20b is text-only — it does not accept image, audio, or video input. It supports a 131,072-token context window, configurable reasoning effort (low/medium/high) with visible chain-of-thought, and native tool use including function calling, browsing, and Python execution via OpenAI's harmony response format.