gpt-oss-20b

Name: gpt-oss-20b
Author: OpenAI

OpenAI's 21B open-weight MoE reasoning model that runs in 16GB

Overview

gpt-oss-20b is the smaller of OpenAI's two open-weight language models, released August 5, 2025 alongside gpt-oss-120b. It is a 20.9B-parameter Mixture-of-Experts transformer that activates only about 3.6B parameters per token, and it ships under the permissive Apache 2.0 license — part of the first open-weight model release from OpenAI since GPT-2. The weights are freely downloadable from Hugging Face and the openai/gpt-oss GitHub repository.

The headline feature of gpt-oss-20b is that it fits in roughly 16GB of memory. Its MoE weights are quantized to MXFP4 out of the box (a 12.8 GiB checkpoint), so the model runs on a single high-end laptop or consumer GPU rather than a datacenter card. That makes it OpenAI's pick for on-device reasoning, local inference, and rapid iteration. OpenAI reports it delivers results similar to its proprietary o3-mini on common benchmarks, even edging it out on competition math and health questions.

Like its larger sibling, gpt-oss-20b exposes configurable reasoning effort (low / medium / high) with visible chain-of-thought, and natively supports function calling, web browsing, Python tool use, and structured outputs via OpenAI's harmony response format. It is text-only, has a June 2024 knowledge cutoff and a 131,072-token context window. Because it is open-weight rather than served first-party, it runs across many hosts — Ollama, LM Studio, vLLM, llama.cpp, Hugging Face, OpenRouter, Fireworks, Together, AWS, Azure and others — each setting its own price, with several offering a free tier.

Released	2025-08-05
License	Apache 2.0
Weights	Open weights
Parameters	20.9B total / 3.6B active (MoE)
Context	131K
Max output	131K
Architecture	Mixture-of-Experts transformer with 24 layers and 32 experts, of which the top-4 are active per token, giving roughly 3.6B active parameters out of 20.9B total. Uses Grouped Query Attention, alternating banded-window (128-token bandwidth) and fully dense attention patterns, and rotary position embeddings extended via YaRN to a 131,072-token context. The MoE weights ship in native MXFP4 quantization (~4.25 bits per parameter, a 12.8 GiB checkpoint) so the model runs within about 16GB of memory on a single consumer GPU. Supports configurable reasoning effort (low / medium / high) with full chain-of-thought, and is trained for OpenAI's "harmony" response format. Text-only.
Knowledge cutoff	June 2024
Modalities	Text
Status	Available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.029 / 1M tokens per 1M tokens
Output	$0.14 / 1M tokens per 1M tokens

gpt-oss-20b is open-weight, so it is not served first-party by OpenAI — prices vary by host. Figures shown are a representative low rate from OpenRouter providers; several hosts also offer a free tier (e.g. openai/gpt-oss-20b:free). Self-hosting incurs only your own compute cost.

Pricing source ↗

Strengths

Runs on-device in ~16GB of memory thanks to native MXFP4 quantization (12.8 GiB checkpoint)
Strong reasoning for its size — comparable to OpenAI o3-mini, beating it on competition math and health
Permissive Apache 2.0 license allowing commercial use, fine-tuning, and self-hosting
Efficient MoE design: only ~3.6B of 20.9B parameters active per token for low latency
Configurable reasoning effort (low/medium/high) with full chain-of-thought visibility
Native agentic tooling: function calling, web browsing, Python execution, structured outputs
131K-token context window for long documents and agent traces
Fine-tunable on consumer hardware with no vendor lock-in

Best for

On-device and local reasoning assistants that keep data off the cloud
Edge and offline deployments where a 16GB memory footprint is the constraint
Cost-controlled high-volume inference via self-hosting or cheap third-party providers
Agentic workflows needing tool calling, browsing, and code execution at low latency
Fine-tuning an open reasoning model for domain-specific applications
Math, science, and coding tasks where o3-mini-class quality is enough
Rapid prototyping and research into chain-of-thought with fully visible reasoning traces

How to access

Provider	Model ID
OpenRouter ↗	`openai/gpt-oss-20b`
Hugging Face ↗	`openai/gpt-oss-20b`
Ollama ↗	`gpt-oss:20b`
OpenAI API ↗	`gpt-oss-20b`

gpt-oss (Open Weight) — every version

The full lineage of the gpt-oss (Open Weight) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
gpt-oss-120bcurrent	2025-08-05	—	Apache-2.0
gpt-oss-20b	2025-08-05	—	Apache-2.0

FAQ

What hardware do I need to run gpt-oss-20b?

Because its Mixture-of-Experts weights ship in native MXFP4 quantization (a 12.8 GiB checkpoint), gpt-oss-20b runs within about 16GB of memory — small enough for a high-end laptop or a single consumer GPU. It has 20.9B total parameters but activates only about 3.6B per token, keeping latency low.

Is gpt-oss-20b free and open source?

The weights are released under the permissive Apache 2.0 license, so you can download, run, fine-tune, and deploy gpt-oss-20b commercially without copyleft restrictions. Running it still costs compute, but several hosts (including an OpenRouter free tier) offer it at zero or near-zero cost, and self-hosting incurs only your own hardware cost.

How does gpt-oss-20b compare to OpenAI's o3-mini?

OpenAI reports gpt-oss-20b delivers results similar to its proprietary o3-mini on common benchmarks, and matches or exceeds it on competition mathematics (AIME) and health questions (HealthBench) — while being fully open-weight, self-hostable, and able to run on-device in about 16GB.

Is gpt-oss-20b multimodal?

No. gpt-oss-20b is text-only — it does not accept image, audio, or video input. It supports a 131,072-token context window, configurable reasoning effort (low/medium/high) with visible chain-of-thought, and native tool use including function calling, browsing, and Python execution via OpenAI's harmony response format.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// gpt-oss (Open Weight) — every version

// FAQ