MiniMax-Text-01

Name: MiniMax-Text-01
Author: MiniMax

Open-weight 456B-total / 45.9B-active hybrid-MoE LLM that scales lightning (linear) attention to a 4M-token inference context.

Overview

MiniMax-Text-01 is the foundational text large language model in MiniMax's MiniMax-01 series, open-sourced on 15 January 2025 alongside the vision-language model MiniMax-VL-01. It is a Mixture-of-Experts model with 456 billion total parameters, of which 45.9 billion are activated per token across 32 experts (Top-2 routing) over 80 layers. MiniMax positions it as the first time linear ("lightning") attention has been scaled to a commercial-grade model of this size.

Its defining trait is a hybrid attention stack: within every block of 8 layers, 7 use lightning (linear) attention and 1 uses standard softmax attention. That mix keeps the cost of processing very long inputs close to linear, which is how MiniMax-Text-01 reaches a 1-million-token context during training and extrapolates to up to 4 million tokens at inference — among the longest contexts of any model at release. MiniMax reports 100% accuracy on a 4M-token Needle-In-A-Haystack retrieval test and a 0.910 RULER score at 1M tokens.

On core academic benchmarks MiniMax-Text-01 lands in the same range as GPT-4o and Claude 3.5 Sonnet — for example 88.5 on MMLU, 94.8 on GSM8K, and 89.1 on both IFEval and Arena-Hard. The weights are released openly under MiniMax's custom Model License Agreement on Hugging Face and GitHub (code is MIT), and MiniMax recommends vLLM for production serving. It is text-only; multimodal input lives in the separate MiniMax-VL-01. The later MiniMax-M1 reasoning model is built on this same Text-01 base.

Released	2025-01-15
License	MiniMax Model License Agreement (open weights, custom)
Weights	Open weights
Parameters	456B total / 45.9B active (MoE, 32 experts, Top-2 routing, 80 layers)
Context	4M
Max output	Not separately published
Architecture	Hybrid Mixture-of-Experts (456B total, 45.9B active across 32 experts, Top-2 routing, 80 layers, hidden size 6144). The attention stack interleaves lightning (linear) attention with periodic softmax attention: within every 8 layers, 7 use lightning attention and 1 uses softmax attention, keeping cost near-linear as the sequence grows. Trained at a 1M-token context and extrapolates to 4M tokens at inference. The vision-language sibling MiniMax-VL-01 adds a vision encoder on top of this base.
Knowledge cutoff	Not officially stated
Modalities	Text
Status	Generally available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.20 / 1M tokens per 1M tokens
Output	$1.10 / 1M tokens per 1M tokens

MiniMax's official open-source announcement lists $0.20 per million input tokens and $1.10 per million output tokens; OpenRouter lists the same $0.20 in / $1.10 out for minimax/minimax-01. Weights are also free to download and self-host under MiniMax's Model License Agreement.

Pricing source ↗

Strengths

Extremely long context: 1M-token training window that extrapolates to up to 4M tokens at inference — among the longest available at release
Lightning (linear) attention keeps long-context cost near-linear instead of quadratic, unlike a pure softmax transformer
Strong long-context retrieval: 100% on a 4M-token Needle-In-A-Haystack test and 0.910 RULER at 1M tokens (per MiniMax)
Competitive general benchmarks against closed frontier models of its era (MMLU 88.5, GSM8K 94.8, IFEval 89.1, Arena-Hard 89.1)
Open weights with a low hosted API price ($0.20 in / $1.10 out per 1M tokens)
Serves as the open base for the MiniMax-M1 reasoning model

Best for

Long-document and whole-codebase analysis that needs hundreds of thousands to millions of tokens of context
Retrieval over very large inputs (long PDFs, transcripts, logs) where needle-in-a-haystack accuracy matters
Self-hosted general-purpose chat and instruction-following where open weights are required
A base model for fine-tuning or for building reasoning systems (as MiniMax did with M1)
Cost-sensitive long-context API workloads via MiniMax or OpenRouter

How to access

Provider	Model ID
MiniMax ↗	`MiniMax-Text-01`
OpenRouter ↗	`minimax/minimax-01`

FAQ

How large is the MiniMax-Text-01 context window?

MiniMax-Text-01 is trained at a 1-million-token context and can extrapolate to up to 4 million tokens at inference — among the longest contexts available when it launched. Its lightning (linear) attention is what makes processing such long inputs affordable; MiniMax reports 100% accuracy on a 4M-token Needle-In-A-Haystack retrieval test and a 0.910 RULER score at 1M tokens.

What architecture does MiniMax-Text-01 use?

It is a Mixture-of-Experts model with 456 billion total parameters and 45.9 billion activated per token across 32 experts (Top-2 routing) over 80 layers. Its attention stack is hybrid: within every 8 layers, 7 use lightning (linear) attention and 1 uses standard softmax attention, which keeps long-context cost close to linear.

Is MiniMax-Text-01 open source, and what license applies?

The weights are openly downloadable on Hugging Face and GitHub under MiniMax's custom Model License Agreement (the code is MIT-licensed). It allows self-hosting and commercial use but adds conditions — for example, naming and attribution requirements and a restriction on using outputs to improve other large language models — so it is open-weight rather than a standard OSI license like Apache 2.0.

Does MiniMax-Text-01 support images or audio?

No. MiniMax-Text-01 is text-only. Image understanding is handled by the separate MiniMax-VL-01 vision-language model, which adds a vision encoder on top of the same MiniMax-01 base.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// FAQ