Overview
MiniMax-Text-01 is the foundational text large language model in MiniMax's MiniMax-01 series, open-sourced on 15 January 2025 alongside the vision-language model MiniMax-VL-01. It is a Mixture-of-Experts model with 456 billion total parameters, of which 45.9 billion are activated per token across 32 experts (Top-2 routing) over 80 layers. MiniMax positions it as the first time linear ("lightning") attention has been scaled to a commercial-grade model of this size.
Its defining trait is a hybrid attention stack: within every block of 8 layers, 7 use lightning (linear) attention and 1 uses standard softmax attention. That mix keeps the cost of processing very long inputs close to linear, which is how MiniMax-Text-01 reaches a 1-million-token context during training and extrapolates to up to 4 million tokens at inference — among the longest contexts of any model at release. MiniMax reports 100% accuracy on a 4M-token Needle-In-A-Haystack retrieval test and a 0.910 RULER score at 1M tokens.
On core academic benchmarks MiniMax-Text-01 lands in the same range as GPT-4o and Claude 3.5 Sonnet — for example 88.5 on MMLU, 94.8 on GSM8K, and 89.1 on both IFEval and Arena-Hard. The weights are released openly under MiniMax's custom Model License Agreement on Hugging Face and GitHub (code is MIT), and MiniMax recommends vLLM for production serving. It is text-only; multimodal input lives in the separate MiniMax-VL-01. The later MiniMax-M1 reasoning model is built on this same Text-01 base.
| Released | 2025-01-15 |
|---|---|
| License | MiniMax Model License Agreement (open weights, custom) |
| Weights | Open weights |
| Parameters | 456B total / 45.9B active (MoE, 32 experts, Top-2 routing, 80 layers) |
| Context | 4M |
| Max output | Not separately published |
| Architecture | Hybrid Mixture-of-Experts (456B total, 45.9B active across 32 experts, Top-2 routing, 80 layers, hidden size 6144). The attention stack interleaves lightning (linear) attention with periodic softmax attention: within every 8 layers, 7 use lightning attention and 1 uses softmax attention, keeping cost near-linear as the sequence grows. Trained at a 1M-token context and extrapolates to 4M tokens at inference. The vision-language sibling MiniMax-VL-01 adds a vision encoder on top of this base. |
| Knowledge cutoff | Not officially stated |
| Modalities | Text |
| Status | Generally available |
Benchmarks
- MMLU88.5%
- MMLU-Pro75.7%
- GPQA Diamond54.4%
- GSM8K94.8%
- MATH77.4%
- IFEval (avg)89.1%
- Arena-Hard89.1%
- C-SimpleQA67.4%
- LongBench v2 (overall, with CoT)56.5%
- RULER (1M tokens)0.91%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.20 / 1M tokens per 1M tokens |
|---|---|
| Output | $1.10 / 1M tokens per 1M tokens |
MiniMax's official open-source announcement lists $0.20 per million input tokens and $1.10 per million output tokens; OpenRouter lists the same $0.20 in / $1.10 out for minimax/minimax-01. Weights are also free to download and self-host under MiniMax's Model License Agreement.
Strengths
- Extremely long context: 1M-token training window that extrapolates to up to 4M tokens at inference — among the longest available at release
- Lightning (linear) attention keeps long-context cost near-linear instead of quadratic, unlike a pure softmax transformer
- Strong long-context retrieval: 100% on a 4M-token Needle-In-A-Haystack test and 0.910 RULER at 1M tokens (per MiniMax)
- Competitive general benchmarks against closed frontier models of its era (MMLU 88.5, GSM8K 94.8, IFEval 89.1, Arena-Hard 89.1)
- Open weights with a low hosted API price ($0.20 in / $1.10 out per 1M tokens)
- Serves as the open base for the MiniMax-M1 reasoning model
Best for
- Long-document and whole-codebase analysis that needs hundreds of thousands to millions of tokens of context
- Retrieval over very large inputs (long PDFs, transcripts, logs) where needle-in-a-haystack accuracy matters
- Self-hosted general-purpose chat and instruction-following where open weights are required
- A base model for fine-tuning or for building reasoning systems (as MiniMax did with M1)
- Cost-sensitive long-context API workloads via MiniMax or OpenRouter
How to access
| Provider | Model ID |
|---|---|
| MiniMax ↗ | MiniMax-Text-01 |
| OpenRouter ↗ | minimax/minimax-01 |
FAQ
How large is the MiniMax-Text-01 context window?
MiniMax-Text-01 is trained at a 1-million-token context and can extrapolate to up to 4 million tokens at inference — among the longest contexts available when it launched. Its lightning (linear) attention is what makes processing such long inputs affordable; MiniMax reports 100% accuracy on a 4M-token Needle-In-A-Haystack retrieval test and a 0.910 RULER score at 1M tokens.
What architecture does MiniMax-Text-01 use?
It is a Mixture-of-Experts model with 456 billion total parameters and 45.9 billion activated per token across 32 experts (Top-2 routing) over 80 layers. Its attention stack is hybrid: within every 8 layers, 7 use lightning (linear) attention and 1 uses standard softmax attention, which keeps long-context cost close to linear.
Is MiniMax-Text-01 open source, and what license applies?
The weights are openly downloadable on Hugging Face and GitHub under MiniMax's custom Model License Agreement (the code is MIT-licensed). It allows self-hosting and commercial use but adds conditions — for example, naming and attribution requirements and a restriction on using outputs to improve other large language models — so it is open-weight rather than a standard OSI license like Apache 2.0.
Does MiniMax-Text-01 support images or audio?
No. MiniMax-Text-01 is text-only. Image understanding is handled by the separate MiniMax-VL-01 vision-language model, which adds a vision encoder on top of the same MiniMax-01 base.