AI/TLDR

MiniMax-Text-01

Open-weight 456B-total / 45.9B-active hybrid-MoE LLM that scales lightning (linear) attention to a 4M-token inference context.

Overview

MiniMax-Text-01 is the foundational text large language model in MiniMax's MiniMax-01 series, open-sourced on 15 January 2025 alongside the vision-language model MiniMax-VL-01. It is a Mixture-of-Experts model with 456 billion total parameters, of which 45.9 billion are activated per token across 32 experts (Top-2 routing) over 80 layers. MiniMax positions it as the first time linear ("lightning") attention has been scaled to a commercial-grade model of this size.

Its defining trait is a hybrid attention stack: within every block of 8 layers, 7 use lightning (linear) attention and 1 uses standard softmax attention. That mix keeps the cost of processing very long inputs close to linear, which is how MiniMax-Text-01 reaches a 1-million-token context during training and extrapolates to up to 4 million tokens at inference — among the longest contexts of any model at release. MiniMax reports 100% accuracy on a 4M-token Needle-In-A-Haystack retrieval test and a 0.910 RULER score at 1M tokens.

On core academic benchmarks MiniMax-Text-01 lands in the same range as GPT-4o and Claude 3.5 Sonnet — for example 88.5 on MMLU, 94.8 on GSM8K, and 89.1 on both IFEval and Arena-Hard. The weights are released openly under MiniMax's custom Model License Agreement on Hugging Face and GitHub (code is MIT), and MiniMax recommends vLLM for production serving. It is text-only; multimodal input lives in the separate MiniMax-VL-01. The later MiniMax-M1 reasoning model is built on this same Text-01 base.

Released2025-01-15
LicenseMiniMax Model License Agreement (open weights, custom)
WeightsOpen weights
Parameters456B total / 45.9B active (MoE, 32 experts, Top-2 routing, 80 layers)
Context4M
Max outputNot separately published
ArchitectureHybrid Mixture-of-Experts (456B total, 45.9B active across 32 experts, Top-2 routing, 80 layers, hidden size 6144). The attention stack interleaves lightning (linear) attention with periodic softmax attention: within every 8 layers, 7 use lightning attention and 1 uses softmax attention, keeping cost near-linear as the sequence grows. Trained at a 1M-token context and extrapolates to 4M tokens at inference. The vision-language sibling MiniMax-VL-01 adds a vision encoder on top of this base.
Knowledge cutoffNot officially stated
ModalitiesText
StatusGenerally available

Benchmarks

  1. MMLU88.5%
  2. MMLU-Pro75.7%
  3. GPQA Diamond54.4%
  4. GSM8K94.8%
  5. MATH77.4%
  6. IFEval (avg)89.1%
  7. Arena-Hard89.1%
  8. C-SimpleQA67.4%
  9. LongBench v2 (overall, with CoT)56.5%
  10. RULER (1M tokens)0.91%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.20 / 1M tokens per 1M tokens
Output$1.10 / 1M tokens per 1M tokens

MiniMax's official open-source announcement lists $0.20 per million input tokens and $1.10 per million output tokens; OpenRouter lists the same $0.20 in / $1.10 out for minimax/minimax-01. Weights are also free to download and self-host under MiniMax's Model License Agreement.

Pricing source ↗

Strengths

  • Extremely long context: 1M-token training window that extrapolates to up to 4M tokens at inference — among the longest available at release
  • Lightning (linear) attention keeps long-context cost near-linear instead of quadratic, unlike a pure softmax transformer
  • Strong long-context retrieval: 100% on a 4M-token Needle-In-A-Haystack test and 0.910 RULER at 1M tokens (per MiniMax)
  • Competitive general benchmarks against closed frontier models of its era (MMLU 88.5, GSM8K 94.8, IFEval 89.1, Arena-Hard 89.1)
  • Open weights with a low hosted API price ($0.20 in / $1.10 out per 1M tokens)
  • Serves as the open base for the MiniMax-M1 reasoning model

Best for

  • Long-document and whole-codebase analysis that needs hundreds of thousands to millions of tokens of context
  • Retrieval over very large inputs (long PDFs, transcripts, logs) where needle-in-a-haystack accuracy matters
  • Self-hosted general-purpose chat and instruction-following where open weights are required
  • A base model for fine-tuning or for building reasoning systems (as MiniMax did with M1)
  • Cost-sensitive long-context API workloads via MiniMax or OpenRouter

How to access

ProviderModel ID
MiniMax ↗MiniMax-Text-01
OpenRouter ↗minimax/minimax-01

FAQ

How large is the MiniMax-Text-01 context window?

MiniMax-Text-01 is trained at a 1-million-token context and can extrapolate to up to 4 million tokens at inference — among the longest contexts available when it launched. Its lightning (linear) attention is what makes processing such long inputs affordable; MiniMax reports 100% accuracy on a 4M-token Needle-In-A-Haystack retrieval test and a 0.910 RULER score at 1M tokens.

What architecture does MiniMax-Text-01 use?

It is a Mixture-of-Experts model with 456 billion total parameters and 45.9 billion activated per token across 32 experts (Top-2 routing) over 80 layers. Its attention stack is hybrid: within every 8 layers, 7 use lightning (linear) attention and 1 uses standard softmax attention, which keeps long-context cost close to linear.

Is MiniMax-Text-01 open source, and what license applies?

The weights are openly downloadable on Hugging Face and GitHub under MiniMax's custom Model License Agreement (the code is MIT-licensed). It allows self-hosting and commercial use but adds conditions — for example, naming and attribution requirements and a restriction on using outputs to improve other large language models — so it is open-weight rather than a standard OSI license like Apache 2.0.

Does MiniMax-Text-01 support images or audio?

No. MiniMax-Text-01 is text-only. Image understanding is handled by the separate MiniMax-VL-01 vision-language model, which adds a vision encoder on top of the same MiniMax-01 base.