MiniMax-M1 (M1-40k / M1-80k)

Name: MiniMax-M1 (M1-40k / M1-80k)
Author: MiniMax

The first open-weight, large-scale hybrid-attention reasoning model — 456B-param MoE with a native 1M-token context.

Overview

MiniMax-M1 is, per MiniMax, the world's first open-weight, large-scale hybrid-attention reasoning model. It is built on the MiniMax-Text-01 base and uses a Mixture-of-Experts design with 456 billion total parameters, of which 45.9 billion are activated per token across 32 experts. Its defining trait is a hybrid attention stack: a softmax-attention transformer block follows every seven 'lightning' (linear) attention blocks, which gives near-linear cost as the sequence grows. MiniMax reports that at a 100K-token generation length M1 uses roughly 25% of the FLOPs that DeepSeek-R1 would.

M1 ships in two variants that differ only in their reasoning (thinking) budget: MiniMax-M1-40k and MiniMax-M1-80k. Both natively support a 1-million-token context window — eight times that of DeepSeek-R1 and on par with closed models like Gemini 2.5 Pro. The 80k variant generally scores a little higher on reasoning and coding, while the 40k variant is cheaper to run and actually leads on some long-context and agent benchmarks. Both were trained with reinforcement learning using MiniMax's CISPO algorithm, which clips importance-sampling weights rather than token updates; the full RL run took 512 H800 GPUs about three weeks at a reported rental cost of $534,700.

The weights are released under the permissive Apache 2.0 license on Hugging Face and GitHub, and MiniMax recommends vLLM (0.9.2+) or Transformers for deployment. M1 is text-only. It is positioned for long-context understanding, software-engineering tasks, and agentic tool use, where MiniMax reports it tops other open-weight models and is competitive with leading proprietary systems.

Released	2025-06-16
License	Apache 2.0
Weights	Open weights
Parameters	456B total / 45.9B active (MoE, 32 experts)
Context	1M
Max output	80k tokens (M1-80k); 40k tokens (M1-40k)
Architecture	Hybrid Mixture-of-Experts (32 experts) with lightning (linear) attention, built on MiniMax-Text-01: one softmax-attention transformer block follows every seven lightning-attention blocks. Trained with large-scale RL using the CISPO algorithm.
Knowledge cutoff	June 2024
Modalities	Text
Status	Generally available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.40 / 1M tokens (0-200k context); $1.30 / 1M tokens (200k-1M context) per 1M tokens
Output	$2.20 / 1M tokens per 1M tokens

Tiered input pricing by context length from MiniMax's official announcement; OpenRouter lists $0.40 in / $2.20 out. Free unlimited use is offered on the MiniMax app and web.

Pricing source ↗

Strengths

Native 1M-token context — among the largest of any open-weight model, matching closed frontier models
Efficient long-form reasoning: lightning attention cuts FLOPs to ~25% of DeepSeek-R1 at 100K-token generation
Strong agentic tool use — leads open-weight models on TAU-bench and beats Gemini 2.5 Pro on parts of it
Apache 2.0 license with fully open weights — free commercial use and self-hosting
Two thinking budgets (40k/80k) let you trade reasoning depth against cost
Competitive coding and math (AIME 2024 86.0%, SWE-bench Verified 56.0% on the 80k variant)

Best for

Long-document and whole-codebase analysis that needs hundreds of thousands of tokens of context
Agentic tool-use and function-calling workflows
Software engineering: bug fixing and repo-level tasks (SWE-bench-style)
Math and competition-style reasoning
Self-hosted reasoning deployments where an open Apache-2.0 license is required

How to access

Provider	Model ID
MiniMax ↗	`MiniMax-M1`
OpenRouter ↗	`minimax/minimax-m1`

FAQ

What is the difference between MiniMax-M1-40k and MiniMax-M1-80k?

They are the same 456B-parameter model trained with two different reasoning (thinking) budgets: 40,000 tokens versus 80,000 tokens. The 80k variant generally scores slightly higher on reasoning and coding benchmarks, while the cheaper 40k variant leads on some long-context and agentic tool-use tasks. Both share the same 1M-token context window.

Is MiniMax-M1 open source and free to use?

The weights are released under the Apache 2.0 license on Hugging Face and GitHub, so you can download, self-host, and use them commercially for free. MiniMax also offers a hosted API (tiered pricing) and free unlimited use through its own app and website.

How large is the context window?

MiniMax-M1 natively supports a 1-million-token context window — about eight times DeepSeek-R1's and comparable to closed frontier models. Its lightning (linear) attention makes processing very long inputs far cheaper than standard softmax attention.

Does MiniMax-M1 support images or audio?

No. MiniMax-M1 is a text-only reasoning model. It is designed for long-context text understanding, coding, math, and agentic tool use rather than multimodal input.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// FAQ