AI/TLDR

Kimi K2 Thinking

Moonshot AI's open-weight trillion-parameter reasoning agent that chains 200-300 tool calls.

Overview

Kimi K2 Thinking is Moonshot AI's flagship open-weight reasoning model, released on November 6, 2025. It extends the Kimi K2 line into an explicit "thinking" agent: a trillion-parameter Mixture-of-Experts model that activates 32 billion parameters per token and interleaves chain-of-thought reasoning with tool use over long horizons.

What sets Kimi K2 Thinking apart is sustained agentic execution. Moonshot reports that it can run 200 to 300 sequential tool calls without human intervention, reasoning coherently across hundreds of steps to research, plan, and solve multi-stage problems. It pairs this with a 256K-token context window and native INT4 quantization (trained via Quantization-Aware Training), which cuts memory and latency without losing accuracy.

Released under a Modified MIT License with full open weights on Hugging Face, Kimi K2 Thinking was the first open model to match or beat leading closed systems on several agentic and reasoning benchmarks at launch. It is served through Moonshot's own Kimi API as well as third-party providers, and is also usable directly in the Kimi chat app.

Released2025-11-06
LicenseModified MIT License
WeightsOpen weights
Parameters1T total / 32B active (MoE)
Context256K
Max output256K
ArchitectureMixture-of-Experts (MoE) with 1 trillion total parameters and 32 billion activated per token. 61 layers (1 dense), 384 routed experts with 8 selected per token plus 1 shared expert, 64 attention heads, 7168 attention hidden dimension, 160K vocabulary, Multi-head Latent Attention (MLA), and SwiGLU activation. Ships with native INT4 weights via Quantization-Aware Training (QAT) for roughly 2x faster inference at the same quality.
ModalitiesText
StatusAvailable

Benchmarks

Official Moonshot AI benchmark comparison for Kimi K2 Thinking versus GPT-5 (High), Claude Sonnet 4.5 (Thinking), Kimi K2 0905, DeepSeek-V3.2, and Grok-4. Values are exactly as published; an asterisk (*) marks scores Moonshot re-tested under their own conditions, and null marks cells with no published score. Showing 20 of 24 published benchmarks.

BenchmarkKimi K2 ThinkingGPT-5 (High)Claude Sonnet 4.5 (Thinking)Kimi K2 0905DeepSeek-V3.2Grok-4
Humanity's Last Exam (Text-only), no tools23.9%26.3%19.8*%7.9%19.8%25.4%
Humanity's Last Exam (Text-only), w/ tools44.9%41.7%32.0*%21.7%20.3*%41%
AIME 2025, no tools94.5%94.6%87%51%89.3%91.7%
AIME 2025, w/ python99.1%99.6%100%75.2%58.1*%98.8%
HMMT 2025, no tools89.4%93.3%74.6*%38.8%83.6%90%
HMMT 2025, w/ python95.1%96.7%88.8*%70.4%49.5*%93.9%
IMO-AnswerBench, no tools78.6%76.0*%65.9*%45.8%76.0*%73.1%
GPQA-Diamond, no tools84.5%85.7%83.4%74.2%79.9%87.5%
MMLU-Pro, no tools84.6%87.1%87.5%81.9%85%
MMLU-Redux, no tools94.4%95.3%95.6%92.7%93.7%
Longform Writing, no tools73.8%71.4%79.8%62.8%72.5%
HealthBench, no tools58%67.2%44.2%43.8%46.9%
BrowseComp, w/ tools60.2%54.9%24.1%7.4%40.1%
BrowseComp-ZH, w/ tools62.3%63*%42.4*%22.2%47.9%
Seal-0, w/ tools56.3%51.4*%53.4*%25.2%38.5*%
FinSearchComp-T3, w/ tools47.4%48.5*%44.0*%10.4%27.0*%
Frames, w/ tools87%86.0*%85.0*%58.1%80.2*%
SWE-bench Verified, w/ tools71.3%74.9%77.2%69.2%67.8%
SWE-bench Multilingual, w/ tools61.1%55.3*%68%55.9%57.9%
Multi-SWE-bench, w/ tools41.9%39.3*%44.3%33.5%30.6%

Comparison source ↗

This model's scores

  1. Humanity's Last Exam (with tools)44.9%
  2. Humanity's Last Exam (text-only, no tools)23.9%
  3. BrowseComp60.2%
  4. BrowseComp-ZH62.3%
  5. SWE-bench Verified71.3%
  6. LiveCodeBench v683.1%
  7. AIME 2025 (with Python)99.1%
  8. HMMT 2025 (with Python)95.1%
  9. GPQA84.5%
  10. MMLU-Pro84.6%
  11. Artificial Analysis Intelligence Index33index

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.60 / 1M tokens per 1M tokens
Output$2.50 / 1M tokens per 1M tokens

Pricing per OpenRouter listing and Artificial Analysis median across providers; the model is also available with open weights for self-hosting.

Pricing source ↗

Strengths

  • Long-horizon agentic execution: stable across 200-300 sequential tool calls
  • Strong reasoning with tools (44.9% on Humanity's Last Exam with tools)
  • State-of-the-art agentic search (60.2% BrowseComp) for an open model
  • Open weights under a permissive Modified MIT License
  • 256K-token context window for large codebases and documents
  • Native INT4 quantization for cheaper, faster deployment without quality loss

Best for

  • Autonomous research agents that browse, gather, and synthesize across many steps
  • Multi-step coding and software-engineering tasks with tool orchestration
  • Deep reasoning over math, science, and competition-style problems
  • Self-hosted or private deployment where open weights are required
  • Long-document analysis and codebase understanding using the 256K context

How to access

ProviderModel ID
Moonshot AI (Kimi API) ↗kimi-k2-thinking
OpenRouter ↗moonshotai/kimi-k2-thinking
Together AI ↗kimi-k2-thinking
Amazon Bedrock ↗kimi-k2-thinking

FAQ

Is Kimi K2 Thinking open source?

Yes. Moonshot AI released Kimi K2 Thinking with open weights on Hugging Face under a Modified MIT License, so you can download and self-host it. The license adds an attribution requirement for very large-scale commercial deployments.

What makes Kimi K2 Thinking different from the original Kimi K2?

Kimi K2 Thinking is an explicit reasoning variant. Instead of answering directly, it produces extended chain-of-thought and interleaves tool calls, sustaining 200 to 300 sequential tool calls across a single task. It shares the trillion-parameter MoE architecture but is tuned for long-horizon, agentic problem solving.

How big is Kimi K2 Thinking and what context window does it support?

It is a Mixture-of-Experts model with 1 trillion total parameters and 32 billion activated per token, and it supports a 256K-token context window. It ships with native INT4 weights via Quantization-Aware Training for faster, cheaper inference.

How much does the Kimi K2 Thinking API cost?

Listed pricing is about $0.60 per million input tokens and $2.50 per million output tokens (per OpenRouter and Artificial Analysis). Because the weights are open, you can also run it on your own hardware instead of paying per token.