AI/TLDR

Pixtral 12B (24.09)

Mistral AI's first multimodal model — a 12B vision-language model with open Apache 2.0 weights.

Overview

Pixtral 12B was Mistral AI's first multimodal (vision-language) model, announced on 17 September 2024. It pairs a 12-billion-parameter text decoder, built on Mistral NeMo, with a 400-million-parameter vision encoder that Mistral trained from scratch. The combined model reads interleaved images and text in a single 128K-token context window and is released under the permissive Apache 2.0 license with the weights published on Hugging Face as pixtral-12b-2409.

Architecturally, Pixtral 12B was notable for ingesting images at their native resolution and aspect ratio rather than forcing every image into a fixed square. It uses RoPE-2D position encodings in the vision encoder and a 16×16 patch size, so the number of tokens spent on an image scales with the image, and a prompt can contain any number of images. This made it well suited to documents, charts, diagrams, and natural photos in the same conversation.

Mistral offered Pixtral 12B through its La Plateforme API (model id pixtral-12b-2409) and made it freely usable in the Le Chat web app at launch. Mistral has since deprecated the model: it was moved to the legacy list with a 2 December 2025 deprecation and 31 December 2025 API retirement, with Ministral 3 14B named as the recommended replacement. Because the weights are Apache 2.0, the model can still be downloaded and self-hosted via vLLM or mistral-inference.

Released2024-09-17
LicenseApache 2.0
WeightsOpen weights
Parameters~12B decoder + ~400M vision encoder (~12.4B total)
Context128K tokens (131,072)
Max outputNot separately published by Mistral
ArchitectureDecoder-only transformer (12B, based on Mistral NeMo) paired with a 400M-parameter vision encoder trained from scratch. The decoder has 40 layers, hidden dimension 14,336, 32 attention heads, 8 key-value heads (GQA), head dimension 128, and a 131,072-token vocabulary. The vision encoder has 24 layers, hidden dimension 4,096, 16 heads, a 16×16 patch size, and uses RoPE-2D position encodings so it can ingest images at their native resolution and aspect ratio. Image and text tokens are interleaved in a single 128K context, allowing any number of images per prompt.
Knowledge cutoffNot officially published by Mistral
Modalitiestext, image
StatusDeprecated. Mistral marked pixtral-12b-2409 as a legacy model (deprecated 2 Dec 2025, API retirement 31 Dec 2025) and recommends Ministral 3 14B as the successor. The open weights remain downloadable on Hugging Face under Apache 2.0.

Benchmarks

  1. MMMU52%
  2. MathVista58.3%
  3. ChartQA81.8%
  4. DocVQA (ANLS)90.7%
  5. VQAv278.6%
  6. MMLU (5-shot, text)69.2%
  7. HumanEval (text)72%
  8. MATH (text)48.1%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.15 / 1M tokens per 1M tokens
Output$0.15 / 1M tokens per 1M tokens

Published API price at the September 2024 launch on La Plateforme; the model was also free to use in Le Chat. Pricing no longer applies after the 31 Dec 2025 API retirement; self-hosting the Apache 2.0 weights is free.

Pricing source ↗

Strengths

  • Open Apache 2.0 weights — free to download, fine-tune, and self-host with no usage restrictions
  • Native variable-resolution image handling (RoPE-2D), so charts, documents and high-res images keep their detail
  • Strong document and chart understanding for its size (DocVQA 90.7, ChartQA 81.8)
  • Retains solid text-only ability — multimodal training did not crater language benchmarks
  • Single 128K context that interleaves text and an arbitrary number of images
  • Self-hostable with mainstream tooling (vLLM, mistral-inference)

Best for

  • Document and form understanding (OCR-style extraction, document QA)
  • Chart, figure, and diagram interpretation
  • Image captioning and visual question answering
  • Multimodal chat assistants that mix text and images
  • On-prem / private-cloud vision workloads where open weights are required
  • A fine-tuning base for custom vision-language tasks

How to access

ProviderModel ID
Mistral AI (La Plateforme) ↗pixtral-12b-2409

Pixtral — every version

The full lineage of the Pixtral line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Pixtral Large (24.11)current2024-11-18Open weights
Pixtral 12B (24.09)2024-09-17Open weights

FAQ

Is Pixtral 12B still available?

Mistral deprecated the hosted model: pixtral-12b-2409 was placed on the legacy list with a 2 December 2025 deprecation and a 31 December 2025 API retirement, with Ministral 3 14B as the recommended successor. However, because the weights are Apache 2.0, you can still download Pixtral-12B-2409 from Hugging Face and run it yourself.

Is Pixtral 12B open source?

The weights are released under the Apache 2.0 license and are downloadable from Hugging Face, so you can use, fine-tune, and self-host the model commercially without restriction. Mistral did not publish full training data, so it is open-weights rather than fully open-source in the strictest sense.

What did Pixtral 12B cost to use?

At its September 2024 launch on Mistral's La Plateforme, Pixtral 12B was priced at $0.15 per million input tokens and $0.15 per million output tokens, and it was also free to use in the Le Chat web app. That hosted pricing ended with the API retirement on 31 December 2025; self-hosting the open weights is free.

How well does Pixtral 12B perform on benchmarks?

In Mistral's technical report it scored 52.0% on MMMU, 58.3% on MathVista, 81.8% on ChartQA, 90.7% on DocVQA, and 78.6% on VQAv2, while keeping strong text performance (69.2% MMLU, 72.0% HumanEval). It was competitive with larger open vision-language models of its era despite its modest 12B size.