Overview
Pixtral 12B was Mistral AI's first multimodal (vision-language) model, announced on 17 September 2024. It pairs a 12-billion-parameter text decoder, built on Mistral NeMo, with a 400-million-parameter vision encoder that Mistral trained from scratch. The combined model reads interleaved images and text in a single 128K-token context window and is released under the permissive Apache 2.0 license with the weights published on Hugging Face as pixtral-12b-2409.
Architecturally, Pixtral 12B was notable for ingesting images at their native resolution and aspect ratio rather than forcing every image into a fixed square. It uses RoPE-2D position encodings in the vision encoder and a 16×16 patch size, so the number of tokens spent on an image scales with the image, and a prompt can contain any number of images. This made it well suited to documents, charts, diagrams, and natural photos in the same conversation.
Mistral offered Pixtral 12B through its La Plateforme API (model id pixtral-12b-2409) and made it freely usable in the Le Chat web app at launch. Mistral has since deprecated the model: it was moved to the legacy list with a 2 December 2025 deprecation and 31 December 2025 API retirement, with Ministral 3 14B named as the recommended replacement. Because the weights are Apache 2.0, the model can still be downloaded and self-hosted via vLLM or mistral-inference.
| Released | 2024-09-17 |
|---|---|
| License | Apache 2.0 |
| Weights | Open weights |
| Parameters | ~12B decoder + ~400M vision encoder (~12.4B total) |
| Context | 128K tokens (131,072) |
| Max output | Not separately published by Mistral |
| Architecture | Decoder-only transformer (12B, based on Mistral NeMo) paired with a 400M-parameter vision encoder trained from scratch. The decoder has 40 layers, hidden dimension 14,336, 32 attention heads, 8 key-value heads (GQA), head dimension 128, and a 131,072-token vocabulary. The vision encoder has 24 layers, hidden dimension 4,096, 16 heads, a 16×16 patch size, and uses RoPE-2D position encodings so it can ingest images at their native resolution and aspect ratio. Image and text tokens are interleaved in a single 128K context, allowing any number of images per prompt. |
| Knowledge cutoff | Not officially published by Mistral |
| Modalities | text, image |
| Status | Deprecated. Mistral marked pixtral-12b-2409 as a legacy model (deprecated 2 Dec 2025, API retirement 31 Dec 2025) and recommends Ministral 3 14B as the successor. The open weights remain downloadable on Hugging Face under Apache 2.0. |
Benchmarks
- MMMU52%
- MathVista58.3%
- ChartQA81.8%
- DocVQA (ANLS)90.7%
- VQAv278.6%
- MMLU (5-shot, text)69.2%
- HumanEval (text)72%
- MATH (text)48.1%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.15 / 1M tokens per 1M tokens |
|---|---|
| Output | $0.15 / 1M tokens per 1M tokens |
Published API price at the September 2024 launch on La Plateforme; the model was also free to use in Le Chat. Pricing no longer applies after the 31 Dec 2025 API retirement; self-hosting the Apache 2.0 weights is free.
Strengths
- Open Apache 2.0 weights — free to download, fine-tune, and self-host with no usage restrictions
- Native variable-resolution image handling (RoPE-2D), so charts, documents and high-res images keep their detail
- Strong document and chart understanding for its size (DocVQA 90.7, ChartQA 81.8)
- Retains solid text-only ability — multimodal training did not crater language benchmarks
- Single 128K context that interleaves text and an arbitrary number of images
- Self-hostable with mainstream tooling (vLLM, mistral-inference)
Best for
- Document and form understanding (OCR-style extraction, document QA)
- Chart, figure, and diagram interpretation
- Image captioning and visual question answering
- Multimodal chat assistants that mix text and images
- On-prem / private-cloud vision workloads where open weights are required
- A fine-tuning base for custom vision-language tasks
How to access
| Provider | Model ID |
|---|---|
| Mistral AI (La Plateforme) ↗ | pixtral-12b-2409 |
Pixtral — every version
The full lineage of the Pixtral line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Pixtral Large (24.11)current | 2024-11-18 | — | Open weights |
| Pixtral 12B (24.09) | 2024-09-17 | — | Open weights |
FAQ
Is Pixtral 12B still available?
Mistral deprecated the hosted model: pixtral-12b-2409 was placed on the legacy list with a 2 December 2025 deprecation and a 31 December 2025 API retirement, with Ministral 3 14B as the recommended successor. However, because the weights are Apache 2.0, you can still download Pixtral-12B-2409 from Hugging Face and run it yourself.
Is Pixtral 12B open source?
The weights are released under the Apache 2.0 license and are downloadable from Hugging Face, so you can use, fine-tune, and self-host the model commercially without restriction. Mistral did not publish full training data, so it is open-weights rather than fully open-source in the strictest sense.
What did Pixtral 12B cost to use?
At its September 2024 launch on Mistral's La Plateforme, Pixtral 12B was priced at $0.15 per million input tokens and $0.15 per million output tokens, and it was also free to use in the Le Chat web app. That hosted pricing ended with the API retirement on 31 December 2025; self-hosting the open weights is free.
How well does Pixtral 12B perform on benchmarks?
In Mistral's technical report it scored 52.0% on MMMU, 58.3% on MathVista, 81.8% on ChartQA, 90.7% on DocVQA, and 78.6% on VQAv2, while keeping strong text performance (69.2% MMLU, 72.0% HumanEval). It was competitive with larger open vision-language models of its era despite its modest 12B size.