Pixtral 12B (24.09)

Name: Pixtral 12B (24.09)
Author: Mistral AI

Mistral AI's first multimodal model — a 12B vision-language model with open Apache 2.0 weights.

Overview

Pixtral 12B was Mistral AI's first multimodal (vision-language) model, announced on 17 September 2024. It pairs a 12-billion-parameter text decoder, built on Mistral NeMo, with a 400-million-parameter vision encoder that Mistral trained from scratch. The combined model reads interleaved images and text in a single 128K-token context window and is released under the permissive Apache 2.0 license with the weights published on Hugging Face as pixtral-12b-2409.

Architecturally, Pixtral 12B was notable for ingesting images at their native resolution and aspect ratio rather than forcing every image into a fixed square. It uses RoPE-2D position encodings in the vision encoder and a 16×16 patch size, so the number of tokens spent on an image scales with the image, and a prompt can contain any number of images. This made it well suited to documents, charts, diagrams, and natural photos in the same conversation.

Mistral offered Pixtral 12B through its La Plateforme API (model id pixtral-12b-2409) and made it freely usable in the Le Chat web app at launch. Mistral has since deprecated the model: it was moved to the legacy list with a 2 December 2025 deprecation and 31 December 2025 API retirement, with Ministral 3 14B named as the recommended replacement. Because the weights are Apache 2.0, the model can still be downloaded and self-hosted via vLLM or mistral-inference.

Released	2024-09-17
License	Apache 2.0
Weights	Open weights
Parameters	~12B decoder + ~400M vision encoder (~12.4B total)
Context	128K tokens (131,072)
Max output	Not separately published by Mistral
Architecture	Decoder-only transformer (12B, based on Mistral NeMo) paired with a 400M-parameter vision encoder trained from scratch. The decoder has 40 layers, hidden dimension 14,336, 32 attention heads, 8 key-value heads (GQA), head dimension 128, and a 131,072-token vocabulary. The vision encoder has 24 layers, hidden dimension 4,096, 16 heads, a 16×16 patch size, and uses RoPE-2D position encodings so it can ingest images at their native resolution and aspect ratio. Image and text tokens are interleaved in a single 128K context, allowing any number of images per prompt.
Knowledge cutoff	Not officially published by Mistral
Modalities	text, image
Status	Deprecated. Mistral marked pixtral-12b-2409 as a legacy model (deprecated 2 Dec 2025, API retirement 31 Dec 2025) and recommends Ministral 3 14B as the successor. The open weights remain downloadable on Hugging Face under Apache 2.0.

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.15 / 1M tokens per 1M tokens
Output	$0.15 / 1M tokens per 1M tokens

Published API price at the September 2024 launch on La Plateforme; the model was also free to use in Le Chat. Pricing no longer applies after the 31 Dec 2025 API retirement; self-hosting the Apache 2.0 weights is free.

Pricing source ↗

Strengths

Open Apache 2.0 weights — free to download, fine-tune, and self-host with no usage restrictions
Native variable-resolution image handling (RoPE-2D), so charts, documents and high-res images keep their detail
Strong document and chart understanding for its size (DocVQA 90.7, ChartQA 81.8)
Retains solid text-only ability — multimodal training did not crater language benchmarks
Single 128K context that interleaves text and an arbitrary number of images
Self-hostable with mainstream tooling (vLLM, mistral-inference)

Best for

Document and form understanding (OCR-style extraction, document QA)
Chart, figure, and diagram interpretation
Image captioning and visual question answering
Multimodal chat assistants that mix text and images
On-prem / private-cloud vision workloads where open weights are required
A fine-tuning base for custom vision-language tasks

How to access

Provider	Model ID
Mistral AI (La Plateforme) ↗	`pixtral-12b-2409`

Pixtral — every version

The full lineage of the Pixtral line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Pixtral Large (24.11)current	2024-11-18	—	Open weights
Pixtral 12B (24.09)	2024-09-17	—	Open weights

FAQ

Is Pixtral 12B still available?

Mistral deprecated the hosted model: pixtral-12b-2409 was placed on the legacy list with a 2 December 2025 deprecation and a 31 December 2025 API retirement, with Ministral 3 14B as the recommended successor. However, because the weights are Apache 2.0, you can still download Pixtral-12B-2409 from Hugging Face and run it yourself.

Is Pixtral 12B open source?

The weights are released under the Apache 2.0 license and are downloadable from Hugging Face, so you can use, fine-tune, and self-host the model commercially without restriction. Mistral did not publish full training data, so it is open-weights rather than fully open-source in the strictest sense.

What did Pixtral 12B cost to use?

At its September 2024 launch on Mistral's La Plateforme, Pixtral 12B was priced at $0.15 per million input tokens and $0.15 per million output tokens, and it was also free to use in the Le Chat web app. That hosted pricing ended with the API retirement on 31 December 2025; self-hosting the open weights is free.

How well does Pixtral 12B perform on benchmarks?

In Mistral's technical report it scored 52.0% on MMMU, 58.3% on MathVista, 81.8% on ChartQA, 90.7% on DocVQA, and 78.6% on VQAv2, while keeping strong text performance (69.2% MMLU, 72.0% HumanEval). It was competitive with larger open vision-language models of its era despite its modest 12B size.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// Pixtral — every version

// FAQ