Aya Vision (8B & 32B)

Name: Aya Vision (8B & 32B)
Author: Cohere

Cohere Labs' open-weights multilingual vision-language models, covering 23 languages in two sizes (8B and 32B).

Overview

Aya Vision is Cohere Labs' first vision model line, released on March 4, 2025 as an open-weights research project. It ships in two sizes — Aya Vision 8B and Aya Vision 32B — both built to handle text and image input across 23 languages, with text output. Capabilities span OCR, image captioning, visual question answering, visual reasoning, summarization, translation, and code.

Architecturally, both models pair a SigLIP2-patch14-384 vision encoder with a multilingual language backbone through a multimodal adapter. The 8B variant is initialized from Command R7B for stronger instruction following and world knowledge, while the 32B variant uses Aya Expanse 32B for its multilingual strength. A Pixel Shuffle step compresses image tokens 4x, and the models were trained in two stages: vision-language alignment followed by supervised fine-tuning, partly on synthetic annotations from translated English data.

Cohere Labs released the open weights on Hugging Face and Kaggle under a CC-BY-NC 4.0 (non-commercial) license, and made the models free to try on WhatsApp. They are also served via the Cohere Chat API as c4ai-aya-vision-32b and c4ai-aya-vision-8b. Alongside the models, Cohere released two new multilingual evaluation sets — AyaVisionBench and m-WildVision — to benchmark vision-language models across all 23 languages.

Released	2025-03-04
License	CC-BY-NC 4.0 (with Cohere Lab's Acceptable Use Policy)
Weights	Open weights
Parameters	8B and 32B (two variants)
Context	16K
Max output	4K tokens
Architecture	Vision-language model: SigLIP2-patch14-384 vision encoder connected through a multimodal adapter to a language backbone (Command R7B for the 8B variant, Aya Expanse 32B for the 32B variant). Uses Pixel Shuffle to compress image tokens 4x; 169 tokens per 364x364 tile, up to 12 tiles plus a thumbnail (2,197 image tokens max). Trained in two stages: vision-language alignment, then supervised fine-tuning.
Modalities	Text, Vision
Status	Generally available

Benchmarks

AyaVisionBench win-rate vs models >2x its size (Aya Vision 32B)64% win rate
m-WildVision win-rate vs models >2x its size (Aya Vision 32B)72% win rate
AyaVisionBench win-rate in its class (Aya Vision 8B)79% win rate
m-WildVision (mWildBench) win-rate in its class (Aya Vision 8B)81% win rate

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Strong multilingual coverage — text and image understanding across 23 languages
Open weights available on Hugging Face and Kaggle for research use
Punches above its parameter class: the 8B variant beats much larger vision models in its class on Cohere's multilingual benchmarks
Two sizes let you trade off quality vs. footprint (8B for efficiency, 32B for top quality)
Free to try on WhatsApp and the Cohere playground

Best for

Multilingual OCR and document understanding
Image captioning and visual question answering
Visual reasoning and image-grounded summarization
Translating text found inside images into coherent text
Research on multilingual multimodal models

How to access

Provider	Model ID
Cohere ↗	`c4ai-aya-vision-32b`
Cohere ↗	`c4ai-aya-vision-8b`

FAQ

What is Aya Vision?

Aya Vision is Cohere Labs' open-weights multilingual vision-language model line, released March 4, 2025. It comes in 8B and 32B sizes, takes text and image input across 23 languages, and produces text output for tasks like OCR, captioning, and visual question answering.

Is Aya Vision free and open-source?

The weights are openly available on Hugging Face and Kaggle under a CC-BY-NC 4.0 license, which permits non-commercial use only. You can also try it free on WhatsApp and the Cohere playground.

How do I access Aya Vision through an API?

Both variants are served through the Cohere Chat API as c4ai-aya-vision-32b and c4ai-aya-vision-8b. The 32B variant is the actively served model on the platform; both sets of open weights remain downloadable from Hugging Face and Kaggle.

How does Aya Vision compare to larger models?

On Cohere's multilingual benchmarks (AyaVisionBench and m-WildVision), Aya Vision 32B reaches win rates of 50-64% on AyaVisionBench and 52-72% on m-WildVision against models more than twice its size, while Aya Vision 8B reaches up to 79% on AyaVisionBench and 81% on mWildBench in its parameter class.

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// FAQ

Overview

Benchmarks

Strengths

Best for

How to access

FAQ