Overview
Aya Vision is Cohere Labs' first vision model line, released on March 4, 2025 as an open-weights research project. It ships in two sizes — Aya Vision 8B and Aya Vision 32B — both built to handle text and image input across 23 languages, with text output. Capabilities span OCR, image captioning, visual question answering, visual reasoning, summarization, translation, and code.
Architecturally, both models pair a SigLIP2-patch14-384 vision encoder with a multilingual language backbone through a multimodal adapter. The 8B variant is initialized from Command R7B for stronger instruction following and world knowledge, while the 32B variant uses Aya Expanse 32B for its multilingual strength. A Pixel Shuffle step compresses image tokens 4x, and the models were trained in two stages: vision-language alignment followed by supervised fine-tuning, partly on synthetic annotations from translated English data.
Cohere Labs released the open weights on Hugging Face and Kaggle under a CC-BY-NC 4.0 (non-commercial) license, and made the models free to try on WhatsApp. They are also served via the Cohere Chat API as c4ai-aya-vision-32b and c4ai-aya-vision-8b. Alongside the models, Cohere released two new multilingual evaluation sets — AyaVisionBench and m-WildVision — to benchmark vision-language models across all 23 languages.
| Released | 2025-03-04 |
|---|---|
| License | CC-BY-NC 4.0 (with Cohere Lab's Acceptable Use Policy) |
| Weights | Open weights |
| Parameters | 8B and 32B (two variants) |
| Context | 16K |
| Max output | 4K tokens |
| Architecture | Vision-language model: SigLIP2-patch14-384 vision encoder connected through a multimodal adapter to a language backbone (Command R7B for the 8B variant, Aya Expanse 32B for the 32B variant). Uses Pixel Shuffle to compress image tokens 4x; 169 tokens per 364x364 tile, up to 12 tiles plus a thumbnail (2,197 image tokens max). Trained in two stages: vision-language alignment, then supervised fine-tuning. |
| Modalities | Text, Vision |
| Status | Generally available |
Benchmarks
- AyaVisionBench win-rate vs models >2x its size (Aya Vision 32B)64% win rate
- m-WildVision win-rate vs models >2x its size (Aya Vision 32B)72% win rate
- AyaVisionBench win-rate in its class (Aya Vision 8B)79% win rate
- m-WildVision (mWildBench) win-rate in its class (Aya Vision 8B)81% win rate
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Strengths
- Strong multilingual coverage — text and image understanding across 23 languages
- Open weights available on Hugging Face and Kaggle for research use
- Punches above its parameter class: the 8B variant beats much larger vision models in its class on Cohere's multilingual benchmarks
- Two sizes let you trade off quality vs. footprint (8B for efficiency, 32B for top quality)
- Free to try on WhatsApp and the Cohere playground
Best for
- Multilingual OCR and document understanding
- Image captioning and visual question answering
- Visual reasoning and image-grounded summarization
- Translating text found inside images into coherent text
- Research on multilingual multimodal models
How to access
FAQ
What is Aya Vision?
Aya Vision is Cohere Labs' open-weights multilingual vision-language model line, released March 4, 2025. It comes in 8B and 32B sizes, takes text and image input across 23 languages, and produces text output for tasks like OCR, captioning, and visual question answering.
Is Aya Vision free and open-source?
The weights are openly available on Hugging Face and Kaggle under a CC-BY-NC 4.0 license, which permits non-commercial use only. You can also try it free on WhatsApp and the Cohere playground.
How do I access Aya Vision through an API?
Both variants are served through the Cohere Chat API as c4ai-aya-vision-32b and c4ai-aya-vision-8b. The 32B variant is the actively served model on the platform; both sets of open weights remain downloadable from Hugging Face and Kaggle.
How does Aya Vision compare to larger models?
On Cohere's multilingual benchmarks (AyaVisionBench and m-WildVision), Aya Vision 32B reaches win rates of 50-64% on AyaVisionBench and 52-72% on m-WildVision against models more than twice its size, while Aya Vision 8B reaches up to 79% on AyaVisionBench and 81% on mWildBench in its parameter class.
