Overview
Command A Vision (07-2025) is Cohere's first commercial vision-language model, released on July 31, 2025 as the flagship of the Command A Vision line. It pairs a SigLIP2-patch16-512 vision encoder with the existing 111B-parameter Command A text model through a multimodal adapter, for roughly 112B parameters total. It keeps the same Chat API interface as other Command models, so teams can add image understanding to existing Cohere applications with minimal changes.
The model is built for enterprise visual workloads rather than open-ended image chat: it targets chart, graph, and diagram analysis, in-image table understanding, optical character recognition (OCR), document question answering, and object detection. It accepts up to 20 images per request (20MB total) within a 128K-token context window and produces up to 8K output tokens. Command A Vision officially supports six languages — English, Portuguese, Italian, French, German, and Spanish. It takes images as input only and does not generate images, and tool use is not supported with this model.
Command A Vision is released as open weights under a CC-BY-NC (non-commercial) license on Hugging Face for research, while commercial use is available through the Cohere platform and Oracle OCI Generative AI. Cohere positions efficiency as a selling point: the 112B model is designed to run privately on just two A100 GPUs, or a single H100 with 4-bit quantization, which is unusually compact for a model that leads several document and OCR benchmarks.
| Released | 2025-07-31 |
|---|---|
| License | CC-BY-NC 4.0 (open weights, non-commercial) + Cohere commercial API |
| Weights | Open weights |
| Parameters | 112B (111B Command A text tower + SigLIP2 vision encoder) |
| Context | 128K |
| Max output | 8K tokens |
| Architecture | Vision-language model: the SigLIP2-patch16-512 vision encoder feeds visual features through an MLP adapter into the dense 111B-parameter Command A text LLM (~112B total). Images are tiled into up to 12 tiles of 512x512 plus a global thumbnail, consuming up to 3,328 tokens per image. |
| Modalities | Text, Vision |
| Status | Available |
Benchmarks
- DocVQA95.9%
- ChartQA90.9%
- OCRBench86.9%
- AI2D94%
- TextVQA84.8%
- InfoVQA82.9%
- MMMU (CoT)65.3%
- Average (8 vision benchmarks)83.1%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Strengths
- Leads strong proprietary VLMs on document and OCR tasks — DocVQA 95.9%, OCRBench 86.9%, ChartQA 90.9%, AI2D 94.0%
- Compact for its capability: runs on two A100 GPUs, or one H100 with 4-bit quantization
- Open weights under CC-BY-NC for non-commercial research, with a multimodal adapter on top of the proven 111B Command A text tower
- High-resolution document handling via tiling (up to 12 x 512x512 tiles plus a thumbnail) for detailed charts, tables, and scanned pages
- Drop-in Chat API compatibility with other Command models and availability on the Cohere platform plus Oracle OCI
Best for
- Document OCR and question answering over scanned PDFs, forms, and invoices
- Extracting and reasoning over charts, graphs, and diagrams
- Parsing and understanding tables embedded in images
- Multilingual image-text recognition across the six supported languages
- Enterprise document-processing pipelines that need private deployment on modest GPU hardware
How to access
| Provider | Model ID |
|---|---|
| Cohere ↗ | command-a-vision-07-2025 |
| Oracle OCI Generative AI ↗ | cohere.command-a-vision |
| Hugging Face ↗ | CohereLabs/command-a-vision-07-2025 |
FAQ
What is Command A Vision (07-2025)?
It is Cohere's first commercial multimodal model, released July 31, 2025. It adds image understanding to the Command A line by connecting a SigLIP2-patch16-512 vision encoder to the 111B-parameter Command A text model through an adapter (~112B parameters total), and it targets enterprise tasks like chart analysis, table parsing, document OCR, and document question answering.
Is Command A Vision open source?
The weights are published on Hugging Face under a CC-BY-NC (non-commercial) license, so you can use them for research but not commercially. For commercial use, the model is available through Cohere's API and Oracle OCI Generative AI. It is open weights but not a permissive open-source license.
What hardware does Command A Vision need?
Cohere designed the 112B model to run on just two A100 GPUs for private deployment, or a single H100 with 4-bit quantization — compact for a model of its capability.
How does Command A Vision perform on benchmarks?
Cohere reports an 83.1% average across eight vision benchmarks, leading GPT-4.1 (78.6%), Llama 4 Maverick (80.5%), and Mistral Medium 3 (78.3%). It is strongest on document and chart tasks — DocVQA 95.9%, ChartQA 90.9%, OCRBench 86.9% — but trails GPT-4.1 on the general-reasoning MMMU benchmark (65.3% vs 74.8%).
