Command A Vision (07-2025)

Name: Command A Vision (07-2025)
Author: Cohere

Cohere's first multimodal Command model — a 112B vision-language system tuned for enterprise document, chart, and OCR understanding that runs on two GPUs.

Overview

Command A Vision (07-2025) is Cohere's first commercial vision-language model, released on July 31, 2025 as the flagship of the Command A Vision line. It pairs a SigLIP2-patch16-512 vision encoder with the existing 111B-parameter Command A text model through a multimodal adapter, for roughly 112B parameters total. It keeps the same Chat API interface as other Command models, so teams can add image understanding to existing Cohere applications with minimal changes.

The model is built for enterprise visual workloads rather than open-ended image chat: it targets chart, graph, and diagram analysis, in-image table understanding, optical character recognition (OCR), document question answering, and object detection. It accepts up to 20 images per request (20MB total) within a 128K-token context window and produces up to 8K output tokens. Command A Vision officially supports six languages — English, Portuguese, Italian, French, German, and Spanish. It takes images as input only and does not generate images, and tool use is not supported with this model.

Command A Vision is released as open weights under a CC-BY-NC (non-commercial) license on Hugging Face for research, while commercial use is available through the Cohere platform and Oracle OCI Generative AI. Cohere positions efficiency as a selling point: the 112B model is designed to run privately on just two A100 GPUs, or a single H100 with 4-bit quantization, which is unusually compact for a model that leads several document and OCR benchmarks.

Released	2025-07-31
License	CC-BY-NC 4.0 (open weights, non-commercial) + Cohere commercial API
Weights	Open weights
Parameters	112B (111B Command A text tower + SigLIP2 vision encoder)
Context	128K
Max output	8K tokens
Architecture	Vision-language model: the SigLIP2-patch16-512 vision encoder feeds visual features through an MLP adapter into the dense 111B-parameter Command A text LLM (~112B total). Images are tiled into up to 12 tiles of 512x512 plus a global thumbnail, consuming up to 3,328 tokens per image.
Modalities	Text, Vision
Status	Available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Leads strong proprietary VLMs on document and OCR tasks — DocVQA 95.9%, OCRBench 86.9%, ChartQA 90.9%, AI2D 94.0%
Compact for its capability: runs on two A100 GPUs, or one H100 with 4-bit quantization
Open weights under CC-BY-NC for non-commercial research, with a multimodal adapter on top of the proven 111B Command A text tower
High-resolution document handling via tiling (up to 12 x 512x512 tiles plus a thumbnail) for detailed charts, tables, and scanned pages
Drop-in Chat API compatibility with other Command models and availability on the Cohere platform plus Oracle OCI

Best for

Document OCR and question answering over scanned PDFs, forms, and invoices
Extracting and reasoning over charts, graphs, and diagrams
Parsing and understanding tables embedded in images
Multilingual image-text recognition across the six supported languages
Enterprise document-processing pipelines that need private deployment on modest GPU hardware

How to access

Provider	Model ID
Cohere ↗	`command-a-vision-07-2025`
Oracle OCI Generative AI ↗	`cohere.command-a-vision`
Hugging Face ↗	`CohereLabs/command-a-vision-07-2025`

FAQ

What is Command A Vision (07-2025)?

It is Cohere's first commercial multimodal model, released July 31, 2025. It adds image understanding to the Command A line by connecting a SigLIP2-patch16-512 vision encoder to the 111B-parameter Command A text model through an adapter (~112B parameters total), and it targets enterprise tasks like chart analysis, table parsing, document OCR, and document question answering.

Is Command A Vision open source?

The weights are published on Hugging Face under a CC-BY-NC (non-commercial) license, so you can use them for research but not commercially. For commercial use, the model is available through Cohere's API and Oracle OCI Generative AI. It is open weights but not a permissive open-source license.

What hardware does Command A Vision need?

Cohere designed the 112B model to run on just two A100 GPUs for private deployment, or a single H100 with 4-bit quantization — compact for a model of its capability.

How does Command A Vision perform on benchmarks?

Cohere reports an 83.1% average across eight vision benchmarks, leading GPT-4.1 (78.6%), Llama 4 Maverick (80.5%), and Mistral Medium 3 (78.3%). It is strongest on document and chart tasks — DocVQA 95.9%, ChartQA 90.9%, OCRBench 86.9% — but trails GPT-4.1 on the general-reasoning MMMU benchmark (65.3% vs 74.8%).

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// FAQ

Overview

Benchmarks

Strengths

Best for

How to access

FAQ