Vision Models for OCR & Document AI: GPT-4o, Claude, Gemini

In plain English

Traditional OCR (Optical Character Recognition) is a translator: it looks at a scanned image, recognises each character, and outputs raw text. That raw text then has to go through a second, separate system that figures out what the words mean — which number is the invoice total, which row is whose address, which column is the quantity. Two pipelines, two failure points.

A vision language model like GPT-4o, Claude, or Gemini collapses that into one step. You hand it an image of a document — a scanned invoice, a PDF slide, a photo of a handwritten form — and ask it a question in plain English: "Extract the line items as JSON" or "What is the total amount due?" The model reads the pixels, understands the layout and context, and returns a structured answer, all in a single API call.

This is called vision-based Document AI, and it has become one of the most commercially valuable uses of large multimodal models. The catch is that it requires good prompting and a realistic understanding of where these models beat traditional pipelines and where they don't.

Why it matters

Most enterprise data is locked in documents. Invoices, contracts, medical records, insurance claims, shipping manifests, research papers — almost none of it arrives as clean, structured rows in a database. Before vision models, processing these documents at scale required:

An OCR engine (e.g. Tesseract, AWS Textract, Google Document AI) to extract raw text.
A layout parser to identify regions — headers, tables, footers — from bounding-box coordinates.
A custom extraction model or regex rules to pull specific fields from the extracted text.
Post-processing logic to validate, normalise, and handle errors.

Each stage needs maintenance, training data, and its own failure modes. Vision models replace most of that stack with a single model you instruct in natural language. For many use cases — especially those with varied document layouts or occasional handwriting — this is dramatically faster to build and easier to adapt.

Invoice and receipt processing. Extract line items, totals, vendor names, and dates across thousands of invoice layouts with no per-layout training.
Contract and legal document review. Find specific clauses, dates, and parties across dense multi-page PDFs.
Medical form digitisation. Read handwritten or mixed print/handwrite patient forms into structured records.
Research and report summarisation. Pull key figures, tables, and conclusions from academic PDFs.
ID and credential verification. Extract fields from passports, driving licences, and ID cards.

How it works

The basic pipeline for vision-based document extraction is straightforward. You convert your document to an image (or use the model's native PDF support where available), send that image alongside a carefully written prompt via the model's API, and receive structured text — ideally JSON — in return.

// Vision-based document extraction pipeline

Source documentPDF, scan, photo, DOCXConvert to imagePDF → PNG pages, or use native PDF inputBuild promptInstruction + JSON schema + imageVision model APIGPT-4o, Claude, GeminiStructured outputJSON, Markdown table, plain textValidate & storeParse, check required fields, write to DB

Step 1: prepare the image

Most vision model APIs accept JPEG or PNG images up to a few MB. PDFs need to be rasterised first — the pdf2image Python library (backed by Poppler) is the standard choice. Aim for 150–300 DPI for printed documents; go higher (300–400 DPI) for handwriting or small print. Too low and characters blur; too high wastes tokens and increases latency.

pythonpython

from pdf2image import convert_from_path
import base64, io

# Convert first page of a PDF to a base64-encoded PNG
pages = convert_from_path("invoice.pdf", dpi=200)
buffer = io.BytesIO()
pages[0].save(buffer, format="PNG")
image_b64 = base64.b64encode(buffer.getvalue()).decode()

Step 2: write a high-quality extraction prompt

The prompt is the most important variable in vision document AI. A vague instruction like "extract the data" will produce inconsistent results. A structured prompt with a clear task, an explicit output format, and a JSON schema example reliably outperforms both.

pythonpython

SYSTEM_PROMPT = """
You are a document extraction assistant. Extract structured data from the document image.
Return ONLY a valid JSON object matching the schema below. Do not add commentary.

Schema:
{
  "vendor_name": "string",
  "invoice_number": "string",
  "invoice_date": "YYYY-MM-DD",
  "due_date": "YYYY-MM-DD or null",
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "total": "number"
    }
  ],
  "subtotal": "number",
  "tax": "number or null",
  "total_due": "number"
}

If a field is not present in the document, use null.
"""

Step 3: call the API

The API call is the same for most providers — you pass the image as a base64 data URL alongside the text prompt. Here is an example using the Anthropic SDK:

pythonpython

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=SYSTEM_PROMPT,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_b64
                    }
                },
                {
                    "type": "text",
                    "text": "Extract all invoice data from this document."
                }
            ]
        }
    ]
)

import json
result = json.loads(message.content[0].text)

Prompting strategies that improve accuracy

The difference between a 70% and a 95% extraction accuracy often comes down to the prompt. Research and practitioner benchmarks consistently identify five techniques that move the needle:

1. Provide an explicit JSON schema

Listing the exact output schema — field names, types, and a note about nulls — reduces hallucination and schema drift. The model treats the schema as a checklist; fields that are missing from the document get null rather than a fabricated value.

2. Define entities you care about

Short, concise entity definitions outperform both no definitions and overly verbose ones. For example: "due_date: the payment deadline, printed as 'Due by' or 'Payment due'. Format as YYYY-MM-DD." This tells the model exactly which text to look for and how to normalise it.

3. One-shot or few-shot examples

Including one complete input/output example in your system prompt dramatically improves format compliance. Show a fictional document snippet alongside the ideal JSON output. The model calibrates its output format to match your example rather than guessing.

4. Ask for Markdown tables when structure matters

For table-heavy documents (financial statements, data sheets), asking the model to return a Markdown table first, then parse that table separately, can be more reliable than asking for JSON directly. The model's visual table-reading accuracy is high; JSON serialisation is where errors creep in.

5. Chain-of-thought for complex layouts

For multi-page documents or ambiguous layouts, prepend "Think step by step: first identify all sections, then extract fields from each section" to your instruction. Chain-of-thought prompting helps models navigate complex structures without jumping to conclusions.

Choosing the right model

GPT-4o, Claude, and Gemini all perform well at document extraction, but they have different strengths. Here is a practical comparison based on 2025–2026 benchmarks and practitioner experience:

// Vision model comparison for document AI

GPT-4o

Strong JSON schema compliance
JSON mode forces valid output
Good all-round printed text accuracy
Fastest among flagship models
Higher cost per page at scale

Claude (Sonnet/Opus)

Low hallucination rate on structured fields
Excellent for long documents and contracts
Strong reasoning about document context
Native PDF support in API (no rasterising)
More conservative — may return null vs. guess

Gemini 2.5 Pro

Top benchmark scores for printed text
Large context window for multi-page docs
Competitive on handwriting accuracy
Native PDF and multi-image support
Best cost/accuracy ratio at high volume

Vision models vs traditional OCR tools

Traditional OCR tools like AWS Textract, Google Document AI, and Azure Form Recognizer are purpose-built for specific document types. They are faster (milliseconds vs seconds), cheaper per page at high volume, and offer specialised extractors for common formats like W-2 forms or driving licences. Vision models beat them on flexibility — novel layouts, mixed content, reasoning questions — but lose on throughput and unit economics.

// Traditional OCR vs Vision model tradeoffs

Traditional OCR (Textract, Document AI)

Milliseconds per page
Low cost at high volume
Deterministic, easy to test
Requires per-layout training for custom fields
Poor on handwriting and novel layouts
No reasoning — extracts text, not meaning

Vision LLM (GPT-4o, Claude, Gemini)

Seconds per page
Higher cost per page
Non-deterministic — test with eval sets
Zero training needed — describe in the prompt
Handles handwriting and variable layouts
Understands context and can reason across fields

Going deeper

Once you have a basic extraction pipeline working, there are several directions to explore depending on your use case.

Multi-page documents

For long PDFs, you have two options. Native PDF input (supported by Claude and Gemini) lets you send the entire PDF as a single API input; the model handles page segmentation internally. Page-by-page processing with aggregation gives you more control — process each page independently, then merge the JSON outputs. The latter is more predictable for very long documents and lets you parallelize requests.

Evaluation and quality measurement

Build an eval set of 50–100 documents with known ground-truth values and run your pipeline against it before going to production. Measure field-level accuracy (not just full-record accuracy), track null rates, and log confidence by document type. An LLM-as-a-judge can help grade free-text fields where exact-match comparison doesn't work.

Image pre-processing for accuracy

On scanned or photographed documents, pre-processing can significantly improve results: deskew (correct rotation), binarise (convert to black and white), and remove background noise before sending to the model. The Python opencv-python and Pillow libraries cover most of these needs. Even a simple deskew step can recover several accuracy percentage points on difficult scans.

Agentic document workflows

For complex document workflows — classify the document type, route it to a specialist extraction prompt, validate the result, then request human review if confidence is low — an AI agent with tool use is a natural fit. The vision model becomes one tool in a broader pipeline rather than the whole system.

FAQ

Can vision models replace Tesseract or AWS Textract for OCR?

For simple, high-volume, uniform documents, dedicated OCR tools like Tesseract, AWS Textract, or Google Document AI are still faster and cheaper. Vision models are the better choice when documents have variable layouts, mixed printed and handwritten content, or require reasoning to extract meaning — not just raw text. Many production pipelines use both: OCR for the easy cases, a vision model for the hard ones.

Which is more accurate for OCR: GPT-4o, Claude, or Gemini?

Based on 2025–2026 benchmarks, all three perform well on printed text. Gemini 2.5 Pro and Claude Sonnet consistently lead on printed-document accuracy. GPT-4o is strong on JSON schema compliance. For handwriting, GPT-4o and Gemini 2.5 Pro are generally top performers. The differences are small enough that your document type and prompting quality often matter more than the model choice. Always test on a representative sample of your own documents.

How do I extract tables from a PDF using a vision model?

Convert the PDF page to an image at 200–300 DPI and ask the model to return the table as a Markdown table. Provide a clear prompt such as 'Extract the table on this page as a Markdown table with headers.' You can then parse the Markdown table into CSV or JSON programmatically. For tables that span multiple pages, process each page separately and concatenate the rows.

How do I prevent hallucination when extracting document fields?

Include a JSON schema with explicit null handling ('if a field is not present, return null'), add concise entity definitions, and validate critical numeric fields with cross-checks (e.g., line item totals should sum to subtotal). Building a small eval set and measuring null rates and field accuracy helps surface hallucination patterns early. Structured output modes (OpenAI's JSON mode, or Pydantic-backed parsing) add another layer of safety.

Does Claude support native PDF input without converting to images?

Yes. The Anthropic API supports PDFs as a document type alongside images. You can pass a PDF directly and Claude will process its pages without you needing to rasterise them first. This simplifies the pipeline for multi-page documents. Gemini also supports native PDF input. GPT-4o requires the PDF to be converted to images first, or you can use the Assistants API with file retrieval for text-layer PDFs.

What DPI should I use when converting PDFs to images for vision model OCR?

150–200 DPI is sufficient for most printed documents and balances image quality with file size and token cost. Use 300 DPI for documents with small print, handwriting, or fine table borders. Avoid going above 400 DPI — it increases image size significantly without meaningful accuracy gains for most models, and very large images may be downscaled by the API anyway.

How to Use Vision Models for Document AI

In plain English

Why it matters

How it works

Step 1: prepare the image

Step 2: write a high-quality extraction prompt

Step 3: call the API

Prompting strategies that improve accuracy

1. Provide an explicit JSON schema

2. Define entities you care about

3. One-shot or few-shot examples

4. Ask for Markdown tables when structure matters

5. Chain-of-thought for complex layouts

Choosing the right model

Vision models vs traditional OCR tools

Going deeper

Multi-page documents

Evaluation and quality measurement

Image pre-processing for accuracy

Agentic document workflows

FAQ

Further reading

// In plain English

// Why it matters

// How it works

Step 1: prepare the image

Step 2: write a high-quality extraction prompt

Step 3: call the API

// Prompting strategies that improve accuracy

1. Provide an explicit JSON schema

2. Define entities you care about

3. One-shot or few-shot examples

4. Ask for Markdown tables when structure matters

5. Chain-of-thought for complex layouts

// Choosing the right model

Vision models vs traditional OCR tools

// Going deeper

Multi-page documents

Evaluation and quality measurement

Image pre-processing for accuracy

Agentic document workflows

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Prompting strategies that improve accuracy

Choosing the right model

Going deeper

FAQ

Further reading

Related