AI/TLDR

LlamaIndex · 2026-04-13 · notable

ParseBench — LlamaIndex's Document Parsing Benchmark for AI Agents

LlamaIndex's open benchmark for document parsing in AI agent pipelines — ~2,000 human-verified enterprise pages, 169K test rules, 5 dimensions. 14 methods evaluated, none tops all five. Charts the hardest: only 4 of 14 parsers exceed 50%.

ParseBench GitHub repository — LlamaIndex's document parsing benchmark for AI agents

The first benchmark that measures whether AI agents can actually read enterprise PDFs correctly — tables, charts, semantic formatting, and layout all tested.

Key specs

GitHub stars437
Pages evaluated~2,000
Test rules169,011
Methods evaluated14
Best overall (llama parse agentic)84.9%
Hugging face dataset downloads15,972

What is it?

ParseBench is an open benchmark from LlamaIndex evaluating document parsing tools on the dimensions that matter for AI agents in production: structural fidelity of tables, chart data extraction, content faithfulness, semantic formatting (strikethroughs, superscripts), and visual grounding. It covers ~2,000 human-verified pages from 1,200+ real enterprise documents with 169,011 test rules. Published as arXiv:2604.08538 on April 13, 2026, with dataset on HuggingFace and a public Kaggle leaderboard.

How does it work?

ParseBench evaluates each parser's output against human-written test rules across five capability dimensions. A parser must correctly handle table structures, extract chart data points within tolerance, preserve content without omissions or hallucinations, preserve semantic formatting markers, and accurately locate elements in the document layout. 14 methods were evaluated including vision-language models (GPT-5.4 Vision, Gemini 3.1 Pro), specialized document parsers (AWS Textract, Azure Document Intelligence), and LlamaParse. Evaluation runs the provided scripts against any parser's standardized output.

Why does it matter?

Existing document parsing benchmarks use academic papers and web content. ParseBench uses real enterprise documents and measures semantic correctness — the quality needed for agents to make accurate downstream decisions. Its key finding: no single method is strong across all five dimensions. Charts proved the sharpest divider (only 4 of 14 parsers exceed 50%), and vision-language models scored below 8% on visual grounding while specialized parsers reached 55–80%. This fragmentation has real cost for teams building production RAG and document agent pipelines.

Who is it for?

Developers building document-processing pipelines, RAG systems, and AI agents over enterprise PDFs

Try it

github.com/run-llama/ParseBench — or submit at kaggle.com/benchmarks/llamaindex-org/parsebench

Sources · 4 outlets

Tags

  • benchmark
  • document-parsing
  • llamaindex
  • rag
  • ocr
  • pdf-extraction
  • ai-agents
  • enterprise
  • evaluation
  • paper
  • tables
  • charts

← All releases · Learn AI