LlamaIndex · 2026-04-13 · notable
ParseBench — LlamaIndex's Document Parsing Benchmark for AI Agents
LlamaIndex's open benchmark for document parsing in AI agent pipelines — ~2,000 human-verified enterprise pages, 169K test rules, 5 dimensions. 14 methods evaluated, none tops all five. Charts the hardest: only 4 of 14 parsers exceed 50%.
The first benchmark that measures whether AI agents can actually read enterprise PDFs correctly — tables, charts, semantic formatting, and layout all tested.
Key specs
| GitHub stars | 437 |
|---|---|
| Pages evaluated | ~2,000 |
| Test rules | 169,011 |
| Methods evaluated | 14 |
| Best overall (llama parse agentic) | 84.9% |
| Hugging face dataset downloads | 15,972 |
What is it?
ParseBench is an open benchmark from LlamaIndex evaluating document parsing tools on the dimensions that matter for AI agents in production: structural fidelity of tables, chart data extraction, content faithfulness, semantic formatting (strikethroughs, superscripts), and visual grounding. It covers ~2,000 human-verified pages from 1,200+ real enterprise documents with 169,011 test rules. Published as arXiv:2604.08538 on April 13, 2026, with dataset on HuggingFace and a public Kaggle leaderboard.
How does it work?
ParseBench evaluates each parser's output against human-written test rules across five capability dimensions. A parser must correctly handle table structures, extract chart data points within tolerance, preserve content without omissions or hallucinations, preserve semantic formatting markers, and accurately locate elements in the document layout. 14 methods were evaluated including vision-language models (GPT-5.4 Vision, Gemini 3.1 Pro), specialized document parsers (AWS Textract, Azure Document Intelligence), and LlamaParse. Evaluation runs the provided scripts against any parser's standardized output.
Why does it matter?
Existing document parsing benchmarks use academic papers and web content. ParseBench uses real enterprise documents and measures semantic correctness — the quality needed for agents to make accurate downstream decisions. Its key finding: no single method is strong across all five dimensions. Charts proved the sharpest divider (only 4 of 14 parsers exceed 50%), and vision-language models scored below 8% on visual grounding while specialized parsers reached 55–80%. This fragmentation has real cost for teams building production RAG and document agent pipelines.
Who is it for?
Developers building document-processing pipelines, RAG systems, and AI agents over enterprise PDFs
Try it
github.com/run-llama/ParseBench — or submit at kaggle.com/benchmarks/llamaindex-org/parsebench