OpenDataLoader PDF

Open-source PDF parser for AI-ready data and automated accessibility tagging

github.com/opendataloader-project/opendataloader-pdf★ 25.4k opendataloader.org

Overview

OpenDataLoader PDF is an open-source tool that converts PDF files into AI-ready data. It extracts text, tables, headings, lists, and images, then outputs them as Markdown, JSON (with bounding boxes for every element), or HTML. This makes it a solid first step for RAG pipelines and other LLM workflows that need clean, structured input.

It runs a fast, deterministic local mode and an optional hybrid mode that routes hard pages, such as complex tables, scanned documents, and formulas, to an AI backend. Beyond extraction, it also auto-tags untagged PDFs into screen-reader-ready Tagged PDFs, which is a free Apache-2.0 foundation for PDF accessibility work.

What it does

Extracts Markdown, JSON, and HTML from any PDF, with bounding boxes attached to every element for source citations
Deterministic local mode plus an optional hybrid AI mode for complex tables, formulas, charts, and scanned pages
Built-in OCR (80+ languages) in hybrid mode for scanned and image-based PDFs
Detects reading order, heading hierarchy, numbered, bulleted, and nested lists
Auto-tags untagged PDFs into Tagged PDFs as a free foundation for accessibility and PDF/UA workflows
AI safety filters for prompt-injection content, plus header, footer, and watermark filtering

Getting started

OpenDataLoader PDF needs Java 11+ and Python 3.10+. Check Java with `java -version` first, then install the Python package and convert your files. Node.js and Java SDKs are also available.

Install the package

Install the latest version from PyPI.

bashbash

pip install -U opendataloader-pdf

Convert PDFs to structured data

Batch your files in one call and pick the output formats you need. Each call spawns a JVM process, so passing several files together is faster than calling convert repeatedly.

pythonpython

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="markdown,json"
)

Handle complex or scanned PDFs

For complex tables, scanned pages, or formulas, install the hybrid extra, start the hybrid server, then run the client with the hybrid flag.

bashbash

pip install "opendataloader-pdf[hybrid]"
opendataloader-pdf-hybrid --port 5002
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

Use it with LangChain

An official LangChain document loader is available for RAG pipelines.

pythonpython

pip install -U langchain-opendataloader-pdf
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Build RAG pipelines by parsing PDFs into structured Markdown for chunking, or JSON with bounding boxes for element-level control and source citations
Extract tables, formulas, and chart descriptions from complex or scanned documents using hybrid mode with built-in OCR
Auto-tag untagged PDFs into Tagged PDFs to speed up accessibility remediation at scale instead of paying for manual document-by-document fixes

How OpenDataLoader PDF compares

OpenDataLoader PDF alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
MarkItDown	★ 156k	A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU	★ 68.1k	A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling	★ 61.9k	An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker	★ 36.2k	A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix	★ 26.4k	Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF	★ 25.4k	Open-source PDF parser for AI-ready data and automated accessibility tagging
Unstructured	★ 15k	A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox	★ 12.2k	A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.

// Overview

// What it does

// Getting started