Overview
OpenDataLoader PDF is an open-source tool that converts PDF files into AI-ready data. It extracts text, tables, headings, lists, and images, then outputs them as Markdown, JSON (with bounding boxes for every element), or HTML. This makes it a solid first step for RAG pipelines and other LLM workflows that need clean, structured input.
It runs a fast, deterministic local mode and an optional hybrid mode that routes hard pages, such as complex tables, scanned documents, and formulas, to an AI backend. Beyond extraction, it also auto-tags untagged PDFs into screen-reader-ready Tagged PDFs, which is a free Apache-2.0 foundation for PDF accessibility work.
What it does
- Extracts Markdown, JSON, and HTML from any PDF, with bounding boxes attached to every element for source citations
- Deterministic local mode plus an optional hybrid AI mode for complex tables, formulas, charts, and scanned pages
- Built-in OCR (80+ languages) in hybrid mode for scanned and image-based PDFs
- Detects reading order, heading hierarchy, numbered, bulleted, and nested lists
- Auto-tags untagged PDFs into Tagged PDFs as a free foundation for accessibility and PDF/UA workflows
- AI safety filters for prompt-injection content, plus header, footer, and watermark filtering
Getting started
OpenDataLoader PDF needs Java 11+ and Python 3.10+. Check Java with `java -version` first, then install the Python package and convert your files. Node.js and Java SDKs are also available.
Install the package
Install the latest version from PyPI.
pip install -U opendataloader-pdfConvert PDFs to structured data
Batch your files in one call and pick the output formats you need. Each call spawns a JVM process, so passing several files together is faster than calling convert repeatedly.
import opendataloader_pdf
opendataloader_pdf.convert(
input_path=["file1.pdf", "file2.pdf", "folder/"],
output_dir="output/",
format="markdown,json"
)Handle complex or scanned PDFs
For complex tables, scanned pages, or formulas, install the hybrid extra, start the hybrid server, then run the client with the hybrid flag.
pip install "opendataloader-pdf[hybrid]"
opendataloader-pdf-hybrid --port 5002
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/Use it with LangChain
An official LangChain document loader is available for RAG pipelines.
pip install -U langchain-opendataloader-pdf
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoaderCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Build RAG pipelines by parsing PDFs into structured Markdown for chunking, or JSON with bounding boxes for element-level control and source citations
- Extract tables, formulas, and chart descriptions from complex or scanned documents using hybrid mode with built-in OCR
- Auto-tag untagged PDFs into Tagged PDFs to speed up accessibility remediation at scale instead of paying for manual document-by-document fixes
How OpenDataLoader PDF compares
OpenDataLoader PDF alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| MarkItDown | ★ 156k | A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs. |
| MinerU | ★ 68.1k | A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content. |
| Docling | ★ 61.9k | An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON. |
| Marker | ★ 36.2k | A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting. |
| Repomix | ★ 26.4k | Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini. |
| OpenDataLoader PDF | ★ 25.4k | Open-source PDF parser for AI-ready data and automated accessibility tagging |
| Unstructured | ★ 15k | A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines. |
| Zerox | ★ 12.2k | A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use. |