AI/TLDR

OpenDataLoader PDF

Open-source PDF parser for AI-ready data and automated accessibility tagging

Overview

OpenDataLoader PDF is an open-source tool that converts PDF files into AI-ready data. It extracts text, tables, headings, lists, and images, then outputs them as Markdown, JSON (with bounding boxes for every element), or HTML. This makes it a solid first step for RAG pipelines and other LLM workflows that need clean, structured input.

It runs a fast, deterministic local mode and an optional hybrid mode that routes hard pages, such as complex tables, scanned documents, and formulas, to an AI backend. Beyond extraction, it also auto-tags untagged PDFs into screen-reader-ready Tagged PDFs, which is a free Apache-2.0 foundation for PDF accessibility work.

What it does

  • Extracts Markdown, JSON, and HTML from any PDF, with bounding boxes attached to every element for source citations
  • Deterministic local mode plus an optional hybrid AI mode for complex tables, formulas, charts, and scanned pages
  • Built-in OCR (80+ languages) in hybrid mode for scanned and image-based PDFs
  • Detects reading order, heading hierarchy, numbered, bulleted, and nested lists
  • Auto-tags untagged PDFs into Tagged PDFs as a free foundation for accessibility and PDF/UA workflows
  • AI safety filters for prompt-injection content, plus header, footer, and watermark filtering

Getting started

OpenDataLoader PDF needs Java 11+ and Python 3.10+. Check Java with `java -version` first, then install the Python package and convert your files. Node.js and Java SDKs are also available.

Install the package

Install the latest version from PyPI.

bashbash
pip install -U opendataloader-pdf

Convert PDFs to structured data

Batch your files in one call and pick the output formats you need. Each call spawns a JVM process, so passing several files together is faster than calling convert repeatedly.

pythonpython
import opendataloader_pdf

opendataloader_pdf.convert(
    input_path=["file1.pdf", "file2.pdf", "folder/"],
    output_dir="output/",
    format="markdown,json"
)

Handle complex or scanned PDFs

For complex tables, scanned pages, or formulas, install the hybrid extra, start the hybrid server, then run the client with the hybrid flag.

bashbash
pip install "opendataloader-pdf[hybrid]"
opendataloader-pdf-hybrid --port 5002
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/

Use it with LangChain

An official LangChain document loader is available for RAG pipelines.

pythonpython
pip install -U langchain-opendataloader-pdf
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Build RAG pipelines by parsing PDFs into structured Markdown for chunking, or JSON with bounding boxes for element-level control and source citations
  • Extract tables, formulas, and chart descriptions from complex or scanned documents using hybrid mode with built-in OCR
  • Auto-tag untagged PDFs into Tagged PDFs to speed up accessibility remediation at scale instead of paying for manual document-by-document fixes

How OpenDataLoader PDF compares

OpenDataLoader PDF alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
MarkItDown★ 156kA Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU★ 68.1kA document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling★ 61.9kAn IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker★ 36.2kA fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix★ 26.4kRepomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF★ 25.4kOpen-source PDF parser for AI-ready data and automated accessibility tagging
Unstructured★ 15kA library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox★ 12.2kA tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.