AI/TLDR

Docling

Convert PDFs, Office files, and more into LLM-ready Markdown or JSON

Overview

Docling is an open-source document conversion library, started by IBM Research, that turns many document formats into a single structured representation. It handles PDF, DOCX, PPTX, XLSX, HTML, EPUB, images, audio, and more, and can export the result to Markdown, HTML, JSON, and other formats.

It is aimed at developers building retrieval and generative-AI systems who need clean, structured text out of messy source documents. Docling goes beyond plain text extraction: it reads PDF page layout, reading order, tables, code, and formulas, and includes OCR for scanned files.

Within the RAG and retrieval space, Docling sits at the parsing and ingestion stage. It produces a unified DoclingDocument that feeds into chunking and embedding, and it ships ready-made integrations with frameworks like LangChain, LlamaIndex, Crew AI, and Haystack.

What it does

  • Parses many formats including PDF, DOCX, PPTX, XLSX, HTML, EPUB, images, audio, email (EML, MSG), LaTeX, and plain text
  • Advanced PDF understanding: page layout, reading order, table structure, code, and formulas
  • Unified DoclingDocument representation with export to Markdown, HTML, JSON, DocTags, and more
  • OCR support for scanned PDFs and images, plus Visual Language Models such as GraniteDocling
  • Local, air-gapped execution for sensitive data, with no required cloud calls
  • Integrations with LangChain, LlamaIndex, Crew AI, and Haystack, plus an MCP server and a CLI

Getting started

Install the package, then convert a document from the command line or in Python.

Install Docling

Install from PyPI. Requires Python 3.10 or higher; works on macOS, Linux, and Windows (x86_64 and arm64).

bashbash
pip install docling

Convert a document from the CLI

Point the CLI at a local path or URL. This writes a Markdown file with the structured content into the current directory.

bashbash
docling https://arxiv.org/pdf/2206.01062

Convert a document in Python

Use DocumentConverter to convert a file and export the result to Markdown.

pythonpython
from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Ingest PDFs and Office files into a RAG pipeline as clean Markdown or JSON before chunking and embedding
  • Extract tables, formulas, and reading order from research papers or technical reports
  • Process scanned documents and images with OCR for downstream search or LLM use
  • Parse sensitive documents fully on-premises in an air-gapped environment

How Docling compares

Docling alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
MarkItDown★ 156kA Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU★ 68.1kA document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling★ 61.9kConvert PDFs, Office files, and more into LLM-ready Markdown or JSON
Marker★ 36.2kA fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix★ 26.4kRepomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF★ 25.4kOpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured★ 15kA library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox★ 12.2kA tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.