Overview
Docling is an open-source document conversion library, started by IBM Research, that turns many document formats into a single structured representation. It handles PDF, DOCX, PPTX, XLSX, HTML, EPUB, images, audio, and more, and can export the result to Markdown, HTML, JSON, and other formats.
It is aimed at developers building retrieval and generative-AI systems who need clean, structured text out of messy source documents. Docling goes beyond plain text extraction: it reads PDF page layout, reading order, tables, code, and formulas, and includes OCR for scanned files.
Within the RAG and retrieval space, Docling sits at the parsing and ingestion stage. It produces a unified DoclingDocument that feeds into chunking and embedding, and it ships ready-made integrations with frameworks like LangChain, LlamaIndex, Crew AI, and Haystack.
What it does
- Parses many formats including PDF, DOCX, PPTX, XLSX, HTML, EPUB, images, audio, email (EML, MSG), LaTeX, and plain text
- Advanced PDF understanding: page layout, reading order, table structure, code, and formulas
- Unified DoclingDocument representation with export to Markdown, HTML, JSON, DocTags, and more
- OCR support for scanned PDFs and images, plus Visual Language Models such as GraniteDocling
- Local, air-gapped execution for sensitive data, with no required cloud calls
- Integrations with LangChain, LlamaIndex, Crew AI, and Haystack, plus an MCP server and a CLI
Getting started
Install the package, then convert a document from the command line or in Python.
Install Docling
Install from PyPI. Requires Python 3.10 or higher; works on macOS, Linux, and Windows (x86_64 and arm64).
pip install doclingConvert a document from the CLI
Point the CLI at a local path or URL. This writes a Markdown file with the structured content into the current directory.
docling https://arxiv.org/pdf/2206.01062Convert a document in Python
Use DocumentConverter to convert a file and export the result to Markdown.
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Ingest PDFs and Office files into a RAG pipeline as clean Markdown or JSON before chunking and embedding
- Extract tables, formulas, and reading order from research papers or technical reports
- Process scanned documents and images with OCR for downstream search or LLM use
- Parse sensitive documents fully on-premises in an air-gapped environment
How Docling compares
Docling alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| MarkItDown | ★ 156k | A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs. |
| MinerU | ★ 68.1k | A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content. |
| Docling | ★ 61.9k | Convert PDFs, Office files, and more into LLM-ready Markdown or JSON |
| Marker | ★ 36.2k | A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting. |
| Repomix | ★ 26.4k | Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini. |
| OpenDataLoader PDF | ★ 25.4k | OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs. |
| Unstructured | ★ 15k | A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines. |
| Zerox | ★ 12.2k | A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use. |