Docling

Convert PDFs, Office files, and more into LLM-ready Markdown or JSON

github.com/docling-project/docling★ 61.9k docling-project.github.io/docling

Overview

Docling is an open-source document conversion library, started by IBM Research, that turns many document formats into a single structured representation. It handles PDF, DOCX, PPTX, XLSX, HTML, EPUB, images, audio, and more, and can export the result to Markdown, HTML, JSON, and other formats.

It is aimed at developers building retrieval and generative-AI systems who need clean, structured text out of messy source documents. Docling goes beyond plain text extraction: it reads PDF page layout, reading order, tables, code, and formulas, and includes OCR for scanned files.

Within the RAG and retrieval space, Docling sits at the parsing and ingestion stage. It produces a unified DoclingDocument that feeds into chunking and embedding, and it ships ready-made integrations with frameworks like LangChain, LlamaIndex, Crew AI, and Haystack.

What it does

Parses many formats including PDF, DOCX, PPTX, XLSX, HTML, EPUB, images, audio, email (EML, MSG), LaTeX, and plain text
Advanced PDF understanding: page layout, reading order, table structure, code, and formulas
Unified DoclingDocument representation with export to Markdown, HTML, JSON, DocTags, and more
OCR support for scanned PDFs and images, plus Visual Language Models such as GraniteDocling
Local, air-gapped execution for sensitive data, with no required cloud calls
Integrations with LangChain, LlamaIndex, Crew AI, and Haystack, plus an MCP server and a CLI

Getting started

Install the package, then convert a document from the command line or in Python.

Install Docling

Install from PyPI. Requires Python 3.10 or higher; works on macOS, Linux, and Windows (x86_64 and arm64).

bashbash

pip install docling

Convert a document from the CLI

Point the CLI at a local path or URL. This writes a Markdown file with the structured content into the current directory.

bashbash

docling https://arxiv.org/pdf/2206.01062

Convert a document in Python

Use DocumentConverter to convert a file and export the result to Markdown.

pythonpython

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Ingest PDFs and Office files into a RAG pipeline as clean Markdown or JSON before chunking and embedding
Extract tables, formulas, and reading order from research papers or technical reports
Process scanned documents and images with OCR for downstream search or LLM use
Parse sensitive documents fully on-premises in an air-gapped environment

How Docling compares

Docling alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
MarkItDown	★ 156k	A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU	★ 68.1k	A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling	★ 61.9k	Convert PDFs, Office files, and more into LLM-ready Markdown or JSON
Marker	★ 36.2k	A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix	★ 26.4k	Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF	★ 25.4k	OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured	★ 15k	A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox	★ 12.2k	A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.

// Overview

// What it does

// Getting started