MinerU

Turn PDFs, Office files, and scans into clean Markdown or JSON for RAG and LLMs

github.com/opendatalab/MinerU★ 68.1k opendatalab.github.io/MinerU

Overview

MinerU is an open-source tool that extracts content from documents and outputs it as clean Markdown or JSON. It handles PDFs, DOCX, PPTX, XLSX files, images, and scanned pages, and reconstructs the page layout so the text comes out in normal human reading order with headers and footers removed.

It is aimed at developers building RAG pipelines, search, and LLM or agent workflows who need reliable text out of messy source files. It deals with complex cases like multi-column layouts, cross-page tables, handwriting, and CJK and other non-Latin scripts, converting formulas to LaTeX and tables to HTML.

Within document parsing and ingestion, MinerU sits at the front of the pipeline: it ships a CLI, an API, and integrations (MCP server, LangChain, Dify, FastGPT and others), and can run offline on CPU or GPU using a pipeline, VLM, or hybrid backend.

What it does

Parses PDF, DOCX, PPTX, XLSX, images, and web pages into structured Markdown or JSON
Converts formulas to LaTeX and tables to HTML, with layout reconstruction
Handles scanned docs, handwriting, multi-column layouts, and cross-page table merging
VLM + OCR dual engine with OCR recognition across 109 languages
Multiple backends: pipeline (CPU/GPU, no hallucination), VLM, and hybrid
Integrations for MCP (Cursor, Claude Desktop), LangChain, LlamaIndex, Dify, FastGPT, plus CLI and REST API

Getting started

Install MinerU with uv, then parse a file from the command line.

Install MinerU

Install the package with all extras using uv. You can also install uv via pip first if you do not have it.

bashbash

pip install uv
uv pip install -U "mineru[all]"

Parse a document

Point the CLI at an input file (or directory) and an output folder. MinerU writes the extracted Markdown and JSON there.

bashbash

mineru -p <input_path> -o <output_path>

Run on a CPU-only machine

Use the pipeline backend, which runs on CPU or GPU and avoids hallucinated content.

bashbash

mineru -p <input_path> -o <output_path> -b pipeline

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Build a RAG knowledge base by converting a folder of PDFs and Office files into clean Markdown for chunking and embedding
Extract tables and formulas from scientific or financial PDFs into HTML and LaTeX for downstream processing
Digitize scanned or handwritten documents using OCR across many languages, including CJK content
Feed parsed document text into an LLM or agent workflow via the MCP server or LangChain/Dify integrations

How MinerU compares

MinerU alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
MarkItDown	★ 156k	A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU	★ 68.1k	Turn PDFs, Office files, and scans into clean Markdown or JSON for RAG and LLMs
Docling	★ 61.9k	An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker	★ 36.2k	A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix	★ 26.4k	Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF	★ 25.4k	OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured	★ 15k	A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox	★ 12.2k	A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.

// Overview

// What it does

// Getting started

Install MinerU

Parse a document

Run on a CPU-only machine

// When to use it

// How MinerU compares

Overview

What it does

Getting started

When to use it

How MinerU compares