Overview
MinerU is an open-source tool that extracts content from documents and outputs it as clean Markdown or JSON. It handles PDFs, DOCX, PPTX, XLSX files, images, and scanned pages, and reconstructs the page layout so the text comes out in normal human reading order with headers and footers removed.
It is aimed at developers building RAG pipelines, search, and LLM or agent workflows who need reliable text out of messy source files. It deals with complex cases like multi-column layouts, cross-page tables, handwriting, and CJK and other non-Latin scripts, converting formulas to LaTeX and tables to HTML.
Within document parsing and ingestion, MinerU sits at the front of the pipeline: it ships a CLI, an API, and integrations (MCP server, LangChain, Dify, FastGPT and others), and can run offline on CPU or GPU using a pipeline, VLM, or hybrid backend.
What it does
- Parses PDF, DOCX, PPTX, XLSX, images, and web pages into structured Markdown or JSON
- Converts formulas to LaTeX and tables to HTML, with layout reconstruction
- Handles scanned docs, handwriting, multi-column layouts, and cross-page table merging
- VLM + OCR dual engine with OCR recognition across 109 languages
- Multiple backends: pipeline (CPU/GPU, no hallucination), VLM, and hybrid
- Integrations for MCP (Cursor, Claude Desktop), LangChain, LlamaIndex, Dify, FastGPT, plus CLI and REST API
Getting started
Install MinerU with uv, then parse a file from the command line.
Install MinerU
Install the package with all extras using uv. You can also install uv via pip first if you do not have it.
pip install uv
uv pip install -U "mineru[all]"Parse a document
Point the CLI at an input file (or directory) and an output folder. MinerU writes the extracted Markdown and JSON there.
mineru -p <input_path> -o <output_path>Run on a CPU-only machine
Use the pipeline backend, which runs on CPU or GPU and avoids hallucinated content.
mineru -p <input_path> -o <output_path> -b pipelineCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Build a RAG knowledge base by converting a folder of PDFs and Office files into clean Markdown for chunking and embedding
- Extract tables and formulas from scientific or financial PDFs into HTML and LaTeX for downstream processing
- Digitize scanned or handwritten documents using OCR across many languages, including CJK content
- Feed parsed document text into an LLM or agent workflow via the MCP server or LangChain/Dify integrations
How MinerU compares
MinerU alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| MarkItDown | ★ 156k | A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs. |
| MinerU | ★ 68.1k | Turn PDFs, Office files, and scans into clean Markdown or JSON for RAG and LLMs |
| Docling | ★ 61.9k | An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON. |
| Marker | ★ 36.2k | A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting. |
| Repomix | ★ 26.4k | Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini. |
| OpenDataLoader PDF | ★ 25.4k | OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs. |
| Unstructured | ★ 15k | A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines. |
| Zerox | ★ 12.2k | A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use. |