AI/TLDR

MinerU

Turn PDFs, Office files, and scans into clean Markdown or JSON for RAG and LLMs

Overview

MinerU is an open-source tool that extracts content from documents and outputs it as clean Markdown or JSON. It handles PDFs, DOCX, PPTX, XLSX files, images, and scanned pages, and reconstructs the page layout so the text comes out in normal human reading order with headers and footers removed.

It is aimed at developers building RAG pipelines, search, and LLM or agent workflows who need reliable text out of messy source files. It deals with complex cases like multi-column layouts, cross-page tables, handwriting, and CJK and other non-Latin scripts, converting formulas to LaTeX and tables to HTML.

Within document parsing and ingestion, MinerU sits at the front of the pipeline: it ships a CLI, an API, and integrations (MCP server, LangChain, Dify, FastGPT and others), and can run offline on CPU or GPU using a pipeline, VLM, or hybrid backend.

What it does

  • Parses PDF, DOCX, PPTX, XLSX, images, and web pages into structured Markdown or JSON
  • Converts formulas to LaTeX and tables to HTML, with layout reconstruction
  • Handles scanned docs, handwriting, multi-column layouts, and cross-page table merging
  • VLM + OCR dual engine with OCR recognition across 109 languages
  • Multiple backends: pipeline (CPU/GPU, no hallucination), VLM, and hybrid
  • Integrations for MCP (Cursor, Claude Desktop), LangChain, LlamaIndex, Dify, FastGPT, plus CLI and REST API

Getting started

Install MinerU with uv, then parse a file from the command line.

Install MinerU

Install the package with all extras using uv. You can also install uv via pip first if you do not have it.

bashbash
pip install uv
uv pip install -U "mineru[all]"

Parse a document

Point the CLI at an input file (or directory) and an output folder. MinerU writes the extracted Markdown and JSON there.

bashbash
mineru -p <input_path> -o <output_path>

Run on a CPU-only machine

Use the pipeline backend, which runs on CPU or GPU and avoids hallucinated content.

bashbash
mineru -p <input_path> -o <output_path> -b pipeline

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Build a RAG knowledge base by converting a folder of PDFs and Office files into clean Markdown for chunking and embedding
  • Extract tables and formulas from scientific or financial PDFs into HTML and LaTeX for downstream processing
  • Digitize scanned or handwritten documents using OCR across many languages, including CJK content
  • Feed parsed document text into an LLM or agent workflow via the MCP server or LangChain/Dify integrations

How MinerU compares

MinerU alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
MarkItDown★ 156kA Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU★ 68.1kTurn PDFs, Office files, and scans into clean Markdown or JSON for RAG and LLMs
Docling★ 61.9kAn IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker★ 36.2kA fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix★ 26.4kRepomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF★ 25.4kOpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured★ 15kA library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox★ 12.2kA tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.