Overview
PyMuPDF is a Python library for reading, extracting, and rendering content from PDFs and other document formats. It is built on top of MuPDF, a small C engine, and exposes both low-level access to document internals and higher-level helper methods. It installs with a single pip command and has no mandatory external dependencies.
It is aimed at developers who need to pull text, tables, images, or layout metadata out of documents, or who need to render, annotate, redact, merge, and split PDFs. Beyond plain text, it can return per-span detail like font, size, color, and bounding boxes, which is useful when document structure matters.
Within parsing and ingestion for RAG pipelines, PyMuPDF handles the step of turning raw PDFs into usable data. The companion pymupdf4llm package converts documents to structure-aware Markdown that you can pass straight to an LLM or a vector store.
What it does
- Text extraction as plain text or a rich dict with font, size, color, and bounding-box metadata
- Table detection with find_tables(), exporting to Markdown or a Pandas DataFrame
- Page rendering to images at any DPI, plus embedded image extraction
- Tesseract-based OCR for scanned pages, with configurable language
- Annotation, redaction, form filling, and PDF editing (merge, split, reorder pages)
- LLM-ready Markdown and JSON output via the pymupdf4llm extra
Getting started
Install the package with pip and open a document to start extracting content. No external dependencies are required for the core library.
Install PyMuPDF
Wheels are available for Windows, macOS, and Linux on Python 3.10 to 3.14. If no wheel exists for your platform, pip compiles from source and needs a C/C++ toolchain.
pip install pymupdfExtract text from a PDF
Open the document and read text page by page.
import pymupdf
doc = pymupdf.open("document.pdf")
for page in doc:
print(page.get_text())Extract tables
Use find_tables() to locate tables and export them as Markdown or a Pandas DataFrame.
import pymupdf
doc = pymupdf.open("spreadsheet.pdf")
page = doc[0]
tables = page.find_tables()
for table in tables:
print(table.to_markdown())
df = table.to_pandas()Convert to Markdown for LLMs
Install the pymupdf4llm extra to get structure-aware Markdown you can pass to an LLM or vector store.
pip install pymupdf4llm
# then in Python:
# import pymupdf4llm
# md = pymupdf4llm.to_markdown("report.pdf")Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Ingesting PDFs into a RAG pipeline by converting them to clean Markdown for chunking and embedding
- Pulling structured tables out of reports or spreadsheets-as-PDF into Pandas for analysis
- Extracting text with layout metadata (font, size, position) when document structure matters
- Rendering pages to images, OCRing scanned documents, or redacting sensitive content before sharing
How PyMuPDF compares
PyMuPDF alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| MarkItDown | ★ 156k | A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs. |
| MinerU | ★ 68.1k | A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content. |
| Docling | ★ 61.9k | An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON. |
| Marker | ★ 36.2k | A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting. |
| Repomix | ★ 26.4k | Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini. |
| OpenDataLoader PDF | ★ 25.4k | OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs. |
| Unstructured | ★ 15k | A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines. |
| PyMuPDF | ★ 10k | Fast Python library for extracting and rendering PDF and document content |