PyMuPDF

Fast Python library for extracting and rendering PDF and document content

github.com/pymupdf/PyMuPDF★ 10k pymupdf.readthedocs.io

Overview

PyMuPDF is a Python library for reading, extracting, and rendering content from PDFs and other document formats. It is built on top of MuPDF, a small C engine, and exposes both low-level access to document internals and higher-level helper methods. It installs with a single pip command and has no mandatory external dependencies.

It is aimed at developers who need to pull text, tables, images, or layout metadata out of documents, or who need to render, annotate, redact, merge, and split PDFs. Beyond plain text, it can return per-span detail like font, size, color, and bounding boxes, which is useful when document structure matters.

Within parsing and ingestion for RAG pipelines, PyMuPDF handles the step of turning raw PDFs into usable data. The companion pymupdf4llm package converts documents to structure-aware Markdown that you can pass straight to an LLM or a vector store.

What it does

Text extraction as plain text or a rich dict with font, size, color, and bounding-box metadata
Table detection with find_tables(), exporting to Markdown or a Pandas DataFrame
Page rendering to images at any DPI, plus embedded image extraction
Tesseract-based OCR for scanned pages, with configurable language
Annotation, redaction, form filling, and PDF editing (merge, split, reorder pages)
LLM-ready Markdown and JSON output via the pymupdf4llm extra

Getting started

Install the package with pip and open a document to start extracting content. No external dependencies are required for the core library.

Install PyMuPDF

Wheels are available for Windows, macOS, and Linux on Python 3.10 to 3.14. If no wheel exists for your platform, pip compiles from source and needs a C/C++ toolchain.

bashbash

pip install pymupdf

Extract text from a PDF

Open the document and read text page by page.

pythonpython

import pymupdf

doc = pymupdf.open("document.pdf")
for page in doc:
    print(page.get_text())

Extract tables

Use find_tables() to locate tables and export them as Markdown or a Pandas DataFrame.

pythonpython

import pymupdf

doc = pymupdf.open("spreadsheet.pdf")
page = doc[0]

tables = page.find_tables()
for table in tables:
    print(table.to_markdown())
    df = table.to_pandas()

Convert to Markdown for LLMs

Install the pymupdf4llm extra to get structure-aware Markdown you can pass to an LLM or vector store.

bashbash

pip install pymupdf4llm

# then in Python:
# import pymupdf4llm
# md = pymupdf4llm.to_markdown("report.pdf")

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Ingesting PDFs into a RAG pipeline by converting them to clean Markdown for chunking and embedding
Pulling structured tables out of reports or spreadsheets-as-PDF into Pandas for analysis
Extracting text with layout metadata (font, size, position) when document structure matters
Rendering pages to images, OCRing scanned documents, or redacting sensitive content before sharing

How PyMuPDF compares

PyMuPDF alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
MarkItDown	★ 156k	A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU	★ 68.1k	A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling	★ 61.9k	An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker	★ 36.2k	A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix	★ 26.4k	Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF	★ 25.4k	OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured	★ 15k	A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
PyMuPDF	★ 10k	Fast Python library for extracting and rendering PDF and document content

// Overview

// What it does

// Getting started