AI/TLDR

PyMuPDF

Fast Python library for extracting and rendering PDF and document content

Overview

PyMuPDF is a Python library for reading, extracting, and rendering content from PDFs and other document formats. It is built on top of MuPDF, a small C engine, and exposes both low-level access to document internals and higher-level helper methods. It installs with a single pip command and has no mandatory external dependencies.

It is aimed at developers who need to pull text, tables, images, or layout metadata out of documents, or who need to render, annotate, redact, merge, and split PDFs. Beyond plain text, it can return per-span detail like font, size, color, and bounding boxes, which is useful when document structure matters.

Within parsing and ingestion for RAG pipelines, PyMuPDF handles the step of turning raw PDFs into usable data. The companion pymupdf4llm package converts documents to structure-aware Markdown that you can pass straight to an LLM or a vector store.

What it does

  • Text extraction as plain text or a rich dict with font, size, color, and bounding-box metadata
  • Table detection with find_tables(), exporting to Markdown or a Pandas DataFrame
  • Page rendering to images at any DPI, plus embedded image extraction
  • Tesseract-based OCR for scanned pages, with configurable language
  • Annotation, redaction, form filling, and PDF editing (merge, split, reorder pages)
  • LLM-ready Markdown and JSON output via the pymupdf4llm extra

Getting started

Install the package with pip and open a document to start extracting content. No external dependencies are required for the core library.

Install PyMuPDF

Wheels are available for Windows, macOS, and Linux on Python 3.10 to 3.14. If no wheel exists for your platform, pip compiles from source and needs a C/C++ toolchain.

bashbash
pip install pymupdf

Extract text from a PDF

Open the document and read text page by page.

pythonpython
import pymupdf

doc = pymupdf.open("document.pdf")
for page in doc:
    print(page.get_text())

Extract tables

Use find_tables() to locate tables and export them as Markdown or a Pandas DataFrame.

pythonpython
import pymupdf

doc = pymupdf.open("spreadsheet.pdf")
page = doc[0]

tables = page.find_tables()
for table in tables:
    print(table.to_markdown())
    df = table.to_pandas()

Convert to Markdown for LLMs

Install the pymupdf4llm extra to get structure-aware Markdown you can pass to an LLM or vector store.

bashbash
pip install pymupdf4llm

# then in Python:
# import pymupdf4llm
# md = pymupdf4llm.to_markdown("report.pdf")

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Ingesting PDFs into a RAG pipeline by converting them to clean Markdown for chunking and embedding
  • Pulling structured tables out of reports or spreadsheets-as-PDF into Pandas for analysis
  • Extracting text with layout metadata (font, size, position) when document structure matters
  • Rendering pages to images, OCRing scanned documents, or redacting sensitive content before sharing

How PyMuPDF compares

PyMuPDF alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
MarkItDown★ 156kA Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU★ 68.1kA document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling★ 61.9kAn IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker★ 36.2kA fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix★ 26.4kRepomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF★ 25.4kOpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured★ 15kA library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
PyMuPDF★ 10kFast Python library for extracting and rendering PDF and document content