Overview
Marker is a Python document-conversion pipeline that turns PDFs and other files into clean Markdown, JSON, chunks, or HTML. It handles PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files in any language, and keeps tables, forms, equations, inline math, links, references, and code blocks intact.
It is aimed at developers who need to feed documents into RAG pipelines, search indexes, or LLM applications. By producing structured text that mirrors the original layout, it gives downstream retrieval and chunking steps cleaner input than raw text extraction.
Marker runs on GPU, CPU, or MPS, and works offline by default. For harder documents you can add the optional --use_llm flag to bring in a Gemini or Ollama model, which improves table merging, inline math, and form extraction.
What it does
- Converts PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files in all languages
- Outputs Markdown, JSON, chunks, or HTML, with tables, equations, inline math, links, references, and code blocks formatted
- Extracts and saves embedded images, and removes headers, footers, and other artifacts
- Optional hybrid mode (--use_llm) uses a Gemini or Ollama model to boost accuracy on tables, math, and forms
- Structured extraction from a JSON schema (beta)
- Runs on GPU, CPU, or MPS, and is extensible with your own formatting and logic
Getting started
You need Python 3.10+ and PyTorch installed. Install Marker from PyPI, then convert a file from the command line.
Install Marker
Install the base package for PDFs. Add the [full] extra to convert other document types like DOCX, PPTX, and EPUB.
pip install marker-pdf
# for non-PDF documents:
pip install marker-pdf[full]Convert a single file
Run marker_single on a PDF or image. The output format and pages can be controlled with flags.
marker_single /path/to/file.pdf --output_format markdownTry the interactive app (optional)
Marker ships with a Streamlit app for trying conversions with basic options in the browser.
pip install streamlit streamlit-ace
marker_guiBoost accuracy with an LLM (optional)
Pass --use_llm to merge tables across pages, handle inline math, and extract form values. It defaults to gemini-2.0-flash and can use any Gemini or Ollama model.
marker_single /path/to/file.pdf --use_llmCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Preparing PDFs and office documents as clean Markdown for ingestion into a RAG or search pipeline
- Extracting tables, equations, and inline math from scientific papers and textbooks into structured JSON
- Batch-converting mixed document formats (DOCX, PPTX, XLSX, EPUB) into a single text format for indexing
- Pulling structured fields from forms or documents using a JSON schema (beta)
How Marker compares
Marker alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| MarkItDown | ★ 156k | A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs. |
| MinerU | ★ 68.1k | A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content. |
| Docling | ★ 61.9k | An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON. |
| Marker | ★ 36.2k | Convert PDFs and documents to Markdown, JSON, or HTML with tables and equations intact |
| Repomix | ★ 26.4k | Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini. |
| OpenDataLoader PDF | ★ 25.4k | OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs. |
| Unstructured | ★ 15k | A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines. |
| Zerox | ★ 12.2k | A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use. |