AI/TLDR

Marker

Convert PDFs and documents to Markdown, JSON, or HTML with tables and equations intact

Overview

Marker is a Python document-conversion pipeline that turns PDFs and other files into clean Markdown, JSON, chunks, or HTML. It handles PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files in any language, and keeps tables, forms, equations, inline math, links, references, and code blocks intact.

It is aimed at developers who need to feed documents into RAG pipelines, search indexes, or LLM applications. By producing structured text that mirrors the original layout, it gives downstream retrieval and chunking steps cleaner input than raw text extraction.

Marker runs on GPU, CPU, or MPS, and works offline by default. For harder documents you can add the optional --use_llm flag to bring in a Gemini or Ollama model, which improves table merging, inline math, and form extraction.

What it does

  • Converts PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files in all languages
  • Outputs Markdown, JSON, chunks, or HTML, with tables, equations, inline math, links, references, and code blocks formatted
  • Extracts and saves embedded images, and removes headers, footers, and other artifacts
  • Optional hybrid mode (--use_llm) uses a Gemini or Ollama model to boost accuracy on tables, math, and forms
  • Structured extraction from a JSON schema (beta)
  • Runs on GPU, CPU, or MPS, and is extensible with your own formatting and logic

Getting started

You need Python 3.10+ and PyTorch installed. Install Marker from PyPI, then convert a file from the command line.

Install Marker

Install the base package for PDFs. Add the [full] extra to convert other document types like DOCX, PPTX, and EPUB.

bashbash
pip install marker-pdf
# for non-PDF documents:
pip install marker-pdf[full]

Convert a single file

Run marker_single on a PDF or image. The output format and pages can be controlled with flags.

bashbash
marker_single /path/to/file.pdf --output_format markdown

Try the interactive app (optional)

Marker ships with a Streamlit app for trying conversions with basic options in the browser.

bashbash
pip install streamlit streamlit-ace
marker_gui

Boost accuracy with an LLM (optional)

Pass --use_llm to merge tables across pages, handle inline math, and extract form values. It defaults to gemini-2.0-flash and can use any Gemini or Ollama model.

bashbash
marker_single /path/to/file.pdf --use_llm

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Preparing PDFs and office documents as clean Markdown for ingestion into a RAG or search pipeline
  • Extracting tables, equations, and inline math from scientific papers and textbooks into structured JSON
  • Batch-converting mixed document formats (DOCX, PPTX, XLSX, EPUB) into a single text format for indexing
  • Pulling structured fields from forms or documents using a JSON schema (beta)

How Marker compares

Marker alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
MarkItDown★ 156kA Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU★ 68.1kA document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling★ 61.9kAn IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker★ 36.2kConvert PDFs and documents to Markdown, JSON, or HTML with tables and equations intact
Repomix★ 26.4kRepomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF★ 25.4kOpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured★ 15kA library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox★ 12.2kA tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.