Marker

Convert PDFs and documents to Markdown, JSON, or HTML with tables and equations intact

github.com/datalab-to/marker★ 36.2k datalab.to

Overview

Marker is a Python document-conversion pipeline that turns PDFs and other files into clean Markdown, JSON, chunks, or HTML. It handles PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files in any language, and keeps tables, forms, equations, inline math, links, references, and code blocks intact.

It is aimed at developers who need to feed documents into RAG pipelines, search indexes, or LLM applications. By producing structured text that mirrors the original layout, it gives downstream retrieval and chunking steps cleaner input than raw text extraction.

Marker runs on GPU, CPU, or MPS, and works offline by default. For harder documents you can add the optional --use_llm flag to bring in a Gemini or Ollama model, which improves table merging, inline math, and form extraction.

What it does

Converts PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files in all languages
Outputs Markdown, JSON, chunks, or HTML, with tables, equations, inline math, links, references, and code blocks formatted
Extracts and saves embedded images, and removes headers, footers, and other artifacts
Optional hybrid mode (--use_llm) uses a Gemini or Ollama model to boost accuracy on tables, math, and forms
Structured extraction from a JSON schema (beta)
Runs on GPU, CPU, or MPS, and is extensible with your own formatting and logic

Getting started

You need Python 3.10+ and PyTorch installed. Install Marker from PyPI, then convert a file from the command line.

Install Marker

Install the base package for PDFs. Add the [full] extra to convert other document types like DOCX, PPTX, and EPUB.

bashbash

pip install marker-pdf
# for non-PDF documents:
pip install marker-pdf[full]

Convert a single file

Run marker_single on a PDF or image. The output format and pages can be controlled with flags.

bashbash

marker_single /path/to/file.pdf --output_format markdown

Try the interactive app (optional)

Marker ships with a Streamlit app for trying conversions with basic options in the browser.

bashbash

pip install streamlit streamlit-ace
marker_gui

Boost accuracy with an LLM (optional)

Pass --use_llm to merge tables across pages, handle inline math, and extract form values. It defaults to gemini-2.0-flash and can use any Gemini or Ollama model.

bashbash

marker_single /path/to/file.pdf --use_llm

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Preparing PDFs and office documents as clean Markdown for ingestion into a RAG or search pipeline
Extracting tables, equations, and inline math from scientific papers and textbooks into structured JSON
Batch-converting mixed document formats (DOCX, PPTX, XLSX, EPUB) into a single text format for indexing
Pulling structured fields from forms or documents using a JSON schema (beta)

How Marker compares

Marker alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
MarkItDown	★ 156k	A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU	★ 68.1k	A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling	★ 61.9k	An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker	★ 36.2k	Convert PDFs and documents to Markdown, JSON, or HTML with tables and equations intact
Repomix	★ 26.4k	Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF	★ 25.4k	OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured	★ 15k	A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox	★ 12.2k	A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.

// Overview

// What it does

// Getting started

Install Marker

Convert a single file

Try the interactive app (optional)

Boost accuracy with an LLM (optional)

// When to use it

// How Marker compares

Overview

What it does

Getting started

When to use it

How Marker compares