MarkItDown

Convert PDFs, Office docs, and more into clean Markdown for LLMs

Overview

MarkItDown is a lightweight Python utility from Microsoft that converts many file types into Markdown. It supports PDF, PowerPoint, Word, Excel, images, audio, HTML, text formats like CSV/JSON/XML, ZIP archives, EPubs, and YouTube URLs, preserving structure such as headings, lists, tables, and links.

It is built for feeding documents into LLMs and text-analysis pipelines, not for high-fidelity conversions meant for human reading. Markdown stays close to plain text while keeping document structure, and mainstream models handle it well and token-efficiently. The project compares itself to textract, with a stronger focus on retaining structure.

It fits the web scraping and data extraction space as a normalization step: take mixed source files and turn them into one consistent text format your prompts and pipelines can consume. It runs as a command-line tool or as a Python library.

What it does

Converts PDF, Word, PowerPoint, Excel, HTML, and text formats (CSV, JSON, XML) into Markdown
Handles images (EXIF metadata and OCR), audio (metadata and speech transcription), EPubs, ZIP archives, and YouTube URLs
Optional dependency groups (e.g. [pdf], [docx], [pptx]) let you install only the formats you need
Command-line interface with file, -o output, and piped stdin input
Supports third-party plugins, including markitdown-ocr for LLM-based OCR of embedded images
Optional Azure Document Intelligence and Azure Content Understanding backends for higher-quality and structured extraction

Getting started

MarkItDown requires Python 3.10 or higher; a virtual environment is recommended. Install it with pip, then convert files from the command line or in Python.

Install with pip

Install MarkItDown with all optional format dependencies. You can also install only the groups you need, e.g. 'markitdown[pdf, docx, pptx]'.

bashbash

pip install 'markitdown[all]'

Convert a file from the command line

Pass a file path and redirect the Markdown output, or use -o to set the output file.

bashbash

markitdown path-to-file.pdf > document.md

Use it from Python

Create a MarkItDown instance, convert a file, and read the Markdown from the result's text_content.

pythonpython

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document_with_images.pdf")
print(result.text_content)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Preparing PDFs, Word, and PowerPoint documents as Markdown context for LLM prompts or RAG pipelines
Normalizing a folder of mixed file types into one consistent text format for analysis
Extracting text and structure from spreadsheets, HTML pages, or EPubs in a data pipeline
Transcribing audio or pulling YouTube transcripts into Markdown for downstream processing

How MarkItDown compares

MarkItDown alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
MarkItDown	★ 156k	Convert PDFs, Office docs, and more into clean Markdown for LLMs
MinerU	★ 68.1k	A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling	★ 61.9k	An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker	★ 36.2k	A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix	★ 26.4k	Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF	★ 25.4k	OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured	★ 15k	A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox	★ 12.2k	A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.

// Overview

// What it does

// Getting started