Overview
MarkItDown is a lightweight Python utility from Microsoft that converts many file types into Markdown. It supports PDF, PowerPoint, Word, Excel, images, audio, HTML, text formats like CSV/JSON/XML, ZIP archives, EPubs, and YouTube URLs, preserving structure such as headings, lists, tables, and links.
It is built for feeding documents into LLMs and text-analysis pipelines, not for high-fidelity conversions meant for human reading. Markdown stays close to plain text while keeping document structure, and mainstream models handle it well and token-efficiently. The project compares itself to textract, with a stronger focus on retaining structure.
It fits the web scraping and data extraction space as a normalization step: take mixed source files and turn them into one consistent text format your prompts and pipelines can consume. It runs as a command-line tool or as a Python library.
What it does
- Converts PDF, Word, PowerPoint, Excel, HTML, and text formats (CSV, JSON, XML) into Markdown
- Handles images (EXIF metadata and OCR), audio (metadata and speech transcription), EPubs, ZIP archives, and YouTube URLs
- Optional dependency groups (e.g. [pdf], [docx], [pptx]) let you install only the formats you need
- Command-line interface with file, -o output, and piped stdin input
- Supports third-party plugins, including markitdown-ocr for LLM-based OCR of embedded images
- Optional Azure Document Intelligence and Azure Content Understanding backends for higher-quality and structured extraction
Getting started
MarkItDown requires Python 3.10 or higher; a virtual environment is recommended. Install it with pip, then convert files from the command line or in Python.
Install with pip
Install MarkItDown with all optional format dependencies. You can also install only the groups you need, e.g. 'markitdown[pdf, docx, pptx]'.
pip install 'markitdown[all]'Convert a file from the command line
Pass a file path and redirect the Markdown output, or use -o to set the output file.
markitdown path-to-file.pdf > document.mdUse it from Python
Create a MarkItDown instance, convert a file, and read the Markdown from the result's text_content.
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document_with_images.pdf")
print(result.text_content)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Preparing PDFs, Word, and PowerPoint documents as Markdown context for LLM prompts or RAG pipelines
- Normalizing a folder of mixed file types into one consistent text format for analysis
- Extracting text and structure from spreadsheets, HTML pages, or EPubs in a data pipeline
- Transcribing audio or pulling YouTube transcripts into Markdown for downstream processing
How MarkItDown compares
MarkItDown alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| MarkItDown | ★ 156k | Convert PDFs, Office docs, and more into clean Markdown for LLMs |
| MinerU | ★ 68.1k | A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content. |
| Docling | ★ 61.9k | An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON. |
| Marker | ★ 36.2k | A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting. |
| Repomix | ★ 26.4k | Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini. |
| OpenDataLoader PDF | ★ 25.4k | OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs. |
| Unstructured | ★ 15k | A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines. |
| Zerox | ★ 12.2k | A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use. |