AI/TLDR

MarkItDown

Convert PDFs, Office docs, and more into clean Markdown for LLMs

Overview

MarkItDown is a lightweight Python utility from Microsoft that converts many file types into Markdown. It supports PDF, PowerPoint, Word, Excel, images, audio, HTML, text formats like CSV/JSON/XML, ZIP archives, EPubs, and YouTube URLs, preserving structure such as headings, lists, tables, and links.

It is built for feeding documents into LLMs and text-analysis pipelines, not for high-fidelity conversions meant for human reading. Markdown stays close to plain text while keeping document structure, and mainstream models handle it well and token-efficiently. The project compares itself to textract, with a stronger focus on retaining structure.

It fits the web scraping and data extraction space as a normalization step: take mixed source files and turn them into one consistent text format your prompts and pipelines can consume. It runs as a command-line tool or as a Python library.

What it does

  • Converts PDF, Word, PowerPoint, Excel, HTML, and text formats (CSV, JSON, XML) into Markdown
  • Handles images (EXIF metadata and OCR), audio (metadata and speech transcription), EPubs, ZIP archives, and YouTube URLs
  • Optional dependency groups (e.g. [pdf], [docx], [pptx]) let you install only the formats you need
  • Command-line interface with file, -o output, and piped stdin input
  • Supports third-party plugins, including markitdown-ocr for LLM-based OCR of embedded images
  • Optional Azure Document Intelligence and Azure Content Understanding backends for higher-quality and structured extraction

Getting started

MarkItDown requires Python 3.10 or higher; a virtual environment is recommended. Install it with pip, then convert files from the command line or in Python.

Install with pip

Install MarkItDown with all optional format dependencies. You can also install only the groups you need, e.g. 'markitdown[pdf, docx, pptx]'.

bashbash
pip install 'markitdown[all]'

Convert a file from the command line

Pass a file path and redirect the Markdown output, or use -o to set the output file.

bashbash
markitdown path-to-file.pdf > document.md

Use it from Python

Create a MarkItDown instance, convert a file, and read the Markdown from the result's text_content.

pythonpython
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document_with_images.pdf")
print(result.text_content)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Preparing PDFs, Word, and PowerPoint documents as Markdown context for LLM prompts or RAG pipelines
  • Normalizing a folder of mixed file types into one consistent text format for analysis
  • Extracting text and structure from spreadsheets, HTML pages, or EPubs in a data pipeline
  • Transcribing audio or pulling YouTube transcripts into Markdown for downstream processing

How MarkItDown compares

MarkItDown alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
MarkItDown★ 156kConvert PDFs, Office docs, and more into clean Markdown for LLMs
MinerU★ 68.1kA document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling★ 61.9kAn IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker★ 36.2kA fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix★ 26.4kRepomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF★ 25.4kOpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured★ 15kA library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
Zerox★ 12.2kA tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.