AI/TLDR

Unstructured

Open-source preprocessing that turns messy documents into clean elements for LLMs

Overview

Unstructured is an open-source Python library for ingesting and preprocessing images and text documents. It handles many file types, including PDFs, HTML, Word docs, and more, and turns them into structured outputs you can feed to a language model.

It is aimed at developers building RAG and LLM data pipelines who need a consistent way to extract text from varied, messy source files. Its modular partition functions and connectors form one system that simplifies data ingestion and preprocessing across different platforms.

Within the parsing and ingestion category, Unstructured sits at the front of the pipeline: it does the document parsing step so your downstream chunking, embedding, and retrieval stages receive clean, normalized elements instead of raw files.

What it does

  • Partitions many document types, including PDFs, HTML, Word docs, plain text, XML, JSON, and emails
  • Modular partition functions such as partition_pdf and partition_text return documents as structured elements
  • Optional install extras let you add only the dependencies you need (for example unstructured[docx,pptx]) instead of all of them
  • Ships official Docker images for x86_64 and Apple silicon, so you can run it in a container without local setup
  • Connectors and modular functions form a cohesive system for data ingestion across different platforms
  • Apache-2.0 licensed, with a hosted Unstructured Platform available for production workloads

Getting started

Install the library from PyPI, then partition a document into structured elements in a few lines of Python.

Install from PyPI

Install the SDK with support for all document types. For plain text, HTML, XML, JSON, and emails you can use the dependency-free pip install unstructured instead, or add only the extras you need.

bashbash
pip install "unstructured[all-docs]"

Install system dependencies

Depending on which document types you parse, install the supporting system packages, for example libmagic-dev (filetype detection), poppler-utils (PDFs and images), and tesseract-ocr (OCR). You may not need all of them.

Partition a document

Use a partition function to read a file and return its structured elements.

pythonpython
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example-docs/layout-parser-paper-fast.pdf")

from unstructured.partition.text import partition_text
elements = partition_text(filename="example-docs/fake-text.txt")

Or run it in a container

Pull the official image and shell into a running container if you would rather not install dependencies locally.

bashbash
docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
docker exec -it unstructured bash

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Extracting clean text from PDFs, Word docs, and HTML to build a RAG knowledge base
  • Normalizing a mixed set of file types into one structured element format before chunking and embedding
  • Preprocessing scanned documents and images with OCR ahead of an LLM pipeline
  • Running document ingestion inside a container as part of an automated data workflow

How Unstructured compares

Unstructured alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
MarkItDown★ 156kA Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU★ 68.1kA document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling★ 61.9kAn IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker★ 36.2kA fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix★ 26.4kRepomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF★ 25.4kOpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured★ 15kOpen-source preprocessing that turns messy documents into clean elements for LLMs
Zerox★ 12.2kA tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.