Unstructured

Open-source preprocessing that turns messy documents into clean elements for LLMs

github.com/Unstructured-IO/unstructured★ 15k unstructured.io

Overview

Unstructured is an open-source Python library for ingesting and preprocessing images and text documents. It handles many file types, including PDFs, HTML, Word docs, and more, and turns them into structured outputs you can feed to a language model.

It is aimed at developers building RAG and LLM data pipelines who need a consistent way to extract text from varied, messy source files. Its modular partition functions and connectors form one system that simplifies data ingestion and preprocessing across different platforms.

Within the parsing and ingestion category, Unstructured sits at the front of the pipeline: it does the document parsing step so your downstream chunking, embedding, and retrieval stages receive clean, normalized elements instead of raw files.

What it does

Partitions many document types, including PDFs, HTML, Word docs, plain text, XML, JSON, and emails
Modular partition functions such as partition_pdf and partition_text return documents as structured elements
Optional install extras let you add only the dependencies you need (for example unstructured[docx,pptx]) instead of all of them
Ships official Docker images for x86_64 and Apple silicon, so you can run it in a container without local setup
Connectors and modular functions form a cohesive system for data ingestion across different platforms
Apache-2.0 licensed, with a hosted Unstructured Platform available for production workloads

Getting started

Install the library from PyPI, then partition a document into structured elements in a few lines of Python.

Install from PyPI

Install the SDK with support for all document types. For plain text, HTML, XML, JSON, and emails you can use the dependency-free pip install unstructured instead, or add only the extras you need.

bashbash

pip install "unstructured[all-docs]"

Install system dependencies

Depending on which document types you parse, install the supporting system packages, for example libmagic-dev (filetype detection), poppler-utils (PDFs and images), and tesseract-ocr (OCR). You may not need all of them.

Partition a document

Use a partition function to read a file and return its structured elements.

pythonpython

from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example-docs/layout-parser-paper-fast.pdf")

from unstructured.partition.text import partition_text
elements = partition_text(filename="example-docs/fake-text.txt")

Or run it in a container

Pull the official image and shell into a running container if you would rather not install dependencies locally.

bashbash

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
docker exec -it unstructured bash

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Extracting clean text from PDFs, Word docs, and HTML to build a RAG knowledge base
Normalizing a mixed set of file types into one structured element format before chunking and embedding
Preprocessing scanned documents and images with OCR ahead of an LLM pipeline
Running document ingestion inside a container as part of an automated data workflow

How Unstructured compares

Unstructured alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
MarkItDown	★ 156k	A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU	★ 68.1k	A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling	★ 61.9k	An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker	★ 36.2k	A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix	★ 26.4k	Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF	★ 25.4k	OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured	★ 15k	Open-source preprocessing that turns messy documents into clean elements for LLMs
Zerox	★ 12.2k	A tool that OCRs documents by passing page images through a vision model to produce Markdown output for downstream use.

// Overview

// What it does

// Getting started