Overview
Open Parse is a Python library for chunking documents before they go into a RAG system. Instead of flattening a file into raw text and slicing it, it looks at the visual layout of a document, headings, sections, tables and images, and groups related content together so each chunk stays coherent.
It is aimed at developers building retrieval pipelines who find that naive text splitting loses the structure of their source files. It works on PDFs out of the box using pdfminer.six, with optional table detection and OCR support, and an optional semantic pipeline that clusters nodes by embedding similarity.
Within the parsing and ingestion space, Open Parse sits between plain text splitters and heavier ML layout parsers: it handles layout, tables and markdown in one library, and uses pydantic models so you can serialize parsed results to dicts or JSON.
What it does
- Visually analyzes document layout instead of splitting on raw text, so chunks follow the document's real structure
- Basic markdown support for headings, bold and italics
- High-precision table extraction into clean Markdown, with optional deep-learning models (Table Transformer, unitable)
- Optional semantic ingestion pipeline that embeds and clusters related nodes (e.g. via OpenAI embeddings)
- Built on pydantic, so parsed results serialize to a dict or JSON
- Extensible post-processing and optional OCR via Tesseract
Getting started
Install the core library with pip, then parse a PDF into nodes. ML table detection and a semantic pipeline are optional add-ons.
Install the core library
Open Parse requires Python 3.8+. Install it with pip.
pip install openparseParse a document
Create a DocumentParser, parse a PDF path, and iterate over the resulting nodes.
import openparse
basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)
for node in parsed_basic_doc.nodes:
print(node)Serialize the results
Results use pydantic, so you can convert them to a dict or a JSON-ready dict.
parsed_content.dict()
# or to convert to a valid json dict
parsed_content.json()Optional: ML table detection
Install the ml extra and download the model weights to enable deep-learning table parsing.
pip install "openparse[ml]"
openparse-downloadCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Preparing PDFs for a RAG pipeline so retrieval returns coherent, structurally-grouped passages instead of arbitrary text slices
- Extracting tables from reports or manuals into clean Markdown for downstream LLM input
- Building a semantic ingestion step that clusters related document sections by embedding similarity
- Keeping document parsing in-house instead of sending data to a commercial document-AI vendor
How open-parse compares
open-parse alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| MarkItDown | ★ 156k | A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs. |
| MinerU | ★ 68.1k | A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content. |
| Docling | ★ 61.9k | An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON. |
| Marker | ★ 36.2k | A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting. |
| Repomix | ★ 26.4k | Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini. |
| OpenDataLoader PDF | ★ 25.4k | OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs. |
| Unstructured | ★ 15k | A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines. |
| open-parse | ★ 3.2k | Visually chunk complex documents for RAG, the way a human would |