AI/TLDR

open-parse

Visually chunk complex documents for RAG, the way a human would

Overview

Open Parse is a Python library for chunking documents before they go into a RAG system. Instead of flattening a file into raw text and slicing it, it looks at the visual layout of a document, headings, sections, tables and images, and groups related content together so each chunk stays coherent.

It is aimed at developers building retrieval pipelines who find that naive text splitting loses the structure of their source files. It works on PDFs out of the box using pdfminer.six, with optional table detection and OCR support, and an optional semantic pipeline that clusters nodes by embedding similarity.

Within the parsing and ingestion space, Open Parse sits between plain text splitters and heavier ML layout parsers: it handles layout, tables and markdown in one library, and uses pydantic models so you can serialize parsed results to dicts or JSON.

What it does

  • Visually analyzes document layout instead of splitting on raw text, so chunks follow the document's real structure
  • Basic markdown support for headings, bold and italics
  • High-precision table extraction into clean Markdown, with optional deep-learning models (Table Transformer, unitable)
  • Optional semantic ingestion pipeline that embeds and clusters related nodes (e.g. via OpenAI embeddings)
  • Built on pydantic, so parsed results serialize to a dict or JSON
  • Extensible post-processing and optional OCR via Tesseract

Getting started

Install the core library with pip, then parse a PDF into nodes. ML table detection and a semantic pipeline are optional add-ons.

Install the core library

Open Parse requires Python 3.8+. Install it with pip.

bashbash
pip install openparse

Parse a document

Create a DocumentParser, parse a PDF path, and iterate over the resulting nodes.

pythonpython
import openparse

basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    print(node)

Serialize the results

Results use pydantic, so you can convert them to a dict or a JSON-ready dict.

pythonpython
parsed_content.dict()

# or to convert to a valid json dict
parsed_content.json()

Optional: ML table detection

Install the ml extra and download the model weights to enable deep-learning table parsing.

bashbash
pip install "openparse[ml]"
openparse-download

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Preparing PDFs for a RAG pipeline so retrieval returns coherent, structurally-grouped passages instead of arbitrary text slices
  • Extracting tables from reports or manuals into clean Markdown for downstream LLM input
  • Building a semantic ingestion step that clusters related document sections by embedding similarity
  • Keeping document parsing in-house instead of sending data to a commercial document-AI vendor

How open-parse compares

open-parse alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
MarkItDown★ 156kA Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU★ 68.1kA document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling★ 61.9kAn IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker★ 36.2kA fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix★ 26.4kRepomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF★ 25.4kOpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured★ 15kA library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
open-parse★ 3.2kVisually chunk complex documents for RAG, the way a human would