open-parse

Visually chunk complex documents for RAG, the way a human would

github.com/Filimoa/open-parse★ 3.2k filimoa.github.io/open-parse

Overview

Open Parse is a Python library for chunking documents before they go into a RAG system. Instead of flattening a file into raw text and slicing it, it looks at the visual layout of a document, headings, sections, tables and images, and groups related content together so each chunk stays coherent.

It is aimed at developers building retrieval pipelines who find that naive text splitting loses the structure of their source files. It works on PDFs out of the box using pdfminer.six, with optional table detection and OCR support, and an optional semantic pipeline that clusters nodes by embedding similarity.

Within the parsing and ingestion space, Open Parse sits between plain text splitters and heavier ML layout parsers: it handles layout, tables and markdown in one library, and uses pydantic models so you can serialize parsed results to dicts or JSON.

What it does

Visually analyzes document layout instead of splitting on raw text, so chunks follow the document's real structure
Basic markdown support for headings, bold and italics
High-precision table extraction into clean Markdown, with optional deep-learning models (Table Transformer, unitable)
Optional semantic ingestion pipeline that embeds and clusters related nodes (e.g. via OpenAI embeddings)
Built on pydantic, so parsed results serialize to a dict or JSON
Extensible post-processing and optional OCR via Tesseract

Getting started

Install the core library with pip, then parse a PDF into nodes. ML table detection and a semantic pipeline are optional add-ons.

Install the core library

Open Parse requires Python 3.8+. Install it with pip.

bashbash

pip install openparse

Parse a document

Create a DocumentParser, parse a PDF path, and iterate over the resulting nodes.

pythonpython

import openparse

basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    print(node)

Serialize the results

Results use pydantic, so you can convert them to a dict or a JSON-ready dict.

pythonpython

parsed_content.dict()

# or to convert to a valid json dict
parsed_content.json()

Optional: ML table detection

Install the ml extra and download the model weights to enable deep-learning table parsing.

bashbash

pip install "openparse[ml]"
openparse-download

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Preparing PDFs for a RAG pipeline so retrieval returns coherent, structurally-grouped passages instead of arbitrary text slices
Extracting tables from reports or manuals into clean Markdown for downstream LLM input
Building a semantic ingestion step that clusters related document sections by embedding similarity
Keeping document parsing in-house instead of sending data to a commercial document-AI vendor

How open-parse compares

open-parse alongside other open-source parsing & ingestion tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
MarkItDown	★ 156k	A Microsoft Python utility that converts many file types, including Office docs and PDFs, into Markdown for LLMs.
MinerU	★ 68.1k	A document extraction tool that converts PDFs and Office files into clean Markdown or JSON, with strong handling of complex layouts and CJK content.
Docling	★ 61.9k	An IBM-originated document conversion pipeline that turns PDF, DOCX, PPTX, HTML, and more into structured, LLM-ready Markdown or JSON.
Marker	★ 36.2k	A fast pipeline that converts PDFs and other documents to Markdown, JSON, or HTML while preserving tables, equations, and formatting.
Repomix	★ 26.4k	Repomix packs an entire repository into one file that is easy to feed to AI tools like Claude, ChatGPT, and Gemini.
OpenDataLoader PDF	★ 25.4k	OpenDataLoader PDF turns any PDF into structured Markdown, JSON, or HTML with bounding boxes, and auto-tags untagged files into screen-reader-ready Tagged PDFs.
Unstructured	★ 15k	A library for ingesting and preprocessing many document types into clean, chunked elements ready for RAG pipelines.
open-parse	★ 3.2k	Visually chunk complex documents for RAG, the way a human would

// Overview

// What it does

// Getting started