AI/TLDR

PageIndex

Vectorless, reasoning-based RAG that builds a tree index from long documents

Overview

PageIndex is an open-source RAG system that works without a vector database and without chunking. Instead of matching text by vector similarity, it builds a hierarchical tree index from a long document, much like a table of contents, and lets an LLM reason over that tree to find the sections that are actually relevant.

The approach is aimed at long, professional documents such as financial reports, regulatory filings, legal manuals, and academic textbooks that exceed an LLM's context window. Because retrieval is driven by reasoning and grounded in explicit page and section references, every result is traceable and easier to explain than opaque vector search.

You can self-host the project with this repository using standard PDF parsing, or use the hosted cloud service and API for enhanced OCR and tree building.

What it does

  • No vector database: retrieval uses document structure and LLM reasoning instead of vector similarity search
  • No chunking: documents are organized into natural sections rather than artificial fixed-size chunks
  • Builds a semantic tree (table-of-contents style) index from long PDFs, with node IDs, summaries, and page ranges
  • Traceable, explainable retrieval grounded in explicit page and section references
  • Context-aware retrieval that can take conversation history and domain knowledge into account
  • Markdown input support and multi-LLM support via LiteLLM

Getting started

Self-host PageIndex to generate a tree structure from a PDF document. You need Python and an LLM API key.

Install dependencies

Install the required Python packages from the repository's requirements file.

bashbash
pip3 install --upgrade -r requirements.txt

Set your LLM API key

Create a .env file in the root directory with your LLM API key. Multiple LLM providers are supported through LiteLLM.

bashbash
OPENAI_API_KEY=your_openai_key_here

Generate a PageIndex tree for your PDF

Run the script against your document to build its tree structure index. A --md_path flag is also available for Markdown files.

bashbash
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

Try the agentic vectorless RAG demo

Install the optional dependency and run the included end-to-end example, which uses self-hosted PageIndex with the OpenAI Agents SDK.

bashbash
pip3 install openai-agents
python3 examples/agentic_vectorless_rag_demo.py

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Question answering over long financial reports, SEC filings, and earnings disclosures
  • Reasoning-based retrieval across regulatory, legal, or technical manuals that exceed an LLM context window
  • Searching and extracting answers from academic textbooks and other complex long-form PDFs
  • Building agentic RAG pipelines where retrieval needs to be traceable and explainable instead of relying on vector similarity

How PageIndex compares

PageIndex alongside other open-source rag frameworks & platforms tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Dify★ 146kAn open-source platform with a visual workflow builder for creating LLM and RAG applications without writing much code.
RAGFlow★ 83.2kA RAG engine built around deep document understanding that turns complex files into a grounded, citation-backed question-answering layer.
Context7★ 57.7kContext7 pulls current, version-specific documentation and code examples for any library and feeds them into your LLM, available as a CLI skill or an MCP server.
Quivr★ 39.2kQuivr is an open-source RAG framework that ingests your documents and answers questions about them, working with any LLM and any file type.
LightRAG★ 36.8kA graph-based RAG system that builds an entity-and-relationship knowledge graph for fast retrieval and easy incremental updates.
GraphRAG★ 33.9kMicrosoft's graph-based RAG system that extracts a knowledge graph from documents to answer broad, multi-document questions.
PageIndex★ 33.2kVectorless, reasoning-based RAG that builds a tree index from long documents
FastGPT★ 28.6kFastGPT is an open-source AI agent platform that pairs a built-in knowledge base with a drag-and-drop Flow editor, so you can build question-answering apps without heavy setup.