PageIndex

Vectorless, reasoning-based RAG that builds a tree index from long documents

github.com/VectifyAI/PageIndex★ 33.2k pageindex.ai

Overview

PageIndex is an open-source RAG system that works without a vector database and without chunking. Instead of matching text by vector similarity, it builds a hierarchical tree index from a long document, much like a table of contents, and lets an LLM reason over that tree to find the sections that are actually relevant.

The approach is aimed at long, professional documents such as financial reports, regulatory filings, legal manuals, and academic textbooks that exceed an LLM's context window. Because retrieval is driven by reasoning and grounded in explicit page and section references, every result is traceable and easier to explain than opaque vector search.

You can self-host the project with this repository using standard PDF parsing, or use the hosted cloud service and API for enhanced OCR and tree building.

What it does

No vector database: retrieval uses document structure and LLM reasoning instead of vector similarity search
No chunking: documents are organized into natural sections rather than artificial fixed-size chunks
Builds a semantic tree (table-of-contents style) index from long PDFs, with node IDs, summaries, and page ranges
Traceable, explainable retrieval grounded in explicit page and section references
Context-aware retrieval that can take conversation history and domain knowledge into account
Markdown input support and multi-LLM support via LiteLLM

Getting started

Self-host PageIndex to generate a tree structure from a PDF document. You need Python and an LLM API key.

Install dependencies

Install the required Python packages from the repository's requirements file.

bashbash

pip3 install --upgrade -r requirements.txt

Set your LLM API key

Create a .env file in the root directory with your LLM API key. Multiple LLM providers are supported through LiteLLM.

bashbash

OPENAI_API_KEY=your_openai_key_here

Generate a PageIndex tree for your PDF

Run the script against your document to build its tree structure index. A --md_path flag is also available for Markdown files.

bashbash

python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

Try the agentic vectorless RAG demo

Install the optional dependency and run the included end-to-end example, which uses self-hosted PageIndex with the OpenAI Agents SDK.

bashbash

pip3 install openai-agents
python3 examples/agentic_vectorless_rag_demo.py

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Question answering over long financial reports, SEC filings, and earnings disclosures
Reasoning-based retrieval across regulatory, legal, or technical manuals that exceed an LLM context window
Searching and extracting answers from academic textbooks and other complex long-form PDFs
Building agentic RAG pipelines where retrieval needs to be traceable and explainable instead of relying on vector similarity

How PageIndex compares

PageIndex alongside other open-source rag frameworks & platforms tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Dify	★ 146k	An open-source platform with a visual workflow builder for creating LLM and RAG applications without writing much code.
RAGFlow	★ 83.2k	A RAG engine built around deep document understanding that turns complex files into a grounded, citation-backed question-answering layer.
Context7	★ 57.7k	Context7 pulls current, version-specific documentation and code examples for any library and feeds them into your LLM, available as a CLI skill or an MCP server.
Quivr	★ 39.2k	Quivr is an open-source RAG framework that ingests your documents and answers questions about them, working with any LLM and any file type.
LightRAG	★ 36.8k	A graph-based RAG system that builds an entity-and-relationship knowledge graph for fast retrieval and easy incremental updates.
GraphRAG	★ 33.9k	Microsoft's graph-based RAG system that extracts a knowledge graph from documents to answer broad, multi-document questions.
PageIndex	★ 33.2k	Vectorless, reasoning-based RAG that builds a tree index from long documents
FastGPT	★ 28.6k	FastGPT is an open-source AI agent platform that pairs a built-in knowledge base with a drag-and-drop Flow editor, so you can build question-answering apps without heavy setup.

// Overview

// What it does

// Getting started

Install dependencies

Set your LLM API key

Generate a PageIndex tree for your PDF

Try the agentic vectorless RAG demo

// When to use it

// How PageIndex compares

Overview

What it does

Getting started

When to use it

How PageIndex compares