In plain English
LlamaIndex is an open-source Python (and TypeScript) framework that connects your own data — PDFs, Notion pages, a database, a folder of Word docs — to a large language model so the model can answer questions about it. The official tagline calls it "the leading framework for building LLM-powered agents over your data." In short: it's the plumbing between your files and a chatbot.
Here's the everyday analogy. Imagine you hire a brilliant new assistant who has read most of the public internet but has never seen a single document from your company. Ask them "What's our refund policy?" and they'll confidently make something up. LlamaIndex is the onboarding process for that assistant. It takes your refund policy doc, chops it into readable pieces, files those pieces in a searchable cabinet, and — when a question comes in — pulls the right page and hands it to the assistant before they answer. The assistant stays the same brilliant generalist; you've just given it a filing system for your stuff.
That pattern — fetch relevant text, then let the model answer using it — is called retrieval-augmented generation, or RAG. LlamaIndex is the most popular framework built specifically to make RAG easy. You can wire up a working question-answering system over your own documents in about five lines of code, then peel back layers and customize every stage as your needs grow.
Why it matters
LLMs have two stubborn limits. First, they only know what was in their training data, which has a cutoff date and never included your private files. Second, when they don't know something, they tend to hallucinate — produce a fluent, plausible, wrong answer. For any app that answers questions about specific content (a support bot, a legal-document search, an internal wiki assistant), both problems are dealbreakers.
The obvious fix — paste the whole document into the prompt — falls apart fast. Your data is bigger than the context window, stuffing it is slow and expensive, and models get worse at finding the needle when the haystack is huge. The real fix is to retrieve only the handful of passages that matter for each question. Doing that well means chunking documents, turning text into embeddings, storing them in a vector index, running similarity search, and assembling a prompt. That's a lot of moving parts to build from scratch.
LlamaIndex packages all of it behind clean defaults. Before frameworks like this, every team rebuilt the same ingest-embed-retrieve-prompt loop by hand. Now you get loaders for 100+ data sources, sensible chunking, a choice of vector stores, retrievers, re-rankers, and query engines out of the box — with escape hatches to swap any piece. It sits in the same neighborhood as LangChain, but where LangChain aims to be a general agent toolkit, LlamaIndex's center of gravity is data: getting your knowledge into a model cleanly.
How it works
LlamaIndex models a RAG app as a short pipeline. Each stage is a swappable component with a good default, so beginners run the whole thing in a few lines while advanced users replace individual stages.
Documents and Nodes
LlamaIndex calls each loaded file a Document. A document is too big to retrieve as one lump, so a node parser splits it into Nodes — small chunks of text (a few sentences to a paragraph) that each carry metadata like the source filename and page number. Nodes are the atomic unit LlamaIndex stores and retrieves. Good chunking matters a lot here: chunks too big waste context and dilute relevance; too small and you lose the surrounding meaning.
Indexes and embeddings
The default and most common index is the VectorStoreIndex. For each node it computes an embedding — a list of numbers capturing the chunk's meaning — and stores those vectors in a vector database. At query time it embeds your question with the same model and finds the nodes whose vectors are nearest, which is semantic search: matching on meaning, not exact keywords. LlamaIndex also offers other index types (a summary index that walks every node, a keyword-table index, a knowledge-graph index) for cases where pure vector search isn't ideal.
Retrievers and query engines
A retriever is the component that, given a question, returns the most relevant nodes (the retriever concept is shared across all RAG tools). A query engine wraps a retriever plus a response synthesizer: it retrieves the top-k nodes, stuffs them into a prompt template, calls the LLM, and returns a finished answer — usually with the source nodes attached so you can show citations. A chat engine is the same idea but keeps conversation history so follow-up questions like "what about the second one?" work.
A five-line RAG app
This is the canonical "hello world" for LlamaIndex. Drop some files into a data/ folder, point SimpleDirectoryReader at it, and ask a question. The high-level API hides loading, chunking, embedding, storage, retrieval, and synthesis behind sensible defaults.
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
os.environ["OPENAI_API_KEY"] = "sk-..." # default LLM + embeddings
# 1. Load every file under ./data into Documents
docs = SimpleDirectoryReader("data").load_data()
# 2. Parse -> Nodes, embed, and store in an in-memory vector index
index = VectorStoreIndex.from_documents(docs)
# 3. Wrap the index in a query engine and ask a question
engine = index.as_query_engine()
response = engine.query("What is the refund window?")
print(response) # the synthesized answer
print(response.source_nodes) # the chunks it used, for citationsRe-embedding every time you run the script is wasteful. In a real app you persist the index once and reload it, so you only pay for embeddings on the data, not on every restart.
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage,
)
DIR = "./storage"
if os.path.exists(DIR):
# Reload the already-built index from disk
ctx = StorageContext.from_defaults(persist_dir=DIR)
index = load_index_from_storage(ctx)
else:
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir=DIR) # save for next time
engine = index.as_query_engine(similarity_top_k=4) # retrieve 4 chunks
print(engine.query("Summarize our security policy."))Connectors, agents, and the wider toolkit
The five-line demo is only the front door. Two things make LlamaIndex worth reaching for on real projects: the breadth of its connectors and its move beyond pure Q&A into agents.
Data connectors (LlamaHub)
SimpleDirectoryReader handles local files, but your data usually lives elsewhere. LlamaHub is LlamaIndex's registry of 100+ community and official connectors that read from Slack, Notion, Google Drive, Confluence, GitHub, Postgres, S3, web pages, and more — each one returning the same Document objects the rest of the pipeline expects. There's also LlamaParse, a hosted service for turning messy PDFs (tables, scanned pages, multi-column layouts) into clean text, which is the part of ingestion that breaks naive pipelines most often.
From query engines to agents
A query engine answers one question against one index. An agent can do more: it reasons about what to do, calls tools, and loops until the task is done. In LlamaIndex you can wrap any query engine as a tool, then hand several tools to an agent so it can decide which knowledge base — or which calculator, API, or function — to use for each step. This is the bridge into agentic RAG, where retrieval becomes a tool the model chooses to call rather than a fixed first step.
The modern foundation for this in LlamaIndex is Workflows — an event-driven way to chain steps, agents, retrievers, and tools into a controllable multi-step process. It's the recommended way to build anything more complex than a single query engine, and it sits in the same design space as other agent frameworks like LangGraph and DSPy.
Common beginner pitfalls
- Blaming the LLM for retrieval misses. If answers are wrong, the model usually never saw the right chunk. Print
response.source_nodesfirst — bad retrieval, not bad generation, is the most common failure. - Leaving chunk size at the default. The default node parser is a starting point, not an answer. Dense reference docs and chatty transcripts want very different chunk sizes; tune it and re-evaluate.
- Forgetting to persist. Rebuilding the index on every run re-embeds everything, which is slow and costs money. Persist once, reload after.
- Mismatched embedding models. The model that embeds your documents must be the same one that embeds the query. Switching embedding models means re-indexing all your data.
- Retrieving too few or too many chunks.
similarity_top_ktoo low starves the model of context; too high floods the prompt with noise and cost. Three to six is a common sweet spot to test from.
Going deeper
Once the basic pipeline works, the gains come from making retrieval smarter and measuring it honestly. A few directions experienced teams pursue:
| Technique | What it does | When to reach for it |
|---|---|---|
| Re-ranking | A second model re-scores retrieved nodes for relevance before they hit the prompt | Vector search returns roughly-right but mis-ordered chunks |
| Hybrid search | Combines keyword (BM25) and vector retrieval | Queries hinge on exact terms — product codes, names, acronyms |
| Metadata filtering | Restricts retrieval by tags like date, author, or department | One index spans many sources and you must scope results |
| Sentence-window / auto-merging | Retrieves a small chunk but expands to its surrounding context | Precise matches need neighboring sentences to make sense |
| Query transformation | Rewrites or decomposes the question before retrieving | Multi-part questions one search can't answer |
Production concerns matter more than any single trick. You'll want evaluation — measuring whether retrieved chunks are relevant and whether answers are grounded in them; LlamaIndex ships evaluators, and the broader practice is covered in how to evaluate RAG. You'll want observability to trace which nodes each answer used. And you'll move from the in-memory index to a real vector database (Qdrant, Pinecone, pgvector, Chroma, and others all have LlamaIndex integrations) once your corpus outgrows a single machine.
Two design tensions are worth knowing. First, structured vs. unstructured data: LlamaIndex can also turn natural-language questions into SQL over a real database, so not everything has to be embedded. Second, freshness: an index is a snapshot, so changing documents means an ingestion pipeline that detects updates and re-embeds only what changed — LlamaIndex's IngestionPipeline with a document store handles this deduplication. These are the problems that separate a weekend demo from a system people trust.
FAQ
Is LlamaIndex free and open source?
Yes. The core llama-index library is open source under the MIT license and free to use; the GitHub repo is at run-llama/llama_index. Some hosted add-ons like LlamaParse and LlamaCloud are paid services, but you can build a full RAG app with only the free library.
What is the difference between LlamaIndex and LangChain?
Both connect LLMs to data and tools, and they overlap. LlamaIndex centers on data — ingesting, indexing, and retrieving over your documents for RAG — with the cleanest defaults for that job. LangChain is a broader, more general toolkit for chains and agents. Many teams use LlamaIndex for the retrieval layer even inside a LangChain or custom app.
Do I need a vector database to use LlamaIndex?
No. VectorStoreIndex.from_documents builds an in-memory index by default, which is perfect for prototypes and small datasets. You only swap in a dedicated vector database like Qdrant or Pinecone when your corpus is large or needs to persist and scale across servers.
Can LlamaIndex work with local or open-source models?
Yes. LlamaIndex is model-agnostic. You set Settings.llm and Settings.embed_model to any supported provider, including local models served through Ollama or Hugging Face, so your data and inference can stay on your own hardware.
What is a query engine in LlamaIndex?
A query engine is the component that turns a question into an answer. It bundles a retriever (which finds relevant chunks) with a response synthesizer (which prompts the LLM with those chunks). Call index.as_query_engine() to create one, then .query("...") to ask. The response includes the source nodes used, so you can show citations.
Does LlamaIndex prevent hallucinations?
It reduces them but doesn't eliminate them. By grounding answers in retrieved passages from your own data, the model has less reason to invent facts. But if retrieval misses the right chunk, the model can still guess. Good chunking, retrieval tuning, and RAG evaluation are what keep answers grounded.