How to Build a "Chat with Your PDF" App

Q: Do I need to re-embed the PDF every time I restart the app?

No. Use Chroma's `persist_directory` option to save your vector store to disk. On subsequent runs, load it with `Chroma(persist_directory=..., embedding_function=...)` and skip the ingestion step entirely. Only re-embed when the PDF changes.

Q: How much does it cost to run a chat-with-PDF app using the OpenAI API?

Indexing a 50-page PDF with `text-embedding-3-small` typically costs under $0.01. Each question answered with `gpt-4o-mini` costs roughly $0.001-0.003 depending on how much context you retrieve. For personal or low-volume use, monthly costs are usually under a dollar.

Q: Can I build this without OpenAI — using a local or open-source model?

Yes. Replace `OpenAIEmbeddings` with `HuggingFaceEmbeddings` (using a model like `sentence-transformers/all-MiniLM-L6-v2` from the `langchain-huggingface` package) and replace the OpenAI chat call with a local model via Ollama. The pipeline structure stays identical — only the model providers change.

Q: How do I add source citations so users can verify the answer?

Store each chunk as a `langchain.schema.Document` with `metadata={"page": page_number, "source": filename}`. When you retrieve chunks for a question, include the page numbers in the context you send to the LLM and ask it to cite them. You can also display the raw chunk text beneath the answer as a "Sources" section in your UI.

Build the classic beginner RAG project — an app that answers questions about any PDF you upload.

BEGINNER15 MIN READUPDATED 2026-06-12

In plain English

A "chat with your PDF" app lets you upload any document — a research paper, a legal contract, a product manual — and ask it questions in plain English. The app reads the file, figures out which parts are relevant to your question, and hands those passages to an LLM that writes a grounded answer. No more scrolling through 80 pages looking for the one paragraph you need.

The analogy that makes this click: imagine you hire a research assistant. You hand them a stack of papers and tell them to answer your questions. A smart assistant doesn't re-read every page from scratch each time you ask something — they quickly scan the pile, pull out the most relevant paragraphs, and use those as the basis for their answer. That's exactly what this app does, except the "scanning" is a vector similarity search and the "assistant writing the answer" is an LLM.

This pattern is called RAG — Retrieval-Augmented Generation. It's the most commonly built beginner AI project because it's genuinely useful, it teaches you how embeddings, vector stores, and LLMs fit together, and the core version is about 50 lines of Python. This guide walks you through every step.

Why this project matters for a builder

LLMs are trained on public internet text. They know nothing about your company's internal docs, your client's contract, or a paper published last week. Asking a plain chatbot about private or recent content gets you confident hallucinations — the model will invent plausible-sounding but wrong answers because it genuinely doesn't know.

A chat-with-PDF app solves this by injecting the actual text from your document into the model's context window at query time. The LLM isn't guessing anymore — it's reading. This makes answers both more accurate and more verifiable, because you can show the user which passage the answer came from.

Beyond this one project, the skills you learn here transfer directly to almost every document-heavy AI product: contract review tools, research assistants, customer support bots backed by a knowledge base, internal wikis you can query in natural language. They all run on the same five-step pipeline you're about to build.

Step in the pipeline	What you learn by building it
Extract text from PDF	How to handle real-world document formats
Chunk text into pieces	Why splitting strategy affects answer quality
Embed chunks as vectors	What embeddings are and how similarity search works
Retrieve relevant chunks	How RAG limits context to what's actually needed
Generate answer with citations	How to ground LLM output in retrieved evidence

How the pipeline works

There are two phases. The indexing phase runs once when you load a new PDF — it extracts text, splits it into chunks, converts each chunk to a vector embedding, and stores those vectors in a vector store. The query phase runs each time you ask a question — it embeds the question, finds the most similar chunks, and sends those chunks plus the question to the LLM.

// The full RAG pipeline for PDF chat

Load PDFextract raw text with pypdf or PyMuPDFChunk textsplit into ~500-token overlapping windowsEmbed chunkseach chunk becomes a vector (e.g. 1536 floats)Store in vector DBFAISS or Chroma, saved to diskUser asks a questionembed the question tooSimilarity searchfind top-k most relevant chunksLLM answersreads retrieved chunks, writes grounded reply

Step 1 — Extract text

The first job is pulling raw text out of the PDF. The two most popular libraries for this in Python are pypdf (pure Python, easy to install) and PyMuPDF (faster, better at preserving layout). For most beginner projects with native — not scanned — PDFs, either works fine. If your PDF is a scanned image, you'll need to add an OCR step (Tesseract or a vision-model based extractor).

Extract text from a PDF with pypdfpython

from pypdf import PdfReader

def extract_text(pdf_path: str) -> str:
    reader = PdfReader(pdf_path)
    pages = [page.extract_text() or "" for page in reader.pages]
    return "\n\n".join(pages)  # join pages with double newline

Step 2 — Chunk the text

You can't embed the whole document as one unit — it would be too large to embed, and you'd lose the ability to pinpoint which section answered the question. Instead, you split the text into smaller, overlapping chunks. The overlap is important: it prevents an answer from being split across two chunks that are never retrieved together.

LangChain's RecursiveCharacterTextSplitter is the standard beginner choice. A 2026 benchmark of seven chunking strategies across 50 academic papers found that recursive 512-token splitting placed first at 69% retrieval accuracy — a good default. A chunk size of 500-1000 characters with 10-20% overlap covers most PDFs well.

Chunk text with LangChainpython

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text(text: str) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,    # characters per chunk
        chunk_overlap=200,  # overlap between consecutive chunks
        separators=["\n\n", "\n", " ", ""]
    )
    return splitter.split_text(text)

Step 3 — Embed and store

Each chunk gets converted into a vector — a list of numbers that captures its meaning. You then store those vectors in a vector database so you can search them by similarity. For local development, Chroma is the easiest choice: it persists to disk, requires no running server, and integrates directly with LangChain. FAISS (Facebook AI Similarity Search) is a fast alternative if you don't need persistence.

Embed chunks and store in Chromapython

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

def build_vector_store(chunks: list[str], persist_dir: str = "./chroma_db") -> Chroma:
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vector_store = Chroma.from_texts(
        texts=chunks,
        embedding=embeddings,
        persist_directory=persist_dir
    )
    return vector_store

Step 4 — Retrieve and generate

When a user asks a question, you embed the question using the same model, then ask the vector store for the top-k most similar chunks. Those chunks become the "context" you inject into the LLM's prompt alongside the question. The LLM reads the context and writes an answer grounded in the actual document text.

Retrieve chunks and answer a questionpython

from openai import OpenAI

client = OpenAI()

def answer_question(question: str, vector_store: Chroma, k: int = 4) -> str:
    # 1. Find the most relevant chunks
    results = vector_store.similarity_search(question, k=k)
    context = "\n\n---\n\n".join(doc.page_content for doc in results)

    # 2. Build the prompt
    system = (
        "You are a precise assistant. Answer questions using ONLY the provided context. "
        "If the context does not contain enough information to answer, say so clearly. "
        "Cite the relevant passage by quoting a short excerpt."
    )
    user = f"Context:\n{context}\n\nQuestion: {question}"

    # 3. Call the LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user}
        ],
        max_tokens=512,
    )
    return response.choices[0].message.content

Putting it all together: the complete app

Here's a minimal but complete PDF chat app in Python. It extracts text from any PDF you provide, builds a vector store on the first run, and then opens an interactive loop where you can ask questions. On subsequent runs it reloads the existing vector store from disk so you don't have to re-embed the whole document.

chat_pdf.py — complete working apppython

import os
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

PDF_PATH = "my_document.pdf"   # <-- change this
PERSIST_DIR = "./chroma_db"

def load_or_build_store(pdf_path: str, persist_dir: str) -> Chroma:
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    if os.path.exists(persist_dir):
        print("Loading existing vector store...")
        return Chroma(persist_directory=persist_dir, embedding_function=embeddings)

    print("Building vector store from PDF...")
    reader = PdfReader(pdf_path)
    full_text = "\n\n".join(p.extract_text() or "" for p in reader.pages)

    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.split_text(full_text)
    print(f"Created {len(chunks)} chunks.")

    return Chroma.from_texts(
        texts=chunks,
        embedding=embeddings,
        persist_directory=persist_dir
    )

def answer(question: str, store: Chroma) -> str:
    docs = store.similarity_search(question, k=4)
    context = "\n\n---\n\n".join(d.page_content for d in docs)
    client = OpenAI()
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Answer using ONLY the provided context. "
                "Cite a short excerpt to support your answer. "
                "If the context is insufficient, say you don't know."
            )},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        max_tokens=512,
    )
    return resp.choices[0].message.content

if __name__ == "__main__":
    store = load_or_build_store(PDF_PATH, PERSIST_DIR)
    print("Ready! Ask questions about your PDF (Ctrl+C to quit).\n")
    while True:
        q = input("You: ").strip()
        if q:
            print(f"Bot: {answer(q, store)}\n")

Install dependencies and runbash

pip install openai langchain langchain-openai langchain-chroma pypdf chromadb python-dotenv
export OPENAI_API_KEY="sk-..."  # or put it in a .env file
python chat_pdf.py

Add a simple web UI with Streamlit

Once your terminal version works, wrapping it in a Streamlit UI takes about 20 more lines. Streamlit gives you a file uploader, a chat history panel, and a text input box — all of the visual chrome a real PDF chat app needs, with zero HTML or CSS.

streamlit_app.py — web UI for PDF chatpython

import streamlit as st
import tempfile, os
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from openai import OpenAI

st.title("Chat with your PDF")

uploaded = st.file_uploader("Upload a PDF", type="pdf")
if uploaded and "store" not in st.session_state:
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as f:
        f.write(uploaded.read())
        tmp_path = f.name
    reader = PdfReader(tmp_path)
    text = "\n\n".join(p.extract_text() or "" for p in reader.pages)
    chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).split_text(text)
    st.session_state.store = Chroma.from_texts(
        texts=chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-small")
    )
    st.session_state.history = []
    st.success(f"Indexed {len(chunks)} chunks. Ask away!")

if "store" in st.session_state:
    for msg in st.session_state.history:
        with st.chat_message(msg["role"]):
            st.write(msg["content"])

    if q := st.chat_input("Ask a question about the PDF..."):
        st.session_state.history.append({"role": "user", "content": q})
        docs = st.session_state.store.similarity_search(q, k=4)
        ctx = "\n\n---\n\n".join(d.page_content for d in docs)
        resp = OpenAI().chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Answer using only the provided context. Cite evidence."},
                {"role": "user", "content": f"Context:\n{ctx}\n\nQuestion: {q}"}
            ],
            max_tokens=512,
        )
        reply = resp.choices[0].message.content
        st.session_state.history.append({"role": "assistant", "content": reply})
        st.rerun()

Common mistakes and how to avoid them

Most PDF chat apps that "don't work" have the same handful of problems. Here are the most common ones and how to fix them.

Pitfall 1 — Chunks that are too large or too small

Very large chunks (2000+ characters) give the model too much irrelevant context and hurt retrieval precision — the similarity search latches on to the average meaning of a big chunk rather than the specific passage that matters. Very small chunks (under 200 characters) lose surrounding context, so the model can't make sense of the snippet in isolation. A starting point of 800-1000 characters with 15-20% overlap works for most PDFs. Adjust based on the document's structure.

Pitfall 2 — Not retrieving enough chunks

Retrieving only 1 or 2 chunks makes it easy to miss the answer when it spans multiple passages or when the document uses slightly different wording than your question. Retrieving 4-6 chunks is a good default. You can retrieve more and let the model ignore irrelevant ones — as long as your total context stays within the model's context window.

Pitfall 3 — Not instructing the model to say "I don't know"

Without an explicit instruction, the LLM will try to answer even when the retrieved context doesn't contain the answer — and it will sound confident. Always include a line like "If the context does not contain enough information to answer the question, say 'The document does not appear to address this topic.'" in your system prompt. This is the single most important guardrail for a trustworthy PDF chat app.

Pitfall 4 — Scanned PDFs with no text layer

Some PDFs are just images of scanned pages. pypdf will return empty strings for these pages because there's no embedded text to extract. You can detect this (extracted text is blank or contains only whitespace) and fall back to PyMuPDF + Tesseract OCR or use a vision-capable model to extract text from the page images before chunking.

Pitfall 5 — Re-embedding the document on every run

Embedding costs money and takes time. Persist your vector store to disk (Chroma does this automatically with persist_directory) and only re-embed when the PDF changes. For a multi-document app, store a hash of each PDF so you can detect when re-ingestion is needed.

Going deeper

Once your basic PDF chat app works reliably, there are several directions you can take it. Each upgrade addresses a real limitation of the simple version.

Add metadata and page numbers to citations

Instead of storing plain text chunks, store them as Document objects with metadata — specifically the page number each chunk came from. When you retrieve chunks, include the page number in the context you pass to the LLM and ask it to cite page numbers in its answers. This makes the app much more trustworthy for professional use cases like legal or medical documents.

Store chunks with page-number metadatapython

from langchain.schema import Document
from pypdf import PdfReader

def extract_docs_with_metadata(pdf_path: str) -> list[Document]:
    reader = PdfReader(pdf_path)
    documents = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        if text.strip():
            documents.append(Document(
                page_content=text,
                metadata={"page": i + 1, "source": pdf_path}
            ))
    return documents

Switch from keyword-style to hybrid retrieval

Pure vector similarity search works well for semantic questions ("what does the document say about liability?") but can miss exact matches ("find section 4.2.1"). Hybrid retrieval combines vector similarity with keyword (BM25) search, then merges the results. LangChain's EnsembleRetriever lets you mix both approaches with a single line of configuration. This is worth adding once your basic retrieval is solid.

Reranking for higher precision

After retrieval you can run the candidate chunks through a reranker — a small model (Cohere's Rerank API or open-source cross-encoder/ms-marco-MiniLM) that scores each chunk against the question much more precisely than cosine similarity alone. A common pattern is to retrieve a wide set (k=20) cheaply with the vector store, then rerank to the top 4 before passing them to the LLM. This often improves answer quality noticeably without meaningfully increasing cost.

Scale to multiple documents

The same pipeline scales naturally to a folder of PDFs. Add a source field to your chunk metadata so the LLM can name which document it drew from. For larger collections (hundreds of documents), replace Chroma with a managed vector database like Pinecone or pgvector (PostgreSQL extension) so you don't have to keep the entire index in memory.

// The upgrade path: basic PDF chat to production document Q&A

Single PDF, terminal UIthe version you just builtStreamlit / Gradio UIfile uploader, chat panelPage citations + metadatatrust and verifiabilityHybrid retrieval + rerankingbetter answer qualityMulti-document + managed DBproduction scale

Use a vision model for complex PDFs

Native text extraction falls apart on PDFs with complex layouts — multi-column text, embedded tables, diagrams with captions. Vision-language models (GPT-4o, Claude, Gemini) can read a PDF page as an image and extract structured text including tables. For a document-heavy production app, running each page through a vision model at ingest time is increasingly the recommended approach, even though it costs more than plain text extraction.

FAQ

Do I need to re-embed the PDF every time I restart the app?

No. Use Chroma's persist_directory option to save your vector store to disk. On subsequent runs, load it with Chroma(persist_directory=..., embedding_function=...) and skip the ingestion step entirely. Only re-embed when the PDF changes.

Why does my app give wrong answers even though the answer is in the document?

The most likely cause is that the relevant passage wasn't retrieved — either the chunk containing the answer scored lower than unrelated chunks, or the answer spans a chunk boundary. Try increasing k (retrieve more chunks), reducing chunk size, or adding chunk overlap. Also verify the passage is actually being extracted from the PDF by printing raw text before chunking.

How much does it cost to run a chat-with-PDF app using the OpenAI API?

Indexing a 50-page PDF with text-embedding-3-small typically costs under $0.01. Each question answered with gpt-4o-mini costs roughly $0.001-0.003 depending on how much context you retrieve. For personal or low-volume use, monthly costs are usually under a dollar.

Can I build this without OpenAI — using a local or open-source model?

Yes. Replace OpenAIEmbeddings with HuggingFaceEmbeddings (using a model like sentence-transformers/all-MiniLM-L6-v2 from the langchain-huggingface package) and replace the OpenAI chat call with a local model via Ollama. The pipeline structure stays identical — only the model providers change.

My PDF is scanned and pypdf returns empty text. What do I do?

Scanned PDFs have no embedded text layer — they're images. You need OCR. Install Tesseract and pytesseract, then render each PDF page to an image with PyMuPDF (fitz.open(pdf_path)[page_num].get_pixmap()) and pass the image to pytesseract.image_to_string(). Alternatively, use a vision-capable LLM to extract text from the page image.

How do I add source citations so users can verify the answer?

Store each chunk as a langchain.schema.Document with metadata={"page": page_number, "source": filename}. When you retrieve chunks for a question, include the page numbers in the context you send to the LLM and ask it to cite them. You can also display the raw chunk text beneath the answer as a "Sources" section in your UI.

// In plain English

// Why this project matters for a builder

// How the pipeline works