PaperQA

High-accuracy, citation-backed RAG over scientific papers and documents

github.com/Future-House/paper-qa★ 8.7k futurehouse.gitbook.io/futurehouse-cookbook

Overview

PaperQA2 is a Python library for retrieval augmented generation (RAG) on PDFs, text files, Microsoft Office documents, and source code, with a focus on scientific literature. It answers questions with grounded responses that include in-text citations pointing back to the exact pages they came from.

It is built for researchers, scientists, and developers who need answers they can trust and check, rather than unattributed summaries. It fetches paper metadata from providers like Semantic Scholar, Crossref, and Unpaywall, including citation counts and retraction checks, and builds a local full-text search index over your document folder.

As a RAG framework, PaperQA2 supports an agentic workflow where a language agent can iteratively refine its queries and answers. By default it uses OpenAI embeddings and models with a NumPy vector store, but it works with any LiteLLM-supported provider, so you can swap in other closed- or open-source models.

What it does

Grounded answers with in-text citations tied to specific document pages
Works with PDFs, text files, Microsoft Office documents, and source code
Automatic paper metadata fetching, including citation counts and retraction checks, from multiple providers
Local full-text search index over your own repository of files
Agentic RAG: a language agent can iteratively refine queries and answers
Model flexibility through LiteLLM, with default OpenAI embeddings and a NumPy vector store

Getting started

Install the package, drop some papers in a folder, then ask a question from the command line. PaperQA2 reads from your OpenAI API key by default, so set that in your environment first.

Install PaperQA2

Install from PyPI with pip. It requires a supported Python version (see the PyPI badge in the README).

bashbash

pip install paper-qa

Add papers and ask a question

Put your PDFs in a folder, move into it, then use the pqa CLI to ask a question. PaperQA2 fetches metadata, indexes the documents, and answers with citations.

bashbash

mkdir my_papers
curl -o my_papers/PaperQA2.pdf https://arxiv.org/pdf/2409.13740
cd my_papers
pqa ask 'What is PaperQA2?'

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Asking questions across a folder of research papers and getting answers with verifiable page-level citations
Summarizing scientific literature for a review while keeping each claim traceable to its source
Detecting contradictions between papers in a collection
Building a local, searchable knowledge base over your own PDFs and documents without sending them to an external service

How PaperQA compares

PaperQA alongside other open-source rag frameworks & platforms tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Dify	★ 146k	An open-source platform with a visual workflow builder for creating LLM and RAG applications without writing much code.
RAGFlow	★ 83.2k	A RAG engine built around deep document understanding that turns complex files into a grounded, citation-backed question-answering layer.
Context7	★ 57.7k	Context7 pulls current, version-specific documentation and code examples for any library and feeds them into your LLM, available as a CLI skill or an MCP server.
Quivr	★ 39.2k	Quivr is an open-source RAG framework that ingests your documents and answers questions about them, working with any LLM and any file type.
LightRAG	★ 36.8k	A graph-based RAG system that builds an entity-and-relationship knowledge graph for fast retrieval and easy incremental updates.
GraphRAG	★ 33.9k	Microsoft's graph-based RAG system that extracts a knowledge graph from documents to answer broad, multi-document questions.
PageIndex	★ 33.2k	PageIndex turns long PDFs into a table-of-contents tree and uses LLM reasoning to retrieve relevant sections, with no vector database and no chunking.
PaperQA	★ 8.7k	High-accuracy, citation-backed RAG over scientific papers and documents

// Overview

// What it does

// Getting started