AI/TLDR

PaperQA

High-accuracy, citation-backed RAG over scientific papers and documents

Overview

PaperQA2 is a Python library for retrieval augmented generation (RAG) on PDFs, text files, Microsoft Office documents, and source code, with a focus on scientific literature. It answers questions with grounded responses that include in-text citations pointing back to the exact pages they came from.

It is built for researchers, scientists, and developers who need answers they can trust and check, rather than unattributed summaries. It fetches paper metadata from providers like Semantic Scholar, Crossref, and Unpaywall, including citation counts and retraction checks, and builds a local full-text search index over your document folder.

As a RAG framework, PaperQA2 supports an agentic workflow where a language agent can iteratively refine its queries and answers. By default it uses OpenAI embeddings and models with a NumPy vector store, but it works with any LiteLLM-supported provider, so you can swap in other closed- or open-source models.

What it does

  • Grounded answers with in-text citations tied to specific document pages
  • Works with PDFs, text files, Microsoft Office documents, and source code
  • Automatic paper metadata fetching, including citation counts and retraction checks, from multiple providers
  • Local full-text search index over your own repository of files
  • Agentic RAG: a language agent can iteratively refine queries and answers
  • Model flexibility through LiteLLM, with default OpenAI embeddings and a NumPy vector store

Getting started

Install the package, drop some papers in a folder, then ask a question from the command line. PaperQA2 reads from your OpenAI API key by default, so set that in your environment first.

Install PaperQA2

Install from PyPI with pip. It requires a supported Python version (see the PyPI badge in the README).

bashbash
pip install paper-qa

Add papers and ask a question

Put your PDFs in a folder, move into it, then use the pqa CLI to ask a question. PaperQA2 fetches metadata, indexes the documents, and answers with citations.

bashbash
mkdir my_papers
curl -o my_papers/PaperQA2.pdf https://arxiv.org/pdf/2409.13740
cd my_papers
pqa ask 'What is PaperQA2?'

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Asking questions across a folder of research papers and getting answers with verifiable page-level citations
  • Summarizing scientific literature for a review while keeping each claim traceable to its source
  • Detecting contradictions between papers in a collection
  • Building a local, searchable knowledge base over your own PDFs and documents without sending them to an external service

How PaperQA compares

PaperQA alongside other open-source rag frameworks & platforms tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Dify★ 146kAn open-source platform with a visual workflow builder for creating LLM and RAG applications without writing much code.
RAGFlow★ 83.2kA RAG engine built around deep document understanding that turns complex files into a grounded, citation-backed question-answering layer.
Context7★ 57.7kContext7 pulls current, version-specific documentation and code examples for any library and feeds them into your LLM, available as a CLI skill or an MCP server.
Quivr★ 39.2kQuivr is an open-source RAG framework that ingests your documents and answers questions about them, working with any LLM and any file type.
LightRAG★ 36.8kA graph-based RAG system that builds an entity-and-relationship knowledge graph for fast retrieval and easy incremental updates.
GraphRAG★ 33.9kMicrosoft's graph-based RAG system that extracts a knowledge graph from documents to answer broad, multi-document questions.
PageIndex★ 33.2kPageIndex turns long PDFs into a table-of-contents tree and uses LLM reasoning to retrieve relevant sections, with no vector database and no chunking.
PaperQA★ 8.7kHigh-accuracy, citation-backed RAG over scientific papers and documents