Overview
Ragas is an open-source Python toolkit for evaluating Large Language Model (LLM) applications. It replaces slow, subjective manual reviews with repeatable, metric-based checks, scoring outputs with both LLM-based judges and traditional metrics so you can measure quality with numbers instead of gut feeling.
It is built for developers and teams shipping RAG systems and other LLM-backed features who need a way to tell whether a change made their app better or worse. As an evaluation framework, it pairs scoring metrics with automatic test-data generation, so you can run checks even when you don't have a hand-labeled dataset ready.
Ragas integrates with common LLM tooling such as LangChain and observability platforms, and it can use production data to build feedback loops that guide ongoing improvements.
What it does
- Pre-built metrics for common evaluation tasks, combining LLM-based scoring with traditional metrics
- Custom aspect evaluators such as DiscreteMetric to score any quality dimension you define
- Automatic test-dataset generation when you don't have ground-truth labels ready
- A ragas quickstart CLI that scaffolds a complete RAG evaluation project
- Integrations with LLM frameworks like LangChain and major observability tools
- Anonymized, opt-out usage analytics (set RAGAS_DO_NOT_TRACK=true to disable)
Getting started
Install Ragas from PyPI, scaffold an example project, then write a metric to score your app's output. Set your OPENAI_API_KEY before running the evaluation snippet.
Install from PyPI
Install the package with pip. You can also install the latest source directly from GitHub.
pip install ragasScaffold an example project
Use the ragas quickstart command to list templates or generate a ready-made RAG evaluation project.
# List available templates
ragas quickstart
# Create a RAG evaluation project
ragas quickstart rag_evalScore an output with a metric
Define a DiscreteMetric, point it at an LLM, and score a response. Make sure OPENAI_API_KEY is set in your environment.
import asyncio
from openai import AsyncOpenAI
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)
metric = DiscreteMetric(
name="summary_accuracy",
allowed_values=["accurate", "inaccurate"],
prompt="""Evaluate if the summary is accurate and captures key information.
Response: {response}
Answer with only 'accurate' or 'inaccurate'."""
)
async def main():
score = await metric.ascore(
llm=llm,
response="The summary of the text is..."
)
print(f"Score: {score.value}")
print(f"Reason: {score.reason}")
if __name__ == "__main__":
asyncio.run(main())Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Measure RAG answer quality after changing your retriever, chunking, or prompt to confirm the change actually helped
- Generate a test dataset automatically when you don't have hand-labeled ground-truth data
- Add LLM-output checks to a CI pipeline so regressions are caught before release
- Build feedback loops from production data to track and improve an LLM app over time
How Ragas compares
Ragas alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Strix | ★ 26.1k | Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts. |
| promptfoo | ★ 22.4k | A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities. |
| OpenAI Evals | ★ 18.7k | A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents. |
| DeepEval | ★ 16.3k | An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD. |
| Ragas | ★ 14.4k | Metric-driven evaluation and test-set generation for LLM and RAG applications |
| Arize Phoenix | ★ 10.2k | An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production. |
| garak | ★ 8.2k | An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses. |
| Giskard | ★ 5.4k | An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner. |