Ragas

Metric-driven evaluation and test-set generation for LLM and RAG applications

github.com/vibrantlabsai/ragas★ 14.4k docs.ragas.io

Overview

Ragas is an open-source Python toolkit for evaluating Large Language Model (LLM) applications. It replaces slow, subjective manual reviews with repeatable, metric-based checks, scoring outputs with both LLM-based judges and traditional metrics so you can measure quality with numbers instead of gut feeling.

It is built for developers and teams shipping RAG systems and other LLM-backed features who need a way to tell whether a change made their app better or worse. As an evaluation framework, it pairs scoring metrics with automatic test-data generation, so you can run checks even when you don't have a hand-labeled dataset ready.

Ragas integrates with common LLM tooling such as LangChain and observability platforms, and it can use production data to build feedback loops that guide ongoing improvements.

What it does

Pre-built metrics for common evaluation tasks, combining LLM-based scoring with traditional metrics
Custom aspect evaluators such as DiscreteMetric to score any quality dimension you define
Automatic test-dataset generation when you don't have ground-truth labels ready
A ragas quickstart CLI that scaffolds a complete RAG evaluation project
Integrations with LLM frameworks like LangChain and major observability tools
Anonymized, opt-out usage analytics (set RAGAS_DO_NOT_TRACK=true to disable)

Getting started

Install Ragas from PyPI, scaffold an example project, then write a metric to score your app's output. Set your OPENAI_API_KEY before running the evaluation snippet.

Install from PyPI

Install the package with pip. You can also install the latest source directly from GitHub.

bashbash

pip install ragas

Scaffold an example project

Use the ragas quickstart command to list templates or generate a ready-made RAG evaluation project.

bashbash

# List available templates
ragas quickstart

# Create a RAG evaluation project
ragas quickstart rag_eval

Score an output with a metric

Define a DiscreteMetric, point it at an LLM, and score a response. Make sure OPENAI_API_KEY is set in your environment.

pythonpython

import asyncio
from openai import AsyncOpenAI
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory

client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)

metric = DiscreteMetric(
    name="summary_accuracy",
    allowed_values=["accurate", "inaccurate"],
    prompt="""Evaluate if the summary is accurate and captures key information.

Response: {response}

Answer with only 'accurate' or 'inaccurate'."""
)

async def main():
    score = await metric.ascore(
        llm=llm,
        response="The summary of the text is..."
    )
    print(f"Score: {score.value}")
    print(f"Reason: {score.reason}")

if __name__ == "__main__":
    asyncio.run(main())

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Measure RAG answer quality after changing your retriever, chunking, or prompt to confirm the change actually helped
Generate a test dataset automatically when you don't have hand-labeled ground-truth data
Add LLM-output checks to a CI pipeline so regressions are caught before release
Build feedback loops from production data to track and improve an LLM app over time

How Ragas compares

Ragas alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Strix	★ 26.1k	Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts.
promptfoo	★ 22.4k	A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities.
OpenAI Evals	★ 18.7k	A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents.
DeepEval	★ 16.3k	An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD.
Ragas	★ 14.4k	Metric-driven evaluation and test-set generation for LLM and RAG applications
Arize Phoenix	★ 10.2k	An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production.
garak	★ 8.2k	An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses.
Giskard	★ 5.4k	An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner.

// Overview

// What it does

// Getting started