What Is Ragas? Metric-Driven RAG Evaluation

You will understand what Ragas measures, the core RAG metrics it computes, and where it fits in an evaluation workflow.

INTERMEDIATE11 MIN READUPDATED 2026-06-14

DOCSdocs.ragas.io explodinggradients/ragas14.4k

In plain English

You built a RAG system: it retrieves documents and an LLM writes an answer from them. It works on the three questions you tried. But does it work on the other three hundred? Did the retriever fetch the right passages? Did the model stick to them, or quietly make something up? Eyeballing a handful of answers won't tell you, and reading hundreds by hand doesn't scale.

Ragas — illustration — Ragas — langfuse.com

Ragas is an open-source Python library that grades a RAG or LLM application automatically. You hand it the questions, the chunks your retriever returned, the answers your system produced, and (optionally) the correct answers. Ragas runs a set of metrics over that data and gives you numbers: how grounded the answers are, how relevant they are, and how good the retrieval was. The name is a contraction of Retrieval-Augmented Generation Assessment.

Think of Ragas as an automated examiner for an open-book test. A student (your LLM) was given some reference pages (the retrieved chunks) and asked a question. The examiner checks three things: Did you actually answer the question that was asked? Is every claim you made supported by the pages you were given, or did you invent some? And were the right pages even pulled for you in the first place? Ragas turns each of those judgments into a score you can track.

Why it matters

Every RAG system has two places it can fail, and they fail independently. The retriever can pull the wrong chunks, or the generator can ignore good chunks and hallucinate anyway. A single "is the final answer correct?" check can't tell those two apart — and if you don't know which half broke, you don't know what to fix. Ragas exists to break the quality of a RAG pipeline into separate, measurable pieces so you can debug it like an engineer instead of guessing.

The problems it solves

"It looked fine" is not a test. Manually checking a few outputs feels like evaluation but isn't. Ragas lets you score an entire test set in one run, so a change either improves the numbers or it doesn't.
Catching regressions. You tweak a prompt, swap an embedding model, or change your chunk size — and three other things silently get worse. Re-running Ragas on a fixed golden dataset turns that invisible drift into a visible drop in a metric.
Pinpointing the broken stage. Separate retrieval metrics and generation metrics tell you whether to spend your next day improving the retriever or the prompt. That's the difference between a targeted fix and a week of flailing.
Measuring hallucination directly. Its faithfulness metric scores how much of an answer is actually backed by the retrieved context — a concrete, repeatable number for the thing RAG is supposed to prevent.

Who needs it? Anyone shipping a RAG feature they can't personally re-read every day: support bots, "chat with your docs" tools, internal knowledge assistants, anything where a confidently wrong answer is a real problem. Ragas fits inside the broader practice of LLM evals — it's the specialized tool for the retrieval-augmented case.

How it works

Ragas works on samples. Each sample is one interaction with your RAG system, described by a few fields. You collect a batch of these into a dataset, then run one or more metrics across it. The output is a score per metric (typically a 0-to-1 number), plus per-sample scores so you can find the worst offenders.

What a sample contains

Field	What it is	Where it comes from
user input	The question that was asked	Your test set
retrieved contexts	The chunks your retriever returned	Logged from your pipeline
response	The answer your LLM generated	Your pipeline's output
reference	The correct / ground-truth answer (optional)	Written once, by you

Notice the reference is optional. This is one of the most useful things about Ragas: some metrics need a human-written correct answer to compare against (reference-based), but others judge the answer purely from the question and the retrieved context (reference-free). Reference-free metrics let you evaluate live or unlabeled traffic where nobody has written down the right answer.

The RAG triad it measures

Most Ragas metrics map onto the three relationships in a RAG system: between the question, the retrieved context, and the answer. The well-known framing is the RAG triad — three edges of a triangle, each of which can break.

// From your pipeline to a score

Question + context + answerone samplePick metricsfaithfulness, relevancy, …Judge model scores eachLLM-graded + embeddingsAggregatescore per metric

// Each edge of the RAG triad is a metric family

The RAG triad

Question ↔ ContextDid retrieval find relevant chunks? (context precision / recall)

Context ↔ AnswerIs the answer grounded in the chunks? (faithfulness)

Question ↔ AnswerDoes the answer address the question? (answer relevancy)

Here's how the LLM-graded part actually works, using faithfulness as the example. Ragas first prompts a judge model to break the answer into individual factual claims. Then, for each claim, it asks the judge: can this be inferred from the retrieved context? The faithfulness score is simply the fraction of claims that the context supports. A low score means the model is adding information that wasn't in its sources — the textbook definition of a RAG hallucination.

Because metrics call a judge model and an embedding model, you configure those before running. Conceptually the call looks like this:

the shape of a Ragas evaluationpython

from ragas import evaluate, EvaluationDataset
from ragas.metrics import (
    Faithfulness,
    ResponseRelevancy,
    LLMContextPrecisionWithoutReference,
)

# Each row = one question your RAG system answered.
samples = [
    {
        "user_input": "How long is the refund window for physical items?",
        "retrieved_contexts": [
            "Refunds on physical items are accepted within 30 days of purchase.",
            "Digital goods are non-refundable once downloaded.",
        ],
        "response": "You can return a physical item within 30 days of purchase.",
        # "reference": "30 days"  # only needed for reference-based metrics
    },
    # ... hundreds more
]

dataset = EvaluationDataset.from_list(samples)

# Faithfulness + relevancy are reference-FREE; they judge from
# the question and context alone. A judge LLM scores each one.
result = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(),
        ResponseRelevancy(),
        LLMContextPrecisionWithoutReference(),
    ],
)

print(result)  # -> {'faithfulness': 0.92, 'answer_relevancy': 0.88, ...}

The core metrics, in plain terms

Ragas ships many metrics, but four classic ones cover the RAG triad and are where most people start. Two grade the generation (the answer), two grade the retrieval (the chunks).

Metric	Question it answers	Grades	Needs a reference?
Faithfulness	Is every claim in the answer supported by the retrieved context?	Generation	No
Answer relevancy	Does the answer actually address the question (no padding or dodging)?	Generation	No
Context precision	Are the retrieved chunks relevant, and are the useful ones ranked near the top?	Retrieval	Either
Context recall	Did retrieval find all the chunks needed to answer fully?	Retrieval	Yes

Reading them as a pair

The metrics are most useful read together, because the pattern tells you which stage to fix:

Low faithfulness, high context relevance → retrieval is fine, but the generator is ignoring its sources and inventing things. Fix the prompt (tell it to answer only from context) or the model.
High faithfulness, low context recall → the model faithfully used what it got, but it didn't get enough. The right chunk never made it into the prompt. Fix chunking, the embedding model, or how many chunks you retrieve.
Low context precision → you're retrieving noise alongside the signal. Consider a reranker, or retrieve fewer, better chunks.
Low answer relevancy → the answer wanders, hedges, or addresses the wrong thing even when the facts are right. Often a prompt or formatting issue.

Ragas vs other eval tools

Ragas is not the only LLM-evaluation library, and the differences are mostly about focus and mental model, not quality. Picking the right one depends on what you're testing.

// Where each tool puts its emphasis

Ragas

Specialized for RAG pipelines
Built-in triad metrics out of the box
Strong reference-free options
Dataset-and-metrics mental model
Can also generate test sets

DeepEval

General LLM testing, Pytest-style
Assertions as unit tests
Includes RAG metrics too
Test-case mental model
Fits naturally in CI

Promptfoo

Config-driven prompt/model matrix
Side-by-side prompt comparison
Red-teaming probes
CLI-first workflow
Less RAG-specific

In practice the lines blur — DeepEval includes RAG metrics, and Ragas works fine inside a test runner. A reasonable rule of thumb: reach for Ragas when your central question is "is my retrieval and grounding good?"; reach for a unit-test-style framework when you want pass/fail assertions wired into CI; and reach for a matrix tool when you're comparing many prompts or models head to head. They're complementary, not mutually exclusive.

Common pitfalls

Ragas gives you crisp numbers, and crisp numbers are easy to misread. The traps below catch most newcomers.

Forgetting the judge is an LLM. LLM-graded metrics inherit the judge's quirks and a bit of run-to-run noise. Two runs on identical data can differ slightly. Treat scores as a strong signal with error bars, not exact constants, and use a capable judge model.
Chasing a single number to 1.0. A perfect faithfulness score on a tiny, easy test set means little. What matters is the trend across a representative set and whether a change moved it — not hitting a round number.
Confusing faithfulness with correctness. Faithfulness only checks that the answer matches the retrieved context. If retrieval pulled a wrong-but-relevant document, the answer can be perfectly faithful and still factually wrong. You need the retrieval metrics too.
A weak or tiny test set. Metrics are only as meaningful as the questions you feed them. Ten cherry-picked easy questions hide the failures real users will find. Invest in a representative dataset; mind your sample size.
Ignoring cost and latency. Because metrics call a judge (and sometimes make several calls per sample), evaluating thousands of samples across several metrics adds up in tokens and time. Sample sensibly; you rarely need to grade every row on every metric.

Going deeper

Once the four core metrics make sense, a few directions are worth knowing.

Test-set generation. A good evaluation needs good questions, and writing them by hand is slow. Ragas can synthesize a test set from your own documents — generating questions, ground-truth answers, and the relevant context — including harder multi-document and reasoning questions, not just easy single-fact ones. It's a fast way to bootstrap a golden dataset, though you should still review what it produces.

Custom and rubric metrics. Beyond the built-ins, Ragas lets you define your own LLM-graded metrics against a plain-language rubric — "score 1 to 5 on whether the answer is polite and on-brand," for instance. This is the same idea behind methods like G-Eval: describe the criteria in words and let a judge model apply them consistently. It's how you measure the things specific to your product that no generic metric covers.

Beyond single-turn RAG. Ragas has grown past basic question-answer pairs toward evaluating multi-turn conversations and tool-using agents, where you score things like whether the agent picked the right tool or stayed on goal. The concept is the same — define what "good" means and have a judge measure it — but the samples are richer than one question and one answer.

Offline scores vs production reality. A clean run on your curated set is a starting line, not a finish line. Real users ask things you didn't anticipate, so pair the offline test suite with reference-free metrics on live traffic, and feed the surprising failures back into the dataset. The honest limitation never fully goes away: an automated grader is a proxy. It scales human judgment, it doesn't replace it — the most reliable RAG evaluation keeps a human in the loop validating that the metrics still mean what you think they mean.

FAQ

What is Ragas used for?

Ragas is an open-source Python library for automatically evaluating RAG and LLM applications. You give it your questions, the retrieved context, and the generated answers, and it scores them on metrics like faithfulness, answer relevancy, and context precision/recall — so you can measure quality across a whole test set instead of eyeballing a few outputs.

What does the faithfulness metric in Ragas measure?

Faithfulness measures how well an answer is grounded in the retrieved context. Ragas uses a judge LLM to break the answer into individual claims and checks what fraction of them can be inferred from the context. A low faithfulness score means the model is adding information its sources don't support — in other words, hallucinating.

Does Ragas need ground-truth answers to work?

Not always. Some metrics are reference-free — faithfulness and answer relevancy judge the answer using only the question and the retrieved context, so you can run them on unlabeled or live traffic. Others, like context recall, are reference-based and need a human-written correct answer to compare against.

Are Ragas metrics calculated by an LLM?

Several of them are. Metrics like faithfulness and answer relevancy prompt a judge model to read the data and score it, which makes Ragas a form of LLM-as-a-judge. Other metrics also use an embedding model. Because a judge LLM is involved, expect a little run-to-run variation and use a capable model as the judge.

What is the difference between Ragas and DeepEval?

Both evaluate LLM apps and both include RAG metrics. Ragas is specialized for RAG and uses a dataset-and-metrics model with strong reference-free options. DeepEval is more general and Pytest-style, framing evaluation as unit-test assertions that drop into CI. They overlap and are often complementary rather than mutually exclusive.

Can Ragas generate a test dataset for me?

Yes. Ragas can synthesize a test set from your own documents — generating questions, ground-truth answers, and relevant context, including harder multi-document and reasoning questions. It's a fast way to bootstrap a golden dataset, but you should still review the generated cases before trusting them.

// In plain English

// Why it matters

The problems it solves

// How it works

What a sample contains

The RAG triad it measures

// The core metrics, in plain terms

Reading them as a pair

// Ragas vs other eval tools

// Common pitfalls

// Going deeper

// FAQ

// Further reading

// Related