AI/TLDR

RAG Evaluation Metrics Explained

You will understand exactly what each of the four core RAGAS metrics measures, how the score is computed step by step, and which part of a broken RAG pipeline each metric points to.

INTERMEDIATE13 MIN READUPDATED 2026-06-12

In plain English

RAGAS (Retrieval-Augmented Generation Assessment) is a framework that puts four numbers on a RAG pipeline: faithfulness, answer relevance, context precision, and context recall. Together they answer two questions: Did the retriever find the right chunks? (precision and recall) and Did the generator use them correctly? (faithfulness and answer relevance).

What makes RAGAS useful is that most of it is reference-free — you do not need a human to write the ideal answer for every test question. Faithfulness and answer relevance judge the output against the retrieved context and the original question, so you can run them on live production traffic. Only context recall needs a ground-truth reference answer to compare against.

Each metric is a number from 0 to 1. Higher is always better. But the real value is in the pattern: high context recall with low faithfulness means the right chunks are there and the model is ignoring or distorting them. Low context recall with high faithfulness means the model is being careful with what it got — but it didn't get enough. The combination of all four tells a diagnostic story.

Why it matters

The four metrics matter because they give you decomposed accountability. A RAG pipeline has at least two independent failure modes: the retriever pulling wrong chunks, and the generator hallucinating beyond what the chunks say. A single end-to-end score ("answer correctness") hides which one broke. RAGAS separates them.

The second reason is scalability. Human review of 500 chatbot answers per day is impossible. RAGAS turns that into an automated pipeline — an LLM judge runs the faithfulness check in milliseconds, and the scores flow into a dashboard. The human reads summaries and drills into anomalies, rather than grading every response.

What each metric tells you to fix

If this score is low…The likely causeWhat to change
Context precisionRetriever returns irrelevant chunks, diluting the good onesTune embedding model, raise similarity threshold, add a reranker
Context recallRetriever misses chunks that contain the answerIncrease top_k, change chunking strategy, improve query rewriting
FaithfulnessGenerator adds claims not grounded in the retrieved contextTighten the system prompt, use a more instruction-following model
Answer relevanceGenerator wanders or gives incomplete answers to the questionRevise the prompt to focus on the specific question, trim context noise

How each metric is computed

Each RAGAS metric goes through a specific algorithmic pipeline, usually involving one or more LLM calls. Here is the data flow for a single evaluation sample:

Faithfulness

Faithfulness measures what fraction of the claims in the generated answer are actually supported by the retrieved context. It catches the generator hallucinating details not present in the chunks.

The computation has two LLM steps. First, an LLM decomposes the answer into atomic statements — short, self-contained claims. The sentence "Paris has a population of 2.1 million and is home to the Eiffel Tower" becomes two statements. Second, for each atomic statement, the LLM checks whether the statement can be inferred from the retrieved context. The final score is the ratio of supported statements to total statements:

texttext
Faithfulness = (number of supported statements) / (total statements in answer)

Example:
  Answer decomposed into 5 atomic statements
  4 of 5 are supported by the retrieved context
  Faithfulness = 4 / 5 = 0.80

A score of 1.0 means every claim in the answer traces to the context. A score of 0.6 means 40% of the answer is ungrounded — the model invented or extrapolated those facts. Because faithfulness only uses the retrieved context (not a gold answer), it runs reference-free: you can compute it on every production query.

Answer relevance

Answer relevance measures whether the generated answer actually addresses the user's question — not whether it is factually correct, but whether it is on topic and complete. An answer that drifts into tangentially related information, or that only partially responds, gets a lower score.

The computation uses a clever reverse-generation trick. An LLM generates N candidate questions (typically 3–5) that the given answer could plausibly have been written to answer. Each candidate question is embedded as a vector, and the mean cosine similarity between those candidate questions and the original question is the relevance score:

texttext
Answer Relevance = (1/N) * Σ cosine_similarity(embed(generated_q_i), embed(original_q))

Example:
  Original question: "What are the side effects of ibuprofen?"
  Answer generated by RAG pipeline
  LLM reverse-generates 3 questions from the answer:
    Q1: "What are the adverse effects of ibuprofen?"           → sim = 0.97
    Q2: "How does ibuprofen affect the stomach?"               → sim = 0.82
    Q3: "What medications have gastrointestinal side effects?" → sim = 0.61
  Answer Relevance = (0.97 + 0.82 + 0.61) / 3 = 0.80

The intuition: if the answer truly addresses the original question, an LLM should be able to reconstruct that original question from the answer. High cosine similarity between reconstructed questions and the real question means the answer is on-topic. If the answer wandered, the reconstructed questions will point in a different direction, lowering the score.

Context precision

Context precision measures the signal-to-noise ratio in the retrieved chunks. When you retrieve top_k chunks, some are relevant and some are distractors. Context precision asks: of everything the retriever returned, what fraction was actually useful?

The precision calculation is rank-aware — chunks returned earlier (rank 1, 2, 3…) are weighted more heavily than chunks returned at rank 10. This reflects reality: models and users pay more attention to what comes first in the context window. The formula is a mean of precision-at-k values, computed only at positions where a relevant chunk appears:

texttext
Context Precision@K = Σ (Precision@k * relevance_k) / (total relevant chunks)

where Precision@k = (relevant chunks in top-k) / k
and   relevance_k = 1 if chunk at rank k is relevant, else 0

Example (top_k = 4, relevant chunks at ranks 1 and 3):
  Precision@1 = 1/1 = 1.0  (relevant)
  Precision@2 = 1/2 = 0.5  (irrelevant — doesn't count)
  Precision@3 = 2/3 = 0.67 (relevant)
  Precision@4 = 2/4 = 0.5  (irrelevant — doesn't count)
  Context Precision = (1.0 + 0.67) / 2 = 0.83

A score of 1.0 means every retrieved chunk was relevant (or at least: relevant chunks came first). A low score means the model had to extract a useful answer from a pile of irrelevant material — which increases hallucination risk and can trigger the lost-in-the-middle problem.

Context recall

Context recall measures coverage: did the retriever surface all the information needed to answer the question? A precision of 1.0 with recall of 0.5 means every chunk you retrieved was relevant, but you only got half of what you needed — the answer is going to be incomplete.

Context recall is the one metric that requires a reference answer (ground truth). The LLM decomposes the reference answer into atomic claims, then checks each claim against the retrieved context. The score is the fraction of reference claims that the context can support:

texttext
Context Recall = (claims in reference answer supported by retrieved context)
                  / (total claims in reference answer)

Example:
  Reference answer decomposes into 4 claims:
    Claim 1: "The Eiffel Tower is in Paris"         → found in context ✓
    Claim 2: "It was built in 1889"                 → found in context ✓
    Claim 3: "It is 330 metres tall"                → found in context ✓
    Claim 4: "It was originally a temporary structure" → NOT in context ✗
  Context Recall = 3 / 4 = 0.75

A recall of 0.75 here means the retriever missed the chunk about the Eiffel Tower's original purpose. The model either skips that fact (incomplete answer) or invents something (hallucination). Either way, the fix is in the retriever: look for better embedding coverage, a different chunking boundary, or a higher top_k.

Running RAGAS in code

RAGAS ships as a Python library. Each metric is a class. You assemble a Dataset of evaluation samples and call evaluate(). The library handles all the LLM judge calls internally.

bashbash
pip install ragas
pythonpython
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Each sample needs: question, answer, contexts, ground_truth
# ground_truth is only required for context_recall
data = {
    "question": [
        "What year was the Eiffel Tower built?",
        "How does HTTPS encryption work?",
    ],
    "answer": [
        "The Eiffel Tower was built in 1889.",
        "HTTPS uses TLS to encrypt the connection between browser and server.",
    ],
    "contexts": [
        ["The Eiffel Tower, completed in 1889, stands in Paris."],
        [
            "TLS (Transport Layer Security) encrypts data in transit.",
            "HTTPS is HTTP over a TLS connection.",
        ],
    ],
    "ground_truth": [
        "The Eiffel Tower was constructed in 1889.",
        "HTTPS uses TLS to encrypt traffic between client and server.",
    ],
}

dataset = Dataset.from_dict(data)
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)

RAGAS uses OpenAI models by default for the LLM judge calls. You can swap in any LiteLLM-compatible model including Claude or local models. The library exposes a LangchainLLM wrapper and an llm parameter on each metric class for this purpose.

pythonpython
from ragas.llms import LangchainLLMWrapper
from langchain_anthropic import ChatAnthropic

# Use Claude as the RAGAS judge instead of OpenAI
judge_llm = LangchainLLMWrapper(ChatAnthropic(model="claude-sonnet-4-5"))

faithfulness.llm = judge_llm
answer_relevancy.llm = judge_llm
context_recall.llm = judge_llm

For production monitoring, you can drop ground_truth from the dataset and skip context_recall — the three reference-free metrics (faithfulness, answer relevance, context precision) still run. Log every query and its scores, and alert when your 7-day rolling faithfulness average drops below your threshold.

Interpreting scores together

Individual metric values are less informative than patterns across all four. Here are the most common diagnostic patterns:

A healthy RAG pipeline typically shows context recall above 0.80, context precision above 0.70, faithfulness above 0.85, and answer relevance above 0.80. These are rough baselines — the right targets depend heavily on domain and user tolerance for error. Medical or legal RAG demands much tighter thresholds than a general FAQ bot.

The precision–recall trade-off

Context precision and context recall naturally pull against each other. Increasing top_k from 4 to 10 almost always improves recall (more chances to grab the right chunk) but hurts precision (more irrelevant chunks mixed in). Adding a reranker is the primary tool for pushing both up at once — it re-orders the top_k results by relevance before they reach the generator, so the most useful chunks come first and irrelevant ones land at the end where they have less influence.

When faithfulness and answer relevance disagree

High faithfulness with low answer relevance usually means the model is being too literal — it found a relevant chunk and paraphrased it accurately, but the user's question had a different framing or scope. The fix is a prompt that asks the model to explicitly address the question structure, not just surface what the chunk says. Low faithfulness with high answer relevance is the red flag: the answer sounds perfect but contains invented claims. This is the classic hallucination pattern and the most important one to catch.

Going deeper

The four core metrics cover most RAG quality dimensions, but there are important edge cases and extensions worth knowing.

Limitations of LLM-judged metrics

Faithfulness and context recall both rely on an LLM to decompose answers/references into atomic claims and verify them. This introduces judge variance: the same answer can score slightly differently on two runs. RAGAS scores have statistical noise of roughly ±0.03–0.05 on small datasets. Treat a move from 0.81 to 0.83 as noise; treat 0.80 to 0.90 as a real improvement. Run enough samples (50+) before acting on a metric shift.

Faithfulness vs. factual correctness

Faithfulness is not the same as factual accuracy. An answer is faithful if it matches the retrieved context — even if the context itself is wrong. If your knowledge base contains an outdated document claiming the API rate limit is 100 requests/minute when it was raised to 500, a faithful answer will confidently state the wrong number. Faithfulness catches model hallucinations; it does not audit the quality of your document corpus. For factual correctness you also need a ground-truth reference and an answer correctness check.

Newer RAGAS metrics

The RAGAS library has expanded beyond the original four. Noise sensitivity measures how much the answer changes when irrelevant chunks are added or removed. Response conciseness penalises answers that pad with unnecessary context. Topic adherence is useful for multi-turn conversations. The official RAGAS documentation lists all current metrics with detailed computation notes.

Integrating RAGAS into CI

The practical end-game is RAGAS running in your CI pipeline: every pull request that changes chunking logic, retrieval parameters, or generation prompts triggers an automated eval run. A score regression blocks the merge. A score improvement gets highlighted in the PR summary. This keeps LLMOps discipline close to the development loop rather than relegated to quarterly audits.

FAQ

What are the four RAGAS metrics?

Faithfulness (fraction of answer claims supported by retrieved context), answer relevance (how well the answer addresses the question), context precision (proportion of retrieved chunks that are relevant, rank-weighted), and context recall (fraction of reference answer claims present in retrieved context).

How is RAGAS faithfulness score calculated?

An LLM decomposes the generated answer into atomic statements, then verifies each statement against the retrieved context. Faithfulness equals the number of supported statements divided by the total number of statements. A score of 1.0 means every claim in the answer is grounded in the retrieved chunks.

Does RAGAS require a reference answer?

Only for context recall, which compares retrieved context against the claims in a ground-truth answer. The other three metrics — faithfulness, answer relevance, and context precision — are reference-free and can be run on live production traffic without any pre-labelled answers.

What is the difference between context precision and context recall in RAG?

Context precision measures signal-to-noise: of the chunks you retrieved, how many were relevant? Context recall measures coverage: of the relevant information that exists, how much did you retrieve? You can have high precision with low recall (retrieved a small perfect set that misses key facts) or high recall with low precision (retrieved everything including a lot of junk).

How does RAGAS answer relevance work without a reference answer?

It uses a reverse-generation trick: an LLM generates N candidate questions that the answer could have been written to answer, then measures the average cosine similarity between those candidate questions and the original question. If the answer is on-topic, the reconstructed questions will closely match the original question.

What RAGAS scores should I target in production?

Rough baselines: faithfulness above 0.85, answer relevance above 0.80, context recall above 0.80, context precision above 0.70. Exact targets depend on domain and risk tolerance — a medical or legal RAG system should demand tighter thresholds than a general FAQ bot. Track trends over time, not just absolute values.

Further reading