In plain English
Imagine asking a friend for restaurant recommendations and they confidently describe a five-star place they read about in exactly the review you gave them — except the review was for a hotel, not a restaurant. Their answer is faithful (every detail comes from the source) but not relevant (it doesn't answer your question). Now flip it: a different friend names a perfect restaurant that sounds spot-on, but they made it up entirely. That answer is relevant but not faithful. These are two orthogonal failure modes, and a single quality score cannot distinguish them.
In RAG evaluation, faithfulness asks: Are the claims in the generated answer actually supported by the retrieved context? It catches hallucinations — moments when the LLM interpolates or invents details beyond what the documents say. Answer relevance asks: Does the generated answer actually address what the user asked? It catches off-topic drift — moments when the model produces a well-grounded wall of text that misses the point.
Both metrics score a single generated answer, but they measure it against different references: faithfulness compares the answer to the retrieved documents, while answer relevance compares it to the original question. That is the core asymmetry. You need both numbers to know whether an answer is both grounded and useful.
Why two separate scores matter
A composite "answer quality" score hides which part of your pipeline is broken. Consider two failing systems: one that consistently retrieves the right documents but lets the LLM invent extras, and one that retrieves irrelevant documents and the LLM faithfully summarises them (producing an accurate-sounding but useless answer). Both systems would score similarly on a single end-to-end quality rating. Splitting the score into faithfulness and relevance immediately tells you which lever to pull.
The four failure quadrants
| Faithfulness | Answer Relevance | What went wrong | Where to look |
|---|---|---|---|
| High | High | Nothing — this is the target state | Ship it |
| High | Low | Model accurately reflects the context but the retriever pulled the wrong docs, or the prompt drifted off-topic | Fix retrieval or tighten the prompt |
| Low | High | Answer looks on-topic but the model fabricated details not in the context — the most dangerous failure | Fix the generation prompt or use a less hallucination-prone model |
| Low | Low | Both retrieval and generation are broken | Start with retrieval; generation often improves once context is good |
High faithfulness with low relevance is frustrating but safe — users get an unhelpful answer, not a wrong one. Low faithfulness with high relevance is the dangerous quadrant: the answer reads like exactly what the user wanted, but the facts are fabricated. A user who trusts a confident, on-topic, made-up answer is worse off than a user who gets a clearly off-topic response they know to ignore.
How each metric is computed
Both metrics use LLM-as-a-judge internally, but their pipelines are structurally different because they compare against different inputs.
Faithfulness: claim decomposition and verification
Faithfulness is computed in two LLM passes. The first pass decomposes the generated answer into atomic factual claims — short, self-contained statements that can each be independently verified. The second pass checks each claim against the retrieved context, labelling it as supported (1) or unsupported (0). The final score is the ratio of supported claims to total claims.
Faithfulness = (number of supported claims) / (total claims in answer)
A score of 0.85 means 85 % of the statements in the answer have backing in the retrieved documents. Frameworks like RAGAS and DeepEval implement this pipeline with gpt-4.1 or claude-sonnet as the judge. Haystack's FaithfulnessEvaluator does the same decompose-then-verify loop and returns per-statement scores alongside the aggregate. Vectara's open-source Hughes Hallucination Evaluation Model (HHEM) offers a lighter, NLI-based alternative that skips the LLM judge and classifies each claim as entailed, neutral, or contradicted using a DeBERTa-based model.
Answer relevance: reverse question generation
Answer relevance does not compare the answer to the retrieved documents at all. Instead it asks: If someone had written this answer, what question were they probably answering? The metric generates n hypothetical questions that the answer would plausibly satisfy, then measures how closely those questions match the user's actual question using cosine similarity of embeddings.
Answer Relevance = mean cosine similarity(embeddings of generated questions, embedding of original question)
RAGAS generates three reverse questions by default (n=3). An answer about "how to deploy Docker containers" that was asked in response to "how do I containerise a Node app?" should produce reverse questions closely matching the original. An answer that drifts into general infrastructure topics will produce reverse questions that are semantically distant, lowering the score. Because embeddings capture meaning rather than wording, a paraphrase still scores high.
- Compares answer to: retrieved context
- Failure mode caught: hallucination
- Technique: claim decomposition + NLI/LLM verify
- Needs retrieval context: yes
- Needs ground-truth answer: no
- Score formula: supported claims / total claims
- Compares answer to: original question
- Failure mode caught: off-topic drift
- Technique: reverse question generation + cosine similarity
- Needs retrieval context: no
- Needs ground-truth answer: no
- Score formula: mean cosine sim of re-generated Qs
Measuring both metrics in code
RAGAS, DeepEval, and TruLens all expose faithfulness and answer relevance as first-class metrics. The following examples show the minimal setup for each.
RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset
data = {
"question": ["What is the return policy?"],
"answer": ["We offer a 30-day full refund at no extra cost."],
"contexts": [["All customers are eligible for a 30-day full refund."]],
# answer_relevancy does not need 'ground_truth'
}
dataset = Dataset.from_dict(data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(result) # {'faithfulness': 1.0, 'answer_relevancy': 0.94}DeepEval
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
test_case = LLMTestCase(
input="What is the return policy?",
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 30-day full refund."]
)
faithfulness_metric = FaithfulnessMetric(threshold=0.7, model="gpt-4.1")
relevancy_metric = AnswerRelevancyMetric(threshold=0.7, model="gpt-4.1")
evaluate(test_cases=[test_case], metrics=[faithfulness_metric, relevancy_metric])Common pitfalls and edge cases
Faithfulness can be gamed by a vague answer
If the generated answer is so hedged or brief that it makes no falsifiable claims, faithfulness will score 1.0 — every (zero) claim is supported. A response like "There may be some relevant information in the documents" is technically faithful but completely useless. Pair faithfulness with a completeness or coverage metric to catch this pathological edge case.
Answer relevance does not detect hallucinations
The reverse-question trick only checks whether the answer addresses the question — it does not look at the retrieved documents at all. A beautifully relevant, completely fabricated answer will score 1.0 on answer relevance. This is why you must run both metrics together: relevance alone gives a false sense of security.
Long contexts stress faithfulness scoring
When the retrieved context is very long, two problems emerge. First, the LLM judge may miss a supporting passage buried deep in the context — a supported claim gets marked unsupported, dragging the score down unfairly. Second, the generator itself is more likely to stray from a dense, noisy context. Chunking aggressively and applying a reranker to trim the context before generation reduces both problems simultaneously.
Incomplete answers hurt relevance scores disproportionately
RAGAS's answer relevance metric penalises incomplete answers — a response that answers only half the question will produce reverse questions that match only part of the original, dragging the cosine similarity down. This is intentional behaviour: an incomplete answer is treated as partially irrelevant. If you see relevance scores declining after shortening your system prompt, check whether the shorter prompt is causing truncated answers.
Going deeper
NLI-based faithfulness without an LLM judge
The LLM-as-a-judge approach to faithfulness is accurate but expensive — two LLM calls per evaluated answer. For high-volume production monitoring, Natural Language Inference (NLI) pipelines offer a cheaper alternative. A model like DeBERTa-v3 is fine-tuned to classify whether a hypothesis (a claim from the answer) is entailed, neutral, or contradicted by a premise (a context chunk). Vectara's open-source HHEM model uses this approach and can score thousands of answers per second on a single GPU without any LLM API calls.
Context relevance: the retrieval-side complement
Faithfulness and answer relevance both evaluate generation. On the retrieval side, context precision measures what fraction of your retrieved chunks actually contain useful information, and context recall measures whether the chunks that do contain the answer were retrieved at all. The full four-metric RAGAS suite covers both sides of the pipeline. A RAG system with high faithfulness and relevance but low context recall might appear to work well on easy questions while silently failing on anything requiring obscure information.
LLM-as-a-judge calibration
Research from 2025 (Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards, arxiv 2505.04847) shows that judge choice matters: different LLMs rate faithfulness quite differently on the same examples. o3-mini with high reasoning effort achieves 84 % balanced accuracy on FaithBench, substantially better than smaller judges. For production pipelines where faithfulness errors are costly, calibrate your judge against a human-annotated test set before trusting its scores at scale.
Using scores as CI/CD gates
DeepEval integrates with pytest, making it straightforward to block a deployment when faithfulness drops below a threshold. A common pattern is to maintain a golden test set of 50–200 queries with known good answers, score the full set on every pull request, and fail the build if faithfulness drops below 0.80 or answer relevance drops below 0.75. This catches regressions introduced by prompt changes, model version upgrades, or index rebuilds before they reach users.
# deepeval pytest integration — fails CI if thresholds are not met
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
@pytest.mark.parametrize("test_case", golden_test_set)
def test_rag_quality(test_case):
assert_test(test_case, metrics=[
FaithfulnessMetric(threshold=0.80),
AnswerRelevancyMetric(threshold=0.75),
])FAQ
Can a RAG answer score 1.0 on faithfulness and still be wrong?
Yes. Faithfulness only checks whether the answer is consistent with the retrieved context — it does not check whether the context itself is correct. If the retrieved document contains outdated or incorrect information, a perfectly faithful answer will repeat that error. Faithfulness guarantees grounding, not accuracy. Use answer correctness (which requires a ground-truth reference) to catch factually wrong answers.
What is the difference between faithfulness and groundedness?
The terms are used interchangeably in most frameworks. Groundedness is the general concept — every claim in the answer is traceable to the source documents. Faithfulness is the specific metric operationalisation of that concept, typically computed as the ratio of supported claims to total claims. Some tools (Azure AI Foundry, Haystack) prefer "groundedness"; RAGAS and DeepEval use "faithfulness".
Does answer relevance require a ground-truth answer?
No. RAGAS's answer relevance metric is reference-free: it generates hypothetical questions from the answer and compares them to the original question using embedding cosine similarity. No human-written ideal answer is needed. This is what makes it practical for production monitoring — you only have the user's question and the system's response, not a gold standard.
Why would answer relevance drop after I improved my retrieval?
Better retrieval often means more context is passed to the generator. If the additional context is noisy or pushes the model toward a tangential topic, the answer may drift from the original question even though it is more faithful. This is sometimes called context distraction. Reranking retrieved chunks to keep only the most relevant ones before generation typically fixes both the relevance drop and reduces context token usage.
How many test samples do I need for reliable faithfulness and relevance scores?
Scores stabilise around 50–100 samples for most RAG systems, assuming the test set covers diverse query types. Fewer samples produce noisy estimates that can vary by 0.05–0.10 between runs just from LLM judge variance. For CI/CD gates, a golden set of 100–200 representative queries is a practical minimum.
Can I run faithfulness scoring without paying for LLM API calls?
Yes. Vectara's open-source HHEM model (available on Hugging Face as vectara/hallucination_evaluation_model) uses NLI to classify claims as entailed or not entailed by the context, with no LLM API calls. It is faster and cheaper than LLM-as-a-judge but slightly less accurate on subtle or complex claims. For high-volume production monitoring it is a practical alternative to GPT-4-class judges.