AI/TLDR

What Is G-Eval? Chain-of-Thought LLM Scoring

You will understand how G-Eval turns plain-language criteria into a chain-of-thought LLM score you can reuse.

INTERMEDIATE11 MIN READUPDATED 2026-06-14

In plain English

Suppose you ask a model to summarize a long article, and you want to know: is this summary any good? There is no single right answer to compare against — a good summary can be worded a hundred ways. You could read every output by hand, but that does not scale to thousands of examples. You need a grader that judges quality the way a thoughtful human would, but runs automatically.

G-Eval — illustration
G-Eval — comet.com

G-Eval is a recipe for building exactly that grader using another LLM. You write your quality criteria in plain English — for example, "a good summary is faithful to the source, covers the key points, and stays concise." G-Eval then turns those words into a structured scoring procedure: the judge model first reasons step by step about the output, then fills in a score on a fixed scale. You get back a number you can track, plus the reasoning behind it.

Think of it like the rubric a teacher uses to grade essays. The teacher does not just glance at an essay and blurt out "7 out of 10." They have a checklist — thesis, evidence, structure, grammar — and they reason through each one before settling on a grade. G-Eval gives the judge model that same disciplined process: think through the criteria first, then assign the score. The careful reasoning is what makes the final number trustworthy instead of a gut reaction.

Why it matters

Most of the things we want an LLM to do well are open-ended. There is no exact string to compare against when you grade a chatbot reply, a generated email, or a summary. Older automatic metrics tried to score these by measuring word overlap with a reference answer (BLEU, ROUGE), but two summaries can share almost no words and both be excellent — or share many words and one be wrong. Overlap metrics correlate poorly with what humans actually think is good.

G-Eval matters because it closes that gap. By having a capable model reason about your criteria, its scores line up far more closely with human judgment than overlap metrics do — which is the whole point of an automatic evaluator. A builder cares about this for concrete reasons:

  • You can grade subjective qualities. Coherence, helpfulness, tone, faithfulness, relevance — things no exact-match check can capture — become measurable numbers.
  • You define the criteria, not a library author. Because the rubric is just plain English, you can score your notion of quality ("never invents a price," "matches our brand voice") instead of a fixed off-the-shelf metric.
  • It scales past human review. Once the rubric works, you can run it over thousands of outputs in CI to catch regressions when you change a prompt or swap models — something manual review cannot keep up with.
  • You get reasons, not just a number. Because the judge reasons before scoring, you can read why an output lost points, which makes failures debuggable.

G-Eval does not replace human review — it stretches it. You still spend human effort up front designing and calibrating the rubric, but after that the model handles the volume, and you sample its grades to confirm they still agree with you.

How it works

G-Eval has two ingredients that set it apart from naively asking a model "rate this 1 to 5." First, it uses chain-of-thought: the judge generates explicit evaluation steps before scoring. Second, it uses a form-filling pattern: the score comes out in a fixed, parseable slot at the end, not buried in prose. Together these turn a vague request into a repeatable measurement.

Step 1 — You write the criteria

You describe, in one or two sentences, what you are measuring and what "good" looks like. You also pick a score range, commonly 1–5. This is the only part you author by hand. Example: "Coherence (1–5): how well-structured and logically ordered the summary is, so the sentences build into a connected whole rather than a pile of facts."

Step 2 — The model generates evaluation steps

G-Eval asks the judge model to expand your short criteria into a concrete, numbered checklist of how to evaluate. This is a one-time generation per metric — the model effectively writes its own grading rubric from your description. Doing this in the open (chain-of-thought) rather than in its head is what makes the later score more reliable and consistent.

Step 3 — The judge scores via form-filling

For each output you want to grade, the judge reads the generated steps, the input, and the output, reasons through the checklist, and then emits a single score in the agreed range. Constraining the answer to one slot at the end is the "form-filling" part — it keeps the output machine-readable.

The probability-weighting refinement

There is one more trick that gives G-Eval its name in the original paper. A plain 1–5 integer score is coarse: many outputs land on the same "4," so the metric cannot tell good fours from great fours. To get a finer, more continuous score, G-Eval looks at the probabilities the model assigns to each possible score token and takes their weighted average. If the model is 70% sure the answer is a 4 and 30% sure it is a 5, the final score becomes 4.3 instead of a flat 4.

A G-Eval metric in practice

In code, G-Eval is usually just a few lines: you name the metric, write the criteria, and the library handles step-generation and scoring. Here is the shape of it using DeepEval, the library that made G-Eval popular among engineers.

summary_coherence.pypython
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# 1) Define the metric in plain English. The library will turn this
#    into chain-of-thought evaluation steps automatically.
coherence = GEval(
    name="Coherence",
    criteria=(
        "Judge whether the summary is well-structured and logically "
        "ordered, so the sentences connect into a coherent whole "
        "rather than a disjointed list of facts."
    ),
    # Which fields of the test case the judge is allowed to look at.
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
)

# 2) One thing to grade: the input and the model's output.
test_case = LLMTestCase(
    input="<the full source article>",
    actual_output="<the summary your app produced>",
)

# 3) Score it. measure() runs the judge and returns 0..1 here,
#    plus a human-readable reason for the grade.
coherence.measure(test_case)
print(coherence.score)   # e.g. 0.82
print(coherence.reason)  # the judge's explanation

The key mental model from DeepEval is the test case: an input, the actual_output your app produced, and optionally an expected_output and retrieval_context. A G-Eval metric is a reusable assertion over that test case, so it slots into a test suite the same way a unit test does — run it in CI, fail the build if the score drops below a threshold.

G-Eval vs other ways to grade

G-Eval is one tool among several. Knowing when it is the right one — and when something simpler or different fits better — saves you cost and pain.

ApproachWhat it measuresBest when
Exact match / regexDid the output equal a known answerThere is one correct answer (a label, a number, JSON shape)
Overlap metrics (BLEU, ROUGE)Word overlap with a referenceYou have references and only need a cheap, rough signal
G-EvalA custom quality you describe, via CoT + form-fillingOpen-ended output, subjective criteria, no single right answer
Pairwise LLM judgingWhich of two outputs is betterYou compare two systems/prompts head-to-head, not absolute scores
Human reviewWhatever a person decidesGround truth, calibration, and high-stakes spot checks

Two distinctions trip people up. First, G-Eval scores one output on an absolute scale (a 4.3 out of 5), whereas pairwise judging ranks two outputs against each other. Pairwise is often more stable for comparing systems; G-Eval is better when you need a standalone quality number you can threshold and track over time.

Second, G-Eval is fully customizable, which is both its strength and its risk. Purpose-built metrics (say, a RAG faithfulness metric) come pre-validated for one job; a G-Eval metric does exactly what your sentence says — including your mistakes. If your criteria are vague, the score will be too.

Common pitfalls

G-Eval inherits every weakness of LLM-as-a-judge, plus a few of its own. None are fatal, but ignoring them produces confident, meaningless numbers.

  • Vague criteria. "Rate the quality" tells the judge nothing. The narrower and more concrete your description, the more consistent the scores. Spell out what loses points.
  • Judge biases. LLM judges tend to favor longer answers (verbosity bias), prefer outputs from their own model family (self-preference), and can be swayed by position or formatting. These are the same judge biases that affect any model-graded eval.
  • No calibration against humans. A G-Eval score is only trustworthy if it agrees with human judgment on a sample you have checked. Always validate the metric before you trust it at scale — see calibrating an LLM judge and judge vs human agreement.
  • Score instability. Run the same grade twice and you may get 4 then 4.3. Lower the judge's temperature, average several runs, or use the probability-weighting refinement to smooth this out.
  • Treating it as ground truth. A G-Eval number is an estimate of quality from a fallible model, not a measurement from a ruler. Report it as a trend signal, and keep human spot-checks in the loop.

Going deeper

Once the basics click, a few directions are worth knowing as you push G-Eval from a demo into something you can rely on.

Better evaluation-step prompts. The auto-generated checklist in Step 2 is itself a prompt you can inspect and improve. Hand-writing clearer steps, or constraining them to a tight rubric, usually tightens agreement with humans more than swapping judge models does. This is the same craft as writing good chain-of-thought judge prompts.

Continuous vs discrete scores. The probability-weighted average is what makes G-Eval finer-grained than a plain integer, but it requires log-probs. When you cannot get them, a common substitute is to ask the judge for a 0–100 score directly, or to sample the grade several times and average. You lose the theoretical cleanliness but keep most of the benefit.

Cost and latency. Every grade is a full model call with reasoning, so evaluating thousands of outputs is not free. Teams often grade a representative sample rather than every example, use a cheaper judge for routine runs and a stronger one for releases, and cache evaluation steps so they are generated once per metric, not once per example.

Where it fits. G-Eval is a method, not a whole testing strategy. In practice it lives inside a larger suite alongside cheap deterministic checks (does it return valid JSON? does it avoid a banned word?) and, for some tasks, pairwise comparisons. Use the cheap checks for things with a right answer, and reserve G-Eval for the open-ended qualities only a reasoning judge can assess.

The honest summary: G-Eval is a clever, practical way to make subjective quality measurable, and it correlates with humans far better than overlap metrics. But it is a model judging a model — it is biased, a little noisy, and only as good as the criteria you write and the calibration you do. Treat its scores as a strong, debuggable signal you keep checking, not as an oracle. To zoom out to the broader method this sits inside, start with LLM-as-a-judge explained and its known pitfalls.

FAQ

What is G-Eval in simple terms?

G-Eval is a way to grade open-ended LLM outputs using another LLM as the judge. You describe your quality criteria in plain English; the judge generates step-by-step evaluation instructions (chain-of-thought), then reasons through them and fills in a score on a fixed scale. It is one specific, well-known form of LLM-as-a-judge.

Why does G-Eval use chain-of-thought?

Because reasoning out loud before scoring makes the grade more consistent and more aligned with human judgment than blurting out a number. The model first lays out concrete evaluation steps, then works through them, then assigns a score. You also get a readable explanation of why an output won or lost points.

What is the difference between G-Eval and DeepEval?

G-Eval is the method — chain-of-thought plus form-filling to score against custom criteria, introduced in a 2023 research paper. DeepEval is an open-source testing library that implements G-Eval as a ready-to-use metric, so most engineers first meet G-Eval through DeepEval. The method is provider-neutral; the library is one tool that offers it.

How is the G-Eval score calculated?

The judge picks a score in your chosen range (often 1 to 5). In the original paper, instead of taking a flat integer, G-Eval reads the probabilities the model assigns to each possible score and takes their weighted average, producing a smoother number like 4.3. When token probabilities are not available, tools approximate this by asking for confidence or averaging several runs.

Is G-Eval reliable enough to trust?

It correlates with human judgment much better than word-overlap metrics, but it is still a model grading a model — so it carries biases (verbosity, self-preference) and some run-to-run noise. Treat its scores as a strong trend signal, calibrate the metric against a human-checked sample first, and keep human spot-checks in the loop for high-stakes decisions.

When should I use G-Eval instead of an exact-match check?

Use exact match or regex when there is one correct answer (a label, a number, a required JSON shape). Reach for G-Eval when the output is open-ended and quality is subjective — summaries, chatbot replies, tone, coherence, faithfulness — where no single string can be the right answer.

Further reading