Add `G-Eval` LLM Judge #914

nkaenzig · 2025-10-10T15:52:05Z

Closes #913

What this PR does

Add basic abstractions for LLM Judge implementations to eva.language.metrics.llm_judge
Implements the G-Eval LLM Judge framework and a specific torchmetrics compatible metric GEvalCorrectness focusing on answer correctness
- Differences to original implementation from paper:
  - Evaluation Steps are provided as input to the prompt, rather than being produced on-the-fly by the model as proposed in the original paper. This makes results more robust & deterministic and is recommended here
  - No confidence weighted scoring (this requires access log probabilities, which is not available for many API models).

Sources:

Particularly the prompt template being used was strongly inspired by the well established implementation from the deepeval framework: https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/g_eval/template.py

Example

metric = GEvalCorrectness(model="google/gemini-2.5-flash-lite")

preds = [
    "The capital of France is Paris.",
    "The capital of Germany is Berlin.",
]
targets = [
    "The capital of France is Paris.",
    "The capital of Germany is Munich.",
]

metric.update(preds=preds, targets=targets)
print("G-Eval Correctness:", metric.compute().item())

-> The first example will yield a correctness score of 5, and the second a score of 1.

Generated prompt

You are an evaluator. Given the following evaluation steps, assess the Model Response below and return a JSON object with two fields:

- `"score"`: an integer between 1 and 5, where higher is better.
- `"reason"`: a brief explanation for why the score was given. This must mention specific strengths or shortcomings, referencing relevant details from the input. Do **not** quote the score itself in the explanation.

Your explanation should:
- Be specific and grounded in the evaluation steps.
- Mention key details from the model response and ground truth.
- Be concise, clear, and focused on the evaluation logic.

Only return valid JSON. Do **not** include any extra commentary or text.

---

Evaluation Steps:
1. Read the Model Response and Ground Truth carefully
2. Identify Key Facts: Extract all important facts, claims, and information from the Ground Truth response.
3. Assess Correctness & Completeness: For each key fact in the Ground Truth, determine if it appears in the Model Response (exactly or paraphrased), and evaluate whether all essential information from Ground Truth is present.
4. Identify Errors: Note any factual contradictions or inaccuracies in the Model Response compared to Ground Truth.

Scoring Criteria:
5 (Excellent): Model response captures all key facts from ground truth accurately. Information is complete and correct, with no factual errors or contradictions. May use different wording but conveys equivalent meaning.
4 (Good): Model response captures most key facts correctly with no significant errors. May miss 1-2 minor details, but all major points are present and accurate.
3 (Acceptable): Model response captures about half of the key information accurately. Some important facts are missing, or there are minor inaccuracies, but no major contradictions with ground truth.
2 (Poor): Model response captures only a small portion of key facts. Major information is missing and may contain factual errors or contradictions with ground truth.
1 (Very Poor): Model response is largely incorrect or incomplete, missing most key facts. Contains significant factual errors or contradictions with ground truth.

Model Response:
The capital of France is Paris.

Ground Truth:
The capital of France is Paris.

---
**Example JSON:**
{
    "reason": "your concise and informative reason here",
    "score": 1
}

JSON:

…GEvalJudge

MaxFeucht

Great addition, only minor comments

src/eva/language/metrics/llm_judge/base.py

src/eva/language/models/postprocess/extract_answer_from_json.py

started GEvalJudge implementation

db4d9d8

nkaenzig linked an issue Oct 10, 2025 that may be closed by this pull request

Add G-Eval LLM Judge #913

Closed

nkaenzig self-assigned this Oct 10, 2025

nkaenzig added metrics multimodal language and removed multimodal labels Oct 10, 2025

implemented GEvalCorrectness metric

5d1037e

nkaenzig marked this pull request as ready for review October 13, 2025 10:14

fix linting issues

2a5a7c7

nkaenzig marked this pull request as draft October 13, 2025 10:24

nkaenzig added 7 commits October 13, 2025 14:33

make mapping in ExtractAnswerFromJson mandatory again

ac989ba

use json_utils instead of ExtractAnswerFromJson to extract scores in …

dcae54d

…GEvalJudge

refine prompt

c9f0c89

refined correctness prompt

8074c25

update wording

4a381e1

fixed casing

8f5ba6b

fix casing

f55c78d

MaxFeucht approved these changes Oct 15, 2025

View reviewed changes

src/eva/language/metrics/llm_judge/base.py Show resolved Hide resolved

src/eva/language/models/postprocess/extract_answer_from_json.py Show resolved Hide resolved

nkaenzig added 5 commits October 15, 2025 16:48

rename ExtractAnswerFromJson to ExtractDiscreteAnswerFromJson

44f8b57

add unit test

b07720d

remove inline comments

50b798c

rename postprocess in yaml configs

d5b7f08

Merge remote-tracking branch 'origin/main' into 913-add-g-eval-llm-judge

c93d1aa

nkaenzig marked this pull request as ready for review October 16, 2025 06:53

nkaenzig added 3 commits October 16, 2025 08:56

added missing remove_multi_blank_lines calls

d3b5728

removed unnecessary docstrings

a03bbaa

add unit tests for GEvalCorrectness

5b03496

nkaenzig enabled auto-merge (squash) October 16, 2025 07:11

nkaenzig merged commit 7a750aa into main Oct 16, 2025
7 checks passed

nkaenzig deleted the 913-add-g-eval-llm-judge branch October 16, 2025 07:22

nkaenzig mentioned this pull request Oct 16, 2025

Add .yaml config for QuiltVQA #919

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `G-Eval` LLM Judge #914

Add `G-Eval` LLM Judge #914

Uh oh!

nkaenzig commented Oct 10, 2025 •

edited

Loading

Uh oh!

MaxFeucht left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add G-Eval LLM Judge #914

Add G-Eval LLM Judge #914

Uh oh!

Conversation

nkaenzig commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Example

Generated prompt

Uh oh!

MaxFeucht left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `G-Eval` LLM Judge #914

Add `G-Eval` LLM Judge #914

nkaenzig commented Oct 10, 2025 •

edited

Loading