Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@nkaenzig
Copy link
Collaborator

@nkaenzig nkaenzig commented Oct 10, 2025

Closes #913

What this PR does

  • Add basic abstractions for LLM Judge implementations to eva.language.metrics.llm_judge
  • Implements the G-Eval LLM Judge framework and a specific torchmetrics compatible metric GEvalCorrectness focusing on answer correctness
    • Differences to original implementation from paper:
      • Evaluation Steps are provided as input to the prompt, rather than being produced on-the-fly by the model as proposed in the original paper. This makes results more robust & deterministic and is recommended here
      • No confidence weighted scoring (this requires access log probabilities, which is not available for many API models).

Sources:

Particularly the prompt template being used was strongly inspired by the well established implementation from the deepeval framework: https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/g_eval/template.py

Example

metric = GEvalCorrectness(model="google/gemini-2.5-flash-lite")

preds = [
    "The capital of France is Paris.",
    "The capital of Germany is Berlin.",
]
targets = [
    "The capital of France is Paris.",
    "The capital of Germany is Munich.",
]

metric.update(preds=preds, targets=targets)
print("G-Eval Correctness:", metric.compute().item())

-> The first example will yield a correctness score of 5, and the second a score of 1.

Generated prompt

You are an evaluator. Given the following evaluation steps, assess the Model Response below and return a JSON object with two fields:

- `"score"`: an integer between 1 and 5, where higher is better.
- `"reason"`: a brief explanation for why the score was given. This must mention specific strengths or shortcomings, referencing relevant details from the input. Do **not** quote the score itself in the explanation.

Your explanation should:
- Be specific and grounded in the evaluation steps.
- Mention key details from the model response and ground truth.
- Be concise, clear, and focused on the evaluation logic.

Only return valid JSON. Do **not** include any extra commentary or text.

---

Evaluation Steps:
1. Read the Model Response and Ground Truth carefully
2. Identify Key Facts: Extract all important facts, claims, and information from the Ground Truth response.
3. Assess Correctness & Completeness: For each key fact in the Ground Truth, determine if it appears in the Model Response (exactly or paraphrased), and evaluate whether all essential information from Ground Truth is present.
4. Identify Errors: Note any factual contradictions or inaccuracies in the Model Response compared to Ground Truth.

Scoring Criteria:
5 (Excellent): Model response captures all key facts from ground truth accurately. Information is complete and correct, with no factual errors or contradictions. May use different wording but conveys equivalent meaning.
4 (Good): Model response captures most key facts correctly with no significant errors. May miss 1-2 minor details, but all major points are present and accurate.
3 (Acceptable): Model response captures about half of the key information accurately. Some important facts are missing, or there are minor inaccuracies, but no major contradictions with ground truth.
2 (Poor): Model response captures only a small portion of key facts. Major information is missing and may contain factual errors or contradictions with ground truth.
1 (Very Poor): Model response is largely incorrect or incomplete, missing most key facts. Contains significant factual errors or contradictions with ground truth.

Model Response:
The capital of France is Paris.

Ground Truth:
The capital of France is Paris.

---
**Example JSON:**
{
    "reason": "your concise and informative reason here",
    "score": 1
}

JSON:

@nkaenzig nkaenzig linked an issue Oct 10, 2025 that may be closed by this pull request
@nkaenzig nkaenzig self-assigned this Oct 10, 2025
@nkaenzig nkaenzig marked this pull request as ready for review October 13, 2025 10:14
@nkaenzig nkaenzig marked this pull request as draft October 13, 2025 10:24
Copy link
Collaborator

@MaxFeucht MaxFeucht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition, only minor comments

@nkaenzig nkaenzig marked this pull request as ready for review October 16, 2025 06:53
@nkaenzig nkaenzig enabled auto-merge (squash) October 16, 2025 07:11
@nkaenzig nkaenzig merged commit 7a750aa into main Oct 16, 2025
7 checks passed
@nkaenzig nkaenzig deleted the 913-add-g-eval-llm-judge branch October 16, 2025 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add G-Eval LLM Judge

2 participants