feat: add G-Eval assertion #2436

schipiga · 2024-12-17T08:41:13Z

Hi!

This is an attempt to implement G-Eval LLM self-assesstment approach in promptfoo. ~~It requires a bit code polish~~ (it's done already), and I'm interesting if maintainers have interest to include it to promptfoo. Because as I know, there was already issue request for that.

It's inspired by:

This an example of G-Eval in promptfoo.
I used next criterias:

Coherence - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the reply should be well-structured and well-organized. The reply should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."
Consistency - the factual alignment between the reply and the source. A factually consistent reply contains only statements that are entailed by the source document. Annotators were also asked to penalize replies that contained hallucinated facts.
Fluency - the quality of the reply in terms of grammar, spelling, punctuation, word choice, and sentence structure.
Relevance - selection of important content for the source. The reply should include only important information for the source document. Annotators were instructed to penalize replies which contained redundancies and excess information.

Which gave next G-Eval responses:

score: 10, reason: The reply is well-structured and logically organized. Each sentence builds upon the previous one and collectively provides a comprehensive explanation of why smoking and alcohol are harmful to health. The information presented forms a coherent body of information, clearly supporting the overall topic.
score: 10, reason: The reply is grammatically correct, with appropriate tenses, spelling, punctuation, word choice, and clear sentence structure. It effectively communicates the harmful effects of smoking and alcohol on health.
score: 10, reason: The reply succinctly covers all the key points mentioned in the source text, discussing harmful chemicals from smoking, their effects on respiratory and cardiovascular health, addictive nature of nicotine, impact of alcohol on the liver and nervous system, and the overall risk to life expectancy. There is no redundant or unnecessary information, making the reply highly relevant and concise.
score: 10, reason: The reply aligns perfectly with the source text. It correctly identifies that smoking introduces harmful chemicals that affect the lungs, cardiovascular system, and can lead to cancer. It also accurately describes how alcohol can damage the liver and central nervous system. All factual claims are supported by the source text without including any additional or hallucinated information.

(UI rendered result isn't so informative for passed tests)

More details about prompts are in matchers.ts code. Could you please review it.

Regards, Sergei

schipiga · 2024-12-17T10:13:57Z

Also example of implemented G-Eval evalution for LLM reply to request How can I learn another language if I have small free time daily and I have poor memory?:

score: 10, reason: The reply exhibits a clear and logical structure, building from introductory advice to specific strategies aligned with the source text's challenges. Each sentence supports the overall topic cohesively, forming a unified body of information.
score: 9, reason: The reply is fluent with clear grammar, correct spelling, appropriate word choice, and well-structured sentences. Minor improvement could be made for more variety in sentence structure.
score: 9, reason: The reply is highly relevant and provides specific strategies that align with the source text's concerns about limited time and poor memory. It does not include superfluous details. However, the reply could slightly focus more on the unique memory challenges posed by the user.
score: 10, reason: The reply is factually consistent with the source text. It provides realistic strategies for learning a language with limited time and poor memory, such as setting realistic goals, using language apps, and practicing speaking. All suggestions are entailed by the source text, which mentions the constraints of small free time and poor memory.

Tests config example:

prompts:
  - >-
    How can I learn another language if I have small free time daily and I have
    poor memory?
providers:
  - id: openai:gpt-4o
    config:
      organization: ''
      temperature: 0.5
      max_tokens: 1024
      top_p: 1
      frequency_penalty: 0
      presence_penalty: 0
scenarios: []
tests:
  - description: ''
    vars: {}
    assert:
      - type: g-eval
        value: >-
          Coherence - the collective quality of all sentences. We align this
          dimension with the DUC quality question of structure and coherence
          whereby "the reply should be well-structured and well-organized. The
          reply should not just be a heap of related information, but should
          build from sentence to a coherent body of information about a topic."
  - description: ''
    vars: {}
    assert:
      - type: g-eval
        value: >-
          Consistency - the factual alignment between the reply and the source.
          A factually consistent reply contains only statements that are
          entailed by the source document. Annotators were also asked to
          penalize replies that contained hallucinated facts.
  - description: ''
    vars: {}
    assert:
      - type: g-eval
        value: >-
          Fluency - the quality of the reply in terms of grammar, spelling,
          punctuation, word choice, and sentence structure.
  - description: ''
    vars: {}
    assert:
      - type: g-eval
        value: >-
          Relevance - selection of important content for the source. The reply
          should include only important information for the source document.
          Annotators were instructed to penalize replies which contained
          redundancies and excess information.

schipiga · 2024-12-17T10:22:30Z

Hi @typpo, may you please help with review?

typpo · 2024-12-17T19:22:33Z

This a great addition @schipiga!

I'm noticing ~half the time the grader thinks that Source Text and Reply are missing when I run the example:

promptfoo eval --no-write --no-cache

Might be worth a look?

schipiga · 2024-12-17T19:36:20Z

Hi @typpo thank you for report, yep, indeed, looks like sometimes LLM can't extract required parts. I will try to add clarification in prompt

schipiga · 2024-12-17T19:57:13Z

@typpo looks like prompt minor tweaks helped. I launched it dozen times and didn't meet the issue. Maybe it could happen sometimes as so we interact with LLM, but I think more rare, than it was before update. Could you pls review and check also

schipiga · 2024-12-18T17:45:38Z

Hi @typpo , I think, now prompt works correct. Also I actualised test example for multiple values. Can you please review pr?

typpo · 2024-12-18T17:58:15Z

will take a look shortly!

…

On Wed, Dec 18, 2024 at 9:46 AM Sergei Chipiga ***@***.***> wrote: Hi @typpo <https://github.com/typpo> , I think, now prompt works correct. Also I actualised test example for multiple values. Can you please review pr? — Reply to this email directly, view it on GitHub <#2436 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACLYJRRBNESTAON5DRLDL32GGYFRAVCNFSM6AAAAABTX4FWAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRHEZDOOBUGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

typpo · 2024-12-20T00:48:41Z

awesome change! thanks @schipiga

schipiga changed the title ~~Add G-Eval assertion~~ feat: add G-Eval assertion Dec 17, 2024

schipiga force-pushed the feature/g-eval-assert branch 5 times, most recently from 7ab2523 to 7108008 Compare December 17, 2024 16:55

schipiga force-pushed the feature/g-eval-assert branch from 7108008 to 40e2b3d Compare December 17, 2024 19:54

schipiga force-pushed the feature/g-eval-assert branch 2 times, most recently from d1a9c8a to 4aa3739 Compare December 18, 2024 17:37

feat: add G-Eval assertion

500631d

schipiga force-pushed the feature/g-eval-assert branch from 4aa3739 to 500631d Compare December 19, 2024 08:53

typpo merged commit e822ec6 into promptfoo:main Dec 20, 2024
20 checks passed

sklein12 pushed a commit that referenced this pull request Jan 11, 2025

feat: add G-Eval assertion (#2436)

af70662

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add G-Eval assertion #2436

feat: add G-Eval assertion #2436

Uh oh!

schipiga commented Dec 17, 2024 •

edited

Loading

Uh oh!

schipiga commented Dec 17, 2024 •

edited

Loading

Uh oh!

schipiga commented Dec 17, 2024

Uh oh!

typpo commented Dec 17, 2024

Uh oh!

schipiga commented Dec 17, 2024

Uh oh!

schipiga commented Dec 17, 2024

Uh oh!

schipiga commented Dec 18, 2024

Uh oh!

typpo commented Dec 18, 2024 via email

Uh oh!

Uh oh!

typpo commented Dec 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: add G-Eval assertion #2436

feat: add G-Eval assertion #2436

Uh oh!

Conversation

schipiga commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schipiga commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schipiga commented Dec 17, 2024

Uh oh!

typpo commented Dec 17, 2024

Uh oh!

schipiga commented Dec 17, 2024

Uh oh!

schipiga commented Dec 17, 2024

Uh oh!

schipiga commented Dec 18, 2024

Uh oh!

typpo commented Dec 18, 2024 via email

Uh oh!

Uh oh!

typpo commented Dec 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

schipiga commented Dec 17, 2024 •

edited

Loading

schipiga commented Dec 17, 2024 •

edited

Loading