Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@schipiga
Copy link
Contributor

@schipiga schipiga commented Dec 17, 2024

Hi!

This is an attempt to implement G-Eval LLM self-assesstment approach in promptfoo. It requires a bit code polish (it's done already), and I'm interesting if maintainers have interest to include it to promptfoo. Because as I know, there was already issue request for that.

It's inspired by:

This an example of G-Eval in promptfoo.
I used next criterias:

  • Coherence - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the reply should be well-structured and well-organized. The reply should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic."
  • Consistency - the factual alignment between the reply and the source. A factually consistent reply contains only statements that are entailed by the source document. Annotators were also asked to penalize replies that contained hallucinated facts.
  • Fluency - the quality of the reply in terms of grammar, spelling, punctuation, word choice, and sentence structure.
  • Relevance - selection of important content for the source. The reply should include only important information for the source document. Annotators were instructed to penalize replies which contained redundancies and excess information.

Which gave next G-Eval responses:

  • score: 10, reason: The reply is well-structured and logically organized. Each sentence builds upon the previous one and collectively provides a comprehensive explanation of why smoking and alcohol are harmful to health. The information presented forms a coherent body of information, clearly supporting the overall topic.
  • score: 10, reason: The reply is grammatically correct, with appropriate tenses, spelling, punctuation, word choice, and clear sentence structure. It effectively communicates the harmful effects of smoking and alcohol on health.
  • score: 10, reason: The reply succinctly covers all the key points mentioned in the source text, discussing harmful chemicals from smoking, their effects on respiratory and cardiovascular health, addictive nature of nicotine, impact of alcohol on the liver and nervous system, and the overall risk to life expectancy. There is no redundant or unnecessary information, making the reply highly relevant and concise.
  • score: 10, reason: The reply aligns perfectly with the source text. It correctly identifies that smoking introduces harmful chemicals that affect the lungs, cardiovascular system, and can lead to cancer. It also accurately describes how alcohol can damage the liver and central nervous system. All factual claims are supported by the source text without including any additional or hallucinated information.

(UI rendered result isn't so informative for passed tests)
Screenshot 2024-12-17 at 10 30 04

More details about prompts are in matchers.ts code. Could you please review it.

Regards, Sergei

@schipiga schipiga changed the title Add G-Eval assertion feat: add G-Eval assertion Dec 17, 2024
@schipiga
Copy link
Contributor Author

schipiga commented Dec 17, 2024

Also example of implemented G-Eval evalution for LLM reply to request How can I learn another language if I have small free time daily and I have poor memory?:

  • score: 10, reason: The reply exhibits a clear and logical structure, building from introductory advice to specific strategies aligned with the source text's challenges. Each sentence supports the overall topic cohesively, forming a unified body of information.
  • score: 9, reason: The reply is fluent with clear grammar, correct spelling, appropriate word choice, and well-structured sentences. Minor improvement could be made for more variety in sentence structure.
  • score: 9, reason: The reply is highly relevant and provides specific strategies that align with the source text's concerns about limited time and poor memory. It does not include superfluous details. However, the reply could slightly focus more on the unique memory challenges posed by the user.
  • score: 10, reason: The reply is factually consistent with the source text. It provides realistic strategies for learning a language with limited time and poor memory, such as setting realistic goals, using language apps, and practicing speaking. All suggestions are entailed by the source text, which mentions the constraints of small free time and poor memory.

Tests config example:

prompts:
  - >-
    How can I learn another language if I have small free time daily and I have
    poor memory?
providers:
  - id: openai:gpt-4o
    config:
      organization: ''
      temperature: 0.5
      max_tokens: 1024
      top_p: 1
      frequency_penalty: 0
      presence_penalty: 0
scenarios: []
tests:
  - description: ''
    vars: {}
    assert:
      - type: g-eval
        value: >-
          Coherence - the collective quality of all sentences. We align this
          dimension with the DUC quality question of structure and coherence
          whereby "the reply should be well-structured and well-organized. The
          reply should not just be a heap of related information, but should
          build from sentence to a coherent body of information about a topic."
  - description: ''
    vars: {}
    assert:
      - type: g-eval
        value: >-
          Consistency - the factual alignment between the reply and the source.
          A factually consistent reply contains only statements that are
          entailed by the source document. Annotators were also asked to
          penalize replies that contained hallucinated facts.
  - description: ''
    vars: {}
    assert:
      - type: g-eval
        value: >-
          Fluency - the quality of the reply in terms of grammar, spelling,
          punctuation, word choice, and sentence structure.
  - description: ''
    vars: {}
    assert:
      - type: g-eval
        value: >-
          Relevance - selection of important content for the source. The reply
          should include only important information for the source document.
          Annotators were instructed to penalize replies which contained
          redundancies and excess information.

@schipiga
Copy link
Contributor Author

Hi @typpo, may you please help with review?

@schipiga schipiga force-pushed the feature/g-eval-assert branch 5 times, most recently from 7ab2523 to 7108008 Compare December 17, 2024 16:55
@typpo
Copy link
Contributor

typpo commented Dec 17, 2024

This a great addition @schipiga!

I'm noticing ~half the time the grader thinks that Source Text and Reply are missing when I run the example:

promptfoo eval --no-write --no-cache
image

Might be worth a look?

@schipiga
Copy link
Contributor Author

Hi @typpo thank you for report, yep, indeed, looks like sometimes LLM can't extract required parts. I will try to add clarification in prompt

@schipiga schipiga force-pushed the feature/g-eval-assert branch from 7108008 to 40e2b3d Compare December 17, 2024 19:54
@schipiga
Copy link
Contributor Author

@typpo looks like prompt minor tweaks helped. I launched it dozen times and didn't meet the issue. Maybe it could happen sometimes as so we interact with LLM, but I think more rare, than it was before update. Could you pls review and check also

@schipiga schipiga force-pushed the feature/g-eval-assert branch 2 times, most recently from d1a9c8a to 4aa3739 Compare December 18, 2024 17:37
@schipiga
Copy link
Contributor Author

Hi @typpo , I think, now prompt works correct. Also I actualised test example for multiple values. Can you please review pr?

@typpo
Copy link
Contributor

typpo commented Dec 18, 2024 via email

@schipiga schipiga force-pushed the feature/g-eval-assert branch from 4aa3739 to 500631d Compare December 19, 2024 08:53
@typpo typpo merged commit e822ec6 into promptfoo:main Dec 20, 2024
20 checks passed
@typpo
Copy link
Contributor

typpo commented Dec 20, 2024

awesome change! thanks @schipiga

sklein12 pushed a commit that referenced this pull request Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants