-
-
Notifications
You must be signed in to change notification settings - Fork 845
feat: add G-Eval assertion #2436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also example of implemented G-Eval evalution for LLM reply to request
Tests config example: prompts:
- >-
How can I learn another language if I have small free time daily and I have
poor memory?
providers:
- id: openai:gpt-4o
config:
organization: ''
temperature: 0.5
max_tokens: 1024
top_p: 1
frequency_penalty: 0
presence_penalty: 0
scenarios: []
tests:
- description: ''
vars: {}
assert:
- type: g-eval
value: >-
Coherence - the collective quality of all sentences. We align this
dimension with the DUC quality question of structure and coherence
whereby "the reply should be well-structured and well-organized. The
reply should not just be a heap of related information, but should
build from sentence to a coherent body of information about a topic."
- description: ''
vars: {}
assert:
- type: g-eval
value: >-
Consistency - the factual alignment between the reply and the source.
A factually consistent reply contains only statements that are
entailed by the source document. Annotators were also asked to
penalize replies that contained hallucinated facts.
- description: ''
vars: {}
assert:
- type: g-eval
value: >-
Fluency - the quality of the reply in terms of grammar, spelling,
punctuation, word choice, and sentence structure.
- description: ''
vars: {}
assert:
- type: g-eval
value: >-
Relevance - selection of important content for the source. The reply
should include only important information for the source document.
Annotators were instructed to penalize replies which contained
redundancies and excess information. |
|
Hi @typpo, may you please help with review? |
7ab2523 to
7108008
Compare
|
This a great addition @schipiga! I'm noticing ~half the time the grader thinks that
Might be worth a look? |
|
Hi @typpo thank you for report, yep, indeed, looks like sometimes LLM can't extract required parts. I will try to add clarification in prompt |
7108008 to
40e2b3d
Compare
|
@typpo looks like prompt minor tweaks helped. I launched it dozen times and didn't meet the issue. Maybe it could happen sometimes as so we interact with LLM, but I think more rare, than it was before update. Could you pls review and check also |
d1a9c8a to
4aa3739
Compare
|
Hi @typpo , I think, now prompt works correct. Also I actualised test example for multiple values. Can you please review pr? |
|
will take a look shortly!
…On Wed, Dec 18, 2024 at 9:46 AM Sergei Chipiga ***@***.***> wrote:
Hi @typpo <https://github.com/typpo> , I think, now prompt works correct.
Also I actualised test example for multiple values. Can you please review
pr?
—
Reply to this email directly, view it on GitHub
<#2436 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACLYJRRBNESTAON5DRLDL32GGYFRAVCNFSM6AAAAABTX4FWAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRHEZDOOBUGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
4aa3739 to
500631d
Compare
|
awesome change! thanks @schipiga |

Hi!
This is an attempt to implement G-Eval LLM self-assesstment approach in promptfoo.
It requires a bit code polish(it's done already), and I'm interesting if maintainers have interest to include it to promptfoo. Because as I know, there was already issue request for that.It's inspired by:
This an example of G-Eval in promptfoo.
I used next criterias:
Which gave next G-Eval responses:
(UI rendered result isn't so informative for passed tests)

More details about prompts are in matchers.ts code. Could you please review it.
Regards, Sergei