generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Description
The metrics-based approaches in the QAAccuracy eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).
It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of LlamaIndex's CorrectnessEvaluator?
As I understand it should be possible in theory to implement something like this by building a custom EvalAlgorithmInterface-based class, but there are a lot of design questions to consider like:
- Is it possible to control whether the same LLM, a different LLM, or a panel of multiple LLMs gets used for the evaluation step, versus the original answer generation?
- Since there are lots of different ways to use LLMs for self-critique, maybe e.g.
QAAccuracyByLLMCriticshould be a subtype of some broader class? Certainly it'd be interesting to use LLMs to judge other aspects like relevancy, and specific aspects of tone e.g. ("did it discuss my competitor companies XYZ")
loretoparisi
Metadata
Metadata
Assignees
Labels
No labels