Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Feature] LLM-based (QA Accuracy) eval algorithm #163

@athewsey

Description

@athewsey

The metrics-based approaches in the QAAccuracy eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).

It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of LlamaIndex's CorrectnessEvaluator?

As I understand it should be possible in theory to implement something like this by building a custom EvalAlgorithmInterface-based class, but there are a lot of design questions to consider like:

  • Is it possible to control whether the same LLM, a different LLM, or a panel of multiple LLMs gets used for the evaluation step, versus the original answer generation?
  • Since there are lots of different ways to use LLMs for self-critique, maybe e.g. QAAccuracyByLLMCritic should be a subtype of some broader class? Certainly it'd be interesting to use LLMs to judge other aspects like relevancy, and specific aspects of tone e.g. ("did it discuss my competitor companies XYZ")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions