[Feature] LLM-based (QA Accuracy) eval algorithm

The metrics-based approaches in the `QAAccuracy` eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).

It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of [LlamaIndex's `CorrectnessEvaluator`](https://github.com/run-llama/llama_index/blob/06b0e390fe1d2fab7c41ff63f6505810c15f8684/llama_index/evaluation/correctness.py)?

As I understand it should be possible in theory to implement something like this by building a custom `EvalAlgorithmInterface`-based class, but there are a lot of design questions to consider like:

- Is it possible to control whether the same LLM, a different LLM, or a panel of multiple LLMs gets used for the evaluation step, versus the original answer generation?
- Since there are lots of different ways to use LLMs for self-critique, maybe e.g. `QAAccuracyByLLMCritic` should be a subtype of some broader class? Certainly it'd be interesting to use LLMs to judge other aspects like relevancy, and specific aspects of tone e.g. ("did it discuss my competitor companies XYZ")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] LLM-based (QA Accuracy) eval algorithm #163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] LLM-based (QA Accuracy) eval algorithm #163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions