Finance RAG Evaluation for Assessing Kay
FREAK aims to provide an evaluation framework for running analyses against Kay's finance data RAG. Plug in your API call and call run — we'll take care of the rest.
- In
tester.pyfill out thecall_my_codefunction with a call to your API. Convert the response into aRagResult. (more details below) - Get a
cohereAPI key (if you don't have one, you can create one here - cohere signup) - Run
python tester.py --cohere-api-key <COHERE_KEY> --kay-api-key <KAY_API_KEY> [--verbose] [--save-chunks-file </path/output_file_name.json>] [--query-override-file </path/to/query.txt>]
If you want to add custom queries, use the --queries-override-file flag.
To see all chunks that each API outputs, use the --save-chunks-file flag.
For a full breakdown of the command line flags available, run python tester.py --help.
RagResult helps us bring retrived results to a common structure for easy comparision.
RagResult is an array of RagDocument.
RagResult(
docs=[
RagDocument
],
)Full defination is here - RagResult
RagDocument holds the raw text of the retrieved context, along with optional metadata.
doc1 = RagDocument(text = "<TEXT_OF_YOUR_DOC_HERE>")Full defination is here - RagResult
We use cohere re-ranking scores as a proxy to evaluate relevancy for retrieved context. While we acknowledge the evident short-comings, this is a quick way to do sanity checks between two retriever systems without a golden test set. If you have a golden test set internally, we can add more metrics to compare two retriver systems.
At Kay, we are pushing the boundaries on RAG. One of the biggest challenges is to accurately keep evaluating a retriever system. The intention behind this library is twofold -
- We use this internally to test improvements confidently and track changes
- We made this publicly available to enable our users to test Kay's retriever system with theirs.
With that note, we would love for you to contribute to this package.