This is the repository for the paper: LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
python3 get_correlation_score.py
python3 run_eval.py
python3 run_qa.py
python3 postprocess_qa.py
python3 run_judge.py
python3 run_eval.py
python3 get_correlation_score.py
- data: input data
- data/human_result.json: human judgement dataset
- qa_inference: predicted answers from 8 QA models
- qa_postprocess: postprocess on the predicted answers
- judge/mistral-v0.3: judged by mistral-v0.3
- judge/llama-3.3-70b: judged by llama-3.3-70b
- judge/qwen-2.5-72b: judged by qwen-2.5-72b