LLM-as-a-Judge for Extractive QA Datasets

This is the repository for the paper: LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

Reproduction of Results

download data.zip
download judge.zip

Table 2: Pearson correlation coefficients

python3 get_correlation_score.py

Table 3: Automatic evaluation scores (EM and F1) and LLM-as-a-judge scores

python3 run_eval.py

Running process

Step 1: Run the QA Task

python3 run_qa.py

Postprocessing the generated response from the QA task

python3 postprocess_qa.py

Step 2: Run LLM-as-a-judge

python3 run_judge.py

Step 3: Evaluation

python3 run_eval.py

Calculate correlation scores:

python3 get_correlation_score.py

Data files include:

data: input data
data/human_result.json: human judgement dataset
qa_inference: predicted answers from 8 QA models
qa_postprocess: postprocess on the predicted answers
judge/mistral-v0.3: judged by mistral-v0.3
judge/llama-3.3-70b: judged by llama-3.3-70b
judge/qwen-2.5-72b: judged by qwen-2.5-72b

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
evaluation		evaluation
models		models
prompts		prompts
tasks		tasks
.gitignore		.gitignore
README.md		README.md
get_correlation_score.py		get_correlation_score.py
run_eval.py		run_eval.py
run_judge.py		run_judge.py
run_qa.py		run_qa.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-as-a-Judge for Extractive QA Datasets

Reproduction of Results

Table 2: Pearson correlation coefficients

Table 3: Automatic evaluation scores (EM and F1) and LLM-as-a-judge scores

Running process

Step 1: Run the QA Task

Postprocessing the generated response from the QA task

Step 2: Run LLM-as-a-judge

Step 3: Evaluation

Calculate correlation scores:

Data files include:

About

Releases

Packages

Languages

Alab-NII/llm-judge-extract-qa

Folders and files

Latest commit

History

Repository files navigation

LLM-as-a-Judge for Extractive QA Datasets

Reproduction of Results

Table 2: Pearson correlation coefficients

Table 3: Automatic evaluation scores (EM and F1) and LLM-as-a-judge scores

Running process

Step 1: Run the QA Task

Postprocessing the generated response from the QA task

Step 2: Run LLM-as-a-judge

Step 3: Evaluation

Calculate correlation scores:

Data files include:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages