Think-J

This repository is for Think-J: Learning to Think for Generative LLM-as-a-Judge

⚡️ Usage

Preparation

Our implementation is based on Llama-Factory and verl. Therefore, you can refer to their repository to prepare the environment.

For initialization data, we use the preference data from Skywork, and we present the processed data with thinking trace annotation in data directory.

For thinking optimiazation, we use the preference data from Helpsteer2 and HH-RLHF. To achieve the strength annotation, we leverage the preference annotation from hh-rlhf-strength-cleaned. Please download them and present them in your data directory.

Judgment Thinking Initilization

Based on the LIMJ707 data, initialize the judge with thinking ability based on Supervised Fine-tuning.

DATASET="helpsteer2"
MODEL="Qwen2.5-7B-Instruct"
PROMPT=strength
DATA_DIR=/path/to/your/data
MODEL_DIR=/path/to/your/model
R1JUDGE_VER="judge-skywork707-${PROMPT}-671BR1"
R1CRITIC_VER="critic-skywork707-${PROMPT}-671BR1"
BASE_PATH=/data/cpfs_0/bumblebee/TuningFactoryModels/${MODEL}
MOS_REPO=${MODEL}-STaR

DATA=skywork707-${PROMPT}-671BR1-judgement.json
bash train_sft.sh ${MODEL_DIR}/${MOS_REPO}-${R1JUDGE_VER} $DATA_DIR/${DATASET}/$DATA ${MODEL_DIR}/${MODEL}

DATA=skywork707-${PROMPT}-671BR1-critique.json
bash train_sft.sh ${MODEL_DIR}/${MOS_REPO}-${R1CRITIC_VER} $DATA_DIR/${DATASET}/$DATA ${MODEL_DIR}/${MODEL}

Judgment Thinking Optimization

Based on preference data, optimize the thinking ability of the judge based on either offline RL or online RL.

For offline RL, run the following scripts to firstly construct preference pairs and then perform training.

DATASET="helpsteer2"
MODEL="Qwen2.5-7B-Instruct"
PROMPT=strength
DATA_DIR=/path/to/your/data
MODEL_DIR=/path/to/your/model
R1JUDGE_VER="judge-skywork707-${PROMPT}-671BR1"
R1CRITIC_VER="critic-skywork707-${PROMPT}-671BR1"
MOS_REPO=${MODEL}-STaR

python3.10 star/infer_judge_create_prompt.py \
    --model-path ${MODEL_DIR}/${MOS_REPO}-${R1JUDGE_VER} \
    --input-file ${DATA_DIR}/${DATASET}/${DATASET}-infer.json \
    --output-file ${DATA_DIR}/${DATASET}/${DATASET}-${MODEL}-judgement-jud.json \
    --prompt-type ${PROMPT}_judge_prompt

python3.10 star/infer_critic_create_prompt.py \
    --model-path ${MODEL_DIR}/${MOS_REPO}-${R1CRITIC_VER} \
    --input-file ${DATA_DIR}/${DATASET}/${DATASET}-infer.json \
    --is-positive True \
    --output-file ${DATA_DIR}/${DATASET}/${DATASET}-${MODEL}-judgement-pos.json \
    --prompt-type ${PROMPT}_judge_prompt

python3.10 star/infer_critic_create_prompt.py \
    --model-path ${MODEL_DIR}/${MOS_REPO}-${R1CRITIC_VER} \
    --input-file ${DATA_DIR}/${DATASET}/${DATASET}-infer.json \
    --is-positive False \
    --output-file ${DATA_DIR}/${DATASET}/${DATASET}-${MODEL}-judgement-neg.json \
    --prompt-type ${PROMPT}_judge_prompt

python3.10 star/create_judgement_pairs.py \
    ${DATA_DIR}/${DATASET}/${DATASET}-${MODEL}-judgement-jud.json \
    ${DATA_DIR}/${DATASET}/${DATASET}-${MODEL}-judgement-pos.json \
    ${DATA_DIR}/${DATASET}/${DATASET}-${MODEL}-judgement-neg.json \
    ${DATA_DIR}/${DATASET}/${DATASET}-${MODEL}-judgement.json \
    ${DATA_DIR}/${DATASET}/${DATASET}-${MODEL}-judgement-sft.json

DATA=${DATASET}-${MODEL}-judgement.json
JUDGE_MODEL=${MOS_REPO}-${DATASET}-judge-dpo-${PROMPT}-671BR1
bash train_dpo.sh ${MODEL_DIR}/${JUDGE_MODEL} $DATA_DIR/${DATASET}/$DATA ${MODEL_DIR}/${MOS_REPO}-${R1JUDGE_VER}

For online RL, run the following scripts to perform GRPO training.

DATASET="helpsteer2"
MODEL="Qwen2.5-7B-Instruct"
PROMPT=strength
DATA_DIR=/path/to/your/data
MODEL_DIR=/path/to/your/model
R1JUDGE_VER="judge-skywork707-${PROMPT}-671BR1"
R1CRITIC_VER="critic-skywork707-${PROMPT}-671BR1"
MOS_REPO=${MODEL}-STaR

DATA=${DATASET}-${PROMPT}-ppo.parquet
DATA_TEST=${DATASET}-${PROMPT}-ppo-test.parquet
JUDGE_MODEL=${MOS_REPO}-${DATASET}-judge-grpo-${PROMPT}-671BR1
bash train_grpo.sh ${MODEL_DIR}/${JUDGE_MODEL} $DATA_DIR/${DATASET}/${DATA} $DATA_DIR/${DATASET}/${DATA_TEST} ${MODEL_DIR}/${MOS_REPO}-${R1JUDGE_VER} 8 reward_${PROMPT}

FINAL_STEP=$(cat ${MODEL_DIR}/${JUDGE_MODEL}/latest_checkpointed_iteration.txt)

python3.10 verl/scripts/model_merger.py \
    --local_dir ${MODEL_DIR}/${JUDGE_MODEL}/global_step_${FINAL_STEP}/actor

Evaluation on RewardBench

We mainly conduct evaluation on RewardBench. Run the following script to test your generative judge on RewardBench.

PROMPT=strength
JUDGE_MODEL=Qwen2.5-7B-Instruct-STaR-helpsteer2-judge-grpo-strength-671BR1
MODEL_PATH=/path/to/your/model

python3.10 star/eval_reward_bench.py \
    --model-path ${MODEL_DIR}/{JUDGE_MODEL} \
    --prompt-type ${PROMPT}_judge_prompt \

Acknowledge

This repository is built on Llama-Factory and verl. Many thanks to their excellent work!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LLaMA-Factory		LLaMA-Factory
data/skywork707		data/skywork707
star		star
verl		verl
.DS_Store		.DS_Store
README.md		README.md
train_dpo.sh		train_dpo.sh
train_grpo.sh		train_grpo.sh
train_sft.sh		train_sft.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Think-J

⚡️ Usage

Preparation

Judgment Thinking Initilization

Judgment Thinking Optimization

Evaluation on RewardBench

Acknowledge

About

Uh oh!

Releases

Packages

Uh oh!

Languages

HuihuiChyan/Think-J

Folders and files

Latest commit

History

Repository files navigation

Think-J

⚡️ Usage

Preparation

Judgment Thinking Initilization

Judgment Thinking Optimization

Evaluation on RewardBench

Acknowledge

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages