Reinforcement Learning from Verifiable Rewards

In this project, I finetune Qwen3-1.7B model using the Group Relative Policy Optimization (GRPO) algorithm. I also use LORA for finetuning. The dataset was taken from dataset

Both the installation and training are done in one script.

Training

bash train.sh

After training, we merge the weights of the LORA adapter with the base model.

Merging

python3 post_train_merge.py

Now, we can evaluate the model on the test set.

Evaluation

bash eval.sh

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
deepspeed_zero3.yaml		deepspeed_zero3.yaml
eval.sh		eval.sh
eval_grpo.py		eval_grpo.py
grpo.py		grpo.py
post_train_merge.py		post_train_merge.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning from Verifiable Rewards

Training

Merging

Evaluation

About

Uh oh!

Releases

Packages

Languages

tamoghnokandar/RLVR

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning from Verifiable Rewards

Training

Merging

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages