training language models from scratch with pure reinforcement learning read it in depth here - avatarl.md)
normal process for building models for any task is typically pretrain + midtrain + sft + rl, where my understanding is that
- pretrain = builds a world model
- midtrain = adds extensive domain knowledge for any area absent in organic pretrain corpus
- sft = rich demonstration data on instruction following, alignment, establishing persona and style
- rl = optimize the demonstrable to what can get fuzzy. improve reasoning, grounded behaviour and harmlessness. essentially learning generalisation over memorization.
this repository contains an optimized implementation of gpt2 but getting trained purely with reinforcment learning.
and we might be creating lots of unorthodox things here, as any fun loving person should.
avatarl.py- main training script implementing avatarl reinforcement learning algorithm for language model pretrainingtrain.py- standard pretraining script for creating baseline models for ablation studiesmodel.py- gpt model architecture with transformer blocks, attention, and language modeling headconfig/train_avatarl.py- training configuration for avatarl experiments (hyperparameters, model size, optimizer settings)configurator.py- command-line configuration override system for experiment managementmodal_train.py- modal cloud deployment for distributed training, profiling andbenchmaxxingbenchmarking.start.sh- local training launcher to run experiments with environment setup and multi-gpu supportevaluate_critic.py/modal_evaluate_critic.py- evaluate critic model cross-entropy loss and perplexity on validation datadocs/avatarl.md- technical documentation explaining the avatarl framework and positive reinforce approach
avatarl/
├── data/
│ └── openwebtext/
│ ├── train.bin
│ └── val.bin
├── out/
│ └── ckpt_*.pt
├── out-avatarl/
│ └── ckpt_*.pt
└── config/
└── train_avatarl.py
Note: start.sh automatically downloads training data and critic models from HuggingFace if not present.
- modal account (sign up free)
- modal cli:
pip install modal - authenticate:
modal setup
- python 3.12
- pytorch 2.6.0
- cuda 12.6+ capable gpu (h200/h100/a100 recommended)
- flash attention (auto-installed)
# setup environment
bash start.sh
# single gpu
python avatarl.py --compile=False
# multi-gpu (8 gpus)
torchrun --nproc_per_node=8 avatarl.pyedit config/train_avatarl.py to change hyperparameters before running.
# install and authenticate modal
pip install modal
modal setup
# run training
modal run modal_train.py:train_avatarl_single_nodesample.py looks for checkpoints in the out/ directory by default. The checkpoint filename is constructed as ckpt_{experiment_name}.pt.
# generate text from a trained checkpoint (looks in out/ckpt_avatarl_pretrain_250M_adamw_big_critic.pt)
python3 sample.py --experiment_name=avatarl_pretrain_250M_adamw_big_critic --start="The true meaning of life is often hidden in"
# additional options
python3 sample.py \
--experiment_name=avatarl_pretrain_250M_adamw_big_critic \
--start="Once upon a time" \
--temperature=0.8 \
--num_samples=5 \
--max_new_tokens=200
# use a different output directory
python3 sample.py \
--out_dir=out-avatarl \
--experiment_name=avatarl_pretrain \
--start="To be or not to be"28may25] so since here i am starting with random weights, i wished to find a way to bootstrap this. and for rl, i am starting with grpo algo.
29may25 i was able to get positive rewards, the approach i tried was to have a bigram objective for partial scoring of predictions and groundtruth, i am getting positive rewards and after like 80% of the run, it is matching bigram level performance and then drops off. see wandb runs for detailed metrics.
30may25 so i figured i could ramp up the ngrams. trigram after bigram and so on, but this approach is going to scale badly. so i decided to think deeper on this. since i need to run many experiments with limited personal budget, i improved speed from 27-30min previously to 2min50s current. now i can do 10x more experiments and it is a good base.
03june25 shared insights on the progress made in bootstrapping a random weights system with rl pretrain in bootstrapping md file.
04june - 01july25 various efforts like curriculum learning and reward chaining to bridge the gap between bootstrapped level of performance and groundtruth accuracy.
1july - 11july25 all previous experiments hit a plateau, briefly consider changing the experiment to minimal pretrain + rl. start in a completely new direction.
11july - 24july25 mostly boostrapping from own judgement to get partial rewards other than gold token. but initial learning is very noisy as model judgement is not establishment this early in training. a method works where i scale partial reward from zero to 0.3 but i don't go with it because it would be very slow and start converging well only late. also it looks very similar to pretrain + rl. lots of time and compute loss due to hidden bugs.
25july - 06aug25 cleaned up the codebase, essentially i gave avataRL a new avatar. not sorry for the pun. pick up a new idea - referee model that is trained on groundtruth data and is used to score the predictions of the model. this works nicely, converging with reasonable compute expense and referee model does not need to be bigger than the model in training. i announce success of the project.
07aug25 update everything in public codebase.
09aug25 completed the article for avatarl.
contributions are welcome! please feel free to submit a pull request.
this project is licensed under the apache 2.0 license - see the license file for details.
if you find this work useful in your research, please consider citing:
@software{avatarl2025,
author = {tokenbender},
title = {avatarl: training language models from scratch with pure reinforcement learning},
year = {2025},
publisher = {github},
journal = {github repository},
url = {https://github.com/tokenbender/avatarl}
}this implementation reuses some code from the modal labs multinode training guide for nanogpt as base for avataRL.