Fine-tune LLM agents with online reinforcement learning
🔗 Agents for Web Data Extraction • 🐦 Twitter
"Agents" originated in reinforcement learning, where they learn by interacting with an environment and receiving a reward signal. However, LLM-based agents today do not learn online (i.e. continuously in real time) via reinforcement.
OpenAI created Gym to standardize and simplify RL environments, but if you try dropping an LLM-based agent into a Gym environment for training, you'd find it's still quite a bit of code to handle LLM conversation context, episode batches, reward assignment, PPO setup, and more.
LlamaGym seeks to simplify fine-tuning LLM agents with RL. Right now, it's a single Agent abstract class that handles all the issues mentioned above, letting you quickly iterate and experiment with agent prompting & hyperparameters across any Gym environment.
-
Clone the project and bootstrap the environment
git clone https://github.com/KhoomeiK/LlamaGymEDU.git cd LlamaGymEDU python scripts/bootstrap.pyThe bootstrapper creates a
.venvvirtual environment, installs the pinned dependencies fromrequirements/base.txt, and automatically adds Linux-only extras likebitsandbytes. Pass--extras textworldto add the optional TextWorld stack or--no-platform-extrasif you want a minimal install. Runpython scripts/bootstrap.py --helpto see every option. -
Activate the virtual environment
source .venv/bin/activate # macOS / Linux .\.venv\Scripts\activate # Windows PowerShell
-
Disable external logging unless you need it
W&B logging defaults to
mode="disabled"unlessWANDB_PROJECTis set. You can override the behaviour by exportingWANDB_MODE(e.g.offlineoronline). Hugging Face models that require authentication can use the standardHF_TOKENenvironment variable.
Laptop-friendly defaults ship with the repository so that even CPU-only machines can experiment quickly.
python examples/blackjack.py --episodes 5The script loads TinyLlama/TinyLlama-1.1B-Chat-v1.0 by default (≈1.1B parameters) and runs fully on CPU or Apple M-series hardware. It includes SFT warm-start using demo data from sft_data/demo_blackjack.jsonl. To try TextWorld, install the optional dependencies and run:
python scripts/bootstrap.py --extras textworld
python examples/text-world.py --episodes 5Both examples accept --model, --device, and other quality-of-life flags so students can switch between CPU, CUDA, or MPS and try other compact models without editing code.
| Model | Parameters | Notes |
|---|---|---|
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
1.1B | Fits in <4 GB of RAM; great for CPU-only runs |
microsoft/phi-2 or microsoft/phi-3-mini-4k-instruct |
2.7B–3.8B | Requires ~8 GB RAM; strong instruction following |
HuggingFaceTB/SmolLM-1.7B-Instruct |
1.7B | Community model tuned for fast experiments |
Qwen/Qwen2-1.5B-Instruct |
1.5B | Multilingual baseline with solid reasoning |
All of these models have dedicated entries in llamagym.model_registry, so load_model_and_tokenizer automatically applies appropriate generation and LoRA defaults.
Fine-tuning an LLM-based agent to play in a Gym-style environment with RL has never been easier! Once you install LlamaGym...
pip install llamagym
First, implement 3 abstract methods on the Agent class:
from llamagym import Agent
class BlackjackAgent(Agent):
def get_system_prompt(self) -> str:
return "You are an expert blackjack player."
def format_observation(self, observation) -> str:
return f"Your current total is {observation[0]}"
def extract_action(self, response: str):
return 0 if "stay" in response else 1Then, define your base LLM (as you would for any fine-tuning job) and instantiate your agent:
model = AutoModelForCausalLMWithValueHead.from_pretrained("Llama-2-7b").to(device)
tokenizer = AutoTokenizer.from_pretrained("Llama-2-7b")
agent = BlackjackAgent(model, tokenizer, device)Finally, write your RL loop as usual and simply call your agent to act, reward, and terminate:
env = gym.make("Blackjack-v1")
for episode in trange(5000):
observation, info = env.reset()
done = False
while not done:
action = agent.act(observation) # act based on observation
observation, reward, terminated, truncated, info = env.step(action)
agent.assign_reward(reward) # provide reward to agent
done = terminated or truncated
train_stats = agent.terminate_episode() # trains if batch is fullLlamaGym includes optional stability features that can dramatically improve RL convergence:
agent = BlackjackAgent(
model, tokenizer, device,
# Stability toggles (all optional, default False)
sft_warm_start=True, # Offline→online bridge via replay buffer
use_target_kl=True # Target-KL early stopping
)SFT Warm-Start: Pre-trains the agent with demonstration data before RL begins, then optionally continues SFT on recent successful episodes during training. Provides an "offline→online bridge" that stabilizes pure online RL.
Target-KL Controller: Enables TRL's built-in KL divergence monitoring and early stopping to prevent catastrophic policy drift.
Robust Action Extraction: Automatically tries JSON parsing first (e.g., {"action": 0}), then falls back to regex, reducing action parsing failures.
Some reminders:
- above code snippets are mildly simplified above but a fully working example is available in
examples/blackjack.py - getting online RL to converge is notoriously difficult so you'll have to mess with hyperparameters to see improvement
- our implementation values simplicity so is not as compute efficient as e.g. Lamorel, but easier to start playing around with
- LlamaGym is a weekend project and still a WIP, but we love contributions!
- Grounding Large Language Models with Online Reinforcement Learning
- True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning
bibtex
@misc{pandey2024llamagym,
title = {LlamaGym: Fine-tune LLM agents with Online Reinforcement Learning},
author = {Rohan Pandey},
year = {2024},
howpublished = {GitHub},
url = {https://github.com/KhoomeiK/LlamaGym}
}