LlamaGym

Fine-tune LLM agents with online reinforcement learning

🔗 Agents for Web Data Extraction • 🐦 Twitter

LlamaGym

"Agents" originated in reinforcement learning, where they learn by interacting with an environment and receiving a reward signal. However, LLM-based agents today do not learn online (i.e. continuously in real time) via reinforcement.

OpenAI created Gym to standardize and simplify RL environments, but if you try dropping an LLM-based agent into a Gym environment for training, you'd find it's still quite a bit of code to handle LLM conversation context, episode batches, reward assignment, PPO setup, and more.

LlamaGym seeks to simplify fine-tuning LLM agents with RL. Right now, it's a single Agent abstract class that handles all the issues mentioned above, letting you quickly iterate and experiment with agent prompting & hyperparameters across any Gym environment.

🚀 Quickstart (macOS / Windows / Linux)

Clone the project and bootstrap the environment
```
git clone https://github.com/KhoomeiK/LlamaGymEDU.git
cd LlamaGymEDU
python scripts/bootstrap.py
```
The bootstrapper creates a .venv virtual environment, installs the pinned dependencies from requirements/base.txt, and automatically adds Linux-only extras like bitsandbytes. Pass --extras textworld to add the optional TextWorld stack or --no-platform-extras if you want a minimal install. Run python scripts/bootstrap.py --help to see every option.

Activate the virtual environment

source .venv/bin/activate          # macOS / Linux
.\.venv\Scripts\activate           # Windows PowerShell

Disable external logging unless you need it

W&B logging defaults to mode="disabled" unless WANDB_PROJECT is set. You can override the behaviour by exporting WANDB_MODE (e.g. offline or online). Hugging Face models that require authentication can use the standard HF_TOKEN environment variable.

Run a quick sanity check

Laptop-friendly defaults ship with the repository so that even CPU-only machines can experiment quickly.

python examples/blackjack.py --episodes 5

The script loads TinyLlama/TinyLlama-1.1B-Chat-v1.0 by default (≈1.1B parameters) and runs fully on CPU or Apple M-series hardware. It includes SFT warm-start using demo data from sft_data/demo_blackjack.jsonl. To try TextWorld, install the optional dependencies and run:

python scripts/bootstrap.py --extras textworld
python examples/text-world.py --episodes 5

Both examples accept --model, --device, and other quality-of-life flags so students can switch between CPU, CUDA, or MPS and try other compact models without editing code.

Laptop-sized model suggestions

Model	Parameters	Notes
`TinyLlama/TinyLlama-1.1B-Chat-v1.0`	1.1B	Fits in <4 GB of RAM; great for CPU-only runs
`microsoft/phi-2` or `microsoft/phi-3-mini-4k-instruct`	2.7B–3.8B	Requires ~8 GB RAM; strong instruction following
`HuggingFaceTB/SmolLM-1.7B-Instruct`	1.7B	Community model tuned for fast experiments
`Qwen/Qwen2-1.5B-Instruct`	1.5B	Multilingual baseline with solid reasoning

All of these models have dedicated entries in llamagym.model_registry, so load_model_and_tokenizer automatically applies appropriate generation and LoRA defaults.

Usage

Fine-tuning an LLM-based agent to play in a Gym-style environment with RL has never been easier! Once you install LlamaGym...

pip install llamagym

First, implement 3 abstract methods on the Agent class:

from llamagym import Agent

class BlackjackAgent(Agent):
    def get_system_prompt(self) -> str:
        return "You are an expert blackjack player."

    def format_observation(self, observation) -> str:
        return f"Your current total is {observation[0]}"

    def extract_action(self, response: str):
        return 0 if "stay" in response else 1

Then, define your base LLM (as you would for any fine-tuning job) and instantiate your agent:

model = AutoModelForCausalLMWithValueHead.from_pretrained("Llama-2-7b").to(device)
tokenizer = AutoTokenizer.from_pretrained("Llama-2-7b")
agent = BlackjackAgent(model, tokenizer, device)

Finally, write your RL loop as usual and simply call your agent to act, reward, and terminate:

env = gym.make("Blackjack-v1")

for episode in trange(5000):
    observation, info = env.reset()
    done = False

    while not done:
        action = agent.act(observation) # act based on observation
        observation, reward, terminated, truncated, info = env.step(action)
        agent.assign_reward(reward) # provide reward to agent
        done = terminated or truncated

    train_stats = agent.terminate_episode() # trains if batch is full

Stability Features

LlamaGym includes optional stability features that can dramatically improve RL convergence:

agent = BlackjackAgent(
    model, tokenizer, device,
    # Stability toggles (all optional, default False)
    sft_warm_start=True,  # Offline→online bridge via replay buffer
    use_target_kl=True    # Target-KL early stopping
)

SFT Warm-Start: Pre-trains the agent with demonstration data before RL begins, then optionally continues SFT on recent successful episodes during training. Provides an "offline→online bridge" that stabilizes pure online RL.

Target-KL Controller: Enables TRL's built-in KL divergence monitoring and early stopping to prevent catastrophic policy drift.

Robust Action Extraction: Automatically tries JSON parsing first (e.g., {"action": 0}), then falls back to regex, reducing action parsing failures.

Some reminders:

above code snippets are mildly simplified above but a fully working example is available in examples/blackjack.py
getting online RL to converge is notoriously difficult so you'll have to mess with hyperparameters to see improvement
our implementation values simplicity so is not as compute efficient as e.g. Lamorel, but easier to start playing around with
LlamaGym is a weekend project and still a WIP, but we love contributions!

Relevant Work

Citation

bibtex
@misc{pandey2024llamagym,
  title        = {LlamaGym: Fine-tune LLM agents with Online Reinforcement Learning},
  author       = {Rohan Pandey},
  year         = {2024},
  howpublished = {GitHub},
  url          = {https://github.com/KhoomeiK/LlamaGym}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
examples		examples
llamagym		llamagym
requirements		requirements
scripts		scripts
sft_data		sft_data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llamagym.png		llamagym.png
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LlamaGym

🚀 Quickstart (macOS / Windows / Linux)

Run a quick sanity check

Laptop-sized model suggestions

Usage

Stability Features

Relevant Work

Citation

About

Uh oh!

Releases

Packages

Languages

License

Robbyswimmer/LlamaGymSFT

Folders and files

Latest commit

History

Repository files navigation

LlamaGym

🚀 Quickstart (macOS / Windows / Linux)

Run a quick sanity check

Laptop-sized model suggestions

Usage

Stability Features

Relevant Work

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages