L0 is a a scalable, end-to-end training pipeline for general-purpose agents. We provides you:
- A RL training framework for complex agentic environments, featuring a low-cost, extensible, and sandboxed concurrent agent worker pool.
- A generalist agentic scaffold Notebook Agent (NB-Agent) operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL) with Jupyter Kernel.
- A simple yet effecitve Agentic multi-turn training recipe with agentic policy gradient and verifiable multi-step rewards.
- A series of models trained with L0, including L0-4B (Qwen 3), L0-7B (Qwen2.5), and L0-32B (Qwen2.5). We claim that these models are capable of general agentic tasks, a case on deep searcher scenario using L0-32B (Qwen2.5) is provided in the examples.
- L0: Reinforcement Learning to Become General Agents
We significantly improved the model's performance on multiple benchmarks using L0:
- Model name + NB-Agent indicates directly evaluate the model with NB-Agent without training.
And L0 gained competitive performance compared with other works:
- All compared with 7B models
- Agentic Policy Gradient: Optimizes policy gradient for agents, treats a complete "think-code" sequence as a single action
- Verifiable Reward Funcion: Provides multi-faceted rewards for answer correctness, format compliance, and code execution
- Strict On-Policy Training: Uses a pure on-policy approach with a KL-divergence penalty for stable learning
- DAPO-Inspired Rejection Sampling: Advanced rejection sampling strategy for improved policy optimization
- Decoupled Architecture: Separates CPU agent workers from a GPU inference server for independent scaling
- Flexible server-client architecture: Scalable agent task execution with FastAPI-based orchestration, you could refer to the trajector sampler design document for more details
- Lightweight Sandboxing: Uses Bubblewrap for secure, low-overhead, and parallel agent environments
NB-Agent is designed to be a general-purpose agentic following "Code-as-Action" paradigm. Moreover, it works in a REPL, which allows agents to interact with environments by generating code snippets that are executed in a Jupyter Notebook environment.
- You could refer to the NB-Agent Documentation for more details on the design and architecture of NB-Agent.
# Clone this repository
git clone https://github.com/cmriat/l0.git
cd l0We use Pixi for package management.
- Pixi is a fast, reliable, and cross-platform package manager for Python and other languages. Visit pixi.sh to learn more and install it.
After installing pixi,
export CONDA_OVERRIDE_CUDA=12.9
# Install dependencies using Pixi.
pixi install
# Enter the environment
pixi shellThis example demonstrates training a NB-Agent using the REINFORCE++ algorithm on QA datasets.
1. Prepare dataset
base examples/data_preprocess/l0_qa.py --local_dir LOCAL_DIR_TO_SAVE_DATASET2. Start Agent Execution Manager Server
On the remote machines (1 or more, only consume cpus):
bash examples/start_task_server.sh3. Configure Remote Server URLs
Edit the training script to specify remote server URLs:
# File: examples/nb_agent_training/train_qa_reinforcepp*.sh
# Line: actor_rollout_ref.traj_sampler.remote_exec_server_url=['http://IP1:8000', 'http://IP2:8000', 'http://IP3:8000']4. API Keys Configuration
Some tools of NB-Agent require API keys for external services. The following services are required in QA training:
- Content Processing: Jina (required)
- Search Services: At least one of Exa, Firecrawl, or Serper (choose one or more)
Create a .env file in the root directory with the configurations of dependent services:
JINA_API_KEY= "YOUR_JINA_API_KEY"
FIRECRAWL_BASE_URL= "YOUR_FIRECRAWL_BASE_URL" # e.g. "http://firecrawl.lionrock.com"
EXA_API_KEY= "YOUR_EXA_API_KEY"
SERPER_API_KEY= "YOUR_SERPER_API_KEY"5. Running in Container
Since L0 uses bubblewrap to isolate environments of agent rollouts, if you want to run it in a container, you need to give your container the following capabilities:
- security-opt:- apparmor=unconfined
- CAPABILITY:- ALL
Or you could use --privileged to give the container all capabilities, which is not recommended for security reasons.
Choose Your Training Configuration
Select the appropriate training script based on your hardware setup and model size requirements.
1. Single Node Training
For single-node setups with limited GPU resources:
- 
0.6B Model (Qwen3-0.6B) - Hardware Requirements: 1 node, 1 GPU
- Command:
bash examples/nb_agent_training/train_qa_reinforcepp_dynamic_0.6b.sh 
 
- 
4B Model (Qwen3-4B) - Hardware Requirements: 1 node, 8 GPUs
- Command:
bash examples/nb_agent_training/train_qa_reinforcepp_dynamic_4b.sh 
 
2. Multi-Node Training
For larger models requiring distributed training, you need to set up a Ray cluster first:
Step 2.1: Launch Ray Cluster
# On the head node:
ray start --head --dashboard-host=0.0.0.0
# On worker nodes:
ray start --address=YOUR_HEAD_NODE_IP:6379Step 2.2: Submit Training Jobs
- 
7B Model (Qwen2.5-7B-Instruct) - Hardware Requirements: 2 nodes, 16 GPUs
- Command:
RAY_ADDRESS=YOUR_HEAD_NODE_IP:8265 ray job submit examples/nb_agent_training/train_qa_reinforcepp_dynamic_7b.sh 
 
- 
32B Model (Qwen2.5-32B-Instruct) - Hardware Requirements: 4 nodes, 32 GPUs
- Command:
RAY_ADDRESS=YOUR_HEAD_NODE_IP:8265 ray job submit examples/nb_agent_training/train_qa_reinforcepp_dynamic_32b.sh 
 
- For ease of use, we have packaged NB-Agent. You can install and use it separately via pixi install nbagent.
- In our tests, existing frontier models like Gemini and Claude have demonstrated powerful capabilities under NB-Agent without training.
- You could refer to the NB-Agent Example for a deep searcher example using NB-Agent.
We directly adapt conversion scripts from verl. It's under examples/model_converter.py. Please refer to the verl model converter document for usage after training.
Since NB-Agent need to use tokenizer of the model, we patch SGLang to provide extra end points. You could refer to the example patched SGLang server document for launching a patched SGLang server.
We provide a evaluation suite for QA dataset with agent worker pool to to parallel sampling. You could refer to the evaluation document for more details.
We provide a data preprocessing pipeline for QA datasets, which includes downloading, merging, quality assessment, and filtering. You could refer to the data preprocessing document for more details.
# Install development dependencies
pixi install --env dev
# Enter the development environment
pixi shell -e dev
# Install pre-commit hooks
pre-commit install
# Running Tests
pytest ./tests- If you encounter Out of Memory (OOM) errors during SGLang server capture CUDA graph, you could try to launch ray cluster first and then submit your training script. You can refer to multinode training section. It also works for single node training.
- This project adapts code from the verl and the SGLang. We are grateful for their contribution to the open-source community.
- Thanks to Open-Reasoner-Zero and DAPO for sharing their training techniques and insights.
- Special thanks to the Pixi team for their excellent package management tool, which greatly simplifies our development process.
If you find this project useful, please cite it in your research:
@misc{liconrockai2025l0,
      title={L0: Reinforcement Learning to Become General Agents}, 
      author={Lionrock-AI},
      year={2025},
      URL="https://github.com/cmriat/l0/tree/main/papers/paper.pdf"
}