Introduction

We present MUA-RL(MULTI-TURN USER-INTERACTING AGENT REINFORCEMENT LEARNING FOR AGENTIC TOOL USE), a reinforcement learning framework for training large language models with multi-turn conversation capabilities and agentic tool usage. MUA-RL is specifically designed for multi-turn user interaction scenarios where agents need to maintain context across conversations while effectively utilizing tools to complete complex tasks.

✨ Features

🔄 Multi-turn Conversation Support: Maintain context across multiple conversation turns for complex task completion
🛠️ Agentic Tool Usage: Seamless integration with various tools and APIs for real-world applications
📊 Flexible Environment Management: Dynamic environment creation for each rollout to ensure fresh context
🔧 Easy Checkpoint Conversion: Automatic conversion from distributed checkpoints to Hugging Face format

🚀 Quick Start

1. Installation

Prerequisites

Python 3.8+
CUDA 11.8+ (for GPU training)
PyTorch 2.0+

Quick Install

# Clone the repository
git clone https://github.com/zzwkk/MUA-RL.git
cd MUA-RL

# Install dependencies
pip install -e .
pip install -r requirements_sglang.txt
pip install transformers==4.51.1

2. Configure Training

Edit the training script parameters:

# Edit model path and other parameters in the script
vim examples/sglang_multiturn/mua_32b.sh

Key Parameters to Modify:

Parameter	Description	Example
`MODEL_PATH`	Path to your base model	`/path/to/your/model`
`N_NODE`	Number of nodes for distributed training	`4`
`BATCH_SIZE`	Training batch size	`32`
`EPOCH_NUM`	Number of training epochs	`30`
`API_KEY`	OpenAI API key for evaluation model	`sk-...`
`BASE_URL`	OpenAI API base URL	`https://api.openai.com/v1`
`CKPT_DIR`	Directory path to save model checkpoints	`/path/to/checkpoints`
`TENSORBOARD_DIR`	Directory path for TensorBoard logs	`/path/to/tensorboard`
`ROLLOUT_LOG_PATH`	Directory path for rollout generation logs	`/path/to/rollout_logs`
`VALID_LOG_PATH`	Directory path for validation logs	`/path/to/validation_logs`

3. Run Training

Multi-Node Training (4 * 8 GPUs)

# For 4*8 GPU setup, suggest H200 141GB
bash examples/sglang_multiturn/mua_grpo.sh

4. Convert Checkpoints to Hugging Face Format

After training, convert distributed checkpoints to Hugging Face format:

# Edit the merge script configuration
vim scripts/merge.sh

# Set your model path and name
BASE_DIR="/path/to/your/checkpoints/"
MODEL_NAME="your_model_name"

# Run the conversion
bash scripts/merge.sh

The script will automatically:

🔍 Find all global_step_* directories
🔄 Convert FSDP/Megatron checkpoints to Hugging Face format
💾 Save merged models to iter_XXXXXX/actor/unify_checkpoint/

📁 Project Structure

Core Files

File/Directory	Description
`examples/sglang_multiturn/mua_grpo.sh`	Main training script for GRPO
`scripts/merge.sh`	Checkpoint conversion script
`scripts/model_merger.py`	Model format conversion utilities
`verl/workers/rollout/sglang_rollout/`	Core rollout implementation
`verl/workers/rollout/schemas.py`	Conversation management and backpropagation control
`MUA_environments/`	Environment management system

🏗️ Architecture

MUA-RL follows a modular architecture designed for scalability and flexibility:

Environment Manager: Creates fresh environments for each rollout
Tool Registry: Manages available tools and their configurations
Data Loader: Handles data loading and preprocessing
Rollout Worker: Executes multi-turn conversations with tool usage

📞 Contact

For questions and support, please open an issue or contact us at GitHub Issues.

🙏 Acknowledgments

Thanks to the open-source community for the excellent tools and libraries
Special thanks to all contributors who help improve MUA-RL

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If you use MUA-RL in your research, please cite our paper:

@misc{zhao2025mua,
  title={MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for Agentic Tool Use},
  author={Weikang Zhao and Xili Wang and Chengdi Ma and Lingbin Kong and Zhaohua Yang and Mingxiang Tuo and Xiaowei Shi and Yitao Zhai and Xunliang Cai},
  year={2025},
  eprint={2508.18669},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.18669}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
MUA_environments		MUA_environments
assets		assets
data		data
docker		docker
docs		docs
examples/sglang_multiturn		examples/sglang_multiturn
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

✨ Features

🚀 Quick Start

1. Installation

Prerequisites

Quick Install

2. Configure Training

Key Parameters to Modify:

3. Run Training

Multi-Node Training (4 * 8 GPUs)

4. Convert Checkpoints to Hugging Face Format

📁 Project Structure

Core Files

🏗️ Architecture

📞 Contact

🙏 Acknowledgments

📄 License

Citation

About

Uh oh!

Releases

Packages

Languages

License

zzwkk/MUA-RL

Folders and files

Latest commit

History

Repository files navigation

Introduction

✨ Features

🚀 Quick Start

1. Installation

Prerequisites

Quick Install

2. Configure Training

Key Parameters to Modify:

3. Run Training

Multi-Node Training (4 * 8 GPUs)

4. Convert Checkpoints to Hugging Face Format

📁 Project Structure

Core Files

🏗️ Architecture

📞 Contact

🙏 Acknowledgments

📄 License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages