This repository presents a reinforcement learning (RL) recipe designed to enhance end-to-end agentic capabilities, inspired by the perspective in The Second Half that realistic progress now hinges on defining problems & evaluations.
We provide a full τ-bench retail (τ-retail) environment to train and evaluate agents on policy-following, strategic clarification, and multi-hop tool use in realistic, rule-bound workflows.
Despite the controlled setup, initial results highlight that it remains challenging for an RL-based agent to:
- Strategically query the user for missing or clarifying information
- Perform effective multi-hop tool calls to achieve the intended outcome
Key takeaway: RL-tuned model improves the Tau Retail score 0.478 → 0.496 (+0.018 abs., +3.8% rel.) in the non-thinking setting.
Note: Reported rewards correspond to pass^1.
The training dataset contains 500 samples, and the test dataset contains 115 samples.
| Strategy | Pass^1 |
|---|---|
| TC (claude-3-5-sonnet-20241022) | TBD |
| TC (gpt-4o) | TBD |
| Baselines | TBD |
*TC = tool-calling strategy (as described in the τ-bench paper)
Refer to the VERL installation guide for detailed setup instructions.
python -m examples.data_preprocess.tau_retail.preprocess_tau_retail_datasetYou should see output similar to:
train dataset len : 500
test dataset len : 115Hardware requirement: At least H100 × 8 GPUs are recommended to reproduce the results.
export OPENAI_API_KEY=<YOUR-API-KEY>
nohup bash examples/sglang_multiturn/run_tau_retail_multiturn.sh > train.log 2>&1 &If you use this repository in your research or work, please cite it as follows:
@misc{tau_retail_rl,
title = {Tau-Retail End-to-End RL Experiment},
author = {Seungyoun, Shin},
year = {2025},
url = {https://github.com/SeungyounShin/tau-retail-rl}
}