ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Ruihan Yang¹ Yikai Zhang¹ Aili Chen¹

¹Fudan University

Introduction

Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an extremely large and combinatorial action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning.

To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering efficient and effective policy optimization.

Extensive experiments demonstrate that ARIA not only significantly reduces gradient variance, but also delivers substantial performance gains of average 9.95% across four downstream tasks (e.g., negotiation and text-based games), consistently outperforming strong offline and online RL baselines.

Installation

You can install ARIA using the following steps:

# Create and activate conda environment
conda create -n aria python=3.10
conda activate aria

# Install the package
pip install -e .

Data Processing Pipeline

The complete data processing pipeline transforms raw game data into training-ready datasets through several steps:

Step 1: Generate Clustering Labels

Starting with raw game data llama3-8b_{game}_msgs.jsonl, generate clustering labels for different k values (k=2 to k=100):

# For different game environments
cd reward_aggregation/{game}_clustering
python preprocesspy
python clustering.py 
python postprocess.py

Supported games:

bargaining (multi-agent)
negotiation (multi-agent)
guess_my_city (single-agent with actions & observations)
twenty_questions (single-agent with special observation processing)

Step 2: Find Optimal K Values

Determine the optimal k values for clustering using silhouette analysis:

cd clustering

# For multi-agent games (bargaining, negotiation)
python clustering_multi.py --data_path {game}_clustering

# For single-agent games (guess_my_city, twenty_questions)
python clustering_single.py --data_path {game}_clustering

This will output the optimal k values for each agent/component.

Step 3: Process Data with Selected K Values

Use the optimal k values to generate the final labeled dataset:

cd clustering
python game_data_processor.py {environment} {input_file} {output_file}

Examples:

# Multi-agent games (bargaining, negotiation)
python game_data_processor.py bargaining \
    llama3-8b_bargaining_with_labels_k2_to_k100.jsonl \
    llama3-8b_bargaining_with_selected_labels.jsonl \
    --alice-k 20 --bob-k 20

python game_data_processor.py negotiation \
    llama3-8b_negotiation_with_labels_k2_to_k100.jsonl \
    llama3-8b_negotiation_with_selected_labels.jsonl \
    --alice-k 16 --bob-k 16

# Single-agent with action/observation clustering
python game_data_processor.py guess_my_city \
    llama3-8b_guess_my_city_with_labels_k2_to_k100.jsonl \
    llama3-8b_guess_my_city_with_selected_labels.jsonl \
    --action-k 28 --observation-k 28

# Single-agent with special observation processing
python game_data_processor.py twenty_questions \
    llama3-8b_twenty_questions_with_labels_k2_to_k100.jsonl \
    llama3-8b_twenty_questions_with_selected_labels.jsonl \
    --k 36

Step 4: Generate Final Training Dataset

Convert the processed data into the final training format:

cd clustering
python gen_reinforce_multi.py

The final dataset will be saved to /ARIA/dataset/actor_reinforce_llama3-8b_multi.json and is ready for training.

Data Processing Summary

The complete pipeline:

llama3-8b_{game}_msgs.jsonl
    ↓ (Step 1: preprocess.py clustering.py postprocess.py)
llama3-8b_{game}_with_labels_k2_to_k100.jsonl
    ↓ (Step 2: clustering_multi/single.py)
optimal k values
    ↓ (Step 3: game_data_processor.py)
llama3-8b_{game}_with_selected_labels.jsonl
    ↓ (Step 4: gen_reinforce_multi.py)
actor_reinforce_llama3-8b_multi.json

Actor Training

cd scripts

# Offline training
python run_offline.py --config-name reinforce_llm
accelerate launch --config_file config/accelerate_config/default_config.yaml run_offline.py --config-name reinforce_llm

# Online training
python run_online.py --config-name onlinereinforce_llm
accelerate launch --config_file config/accelerate_config/default_config.yaml run_online.py --config-name onlinereinforce_llm

RM Training

We train a reward model (RM) using past rollout results from the actor (updated every 50 steps) to ensure accurate advantage estimation for the actor.

cd scripts
bash train_rm.sh

Evaluation

To evaluate your model in single-agent environments (e.g., Twenty Questions, Guess My City), follow these steps:

Step 1: Launch the model with vLLM

First, make sure your model is served using vLLM. Here's an example command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m model_path --port 8036 --tensor-parallel-size 4 --gpu_memory_utilization 0.6

Replace model_path with your actual model module or entry point.
The model will be accessible at http://localhost:8036.

Step 2: Run the evaluation script

Navigate to the evaluation directory and execute the evaluation script:

cd evaluation
bash eval.sh

The script does the following:

Evaluates the model on two environments: twenty_questions and guess_my_city.
Sends requests to the vLLM server via the specified BASE_URL.
Uses the model named llama3-8B (you can change this by editing MODEL_NAME in eval.sh).
Repeats each evaluation 200 times.
Saves results to ../results/single_agent/llama3-8B.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
aria		aria
dataset		dataset
evaluation		evaluation
figures		figures
pointwise_rm		pointwise_rm
reward_aggregation		reward_aggregation
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main.png		main.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Introduction

Installation

Data Processing Pipeline

Step 1: Generate Clustering Labels

Step 2: Find Optimal K Values

Step 3: Process Data with Selected K Values

Step 4: Generate Final Training Dataset

Data Processing Summary

Actor Training

RM Training

Evaluation

Step 1: Launch the model with vLLM

Step 2: Run the evaluation script

About

Uh oh!

Packages

Uh oh!

Languages

rhyang2021/ARIA

Folders and files

Latest commit

History

Repository files navigation

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Introduction

Installation

Data Processing Pipeline

Step 1: Generate Clustering Labels

Step 2: Find Optimal K Values

Step 3: Process Data with Selected K Values

Step 4: Generate Final Training Dataset

Data Processing Summary

Actor Training

RM Training

Evaluation

Step 1: Launch the model with vLLM

Step 2: Run the evaluation script

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Languages

Packages