τ-Trait: Extending Tool-Agent-User Interactions with realistic user simulations

$\tau$-Trait is a benchmark for evaluating large language models (LLMs) with realistic, persona-aware simulations. It builds on $\tau$-Bench but introduces two key modifications:

TraitBasis-generated personas – more accurate and interpretable user simulations.
Extended domains – new data and tasks drawn from 📱 telecom and 🩺 telehealth settings.

$\tau$-Trait is designed to test model robustness, personalization, and fairness in high-impact, customer-facing domains where user traits strongly influence interaction quality.

Results

Drops in agent performance with user simulations with TraitBasis relative to without any traits, averaged across 3 runs.

Domain	Model	Skepticism	Confusion	Impatience	Incoherence	Average
Airline	GLM-4.5	-11.0%	-16.9%	-12.8%	-12.2%	-13.2%
	GPT-4o	-6.7%	-5.0%	-4.4%	-6.7%	-5.7%
	Kimi K2	-11.8%	-9.5%	-6.2%	-7.1%	-8.7%
Retail	GLM-4.5	0.2%	-5.4%	-2.6%	-0.5%	-2.1%
	GPT-4o	-29.2%	-34.2%	-25.9%	-22.9%	-28.1%
	Kimi K2	-21.9%	-45.7%	-31.2%	-21.4%	-30.0%
Telecom	GLM-4.5	0.8%	-16.8%	-3.9%	-2.3%	-5.5%
	GPT-4o	-11.5%	-14.0%	-16.9%	-8.7%	-12.8%
	Kimi K2	-11.4%	-18.1%	-14.7%	-4.5%	-12.2%

🚀 Getting Started

Setup

Clone this repository:

git clone https://github.com/collinear-ai/tau-trait && cd ./tau-trait

Install from source along with other required packages:

conda create -n tau_trait -y python=3.11
conda activate tau_trait
pip install jupyterlab ipykernel nest_asyncio jinja2 --no-cache
pip install "openai>=1.13.3" "mistralai>=0.4.0" "anthropic>=0.26.1" "google-generativeai>=0.5.4" "tenacity>=8.3.0" "termcolor>=2.4.0" "numpy>=1.26.4" "litellm==1.41.0"
pip install -e .

Set up your steer API key (for the user model) and your OpenAI / Anthropic / Google / Mistral / AnyScale API keys (for the assistant model) as environment variables. Note: if you do not have a steer API key, you can use tautrait-001 for non-production workloads. Please reach out if you are interested in your own API key for optimized inference and throughput.

OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
MISTRAL_API_KEY=...
STEER_API_KEY=...

Usage

Quick-Start

For a notebook to start playing around with things, please see notebooks/getting_started.ipynb

CLI commands

python run.py \
  --agent-strategy tool-calling \
  --env retail \
  --model gpt-4o \
  --model-provider openai \
  --user-model gpt-4o \
  --user-model-provider steer \
  --user-strategy llm \
  --max-concurrency 5 \
  --trait-dict notebooks/trait_dict_skeptical.json \
  --result-fp notebooks/results/test.json \
  --endpoint https://steer.collinear.ai/steer_bare

Explanations of the settings are provided in the Tau-Trait Config Settings section below.

Tau-Trait Config Settings

General

--num-trials (int, default: 1)
Number of independent trials to run.
--seed (int, default: 10)
Random seed for reproducibility.
--shuffle (int, default: 0)
Whether to shuffle task order (0 = no, 1 = yes).
--log-dir (str, default: results)
Directory where logs and results are stored.

Environment & Tasks

--env (str, choices: retail, airline, default: retail)
Domain environment in which to run simulations.
--task-split (str, choices: train, test, dev, default: test)
Dataset split of tasks to run (applies only to the retail domain currently).
--start-index (int, default: 0)
Index of the first task to run.
--end-index (int, default: -1)
Index of the last task to run. Use -1 to run all remaining tasks.
--task-ids (list of int, optional)
Explicit list of task IDs to run (overrides index ranges).
--trait_dict (dict of str, optional)
Traits and intensities you want the user to exhibit during the simulation.

Agent Configuration

--model (str, required)
The model to use for the agent.
--model-provider (str, choices from provider_list)
Provider for the agent’s model.
--temperature (float, default: 0.0)
Sampling temperature for the action model (higher = more randomness).
--few-shot-displays-path (str, optional)
Path to a JSONL file containing few-shot demonstration examples.

User Simulator Configuration

--user-model (str, default: gpt-4o)
Model to use for the user simulator.
--user-model-provider (str, optional)
Provider for the user simulator’s model.
--user-strategy (str, choices from UserStrategy, default: llm)
Strategy for the simulated user (e.g., LLM-based).

Execution Controls

--max-concurrency (int, default: 1)
Number of tasks to run in parallel.

τ-Trait vs τ-Bench rollouts

@article{he2025impatient,
  title        = {Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents},
  author       = {He, Muyu and Kumar, Anand and Mackey, Tsach and Rajeev, Meghana and Zou, James and Rajani, Nazneen},
  journal      = {arXiv preprint arXiv:2510.04491},
  year         = {2025},
  url          = {https://arXiv.org/abs/2510.04491}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
notebooks		notebooks
tau_trait		tau_trait
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
auto_error_identification.py		auto_error_identification.py
completion_test.py		completion_test.py
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

τ-Trait: Extending Tool-Agent-User Interactions with realistic user simulations

Results

🚀 Getting Started

Setup

Usage

Quick-Start

CLI commands

Tau-Trait Config Settings

Execution Controls

τ-Trait vs τ-Bench rollouts

About

Uh oh!

Releases

Packages

Languages

License

collinear-ai/tau-trait

Folders and files

Latest commit

History

Repository files navigation

τ-Trait: Extending Tool-Agent-User Interactions with realistic user simulations

Results

🚀 Getting Started

Setup

Usage

Quick-Start

CLI commands

Tau-Trait Config Settings

Execution Controls

τ-Trait vs τ-Bench rollouts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages