The tau2 package - packaged by NVIDIA

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

All contributions to this repository must be licensed under the Apache License, Version 2.0. Existing code under other licenses remains unchanged, but new additions follow this policy only.

NVIDIA NeMo Evaluator

This is a forked τ²-bench package wrapped in NVIDIA NeMo Evaluator.

The goal of NVIDIA NeMo Evaluator is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks.

Quick Start Guide

NVIDIA NeMo Evaluator provides you with evaluation clients that are specifically built to evaluate model endpoints using our Standard API.

Launching an Evaluation for an LLM

Install the package

pip install nvidia-tau2

(Optional) Set a token to your API endpoint if it's protected

export MY_API_KEY="your_api_key_here"

List the available evaluations

$ nemo-evaluator ls

Available tasks:
* tau2_bench_telecom
* tau2_bench_airline
* tau2_bench_retail
...

Run the evaluation of your choice

nemo-evaluator run_eval \
    --eval_type tau2_bench_telecom \
    --model_id meta/llama-3.1-70b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /workspace/results

Gather the results

cat /workspace/results/results.yml

Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for tau2-bench:

Commands

1. List Evaluation Types

nemo-evaluator ls

Displays the evaluation types available within the harness.

2. Run an Evaluation

The nemo-evaluator run_eval command executes the evaluation process. Below are the flags and their descriptions:

Required flags

--eval_type <string> - The type of evaluation to perform (e.g., tau2_airline, tau2_retail, tau2_telecom)
--model_id <string> - The name or identifier of the model to evaluate
--model_url <url> - The API endpoint where the model is accessible
--model_type <string> - The type of the model to evaluate, currently either "chat", "completions", or "vlm"
--output_dir <directory> - The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here

Optional flags

--api_key_name <string> - The name of the environment variable that stores the Bearer token for the API, if authentication is required
--run_config <path> - Specifies the path to a YAML file containing the evaluation definition

Example

nemo-evaluator run_eval \
    --eval_type tau2_bench_telecom \
    --model_id my_model \
    --model_type chat \
    --model_url http://localhost:8000 \
    --output_dir ./evaluation_results

If the model API requires authentication, set the API key in an environment variable and reference it using the --api_key_name flag:

export MY_API_KEY="your_api_key_here"

nemo-evaluator run_eval \
    --eval_type tau2_bench_telecom \
    --model_id my_model \
    --model_type chat \
    --model_url http://localhost:8000 \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results

Configuring Evaluations via YAML

Evaluations in NVIDIA NeMo Evaluator are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

Example of a YAML config:

config:
  type: tau2_bench_telecom
  params:
    parallelism: 10
    limit_samples: 20
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NVIDIA_API_KEY

The priority of overrides is as follows:

Command line arguments
User config (as seen above)
Task defaults (defined per task type)
Framework defaults

The --dry_run option allows you to print the final run configuration and command without executing the evaluation.

Figure 1: τ²-bench allows users to interact with the agent and the environment

Figure 2: Trajectory of a conversation between an agent and a user

🆕 What's New

🤖 Reinforcement Learning Support (New!)

τ²-bench now supports RL training with a Gymnasium-compatible interface:

🏋️ Train RL Agents: Use the gym interface to train agents with popular RL frameworks.
🎮 Play as Agent or User: Interactive mode lets you control either the agent or the user in conversations
📊 Train/Test Splits: To help support experiments around training Agents and evaluating them, all domains include standardized task splits for proper train/test evaluation.

⚠️ IMPORTANT FOR BACKWARD COMPATIBILITY: If you are just evaluating an agent (not training), you MUST use the base task split to evaluate on the complete task set that matches the original τ²-bench structure. This ensures your results are comparable to previous evaluations and maintains consistency with the established benchmark. (If you don't specify a task split, it will default to base.)

🔧 Gymnasium Compatible: Standard gym interface works with existing RL tools and libraries

→ See Gym Documentation | → Try CLI Play Mode

🏆 Live Leaderboard (v0.2.0)

The τ²-bench leaderboard is now live at taubench.com!

📊 Interactive Rankings: Compare model performance across all domains
📱 Mobile-Friendly: View results on any device
🔍 Detailed Analysis: Explore trajectories and conversation flows
📥 Easy Submission: Submit your results directly through the interface

→ Visit the Leaderboard | → Submit Your Results

Overview

$\tau^2$-bench implements a simulation framework for evaluating customer service agents across various domains.

$\tau^2$-bench is the new iteration of the original $\tau$-bench, featuring code fixes and an additional telecom domain.

Each domain specifies:

a policy that the agent must follow
a set of tools that the agent can use
a set of tasks to evaluate the agent's performance
Optionally: A set of tools that the user simulator can use

Domains are:

mock
airline
retail
telecom

All the information that an agent developer needs to build an agent for a domain can be accessed through the domain's API docs. See View domain documentation for more details.

Installation

Clone the repository:

git clone https://github.com/sierra-research/tau2-bench
cd tau2-bench

Create a new environment (optional)

$\tau^2$-bench requires Python 3.10 or higher. You may create and activate a new environment:

python -m venv .venv
source .venv/bin/activate

Install tau2

pip install -e .

This will enable you to run the tau2 command.

Note: If you use pip install . (without -e), you'll need to set the TAU2_DATA_DIR environment variable to point to your data directory:

export TAU2_DATA_DIR=/path/to/your/tau2-bench/data

Check your data directory setup:

After installation, you can verify that your data directory is correctly configured by running:

tau2 check-data

This command will check if the data directory exists and print instructions if it is missing.

To remove all the generated files and the virtual environment, run:

make clean

Quick Start

Setup LLM API keys

We use LiteLLM to manage LLM APIs, so you can use any LLM provider supported by LiteLLM.

To provide your API keys, copy .env.example as .env and edit it to include your API keys.

Run agent evaluation

To run a test evaluation on only 5 tasks with 1 trial per task, run:

tau2 run \ 
--domain airline \
--agent-llm gpt-4.1 \
--user-llm gpt-4.1 \
--num-trials 1 \
--num-tasks 5

Results will be saved in data/tau2/simulations/.

💡 Tip: For full agent evaluation that matches the original τ²-bench methodology, remove --num-tasks and use --task-split base to evaluate on the complete task set.

Command Line Interface

The tau2 command provides a unified interface for all functionality:

Running Benchmark

tau2 run \
  --domain <domain> \
  --agent-llm <llm_name> \
  --user-llm <llm_name> \
  --num-trials <trial_count> \
  --task-ids <task_ids> \
  --max-concurrency <concurrent_sims> \
  ...

Interactive Play Mode

tau2 play

Experience τ²-bench from either perspective! The play mode allows you to:

Play as Agent: Manually control the agent's responses and tool calls
Play as User: Control the user while an LLM agent handles requests (available in domains with user tools like telecom)
Understand tasks by walking through scenarios step-by-step
Test strategies before implementing them in code
Choose task splits to practice on training data or test on held-out tasks

This is perfect for:

Getting familiar with domain policies and tools from both perspectives
Debugging task scenarios and conversation flows
Developing intuition for agent strategies
Testing user behavior and agent responses
Training yourself before training your model!

See the Gym Documentation for more details on using the gymnasium interface programmatically, including the AgentGymEnv (play as agent) and UserGymEnv (play as user).

Viewing Results

tau2 view

This tool allows you to:

Browse simulation files (in data/tau2/simulations/)
View agent performance metrics
View a particular simulation
View task details

View domain documentation

tau2 domain <domain>

Visit http://127.0.0.1:8004/redoc to see the domain policy and API documentation.

domain_viewer1

Check data configuration

tau2 check-data

This command checks if your data directory is properly configured and all required files are present.

Leaderboard Submission

To submit your agent results to the τ²-bench leaderboard, you need to prepare a valid submission package that meets specific requirements.

Requirements for Valid Submissions

Your trajectory runs must follow these constraints:

Complete domain coverage: Include results for all three domains:
- retail
- airline
- telecom
Consistent model configuration: All trajectory files must use:
- The same agent LLM with identical arguments across all domains
- The same user simulator LLM with identical arguments across all domains
One result per domain: Each domain should appear exactly once in your submission
All tasks completed: Run evaluation on all tasks within each domain (don't use --task-ids or --num-tasks filters)

📝 Note: For consistency with the original τ²-bench evaluation methodology, use the base task split when evaluating your agent to ensure you're testing on the complete, standard task set.

Preparing Your Submission

Step 1: Run Evaluations

First, run your agent evaluation on all domains with consistent settings:

# Example: Run complete evaluation for all domains
tau2 run --domain retail --agent-llm gpt-4.1 --user-llm gpt-4.1 --num-trials 4 --save-to my_model_retail
tau2 run --domain airline --agent-llm gpt-4.1 --user-llm gpt-4.1 --num-trials 4 --save-to my_model_airline  
tau2 run --domain telecom --agent-llm gpt-4.1 --user-llm gpt-4.1 --num-trials 4 --save-to my_model_telecom

Important: Use identical --agent-llm, --user-llm, and their arguments across all runs.

Step 2: Prepare Submission Package

Use the submission preparation tool to create your leaderboard submission:

tau2 submit prepare data/tau2/simulations/my_model_*.json --output ./my_submission

This command will:

Verify all trajectory files are valid
Check that submission requirements are met
Compute performance metrics (Pass^k rates)
Prompt for required metadata (model name, organization, contact email)
Create a structured submission directory with:
- submission.json: Metadata and metrics
- trajectories/: Your trajectory files

Step 3: Validate Your Submission

Before submitting, validate your submission package:

tau2 submit validate ./my_submission

This will verify:

All required files are present
Trajectory files are valid
Domain coverage is complete
Model configurations are consistent

Additional Options

Skip Verification (if needed)

tau2 submit prepare data/tau2/simulations/my_model_*.json --output ./my_submission --no-verify

Verify Individual Trajectory Files

tau2 submit verify-trajs data/tau2/simulations/my_model_*.json

Submitting to the Leaderboard

Once your submission package is prepared and validated:

Review the generated submission.json file
Follow the submission guidelines in web/leaderboard/public/submissions/README.md to create a Pull Request
Keep your trajectories/ directory for reference

The leaderboard will display your model's Pass^k success rates (k=1,2,3,4) across all domains.

Experiments

Experimental Code Directory

The @experiments/ directory contains experimental features and research code that extends beyond the core tau2 benchmark. This directory is designed for community contributions of innovative approaches, prototypes, and new features that are not part of the core evaluation framework.

Purpose: Research code and experimental features
Location: src/experiments/
Usage: Each experimental component has its own README with documentation
Status: Experimental code is provided as-is and may not be fully tested or supported

For more details, see the experiments README.

Running Ablation Studies (No User, or Agent with Oracle Plan)

telecom domain enables running ablation studies.

Running an LLM in no-user mode. In this mode, the LLM is given all the tools and the information upfront. Just choose llm_agent_solo as the agent and dummy_user as the user.

tau2 run \
  --domain telecom \
  --agent llm_agent_solo \
  --agent-llm gpt-4.1 \
  --user dummy_user \
  ...

Running an LLM in oracle-plan mode. In this mode, the LLM is given an oracle plan ahead of time alleviating the need for action planning. Just choose llm_agent_gt as the agent.

tau2 run \
  --domain telecom \
  --agent llm_agent_gt \
  --agent-llm gpt-4.1 \
  --user-llm gpt-4.1 \
  ...

Running Telecom Domain with Workflow Policy

To test the impact of policy format, we provide an additional "workflow" policy for the telecom domain. To run using this policy, use the telecom-workflow domain.

tau2 run \
  --domain telecom-workflow \
  --agent-llm gpt-4.1 \
  --user-llm gpt-4.1 \
  ...

Domains

For all the details see the domains README.

Basics

Code is located in src/tau2/domains/
Data is located in data/tau2/domains/
Each domain has its own configuration and task definitions

View domain-specific policy and API docs:

Run the following command to see the domain policy and API documentation.

tau2 env <domain>

Then visit http://127.0.0.1:8004/redoc

Environment CLI (beta)

An interactive command-line interface for directly querying and testing domain environments. Features:

Interactive query interface with domain-specific tools
Support for multiple domains (airline, mock, etc.)
Session management with history

To use:

make env-cli

Available commands:

:q - quit the program
:d - change domain
:n - start new session (clears history)

Example usage:

$ make env-cli

Welcome to the Environment CLI!
Connected to airline domain.

Query (:n new session, :d change domain, :q quit)> What flights are available from SF to LA tomorrow?
Assistant: Let me check the flight availability for you...
[Flight details will appear here]

The Environment CLI is useful for:

Testing domain tools and queries
Debugging environment responses
Exploring available domain functionality
Quick domain interaction without starting the full server stack

Run tests

To run the test suite use the command

make test

Config

To configure the framework, see the config file.

LLM Calls caching

LLM call caching is disabled by default.

To enable LLM calls caching: - Make sure redis is running. - Update the redis config in config.py if necessary. - Set LLM_CACHE_ENABLED to True in config.py

Evaluate Your Own Agent

For local or remote agent evaluation, see our agent developer guide.

Contributing

We welcome contributions to τ²-bench! Whether you're fixing bugs, adding new features, creating new domains, or contributing experimental research code, please see our Contributing Guide for detailed guidelines on:

Opening issues before starting work
Branch naming conventions and development workflow
Code quality standards and testing requirements
Pull request guidelines for clean, reviewable contributions
Domain and experimental contributions specific guidelines

For experimental features and research code, check out the @experiments/ directory.

Orchestration Sequence Diagram

sequenceDiagram
    participant O as Orchestrator
    participant A as Agent
    participant U as UserSimulator
    participant E as Environment

    Note over O: Initialize(task)
    rect rgb(100, 150, 150)
        O->>A: get_init_state_info(message_history)
        A->>O: agent_state_info
        O->>U: get_init_state_info(message_history)
        U->>O: user_state_info
        O->>E: set_state(initialization_data, initialization_actions, message_history)
    end
    Note over O: Start simulation
    loop Pass messages between Agent, User, and Environment

        alt Agent/Env to User
            rect rgb(200, 150, 150)
            O->>U: generate_next_message(msg, user_state_info)
            U-->>O: (user_msg, user_state_info)
            end
            Note over O: Check if user_msg is STOP
        else User/Env to Agent
            rect rgb(100, 200, 100)
            O->>A: generate_next_message(msg, agent_state_info)
            A-->>O: (assistant_msg, agent_state_info)
            Note over O: Check if too many errors
            end
        else User/Agent to Environment
            rect rgb(150, 150, 200)
            O->>E: get_response(tool_call)
            E-->>O: tool_message
            end
        end
        Note over O: Check if max turns reached.
    end
    Note over O: Return simulation run

Citation

@misc{barres2025tau2,
      title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment}, 
      author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
      year={2025},
      eprint={2506.07982},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.07982}, 
}

Project details

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

26.3

Mar 16, 2026

26.1

Feb 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvidia_tau2-26.3-py3-none-any.whl (2.4 MB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file nvidia_tau2-26.3-py3-none-any.whl.

File metadata

Download URL: nvidia_tau2-26.3-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 2.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for nvidia_tau2-26.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58c8bb42aa00094882188611ae2417798a4f0bde15f43f89f836d11c945d563c`
MD5	`0c32ffa29a9532b1c7f2eb8fa1e8336f`
BLAKE2b-256	`76805fe1893e99d8c3f9bf1e218dd28325da26ebaecfabba7e54fd44d127a779`

See more details on using hashes here.

nvidia-tau2 26.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

NVIDIA NeMo Evaluator

Quick Start Guide

Launching an Evaluation for an LLM

Command-Line Tool

Commands

1. List Evaluation Types

2. Run an Evaluation

Configuring Evaluations via YAML

🆕 What's New

🤖 Reinforcement Learning Support (New!)

🏆 Live Leaderboard (v0.2.0)

Overview

Installation

Quick Start

Setup LLM API keys

Run agent evaluation

Command Line Interface

Running Benchmark

Interactive Play Mode

Viewing Results

View domain documentation

Check data configuration

Leaderboard Submission

Requirements for Valid Submissions

Preparing Your Submission

Step 1: Run Evaluations

Step 2: Prepare Submission Package

Step 3: Validate Your Submission

Additional Options

Skip Verification (if needed)

Verify Individual Trajectory Files

Submitting to the Leaderboard

Experiments

Experimental Code Directory

Running Ablation Studies (No User, or Agent with Oracle Plan)

Running Telecom Domain with Workflow Policy

Domains

Basics

View domain-specific policy and API docs:

Environment CLI (beta)

Run tests

Config

LLM Calls caching

Evaluate Your Own Agent

Contributing

Orchestration Sequence Diagram

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes