Engineering

τ³-Bench: Fixing Airline + Retail

February 2026

TL;DR

We audited and fixed 50+ tasks across the airline and retail domains in τ-bench. Fixes addressed incorrect expected actions, ambiguous user instructions, impossible constraints, and missing fallback behaviors.

Since its release, τ-bench has seen widespread adoption as a benchmark for evaluating tool-using language agents. Along with that adoption, the community has been invaluable in identifying issues — from annotation errors to underspecified tasks — in the original airline (50 tasks) and retail (114 tasks) domains.

As we prepared for the τ³ release, we set out to address this feedback head-on. The majority of fixes were obtained directly from τ-Bench Verified (Cuadron et al., SABER), which systematically identified annotation errors and underspecified tasks in the original benchmark. We collaborated closely with the team at Amazon on this effort — going back and forth on proposed revisions until reaching consensus on each fix. We also incorporated additional fixes from community pull requests, including several from Anthropic, and integrated everything into the official task set.

This post summarizes what we fixed and how it changed model scores.

What We Fixed

Fixes fell into five broad categories:

27 airline tasks fixed
26 retail tasks fixed

1. Incorrect Expected Actions

The most impactful category. Several tasks included expected actions that were wrong according to the domain's own policy. For example:

2. Ambiguous User Instructions

Many tasks had user simulation instructions that left key information underspecified, causing the simulated user to behave unpredictably across trials:

3. Impossible or Contradictory Constraints

Some tasks asked agents to do things that were structurally impossible:

4. Missing Fallback Behaviors

Several tasks didn't specify what the user should do when the agent's search didn't return the expected results:

5. Policy Loophole Prevention

Some tasks could be "solved" by agents using clever workarounds that violated the spirit of the task:

🔧 All Task Fixes
Task Domain Category Description

Before & After Results

We re-ran all three models on both domains after applying the fixes. The chart below shows pass^1 and pass^4 scores before and after:

📊 Before vs. After Task Fixes
4 trials per task

The most dramatic improvement is in airline, where pass^1 increased by +14.0 to +20.0 points depending on the model. This is largely because many of the original airline tasks penalized agents for correctly following the policy — once those incorrect expected actions were removed, all models saw substantial gains.

Retail improvements were more modest (-0.4 to +5.5 points on pass^1) because the original retail tasks had fewer outright incorrect expected actions. Most retail fixes addressed ambiguity and missing fallback behaviors, which primarily reduced variance across trials rather than uniformly boosting scores.

An interesting pattern: pass^4 improvements were often even larger than pass^1 improvements in airline, particularly for GPT-5.2 (high), which went from 50.0% to 72.0% — a +22 point gain. This suggests the fixes disproportionately helped tasks where the model was intermittently succeeding, by removing evaluation noise that turned some correct completions into false negatives.

The updated tasks are now live on the leaderboard, and all trajectory data is available in the trajectory visualizer. The fixes are also reflected in the open-source repository.

Citation

@misc{cuadron2025sabersmallactionsbig,
  title={SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents},
  author={Alejandro Cuadron and Pengfei Yu and Yang Liu and Arpit Gupta},
  year={2025},
  eprint={2512.07850},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2512.07850}
}

@misc{barres2025tau2,
  title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
  author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
  year={2025},
  eprint={2506.07982},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2506.07982}
}
← Back to τ-bench Leaderboard