τ-knowledge is a benchmark for evaluating AI agents in knowledge-intensive customer support. It pairs a realistic fintech knowledge base (698 documents across 21 product categories, ~195K tokens) with tasks requiring multi-step reasoning, policy application, and tool use. The best model, GPT-5.2 with high reasoning, achieves only ~26% pass^1. Current frontier models still fail at retrieving, interpreting, and acting on messy real-world documentation.
Modern agents are no longer deployed with everything they need neatly stuffed into context. Instead, they are expected to operate over large, messy, and constantly evolving knowledge bases: internal documentation, policy manuals, tool descriptions, product catalogs, and procedural guides.
These knowledge bases have several properties that make them uniquely difficult for agents:
- Information is unstructured and spread across long documents
- Policies are procedural, conditional, and easy to misapply
- Content is often time-sensitive, with exceptions and promotions that change behavior
- Product names and internal terms are out-of-distribution, breaking embedding-based assumptions
- Some tools are discoverable, referenced only in KB documents rather than given to the agent directly
Despite how common this setting is in real deployments, existing benchmarks rarely capture it. Most evaluations isolate retrieval (e.g., QA over documents) or isolate tool use — without requiring agents to reason over private knowledge bases while coordinating tools in a live user interaction.
Introducing τ-Knowledge
To address this gap, we introduce τ-Knowledge, an extension of τ-Bench that evaluates agents in knowledge-grounded, human-facing environments.
τ-Knowledge introduces a new domain: τ-Banking, a fintech-inspired customer support setting where task success depends on correctly interpreting and operationalizing information from a natural-language knowledge base. Unlike factoid-style RAG benchmarks where retrieving the right document largely determines the answer, τ-Banking requires agents to coordinate knowledge-base evidence with tool outputs over long-horizon conversations to produce verifiable, database-level state changes.
Crucially, τ-Knowledge is agnostic to the retrieval mechanism. It supports and evaluates arbitrary strategies for searching and interacting with a corpus — including dense and sparse retrieval, hybrid approaches, long-context processing, and even filesystem-based exploration via terminal commands. This flexibility lets it evaluate emerging paradigms beyond traditional semantic retrieval.
file_dispute with user_id, txn_id, reason...→ Files dispute first (as user requested)
→ request_credit_limit_increase(...)
→ DENIED: pending dispute on record ✗
"request_credit_limit_increase_4829", {...})
→ ERROR: Tool not unlocked
→ Agent tells user "Your request has been
submitted!" (hallucinated) ✗
→ Agent skips verification
→ apply_dispute_credit(dispute_id: "dsp_7a3p")
→ Credit applied for unapproved dispute ✗
What Makes τ-Banking Hard?
The τ-Banking knowledge base contains 698 documents spanning 21 product categories and roughly 195K tokens. Coverage includes personal and business checking accounts, tiered savings accounts, rewards credit cards, buy-now-pay-later plans, and more. Documents detail not only customer-facing product specifications — APY rates, fees, cashback structures — but also internal agent protocols: procedures for ordering replacement cards, eligibility requirements for account closure, referral program rules, identity verification workflows, and more.
Each task averages 18.6 required documents and 9.5 required tool calls, with some tasks requiring up to 33 tool calls. Three design elements make τ-Banking uniquely challenging:
- Discoverable tools. Many tools are not available to the agent by default — they are referenced only implicitly within knowledge base documents. To use a discoverable tool, the agent must first locate its documentation, then unlock and invoke it. This mirrors real-world deployments where agent capabilities are defined by accessible documentation rather than hard-coded interfaces.
- Flow-based user simulation. Each task defines conditional rules that prescribe the simulated user's behavior at evaluation-critical junctures — steering conversations toward edge cases, testing whether the agent correctly refuses ineligible requests, or introducing mid-conversation state changes that require adaptation.
- Objective verification. Task success is not judged by conversational quality. Instead, each task specifies a target database state, and an agent succeeds only if its sequence of actions produces the correct final system state.
How We Built It
We generated the knowledge base and benchmark tasks through a multi-stage process that combines LLM generation with human refinement. Throughout the pipeline, we vary LLM usage across four models (GPT-5, GPT-5.2, Claude-4.5-Opus, and Gemini-3-Pro) to induce diversity in wording, style, and document structure:
- Stage 1: Structured database generation. We first construct a structured knowledge base using LLMs. This begins by generating business categories (e.g., credit cards, savings accounts), features within each category (e.g., card tiers, account protocols), and concrete variable values (e.g., annual fees, cashback rates). The result is a structured database where each feature is a collection of typed variables.
- Stage 2: Structured → unstructured conversion. We convert the structured database into a realistic unstructured document corpus. For each feature, we generate plausible document titles (e.g., "Bronze Rewards Card Overview," "How do I view my monthly cashback?"), allocate variables to documents, and use an LLM to write natural-language articles. This transforms the structured database into documentation resembling real customer service knowledge bases.
- Stage 3: Task and database creation. After initial knowledge base construction, the tasks and the database are co-constructed manually with LLM assistance to mirror common fintech customer service flows — such as ordering replacement cards, disputing transactions, and recommending accounts. Each task is built around a specific workflow, with knowledge articles and tools updated to support it.
- Stage 4: Human-in-the-loop refinement. As tasks are created, we iteratively refine the structured knowledge base by adding, removing, or modifying variables to meet new task requirements, then selectively re-run the generation pipeline for affected portions. Manual editing is also performed to improve clarity and realism.
- Stage 5: Review. All tasks and associated gold document sets are independently audited by two reviewers not involved in task creation. Reviewers verify that the expected final state is correct, the gold document set is complete and minimal, and the task can be completed using only those documents and documented tools.
This pipeline is highly scalable, minimizes unintended collisions between features, and simplifies task creation by letting each task be expressed as a set of constraints over knowledge base variables — constraints that can be validated directly against the structured database.
Yumi Tanaka, a small creative design studio owner, wants one business checking and one business savings account recommendation that meet all her constraints — no options to compare, just the best fit.
Her checking requirements: mobile deposit ≥ $10K/day, zero overdraft fees, minimum balance < $10K, APY ≥ 1%.
Her savings requirements: same-day ACH transfers, minimum balance < $50K, wire fees ≤ $15.
After filtering, 3 checking and 3 savings accounts remain. The agent must then check time-sensitive promotion policies (with varying date ranges) to break the tie — using the correct simulation date of 11/14/2025.
Yuki Nakamura wants to close her Platinum Rewards Card because the annual fee is too high. The agent must follow the full Credit Card Retention Protocol from the knowledge base:
(1) Verify closure eligibility — checking disputes, pending replacements, account age, and discovering a $75 outstanding balance that must be paid first. (2) Check for previous retention attempts in the past year. (3) Ask why and log the reason. (4) Since her reason is "annual fee" and she's been a customer for 3+ years, offer to waive the annual fee for one year.
If she accepts the waiver (she will), apply the flag with the correct expiration date. Failure to execute any step, or executing steps out of order, results in task failure.
Jordan Chen wants a full banking reorganization: (1) close Bronze savings, (2) open business checking, (3) close Evergreen checking, (4) open personal savings.
Following the user's order would fail entirely. Closing any account first creates a CLOSED status that blocks the business checking application (policy: "no accounts with status CLOSED"). And closing Evergreen checking leaves only a 12-day-old account, failing the 14-day tenure requirement for opening savings.
The agent must independently discover these dependencies from separate KB documents, reorder all four operations, explain the constraints to the user, and execute 13 tool calls in the correct sequence: opens first, then closures.
Experiments
We evaluate a diverse set of frontier language models — GPT-5.2, Claude-4.5-Opus, Claude-4.5-Sonnet, Gemini-3-Pro, and Gemini-3-Flash — across multiple retrieval configurations:
- Dense retrieval with text-embedding-3-large and Qwen3-Embedding-8B
- Sparse retrieval with BM25
- Terminal use, where the knowledge base is exported as files and the agent navigates via shell commands (grep, cat, find, etc.)
- Golden retriever, where ground-truth documents are placed directly in context, removing retrieval from the evaluation
We use the pass^k metric, defined as the probability that a task is successfully completed in all k independent trials, evaluating up to k = 4.
| Model | Reasoning | Gold | text-emb-3-large | Qwen3-Emb | BM25 | Terminal |
|---|
Key Findings
τ-Knowledge is hard. The best-performing configuration — GPT-5.2 with high reasoning — achieves only 25.52% pass^1. Performance degrades sharply with increasing k: GPT-5.2-High drops to just 13.40% pass^4. Even when gold documents are provided directly in context, the highest-scoring model (Claude-4.5-Opus-High) achieves only 39.69% pass^1, falling to 26.80% pass^4. This demonstrates that success requires not just finding the right documents, but carefully reasoning over them.
Reliability and efficiency diverge. Some models that underperform at pass^1 exhibit much higher consistency. For example, Claude-4.5-Sonnet underperforms Claude-4.5-Opus at pass^1 but shows a smaller reliability decline across trials, overtaking Opus at pass^4 (10.31% vs 9.28%). Meanwhile, analyzing task durations reveals dramatic differences in solution efficiency: Claude models achieve performance comparable to GPT models while completing tasks with significantly shorter durations. This stems from both reduced total token generation for Claude compared to GPT (0.7M vs 1.2M) and fewer tool calls, with Claude-4.5-Opus issuing fewer retrieval calls on average per task (8.7) compared to GPT-5.2 high reasoning (18.5).
Freeform search strategies outperform traditional retrieval. Terminal-based retrieval — where agents navigate the knowledge base via shell commands (grep, cat, find, etc.) — achieves the highest pass^1 in 5 of 6 model configurations, outperforming standard retrieval methods by 2.4–3.6 percentage points on average. This suggests that giving agents more flexible, freeform search strategies yields better results than constrained retrieval pipelines. We also observe meaningful differences in search frequency — dense retrieval averages 9.9–10.1 searches per task compared to 11.4 for BM25 and 14.5 grep calls in terminal-use — with a median turn time increase of 6.6 seconds for terminal-use relative to dense retrieval configurations.
How Agents Fail
Our qualitative analysis of agent trajectories reveals four recurring failure modes:
- Complex product interdependencies. Products and policies are deeply interdependent, requiring multi-hop reasoning across documents. Agents frequently chase promotional bonuses while missing that alternative products offer better base rates — recommending suboptimal product combinations despite satisfying surface-level incentives.
- Implicit subtask ordering. Some tasks involve hidden dependencies between actions. For instance, filing a dispute before requesting a credit limit increase — because the bank policy automatically rejects credit limit requests when disputes are pending. Agents tend to execute actions in the order presented by the user rather than reasoning about these constraints.
- Overtrusting user assertions. Agents take user-provided statements at face value without verifying them against the system state. For example, when a user claims "all my disputes have been approved," many agents proceed to apply credits without checking the actual dispute status.
- Search inefficiency and assumptions. Rather than clarifying ambiguity through targeted questions or searches, agents make unwarranted assumptions. When a user asks "which account has the highest referral bonus" without specifying an account type, agents often immediately assume credit cards — despite available documentation covering referral programs for other account types.
Why Solution Efficiency Matters
A central theme of our findings is that solution efficiency — the ability to reach correct, policy-compliant outcomes with minimal time, tool calls, and conversational backtracking — should be a first-class evaluation metric for human-facing agents. Extra turns translate into longer resolution times, higher cognitive burden, and reduced trust, especially for time-sensitive support scenarios like a lost credit card or unrecognized transactions.
Progress on human-facing agents should be measured not only by final task success but also by how efficiently agents achieve it.
Looking Forward
The scores on τ-knowledge are low, and we believe that's the point. With the best frontier model reaching only ~26% pass^1, there is enormous room for improvement. Unlike web search, where the corpus is functionally unbounded, τ-knowledge evaluates agents over a finite, closed knowledge base. This is how search actually works in most real-world deployments: customer support agents, internal tools, enterprise platforms, and regulated industries all operate over bounded, curated documentation. An agent that can't reliably navigate a few hundred documents has no business being trusted with open-ended retrieval.
We invite model providers and agent developers to use τ-knowledge as a meaningful measure of an agent's ability to search, reason, and act over unstructured but finite data. The benchmark is open, the tasks are verifiable, and the gap between current performance and reliable deployment is clear.