τ-knowledge | τ-bench

TL;DR

τ-knowledge is a benchmark for evaluating AI agents in knowledge-intensive customer support. It pairs a realistic fintech knowledge base (698 documents across 21 product categories, ~195K tokens) with tasks requiring multi-step reasoning, policy application, and tool use. The best model, GPT-5.2 with high reasoning, achieves only ~26% pass^1. Current frontier models still fail at retrieving, interpreting, and acting on messy real-world documentation.

Modern agents are no longer deployed with everything they need neatly stuffed into context. Instead, they are expected to operate over large, messy, and constantly evolving knowledge bases: internal documentation, policy manuals, tool descriptions, product catalogs, and procedural guides.

These knowledge bases have several properties that make them uniquely difficult for agents:

Information is unstructured and spread across long documents
Policies are procedural, conditional, and easy to misapply
Content is often time-sensitive, with exceptions and promotions that change behavior
Product names and internal terms are out-of-distribution, breaking embedding-based assumptions
Some tools are discoverable, referenced only in KB documents rather than given to the agent directly

Despite how common this setting is in real deployments, existing benchmarks rarely capture it. Most evaluations isolate retrieval (e.g., QA over documents) or isolate tool use — without requiring agents to reason over private knowledge bases while coordinating tools in a live user interaction.

Introducing τ-Knowledge

To address this gap, we introduce τ-Knowledge, an extension of τ-Bench that evaluates agents in knowledge-grounded, human-facing environments.

τ-Knowledge introduces a new domain: τ-Banking, a fintech-inspired customer support setting where task success depends on correctly interpreting and operationalizing information from a natural-language knowledge base. Unlike factoid-style RAG benchmarks where retrieving the right document largely determines the answer, τ-Banking requires agents to coordinate knowledge-base evidence with tool outputs over long-horizon conversations to produce verifiable, database-level state changes.

Crucially, τ-Knowledge is agnostic to the retrieval mechanism. It supports and evaluates arbitrary strategies for searching and interacting with a corpus — including dense and sparse retrieval, hybrid approaches, long-context processing, and even filesystem-based exploration via terminal commands. This flexibility lets it evaluate emerging paradigms beyond traditional semantic retrieval.

🧩 Anatomy of a τ-Banking Task · Click failure annotations to explore

Dispute + Credit Limit Increase · Based on Tasks 53 & 27

There's this transaction I want to dispute. I also want to file a credit limit increase request.

🔍 KB Search 📋 Policy

🔍 kb_search("file dispute transaction policy procedure...")

🔍 kb_search("credit limit increase eligibility requirements...")

1. Credit Limit Increase Policy

...Credit limit requests will be rejected if there are pending disputes on record...

2. How to File a Transaction Dispute

...To file a dispute, use file_dispute with user_id, txn_id, reason...

Sure! Let me look into that for you. I found an important detail: credit limit requests are automatically rejected if there are pending disputes on record. So I'll process your credit limit increase first, then file the dispute.

✗ How agents fail: miss critical policy dependency ▸

Agent only searches for "how to file a dispute" and never searches for credit limit policies. It files the dispute first (following the user's order) — then the credit limit increase is automatically denied by the system because of the pending dispute.

kb_search("file dispute") → finds dispute tool
→ Files dispute first (as user requested)
→ request_credit_limit_increase(...)
→ DENIED: pending dispute on record ✗

Sure, that makes sense. Go ahead.

🔧 Tool Discovery 🧮 Sequencing

🔓 unlock_discoverable_agent_tool("request_credit_limit_increase_4829")

⚙️ call_discoverable_agent_tool("request_credit_limit_increase_4829", {user_id: "ab12c3", amount: 5000})

→ {status: "APPROVED", new_limit: "$15,000"}

Great news — your credit limit increase has been approved! New limit: $15,000. Now let me file that dispute.

✗ How agents fail: skip tool unlock step ▸

Many tools in τ-Banking are discoverable — they must be unlocked via the KB before use. Agents that skip the unlock call get an error, and some then hallucinate a success response to the user.

call_discoverable_agent_tool(
"request_credit_limit_increase_4829", {...})
→ ERROR: Tool not unlocked
→ Agent tells user "Your request has been
submitted!" (hallucinated) ✗

🔧 Tool Calling

🔓 unlock_discoverable_agent_tool("file_dispute_8291")

⚙️ call_discoverable_agent_tool("file_dispute_8291", {user_id: "ab12c3", txn_id: "txn_9f2a", reason: "unauthorized_charge"})

→ {dispute_id: "dsp_7a3p...", status: "UNDER_REVIEW"}

Done! Your dispute for transaction txn_9f2a has been filed and is now under review. You'll get a notification when it's resolved.

Later in the conversation

I just got a notification that my dispute was approved! Can you apply the credit?

🔎 Verify Claims

⚙️ call_discoverable_agent_tool("get_dispute_status_3847", {dispute_id: "dsp_7a3p"})

→ {dispute_id: "dsp_7a3p...", status: "UNDER_REVIEW"} ← not approved yet!

I checked your dispute status and it's actually still under review — it hasn't been approved yet. I'll apply the credit as soon as it's officially resolved on our end.

✗ How agents fail: over-trust user claims ▸

Agents often take user assertions at face value without verifying against the system. When a user claims "my dispute was approved," many agents immediately apply credits — even though the dispute is still under review.

User: "My dispute was approved!"
→ Agent skips verification
→ apply_dispute_credit(dispute_id: "dsp_7a3p")
→ Credit applied for unapproved dispute ✗

✅

Task Pass

All skills correct → right DB state

❌

Task Fail

Any failure → wrong state (best: ~26%)

What Makes τ-Banking Hard?

The τ-Banking knowledge base contains 698 documents spanning 21 product categories and roughly 195K tokens. Coverage includes personal and business checking accounts, tiered savings accounts, rewards credit cards, buy-now-pay-later plans, and more. Documents detail not only customer-facing product specifications — APY rates, fees, cashback structures — but also internal agent protocols: procedures for ordering replacement cards, eligibility requirements for account closure, referral program rules, identity verification workflows, and more.

Each task averages 18.6 required documents and 9.5 required tool calls, with some tasks requiring up to 33 tool calls. Three design elements make τ-Banking uniquely challenging:

Discoverable tools. Many tools are not available to the agent by default — they are referenced only implicitly within knowledge base documents. To use a discoverable tool, the agent must first locate its documentation, then unlock and invoke it. This mirrors real-world deployments where agent capabilities are defined by accessible documentation rather than hard-coded interfaces.
Flow-based user simulation. Each task defines conditional rules that prescribe the simulated user's behavior at evaluation-critical junctures — steering conversations toward edge cases, testing whether the agent correctly refuses ineligible requests, or introducing mid-conversation state changes that require adaptation.
Objective verification. Task success is not judged by conversational quality. Instead, each task specifies a target database state, and an agent succeeds only if its sequence of actions produces the correct final system state.

📄 Knowledge Base Document Browser · 12 documents

← Select a document to read

How We Built It

We generated the knowledge base and benchmark tasks through a multi-stage process that combines LLM generation with human refinement. Throughout the pipeline, we vary LLM usage across four models (GPT-5, GPT-5.2, Claude-4.5-Opus, and Gemini-3-Pro) to induce diversity in wording, style, and document structure:

Stage 1: Structured database generation. We first construct a structured knowledge base using LLMs. This begins by generating business categories (e.g., credit cards, savings accounts), features within each category (e.g., card tiers, account protocols), and concrete variable values (e.g., annual fees, cashback rates). The result is a structured database where each feature is a collection of typed variables.
Stage 2: Structured → unstructured conversion. We convert the structured database into a realistic unstructured document corpus. For each feature, we generate plausible document titles (e.g., "Bronze Rewards Card Overview," "How do I view my monthly cashback?"), allocate variables to documents, and use an LLM to write natural-language articles. This transforms the structured database into documentation resembling real customer service knowledge bases.
Stage 3: Task and database creation. After initial knowledge base construction, the tasks and the database are co-constructed manually with LLM assistance to mirror common fintech customer service flows — such as ordering replacement cards, disputing transactions, and recommending accounts. Each task is built around a specific workflow, with knowledge articles and tools updated to support it.
Stage 4: Human-in-the-loop refinement. As tasks are created, we iteratively refine the structured knowledge base by adding, removing, or modifying variables to meet new task requirements, then selectively re-run the generation pipeline for affected portions. Manual editing is also performed to improve clarity and realism.
Stage 5: Review. All tasks and associated gold document sets are independently audited by two reviewers not involved in task creation. Reviewers verify that the expected final state is correct, the gold document set is complete and minimal, and the task can be completed using only those documents and documented tools.

This pipeline is highly scalable, minimizes unintended collisions between features, and simplifies task creation by letting each task be expressed as a set of constraints over knowledge base variables — constraints that can be validated directly against the structured database.

🧪 Sample Tasks from τ-Banking

Recommendation Multi-Constraint Product Recommendation

▾

Checking candidates: 8 Savings candidates: 7 Active promotions to evaluate: 4 Required tool calls: 6

User Scenario

Yumi Tanaka, a small creative design studio owner, wants one business checking and one business savings account recommendation that meet all her constraints — no options to compare, just the best fit.

Her checking requirements: mobile deposit ≥ $10K/day, zero overdraft fees, minimum balance < $10K, APY ≥ 1%.

Her savings requirements: same-day ACH transfers, minimum balance < $50K, wire fees ≤ $15.

After filtering, 3 checking and 3 savings accounts remain. The agent must then check time-sensitive promotion policies (with varying date ranges) to break the tie — using the correct simulation date of 11/14/2025.

Expected Action Sequence

1. log_verification(name="Yumi Tanaka", user_id="yt71c9e4f2", ...)

2. unlock_discoverable_agent_tool("get_all_user_accounts_by_user_id_3847")

3. call_discoverable_agent_tool("get_all_user_accounts_by_user_id_3847", {user_id: "yt71c9e4f2"})

4. unlock_discoverable_agent_tool("open_bank_account_4821")

5. call_discoverable_agent_tool("open_bank_account_4821", {user_id: "yt71c9e4f2", account_type: "business_checking", account_class: "Sky Blue"})

6. call_discoverable_agent_tool("open_bank_account_4821", {user_id: "yt71c9e4f2", account_type: "business_savings", account_class: "Gold Saver Account"})

Procedural Credit Card Retention Protocol

▾

Protocol steps: 4 Required tool calls: 15 Customer tenure: 3 years

User Scenario

Yuki Nakamura wants to close her Platinum Rewards Card because the annual fee is too high. The agent must follow the full Credit Card Retention Protocol from the knowledge base:

(1) Verify closure eligibility — checking disputes, pending replacements, account age, and discovering a $75 outstanding balance that must be paid first. (2) Check for previous retention attempts in the past year. (3) Ask why and log the reason. (4) Since her reason is "annual fee" and she's been a customer for 3+ years, offer to waive the annual fee for one year.

If she accepts the waiver (she will), apply the flag with the correct expiration date. Failure to execute any step, or executing steps out of order, results in task failure.

Expected Action Sequence

1. log_verification(...)

2. unlock + call get_user_dispute_history_7291 // check pending disputes

4. unlock + call get_pending_replacement_orders_5765 // check replacements

6. unlock + call get_all_user_accounts_by_user_id_3847 // find $75 balance

8. unlock + call pay_credit_card_from_checking_9182 // pay off balance

10. unlock + call get_closure_reason_history_8293 // check previous attempts

12. unlock + call log_credit_card_closure_reason_4521 // log reason: annual_fee

14. unlock + call apply_credit_card_account_flag_6147 // waive annual fee

Sequencing Operation Sequencing with Dependencies

▾

User-requested operations: 4 Required tool calls: 13 Hidden dependencies: 3

User Scenario

Jordan Chen wants a full banking reorganization: (1) close Bronze savings, (2) open business checking, (3) close Evergreen checking, (4) open personal savings.

Following the user's order would fail entirely. Closing any account first creates a CLOSED status that blocks the business checking application (policy: "no accounts with status CLOSED"). And closing Evergreen checking leaves only a 12-day-old account, failing the 14-day tenure requirement for opening savings.

The agent must independently discover these dependencies from separate KB documents, reorder all four operations, explain the constraints to the user, and execute 13 tool calls in the correct sequence: opens first, then closures.

Expected Action Sequence

1. log_verification(...)

2. unlock + call get_all_user_accounts_by_user_id_3847

4. unlock + call open_bank_account_4821 → business_checking: "Navy Blue" // open first!

6. call open_bank_account_4821 → savings: "Silver Plus" // open second!

7. unlock + call transfer_funds_between_bank_accounts_7291 // drain Bronze

9. unlock + call close_bank_account_7392 // close Bronze savings

11. call transfer × 2 // drain Evergreen to savings + biz checking

13. call close_bank_account_7392 // close Evergreen last

Experiments

We evaluate a diverse set of frontier language models — GPT-5.2, Claude-4.5-Opus, Claude-4.5-Sonnet, Gemini-3-Pro, and Gemini-3-Flash — across multiple retrieval configurations:

Dense retrieval with text-embedding-3-large and Qwen3-Embedding-8B
Sparse retrieval with BM25
Terminal use, where the knowledge base is exported as files and the agent navigates via shell commands (grep, cat, find, etc.)
Golden retriever, where ground-truth documents are placed directly in context, removing retrieval from the evaluation

We use the pass^k metric, defined as the probability that a task is successfully completed in all k independent trials, evaluating up to k = 4.

📊 Results Dashboard

Metric:

Model	Reasoning	Gold	text-emb-3-large	Qwen3-Emb	BM25	Terminal

Key Findings

📊 Key Comparisons

Best pass^1 by τ-Bench Domain

Best model per domain · higher is easier

τ-telecom

Qwen3.5

97.8

τ-airline

Opus 4.5

84.0

τ-retail

Qwen3.5

82.9

τ-banking

GPT-5.2

25.5

τ-banking is 3–4× harder than existing τ-bench domains — even for the best frontier models.

Average pass^1 by Retrieval Method

Mean across 6 model configurations on τ-banking

Terminal

20.1

Qwen3-Emb

17.7

BM25

16.8

text-emb-3-large

16.5

Terminal-based retrieval achieves the highest pass^1 in 5 of 6 model configurations, outperforming standard retrieval methods by 2.4–3.6 percentage points on average.

τ-Knowledge is hard. The best-performing configuration — GPT-5.2 with high reasoning — achieves only 25.52% pass^1. Performance degrades sharply with increasing k: GPT-5.2-High drops to just 13.40% pass^4. Even when gold documents are provided directly in context, the highest-scoring model (Claude-4.5-Opus-High) achieves only 39.69% pass^1, falling to 26.80% pass^4. This demonstrates that success requires not just finding the right documents, but carefully reasoning over them.

Reliability and efficiency diverge. Some models that underperform at pass^1 exhibit much higher consistency. For example, Claude-4.5-Sonnet underperforms Claude-4.5-Opus at pass^1 but shows a smaller reliability decline across trials, overtaking Opus at pass^4 (10.31% vs 9.28%). Meanwhile, analyzing task durations reveals dramatic differences in solution efficiency: Claude models achieve performance comparable to GPT models while completing tasks with significantly shorter durations. This stems from both reduced total token generation for Claude compared to GPT (0.7M vs 1.2M) and fewer tool calls, with Claude-4.5-Opus issuing fewer retrieval calls on average per task (8.7) compared to GPT-5.2 high reasoning (18.5).

Freeform search strategies outperform traditional retrieval. Terminal-based retrieval — where agents navigate the knowledge base via shell commands (grep, cat, find, etc.) — achieves the highest pass^1 in 5 of 6 model configurations, outperforming standard retrieval methods by 2.4–3.6 percentage points on average. This suggests that giving agents more flexible, freeform search strategies yields better results than constrained retrieval pipelines. We also observe meaningful differences in search frequency — dense retrieval averages 9.9–10.1 searches per task compared to 11.4 for BM25 and 14.5 grep calls in terminal-use — with a median turn time increase of 6.6 seconds for terminal-use relative to dense retrieval configurations.

How Agents Fail

Our qualitative analysis of agent trajectories reveals four recurring failure modes:

Complex product interdependencies. Products and policies are deeply interdependent, requiring multi-hop reasoning across documents. Agents frequently chase promotional bonuses while missing that alternative products offer better base rates — recommending suboptimal product combinations despite satisfying surface-level incentives.
Implicit subtask ordering. Some tasks involve hidden dependencies between actions. For instance, filing a dispute before requesting a credit limit increase — because the bank policy automatically rejects credit limit requests when disputes are pending. Agents tend to execute actions in the order presented by the user rather than reasoning about these constraints.
Overtrusting user assertions. Agents take user-provided statements at face value without verifying them against the system state. For example, when a user claims "all my disputes have been approved," many agents proceed to apply credits without checking the actual dispute status.
Search inefficiency and assumptions. Rather than clarifying ambiguity through targeted questions or searches, agents make unwarranted assumptions. When a user asks "which account has the highest referral bonus" without specifying an account type, agents often immediately assume credit cards — despite available documentation covering referral programs for other account types.

💬 Sample Conversation Trace

Task 077 · Lost Wallet with Multiple Cards · GPT-5.2 High

Hi, I think I might have lost my wallet with both of my debit cards in it. I have two checking accounts — my Bluest Account and my Light Green Account. I've been looking everywhere...

🔍 KB_search("lost wallet debit card freeze lock replace...")

🔍 KB_search("close debit card procedure close_debit_card card_id...")

I can help you secure both debit cards right away. Since you're not 100% sure the wallet is gone, the best immediate step is to freeze both debit cards...

Yes, please freeze BOTH cards right away. For verification: Email: [email protected], DOB: 04/12/2002, Phone: 312-555-0847

⚙️ get_user_information_by_email("jinhai.liang@...")

⚙️ log_verification(name="Liang Jinhai", user_id="lj82d4f1a9", ...)

🔓 unlock get_all_user_accounts_by_user_id_3847

⚙️ call get_all_user_accounts_by_user_id_3847

🔓 unlock get_debit_cards_by_account_id_7823

🔓 unlock freeze_debit_card_3892

⚙️ get_debit_cards... × 2 (for each account)

⚙️ freeze_debit_card_3892(card_id="dbc_lj82d4f1a9_bluest")

⚙️ freeze_debit_card_3892(card_id="dbc_538bfb9cba")

⚙️ get_credit_card_accounts_by_user(user_id="lj82d4f1a9")

Done — I've frozen BOTH debit cards. I also see you have a Silver Rewards credit card. Was that card also in the wallet?

Oh wow, I completely forgot — yes, my Silver Rewards credit card was in my wallet too!

🔍 KB_search("lock credit card freeze credit card temporarily...")

Since your Silver Rewards credit card was also in the wallet, the safest move is to replace the credit card...

Yes, please do that. Replacement reason: Lost. Shipping address: 1420 Lake Shore Drive, Chicago, IL 60610. Shipping speed: Standard please.

🔓 unlock order_replacement_credit_card_7291

⚙️ get_pending_replacement_orders_5765(...)

⚙️ order_replacement_credit_card_7291(credit_card_account_id="cc_lj82d4f1a9_silver", reason="lost", ...)

All set — I've placed a replacement order for your Silver Rewards credit card.

I've now looked absolutely everywhere... My wallet is definitely gone. I need to cancel both debit cards completely and get new ones for each account.

🔍 KB_search("Bluest Account tier premium debit card replacement...")

🔍 KB_search("checking account tier mapping entry tier mid tier...")

Your Light Green Account is Entry Tier with a 48-hour waiting period after closure before ordering a replacement...

Let's do Option A — close both debit cards now. For Bluest: Ship to 1420 Lake Shore Drive, Classic design, Standard shipping.

🔓 unlock unfreeze_debit_card_3893

🔓 unlock close_debit_card_4721

🔓 unlock get_bank_account_transactions_9173

🔓 unlock order_debit_card_5739

⚙️ get_bank_account_transactions × 2

⚙️ unfreeze_debit_card_3893 × 2

⚙️ close_debit_card_4721(card_id="dbc_lj82d4f1a9_bluest", reason="lost")

⚙️ close_debit_card_4721(card_id="dbc_538bfb9cba", reason="lost")

⚙️ order_debit_card_5739(account_id="chk_lj82d4f1a9", delivery_option="STANDARD", card_design="CLASSIC", ...)

Done. Both debit cards permanently cancelled. Bluest replacement ordered.

Great news — my Bluest Account replacement card just arrived. Can you help me activate it?

⚙️ get_debit_cards_by_account_id_7823(...)

🔓 unlock activate_debit_card_8292

I located your new Bluest replacement debit card (ending 7260). Please share the last 4 digits, expiration date, CVV, and your desired PIN to activate.

Last 4 digits: 7260, expiration: 11/29, CVV: 741. PIN: 5739.

⚙️ activate_debit_card_8292(card_id="dbc_8eed7b16caeb", last_4="7260", exp="11/29", cvv="741", pin="5739")

Your Bluest replacement debit card is now activated and ready to use.

Why Solution Efficiency Matters

A central theme of our findings is that solution efficiency — the ability to reach correct, policy-compliant outcomes with minimal time, tool calls, and conversational backtracking — should be a first-class evaluation metric for human-facing agents. Extra turns translate into longer resolution times, higher cognitive burden, and reduced trust, especially for time-sensitive support scenarios like a lost credit card or unrecognized transactions.

Progress on human-facing agents should be measured not only by final task success but also by how efficiently agents achieve it.

Looking Forward

The scores on τ-knowledge are low, and we believe that's the point. With the best frontier model reaching only ~26% pass^1, there is enormous room for improvement. Unlike web search, where the corpus is functionally unbounded, τ-knowledge evaluates agents over a finite, closed knowledge base. This is how search actually works in most real-world deployments: customer support agents, internal tools, enterprise platforms, and regulated industries all operate over bounded, curated documentation. An agent that can't reliably navigate a few hundred documents has no business being trusted with open-ended retrieval.

We invite model providers and agent developers to use τ-knowledge as a meaningful measure of an agent's ability to search, reason, and act over unstructured but finite data. The benchmark is open, the tasks are verifiable, and the gap between current performance and reliable deployment is clear.

For full details, see our paper and code.