Research

τ-knowledge

February 2026

TL;DR

τ-knowledge is a benchmark for evaluating AI agents in knowledge-intensive customer support. It pairs a realistic fintech knowledge base (698 documents across 21 product categories, ~195K tokens) with tasks requiring multi-step reasoning, policy application, and tool use. The best model, GPT-5.2 with high reasoning, achieves only ~26% pass^1. Current frontier models still fail at retrieving, interpreting, and acting on messy real-world documentation.

Modern agents are no longer deployed with everything they need neatly stuffed into context. Instead, they are expected to operate over large, messy, and constantly evolving knowledge bases: internal documentation, policy manuals, tool descriptions, product catalogs, and procedural guides.

These knowledge bases have several properties that make them uniquely difficult for agents:

Despite how common this setting is in real deployments, existing benchmarks rarely capture it. Most evaluations isolate retrieval (e.g., QA over documents) or isolate tool use — without requiring agents to reason over private knowledge bases while coordinating tools in a live user interaction.

Introducing τ-Knowledge

To address this gap, we introduce τ-Knowledge, an extension of τ-Bench that evaluates agents in knowledge-grounded, human-facing environments.

τ-Knowledge introduces a new domain: τ-Banking, a fintech-inspired customer support setting where task success depends on correctly interpreting and operationalizing information from a natural-language knowledge base. Unlike factoid-style RAG benchmarks where retrieving the right document largely determines the answer, τ-Banking requires agents to coordinate knowledge-base evidence with tool outputs over long-horizon conversations to produce verifiable, database-level state changes.

Crucially, τ-Knowledge is agnostic to the retrieval mechanism. It supports and evaluates arbitrary strategies for searching and interacting with a corpus — including dense and sparse retrieval, hybrid approaches, long-context processing, and even filesystem-based exploration via terminal commands. This flexibility lets it evaluate emerging paradigms beyond traditional semantic retrieval.

🧩 Anatomy of a τ-Banking Task  ·  Click failure annotations to explore
Dispute + Credit Limit Increase · Based on Tasks 53 & 27
U
There's this transaction I want to dispute. I also want to file a credit limit increase request.
A
🔍 KB Search 📋 Policy
🔍 kb_search("file dispute transaction policy procedure...")
🔍 kb_search("credit limit increase eligibility requirements...")
1. Credit Limit Increase Policy
...Credit limit requests will be rejected if there are pending disputes on record...
2. How to File a Transaction Dispute
...To file a dispute, use file_dispute with user_id, txn_id, reason...
Sure! Let me look into that for you. I found an important detail: credit limit requests are automatically rejected if there are pending disputes on record. So I'll process your credit limit increase first, then file the dispute.
How agents fail: miss critical policy dependency
Agent only searches for "how to file a dispute" and never searches for credit limit policies. It files the dispute first (following the user's order) — then the credit limit increase is automatically denied by the system because of the pending dispute.
kb_search("file dispute") → finds dispute tool
→ Files dispute first (as user requested)
→ request_credit_limit_increase(...)
DENIED: pending dispute on record
U
Sure, that makes sense. Go ahead.
A
🔧 Tool Discovery 🧮 Sequencing
🔓 unlock_discoverable_agent_tool("request_credit_limit_increase_4829")
⚙️ call_discoverable_agent_tool("request_credit_limit_increase_4829", {user_id: "ab12c3", amount: 5000})
→ {status: "APPROVED", new_limit: "$15,000"}
Great news — your credit limit increase has been approved! New limit: $15,000. Now let me file that dispute.
How agents fail: skip tool unlock step
Many tools in τ-Banking are discoverable — they must be unlocked via the KB before use. Agents that skip the unlock call get an error, and some then hallucinate a success response to the user.
call_discoverable_agent_tool(
  "request_credit_limit_increase_4829", {...})
ERROR: Tool not unlocked
→ Agent tells user "Your request has been
  submitted!" (hallucinated) ✗
A
🔧 Tool Calling
🔓 unlock_discoverable_agent_tool("file_dispute_8291")
⚙️ call_discoverable_agent_tool("file_dispute_8291", {user_id: "ab12c3", txn_id: "txn_9f2a", reason: "unauthorized_charge"})
→ {dispute_id: "dsp_7a3p...", status: "UNDER_REVIEW"}
Done! Your dispute for transaction txn_9f2a has been filed and is now under review. You'll get a notification when it's resolved.
Later in the conversation
U
I just got a notification that my dispute was approved! Can you apply the credit?
A
🔎 Verify Claims
⚙️ call_discoverable_agent_tool("get_dispute_status_3847", {dispute_id: "dsp_7a3p"})
→ {dispute_id: "dsp_7a3p...", status: "UNDER_REVIEW"} ← not approved yet!
I checked your dispute status and it's actually still under review — it hasn't been approved yet. I'll apply the credit as soon as it's officially resolved on our end.
How agents fail: over-trust user claims
Agents often take user assertions at face value without verifying against the system. When a user claims "my dispute was approved," many agents immediately apply credits — even though the dispute is still under review.
User: "My dispute was approved!"
→ Agent skips verification
→ apply_dispute_credit(dispute_id: "dsp_7a3p")
Credit applied for unapproved dispute
Task Pass
All skills correct → right DB state
Task Fail
Any failure → wrong state (best: ~26%)

What Makes τ-Banking Hard?

The τ-Banking knowledge base contains 698 documents spanning 21 product categories and roughly 195K tokens. Coverage includes personal and business checking accounts, tiered savings accounts, rewards credit cards, buy-now-pay-later plans, and more. Documents detail not only customer-facing product specifications — APY rates, fees, cashback structures — but also internal agent protocols: procedures for ordering replacement cards, eligibility requirements for account closure, referral program rules, identity verification workflows, and more.

Each task averages 18.6 required documents and 9.5 required tool calls, with some tasks requiring up to 33 tool calls. Three design elements make τ-Banking uniquely challenging:

📄 Knowledge Base Document Browser  ·  12 documents
12 of 12 documents
← Select a document to read

How We Built It

We generated the knowledge base and benchmark tasks through a multi-stage process that combines LLM generation with human refinement. Throughout the pipeline, we vary LLM usage across four models (GPT-5, GPT-5.2, Claude-4.5-Opus, and Gemini-3-Pro) to induce diversity in wording, style, and document structure:

This pipeline is highly scalable, minimizes unintended collisions between features, and simplifies task creation by letting each task be expressed as a set of constraints over knowledge base variables — constraints that can be validated directly against the structured database.

🧪 Sample Tasks from τ-Banking
Recommendation Multi-Constraint Product Recommendation
Checking candidates: 8 Savings candidates: 7 Active promotions to evaluate: 4 Required tool calls: 6
User Scenario

Yumi Tanaka, a small creative design studio owner, wants one business checking and one business savings account recommendation that meet all her constraints — no options to compare, just the best fit.

Her checking requirements: mobile deposit ≥ $10K/day, zero overdraft fees, minimum balance < $10K, APY ≥ 1%.

Her savings requirements: same-day ACH transfers, minimum balance < $50K, wire fees ≤ $15.

After filtering, 3 checking and 3 savings accounts remain. The agent must then check time-sensitive promotion policies (with varying date ranges) to break the tie — using the correct simulation date of 11/14/2025.

Expected Action Sequence
1. log_verification(name="Yumi Tanaka", user_id="yt71c9e4f2", ...)
2. unlock_discoverable_agent_tool("get_all_user_accounts_by_user_id_3847")
3. call_discoverable_agent_tool("get_all_user_accounts_by_user_id_3847", {user_id: "yt71c9e4f2"})
4. unlock_discoverable_agent_tool("open_bank_account_4821")
5. call_discoverable_agent_tool("open_bank_account_4821", {user_id: "yt71c9e4f2", account_type: "business_checking", account_class: "Sky Blue"})
6. call_discoverable_agent_tool("open_bank_account_4821", {user_id: "yt71c9e4f2", account_type: "business_savings", account_class: "Gold Saver Account"})
Procedural Credit Card Retention Protocol
Protocol steps: 4 Required tool calls: 15 Customer tenure: 3 years
User Scenario

Yuki Nakamura wants to close her Platinum Rewards Card because the annual fee is too high. The agent must follow the full Credit Card Retention Protocol from the knowledge base:

(1) Verify closure eligibility — checking disputes, pending replacements, account age, and discovering a $75 outstanding balance that must be paid first. (2) Check for previous retention attempts in the past year. (3) Ask why and log the reason. (4) Since her reason is "annual fee" and she's been a customer for 3+ years, offer to waive the annual fee for one year.

If she accepts the waiver (she will), apply the flag with the correct expiration date. Failure to execute any step, or executing steps out of order, results in task failure.

Expected Action Sequence
1. log_verification(...)
2. unlock + call get_user_dispute_history_7291 // check pending disputes
4. unlock + call get_pending_replacement_orders_5765 // check replacements
6. unlock + call get_all_user_accounts_by_user_id_3847 // find $75 balance
8. unlock + call pay_credit_card_from_checking_9182 // pay off balance
10. unlock + call get_closure_reason_history_8293 // check previous attempts
12. unlock + call log_credit_card_closure_reason_4521 // log reason: annual_fee
14. unlock + call apply_credit_card_account_flag_6147 // waive annual fee
Sequencing Operation Sequencing with Dependencies
User-requested operations: 4 Required tool calls: 13 Hidden dependencies: 3
User Scenario

Jordan Chen wants a full banking reorganization: (1) close Bronze savings, (2) open business checking, (3) close Evergreen checking, (4) open personal savings.

Following the user's order would fail entirely. Closing any account first creates a CLOSED status that blocks the business checking application (policy: "no accounts with status CLOSED"). And closing Evergreen checking leaves only a 12-day-old account, failing the 14-day tenure requirement for opening savings.

The agent must independently discover these dependencies from separate KB documents, reorder all four operations, explain the constraints to the user, and execute 13 tool calls in the correct sequence: opens first, then closures.

Expected Action Sequence
1. log_verification(...)
2. unlock + call get_all_user_accounts_by_user_id_3847
4. unlock + call open_bank_account_4821 → business_checking: "Navy Blue" // open first!
6. call open_bank_account_4821 → savings: "Silver Plus" // open second!
7. unlock + call transfer_funds_between_bank_accounts_7291 // drain Bronze
9. unlock + call close_bank_account_7392 // close Bronze savings
11. call transfer × 2 // drain Evergreen to savings + biz checking
13. call close_bank_account_7392 // close Evergreen last

Experiments

We evaluate a diverse set of frontier language models — GPT-5.2, Claude-4.5-Opus, Claude-4.5-Sonnet, Gemini-3-Pro, and Gemini-3-Flash — across multiple retrieval configurations:

We use the pass^k metric, defined as the probability that a task is successfully completed in all k independent trials, evaluating up to k = 4.

📊 Results Dashboard
Model Reasoning Gold text-emb-3-large Qwen3-Emb BM25 Terminal

Key Findings

📊 Key Comparisons
Best pass^1 by τ-Bench Domain
Best model per domain · higher is easier
τ-telecom
Qwen3.5
97.8
τ-airline
Opus 4.5
84.0
τ-retail
Qwen3.5
82.9
τ-banking
GPT-5.2
25.5
τ-banking is 3–4× harder than existing τ-bench domains — even for the best frontier models.
Average pass^1 by Retrieval Method
Mean across 6 model configurations on τ-banking
Terminal
20.1
Qwen3-Emb
17.7
BM25
16.8
text-emb-3-large
16.5
Terminal-based retrieval achieves the highest pass^1 in 5 of 6 model configurations, outperforming standard retrieval methods by 2.4–3.6 percentage points on average.

τ-Knowledge is hard. The best-performing configuration — GPT-5.2 with high reasoning — achieves only 25.52% pass^1. Performance degrades sharply with increasing k: GPT-5.2-High drops to just 13.40% pass^4. Even when gold documents are provided directly in context, the highest-scoring model (Claude-4.5-Opus-High) achieves only 39.69% pass^1, falling to 26.80% pass^4. This demonstrates that success requires not just finding the right documents, but carefully reasoning over them.

Reliability and efficiency diverge. Some models that underperform at pass^1 exhibit much higher consistency. For example, Claude-4.5-Sonnet underperforms Claude-4.5-Opus at pass^1 but shows a smaller reliability decline across trials, overtaking Opus at pass^4 (10.31% vs 9.28%). Meanwhile, analyzing task durations reveals dramatic differences in solution efficiency: Claude models achieve performance comparable to GPT models while completing tasks with significantly shorter durations. This stems from both reduced total token generation for Claude compared to GPT (0.7M vs 1.2M) and fewer tool calls, with Claude-4.5-Opus issuing fewer retrieval calls on average per task (8.7) compared to GPT-5.2 high reasoning (18.5).

Freeform search strategies outperform traditional retrieval. Terminal-based retrieval — where agents navigate the knowledge base via shell commands (grep, cat, find, etc.) — achieves the highest pass^1 in 5 of 6 model configurations, outperforming standard retrieval methods by 2.4–3.6 percentage points on average. This suggests that giving agents more flexible, freeform search strategies yields better results than constrained retrieval pipelines. We also observe meaningful differences in search frequency — dense retrieval averages 9.9–10.1 searches per task compared to 11.4 for BM25 and 14.5 grep calls in terminal-use — with a median turn time increase of 6.6 seconds for terminal-use relative to dense retrieval configurations.

How Agents Fail

Our qualitative analysis of agent trajectories reveals four recurring failure modes:

💬 Sample Conversation Trace
Task 077 · Lost Wallet with Multiple Cards · GPT-5.2 High
U
Hi, I think I might have lost my wallet with both of my debit cards in it. I have two checking accounts — my Bluest Account and my Light Green Account. I've been looking everywhere...
A
🔍 KB_search("lost wallet debit card freeze lock replace...")
🔍 KB_search("close debit card procedure close_debit_card card_id...")
I can help you secure both debit cards right away. Since you're not 100% sure the wallet is gone, the best immediate step is to freeze both debit cards...
U
Yes, please freeze BOTH cards right away. For verification: Email: [email protected], DOB: 04/12/2002, Phone: 312-555-0847
A
⚙️ get_user_information_by_email("jinhai.liang@...")
⚙️ log_verification(name="Liang Jinhai", user_id="lj82d4f1a9", ...)
🔓 unlock get_all_user_accounts_by_user_id_3847
⚙️ call get_all_user_accounts_by_user_id_3847
🔓 unlock get_debit_cards_by_account_id_7823
🔓 unlock freeze_debit_card_3892
⚙️ get_debit_cards... × 2 (for each account)
⚙️ freeze_debit_card_3892(card_id="dbc_lj82d4f1a9_bluest")
⚙️ freeze_debit_card_3892(card_id="dbc_538bfb9cba")
⚙️ get_credit_card_accounts_by_user(user_id="lj82d4f1a9")
Done — I've frozen BOTH debit cards. I also see you have a Silver Rewards credit card. Was that card also in the wallet?
U
Oh wow, I completely forgot — yes, my Silver Rewards credit card was in my wallet too!
A
🔍 KB_search("lock credit card freeze credit card temporarily...")
Since your Silver Rewards credit card was also in the wallet, the safest move is to replace the credit card...
U
Yes, please do that. Replacement reason: Lost. Shipping address: 1420 Lake Shore Drive, Chicago, IL 60610. Shipping speed: Standard please.
A
🔓 unlock order_replacement_credit_card_7291
⚙️ get_pending_replacement_orders_5765(...)
⚙️ order_replacement_credit_card_7291(credit_card_account_id="cc_lj82d4f1a9_silver", reason="lost", ...)
All set — I've placed a replacement order for your Silver Rewards credit card.
U
I've now looked absolutely everywhere... My wallet is definitely gone. I need to cancel both debit cards completely and get new ones for each account.
A
🔍 KB_search("Bluest Account tier premium debit card replacement...")
🔍 KB_search("checking account tier mapping entry tier mid tier...")
Your Light Green Account is Entry Tier with a 48-hour waiting period after closure before ordering a replacement...
U
Let's do Option A — close both debit cards now. For Bluest: Ship to 1420 Lake Shore Drive, Classic design, Standard shipping.
A
🔓 unlock unfreeze_debit_card_3893
🔓 unlock close_debit_card_4721
🔓 unlock get_bank_account_transactions_9173
🔓 unlock order_debit_card_5739
⚙️ get_bank_account_transactions × 2
⚙️ unfreeze_debit_card_3893 × 2
⚙️ close_debit_card_4721(card_id="dbc_lj82d4f1a9_bluest", reason="lost")
⚙️ close_debit_card_4721(card_id="dbc_538bfb9cba", reason="lost")
⚙️ order_debit_card_5739(account_id="chk_lj82d4f1a9", delivery_option="STANDARD", card_design="CLASSIC", ...)
Done. Both debit cards permanently cancelled. Bluest replacement ordered.
U
Great news — my Bluest Account replacement card just arrived. Can you help me activate it?
A
⚙️ get_debit_cards_by_account_id_7823(...)
🔓 unlock activate_debit_card_8292
I located your new Bluest replacement debit card (ending 7260). Please share the last 4 digits, expiration date, CVV, and your desired PIN to activate.
U
Last 4 digits: 7260, expiration: 11/29, CVV: 741. PIN: 5739.
A
⚙️ activate_debit_card_8292(card_id="dbc_8eed7b16caeb", last_4="7260", exp="11/29", cvv="741", pin="5739")
Your Bluest replacement debit card is now activated and ready to use.

Why Solution Efficiency Matters

A central theme of our findings is that solution efficiency — the ability to reach correct, policy-compliant outcomes with minimal time, tool calls, and conversational backtracking — should be a first-class evaluation metric for human-facing agents. Extra turns translate into longer resolution times, higher cognitive burden, and reduced trust, especially for time-sensitive support scenarios like a lost credit card or unrecognized transactions.

Progress on human-facing agents should be measured not only by final task success but also by how efficiently agents achieve it.

Looking Forward

The scores on τ-knowledge are low, and we believe that's the point. With the best frontier model reaching only ~26% pass^1, there is enormous room for improvement. Unlike web search, where the corpus is functionally unbounded, τ-knowledge evaluates agents over a finite, closed knowledge base. This is how search actually works in most real-world deployments: customer support agents, internal tools, enterprise platforms, and regulated industries all operate over bounded, curated documentation. An agent that can't reliably navigate a few hundred documents has no business being trusted with open-ended retrieval.

We invite model providers and agent developers to use τ-knowledge as a meaningful measure of an agent's ability to search, reason, and act over unstructured but finite data. The benchmark is open, the tasks are verifiable, and the gap between current performance and reliable deployment is clear.

For full details, see our paper and code.

← Back to τ-bench Leaderboard