τ-Voice extends τ-bench to live, full-duplex voice interactions — where both sides speak and listen at once, people interrupt, and calls happen in noisy environments. Rather than clean audio in a quiet room, τ-Voice simulates realistic conditions: accents, street noise, burst sounds, connection drops, and natural turn-taking dynamics.
A simulated τ-Voice call in the retail domain. The main timeline shows six minutes of overlapping speech, interruptions, and noise. Inset A decomposes the audio the agent receives; Inset B highlights turn-taking dynamics.
The examples below demonstrate how these conditions affect agent performance. The same task can succeed under clean audio and fail under realistic conditions — same task, same agent, different outcome. A full blog post with detailed results is coming soon.
Explore full voice trajectoriesBrowse and replay complete voice simulations — with transcripts, tool calls, and turn-by-turn details — in the interactive trajectory visualizer.
Task 14 succeeds under clean audio but fails when realistic effects are applied — same task, same agent, different outcome.
Clean
Gemini Success
Realistic
Gemini Logical
Transcription failures
Both conversations fail due to transcription errors. In clean audio, verbally encoded characters trip up the agent; in realistic audio, accent and noise compound the problem.
Clean
xAI Transcription
Realistic
xAI Transcription
Logical failures
Both conversations fail due to reasoning errors — wrong policy application or missed constraints — independent of audio quality.
Clean
OpenAI Logical
Realistic
Gemini Logical
Annotated Speech Activity Timeline
The interactive visualization below annotates the realistic Task 14 audio with speech-activity markers — user & agent speech, interruptions, noise effects, backchannels, and more. Press play to step through the conversation with a synchronized playhead.