The short answer: voice agent latency must be designed below 800 ms, but the real product work is turn-taking, interruption, ASR recovery, and multilingual switching.
The old "800 ms feels natural" target has become a trap. Human conversation is closer to a 200 ms mean inter-turn gap, according to the cross-linguistic study by Stivers et al.
In PNAS, and the ITU's one-way voice delay guidance treats more than 300 ms as unacceptable for conversational quality in some contexts (ITU-T G.114).
For AI voice agents, the practical 2026 target is harsher: make the user hear useful audio in under 800 ms at P95, keep normal turns in the 300-650 ms band, and treat sub-300 ms as a premium "feels live" experience.
TL;DR: The latency race moved from "pick a faster model" to "control the whole speech loop." Streaming ASR, endpointing, LLM TTFT, tool calls, TTS first-byte, network hops, and barge-in behavior all compound. A 500 ms agent that misunderstands a code-switched caller will feel worse than a 700 ms agent that recovers cleanly.
Key Takeaways
- The 800 ms threshold is now a warning line for voice agent latency, not a best-in-class goal.
- End-of-turn detection can save 200-600 ms versus naive silence thresholds, according to LiveKit's Turn Detector v1.0 write-up.
- LLM TTFT and synchronous tool calls usually dominate the latency budget once ASR and TTS are streaming.
- Native speech-to-speech models improve barge-in and full-duplex interaction, but benchmarked end-to-end latency still varies widely.
- Multilingual voice agents need ASR benchmarks for code-switching, not just English WER.
- Regulated voice UX sometimes should spend latency on accuracy, consent, auditability, or prosody.
Why Voice Agent Latency Moved Below 800 ms
The 800 ms number survived because it was convenient. It gave product teams a simple threshold, and it mapped loosely to when users start noticing a lag.
But live speech has a tighter clock. A user can tell when the system is waiting for silence, waiting for a transcript, waiting for a model, then waiting again for speech synthesis.
Microsoft's April 27, 2026 general availability of real-time voice agents in Copilot Studio framed the product around "speech-to-speech, low latency, interruptible" behavior in Dynamics 365 Contact Center (Microsoft Copilot Studio). Microsoft didn't publish a millisecond target, but the market signal matters: enterprise voice buyers now expect interruption and low-latency speech as table stakes.
The same shift shows up in orchestration docs. LiveKit says a typical well-built pipeline lands around 500-650 ms P95 response latency in production conditions (LiveKit). That is the new center of gravity.
The Voice Agent Latency Budget, in Practice
A voice agent doesn't have one latency number. It has a chain of small waits that users experience as one pause.
| Component | 2026 practical target | What to measure |
|---|---|---|
| VAD and endpointing | <30 ms | Speech onset, silence, end-of-turn |
| Streaming ASR first partial | <200 ms | Time to useful interim transcript |
| LLM TTFT | <300-400 ms | First token after prompt is ready |
| Tool calls | <500 ms | P95 per API or retrieval step |
| TTS first-byte | <150-200 ms | First playable audio chunk |
| Barge-in stop | <200 ms | User speech to agent silence |
| End-to-end TTFA | <800 ms P95 | User turn end to audible response |
The main product lesson is that medians lie. A demo with a 450 ms median can still produce 1.8 second stalls when a tool call blocks, endpointing waits too long, or ASR confidence collapses in background noise.
This is why OpenTelemetry-style tracing matters. LiveKit Agents and Pipecat both expose component-level hooks, and teams increasingly track VAD, STT, LLM, tool, and TTS spans rather than a single opaque "response time" (LiveKit Agents docs, Pipecat docs).
The Hard Part Is Turn-Taking AI
Turn-taking AI fails in ways users feel immediately. It talks over them, waits too long, stops too late, or answers before the user has finished a thought.
Silence thresholds are too crude for production. A user pausing after "my account number is..." has not ended the turn. A caller saying "no, wait" while the agent is speaking has clearly taken the floor.
Modern systems use a dedicated end-of-turn model or VAD layer. LiveKit's Turn Detector v1.0 reports 200-600 ms saved versus naive silence thresholding (LiveKit), while TEN VAD is designed for low-latency real-time agents (TEN VAD).
Barge-in is the sharper test. Native speech-to-speech models can keep a full-duplex audio channel open, detect interruption, and stop closer to mid-syllable. Pipeline systems built from separate ASR, LLM, and TTS services need explicit buffer flushing and cancellation logic.
A practical barge-in test is simple: have a caller interrupt the agent 50 times across different syllables, background-noise levels, and languages. If the agent keeps talking for more than 200 ms after speech onset, users will perceive it as rude.
Current State of Real-Time Speech AI
The 2026 stack splits into two camps: native speech-to-speech models and modular pipelines.
Native speech-to-speech systems reduce handoff cost and tend to handle interruption better. OpenAI's May 2026 Realtime API updates introduced the current GPT-Realtime generation with unified speech capabilities (OpenAI), while Google's Live API supports real-time conversational audio flows (Google Gemini Live API).
XAI also shipped Grok Voice Think Fast 1.0 via API in April 2026 (xAI).
The modular pipeline still wins when teams need control. You can swap ASR by language, choose a lower-latency LLM for the main loop, route hard tool calls to a reasoning model, and use a different TTS voice for regulated payloads.
| Architecture | Best fit | Main advantage | Main risk |
|---|---|---|---|
| Native speech-to-speech | Full-duplex agents, interruptions, fast prototypes | Fewer handoffs and better barge-in | Less control over ASR, tools, and TTS behavior |
| Modular ASR → LLM → TTS | Contact centers, regulated flows, multilingual routing | Component-level optimization and observability | More handoff latency and cancellation complexity |
| Hybrid routing | Production systems with varied call types | Fast path for simple turns, controlled path for risky turns | Requires strong orchestration and eval coverage |
For most teams, the best 2026 default is boring: use LiveKit Agents for managed cloud plus SIP, or Pipecat for self-hosted and edge-heavy deployments. Both have enough ecosystem depth to keep you out of bespoke WebRTC maintenance.
The ASR Benchmark That Matters Is Your Caller Mix
An ASR benchmark is only useful if it resembles your traffic. English WER on clean audio tells you little about a bilingual caller in a noisy kitchen giving a policy number.
Deepgram's Nova-3 is positioned around low-latency streaming speech-to-text and real-time agent use cases (Deepgram). AssemblyAI's Universal-3 Pro streaming is cited in the research report as a sub-200 ms model with promptable speech behavior. ElevenLabs Scribe v2 Realtime targets multilingual real-time transcription and sits in the same production class (ElevenLabs models).
For multilingual voice agents, the June 2026 shift is code-switching. ServiceNow-AI's AU-Harness evaluates bilingual speech across Spanish-English, French-English, French-Canadian-English, and German-English using WER, semantic WER, and accent error rate, as described in the ServiceNow-AI Hugging Face materials (Hugging Face).
Google's Chirp 3 is the obvious candidate when broad language coverage matters (Google Cloud Speech-to-Text). NVIDIA's Canary and Parakeet families matter for teams that need open models and on-prem control, especially across European languages (NVIDIA NeMo ASR).
The mistake is treating multilingual support as a checkbox. The right test set includes accents, code-switching, phone audio, interruptions, named entities, and the exact alphanumeric strings your customers say.
The 800 ms Wall Is a UX Problem First
A voice agent UX breaks before the latency dashboard turns red. Users repeat themselves, start speaking over the system, or switch to shorter phrases because they no longer trust the interaction rhythm.
That means latency design belongs in product specs. The spec should define when the agent backchannels, when it waits, when it confirms, when it interrupts itself, and when it escalates to a person.
For example, a support agent can say "I’m checking that" quickly while a tool call runs. That buys perceived responsiveness without pretending the answer is ready. But the pattern only works if the system has already captured the user's intent and won't use filler to mask confusion.
The same rule applies to confirmations. If ASR confidence is low for an address, account number, medication, or legal name, the agent should slow down and confirm. A faster wrong answer creates more handle time than a slightly slower repair turn.
Where Lower Latency Can Make the Product Worse
Healthcare is the cleanest example. The research report notes that streaming TTS can lose important context on alphanumeric and clinical payloads such as drug names, medical record numbers, and prescription codes. In that situation, batch-mode TTS for risky phrases is a better product decision even with a 200-500 ms penalty.
Financial services has a different constraint. A voice agent may need a recording disclosure, opt-out path, and audit trail before the useful conversation starts. The relevant metric becomes time-to-compliant-state, not only turn latency.
Mental health and emotional-support products have another tradeoff. The American Psychological Association's November 2025 advisory warned against substituting AI for human therapists in crisis intervention, and that should shape voice design. Prosody, silence pacing, and escalation behavior matter more than shaving another 100 ms.
The general rule: optimize latency until the interaction feels responsive, then spend the next engineering cycle on accuracy, repair, compliance, and task success.
A Practical Decision Framework for AI Voice Agents
Choose the stack by failure mode, not by leaderboard rank.
| Your deployment | Prefer | Why |
|---|---|---|
| English-heavy support calls | Deepgram or AssemblyAI ASR, low-TTFT LLM, fast streaming TTS | Best cost-latency balance |
| Bilingual or code-switched calls | ElevenLabs Scribe, Google Chirp 3, or dedicated multilingual routing | English-only WER will mislead you |
| Regulated healthcare | Private LLM path, batch TTS for risky payloads | Accuracy and trust boundary dominate |
| High-interruption flows | Native speech-to-speech or carefully tuned full-duplex pipeline | Barge-in quality drives perceived intelligence |
| SIP/contact center replacement | LiveKit Agents or vendor-native contact-center stack | Telephony, routing, and observability are first-order requirements |
| Edge or on-prem | Pipecat plus open ASR/TTS options | Control over data, region, and deployment topology |
The one-sentence takeaway for your team Slack: voice latency below 800 ms gets you into the conversation, but turn-taking and recovery determine whether users stay there.
Implementation Checklist
Use this checklist before calling a voice agent production-ready.
- Instrument VAD, ASR, LLM TTFT, tool calls, TTS first-byte, TTFA, and barge-in latency as separate spans.
- Set promotion gates at P95 TTFA under 800 ms, P95 end-to-end under 1,000 ms, and barge-in stop under 200 ms.
- Replace silence thresholds with LiveKit Turn Detector, TEN VAD, Silero VAD, or an equivalent end-of-turn model.
- Run ASR evals on your own calls, with noise, accents, code-switching, names, IDs, and domain terms.
- Disable reasoning mode in the main conversational loop; route hard reasoning to a parallel branch or delayed tool path.
- Time-box synchronous tools to 500 ms and return a useful holding utterance when the answer needs longer.
- Use streaming TTS by default, then switch to safer synthesis for numbers, drug names, IDs, and proper nouns.
- Test interruption as a first-class scenario, including user speech while TTS is mid-word.
- Colocate ASR, LLM, and TTS services in the same region where possible.
- Gate every release with at least a 1,000-call synthetic or shadow A/B evaluation before broad rollout.
What This Means for You
If you're building AI voice agents in 2026, don't start with a model bake-off. Start with the call trace.
Find the largest waits in real traffic. Then decide whether the fix is faster inference, better endpointing, parallel tool calls, a different ASR path, or a UX repair turn.
Voice agent latency is no longer a single vendor claim. It's the product surface of real-time speech AI, and users will judge every missed interruption, awkward pause, and mangled code-switch as part of the same experience.
Sources
- Universals and cultural variation in turn-taking in conversation
- ITU-T G.114 one-way transmission time
- LiveKit: Understand and Improve Voice Agent Latency
- LiveKit: Solving end-of-turn detection
- Microsoft Copilot Studio real-time voice agents
- OpenAI: Introducing gpt-realtime and Realtime API updates
- Google Gemini Live API capabilities
- Deepgram Nova-3 speech-to-text API
- ElevenLabs models documentation
- Google Chirp 3 transcription
- NVIDIA NeMo multilingual and code-switched ASR
- Pipecat documentation
