What is a good voice agent latency target in 2026?

A production AI voice agent should target sub-800 ms P95 time-to-first-audio, with 300-650 ms as the practical sweet spot. Sub-300 ms can feel live, but only if interruption handling and ASR accuracy hold up.

Is 800 ms still fast enough for AI voice agents?

800 ms is usable, but it no longer feels especially natural. The more important target is whether the agent can detect end-of-turn quickly, stop on barge-in, and avoid making users repeat themselves.

Which part of a voice agent stack usually causes the most latency?

The LLM step usually dominates the budget, especially TTFT and streaming token generation. Tool calls can exceed it when RAG, CRM, or account lookups run synchronously.

How should teams measure real-time speech AI latency?

Measure every turn with component spans: VAD, ASR first partial, LLM TTFT, tool latency, TTS first-byte, TTFA, and barge-in latency. P95 and failure modes matter more than a polished median.

Voice Agent Latency Hit a Wall. Design Around It

The short answer: voice agent latency must be designed below 800 ms, but the real product work is turn-taking, interruption, ASR recovery, and multilingual switching.

The old "800 ms feels natural" target has become a trap. Human conversation is closer to a 200 ms mean inter-turn gap, according to the cross-linguistic study by Stivers et al.

In PNAS, and the ITU's one-way voice delay guidance treats more than 300 ms as unacceptable for conversational quality in some contexts (ITU-T G.114).

For AI voice agents, the practical 2026 target is harsher: make the user hear useful audio in under 800 ms at P95, keep normal turns in the 300-650 ms band, and treat sub-300 ms as a premium "feels live" experience.

TL;DR: The latency race moved from "pick a faster model" to "control the whole speech loop." Streaming ASR, endpointing, LLM TTFT, tool calls, TTS first-byte, network hops, and barge-in behavior all compound. A 500 ms agent that misunderstands a code-switched caller will feel worse than a 700 ms agent that recovers cleanly.

Key Takeaways

The 800 ms threshold is now a warning line for voice agent latency, not a best-in-class goal.
End-of-turn detection can save 200-600 ms versus naive silence thresholds, according to LiveKit's Turn Detector v1.0 write-up.
LLM TTFT and synchronous tool calls usually dominate the latency budget once ASR and TTS are streaming.
Native speech-to-speech models improve barge-in and full-duplex interaction, but benchmarked end-to-end latency still varies widely.
Multilingual voice agents need ASR benchmarks for code-switching, not just English WER.
Regulated voice UX sometimes should spend latency on accuracy, consent, auditability, or prosody.

Why Voice Agent Latency Moved Below 800 ms

The 800 ms number survived because it was convenient. It gave product teams a simple threshold, and it mapped loosely to when users start noticing a lag.

But live speech has a tighter clock. A user can tell when the system is waiting for silence, waiting for a transcript, waiting for a model, then waiting again for speech synthesis.

Microsoft's April 27, 2026 general availability of real-time voice agents in Copilot Studio framed the product around "speech-to-speech, low latency, interruptible" behavior in Dynamics 365 Contact Center (Microsoft Copilot Studio). Microsoft didn't publish a millisecond target, but the market signal matters: enterprise voice buyers now expect interruption and low-latency speech as table stakes.

The same shift shows up in orchestration docs. LiveKit says a typical well-built pipeline lands around 500-650 ms P95 response latency in production conditions (LiveKit). That is the new center of gravity.

The Voice Agent Latency Budget, in Practice

A voice agent doesn't have one latency number. It has a chain of small waits that users experience as one pause.

Component	2026 practical target	What to measure
VAD and endpointing	<30 ms	Speech onset, silence, end-of-turn
Streaming ASR first partial	<200 ms	Time to useful interim transcript
LLM TTFT	<300-400 ms	First token after prompt is ready
Tool calls	<500 ms	P95 per API or retrieval step
TTS first-byte	<150-200 ms	First playable audio chunk
Barge-in stop	<200 ms	User speech to agent silence
End-to-end TTFA	<800 ms P95	User turn end to audible response

The main product lesson is that medians lie. A demo with a 450 ms median can still produce 1.8 second stalls when a tool call blocks, endpointing waits too long, or ASR confidence collapses in background noise.

This is why OpenTelemetry-style tracing matters. LiveKit Agents and Pipecat both expose component-level hooks, and teams increasingly track VAD, STT, LLM, tool, and TTS spans rather than a single opaque "response time" (LiveKit Agents docs, Pipecat docs).

Typical 2026 Voice Agent Latency Budget

The Hard Part Is Turn-Taking AI

Turn-taking AI fails in ways users feel immediately. It talks over them, waits too long, stops too late, or answers before the user has finished a thought.

Silence thresholds are too crude for production. A user pausing after "my account number is..." has not ended the turn. A caller saying "no, wait" while the agent is speaking has clearly taken the floor.

Modern systems use a dedicated end-of-turn model or VAD layer. LiveKit's Turn Detector v1.0 reports 200-600 ms saved versus naive silence thresholding (LiveKit), while TEN VAD is designed for low-latency real-time agents (TEN VAD).

Barge-in is the sharper test. Native speech-to-speech models can keep a full-duplex audio channel open, detect interruption, and stop closer to mid-syllable. Pipeline systems built from separate ASR, LLM, and TTS services need explicit buffer flushing and cancellation logic.

A practical barge-in test is simple: have a caller interrupt the agent 50 times across different syllables, background-noise levels, and languages. If the agent keeps talking for more than 200 ms after speech onset, users will perceive it as rude.

Current State of Real-Time Speech AI

The 2026 stack splits into two camps: native speech-to-speech models and modular pipelines.

Native speech-to-speech systems reduce handoff cost and tend to handle interruption better. OpenAI's May 2026 Realtime API updates introduced the current GPT-Realtime generation with unified speech capabilities (OpenAI), while Google's Live API supports real-time conversational audio flows (Google Gemini Live API).

XAI also shipped Grok Voice Think Fast 1.0 via API in April 2026 (xAI).

The modular pipeline still wins when teams need control. You can swap ASR by language, choose a lower-latency LLM for the main loop, route hard tool calls to a reasoning model, and use a different TTS voice for regulated payloads.

Architecture	Best fit	Main advantage	Main risk
Native speech-to-speech	Full-duplex agents, interruptions, fast prototypes	Fewer handoffs and better barge-in	Less control over ASR, tools, and TTS behavior
Modular ASR → LLM → TTS	Contact centers, regulated flows, multilingual routing	Component-level optimization and observability	More handoff latency and cancellation complexity
Hybrid routing	Production systems with varied call types	Fast path for simple turns, controlled path for risky turns	Requires strong orchestration and eval coverage

For most teams, the best 2026 default is boring: use LiveKit Agents for managed cloud plus SIP, or Pipecat for self-hosted and edge-heavy deployments. Both have enough ecosystem depth to keep you out of bespoke WebRTC maintenance.

The ASR Benchmark That Matters Is Your Caller Mix

An ASR benchmark is only useful if it resembles your traffic. English WER on clean audio tells you little about a bilingual caller in a noisy kitchen giving a policy number.

Deepgram's Nova-3 is positioned around low-latency streaming speech-to-text and real-time agent use cases (Deepgram). AssemblyAI's Universal-3 Pro streaming is cited in the research report as a sub-200 ms model with promptable speech behavior. ElevenLabs Scribe v2 Realtime targets multilingual real-time transcription and sits in the same production class (ElevenLabs models).

For multilingual voice agents, the June 2026 shift is code-switching. ServiceNow-AI's AU-Harness evaluates bilingual speech across Spanish-English, French-English, French-Canadian-English, and German-English using WER, semantic WER, and accent error rate, as described in the ServiceNow-AI Hugging Face materials (Hugging Face).

Google's Chirp 3 is the obvious candidate when broad language coverage matters (Google Cloud Speech-to-Text). NVIDIA's Canary and Parakeet families matter for teams that need open models and on-prem control, especially across European languages (NVIDIA NeMo ASR).

The mistake is treating multilingual support as a checkbox. The right test set includes accents, code-switching, phone audio, interruptions, named entities, and the exact alphanumeric strings your customers say.

The 800 ms Wall Is a UX Problem First

A voice agent UX breaks before the latency dashboard turns red. Users repeat themselves, start speaking over the system, or switch to shorter phrases because they no longer trust the interaction rhythm.

That means latency design belongs in product specs. The spec should define when the agent backchannels, when it waits, when it confirms, when it interrupts itself, and when it escalates to a person.

For example, a support agent can say "I’m checking that" quickly while a tool call runs. That buys perceived responsiveness without pretending the answer is ready. But the pattern only works if the system has already captured the user's intent and won't use filler to mask confusion.

The same rule applies to confirmations. If ASR confidence is low for an address, account number, medication, or legal name, the agent should slow down and confirm. A faster wrong answer creates more handle time than a slightly slower repair turn.

Where Lower Latency Can Make the Product Worse

Healthcare is the cleanest example. The research report notes that streaming TTS can lose important context on alphanumeric and clinical payloads such as drug names, medical record numbers, and prescription codes. In that situation, batch-mode TTS for risky phrases is a better product decision even with a 200-500 ms penalty.

Financial services has a different constraint. A voice agent may need a recording disclosure, opt-out path, and audit trail before the useful conversation starts. The relevant metric becomes time-to-compliant-state, not only turn latency.

Mental health and emotional-support products have another tradeoff. The American Psychological Association's November 2025 advisory warned against substituting AI for human therapists in crisis intervention, and that should shape voice design. Prosody, silence pacing, and escalation behavior matter more than shaving another 100 ms.

The general rule: optimize latency until the interaction feels responsive, then spend the next engineering cycle on accuracy, repair, compliance, and task success.

A Practical Decision Framework for AI Voice Agents

Choose the stack by failure mode, not by leaderboard rank.

Your deployment	Prefer	Why
English-heavy support calls	Deepgram or AssemblyAI ASR, low-TTFT LLM, fast streaming TTS	Best cost-latency balance
Bilingual or code-switched calls	ElevenLabs Scribe, Google Chirp 3, or dedicated multilingual routing	English-only WER will mislead you
Regulated healthcare	Private LLM path, batch TTS for risky payloads	Accuracy and trust boundary dominate
High-interruption flows	Native speech-to-speech or carefully tuned full-duplex pipeline	Barge-in quality drives perceived intelligence
SIP/contact center replacement	LiveKit Agents or vendor-native contact-center stack	Telephony, routing, and observability are first-order requirements
Edge or on-prem	Pipecat plus open ASR/TTS options	Control over data, region, and deployment topology

The one-sentence takeaway for your team Slack: voice latency below 800 ms gets you into the conversation, but turn-taking and recovery determine whether users stay there.

Implementation Checklist

Use this checklist before calling a voice agent production-ready.

Instrument VAD, ASR, LLM TTFT, tool calls, TTS first-byte, TTFA, and barge-in latency as separate spans.
Set promotion gates at P95 TTFA under 800 ms, P95 end-to-end under 1,000 ms, and barge-in stop under 200 ms.
Replace silence thresholds with LiveKit Turn Detector, TEN VAD, Silero VAD, or an equivalent end-of-turn model.
Run ASR evals on your own calls, with noise, accents, code-switching, names, IDs, and domain terms.
Disable reasoning mode in the main conversational loop; route hard reasoning to a parallel branch or delayed tool path.
Time-box synchronous tools to 500 ms and return a useful holding utterance when the answer needs longer.
Use streaming TTS by default, then switch to safer synthesis for numbers, drug names, IDs, and proper nouns.
Test interruption as a first-class scenario, including user speech while TTS is mid-word.
Colocate ASR, LLM, and TTS services in the same region where possible.
Gate every release with at least a 1,000-call synthetic or shadow A/B evaluation before broad rollout.

What This Means for You

If you're building AI voice agents in 2026, don't start with a model bake-off. Start with the call trace.

Find the largest waits in real traffic. Then decide whether the fix is faster inference, better endpointing, parallel tool calls, a different ASR path, or a UX repair turn.

Voice agent latency is no longer a single vendor claim. It's the product surface of real-time speech AI, and users will judge every missed interruption, awkward pause, and mangled code-switch as part of the same experience.

Voice Agent Latency Hit the 800ms Wall. Design Around It