Economics Of Ai Coding Agents

Voice AI Under 500ms: Latency Architecture for Agents

Sub-500ms round trips are the line between a voice agent people prefer and one they hang up on; here's the architecture that gets you there.

By June 27, 202612 min read
voice AI latencyAI voice agent architecturevoice AI telephony integration
Voice AI Under 500ms: Latency Architecture for Agents

A voice agent that takes 700ms to respond doesn't feel slow. It feels broken. Users hang up, reroute to a human, and never come back. Yet most of the $15B-odd voice AI market is shipping systems that sit in exactly that range, because the engineering literature on hitting sub-500ms round trips is thin and scattered across vendor docs.

Voice AI latency is the single number that decides whether an agent feels like conversation or like autocomplete with a phone number. Hitting sub-500ms end-to-end is a solved engineering problem as of mid-2026, but only if you co-optimize all four pipeline stages: ASR, LLM inference, TTS, and the telephony layer connecting them.

Isolated wins in any one stage do not get you there.

TL;DR

Sub-500ms round-trip latency is achievable today with streaming-native ASR (OpenAI's GPT-Realtime-Whisper, shipped May 2026), GPU-optimized LLM serving with speculative decoding, and sub-100ms TTS from Cartesia or ElevenLabs. The biggest wins are architectural: start TTS while the LLM is still streaming, use VAD-triggered chunking instead of fixed buffers, and co-locate every stage in one region.

The 700ms "human tolerance" figure people cite is a misreading of turn-taking research, and it is not a target you should build to.

Key takeaways

  • Target 450ms with 50ms headroom; allocate roughly 40/100/150/120/40ms across transport, ASR, LLM TTFT, TTS TTFA, and buffer.
  • Use streaming-native ASR. Chunked Whisper is a latency trap; a March 2026 benchmark showed naive 2-second chunking produced p50 7.20s latency versus 3.03s offline.
  • Start TTS the moment the LLM emits substantive tokens. This single change can remove 200-300ms of perceived latency.
  • VAD-triggered processing (Silero, 1.8MB, 23x real-time) filters 50-70% of silence and cuts downstream ASR compute 30-60%.
  • Verify any ROI figure against primary sources. The widely cited "$15.12B market / 45% call deflection / $3.50 ROI per dollar" cluster has no traceable attribution.

What is voice AI latency, and why 500ms?

Voice AI latency is the end-to-end round trip from the moment a user stops speaking to the moment the agent's first audio byte reaches their ear. It is the sum of four stages: speech recognition (ASR), language model inference (LLM), speech synthesis (TTS), and the telephony transport that carries audio both ways.

Five hundred milliseconds is the threshold where a system stops feeling like it is "thinking" and starts feeling like it is responding. Above 300-400ms, users report the system pausing before each reply.

Above 500ms, conversation flow breaks: backchannels ("uh-huh," "right") arrive too late, interruptions feel ignored, and the dialogue collapses into rigid turn-taking. HCI research consistently shows response latency correlates with perceived competence, independent of actual capability.

The 700ms myth, and why it misleads builders

A common counterargument holds that humans tolerate up to 700ms of response latency, citing Sacks, Schegloff, and Jefferson's classic work on conversation turn-taking. That citation is being used wrong.

The 700ms figure describes the acceptable gap between one speaker's completed turn and the next speaker's turn initiation. It is a measure of conversational rhythm, not of system response time after a direct question.

It also represents an upper bound for patient users in low-stress dialogue, not a design target for a system competing with a human receptionist.

Production implementations from OpenAI's Advanced Voice Mode, LiveKit's Agents framework, and vendors like Retell and Airi all target sub-500ms or lower as a competitive baseline. Build to 500ms, not 700ms.

The 2026 voice AI stack, stage by stage

ASR: stop using chunked Whisper for streaming

The Whisper family still dominates open-source ASR, but the production-relevant lineage has split. Whisper Large v3 (November 2023) added Cantonese and 100-language support. Distil-Whisper cut parameters 49% for roughly 6x faster inference within 1% WER.

Whisper Large v3 Turbo (October 2024 DevDay) dropped decoder layers from 32 to 4 and hit 216x real-time on Groq LPUs.

The problem: all of these process audio in 30-second chunks. That is incompatible with real-time streaming, and adapting them with naive chunking is a known trap. A March 2026 benchmark on NVIDIA L4 GPUs found Whisper with 10-second chunks caused a 3.5% absolute WER regression versus optimal chunk configurations.

A separate gist benchmark showed naive 2-second chunking of Faster-Whisper small on Apple M2 produced p50 7.20s latency, worse than offline processing at p50 3.03s.

The fix is streaming-native models. OpenAI's GPT-Realtime-Whisper, launched May 7, 2026, processes audio token-by-token at $0.017/minute and reports roughly 90% fewer hallucinations than original Whisper in OpenAI's internal noise robustness tests.

For on-device work, WhisperKit (ICML 2025) reports 0.46s end-to-end with 2.2% WER on Apple Silicon, beating OpenAI's gpt-4o-transcribe and Deepgram nova-3 in its benchmark. For self-hosted batch-adjacent workloads, Faster-Whisper with CTranslate2 int8 quantization delivers roughly 4x speedup over the reference implementation at matching accuracy, needing ~2.9 GB VRAM.

LLM: time-to-first-token is the only number that matters

For voice, total generation time is irrelevant. The user never waits for the full response; they wait for the first audio byte, which depends on the LLM's time-to-first-token (TTFT).

Well-optimized GPU serving of 7B-13B models hits 150-250ms TTFT. Larger 70B+ models land at 250-400ms unless you put them on streaming-optimized hardware. Three techniques compound:

For serving, vLLM handles decoder-only LLMs with dynamic batching, TGI gives Hugging Face compatibility, and NVIDIA Triton fronts multiple models behind one gateway. Groq LPUs are purpose-built for streaming workloads and consistently lead TTFT benchmarks.

TTS: chase time-to-first-audio, not total synthesis

TTS latency is measured as time-to-first-audio (TTFA): how quickly sound starts after the LLM emits text. Reported model inference times are misleading on their own because end-to-end TTFA in production includes network transport and codec work and typically lands 150-300ms.

The current low-latency field, as of mid-2026:

Provider Reported model TTFA Notes
Cartesia Sonic ~40ms Independent benchmarks measured higher in production
ElevenLabs Flash v2.5 ~75ms 32 languages, 40k char limit
Kokoro (Apache 2.0) ~100ms on GPU Open-source option
OpenAI TTS-1-HD Higher Not suited for latency-sensitive paths

The architectural move that matters is chunked streaming synthesis: begin playing audio incrementally as it is generated, chunking at sentence boundaries to preserve prosody. Deepgram's TTS WebSocket docs describe this pattern explicitly. Persistent WebSocket connections cut per-request overhead measurably in production.

Telephony: pick WebRTC over PSTN bridges

The telephony layer is where many teams quietly lose 50-100ms they never budgeted for. WebRTC transport adds 20-80ms depending on geography; PSTN integration through Twilio adds another 20-50ms of codec and transport overhead on top.

LiveKit Agents SDK 1.6.4 (released June 24, 2026) is the current reference for WebRTC-native voice agents, with plugins for OpenAI Realtime API, ElevenLabs TTS, and adaptive interruption handling. Twilio's Voice SDK handles PSTN/VoIP with WebSocket media streams but requires custom ASR/LLM/TTS wiring and carries extra overhead. Vapi (raised $50M Series B in May 2025 at ~$500M valuation) and Retell trade control for pre-built integrations.

The sub-500ms latency budget

For a 450ms target with 50ms headroom, allocate like this:

Component Target Acceptable range
Audio transport (one-way) 40ms 20-80ms
ASR processing 100ms 80-150ms
LLM first token 150ms 120-200ms
TTS first audio 120ms 80-180ms
Buffer/overhead 40ms 20-60ms
Total 450ms 400-500ms

This budget only holds under a streaming architecture where TTS begins the instant the LLM produces substantive output. Run the stages sequentially and you will land at 700ms+ no matter how fast each individual component is.

Sub-500ms voice agent latency budget (target allocation)Audio transport40msASR100msLLM first token150msTTS first audio120msBuffer/overhead40ms
Sub-500ms voice agent latency budget (target allocation)

Where the biggest wins come from

Five optimizations, in rough order of impact:

  1. Parallel ASR-TTS initiation. Start TTS the moment the LLM emits substantive tokens. This can remove 200-300ms of effective end-to-end latency.
  2. VAD-triggered chunking. Replace timer-based audio buffers with voice-activity detection. Silero VAD is the de-facto standard: 1.8MB, processes 32ms chunks at 23x real-time, used by Airi, WhisperX, and Moonshine. It filters 50-70% of silence and cuts downstream ASR compute 30-60%.
  3. Streaming-native ASR. Token-by-token models (GPT-Realtime-Whisper, production RNN-T/Conformer) hit sub-200ms first-token latency versus 500ms+ for chunked Whisper.
  4. Optimized LLM serving. Streaming-optimized hardware (Groq LPUs, TRT-LLM GPU instances) cuts TTFT 50-70% versus naive deployment.
  5. Progressive TTS playback. Chunked synthesis reduces perceived latency even when total processing time is unchanged.

How do you handle barge-in and interruptions?

Barge-in, the ability for a user to interrupt the AI mid-sentence, is what separates a voice agent from a voicemail system. It requires sub-100ms VAD response to user speech onset, immediate audio capture routing, and clean cancellation of pending TTS without audible artifacts.

Three patterns work:

  • Full duplex with ducking. Both streams run simultaneously; the system lowers AI volume when user speech is detected. Most natural, most complex.
  • Kill-and-switch. On user speech, stop AI audio immediately and process the interruption. Simpler, can feel abrupt.
  • Soft handover. Capture user speech while AI audio fades out over a few hundred milliseconds. Natural, requires careful mixing.

Production systems typically require 150-300ms of continuous user speech before treating input as an intentional interruption rather than brief overlap. Tune VAD sensitivity per environment: too sensitive and background noise triggers false interrupts; too insensitive and real interrupts go undetected.

The LLM must also be notified of the interruption so it can adjust conversation state, either ignoring or acknowledging it based on context.

Quality-latency tradeoffs you will actually face

Every stage has a real tradeoff. Naming it helps you tune deliberately instead of accidentally.

ASR. Smaller models are faster but lose WER on accents, domain terms, and noise. Streaming ASR typically runs 5-15% higher WER than batch on the same audio because it lacks future-token context. GPT-Realtime-Whisper's reported hallucination reduction helps, but at higher compute cost.

LLM. Quantization (INT8, INT4) shrinks memory bandwidth and lifts throughput but can degrade complex reasoning. Larger context windows maintain multi-turn coherence at the cost of per-token compute. Voice apps rarely need extreme context; budget enough for coherent history and stop there.

TTS. Premium voices with emotional expressiveness add 50-100ms to TTFA. Customer-service bots may tolerate slightly robotic output; healthcare and accessibility applications usually do not. ElevenLabs' Turbo v2.5 is the documented balance point for latency-sensitive work.

What this means for you: a deployment checklist

Before you ship an AI receptionist or any voice agent to production, run through this:

  • Pick streaming-native ASR. GPT-Realtime-Whisper for cloud, WhisperKit for on-device. Do not adapt chunked Whisper and hope.
  • Co-locate ASR, LLM, and TTS in one region. Cross-region hops between stages are invisible in component benchmarks and fatal in production.
  • Instrument every stage. Track per-component latency, set degradation alerts, and measure end-to-end conversation latency including user speech duration. Component metrics miss the optimizations that matter.
  • Wire TTS to start on first substantive LLM token. This is the single highest-leverage architectural decision in the whole stack.
  • Use VAD-triggered processing, not fixed buffers. Silero is the default for a reason.
  • Plan for network degradation. Adaptive bitrate, dynamic jitter buffers, codec fallback to G.711, and graceful partial-response strategies. A voice agent that dies on flaky cellular is a voice agent nobody uses.
  • Validate ROI against primary sources. Gartner projects 80% of common customer service issues handled autonomously by 2029 with 30% operational cost reduction. IBM research cites 30% cost reduction from conversational AI. Juniper's 2026 report projects 8.2 billion automated banking interactions generating $11B in annual savings. The widely circulated "$15.12B market / 45% deflection / $3.50 ROI per dollar" figures could not be traced to any primary source; do not put them in a board deck.

Sub-500ms is not a stretch goal. It is the price of entry for voice agents people prefer over the alternative. The stack to hit it exists today; the work is in the orchestration.

Sources

Frequently asked questions

What is the latency budget for a sub-500ms voice AI agent?

A reasonable 450ms target allocates roughly 40ms to audio transport, 100ms to ASR, 150ms to LLM time-to-first-token, 120ms to TTS time-to-first-audio, and 40ms of buffer overhead. The budget only holds if TTS starts streaming as soon as the LLM emits substantive tokens.

Which ASR model is best for streaming voice agents in 2026?

Streaming-native models beat chunked Whisper for latency. OpenAI's GPT-Realtime-Whisper (May 2026) processes audio token-by-token at $0.017/minute with far fewer hallucinations. For on-device work, WhisperKit reports 0.46s end-to-end with 2.2% WER on Apple Silicon.

How do you handle barge-in and interruptions in a voice agent?

Use VAD (Silero is the de-facto standard) to detect user speech during AI output, then cancel pending TTS audio, notify the LLM of the interruption, and route the user's audio back through ASR. Production systems typically require 150-300ms of continuous user speech before treating it as an intentional interruption.

Is the 700ms human conversation tolerance figure a valid latency target?

No. The 700ms figure comes from research on turn-taking gaps, not response latency, and it describes an upper bound for patient users. Production voice AI from OpenAI, LiveKit, and Retell consistently targets sub-500ms because perceived agency, conversation flow, and trust all degrade above 400ms.

What's the biggest single latency win in a voice AI pipeline?

Starting TTS generation while the LLM is still streaming tokens, rather than waiting for the full response. Combined with VAD-triggered chunking instead of fixed-length buffers, this can cut effective end-to-end latency by 200-300ms.