Hamming.ai analyzed more than 4 million production voice agent calls and found the median end-to-end latency sits at 1.4 to 1.7 seconds, with the slowest 10% of calls running 3 to 5 seconds. Humans take conversational turns with a median gap closer to 200 milliseconds.
That gap, roughly five to one, is the difference between a colleague and a call center hold queue.
The strange part is that almost nobody measures it the same way. Ask three teams how they evaluate a voice agent and you get three answers: one quotes a TTS vendor's "75ms" number, another tracks transcription accuracy, a third runs vibes-based QA on call recordings. There is no shared scorecard.
This piece proposes one. A voice agent evaluation framework rests on four metrics that, together, tell you whether your agent is fast enough, heard correctly, and sounded human: latency (headlined by TTFA, time to first audio), WER, and MOS.
TL;DR
A complete voice agent evaluation methodology needs four numbers, not one. Track end-to-end latency with time-to-first-audio as the primary figure, word error rate for what the agent heard, mean opinion score for how it sounds, and an interruption budget for barge-in.
The industry median of 1.4-1.7s (per Hamming.ai) fails because human conversation runs near 250ms, and most of that latency hides in stages teams never instrument.
Key takeaways
- Human turn-taking medians cluster around 200-300ms (Stivers et al., 2009, PNAS); voice agents that exceed ~800ms TTFA start to feel unnatural.
- TTFA is the metric that matters, because it measures the silence the user actually hears, not total response length.
- WER and MOS measure different failures: WER catches mishearing, MOS catches robotic or distorted output. You need both.
- Streaming TTS is the single biggest lever on TTFA, but the LLM stage often eats 60-70% of total latency (per Coval, cited by DILR.ai).
- A single "median latency" number is meaningless without p95/p90: Hamming's p90 hits 3.3-3.8s.
Why is voice agent latency the metric that breaks UX?
A voice agent feels human when it responds inside the rhythm humans use to take turns. That rhythm is well measured. The canonical cross-linguistic study, Stivers et al. (2009), analyzed ten languages across five continents and found an overall median response offset near +100ms, with most languages landing inside a +200ms band.
Later re-analyses of turn-taking corpora put the figure higher, around a 210ms median and 251ms mean (Turn-end Estimation, 2021) or a 236ms mean (Enfield, 2022).
The telecom world drew its line decades ago. ITU-T G.114 recommends a one-way transmission delay of 150ms or less for interactive voice. Past that, conversation degrades.
So the practitioner target is well established: 200-300ms feels natural, beyond ~500ms the conversational illusion starts to slip, and beyond 1 second you are in phone-tree territory.
Now compare that to what ships. DILR.ai's April 2026 measurement found an optimized in-house median of 680ms, down from roughly 1,200ms in 2024. Real deployments, under noise, accents, and barge-in, drift back into the 1.4-1.7s range Hamming reports.
The reconciliation matters: 680ms is the best case; 1.4-1.7s is what production looks like with the typical mix of conditions.
What is TTFA, and why is it the headline latency number?
Time-to-first-audio (TTFA) is the wall-clock delay from the moment a user stops speaking to the first playable sample of the agent's reply. It is the silence the user hears, and it is the number that should anchor any conversational latency benchmark.
This is the consensus definition across vendors. ElevenLabs, Hamming, Sierra, and Gradium all define TTFA as user-finish to agent-starts-speaking. Gradium adds a useful warning: measure the first playable audio sample, not the first streaming container or HTTP header, or you will flatter your numbers.
TTFA is not one thing. It decomposes into a pipeline of tunable stages, and the canonical breakdown is stable across Twilio, Parloa, and the OpenAI Realtime guide.
| Stage | Typical range (ms) | Main tunable knob |
|---|---|---|
| VAD / endpointing fires end-of-turn | 100-500 | silence_duration_ms, eager end-of-turn |
| STT partial-to-final | 200-600 | streaming vs. Batched |
| LLM time-to-first-token | 200-800 | model size, streaming, prompt length |
| LLM first usable tokens for TTS | 100-500 | min-tokens-before-stream |
| TTS first audio chunk | 80-300 | streaming mode, chunk size |
| Network (each way) | 30-150 | WebRTC vs. SIP vs. WS |
| Playback / jitter buffer | 20-60 | buffer size |
Ranges aggregated from Twilio, Parloa, and OpenAI Realtime documentation; treat as directional, not vendor-certified.
The lesson from this table: the LLM stages, not the TTS stage, usually dominate. Coval's analysis (cited by DILR.ai) attributes 60-70% of total latency to LLM time-to-first-token. Teams obsess over shaving 20ms off TTS while a regional LLM endpoint quietly adds 800ms.
How streaming TTS and endpointing actually move TTFA
Two levers give the most TTFA per unit of effort: streaming TTS at the last hop, and smarter endpointing at the first.
Streaming TTS is the larger delta over non-streaming synthesis. The current generation of low-latency models, measured independently, clusters well under 300ms for first audio.
| TTS model | TTFA (p50) | Source / date |
|---|---|---|
| Cartesia Sonic 2.5 | 65 ms | andrew.ooo, Apr 2026 |
| ElevenLabs Flash v3 | 75 ms | andrew.ooo, Apr 2026 |
| Cartesia Sonic-3 | 188 ms | Gradium/Coval, May 2026 |
| ElevenLabs Turbo v2.5 | 264 ms | Gradium/Coval, May 2026 |
| ElevenLabs Flash v2.5 | 288 ms | Gradium/Coval, May 2026 |
All figures are third-party measurements, not vendor-certified, and the numbers move with each release. Cartesia's latest is Sonic-3 (released Oct 28, 2025); ElevenLabs shipped Turbo v3 and Flash v3 in April 2026. Re-verify before citing.
Note the disagreement built into that table: andrew.ooo measures Sonic 2.5 at 65ms while Coval measures Sonic-3 at 188ms. Versioning and test harnesses differ. This is exactly why your evaluation framework should measure your own stack rather than trust a leaderboard.
On the endpointing side, the shift in 2025-2026 is from fixed-silence VAD to learned eager end-of-turn. Deepgram Flux predicts end-of-turn before a long silence elapses, running alongside Nova-3 STT, and Telnyx reports a meaningful median TTFA reduction from it. LiveKit's Smart-Turn plays the same role. On the OpenAI Realtime API, this is the turn_detection object, where silence_duration_ms is the single most consequential parameter you will tune.
WER and MOS: measuring what latency can't
Speed is necessary but not sufficient. A fast agent that mishears the user or sounds like a 2010 GPS is still a bad agent. Two metrics cover the rest.
Word error rate (WER) scores the ASR stage. The formula is fixed: WER = (S + D + I) / N, substitutions plus deletions plus insertions over reference word count, computed by minimum edit distance. The de facto standard implementation is the open-source jiwer library, which traces back to the NIST sclite lineage and Morris et al. (2004). Production English ASR typically lands at 0-10%; Speechmatics Ursa 2, for instance, reports 7.88% on the English Kincaid46 set. For languages without clear word boundaries, use character error rate (CER) instead.
Mean opinion score (MOS) scores perceived voice quality. It comes from ITU-T P.800 (1996), an Absolute Category Rating test where naive listeners rate audio on a 5-point scale (5 Excellent to 1 Bad). The modern, scalable variant is ITU-T P.808 (2018), a crowdsourcing methodology with a validated open-source implementation from Microsoft Research.
One caveat that trips teams up: WER does not apply cleanly to unified speech-to-speech models like OpenAI's gpt-realtime, Google's Gemini Live, or xAI's Grok Voice. There is no separate transcript to score. For those, lean on MOS for output quality and task-success metrics for correctness, and drop WER.
The four-metric scorecard
| Metric | What it measures | Target / reference | Standard tool |
|---|---|---|---|
| TTFA (latency) | User-finish to first agent audio | <800ms feels conversational; ~250ms is human | Per-stage instrumentation |
| Interruption / barge-in | How fast the agent stops when the user speaks | Directional; track time-to-interrupt | Galileo |
| WER | ASR accuracy (cascaded stacks only) | 0-10% English; use CER otherwise | jiwer |
| MOS | Synthesized voice quality | 5-point ACR scale | ITU-T P.808 |
Barge-in is the fourth dimension and the least standardized. No first-party source publishes a clean "interrupt within X ms" threshold; the practical approach is to measure time-to-interrupt and set your own budget inside the overall ~600-800ms conversational envelope Twilio describes.
On the Realtime API, the interrupt_response flag controls whether assistant audio truncates when the user starts speaking.
What the latest models tell you about the ceiling
The frontier is converging on sub-second TTFA from unified speech-to-speech architectures. xAI's Grok Voice Agent API (launched Dec 17, 2025, in partnership with LiveKit) claims a sub-700ms response and an average TTFA under 1 second by processing speech in and out within a single model, priced at a flat $0.05/minute. OpenAI's gpt-realtime reached GA in August 2025; Google's Gemini 2.5 Flash Native Audio shipped in December 2025; Amazon's Nova 2 Sonic went GA December 2, 2025.
The catch: none of these vendors publish a verifiable p50/p95 TTFA in their first-party docs. Grok's <1s is a vendor claim with Artificial Analysis named as auditor, but the corroborating figure was not independently located.
Google's developer forums even document latency drift during long Gemini Live sessions. Treat every vendor number as a starting hypothesis for your own measurement.
What this means for you
Build the scorecard before you pick a model. Instrument all seven latency stages so you know whether your bottleneck is the LLM (it usually is) or the TTS (it usually isn't).
Always report p50 and p95 together. A 680ms median with a 3.5s tail is a worse product than a steady 900ms, and only the tail tells you that.
Match the metrics to the architecture. Cascaded STT-LLM-TTS stacks get all four metrics. Unified speech-to-speech models get MOS, latency, and task success; WER does not fit.
And re-verify the version-specific numbers every few weeks. Cartesia, ElevenLabs, and OpenAI are shipping new voice models on roughly monthly cadences, and a TTFA figure from last quarter is already stale.
What would change my mind
If a vendor published an audited, reproducible p50/p95 TTFA on a public harness, and independent benchmarks confirmed it, the case for rolling your own measurement would weaken. As of June 2026, no major voice vendor does this, which is precisely why the four-metric scorecard has to live in your own pipeline.
Sources
- Hamming.ai, Voice Agent Evaluation Metrics Guide
- DILR.ai, Voice Agent Latency & Quality Benchmarks
- Stivers et al. (2009), PNAS, Universals and cultural variation in turn-taking
- ITU-T G.114 (one-way transmission time)
- ITU-T P.808 summary (crowdsourced MOS)
- jiwer, WER reference implementation
- Twilio, Core Latency in AI Voice Agents
- Parloa, Speech Latency in Voice AI
- ElevenLabs, Understanding latency
- Gradium/Coval, TTS Latency Benchmark 2026
- OpenAI Realtime API guide
- Deepgram Flux, eager end-of-turn
- LiveKit Turn Detector (Smart-Turn)
- xAI, Grok Voice Agent API
