Evaluating Ai Models And Agents

Voice Agent Evaluation: Latency, MOS, WER & TTFA

A reproducible four-metric scorecard for production voice agents, and why a 1.4s median latency quietly breaks human-like conversation.

June 18, 202611 min read
voice agent evaluation frameworkvoice agent latencyTTFA time to first audio
Voice Agent Evaluation: Latency, MOS, WER & TTFA

Hamming.ai analyzed more than 4 million production voice agent calls and found the median end-to-end latency sits at 1.4 to 1.7 seconds, with the slowest 10% of calls running 3 to 5 seconds. Humans take conversational turns with a median gap closer to 200 milliseconds.

That gap, roughly five to one, is the difference between a colleague and a call center hold queue.

The strange part is that almost nobody measures it the same way. Ask three teams how they evaluate a voice agent and you get three answers: one quotes a TTS vendor's "75ms" number, another tracks transcription accuracy, a third runs vibes-based QA on call recordings. There is no shared scorecard.

This piece proposes one. A voice agent evaluation framework rests on four metrics that, together, tell you whether your agent is fast enough, heard correctly, and sounded human: latency (headlined by TTFA, time to first audio), WER, and MOS.

TL;DR

A complete voice agent evaluation methodology needs four numbers, not one. Track end-to-end latency with time-to-first-audio as the primary figure, word error rate for what the agent heard, mean opinion score for how it sounds, and an interruption budget for barge-in.

The industry median of 1.4-1.7s (per Hamming.ai) fails because human conversation runs near 250ms, and most of that latency hides in stages teams never instrument.

Key takeaways

  • Human turn-taking medians cluster around 200-300ms (Stivers et al., 2009, PNAS); voice agents that exceed ~800ms TTFA start to feel unnatural.
  • TTFA is the metric that matters, because it measures the silence the user actually hears, not total response length.
  • WER and MOS measure different failures: WER catches mishearing, MOS catches robotic or distorted output. You need both.
  • Streaming TTS is the single biggest lever on TTFA, but the LLM stage often eats 60-70% of total latency (per Coval, cited by DILR.ai).
  • A single "median latency" number is meaningless without p95/p90: Hamming's p90 hits 3.3-3.8s.

Why is voice agent latency the metric that breaks UX?

A voice agent feels human when it responds inside the rhythm humans use to take turns. That rhythm is well measured. The canonical cross-linguistic study, Stivers et al. (2009), analyzed ten languages across five continents and found an overall median response offset near +100ms, with most languages landing inside a +200ms band.

Later re-analyses of turn-taking corpora put the figure higher, around a 210ms median and 251ms mean (Turn-end Estimation, 2021) or a 236ms mean (Enfield, 2022).

The telecom world drew its line decades ago. ITU-T G.114 recommends a one-way transmission delay of 150ms or less for interactive voice. Past that, conversation degrades.

So the practitioner target is well established: 200-300ms feels natural, beyond ~500ms the conversational illusion starts to slip, and beyond 1 second you are in phone-tree territory.

Now compare that to what ships. DILR.ai's April 2026 measurement found an optimized in-house median of 680ms, down from roughly 1,200ms in 2024. Real deployments, under noise, accents, and barge-in, drift back into the 1.4-1.7s range Hamming reports.

The reconciliation matters: 680ms is the best case; 1.4-1.7s is what production looks like with the typical mix of conditions.

Voice agent latency vs human baselineHuman turn-taking median250msOptimized production (Apr 2026)680msIndustry p50 (Hamming)1550msIndustry p90 (Hamming)3550ms
Voice agent latency vs human baseline

What is TTFA, and why is it the headline latency number?

Time-to-first-audio (TTFA) is the wall-clock delay from the moment a user stops speaking to the first playable sample of the agent's reply. It is the silence the user hears, and it is the number that should anchor any conversational latency benchmark.

This is the consensus definition across vendors. ElevenLabs, Hamming, Sierra, and Gradium all define TTFA as user-finish to agent-starts-speaking. Gradium adds a useful warning: measure the first playable audio sample, not the first streaming container or HTTP header, or you will flatter your numbers.

TTFA is not one thing. It decomposes into a pipeline of tunable stages, and the canonical breakdown is stable across Twilio, Parloa, and the OpenAI Realtime guide.

Stage Typical range (ms) Main tunable knob
VAD / endpointing fires end-of-turn 100-500 silence_duration_ms, eager end-of-turn
STT partial-to-final 200-600 streaming vs. Batched
LLM time-to-first-token 200-800 model size, streaming, prompt length
LLM first usable tokens for TTS 100-500 min-tokens-before-stream
TTS first audio chunk 80-300 streaming mode, chunk size
Network (each way) 30-150 WebRTC vs. SIP vs. WS
Playback / jitter buffer 20-60 buffer size

Ranges aggregated from Twilio, Parloa, and OpenAI Realtime documentation; treat as directional, not vendor-certified.

The lesson from this table: the LLM stages, not the TTS stage, usually dominate. Coval's analysis (cited by DILR.ai) attributes 60-70% of total latency to LLM time-to-first-token. Teams obsess over shaving 20ms off TTS while a regional LLM endpoint quietly adds 800ms.

How streaming TTS and endpointing actually move TTFA

Two levers give the most TTFA per unit of effort: streaming TTS at the last hop, and smarter endpointing at the first.

Streaming TTS is the larger delta over non-streaming synthesis. The current generation of low-latency models, measured independently, clusters well under 300ms for first audio.

TTS model TTFA (p50) Source / date
Cartesia Sonic 2.5 65 ms andrew.ooo, Apr 2026
ElevenLabs Flash v3 75 ms andrew.ooo, Apr 2026
Cartesia Sonic-3 188 ms Gradium/Coval, May 2026
ElevenLabs Turbo v2.5 264 ms Gradium/Coval, May 2026
ElevenLabs Flash v2.5 288 ms Gradium/Coval, May 2026

All figures are third-party measurements, not vendor-certified, and the numbers move with each release. Cartesia's latest is Sonic-3 (released Oct 28, 2025); ElevenLabs shipped Turbo v3 and Flash v3 in April 2026. Re-verify before citing.

Note the disagreement built into that table: andrew.ooo measures Sonic 2.5 at 65ms while Coval measures Sonic-3 at 188ms. Versioning and test harnesses differ. This is exactly why your evaluation framework should measure your own stack rather than trust a leaderboard.

On the endpointing side, the shift in 2025-2026 is from fixed-silence VAD to learned eager end-of-turn. Deepgram Flux predicts end-of-turn before a long silence elapses, running alongside Nova-3 STT, and Telnyx reports a meaningful median TTFA reduction from it. LiveKit's Smart-Turn plays the same role. On the OpenAI Realtime API, this is the turn_detection object, where silence_duration_ms is the single most consequential parameter you will tune.

WER and MOS: measuring what latency can't

Speed is necessary but not sufficient. A fast agent that mishears the user or sounds like a 2010 GPS is still a bad agent. Two metrics cover the rest.

Word error rate (WER) scores the ASR stage. The formula is fixed: WER = (S + D + I) / N, substitutions plus deletions plus insertions over reference word count, computed by minimum edit distance. The de facto standard implementation is the open-source jiwer library, which traces back to the NIST sclite lineage and Morris et al. (2004). Production English ASR typically lands at 0-10%; Speechmatics Ursa 2, for instance, reports 7.88% on the English Kincaid46 set. For languages without clear word boundaries, use character error rate (CER) instead.

Mean opinion score (MOS) scores perceived voice quality. It comes from ITU-T P.800 (1996), an Absolute Category Rating test where naive listeners rate audio on a 5-point scale (5 Excellent to 1 Bad). The modern, scalable variant is ITU-T P.808 (2018), a crowdsourcing methodology with a validated open-source implementation from Microsoft Research.

One caveat that trips teams up: WER does not apply cleanly to unified speech-to-speech models like OpenAI's gpt-realtime, Google's Gemini Live, or xAI's Grok Voice. There is no separate transcript to score. For those, lean on MOS for output quality and task-success metrics for correctness, and drop WER.

The four-metric scorecard

Metric What it measures Target / reference Standard tool
TTFA (latency) User-finish to first agent audio <800ms feels conversational; ~250ms is human Per-stage instrumentation
Interruption / barge-in How fast the agent stops when the user speaks Directional; track time-to-interrupt Galileo
WER ASR accuracy (cascaded stacks only) 0-10% English; use CER otherwise jiwer
MOS Synthesized voice quality 5-point ACR scale ITU-T P.808

Barge-in is the fourth dimension and the least standardized. No first-party source publishes a clean "interrupt within X ms" threshold; the practical approach is to measure time-to-interrupt and set your own budget inside the overall ~600-800ms conversational envelope Twilio describes.

On the Realtime API, the interrupt_response flag controls whether assistant audio truncates when the user starts speaking.

What the latest models tell you about the ceiling

The frontier is converging on sub-second TTFA from unified speech-to-speech architectures. xAI's Grok Voice Agent API (launched Dec 17, 2025, in partnership with LiveKit) claims a sub-700ms response and an average TTFA under 1 second by processing speech in and out within a single model, priced at a flat $0.05/minute. OpenAI's gpt-realtime reached GA in August 2025; Google's Gemini 2.5 Flash Native Audio shipped in December 2025; Amazon's Nova 2 Sonic went GA December 2, 2025.

The catch: none of these vendors publish a verifiable p50/p95 TTFA in their first-party docs. Grok's <1s is a vendor claim with Artificial Analysis named as auditor, but the corroborating figure was not independently located.

Google's developer forums even document latency drift during long Gemini Live sessions. Treat every vendor number as a starting hypothesis for your own measurement.

What this means for you

Build the scorecard before you pick a model. Instrument all seven latency stages so you know whether your bottleneck is the LLM (it usually is) or the TTS (it usually isn't).

Always report p50 and p95 together. A 680ms median with a 3.5s tail is a worse product than a steady 900ms, and only the tail tells you that.

Match the metrics to the architecture. Cascaded STT-LLM-TTS stacks get all four metrics. Unified speech-to-speech models get MOS, latency, and task success; WER does not fit.

And re-verify the version-specific numbers every few weeks. Cartesia, ElevenLabs, and OpenAI are shipping new voice models on roughly monthly cadences, and a TTFA figure from last quarter is already stale.

What would change my mind

If a vendor published an audited, reproducible p50/p95 TTFA on a public harness, and independent benchmarks confirmed it, the case for rolling your own measurement would weaken. As of June 2026, no major voice vendor does this, which is precisely why the four-metric scorecard has to live in your own pipeline.

Sources

Frequently asked questions

What metrics should you use to evaluate a voice agent?

Use four: end-to-end latency (with time-to-first-audio as the headline number), word error rate (WER) for the ASR stage, mean opinion score (MOS) for synthesized voice quality, and an interruption/barge-in measure. Latency and TTFA capture responsiveness, WER captures whether the agent heard you, and MOS captures whether the output sounds human.

What is a good latency for a voice agent?

Humans take turns in conversation with a median gap near 200-300ms (Stivers et al., 2009). Anything under roughly 800ms time-to-first-audio still feels conversational; past 1.5s the interaction starts to feel like a phone tree. Hamming.ai's January 2026 analysis of 4M+ production calls found a median of 1.4-1.7s, about five times the human baseline.

What is TTFA (time to first audio)?

Time-to-first-audio is the wall-clock delay from the moment a user stops speaking to the first playable sample of the agent's reply. It is the single most important latency number for perceived responsiveness because it measures the silence the user actually hears, not total response duration.

How is WER calculated for voice agents?

Word error rate is (substitutions + deletions + insertions) divided by the number of reference words, computed via minimum edit distance. The open-source jiwer library is the de facto standard. Production English ASR typically lands in the 0-10% range; use character error rate (CER) for languages without clear word boundaries.

Does MOS apply to unified speech-to-speech models?

MOS (ITU-T P.800, 1996) rates perceived audio quality on a 5-point scale and still applies to any synthesized output, including unified models like gpt-realtime or Grok Voice. WER, however, does not map cleanly onto speech-to-speech systems because there is no separate transcript stage to score.