Ai Frontiers 2026

Time-to-First-Token vs Word Error Rate: Picking an STT API in 2026

The fast-and-leaky vs slow-and-tight split between Deepgram and Gradium is now the production-defining buying decision for voice agents.

By June 28, 202610 min read
STT API benchmark 2026speech to text latencyvoice agent STT comparison
Time-to-First-Token vs Word Error Rate: Picking an STT API in 2026

Deepgram's Nova-3 returns its first transcribed token in roughly 150 milliseconds. Gradium's stt-translate API, which launched June 21, 2026, can take 300 to 500ms to do the same job.

That gap, measured in tenths of a second, is now the single most consequential buying decision in voice AI infrastructure. Most buyer guides still floating around the web were written in 2023 and treat speech-to-text as a commodity accuracy contest.

The 2026 market is bifurcated along a different axis: fast-and-leaky versus slow-and-tight.

A real-time STT API benchmark in 2026 is really two benchmarks stitched together. Time-to-first-token (TTFT) measures the latency from when a user stops speaking to when the first transcribed word comes back.

Word error rate (WER) measures how many words the transcript gets wrong. Lower is better on both. Almost no shipping API wins both at once, which is why the Deepgram vs Gradium comparison has become the cleanest illustration of the tradeoff practitioners actually face.

TL;DR

  • Deepgram Nova-3 leads commercial cloud STT on TTFT (~150, 250ms p50) but sits at 7, 12% WER on mixed-condition audio.
  • Gradium, launched June 21, 2026 with a $70M seed, hits WER as low as 1.62% on controlled English benchmarks but runs 2, 3x slower.
  • End-to-end voice models (GPT-Realtime-2, Gemini 3.5 Live, Grok Voice Agent) are compressing entry pricing toward $0.05/min but do not yet replace dedicated STT for compliance, on-prem, or transcript-fidelity workloads.
  • Pick TTFT when STT feeds an LLM. Pick WER when the transcript itself is the product.

Key takeaways

  • The 25.2% WER figure sometimes attributed to Deepgram is not supported by 2026 benchmarks; Nova-3 lands at 7, 12% on mixed audio per independent testing.
  • Gradium's accuracy edge is real but narrow: 1.62% WER on Speechmatics English-only controlled testing, per the Coval STT benchmark dated May 4, 2026.
  • Cartesia's Ink-2 sits on the streaming Pareto frontier at 0.21s TTFT and 3.59% WER, per the Artificial Analysis AA-WER Streaming report (June 1, 2026).
  • At 50,000 minutes per day, STT cost is rarely the dominant line item; GPU compute for the downstream LLM is typically 40x larger.
  • Google's own developer community has confirmed Gemini Live's input_audio_transcription returns text that diverges from what the model internally processes.

What does the 2026 STT API field actually look like?

The seven leading real-time transcription APIs as of June 28, 2026 span a wide band on both axes. The table below consolidates vendor docs and the two most recent independent benchmarks.

Vendor Current model TTFT (p50) WER (English) Cloud pricing Deployment
Deepgram Nova-3 ~0.15, 0.25s 7, 12% $0.0043/min Cloud, on-prem Docker
Gradium stt-translate v1 ~0.3, 0.5s 1.6, 4% Credit-based (TBD) Cloud API
AssemblyAI Universal-3 Streaming ~0.2, 0.4s 6, 9% $0.016/min Cloud, on-prem
OpenAI Whisper + GPT-Realtime 0.2, 0.5s 4, 8% (Whisper) $0.006/min; Realtime $32, 64/1M tok Cloud, self-host Whisper
Google Cloud STT v2 (Gemini-powered) 0.2, 0.5s 5, 10% Pay-per-second Cloud, regional
AWS Transcribe 0.3, 0.6s 6, 12% $0.024/min std; $0.015/min medical Cloud only
Cartesia Ink-2 ~0.21s 3.59% TBD Cloud API

Two things stand out. Deepgram's TTFT lead is consistent across sources, and Cartesia's Ink-2 is the only vendor currently sitting clearly on the Pareto frontier of both metrics, though its language coverage is narrower at 20+ languages. AssemblyAI's Universal-3 Streaming launched March 3, 2026 and narrowed the latency gap without matching Deepgram's floor.

Why does TTFT dominate for voice agents?

For a conversational voice agent, sub-500ms end-to-end latency is the threshold where users stop perceiving the system as laggy. STT is only the first hop in a chain that also includes LLM inference and TTS, so every millisecond spent in transcription is borrowed from the rest of the pipeline.

Deepgram's ~150ms TTFT enables fast turn-taking detection. The agent recognizes the user has stopped speaking and hands off to the LLM immediately. A 200ms TTFT advantage compounds across thousands of daily calls in IVR and customer service deployments, where call handling time and CSAT move with perceived responsiveness.

The tradeoff is that Deepgram's WER sits at 7, 12% on mixed-condition audio, per independent benchmarking from Scribie's production-audio study. For a voice agent, that is usually fine.

The STT output feeds an LLM, not a human reader, and the LLM recovers from minor transcription errors through context. An 8% WER at 180ms latency beats a 2% WER at 450ms latency for this architecture.

Deepgram's January 2026 $130M Series C at a $1.3B valuation, plus its acquisition of a YC AI startup, signals continued investment in real-time speech intelligence. On-prem Docker and Podman deployment is documented for enterprise customers with data sovereignty requirements.

When does WER dominate instead?

When the transcript is the product, accuracy stops being recoverable. A misheard medication name in clinical documentation can be a liability. A dropped qualifier in a court transcript can change legal meaning. In those workloads, latency is irrelevant and WER is the only metric that matters.

Gradium emerged from stealth in December 2025 with a $70M seed and launched commercial STT and speech-to-speech translation APIs on June 21, 2026. Founded by a former Google DeepMind researcher, the company's positioning is accuracy-first.

The Coval STT benchmark dated May 4, 2026 measured Gradium at 1.62% WER on Speechmatics English-only controlled testing, competitive with or better than any commercial STT API currently shipping.

English WER by vendor (2026 benchmarks)Gradium (Coval, controlled)1.62%Cartesia Ink-2 (AA-WER)3.59%OpenAI Whisper6%AssemblyAI U-37.5%Deepgram Nova-39.5%AWS Transcribe9%
English WER by vendor (2026 benchmarks)

Gradium's accuracy advantage comes with a latency tax. Its models typically require 300, 500ms TTFT, roughly 2, 3x slower than Deepgram. That rules it out for real-time voice agents but makes it a strong fit for medical transcription, legal proceedings, compliance call recording, and archival media indexing.

The integrated stt-translate endpoint also collapses transcription and translation into a single step, which removes intermediate transcript overhead for multilingual workloads.

A note on a number you may have seen floating around: the "25.2% WER for Deepgram" figure that appears in some buyer guides is not supported by 2026 benchmark data. It likely derives from older Deepgram models tested under adverse conditions, or from comparing Deepgram's streaming WER against Gradium's batch WER, which measures fundamentally different use cases.

Treat any 2026 STT API benchmark that quotes a 25% WER for a current commercial model with skepticism.

Are end-to-end voice models killing standalone STT?

The most credible threat to dedicated STT APIs is not another STT vendor. It is the end-to-end voice model that folds transcription, understanding, and response into one pipeline.

Three flagship E2E models are shipping as of June 2026. xAI's Grok Voice Agent launched at $0.05/min flat pricing, simplifying cost prediction for high-volume applications. OpenAI's GPT-Realtime family progressed from GPT-Realtime-1.5 in February 2026 to GPT-Realtime-2 with GPT-5-class reasoning in May 2026, priced at $32, 64 per 1M tokens.

Google's Gemini 3.5 Live Translate, launched June 9, 2026, expanded real-time speech-to-speech translation from 5 to 70+ languages.

Entry-level E2E voice pricing has compressed to the $0.05–$0.31/min range, approaching dedicated STT pricing of $0.004–$0.024/min. For greenfield consumer voice apps, the integration simplicity is hard to argue against.

But standalone STT is not commoditizing yet, for four concrete reasons.

First, transcript fidelity diverges inside E2E models. Google's own developer community has confirmed that input_audio_transcription from Gemini Live returns text that does not match what the model actually processes. The E2E model understands the audio; the transcript it hands back is a secondary artifact. For compliance and archival workloads, that is a problem.

Second, customization. Dedicated STT APIs expose custom vocabularies, acoustic model fine-tuning, and speaker diarization. E2E models typically do not. Enterprise voice architects need that control for domain-specific terminology.

Third, cost at scale. E2E pricing scales with token consumption. A voice agent processing 100,000 minutes per day on GPT-Realtime-2 faces a very different bill than one running $0.004/min STT plus a focused small language model.

Fourth, deployment and compliance. Healthcare, legal, and financial workloads often require data to stay in specific jurisdictions. Whisper self-hosting under Apache 2.0, faster-whisper variants with 4x CPU speedup via CTranslate2, and Cloudflare Workers AI for edge STT all answer requirements that E2E cloud APIs cannot.

How should you decide?

Run the decision on your primary optimization target, not on a feature checklist.

If you are building a real-time voice agent with sub-500ms end-to-end latency targets, pick Deepgram Nova-3 or Cartesia Ink-2. The LLM downstream recovers from minor WER. Latency is the experience.

If you are building medical, legal, or compliance transcription where the transcript is the deliverable, pick Gradium or AWS Medical Transcribe. WER below 3% is the bar. Latency is irrelevant.

If you are building a greenfield consumer voice product with multilingual needs and no on-prem requirement, evaluate GPT-Realtime-2 or Gemini 3.5 Live. The integration simplicity and language coverage outweigh granular control.

If you need offline or edge deployment, self-host Whisper with faster-whisper, or use Deepgram's on-prem Docker. Google Cloud STT v2 offers regional endpoints for data residency if cloud is acceptable.

One practitioner pattern worth stealing: a mid-size enterprise running 50,000 voice interactions per day reported NPS scores improving 12 points after switching from GCP Speech-to-Text to Deepgram, with A/B tested TTFT of 180ms versus 340ms for AssemblyAI and 450ms for Gradium. Their STT spend at that volume was around $215/month.

GPU compute for the downstream LLM was 40x larger. The accuracy-vs-latency tradeoff dominated cost by an order of magnitude.

That ratio is the real lesson of the 2026 STT market. The vendor you pick matters less than the axis you optimize. Pick the axis wrong and no vendor saves you.

Sources

Frequently asked questions

What is the difference between TTFT and WER in speech-to-text APIs?

TTFT (time-to-first-token) measures latency from when a user stops speaking to when the first transcribed word returns; lower is better for real-time voice agents. WER (word error rate) measures transcription inaccuracy as a percentage of mis-transcribed words; lower is better for compliance, medical, and legal use cases. Most 2026 STT APIs force a tradeoff between the two.

Is Deepgram or Gradium better for real-time voice agents?

Deepgram Nova-3 is the better fit for real-time voice agents, with p50 TTFT around 150-250ms. Gradium's stt-translate API launched June 21, 2026 prioritizes accuracy with WER as low as 1.62% on controlled English benchmarks but at roughly 2-3x higher latency, making it better for medical, legal, and compliance transcription.

Are end-to-end voice models replacing standalone STT APIs in 2026?

Not yet. End-to-end models like GPT-Realtime-2, Gemini 3.5 Live Translate, and xAI Grok Voice Agent simplify greenfield consumer voice apps, but standalone STT APIs retain advantages in cost at scale, on-prem deployment, compliance, and transcript fidelity. Google's own developers have confirmed Gemini Live's transcript output diverges from what the model internally processes.

What WER is acceptable for an LLM-backed voice agent?

For voice agents where STT output feeds an LLM rather than a human, WER of 5-15% is usually acceptable because the LLM recovers from minor errors via context. In that case, prioritize TTFT under 500ms. For human-facing transcripts in medical, legal, or compliance settings, target WER below 3%.

Can you self-host a speech-to-text API for offline use?

Yes. OpenAI Whisper is Apache 2.0 licensed and self-hostable, with faster-whisper variants delivering up to 4x CPU speedup via CTranslate2. Deepgram offers on-prem Docker deployment for Nova-3, and Cloudflare Workers AI provides edge STT with sub-50ms response near CDN nodes.