Deepgram's Nova-3 returns its first transcribed token in roughly 150 milliseconds. Gradium's stt-translate API, which launched June 21, 2026, can take 300 to 500ms to do the same job.
That gap, measured in tenths of a second, is now the single most consequential buying decision in voice AI infrastructure. Most buyer guides still floating around the web were written in 2023 and treat speech-to-text as a commodity accuracy contest.
The 2026 market is bifurcated along a different axis: fast-and-leaky versus slow-and-tight.
A real-time STT API benchmark in 2026 is really two benchmarks stitched together. Time-to-first-token (TTFT) measures the latency from when a user stops speaking to when the first transcribed word comes back.
Word error rate (WER) measures how many words the transcript gets wrong. Lower is better on both. Almost no shipping API wins both at once, which is why the Deepgram vs Gradium comparison has become the cleanest illustration of the tradeoff practitioners actually face.
TL;DR
- Deepgram Nova-3 leads commercial cloud STT on TTFT (~150, 250ms p50) but sits at 7, 12% WER on mixed-condition audio.
- Gradium, launched June 21, 2026 with a $70M seed, hits WER as low as 1.62% on controlled English benchmarks but runs 2, 3x slower.
- End-to-end voice models (GPT-Realtime-2, Gemini 3.5 Live, Grok Voice Agent) are compressing entry pricing toward $0.05/min but do not yet replace dedicated STT for compliance, on-prem, or transcript-fidelity workloads.
- Pick TTFT when STT feeds an LLM. Pick WER when the transcript itself is the product.
Key takeaways
- The 25.2% WER figure sometimes attributed to Deepgram is not supported by 2026 benchmarks; Nova-3 lands at 7, 12% on mixed audio per independent testing.
- Gradium's accuracy edge is real but narrow: 1.62% WER on Speechmatics English-only controlled testing, per the Coval STT benchmark dated May 4, 2026.
- Cartesia's Ink-2 sits on the streaming Pareto frontier at 0.21s TTFT and 3.59% WER, per the Artificial Analysis AA-WER Streaming report (June 1, 2026).
- At 50,000 minutes per day, STT cost is rarely the dominant line item; GPU compute for the downstream LLM is typically 40x larger.
- Google's own developer community has confirmed Gemini Live's
input_audio_transcriptionreturns text that diverges from what the model internally processes.
What does the 2026 STT API field actually look like?
The seven leading real-time transcription APIs as of June 28, 2026 span a wide band on both axes. The table below consolidates vendor docs and the two most recent independent benchmarks.
| Vendor | Current model | TTFT (p50) | WER (English) | Cloud pricing | Deployment |
|---|---|---|---|---|---|
| Deepgram | Nova-3 | ~0.15, 0.25s | 7, 12% | $0.0043/min | Cloud, on-prem Docker |
| Gradium | stt-translate v1 | ~0.3, 0.5s | 1.6, 4% | Credit-based (TBD) | Cloud API |
| AssemblyAI | Universal-3 Streaming | ~0.2, 0.4s | 6, 9% | $0.016/min | Cloud, on-prem |
| OpenAI | Whisper + GPT-Realtime | 0.2, 0.5s | 4, 8% (Whisper) | $0.006/min; Realtime $32, 64/1M tok | Cloud, self-host Whisper |
| Google Cloud | STT v2 (Gemini-powered) | 0.2, 0.5s | 5, 10% | Pay-per-second | Cloud, regional |
| AWS | Transcribe | 0.3, 0.6s | 6, 12% | $0.024/min std; $0.015/min medical | Cloud only |
| Cartesia | Ink-2 | ~0.21s | 3.59% | TBD | Cloud API |
Two things stand out. Deepgram's TTFT lead is consistent across sources, and Cartesia's Ink-2 is the only vendor currently sitting clearly on the Pareto frontier of both metrics, though its language coverage is narrower at 20+ languages. AssemblyAI's Universal-3 Streaming launched March 3, 2026 and narrowed the latency gap without matching Deepgram's floor.
Why does TTFT dominate for voice agents?
For a conversational voice agent, sub-500ms end-to-end latency is the threshold where users stop perceiving the system as laggy. STT is only the first hop in a chain that also includes LLM inference and TTS, so every millisecond spent in transcription is borrowed from the rest of the pipeline.
Deepgram's ~150ms TTFT enables fast turn-taking detection. The agent recognizes the user has stopped speaking and hands off to the LLM immediately. A 200ms TTFT advantage compounds across thousands of daily calls in IVR and customer service deployments, where call handling time and CSAT move with perceived responsiveness.
The tradeoff is that Deepgram's WER sits at 7, 12% on mixed-condition audio, per independent benchmarking from Scribie's production-audio study. For a voice agent, that is usually fine.
The STT output feeds an LLM, not a human reader, and the LLM recovers from minor transcription errors through context. An 8% WER at 180ms latency beats a 2% WER at 450ms latency for this architecture.
Deepgram's January 2026 $130M Series C at a $1.3B valuation, plus its acquisition of a YC AI startup, signals continued investment in real-time speech intelligence. On-prem Docker and Podman deployment is documented for enterprise customers with data sovereignty requirements.
When does WER dominate instead?
When the transcript is the product, accuracy stops being recoverable. A misheard medication name in clinical documentation can be a liability. A dropped qualifier in a court transcript can change legal meaning. In those workloads, latency is irrelevant and WER is the only metric that matters.
Gradium emerged from stealth in December 2025 with a $70M seed and launched commercial STT and speech-to-speech translation APIs on June 21, 2026. Founded by a former Google DeepMind researcher, the company's positioning is accuracy-first.
The Coval STT benchmark dated May 4, 2026 measured Gradium at 1.62% WER on Speechmatics English-only controlled testing, competitive with or better than any commercial STT API currently shipping.
Gradium's accuracy advantage comes with a latency tax. Its models typically require 300, 500ms TTFT, roughly 2, 3x slower than Deepgram. That rules it out for real-time voice agents but makes it a strong fit for medical transcription, legal proceedings, compliance call recording, and archival media indexing.
The integrated stt-translate endpoint also collapses transcription and translation into a single step, which removes intermediate transcript overhead for multilingual workloads.
A note on a number you may have seen floating around: the "25.2% WER for Deepgram" figure that appears in some buyer guides is not supported by 2026 benchmark data. It likely derives from older Deepgram models tested under adverse conditions, or from comparing Deepgram's streaming WER against Gradium's batch WER, which measures fundamentally different use cases.
Treat any 2026 STT API benchmark that quotes a 25% WER for a current commercial model with skepticism.
Are end-to-end voice models killing standalone STT?
The most credible threat to dedicated STT APIs is not another STT vendor. It is the end-to-end voice model that folds transcription, understanding, and response into one pipeline.
Three flagship E2E models are shipping as of June 2026. xAI's Grok Voice Agent launched at $0.05/min flat pricing, simplifying cost prediction for high-volume applications. OpenAI's GPT-Realtime family progressed from GPT-Realtime-1.5 in February 2026 to GPT-Realtime-2 with GPT-5-class reasoning in May 2026, priced at $32, 64 per 1M tokens.
Google's Gemini 3.5 Live Translate, launched June 9, 2026, expanded real-time speech-to-speech translation from 5 to 70+ languages.
Entry-level E2E voice pricing has compressed to the $0.05–$0.31/min range, approaching dedicated STT pricing of $0.004–$0.024/min. For greenfield consumer voice apps, the integration simplicity is hard to argue against.
But standalone STT is not commoditizing yet, for four concrete reasons.
First, transcript fidelity diverges inside E2E models. Google's own developer community has confirmed that input_audio_transcription from Gemini Live returns text that does not match what the model actually processes. The E2E model understands the audio; the transcript it hands back is a secondary artifact. For compliance and archival workloads, that is a problem.
Second, customization. Dedicated STT APIs expose custom vocabularies, acoustic model fine-tuning, and speaker diarization. E2E models typically do not. Enterprise voice architects need that control for domain-specific terminology.
Third, cost at scale. E2E pricing scales with token consumption. A voice agent processing 100,000 minutes per day on GPT-Realtime-2 faces a very different bill than one running $0.004/min STT plus a focused small language model.
Fourth, deployment and compliance. Healthcare, legal, and financial workloads often require data to stay in specific jurisdictions. Whisper self-hosting under Apache 2.0, faster-whisper variants with 4x CPU speedup via CTranslate2, and Cloudflare Workers AI for edge STT all answer requirements that E2E cloud APIs cannot.
How should you decide?
Run the decision on your primary optimization target, not on a feature checklist.
If you are building a real-time voice agent with sub-500ms end-to-end latency targets, pick Deepgram Nova-3 or Cartesia Ink-2. The LLM downstream recovers from minor WER. Latency is the experience.
If you are building medical, legal, or compliance transcription where the transcript is the deliverable, pick Gradium or AWS Medical Transcribe. WER below 3% is the bar. Latency is irrelevant.
If you are building a greenfield consumer voice product with multilingual needs and no on-prem requirement, evaluate GPT-Realtime-2 or Gemini 3.5 Live. The integration simplicity and language coverage outweigh granular control.
If you need offline or edge deployment, self-host Whisper with faster-whisper, or use Deepgram's on-prem Docker. Google Cloud STT v2 offers regional endpoints for data residency if cloud is acceptable.
One practitioner pattern worth stealing: a mid-size enterprise running 50,000 voice interactions per day reported NPS scores improving 12 points after switching from GCP Speech-to-Text to Deepgram, with A/B tested TTFT of 180ms versus 340ms for AssemblyAI and 450ms for Gradium. Their STT spend at that volume was around $215/month.
GPU compute for the downstream LLM was 40x larger. The accuracy-vs-latency tradeoff dominated cost by an order of magnitude.
That ratio is the real lesson of the 2026 STT market. The vendor you pick matters less than the axis you optimize. Pick the axis wrong and no vendor saves you.
Sources
- Deepgram product overview
- Deepgram raises $130M at $1.3B valuation (TechCrunch, Jan 2026)
- Deepgram AWS Docker / Podman deployment docs
- Gradium launches with $70M to build voice layer for AI
- AssemblyAI (Universal-3 Streaming)
- Scribie: Real-world STT accuracy benchmarking AssemblyAI, Deepgram, WhisperX
- OpenAI: Advancing voice intelligence with new models in the API
- xAI Grok Voice Agent API plugin (LiveKit docs)
- Gemini 3.5 updates: Live Translate, Flash/Pro GA (Google Cloud blog)
- Gemini Live API: input_audio_transcription returns incorrect text (Google AI developer forum)
- openai/whisper-large-v3 on Hugging Face
- Cloudflare Workers AI pricing
- Google Cloud Speech-to-Text pricing
