Agentic Loops And Harness Engineering

Cascaded vs End-to-End Voice Agents: Which Architecture Ships in Healthcare?

The latency gap is narrowing, but the workflow, not the benchmark, picks the architecture.

By June 29, 202613 min read
voice agent architecturecascaded vs end to end speechhealthcare voice AI
Cascaded vs End-to-End Voice Agents: Which Architecture Ships in Healthcare?

A common framing in mid-2026 holds that "Groq-cascaded" voice agents hit ~780ms while Google's Gemini Live end-to-end model sits near 2,980ms, and that this gap defines the cascaded-versus-end-to-end debate. That framing is wrong on the facts.

The 780ms number belongs to xAI's Grok Voice Think Fast 1.0, which is itself an end-to-end speech-to-speech model released April 23, 2026. Both sides of that comparison are end-to-end systems.

The real production-defining axis for healthcare voice AI is not cascaded versus end-to-end latency in a benchmark. It is which architecture survives the clinical workflow's accuracy, auditability, and interruption requirements under real load.

As Andrew Ng put it in April 2026, low-latency voice models lack reliability while agentic STT-LLM-TTS pipelines are intelligent but too slow. That tension, not a single TTFT number, is what healthcare architects have to resolve.

TL;DR: Cascaded pipelines (STT → LLM → TTS) remain the pragmatic default for regulated healthcare workflows because they let you audit errors stage by stage and swap components independently. End-to-end speech-to-speech models have crossed the 300ms latency wall in benchmarks and are closing in on production viability for patient-facing triage, but their black-box failure modes and 4-10x cost premium keep them out of most clinical deployments today. The right choice depends on the workflow, not the leaderboard.

Key takeaways

  • The "Groq-cascaded vs Gemini Live" comparison is a misconception: Grok Voice is end-to-end, and the 780ms figure is xAI first-party data, not a cascaded baseline.
  • End-to-end models (OpenAI gpt-realtime-2, NVIDIA PersonaPlex-7B) have hit 180-265ms TTFT in benchmarks, but Gemini Live users reported 8-30s in production as recently as March 2026.
  • Cascaded architectures dominate ambient clinical documentation because accuracy and error attribution matter more than latency.
  • Patient-facing triage is the one workflow where end-to-end's latency advantage could win, but only after production validation at scale.
  • Hippocratic AI's Polaris 5.0 constellation (30+ models with two-level verifiers) points to a hybrid middle path that neither pure architecture satisfies alone.

What Is Cascaded vs End-to-End Speech Architecture?

A cascaded voice agent decomposes conversation into three sequential neural stages: speech-to-text transcribes audio, an LLM reasons over the transcript and generates a text response, and text-to-speech synthesizes the audible reply. Each stage is independently trainable, replaceable, and auditable.

A medical STT model can be swapped in for clinical vocabulary; a clinical LLM can enforce SOAP structure; a healthcare-compliant TTS can hold a calm tone. The cost of this modularity is a latency floor driven by audio buffering, network hops, and endpoint detection, typically landing at 600-1,200ms in production according to LatencyGrid and Picovoice's latency analysis.

An end-to-end speech-to-speech model collapses all three stages into a single forward pass from audio input to audio output, usually running full-duplex with native barge-in handling. Representative systems as of June 2026 include OpenAI gpt-realtime-2 (180ms WebRTC / 340ms PSTN, released May 7, 2026), NVIDIA PersonaPlex-7B (205-265ms, open weights, January 15, 2026), Gemini 3.1 Flash Live (March 26, 2026), and the aforementioned Grok Voice.

The academic reference point remains Kyutai Moshi (160-240ms theoretical, September 2024).

How Does the TTFT Benchmark Actually Look in 2026?

Time-to-first-token is the metric most often cited in architecture debates, and it is the one most often misread. The table below consolidates first-party and production measurements as of June 2026.

Model Architecture TTFT (benchmark) TTFT (production reports) Release
OpenAI gpt-realtime-2 End-to-end 180ms WebRTC / 340ms PSTN ~340ms measured May 7, 2026
NVIDIA PersonaPlex-7B End-to-end 205-265ms Not widely reported Jan 15, 2026
Kyutai Moshi End-to-end 160-240ms (theoretical) Limited deployments Sep 2024
Grok Voice Think Fast 1.0 End-to-end ~780ms Not widely reported Apr 23, 2026
Gemini 3.1 Flash Live End-to-end ~2,980ms 8-30s (user forum, Mar 2026) Mar 26, 2026
Typical cascaded (STT→LLM→TTS) Cascaded 600-1,200ms 600-1,200ms baseline Various
First-turn latency by model (benchmark vs production, ms)gpt-realtime-2 (WebRTC)180msPersonaPlex-7B235msMoshi (theoretical)200msGrok Voice 1.0780msGemini Live (benchmark)2980msCascaded baseline900ms
First-turn latency by model (benchmark vs production, ms)

Two things deserve emphasis. First, the spread within end-to-end systems is enormous: from 180ms to nearly 3,000ms. Architecture alone explains almost none of it. Model size, optimization, and infrastructure dominate.

Second, the Gemini Live gap between Google's ~2,980ms benchmark and the 8-30s production reports on the Google AI developer forum is the credibility problem that matters for healthcare. A model that benchmarks at 3s and ships at 30s under load is not a system you put in front of a patient on a triage line.

Industry consensus has converged on the "300ms wall": sub-300ms feels natural, 500ms feels delayed, 700ms-plus feels broken, per Spheron's deployment analysis. End-to-end models have crossed it in controlled conditions. Production is a different question.

Why Healthcare Workflow, Not Latency, Picks the Architecture

Healthcare voice AI serves two operational contexts with opposing requirements, and conflating them is the most common architectural mistake.

Ambient scribe: latency-tolerant, accuracy-critical

Ambient AI scribes run in post-visit batch mode. The system listens to an encounter and generates a SOAP note, H&P, or follow-up instructions minutes to hours later. Latency is effectively irrelevant. Accuracy is everything.

A transcription error that turns "hydrochlorothiazide" into "hydrocortisone" is a medication safety incident, not a UX nit. The HIPAA clinical workflow testing checklist from Hamming AI frames this precisely: every stage of a voice pipeline touching PHI needs verifiable error attribution.

Cascaded architectures give you that. You can pull the STT log, see the mis-transcription, and fix the medical vocabulary model. An end-to-end model gives you audio in and audio out with no intermediate transcript to inspect.

The evidence on ambient scribe impact is now substantial. The NEJM AI randomized trial (Lukac et al., November 2025) across 238 outpatient physicians in 14 specialties showed burnout and cognitive load improvements with DAX Copilot and Nabla.

A JAMA Network Open multicenter study tracked burnout falling from 51.9% to 38.8% at 30 days post-rollout across six health systems. Kaiser Permanente's 63-week evaluation saved 15,791 documentation hours, equivalent to 1,794 eight-hour workdays, per Sunoh's reporting.

The error picture is less rosy. A ScienceDirect study from December 2025 reported a mean percent error of ~26.3% across platforms, with 4.8-71% of signed notes containing errors. That range is why auditability is non-negotiable, and why cascaded remains the recommendation for ambient documentation.

Patient-facing triage: latency-critical, interruption-heavy

Patient triage voice agents operate in real time on phone calls or voice interfaces. The latency budget is brutal. Users perceive delays over 1-2 seconds as unnatural, and the hard ceiling is roughly 1 second before callers terminate, according to Coval's latency guide and Retell AI's healthcare front-desk analysis.

Production targets for natural conversation sit at 300-500ms.

Interruption handling is the other hard requirement. Roughly 1 in 5 triage calls involves a patient barge-in, and the component-level targets are tight: TTS flush around 60ms, LLM cancel around 40ms.

Cascaded pipelines give you fine-grained control over both because TTS and LLM are independent processes you can kill on demand. End-to-end models handle barge-in natively but with less predictable cancellation behavior under load.

Recent deployments show where this is heading. Infermedica launched a Nurse Triage Co-Pilot at Médis in Portugal in July 2025 that cut call durations by up to 4 minutes and collected 4x more symptoms. In June 2026, Infermedica paired with Healthdirect Australia for a closed beta combining symptom triage with DERM skin cancer screening on AWS-secured architecture.

These are cascaded systems, and they are winning the deployed market because they meet the latency budget and the accuracy bar simultaneously.

What Does the Accuracy-vs-Speed Trade-off Mean for Medical Voice Assistants?

This is the unresolved core of the debate. Low-latency inference favors smaller, faster models. Healthcare accuracy requires larger, more careful reasoning. As end-to-end latency improves, vendors face pressure to push models toward faster inference, which can sacrifice the reasoning depth that clinical safety demands.

Hippocratic AI's Polaris 5.0, released April 30, 2026, is the clearest response to this tension. It achieves 99.9% clinical safety through a "constellation" of 30+ specialized models plus two-level verifiers.

The company's position is blunt: healthcare voice AI requires "north of 99% accuracy on every reasoning subtask," and a single hallucinated medication dose or missed escalation is "not an evaluation curiosity; it is a safety event."

The constellation pattern is neither pure cascaded nor pure end-to-end. It is an orchestration layer that treats verifiability as a first-class architectural concern. For a medical voice assistant operating in triage, this is the design pattern most likely to satisfy both the latency budget and the accuracy floor, regardless of what the underlying speech model looks like.

OpenAI's gpt-realtime-2 release in May 2026 is telling on the latency side. The launch emphasized Big Bench Audio accuracy improvement (81.4% to 96.6%), context window expansion (32K to 128K), and configurable reasoning effort.

It did not chase further latency reduction. First-turn latency is increasingly treated as a solved problem in benchmarks; the frontier has moved to multi-turn memory, persona maintenance, and reasoning depth.

How Do Cost and Vendor Lock-in Differ?

Cascaded architectures win on cost tunability and vendor flexibility. End-to-end systems carry a 4-10x per-minute premium today but reduce integration overhead.

Component Cost (2026) Notes
STT (Deepgram) $0.0077/min Medical models available
STT (AssemblyAI) $0.0035/min Universal-3 Pro
TTS (Cartesia Sonic) $0.03/min Expressive voices
GPT-Realtime-2 $0.12-0.20/min audio $32/$64 per million tokens
PersonaPlex-7B (self-hosted) Infra cost only MIT license, open weights

The vendor lock-in story matters more in healthcare than in most verticals. When you are cascaded, you can swap in a better medical STT model the week it ships.

When you are end-to-end, you wait for your vendor to prioritize healthcare accuracy over consumer voice features, and you renegotiate your HIPAA BAA at the system level rather than per component. For compliance teams that need to audit each stage against PHI handling requirements, that concentration of vendor risk is a real cost.

What the Market Is Funding

The capital flowing into healthcare voice AI confirms this is past proof-of-concept. Prosper AI closed a $30M Series A led by Andreessen Horowitz on June 22, 2026, with Base10 Partners, Emergence Capital, Y Combinator, and Company Ventures participating, confirmed across SiliconANGLE, Fierce Healthcare, and Forbes. Prosper reports 35,000 providers on platform and is broadening from voice agents into an integrated AI workforce for scheduling, insurance verification, billing, and payer follow-up.

The broader funding table tells the same story: Ambience Healthcare raised $243M in 2024 for ambient documentation, Assort Health raised $76M in 2026 for voice AI in healthcare, and Attuned Intelligence raised a $13M seed for hospital communication AI. Investors are funding platforms that address comprehensive care coordination, not point solutions that win a single benchmark.

What This Means for You

If you are architecting a healthcare voice agent today, the decision rubric is straightforward.

For ambient clinical documentation, choose cascaded. The ability to audit an STT error versus an LLM hallucination is a compliance requirement, not a nice-to-have. Use specialized medical STT, a clinical LLM, and a healthcare-compliant TTS. Latency is irrelevant because notes generate post-visit.

For patient-facing triage, start cascaded and validate production latency at scale before considering end-to-end. Martin Schweiger, Chief AI Architect and author of Building Enterprise Realtime Voice Agents from Scratch (Salesforce AI Research, March 2026), puts it directly: systems that benchmark at 200ms in the lab can hit 2-3 seconds under load with real patient audio.

Prove the cascaded system first, then migrate only when you have production data confirming sub-500ms at peak.

For either workflow, demand vendor transparency on failure modes. If an end-to-end provider cannot explain where a triage decision went wrong, that is a disqualifying gap for regulated healthcare use.

Plan for modularity so you can upgrade components as the technology evolves, and monitor the market quarterly. The pace of improvement is fast enough that today's recommendation may shift within a couple of release cycles.

The honest projection: end-to-end systems will likely match cascaded production latency by 2027-2028, but healthcare-specific accuracy requirements will keep favoring cascaded or hybrid constellation approaches for longer. The architecture that ships in healthcare is the one you can audit, and that is still cascaded for most clinical workflows.

Sources

Frequently asked questions

What is the difference between cascaded and end-to-end voice agent architecture?

Cascaded architecture chains three specialized models (STT, LLM, TTS) so each stage can be optimized, audited, and swapped independently. End-to-end speech-to-speech models process audio directly to audio in one forward pass, typically with built-in full-duplex interruption handling and lower best-case latency but less transparent failure modes.

Which voice agent architecture is better for healthcare?

For regulated clinical workflows like ambient documentation, cascaded architectures win because errors can be attributed to a specific stage and corrected. For patient-facing triage with validated sub-500ms production latency, end-to-end systems are becoming viable, but most teams still start cascaded and migrate only after production validation.

What is the 300ms latency threshold for voice agents?

Sub-300ms time-to-first-token feels natural to users, 500ms feels delayed, and 700ms-plus feels broken. End-to-end models have crossed this threshold in controlled benchmarks, but real-world load, network conditions, and concurrent users often push production latency higher.

Is Grok Voice a cascaded or end-to-end model?

Grok Voice Think Fast 1.0 is an end-to-end speech-to-speech model, not a cascaded pipeline. The ~780ms latency figure often attributed to a 'Groq-cascaded' system actually describes xAI's end-to-end Grok Voice agent, so it should not be used as a cascaded-versus-end-to-end comparison point.

How does Hippocratic AI achieve 99.9% clinical safety in voice AI?

Hippocratic AI's Polaris 5.0 uses a constellation approach: 30-plus specialized models orchestrated with two-level verifiers. This hybrid pattern combines end-to-end flexibility with cascaded verifiability, reflecting that neither pure architecture alone meets the healthcare accuracy bar.