Is AI actually improving medical diagnosis today?

In specific, validated use cases, yes. Annalise CXR improved radiologist perception for 102 of 124 chest X-ray findings in a Lancet Digital Health study and is live across 40+ NHS trusts. But counterexamples like the Epic Sepsis Model, which caught only 7% of sepsis cases in external validation, show deployment without independent validation is the real risk.

What happened in Mata v. Avianca and why does it matter?

In 2023, two New York attorneys filed a brief containing six fictitious cases generated by ChatGPT and were sanctioned $5,000 each. It matters because the pattern replicated: a tracker by Damien Charlotin has since catalogued more than 900 court decisions involving AI-generated hallucinations across multiple jurisdictions.

Does 'human in the loop' make AI decision-making safe?

Not by itself. Evidence from the IDF's Lavender system, where reviewers reportedly spent under 20 seconds per AI-flagged target, shows that scale and speed can turn human review into a rubber stamp. Meaningful oversight requires time, context, and genuine authority to override the system.

How does the EU AI Act treat high-stakes AI?

The Act, in force since August 2024 with most obligations applying from August 2026, classifies AI in medical devices, criminal justice, and law enforcement as high-risk. That triggers conformity assessments, human oversight, and post-market monitoring, with penalties up to €35 million or 7% of global turnover.

What is the biggest documented case of AI bias?

A 2019 Science study by Obermeyer and colleagues found a commercial healthcare algorithm affecting roughly 200 million Americans used cost as a proxy for need, allocating resources to white patients at a rate more than 50% higher than to equally sick Black patients.

AI Decision-Making in High-Stakes Sectors: Risks and Rewards

A radiologist shortage of 30%, projected to hit 40% by 2028. More than 900 court decisions involving AI-hallucinated case law. A targeting system that flagged 37,000 people while human reviewers reportedly spent less than 20 seconds per name.

Those three numbers, from the UK's Royal College of Radiologists, legal researcher Damien Charlotin's sanctions tracker, and the +972 Magazine investigation into the IDF's Lavender system, describe the same phenomenon from three angles. AI decision-making has already arrived in the highest-stakes corners of modern life.

The question is no longer whether to deploy it, but whether the oversight around it is real or decorative.

AI decision-making in critical sectors means using machine learning systems to inform or drive consequential judgments: who gets diagnosed, who gets sentenced, who gets targeted. The evidence shows it works where independently validated and fails badly where humans defer to unverified outputs.

TL;DR:

AI in healthcare has its strongest evidence base in NHS England radiology, where Annalise CXR improved radiologist accuracy in a peer-reviewed Lancet Digital Health study and now runs across 40+ trusts.
AI in law is being adopted faster than it's being verified. The Mata v. Avianca sanctions cascade has produced 900+ documented hallucination cases in courts worldwide.
In national security, AI targeting operates at a scale that can hollow out human review entirely.
The recurring failure mode isn't the algorithm. It's deployment at scale before independent validation, plus humans who stop checking.

The choice is not between AI and no AI. It is between AI deployed with independent validation and meaningful human override, and AI deployed with neither.

Key takeaways

Demand published external validation before any high-stakes AI deployment. The credible vendors (Annalise.ai, Qure.ai, GRAIL) are the ones publishing data.
"Human in the loop" without time, context, and authority to override is a liability shield, not a safeguard.
Bias is structural, not incidental: a 2025 systematic review found 75% of clinical machine-learning studies reported some form of bias.
The EU AI Act classifies nearly every system discussed here as high-risk, with most obligations applying from August 2026. The US remains a patchwork.

AI in healthcare: where the evidence is strongest

The best-documented success story for AI decision-making is NHS England's chest X-ray program, anchored by peer-reviewed validation rather than vendor claims. Bradford Teaching Hospitals went live with Annalise CXR in May 2024, covering up to 124 findings from lung nodules to misplaced feeding tubes.

The evidence behind it is unusually strong. In a study published in The Lancet Digital Health, Annalise CXR as an assist device significantly improved radiologist perception for 102 of 124 findings, was non-inferior for 19, and degraded accuracy on none.

Deployment has scaled accordingly. In November 2024, Bolton NHS Foundation Trust and six other Greater Manchester trusts rolled out the system across a population of 2.8 million, in a region where lung cancer incidence runs 24% above the national average.

The vendor reports 40+ NHS trusts now use the tool, and systematic reviews of AI in lung cancer screening support the broader pattern.

The headline result came from the NHS-Galleri trial, a randomized controlled trial of 142,250 participants run with GRAIL. At ASCO 2026, GRAIL reported that annual multi-cancer blood testing reduced stage IV diagnoses of 12 prespecified cancers by 22% in screening round two and 26% in round three.

Those are vendor-stated figures from a press release, so treat them as promising rather than settled. But a stage-shift signal of that size, in an RCT of that scale, is the kind of evidence most clinical AI never produces.

NHS-Galleri trial: reductions reported at ASCO 2026 (vendor-stated)

The Malawi caution: don't credit AI for human work

Here's where honesty matters. Malawi's under-5 mortality has fallen more than 75% since 2000, per the UN Inter-agency Group for Child Mortality Estimation. A tempting narrative credits digital health and AI.

The evidence doesn't support that attribution. UN reporting credits midwives, skilled birth attendants, and community health workers. Programs like m-mama (emergency transport) and Imaging the World's ultrasound training are real and valuable, but they're digital-enabled human systems, not AI-driven ones.

Conflating the two over-claims AI and under-claims frontline health workers. AI ethics starts with not taking credit that belongs to people.

What happens when AI in law goes wrong?

The legal profession's AI reckoning began with Mata v. Avianca in 2023, when two attorneys filed six fictitious ChatGPT-generated cases and were sanctioned $5,000 each by Judge P. Kevin Castel. The misconduct was individual. The replication was systemic: Charlotin's tracker now catalogues more than 900 court decisions involving AI hallucinations across the US, UK, Canada, and Australia.

And adoption keeps accelerating anyway. Harvey AI serves major firms including Paul, Weiss and A&O Shearman, which co-built an agentic AI product with the company. Harvey's platform now spans drafting, research, and agentic workflows, competing with Lexis+ AI's Protégé and Thomson Reuters CoCounsel. Law360 Pulse reports clients themselves are now driving adoption.

Regulators are responding. The American Bar Association's Formal Opinion 512 (July 2024) requires lawyers to understand generative AI's risks, protect confidentiality, supervise AI output like any non-lawyer assistant, and not silently bill for time AI saved.

California's COPRAC went further in March 2026, proposing to treat undisclosed AI use in client work as a disciplinable ethical breach.

The operational fix is boring and non-negotiable: verify every AI-cited case in an authoritative database before filing. Firms that haven't made this a workflow gate are running the Avianca experiment again with their own name on the docket.

Why does "human in the loop" keep failing?

Human oversight fails when the system's scale and speed make genuine review impossible, converting judgment into ratification. The starkest documented case is Lavender, the IDF targeting system investigated by +972 Magazine in April 2024. It flagged as many as 37,000 Palestinians as potential operatives; reviewers reportedly spent under 20 seconds per recommendation. The IDF disputed parts of the reporting, then acknowledged the system exists as a recommendation tool with humans making final calls.

The same dynamic runs through the Pentagon's Maven Smart System, now a multi-billion-dollar program of record supporting targeting workflows, and Palantir's AIP, marketed explicitly for warfare. DoD Directive 3000.09 requires traceability and senior review for autonomous weapons.

But a policy requirement for review is not the same as an interface, a timeline, and an incentive structure that make review real.

This is automation complacency, and it's sector-agnostic. The more accurate a system is, the more dangerous its failures become, because humans stop checking. Radiologists defer to triage that's right 95% of the time.

Lawyers stop verifying citations that are usually real. The countermeasure is designing for engaged oversight: monitor override rates, give reviewers time, and make the override path cheaper than the rubber stamp.

AI risks and rewards: the documented record

The rewards are speed, scale, and consistency; the risks are bias, hallucination, and over-reliance, and all six are now quantified rather than hypothetical.

Sector	Flagship system	Strongest evidence	Documented failure mode
Healthcare	Annalise CXR, Galleri	Lancet Digital Health validation; 142,250-person RCT	Epic Sepsis Model: caught 7% of sepsis cases, ~90% false alarms (JAMA Internal Medicine, 2021)
Law	Harvey, CoCounsel, Lexis+ AI	Enterprise adoption at top firms	900+ hallucination decisions since Mata v. Avianca
National security	Maven Smart System, Lavender	Operational scale across theaters	20-second human review of lethal targeting recommendations

The bias record deserves its own line. Obermeyer and colleagues showed in Science (2019) that a commercial algorithm affecting roughly 200 million Americans used healthcare cost as a proxy for need, directing resources to white patients at a rate more than 50% higher than to equally sick Black patients.

Dressel and Farid showed in Science Advances (2018) that COMPAS, used in pretrial risk assessment, predicted recidivism no better than a two-variable model of age and prior offenses.

And the commercial post-mortems rhyme. IBM Watson for Oncology produced "unsafe and incorrect" recommendations and cost MD Anderson $62 million before being pulled. Zillow Offers shut down in November 2021 after its pricing model overpaid at scale, taking a quarter of the company's workforce with it.

Every one of these failures shared three features: high cost of error, deployment before independent validation, and an organization strongly incentivized to trust its own model.

Who regulates AI decision-making?

The EU AI Act is the only comprehensive law in force, classifying medical, criminal-justice, and law-enforcement AI as high-risk with penalties up to €35 million or 7% of global turnover. It entered into force in August 2024; most obligations apply from August 2026. High-risk classification triggers conformity assessments, documentation, human oversight, and post-market monitoring.

The US, by contrast, is a patchwork. NIST's AI Risk Management Framework (2023) is voluntary but widely referenced. FDA guidance on AI/ML medical devices, including Predetermined Change Control Plans for post-deployment model updates, governs the clinical slice.

The Biden-era Executive Order 14110 was rescinded in January 2025 in favor of an "AI dominance" posture, and Congress has still not passed a comprehensive AI law. International instruments (Bletchley, the G7 Hiroshima Code of Conduct, the OECD Principles, the UN's "Governing AI for Humanity" report) remain non-binding.

The practical consequence: if you deploy high-stakes AI in or into Europe, compliance is a hard requirement on a 2026 clock. Everywhere else, the standard you hold yourself to is largely the standard you choose.

What this means for you

If you're deploying AI into consequential decisions, the evidence reviewed here compresses into four moves:

Make validation a procurement gate. Require published, external validation on a population resembling yours. No paper, no purchase order.
Instrument the human. Track acceptance and override rates on AI recommendations. A 99% acceptance rate isn't a sign the model is great; it's a sign nobody is checking.
Disclose. To patients, to clients, to courts. The ABA, the AMA, and the EU AI Act are all converging on disclosure as the floor.
Plan for drift. Models change after deployment. Borrow the FDA's change-control framing even where you're not regulated.

The technology is moving faster than the law, and the people most affected by AI decisions are usually the least equipped to challenge them. Closing that gap is not a compliance chore.

It's the difference between the NHS radiology story and the Epic Sepsis story, and every organization deploying AI is currently choosing which one it wants to be.

Sources

Bradford Teaching Hospitals welcomes AI technology in Radiology, first-party NHS trust announcement of the Annalise CXR go-live, May 2024.
GRAIL reports full results from the NHS-Galleri trial at ASCO 2026, vendor-stated stage IV reduction figures from the 142,250-participant RCT.
Systematic review: AI in lung cancer screening on chest X-rays (PMC), peer-reviewed context for CXR AI performance.
Annalise.ai on the Sectra Amplifier Marketplace, vendor deployment footprint across NHS trusts.
Paul, Weiss partners with Harvey AI on new AI workflows, first-party firm announcement of enterprise legal AI adoption.
A&O Shearman, Harvey partner on agentic AI product (Law360 Pulse), reporting on agentic AI in big law.
Harvey, AI software for legal and professional services, vendor platform documentation.
Harvey (software), Wikipedia, third-party overview of Harvey's deployment across legal workflows.

AI's Role in Critical Decision-Making: Risks, Rewards, and Responsibilities