On January 16, 2026, ClickHouse acquired Langfuse, the open-source LLM observability platform, in the largest consolidation event the LLMOps market has seen. That deal is a useful signal: enterprise LLM spend doubled from $3.5B to $8.4B in twelve months according to Menlo Ventures' mid-2025 enterprise update, and the operational tooling around LLMs is now valuable enough to buy outright.
LLMOps is the discipline of operating large language model applications in production. It extends MLOps rather than replacing it, adding prompt engineering, RAG evaluation, and token-cost monitoring to the traditional model lifecycle. That subset framing is the consensus across Microsoft's GenAIOps playbook, Google Cloud, Red Hat, and MLflow.
TL;DR: Your model registry and CI/CD survive the transition. What changes is the artifact: you ship a prompt-plus-retrieval-plus-tools bundle instead of trained weights, you evaluate with task-specific judges instead of held-out accuracy, and you defend against a failure mode classical MLOps never had: the vendor changing your model behind an unchanged API endpoint.
Key takeaways:
- LLMOps adds three artifacts to MLOps: a prompt registry, an eval harness, and trace-level production observability.
- Silent model drift, where a vendor update degrades your app with no code change on your side, is now a first-class failure mode. Pin model IDs in every trace.
- Prompt caching cuts cost by up to 90% per AWS Bedrock's figures. Do it before any other optimization.
- 37% of enterprises run five or more models in production, per a16z's January 2026 analysis, so a routing gateway outlasts any single provider choice.
- The eval set is the new model artifact. Version it, regression-test it, and gate merges on it.
What is the actual difference between LLMOps and MLOps?
The difference comes down to what you deploy and how fast it changes. MLOps ships a trained model artifact on a weekly or monthly cadence. LLMOps ships a bundle of prompts, retrieval indices, tool definitions, and guardrails that changes at the speed of business context.
Microsoft's framing splits the work into an inner loop (prompt and retrieval iteration, paced in minutes to hours) and an outer loop (deployment, evaluation, monitoring). Classical MLOps centered the outer loop because the model was stable.
LLMOps inverts that: most engineering hours go into the inner loop, while most failures emerge in the outer one.
| Dimension | MLOps | LLMOps |
|---|---|---|
| Primary artifact | Trained model weights | Prompt + retrieval + tools + guardrails |
| Inner-loop cadence | Days to weeks | Minutes to hours |
| Determinism | Deterministic at inference | Non-deterministic outputs |
| Primary eval signal | Accuracy/AUC on held-out set | LLM-as-Judge, RAG metrics, human preference |
| Cost driver | GPU-hours | Token spend and context size |
| Top monitoring need | Feature and prediction drift | Hallucination, retrieval precision, cost-per-task, injection attempts |
The team structure shifts too. The dominant 2026 pattern seats eval-harness and platform ownership with a central team while product teams own their prompts and retrieval indices.
Why do LLM apps degrade when nobody changed anything?
Because there are now three drift modes instead of two. AWS's prescriptive guidance on GenAI drift covers the classical pair, data drift and concept drift, and adds drift in upstream foundation-model behavior. A 2026 paper from the Semantic Fidelity Lab names the third mode silent model drift: the vendor pushes an update behind the same endpoint, with no version bump you can see.
The numbers from regression-testing vendor bards.ai illustrate the blast radius, though they come from a single source and should be read with that caveat. In one production case, an unrelated vendor-side regression at agent step 3 dropped end-task success from 89% to 83%, and a single prompt edit broke 8% of evaluation traces.
The defense is mechanical. Record the exact model ID and snapshot in every trace, never just the model family. And run a small golden set through production traffic continuously, alerting on statistical drift in any metric.
Eval hygiene matters even for the vendors themselves. Anthropic shipped three quality regressions in Claude Code over six weeks that its own evals failed to catch, as documented in Simon Willison's 2026 coverage. If the model builders can miss regressions, your eval set needs the same versioning and maintenance discipline as production code.
How do you keep LLM inference fast and cheap?
Cache first. AWS Bedrock advertises up to 90% cost reduction and 85% latency reduction on cached prompt prefixes, and the same pattern is standard across Anthropic and OpenAI.
Because per-token rates collapsed over 80% in two years (GPT-4o's $5/$15 per million tokens gave way to GPT-4.1-mini at $0.40/$1.60), total cost is now dominated by how much context you load per request rather than the rate you pay for it.
Long context got cheaper too. Anthropic introduced a 1M-token window for Claude Sonnet 4 in August 2025 with a 2x surcharge above 200K tokens; by March 2026 the surcharge was gone for Opus 4.6 and Sonnet 4.6.
For self-hosted serving, vLLM from UC Berkeley is the default. Its PagedAttention design delivers 2-4x the throughput of FasterTransformer per the SOSP 2023 paper, and a 2026 SitePoint benchmark measured vLLM at 2.8s p99 latency versus Ollama's 24.7s at 50 concurrent users.
Stripe reportedly cut inference cost 73% running 50M daily calls on vLLM, though that figure comes from a third-party consultant rather than an audited source.
Measure with the vocabulary from NVIDIA's NIM benchmarking docs: time-to-first-token, inter-token latency, tokens per second, all at p50/p95/p99.
Which LLMOps tools should you actually adopt?
The market settled into three layers, and knowing which layer a tool occupies answers most build-vs-buy questions.
Orchestration: LangChain with LangSmith (Developer free, Plus $39/seat/month), LlamaIndex for RAG-first work, DSPy for programmatic prompt optimization, and Haystack for compliance-heavy pipelines.
Experiment tracking: MLflow 3 (Apache 2.0, 30M+ monthly downloads, now with a prompt registry and GenAI evals) and W&B Weave. TensorBoard remains useful for classical ML but cannot render a multi-turn agent trace.
Observability and evaluation: Langfuse (MIT-licensed, OpenTelemetry-native, now ClickHouse-owned), Arize Phoenix, Helicone (free up to 10K requests/month), plus RAGAS for the four reference-free RAG metrics and TruLens for the RAG Triad of context relevance, groundedness, and answer relevance.
The decision rule: extend your MLOps platform when fine-tuning dominates and you already run a registry. Adopt a dedicated LLMOps layer when prompt iteration, agent tracing, and cost-per-token monitoring dominate. Most teams land on a hybrid, with LiteLLM as the unified gateway across 100+ providers.
The gateway matters because the provider market keeps moving. Menlo Ventures' data shows how fast the shares shifted in one year:
OpenAI held roughly 50% a year earlier. Keep prompts, eval sets, and trace schemas in vendor-neutral formats (OpenTelemetry spans, JSONL datasets) so the framework underneath can be swapped without losing the eval asset.
What does AI agent operations look like in production?
The honest picture includes both wins and reversals. Klarna's OpenAI assistant handled 2.3M conversations a month, the workload of roughly 700 agents, with $40M in projected savings. By Q3 2025 the company had saved about $60M but watched CSAT fall 22% on complex queries; the CEO told Bloomberg they "cut too deep" and rehired humans.
These are company-reported numbers, but the lesson holds: automation rate without a quality gate is a vanity metric.
Morgan Stanley's deployment shows the durable pattern. 98% of its 16,000+ advisor teams actively use its GPT-4 assistant, and the load-bearing component is a custom eval framework where advisors and prompt engineers grade summarization output. The eval set is treated as a continuously maintained asset.
For orchestration, LangGraph dominates stateful multi-step agents, running in production at Klarna, Uber, LinkedIn, and J.P. Morgan, with a frozen 1.0 API as of April 2026. NVIDIA's scale-out playbook for taking a LangGraph agent from 1 to 1,000 users follows a simple sequence: profile a single user, load-test, then monitor with OpenTelemetry and OpenInference semantic conventions.
The metrics set that matters for agents: task success rate (end-to-end, since per-step metrics hide compounding failures), dollars per successful task, time-to-first-token, prompt-injection attempts, and tool-call failure rate. Governance rides alongside: EU AI Act Article 53 obligations, NIST AI RMF mapping, and 90-day trace retention for incident investigation are integration work your platform team owns, because no orchestration framework ships them out of the box.
What this means for you
The 90-day ladder is deliberately boring, and that is its strength. Days 0-30: route every call through LiteLLM with model and prompt version recorded, stand up Langfuse or Phoenix for traces, and build a golden eval set of 50-200 prompts that runs in CI on every prompt change.
Days 30-60: add RAGAS or TruLens metrics and a calibrated LLM-as-Judge, targeting at least 80% agreement with human graders, and gate merges on eval results. Days 60-90: get 100% of production requests traced, trend cost per business task, and push prompt cache hit rate above 40%.
The anti-pattern is buying LangSmith, Langfuse, Phoenix, and Helicone on day one and ending up with four incompatible dashboards and no golden set. Eval first, observability second, continuous improvement third.
If you internalize one thing: the eval set is the new model artifact. It is the only durable defense against silent regressions, judge bias, and vendor-side model swaps, and it deserves the same versioning, testing, and ownership you give production code.
Sources
- Microsoft GenAIOps: Operational management of LLMs
- Google Cloud: What is LLMOps
- Red Hat: What is LLMOps
- MLflow: LLMOps guide
- AWS: Detecting drift in production GenAI applications
- Menlo Ventures enterprise LLM spend update
- a16z: Leaders and gainers in the enterprise AI arms race
- AWS Bedrock prompt caching
- Anthropic: Claude Sonnet 4 1M-token context
- vLLM, UC Berkeley Sky Computing Lab
- NVIDIA NIM LLM benchmarking metrics
- ClickHouse acquires Langfuse
- RAGAS: Automated evaluation of RAG (arXiv)
- TruLens RAG Triad
- OpenAI: Morgan Stanley case study
- NVIDIA: Scaling LangGraph agents in production
- Simon Willison: How we contain Claude
- EU AI Act Article 53
