Question 1

What is production AI observability?

Accepted Answer

Production AI observability is the practice of tracing, evaluating, and monitoring LLM and agent applications in production — covering request traces, token cost, eval quality, and drift or regression detection. It extends traditional software observability with LLM-specific signals like prompt/version, model, and judge-based quality scoring.

Question 2

Which observability tools support OpenTelemetry?

Accepted Answer

Arize Phoenix is built on OpenTelemetry (OTel), and Langfuse offers partial OTel compatibility. OpenTelemetry GenAI provides the semantic-conventions spec that other tools can build on. If OTel-native instrumentation matters to your existing pipeline, Phoenix is the closest fit.

Question 3

Can I self-host an LLM observability stack?

Accepted Answer

Yes. Arize Phoenix, Langfuse, and Lunary are open-source and self-hostable. Helicone offers a self-host option. Braintrust is primarily a managed platform. Self-hosting gives you data residency control but adds operational burden.

Question 4

How do I choose between eval-first and tracing-first tools?

Accepted Answer

If your primary risk is quality regression (hallucination, drift), an eval-first platform like Braintrust or Langfuse fits. If your primary risk is debugging production failures and latency, a tracing-first tool like Phoenix or Helicone fits. Most mature teams end up needing both, which is why OTel-compatible tracing plus a separate eval store is a common pattern.

Question 5

Is there a free or open-source option?

Accepted Answer

Yes. Arize Phoenix, Langfuse, and Lunary are open-source. OpenTelemetry GenAI is a free spec. Several managed tools offer free tiers, but verify current limits against each vendor's pricing page.

Tool	Request tracing	LLM-as-judge eval	Cost / token tracking	Drift / regression detection	Self-hostable	OpenTelemetry-compatible	Notes
Arize Phoenix	✓	✓	~	~	✓	✓	Open-source LLM tracing + eval; OTel-native.
Langfuse	✓	✓	✓	~	✓	~	Open-source LLM engineering platform; tracing, eval, cost.
Helicone	✓	—	✓	—	~	—	LLM observability + caching + cost tracking.
Braintrust	~	✓	~	✓	—	—	Eval-first platform; prompt playground + regression detection.
Lunary	✓	~	✓	—	✓	—	Open-source LLM observability; tracing + cost.
OpenTelemetry GenAI	~	—	—	—	✓	✓	Semantic-conventions spec, not a product. Foundation other tools build on.

The production AI observability matrix

The matrix

How to decide

Get the deeper evaluation

Sponsor this coverage

Need a decision, not a list?