The OpenTelemetry project now has a versioned, citable answer to the question "what does a correct AI agent trace look like?" As of June 17, 2026, the GenAI agent-spans specification sits at release v1.41.1, tagged 2026-05-11, and it defines exactly five agent span operations.
It is also still labeled Development, which means the names can change under you. Both facts matter, and this guide treats them as a single engineering problem.
If you are building agent observability with the OpenTelemetry GenAI semantic conventions, the durable move is to emit those five spans now and insulate your code from the churn. The vendor side is converging fast: ClickHouse acquired Langfuse in January 2026, and Cisco completed its acquisition of Galileo in May 2026.
Both consolidated platforms ingest OTel-shaped traces as their primary input. Picking a non-OTel wire format in 2026 means opting out of the two biggest acquisitions in the category.
TL;DR. The OTel GenAI spec defines five agent spans (create_agent, invoke_agent_client, invoke_agent_internal, invoke_workflow, execute_tool) plus a gen_ai.* attribute vocabulary. The spec is still Development, so shim, pin, dual-emit, and test. Your traces land in warehouse-first backends like Langfuse on ClickHouse or Splunk via Galileo.
Key takeaways
- The agent spec defines five named span operations, every one carrying
gen_ai.operation.name. - Five attributes do most of the work: provider, model, input tokens, output tokens, conversation ID.
- The spec is
Developmentas of v1.41.1 (2026-05-11); names can still break. gen_ai.usage.reasoning.output_tokens(added v1.41.0) is the new cost trap for o-series and extended-thinking models.- Warehouse-first storage (Postgres for state, ClickHouse for OLAP) is now the reference architecture.
What are the OpenTelemetry GenAI semantic conventions?
The OpenTelemetry GenAI semantic conventions are a standard vocabulary of span names and attributes for tracing generative-AI systems, including a dedicated page for AI agents that defines five span operations and a gen_ai.* attribute set for models, tools, and token usage.
The agent-spans page lists the five operations in this order:
| Span operation | Kind | What it wraps |
|---|---|---|
create_agent |
client | Instantiating an agent (a LangGraph graph, an AutoGen agent, a hand-rolled class) |
invoke_agent_client |
client | The caller side, treating the agent as a remote service |
invoke_agent_internal |
internal | The agent's top-level reasoning loop; parent of model and tool calls |
invoke_workflow |
client | A discrete workflow step or sub-graph node |
execute_tool |
client | The agent running a tool: a function, retrieval, code interpreter, shell |
There are two execute_tool definitions in the wider namespace, and the distinction trips people up. The LLM provider asking for a function is a gen_ai.chat span with a gen_ai.tool.* event.
The agent runtime actually running that function is the execute_tool operation on the agent page. The generative-AI spans page covers the non-agent operations like chat, embeddings, and generate_content.
The gen-ai.* attribute table you actually need
The full registry is large. In practice, five attributes are the ones every backend (Langfuse, ClickHouse ClickStack, Splunk, Honeycomb) projects into a queryable column, and the rest is gravy.
| Attribute | Required for | Notes |
|---|---|---|
gen_ai.operation.name |
all agent spans | One of the five values above |
gen_ai.provider.name |
all agent spans | Renamed from gen_ai.system in v1.37 |
gen_ai.request.model |
model invocations | Required |
gen_ai.usage.input_tokens |
model invocations | Conditionally Required when provider returns counts |
gen_ai.usage.output_tokens |
model invocations | Conditionally Required |
gen_ai.conversation.id |
session-scoped spans | Your primary grouping key |
gen_ai.usage.reasoning.output_tokens |
reasoning models | Opt-In, added v1.41.0 |
gen_ai.tool.name / gen_ai.tool.call.id |
execute_tool |
Required on tool spans |
error.type |
failed spans | Stable |
The gen_ai.usage.* attributes are the cost-tracking spine. Input and output tokens are Conditionally Required on model spans, meaning you set them whenever the provider returns counts and omit them otherwise.
The new one to watch is gen_ai.usage.reasoning.output_tokens, added in v1.41.0 (April 2026) per the semantic-conventions CHANGELOG. Reasoning models such as OpenAI's o-series and Anthropic extended-thinking can multiply per-call cost by 5x to 20x against the base input rate.
A cost pipeline that ignores this attribute will systematically under-report. Treat it as Opt-In until v1.42.0 confirms the convention.
OTel also standardizes a metric, gen_ai.client.operation.duration, a histogram in seconds keyed on provider, operation, and model. Its dimensions are stable even when span attribute names churn, which makes it the right foundation for latency SLOs.
The canonical multi-step agent loop
The shape the spec expects is a parent-child tree. The caller opens an invoke_agent_client span; the agent opens invoke_agent_internal; each model call is a gen_ai.chat child; each tool the model requests becomes an execute_tool child.
invoke_agent_client (CLIENT)
└── invoke_agent_internal (INTERNAL)
├── gen_ai.chat (model call #1 → requests a tool)
├── execute_tool (the tool runs)
└── gen_ai.chat (model call #2 with the tool result → final answer)
Here is the loop in Python, with the attributes that earn their keep:
import os
from opentelemetry import trace
# Opt into the latest experimental GenAI conventions. Leave OFF to keep v1.36 names.
os.environ.setdefault("OTEL_SEMCONV_STABILITY_OPT_IN", "gen_ai_latest_experimental")
tracer = trace.get_tracer("com.example.agent", "1.0.0")
def invoke_model(messages, model="gpt-4.1"):
with tracer.start_as_current_span("gen_ai.chat") as span:
span.set_attribute("gen_ai.operation.name", "chat")
span.set_attribute("gen_ai.provider.name", "openai")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.conversation.id", messages.session_id)
reply = call_provider(messages, model)
span.set_attribute("gen_ai.usage.input_tokens", reply.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", reply.usage.output_tokens)
return reply
def run_agent(user_message):
with tracer.start_as_current_span("invoke_agent_client") as client:
client.set_attribute("gen_ai.operation.name", "invoke_agent_client")
client.set_attribute("gen_ai.agent.name", "research-assistant")
client.set_attribute("gen_ai.agent.id", AGENT_ID)
with tracer.start_as_current_span("invoke_agent_internal") as agent:
agent.set_attribute("gen_ai.operation.name", "invoke_agent_internal")
for _ in range(MAX_STEPS):
reply = invoke_model(messages)
if reply.tool_call:
execute_tool(reply.tool_call.name, reply.tool_call)
else:
return reply.text
The TypeScript and Go SDKs follow the same shape. In Node, wrap each step in tracer.startActiveSpan(...), set error.type and SpanStatusCode.ERROR in the catch block, and end the span in finally.
In Go, tracer.Start(ctx, "gen_ai.chat") with defer span.End() and span.SetAttributes(...) is the idiom. The attribute keys are identical across all three languages, which is the whole point of a shared convention.
Attributing latency and cost per span
Per-span latency is just end_time − start_time from the OTLP fields. For the loop, sum the durations of every child gen_ai.chat and execute_tool span; the gap between that sum and the parent's duration is the agent's own overhead, including prompt assembly, tool selection, and any LLM-based routing the framework does outside the model API call.
Token cost is a join with a model-price table at query time:
cost = input_tokens * price_in(model)
+ output_tokens * price_out(model)
+ reasoning_output_tokens * price_reasoning(model) -- v1.41.0+
+ cached_input_tokens * price_cached(model)
Langfuse surfaces this pre-aggregated as cost-per-trace because it stores usage_details on every observation. A "which conversations are burning the budget" query in ClickHouse groups by conversation_id and agent_id over a recent window and orders by cost_usd, no price-table join needed at query time because Langfuse stores the joined cost on ingest.
Where the spans land: the 2026 vendor consolidation
Two acquisitions reshaped the storage layer this year, and both point the same direction.
Langfuse joined ClickHouse. On 2026-01-16, ClickHouse announced a $400M Series D at a $15B valuation and the acquisition of Langfuse in the same release. Langfuse stayed MIT-licensed and self-hostable. CEO Marc Klingen framed it plainly: "LLM observability and evaluation is fundamentally a data problem... We moved our data layer to ClickHouse, and that technical decision turned into a real partnership."
The 2026 Langfuse architecture is hybrid: PostgreSQL for transactional state (sessions, projects, accounts), ClickHouse for the OLAP store (traces, observations, scores), Redis or Valkey for cache and queue, and S3 or MinIO for object payloads. For OTel users, the consequence is direct: Langfuse's OTLP ingestion endpoint accepts your spans as-is, projects the gen_ai.* attributes into ClickHouse columns, and exposes them as filtering dimensions in the UI.
No Langfuse-specific code required.
Galileo joined Cisco. Cisco announced intent to acquire Galileo on 2026-04-09 and updated the post on 2026-05-22 to confirm completion; the Cisco acquisitions list records it as closed. Galileo is being folded into Splunk Observability Cloud's AI Agent Monitoring, not rebranded standalone. Its docs recommend OpenTelemetry and OpenInference integration paths, so the same gen_ai.* attributes flow into its evaluation surface today.
Read together, these moves say OTel is the wire format and warehouse-first storage is winning. A 2026 observability stack without a columnar OLAP layer behind the trace store is structurally more expensive to run at the volume agentic systems produce.
How do you survive an unstable spec?
This is the honest caveat, paired with the workaround. The conventions page itself warns that instrumentation libraries "should NOT change the version of the GenAI conventions that they emit by default" while the spec is Development.
In roughly 11 months from v1.36.0 to the in-development v1.42.0, the GenAI namespace shipped multiple material versions and breaking renames, including gen_ai.system to gen_ai.provider.name in v1.37.
The defensive posture is the same one that worked for OTel's database and HTTP conventions:
- Use a shim. Traceloop's OpenLLMetry or Arize's OpenInference own the attribute vocabulary, so a spec change becomes a shim-version bump, not an application edit.
- Pin the conventions library. Pin
opentelemetry-semantic-conventions(and its Go/JS/Java siblings) at the version matching the spec page you code against. Treat upgrades as explicit events. - Gate experimental attributes. Set
OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimentalonly where you want the next version. Keep production on the default and dual-emit in staging to see the diff first. - Centralize keys. Put every attribute name in one constants module. When
gen_ai.systembecamegen_ai.provider.name, a well-organized codebase changed one file. - Lock the spec in tests. A golden span exporter that asserts exact attribute keys catches a churn-induced rename in CI before it reaches production.
- Prefer stable metrics. Dashboards reading
gen_ai.client.operation.durationsurvive renames; dashboards reading span-level attribute keys do not. - Read the CHANGELOG on every release. The GitHub issue cadence on the genai subdirectory is the cleanest early-warning signal.
OpenTelemetry graduated within CNCF on 2026-05-21, which underwrites the bet even while the GenAI subspec matures.
What this means for you
Emit the five agent operations explicitly, parent every model and tool call under invoke_agent_internal, and put gen_ai.operation.name on every span. Cover provider, model, both token counts, and conversation ID on each model span, and add reasoning tokens when the model produces them.
Choose OTel as the wire format so your backend stays swappable. If your fleet produces more than roughly 100M observations a month, the ClickHouse-backed Langfuse path scales with storage cost and lets you query spans without exporting them. If volume is small, Langfuse Cloud, Arize Phoenix, and Honeycomb all consume the same traces.
Then shim, pin, dual-emit, and test, because the spec will move again before the year is out. The workflow you build around the five spans will still be correct after the next rename. The exact attribute strings might not be, and that is precisely why you keep them in one file.
Sources
- Semantic Conventions for GenAI agent and framework spans, OpenTelemetry
- Semantic conventions for generative AI client spans, OpenTelemetry
- Inside the LLM Call: GenAI Observability with OpenTelemetry
- semantic-conventions CHANGELOG, GitHub
- ClickHouse acquires Langfuse
- ClickHouse Raises $400M Series D, BusinessWire
- Langfuse ClickHouse self-hosting docs
- Langfuse OpenTelemetry tracing support
- Cisco announces intent to acquire Galileo
- Cisco acquisitions by year
- Galileo OpenTelemetry and OpenInference integration docs
- OpenTelemetry CNCF graduation announcement
