If you’re an engineering executive, reporting AI productivity gains to your board while quietly worrying about what AI is costing you in production, this post is the one that matters.
Resist the temptation of reporting tokens consumed as productivity. It’s the old and naive “lines of code” metric dressed up for the AI world.
Last month we had an agent in our coding pipeline quietly introducing inefficient query patterns across our ETL processing even after all our automated tests and code reviews passed. We found the anomaly after the feature shipped, when cloud costs on our workloads started climbing. S3 objects scan volumes and Athena query costs drifted upward across our tenants. Our finops cost anomaly detection caught the problem after all dev/test cycles had passed.
The code was AI generated and the stage gates before deployment did not catch the problem. Production observability helped us catch the problem, testing happens in prod, whether you acknowledge it or continue to live a lie.
It’s the end of the SDLC as we know it and I feel fine
Photo by Lukasz Szmigiel on Unsplash
Development and Operations are fragmented in many orgs.
You have the development loop, with build, test, merge.
Then you have the Operations loop, with monitor, alert, observe, fix.
Those loops rarely talk to each other unless there is serious investment in DevOps/SRE/Platform engineering.
AI coding agents have reduced the SDLC cycle time so fast the two loops cannot keep up separately.
Another aspect is on how we now have AI Agents in our feature code and their non deterministic nature in prod. In the old world behaviour was deterministic, the code artifact you tested in CI was basically the behaviour you deployed. For agentic workloads the artifact you promote is just a harness around a model. The actual system behavior is code plus model weights plus prompts plus tools plus live data.
Because the same input to an LLM doesn’t produce the same output, pre-prod evals only prove the artifact passed a few samples, and even those are not guaranteed to pass in subsequent runs. If your observability was designed around known failure cases, it won’t catch this non-determinism. We have to shift from predicting bugs in pre-prod to observing the agent’s behaviour in prod. Did it use the right tools? Did it stay within its permissions? Did it burn through too many tokens?
If all you have is logs, you’re doing archaeology.
Instrumenting your Agentic Coding Tool
Photo by Sebastian Unrau on Unsplash.
I use Claude Code but Codex and OpenCode have similar extensibility with OpenTelemetry.
Claude Code exposes hooks that fire at key points in the agent lifecycle: PreToolUse, PostToolUse, Notification, Stop. You can emit OTel spans from these hooks to get a trace of what the agent actually did during a session, using the same OTel collector your production services are already sending to.
I’ve published the full hook setup as a dotfiles repo you can drop straight into your ~/.claude/ folder. It covers every hook event the CLI exposes, session start and end, every tool call pre and post, subagent lifecycle, context compaction, permission requests, the lot: github.com/nimeshjm/claude-otel-hooks.
Code snippets as an example. These attributes are captured when Claude calls a tool.
# hooks/post-tool-use.py — emit a span for every tool callattrs = { "session.id": session_id, "cwd": data.get("cwd", ""), "turn.id": turn_id, "gen_ai.operation.name": "tool_call", "gen_ai.tool.name": tool_name, "gen_ai.tool.type": "extension" if is_mcp else "function", "gen_ai.tool.success": True, "tool_use_id": tool_use_id, "tool.duration_ms": (now - start_ns) // 1_000_000,}And the turn-end hook, which is where token data lives:
# hooks/stop.py — fires when the agent finishes a turnemit_span( "claude_code.turn.stop", { "session.id": session_id, "cwd": data.get("cwd", ""), "turn.id": turn_id, "gen_ai.operation.name": "chat", "gen_ai.request.model": model, "agent.stop_reason": stop_reason, "gen_ai.usage.input_tokens": input_tokens, "gen_ai.usage.output_tokens": usage.get("output_tokens", 0), "gen_ai.usage.cache_creation_tokens": usage.get("cache_creation_input_tokens", 0), "gen_ai.usage.cache_read_tokens": cache_read, "gen_ai.usage.cache_hit_ratio": cache_hit_ratio, }, start_time_ns=now, end_time_ns=now, status_ok=stop_reason not in ("error", "max_turns"), error_message=stop_reason if stop_reason in ("error", "max_turns") else "",)What you get: a trace per session showing every tool call the agent made, in order, with timing. This is crucial because standard APM only sees fragmented API calls, which are below the granularity you want. These OTel hooks capture the full agentic transaction,i.e. the actual decision graph the agent walked.
Over the past week this gave us 578 tool calls: 337 Bash, 101 Read, 67 Edit. The read-to-write ratio is healthy most of the time. Our 5.16% error rate and the claude_code.turn.stop_failure spans are what flag the sessions worth looking at.
A few things worth watching once you have this data:
High counts on specific tasks. If an agent is consistently making a lot of tool calls to complete a task that should take a couple then there is something wrong. Maybe the task is too large and it needs to scan a large part of your codebase to make progress.
Read/Write ratios. If the agent is reading far more than it’s writing that is exploration overhead which impacts your token costs.
My previous post, Agentic RAG in practice discusses some mitigation strategies.
Instrumenting your services
Photo by Vincentiu Solomon on Unsplash
Enforcing instrumentation during development
While developing we ensure span attributes are consistent so our troubleshooting agent can pull and reason over them at investigation time. We have a section in CLAUDE.md that defines the pattern for observability.
## Observability- Prefer adding attributes to existing spans over creating new child spans, unless the operation is both interesting and aggregable.- Add `tenant.id` to the active span.- Add the stack trace on every error to the span and set `error=true`.Once it’s in CLAUDE.md, the coding agent applies this pattern consistently on every code modification.
Baseline metadata
There are a number of cross cutting attributes that we add to every span. These are set “globally” using resource attributes.
Resource attributes: set once and they are inherited by all spans.
# app.py — run once at startupimport osfrom opentelemetry import tracefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.resources import Resource
resource = Resource.create({ "service.name": "billing-service", "service.version": os.environ.get("BUILD_COMMIT_SHA", ""), "build.commit_sha": os.environ.get("BUILD_COMMIT_SHA", ""), "deploy.environment": os.environ.get("ENVIRONMENT", ""), "work_tracker.issue_id": os.environ.get("ISSUE_ID", ""),})
provider = TracerProvider(resource=resource)trace.set_tracer_provider(provider)Every time you emit a span, all these attributes are added. This is useful to identify failures caused by a specific version of the code.
Span attributes: specific metadata of the methods that are being executed.
tenant.id, billing.record_count and s3.bytes_scanned are specific to each invocation.
span.set_attributes({ "tenant.id": tenant_id, "billing.record_count": len(records), "s3.bytes_scanned": bytes_scanned,})build.commit_sha isn’t in this snippet, it’s on the resource attribute so we don’t need to add it individually to every span.
Connecting day 1 with day 2: agent sessions to production impact
Photo by Joseph Barrientos on Unsplash
When you have a production issue you know the build SHA1 and the issue id so you can quickly investigate if this was an issue caused in the last deployment.
The work tracker issue ID is already there
Most teams already follow a branch naming convention like feature/PROJ-1234-add-payment-flow. The issue ID is already present at PR creation and build time.
- name: Extract issue ID from branch name run: | ISSUE_ID=$(echo "${{ github.head_ref }}" | grep -oP '[A-Z]+-\d+') echo "ISSUE_ID=${ISSUE_ID}" >> $GITHUB_ENV
- name: Deploy with resource attributes run: ./deploy.sh env: BUILD_COMMIT_SHA: ${{ github.sha }} ISSUE_ID: ${{ env.ISSUE_ID }}When production behaviour changes after a deploy, you filter by the commit SHA and see every span produced by that change, whether the code was written by a human or an agent.
The agent writes back to the ticket
When the agent writes code, it should leave a record in the ticket of why it made those changes, writing back its plan.
A Stop hook fires when the agent finishes a session. The hook receives an event dictionary containing the session ID, a summary of what the agent did, the list of files it modified, and usage metrics like tool call count and duration. You can use it to post a structured summary back to the work tracker:
# hooks/stop.py, fires when the agent completes a sessionimport osimport requests
issue_id = os.environ["ISSUE_ID"]jira_url = f"{os.environ['JIRA_BASE_URL']}/rest/api/2/issue/{issue_id}/comment"
plan_content = ""if os.path.exists("plan.md"): with open("plan.md", "r") as f: plan_content = f"\n\n*Plan:*\n{f.read()}"
comment_body = ( f"*Agent session complete* ({event['session_id']})\n\n" f"{event['final_summary']}{plan_content}\n\n" f"Files modified: {', '.join(event['files_modified'])}\n" f"Tool calls: {event['tool_call_count']} | " f"Duration: {event['duration_ms']}ms\n")
requests.post( jira_url, json={"body": comment_body}, auth=(os.environ["JIRA_USER"], os.environ["JIRA_API_TOKEN"]),)When a P1 hits, the error spans will have the issue id, where the agent’s reasoning is recorded. The AI SRE can correlate the reasoning behind the change, the actual changeset and the trace to reconstruct the failure path.
Agentic AI SRE trace-to-fix loop
Photo by Sean Oulashin on Unsplash
Once you can correlate agent sessions to production behaviour, the next step is automating the investigation itself.
We pull production telemetry data nightly and run failure events through an AI model. The output is a list of issues, an investigation path and a description of the problem for an engineer to review. The creation of an incident and follow-up PR is made available to the engineer.
# Pseudocode — runs nightly in CItraces = signoz.query( filter='name = "claude_code.turn.stop_failure" OR error = true', group_by=['exception.slug', 'build.commit_sha', 'service.name'], time_range='last_24h',)
for error_group in traces.top_errors(limit=20): issue = work_tracker.get_issue(error_group.work_tracker_issue_id) context = { "spans": error_group.sample_spans(n=5), "diff": github.get_diff(error_group.build_commit_sha), "acceptance_criteria": issue.description, } pr = claude.complete(FIX_PROMPT, context=context) github.create_draft_pr(title=f"fix: {error_group.slug}", body=pr)The connection of code changes with production telemetry and the reasoning of the change are all useful context for the AI SRE agent to determine the root cause and propose fixes.
Beyond debugging, this observability layer becomes your production intelligence and audit trail. When the CISO or your SOC2 auditor asks how you verify agent behavior in production, CI attestation won’t be enough. This trace data proving the decision making process and how the agent stayed within its behavioral envelope is the evidence you hand them.
Tracing the Decision Tree for SOC2
To provide that evidence, you don’t need a custom database to track the agent’s decision tree; the OTel trace hierarchy is the decision tree. The root span is the agent session. The child spans are each LLM turn (the “thought”). The leaf spans are the tool calls or context retrievals. This gives auditors a chronological flowchart of exactly what the agent decided to do.
But there’s a catch for enterprise governance: logging full prompts and model completions directly into your observability backend can leak PII or secrets. The enterprise pattern is to stream the raw input/output payloads to a secure, encrypted blob store (like S3) and only log the reference in your OTel span:
span.set_attribute("gen_ai.payload.s3_uri", f"s3://audit-logs/agent-sessions/{session_id}/turn_4.json")This gives you the best of both worlds. Your observability dashboard shows the metadata, timing, and decision graph, while the sensitive text inputs and thoughts are locked down in a SOC2-compliant vault, accessible only when an auditor actually needs to see the exact context.
OTel is the foundation
Photo by Jay Mantri on Unsplash
OpenTelemetry is the standard we’re betting on. It’s vendor-neutral, CNCF graduated and well-supported across every language and platform we run.
We run two backends: Honeycomb and SigNoz. Both have a SaaS offering and SigNoz also has a self hosted version that we use for prototyping or in air-gapped environments.
Getting SigNoz running takes two commands:
git clone https://github.com/SigNoz/signoz.gitdocker compose -f deploy/docker/docker-compose.yaml up -dConfigure your apps’ OTEL_EXPORTER_OTLP_ENDPOINT to localhost:4317 and traces start flowing into SigNoz.
The Claude Code Sessions dashboard we run has 16 panels: time-series for sessions, tool calls, token usage (input, cache, output), cache hit ratio, and model usage; tables for tool failures by gen_ai.tool.name, permission denials, tool duration, lines edited per session, stop reason distribution, subagent activity, and context compaction.
The two panels I look at most are tool failures and permission denials. Last week those told me the agent was hitting a Bash permission boundary on a specific workflow, where it was trying to access data outside of the boundaries we had set. Not a desirable behaviour so we took steps to correct it.
On the Honeycomb side the Canvas investigation tool offers a really good experience. The more context we can include in our spans, the better the investigation is.
The evidence beyond our team
Photo by Kal Visuals on Unsplash
Mixpanel rolled Claude Code to their entire engineering org and immediately started tracking agent costs with an observability board as one of their sources of truth. When agents are pushing code to prod you need the same or better observability you’d apply to any other production system.
Intercom’s Fin chatbot handles millions of customer conversations across thousands of organisations. When resolution rates started degrading and time to first token started climbing, they needed to understand why, not just that something was wrong. The resolution rate is now at its highest level.
One pattern at the companies navigating this well: they treat observability as a requirement before any agentic system ships, not something they bolt on after the first incident. You can’t predict the failure modes of a non-deterministic system in advance. You can make sure you’ll be able to see what happened when it goes wrong.
What to do on your next SRE/DevOps/Platform Engineering sprint
Photo by Luca Bravo on Unsplash
-
Clone claude-otel-hooks and install in the machines of every person using Claude Code. Configure it to point to your OTel collector, to Honeycomb, to SigNoz Cloud or deploy your own. The SigNoz docker-compose setup is a proof of concept for local evaluation. For production use, harden the deployment or use SigNoz Cloud or Honeycomb instead.
-
Import the Claude Code Sessions dashboard JSON from the hooks repo. Sixteen panels ready on first session.
-
Add three resource attributes to every service:
build.commit_sha,deploy.environment,work_tracker.issue_id. -
Add the observability block to CLAUDE.md of your code repos.
The cost spike took a week to surface because we didn’t have SLOs or alerting on athena.query_cost or s3.bytes_scanned as span attributes at the time. Every incident that improves the instrumentation makes the next incident faster to find. That’s the compound engineering part.