Chapter 22: Observability and Monitoring #
"You cannot improve what you cannot measure. And in production, you cannot debug what you cannot observe." -- Observability engineering axiom
What You Will Learn #
In this chapter, you will learn how to observe and monitor Neam agents in production. You will understand the three health check endpoints and their semantics, configure OpenTelemetry integration for distributed tracing and metrics, visualize traces in Jaeger, build Prometheus dashboards, monitor the LLM Gateway (rate limits, circuit breaker state, cache hits, cost), trace requests across multi-agent systems, and design alerting strategies. By the end of this chapter, you will be able to answer the question "why is my agent slow?" in under five minutes.
22.1 Health Check Semantics #
Neam v0.6.0 exposes three health check endpoints, each with distinct semantics. These endpoints are used by Kubernetes probes, load balancers, and monitoring systems to determine the operational state of a Neam agent.
GET /health (Liveness) #
The liveness endpoint answers one question: is the Neam process alive and able to respond to HTTP requests?
What it checks: - The HTTP server is listening and can process requests - The main event loop has not deadlocked
What it does NOT check: - External dependencies (database, LLM providers, OTel collector) - Whether agents are loaded or initialized
Response when healthy (HTTP 200):
{
"status": "ok",
"version": "0.6.0",
"uptime_seconds": 3672
}
When it fails: The process is irrecoverably broken. Kubernetes kills the pod and restarts it.
Kubernetes configuration:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
This means: after an initial 15-second delay, check /health every 20 seconds. If
3 consecutive checks fail (each with a 5-second timeout), kill and restart the pod.
GET /ready (Readiness) #
The readiness endpoint answers: can this pod serve traffic right now?
What it checks: 1. State backend connectivity: Can the VM connect to and query the configured state backend (SQLite, PostgreSQL, Redis, DynamoDB, CosmosDB)? 2. LLM provider availability: Is at least one LLM provider circuit in the Closed or HalfOpen state? (If all circuits are Open, the agent cannot make LLM calls.) 3. Telemetry health: If telemetry is enabled, is the export queue below its capacity limit? (A full queue indicates the OTLP endpoint is down.)
Response when ready (HTTP 200):
{
"status": "ready",
"checks": {
"state_backend": {
"status": "connected",
"type": "postgres",
"latency_ms": 2
},
"llm_providers": {
"openai": {
"status": "healthy",
"circuit": "closed",
"requests_total": 1547,
"failures_total": 3
},
"anthropic": {
"status": "healthy",
"circuit": "closed",
"requests_total": 42,
"failures_total": 0
}
},
"telemetry": {
"status": "ok",
"pending_spans": 12,
"queue_capacity": 1000
}
}
}
Response when not ready (HTTP 503):
{
"status": "not_ready",
"checks": {
"state_backend": {
"status": "connection_refused",
"type": "postgres",
"error": "could not connect to server: Connection refused"
},
"llm_providers": {
"openai": {
"status": "unhealthy",
"circuit": "open",
"last_failure": "2026-01-30T14:32:05Z",
"error": "429 Too Many Requests"
},
"anthropic": {
"status": "healthy",
"circuit": "closed"
}
},
"telemetry": {
"status": "ok"
}
}
}
When it fails: Kubernetes removes the pod from Service endpoints. No traffic is routed to it. The pod stays running (it is not killed -- that is the liveness probe's job). Once the dependency recovers, the next readiness check passes, and traffic resumes.
Kubernetes configuration:
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
GET /startup (Startup) #
The startup endpoint answers: has the Neam VM completed its initialization sequence?
What it checks: 1. Bytecode is loaded and validated 2. Agents are registered in the VM 3. State backend connection is established 4. Knowledge bases are ingested (if any) 5. Autonomous executor is started (if configured) 6. LLM Gateway is initialized (if configured)
Response when startup complete (HTTP 200):
{
"status": "started",
"initialized_at": "2026-01-30T14:00:05Z",
"agents_registered": 3,
"knowledge_bases_loaded": 1,
"autonomous_agents": 1
}
Response during startup (HTTP 503):
{
"status": "starting",
"phase": "ingesting_knowledge_bases",
"progress": "2/5 sources processed"
}
Kubernetes configuration:
startupProbe:
httpGet:
path: /startup
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30
This allows up to 150 seconds (30 x 5s) for startup. Once the startup probe succeeds, Kubernetes switches to the liveness and readiness probes. This is critical for agents with large knowledge bases that take time to ingest.
Health Check Summary #
| Endpoint | Question | Failure Action | Checks Dependencies | Frequency |
|---|---|---|---|---|
/health |
Is the process alive? | Kill and restart | No | Every 20s |
/ready |
Can it serve traffic? | Remove from LB | Yes | Every 10s |
/startup |
Is init complete? | Wait for startup | Yes (init only) | Every 5s |
22.2 OpenTelemetry Integration #
Neam v0.6.0 integrates with the OpenTelemetry standard for distributed tracing and metrics. The integration is in-process -- no sidecar or agent is required (though an OTel Collector is recommended in production for reliable delivery).
Architecture #
+---------------------------------------------------------------+
| |
| Neam Agent (in-process) |
| +-----------------------------------------------------------+
| | |
| | Agent.ask() |
| | | |
| | v |
| | TelemetryExporter |
| | +---------------------------+ |
| | | start_span("agent.ask") | |
| | | start_span("llm.call") | |
| | | set_attribute(...) | |
| | | end_span() | |
| | | start_span("rag.query")| |
| | | end_span() | |
| | | end_span() | |
| | +---------------------------+ |
| | | |
| | Batch buffer (100 spans or 5s) |
| | | |
| +---------|--------------------------------------------------+
| | |
| v OTLP/HTTP JSON |
| +---------+----------+ |
| | OTel Collector | |
| | (otel-collector) | |
| +----+----------+----+ |
| | | |
| v v |
| +--------+ +-----------+ |
| | Jaeger | | Prometheus| |
| | (traces)| | (metrics) | |
| | :16686 | | :9090 | |
| +--------+ +-----------+ |
| |
+---------------------------------------------------------------+
Configuration #
Enable telemetry in neam.toml:
[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "neam-agent"
sampling-rate = 0.5
Or via environment variables:
export NEAM_TELEMETRY_ENABLED=true
export NEAM_OTEL_ENDPOINT=http://otel-collector:4318
export NEAM_TELEMETRY_SERVICE_NAME=neam-agent
export NEAM_TELEMETRY_SAMPLING_RATE=0.5
Automatic Span Creation #
The Neam VM automatically creates spans for the following operations:
| Span Name | When Created | Key Attributes |
|---|---|---|
neam.agent.ask |
Every Agent.ask() call |
agent.name, agent.provider, agent.model |
neam.llm.call |
Each LLM API request | gen_ai.system, gen_ai.request.model, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens |
neam.tool.call |
Tool invocation | tool.name, tool.duration_ms |
neam.rag.query |
RAG retrieval | rag.strategy, rag.top_k, rag.documents_retrieved |
neam.reflection |
Self-reflection pass | reflection.dimensions, reflection.min_confidence, reflection.score |
neam.learning.review |
Learning review trigger | learning.strategy, learning.interactions_reviewed |
neam.handoff |
Agent handoff | handoff.from, handoff.to, handoff.reason |
neam.mcp.call |
MCP tool execution | mcp.server, mcp.tool, mcp.duration_ms |
neam.gateway.ratelimit |
Rate limit wait | gateway.provider, gateway.wait_ms |
neam.gateway.circuitbreak |
Circuit breaker trip | gateway.provider, gateway.circuit_state |
neam.gateway.cache |
Cache hit/miss | gateway.provider, gateway.cache_hit |
Span Hierarchy #
A typical agent call produces a tree of spans:
neam.agent.ask (TriageAgent, 1200ms)
|
+-- neam.rag.query (strategy: basic, 45ms)
| +-- Retrieved 3 documents
|
+-- neam.llm.call (openai/gpt-4o-mini, 850ms)
| +-- prompt_tokens: 1200
| +-- completion_tokens: 150
| +-- cost_usd: 0.0018
|
+-- neam.reflection (accuracy: 0.9, relevance: 0.85, 400ms)
| +-- neam.llm.call (openai/gpt-4o-mini, 350ms)
|
+-- neam.handoff (TriageAgent -> RefundAgent, 0ms)
OTLP Export Format #
Neam exports spans as OTLP/HTTP JSON (not protobuf) to avoid the protobuf dependency. The OTel Collector accepts both formats:
{
"resourceSpans": [{
"resource": {
"attributes": [
{"key": "service.name", "value": {"stringValue": "neam-agent"}},
{"key": "service.version", "value": {"stringValue": "0.6.0"}},
{"key": "deployment.environment", "value": {"stringValue": "production"}}
]
},
"scopeSpans": [{
"scope": {"name": "neam", "version": "0.6.0"},
"spans": [
{
"traceId": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
"spanId": "1a2b3c4d5e6f7a8b",
"parentSpanId": "",
"name": "neam.agent.ask",
"kind": 2,
"startTimeUnixNano": "1706620800000000000",
"endTimeUnixNano": "1706620801200000000",
"attributes": [
{"key": "agent.name", "value": {"stringValue": "TriageAgent"}},
{"key": "agent.provider", "value": {"stringValue": "openai"}},
{"key": "agent.model", "value": {"stringValue": "gpt-4o-mini"}}
],
"status": {"code": 1}
}
]
}]
}]
}
Batching and Background Export #
Spans are buffered in memory and exported in batches:
- Batch size: 100 spans (or fewer if the flush interval triggers first)
- Flush interval: 5 seconds
- Export thread: A background thread performs the HTTP POST to the OTLP endpoint
- Backpressure: If the export queue exceeds 1000 pending spans (configurable), new spans are dropped with a warning log
- Failure handling: Failed exports are retried once with a 1-second delay, then dropped
Sampling #
The sampling-rate controls what fraction of traces are exported:
| Rate | Effect | Use Case |
|---|---|---|
1.0 |
Every request traced | Development, debugging |
0.5 |
50% of requests | Staging |
0.1 |
10% of requests | Production (moderate traffic) |
0.01 |
1% of requests | Production (high traffic) |
Sampling is deterministic per trace: if a trace is sampled, all spans within that trace (including child spans from tool calls, RAG queries, and reflections) are included. This is achieved by hashing the trace ID and comparing against the sampling threshold.
// This code behaves identically regardless of sampling rate.
// The telemetry layer is transparent to agent logic.
agent TracedAgent {
provider: "openai"
model: "gpt-4o-mini"
system: "You are a helpful assistant."
}
{
let response = TracedAgent.ask("Explain observability.");
emit response;
// If this trace is sampled, spans are exported automatically.
// If not sampled, zero overhead is added.
}
22.3 Jaeger for Trace Visualization #
Jaeger is an open-source distributed tracing platform. The Docker Compose stack from Chapter 20 includes Jaeger, and Neam traces flow through the OTel Collector to Jaeger automatically.
Accessing Jaeger #
# If running Docker Compose
open http://localhost:16686
# If running in Kubernetes
kubectl port-forward svc/jaeger-query -n observability 16686:16686
open http://localhost:16686
Finding Traces #
In the Jaeger UI:
- Select Service:
neam-agent - Select Operation:
neam.agent.ask(or leave as "all") - Set a time range
- Click Find Traces
Each trace shows the complete span tree for one request, including:
- Total request duration
- Time spent in each LLM call
- RAG retrieval latency
- Reflection overhead
- Handoff chain
Reading a Trace #
A trace for a customer service triage request might look like:
Trace: a1b2c3d4 (1450ms total)
[==============================================] neam.agent.ask (TriageAgent) 1450ms
[====] neam.rag.query (basic, 3 docs) 50ms
[===================] neam.llm.call (openai/gpt-4o-mini) 900ms
prompt_tokens: 1500 completion_tokens: 80 cost: $0.0020
[======] neam.reflection (accuracy: 0.92) 350ms
[====] neam.llm.call (openai/gpt-4o-mini) 300ms
[] neam.handoff (TriageAgent -> RefundAgent) 1ms
[====================================] neam.agent.ask (RefundAgent) 750ms
[==========================] neam.llm.call (openai/gpt-4o-mini) 600ms
prompt_tokens: 800 completion_tokens: 200 cost: $0.0015
From this trace, you can immediately see:
- The total request took 1450ms
- The LLM call to OpenAI was the bottleneck (900ms for triage, 600ms for refund)
- RAG retrieval was fast (50ms)
- Reflection added 350ms of overhead (with its own LLM call)
- The handoff from TriageAgent to RefundAgent was instantaneous
22.4 Prometheus Metrics #
Neam exports metrics to Prometheus via the OTel Collector. These metrics provide aggregate visibility across all requests, complementing the per-request detail of traces.
Exported Metrics #
| Metric | Type | Labels | Description |
|---|---|---|---|
neam_llm_requests_total |
Counter | provider, model, status |
Total LLM API calls |
neam_llm_tokens_total |
Counter | provider, model, type |
Tokens consumed (prompt/completion) |
neam_llm_latency_seconds |
Histogram | provider, model |
LLM call latency distribution |
neam_llm_cost_usd_total |
Counter | provider, model |
Accumulated LLM cost |
neam_agent_requests_total |
Counter | agent, status |
Agent ask() calls |
neam_agent_latency_seconds |
Histogram | agent |
End-to-end agent latency |
neam_rag_queries_total |
Counter | strategy, knowledge_base |
RAG retrieval queries |
neam_rag_latency_seconds |
Histogram | strategy |
RAG retrieval latency |
neam_tool_calls_total |
Counter | tool, status |
Tool invocations |
neam_reflection_score |
Gauge | agent, dimension |
Latest reflection scores |
neam_gateway_rate_limit_waits_total |
Counter | provider |
Rate limit delays |
neam_gateway_circuit_breaker_state |
Gauge | provider |
Circuit state (0=closed, 1=open, 2=half-open) |
neam_gateway_cache_hits_total |
Counter | provider |
Cache hits |
neam_gateway_cache_misses_total |
Counter | provider |
Cache misses |
neam_gateway_cost_daily_usd |
Gauge | Current daily cost | |
neam_gateway_cost_budget_usd |
Gauge | Configured daily budget |
Prometheus Configuration #
# docker/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
metrics_path: /metrics
- job_name: 'neam-agent'
static_configs:
- targets: ['neam-agent:8080']
metrics_path: /metrics
Useful PromQL Queries #
Request rate (requests per second):
rate(neam_agent_requests_total[5m])
P95 agent latency:
histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m]))
LLM cost per hour:
rate(neam_llm_cost_usd_total[1h]) * 3600
Token consumption rate by provider:
sum by (provider) (rate(neam_llm_tokens_total[5m]))
Cache hit ratio:
sum(rate(neam_gateway_cache_hits_total[5m]))
/
(sum(rate(neam_gateway_cache_hits_total[5m])) + sum(rate(neam_gateway_cache_misses_total[5m])))
Circuit breaker status (1 = problem):
neam_gateway_circuit_breaker_state > 0
Budget utilization percentage:
neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd * 100
22.5 LLM Gateway Monitoring #
The LLM Gateway is the most critical component to monitor because it controls the flow of all LLM requests. The gateway exposes its internal state through the readiness endpoint and through Prometheus metrics.
Rate Limit Tracking #
The gateway tracks per-provider request rates and enforces the limits defined in
neam.toml:
[llm.rate-limits.openai]
requests-per-minute = 120
Monitoring rate limits:
# Current request rate vs. limit
rate(neam_llm_requests_total{provider="openai"}[1m]) * 60
# Compare against the configured limit of 120
# Rate limit wait events (indicates you are approaching the limit)
rate(neam_gateway_rate_limit_waits_total{provider="openai"}[5m])
When rate limit waits increase, it means the gateway is throttling requests to stay within the configured limit. If waits are frequent, consider:
- Increasing the
requests-per-minutelimit (if the provider allows it) - Adding a fallback provider to distribute load
- Enabling response caching to reduce redundant calls
Circuit Breaker State #
The circuit breaker has three states, represented as a gauge metric:
| Value | State | Meaning |
|---|---|---|
| 0 | Closed | Normal operation |
| 1 | Open | Provider is down; all requests rejected |
| 2 | Half-Open | Probing the provider with a single request |
# Alert when any circuit is open
neam_gateway_circuit_breaker_state{provider="openai"} == 1
Visualizing circuit breaker transitions:
In Grafana, create a state timeline panel with the
neam_gateway_circuit_breaker_state metric. This shows exactly when each provider
went down and how long it took to recover:
Time: 00:00 00:05 00:10 00:15 00:20 00:25 00:30
OpenAI: [--- Closed ---][Open][HO][--- Closed ---]
Anthropic: [---------- Closed ----------------------------------]
Cache Hit Rates #
# Cache hit ratio (higher is better, saves money)
sum(rate(neam_gateway_cache_hits_total[5m]))
/
(sum(rate(neam_gateway_cache_hits_total[5m])) + sum(rate(neam_gateway_cache_misses_total[5m])))
A cache hit ratio of 0 means caching is not effective (likely because all agents use temperature > 0). A ratio above 0.3 means you are saving at least 30% on LLM costs.
Cost Tracking #
The gateway tracks real-time cost using Neam's built-in pricing table:
# Daily cost (USD)
neam_gateway_cost_daily_usd
# Budget utilization
neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd * 100
# Cost by provider
sum by (provider) (rate(neam_llm_cost_usd_total[1h])) * 3600
# Cost by model
sum by (model) (rate(neam_llm_cost_usd_total[1h])) * 3600
Cost dashboard example:
+----------------------------------------------+
| Daily LLM Cost |
| |
| $47.32 / $100.00 budget (47.3%) |
| [========================............] 47% |
| |
| By Provider: |
| OpenAI: $38.50 (81%) |
| Anthropic: $8.82 (19%) |
| |
| By Model: |
| gpt-4o-mini: $32.10 |
| gpt-4o: $6.40 |
| claude-3.5: $8.82 |
+----------------------------------------------+
22.6 Distributed Tracing Across Multi-Agent Systems #
When a request flows through multiple agents (triage -> specialist -> supervisor), distributed tracing keeps the entire chain visible as a single trace.
Trace Propagation #
Within a single Neam VM, trace propagation is automatic. The VM maintains a trace context stack, and when one agent hands off to another, the child agent's span is created with the parent agent's span ID.
Cross-Service Tracing #
When agents communicate across services (via the A2A protocol), the trace context is propagated via HTTP headers following the W3C Trace Context standard:
POST /a2a HTTP/1.1
Host: specialist-service.internal
Content-Type: application/json
traceparent: 00-abc123def456abc123def456abc123de-1a2b3c4d5e6f7a8b-01
tracestate: neam=agent:TriageAgent
{"jsonrpc": "2.0", "method": "tasks/send", ...}
The receiving service picks up the traceparent header and creates its spans as
children of the calling service's span. This means a single trace in Jaeger can show
the complete request path across multiple Neam services.
Practical Example: Multi-Service Tracing #
agent TriageAgent {
provider: "openai"
model: "gpt-4o-mini"
system: "Route customer requests."
handoffs: [RefundAgent]
}
{
// This creates a root span: neam.agent.ask
let triage = TriageAgent.ask("I need a refund for order #123");
// Handoff propagates the trace context
// The RefundAgent span becomes a child of this span
}
agent RefundAgent {
provider: "openai"
model: "gpt-4o"
system: "Process refund requests."
}
{
// When called via A2A, the trace context is inherited
// from the traceparent header
let result = RefundAgent.ask("Process refund for order #123");
emit result;
}
In Jaeger, the combined trace shows:
Trace abc123 (2100ms)
Service: triage-service
neam.agent.ask (TriageAgent) ........................ 1200ms
neam.llm.call (openai/gpt-4o-mini) .............. 900ms
neam.handoff (TriageAgent -> RefundAgent) ........ 1ms
Service: refund-service
neam.agent.ask (RefundAgent) ....................... 900ms
neam.llm.call (openai/gpt-4o) ................... 750ms
22.7 Alerting Strategies #
Monitoring without alerting is just logging with a GUI. Here are alerting rules for the most important Neam operational signals.
Prometheus Alerting Rules #
# alerting-rules.yaml
groups:
- name: neam-agent
rules:
# Alert when error rate exceeds 5%
- alert: NeamHighErrorRate
expr: |
sum(rate(neam_agent_requests_total{status="error"}[5m]))
/
sum(rate(neam_agent_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Neam agent error rate above 5%"
description: "{{ $value | humanizePercentage }} of requests are failing"
# Alert when P95 latency exceeds 5 seconds
- alert: NeamHighLatency
expr: |
histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m]))
> 5
for: 10m
labels:
severity: warning
annotations:
summary: "Neam P95 latency above 5 seconds"
# Alert when a circuit breaker is open
- alert: NeamCircuitBreakerOpen
expr: neam_gateway_circuit_breaker_state == 1
for: 2m
labels:
severity: critical
annotations:
summary: "LLM provider {{ $labels.provider }} circuit breaker is open"
description: "All requests to {{ $labels.provider }} are being rejected"
# Alert when daily cost exceeds 80% of budget
- alert: NeamCostBudgetWarning
expr: |
neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Daily LLM cost at {{ $value | humanizePercentage }} of budget"
# Alert when daily cost exceeds 95% of budget
- alert: NeamCostBudgetCritical
expr: |
neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd > 0.95
for: 1m
labels:
severity: critical
annotations:
summary: "Daily LLM cost at {{ $value | humanizePercentage }} of budget"
# Alert when rate limit waits are frequent
- alert: NeamRateLimitPressure
expr: |
rate(neam_gateway_rate_limit_waits_total[5m]) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Provider {{ $labels.provider }} under rate limit pressure"
# Alert when state backend is unreachable
- alert: NeamStateBackendDown
expr: |
up{job="neam-agent"} == 1
unless
neam_health_state_backend_connected == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Neam state backend is unreachable"
# Alert when all pods are not ready
- alert: NeamNoReadyPods
expr: |
kube_deployment_status_replicas_ready{deployment="neam-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No ready Neam agent pods"
Alert Priority Matrix #
| Condition | Severity | Response Time | Action |
|---|---|---|---|
| All pods down | Critical | Immediate | Page on-call, investigate cluster |
| Circuit breaker open | Critical | 5 min | Check provider status, verify failover |
| Cost > 95% budget | Critical | 15 min | Investigate usage, consider throttling |
| Error rate > 5% | Warning | 30 min | Review traces, check for bad inputs |
| P95 latency > 5s | Warning | 1 hour | Review traces, check provider latency |
| Rate limit pressure | Warning | 1 hour | Consider increasing limits or caching |
| Cost > 80% budget | Warning | 4 hours | Review cost trends, adjust budget |
22.8 Operational Runbook #
Here is a practical runbook for diagnosing common issues using the observability stack.
"Why is my agent slow?" #
-
Check Prometheus: Query
histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m]))to confirm the latency baseline. -
Check Jaeger: Find a slow trace. Look at the span tree:
- Is the LLM call slow? (Provider issue or large prompt)
- Is RAG retrieval slow? (Knowledge base too large or slow vector search)
-
Is reflection adding latency? (Consider reducing
min_confidenceor disabling for non-critical agents) -
Check rate limits: Query
rate(neam_gateway_rate_limit_waits_total[5m]). If rate limit waits are high, the gateway is throttling requests. -
Check cache hit ratio: If the cache is available but the hit ratio is 0, check that
temperature: 0is set on deterministic agents.
"Why is my agent returning errors?" #
-
Check circuit breaker state: Query
neam_gateway_circuit_breaker_state. If a circuit is open (1), the provider is down. -
Check the readiness endpoint:
curl http://neam-agent:8080/readyto see which components are unhealthy. -
Check Jaeger: Find traces with error status. The error span will have a
status_messageattribute explaining the failure. -
Check provider health: Query
sum by (provider, status) (rate(neam_llm_requests_total[5m]))to see error rates per provider.
"Am I spending too much?" #
-
Check daily cost: Query
neam_gateway_cost_daily_usdfor the current total. -
Break down by model: Query
sum by (model) (rate(neam_llm_cost_usd_total[1h])) * 3600to find the most expensive model. -
Check cache effectiveness: A low cache hit ratio means you are paying for redundant calls.
-
Check token usage: Query
sum by (agent) (rate(neam_llm_tokens_total[1h]))to find agents consuming the most tokens. Long system prompts or large RAG contexts inflate token counts.
22.9 Complete Observability Example #
Here is a complete Neam agent with full observability configuration:
# neam.toml
[project]
name = "observed-agent"
version = "1.0.0"
[project.entry_points]
main = "src/main.neam"
[state]
backend = "postgres"
connection-string = "postgresql://neam:pass@postgres:5432/neam"
[llm]
default-provider = "openai"
default-model = "gpt-4o-mini"
[llm.rate-limits.openai]
requests-per-minute = 120
[llm.circuit-breaker]
failure-threshold = 3
reset-timeout-seconds = 60
[llm.cache]
enabled = true
max-entries = 1000
ttl-seconds = 600
[llm.cost]
daily-budget-usd = 100.0
[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "observed-agent"
sampling-rate = 1.0
agent AnalystAgent {
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.3
system: "You are a data analyst. Provide clear, data-driven answers."
reasoning: chain_of_thought
reflect: {
after: each_response
evaluate: [accuracy, clarity]
min_confidence: 0.7
on_low_quality: {
strategy: "revise"
max_revisions: 1
}
}
learning: {
strategy: "experience_replay"
review_interval: 20
}
memory: "analyst_memory"
}
{
let query = input();
let answer = AnalystAgent.ask(query);
emit answer;
// Check learning stats periodically
let stats = agent_learning_stats("AnalystAgent");
emit "Interactions: " + str(stats["total_interactions"]);
emit "Avg score: " + str(stats["avg_reflection_score"]);
}
With this configuration, every request generates:
- Traces in Jaeger showing the agent call, LLM request, RAG query (if any), and reflection pass
- Metrics in Prometheus tracking request rate, latency, token usage, cost, cache hits, and circuit breaker state
- Health endpoints for Kubernetes probes
22.10 Observability Standard Library Modules #
The Neam standard library includes a comprehensive observability package organized into six sub-packages. These modules let you extend the built-in telemetry with custom instrumentation, alternative exporters, and diagnostic tools.
Package Overview #
| Sub-package | Modules | Purpose |
|---|---|---|
observability/core |
tracer, meter, logger, context, sampling |
Core OTel providers and context management |
observability/exporters |
otlp, jaeger, elasticsearch, mlflow, langfuse, sqlite, local, multi |
Export destinations for traces, metrics, and logs |
observability/instrumentation |
llm, agent, tool, handoff, memory |
Automatic span creation for Neam operations |
observability/semantic |
attributes, genai, events |
OpenTelemetry semantic conventions for AI |
observability/triage |
triage, anomaly, patterns, compare, dependencies, gaps, replay, reports |
Diagnostic analysis and debugging |
observability/config |
programmatic, environment, runtime |
Configuration methods |
Using the Core Modules #
The core modules give you direct access to the OTel tracer, meter, and logger providers for custom instrumentation:
import observability/core/tracer
import observability/core/meter
fun process_order(order_id) {
let span = tracer.start_span("process_order", {
"order.id": order_id,
"order.source": "web"
})
let counter = meter.counter("orders_processed_total", {
description: "Total orders processed"
})
let result = do_processing(order_id)
counter.add(1, { "status": result.status })
span.set_attribute("order.status", result.status)
span.end()
return result
}
Sampling Strategies #
The sampling module provides four strategies beyond the default trace-ID ratio:
import observability/core/sampling
let sampler = sampling.create({
strategy: "parent_based",
root: {
strategy: "trace_id_ratio",
rate: 0.1
}
})
| Strategy | Description |
|---|---|
always_on |
Sample every trace (development) |
always_off |
Sample nothing (disable telemetry without removing config) |
trace_id_ratio |
Sample a fixed percentage based on trace ID hash |
parent_based |
Inherit sampling decision from parent span; use a fallback strategy for root spans |
Alternative Exporters #
Beyond OTLP and Jaeger, Neam supports several specialized exporters:
import observability/exporters/elasticsearch
import observability/exporters/langfuse
import observability/exporters/mlflow
let es_exporter = elasticsearch.create({
url: "https://elasticsearch:9200",
traces_index: "neam-traces",
metrics_index: "neam-metrics",
logs_index: "neam-logs"
})
let langfuse_exporter = langfuse.create({
public_key: env("LANGFUSE_PUBLIC_KEY"),
secret_key: env("LANGFUSE_SECRET_KEY"),
host: "https://cloud.langfuse.com"
})
let mlflow_exporter = mlflow.create({
tracking_uri: "http://mlflow:5000",
experiment_name: "neam-agent-eval"
})
| Exporter | Best For |
|---|---|
otlp |
Standard OTel Collector pipeline |
jaeger |
Direct Jaeger ingestion (no collector) |
elasticsearch |
Full-text search over traces and logs |
langfuse |
LLM-specific observability with prompt tracking |
mlflow |
ML experiment tracking and model registry |
sqlite |
Local development without external services |
local |
File-based export for offline analysis |
multi |
Route different signals to different exporters |
The multi exporter lets you send traces and metrics to different destinations:
import observability/exporters/multi
let pipeline = multi.create({
traces: [otlp_exporter, langfuse_exporter],
metrics: [otlp_exporter],
logs: [elasticsearch_exporter]
})
Semantic Conventions for AI #
The semantic/attributes module defines standard attribute names following the
OpenTelemetry GenAI semantic conventions:
import observability/semantic/attributes
// GenAI operation attributes
attributes.GEN_AI_SYSTEM // "gen_ai.system" (e.g., "openai")
attributes.GEN_AI_REQUEST_MODEL // "gen_ai.request.model"
attributes.GEN_AI_REQUEST_MAX_TOKENS
attributes.GEN_AI_REQUEST_TEMPERATURE
// GenAI response attributes
attributes.GEN_AI_USAGE_PROMPT_TOKENS
attributes.GEN_AI_USAGE_COMPLETION_TOKENS
attributes.GEN_AI_RESPONSE_FINISH_REASONS
// Agent-specific attributes
attributes.AGENT_NAME // "agent.name"
attributes.AGENT_ID // "agent.id"
attributes.AGENT_TEAM // "agent.team"
attributes.AGENT_ROLE // "agent.role"
attributes.AGENT_PARENT // "agent.parent"
Using standard attribute names ensures your traces are compatible with any OTel-
compatible backend and enables cross-tool queries like "show me all traces where
gen_ai.usage.prompt_tokens > 5000."
22.11 Structured Logging #
In addition to traces and metrics, Neam supports structured logging through the OpenTelemetry Logs API. Structured logs attach key-value attributes to each log record, making them searchable and correlatable with traces.
Log Configuration #
[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "neam-agent"
log-level = "info"
The log level controls which records are emitted:
| Level | Emitted At | Examples |
|---|---|---|
debug |
Development only | Prompt text, full LLM responses, internal state |
info |
Normal operations | Agent started, request processed, handoff completed |
warn |
Potential issues | Rate limit approached, cache eviction, slow query |
error |
Failures | LLM call failed, state backend timeout, circuit open |
Log Records #
Each log record is a structured JSON object exported via OTLP alongside traces and metrics:
{
"timestamp": "2026-01-30T14:32:05.123Z",
"severity": "WARN",
"body": "Rate limit approaching threshold",
"attributes": {
"provider": "openai",
"current_rpm": 108,
"limit_rpm": 120,
"utilization_pct": 90
},
"traceId": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
"spanId": "1a2b3c4d5e6f7a8b"
}
The traceId and spanId fields correlate logs with the trace that produced
them. In Grafana, this means you can click from a log line directly to the
corresponding trace in Jaeger.
Custom Log Records #
Use the logger module to emit structured logs from your agent code:
import observability/core/logger
let log = logger.create({ name: "order-processor" })
fun process_order(order) {
log.info("Processing order", {
"order.id": order.id,
"order.total": order.total,
"customer.tier": order.customer_tier
})
if (order.total > 10000) {
log.warn("High-value order requires review", {
"order.id": order.id,
"order.total": order.total
})
}
}
Log Aggregation Pipeline #
In the OTel Collector, logs flow through the same pipeline as traces and metrics:
service:
pipelines:
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [elasticsearch, debug]
Elasticsearch is the recommended log destination because it supports full-text
search, aggregations, and Kibana dashboards. For simpler setups, the debug
exporter writes logs to stdout, which Docker and Kubernetes capture automatically.
22.12 Privacy and Redaction #
Production agents handle sensitive data — customer names, account numbers, API
keys in prompts. The observability stack must not leak this data into traces or
logs. The observability/privacy module provides configurable redaction rules.
Redaction Configuration #
import observability/privacy
let privacy_config = privacy.create({
mode: "redact",
rules: [
{ pattern: "\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b", replace: "[CARD]" },
{ pattern: "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b", replace: "[EMAIL]" },
{ pattern: "sk-[a-zA-Z0-9]{20,}", replace: "[API_KEY]" },
{ pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b", replace: "[SSN]" }
],
capture_prompts: false,
capture_responses: false
})
Privacy Modes #
| Mode | Behavior |
|---|---|
full |
Capture everything — prompts, responses, tool inputs/outputs (development only) |
redact |
Apply regex rules to sanitize sensitive patterns before export |
hash |
Replace sensitive values with one-way hashes (preserves cardinality for analysis) |
minimal |
Capture only span names, durations, and status codes — no content attributes |
Controlling What Gets Traced #
By default, Neam traces include span names and metadata attributes (agent name, provider, model, token counts, latency). Prompt and response text are not captured unless explicitly enabled:
[telemetry]
capture-prompts = false # Do not include prompt text in spans
capture-responses = false # Do not include response text in spans
For debugging specific issues, you can enable prompt capture temporarily using the runtime configuration module:
import observability/config/runtime
runtime.set_capture("TriageAgent", {
capture_prompts: true,
capture_responses: true,
duration: "30m"
})
This enables prompt/response capture for TriageAgent only, for 30 minutes,
then automatically reverts to the default configuration.
Access Control #
The privacy module supports role-based access to observability data:
let access_config = privacy.access_control({
roles: {
"developer": ["traces", "metrics"],
"ops": ["traces", "metrics", "logs"],
"security": ["traces", "metrics", "logs", "prompts"]
}
})
This does not enforce access at the Neam level — it sets metadata tags on exported data that downstream systems (Grafana, Kibana) can use for RBAC filtering.
22.13 Diagnostic Triage #
The observability/triage module provides automated diagnostic tools for
identifying issues in production without manual trace inspection.
Anomaly Detection #
The anomaly detector monitors metrics for deviations from learned baselines:
import observability/triage/anomaly
let detector = anomaly.create({
metrics: ["neam_agent_latency_seconds", "neam_llm_cost_usd_total"],
window: "1h",
sensitivity: 2.0,
on_anomaly: fun(alert) {
log.warn("Anomaly detected: " + alert.metric, {
"expected": alert.expected,
"actual": alert.actual,
"deviation": alert.deviation
})
}
})
The detector uses a rolling window to compute the mean and standard deviation of
each metric. When the current value deviates by more than sensitivity standard
deviations, the on_anomaly callback fires.
Error Pattern Analysis #
The pattern analyzer groups errors by type and identifies recurring failure modes:
import observability/triage/patterns
let analysis = patterns.analyze({
window: "24h",
min_occurrences: 5
})
for (pattern, details) in analysis {
emit "Pattern: " + pattern
emit " Count: " + str(details.count)
emit " First seen: " + details.first_seen
emit " Last seen: " + details.last_seen
emit " Affected agents: " + str(details.agents)
}
Dependency Graph #
The dependency graph builder analyzes traces to map service-to-service relationships:
import observability/triage/dependencies
let graph = dependencies.build({ window: "1h" })
for (service, deps) in graph {
emit service + " depends on: " + join(deps, ", ")
}
This is useful for understanding blast radius: if a provider goes down, which agents and services are affected?
Diagnostic Reports #
The report generator combines anomaly detection, error patterns, and dependency analysis into a structured diagnostic report:
import observability/triage/reports
let report = reports.generate({
window: "24h",
include: ["anomalies", "errors", "dependencies", "recommendations"]
})
emit report.summary
for rec in report.recommendations {
emit " - " + rec
}
A typical report might contain:
Diagnostic Report (last 24h)
=============================
Anomalies: 2
- neam_agent_latency_seconds: 3.2x above baseline (P95: 8.1s vs. 2.5s baseline)
- neam_llm_cost_usd_total: 1.8x above baseline ($142 vs. $78 baseline)
Error patterns: 1
- "429 Too Many Requests" from openai (47 occurrences, affecting TriageAgent)
Dependencies:
TriageAgent → openai, postgres
RefundAgent → openai, postgres
SupervisorAgent → anthropic, postgres
Recommendations:
- Increase OpenAI rate limit or add fallback provider (47 rate limit errors)
- Investigate TriageAgent prompt length (high token cost correlates with latency)
- Consider caching for TriageAgent (0% cache hit rate)
Summary #
In this chapter, you learned:
- Three health check endpoints:
/health(liveness),/ready(readiness),/startup(initialization) and their distinct semantics - OpenTelemetry integration: OTLP/HTTP JSON export, automatic span creation, sampling, batching
- Jaeger for trace visualization: reading span trees, identifying bottlenecks
- Prometheus metrics: LLM cost tracking, token usage, latency histograms, cache hits, circuit breaker state
- LLM Gateway monitoring: rate limits, circuit breakers, cache effectiveness, cost budgets
- Distributed tracing across multi-agent, multi-service systems
- Alerting strategies with Prometheus alerting rules
- An operational runbook for diagnosing latency, errors, and cost issues
- Observability standard library: 56 modules across core providers, exporters (OTLP, Jaeger, Elasticsearch, Langfuse, MLflow), instrumentation, and semantic conventions
- Structured logging with the OTel Logs API, log levels, and trace-log correlation
- Privacy and redaction: four privacy modes, configurable regex rules, runtime capture control
- Diagnostic triage: automated anomaly detection, error pattern analysis, dependency graphing, and report generation
These tools and techniques give you complete visibility into your Neam agents in production. Combined with the deployment patterns from Chapters 20 and 21, you now have everything needed to build, deploy, and operate production AI agent systems.
Exercises #
Exercise 22.1: Health Check Design #
A Neam agent uses PostgreSQL for state, OpenAI and Anthropic for LLM calls, and has
telemetry enabled. Write the expected JSON response for /ready in each of these
scenarios:
- Everything is healthy
- PostgreSQL is down, LLM providers are fine
- OpenAI circuit is open, Anthropic is healthy
- Both OpenAI and Anthropic circuits are open
For each scenario, state whether the readiness probe passes (HTTP 200) or fails (HTTP 503) and explain why.
Exercise 22.2: Trace Analysis #
Given the following Jaeger trace for a customer service request:
neam.agent.ask (TriageAgent) .................... 3500ms
neam.rag.query (hybrid, 5 docs) .............. 1200ms
neam.llm.call (openai/gpt-4o) ................ 1800ms
prompt_tokens: 4500
completion_tokens: 200
neam.reflection (accuracy: 0.65) .............. 400ms
neam.llm.call (openai/gpt-4o) .............. 350ms
neam.reflection (revision 1, accuracy: 0.82) .. 400ms
neam.llm.call (openai/gpt-4o) .............. 350ms
Answer the following:
- What is the biggest contributor to latency?
- Why did the reflection pass run twice?
- How many total LLM calls were made?
- Estimate the total token cost assuming GPT-4o at $5/1M input, $15/1M output.
- Suggest three optimizations to reduce the total latency.
Exercise 22.3: Prometheus Queries #
Write PromQL queries for the following:
- The average number of LLM tokens consumed per agent request (over the last hour)
- The cache hit ratio for OpenAI calls specifically
- The number of circuit breaker state transitions in the last 24 hours
- The top 3 agents by total cost in the last day
- An alert rule that fires when the rate limit wait time exceeds 1 second on average
Exercise 22.4: Alerting Configuration #
Design an alerting strategy for a Neam deployment with these SLAs:
- 99.5% availability (measured as successful responses / total requests)
- P99 latency under 10 seconds
- Monthly LLM budget of $3,000
Write Prometheus alerting rules with appropriate thresholds, for durations, and
severity levels. Include both warning and critical tiers for each SLA.
Exercise 22.5: Cost Optimization Analysis #
A production Neam deployment has these metrics over 24 hours:
neam_llm_requests_total{provider="openai",model="gpt-4o"}: 5,000neam_llm_requests_total{provider="openai",model="gpt-4o-mini"}: 45,000neam_llm_tokens_total{type="prompt"}: 25,000,000neam_llm_tokens_total{type="completion"}: 5,000,000neam_gateway_cache_hits_total: 8,000-
neam_gateway_cache_misses_total: 42,000 -
Calculate the approximate daily LLM cost (use GPT-4o at $5/$15 per 1M tokens, GPT-4o-mini at $0.15/$0.60 per 1M tokens).
- What is the current cache hit ratio?
- If the cache hit ratio improved to 40%, how much would you save daily?
- Should any agents be migrated from GPT-4o to GPT-4o-mini? What information would you need to make this decision?
Exercise 22.6: Distributed Tracing Design #
Design the tracing instrumentation for a multi-service Neam deployment with:
- Gateway Service: Accepts HTTP requests, authenticates users, routes to agents
- Triage Service: Runs TriageAgent to classify requests
- Specialist Service: Runs RefundAgent, BillingAgent, and TechSupportAgent
- Review Service: Runs SupervisorAgent to review specialist responses
Draw the span hierarchy for a request that goes through all four services. List the attributes you would set on each span. Explain how the trace context propagates between services.