📖 15 min read

Chapter 22: Observability and Monitoring #

"You cannot improve what you cannot measure. And in production, you cannot debug what you cannot observe." -- Observability engineering axiom

What You Will Learn #

In this chapter, you will learn how to observe and monitor Neam agents in production. You will understand the three health check endpoints and their semantics, configure OpenTelemetry integration for distributed tracing and metrics, visualize traces in Jaeger, build Prometheus dashboards, monitor the LLM Gateway (rate limits, circuit breaker state, cache hits, cost), trace requests across multi-agent systems, and design alerting strategies. By the end of this chapter, you will be able to answer the question "why is my agent slow?" in under five minutes.

22.1 Health Check Semantics #

Neam v0.6.0 exposes three health check endpoints, each with distinct semantics. These endpoints are used by Kubernetes probes, load balancers, and monitoring systems to determine the operational state of a Neam agent.

GET /health (Liveness) #

The liveness endpoint answers one question: is the Neam process alive and able to respond to HTTP requests?

What it checks: - The HTTP server is listening and can process requests - The main event loop has not deadlocked

What it does NOT check: - External dependencies (database, LLM providers, OTel collector) - Whether agents are loaded or initialized

Response when healthy (HTTP 200):

json

{
  "status": "ok",
  "version": "0.6.0",
  "uptime_seconds": 3672
}

When it fails: The process is irrecoverably broken. Kubernetes kills the pod and restarts it.

Kubernetes configuration:

yaml

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  timeoutSeconds: 5
  failureThreshold: 3

This means: after an initial 15-second delay, check /health every 20 seconds. If 3 consecutive checks fail (each with a 5-second timeout), kill and restart the pod.

GET /ready (Readiness) #

The readiness endpoint answers: can this pod serve traffic right now?

What it checks: 1. State backend connectivity: Can the VM connect to and query the configured state backend (SQLite, PostgreSQL, Redis, DynamoDB, CosmosDB)? 2. LLM provider availability: Is at least one LLM provider circuit in the Closed or HalfOpen state? (If all circuits are Open, the agent cannot make LLM calls.) 3. Telemetry health: If telemetry is enabled, is the export queue below its capacity limit? (A full queue indicates the OTLP endpoint is down.)

Response when ready (HTTP 200):

json

{
  "status": "ready",
  "checks": {
    "state_backend": {
      "status": "connected",
      "type": "postgres",
      "latency_ms": 2
    },
    "llm_providers": {
      "openai": {
        "status": "healthy",
        "circuit": "closed",
        "requests_total": 1547,
        "failures_total": 3
      },
      "anthropic": {
        "status": "healthy",
        "circuit": "closed",
        "requests_total": 42,
        "failures_total": 0
      }
    },
    "telemetry": {
      "status": "ok",
      "pending_spans": 12,
      "queue_capacity": 1000
    }
  }
}

Response when not ready (HTTP 503):

json

{
  "status": "not_ready",
  "checks": {
    "state_backend": {
      "status": "connection_refused",
      "type": "postgres",
      "error": "could not connect to server: Connection refused"
    },
    "llm_providers": {
      "openai": {
        "status": "unhealthy",
        "circuit": "open",
        "last_failure": "2026-01-30T14:32:05Z",
        "error": "429 Too Many Requests"
      },
      "anthropic": {
        "status": "healthy",
        "circuit": "closed"
      }
    },
    "telemetry": {
      "status": "ok"
    }
  }
}

When it fails: Kubernetes removes the pod from Service endpoints. No traffic is routed to it. The pod stays running (it is not killed -- that is the liveness probe's job). Once the dependency recovers, the next readiness check passes, and traffic resumes.

Kubernetes configuration:

yaml

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

GET /startup (Startup) #

The startup endpoint answers: has the Neam VM completed its initialization sequence?

What it checks: 1. Bytecode is loaded and validated 2. Agents are registered in the VM 3. State backend connection is established 4. Knowledge bases are ingested (if any) 5. Autonomous executor is started (if configured) 6. LLM Gateway is initialized (if configured)

Response when startup complete (HTTP 200):

json

{
  "status": "started",
  "initialized_at": "2026-01-30T14:00:05Z",
  "agents_registered": 3,
  "knowledge_bases_loaded": 1,
  "autonomous_agents": 1
}

Response during startup (HTTP 503):

json

{
  "status": "starting",
  "phase": "ingesting_knowledge_bases",
  "progress": "2/5 sources processed"
}

Kubernetes configuration:

yaml

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 30

This allows up to 150 seconds (30 x 5s) for startup. Once the startup probe succeeds, Kubernetes switches to the liveness and readiness probes. This is critical for agents with large knowledge bases that take time to ingest.

Health Check Summary #

Endpoint	Question	Failure Action	Checks Dependencies	Frequency
`/health`	Is the process alive?	Kill and restart	No	Every 20s
`/ready`	Can it serve traffic?	Remove from LB	Yes	Every 10s
`/startup`	Is init complete?	Wait for startup	Yes (init only)	Every 5s

22.2 OpenTelemetry Integration #

Neam v0.6.0 integrates with the OpenTelemetry standard for distributed tracing and metrics. The integration is in-process -- no sidecar or agent is required (though an OTel Collector is recommended in production for reliable delivery).

Architecture #

+---------------------------------------------------------------+
|                                                               |
|  Neam Agent (in-process)                                      |
|  +-----------------------------------------------------------+
|  |                                                            |
|  |  Agent.ask()                                               |
|  |    |                                                       |
|  |    v                                                       |
|  |  TelemetryExporter                                         |
|  |  +---------------------------+                             |
|  |  | start_span("agent.ask")  |                             |
|  |  |   start_span("llm.call") |                             |
|  |  |     set_attribute(...)    |                             |
|  |  |   end_span()             |                             |
|  |  |   start_span("rag.query")|                             |
|  |  |   end_span()             |                             |
|  |  | end_span()               |                             |
|  |  +---------------------------+                             |
|  |         |                                                  |
|  |    Batch buffer (100 spans or 5s)                          |
|  |         |                                                  |
|  +---------|--------------------------------------------------+
|            |                                                   |
|            v  OTLP/HTTP JSON                                   |
|  +---------+----------+                                        |
|  | OTel Collector     |                                        |
|  | (otel-collector)   |                                        |
|  +----+----------+----+                                        |
|       |          |                                             |
|       v          v                                             |
|  +--------+  +-----------+                                     |
|  | Jaeger |  | Prometheus|                                     |
|  | (traces)|  | (metrics) |                                    |
|  | :16686  |  | :9090     |                                    |
|  +--------+  +-----------+                                     |
|                                                               |
+---------------------------------------------------------------+

Configuration #

Enable telemetry in neam.toml:

toml

[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "neam-agent"
sampling-rate = 0.5

Or via environment variables:

bash

export NEAM_TELEMETRY_ENABLED=true
export NEAM_OTEL_ENDPOINT=http://otel-collector:4318
export NEAM_TELEMETRY_SERVICE_NAME=neam-agent
export NEAM_TELEMETRY_SAMPLING_RATE=0.5

Automatic Span Creation #

The Neam VM automatically creates spans for the following operations:

Span Name	When Created	Key Attributes
`neam.agent.ask`	Every `Agent.ask()` call	`agent.name`, `agent.provider`, `agent.model`
`neam.llm.call`	Each LLM API request	`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.prompt_tokens`, `gen_ai.usage.completion_tokens`
`neam.tool.call`	Tool invocation	`tool.name`, `tool.duration_ms`
`neam.rag.query`	RAG retrieval	`rag.strategy`, `rag.top_k`, `rag.documents_retrieved`
`neam.reflection`	Self-reflection pass	`reflection.dimensions`, `reflection.min_confidence`, `reflection.score`
`neam.learning.review`	Learning review trigger	`learning.strategy`, `learning.interactions_reviewed`
`neam.handoff`	Agent handoff	`handoff.from`, `handoff.to`, `handoff.reason`
`neam.mcp.call`	MCP tool execution	`mcp.server`, `mcp.tool`, `mcp.duration_ms`
`neam.gateway.ratelimit`	Rate limit wait	`gateway.provider`, `gateway.wait_ms`
`neam.gateway.circuitbreak`	Circuit breaker trip	`gateway.provider`, `gateway.circuit_state`
`neam.gateway.cache`	Cache hit/miss	`gateway.provider`, `gateway.cache_hit`

Span Hierarchy #

A typical agent call produces a tree of spans:

text

neam.agent.ask (TriageAgent, 1200ms)
  |
  +-- neam.rag.query (strategy: basic, 45ms)
  |     +-- Retrieved 3 documents
  |
  +-- neam.llm.call (openai/gpt-4o-mini, 850ms)
  |     +-- prompt_tokens: 1200
  |     +-- completion_tokens: 150
  |     +-- cost_usd: 0.0018
  |
  +-- neam.reflection (accuracy: 0.9, relevance: 0.85, 400ms)
  |     +-- neam.llm.call (openai/gpt-4o-mini, 350ms)
  |
  +-- neam.handoff (TriageAgent -> RefundAgent, 0ms)

OTLP Export Format #

Neam exports spans as OTLP/HTTP JSON (not protobuf) to avoid the protobuf dependency. The OTel Collector accepts both formats:

json

{
  "resourceSpans": [{
    "resource": {
      "attributes": [
        {"key": "service.name", "value": {"stringValue": "neam-agent"}},
        {"key": "service.version", "value": {"stringValue": "0.6.0"}},
        {"key": "deployment.environment", "value": {"stringValue": "production"}}
      ]
    },
    "scopeSpans": [{
      "scope": {"name": "neam", "version": "0.6.0"},
      "spans": [
        {
          "traceId": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
          "spanId": "1a2b3c4d5e6f7a8b",
          "parentSpanId": "",
          "name": "neam.agent.ask",
          "kind": 2,
          "startTimeUnixNano": "1706620800000000000",
          "endTimeUnixNano": "1706620801200000000",
          "attributes": [
            {"key": "agent.name", "value": {"stringValue": "TriageAgent"}},
            {"key": "agent.provider", "value": {"stringValue": "openai"}},
            {"key": "agent.model", "value": {"stringValue": "gpt-4o-mini"}}
          ],
          "status": {"code": 1}
        }
      ]
    }]
  }]
}

Batching and Background Export #

Spans are buffered in memory and exported in batches:

Batch size: 100 spans (or fewer if the flush interval triggers first)
Flush interval: 5 seconds
Export thread: A background thread performs the HTTP POST to the OTLP endpoint
Backpressure: If the export queue exceeds 1000 pending spans (configurable), new spans are dropped with a warning log
Failure handling: Failed exports are retried once with a 1-second delay, then dropped

Sampling #

The sampling-rate controls what fraction of traces are exported:

Rate	Effect	Use Case
`1.0`	Every request traced	Development, debugging
`0.5`	50% of requests	Staging
`0.1`	10% of requests	Production (moderate traffic)
`0.01`	1% of requests	Production (high traffic)

Sampling is deterministic per trace: if a trace is sampled, all spans within that trace (including child spans from tool calls, RAG queries, and reflections) are included. This is achieved by hashing the trace ID and comparing against the sampling threshold.

neam

// This code behaves identically regardless of sampling rate.
// The telemetry layer is transparent to agent logic.
agent TracedAgent {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a helpful assistant."
}

{
  let response = TracedAgent.ask("Explain observability.");
  emit response;
  // If this trace is sampled, spans are exported automatically.
  // If not sampled, zero overhead is added.
}

22.3 Jaeger for Trace Visualization #

Jaeger is an open-source distributed tracing platform. The Docker Compose stack from Chapter 20 includes Jaeger, and Neam traces flow through the OTel Collector to Jaeger automatically.

Accessing Jaeger #

bash

# If running Docker Compose
open http://localhost:16686

# If running in Kubernetes
kubectl port-forward svc/jaeger-query -n observability 16686:16686
open http://localhost:16686

Finding Traces #

In the Jaeger UI:

Select Service: neam-agent
Select Operation: neam.agent.ask (or leave as "all")
Set a time range
Click Find Traces

Each trace shows the complete span tree for one request, including:

Total request duration
Time spent in each LLM call
RAG retrieval latency
Reflection overhead
Handoff chain

Reading a Trace #

A trace for a customer service triage request might look like:

text

Trace: a1b2c3d4 (1450ms total)

[==============================================] neam.agent.ask (TriageAgent) 1450ms
  [====] neam.rag.query (basic, 3 docs)                                       50ms
  [===================] neam.llm.call (openai/gpt-4o-mini)                    900ms
    prompt_tokens: 1500  completion_tokens: 80  cost: $0.0020
  [======] neam.reflection (accuracy: 0.92)                                   350ms
    [====] neam.llm.call (openai/gpt-4o-mini)                                300ms
  [] neam.handoff (TriageAgent -> RefundAgent)                                  1ms

[====================================] neam.agent.ask (RefundAgent)           750ms
  [==========================] neam.llm.call (openai/gpt-4o-mini)             600ms
    prompt_tokens: 800  completion_tokens: 200  cost: $0.0015

From this trace, you can immediately see:

The total request took 1450ms
The LLM call to OpenAI was the bottleneck (900ms for triage, 600ms for refund)
RAG retrieval was fast (50ms)
Reflection added 350ms of overhead (with its own LLM call)
The handoff from TriageAgent to RefundAgent was instantaneous

22.4 Prometheus Metrics #

Neam exports metrics to Prometheus via the OTel Collector. These metrics provide aggregate visibility across all requests, complementing the per-request detail of traces.

Exported Metrics #

Metric	Type	Labels	Description
`neam_llm_requests_total`	Counter	`provider`, `model`, `status`	Total LLM API calls
`neam_llm_tokens_total`	Counter	`provider`, `model`, `type`	Tokens consumed (prompt/completion)
`neam_llm_latency_seconds`	Histogram	`provider`, `model`	LLM call latency distribution
`neam_llm_cost_usd_total`	Counter	`provider`, `model`	Accumulated LLM cost
`neam_agent_requests_total`	Counter	`agent`, `status`	Agent ask() calls
`neam_agent_latency_seconds`	Histogram	`agent`	End-to-end agent latency
`neam_rag_queries_total`	Counter	`strategy`, `knowledge_base`	RAG retrieval queries
`neam_rag_latency_seconds`	Histogram	`strategy`	RAG retrieval latency
`neam_tool_calls_total`	Counter	`tool`, `status`	Tool invocations
`neam_reflection_score`	Gauge	`agent`, `dimension`	Latest reflection scores
`neam_gateway_rate_limit_waits_total`	Counter	`provider`	Rate limit delays
`neam_gateway_circuit_breaker_state`	Gauge	`provider`	Circuit state (0=closed, 1=open, 2=half-open)
`neam_gateway_cache_hits_total`	Counter	`provider`	Cache hits
`neam_gateway_cache_misses_total`	Counter	`provider`	Cache misses
`neam_gateway_cost_daily_usd`	Gauge		Current daily cost
`neam_gateway_cost_budget_usd`	Gauge		Configured daily budget

Prometheus Configuration #

yaml

# docker/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']
    metrics_path: /metrics

  - job_name: 'neam-agent'
    static_configs:
      - targets: ['neam-agent:8080']
    metrics_path: /metrics

Useful PromQL Queries #

Request rate (requests per second):

promql

rate(neam_agent_requests_total[5m])

P95 agent latency:

promql

histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m]))

LLM cost per hour:

promql

rate(neam_llm_cost_usd_total[1h]) * 3600

Token consumption rate by provider:

promql

sum by (provider) (rate(neam_llm_tokens_total[5m]))

Cache hit ratio:

promql

sum(rate(neam_gateway_cache_hits_total[5m]))
/
(sum(rate(neam_gateway_cache_hits_total[5m])) + sum(rate(neam_gateway_cache_misses_total[5m])))

Circuit breaker status (1 = problem):

promql

neam_gateway_circuit_breaker_state > 0

Budget utilization percentage:

promql

neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd * 100

22.5 LLM Gateway Monitoring #

The LLM Gateway is the most critical component to monitor because it controls the flow of all LLM requests. The gateway exposes its internal state through the readiness endpoint and through Prometheus metrics.

Rate Limit Tracking #

The gateway tracks per-provider request rates and enforces the limits defined in neam.toml:

toml

[llm.rate-limits.openai]
requests-per-minute = 120

Monitoring rate limits:

promql

# Current request rate vs. limit
rate(neam_llm_requests_total{provider="openai"}[1m]) * 60
# Compare against the configured limit of 120

# Rate limit wait events (indicates you are approaching the limit)
rate(neam_gateway_rate_limit_waits_total{provider="openai"}[5m])

When rate limit waits increase, it means the gateway is throttling requests to stay within the configured limit. If waits are frequent, consider:

Increasing the requests-per-minute limit (if the provider allows it)
Adding a fallback provider to distribute load
Enabling response caching to reduce redundant calls

Circuit Breaker State #

The circuit breaker has three states, represented as a gauge metric:

Value	State	Meaning
0	Closed	Normal operation
1	Open	Provider is down; all requests rejected
2	Half-Open	Probing the provider with a single request

promql

# Alert when any circuit is open
neam_gateway_circuit_breaker_state{provider="openai"} == 1

Visualizing circuit breaker transitions:

In Grafana, create a state timeline panel with the neam_gateway_circuit_breaker_state metric. This shows exactly when each provider went down and how long it took to recover:

text

Time:       00:00  00:05  00:10  00:15  00:20  00:25  00:30
OpenAI:     [--- Closed ---][Open][HO][--- Closed ---]
Anthropic:  [---------- Closed ----------------------------------]

Cache Hit Rates #

promql

# Cache hit ratio (higher is better, saves money)
sum(rate(neam_gateway_cache_hits_total[5m]))
/
(sum(rate(neam_gateway_cache_hits_total[5m])) + sum(rate(neam_gateway_cache_misses_total[5m])))

A cache hit ratio of 0 means caching is not effective (likely because all agents use temperature > 0). A ratio above 0.3 means you are saving at least 30% on LLM costs.

Cost Tracking #

The gateway tracks real-time cost using Neam's built-in pricing table:

promql

# Daily cost (USD)
neam_gateway_cost_daily_usd

# Budget utilization
neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd * 100

# Cost by provider
sum by (provider) (rate(neam_llm_cost_usd_total[1h])) * 3600

# Cost by model
sum by (model) (rate(neam_llm_cost_usd_total[1h])) * 3600

Cost dashboard example:

+----------------------------------------------+
| Daily LLM Cost                               |
|                                              |
|  $47.32 / $100.00 budget (47.3%)            |
|  [========================............] 47%  |
|                                              |
|  By Provider:                                |
|    OpenAI:    $38.50 (81%)                   |
|    Anthropic: $8.82  (19%)                   |
|                                              |
|  By Model:                                   |
|    gpt-4o-mini: $32.10                       |
|    gpt-4o:      $6.40                        |
|    claude-3.5:  $8.82                        |
+----------------------------------------------+

22.6 Distributed Tracing Across Multi-Agent Systems #

When a request flows through multiple agents (triage -> specialist -> supervisor), distributed tracing keeps the entire chain visible as a single trace.

Trace Propagation #

TriageAgent

(triage logic)

▼

RefundAgent

(refund logic)

▼

SupervisorAgent

(review logic)

Within a single Neam VM, trace propagation is automatic. The VM maintains a trace context stack, and when one agent hands off to another, the child agent's span is created with the parent agent's span ID.

Cross-Service Tracing #

When agents communicate across services (via the A2A protocol), the trace context is propagated via HTTP headers following the W3C Trace Context standard:

text

POST /a2a HTTP/1.1
Host: specialist-service.internal
Content-Type: application/json
traceparent: 00-abc123def456abc123def456abc123de-1a2b3c4d5e6f7a8b-01
tracestate: neam=agent:TriageAgent

{"jsonrpc": "2.0", "method": "tasks/send", ...}

The receiving service picks up the traceparent header and creates its spans as children of the calling service's span. This means a single trace in Jaeger can show the complete request path across multiple Neam services.

Practical Example: Multi-Service Tracing #

neam

agent TriageAgent {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "Route customer requests."
  handoffs: [RefundAgent]
}

{
  // This creates a root span: neam.agent.ask
  let triage = TriageAgent.ask("I need a refund for order #123");
  // Handoff propagates the trace context
  // The RefundAgent span becomes a child of this span
}

neam

agent RefundAgent {
  provider: "openai"
  model: "gpt-4o"
  system: "Process refund requests."
}

{
  // When called via A2A, the trace context is inherited
  // from the traceparent header
  let result = RefundAgent.ask("Process refund for order #123");
  emit result;
}

In Jaeger, the combined trace shows:

text

Trace abc123 (2100ms)
  Service: triage-service
    neam.agent.ask (TriageAgent) ........................ 1200ms
      neam.llm.call (openai/gpt-4o-mini) .............. 900ms
      neam.handoff (TriageAgent -> RefundAgent) ........   1ms

  Service: refund-service
    neam.agent.ask (RefundAgent) ....................... 900ms
      neam.llm.call (openai/gpt-4o) ................... 750ms

22.7 Alerting Strategies #

Monitoring without alerting is just logging with a GUI. Here are alerting rules for the most important Neam operational signals.

Prometheus Alerting Rules #

yaml

# alerting-rules.yaml
groups:
  - name: neam-agent
    rules:
      # Alert when error rate exceeds 5%
      - alert: NeamHighErrorRate
        expr: |
          sum(rate(neam_agent_requests_total{status="error"}[5m]))
          /
          sum(rate(neam_agent_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Neam agent error rate above 5%"
          description: "{{ $value | humanizePercentage }} of requests are failing"

      # Alert when P95 latency exceeds 5 seconds
      - alert: NeamHighLatency
        expr: |
          histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m]))
          > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Neam P95 latency above 5 seconds"

      # Alert when a circuit breaker is open
      - alert: NeamCircuitBreakerOpen
        expr: neam_gateway_circuit_breaker_state == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LLM provider {{ $labels.provider }} circuit breaker is open"
          description: "All requests to {{ $labels.provider }} are being rejected"

      # Alert when daily cost exceeds 80% of budget
      - alert: NeamCostBudgetWarning
        expr: |
          neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Daily LLM cost at {{ $value | humanizePercentage }} of budget"

      # Alert when daily cost exceeds 95% of budget
      - alert: NeamCostBudgetCritical
        expr: |
          neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd > 0.95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Daily LLM cost at {{ $value | humanizePercentage }} of budget"

      # Alert when rate limit waits are frequent
      - alert: NeamRateLimitPressure
        expr: |
          rate(neam_gateway_rate_limit_waits_total[5m]) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Provider {{ $labels.provider }} under rate limit pressure"

      # Alert when state backend is unreachable
      - alert: NeamStateBackendDown
        expr: |
          up{job="neam-agent"} == 1
          unless
          neam_health_state_backend_connected == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Neam state backend is unreachable"

      # Alert when all pods are not ready
      - alert: NeamNoReadyPods
        expr: |
          kube_deployment_status_replicas_ready{deployment="neam-agent"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No ready Neam agent pods"

Alert Priority Matrix #

Condition	Severity	Response Time	Action
All pods down	Critical	Immediate	Page on-call, investigate cluster
Circuit breaker open	Critical	5 min	Check provider status, verify failover
Cost > 95% budget	Critical	15 min	Investigate usage, consider throttling
Error rate > 5%	Warning	30 min	Review traces, check for bad inputs
P95 latency > 5s	Warning	1 hour	Review traces, check provider latency
Rate limit pressure	Warning	1 hour	Consider increasing limits or caching
Cost > 80% budget	Warning	4 hours	Review cost trends, adjust budget

22.8 Operational Runbook #

Here is a practical runbook for diagnosing common issues using the observability stack.

"Why is my agent slow?" #

Check Prometheus: Query histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m])) to confirm the latency baseline.
Check Jaeger: Find a slow trace. Look at the span tree:
Is the LLM call slow? (Provider issue or large prompt)
Is RAG retrieval slow? (Knowledge base too large or slow vector search)
Is reflection adding latency? (Consider reducing min_confidence or disabling for non-critical agents)
Check rate limits: Query rate(neam_gateway_rate_limit_waits_total[5m]). If rate limit waits are high, the gateway is throttling requests.
Check cache hit ratio: If the cache is available but the hit ratio is 0, check that temperature: 0 is set on deterministic agents.

"Why is my agent returning errors?" #

Check circuit breaker state: Query neam_gateway_circuit_breaker_state. If a circuit is open (1), the provider is down.
Check the readiness endpoint: curl http://neam-agent:8080/ready to see which components are unhealthy.
Check Jaeger: Find traces with error status. The error span will have a status_message attribute explaining the failure.
Check provider health: Query sum by (provider, status) (rate(neam_llm_requests_total[5m])) to see error rates per provider.

"Am I spending too much?" #

Check daily cost: Query neam_gateway_cost_daily_usd for the current total.
Break down by model: Query sum by (model) (rate(neam_llm_cost_usd_total[1h])) * 3600 to find the most expensive model.
Check cache effectiveness: A low cache hit ratio means you are paying for redundant calls.
Check token usage: Query sum by (agent) (rate(neam_llm_tokens_total[1h])) to find agents consuming the most tokens. Long system prompts or large RAG contexts inflate token counts.

22.9 Complete Observability Example #

Here is a complete Neam agent with full observability configuration:

toml

# neam.toml
[project]
name = "observed-agent"
version = "1.0.0"

[project.entry_points]
main = "src/main.neam"

[state]
backend = "postgres"
connection-string = "postgresql://neam:pass@postgres:5432/neam"

[llm]
default-provider = "openai"
default-model = "gpt-4o-mini"

[llm.rate-limits.openai]
requests-per-minute = 120

[llm.circuit-breaker]
failure-threshold = 3
reset-timeout-seconds = 60

[llm.cache]
enabled = true
max-entries = 1000
ttl-seconds = 600

[llm.cost]
daily-budget-usd = 100.0

[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "observed-agent"
sampling-rate = 1.0

neam

agent AnalystAgent {
  provider: "openai"
  model: "gpt-4o-mini"
  temperature: 0.3
  system: "You are a data analyst. Provide clear, data-driven answers."

  reasoning: chain_of_thought

  reflect: {
    after: each_response
    evaluate: [accuracy, clarity]
    min_confidence: 0.7
    on_low_quality: {
      strategy: "revise"
      max_revisions: 1
    }
  }

  learning: {
    strategy: "experience_replay"
    review_interval: 20
  }

  memory: "analyst_memory"
}

{
  let query = input();
  let answer = AnalystAgent.ask(query);
  emit answer;

  // Check learning stats periodically
  let stats = agent_learning_stats("AnalystAgent");
  emit "Interactions: " + str(stats["total_interactions"]);
  emit "Avg score: " + str(stats["avg_reflection_score"]);
}

With this configuration, every request generates:

Traces in Jaeger showing the agent call, LLM request, RAG query (if any), and reflection pass
Metrics in Prometheus tracking request rate, latency, token usage, cost, cache hits, and circuit breaker state
Health endpoints for Kubernetes probes

22.10 Observability Standard Library Modules #

The Neam standard library includes a comprehensive observability package organized into six sub-packages. These modules let you extend the built-in telemetry with custom instrumentation, alternative exporters, and diagnostic tools.

Package Overview #

Sub-package	Modules	Purpose
`observability/core`	`tracer`, `meter`, `logger`, `context`, `sampling`	Core OTel providers and context management
`observability/exporters`	`otlp`, `jaeger`, `elasticsearch`, `mlflow`, `langfuse`, `sqlite`, `local`, `multi`	Export destinations for traces, metrics, and logs
`observability/instrumentation`	`llm`, `agent`, `tool`, `handoff`, `memory`	Automatic span creation for Neam operations
`observability/semantic`	`attributes`, `genai`, `events`	OpenTelemetry semantic conventions for AI
`observability/triage`	`triage`, `anomaly`, `patterns`, `compare`, `dependencies`, `gaps`, `replay`, `reports`	Diagnostic analysis and debugging
`observability/config`	`programmatic`, `environment`, `runtime`	Configuration methods

Using the Core Modules #

The core modules give you direct access to the OTel tracer, meter, and logger providers for custom instrumentation:

neam

import observability/core/tracer
import observability/core/meter

fun process_order(order_id) {
  let span = tracer.start_span("process_order", {
    "order.id": order_id,
    "order.source": "web"
  })

  let counter = meter.counter("orders_processed_total", {
    description: "Total orders processed"
  })

  let result = do_processing(order_id)
  counter.add(1, { "status": result.status })

  span.set_attribute("order.status", result.status)
  span.end()

  return result
}

Sampling Strategies #

The sampling module provides four strategies beyond the default trace-ID ratio:

neam

import observability/core/sampling

let sampler = sampling.create({
  strategy: "parent_based",
  root: {
    strategy: "trace_id_ratio",
    rate: 0.1
  }
})

Strategy	Description
`always_on`	Sample every trace (development)
`always_off`	Sample nothing (disable telemetry without removing config)
`trace_id_ratio`	Sample a fixed percentage based on trace ID hash
`parent_based`	Inherit sampling decision from parent span; use a fallback strategy for root spans

Alternative Exporters #

Beyond OTLP and Jaeger, Neam supports several specialized exporters:

neam

import observability/exporters/elasticsearch
import observability/exporters/langfuse
import observability/exporters/mlflow

let es_exporter = elasticsearch.create({
  url: "https://elasticsearch:9200",
  traces_index: "neam-traces",
  metrics_index: "neam-metrics",
  logs_index: "neam-logs"
})

let langfuse_exporter = langfuse.create({
  public_key: env("LANGFUSE_PUBLIC_KEY"),
  secret_key: env("LANGFUSE_SECRET_KEY"),
  host: "https://cloud.langfuse.com"
})

let mlflow_exporter = mlflow.create({
  tracking_uri: "http://mlflow:5000",
  experiment_name: "neam-agent-eval"
})

Exporter	Best For
`otlp`	Standard OTel Collector pipeline
`jaeger`	Direct Jaeger ingestion (no collector)
`elasticsearch`	Full-text search over traces and logs
`langfuse`	LLM-specific observability with prompt tracking
`mlflow`	ML experiment tracking and model registry
`sqlite`	Local development without external services
`local`	File-based export for offline analysis
`multi`	Route different signals to different exporters

The multi exporter lets you send traces and metrics to different destinations:

neam

import observability/exporters/multi

let pipeline = multi.create({
  traces: [otlp_exporter, langfuse_exporter],
  metrics: [otlp_exporter],
  logs: [elasticsearch_exporter]
})

Semantic Conventions for AI #

The semantic/attributes module defines standard attribute names following the OpenTelemetry GenAI semantic conventions:

neam

import observability/semantic/attributes

// GenAI operation attributes
attributes.GEN_AI_SYSTEM          // "gen_ai.system" (e.g., "openai")
attributes.GEN_AI_REQUEST_MODEL   // "gen_ai.request.model"
attributes.GEN_AI_REQUEST_MAX_TOKENS
attributes.GEN_AI_REQUEST_TEMPERATURE

// GenAI response attributes
attributes.GEN_AI_USAGE_PROMPT_TOKENS
attributes.GEN_AI_USAGE_COMPLETION_TOKENS
attributes.GEN_AI_RESPONSE_FINISH_REASONS

// Agent-specific attributes
attributes.AGENT_NAME             // "agent.name"
attributes.AGENT_ID               // "agent.id"
attributes.AGENT_TEAM             // "agent.team"
attributes.AGENT_ROLE             // "agent.role"
attributes.AGENT_PARENT           // "agent.parent"

Using standard attribute names ensures your traces are compatible with any OTel- compatible backend and enables cross-tool queries like "show me all traces where gen_ai.usage.prompt_tokens > 5000."

22.11 Structured Logging #

In addition to traces and metrics, Neam supports structured logging through the OpenTelemetry Logs API. Structured logs attach key-value attributes to each log record, making them searchable and correlatable with traces.

Log Configuration #

toml

[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "neam-agent"
log-level = "info"

The log level controls which records are emitted:

Level	Emitted At	Examples
`debug`	Development only	Prompt text, full LLM responses, internal state
`info`	Normal operations	Agent started, request processed, handoff completed
`warn`	Potential issues	Rate limit approached, cache eviction, slow query
`error`	Failures	LLM call failed, state backend timeout, circuit open

Log Records #

Each log record is a structured JSON object exported via OTLP alongside traces and metrics:

json

{
  "timestamp": "2026-01-30T14:32:05.123Z",
  "severity": "WARN",
  "body": "Rate limit approaching threshold",
  "attributes": {
    "provider": "openai",
    "current_rpm": 108,
    "limit_rpm": 120,
    "utilization_pct": 90
  },
  "traceId": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
  "spanId": "1a2b3c4d5e6f7a8b"
}

The traceId and spanId fields correlate logs with the trace that produced them. In Grafana, this means you can click from a log line directly to the corresponding trace in Jaeger.

Custom Log Records #

Use the logger module to emit structured logs from your agent code:

neam

import observability/core/logger

let log = logger.create({ name: "order-processor" })

fun process_order(order) {
  log.info("Processing order", {
    "order.id": order.id,
    "order.total": order.total,
    "customer.tier": order.customer_tier
  })

  if (order.total > 10000) {
    log.warn("High-value order requires review", {
      "order.id": order.id,
      "order.total": order.total
    })
  }
}

Log Aggregation Pipeline #

In the OTel Collector, logs flow through the same pipeline as traces and metrics:

yaml

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [elasticsearch, debug]

Elasticsearch is the recommended log destination because it supports full-text search, aggregations, and Kibana dashboards. For simpler setups, the debug exporter writes logs to stdout, which Docker and Kubernetes capture automatically.

22.12 Privacy and Redaction #

Production agents handle sensitive data — customer names, account numbers, API keys in prompts. The observability stack must not leak this data into traces or logs. The observability/privacy module provides configurable redaction rules.

Redaction Configuration #

neam

import observability/privacy

let privacy_config = privacy.create({
  mode: "redact",
  rules: [
    { pattern: "\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b", replace: "[CARD]" },
    { pattern: "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b", replace: "[EMAIL]" },
    { pattern: "sk-[a-zA-Z0-9]{20,}", replace: "[API_KEY]" },
    { pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b", replace: "[SSN]" }
  ],
  capture_prompts: false,
  capture_responses: false
})

Privacy Modes #

Mode	Behavior
`full`	Capture everything — prompts, responses, tool inputs/outputs (development only)
`redact`	Apply regex rules to sanitize sensitive patterns before export
`hash`	Replace sensitive values with one-way hashes (preserves cardinality for analysis)
`minimal`	Capture only span names, durations, and status codes — no content attributes

Controlling What Gets Traced #

By default, Neam traces include span names and metadata attributes (agent name, provider, model, token counts, latency). Prompt and response text are not captured unless explicitly enabled:

toml

[telemetry]
capture-prompts = false     # Do not include prompt text in spans
capture-responses = false   # Do not include response text in spans

For debugging specific issues, you can enable prompt capture temporarily using the runtime configuration module:

neam

import observability/config/runtime

runtime.set_capture("TriageAgent", {
  capture_prompts: true,
  capture_responses: true,
  duration: "30m"
})

This enables prompt/response capture for TriageAgent only, for 30 minutes, then automatically reverts to the default configuration.

Access Control #

The privacy module supports role-based access to observability data:

neam

let access_config = privacy.access_control({
  roles: {
    "developer": ["traces", "metrics"],
    "ops": ["traces", "metrics", "logs"],
    "security": ["traces", "metrics", "logs", "prompts"]
  }
})

This does not enforce access at the Neam level — it sets metadata tags on exported data that downstream systems (Grafana, Kibana) can use for RBAC filtering.

22.13 Diagnostic Triage #

The observability/triage module provides automated diagnostic tools for identifying issues in production without manual trace inspection.

Anomaly Detection #

The anomaly detector monitors metrics for deviations from learned baselines:

neam

import observability/triage/anomaly

let detector = anomaly.create({
  metrics: ["neam_agent_latency_seconds", "neam_llm_cost_usd_total"],
  window: "1h",
  sensitivity: 2.0,
  on_anomaly: fun(alert) {
    log.warn("Anomaly detected: " + alert.metric, {
      "expected": alert.expected,
      "actual": alert.actual,
      "deviation": alert.deviation
    })
  }
})

The detector uses a rolling window to compute the mean and standard deviation of each metric. When the current value deviates by more than sensitivity standard deviations, the on_anomaly callback fires.

Error Pattern Analysis #

The pattern analyzer groups errors by type and identifies recurring failure modes:

neam

import observability/triage/patterns

let analysis = patterns.analyze({
  window: "24h",
  min_occurrences: 5
})

for (pattern, details) in analysis {
  emit "Pattern: " + pattern
  emit "  Count: " + str(details.count)
  emit "  First seen: " + details.first_seen
  emit "  Last seen: " + details.last_seen
  emit "  Affected agents: " + str(details.agents)
}

Dependency Graph #

The dependency graph builder analyzes traces to map service-to-service relationships:

neam

import observability/triage/dependencies

let graph = dependencies.build({ window: "1h" })

for (service, deps) in graph {
  emit service + " depends on: " + join(deps, ", ")
}

This is useful for understanding blast radius: if a provider goes down, which agents and services are affected?

Diagnostic Reports #

The report generator combines anomaly detection, error patterns, and dependency analysis into a structured diagnostic report:

neam

import observability/triage/reports

let report = reports.generate({
  window: "24h",
  include: ["anomalies", "errors", "dependencies", "recommendations"]
})

emit report.summary
for rec in report.recommendations {
  emit "  - " + rec
}

A typical report might contain:

Diagnostic Report (last 24h)
=============================
Anomalies: 2
  - neam_agent_latency_seconds: 3.2x above baseline (P95: 8.1s vs. 2.5s baseline)
  - neam_llm_cost_usd_total: 1.8x above baseline ($142 vs. $78 baseline)

Error patterns: 1
  - "429 Too Many Requests" from openai (47 occurrences, affecting TriageAgent)

Dependencies:
  TriageAgent → openai, postgres
  RefundAgent → openai, postgres
  SupervisorAgent → anthropic, postgres

Recommendations:
  - Increase OpenAI rate limit or add fallback provider (47 rate limit errors)
  - Investigate TriageAgent prompt length (high token cost correlates with latency)
  - Consider caching for TriageAgent (0% cache hit rate)

Summary #

In this chapter, you learned:

Three health check endpoints: /health (liveness), /ready (readiness), /startup (initialization) and their distinct semantics
OpenTelemetry integration: OTLP/HTTP JSON export, automatic span creation, sampling, batching
Jaeger for trace visualization: reading span trees, identifying bottlenecks
Prometheus metrics: LLM cost tracking, token usage, latency histograms, cache hits, circuit breaker state
LLM Gateway monitoring: rate limits, circuit breakers, cache effectiveness, cost budgets
Distributed tracing across multi-agent, multi-service systems
Alerting strategies with Prometheus alerting rules
An operational runbook for diagnosing latency, errors, and cost issues
Observability standard library: 56 modules across core providers, exporters (OTLP, Jaeger, Elasticsearch, Langfuse, MLflow), instrumentation, and semantic conventions
Structured logging with the OTel Logs API, log levels, and trace-log correlation
Privacy and redaction: four privacy modes, configurable regex rules, runtime capture control
Diagnostic triage: automated anomaly detection, error pattern analysis, dependency graphing, and report generation

These tools and techniques give you complete visibility into your Neam agents in production. Combined with the deployment patterns from Chapters 20 and 21, you now have everything needed to build, deploy, and operate production AI agent systems.

Exercises #

Exercise 22.1: Health Check Design #

A Neam agent uses PostgreSQL for state, OpenAI and Anthropic for LLM calls, and has telemetry enabled. Write the expected JSON response for /ready in each of these scenarios:

Everything is healthy
PostgreSQL is down, LLM providers are fine
OpenAI circuit is open, Anthropic is healthy
Both OpenAI and Anthropic circuits are open

For each scenario, state whether the readiness probe passes (HTTP 200) or fails (HTTP 503) and explain why.

Exercise 22.2: Trace Analysis #

Given the following Jaeger trace for a customer service request:

text

neam.agent.ask (TriageAgent) .................... 3500ms
  neam.rag.query (hybrid, 5 docs) .............. 1200ms
  neam.llm.call (openai/gpt-4o) ................ 1800ms
    prompt_tokens: 4500
    completion_tokens: 200
  neam.reflection (accuracy: 0.65) ..............  400ms
    neam.llm.call (openai/gpt-4o) ..............  350ms
  neam.reflection (revision 1, accuracy: 0.82) .. 400ms
    neam.llm.call (openai/gpt-4o) ..............  350ms

Answer the following:

What is the biggest contributor to latency?
Why did the reflection pass run twice?
How many total LLM calls were made?
Estimate the total token cost assuming GPT-4o at $5/1M input, $15/1M output.
Suggest three optimizations to reduce the total latency.

Exercise 22.3: Prometheus Queries #

Write PromQL queries for the following:

The average number of LLM tokens consumed per agent request (over the last hour)
The cache hit ratio for OpenAI calls specifically
The number of circuit breaker state transitions in the last 24 hours
The top 3 agents by total cost in the last day
An alert rule that fires when the rate limit wait time exceeds 1 second on average

Exercise 22.4: Alerting Configuration #

Design an alerting strategy for a Neam deployment with these SLAs:

99.5% availability (measured as successful responses / total requests)
P99 latency under 10 seconds
Monthly LLM budget of $3,000

Write Prometheus alerting rules with appropriate thresholds, for durations, and severity levels. Include both warning and critical tiers for each SLA.

Exercise 22.5: Cost Optimization Analysis #

A production Neam deployment has these metrics over 24 hours:

neam_llm_requests_total{provider="openai",model="gpt-4o"}: 5,000
neam_llm_requests_total{provider="openai",model="gpt-4o-mini"}: 45,000
neam_llm_tokens_total{type="prompt"}: 25,000,000
neam_llm_tokens_total{type="completion"}: 5,000,000
neam_gateway_cache_hits_total: 8,000
neam_gateway_cache_misses_total: 42,000
Calculate the approximate daily LLM cost (use GPT-4o at $5/$15 per 1M tokens, GPT-4o-mini at $0.15/$0.60 per 1M tokens).
What is the current cache hit ratio?
If the cache hit ratio improved to 40%, how much would you save daily?
Should any agents be migrated from GPT-4o to GPT-4o-mini? What information would you need to make this decision?

Exercise 22.6: Distributed Tracing Design #

Design the tracing instrumentation for a multi-service Neam deployment with:

Gateway Service: Accepts HTTP requests, authenticates users, routes to agents
Triage Service: Runs TriageAgent to classify requests
Specialist Service: Runs RefundAgent, BillingAgent, and TechSupportAgent
Review Service: Runs SupervisorAgent to review specialist responses

Draw the span hierarchy for a request that goes through all four services. List the attributes you would set on each span. Explain how the trace context propagates between services.