Programming Neam
📖 15 min read

Chapter 22: Observability and Monitoring #


"You cannot improve what you cannot measure. And in production, you cannot debug what you cannot observe." -- Observability engineering axiom


What You Will Learn #

In this chapter, you will learn how to observe and monitor Neam agents in production. You will understand the three health check endpoints and their semantics, configure OpenTelemetry integration for distributed tracing and metrics, visualize traces in Jaeger, build Prometheus dashboards, monitor the LLM Gateway (rate limits, circuit breaker state, cache hits, cost), trace requests across multi-agent systems, and design alerting strategies. By the end of this chapter, you will be able to answer the question "why is my agent slow?" in under five minutes.


22.1 Health Check Semantics #

Neam v0.6.0 exposes three health check endpoints, each with distinct semantics. These endpoints are used by Kubernetes probes, load balancers, and monitoring systems to determine the operational state of a Neam agent.

GET /health (Liveness) #

The liveness endpoint answers one question: is the Neam process alive and able to respond to HTTP requests?

What it checks: - The HTTP server is listening and can process requests - The main event loop has not deadlocked

What it does NOT check: - External dependencies (database, LLM providers, OTel collector) - Whether agents are loaded or initialized

Response when healthy (HTTP 200):

json
{
  "status": "ok",
  "version": "0.6.0",
  "uptime_seconds": 3672
}

When it fails: The process is irrecoverably broken. Kubernetes kills the pod and restarts it.

Kubernetes configuration:

yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  timeoutSeconds: 5
  failureThreshold: 3

This means: after an initial 15-second delay, check /health every 20 seconds. If 3 consecutive checks fail (each with a 5-second timeout), kill and restart the pod.

GET /ready (Readiness) #

The readiness endpoint answers: can this pod serve traffic right now?

What it checks: 1. State backend connectivity: Can the VM connect to and query the configured state backend (SQLite, PostgreSQL, Redis, DynamoDB, CosmosDB)? 2. LLM provider availability: Is at least one LLM provider circuit in the Closed or HalfOpen state? (If all circuits are Open, the agent cannot make LLM calls.) 3. Telemetry health: If telemetry is enabled, is the export queue below its capacity limit? (A full queue indicates the OTLP endpoint is down.)

Response when ready (HTTP 200):

json
{
  "status": "ready",
  "checks": {
    "state_backend": {
      "status": "connected",
      "type": "postgres",
      "latency_ms": 2
    },
    "llm_providers": {
      "openai": {
        "status": "healthy",
        "circuit": "closed",
        "requests_total": 1547,
        "failures_total": 3
      },
      "anthropic": {
        "status": "healthy",
        "circuit": "closed",
        "requests_total": 42,
        "failures_total": 0
      }
    },
    "telemetry": {
      "status": "ok",
      "pending_spans": 12,
      "queue_capacity": 1000
    }
  }
}

Response when not ready (HTTP 503):

json
{
  "status": "not_ready",
  "checks": {
    "state_backend": {
      "status": "connection_refused",
      "type": "postgres",
      "error": "could not connect to server: Connection refused"
    },
    "llm_providers": {
      "openai": {
        "status": "unhealthy",
        "circuit": "open",
        "last_failure": "2026-01-30T14:32:05Z",
        "error": "429 Too Many Requests"
      },
      "anthropic": {
        "status": "healthy",
        "circuit": "closed"
      }
    },
    "telemetry": {
      "status": "ok"
    }
  }
}

When it fails: Kubernetes removes the pod from Service endpoints. No traffic is routed to it. The pod stays running (it is not killed -- that is the liveness probe's job). Once the dependency recovers, the next readiness check passes, and traffic resumes.

Kubernetes configuration:

yaml
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

GET /startup (Startup) #

The startup endpoint answers: has the Neam VM completed its initialization sequence?

What it checks: 1. Bytecode is loaded and validated 2. Agents are registered in the VM 3. State backend connection is established 4. Knowledge bases are ingested (if any) 5. Autonomous executor is started (if configured) 6. LLM Gateway is initialized (if configured)

Response when startup complete (HTTP 200):

json
{
  "status": "started",
  "initialized_at": "2026-01-30T14:00:05Z",
  "agents_registered": 3,
  "knowledge_bases_loaded": 1,
  "autonomous_agents": 1
}

Response during startup (HTTP 503):

json
{
  "status": "starting",
  "phase": "ingesting_knowledge_bases",
  "progress": "2/5 sources processed"
}

Kubernetes configuration:

yaml
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 30

This allows up to 150 seconds (30 x 5s) for startup. Once the startup probe succeeds, Kubernetes switches to the liveness and readiness probes. This is critical for agents with large knowledge bases that take time to ingest.

Health Check Summary #

Endpoint Question Failure Action Checks Dependencies Frequency
/health Is the process alive? Kill and restart No Every 20s
/ready Can it serve traffic? Remove from LB Yes Every 10s
/startup Is init complete? Wait for startup Yes (init only) Every 5s

22.2 OpenTelemetry Integration #

Neam v0.6.0 integrates with the OpenTelemetry standard for distributed tracing and metrics. The integration is in-process -- no sidecar or agent is required (though an OTel Collector is recommended in production for reliable delivery).

Architecture #

+---------------------------------------------------------------+
|                                                               |
|  Neam Agent (in-process)                                      |
|  +-----------------------------------------------------------+
|  |                                                            |
|  |  Agent.ask()                                               |
|  |    |                                                       |
|  |    v                                                       |
|  |  TelemetryExporter                                         |
|  |  +---------------------------+                             |
|  |  | start_span("agent.ask")  |                             |
|  |  |   start_span("llm.call") |                             |
|  |  |     set_attribute(...)    |                             |
|  |  |   end_span()             |                             |
|  |  |   start_span("rag.query")|                             |
|  |  |   end_span()             |                             |
|  |  | end_span()               |                             |
|  |  +---------------------------+                             |
|  |         |                                                  |
|  |    Batch buffer (100 spans or 5s)                          |
|  |         |                                                  |
|  +---------|--------------------------------------------------+
|            |                                                   |
|            v  OTLP/HTTP JSON                                   |
|  +---------+----------+                                        |
|  | OTel Collector     |                                        |
|  | (otel-collector)   |                                        |
|  +----+----------+----+                                        |
|       |          |                                             |
|       v          v                                             |
|  +--------+  +-----------+                                     |
|  | Jaeger |  | Prometheus|                                     |
|  | (traces)|  | (metrics) |                                    |
|  | :16686  |  | :9090     |                                    |
|  +--------+  +-----------+                                     |
|                                                               |
+---------------------------------------------------------------+

Configuration #

Enable telemetry in neam.toml:

toml
[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "neam-agent"
sampling-rate = 0.5

Or via environment variables:

bash
export NEAM_TELEMETRY_ENABLED=true
export NEAM_OTEL_ENDPOINT=http://otel-collector:4318
export NEAM_TELEMETRY_SERVICE_NAME=neam-agent
export NEAM_TELEMETRY_SAMPLING_RATE=0.5

Automatic Span Creation #

The Neam VM automatically creates spans for the following operations:

Span Name When Created Key Attributes
neam.agent.ask Every Agent.ask() call agent.name, agent.provider, agent.model
neam.llm.call Each LLM API request gen_ai.system, gen_ai.request.model, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens
neam.tool.call Tool invocation tool.name, tool.duration_ms
neam.rag.query RAG retrieval rag.strategy, rag.top_k, rag.documents_retrieved
neam.reflection Self-reflection pass reflection.dimensions, reflection.min_confidence, reflection.score
neam.learning.review Learning review trigger learning.strategy, learning.interactions_reviewed
neam.handoff Agent handoff handoff.from, handoff.to, handoff.reason
neam.mcp.call MCP tool execution mcp.server, mcp.tool, mcp.duration_ms
neam.gateway.ratelimit Rate limit wait gateway.provider, gateway.wait_ms
neam.gateway.circuitbreak Circuit breaker trip gateway.provider, gateway.circuit_state
neam.gateway.cache Cache hit/miss gateway.provider, gateway.cache_hit

Span Hierarchy #

A typical agent call produces a tree of spans:

text
neam.agent.ask (TriageAgent, 1200ms)
  |
  +-- neam.rag.query (strategy: basic, 45ms)
  |     +-- Retrieved 3 documents
  |
  +-- neam.llm.call (openai/gpt-4o-mini, 850ms)
  |     +-- prompt_tokens: 1200
  |     +-- completion_tokens: 150
  |     +-- cost_usd: 0.0018
  |
  +-- neam.reflection (accuracy: 0.9, relevance: 0.85, 400ms)
  |     +-- neam.llm.call (openai/gpt-4o-mini, 350ms)
  |
  +-- neam.handoff (TriageAgent -> RefundAgent, 0ms)

OTLP Export Format #

Neam exports spans as OTLP/HTTP JSON (not protobuf) to avoid the protobuf dependency. The OTel Collector accepts both formats:

json
{
  "resourceSpans": [{
    "resource": {
      "attributes": [
        {"key": "service.name", "value": {"stringValue": "neam-agent"}},
        {"key": "service.version", "value": {"stringValue": "0.6.0"}},
        {"key": "deployment.environment", "value": {"stringValue": "production"}}
      ]
    },
    "scopeSpans": [{
      "scope": {"name": "neam", "version": "0.6.0"},
      "spans": [
        {
          "traceId": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
          "spanId": "1a2b3c4d5e6f7a8b",
          "parentSpanId": "",
          "name": "neam.agent.ask",
          "kind": 2,
          "startTimeUnixNano": "1706620800000000000",
          "endTimeUnixNano": "1706620801200000000",
          "attributes": [
            {"key": "agent.name", "value": {"stringValue": "TriageAgent"}},
            {"key": "agent.provider", "value": {"stringValue": "openai"}},
            {"key": "agent.model", "value": {"stringValue": "gpt-4o-mini"}}
          ],
          "status": {"code": 1}
        }
      ]
    }]
  }]
}

Batching and Background Export #

Spans are buffered in memory and exported in batches:

Sampling #

The sampling-rate controls what fraction of traces are exported:

Rate Effect Use Case
1.0 Every request traced Development, debugging
0.5 50% of requests Staging
0.1 10% of requests Production (moderate traffic)
0.01 1% of requests Production (high traffic)

Sampling is deterministic per trace: if a trace is sampled, all spans within that trace (including child spans from tool calls, RAG queries, and reflections) are included. This is achieved by hashing the trace ID and comparing against the sampling threshold.

neam
// This code behaves identically regardless of sampling rate.
// The telemetry layer is transparent to agent logic.
agent TracedAgent {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "You are a helpful assistant."
}

{
  let response = TracedAgent.ask("Explain observability.");
  emit response;
  // If this trace is sampled, spans are exported automatically.
  // If not sampled, zero overhead is added.
}

22.3 Jaeger for Trace Visualization #

Jaeger is an open-source distributed tracing platform. The Docker Compose stack from Chapter 20 includes Jaeger, and Neam traces flow through the OTel Collector to Jaeger automatically.

Accessing Jaeger #

bash
# If running Docker Compose
open http://localhost:16686

# If running in Kubernetes
kubectl port-forward svc/jaeger-query -n observability 16686:16686
open http://localhost:16686

Finding Traces #

In the Jaeger UI:

  1. Select Service: neam-agent
  2. Select Operation: neam.agent.ask (or leave as "all")
  3. Set a time range
  4. Click Find Traces

Each trace shows the complete span tree for one request, including:

Reading a Trace #

A trace for a customer service triage request might look like:

text
Trace: a1b2c3d4 (1450ms total)

[==============================================] neam.agent.ask (TriageAgent) 1450ms
  [====] neam.rag.query (basic, 3 docs)                                       50ms
  [===================] neam.llm.call (openai/gpt-4o-mini)                    900ms
    prompt_tokens: 1500  completion_tokens: 80  cost: $0.0020
  [======] neam.reflection (accuracy: 0.92)                                   350ms
    [====] neam.llm.call (openai/gpt-4o-mini)                                300ms
  [] neam.handoff (TriageAgent -> RefundAgent)                                  1ms

[====================================] neam.agent.ask (RefundAgent)           750ms
  [==========================] neam.llm.call (openai/gpt-4o-mini)             600ms
    prompt_tokens: 800  completion_tokens: 200  cost: $0.0015

From this trace, you can immediately see:


22.4 Prometheus Metrics #

Neam exports metrics to Prometheus via the OTel Collector. These metrics provide aggregate visibility across all requests, complementing the per-request detail of traces.

Exported Metrics #

Metric Type Labels Description
neam_llm_requests_total Counter provider, model, status Total LLM API calls
neam_llm_tokens_total Counter provider, model, type Tokens consumed (prompt/completion)
neam_llm_latency_seconds Histogram provider, model LLM call latency distribution
neam_llm_cost_usd_total Counter provider, model Accumulated LLM cost
neam_agent_requests_total Counter agent, status Agent ask() calls
neam_agent_latency_seconds Histogram agent End-to-end agent latency
neam_rag_queries_total Counter strategy, knowledge_base RAG retrieval queries
neam_rag_latency_seconds Histogram strategy RAG retrieval latency
neam_tool_calls_total Counter tool, status Tool invocations
neam_reflection_score Gauge agent, dimension Latest reflection scores
neam_gateway_rate_limit_waits_total Counter provider Rate limit delays
neam_gateway_circuit_breaker_state Gauge provider Circuit state (0=closed, 1=open, 2=half-open)
neam_gateway_cache_hits_total Counter provider Cache hits
neam_gateway_cache_misses_total Counter provider Cache misses
neam_gateway_cost_daily_usd Gauge Current daily cost
neam_gateway_cost_budget_usd Gauge Configured daily budget

Prometheus Configuration #

yaml
# docker/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']
    metrics_path: /metrics

  - job_name: 'neam-agent'
    static_configs:
      - targets: ['neam-agent:8080']
    metrics_path: /metrics

Useful PromQL Queries #

Request rate (requests per second):

promql
rate(neam_agent_requests_total[5m])

P95 agent latency:

promql
histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m]))

LLM cost per hour:

promql
rate(neam_llm_cost_usd_total[1h]) * 3600

Token consumption rate by provider:

promql
sum by (provider) (rate(neam_llm_tokens_total[5m]))

Cache hit ratio:

promql
sum(rate(neam_gateway_cache_hits_total[5m]))
/
(sum(rate(neam_gateway_cache_hits_total[5m])) + sum(rate(neam_gateway_cache_misses_total[5m])))

Circuit breaker status (1 = problem):

promql
neam_gateway_circuit_breaker_state > 0

Budget utilization percentage:

promql
neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd * 100

22.5 LLM Gateway Monitoring #

The LLM Gateway is the most critical component to monitor because it controls the flow of all LLM requests. The gateway exposes its internal state through the readiness endpoint and through Prometheus metrics.

Rate Limit Tracking #

The gateway tracks per-provider request rates and enforces the limits defined in neam.toml:

toml
[llm.rate-limits.openai]
requests-per-minute = 120

Monitoring rate limits:

promql
# Current request rate vs. limit
rate(neam_llm_requests_total{provider="openai"}[1m]) * 60
# Compare against the configured limit of 120

# Rate limit wait events (indicates you are approaching the limit)
rate(neam_gateway_rate_limit_waits_total{provider="openai"}[5m])

When rate limit waits increase, it means the gateway is throttling requests to stay within the configured limit. If waits are frequent, consider:

  1. Increasing the requests-per-minute limit (if the provider allows it)
  2. Adding a fallback provider to distribute load
  3. Enabling response caching to reduce redundant calls

Circuit Breaker State #

The circuit breaker has three states, represented as a gauge metric:

Value State Meaning
0 Closed Normal operation
1 Open Provider is down; all requests rejected
2 Half-Open Probing the provider with a single request
promql
# Alert when any circuit is open
neam_gateway_circuit_breaker_state{provider="openai"} == 1

Visualizing circuit breaker transitions:

In Grafana, create a state timeline panel with the neam_gateway_circuit_breaker_state metric. This shows exactly when each provider went down and how long it took to recover:

text
Time:       00:00  00:05  00:10  00:15  00:20  00:25  00:30
OpenAI:     [--- Closed ---][Open][HO][--- Closed ---]
Anthropic:  [---------- Closed ----------------------------------]

Cache Hit Rates #

promql
# Cache hit ratio (higher is better, saves money)
sum(rate(neam_gateway_cache_hits_total[5m]))
/
(sum(rate(neam_gateway_cache_hits_total[5m])) + sum(rate(neam_gateway_cache_misses_total[5m])))

A cache hit ratio of 0 means caching is not effective (likely because all agents use temperature > 0). A ratio above 0.3 means you are saving at least 30% on LLM costs.

Cost Tracking #

The gateway tracks real-time cost using Neam's built-in pricing table:

promql
# Daily cost (USD)
neam_gateway_cost_daily_usd

# Budget utilization
neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd * 100

# Cost by provider
sum by (provider) (rate(neam_llm_cost_usd_total[1h])) * 3600

# Cost by model
sum by (model) (rate(neam_llm_cost_usd_total[1h])) * 3600

Cost dashboard example:

+----------------------------------------------+
| Daily LLM Cost                               |
|                                              |
|  $47.32 / $100.00 budget (47.3%)            |
|  [========================............] 47%  |
|                                              |
|  By Provider:                                |
|    OpenAI:    $38.50 (81%)                   |
|    Anthropic: $8.82  (19%)                   |
|                                              |
|  By Model:                                   |
|    gpt-4o-mini: $32.10                       |
|    gpt-4o:      $6.40                        |
|    claude-3.5:  $8.82                        |
+----------------------------------------------+

22.6 Distributed Tracing Across Multi-Agent Systems #

When a request flows through multiple agents (triage -> specialist -> supervisor), distributed tracing keeps the entire chain visible as a single trace.

Trace Propagation #

TriageAgent
(triage logic)
RefundAgent
(refund logic)
SupervisorAgent
(review logic)

Within a single Neam VM, trace propagation is automatic. The VM maintains a trace context stack, and when one agent hands off to another, the child agent's span is created with the parent agent's span ID.

Cross-Service Tracing #

When agents communicate across services (via the A2A protocol), the trace context is propagated via HTTP headers following the W3C Trace Context standard:

text
POST /a2a HTTP/1.1
Host: specialist-service.internal
Content-Type: application/json
traceparent: 00-abc123def456abc123def456abc123de-1a2b3c4d5e6f7a8b-01
tracestate: neam=agent:TriageAgent

{"jsonrpc": "2.0", "method": "tasks/send", ...}

The receiving service picks up the traceparent header and creates its spans as children of the calling service's span. This means a single trace in Jaeger can show the complete request path across multiple Neam services.

Practical Example: Multi-Service Tracing #

neam
agent TriageAgent {
  provider: "openai"
  model: "gpt-4o-mini"
  system: "Route customer requests."
  handoffs: [RefundAgent]
}

{
  // This creates a root span: neam.agent.ask
  let triage = TriageAgent.ask("I need a refund for order #123");
  // Handoff propagates the trace context
  // The RefundAgent span becomes a child of this span
}
neam
agent RefundAgent {
  provider: "openai"
  model: "gpt-4o"
  system: "Process refund requests."
}

{
  // When called via A2A, the trace context is inherited
  // from the traceparent header
  let result = RefundAgent.ask("Process refund for order #123");
  emit result;
}

In Jaeger, the combined trace shows:

text
Trace abc123 (2100ms)
  Service: triage-service
    neam.agent.ask (TriageAgent) ........................ 1200ms
      neam.llm.call (openai/gpt-4o-mini) .............. 900ms
      neam.handoff (TriageAgent -> RefundAgent) ........   1ms

  Service: refund-service
    neam.agent.ask (RefundAgent) ....................... 900ms
      neam.llm.call (openai/gpt-4o) ................... 750ms

22.7 Alerting Strategies #

Monitoring without alerting is just logging with a GUI. Here are alerting rules for the most important Neam operational signals.

Prometheus Alerting Rules #

yaml
# alerting-rules.yaml
groups:
  - name: neam-agent
    rules:
      # Alert when error rate exceeds 5%
      - alert: NeamHighErrorRate
        expr: |
          sum(rate(neam_agent_requests_total{status="error"}[5m]))
          /
          sum(rate(neam_agent_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Neam agent error rate above 5%"
          description: "{{ $value | humanizePercentage }} of requests are failing"

      # Alert when P95 latency exceeds 5 seconds
      - alert: NeamHighLatency
        expr: |
          histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m]))
          > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Neam P95 latency above 5 seconds"

      # Alert when a circuit breaker is open
      - alert: NeamCircuitBreakerOpen
        expr: neam_gateway_circuit_breaker_state == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LLM provider {{ $labels.provider }} circuit breaker is open"
          description: "All requests to {{ $labels.provider }} are being rejected"

      # Alert when daily cost exceeds 80% of budget
      - alert: NeamCostBudgetWarning
        expr: |
          neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Daily LLM cost at {{ $value | humanizePercentage }} of budget"

      # Alert when daily cost exceeds 95% of budget
      - alert: NeamCostBudgetCritical
        expr: |
          neam_gateway_cost_daily_usd / neam_gateway_cost_budget_usd > 0.95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Daily LLM cost at {{ $value | humanizePercentage }} of budget"

      # Alert when rate limit waits are frequent
      - alert: NeamRateLimitPressure
        expr: |
          rate(neam_gateway_rate_limit_waits_total[5m]) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Provider {{ $labels.provider }} under rate limit pressure"

      # Alert when state backend is unreachable
      - alert: NeamStateBackendDown
        expr: |
          up{job="neam-agent"} == 1
          unless
          neam_health_state_backend_connected == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Neam state backend is unreachable"

      # Alert when all pods are not ready
      - alert: NeamNoReadyPods
        expr: |
          kube_deployment_status_replicas_ready{deployment="neam-agent"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No ready Neam agent pods"

Alert Priority Matrix #

Condition Severity Response Time Action
All pods down Critical Immediate Page on-call, investigate cluster
Circuit breaker open Critical 5 min Check provider status, verify failover
Cost > 95% budget Critical 15 min Investigate usage, consider throttling
Error rate > 5% Warning 30 min Review traces, check for bad inputs
P95 latency > 5s Warning 1 hour Review traces, check provider latency
Rate limit pressure Warning 1 hour Consider increasing limits or caching
Cost > 80% budget Warning 4 hours Review cost trends, adjust budget

22.8 Operational Runbook #

Here is a practical runbook for diagnosing common issues using the observability stack.

"Why is my agent slow?" #

  1. Check Prometheus: Query histogram_quantile(0.95, rate(neam_agent_latency_seconds_bucket[5m])) to confirm the latency baseline.

  2. Check Jaeger: Find a slow trace. Look at the span tree:

  3. Is the LLM call slow? (Provider issue or large prompt)
  4. Is RAG retrieval slow? (Knowledge base too large or slow vector search)
  5. Is reflection adding latency? (Consider reducing min_confidence or disabling for non-critical agents)

  6. Check rate limits: Query rate(neam_gateway_rate_limit_waits_total[5m]). If rate limit waits are high, the gateway is throttling requests.

  7. Check cache hit ratio: If the cache is available but the hit ratio is 0, check that temperature: 0 is set on deterministic agents.

"Why is my agent returning errors?" #

  1. Check circuit breaker state: Query neam_gateway_circuit_breaker_state. If a circuit is open (1), the provider is down.

  2. Check the readiness endpoint: curl http://neam-agent:8080/ready to see which components are unhealthy.

  3. Check Jaeger: Find traces with error status. The error span will have a status_message attribute explaining the failure.

  4. Check provider health: Query sum by (provider, status) (rate(neam_llm_requests_total[5m])) to see error rates per provider.

"Am I spending too much?" #

  1. Check daily cost: Query neam_gateway_cost_daily_usd for the current total.

  2. Break down by model: Query sum by (model) (rate(neam_llm_cost_usd_total[1h])) * 3600 to find the most expensive model.

  3. Check cache effectiveness: A low cache hit ratio means you are paying for redundant calls.

  4. Check token usage: Query sum by (agent) (rate(neam_llm_tokens_total[1h])) to find agents consuming the most tokens. Long system prompts or large RAG contexts inflate token counts.


22.9 Complete Observability Example #

Here is a complete Neam agent with full observability configuration:

toml
# neam.toml
[project]
name = "observed-agent"
version = "1.0.0"

[project.entry_points]
main = "src/main.neam"

[state]
backend = "postgres"
connection-string = "postgresql://neam:pass@postgres:5432/neam"

[llm]
default-provider = "openai"
default-model = "gpt-4o-mini"

[llm.rate-limits.openai]
requests-per-minute = 120

[llm.circuit-breaker]
failure-threshold = 3
reset-timeout-seconds = 60

[llm.cache]
enabled = true
max-entries = 1000
ttl-seconds = 600

[llm.cost]
daily-budget-usd = 100.0

[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "observed-agent"
sampling-rate = 1.0
neam
agent AnalystAgent {
  provider: "openai"
  model: "gpt-4o-mini"
  temperature: 0.3
  system: "You are a data analyst. Provide clear, data-driven answers."

  reasoning: chain_of_thought

  reflect: {
    after: each_response
    evaluate: [accuracy, clarity]
    min_confidence: 0.7
    on_low_quality: {
      strategy: "revise"
      max_revisions: 1
    }
  }

  learning: {
    strategy: "experience_replay"
    review_interval: 20
  }

  memory: "analyst_memory"
}

{
  let query = input();
  let answer = AnalystAgent.ask(query);
  emit answer;

  // Check learning stats periodically
  let stats = agent_learning_stats("AnalystAgent");
  emit "Interactions: " + str(stats["total_interactions"]);
  emit "Avg score: " + str(stats["avg_reflection_score"]);
}

With this configuration, every request generates:

  1. Traces in Jaeger showing the agent call, LLM request, RAG query (if any), and reflection pass
  2. Metrics in Prometheus tracking request rate, latency, token usage, cost, cache hits, and circuit breaker state
  3. Health endpoints for Kubernetes probes

22.10 Observability Standard Library Modules #

The Neam standard library includes a comprehensive observability package organized into six sub-packages. These modules let you extend the built-in telemetry with custom instrumentation, alternative exporters, and diagnostic tools.

Package Overview #

Sub-package Modules Purpose
observability/core tracer, meter, logger, context, sampling Core OTel providers and context management
observability/exporters otlp, jaeger, elasticsearch, mlflow, langfuse, sqlite, local, multi Export destinations for traces, metrics, and logs
observability/instrumentation llm, agent, tool, handoff, memory Automatic span creation for Neam operations
observability/semantic attributes, genai, events OpenTelemetry semantic conventions for AI
observability/triage triage, anomaly, patterns, compare, dependencies, gaps, replay, reports Diagnostic analysis and debugging
observability/config programmatic, environment, runtime Configuration methods

Using the Core Modules #

The core modules give you direct access to the OTel tracer, meter, and logger providers for custom instrumentation:

neam
import observability/core/tracer
import observability/core/meter

fun process_order(order_id) {
  let span = tracer.start_span("process_order", {
    "order.id": order_id,
    "order.source": "web"
  })

  let counter = meter.counter("orders_processed_total", {
    description: "Total orders processed"
  })

  let result = do_processing(order_id)
  counter.add(1, { "status": result.status })

  span.set_attribute("order.status", result.status)
  span.end()

  return result
}

Sampling Strategies #

The sampling module provides four strategies beyond the default trace-ID ratio:

neam
import observability/core/sampling

let sampler = sampling.create({
  strategy: "parent_based",
  root: {
    strategy: "trace_id_ratio",
    rate: 0.1
  }
})
Strategy Description
always_on Sample every trace (development)
always_off Sample nothing (disable telemetry without removing config)
trace_id_ratio Sample a fixed percentage based on trace ID hash
parent_based Inherit sampling decision from parent span; use a fallback strategy for root spans

Alternative Exporters #

Beyond OTLP and Jaeger, Neam supports several specialized exporters:

neam
import observability/exporters/elasticsearch
import observability/exporters/langfuse
import observability/exporters/mlflow

let es_exporter = elasticsearch.create({
  url: "https://elasticsearch:9200",
  traces_index: "neam-traces",
  metrics_index: "neam-metrics",
  logs_index: "neam-logs"
})

let langfuse_exporter = langfuse.create({
  public_key: env("LANGFUSE_PUBLIC_KEY"),
  secret_key: env("LANGFUSE_SECRET_KEY"),
  host: "https://cloud.langfuse.com"
})

let mlflow_exporter = mlflow.create({
  tracking_uri: "http://mlflow:5000",
  experiment_name: "neam-agent-eval"
})
Exporter Best For
otlp Standard OTel Collector pipeline
jaeger Direct Jaeger ingestion (no collector)
elasticsearch Full-text search over traces and logs
langfuse LLM-specific observability with prompt tracking
mlflow ML experiment tracking and model registry
sqlite Local development without external services
local File-based export for offline analysis
multi Route different signals to different exporters

The multi exporter lets you send traces and metrics to different destinations:

neam
import observability/exporters/multi

let pipeline = multi.create({
  traces: [otlp_exporter, langfuse_exporter],
  metrics: [otlp_exporter],
  logs: [elasticsearch_exporter]
})

Semantic Conventions for AI #

The semantic/attributes module defines standard attribute names following the OpenTelemetry GenAI semantic conventions:

neam
import observability/semantic/attributes

// GenAI operation attributes
attributes.GEN_AI_SYSTEM          // "gen_ai.system" (e.g., "openai")
attributes.GEN_AI_REQUEST_MODEL   // "gen_ai.request.model"
attributes.GEN_AI_REQUEST_MAX_TOKENS
attributes.GEN_AI_REQUEST_TEMPERATURE

// GenAI response attributes
attributes.GEN_AI_USAGE_PROMPT_TOKENS
attributes.GEN_AI_USAGE_COMPLETION_TOKENS
attributes.GEN_AI_RESPONSE_FINISH_REASONS

// Agent-specific attributes
attributes.AGENT_NAME             // "agent.name"
attributes.AGENT_ID               // "agent.id"
attributes.AGENT_TEAM             // "agent.team"
attributes.AGENT_ROLE             // "agent.role"
attributes.AGENT_PARENT           // "agent.parent"

Using standard attribute names ensures your traces are compatible with any OTel- compatible backend and enables cross-tool queries like "show me all traces where gen_ai.usage.prompt_tokens > 5000."


22.11 Structured Logging #

In addition to traces and metrics, Neam supports structured logging through the OpenTelemetry Logs API. Structured logs attach key-value attributes to each log record, making them searchable and correlatable with traces.

Log Configuration #

toml
[telemetry]
enabled = true
endpoint = "http://otel-collector:4318"
service-name = "neam-agent"
log-level = "info"

The log level controls which records are emitted:

Level Emitted At Examples
debug Development only Prompt text, full LLM responses, internal state
info Normal operations Agent started, request processed, handoff completed
warn Potential issues Rate limit approached, cache eviction, slow query
error Failures LLM call failed, state backend timeout, circuit open

Log Records #

Each log record is a structured JSON object exported via OTLP alongside traces and metrics:

json
{
  "timestamp": "2026-01-30T14:32:05.123Z",
  "severity": "WARN",
  "body": "Rate limit approaching threshold",
  "attributes": {
    "provider": "openai",
    "current_rpm": 108,
    "limit_rpm": 120,
    "utilization_pct": 90
  },
  "traceId": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
  "spanId": "1a2b3c4d5e6f7a8b"
}

The traceId and spanId fields correlate logs with the trace that produced them. In Grafana, this means you can click from a log line directly to the corresponding trace in Jaeger.

Custom Log Records #

Use the logger module to emit structured logs from your agent code:

neam
import observability/core/logger

let log = logger.create({ name: "order-processor" })

fun process_order(order) {
  log.info("Processing order", {
    "order.id": order.id,
    "order.total": order.total,
    "customer.tier": order.customer_tier
  })

  if (order.total > 10000) {
    log.warn("High-value order requires review", {
      "order.id": order.id,
      "order.total": order.total
    })
  }
}

Log Aggregation Pipeline #

In the OTel Collector, logs flow through the same pipeline as traces and metrics:

yaml
service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [elasticsearch, debug]

Elasticsearch is the recommended log destination because it supports full-text search, aggregations, and Kibana dashboards. For simpler setups, the debug exporter writes logs to stdout, which Docker and Kubernetes capture automatically.


22.12 Privacy and Redaction #

Production agents handle sensitive data — customer names, account numbers, API keys in prompts. The observability stack must not leak this data into traces or logs. The observability/privacy module provides configurable redaction rules.

Redaction Configuration #

neam
import observability/privacy

let privacy_config = privacy.create({
  mode: "redact",
  rules: [
    { pattern: "\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b", replace: "[CARD]" },
    { pattern: "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b", replace: "[EMAIL]" },
    { pattern: "sk-[a-zA-Z0-9]{20,}", replace: "[API_KEY]" },
    { pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b", replace: "[SSN]" }
  ],
  capture_prompts: false,
  capture_responses: false
})

Privacy Modes #

Mode Behavior
full Capture everything — prompts, responses, tool inputs/outputs (development only)
redact Apply regex rules to sanitize sensitive patterns before export
hash Replace sensitive values with one-way hashes (preserves cardinality for analysis)
minimal Capture only span names, durations, and status codes — no content attributes

Controlling What Gets Traced #

By default, Neam traces include span names and metadata attributes (agent name, provider, model, token counts, latency). Prompt and response text are not captured unless explicitly enabled:

toml
[telemetry]
capture-prompts = false     # Do not include prompt text in spans
capture-responses = false   # Do not include response text in spans

For debugging specific issues, you can enable prompt capture temporarily using the runtime configuration module:

neam
import observability/config/runtime

runtime.set_capture("TriageAgent", {
  capture_prompts: true,
  capture_responses: true,
  duration: "30m"
})

This enables prompt/response capture for TriageAgent only, for 30 minutes, then automatically reverts to the default configuration.

Access Control #

The privacy module supports role-based access to observability data:

neam
let access_config = privacy.access_control({
  roles: {
    "developer": ["traces", "metrics"],
    "ops": ["traces", "metrics", "logs"],
    "security": ["traces", "metrics", "logs", "prompts"]
  }
})

This does not enforce access at the Neam level — it sets metadata tags on exported data that downstream systems (Grafana, Kibana) can use for RBAC filtering.


22.13 Diagnostic Triage #

The observability/triage module provides automated diagnostic tools for identifying issues in production without manual trace inspection.

Anomaly Detection #

The anomaly detector monitors metrics for deviations from learned baselines:

neam
import observability/triage/anomaly

let detector = anomaly.create({
  metrics: ["neam_agent_latency_seconds", "neam_llm_cost_usd_total"],
  window: "1h",
  sensitivity: 2.0,
  on_anomaly: fun(alert) {
    log.warn("Anomaly detected: " + alert.metric, {
      "expected": alert.expected,
      "actual": alert.actual,
      "deviation": alert.deviation
    })
  }
})

The detector uses a rolling window to compute the mean and standard deviation of each metric. When the current value deviates by more than sensitivity standard deviations, the on_anomaly callback fires.

Error Pattern Analysis #

The pattern analyzer groups errors by type and identifies recurring failure modes:

neam
import observability/triage/patterns

let analysis = patterns.analyze({
  window: "24h",
  min_occurrences: 5
})

for (pattern, details) in analysis {
  emit "Pattern: " + pattern
  emit "  Count: " + str(details.count)
  emit "  First seen: " + details.first_seen
  emit "  Last seen: " + details.last_seen
  emit "  Affected agents: " + str(details.agents)
}

Dependency Graph #

The dependency graph builder analyzes traces to map service-to-service relationships:

neam
import observability/triage/dependencies

let graph = dependencies.build({ window: "1h" })

for (service, deps) in graph {
  emit service + " depends on: " + join(deps, ", ")
}

This is useful for understanding blast radius: if a provider goes down, which agents and services are affected?

Diagnostic Reports #

The report generator combines anomaly detection, error patterns, and dependency analysis into a structured diagnostic report:

neam
import observability/triage/reports

let report = reports.generate({
  window: "24h",
  include: ["anomalies", "errors", "dependencies", "recommendations"]
})

emit report.summary
for rec in report.recommendations {
  emit "  - " + rec
}

A typical report might contain:

Diagnostic Report (last 24h)
=============================
Anomalies: 2
  - neam_agent_latency_seconds: 3.2x above baseline (P95: 8.1s vs. 2.5s baseline)
  - neam_llm_cost_usd_total: 1.8x above baseline ($142 vs. $78 baseline)

Error patterns: 1
  - "429 Too Many Requests" from openai (47 occurrences, affecting TriageAgent)

Dependencies:
  TriageAgent → openai, postgres
  RefundAgent → openai, postgres
  SupervisorAgent → anthropic, postgres

Recommendations:
  - Increase OpenAI rate limit or add fallback provider (47 rate limit errors)
  - Investigate TriageAgent prompt length (high token cost correlates with latency)
  - Consider caching for TriageAgent (0% cache hit rate)

Summary #

In this chapter, you learned:

These tools and techniques give you complete visibility into your Neam agents in production. Combined with the deployment patterns from Chapters 20 and 21, you now have everything needed to build, deploy, and operate production AI agent systems.


Exercises #

Exercise 22.1: Health Check Design #

A Neam agent uses PostgreSQL for state, OpenAI and Anthropic for LLM calls, and has telemetry enabled. Write the expected JSON response for /ready in each of these scenarios:

  1. Everything is healthy
  2. PostgreSQL is down, LLM providers are fine
  3. OpenAI circuit is open, Anthropic is healthy
  4. Both OpenAI and Anthropic circuits are open

For each scenario, state whether the readiness probe passes (HTTP 200) or fails (HTTP 503) and explain why.

Exercise 22.2: Trace Analysis #

Given the following Jaeger trace for a customer service request:

text
neam.agent.ask (TriageAgent) .................... 3500ms
  neam.rag.query (hybrid, 5 docs) .............. 1200ms
  neam.llm.call (openai/gpt-4o) ................ 1800ms
    prompt_tokens: 4500
    completion_tokens: 200
  neam.reflection (accuracy: 0.65) ..............  400ms
    neam.llm.call (openai/gpt-4o) ..............  350ms
  neam.reflection (revision 1, accuracy: 0.82) .. 400ms
    neam.llm.call (openai/gpt-4o) ..............  350ms

Answer the following:

  1. What is the biggest contributor to latency?
  2. Why did the reflection pass run twice?
  3. How many total LLM calls were made?
  4. Estimate the total token cost assuming GPT-4o at $5/1M input, $15/1M output.
  5. Suggest three optimizations to reduce the total latency.

Exercise 22.3: Prometheus Queries #

Write PromQL queries for the following:

  1. The average number of LLM tokens consumed per agent request (over the last hour)
  2. The cache hit ratio for OpenAI calls specifically
  3. The number of circuit breaker state transitions in the last 24 hours
  4. The top 3 agents by total cost in the last day
  5. An alert rule that fires when the rate limit wait time exceeds 1 second on average

Exercise 22.4: Alerting Configuration #

Design an alerting strategy for a Neam deployment with these SLAs:

Write Prometheus alerting rules with appropriate thresholds, for durations, and severity levels. Include both warning and critical tiers for each SLA.

Exercise 22.5: Cost Optimization Analysis #

A production Neam deployment has these metrics over 24 hours:

Exercise 22.6: Distributed Tracing Design #

Design the tracing instrumentation for a multi-service Neam deployment with:

Draw the span hierarchy for a request that goes through all four services. List the attributes you would set on each span. Explain how the trace context propagates between services.

Start typing to search...